Eggex Example: Recognizing Python Integer Literals

2019-12-22

Eggex is Oil's regular expression syntax. You may want to skim the headings of the eggex doc if it's unfamiliar.

This post demonstrates egg expressions on a realistic problem: recognizing integer literals like 42, 0xFF, or 0o755.

It also shows that eggexes are appropriate for two problems that traditionally use different regex syntax:

Ad-hoc text processing in languages like shell and Python. For example, I interactively refactor source code with regexes, and the /release/$VERSION/ page is essentially a big shell script.
Lexing programming languages. Under the hood, Oil's lexer uses re2c to turn regexes into efficient C code. But the source is in Python, using its Perl-style regex syntax.

The Oil language and its eggex subset are both still under development. I made a few changes motivated by this example, and I'm happy with the results. Let me know what you think in the comments or follow the links on the wiki!

Table of Contents

Three Syntaxes for Regular Languages

Python's Spec

Eggex

POSIX Syntax

Two Use Cases for Eggexes

Text Processing With Unix Tools

Lexing Programming Languages

Recent Syntax Changes

Summary

Appendix A: Oil's Lexer Uses Two Stages of Code Generation

Appendix B: Other Pattern Syntaxes

Three Syntaxes for Regular Languages

Python's Spec

Considered by themselves, integer literals are a regular language. You can write them in decimal, binary, octal, or hexadecimal; and add underscores for readability.

The reference manual describes them like this:

integer      ::=  decinteger | bininteger | octinteger | hexinteger
decinteger   ::=  nonzerodigit (["_"] digit)* | "0"+ (["_"] "0")*
bininteger   ::=  "0" ("b" | "B") (["_"] bindigit)+
octinteger   ::=  "0" ("o" | "O") (["_"] octdigit)+
hexinteger   ::=  "0" ("x" | "X") (["_"] hexdigit)+
nonzerodigit ::=  "1"..."9"
digit        ::=  "0"..."9"
bindigit     ::=  "0" | "1"
octdigit     ::=  "0"..."7"
hexdigit     ::=  digit | "a"..."f" | "A"..."F"

Examples of valid integer literals:

7     2147483647                        0o177    0b100110111
3     79228162514264337593543950336     0o377    0xdeadbeef
      100_000_000_000                   0b_1110_0101

I'm unfamiliar with this metalanguage, but I can guess that ... is a character range, that [] means "optional", and that * and + denote repetition.

This syntax is meant for humans to read — i.e. it isn't executable.

Source: https://docs.python.org/3/reference/lexical_analysis.html#integer-literals

Eggex

Here's the same language in eggex syntax:

DecDigit = / [0-9] /
BinDigit = / [0-1] /
OctDigit = / [0-7] /
HexDigit = / [0-9 a-f A-F] /

DecInt   = / [1-9] ('_'? DecDigit)* | '0'+ ('_'? '0')* /
BinInt   = / '0' [b B] ('_'? BinDigit)+ /
OctInt   = / '0' [o O] ('_'? OctDigit)+ /
HexInt   = / '0' [x X] ('_'? HexDigit)+ /

Integer  = / DecInt | BinInt | OctInt | HexInt /

It shows some of the differences between eggex and commmon Perl-style regex syntax:

Patterns can be composed by naming them, e.g. BinDigit.
Literals are quoted, e.g. '0' and '_'. Quotes are optional in character classes for brevity, and to make them similar to Perl/POSIX syntax.
Terms are separated with whitespace.

Unlike the Python spec, this syntax is executable. Compared with POSIX regexes, it's more readable.

(Note that I also re-organized the rules: They're in bottom-up order rather than top-down, and I inlined the nonzerodigit rule because it's only used once.)

POSIX Syntax

An eggex in Oil evaluates to a POSIX extended regex (ERE) by default, so we simply use echo to see it.

$ echo $Integer
([1-9](_?[0-9])*|0+(_?0)*|0[bB](_?[0-1])+|0[oO](_?[0-7])+|0[xX](_?[0-9a-fA-F])+)

I wouldn't want to type this by hand! Even if I succeeded, reading it and modifying it later would be a challenge.

Perl's /x and Python's re.VERBOSE help to a degree, but they're not available in shell tools like grep and awk.

Two Use Cases for Eggexes

Text Processing With Unix Tools

The integer literal problem seems naturally closer to lexing, but let's try it with Unix tools first.

Here I grep for integer literals using Oil's seamless integration:

bash$ bin/oil

~/git/oilshell/oil$ egrep $Integer */*.py
...
core/process.py:    fd = posix.open(path, fd_mode, 0o666)
...
osh/string_ops.py:  return ''.join(chr(b & 0xFF) for b in bytes_)
...
osh/string_ops.py:  elif (starting_byte >> 3) == 0b11110:

It works! We found:

An octal literal 0o666 for file permissions (about the only thing they're good for).
A hex literal 0xFF for truncating an integer to a single byte.
A binary literal 0b11110 for recognizing with UTF-8. (UTF-8 is best understood in binary notation.)

Lexing Programming Languages

In addition to translating to POSIX ERE syntax, eggexes should also translate to:

Perl-style syntax, to use with the regex engines in nearly all other languages: Python, Ruby, Java, C++, etc. See issue 488.
The syntax that the re2c compiler accepts. This would let Oil's own lexer be expressed in eggex.

To be concrete, here's an except from Oil's frontend/lex.py:

VAR_NAME_RE = r'[a-zA-Z_][a-zA-Z0-9_]*'

rules = [ 
  R(r'\$' + VAR_NAME_RE,  Id.VSub_DollarName),
  ...
  R(VAR_NAME_RE + '\+?=', Id.Lit_VarLike),
  R(VAR_NAME_RE + '\[',   Id.Lit_ArrayLhsOpen),
]

As you can see, we use the syntax from Python's re module and string concatenation.

We do this because Oil started as a pure Python program, and because the lexer requires "metaprogramming" to remove duplication.

But Eggex could also remove that duplication with pattern composition. It's also improved by "first class" regex syntax, whitespace, and quoted literals.

VarName = / [a-z A-Z _] [a-z A-Z 0-9 _]* /

rules = [ 
  ( / '$' VarName /,      Id.VSub_DollarName),
  ...
  ( / VarName '+'? '=' /, Id.Lit_VarLike),
  ( / VarName '[' /,      Id.Lit_ArrayLhsOpen),
]

See Appendix A for details on the lexer's architecture.

Recent Syntax Changes

I used the integer literal example to test whether Eggex is a good syntax.

I removed punctuation in:

Pattern composition syntax: Subpattern is allowed in addition to @subpattern. The latter is still supported if you don't want to use capital letters.
Character literals: [b B] is allowed in addition to ['b' 'B']. This is consistent with ranges like [a-f], and with Perl/POSIX syntax.

I also implemented the capturing and grouping syntax, which contains less punctuation than the Perl-style syntax:

Eggex:

(mygroup+)
< mycapture+ >
< mycapture+ : name >

Perl/Python:

(?:mygroup+)
(mycapture+)
(?P<name>mycapture+)

These changes were motivated by Zulip chats. Thanks to Eric Bergstrome for discussing the grouping and capturing syntax, and to Aur Saraf for discussing the use of punctuation.

The Oil language is still under development, and I've made multiple changes based on user feedback. Feel free to lurk on Zulip and see what we're talking about!

Summary

Like Oil itself, eggexes are a seamless upgrade. You write patterns in a modern, composable syntax, but tools like egrep and awk are passed the ERE syntax they expect.

The same eggex syntax can be used for both ad-hoc text processing and writing fast and correct lexers.

Try porting regex to eggex and see if it's more maintainable. Start a thread on #oil-discuss on Zulip if you have problems.

Also see the Help Wanted section if you want to help make Oil happen! There are many opportunities to parallelize development.

Appendix A: Oil's Lexer Uses Two Stages of Code Generation

Since I didn't finish the #lexing series, now is a good time to explain the architecture of Oil's lexer.

I started publishing the following dense source files on each /release/$VERSION/ page. If anyone wants to implement Oil in another language, this portable and principled lexer will save them many hours of work.

frontend/lex.py is a pure Python source file using the re module syntax. The dynamic lists are evaluated into (pattern, Id) rules for 14 lexer modes.

Then we use the (undocumented) regex parse tree from Python's sre_parse.py to produce
osh-lex.re2c.h, a file in re2c syntax. This syntax has quoted literals like lex and Eggex.

Then re2c produces
osh-lex.h, a large C file with 20K lines of switch and goto statements, which nicely express a DFA.

Then the C compiler produces
Machine code. Even though the the source code is large, the machine code isn't, according to the metrics. This is because C compilers perform a switch lowering optimization.

Why go to all this trouble? Because regexes compiled to state machines are more correct than bash's style of groveling through backslashes and braces one at a time. It's also more concise: Oil's source code is many times smaller than bash's.

Appendix B: Other Pattern Syntaxes

Eggex obeys some rough refactoring properties. You might interactively type:

$ egrep [0-9]+ foo.txt

And then extract a pattern so you can test and enhance it:

$ pat = / [0-9]+ /
$ egrep $pat foo.txt

In general the eggex should resemble the regex, but be more readable. The integer literal example is intended to show that.

Eggex also uses postfix repetition operators like + and * so it's consistent with:

POSIX regex syntax
Perl regex syntax
Context Free Grammar syntax like EBNF and Python's pgen2, which Oil uses. Except that [optional] should be optional?.
Parsing Expression Grammar syntax, which Python is likely moving to.
The metalanguage for command line syntax.
- grep [OPTION]... PATTERN [FILE]... could be
- grep OPTION* PATTERN FILE* to be consistent.

Note that ... meant "character range" in Python's spec, but means "repetition" in command lines. One of Oil's main goals to avoid such punning across the many languages and metalanguages that compose a large software system.