# Eggex Example: Recognizing Python Integer Literals

2019-12-22

Eggex is Oil's regular expression syntax. You may want to skim the headings of the eggex doc if it's unfamiliar.

This post demonstrates egg expressions on a realistic problem: recognizing integer literals like `42`, `0xFF`, or `0o755`.

It also shows that eggexes are appropriate for two problems that traditionally use different regex syntax:

1. Ad-hoc text processing in languages like shell and Python. For example, I interactively refactor source code with regexes, and the /release/\$VERSION/ page is essentially a big shell script.
2. Lexing programming languages. Under the hood, Oil's lexer uses re2c to turn regexes into efficient C code. But the source is in Python, using its Perl-style regex syntax.

The Oil language and its eggex subset are both still under development. I made a few changes motivated by this example, and I'm happy with the results. Let me know what you think in the comments or follow the links on the wiki!

## Three Syntaxes for Regular Languages

### Python's Spec

Considered by themselves, integer literals are a regular language. You can write them in decimal, binary, octal, or hexadecimal; and add underscores for readability.

The reference manual describes them like this:

``````integer      ::=  decinteger | bininteger | octinteger | hexinteger
decinteger   ::=  nonzerodigit (["_"] digit)* | "0"+ (["_"] "0")*
bininteger   ::=  "0" ("b" | "B") (["_"] bindigit)+
octinteger   ::=  "0" ("o" | "O") (["_"] octdigit)+
hexinteger   ::=  "0" ("x" | "X") (["_"] hexdigit)+
nonzerodigit ::=  "1"..."9"
digit        ::=  "0"..."9"
bindigit     ::=  "0" | "1"
octdigit     ::=  "0"..."7"
hexdigit     ::=  digit | "a"..."f" | "A"..."F"
``````

Examples of valid integer literals:

``````7     2147483647                        0o177    0b100110111
100_000_000_000                   0b_1110_0101
``````

I'm unfamiliar with this metalanguage, but I can guess that `...` is a character range, that `[]` means "optional", and that `*` and `+` denote repetition.

This syntax is meant for humans to read — i.e. it isn't executable.

### Eggex

Here's the same language in eggex syntax:

``````DecDigit = / [0-9] /
BinDigit = / [0-1] /
OctDigit = / [0-7] /
HexDigit = / [0-9 a-f A-F] /

DecInt   = / [1-9] ('_'? DecDigit)* | '0'+ ('_'? '0')* /
BinInt   = / '0' [b B] ('_'? BinDigit)+ /
OctInt   = / '0' [o O] ('_'? OctDigit)+ /
HexInt   = / '0' [x X] ('_'? HexDigit)+ /

Integer  = / DecInt | BinInt | OctInt | HexInt /
``````

It shows some of the differences between eggex and commmon Perl-style regex syntax:

• Patterns can be composed by naming them, e.g. `BinDigit`.
• Literals are quoted, e.g. `'0'` and `'_'`. Quotes are optional in character classes for brevity, and to make them similar to Perl/POSIX syntax.
• Terms are separated with whitespace.

Unlike the Python spec, this syntax is executable. Compared with POSIX regexes, it's more readable.

(Note that I also re-organized the rules: They're in bottom-up order rather than top-down, and I inlined the `nonzerodigit` rule because it's only used once.)

### POSIX Syntax

An eggex in Oil evaluates to a POSIX extended regex (ERE) by default, so we simply use `echo` to see it.

``````\$ echo \$Integer
([1-9](_?[0-9])*|0+(_?0)*|0[bB](_?[0-1])+|0[oO](_?[0-7])+|0[xX](_?[0-9a-fA-F])+)
``````

I wouldn't want to type this by hand! Even if I succeeded, reading it and modifying it later would be a challenge.

Perl's `/x` and Python's `re.VERBOSE` help to a degree, but they're not available in shell tools like grep and awk.

## Two Use Cases for Eggexes

### Text Processing With Unix Tools

The integer literal problem seems naturally closer to lexing, but let's try it with Unix tools first.

Here I grep for integer literals using Oil's seamless integration:

``````bash\$ bin/oil

~/git/oilshell/oil\$ egrep \$Integer */*.py
...
core/process.py:    fd = posix.open(path, fd_mode, 0o666)
...
osh/string_ops.py:  return ''.join(chr(b & 0xFF) for b in bytes_)
...
osh/string_ops.py:  elif (starting_byte >> 3) == 0b11110:
``````

It works! We found:

• An octal literal `0o666` for file permissions (about the only thing they're good for).
• A hex literal `0xFF` for truncating an integer to a single byte.
• A binary literal `0b11110` for recognizing with UTF-8. (UTF-8 is best understood in binary notation.)

### Lexing Programming Languages

In addition to translating to POSIX ERE syntax, eggexes should also translate to:

1. Perl-style syntax, to use with the regex engines in nearly all other languages: Python, Ruby, Java, C++, etc. See issue 488.
2. The syntax that the re2c compiler accepts. This would let Oil's own lexer be expressed in eggex.

To be concrete, here's an except from Oil's frontend/lex.py:

``````VAR_NAME_RE = r'[a-zA-Z_][a-zA-Z0-9_]*'

rules = [
R(r'\\$' + VAR_NAME_RE,  Id.VSub_DollarName),
...
R(VAR_NAME_RE + '\+?=', Id.Lit_VarLike),
R(VAR_NAME_RE + '\[',   Id.Lit_ArrayLhsOpen),
]
``````

As you can see, we use the syntax from Python's `re` module and string concatenation.

We do this because Oil started as a pure Python program, and because the lexer requires "metaprogramming" to remove duplication.

But Eggex could also remove that duplication with pattern composition. It's also improved by "first class" regex syntax, whitespace, and quoted literals.

``````VarName = / [a-z A-Z _] [a-z A-Z 0-9 _]* /

rules = [
( / '\$' VarName /,      Id.VSub_DollarName),
...
( / VarName '+'? '=' /, Id.Lit_VarLike),
( / VarName '[' /,      Id.Lit_ArrayLhsOpen),
]
``````

See Appendix A for details on the lexer's architecture.

## Recent Syntax Changes

I used the integer literal example to test whether Eggex is a good syntax.

I removed punctuation in:

• Pattern composition syntax: `Subpattern` is allowed in addition to `@subpattern`. The latter is still supported if you don't want to use capital letters.
• Character literals: `[b B]` is allowed in addition to `['b' 'B']`. This is consistent with ranges like `[a-f]`, and with Perl/POSIX syntax.

I also implemented the capturing and grouping syntax, which contains less punctuation than the Perl-style syntax:

Eggex:

``````(mygroup+)
< mycapture+ >
< mycapture+ : name >
``````

Perl/Python:

``````(?:mygroup+)
(mycapture+)
(?P<name>mycapture+)
``````

These changes were motivated by Zulip chats. Thanks to Eric Bergstrome for discussing the grouping and capturing syntax, and to Aur Saraf for discussing the use of punctuation.

The Oil language is still under development, and I've made multiple changes based on user feedback. Feel free to lurk on Zulip and see what we're talking about!

## Summary

Like Oil itself, eggexes are a seamless upgrade. You write patterns in a modern, composable syntax, but tools like `egrep` and `awk` are passed the ERE syntax they expect.

The same eggex syntax can be used for both ad-hoc text processing and writing fast and correct lexers.

Try porting regex to eggex and see if it's more maintainable. Start a thread on `#oil-discuss` on Zulip if you have problems.

Also see the Help Wanted section if you want to help make Oil happen! There are many opportunities to parallelize development.

## Appendix A: Oil's Lexer Uses Two Stages of Code Generation

Since I didn't finish the #lexing series, now is a good time to explain the architecture of Oil's lexer.

I started publishing the following dense source files on each /release/\$VERSION/ page. If anyone wants to implement Oil in another language, this portable and principled lexer will save them many hours of work.

1. frontend/lex.py is a pure Python source file using the `re` module syntax. The dynamic lists are evaluated into (pattern, `Id`) rules for 14 lexer modes.

Then we use the (undocumented) regex parse tree from Python's `sre_parse.py` to produce

2. osh-lex.re2c.h, a file in re2c syntax. This syntax has quoted literals like lex and Eggex.

Then re2c produces

3. osh-lex.h, a large C file with 20K lines of `switch` and `goto` statements, which nicely express a DFA.

Then the C compiler produces

4. Machine code. Even though the the source code is large, the machine code isn't, according to the metrics. This is because C compilers perform a switch lowering optimization.

Why go to all this trouble? Because regexes compiled to state machines are more correct than bash's style of groveling through backslashes and braces one at a time. It's also more concise: Oil's source code is many times smaller than bash's.

## Appendix B: Other Pattern Syntaxes

Eggex obeys some rough refactoring properties. You might interactively type:

``````\$ egrep [0-9]+ foo.txt
``````

And then extract a pattern so you can test and enhance it:

``````\$ pat = / [0-9]+ /
\$ egrep \$pat foo.txt
``````

In general the eggex should resemble the regex, but be more readable. The integer literal example is intended to show that.

Eggex also uses postfix repetition operators like `+` and `*` so it's consistent with:

• POSIX regex syntax
• Perl regex syntax
• Context Free Grammar syntax like EBNF and Python's pgen2, which Oil uses. Except that `[optional]` should be `optional?`.
• Parsing Expression Grammar syntax, which Python is likely moving to.
• The metalanguage for command line syntax.
• `grep [OPTION]... PATTERN [FILE]...` could be
• `grep OPTION* PATTERN FILE*` to be consistent.

Note that `...` meant "character range" in Python's spec, but means "repetition" in command lines. One of Oil's main goals to avoid such punning across the many languages and metalanguages that compose a large software system.