In December, I successfully compiled the high-level spec for the Oil lexer to native code. Then I started a series of five blog posts on #lexing, but unfortunately only finished two of them.
This post outlines the rest of the series. I wanted to keep going in the "friendly tutorial" style, but I haven't found the time.
Instead, I think of it like this:
If you like to read code, most of what we'll discuss is related to osh-lex.re2c.h and core/lexer.py.
I briefly recap the first two posts, and then we'll see new material.
Main points from The Oil Lexer: Introduction and Recap:
Main points from When are Lexer Modes Useful?:
mode parameter and uses it to determine the next
token. The parser supplies the mode based on its own state.Lexing and parsing are fundamentally different, although the line between them isn't always obvious. I follow this guideline:
Now that we understand language composition and lexer modes, let's illustrate the concept in shell. Warning: this section may give you a headache, but it's important to be concrete.
Here are some word parts:
'SQ' "d\$q" $var ${var} ${var:-X} $(echo command) $((2+3))
Word parts are made of tokens:
'SQ' # 1 token, single-quoted strings lack escapes/substitutions
"d\$q" # 5 tokens: " $d \$ q "
$var # 1 token
${var} # 3 tokens: ${ var }
${var:-X} # 5 tokens: ${ var :- X }
$(echo command) # 4 tokens: $( echo command )
$((2+3)) # 5 tokens: $(( 2 + 3 ))
Words are made of word parts:
$ echo 'SQ'"d\$q"$var${var}${c:-X}$(echo command)$((2+3)) # a single word SQd$qXcommand5
Commands are made of words:
$ for x in 1 2 3; do > # 4 words, the first resolves to the echo builtin > "ec$(echo ho)" hel${unset:-lo} world $x > done hello world 1 hello world 2 hello world 3
Boolean expressions are made of words:
$ [[ ec$(echo ho) == ec${unset:-ho} ]] && echo bool bool
Arithmetic expressions are made of words and operator tokens:
$ echo $(( 2${x:-2} + 3$(echo 3) )) # 22 + 33 55
Now we've come full circle. There are word parts that contain commands, boolean expressions, and arithmetic expressions:
$ echo $(echo command)--$([[ foo == foo ]] && echo bool)--$((2+3)) command--bool--5
In other words, OSH has four mutually recursive parsers that read from a single modal lexer.
~/git/oilshell/oil$ scripts/count.sh parser ... 526 osh/lex.py ... 185 osh/arith_parse.py # uses core/tdop.py 272 osh/bool_parse.py # shared between [ and [[ languages 1239 osh/word_parse.py 1555 osh/cmd_parse.py ...
There are also minor parsers, like the one for brace expansion like *.{cc,h}.
Hopefully it's clear from the above examples that the shell language can't be tokenized in just one way. The parser must supply the mode as context. Appendix B lists the OSH lexer modes.
I tried to follow these principles:
goto statements. This might be OK if
your language has a simple lexical structure, e.g. a template
language, but it doesn't work for shell.Shell obviously has sublanguages, but Python, JavaScript, and even Lisp do as well. They all have:
1.2e10"foo\n" and especially "foo${var}\n". The latter is
a trend with JavaScript template literals, Python f-strings, Swift, etc.I claim that these should be represented by non-recursive sublanguages rather than a single opaque token.
Why? Because otherwise you have to reparse your input. See Appendix A for details on this code duplication.
I prefer to have the lexer march strictly forward through the input. Every input byte is touched exactly once in the front end. After that, tokens are used.
Python's front end does not have this property.
NOTE: For further motivation, This style of writing lexers already helped at least one person. guy who wanted to parse "foo $foo" -- a mini-language)
An alternative to "re-parsing", as Python does it, is to decode numbers and strings in the lexer. (This works better in dynamic languages, where tokens don't require a fixed type.)
However, decoding in the lexer loses information, which violates the principles of the lossless syntax tree. That is, it makes it hard or impossible to reuse the parser for other code transformations.
In other words, our lexer divides the input into fine-grained spans labeled by token IDs. The spans can be converted back into text.
Let's switch gears. With Oil, I'm trying to efficiently implement a real-world language in an abstract style. I don't want to write 100K lines of C or C++ by hand.
The lexer is a good example of this. It's specified at a high level, with a set of rules like this:
(current position, lexer mode, Python regex) -> token ID if it matches
These rules are translated to native code by three compilers:
core/lexer_gen.py translates our rules to re2c regexes and semantic
actions. I wrote this small translator using Python's
undocumented sre_parse.py module.switch and goto statements.Why is the last step interesting? Because re2c relies on the compiler to do nontrivial switch lowering transformations.
As I understand it, LLVM uses multiple strategies so that a 256-case switch statement isn’t a linear search: finding closed-form expressions, jump tables, binary search, and a hybrid of all three techniques.
Related question: Is re2c output compatible with WebAssembly? WebAssembly apparently doesn't support unrestricted gotos.
So most of the OSH lexer is generated from regular expressions. However, it also has a stack to disambiguate multiple meanings of the right parenthesis.
I call this this mechanism a lexer hint. Roughly speaking, the parser
calls PushHint() when it sees a token like $( or (.
In other words:
Turning right parens into multiple different token IDs like Id.Right_CasePat
and Id.Right_FuncDef makes the parser's job easier.
See Appendix D for details.
The lexer marches strictly forward through the input, but I needed a small hack to make these cases work:
(case x in x) ;; esac ) # case statement in subshell $(case x in x) ;; esac ) # case statement in command sub
It's complicated to explain without stepping through it, but
the first right paren would be misunderstood without Unread(). The hack is
pretty limited:
).Other shell lexers I've seen also have something like Unread(), though I
haven't examined them in detail.
How would you distinguish between the do and done keywords in shell? One
way is to do successive string comparisons:
if (strcmp(cursor, "do") == 0) {
return DO;
}
if (strcmp(cursor, "done") == 0) {
return DONE;
}
However, this means we're touching each character more than once. Another way is to do something like Python's tokenizer.c, and have nested if statements:
if (c == '.') {
c = tok_nextc(tok);
if (isdigit(c)) {
goto fraction;
} else if (c == '.') {
c = tok_nextc(tok);
if (c == '.') {
The tok_get() function has if statements three levels deep in some places.
The nice thing about re2c is that we just write a flat list of patterns,
and we automatically get a state machine that can distinguish do and done
without backtracking or nested if statements.
In other words, the flat list of patterns in osh-lex.re2c.h is compiled to
an NFA, then a DFA. The DFA receives one character at a time, and is
in an "ambiguous" intermediate state until it receives the character after d
and o.
TODO: show graphic of do/done state machine.
For small lexers, the hand-written options are OK. But for the large OSH lexer, the concept of nondeterminism makes life easier.
Another example is matching redirects:
[0-9]* "<" { *id = id__Redir_Less; break; }
[0-9]* "<<" { *id = id__Redir_DLess; break; }
[0-9]* "<<<" { *id = id__Redir_TLess; break; }
[0-9]* "<<-" { *id = id__Redir_DLessDash; break; }
[0-9]* "<&" { *id = id__Redir_LessAnd; break; }
Writing an efficient state machine to distinguish these cases by hand would be tedious.
NOTE: This may be more applicable to messy languages like shell, HTML, CommonMark. In Python and JavaScript the lexical structure is clearer. Maybe Perl has these issues too.
Shell is an odd language, and there are several cases where the lexer and parser "conspire" to recognize a construct.
The lexer should make as many distinctions as possible, given its limited model of marching forward through the input with regular expressions.
This makes the rest of the job easier for the parser. The parser only looks at token IDs; it never looks at raw characters.
Sometimes I think of this as "pre-structuring" the input.
Examples:
Example from aosabook
Bash example: for for in for for; do echo for done
We have no problem with this!!! It just worked the first time.
prestructuring: speculatively creating separae tokens for things that may or may not be "operators". I think of it as "cheap disambiguation". In general, I want both the lexer and the parser to always walk forward. OSH is a Predctive recursive descent parser, not a backtracking recursive descent parser.
And prestucturing is a way to delay prediction until you have enough information. Lexer does a lot of work so that [OSH Only requires Two Tokens of Lookahead].
Create distinctions that may or may not matter.
It's about communication between the lexer adn parser. Whereas non-determinism is with respect to the lexer only.
Teaching lexers as regular languages and grammars as context-free grammars is actively misleading. This is not how languages you care about work.
For a given programming language, the boundary between the lexer and parser is not obvious.
Also: the lexer can do a lot of work for the parser. It makes the parser's job easier.
I talked about:
Principles to keep the lexer clean
implementation mechanisms:
concepts:
And I hopefully gave examples
I abbreviated many points made in this post, so if you'd like to hear more, leave a comment.
There were a couple TODOs as well. Let me know if you'd like to see those filled in.
Parser/tokenize.c -- search for 'j'
Python/ast.c - search for 'j' (parsenumber)
What about string literal logic?
This problem also occurs in PyPy -- and presumably Jython, IronPython, etc., since they all use the Python grammar.
Why is async the only keyword that Python's lexer deals with? That is,
keywords like def, if, and for are unknown to the lexer, but async
is known.
Python has two separate pieces of code that recognize the structure of number and string literals.
how does it handle f strings?
it's not regular because of indentation. however I believe the "modes" model can easily handle this
To guide you, look at the intro
scannerless parsing bad. lexer modes good.
lexing and parsing are fundamentally different. lexing is non-recursive structure; parsing is recursive structure.
Another reason to use a stack in the lexer: for recognizing Python-like
indentation, and for implementing the rule where expressions broken across
lines in ( ), [ ], and { } don't require backslash escapes.
The fourteen modes are found in osh/types.asdl:
lex_mode = NONE | COMMENT | OUTER | DBRACKET | SQ | DQ | DOLLAR_SQ | ARITH | EXTGLOB | VS_1 | VS_2 | VS_ARG_UNQ | VS_ARG_DQ | BASH_REGEX | BASH_REGEX_CHARS
In the command language:
( cd / && pwd ): subshellcase foo in foo) echo hi; ...: after a case pattern(( x = 1 + 2 )): arithmetic statementf() { echo f{ }: function defa=(1 2 3): array literalIn the word language:
echo $(echo hi): command substitutionecho $(( 1 + 2 )): arithmetic substitutionecho @(ext|glob): extended globIn the arithmetic Language:
echo $(( (1+2)*3 )) # grouping for precedenceRecall that these languages are composed, which means the different meanings of right paren can be interleaved.