blog | oilshell.org

Tips, Observations, and Opinions On Lexing

2018-05-01

In December, I successfully compiled the high-level spec for the Oil lexer to native code. Then I started a series of five blog posts on #lexing, but unfortunately only finished two of them.

This post outlines the rest of the series. I wanted to keep going in the "friendly tutorial" style, but I haven't found the time.

Instead, I think of it like this:

If you like to read code, most of what we'll discuss is related to osh-lex.re2c.h and core/lexer.py.

I briefly recap the first two posts, and then we'll see new material.

Table of Contents
Recap: Why is Lexing Interesting?
Recap: When are Lexer Modes Useful?
Examples of Language Composition in Shell
OSH Lexer Design Principles
Language Composition is the Rule, Not the Exception
Lexer Modes Complement a Lossless Syntax Tree
A Chain of Three Compilers Generates the OSH Lexer
Mechanism: Lexer Hints
Mechanism: Unread()
Tip: Generated Lexers Can Express Nondeterminism, Which Simplifies Code
Examples of Speculative Distinctions Made by the OSH Lexer
Challenging the Traditional Architecture
High-Level Summary and Conclusion
Appendix A: Reparsing of Input in CPython (and PyPy)
Appendix B: Other Problems with Python's Lexer and Parser
Appendix B: List of OSH Lexer Modes
Appendix D: Meanings of Right Paren in Shell

Recap: Why is Lexing Interesting?

Main points from The Oil Lexer: Introduction and Recap:

Recap: When are Lexer Modes Useful?

Main points from When are Lexer Modes Useful?:

Lexing and parsing are fundamentally different, although the line between them isn't always obvious. I follow this guideline:

Examples of Language Composition in Shell

Now that we understand language composition and lexer modes, let's illustrate the concept in shell. Warning: this section may give you a headache, but it's important to be concrete.

Here are some word parts:

'SQ'  "d\$q"  $var  ${var}  ${var:-X}  $(echo command)  $((2+3))

Word parts are made of tokens:

'SQ'             # 1 token, single-quoted strings lack escapes/substitutions
"d\$q"           # 5 tokens:  "  $d  \$  q  "
$var             # 1 token
${var}           # 3 tokens:  ${  var  }
${var:-X}        # 5 tokens:  ${  var  :-  X  }
$(echo command)  # 4 tokens:  $(  echo  command  )
$((2+3))         # 5 tokens:  $(( 2  +  3 ))

Words are made of word parts:

$ echo 'SQ'"d\$q"$var${var}${c:-X}$(echo command)$((2+3))  # a single word
SQd$qXcommand5

Commands are made of words:

$ for x in 1 2 3; do
>   # 4 words, the first resolves to the echo builtin
>   "ec$(echo ho)" hel${unset:-lo} world $x
> done
hello world 1
hello world 2
hello world 3

Boolean expressions are made of words:

$ [[ ec$(echo ho) == ec${unset:-ho} ]] && echo bool
bool

Arithmetic expressions are made of words and operator tokens:

$ echo $(( 2${x:-2} + 3$(echo 3) ))  # 22 + 33
55

Now we've come full circle. There are word parts that contain commands, boolean expressions, and arithmetic expressions:

$ echo $(echo command)--$([[ foo == foo ]] && echo bool)--$((2+3))
command--bool--5

In other words, OSH has four mutually recursive parsers that read from a single modal lexer.

~/git/oilshell/oil$ scripts/count.sh parser
...
   526 osh/lex.py
...
   185 osh/arith_parse.py  # uses core/tdop.py
   272 osh/bool_parse.py   # shared between [ and [[ languages
  1239 osh/word_parse.py
  1555 osh/cmd_parse.py
...

There are also minor parsers, like the one for brace expansion like *.{cc,h}.

Hopefully it's clear from the above examples that the shell language can't be tokenized in just one way. The parser must supply the mode as context. Appendix B lists the OSH lexer modes.

OSH Lexer Design Principles

I tried to follow these principles:

Language Composition is the Rule, Not the Exception

Shell obviously has sublanguages, but Python, JavaScript, and even Lisp do as well. They all have:

I claim that these should be represented by non-recursive sublanguages rather than a single opaque token.

Why? Because otherwise you have to reparse your input. See Appendix A for details on this code duplication.

I prefer to have the lexer march strictly forward through the input. Every input byte is touched exactly once in the front end. After that, tokens are used.

Python's front end does not have this property.

NOTE: For further motivation, This style of writing lexers already helped at least one person. guy who wanted to parse "foo $foo" -- a mini-language)

Lexer Modes Complement a Lossless Syntax Tree

An alternative to "re-parsing", as Python does it, is to decode numbers and strings in the lexer. (This works better in dynamic languages, where tokens don't require a fixed type.)

However, decoding in the lexer loses information, which violates the principles of the lossless syntax tree. That is, it makes it hard or impossible to reuse the parser for other code transformations.

In other words, our lexer divides the input into fine-grained spans labeled by token IDs. The spans can be converted back into text.

A Chain of Three Compilers Generates the OSH Lexer

Let's switch gears. With Oil, I'm trying to efficiently implement a real-world language in an abstract style. I don't want to write 100K lines of C or C++ by hand.

The lexer is a good example of this. It's specified at a high level, with a set of rules like this:

(current position, lexer mode, Python regex) -> token ID if it matches

These rules are translated to native code by three compilers:

  1. core/lexer_gen.py translates our rules to re2c regexes and semantic actions. I wrote this small translator using Python's undocumented sre_parse.py module.
  2. re2c translates regexes to to portable C code, with a DFA expressed as a series of switch and goto statements.
  3. A C compiler like Clang generates native code from the output of re2c.

Why is the last step interesting? Because re2c relies on the compiler to do nontrivial switch lowering transformations.

As I understand it, LLVM uses multiple strategies so that a 256-case switch statement isn’t a linear search: finding closed-form expressions, jump tables, binary search, and a hybrid of all three techniques.

Related question: Is re2c output compatible with WebAssembly? WebAssembly apparently doesn't support unrestricted gotos.

Mechanism: Lexer Hints

So most of the OSH lexer is generated from regular expressions. However, it also has a stack to disambiguate multiple meanings of the right parenthesis.

I call this this mechanism a lexer hint. Roughly speaking, the parser calls PushHint() when it sees a token like $( or (.

In other words:

Turning right parens into multiple different token IDs like Id.Right_CasePat and Id.Right_FuncDef makes the parser's job easier.

See Appendix D for details.

Mechanism: Unread()

The lexer marches strictly forward through the input, but I needed a small hack to make these cases work:

 (case x in x) ;; esac )     # case statement in subshell
$(case x in x) ;; esac )     # case statement in command sub

It's complicated to explain without stepping through it, but the first right paren would be misunderstood without Unread(). The hack is pretty limited:

Other shell lexers I've seen also have something like Unread(), though I haven't examined them in detail.

Tip: Generated Lexers Can Express Nondeterminism, Which Simplifies Code

How would you distinguish between the do and done keywords in shell? One way is to do successive string comparisons:

if (strcmp(cursor, "do") == 0) {
  return DO;
}
if (strcmp(cursor, "done") == 0) {
  return DONE;
}

However, this means we're touching each character more than once. Another way is to do something like Python's tokenizer.c, and have nested if statements:

    if (c == '.') {
        c = tok_nextc(tok);
        if (isdigit(c)) {
            goto fraction;
        } else if (c == '.') {
            c = tok_nextc(tok);
            if (c == '.') {

The tok_get() function has if statements three levels deep in some places.

The nice thing about re2c is that we just write a flat list of patterns, and we automatically get a state machine that can distinguish do and done without backtracking or nested if statements.

In other words, the flat list of patterns in osh-lex.re2c.h is compiled to an NFA, then a DFA. The DFA receives one character at a time, and is in an "ambiguous" intermediate state until it receives the character after d and o.

TODO: show graphic of do/done state machine.

For small lexers, the hand-written options are OK. But for the large OSH lexer, the concept of nondeterminism makes life easier.

Another example is matching redirects:

[0-9]* "<"               { *id = id__Redir_Less; break; }
[0-9]* "<<"              { *id = id__Redir_DLess; break; }
[0-9]* "<<<"             { *id = id__Redir_TLess; break; }
[0-9]* "<<-"             { *id = id__Redir_DLessDash; break; }
[0-9]* "<&"              { *id = id__Redir_LessAnd; break; }

Writing an efficient state machine to distinguish these cases by hand would be tedious.

Examples of Speculative Distinctions Made by the OSH Lexer

NOTE: This may be more applicable to messy languages like shell, HTML, CommonMark. In Python and JavaScript the lexical structure is clearer. Maybe Perl has these issues too.

Shell is an odd language, and there are several cases where the lexer and parser "conspire" to recognize a construct.

The lexer should make as many distinctions as possible, given its limited model of marching forward through the input with regular expressions.

This makes the rest of the job easier for the parser. The parser only looks at token IDs; it never looks at raw characters.

Sometimes I think of this as "pre-structuring" the input.

Examples:

Example from aosabook

Bash example: for for in for for; do echo for done

We have no problem with this!!! It just worked the first time.

prestructuring: speculatively creating separae tokens for things that may or may not be "operators". I think of it as "cheap disambiguation". In general, I want both the lexer and the parser to always walk forward. OSH is a Predctive recursive descent parser, not a backtracking recursive descent parser.

And prestucturing is a way to delay prediction until you have enough information. Lexer does a lot of work so that [OSH Only requires Two Tokens of Lookahead].

Create distinctions that may or may not matter.

It's about communication between the lexer adn parser. Whereas non-determinism is with respect to the lexer only.

Challenging the Traditional Architecture

High-Level Summary and Conclusion

I talked about:

And I hopefully gave examples

I abbreviated many points made in this post, so if you'd like to hear more, leave a comment.

There were a couple TODOs as well. Let me know if you'd like to see those filled in.

Appendix A: Reparsing of Input in CPython (and PyPy)

Parser/tokenize.c -- search for 'j'

Python/ast.c - search for 'j' (parsenumber)

What about string literal logic?

This problem also occurs in PyPy -- and presumably Jython, IronPython, etc., since they all use the Python grammar.

Appendix B: Other Problems with Python's Lexer and Parser

Another reason to use a stack in the lexer: for recognizing Python-like indentation, and for implementing the rule where expressions broken across lines in ( ), [ ], and { } don't require backslash escapes.

Appendix B: List of OSH Lexer Modes

The fourteen modes are found in osh/types.asdl:

lex_mode =
  NONE
| COMMENT
| OUTER
| DBRACKET
| SQ | DQ | DOLLAR_SQ
| ARITH
| EXTGLOB
| VS_1 | VS_2 | VS_ARG_UNQ | VS_ARG_DQ
| BASH_REGEX | BASH_REGEX_CHARS

Appendix D: Meanings of Right Paren in Shell

In the command language:

In the word language:

In the arithmetic Language:

Recall that these languages are composed, which means the different meanings of right paren can be interleaved.