The Oil Lexer: Introduction and Recap

2017-12-15

Although the lexer is one of the most important parts of Oil, I've only written one post about it. And that was over a year ago.

This is the first of five posts I plan to publish about the lexer. I motivate the topic and recap what we know so far.

The rest of the series will discuss:

Lexical problems with the shell language.
How Oil's lexer solves these problems.
The tools I used to build the lexer.

Table of Contents

Why Is It Interesting?

Why Write About It Now?

Problems that Real Lexers Solve

Recap: Lexer Modes

Appendix: Lexing the C Language

Why Is It Interesting?

I claim that Oil's lexer is different than other lexers, which I'll divide into three categories:

Existing shell lexers. Most shells interleave parsing and execution. In contrast, Oil parses shell scripts up front, in a single pass.

Lexer modes are a key feature that enable such static parsing. We'll review this idea shortly.
Textbook Lexers recognize simple languages. They're designed to teach you the concept, without getting bogged down in details.

In contrast, Oil recognizes a big, hairy, pre-existing language. It solves several problems that you won't find in a textbook lexer.

Textbook lexers are either hand-written or generated by a tool like lex.
"Real" lexers recognize the languages you use every day, like Python or C. They also have to solve non-trivial problems that you won't find in textbooks.

However, they're almost always written by hand. I've looked at over 20 real lexers in C or C++, and I have yet to see a generated lexer.

In contrast, Oil's lexer is generated. I actually use two translators/compilers to create it, which I describe in this series.

I'll go further and say that Oil has one the most complex lexers you'll ever see. In terms of its lexical structure, shell is already a complex language, and Oil's lexer does more work than any other shell lexer.

If you don't believe this now, the examples in the coming posts may convince you of this.

Why Write About It Now?

In many ways, this topic is a throwback to the beginning of the blog, when I wrote a lot about parsing shell.

I put off writing about the lexer, because up until the last release, it was implemented with Python regular expressions. This allowed me to iteratively "discover" the shell language.

A few weeks ago, I successfully translated the lexer to native code via re2c — a tool you'll hear about shortly.

So now is a good time to write about the lexer. It's been tested on millions of lines of real code, and it's fast enough to be called "production-quality".

Problems that Real Lexers Solve

Let's be more specific about the distinction between textbook lexers and real lexers. Here are some examples:

In C, you can't parse an expression like (A)*B without knowing if A is a type or a variable. The solution to this problem is called the "lexer hack".
A C source file is actually a composition of two languages: the C preprocessor, and C itself. One thing I learned recently is that the lexer for the C preprocessor has slightly different tokenization rules.

Lexing shell doesn't have these exact problems, but it has similar ones that are a result of fundamental properties of the language:

Shell is not a single language. It's a composition of at least four mutually-recursive sublanguages, which I call Command, Word, Arith, and Bool.
Shell has command substitution. An entire program can appear in a string literal, which is not true of say C or Python. For example:

$ echo "hello $(echo world | tr a-z A-Z)"
hello WORLD

I'll give more examples of these problems in the following posts.

Recap: Lexer Modes

My last post on lexing asks: What do the characters :- mean in this code?

$ echo "foo:-bar  ${foo:-bar}  $(( foo > 1 ? 5:- 5 ))"
foo:-bar  bar  -5

Depending on the context, they mean three different things. I said that the mechanism we use to return different tokens depending on the context is "lexical state".

I now call this idea lexer modes, for two reasons:

(1) The lexer doesn't actually hold state. The parser holds the state, and the mode is a parameter:

Token Read(lex_mode_t mode)

(2) There's a different mechanism which I call lexer hints, which does involve state stored in the lexer. I explain this concept later in the series.

In this post, I explained why the Oil lexer is interesting, and I reviewed the idea of lexer modes.

The next post will give a bigger example of using lexer modes. The goal is to convince you that the Oil lexer is indeed powerful and tames much of the complexity of the shell language.

Appendix: Lexing the C Language

The "lexer hack" I mentioned above has a few other names:

typedef-name: identifier problem
Context-Sensitivity
Lexical Feedback

The Context-Sensitivity of C's Grammar describes the problem and gives more examples:

All that's needed is keeping a symbol table of types defined by typedef as the parsing goes. Whenever a new identifier is recognized in the lexer, it checks if this identifier is a defined type, and returns the correct token to the parser. As far as the parser is concerned, it has two distinct terminals — an identifier and a defined type.

However, in the context of a formally verified C compiler, the problem isn't that simple. The "Related Work" section in this paper is informative:

A simple, possibly correct LR parser for C11 (PDF)

They point out that bugs can arise due to the interaction between the lexer hack and parser lookahead. Their parser is generated from a grammar, rather than hand-written.

Their C compiler uses an interesting technique of parsing twice in order to resolve the context-sensitivity:

The CompCert C compiler contains two parsers.

The CompCert "pre-parser" is an LALR(1) parser equipped with a lexical feedback mechanism. The C11 parser presented in this paper is based on it. The pre-parser is in charge of detecting and reporting syntax errors, following a technique proposed by Jeffery and implemented in Menhir by the second author.

The CompCert "parser" is also an LALR(1) parser, but does not need lexical feedback: it parses the input again, exploiting the classification of identifiers that has been performed already by the pre-parser.

Comments with respect to Oil:

The "real world" mechanisms described in this series of posts could be called "lexical feedback". In each case, the lexer needs to know something that isn't contained in the input stream. However, I choose to use more specific names: lexer modes and lexer hints.
I took care to ensure that parsing Oil requires just two tokens of lookahead, with no backtracking. Although Oil doesn't require lexical feedback, so there is no potential for adverse interaction between the two mechanisms, this property is useful for other reasons. For example, we want to reuse the parser for context-sensitive command completion, and arbitrary lookahead would complicate that.