Home

The Oil Lexer: Introduction and Recap

2017-12-14

Although the lexer is one of the most important parts of Oil, I've only written one post about it, and that was over a year ago.

This is the first of five posts I plan to publish about the lexer. I'll provide context and show motivating examples.

The rest of the series will explain:

  1. Lexical problems with the shell language,
  2. How Oil's lexer solves these problems, and
  3. The tools I used to build the lexer.
Table of Contents

Why Is It Interesting?

Oil's lexer different than other lexers:

  1. Existing shell lexers. Most shells interleave parsing and execution. In contrast, Oil parses shell scripts up front, in a single pass. In other words, it performs static rather than dynamic parsing.

    The unique relationship between Oil's many parsers and its "modal" lexer enables static parsing.

  2. Textbook Lexers recognize simple languages. They're meant to teach you how to implement a programming language without getting bogged down in details. They're either hand-written or generated by a tool like lex.

    In contrast, Oil recognizes an existing, big, and hairy language. It solves several problems that you won't find in a textbook lexer.

  3. "Real" lexers recognize the languages you use every day, like Python or C. They have to solve non-trivial problems that you won't find in textbooks.

    They're almost always written by hand. I've looked at over 20 lexers and parsers in C or C++, and I have yet to see a generated lexer.

    In contrast, Oil's lexer is generated. I actually use two translators/compilers to create it, which I'll describe later in this series.

It's fair to say that Oil's lexer is one of the most complicated lexers you'll ever see. Shell is a complex language in terms of its lexical structure, and Oil's lexer does more work than any other shell lexer.

If you don't believe me now, the examples in the coming posts may convince you of this.

I've said before that writing a shell should be significantly easier than writing a C compiler, the lexer is harder.

Why Write About It Now?

In some ways, this is a throwback to the beginning of the blog.

Up until the last release, the lexer was written with Python regular expressions.

A few weeks ago, I successfully converted it to C code via re2c. So now is a good time to write about the lexer -- it works and it's fast!

I expect this series of posts to be the last on parsing shell.

Problems that Real Lexers Solve

Here's an example of a problem that real languages have to solve.

  1. In C, the "lexer hack" is used to distinguish between types and variables.
  2. Another thing I learned recently is that The C preprocessor has a separate lexer that uses slightly different tokenization rules.

Lexing shell doesn't have these problems, but it has different ones. They stem from fundamental language design features:

  1. Shell is not a single language. It's a composition of at least four recursive sublanguages, which I call Command, Word, Arith, and Bool.
  2. Shell has command substitution. An entire program can appear in a string literal.

I will give examples of these problems in the following posts.

Recap: Lexer Modes

In my last post on lexing,

I renamed it from lexical state to lexer modes.

Why not lexical state? Because it's not really state in the lexer.

And I have other state in the lexer: the hints that are pushed onto a stack.

This is to motivate the next post.

Where did I get it from? Ninja. I liked how Ninja uses lexer modes.

:- example.

Next

I explained why the Oil lexer is interesting, and I gave examples of lexer modes.

The next post will give a larger of example of lexer modes. This will show how powerful the lexer is and how much complexity it handles.

At the end of this series, you should:

This is generated code, but it's bigger than OSH itself, which I showed in the last release was 14K lines of Python.