blog | oilshell.org

When Are Lexer Modes Useful?

2017-12-16

The last post explained why the Oil lexer is interesting, and reviewed the idea of lexer modes.

If you want to peek ahead, and you're good at reading regular expressions and C code, take a look at the file osh-lex.re2c.h. It's generated from osh/lex.py, a Python script that is less readable due to all the metaprogramming.

Regardless of whether that code makes sense right now, the idea is simple. Most lexers look something like this:

Token Read()  // return the next token

In contrast, a modal lexer simply has an "enum" parameter:

// return the next token, using rules determined by the mode
Token Read(lex_mode_t mode)

You'll rarely see this technique in textbook lexers or generated lexers, but it solves several common problems. Read on for examples.

Table of Contents

String Literals With Recursive Structure

Consider string literals with \ escapes:

>>> print("Hello W\u00f4rld\n")
Hello Wรดrld

Although there is lexical structure within the double quotes, a lexer mode isn't required here. C-style string literals usually scanned with hand-written code, and treated as a single token. I prefer to treat \n and \u00f4 as separate tokens, but that's unusual.

Lexer modes become useful when you have a recursive language within double quotes, as shell does:

$ echo "$((1 + 2 * 3)) equals 7"
7 equals 7

This shell-like syntax appears in JavaScript as of version 6:

> console.log(`${1 + 2 * 3} equals 7`)
7 equals 7

as well as in Apple's Swift language:

> print("\(1 + 2 * 3) equals 7")
7 equals 7

Perl-Style Regex Literals

Consider these two lines of JavaScript:

var i = ab + c * d      // addition and multiplication
var pattern = /ab+c*d/  // 1 or more b's and 0 or more c's

The letters abcd and the characters + and * mean completely different things.

A lexer mode could be used here. When the JavaScript parser sees the opening /, it can tell the lexer to switch to a different mode.

Note that the regex language is arbitrarily recursive, e.g.:

var p = /(((a+b)+c)+d)+/  // matches abcd, aabcd, etc.

Regular Expressions

Not only do embedded regex literals lend themselves to a separate lexer modes, but regex syntax considered on its own has natural modes.

For example, the ^ character has three different meanings in these three regexes:

^line$    # ^ is a metacharacter that matches the beginning of a line
[^0-9]    # ^ is the negation operator
[ab^]     # ^ is a character literal

Depending on the context, the lexer can return different tokens for each instance of the ^ character. When the lexer can disambiguate characters based on their context, it makes the parser's job easier.

HTML, CSS, and JavaScript

If you squint, HTML resembles shell:

  1. It has references to "files" in a hierarchical namespace. URLs are essentially a generalization of the file system.
  2. It embeds other languages as opaque strings. CSS and JavaScript are embedded in HTML, while sed, awk, and other languages are embedded in shell.
  3. Its "child" languages change over time, but the core evolves relatively slowly. For example, Java used to be more commonly embedded in HTML than it is now; likewise for Perl in shell.

So HTML involves language composition, just like shell does. It happens in both tags and attributes:

<head>
  <!-- arbitrary JS here -->
  <script type="text/javascript">
    function foo() { return 5; }
  </script>

  <!-- arbitrary CSS here -->
  <style>
    body {
      width: 80em;
    }
  </style>
</head>

<body onload="doStuff();">       <!-- arbitrary JS here -->
  <p style="font-size: large;">  <!-- arbitrary CSS here -->
    Hello World
  </p>
</body>

I'm not claiming that lexer modes are required to parse this kind of document. For example, when you see a <script> tag, you could wait until you see the closing tag, then pass that opaque string to a completely separate lexer and parser. For attributes, you can wait for the closing double quote.

I do think lexer modes are a nice way to handle this problem, but I won't say much more, because I don't know much about web browser implementation. If you do, leave a comment.

Tradeoffs and Implementation Notes

The mode doesn't have to be a parameter to the lexer's Read() function. Instead, your lexer can be a state machine that makes mode transitions by itself, rather than needing the parser to tell it when to change modes.

In fact, I wrote a parser for regular expressions this way, and it worked well.

For Oil, I borrowed the mode-as-parameter style from Ninja's lexer, and it turned out well. This style is almost certainly more powerful, because the parser has more information about the structure of the document than the lexer does.

Generated Lexers and Parsers

One drawback is that you may have problems interfacing a generated parser with a modal lexer.

Likewise, generated lexers are most often "autonomous", because they expect to be called without a mode. However, re2c is a notable exception. Because it's a library and not a framework, it can easily accomodate the style of modal lexers.

This is yet another reason that parser generators are less than optimal for recognizing "real" languages: Parser generators don't handle language composition well.

Laurence Tratt points this out in his excellent article Parsing: The Solved Problem That Isn't.

One of the things that's become increasingly obvious to me over the past few years is that the general consensus breaks down for one vital emerging trend: language composition.

Lexing vs. Parsing

The article also addresses the problem of the lexer when composing languages. Some parsing techniques like Parsing Expression Grammars unify lexing and parsing, a style known as scannerless parsing.

Scannerless parsing does away with a separate tokenization phase; the grammar now contains the information necessary to dynamically tokenize text. Combining grammars with markedly different tokenization rules is now possible.

I believe that scannerless parsing is usually a mistake. (It's also uncommon; I've never seen it in a "real" parser.)

A simple argument I've made in the past is:

1. Most languages have lexical structure as well as grammatical structure.

In other words, parsers are detailed pieces of code, and if you can collapse substrings like while or 3.14159 into a single "thing" you have to keep track of, it makes the code easier to understand.

Another argument is that lexing and parsing are fundamentally different:

2. Recursion is the difference between lexing and parsing.

Another argument I've used against scannerless parsing can be summarized like this:

3. Do the easy thing with the fast algorithm and the hard thing with the slow algorithm.

To elaborate:

Lexing is easy and fast. You can write a switch statement in C, or you can match regular expressions in linear time and constant space. You don't need to look ahead or backtrack; you just march forward through the bytes.

Parsing is hard and resource-intensive. There are dozens of algorithms for parsing with different tradeoffs:

In this thread, Aidenn0 on reddit points out that scannerless parsing wastes time when combined with backtracking, and wastes space when combined with memoization (packrat parsing).

(This is somewhat similar to the issue mentioned in the appendix of the the last post: lookhead can interact with the "lexer hack".)

Parsing Real Languages is Difficult

I quoted Laurence Tratt above. He argued that that parsing isn't solved because language composition is common and not well understood.

To address a different issue, Trevor Jim argues that parsing isn't solved because most languages are not context-free, while parser generators assume that they are. Instead, they require Turing Complete parsers.

This is definitely true of the Oil parser or any other shell parser. There is no way to express shell as a context-free grammar. I ported the grammar from the POSIX spec to ANTLR back in March 2016, and found that it only covered one of the four major sublanguages in shell.

Summary

Here are the points I want you to take away from this post:

  1. Lexer modes are useful when you need to compose languages. The same lexer can be used in different modes with multiple parsers.
  2. Languages have recursive structure, which means that the parsers have recursive functions. Lexers are not recursive.
  3. Scannerless parsing techniques could be considered an alternative to lexer modes when composing languages, but I believe this style has significant drawbacks.
  4. Parsing real languages is full of unacknowledged difficulties.
  5. However, lexing is straightforward, and modal lexers are a simple and efficient mechanism that solves common problems.

Now we should be in a better position to language composition in shell. As I've said, it's not one language, but a composition of at least four mutually recursive sublanguages. I will show examples of this in the next post.