When Are Lexer Modes Useful?

2017-12-17 (Last updated 2022-05-06)

The last post explained why the Oil lexer is interesting, and reviewed the idea of lexer modes.

If you want to peek ahead, and know something about C and regexes, scroll through osh-lex.re2c.h. It's generated from osh/lex.py (a Python module that is less readable due to all the metaprogramming).

Regardless of whether that code makes sense right now, the idea is simple. Most lexers look something like this:

Token Read()  // return the next token

In contrast, a modal lexer has an "enum" parameter:

// return the next token, using rules determined by the mode
Token Read(lex_mode_t mode)

You'll rarely see this technique in textbook lexers or generated lexers, but it solves several common problems.

The first half of this post shows motivating examples. The second half discusses implementation issues, and makes several arguments about the distinction between lexing and parsing.

Table of Contents

Examples

String Literals With Recursive Structure

Perl-Style Regex Literals

Regular Expressions

HTML, CSS, and JavaScript

Tradeoffs and Implementation Notes

Generated Lexers and Parsers

Lexing versus Parsing

Parsing Real Languages is Difficult

Summary

Examples

String Literals With Recursive Structure

Consider string literals with \ escapes:

>>> print("Hello W\u00f4rld\n")
Hello Wôrld

Although there is lexical structure within the double quotes, a lexer mode is not required here. C-style string literals usually scanned with hand-written code, and treated as a single token. I prefer to treat \n and \u00f4 as separate tokens, but that's not the way it's usually done.

Lexer modes become useful when you have a recursive language within double quotes, as shell does:

$ echo "$((1 + 2 * 3)) equals 7"
7 equals 7

This shell-like syntax appears in JavaScript as of version 6:

> console.log(`${1 + 2 * 3} equals 7`)
7 equals 7

as well as in Apple's Swift language:

> print("\(1 + 2 * 3) equals 7")
7 equals 7

This is the most compelling use case for lexer modes because the language inside strings is not only recursive, but it's mutually recursive with the "main" language. The same expressions can appear both inside and outside the string literal.

The next three examples are recursive, but not mutually recursive.

Perl-Style Regex Literals

Consider these two lines of JavaScript:

var i = ab + c * d      // addition and multiplication
var pattern = /ab+c*d/  // 1 or more b's and 0 or more c's

The letters abcd and the characters + and * mean completely different things.

A lexer mode could be used here. When the JavaScript parser sees the opening /, it can tell the lexer to switch to a different mode.

Note that the regex language is arbitrarily recursive, e.g.:

var p = /(((a+b)+c)+d)+/  // matches abcd, aabcd, etc.

But again, it's not mutually recursive. Regex literals don't contain JavaScript code.

(However, Perl regexes are mutually recursive. Then again, parsing Perl is undecidable. I discuss this issue at the end of Parsing Bash is Undecidable.)

Regular Expressions

Not only do embedded regex literals lend themselves to a separate lexer modes, but regex syntax considered on its own has natural modes.

For example, the ^ character has three different meanings in these three regexes:

^line$    # ^ is a metacharacter that matches the beginning of a line
[^0-9]    # ^ is the negation operator
[ab^]     # ^ is a character literal

Depending on the context, the lexer can return different tokens for each instance of the ^ character. This makes the parser's job easier.

HTML, CSS, and JavaScript

If you squint, HTML resembles shell:

It has references to "files" in a hierarchical namespace. I think of URLs as a generalization of the file system.
It embeds other languages as opaque strings. CSS and JavaScript are embedded in HTML, while sed, awk, and other languages are embedded in shell.
Its "child" languages change over time, but the core evolves relatively slowly. For example, Java used to be more commonly embedded in HTML than it is now; likewise for Perl in shell.

So HTML involves language composition, just like shell does. It happens in both tags and attributes:

<head>
  <!-- arbitrary JS here -->
  <script type="text/javascript">
    function init() { return 42; }
  </script>

  <!-- arbitrary CSS here -->
  <style>
    body {
      width: 80em;
    }
  </style>
</head>

<body onload="init();">       <!-- arbitrary JS here -->
  <p style="font-size: large;">  <!-- arbitrary CSS here -->
    Hello World
  </p>
</body>

I'm not claiming that lexer modes are required to parse this kind of document. For example, when you see a <script> tag, you could wait until you see the closing tag, then pass that opaque string to a separate lexer and parser. In the case of attributes, you can wait for the closing double quote.

This is because there is recursion, but no mutual recursion. (Although something like JSX introduces that possibility again.)

In any case, I do think lexer modes are a nice way to handle this problem, but I won't say much more because I don't know much about web browser implementation. If you do, leave a comment.

Tradeoffs and Implementation Notes

The mode doesn't have to be a parameter to the lexer's Read() function. Instead, your lexer can be a state machine that makes mode transitions by itself, rather than needing the parser to tell it when to change modes.

In fact, I wrote a parser for regexes this way, and it worked well.

For Oil, I used the mode-as-parameter style — somewhat inspired by Ninja's lexer —and it turned out well. This style is almost certainly more powerful, because the parser knows more about the structure of the document than the lexer does.

Generated Lexers and Parsers

One drawback is that you may have problems interfacing a generated parser with a modal lexer. The parser expects to call the lexer without a parameter.

Likewise, generated lexers are most often "autonomous", because they expect to be called without a parameter.

However, re2c is a notable exception. Because it's a library and not a framework, it can easily accomodate the style of modal lexers. You write the function signatures; re2c generates a bunch of switch statements and goto statements that implement a DFA. (More about this later.)

This is yet another reason that parser generators are less than optimal for recognizing "real" languages: they don't handle language composition well.

Laurence Tratt points this out in his excellent article Parsing: The Solved Problem That Isn't.

One of the things that's become increasingly obvious to me over the past few years is that the general consensus breaks down for one vital emerging trend: language composition.

Lexing versus Parsing

The article also addresses the problem of the lexer when composing languages. Some parsing techniques like Parsing Expression Grammars unify lexing and parsing, a style known as scannerless parsing.

Scannerless parsing does away with a separate tokenization phase; the grammar now contains the information necessary to dynamically tokenize text. Combining grammars with markedly different tokenization rules is now possible.

I believe that this is the wrong approach. (It's also uncommon; most real languages are parsed with separate lexers.)

Rather than throwing out the lexer, you can use the technique I describe in this post.

Here are two arguments for separate lexing and parsing:

Lexing and parsing are fundamentally different. Parsers are recursive because languages are recursive. Lexers are not recursive. In other words, programming languages inherently have both lexical structure and grammatical structure.
It's more efficient to do the easy thing with the fast algorithm (lexing) and the hard thing with the slow algorithm (parsing).

To elaborate on the first argument, here are examples of non-recursive, lexical structure:

 while  3.14159  \n

Here are examples of recursive, grammatical structure

(1 + (2 + (3 + 4)))
/(((a+b)+c)+d)+/
if (x) { if (y) { print("x and y") } }

Parsers are detailed pieces of code. If you can collapse substrings like while or 3.14159 into a single "thing" you have to keep track of, it makes them easier to understand.

To elaborate on the second argument:

Lexing is easy and fast. You can write a switch statement in C, or you can match regular expressions in linear time and constant space. You don't need to look ahead or backtrack; you just march forward through the bytes.

Parsing is hard and resource-intensive. There are dozens of algorithms for parsing with different tradeoffs:

You can parse top-down or bottom-up.
You can predict (look ahead), backtrack, or memoize.
It often makes sense to combine multiple parsing algorithms, as I did with recursive descent and Pratt parsing in Oil.

In this thread, Aidenn0 on reddit points out that scannerless parsing wastes time when combined with backtracking, and wastes space when combined with memoization (packrat parsing).

(This is somewhat similar to the issue mentioned in the appendix of the the last post: lookahead can interact with the "lexer hack".)

2022 Update: I argued this further on the wiki:

Why Lexing and Parsing Should Be Separate. Also see the reddit discussion.

Parsing Real Languages is Difficult

I quoted Laurence Tratt above. He argued that that parsing isn't solved because language composition is common and not well understood.

To address a different issue, Trevor Jim argues that parsing isn't solved because most languages are not context-free, while parser generators assume that they are. Instead, their grammars are unrestricted, which means they require Turing Complete parsers.

This is true of the Oil parser, or any other shell parser. There is no way to express shell as a context-free grammar. I ported the POSIX shell grammar to ANTLR back in March 2016, and found that it only covered one of the four major sublanguages in shell.

Summary

Here are the points I want you to take away from this post:

Lexer modes are useful when you need to compose languages. The same lexer can be used in different modes with multiple parsers. I call it one lexer rather than multiple lexers because it maintains a single position in the document, and marches strictly forward.
Languages have recursive structure, which means that the parsers have recursive functions. Lexers are not recursive.
Scannerless parsing techniques could be considered an alternative to lexer modes when composing languages, but I believe this is a bad idea in general.
Parsing real languages is full of unacknowledged difficulties. For example, most real languages are not context-free.
However, lexing is straightforward. Modal lexers are a simple and efficient mechanism that solves common problems.

Now we should be in a better position to language composition in shell. As I've said, it's not one language, but a composition of at least four mutually recursive sublanguages. I will show examples of this in the next post.

Leave a comment if you have other good examples of language composition!