blog |

When Are Lexer Modes Useful?


The last post explained why the Oil lexer is interesting, and reviewed the idea of lexer modes.

If you're good at reading C and regular expressions, you can take a look at the generated file osh-lex.re2c.h to orient yourself. The idea is simple: add an enum lex_mode argument to the lexer.

Although the implementation is trivial, you won't see it in textbook lexers, and it solves some common problems.

Table of Contents

Here are some language constructs that could be handled with lexer modes:

String Literals

>>> print("Hello W\u00f4rld\n")
Hello Wรดrld

Although there is lexical structure within the double quotes, it's not a recursive language. C-style escapes are usually treated as a single token, and don't require an extra mode for the lexer.

By the way, I think of recursion (a tree-like self-similar structure) as the key difference between lexing and parsing. Lexing recognizes non-recursive structure, while parsing recognizes recursive structure like (1 + (2 + (3 + 4))).

Where lexer modes become useful is when you have a fully recursive language within double quotes, as in shell:

$ echo "$((1 + 2 * 3)) equals 7"
7 equals 7

This shell-like syntax appears in languages like JavaScript as of version 6:

> console.log(`${1 + 2 * 3} equals 7`)
7 equals 7

and Swift:

> print("\(1 + 2 * 3) equals 7")
7 equals 7

Embedded Regex Literals

Notice that the tokens a, b, c, and +

mean completely different things in these two lines of JavaScript:

var i = ab + c
var pattern = /ab+c/

Regular Expressions

As we saw above, changing from JavaScript to regex syntax lends itself to using a lexer mode.

However, there is naturally more than one mode within the regular expression syntax itself.

To see this, consider what the character ^ means in these three regular expressions:

^line$    # ^ is a metacharacter that matches the beginning of a line
[^0-9]    # ^ is the negation operator
[ab^]     # ^ is a character literal

Language Composition in Shell

Recursive Sublanguages

Language Composition in Shell

What do these examples have in common? They are languages embedded within other languages. As mentioned in the last post, shell takes this to an extreme.

I said there were at least four fully recursive languages, but there are even more:

(1) Brace expansion

$ type=sh; echo {build,test,release}/*.{py,$type}
build/*.py build/*.sh test/*.py test/*.sh release/*.py release/*.sh

(2) Extended globs. Globs are not a recursive language, but [extended globs][extglob] are.

$ [[ --help == --@(help|verbose=@(1|2)) ]] && echo TRUE

(3) Regular Expressions.

$ [[ --help == --@(help|verbose=@(1|2)) ]] && echo TRUE

Mutually Recursive Sublanguages

(1) Array Literal Syntax

(2) String Operators

Problems with Lexer Modes

You may also have problems interfacing a generated parser with a lexer that takes mode arguments. This is one reason that parser generators are often less than optimal for recognizing "real" languages.


We'll see an example of shell that is more in depth.