The last post explained why the Oil lexer is interesting, and reviewed the idea of lexer modes.
If you want to peek ahead, and you're good at reading regular expressions and C code, take a look at the file osh-lex.re2c.h. It's generated from osh/lex.py, a Python script that is less readable due to all the metaprogramming.
Regardless of whether that code makes sense right now, the idea is simple. Most lexers look something like this:
Token Read() // return the next token
In contrast, a modal lexer simply has an "enum" parameter:
// return the next token, using rules determined by the mode
Token Read(lex_mode_t mode)
You'll rarely see this technique in textbook lexers or generated lexers, but it solves several common problems. Read on for examples.
Consider string literals with \ escapes:
>>> print("Hello W\u00f4rld\n")
Hello Wรดrld
Although there is lexical structure within the double quotes, a lexer mode
isn't required here. C-style string literals usually scanned with hand-written
code, and treated as a single token. I prefer to treat \n and \u00f4
as separate tokens, but that's unusual.
Lexer modes become useful when you have a recursive language within double quotes, as shell does:
$ echo "$((1 + 2 * 3)) equals 7" 7 equals 7
This shell-like syntax appears in JavaScript as of version 6:
> console.log(`${1 + 2 * 3} equals 7`)
7 equals 7
as well as in Apple's Swift language:
> print("\(1 + 2 * 3) equals 7")
7 equals 7
Consider these two lines of JavaScript:
var i = ab + c * d // addition and multiplication
var pattern = /ab+c*d/ // 1 or more b's and 0 or more c's
The letters abcd and the characters + and * mean completely different
things.
A lexer mode could be used here. When the JavaScript parser sees the opening
/, it can tell the lexer to switch to a different mode.
Note that the regex language is arbitrarily recursive, e.g.:
var p = /(((a+b)+c)+d)+/ // matches abcd, aabcd, etc.
Not only do embedded regex literals lend themselves to a separate lexer modes, but regex syntax considered on its own has natural modes.
For example, the ^ character has three different meanings in these three
regexes:
^line$ # ^ is a metacharacter that matches the beginning of a line [^0-9] # ^ is the negation operator [ab^] # ^ is a character literal
Depending on the context, the lexer can return different tokens for each
instance of the ^ character. When the lexer can disambiguate characters
based on their context, it makes the parser's job easier.
If you squint, HTML resembles shell:
So HTML involves language composition, just like shell does. It happens in both tags and attributes:
<head>
<!-- arbitrary JS here -->
<script type="text/javascript">
function foo() { return 5; }
</script>
<!-- arbitrary CSS here -->
<style>
body {
width: 80em;
}
</style>
</head>
<body onload="doStuff();"> <!-- arbitrary JS here -->
<p style="font-size: large;"> <!-- arbitrary CSS here -->
Hello World
</p>
</body>
I'm not claiming that lexer modes are required to parse this kind of document.
For example, when you see a <script> tag, you could wait until you see the
closing tag, then pass that opaque string to a completely separate lexer and
parser. For attributes, you can wait for the closing double quote.
I do think lexer modes are a nice way to handle this problem, but I won't say much more, because I don't know much about web browser implementation. If you do, leave a comment.
The mode doesn't have to be a parameter to the lexer's Read() function.
Instead, your lexer can be a state machine that makes mode transitions by
itself, rather than needing the parser to tell it when to change modes.
In fact, I wrote a parser for regular expressions this way, and it worked well.
For Oil, I borrowed the mode-as-parameter style from Ninja's lexer, and it turned out well. This style is almost certainly more powerful, because the parser has more information about the structure of the document than the lexer does.
One drawback is that you may have problems interfacing a generated parser with a modal lexer.
Likewise, generated lexers are most often "autonomous", because they expect to be called without a mode. However, re2c is a notable exception. Because it's a library and not a framework, it can easily accomodate the style of modal lexers.
This is yet another reason that parser generators are less than optimal for recognizing "real" languages: Parser generators don't handle language composition well.
Laurence Tratt points this out in his excellent article Parsing: The Solved Problem That Isn't.
One of the things that's become increasingly obvious to me over the past few years is that the general consensus breaks down for one vital emerging trend: language composition.
The article also addresses the problem of the lexer when composing languages. Some parsing techniques like Parsing Expression Grammars unify lexing and parsing, a style known as scannerless parsing.
Scannerless parsing does away with a separate tokenization phase; the grammar now contains the information necessary to dynamically tokenize text. Combining grammars with markedly different tokenization rules is now possible.
I believe that scannerless parsing is usually a mistake. (It's also uncommon; I've never seen it in a "real" parser.)
A simple argument I've made in the past is:
1. Most languages have lexical structure as well as grammatical structure.
In other words, parsers are detailed pieces of code, and if you can collapse
substrings like while or 3.14159 into a single "thing" you have to keep
track of, it makes the code easier to understand.
Another argument is that lexing and parsing are fundamentally different:
2. Recursion is the difference between lexing and parsing.
while,
3.14159 or \n.(1 + (2 + (3 +
4))).Another argument I've used against scannerless parsing can be summarized like this:
3. Do the easy thing with the fast algorithm and the hard thing with the slow algorithm.
To elaborate:
Lexing is easy and fast. You can write a switch statement in C, or you
can match regular expressions in linear time and constant space. You
don't need to look ahead or backtrack; you just march forward through the
bytes.
Parsing is hard and resource-intensive. There are dozens of algorithms for parsing with different tradeoffs:
In this thread, Aidenn0 on reddit points out that scannerless parsing
wastes time when combined with backtracking, and wastes space when
combined with memoization (packrat parsing).
(This is somewhat similar to the issue mentioned in the appendix of the the last post: lookhead can interact with the "lexer hack".)
I quoted Laurence Tratt above. He argued that that parsing isn't solved because language composition is common and not well understood.
To address a different issue, Trevor Jim argues that parsing isn't solved because most languages are not context-free, while parser generators assume that they are. Instead, they require Turing Complete parsers.
This is definitely true of the Oil parser or any other shell parser. There is no way to express shell as a context-free grammar. I ported the grammar from the POSIX spec to ANTLR back in March 2016, and found that it only covered one of the four major sublanguages in shell.
Here are the points I want you to take away from this post:
Now we should be in a better position to language composition in shell. As I've said, it's not one language, but a composition of at least four mutually recursive sublanguages. I will show examples of this in the next post.