Although the lexer is one of the most important parts of Oil, I've only written one post about it, and that was over a year ago.
This is the first of five posts I plan to publish about the lexer. I'll provide context and show motivating examples.
The rest of the series will explain:
Oil's lexer different than other lexers:
Existing shell lexers. Most shells interleave parsing and execution. In contrast, Oil parses shell scripts up front, in a single pass. In other words, it performs static rather than dynamic parsing.
The unique relationship between Oil's many parsers and its "modal" lexer enables static parsing.
Textbook Lexers recognize simple languages. They're meant to teach you how to implement a programming language without getting bogged down in details. They're either hand-written or generated by a tool like lex.
In contrast, Oil recognizes an existing, big, and hairy language. It solves several problems that you won't find in a textbook lexer.
"Real" lexers recognize the languages you use every day, like Python or C. They have to solve non-trivial problems that you won't find in textbooks.
They're almost always written by hand. I've looked at over 20 lexers and parsers in C or C++, and I have yet to see a generated lexer.
In contrast, Oil's lexer is generated. I actually use two translators/compilers to create it, which I'll describe later in this series.
It's fair to say that Oil's lexer is one of the most complicated lexers you'll ever see. Shell is a complex language in terms of its lexical structure, and Oil's lexer does more work than any other shell lexer.
If you don't believe me now, the examples in the coming posts may convince you of this.
I've said before that writing a shell should be significantly easier than writing a C compiler, the lexer is harder.
In some ways, this is a throwback to the beginning of the blog.
Up until the last release, the lexer was written with Python regular expressions.
A few weeks ago, I successfully converted it to C code via re2c. So now is a good time to write about the lexer -- it works and it's fast!
I expect this series of posts to be the last on parsing shell.
Here's an example of a problem that real languages have to solve.
Lexing shell doesn't have these problems, but it has different ones. They stem from fundamental language design features:
Command, Word, Arith,
and Bool.I will give examples of these problems in the following posts.
In my last post on lexing,
I renamed it from lexical state to lexer modes.
Why not lexical state? Because it's not really state in the lexer.
And I have other state in the lexer: the hints that are pushed onto a stack.
This is to motivate the next post.
Where did I get it from? Ninja. I liked how Ninja uses lexer modes.
:- example.
I explained why the Oil lexer is interesting, and I gave examples of lexer modes.
The next post will give a larger of example of lexer modes. This will show how powerful the lexer is and how much complexity it handles.
At the end of this series, you should:
Understand how a lexer for a "real" language differs from that of a textbook lexer
Understand what re2c is good for. Why does the lexer consist of 14,000 lines of C code?
This is generated code, but it's bigger than OSH itself, which I showed in the last release was 14K lines of Python.