blog | oilshell.org
Comments About Parsing: Theory vs. Practice
(Last updated 2021-01-07)
This post is part of "flushing the blog queue", described in yesterday's blog
roadmap. I link to comments and stories, and provide a
summary of themes, without making a full argument.
Let me know if this style is or isn't comprehensible / useful!
What Programmers Don't Understand About Grammars
My comments here connect #parsing theory to practice:
- Languages as (infinite) sets vs. languages as algorithms.
- The "sets" viewpoint is mathematical, but it has engineering consequences!
(The July and December 2020 posts on regular languages
make a similar point.)
- Generation vs. Recognition of strings in languages.
- Context-Free Grammars (CFGs) vs. Parsing Expression Grammars (PEGs)
- PEGs define away ambiguity.
- But PEGs are still sets! There are multiple algorithms for recognizing
the set: backtracking and packrat parsing (memoization).
- The original PEG paper is highly
readable. It's an idea that made it from research to production in ~15
- Java has two grammars: one for specification, and one for efficient
recognition. A point I didn't make: In practice, context-free grammars are
used as and conflated with algorithms.
- Conclusion: Parsing is full of fundamental tradeoffs. You have to pick
the right solution for a specific application, which is hard.
"Returning" to LR Parsing (yacc)
Summary: I studied many different ways of parsing, and used several in
Oil. Like Tratt, I "returned" to the
textbook LR style to some degree:
Survey: What parser generator are you using, if any? (/r/ProgrammingLanguages)
42 points, - 19 May 2019
Oil doesn't currently use LR parsing, but it would probably be appropriate for
the expression language. I see why it's a good compromise in some situations.
I also encourage implementers to make this distinction:
- Design a new language with a grammar and parser generator.
- Parse an existing language with a hand-written parser. The "default"
technique should be recursive descent. (Remember
that Pratt parsing can be considered a variant of recursive descent that
Oil's Error Handling Architecture
Multiple Comments on
What I wish compiler books would cover (/r/ProgrammingLanguages)
136 points, - 30 Apr 2020
I describe how Oil uses spans, span IDs, and Python/C++ exceptions to provide
detailed errors, while keeping the code clean. And I link to related blog
This design has worked well, but I don't claim it's the best one. I'd like to
hear about other approaches.
Why Lexing and Parsing Should Be Separate
I've posted this link on the blog before, but #lexing is another
place where theory and practice meet.
- Lexing is non-recursive; parsing is recursive.
- Do the easy thing with the fast algorithm, and the hard thing with the slow
- Scannerless parsing is a mistake for almost all problems.
Appendix: Bootstrapping Case Studies
I updated this wiki page
based on this lobste.rs
It's not strictly about parsing, but may be interesting to language designers.
Here are a few observations about the metalanguage for compilers from
- Throwing out the code and starting again IMO is a sign that the metalanguage
was too inflexible. For example, it can be hard to quickly refactor C
code, in part because memory management is a concern at every function
boundary. The language encourages long functions for the same reason.
- There are multiple dimensions of rigor, and they can be at odds:
efficiency is one, but correctness is another. C is better for the former; ML
are better for the latter. (If you don’t believe this, let me test your
- Oil uses Zephyr ASDL to get the same benefit. It has
been a consistent win for 4 years. It's now statically typed with
- Bootstrapping means two different things: building a compiler from source
without binaries, and writing the language in itself. Ironically the latter
bootstrapping makes the former much harder!
Update: When is All This Useful?
This post is mainly for experienced language implementers. If you've never
written a parser, a good intro is Chapter 6: Parsing
It will give you just enough theory to write your first parser. After
that, the theory above will be more useful (CFGs and PEGs, for example).
Why use theory? One reason is that writing a parser isn't the same thing as
designing the syntax for a language. For example, many language specifications
And most languages have 2 or 3 widely used parsers, so it helps to be
"abstract" about the syntax. Bootstrapping is one reason that you will
need to write another parser, but here are more important use cases:
- IDEs usually have entirely separate parsers because they need to handle
incomplete code. After being knee-deep in CLion in December, I appreciate
how much work goes into a good IDE.
- Profiling and debugging rely on location info generated by your
compiler front end. For example, GDB knows how to step through your code
line-by-line because it reads debug info from binaries compiled with
- Linters (pep8), autoformatters (
automatic translators (#osh-to-oil) need different
representations of your code than compilers do.
An anecdote to show why this matters: While debugging the garbage collector, I
ran into an issue a where GDB used incorrect location info when debugging
binaries compiled with Address Sanitizer. This led to confusing
and frustrating sessions where I was literally debugging the wrong code.
While I don't know the exact cause of this issue, the general point is that
good tools rely on good front ends. Front ends have non-obvious design
decisions that percolate throughout the interpreter or compiler. The above link
about error handling architecture elaborates on this.