Comments About Parsing: Theory vs. Practice

2021-01-06 (Last updated 2021-01-07)

This post is part of "flushing the blog queue", described in yesterday's blog roadmap. I link to comments and stories, and provide a summary of themes, without making a full argument.

Let me know if this style is or isn't comprehensible / useful!

Table of Contents

What Programmers Don't Understand About Grammars

"Returning" to LR Parsing (yacc)

Oil's Error Handling Architecture

Why Lexing and Parsing Should Be Separate

Appendix: Bootstrapping Case Studies

Update: When is All This Useful?

What Programmers Don't Understand About Grammars

My comments here connect #parsing theory to practice:

Themes:

Languages as (infinite) sets vs. languages as algorithms.
- The "sets" viewpoint is mathematical, but it has engineering consequences! (The July and December 2020 posts on regular languages make a similar point.)
Generation vs. Recognition of strings in languages.
- I wrote about this over 4 years ago at the end of Pratt Parsing and Precedence Climbing Are the Same Algorithm. Pratt parsing is another "recognition-based" parsing technique.
Context-Free Grammars (CFGs) vs. Parsing Expression Grammars (PEGs)
- PEGs define away ambiguity.
- But PEGs are still sets! There are multiple algorithms for recognizing the set: backtracking and packrat parsing (memoization).
- The original PEG paper is highly readable. It's an idea that made it from research to production in ~15 years (Python 3.8 and Zig).
Java has two grammars: one for specification, and one for efficient recognition. A point I didn't make: In practice, context-free grammars are used as and conflated with algorithms.
Conclusion: Parsing is full of fundamental tradeoffs. You have to pick the right solution for a specific application, which is hard.

"Returning" to LR Parsing (yacc)

Summary: I studied many different ways of parsing, and used several in Oil. Like Tratt, I "returned" to the textbook LR style to some degree:

Oil doesn't currently use LR parsing, but it would probably be appropriate for the expression language. I see why it's a good compromise in some situations.

I also encourage implementers to make this distinction:

Design a new language with a grammar and parser generator.
Parse an existing language with a hand-written parser. The "default" technique should be recursive descent. (Remember that Pratt parsing can be considered a variant of recursive descent that respects precedence.)

Oil's Error Handling Architecture

I describe how Oil uses spans, span IDs, and Python/C++ exceptions to provide detailed errors, while keeping the code clean. And I link to related blog posts.

This design has worked well, but I don't claim it's the best one. I'd like to hear about other approaches.

Why Lexing and Parsing Should Be Separate

I've posted this link on the blog before, but #lexing is another place where theory and practice meet.

Lexing is non-recursive; parsing is recursive.
Do the easy thing with the fast algorithm, and the hard thing with the slow algorithm.
Scannerless parsing is a mistake for almost all problems.

Appendix: Bootstrapping Case Studies

I updated this wiki page

based on this lobste.rs discussion. It's not strictly about parsing, but may be interesting to language designers.

Here are a few observations about the metalanguage for compilers from my comments:

Throwing out the code and starting again IMO is a sign that the metalanguage was too inflexible. For example, it can be hard to quickly refactor C code, in part because memory management is a concern at every function boundary. The language encourages long functions for the same reason.
There are multiple dimensions of rigor, and they can be at odds: efficiency is one, but correctness is another. C is better for the former; ML dialects are better for the latter. (If you don’t believe this, let me test your compiler.)
- Oil uses Zephyr ASDL to get the same benefit. It has been a consistent win for 4 years. It's now statically typed with MyPy.
Bootstrapping means two different things: building a compiler from source without binaries, and writing the language in itself. Ironically the latter bootstrapping makes the former much harder!

Update: When is All This Useful?

This post is mainly for experienced language implementers. If you've never written a parser, a good intro is Chapter 6: Parsing Expressions in Crafting Interpreters.

It will give you just enough theory to write your first parser. After that, the theory above will be more useful (CFGs and PEGs, for example).

Why use theory? One reason is that writing a parser isn't the same thing as designing the syntax for a language. For example, many language specifications contain grammars.

And most languages have 2 or 3 widely used parsers, so it helps to be "abstract" about the syntax. Bootstrapping is one reason that you will need to write another parser, but here are more important use cases:

IDEs usually have entirely separate parsers because they need to handle incomplete code. After being knee-deep in CLion in December, I appreciate how much work goes into a good IDE.
Profiling and debugging rely on location info generated by your compiler front end. For example, GDB knows how to step through your code line-by-line because it reads debug info from binaries compiled with -g.
Linters (pep8), autoformatters (go fmt, clang-format), and automatic translators (#osh-to-oil) need different representations of your code than compilers do.

An anecdote to show why this matters: While debugging the garbage collector, I ran into an issue a where GDB used incorrect location info when debugging binaries compiled with Address Sanitizer. This led to confusing and frustrating sessions where I was literally debugging the wrong code.

While I don't know the exact cause of this issue, the general point is that good tools rely on good front ends. Front ends have non-obvious design decisions that percolate throughout the interpreter or compiler. The above link about error handling architecture elaborates on this.