blog | oilshell.org

Comments About Parsing: Theory vs. Practice

2021-01-06 (Last updated 2021-01-07)

This post is part of "flushing the blog queue", described in yesterday's blog roadmap. I link to comments and stories, and provide a summary of themes, without making a full argument.

Let me know if this style is or isn't comprehensible / useful!

Table of Contents
What Programmers Don't Understand About Grammars
"Returning" to LR Parsing (yacc)
Oil's Error Handling Architecture
Why Lexing and Parsing Should Be Separate
Appendix: Bootstrapping Case Studies
Update: When is All This Useful?

What Programmers Don't Understand About Grammars

My comments here connect #parsing theory to practice:

Themes:

"Returning" to LR Parsing (yacc)

Summary: I studied many different ways of parsing, and used several in Oil. Like Tratt, I "returned" to the textbook LR style to some degree:

Oil doesn't currently use LR parsing, but it would probably be appropriate for the expression language. I see why it's a good compromise in some situations.

I also encourage implementers to make this distinction:

Oil's Error Handling Architecture

I describe how Oil uses spans, span IDs, and Python/C++ exceptions to provide detailed errors, while keeping the code clean. And I link to related blog posts.

This design has worked well, but I don't claim it's the best one. I'd like to hear about other approaches.

Why Lexing and Parsing Should Be Separate

I've posted this link on the blog before, but #lexing is another place where theory and practice meet.

Appendix: Bootstrapping Case Studies

I updated this wiki page

based on this lobste.rs discussion. It's not strictly about parsing, but may be interesting to language designers.

Here are a few observations about the metalanguage for compilers from my comments:

Update: When is All This Useful?

This post is mainly for experienced language implementers. If you've never written a parser, a good intro is Chapter 6: Parsing Expressions in Crafting Interpreters.

It will give you just enough theory to write your first parser. After that, the theory above will be more useful (CFGs and PEGs, for example).

Why use theory? One reason is that writing a parser isn't the same thing as designing the syntax for a language. For example, many language specifications contain grammars.

And most languages have 2 or 3 widely used parsers, so it helps to be "abstract" about the syntax. Bootstrapping is one reason that you will need to write another parser, but here are more important use cases:

Related article:

An anecdote to show why this matters: While debugging the garbage collector, I ran into an issue a where GDB used incorrect location info when debugging binaries compiled with Address Sanitizer. This led to confusing and frustrating sessions where I was literally debugging the wrong code.

While I don't know the exact cause of this issue, the general point is that good tools rely on good front ends. Front ends have non-obvious design decisions that percolate throughout the interpreter or compiler. The above link about error handling architecture elaborates on this.