Why Sponsor Oils? | blog | oilshell.org

Review of Pratt/TDOP Parsing Tutorials

2016-11-02

Yesterday's post took a turn into the more fundamental task of explaining algorithms. I might as well finish the thought, and show my code for Pratt parsing, since it differs in organization from what you'll see online.

I want to write a post called Pratt Parsing without Prototypal Inheritance, Global Variables, Java, or Virtual Dispatch, and this post will set that up.

Here is a review of popular expositions:

  1. Douglas Crockford deserves credit for reviving and explaining Pratt parsing with this article, also published in the book Beautiful Code. I worked with the code about 8 years ago when it was published, and remember being confused. I now I understand better why I was confused:

a) Prototypal inheritance. JavaScript and Lua both use prototypal inheritance, and I still don't see any advantage of this style over classes. It might be easier to implement, but that doesn't help me as a language user. In particular, it uses Object.create(), which I think has already fallen out of favor.

I like the idea behind Bob Nystrom's scripting language Wren: fill Lua's niche, but use classes.

b) The representation of tokens as objects with rich behavior. That is, they have nud and led methods. To me, tokens are data that are operated on with functions (or methods on a separate class.)

Pratt parsing is a top-down algorithm like recursive descent parsing, but using tokens this way obscures the connection. I've never seen a tutorial on recursive descent where tokens need nontrivial methods.

c) The reuse of token objects as AST node objects. This results in a lot of mutation in the code, which goes against the grain of the parser as a pure function from tokens to a tree. In my opinion, pure functions are the "default choice" when writing compilers, parsers, lexers, etc., and I see no reason to deviate from that here.

d) Mutating globals. Rather than this:

assignment("=");
assignment("+=");

I prefer this:

spec.assignment("=");
spec.assignment("+=");

This way, you just need to find the type of spec to see what's going on. Otherwise, you have to trace through multiple function calls to see what variable is being changed.

Crockford's style has bled into popular Python tutorials as well:

  1. TDOP in Python by Fredrikh Lundh: In this code, tokens still have nud and led methods. It still mutates tokens instances to become AST nodes. It uses an unusual style of registering methods using decorators, which looks more like Crockford's JavaScript than idiomatic Python.

The current token is a global variable, which is fine for a tutorial, but you wouldn't do that in a "production" parser these days.

  1. TDOP by Eli Bendersky: He says he found Crockford's exposition difficult, but understood the algorithm using Lundh's code. The code is similar to Lundh's, but uses classes in a more traditional manner. Tokens still have nud and led methods.

  2. Pratt Parsing by Bob Nystrom: His motivation is also to improve on Crockford's exposition. The tutorial is done in Java, as sort of the "lowest common denominator" among languages. In his own words,

You gotta love Java’s "please sign it in quadruplicate" level of bureaucracy here. Like I said, if you can do this in Java, you can do it in any language.

I disagree with the second sentence, but agree with this reader's comment:

I really wish that Java was not used in this tutorial. I feel that the parser is lost in the [verbosity] making it quite the poor teaching language.

I appreciate Java more than I used to, but looking at the code, it really does obscure the algorithm with extraneous classes, files, and directories.


So we've seen four tutorials, and two accomplished programmers that aren't satisfied with Crockford's exposition.

I chose to express my code yet a different way, which I'll show tomorrow.


Related: Pratt Parsing Index and Updates