Pratt Parsing Without Prototypal Inheritance, Global Variables, Virtual Dispatch, or Java

2016-11-03

To motivate today's post, yesterday I reviewed existing Pratt parsing tutorials. Now I can explain a different style for writing Pratt parsers.

The first thing to notice is that this style yields code that looks like a table. Compare it with this table of C operator precedence. In the styles I reviewed yesterday, this crucial information is spread throughout many lines of code.

spec = tdop.ParserSpec()

spec.Left(31, LeftIncDec, ['++', '--'])
spec.Left(31, LeftFuncCall, ['('])
spec.Left(31, LeftIndex, ['['])

# 29 -- binds to everything except function call, indexing, postfix ops
spec.Null(29, NullIncDec, ['++', '--'])
spec.Null(29, NullPrefixOp, ['+', '!', '~', '-'])

# Right associative: 2 ** 3 ** 2 == 2 ** (3 ** 2)
spec.LeftRightAssoc(27, LeftBinaryOp, ['**'])
spec.Left(25, LeftBinaryOp, ['*', '/', '%'])

spec.Left(23, LeftBinaryOp, ['+', '-'])
spec.Left(21, LeftBinaryOp, ['<<', '>>'])
spec.Left(19, LeftBinaryOp, ['<', '>', '<=', '>='])
spec.Left(17, LeftBinaryOp, ['!=', '=='])

spec.Left(15, LeftBinaryOp, ['&'])
spec.Left(13, LeftBinaryOp, ['^'])
spec.Left(11, LeftBinaryOp, ['|'])
spec.Left(9, LeftBinaryOp, ['&&'])
spec.Left(7, LeftBinaryOp, ['||'])

spec.Left(5, LeftTernary, ['?'])

# Right associative: a = b = 2 is a = (b = 2)
spec.LeftRightAssoc(3, LeftAssign, [
    '=',
    '+=', '-=', '*=', '/=', '%=',
    '<<=', '>>=', '&=', '^=', '|='])

spec.Left(COMMA_PREC, LeftComma, [','])

# 0 precedence -- doesn't bind until )
spec.Null(0, NullParen, ['('])  # for grouping

# -1 precedence -- never used
spec.Null(-1, NullConstant, ['name', 'number'])
spec.Null(-1, tdop.NullError, [')', ']', ':', 'eof'])

(Full code and tests are on Github)

There are more advantages to this style:

Because it uses a ParserSpec object rather than globals, defining two parsers in the same program is cleaner. This is important to me because I'm writing two expression parsers: a POSIX-compatible shell arithmetic parser, and a parser for the oil language.
There are no other global variables. The current token is a member on a Parser class, which is how you typically see parsing algorithms like recursive descent presented.
It's factored into the immutable ParserSpec object and the stateful Parser object. Concurrent parsers are easy.
There's no inheritance or virtual dispatch. In C++, this avoids the v-table pointer on a Token, which could increase its size by 50% or 100% on 64-bit platforms.

(I use virtual dispatch elsewhere in my shell parser, because it's concise and natural, but that isn't case here.)
Although I present it in Python, it's statically typed like Nystrom's code. It uses functions and classes, but mostly functions. Classes are used when there is state; functions are used when there is not.

API Sketch

The API is simple, but to understand it, you should have a rough idea of how the Pratt algorithm works. If not, the posts reviewed yesterday go into it.

To me, these are its salient characteristics:

It's mutually recursive between a core parse(int rbp) function and nud() and led() "plugin functions" for each operator. In my code, these are the p.ParseUntil(int precedence) method, and Null*() and Left*() functions.
Many recursive algorithms like tree traversal are purely functional. In contrast, Pratt parsing has the property that the recursive procedures also mutate state, i.e. the current token. This is just like recursive descent parsing, which is one reason they go together well.

The code has two main classes, ParserSpec and Parser. You "configure" the language by calling methods on the spec:

class ParserSpec:
  """Specification for a TDOP parser."""

  def Null(self, bp, nud, tokens):
    """Register a token that does NOT take an expression on the left.

    Examples: constant, prefix operator
    """
    # ...

  def Left(self, bp, led, tokens):
    """Register a token that takes an expression on the left.

    Examples: infix operator, postfix operator, 
              the ternary operator b ? 0 : 1, array indexing a[0].
    """
    # ...

  def LeftRightAssoc(self, bp, led, tokens):
    """Register a right associative operator.

    Examples: exponentiation, assignment, ternary operator.
    """
    # ...

These methods register the callbacks that are mutually recursive with the parser. Some people use the names Prefix() and Infix() instead of Null() and Left() (nud and led in Pratt's paper), but this is misleading because there are other kinds of operators.

(I suppose LeftRightAssoc is overly-specific as well, but it serves it purpose for now.)

Here is the Parser API:

class Parser(object):

  def AtToken(self, token_type):
    """Test if we are looking at a token."""
    # ...

  def Next(self):
    """Move to the next token."""
    # ...

  def Eat(self, val):
    """Assert the value of the current token, then move to the next token."""
    # ...

  def ParseUntil(self, rbp):
    """
    Parse to the right, eating tokens until we encounter a token with binding
    power LESS THAN OR EQUAL TO rbp.
    """
    # ...

  def Parse(self):
    """Initial entry point."""
    self.Next()
    return self.ParseUntil(0)

The first four methods are for the "plugin" Null/Left functions. The functions are passed a parser instance p and call the methods to parse, e.g.:

def LeftTernary(p, token, left, bp):
  """ e.g. a > 1 ? x : y """
  true_expr = p.ParseUntil(bp)
  p.Eat(':')
  false_expr = p.ParseUntil(bp)
  children = [left, true_expr, false_expr]
  return CompositeNode(token, children)

The end user need only call the Parse() method, which kicks off the maze of of mutually recursive functions.

spec = MakeSpec()  # Return ParserSpec instance
lexer = tdop.Tokenize(s)
p = tdop.Parser(spec, lexer)
tree = p.Parse()
print(tree)

So we have 2 classes, each with a few methods, and then a bunch of plugin functions to parse specific syntax. I'm happy with how the API turned out. See the code for details.

The current theme of this blog is to explain why implementing a shell is an interesting problem, and what's different about my implementation. The last three days have been a digression on expression parsing algorithms, but hopefully it will be useful to someone. The next post will return to the Unix shell.