Why Sponsor Oils? | blog | oilshell.org

Pratt Parsing Without Prototypal Inheritance, Global Variables, Virtual Dispatch, or Java

2016-11-03

To motivate today's post, yesterday I reviewed existing Pratt parsing tutorials. Now I can explain a different style for writing Pratt parsers.

The first thing to notice is that this style yields code that looks like a table. Compare it with this table of C operator precedence. In the styles I reviewed yesterday, this crucial information is spread throughout many lines of code.

spec = tdop.ParserSpec()

spec.Left(31, LeftIncDec, ['++', '--'])
spec.Left(31, LeftFuncCall, ['('])
spec.Left(31, LeftIndex, ['['])

# 29 -- binds to everything except function call, indexing, postfix ops
spec.Null(29, NullIncDec, ['++', '--'])
spec.Null(29, NullPrefixOp, ['+', '!', '~', '-'])

# Right associative: 2 ** 3 ** 2 == 2 ** (3 ** 2)
spec.LeftRightAssoc(27, LeftBinaryOp, ['**'])
spec.Left(25, LeftBinaryOp, ['*', '/', '%'])

spec.Left(23, LeftBinaryOp, ['+', '-'])
spec.Left(21, LeftBinaryOp, ['<<', '>>'])
spec.Left(19, LeftBinaryOp, ['<', '>', '<=', '>='])
spec.Left(17, LeftBinaryOp, ['!=', '=='])

spec.Left(15, LeftBinaryOp, ['&'])
spec.Left(13, LeftBinaryOp, ['^'])
spec.Left(11, LeftBinaryOp, ['|'])
spec.Left(9, LeftBinaryOp, ['&&'])
spec.Left(7, LeftBinaryOp, ['||'])

spec.Left(5, LeftTernary, ['?'])

# Right associative: a = b = 2 is a = (b = 2)
spec.LeftRightAssoc(3, LeftAssign, [
    '=',
    '+=', '-=', '*=', '/=', '%=',
    '<<=', '>>=', '&=', '^=', '|='])

spec.Left(COMMA_PREC, LeftComma, [','])

# 0 precedence -- doesn't bind until )
spec.Null(0, NullParen, ['('])  # for grouping

# -1 precedence -- never used
spec.Null(-1, NullConstant, ['name', 'number'])
spec.Null(-1, tdop.NullError, [')', ']', ':', 'eof'])

(Full code and tests are on Github)

There are more advantages to this style:

API Sketch

The API is simple, but to understand it, you should have a rough idea of how the Pratt algorithm works. If not, the posts reviewed yesterday go into it.

To me, these are its salient characteristics:

The code has two main classes, ParserSpec and Parser. You "configure" the language by calling methods on the spec:

class ParserSpec:
  """Specification for a TDOP parser."""

  def Null(self, bp, nud, tokens):
    """Register a token that does NOT take an expression on the left.

    Examples: constant, prefix operator
    """
    # ...

  def Left(self, bp, led, tokens):
    """Register a token that takes an expression on the left.

    Examples: infix operator, postfix operator, 
              the ternary operator b ? 0 : 1, array indexing a[0].
    """
    # ...

  def LeftRightAssoc(self, bp, led, tokens):
    """Register a right associative operator.

    Examples: exponentiation, assignment, ternary operator.
    """
    # ...

These methods register the callbacks that are mutually recursive with the parser. Some people use the names Prefix() and Infix() instead of Null() and Left() (nud and led in Pratt's paper), but this is misleading because there are other kinds of operators.

(I suppose LeftRightAssoc is overly-specific as well, but it serves it purpose for now.)

Here is the Parser API:

class Parser(object):

  def AtToken(self, token_type):
    """Test if we are looking at a token."""
    # ...

  def Next(self):
    """Move to the next token."""
    # ...

  def Eat(self, val):
    """Assert the value of the current token, then move to the next token."""
    # ...

  def ParseUntil(self, rbp):
    """
    Parse to the right, eating tokens until we encounter a token with binding
    power LESS THAN OR EQUAL TO rbp.
    """
    # ...

  def Parse(self):
    """Initial entry point."""
    self.Next()
    return self.ParseUntil(0)

The first four methods are for the "plugin" Null/Left functions. The functions are passed a parser instance p and call the methods to parse, e.g.:

def LeftTernary(p, token, left, bp):
  """ e.g. a > 1 ? x : y """
  true_expr = p.ParseUntil(bp)
  p.Eat(':')
  false_expr = p.ParseUntil(bp)
  children = [left, true_expr, false_expr]
  return CompositeNode(token, children)

The end user need only call the Parse() method, which kicks off the maze of of mutually recursive functions.

spec = MakeSpec()  # Return ParserSpec instance
lexer = tdop.Tokenize(s)
p = tdop.Parser(spec, lexer)
tree = p.Parse()
print(tree)

So we have 2 classes, each with a few methods, and then a bunch of plugin functions to parse specific syntax. I'm happy with how the API turned out. See the code for details.


The current theme of this blog is to explain why implementing a shell is an interesting problem, and what's different about my implementation. The last three days have been a digression on expression parsing algorithms, but hopefully it will be useful to someone. The next post will return to the Unix shell.


Related: Pratt Parsing Index and Updates