Why Sponsor Oils? | blog | oilshell.org

What is Zephyr ASDL?

2016-12-11 (Last updated 2019-02-06)

Because understanding Zephyr ASDL will be important in future posts, I'll describe it from three angles here. At a high level, my motivation is to concisely describe the backbone of the interpreter.

So what is it? As stated in the 1997 paper, Zephyr was a university compiler infrastructure project, and ASDL was a sub-project which stands for Abstract Syntax Description Language.

This is a rather plain name for a domain-specific language that describes an abstract syntax tree.

Table of Contents
It's Used to Describe the Structure of Python
It Describes a Data Structure, Not a Language
Why Use a DSL?
Algebraic Data Types
Serialization
Conclusion

It's Used to Describe the Structure of Python

Let's first see how it describes Python:

$ python3
>>> import ast
>>> t = ast.parse('1 + 2 * 3')
>>> print(ast.dump(t))
Module(body=[Expr(value=BinOp(left=Num(n=1),
             op=Add(),
             right=BinOp(left=Num(n=2), op=Mult(), right=Num(n=3))))])

This is how you parse an expression and print its AST, without executing it. Symbols like Module, Expr, and BinOp correspond to data structures defined in the file Python.asdl. The relevant excerpts from this file are:

mod = Module(stmt* body)
    | ...

stmt = ...
     | Expr(expr value)
     | ...

expr = ...
     | BinOp(expr left, operator op, expr right)
     | ...

operator = Add | Sub | Mult | MatMult | Div | Mod | Pow | LShift
             | RShift | BitOr | BitXor | BitAnd | FloorDiv

The Python build process invokes the tool asdl_c.py, which parses Python.asdl and generates C code that defines Python classes. At runtime, the Python parser instantiates these classes as its output.

The Design of CPython's Compiler goes into more detail about this.

It Describes a Data Structure, Not a Language

It's important to understand the difference between:

Python has separate mechanisms for these two tasks. In addition to asdl_c.py generating code from Python.asdl (131 lines), there's the separate but analogous process of Parser/pgenmain.c generating code from Grammar/Grammar (156 lines).

On the face of it, ASDL and a parser generator like yacc or ANTLR are similar:

I remember being confused about this: Why do we need both?

The answer relates to the difference between a parse tree and an abstract syntax tree. A parse tree has nodes corresponding directly to the rules of a grammar, while an AST is simpler and more suitable for code generation. Scroll down here for a nice graphic that shows the difference.

Oil will have at least two tree representations, but neither is a parse tree, strictly speaking. The osh parser is written by hand and thus can produce an AST on the fly. Subsequent representations will be even more simplified.

So the difference is that ASDL lets you describe a simplified tree, i.e. an AST, while a grammar only has enough information to describe a strict parse tree.

The documentation for Python's own AST module confuses the issue: it shows Python.asdl with the heading Abstract Grammar. An ASDL file isn't a grammar, because it doesn't describe a language. It defines a data structure for representing a language.

Why Use a DSL?

Aren't C++ and Python already good languages for defining data structures?

There are three reasons to use a DSL, and the first is concision. Remarkably, Python.asdl is expanded into 65x more code:

~/src/Python-3.5.2$ wc -l Parser/Python.asdl
123 Parser/Python.asdl

~/src/Python-3.5.2$ wc -l */Python-ast.*
   601 Include/Python-ast.h
  7497 Python/Python-ast.c
  8098 total

I suspect that much of this is due to the wordy nature of the Python-C API. Oil of course won't use this API, but ASDL is still worthwhile for concision, and for the following two reasons.

Algebraic Data Types

ASDL is almost a way of adding Algebraic Data Types (ADTs) to C++ or Python. I say "almost" because the shape of the data is the same, but the type system is not.

Briefly, ADTs are a feature from the ML family of strongly-typed functional languages, which have made it into recent non-functional languages like Rust and Swift. This data model was designed to represent languages: ML stands for MetaLanguage. A meta-language is the language you use to express your language.

Roughly speaking, product types are simply structs, while sum types can be represented as either:

(Not coincidentally, one of the authors on the ASDL paper is Andrew Appel, who's known for his work on ML.)

Serialization

Once you describe a data structure with a DSL, you can automatically generate code to serialize it, in the manner of protocol buffers. This isn't possible with plain C structs/arrays or Python dictionaries/lists.

Although the paper says that serialization is a primary feature of ASDL, Python never serializes the AST. But we'll see later that Oil will benefit from automatic serialization.

Conclusion

I described Zephyr ASDL in three ways:

  1. How it's used in Python.
  2. How it contrasts with a grammar and parser generator.
  3. Its motivations: making code more concise, borrowing the sum and product data model of algebraic data types, and enabling automatic serialization.

I expect that ASDL will be used in three ways:

  1. To describe osh.
  2. To describe oil.
  3. To describe a language with semantics that both osh and oil can be compiled to, and which oil can be bootstrapped in. I recently discovered the Perl 6 project NQP, or Not Quite Perl, which seems similar in spirit.

Right now, I'm adapting Python's ASDL implementation for oil. I'll describe these changes in detail when the code lands in the repo.