Analogies for ASDL

2016-12-16

If you haven't used Google's protocol buffer serialization technology, this analogy may be helpful:

JavaScript Data Model : JSON :: C Data Model : Protocol Buffers

Just as JSON is a language-independent serialization format extracted from JavaScript's data model (objects, heterogeneous arrays, strings, numbers, booleans), protocol buffers are a mostly language-independent serialization format extracted from C's data model:

structs (messages)
homogeneous arrays (repeated fields)
strings
enums
double and float
unsigned and signed integers of various widths.

A similar analogy explains Zephyr ASDL, which I explained from a few other angles in the last post:

C data model : Protocol Buffers :: ML data model : ASDL

ML is the language that introduced algebraic data types or ADTs. ADTs are a characteristic feature of strongly-typed functional languages like Standard ML, OCaml, and Haskell.

ASDL, like protocol buffers, is a domain-specific language that describes a language-independent serialization format for a particular data model -- in this case, the ML data model. It has the following constructs:

Product types, aka records, representable by structs in C and C++.
Sum types, aka variants, representable by tagged unions in C or subclasses in C++.
Optional fields, representable by a pointer that may be null. In ML-like languages, they're the Option type.
Repeated fields, representable by arrays in C++. In ML-like languages, they're lists.
Strings.
Integers of unspecified width.

Oil will use a custom serialization format I developed, but Python doesn't serialize the data structures it represents with ASDL. Instead, it uses ASDL to share the AST between languages, bridging the parser written in C and the AST module in Python.

Taking into that account, this analogy is also valid:

ASDL : Python :: WebIDL : Web Browser

WebIDL is an interface definition language that bridges C++ and JavaScript in the browser. It's similar to Microsoft's COM, but it's part of a single application rather than an OS-wide construct.

ASDL in Oil

Yesterday, I committed the first pass of oil's ASDL implementation. The schema parser is taken from Python, but these three features are new:

Dynamic generation of Python classes (as opposed to generating Python-C bindings in C).
The encoding of serialized trees into the oheap format, which I'll describe later.
Generation of C++ code that decodes the tree lazily.

Fortunately, not much code is required to implement these features:

~/git/oil$ asdl/run.sh count
  417 asdl/asdl.py
  249 asdl/py_meta.py
  462 asdl/gen_cpp.py
  268 asdl/encode.py
 1396 total

The py_meta.py file uses metaprogramming all over: Python metaclasses, but also things like dynamic kwargs and setattr().

I believe that type checking oil with mypy is now hopeless. It was thwarted by very simple metaprogramming, and this addition won't help. However, I believe that ASDL is more valuable than mypy for ensuring the structural integrity of the program.

Another thing to ponder: you could say this means I value Lisp over ML, though paradoxically the purpose of the metaprogramming is to use ML's data model in C++ and Python.

Snapshot of Line Count

I haven't used ASDL in oil yet -- that's the next step. Since I'm obsessed with the line count, let me snapshot the tree now:

$ ./count.sh parser
Lexer/Parser
    77 osh/parse_lib.py
   196 osh/arith_parse.py
   291 osh/bool_parse.py
   334 osh/lex.py
  1144 osh/word_parse.py
  1455 osh/cmd_parse.py
  3497 total

AST and IDs
   80 core/tokens.py
   99 core/expr_node.py
  441 core/id_kind.py
  491 core/cmd_node.py
  777 core/word_node.py
 1888 total

Common Algorithms
  228 core/lexer.py
  338 core/tdop.py
  566 total

Using ASDL will affect the middle section the most, but I'm not sure if it will get bigger or smaller. On the one hand, ASDL provides impressive code compression. I mentioned in the last post that 123 lines of ASDL turns into ~8100 lines of C code in Python. (However, the oheap format needs just 907 lines of C++ generated from 107 lines of ASDL, an order of magnitude less code. More on that later.)

On the other hand, the Word and WordPart classes in word_node.py have nontrivial methods, which I need to attach to the classes generated from osh.asdl. Also, the tree will be more heterogeneous, because I'm representing osh very faithfully and then "lowering" it into what I'm calling ovm in my head. ovm is more homogeneous.

But whether it gets bigger or smaller, the new AST representation brings us closer to the top priorities. It forms the backbone of both the interpreter and the tools to convert osh / bash to oil.

This conversion is, of course, the main reason I expect anyone to actually use oil!