Home

ASDL Implementation and Use Cases

2017-01-05

Yesterday I showed an example ASDL schema and data structure, and recapped the project priorities. Today I'll describe what oil's ASDL implementation does, and describe how it enables each of these three priorities.

ASDL Implementation

I think ASDL is a big win so far, but the implementation is quite short:

ASDL
  433 asdl/asdl_.py
  271 asdl/encode.py
  309 asdl/py_meta.py
  313 asdl/format.py
  471 asdl/gen_cpp.py
 1797 total

I copied asdl_.py from Python's ASDL implementation. It uses a simple regex-based lexer and recursive descent parser to turn the input foo.asdl schema into tree of Python objects — that is, the AST for the schema itself.

The other four files implement four recursive algorithms. Two of them walk this AST:

The other two algorithms operator on both the schema and the data, i.e. an oil AST for a particular program:

I'll discuss oheap in a subsequent post. For now, just think of it as a binary version of the text representation.

So it's possible to write a shell interpreter with a front end in Python and a back end in C++ like this:

[Lines of Code]               -> Lexer (osh/lex.py) ->

[Tokens]                      -> Parser (osh/*_parse.py) ->

[Tree of Python ASDL objects] -> Encoder (asdl/encode.py) ->

[oheap encoding of the tree]  -> Decoder (generated osh.asdl.h) ->

Runtime in C++

(Above, data is surrounded by brackets and code is surrounded by arrows.)

But as discussed below, oil will likely use a different structure.

Multiple Tree Representations

All shells I've looked at (including bash, dash, zsh, mksh) are tree interpreters. That is, after the parser builds the AST, the executor just traverses it and makes system calls like fork(), exec(), pipe(), and dup(). In contrast, Python is a bytecode interpreter.

If I were just writing a shell, I would write a tree interpreter as well. But oil is an interpreter with two languages: the compatible osh language and the new oil language.

And now that I'm closer to tackling this problem, it looks more complex. I naively thought that I could just compile both languages to the same AST, and use that same AST to convert shell source to oil source.

I now realize that I will need multiple representations of the code, for two reasons:

  1. Because ASTs are lossy. The whole point is to abstract away some details of the source code, but for source conversion we want a lossless representation of the code.
  2. Because oil is a superset of the shell language. For example, it's dynamically typed rather than "unityped". It doesn't make sense to use the same AST for two different languages, even if one is inspired by the other.

I'll write more about this when I've actually implemented it, because no doubt my thoughts will evolve.

For example, I wrote this post before I implemented ASDL or the new AST. I realized that the three use cases would impose different requirements on "the AST". But I didn't realize that we need multiple tree representations in oil.

Luckily, ASDL is perfect for this, and it's no accident. Compilers written in ML are typically composed of many small stages connected by typed trees (algebraic data types). The types enforce properties of the code at each stage.

Immediately after integrating ASDL, it caught a lot of bugs in oil, which I quickly fixed. Surprisingly, dynamic type checking plus a good test suite is a decent replacement for static type checking, at least for our prototype.

(Update on MyPy: Type checking with MyPy appears to be a lost cause, since ASDL uses extensive metaprogramming. But I wanted ML's type system and not MyPy's Java-like type system.)

I plan to use three AST schemas:

  1. osh.asdl — A lossless representation of shell source, already written.
  2. oil.asdl — A lossless representation of oil source, not yet written.
  3. ovm.asdl — A simpler language/tree that both osh and oil compile to, which is easily executed by C++.

Tackling the Top Priorities

Again, the top three project priorities are:

  1. Test the oil language design by converting shell scripts in the wild to it.
  2. Fill out the shell executor/runtime in Python.
  3. Write a production quality executor/runtime in C++.

ASDL will help with each of these tasks.

1) As mentioned, osh.asdl will be a lossless representation, so we should be able to convert any shell script to oil.

The main blocker is to clean up the source location schema, so we can preserve whitespace and comments. Right now I'm attaching a line_span object to each token, but this information should be propagated up to the node level.

2) The Python runtime has substantial functionality and can be easily extended. I hope to get help on this, but I need to get the spec tests in better shape before accepting contributions. Let me know in the comments if you're interested in contributing, and I'll increase the priority of this task!

3) The C++ executor is blocked on designing ovm.asdl, which is itself blocked on validating the oil language design. To design ovm, we need to have a good grasp on the languages that compile to it.

So this is further off, but I'm pleased to have implemented the oheap format, which makes progress toward this goal.

Conclusion

We talked about four tree-walking algorithms in oil's ASDL implementation, and then explained how ASDL helps with the top three priorities. And I suspect that we will have three ASDL schemas: osh, oil, and ovm.

I have one more post on the ASDL implementation, and then I'll share some thoughts that came from the experience of implementing ASDL and using it in oil.

For one, I'm surprised that ASDL seems to be uncommon. It's used in one of most popular languages in the world, but there are only a couple blog posts about it, and not too many modern successors.

Not to jump forward too much, but I want oil to be a shell for distributed computing, and schema languages are widely used in that domain. So this lesson from oil's implementation may inform the design of the oil language itself.


Discuss this post on Reddit