Yesterday I showed an example ASDL schema and data structure, and recapped the project priorities. Today I'll describe what oil's ASDL implementation does, and describe how it enables each of these three priorities.
I think ASDL is a big win so far, but the implementation is quite short:
ASDL 433 asdl/asdl_.py 271 asdl/encode.py 309 asdl/py_meta.py 313 asdl/format.py 471 asdl/gen_cpp.py 1797 total
I copied asdl_.py
from Python's ASDL implementation. It uses
a simple regex-based lexer and recursive
descent parser to turn the input foo.asdl
schema into
tree of Python objects — that is, the AST for the schema itself.
The other four files implement four recursive algorithms. Two of them walk this AST:
py_meta.py
dynamically generates classes from the schema using Python
metaprogramming (the type function). We get a class for each product
type and sum type (e.g. the token
and word
types in yesterday's example).
gen_cpp.py
generates C++ code that can read a binary encoding of ASTs,
called oheap
. For example, the 129 lines in osh.asdl produces ~1100
lines of C++.
The other two algorithms operator on both the schema and the data, i.e. an oil AST for a particular program:
encode.py
walks an instance of an AST in Python and encodes it in the
oheap
format.format.py
is similar, but instead of binary data, it produces the text
format that we saw at the end of yesterday's post.I'll discuss oheap
in a subsequent post. For now, just think of it as a
binary version of the text representation.
So it's possible to write a shell interpreter with a front end in Python and a back end in C++ like this:
[Lines of Code] -> Lexer (osh/lex.py) -> [Tokens] -> Parser (osh/*_parse.py) -> [Tree of Python ASDL objects] -> Encoder (asdl/encode.py) -> [oheap encoding of the tree] -> Decoder (generated osh.asdl.h) -> Runtime in C++
(Above, data is surrounded by brackets and code is surrounded by arrows.)
But as discussed below, oil will likely use a different structure.
All shells I've looked at (including bash, dash, zsh, mksh) are
tree interpreters. That is, after the parser builds the AST, the executor
just traverses it and makes system calls like fork()
, exec()
, pipe
(), and
dup()
. In contrast, Python is a bytecode interpreter.
If I were just writing a shell, I would write a tree interpreter as well. But oil is an interpreter with two languages: the compatible osh language and the new oil language.
And now that I'm closer to tackling this problem, it looks more complex. I
naively thought that I could just compile both languages to the same AST, and
use that same AST to convert shell source to oil
source.
I now realize that I will need multiple representations of the code, for two reasons:
oil
is a superset of the shell language. For example, it's
dynamically typed rather than "unityped". It doesn't make sense to use the
same AST for two different languages, even if one is inspired by the other.I'll write more about this when I've actually implemented it, because no doubt my thoughts will evolve.
For example, I wrote this post before I implemented ASDL or the new AST. I realized that the three use cases would impose different requirements on "the AST". But I didn't realize that we need multiple tree representations in oil.
Luckily, ASDL is perfect for this, and it's no accident. Compilers written in ML are typically composed of many small stages connected by typed trees (algebraic data types). The types enforce properties of the code at each stage.
Immediately after integrating ASDL, it caught a lot of bugs in oil
, which I
quickly fixed. Surprisingly, dynamic type checking plus a good test suite
is a decent replacement for static type checking, at least for our prototype.
(Update on MyPy: Type checking with MyPy appears to be a lost cause, since ASDL uses extensive metaprogramming. But I wanted ML's type system and not MyPy's Java-like type system.)
I plan to use three AST schemas:
oil.asdl
— A lossless representation of oil source, not yet written.ovm.asdl
— A simpler language/tree that both osh and oil compile to,
which is easily executed by C++.Again, the top three project priorities are:
ASDL will help with each of these tasks.
1) As mentioned, osh.asdl
will be a lossless representation, so we should be
able to convert any shell script to oil
.
The main blocker is to clean up the source location schema, so we can
preserve whitespace and comments. Right now I'm attaching a line_span
object
to each token, but this information should be propagated up to the node level.
2) The Python runtime has substantial functionality and can be easily extended. I hope to get help on this, but I need to get the spec tests in better shape before accepting contributions. Let me know in the comments if you're interested in contributing, and I'll increase the priority of this task!
3) The C++ executor is blocked on designing ovm.asdl
, which is itself blocked
on validating the oil language design. To design ovm
, we need to have a good
grasp on the languages that compile to it.
So this is further off, but I'm pleased to have implemented the oheap
format,
which makes progress toward this goal.
We talked about four tree-walking algorithms in oil's ASDL implementation,
and then explained how ASDL helps with the top three priorities. And I
suspect that we will have three ASDL schemas: osh
, oil
, and ovm
.
I have one more post on the ASDL implementation, and then I'll share some
thoughts that came from the experience of implementing ASDL and using it in
oil
.
For one, I'm surprised that ASDL seems to be uncommon. It's used in one of most popular languages in the world, but there are only a couple blog posts about it, and not too many modern successors.
Not to jump forward too much, but I want oil to be a shell for distributed computing, and schema languages are widely used in that domain. So this lesson from oil's implementation may inform the design of the oil language itself.