ASDL in Oil


Yesterday I showed an example of an ASDL schema and data structure, and recapped the project priorities. Today I will describe what OIl's ASDL implemetation does, and describe how it will help with each of the top three priorities.

Oil's ASDL Implementation

The implementation is very short:

[show line count]

How does it work? As mentioned the asdl_.py was copied from Python. This uses a simple lexer and recursive descent parser to turn the ASDL schema (example: osh.asdl) into a Python data structure.

Then we have two algorithms that walk this tree representation of the schema:

And two algorithms that uses the schema in addition to data:

I will discuss oheap in a subsequent post, but for now we can just say it's a binary version of the text representation I showed at the end of yesterday's post.

So the flow will be like this:

LINES -> lex -> TOKENS -> parser -> TREE OF ASDL OBJECTS -> encoder -> oheap version of the tree -> decoder -> runtime in C++

should I show python classes?

Relation to the Three Priorities Forward

Why Binary AST format?

For now, it's convenient. Compare with Python's ASDL. It generates 10K lines of code, as big as all of oil!

What I do NOT want: .pyc files or pycache files. This causes problems. Most scripts won't need this.

Libraries pre-cmopiled but scripts won't be.

Should be able to parse say 100,000 lines of code neglible time. as fast as |wc -l.

This work took awhile, around 6 weeks, but it unblocks the three top priorities:

1) converting shell scripts to oil 2) testing the semantics by executing the shell in Python The main reason for this is that I don't like iterating in C++. I am doing a lot of work to avoid writing C++ code :) 3) Writing a production quality version in C++

Using ASDL will help with all three.

1) We still need source location info. 2) ASDL found a lot of bugs. It really is the core of the interpreter. Dynamic type checking with exhaustive tests is as good as static type checking.

In oil, I want optional typing like Dart.

3) My bespoke ASDL implementation writes things to the oheap format.

One important realization that led me to use ASDL:

Originally I thought that I could have a very simple architecture. A "pure front-end replacement".

But for a number of reasons this won't work. oil is going to be a SUPERSET of the semantics of oil.

important point: All ERRORS are handled in the first stage.

ovm is a smaller language. It's a "lowering". It's a VM, but for now it will have a tree structure. It might be a little similar to NQP, although I don't know much about it.

ovm can also be used for other tools. I am not writing an awk interpreter or parser, but if someone wanted to write one, ovm might be a good target. Instead I am folding it

A Peek of ASDL

I've been working on using Zephyr ASDL. How is it used in Python?

design of CPython compiler

  1. Python.asdl is a big data structure -- a set of Algebraic Data Types. For people coming from a background of C, you can think of them like classes and tagged unions (and indeed Rust has made that connection, using the "enum" keyword for variants.)

  2. This file transformed via asdl_c.py to a header file and C code.

  3. The generated parser creates C structs.
  4. Then it creates Python/C bindings classes so you can do the following:

Eventually, oil won't depend on Python, so I'm not going to use Python/C bindings of course. Here is the plan for oil's architecture.

  1. osh.asdl
  2. oil.asdl
  3. Both compile to ovm.asdl

I was working on making the AST more homogeneous ([example commit]), and then I realized that I really need two tree data structures:

Heterogeneous means: represent everything faithfully, more types.

Homogeneous: fewer types. Most shells seems to do this. In C this is the natural thing because there is no subtyping.

I will just say what I have now, and describe how it's used once code is actually committed:

  1. Parse the .asdl file, creating its own tree. (And yes, ASDL can be described with ASDL, another form of bootstrapping.)

  2. Use Python metaprogramming to generate classes so that the Python parser can use.

  3. Able to create and DYNAMICALLY type check instances of those generated types. Example:

  4. Able to serialize them to binary.

  5. Able to GENERATE C++ code that turns. The API looks somewhawt like protocol buffers,

In fact ASDL and protocol buffers are sort of the same. There is a long history of serializing types:

It is divided into coarse-grained "kinds", which help make parsing decisions. I'm also using the "external visitor" pattern, i.e. this thing that Terrence Parr gave me permission to do. (TODO: maybe write a blog post on that?)

What does the AST Look like?

Clang AST. IPR for C++.

I'm appreciating how hard, or rather how many choices there, it is to write a programming language. LLVM is WAY cleaner than gcc; yet it seems that they are stuck with some decisinos.


Another reason for not blogging I also want to be more forward looking. If I blog after doing some more coding, that will be the case. I could write about Shell WTFs every day for 6 months, but it doesn't really move the project fowrard. Everyone is convinced. Still, some things are too weird not to mention, like what I found out yesterday (explained in the Addendum below).

(I need a way to link to test cases.)

Fun fact: literally nobody is arguing that shell is a good language. I've thought that programming is big, the internet is big, and someone would come out of the woodwork and claim we don't need to fix shell. I haven't gotten that feedback yet, which is sort of surprising.

Crucially, I think people understand the point of auto-converting osh to oil. The effort is quite large, but it's worth it, and actually improved the design of the oil language in several cases, which I will write about in a future post.

I still have a draft of the blog post listing posts which I will release.

I want a BALANCE of the AST format. AST formats are non-trivial: Clang talk which you can't print. Receive this question a lot. They are doing source edits. Clang AST is fuggin' enormous.

TODO: implement stack traces.

Code is smaller now

Two clients of AST: Shared osh/oil executor, written in C++ oil printer, written in Python

They have different requirements. Fleshing out a third one.

(This is not the same thing as the AST vs. parse-tree distinction. That difference is essentially whehter you can automatically generate a tree from a grammar, without semantic actions, like ANTLR 4 does.

My parser is hand-written and has "semantic actions" to prune the tree as it's being created; hence I'm producing an AST.

I don't like ANTLR framework.

Compiler class is kind of toy. I would like to see different representations. nano-pass compiler paper kind of talks about this.

Vertical slice: take my old C++ executor, and just execute execve()

Plumbing from Python classes with annotations -> Python AST writer -> binary AST format -> C++ AST reader -> C++ classes -> C++ executor.


Alternate roadmap:

Skipping over some posts. Skipping over re2c to post, waiting until I publish the code. The current code doesn't use it, but the original C++ implementation did. Python is missing abstractions for describing data.

Protocol Buffers and Gazelle.

Discuss this post on Reddit