Status Update: the Backbone of the Interpreter


Instead of outlining future blog posts as promised last time, I worked on implementing the two top priorities: printing the AST to oil, and executing the AST in Python, aided by spec test cases.

I also decided to start the third use case for the AST: serializing it to a binary format for execution in C++. In other words, I will build bridge to the C++ code I wrote as the first pass of the oil implementation.

These new use cases for the AST will change it drastically. Prior to this project, I had a cartoonish view of programming language implementation: isn't there a single, "obvious" AST representation for each language?

There is not; the representation depends heavily on what you need to do with it. In fact, this was the entire motivation for Clang: to develop an AST representation of C and C++ to support source code tools and IDEs, because the representation that GCC was only suited for code generation.

I've been watching a bunch of videos about Clang and its refactoring tools online:

One of the first e-mails I got about oil was from an LLVM developer that made this connection: you could say that oil/osh is to bash as Clang is to GCC. That's not the purpose of the project per se, but there's some truth there.

The Backbone of the Interpreter

Turing Award-winner Fred Brooks wrote:

Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.

A more modern version is by Linus Torvalds:

I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.

In that light, the AST is the most important part of an interpreter. It's the bridge between parsing and execution.

I'm about halfway through an editing pass over the AST. I reused the same expression nodes for the arithmetic sublanguage ( $((a + b)) ) and the boolean sublanguage ( [[ $a -eq $b ]] ), so it's looking more like a homogeneous AST rather than a heterogeneous AST. There are some other decisions to make, and I'll write more about them when it's settled down — that is, when it's closer to supporting the three primary use cases.

The Backbone of the Backbone of the Interpreter

However, I can show some code that I expect to be relatively stable.

If the AST backbone of the intepreter, the file core/id_kind.py is the backbone of the AST.

The functions _AddKinds() and _AddBoolKinds() dynamically generate values on the enums Id and Kind. An Id is for identifying tokens, words, and nodes. A Kind is for making coarse-grained parsing decisions in three recursive descent parsers for three of the sublanguages. (Pratt Parsing doesn't need the Kind.)

The takeaway is that you can find every construct that oil implements named in these ~200 lines. There are 219 Ids organized into 21 values of Kind. This gives a rough idea of how large the shell language is.


As mentioned, I had intended to outline a list of future blog topics, but I think blogging about what I'm actually implementing will give better results.

So I may be in heads-down mode with the AST and its three use cases for the next week or two. If I need a break, I'll surface the list of blog posts, and perhaps talk about two more topics: