Home

Success with ASDL

2017-01-04

In the last 2 days, I landed a few weeks' worth of big changes on the master branch. After writing my ASDL implementation, I replaced the backbone of the interpreter with dynamically-generated ASDL classes.

Recall that ASDL is a way of using ML's data model of algebraic data types inside another language, roughly analogous to how JSON lets you use JavaScript's data model in another language.

I'm happy with ASDL, so I plan to write a few posts about it. In this post, I'll review the progress on oil so far, and then show an example ASDL schema and data structure.

In subsequent posts, I'll go into detail on oil's ASDL implementation, describe how it concretely moves the project forward, and share some more abstract thoughts.

Recap of Priorities

Shortly after I released the code in November, I listed two top priorities, gave a detailed roadmap, and mentioned a third use case for the AST.

Looking back, we're pretty much on track. Replacing the backbone of the interpreter with ASDL took awhile, requiring changes to essentially every source file, but it was necessary for all three priorities:

  1. Test the oil language design by converting shell scripts in the wild to it.
  2. Fill out the shell executor/runtime in Python.
  3. Write a production quality executor/runtime in C++.

Why do the same work in both Python and C++? The first reason is that discovering the semantics of shell is the hard part, and we want to do that in an agile language. Once that's done, writing the code is easy.

The second reason is that the C++ executor will operate on a lower-level representation of code. I'll explain this in a future post.

An Example of ASDL

At 129 lines, osh.asdl compactly describes the osh language (which is nearly identical to bash). An excerpt:

token = (id id, string val, line_span? loc)

word =
  TokenWord(token token)
| CompoundWord(word_part* parts)

This ASDL schema syntax should be readable to programmers who know Haskell or ML. For others, it only involves a few concepts and can be read like this:

  1. token is a product type, with two required fields id and val, and an optional field loc.
  2. word is a sum type with two alternatives:

For those with C background, it's helpful to remember that a product type can be represented by a struct, and a sum type can be represented by a tagged union. However, structs and unions fall short of algebraic data types because of the static type system.

Consider this statement:

ls >> ~/git/$repo/listing.txt

It consists of three words:

  1. A CompoundWord representing ls
  2. A TokenWord with the token (Id.Redir_DGreat, ">>")
  3. A CompoundWord with 4 parts:
    1. TildeSubPart: to substitute $HOME
    2. LiteralPart("/git/")
    3. VarSubPart(repo): to substitute the repo variable
    4. LiteralPart("/listing.txt")

These three words are further parsed into a command node, which our ASDL implementation pretty-prints like this:

(SimpleCommand
  words: [
    (CompoundWord
      parts: [
        (LiteralPart
          token: (token id:Lit_Chars val:ls loc:(line_span pool_index:0 col:0 length:2))
        )
      ]
    )
  ]
  redirects: [
    (Redirect
      op_id: Redir_DGreat
      arg_word: 
        (CompoundWord
          parts: [
            (TildeSubPart prefix:"")
            (LiteralPart token:(token id:Lit_Chars val:/git/))
            (VarSubPart name:repo)
            (LiteralPart
              token: 
                (token
                  id: Lit_Chars
                  val: /listing.txt
                  loc: (line_span pool_index:0 col:17 length:12)
                )
            )
          ]
        )
      fd: 1
    )
  ]
)

This may look like a Lisp S-expression, but two features give it more structure:

  1. Explicitly named and typed fields
  2. Arrays, denoted with brackets [ ... ] above

In addition to this textual representation, which is useful for debugging, there's also a binary representation. I will describe that in tomorrow's post.