Home

The Thinner Waist of the Interpreter

2017-01-26

(This is the update on source code size mentioned in the Blog TODO stack.)

In Analogies for ASDL, I counted lines of code in the osh front end prior to integrating ASDL. There were ~3500 lines in the lexer and parser, and ~1900 lines to represent the AST, for a total of ~5400 lines.

A few weeks ago, I counted again. As shown below, there are ~1000 fewer lines, and tokens.py and {expr,cmd,word}_node.py are gone. That is, there are no more hand-written AST nodes, and the basic Token type is now represented in ASDL.

# as of 12/16/2016            # as of 1/6/2017
$ ./count.sh parser           $ ./count.sh parser

Lexer/Parser                  Lexer/Parser
    77 osh/parse_lib.py           77 osh/parse_lib.py
   196 osh/arith_parse.py        167 osh/arith_parse.py
   291 osh/bool_parse.py         264 osh/bool_parse.py
   334 osh/lex.py                334 osh/lex.py
  1144 osh/word_parse.py         406 core/word.py
  1455 osh/cmd_parse.py         1096 osh/word_parse.py
  3497 total                    1442 osh/cmd_parse.py
                                3786 total

AST and IDs                   AST and IDs            
   80 core/tokens.py            129 osh/osh.asdl
   99 core/expr_node.py         456 core/id_kind.py
  441 core/id_kind.py           585 total           
  491 core/cmd_node.py        
  777 core/word_node.py       
 1888 total                   

TOTAL: 5385                   TOTAL: 4371

I'm not counting the ASDL implementation itself, which takes ~1800 lines of code. On the other hand, it also implements new features like the oheap encoding, C++ code generation and generic pretty printing.

Observations:

(1) There is a new core/word.py file. What used to be methods on the CompoundWord class are now free functions:

# OLD:
w.HasArrayPart()
w.AsFuncName()
w.AsArithVarName()
w.LooksLikeAssignment()

# NEW (word is a module, w is an object):
word.HasArrayPart(w)
word.AsFuncName(w)
word.AsArithVarName(w)
word.LooksLikeAssignment(w)

In other words, I've switched from and object-oriented style with methods to a functional style. This might seem like a step backward from the syntactic point of view, but I think that's a minor difference compared to the architectural advantages.

I've called the AST backbone of the interpreter. It's now represented by just 600 lines of code in core/id_kind.py and osh.asdl. Together, these two files form a concise description of every element of the language.

So now I think of them as thin waist of the interpreter. They divide the code cleanly between front end and back end. This helps in a few ways:

  1. It's easier to test and modify the code. The only output of the parser is the AST, and the inputs for the runtime are the AST and command line flags. There are no global variables.
  2. It's possible to avoid the work of parsing with oheap. This is important for bootstrapping oil in oil.

Of course, this is exactly the kind of modular architecture you would learn about in school. But "real programs" rarely have a clean architecture.

For example, I've spent significant time reading the source of various shells, including bash, which has ~150K lines of C code. Bash even has a great article in the AOSA book. But I would say that the architecture only covers a small portion of the program, and the portion that it covers is necessarily inaccurate.

In any case, bash cannot have this clean "thin waist" architecture, because it is not statically parsed. Execution and parsing are interleaved, and the same functions are called in both stages.

For example, issue 3 on Github is a head-scratcher. It's an undocumented case of parsing and evaluating code inside strings at runtime, without an explicit eval.

I should note that author Chet Ramey was refreshingly honest about the parser, and his article was part of the motivation for oil.

Overall Code Size

When I open-sourced the code, it had ~11,500 lines of code. After the 1000-line reduction above, and a few other cleanups, the code dipped below 10,000 lines!

$ ./count.sh all
...
   546 core/process.py
   786 core/completion.py
   843 core/cmd_exec.py
   876 core/word_eval.py
  1096 osh/word_parse.py
  1442 osh/cmd_parse.py
  9992 total

This doesn't mean that much, especially since I'm not counting ASDL. But I'd like to celebrate this as the last time the code will ever be so small. I like that you can describe (most of) a large language like bash in a small amount of code.

I've always agreed that Size is Code's Worst Enemy. I have strategies to avoid writing 50K or 100K lines of C++, including metaprogramming and bootstrapping oil in oil.

Conclusion

Using the model of algebraic data types in osh.asdl is powerful and concise.

I expect bootstrapping oil to be a future theme in the blog, although I'm also planning posts on central topic of the oil language design.

Right now, I'm working on automatically converting osh to oil, and it's working better than expected. I've also been in touch with the authors of four different alternative shells, and I hope to get design feedback from them.

Links

Kartik Agaram is working on the problem of conveying the global structure of programs. After puzzling through the source code of many shells (bash, dash, mksh, zsh), this idea resonates with me.

I'm looking to easily answer to questions like:

  1. What are the interaction points of the lexer and parser(s)? Shell is complicated, necessarily breaking out of the simple "lexer to parser" model you learn in school. For instance, entire subprograms are embedded in a string literals, like "abc: $(for i in a b c; do echo $i; done)".

  2. How are the parsers "reused" for interactive completion, or is knowledge of the language duplicated?

  3. How do they solve the "$PS2 problem"? This is the problem of interactive parsing: when the user hits enter, do you print >, waiting for more input, or do you try to parse and run the command, possibly getting a syntax error?