Home

The Thinner Waist of the Interpreter

2017-01-25

(This is the status update on source code size I mentioned in the Blog TODO stack.)

In Analogies for ASDL, I counted lines of code in the osh front end prior to integrating ASDL. There were ~3500 lines in the lexer and parser, and ~1900 lines to represent the AST, for a total of ~5400 lines.

A few weeks ago, I made another snapshot. As shown below, the total count dropped by 1000 lines, and we've removed the files tokens.py and {expr,cmd,word}_node.py. That is, there are no longer any hand-written AST nodes, and even the basic Token is represented in ASDL.

# as of 12/16/2016            # as of 1/6/2017
$ ./count.sh parser           $ ./count_parser

Lexer/Parser                  Lexer/Parser
    77 osh/parse_lib.py           77 osh/parse_lib.py
   196 osh/arith_parse.py        167 osh/arith_parse.py
   291 osh/bool_parse.py         264 osh/bool_parse.py
   334 osh/lex.py                334 osh/lex.py
  1144 osh/word_parse.py         406 core/word.py
  1455 osh/cmd_parse.py         1096 osh/word_parse.py
  3497 total                    1442 osh/cmd_parse.py
                                3786 total

AST and IDs                   AST and IDs            
   80 core/tokens.py            129 osh/osh.asdl
   99 core/expr_node.py         456 core/id_kind.py
  441 core/id_kind.py           585 total           
  491 core/cmd_node.py        
  777 core/word_node.py       
 1888 total                   

TOTAL: 5385                   TOTAL: 4371

I'm not counting the ASDL implementation itself, which takes ~1800 lines of code. On the other hand, it also implements new features like the oheap encoding, C++ code generation and generic pretty printing.

Observations:

(1) There is a new core/word.py file. What used to be methods on the CompoundWord class are now free functions:

# OLD:
w.HasArrayPart()
w.AsFuncName()
w.AsArithVarName()
w.LooksLikeAssignment()

# NEW (word is a module, w is an object):
word.HasArrayPart(w)
word.AsFuncName(w)
word.AsArithVarName(w)
word.LooksLikeAssignment(w)

In other words, I've switched from and object-oriented style with methods to a functional style. This might seem like a step backward from the syntactic point of view, but I think that is less important than software architecture.

I've called the AST backbone of the interpreter. It's now represented just 600 lines of code in core/id_kind.py and osh.asdl. Together, they form a concise description of every element of the language.

Because of this, I think of these files as thin waist of the interpreter. They divide the code cleanly between front end and back end, which helps in a few ways:

  1. Make it easy to test and modify the code. The only output of the parser is the AST, and the inputs for the runtime are the AST and command line flags.
  2. Make it possible to avoid the work of parsing with oheap.

Of course, this is exactly the kind of modular architecture you would learn about in school. But "real programs" rarely have a clean architecture. I've spent significant time reading the source of various shells, including bash, which has ~150K lines of C code.

The article about bash in the AOSA book is great, but it misses a lot of things, probably due to the sheer size of the codebase.

Notably, bash cannot have this clean "thin waist" architecture, because it is not statically parsed. Explaining shell WTFs is exhausting, so I'll just point to issue 3 on Github as another head-scratcher. It appears to be an undocumented case of parsing and evaluating code in strings at runtime.

Overall Code Size

When I open-sourced the code, it had ~11,500 lines of After the 1000-line reduction above, and a few other cleanups, the code dipped below 10,000 lines:

$ ./count.sh all
...
   546 core/process.py
   786 core/completion.py
   843 core/cmd_exec.py
   876 core/word_eval.py
  1096 osh/word_parse.py
  1442 osh/cmd_parse.py
  9992 total

This doesn't have too much meaning, especially since I'm not counting ASDL. But I would like to celebrate this as the last time the code will ever be so small. I like that you can describe (most of) a relatively large language like bash in a small amount of code.

I've always agreed that Size is Code's Worst Enemy. However, I do have strategies to avoid writing 50K or 100K lines of C++, including metaprogramming and bootstrapping oil in oil.

Conclusion

Using the model of algebraic data types in osh.asdl is powerful and concise.

I expect bootstrapping oil to be a future theme in the blog, although I'm also planning posts on central topic of the [oil language][oil-language] design.

Right now, I'm working on automatically converting [osh][osh-langauge] to [oil][oil-language], and it's working better than expected. I've also been in touch with the authors of four different alternative shells, and I hope to get design feedback from them.


Discuss this post on Reddit