Oil Can Parse Real Shell Programs


Here is how you print the AST for a shell script using the current prototype:

$ sketch/pysh.py --no-exec --print-ast -c 'ls ~/src'

(Com {[LIT_CHARS ls]} {[TildeSub ''] [LIT_CHARS /src]})

The AST contains a single command node: (Com ...). It has two words, both of which are enclosed in { }. The first word has a literal token ls. The second one has a TildeSub part and then a literal /src part. LIT_CHARS is a token type.

I can test the parser out on any shell script, and the first thing I did was parse ~2500 lines of my own scripts. But what's more interesting is to run it on scripts written by someone else. Shell is a little like C++ -- everybody writes in their own subset of it.

I tried parsing debootstrap and Aboriginal Linux, two shell codebases that are thousands of lines long (and teach you about the foundations of a Linux system.)

They both revealed bugs and missing features in my parser. In the last few days I've fixed all the issues Aboriginal Linux hit, including:

For example:

ls / &&
ls /bin ||
echo ERROR

Another tricky issue I dealt with was with expressions like $((1 * (2+3))). I would write this as $((1 * (2+3) )), to make it clear that the first right paren is for expression grouping, and the last two right parens close the substitution.

But this isn't required, so tokenizer has to take care not to interpret the first two of three )) as closing the arithmetic substitution.

RESULTS: This directory contains the source code and abstract syntax trees for Aboriginal Linux, totalling 3,770 lines of code.

It works! But I also claim it's better than existing shell parsers. Tomorrow I will talk about why this is.