Oil Parses Shell Scripts Up Front in a Single Pass. Other Shells Don't.


The last post discussed parse-time vs. runtime errors. A goal for oil is to improve programmer productivity by emitting as many parse time errors as possible.

Therefore, we must we parse the whole program at once. I thought this was the simplest and more obvious way to parse, but it turns out that other shells do not do this. I will show that in this post.

Let's start with these 3 valid programs:

$ echo ${x:-VarSub}
$ echo $(echo CommandSub)  # `echo CommandSub` is equivalent
$ echo ArithSub $((1 + 2))
ArithSub 3

Now create syntax errors by adding or removing characters:

$ echo ${x:}
/bin/bash: line 1: ${x:}: bad substitution
$ echo $(echo CommandSub >)
/bin/bash: command substitution: line 2: syntax error near unexpected token `)'
/bin/bash: command substitution: line 2: `echo CommandSub >)'
$ echo ArithSub $((1 + ))
/bin/bash: line 1: 1 + : syntax error: operand expected (error token is "+ ")

Now use the same if false technique from the last post:

$ if false; then echo ${foo:};      else echo VarSub not parsed;     fi &&
> if false; then echo $(echo hi >); else echo CommandSub not parsed; fi &&
> if false; then echo $(( 1 + ));   else echo ArithSub not parsed;   fi
VarSub not parsed
CommandSub not parsed
ArithSub not parsed

Wow, bash doesn't parse any of them! It interleaves parsing and execution. Let's test dash, mksh, and zsh:

$ for sh in dash mksh zsh; do
>   echo -----
>   echo $sh
>   $sh -c 'if false; then echo ${foo:};      else echo VarSub not parsed;     fi'
>   $sh -c 'if false; then echo $(echo hi >); else echo CommandSub not parsed; fi'
>   $sh -c 'if false; then echo $(( 1 + ));   else echo ArithSub not parsed;   fi'
>   echo
> done
dash: 1: Syntax error: Missing '}'
dash: 1: Syntax error: ")" unexpected
ArithSub not parsed

VarSub not parsed
mksh: syntax error: ')' unexpected
ArithSub not parsed

VarSub not parsed
zsh:1: parse error near `)'
zsh:1: parse error near `$(echo hi >)'
ArithSub not parsed

Interesting! dash does the best, catching 2 of 3 errors, but missing the arithmetic substitution error. mksh only caches the the command sub error. And zsh misses all errors, like bash.

Let's try oil:

Line 1 of '<-c arg>'
  if false; then echo ${foo:};      else echo VarSub not parsed;     fi
Unknown token in arith context: <UNKNOWN_TOK ";">

Line 1 of '<-c arg>'
  if false; then echo $(echo hi >); else echo CommandSub not parsed; fi
Expected filename after redirect operator

Line 1 of '<-c arg>'
  if false; then echo $(( 1 + ));   else echo ArithSub not parsed;   fi
Expected value or parenthesized expression, got {ASOP_RPAREN ")"}

It catches all 3 errors up front. The error messages need a little polish, but I like that they point to the correct token like the Python parser and Clang. No other shell does this, even if manages to catch the error.

I think this is pretty cool!

Problems with Multi-Stage Parsing

Multi-stage parsing, i.e. parsing at execution time, is harder, not easier. The issue is that you have to "pre-parse" the code anyway to figure out where the closing } or ) or )) is.

This is because the non-recursive lexer needs knowledge from a recursive parser in order to be able to "skip over" text. A lexer can't run by itself -- it will start returning the wrong tokens without proper feedback. The lexer and parser are necessarily coupled in a feedback loop.

This is described quite well in the AOSA book chapter on bash:

Much of the work to recognize the end of the command substitution during the parsing stage is encapsulated into a single function (parse_comsub), which knows an uncomfortable amount of shell syntax and duplicates rather more of the token-reading code than is optimal. This function has to know about here documents, shell comments, metacharacters and word boundaries, quoting, and when reserved words are acceptable (so it knows when it's in a case statement); it took a while to get that right.

(This article by bash maintainer Chet Ramey has been very useful. I've read it several times, and I will continue to refer to it in future posts.)

I also believe that multi-stage parsing is slower. You end up doing more total work because you have to pre-parse as well as parse. And parsing at execution time will obviously slow execution down. I haven't seen any indication that existing shells cache parse trees, so they will end up repeatedly parsing the text inside ${} and $() if they appear inside in a loop.

I'll have more to say about the performance of the oil parser once it's ported to C++. But for now, I claim it is algorithmically more efficient.

Tomorrow I will talk about the simple technique that lets oil parse in one pass: lexical states.


Yesterday I showed that [[ is part of the language, while [ is a builtin. Thus the arguments to [ are opaque to the shell parser, while [[ expressions can be parsed up front.

Ironically, even though the ${}, $(), and $(( )) constructs are very much part of the language, they are not parsed up front. The parser skips over them, essentially treating them as builtins as well.