The last post discussed parse-time vs. runtime errors. A goal for oil is to improve programmer productivity by emitting as many parse time errors as possible.
Therefore, we must we parse the whole program at once. I thought this was the simplest and more obvious way to parse, but it turns out that other shells do not do this. I will show that in this post.
Let's start with these 3 valid programs:
$ echo ${x:-VarSub} VarSub
$ echo $(echo CommandSub) # `echo CommandSub` is equivalent CommandSub
$ echo ArithSub $((1 + 2)) ArithSub 3
Now create syntax errors by adding or removing characters:
$ echo ${x:} /bin/bash: line 1: ${x:}: bad substitution
$ echo $(echo CommandSub >) /bin/bash: command substitution: line 2: syntax error near unexpected token `)' /bin/bash: command substitution: line 2: `echo CommandSub >)'
$ echo ArithSub $((1 + )) /bin/bash: line 1: 1 + : syntax error: operand expected (error token is "+ ")
Now use the same if false technique from the last post:
$ if false; then echo ${foo:}; else echo VarSub not parsed; fi && > if false; then echo $(echo hi >); else echo CommandSub not parsed; fi && > if false; then echo $(( 1 + )); else echo ArithSub not parsed; fi VarSub not parsed CommandSub not parsed ArithSub not parsed
Wow, bash doesn't parse any of them! It interleaves parsing and
execution. Let's test dash, mksh, and zsh:
$ for sh in dash mksh zsh; do > echo ----- > echo $sh > $sh -c 'if false; then echo ${foo:}; else echo VarSub not parsed; fi' > $sh -c 'if false; then echo $(echo hi >); else echo CommandSub not parsed; fi' > $sh -c 'if false; then echo $(( 1 + )); else echo ArithSub not parsed; fi' > echo > done ----- dash dash: 1: Syntax error: Missing '}' dash: 1: Syntax error: ")" unexpected ArithSub not parsed ----- mksh VarSub not parsed mksh: syntax error: ')' unexpected ArithSub not parsed ----- zsh VarSub not parsed zsh:1: parse error near `)' zsh:1: parse error near `$(echo hi >)' ArithSub not parsed
Interesting! dash does the best, catching 2 of 3 errors, but missing the
arithmetic substitution error. mksh only caches the the command sub
error. And zsh misses all errors, like bash.
Let's try oil:
Line 1 of '<-c arg>'
if false; then echo ${foo:}; else echo VarSub not parsed; fi
^
Unknown token in arith context: <UNKNOWN_TOK ";">
Line 1 of '<-c arg>'
if false; then echo $(echo hi >); else echo CommandSub not parsed; fi
^
Expected filename after redirect operator
Line 1 of '<-c arg>'
if false; then echo $(( 1 + )); else echo ArithSub not parsed; fi
^
Expected value or parenthesized expression, got {ASOP_RPAREN ")"}
It catches all 3 errors up front. The error messages need a little polish, but I like that they point to the correct token like the Python parser and Clang. No other shell does this, even if manages to catch the error.
I think this is pretty cool!
Multi-stage parsing, i.e. parsing at execution time, is harder, not easier.
The issue is that you have to "pre-parse" the code anyway to figure out where
the closing } or ) or )) is.
This is because the non-recursive lexer needs knowledge from a recursive parser in order to be able to "skip over" text. A lexer can't run by itself -- it will start returning the wrong tokens without proper feedback. The lexer and parser are necessarily coupled in a feedback loop.
This is described quite well in the AOSA book chapter on bash:
Much of the work to recognize the end of the command substitution during the parsing stage is encapsulated into a single function (
parse_comsub), which knows an uncomfortable amount of shell syntax and duplicates rather more of the token-reading code than is optimal. This function has to know about here documents, shell comments, metacharacters and word boundaries, quoting, and when reserved words are acceptable (so it knows when it's in a case statement); it took a while to get that right.
(This article by bash maintainer Chet Ramey has been very useful. I've read it several times, and I will continue to refer to it in future posts.)
I also believe that multi-stage parsing is slower. You end up doing more
total work because you have to pre-parse as well as parse. And parsing at
execution time will obviously slow execution down. I haven't seen any
indication that existing shells cache parse trees, so they will end up
repeatedly parsing the text inside ${} and $() if they appear inside in a
loop.
I'll have more to say about the performance of the oil parser once it's ported to C++. But for now, I claim it is algorithmically more efficient.
Tomorrow I will talk about the simple technique that lets oil parse in one pass: lexical states.
Yesterday I showed that [[ is part of the language, while [ is a builtin.
Thus the arguments to [ are opaque to the shell parser, while [[
expressions can be parsed up front.
Ironically, even though the ${}, $(), and $(( )) constructs are very much
part of the language, they are not parsed up front. The parser skips over
them, essentially treating them as builtins as well.