What do the characters :- mean in this code?
$ echo "foo:-bar ${foo:-bar} $(( foo > 1 ? 5:- 5 ))" foo:-bar bar -5
They mean 3 different things, depending on the context:
stdout${}: in the C-style ternary operator, then the unary minus operator for
negationThis illustrates the concept of lexical state. It's a pretty easy concept, but I guess it has a name because it deviates from what you might have learned in a compiler class, and because not all code generators support it.
(I'm taking the name from the DSL Book by Martin Fowler. Chapter 28, "Alternative Tokenization" talks about "completely replacing the lexer" when you get a certain token.)
In oil, there are currently 8 lexical states, named:
UNQUOTED: the start state, for echo fooSQ: for 'single-quoted strings'DQ: for "double quoted strings, which allow $ expansions"ARITH: for arithmetic expressions, e.g. in $((1+2))BRACED_VAR_SUB_1: for the first token after ${BRACED_VAR_SUB_2: for a token in ${ after a name, like :- in
${foo:-bar}UNQUOTED_VAROP: for the argument after an operator, e.g. ${foo:-var op}DQ_VAROP: for the argument after an operator when double quoted, e.g.
"${foo:-var op}"And there are actually two more unimplemented lexical states:
DOLLAR_SQ: for $'\n' -- literal strings that accept C escapesREGEX: for the right hand side of [[ foo =~ ^foo$ ]]. (I believe the
existence of this state is a bug in bash, but let's discuss that later.)In the implementation of many languages, you can get away with with a single
lexical state -- that is, no state. Most languages have string literals, where
you could consider \t a token, but that can be worked around simply by
writing code to treat the whole double-quoted string (e.g. "a\tb\tc") as a
single token (and that seems to be what most people do).
You can't do this in shell, because a double-quoted string can contain an entire subprogram:
echo "BEGIN $(if grep pat input.txt; then echo 'FOUND'; fi) END"
That is, the "token" would have a recursive tree structure, which means it's not really a token anymore. The lexical state pattern lets us easily handle these more complicated string literals.
Where do lexical states get used in the oil shell parser? Let's just take the
ARITH state. It gets used in all of these places:
(( y = x + 2 )) (useful in if or while conditions)let syntax for arithmetic commands: let y=x+2echo $(( y = x + 2 )) and the bash alias $[y = x + 2]for ((i=0; i<5; ++i); do echo; done. This is distinct
from the (( command because it uses the ; token.echo ${a : i : i+length}echo ${a[x + 2]} : R-value subscripta[x + 2]=foo : L-value subscripta=([x + 2]=foo [y]=z) : literal subsriptSo when the parser sees a $(( token, it knows it must start calling the lexer
with the LexState.ARITH state, rather than say LexState.UNQUOTED.
Likewise, when it sees a ${ it will switch to the BRACED_VAR_SUB_1 state.
The current state can be stored on the stack, since paired delimiters like
"quotes", ${}, and $(()) can be parsed naturally with recursive function
calls.
This post hopefully gave you an idea of how oil parses shell scripts up front in a single pass. The lexer cannot run by itself -- it needs the parser to send it information so it knows what tokens to return.
The purpose in explaining this is to provide background for explaining the one
place where bash cannot be parsed up front: associative array indexing.
We'll see this tomorrow.