Grammar for Variable Substitutions

2016-10-26 (Last updated 2019-02-06)

At the bottom of this page is a grammar for the language accepted inside ${}. I wrote it to guide the implementation of a recursive descent parser.

I've tested it on over a hundred thousand lines of shell, and it now appears to parse everything correctly. The previous iteration had problems with code like ${@:1:2}.

Observations

This grammar describes part of what I call the "word language". Shell is actually composed of four interleaved sublanguages:
1. the command language: for, if, functions, ...
2. the word language: ${}, $(), $(()), ...
3. the arithmetic language: a**2 + b**2
4. the boolean language ([[, which is statically parsed).
There are other mini-languages in shell, like globbing and brace expansion, but I don't consider them full-fledged languages because they're not recursive and there are no syntax errors.

I hope to publish grammars for all 4 languages at some point, but right now I'm doing the minimum to get a correct parser working.
The arithmetic language appears inside the word language in two places: inside subscripts ${a[x+1]} and inside slicing ${a:x+1:y+2}.
It's a recursive grammar because it contains other words. Words like this are valid:

$ echo ${a-${a-${a-unset}}}
unset

The # and ! tokens need LL(2) lookahead. For example, the # token could be either a variable like ${#} (length of arguments array), or a prefix operator like ${#var}.

The grammar tries to strike a balance between being faithful to bash and following the philosophy of early errors. Bash accepts some code as syntactically valid, but doesn't interpret it correctly.

For example, bash allows multiple subscripts during parsing, but ignores them during execution:

$ array=(abc def ghi)
> echo ${array[0]}
> echo ${array[0][1]}
> echo ${array[0][1][2]}
> echo ${array[0][1][2] : 1 : 2}
abc
abc
abc
bc

Here is an example where slices are accepted, but ignored:

$ array=(abc def ghi jkl)
> echo ${#array[@]}  # length of array, OK
> echo ${array[@] : 1 : 2}  # slice of array, OK
> echo ${#array[@] : 1 : 2}  # why is this 4?
4
def ghi
4

The OSH parser disallows both of these constructs, since they don't seem to be implemented correctly.

The Grammar

NAME        = [a-zA-Z_][a-zA-Z0-9_]*
NUMBER      = [0-9]+                    # ${10}, ${11}, ...

Subscript   = '[' ('@' | '*' | ArithExpr) ']'
VarSymbol   = '!' | '@' | '#' | ...
VarOf       = NAME Subscript?
            | NUMBER   # no subscript allowed, none of these are arrays
            | VarSymbol

TEST_OP     = '-' | ':-' | '=' | ':=' | '+' | ':+' | '?' | ':?'
STRIP_OP    = '#' | '##' | '%' | '%%'
CASE_OP     = ',' | ',,' | '^' | '^^'

UnaryOp     = TEST_OP | STRIP_OP | CASE_OP | ...
Match       = ('/' | '#' | '%') WORD   # match all / prefix / suffix
VarExpr     = VarOf
            | VarOf UnaryOp WORD
            | VarOf ':' ArithExpr (':' ArithExpr )?
            | VarOf '/' Match '/' WORD

LengthExpr  = '#' VarOf     # can't apply operators after length

RefOrKeys   = '!' VarExpr   # CAN apply operators after a named ref
                            # ${!ref[0]} vs ${!keys[@]} resolved later

PrefixQuery = '!' NAME ('*' | '@')   # list variable names with a prefix

VarSub      = LengthExpr
            | RefOrKeys
            | PrefixQuery
            | VarExpr

Again, this isn't the entire word grammar — it's the grammar for variable substitutions inside words.