blog | oilshell.org

Grammar for Variable Substitutions

2016-10-26 (Last updated 2019-02-06)

At the bottom of this page is a grammar for the language accepted inside ${}. I wrote it to guide the implementation of a recursive descent parser.

I've tested it on over a hundred thousand lines of shell, and it now appears to parse everything correctly. The previous iteration had problems with code like ${@:1:2}.

Observations

$ echo ${a-${a-${a-unset}}}
unset

The grammar tries to strike a balance between being faithful to bash and following the philosophy of early errors. Bash accepts some code as syntactically valid, but doesn't interpret it correctly.

For example, bash allows multiple subscripts during parsing, but ignores them during execution:

$ array=(abc def ghi)
> echo ${array[0]}
> echo ${array[0][1]}
> echo ${array[0][1][2]}
> echo ${array[0][1][2] : 1 : 2}
abc
abc
abc
bc

Here is an example where slices are accepted, but ignored:

$ array=(abc def ghi jkl)
> echo ${#array[@]}  # length of array, OK
> echo ${array[@] : 1 : 2}  # slice of array, OK
> echo ${#array[@] : 1 : 2}  # why is this 4?
4
def ghi
4

The OSH parser disallows both of these constructs, since they don't seem to be implemented correctly.

The Grammar

NAME        = [a-zA-Z_][a-zA-Z0-9_]*
NUMBER      = [0-9]+                    # ${10}, ${11}, ...

Subscript   = '[' ('@' | '*' | ArithExpr) ']'
VarSymbol   = '!' | '@' | '#' | ...
VarOf       = NAME Subscript?
            | NUMBER   # no subscript allowed, none of these are arrays
            | VarSymbol

TEST_OP     = '-' | ':-' | '=' | ':=' | '+' | ':+' | '?' | ':?'
STRIP_OP    = '#' | '##' | '%' | '%%'
CASE_OP     = ',' | ',,' | '^' | '^^'

UnaryOp     = TEST_OP | STRIP_OP | CASE_OP | ...
Match       = ('/' | '#' | '%') WORD   # match all / prefix / suffix
VarExpr     = VarOf
            | VarOf UnaryOp WORD
            | VarOf ':' ArithExpr (':' ArithExpr )?
            | VarOf '/' Match '/' WORD

LengthExpr  = '#' VarOf     # can't apply operators after length

RefOrKeys   = '!' VarExpr   # CAN apply operators after a named ref
                            # ${!ref[0]} vs ${!keys[@]} resolved later

PrefixQuery = '!' NAME ('*' | '@')   # list variable names with a prefix

VarSub      = LengthExpr
            | RefOrKeys
            | PrefixQuery
            | VarExpr

Again, this isn't the entire word grammar — it's the grammar for variable substitutions inside words.