At the bottom of this page is a grammar for the language accepted inside ${}.
I used it to help me write a recursive descent parser.
I've tested it on over a hundred thousand lines of shell, and it appears to
parse everything correctly. (The previous iteration had problems with stuff
like ${@:1:2}).
Some observations:
This grammar describes part of what I call the "word language". Shell is actually composed of four interleaved sublanguages:
for, if, functions, ...${}, $(), $(()), ...a**2 + b**2[[, which is statically parsed).There are other mini-languages in shell, like globbing and brace expansion, but I don't consider them full-fledged languages because they're not recursive and there are no syntax errors.
I hope to publish grammars for all 4 languages at some point, but right now I'm just doing the minimum to get a correct parser working.
The arithmetic language is interleaved in two places: inside subscripts
${a[x+1]} and inside slicing ${a:x+1:y+2}.
It's recursive grammar because it contains other words. Words like this are valid:
$ echo ${a-${a-${a-unset}}} unset
# and ! tokens need LL(2) lookahead. For example, the # token
could be either a variable like ${#} (length of arguments array), or a
prefix operator like ${#var}.The grammar tries to strike a balance between being faithful to bash and
following the philosophy of early errors. Bash accepts some code as
syntactically valid, but doesn't interpret it correctly.
For example, bash allows multiple subscripts during parsing, but ignores them during execution:
$ array=(abc def ghi) > echo ${array[0]} > echo ${array[0][1]} > echo ${array[0][1][2]} > echo ${array[0][1][2] : 1 : 2} abc abc abc bc
Here is an example where slices are accepted, but ignored:
$ array=(abc def ghi jkl) > echo ${#array[@]} # length of array, OK > echo ${array[@] : 1 : 2} # slice of array, OK > echo ${#array[@] : 1 : 2} # why is this 4? 4 def ghi 4
The oil parser disallows both of these constructs, since they don't seem to be implemented correctly.
(This is just part of the word grammar)
NAME = [a-zA-Z_][a-zA-Z0-9_]*
NUMBER = [0-9]+ # ${10}, ${11}, ...
Subscript = '[' ('@' | '*' | ArithExpr) ']'
VarSymbol = '!' | '@' | '#' | ...
VarOf = NAME Subscript?
| NUMBER # no subscript allowed, none of these are arrays
| VarSymbol
TEST_OP = '-' | ':-' | '=' | ':=' | '+' | ':+' | '?' | ':?'
STRIP_OP = '#' | '##' | '%' | '%%'
CASE_OP = ',' | ',,' | '^' | '^^'
UnaryOp = TEST_OP | STRIP_OP | CASE_OP | ...
Match = ('/' | '#' | '%') WORD # match all / prefix / suffix
VarExpr = VarOf
| VarOf UnaryOp WORD
| VarOf ':' ArithExpr (':' ArithExpr )?
| VarOf '/' Match '/' WORD
LengthExpr = '#' VarOf # can't apply operators after length
RefOrKeys = '!' VarExpr # CAN apply operators after a named ref
# ${!ref[0]} vs ${!keys[@]} resolved later
PrefixQuery = '!' NAME ('*' | '@') # list variable names with a prefix
VarSub = LengthExpr
| RefOrKeys
| PrefixQuery
| VarExpr