Lexical State and How We Use It


What do the characters :- mean in this code?

$ echo "foo:-bar  ${foo:-bar}  $(( foo > 1 ? 5:- 5 ))"
foo:-bar  bar  -5

They mean 3 different things, depending on the context:

  1. Literal characters to be printed to stdout
  2. The "if empty or unset" operator within ${}
  3. The : in the C-style ternary operator, then the unary minus operator for negation

This illustrates the concept of lexical state. It's a pretty easy concept, but I guess it has a name because it deviates from what you might have learned in a compiler class, and because not all code generators support it.

(I'm taking the name from the DSL Book by Martin Fowler. Chapter 28, "Alternative Tokenization" talks about "completely replacing the lexer" when you get a certain token.)

In oil, there are currently 8 lexical states, named:

And there are actually two more unimplemented lexical states:

In the implementation of many languages, you can get away with with a single lexical state -- that is, no state. Most languages have string literals, where you could consider \t a token, but that can be worked around simply by writing code to treat the whole double-quoted string (e.g. "a\tb\tc") as a single token (and that seems to be what most people do).

You can't do this in shell, because a double-quoted string can contain an entire subprogram:

echo "BEGIN $(if grep pat input.txt; then echo 'FOUND'; fi) END"

That is, the "token" would have a recursive tree structure, which means it's not really a token anymore. The lexical state pattern lets us easily handle these more complicated string literals.

Example: Where the ARITH state is used

Where do lexical states get used in the oil shell parser? Let's just take the ARITH state. It gets used in all of these places:

So when the parser sees a $(( token, it knows it must start calling the lexer with the LexState.ARITH state, rather than say LexState.UNQUOTED.

Likewise, when it sees a ${ it will switch to the BRACED_VAR_SUB_1 state.

The current state can be stored on the stack, since paired delimiters like "quotes", ${}, and $(()) can be parsed naturally with recursive function calls.


This post hopefully gave you an idea of how oil parses shell scripts up front in a single pass. The lexer cannot run by itself -- it needs the parser to send it information so it knows what tokens to return.

The purpose in explaining this is to provide background for explaining the one place where bash cannot be parsed up front: associtative array indexing. We'll see this tomorrow.