Why Sponsor Oils? | blog | oilshell.org
OSH uses a simple lexing technique to recognize the shell's many sublanguages in a single pass. I now call it modal lexing.
This is how we address the language composition problem.
What do the characters :- mean in this code?
$ echo "foo:-bar ${foo:-bar} $(( foo > 1 ? 5:- 5 ))" foo:-bar bar -5
Three different things, depending on the context:
stdout.${}.: in the C-style ternary operator, then the unary minus operator for
negation.This is why we need lexer modes. Most lexers look like this:
Token Read() // return the next token
But our modal lexer has an "enum" parameter:
// return the next token, using rules determined by the mode
Token Read(lex_mode_t mode)
The concept is easy, but it needs a name because:
In OSH, there are currently 8 modes:
UNQUOTED: the start mode, for echo fooSQ: for 'single-quoted strings'DQ: for "double quoted strings, which allow $ expansions"ARITH: for arithmetic expressions, e.g. in $((1+2))BRACED_VAR_SUB_1: for the first token after ${BRACED_VAR_SUB_2: for a token in ${ after a name, like :- in
${foo:-bar}UNQUOTED_VAROP: for the argument after an operator, e.g. ${foo:-var op}DQ_VAROP: for the argument after an operator when double quoted, e.g.
"${foo:-var op}"And there are two more unimplemented modes:
DOLLAR_SQ: for $'\n' -- literal strings that accept C escapesREGEX: for the right hand side of [[ foo =~ ^foo$ ]]. (I believe the
existence of this mode is a bug in bash, but let's discuss that later.)(2019 Update: I published an updated list of lexer modes.)
In the implementation of many languages, you get by without any modes.
Most languages have string literals, where you could consider \t a token, but
that can be worked around by writing code to treat the whole double-quoted
string (e.g. "a\tb\tc") as a single token (and that seems to be what most
languages do).
You can't do this in shell, because a double-quoted string can contain an entire subprogram:
echo "BEGIN $(if grep pat input.txt; then echo 'FOUND'; fi) END"
That is, the "token" would have a recursive tree structure, which means it's not really a token anymore. The modal lexer pattern lets us easily handle these more complicated string literals.
For examples of modes in other languages, see When Are Lexer Modes Useful?. I observe that both Python and JavaScript have grown shell-like string interpolation in the last 10 years.
Where are modes get used in the OSH parser? Let's consider the ARITH mode.
It gets used in all of these places:
(( y = x + 2 )) (useful in if or while conditions)let syntax for arithmetic commands: let y=x+2echo $(( y = x + 2 )) and the bash alias $[y = x + 2]for ((i=0; i<5; ++i); do echo; done. This is distinct
from the (( command because it uses the ; token.echo ${a : i : i+length}echo ${a[x + 2]} : R-value subscripta[x + 2]=foo : L-value subscripta=([x + 2]=foo [y]=z) : literal subscriptSo when the parser sees a $(( token, it starts calling the lexer with
lex_mode_t.ARITH, rather than say lex_mode_t.UNQUOTED.
Likewise, when it sees a ${ it will switch to lex_mode_t.BRACED_VAR_SUB_1.
The current mode can be stored on the stack, since paired delimiters like
"quotes", ${}, and $(()) are naturally parsed with recursive function
calls.
This post described how the OSH lexer supports parsing shell scripts in a single pass. The lexer cannot run by itself — it needs the parser to send it information so it knows what tokens to return.
This is useful background for explaining the one place where bash cannot be
parsed up front: associative array indexing. We'll see this tomorrow.
Note: This post was formerly titled Lexical State and How We Use It.
I'm using "modal lexing" over "lexical state" because the OSH
lexer has other state, like a stack of hints to disambiguate the many
meanings of ).
I found the term "lexical state" in the DSL Book by Martin Fowler. The Alternative Tokenization pattern is about "completely replacing the lexer" when you get a certain token.