blog |

Lexer Modes in Shell


Last post. We saw that the composition of languages with different lexical structures leads to lexer modes.

Then we saw how shell composes 4 major sublanguages plus a 3 minor languages.

This post, we'll non-shell examples of using lexer modes

Words Have Alternative Lexical Structures

VS_ARG etc.

List of Lexer Modes

Here are the 14 lexer modes:

lex_mode =
| VS_1 | VS_2 | VS_ARG_UNQ | VS_ARG_DQ

Shell Has a Two-Level Structure: The Word Sublanguage

Surprisingly, I don't think I've mentioned this fact anywhere in the last 20 months.

Shell has an odd structure:

But then there yet another level of structure:

I don't believe any other programming language has this structure. I believe it comes from the transition from the simple Thompson shell to an ALGOL-like Bourne shell.


for x in 'ab' "ab"; do
  echo 'ab'

Below, echo and hi are words,

echo hi

But so is this:

echo $dir/*.{sh,py}
echo ${dir}/'My Documents'
echo ${dir:-"default"}/'My Documents'
echo "c $suffix"
echo @(*.py|"other pattern"|'other pattern')
a=(1 2 3)

Word language is left out of the POSIX spec, but it's pretty complicated.

Search for lex_mode= in

Example: The Word Parser Reuses the Lexer in Different Modes

The above examples don't take about the relationship between the lexer and the parser.



This post explained lexer modes in detail. The next posts explains another problem that a "real lexer" has to solve, and how we store state in the lexer to solve it.

I call this mechanism a lexer hint.

Appendix: Abstract Syntax for the Word Language

Word language is most complex: Quote osh/osh.asdl.

This basically says that a word consists of parts, and a part can be a bare literal, single quoted, double quoted, backslash-escaped, a tilde, variable substituion, command substitution, arithmetic substituion, brace expansion, an extended glob, or an array literal in an assignment word.

Full file is osh.asdl

  word_part = 
    ArrayLiteralPart(word* words)
  | LiteralPart(token token)
  | EscapedLiteralPart(token token)
  | SingleQuotedPart(token* tokens)
  | DoubleQuotedPart(word_part* parts)
  | SimpleVarSub(token token)
  | BracedVarSub(token token,
                id? prefix_op,  -- prefix # or ! operators
                bracket_op? bracket_op
                suffix_op? suffix_op)
  | TildeSubPart(string prefix)
    -- For command sub and process sub: $(...)  <(...)  >(...)
  | CommandSubPart(command command_list, token left_token)
  | ArithSubPart(arith_expr anode)
    -- {a,b,c}
  | BracedAltPart(word* words)
    -- {1..10} or {1..10..2}
  | BracedIntRangePart(int start, int end, braced_step? step)
    -- {a..f} or {a..f..2} or {a..f..-2}
  | BracedCharRangePart(string start, string end, braced_step? step)
    -- extended globs are parsed statically, unlike globs
  | ExtGlobPart(token op, word* arms)


To put this in the context of the project, Lexer modes with [re2c][] was the thing that convinced me that reimplementing bash was possible in a reasonable amount of time. I saw this style used in the lexer for the [Ninja][ninja] build system, and thought it would work well for for shell. (There's more history than that, which I may explain later.)