Last post. We saw that the composition of languages with different lexical structures leads to lexer modes.
Then we saw how shell composes 4 major sublanguages plus a 3 minor languages.
This post, we'll non-shell examples of using lexer modes
VS_ARG etc.
Surprisingly, I don't think I've mentioned this fact anywhere in the last 20 months.
Shell has an odd structure:
WordParser reads tokens and produces words.But then there yet another level of structure:
I don't believe any other programming language has this structure. I believe it comes from the transition from the simple Thompson shell to an ALGOL-like Bourne shell.
Example:
for x in 'ab' "ab"; do
echo 'ab'
done
Below, echo and hi are words,
echo hi
But so is this:
echo $dir/*.{sh,py}
echo ${dir}/'My Documents'
echo ${dir:-"default"}/'My Documents'
echo "c $suffix"
echo @(*.py|"other pattern"|'other pattern')
a=(1 2 3)
Word language is left out of the POSIX spec, but it's pretty complicated.
Search for lex_mode= in word_parse.py
The above examples don't take about the relationship between the lexer and the parser.
ReadCompoundWord
OUTER state. echo 'ab'$ab"ab"${ab}
ARITH
EXTGLOB
VS_ARG_UNQ -- words allowed here
VS_ARG_DQ
BASH_REGEX_CHARS -- (|) are not operators!
This post explained lexer modes in detail. The next posts explains another problem that a "real lexer" has to solve, and how we store state in the lexer to solve it.
I call this mechanism a lexer hint.
Word language is most complex: Quote osh/osh.asdl.
This basically says that a word consists of parts, and a part can be a bare literal, single quoted, double quoted, backslash-escaped, a tilde, variable substituion, command substitution, arithmetic substituion, brace expansion, an extended glob, or an array literal in an assignment word.
Full file is osh.asdl
word_part =
ArrayLiteralPart(word* words)
| LiteralPart(token token)
| EscapedLiteralPart(token token)
| SingleQuotedPart(token* tokens)
| DoubleQuotedPart(word_part* parts)
| SimpleVarSub(token token)
| BracedVarSub(token token,
id? prefix_op, -- prefix # or ! operators
bracket_op? bracket_op
suffix_op? suffix_op)
| TildeSubPart(string prefix)
-- For command sub and process sub: $(...) <(...) >(...)
| CommandSubPart(command command_list, token left_token)
| ArithSubPart(arith_expr anode)
-- {a,b,c}
| BracedAltPart(word* words)
-- {1..10} or {1..10..2}
| BracedIntRangePart(int start, int end, braced_step? step)
-- {a..f} or {a..f..2} or {a..f..-2}
| BracedCharRangePart(string start, string end, braced_step? step)
-- extended globs are parsed statically, unlike globs
| ExtGlobPart(token op, word* arms)
To put this in the context of the project, Lexer modes with [re2c][] was the thing that convinced me that reimplementing bash was possible in a reasonable amount of time. I saw this style used in the lexer for the [Ninja][ninja] build system, and thought it would work well for for shell. (There's more history than that, which I may explain later.)