In the first post, I mentioned that I used Python to prototype the oil parser, in order to figure out an algorithm for parsing shell. After the algorithm is nailed down, implementing it in C++ should be easy.
Why is this a problem? First, the
POSIX grammar
only covers one of four sublanguages of shell -- the command
sublanguage. It doesn't formalize basic constructs like ${my_var} or
$(my-command) -- they are part of the word sublanguage.
Also, POSIX shell is a smaller language than what people use in practice, i.e. bash or ksh . I'd like to describe these larger languages.
Finally, the
AOSA chapter on Bash
notes several difficulties in parsing bash, mostly due to the unconventional
but entrenched choice of using
yacc
to generate code. Most shells use
top-down
hand-written parsers (recursive descent), not bottom-up generated parsers
(yacc).
I'm about to release the oil parser, which statically parses many real shell programs, so I'm confident I've contributed something to this problem.
I discuss the issue of token lookahead here. There is more to write about after the parser is released.
Let's define a new term first: the osh language .
Most people won't be able to tell the difference between osh and
bash
, because osh is designed for compatibility. But it's useful when talking
about the parser. From now on, I'll be careful to make the distinction
between the new
oil language
and the compatible
osh language
.
The main differences between bash and osh are:
${var[0][1][2]}. osh prefers to catch errors early where it won't
adversely affect compatibility.$((foo)) is always an arithmetic substitution, not a command
substitution with a subshell in it. I will talk more about this issue
tomorrow.Repeating the thesis: the
osh language
can be parsed with two tokens of lookahead. In other words, it's
expressable with an LL(2) grammar.
This is good because it means the parser is relatively efficient.
When the parser encounters tokens like for, if, while, $(, ${, etc.,
it doesn't have to look any further to know what to parse. If these were the
only constructs, it would be a language expressible with an LL(1) grammar.
But each of the four sublanguages requires two tokens of lookahead in certain cases:
(1) Command Language
Functions. When we see a token like my-func, we must peek ahead to see if
the next token is ( to know whether it's a function definition or a
command.
my-func() { echo hi; } # function definition
my-func x y # command
Coprocesses. In bash, after we see coproc foo, we must peek ahead to see if
the next word is { to know whether it's a coprocess definition or
command.
coproc foo bar
coproc foo {
echo hi
}
(2) Word language
When we see ${#, we have to look past the # token to know if it's a
variable or an operator. See this in-depth
discussion). The same issue occurs with ${!} vs ${!var}.
echo ${#} # # is a variable
echo ${#foo} # # is a prefix operator
When we see a token like var=, we have to check if the next token is the (
operator to see if the word is an array assignment, which requires
recursive word parsing, or a word assignment.
var=value
var=(1 2 3)
(3) Arithmetic Language
The arithmetic language uses Pratt Parsing, which I believe is a form of LL(2) parsing, although I've never heard anyone say that.
The nud methods (null denotation) are for parsing decisions that can be
made with one token of lookahead, while the led methods (left
denotation) are for decisions that can be made with two tokens.
This handles mathematical expressions easily, since the infix operator token is second, after the first operand, telling you the type of AST node to create. In Pratt parsing, there's no possibility of looking ahead three tokens.
(4) Boolean Language
Like the arithmetic language, the boolean language contains infix operators. So it has to look ahead one word when it sees an operand word.
[[ $var ]] # test if $var is non-empty
[[ $var -eq foo ]] # compare $var with another value
So osh has parsers for four sublanguages, each of which use two tokens of
lookahead. Arithmetic is handled with Pratt parsing, and the
other three languages are handled with
recursive descent
. These top-down techniques are easily interwoven.
(I thought about using Pratt parsing for the boolean language as well, but it
has two extra lexical states for regular expressions (operator
=~), which complicates things.)
In contrast, bash uses a generated, bottom-up parser for the command sublanguage, and ad hoc parsers for the other three. It also dynamically parses some sublanguages, as demonstrated in this post.
The parse.y file is 6,500 lines, containing both the generated command parser and the hand-written boolean parser.
The ${} parser is mixed in with the runtime in subst.c,
weighing in at 10,700 lines! Arithmetic in expr.c is another
1,500 lines (also including the runtime).
In oil, I've taken care to keep the definition of the language in one place,
and I believe it will be easier to read and extend.
A longer-term goal is to define the osh language in a meta-language or
grammar rather than in Python, but I believe that is beyond the state of the
art in parsing technology. I will have more to say about this.
Credit: Thanks to Daniel Marti for pointing out the coproc lookahead issue.