OSH Can Be Parsed With Two Tokens of Lookahead

2016-11-17 (Last updated 2019-02-06)

In the first post, I mentioned that I used Python to prototype the OSH parser, in order to figure out an algorithm for parsing shell. After the algorithm is nailed down, implementing it in C++ should be easy.

Why is this interesting?

The POSIX grammar only covers one of four sublanguages of shell: the command sublanguage. It doesn't formalize basic constructs like ${my_var} or $(my-command), which are part of the word sublanguage.
POSIX shell is a smaller language than what people use in practice, e.g. bash or ksh. I'd like to describe these larger languages.
Finally, the AOSA chapter on Bash notes several difficulties in parsing bash, mostly due to the unconventional but entrenched choice of using yacc to generate code. Most shells use top-down hand-written parsers (recursive descent), not bottom-up generated parsers (yacc).

I'm about to release the OSH parser, which statically parses many real shell programs, so I'm confident I've contributed something to this problem.

I discuss the issue of token lookahead here. There will be more to write about after the parser is released.

Table of Contents

The OSH Language

Cases that Require Two Tokens of Lookahead

The OSH Language

Let's define a new term first: the OSH language.

Most people won't be able to tell the difference between OSH and bash, because OSH is designed for compatibility. But it's useful when talking about the parser. From now on, I'll be careful to make the distinction between the new Oil language and the compatible OSH language.

The main differences between bash and OSH are:

Quotes are required for associative array keys, to avoid the undecidable parse.
bash parses but doesn't run certain constructs like ${var[0][1][2]}. OSH prefers to catch errors early where it won't adversely affect compatibility.
$((foo)) is always an arithmetic substitution, not a command substitution with a subshell in it. I will talk more about this issue tomorrow.

Repeating the thesis: the OSH language can be parsed with two tokens of lookahead. In other words, it's expressible with an LL(2) grammar.

This is good because it means the parser is relatively efficient.

Cases that Require Two Tokens of Lookahead

When the parser encounters tokens like for, if, while, $(, ${, etc., it doesn't have to look any further to know what to parse. If these were the only constructs, it would be a language expressible with an LL(1) grammar.

But each of the four sublanguages requires two tokens of lookahead in certain cases:

Command Language

Functions. When we see a token like my-func, we must peek ahead to see if the next token is ( to know whether it's a function definition or a command.

my-func() { echo hi; }  # function definition
my-func x y             # command

Coprocesses. In bash, after we see coproc foo, we must peek ahead to see if the next word is { to know whether it's a coprocess definition or command.

coproc foo bar
coproc foo {
  echo hi
}

Word language

When we see ${#, we have to look past the # token to know if it's a variable or an operator. See this in-depth discussion). The same issue occurs with ${!} vs. ${!var}. (The former is equivalent to $!, the last background process PID.)

echo ${#}      #   # is a variable
echo ${#foo}   #   # is a prefix operator

When we see a token like var=, we have to check if the next token is the ( operator to see if the word is an array assignment, which requires recursive word parsing, or a word assignment.

var=value
var=(1 2 3)

Arithmetic Language

The arithmetic language uses Pratt Parsing, which I believe is a form of LL(2) parsing, although I've never heard anyone say that.

The nud methods (null denotation) are for parsing decisions that can be made with one token of lookahead, while the led methods (left denotation) are for decisions that can be made with two tokens.

This handles mathematical expressions easily, since the infix operator token is second, after the first operand, telling you the type of AST node to create. In Pratt parsing, there's no possibility of looking ahead three tokens.

Boolean Language

Like the arithmetic language, the boolean language contains infix operators. So it has to look ahead one word when it sees an operand word.

[[ $var ]]            # test if $var is non-empty
[[ $var -eq foo ]]    # compare $var with another value

Comparison with Bash

So OSH has parsers for four sublanguages, each of which use two tokens of lookahead.

Arithmetic is handled with Pratt parsing, and the other three languages are handled with recursive descent. These top-down techniques are easily interwoven.

(I thought about using Pratt parsing for the boolean language as well, but it has two extra lexical states for regular expressions (operator =~), which complicates things.)

In contrast, bash uses a generated, bottom-up parser for the command sublanguage, and ad hoc parsers for the other three. It also dynamically parses some sublanguages, as demonstrated in this post.

The parse.y file is ~6500 lines, containing both the generated command parser and the hand-written boolean parser.
The ${} parser is mixed in with the runtime in subst.c, weighing in at ~10,700 lines! Arithmetic in expr.c is another ~1500 lines, also including the runtime.

In OSH, I've taken care to keep the definition of the language in one place, apart from the runtime, which should make it easier to read and extend.

A longer-term goal is to define the OSH language in a meta-language or grammar rather than in Python, but I believe that is beyond the state of the art in parsing technology. I will have more to say about this.

Credit: Thanks to Daniel Marti for pointing out the coproc lookahead issue.