Parsing Bash is Undecidable

2016-10-20 (Last updated 2019-02-06)

Writing a parser for the OSH language revealed an unusual feature of the bash language. This post explains the theory, and why it matters in practice.

Table of Contents

An Example of the Problem

A Solution

Examples of a Better Syntax and Semantics

Recap

Appendix: Parsing C++, Perl, and Make is Also Undecidable

An Example of the Problem

What does this line of bash code mean?

array[X=b+2*3]=c

It depends on what array is. Let's wrap it in some code:

if test $# -eq 0; then  # any arguments?
  declare -a array      # indexed array
else
  declare -A array      # associative array
fi

b=1
array[X=b+2*3]=c        # What does this mean?

echo "Index: ${!array[@]}"
echo "Value: ${array[@]}"
echo "X:     ${X:-undefined}"

download undecidable-parse.sh

If I run it with no arguments, I get:

$ bash undecidable-parse.sh

Index: 7
Value: c
X:     7

But if I run it with an argument, I get:

$ bash undecidable-parse.sh ZZZ

Index: X=b+2*3
Value: c
X:     undefined

In the first case, bash:

Creates an indexed array (see help declare).
Evaluates b+2*3 to 7. (The $ prefix for variables is optional in an arithmetic context, so b+2*3 is the same as $b+2*3.)
Assigns the string 7 to X (a side effect)
Assigns the string c to array[7].

In the second case, it

Creates an associative array
Associates the string c with the the string index 'X=b+2*3'
X remains undefined.

That is, parsing what's inside array[] depends on the dynamic type of the variable. We can't tell until runtime which one of these is the parse tree:

; indexed array case
(assign
  (L-value
    "array"
    (assign
      (L-value "X")
      (add
        (var "b")
        (multiply 2 3))))
  (string "c"))

; associative array case
(assign
  (L-value
    "array"
    (string "X=b+2*3"))
  (string "c"))

More formally, we can't make a parse tree without being able to statically determine properties of arbitrary programs, which means that it's undecidable. To emphasize this, you can replace the test $# -eq 0 line with arbitrary code — an interpreter for a Turing machine, or a function that tests the network latency to Google, etc. The parser can't say anything about the results of those computations.

A Solution

OSH has the philosophy of catching errors earlier, so it aims to parse shell programs entirely up front. It can't do this if it's mathematically impossible!

Fortunately, there's a simple solution: always quote string literals used as array keys. The parser will then know that unquoted strings are variables. That is:

array[mystring]=x    # BAD: is it a string or a variable?
array['mystring']=x  # GOOD: it's a string

This rule technically makes OSH an incompatible variant of bash, but you can turn any valid bash program into a valid OSH program without breaking it. In other words, it doesn't remove any functionality; it just tightens up the syntax by requiring quotes.

Associative arrays were added in bash 4.0, which means that they're one of the most non-portable parts of bash. Most shells don't implement associative arrays, and zsh uses an entirely different syntax.

Examples of a Better Syntax and Semantics

declare -a a  # indexed array
declare -A A  # associative array

a['1+2']=x    # ERROR: '1+2' doesn't look like a number
A['1+2']=x    # GOOD: a literal string key

a[1+2]=x      # GOOD: set index 3
A[1+2]=x      # OK: set index "3" ("type coercion")

a[3]=x        # GOOD: set index 3
A[3]=x        # OK: set index "3" ("type coercion")

a['3']=x      # OK: set index 3 ("type coercion")
A['3']=x      # GOOD: set index "3"

a[v]=x        # GOOD: substitute variable (runtime error if 
              # not integer-like)
A[v]=x        # CHANGED: substitute variable
              # Use A['v'] if you want a string literal

The only line with changed behavior is the last one. With this change, bash can be parsed up front, with only trivial modifications to existing programs. In practice, I see people following this style already.

The following are acceptable and equivalent to a[v]=x or A[v]=x, but the simpler forms are preferred:

a[$v]=x    # strings converted to integers
A[$v]=x
a["$v"]=x  # strings converted to integers
A["$v"]=x

There could be an even stricter mode to disallow "type coercions", but that's a bigger change at odds with existing shell semantics.

Recap

A major goal of Oil is to treat shell as a real programming language, rather than an OS command shell with sloppy macro processing.

Here's the narrative so far:

I showed that that [[ is the "compile-time" version of [, illustrating the difference between parsing up front vs. parsing during execution. The latter opens up the possibility of depending on arbitrary execution state, which makes parsing undecidable. All shells that implement [[ parse it up front.
Even when it's possible to parse up front, all popular shells defer some parsing to execution time, e.g. what's inside ${} or $().
OSH uses the technique of lexical state to parse up front, in a single pass. I believe this strategy is simpler, faster, and results in a better user experience.
In this post, I showed the only construct in bash that's impossible to parse up front. Then I proposed a simple fix: always quote string literals used as array keys.

Appendix: Parsing C++, Perl, and Make is Also Undecidable

Similar issues come up in other programming languages.

Parsing C++ is Literally Undecidable. As far as I understand, the issue is that C++ has two Turing-complete execution contexts: compile-time and runtime, and parsing sits between them. Parsing can depend on arbitrary compile-time computation, making it undecidable.

Parsing Perl Considered Undecidable.

What does this code mean?

whatever / 25 ; # / ; die "this dies!";

Kati internals. Kati is a modern parser for GNU Make, and it finds that parsing and execution are interleaved.

What does this code mean?

$(VAR)
    X=hoge echo $${X}

I'll let you decide whether these are language design errors. On the one hand, these are obscure corners that don't come up very often.

On the other hand, C++, Perl, Make, and bash all have a reputation for being difficult to learn, in a large part due to their syntax. Languages like Python and Java don't have these types of issues.

On a related note, parsing JavaScript is relatively easy, and gives the expected parse errors. But it does have a lack of runtime errors, preferring nonsensical type coercion to stopping the program. The shell shares this unfortunate property. Future posts will discuss how OSH will improve upon this situation.