blog | oilshell.org

Parsing Bash is Undecidable

2016-10-20

What does this line of bash code mean?

array[X=b+2*3]=c

It depends on what array is. Let's wrap it in some code:

if test $# -eq 0; then  # any arguments?
  declare -a array      # indexed array
else
  declare -A array      # associative array
fi

b=1
array[X=b+2*3]=c        # What does this mean?

echo "Index: ${!array[@]}"
echo "Value: ${array[@]}"
echo "X:     ${X:-undefined}"

download the script

If I run it with no arguments, I get:

$ bash undecidable-parse.sh

Index: 7
Value: c
X:     7

But if I run it with an argument, I get:

$ bash undecidable-parse.sh x

Index: X=b+2*3
Value: c
X:     undefined

In the first case, bash did this:

  1. Create an indexed array (see help declare).
  2. Evaluate b+2*3 to 7. (Inside an arithmetic context, the $ variable prefix is optional, so b+2*3 is the same as $b+2*3.)
  3. Assign the value 7 to X (a side effect)
  4. Assign the value c to array[7].

In the second case, instead of a regular indexed array, bash creates an associative array. The key is the string X=b+2*3; the value is c; and X is undefined.

That is, parsing what's inside array[] depends on the type of the variable, which is dynamic in bash. So we can't tell until runtime which one of these is the parse tree:

(assign
  (L-value
    "array"
    (string "X=b+2*3"))
  (string "c"))

(assign
  (L-value
    "array"
    (assign
      (L-value "X")
      (add
        (var "b")
        (multiply 2 3))))
  (string "c"))

More formally, we can't make a parse tree without being able to statically determine properties of arbitrary programs, which means that it's undecidable. To emphasize this, you can replace the test $# -eq 0 line with anything -- an interpreter for a Turing machine, or a function that tests the network latency to Google, etc. The parser cannot say anything about the results of those computations.

A Solution

oil has the philosophy of catching errors earlier, so it aims to parse shell programs entirely up front. It can't do this if it's mathematically impossible!

Fortunately, there's a simple solution: always quote string literals used as array keys. The parser will then know that unquoted strings are variables. That is:

array[mystring]=x    # BAD: is it a string or a variable?
array['mystring']=x  # GOOD: it's a string

This rule technically makes oil an incompatible variant of bash, but it has the nice property that you can make any valid bash program into a valid oil program without breaking it. In other words, it doesn't remove any functionality; it just tightens up the syntax by requiring quotes.

Associative arrays were added in bash 4.0, which means that they're one of the most non-portable parts of bash. Most shells don't implement associative arrays, and zsh uses an entirely different syntax.

More examples of the proposed syntax and semantics:

declare -a a  # indexed array
declare -A A  # associative array

a['1+2']=x    # ERROR: '1+2' doesn't look like a number
A['1+2']=x    # GOOD: a literal string key

a[1+2]=x      # GOOD: set index 3
A[1+2]=x      # OK: set index "3" ("type coercion")

a[3]=x        # GOOD: set index 3
A[3]=x        # OK: set index "3" ("type coercion")

a['3']=x      # OK: set index 3 ("type coercion")
A['3']=x      # GOOD: set index "3"

a[v]=x        # GOOD: substitute variable (runtime error if 
              # not integer-like)
A[v]=x        # CHANGED: substitute variable
              # Use A['v'] if you want a string literal

The following are acceptable and equivalent to a[v]=x or A[v]=x, but the simpler forms are preferred:

a[$v]=x    # strings converted to integers
A[$v]=x
a["$v"]=x  # strings converted to integers
A["$v"]=x

The only changed semantic is the last one. With this change, bash can be parsed up front, with only trivial modifications to existing programs. In practice, I see people following this style already.

(There could be an even stricter mode to disallow "type coercions", but that's a bigger change at odds with existing shell semantics.)

Recap

A major goal of oil is to treat shell as a real programming language, rather than an OS command shell with a bunch of macro processing thrown in.

Here is the narrative so far:

  1. I showed that that [[ is the "compile-time" version of [, illustrating the difference between parsing up front vs. parsing during execution. The latter opens up the possibility of depending on arbitrary execution state, which makes parsing undecidable. All shells that implement [[ parse it up front.

  2. Even when it's possible to parse up front, all popular shells defer some parsing to execution time, e.g. the stuff inside ${} or $().

  3. oil uses the technique of lexical state to parse up front in a single pass. I believe this strategy is simpler, faster, and results in a better user experience.

  4. In this post, I showed the only construct in bash that's impossible to parse up front. Then I proposed a simple fix: always quote string literals used as array keys.

Parsing C++, Perl, and Make is also undecidable

Similar issues come up in other programming languages.

Parsing C++ is Literally Undecidable. As far as I understand, the issue is basically that C++ has two Turing-complete execution contexts: compile-time and runtime, and parsing sits between them. Parsing can depend on arbitrary compile-time computation, making it undecidable.

Parsing Perl Considered Undecidable.

What does this code mean?

whatever / 25 ; # / ; die "this dies!";

Kati internals. Kati is a modern parser for GNU Make, and it finds that parsing and execution are interleaved.

What does this code mean?

$(VAR)
    X=hoge echo $${X}

I'll leave to you to decide if these are language design errors. On the one hand, these are obscure corners that don't come up very often.

On the other hand, C++, Perl, Make, and bash all have a reputation for being difficult to learn, in a large part due to their syntax. Languages like Python and Java don't have these types of issues.

On a related note, parsing JavaScript is relatively easy, and gives the expected parse errors. But it does have a lack of runtime errors, e.g. preferring nonsensical type coercion to stopping the program. The shell unfortunately shares this property. Future posts will discuss how oil will improve upon this situation.