Blog Roadmap, and Project Roadmap

2016-10-15

Let me give a roadmap for the next few posts, so the narrative is coherent.

I want to explain an interesting fact I've discovered: parsing bash is not decidable. There's a language issue related to associative arrays, which were introduced in bash 4.0.

On the other hand, oil is a constructive proof that almost of all of shell and bash is decidable. The last few posts have described the idea of parsing shell scripts in a single pass.

Some background will be useful before explaining this. First, as promised, I describe the simple technique of using lexical state, which is related to the "start conditions" feature in tools like flex.

Then I will show where lexical states are used. For example, one lexical state is devoted to arithmetic expressions, which are used in five different places in bash. I was surprised by that, even after many years and thousands of lines of shell programming.

This will lead into the problem where we can't parse bash up front. To give you a hint: What does this line of code mean?

array[b+2*3]=c

After I describe the problem, I'll propose an easy fix for shell script authors.

Project Roadmap

So far I've parsed Aboriginal Linux and my /etc/init.d directory.

I'm working on fixing a bug blocking me from parsing debootstrap. And I'm also testing the parser on more shell scripts found in the wild.

Last night I ran it over my ~/git/other directory, which included a Bazel shell script. The script exposed surprising bash behavior related to the regex matching construct: [[ foo =~ ^foo$ ]]. I think this is essentially a bug, as zsh has much more sane behavior, but unfortunately real scripts depend on it.

So far, all of the fixes have been easy and very localized — I'm convinced my architecture is "right". So after some more bug fixing and polish, I will release the code. There are enough shell users that a relatively complete parser should be interesting, even if it's slow.

And then I will hopefully devote more time to the fun part of the project, which I haven't mentioned yet. I'm designing a new shell language. It's kind of like CoffeeScript for shell, except I'm implementing a interpreter with two parsers, not just doing source-to-source transformation. I will have a lot more to say about this later.