Let me give a roadmap for the next few posts, so the narrative is coherent.
I want to explain an interesting fact I've discovered: parsing bash is not decidable. There's a language issue related to associative arrays, which were introduced in bash 4.0.
On the other hand, oil is a constructive proof that almost of all of shell and
bash is decidable. The last few posts have described the idea of parsing
shell scripts in a single pass.
Some background will be useful before explaining this. First, as promised, I
describe the simple technique of using lexical state, which is
related to the "start conditions" feature in tools like
Then I will show where lexical states are used. For example, one lexical state
is devoted to arithmetic expressions, which are used in five different
bash. I was surprised by that, even after many years and thousands
of lines of shell programming.
This will lead into the problem where we can't parse bash up front. To give you a hint: What does this line of code mean?
After I describe the problem, I'll propose an easy fix for shell script authors.
So far I've parsed Aboriginal Linux and my
I'm working on fixing a bug blocking me from parsing debootstrap. And I'm also testing the parser on more shell scripts found in the wild.
Last night I ran it over my
~/git/other directory, which included a Bazel
shell script. The script exposed surprising
bash behavior related to the
regex matching construct:
[[ foo =~ ^foo$ ]]. I think this is essentially a
zsh has much more sane behavior, but unfortunately real scripts
depend on it.
So far, all of the fixes have been easy and very localized — I'm convinced my architecture is "right". So after some more bug fixing and polish, I will release the code. There are enough shell users that a relatively complete parser should be interesting, even if it's slow.
And then I will hopefully devote more time to the fun part of the project, which I haven't mentioned yet. I'm designing a new shell language. It's kind of like CoffeeScript for shell, except I'm implementing a interpreter with two parsers, not just doing source-to-source transformation. I will have a lot more to say about this later.