Why Sponsor Oils? | blog | oilshell.org

debootstrap Parsed

2016-10-16

I fixed the bugs that prevented debootstrap from parsing. Here are the ASTs and line counts. So now I have three projects parsing: Aboriginal Linux, /etc/init.d and debootstrap.

The issue that was blocking it was this syntax:

echo >out.txt 1 2 3

I always thought the redirect had to go last:

echo 1 2 3 >out.txt

But actually all of these are valid ways of printing 3 numbers to a file:

>out.txt echo 1 2 3 
echo >out.txt 1 2 3 
echo 1 >out.txt 2 3 
echo 1 2 >out.txt 3 
echo 1 2 3 >out.txt

This is spelled out in the POSIX grammar, but I misread the recursive rules. Fixed now!

Another lightbulb: this is where Python's "print to file" syntax comes from:

print >>sys.stderr, '1 2 3'

I hacked on debootstrap a few years ago when trying to build a package manager and bootstrap it with Debian tools. What I remember is that it's quite slow for what it does — for example, parsing all the Debian package metadata in shell/Perl seems to take forever, and is possibly done in an algorithmically inefficient way. The code is not very nice along multiple dimensions, even for sh code.

At some point, I want to build some profiling tools and hooks into my shell, which will help with monsters like this. But parsing everything is the first step.

I'm actually working on parsing git now! I knew that some of git was written in shell, but I didn't realize how much. I'm parsing 125K lines right now, and that's just a subset of an old copy of the git source tree! There are 10 or so errors to fix, and then I'll try parsing the entire tree.


I'll have to write the post on lexical states a little later. In the last few days, I've gotten into a good rhythm fixing bugs in the parser, which is more important. The error messages with the column number are really helping, e.g.

Line 33 of '/home/andy/git/other/chef-bcpc/zap-ceph-disks.sh'
  if ! echo "$disk" | egrep -q "${mounted_disk_regex:0:-1}"; then
                                                        ^
Unexpected token after slice: <AS_NUM_LITERAL 1>

(This error is because I'm not correctly implementing unary minus in the arithmetic language.)