blog | oilshell.org
I just fixed the hardest bug ever in OSH, and now it can run significant shell scripts that I didn't write:
(Appendix A explains how autoconf generates these files.)
I ran these programs with both bash and OSH and compared the respective
output files. They're essentially identical: the handful of one-line
differences appear to be due to autoconf checking
$BASH_VERSION for optional
configure scripts use a quirky subset of POSIX shell, as opposed to
the full bash language, I'm happy to reach this milestone.
At the end of How I Use Tests, I mentioned that I want to write a post called The Long Slog Through a Shell. It should give you a sense of how I filled out the OSH runtime.
But since this last bug happened to be the hardest one ever, and the one that let me run real shell programs, it's worth writing about.
Spoiler: I fixed it by simply calling
close() on a couple files. It was a
silly bug with a trivial fix. But that's unfortunately typical of bugs that
cost a lot of time!
Under OSH, but not bash, the execution of
config.status stopped with no error
message at the following line:
if sed "s/$ac_cr//" < /dev/null > /dev/null 2>&1; then ...
This code redirects
I suspected that execution didn't actually halt there. Rather, the script
continued executing with no output.
Before explaining the fix, here is why this bug was tricky:
(1) The obvious suspect was innocent. A shell takes three steps to execute
echo hi 1>&2:
1>&2redirect to the process state, using the
fcntl()system calls. These modify the file descriptor table of the process.
The observed behavior could be easily explained by a bug in step 3. If I
didn't restore the state correctly, then all output would continue to go to
/dev/null. I was convinced that the bug was in this code, and I spent a lot
of time instrumenting it and mentally tracing it.
But the bug was in a totally different place. You could say there was "action at a distance".
(2) It broke my debugging tools. To debug the redirect code, I added
(3) Debugging the bug changed the bug. To work around problem #2, I tried opening separate disk file for debug messages. But this triggered an unhandled exception in a different part of the program. I still don't understand exactly what happened. But opening a file changes the file descriptor table, so it's not too surprising.
I'm reminded of James Mickens' hilarious article The Night Watch (PDF), which just popped up again on Reddit:
I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS.
(4) My attempts to meaningfully reduce the test case failed. I cut the 1388-line script down to a 10-line script that contained the last statement, but the smaller script failed to reproduce the bug.
Attempting to read 1388 lines of shell generated by autoconf is not fun!
The only saving grace is that the bug was completely deterministic. (This reminds me that I hit a tough non-deterministic bytecode generation bug while developing OPy.)
So I was painstakingly tracing through the wrong code. And not being able to log any messages slowed me down a lot. I got stuck and moved onto other things for awhile.
After many failed debugging sessions, I had the idea to look in the
/proc/$PID/fd/ directory. I don't recall using this much, but it now seems
essential. I saw that extra files were open:
$ ls -l /proc/29405/fd total 0 lrwx------ 1 andy andy 64 Jul 1 18:37 0 -> /dev/pts/25 lrwx------ 1 andy andy 64 Jul 1 18:37 1 -> /dev/pts/25 lrwx------ 1 andy andy 64 Jul 1 18:37 2 -> /dev/pts/25 lr-x------ 1 andy andy 64 Jul 1 18:37 5 -> /home/andy/git/oil/core/runtime.asdl lr-x------ 1 andy andy 64 Jul 1 18:37 6 -> /home/andy/git/oil/osh/osh.asdl lr-x------ 1 andy andy 64 Jul 1 18:37 8 -> /dev/urandom
Oops, why are descriptors 5 and 6 open? They're connected to the two ASDL schemas I'm using.
(Oil's ASDL implementation generates classes at startup time
.asdl schema, so we need to open those two files.)
I also remembered that one line in
exec 6>&1, so this
conflict on descriptor 6 looked like the culprit.
I tried simply closing the files, and the problem went away! (Github commit) I was kicking myself at the waste of time. On the other hand, I learned a few things the hard way.
In Unix, calling
open() returns a file descriptor that the kernel chooses.
The descriptor is still open in the child process after
And this must be the case; otherwise you couldn't properly set up shell
But if the child program uses a hard-coded file descriptor that happens to be
the same as one the kernel assigned (e.g.
exec 6>&1), there's an inherent
conflict and your program will misbehave.
And, as I found, this misbehavior can be very hard to diagnose.
There's more I should understand about this bug. I don't understand why shell programs use hard-coded file descriptors. This seems like a bad idea because it's an implicit part of the "calling contract" that's not documented.
I know that the FD_CLOEXEC flag on file descriptors causes them to be closed
exec(). Unix shells, including OSH, use this flag to "save" file
descriptors in the table, generally at indices greater than
Nevertheless, I'm still confused about this shell programming technique. I've encountered it in both debootstrap and the shell scripts that autoconf generates.
Please leave a comment if you use or have seen this technique!
This post described the last bug I fixed. It also happened to be the one that enabled me to run the first significant shell programs I didn't write.
This was toward the end of a long period of filling out the shell runtime, which I mentioned in Working Toward an OSH Release. The next few posts will go back in time and document the beginning of this period.
configureshell script from
configure.ac, which contains hand-written M4 macros. . Then
configureis distributed with the source tarball.
./configure. It inspects system values, generates a
config.statusshell script, and then runs it. This typically produces
Makefile. In Python's case, it also files like
See the autoconf Wikipedia article for a nice diagram.