OSH Runs Real Shell Programs

2017-07-02

I just fixed the hardest bug ever in OSH, and now it can run significant shell scripts that I didn't write:

Python's autogenerated 16,000-line configure script.
The related 1400-line config.status script.

(Appendix A explains how autoconf generates these files.)

I ran these programs with both bash and OSH and compared the respective output files. They're essentially identical: the handful of one-line differences appear to be due to autoconf checking $BASH_VERSION for optional behavior.

Although configure scripts use a quirky subset of POSIX shell, as opposed to the full bash language, I'm happy to reach this milestone.

At the end of How I Use Tests, I mentioned that I want to write a post called The Long Slog Through a Shell. It should give you a sense of how I filled out the OSH runtime.

But since this last bug happened to be the hardest one ever, and the one that let me run real shell programs, it's worth writing about.

Spoiler: I fixed it by simply calling close() on a couple files. It was a silly bug with a trivial fix. But that's unfortunately typical of bugs that cost a lot of time!

The Trickiest Bug

Under OSH, but not bash, the execution of config.status stopped with no error message at the following line:

if sed "s/$ac_cr//" < /dev/null > /dev/null 2>&1; then  
  ...

This code redirects stdin, stdout, and stderr of sed to /dev/null, so I suspected that execution didn't actually halt there. Rather, the script continued executing with no output.

Before explaining the fix, here is why this bug was tricky:

(1) The obvious suspect was innocent. A shell takes three steps to execute something like echo hi 1>&2:

Apply the 1>&2 redirect to the process state, using the dup2() and fcntl() system calls. These modify the file descriptor table of the process.
Execute echo hi.
Restore the file descriptor table.

The observed behavior could be easily explained by a bug in step 3. If I didn't restore the state correctly, then all output would continue to go to /dev/null. I was convinced that the bug was in this code, and I spent a lot of time instrumenting it and mentally tracing it.

But the bug was in a totally different place. You could say there was "action at a distance".

(2) It broke my debugging tools. To debug the redirect code, I added print statements, as I usually do. But this bug was inherently about broken print statements! The script unexpectedly stopped printing all messages to the console — both the program's output and my own debug print statements!

(3) Debugging the bug changed the bug. To work around problem #2, I tried opening separate disk file for debug messages. But this triggered an unhandled exception in a different part of the program. I still don't understand exactly what happened. But opening a file changes the file descriptor table, so it's not too surprising.

I'm reminded of James Mickens' hilarious article The Night Watch (PDF), which just popped up again on Reddit:

I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS.

(4) My attempts to meaningfully reduce the test case failed. I cut the 1388-line script down to a 10-line script that contained the last statement, but the smaller script failed to reproduce the bug.

Attempting to read 1388 lines of shell generated by autoconf is not fun!

The only saving grace is that the bug was completely deterministic. (This reminds me that I hit a tough non-deterministic bytecode generation bug while developing OPy.)

So I was painstakingly tracing through the wrong code. And not being able to log any messages slowed me down a lot. I got stuck and moved onto other things for awhile.

Solution

After many failed debugging sessions, I had the idea to look in the /proc/$PID/fd/ directory. I don't recall using this much, but it now seems essential. I saw that extra files were open:

$ ls -l /proc/29405/fd
total 0
lrwx------ 1 andy andy 64 Jul  1 18:37 0 -> /dev/pts/25
lrwx------ 1 andy andy 64 Jul  1 18:37 1 -> /dev/pts/25
lrwx------ 1 andy andy 64 Jul  1 18:37 2 -> /dev/pts/25
lr-x------ 1 andy andy 64 Jul  1 18:37 5 -> /home/andy/git/oil/core/runtime.asdl
lr-x------ 1 andy andy 64 Jul  1 18:37 6 -> /home/andy/git/oil/osh/osh.asdl
lr-x------ 1 andy andy 64 Jul  1 18:37 8 -> /dev/urandom

Oops, why are descriptors 5 and 6 open? They're connected to the two ASDL schemas I'm using.

(Oil's ASDL implementation generates classes at startup time from an .asdl schema, so we need to open those two files.)

I also remembered that one line in config.status was exec 6>&1, so this conflict on descriptor 6 looked like the culprit.

I tried simply closing the files, and the problem went away! (Github commit) I was kicking myself at the waste of time. On the other hand, I learned a few things the hard way.

Analysis and Takeaway

In Unix, calling open() returns a file descriptor that the kernel chooses. The descriptor is still open in the child process after fork() and exec(). And this must be the case; otherwise you couldn't properly set up shell pipelines.

But if the child program uses a hard-coded file descriptor that happens to be the same as one the kernel assigned (e.g. exec 6>&1), there's an inherent conflict and your program will misbehave.

And, as I found, this misbehavior can be very hard to diagnose.

There's more I should understand about this bug. I don't understand why shell programs use hard-coded file descriptors. This seems like a bad idea because it's an implicit part of the "calling contract" that's not documented.

I know that the FD_CLOEXEC flag on file descriptors causes them to be closed upon exec(). Unix shells, including OSH, use this flag to "save" file descriptors in the table, generally at indices greater than 10.

Nevertheless, I'm still confused about this shell programming technique. I've encountered it in both debootstrap and the shell scripts that autoconf generates.

Please leave a comment if you use or have seen this technique!

Conclusion

This post described the last bug I fixed. It also happened to be the one that enabled me to run the first significant shell programs I didn't write.

This was toward the end of a long period of filling out the shell runtime, which I mentioned in Working Toward an OSH Release. The next few posts will go back in time and document the beginning of this period.

Appendix A: How GNU Autoconf Works

On the developer machine, autoconf generates a portable configure shell script from configure.ac, which contains hand-written M4 macros. . Then configure is distributed with the source tarball.
Users unpack the source tarball on their machines and run ./configure. It inspects system values, generates a config.status shell script, and then runs it. This typically produces config.h and a Makefile. In Python's case, it also files like Modules/Setup.config.

See the autoconf Wikipedia article for a nice diagram.