Three months ago, in Roadmap #5, I wrote that OSH will be a better shell for building Linux distributions. It will run existing code, including bash scripts, but it's stricter and easier to debug.
In the last month, I've made significant progress toward this goal. I fixed dozens of bugs, implemented new features, and simplified the codebase.
OSH can now run thousands of lines of shell scripts that build three distros: Aboriginal Linux, Alpine Linux, and Debian. This post describes what I did, and the technical work that was involved.
I haven't written about Linux distributions in awhile. What happened?
OSH was able to run abuild -h back in October, but its parsing speed made
debugging sessions unpleasant. On a fast machine, it took more than 1600
milliseconds to parse abuild!
So I pushed two tasks onto the stack, for a total of three:
The two releases since October popped #3 and #2 off the stack:
Now OSH can parse abuild in about 250 milliseconds. That's still too slow, but it's not blocking progress.
I plan to release OSH 0.4 at the end of this month. It will be able to run not just abuild, but also shell scripts from Aboriginal Linux and Debian.
After that, the stack will be empty again. I had to shave some yaks, but I didn't lose sight of the goal!
I didn't understand how Linux distros worked until pretty recently. It's useful to think of them as having (at least) these four components:
apt; for Red Hat-based
systems (CentOS, Fedora, etc.), it's yum.I like the diversity of the three distros I worked with, because it gives me confidence that OSH is correct:
So not only am I testing shell scripts by different authors, I'm also testing OSH for compatibility with scripts written for different shell dialects.
More background and detail on what I did:
(1) Debian
.deb packages, which,
roughly speaking, are tarballs of binaries, scripts, and metadata. I parsed
debootstrap with OSH back in October 2016.(2) Alpine Linux
.apk packages and metadata.
Here is an excerpt..apk packages with abuild running under OSH-musl.abuild verify to check that the packages looked reasonable.(3) Aboriginal Linux
i686 target, which builds a complete
system from source. In contrast, debootstrap assembles an image from
binary packages. (I haven't yet run Debian's package build system, which is
based on GNU make.) I booted the resulting image in QEMU and got a
shell prompt!In summary, I tested OSH on a diverse set of shell scripts found in the wild, and did whatever was necessary to make them run.
I started this process after the last release, and I honestly didn't know how long it would take. There were more problems than I expected, but I was also able to fix them more quickly than expected!
This section describes are the holes I filled in to make these scripts work.
Some errors I ran into had obvious causes. For example, OSH would throw
NotImplementedError when it encountered ${s:1:2} (string slicing).
Implementing slicing and getting past the error was easy.
However, other errors required debugging thousands of lines of other people's shell scripts.
This motivated me to learn more about bash and debugging. In particular, this
tip on making xtrace useful by setting $PS4 helped me figure
out where scripts were going wrong.
I implemented these debugging features in OSH:
set -x / xtrace, with $PS4 support.$SHELLOPTS, so you can inherit xtrace. Shell scripts
often invoke other shell scripts, and there needs to be a way to preserve
-x.PS4 string: $LINENO, and my own
$SOURCE_NAME.A recurring theme was relaxing OSH's strict behavior in order to accomodate
common shell usage. However, I added the ability to opt in to the strict
behavior. I added set -o strict-control-flow, strict-array, and
strict-errexit. I'll address this topic in another blog post.
POSIX has quirky rules for the $IFS variable, which determines both how words
are split and how the read builtin splits fields.
I rewrote the crappy regex-based version of IFS-splitting with an explicit state machine. This is an interesting piece of code which I may explain in another blog post. It's in core/legacy.py. It turned a lot of red tests green.
echo -e '1\n2' and echo $'1\n2' both print the lines 1 and 2. Their
relationship is the same as the relationship between [ and
[[ — the former is dynamically parsed, and the
latter is statically parsed. For example, dynamic parsing allow this:
char=n; echo -e "1\\${char}2", but static parsing doesn't.
I implemented these with a similar but not identical lexers. Metaprogramming let me avoid duplication.
This is another feature that touches on some computer science. Originally, I
translated globs to Python regexes, in order to take advantage of Python's
non-greedy matching, e.g. the expression ${x%%*suffix} could be implemented
with the regex .*?(suffix).
However, abuild uses character classes in globs, e.g. ${i%%[<>=]*}
which isn't easy to translate reliably.
So instead I had to implement these operators using a linear number of
calls to fnmatch(), which makes the overall algorithm quadratic. If
fnmatch() isn't linear in the worst case, which it often isn't,
then the algorithm could be cubic.
However this issue doesn't appear to arise in practice, as all shells use this slow algorithm. Strings are generally short.
There were several other minor features to implement. In most cases, I had already done the hard part: representing code with the lossless syntax tree. The implementation often "falls out" after choosing a good representation.
${s:1:2} and ${a[@]:1:2}.diff <(sort left.txt) <(sort right.txt). This
feature is inherently flaky because it doesn't wait() on the forked
process, and it didn't set $! until bash 4.4.type builtin without -t. abuild unfortunately matches
the output of type with a regex.test builtin:
-L and -h are aliases to check if a file is a symlink.[ -t 1 ] to check if stdout is a TTY. There is no color in
abuild without this!-nt and -ot to compare timestamps on files.Reimplementing these shell quirks was both fun and depressing. As penance, I've been maintaining a wiki page of Shell WTFs (which is not well-organized).
I could blog every day about one of these and not be done for months. But I remind myself that my main goal is to improve shell with the Oil language, not dwell on the past. Legacy behavior is only useful as far as it gives people an upgrade path to Oil.
As far as I know, a shell must handle file descriptors differently than any other Unix program. It can't open any files in the descriptor range 3-9, because shell scripts may use them directly.
source'd scripts are now moved out of the way with
dup2() immediately after opening them.echo hi 6>&1, which debootstrap
uses.I used the /proc/$$/fd/ mechanism I mentioned in in OSH Runs Real Shell
Programs to debug these problems. It's a very useful
way of showing the file descriptor state of a process.
In The Riskiest Part of the Project, I mentioned several difficulties with using CPython to write a Unix shell.
I encountered another problem: Python does its own buffering of file I/O. I believe this is on top of libc's buffering, although I haven't looked into it deeply.
sys.stdout.flush() is required after type; otherwise $() may be
incorrectly evaluated. Hat tip to timetoplatypus for mentioning this with
respect to the dirs builtin.read builtin can't use Python's f.readline(). The descriptor that
underlies the sys.stdin file object changes whenever you redirect, which
interacts badly with buffering.Instead, I have to read a byte at a time from file descriptor 0. This seems
inefficient, but I noticed that dash, mksh, and zsh also do this
(in C).
&& and ||. Confusingly, they have equal precedence
in the command language, but the normal unequal precedence in the [[
expression language.FOO=bar myfunc. Shells differ in behavior here!${x/pat/replace} when x is undefined. (This case revealed a bug in
mksh.)cd-ing away from a directory that's been removed.readonly R; unset R should return 1 and respect errexit, not
unconditionally fail. Although I consider this a programming error,
errexit will be on by default in Oil. (It wuold also be nice to make this a
statically-detected error.)I punted on a few things which weren't strictly necessary to build system images, or which had easy workarounds:
trap builtin is unimplemented; warnings are printed on stderr.alias is also unimplemented. I changed a couple lines in
alpine-chroot-install. Trivia: bash is the only shell that doesn't
expand aliases by default; it requires shopt -s expand_aliases.set -h / hashall is a stub that does nothing. This option is used by
Aboriginal and affects bash's $PATH cache, which I don't yet understand.Also note that these OSH builds are in a sense "shallow". I changed the
shebang lines of thousands of lines of top-level scripts, but they often invoke
more shell scripts with a #!/bin/bash or #!/bin/sh shebang line.
For example, building any Linux distro will require running dozens of
configure scripts. Fortunately, OSH can already run those.
As mentioned, the upcoming OSH 0.4 release will include all this work.
I also have several writing tasks on my TODO list:
nickpsecurity
brought an interesting paper to my attention, and I followed the citations
and read two more papers. I responded in comments on lobste.rs
and reddit. There is more to say about them!It would also be nice to get oil-dev@ going again. If you're interested in
contributing, e-mail me or leave a comment.