Dev Log #7: Hollowing Out the Python Interpreter

Success with Aboriginal, Alpine, and Debian Linux (January 2018). OSH can run real, hairy shell scripts. If I hadn't achieved this milestone in 21 months, I might have abandoned the project! Oil is a big project, and I'm feeling that right now.
Running Bash Completion Scripts with OSH (October 2018). This is the latest "user-facing" milestone. I discovered that completion scripts are written in a different shell dialect than the difficult Linux scripts above are. OSH can run parts of completion scripts, but there's still more work to do.

Project Roadmap

Why is this post #7? I wrote Roadmap #5 last year, but I mentioned a new roadmap in August.

So I skipped #6. And I don't think there will be more roadmap posts: they got us to the January milestone, but the project now has more design choices that can't be planned out ahead of time.

Unpublished Blog Post Drafts

I've only published 4 posts since the summer. These drafts are unpublished:

The rest of the lexer series. Unfortunately I'm abandoning this for now. I was eager to publish it because several language implementers told me that they changed their lexers based on the first two posts. And Jacquin at RC read the draft and got something out of it.

But I have to cut something, and this isn't essential to completing the shell.
Book Reviews. I mentioned this idea while at Recurse Center. These drafts are newer than the lexer series, so I hope to publish them. I want to introduce some higher-level ideas in the blog.

In my mind, shell isn't just program you type commands into. It's also about distributed software architecture. I addressed that in this lobste.rs comment on "systems" programming.
Shell Implementation Difficulties. An outline of the hardest problems I've solved. These include:
- Word evaluation: splitting, joining, and globbing while respecting quotes and arrays took more than one try to get right. (This complexity will be gone in the Oil language.)
- Maintaining consistency (and inconsistency) between statically-parsed vs. dynamically parsed variants of the same sublanguage. For example, echo -e '\n' vs. $'\n'.
- Parallel processes and the Waiter(): this is a nice abstraction that solves a tricky problem.
- Managing file descriptors, which are global process state.
- Still not done yet: correctly handling signals and interactive completion.
A Status Update which addressed:
- A sketch of a new Oil VM, based on my experience with reusing CPython. The CPython cleanup work I recently finished — and describe below — will motivate this future work.
- A review of Oil "carrots". Why should anyone use OSH over bash?

Recent Releases: 0.6.pre6, pre7, and pre8

I've followed through on releasing more often. They're all pre releases because I want to make version 0.6.0 a release that's meaningful to users.

Contributors

okay zed and I implemented $PS1 expansion. PS1 is a special variable that specifies the interactive prompt. See notes below.
okay zed and I fixed the semantics of cd -L and cd -P. The difference has to do with symlinks.
Brayden Banks implemented ${!myarray[@]} to get the keys of an array. (This means something completely different than ${!mystring}, which has also been a problem lately.)

Notes on $PS1

Describing the semantics of $PS1 would be good episode of #shell-the-bad-parts. (Or really "bash-the-bad-parts", because the issue is bash-specific.)

There are two separate steps to the expansion:

Substitute special \ codes. For example, \h becomes your hostname. This is described in the bash manual.
Not described: Escape $ as \$ in the value of any variables. This is necessary because the resulting string is parsed as code and re-evaluated, so that variables ${debian_chroot} can be expanded.

Instead, bash should work like this:

${hostname} should evaluate to the host name in the context of $PS1. There are no special \ codes.

Then there would be a single phase of expansion, and a single syntax for substitution.

Hollowing out the Python Interpreter

Most other changes in these three releases have been under the hood.

Oil ships with a subset of the CPython interpreter, and I went to great lengths to cut down its size. In effect, I was auditing which parts of CPython we use.

Building Oil with the OPy Bytecode Compiler gives background on this strategy, and motivates it.

Source Metrics

Let's explain the slimming-down of Oil through metrics. As usual, there's slightly more code in the core, to implement new features:

src for 0.6.pre5: 18,586 lines of Python
src for 0.6.pre8: 19,149 lines of Python

But we depend on far less of the Python standard library:

pydeps for 0.6.pre5: 29,228 lines of Python
pydeps for 0.6.pre8: 23,306 lines of Python

The runpy.py module is an unnecessary CPython implementation detail, and I replaced it with some C code. This broke many dependencies on the standard library, so we're shipping ~23K lines of Python total (including the 19K lines above, not in addition to them).

This is a great result! OSH can run real shell scripts, but it's still a small program.

Native Code Metrics

Removing Python source also lets us remove native C code that implements certain Python features.

I reduced the lines of native dependencies by ~9K lines:

native-deps for 0.6.pre5: 139,210 lines of C
native-deps for 0.6.pre8: 130,344 lines of C

This still feels too big, but it's doesn't tell the whole story. Due to build changes in this release, many of those ~130K lines are no longer used.

The larger drop in compiled code size dropped shows this:

ovm-build for 0.6.pre5: 1,356,344 bytes of native code (under GCC)
ovm-build for 0.6.pre8: 946,520 bytes of native code (under GCC)

Here's another view with Bloaty McBloatFace, a nice tool that accounts for code size in ELF files:

native-code/overview for 0.6.pre6: 3,590 C symbols
native-code/overview for 0.6.pre8: 2,514 C symbols

The reduction has two components:

Removing unused CPython files. I removed thousands of lines of code that implemented the newer .format() method, and thousands of lines of code that implement floating point to string conversions.
Filtering CPython functions and methods. The next section describes the unusual technique I used to this. It was made possible by CPython's regular code structure.

(I also enabled the compiler flags for removing unused code, e.g. --gc-sections. I'm not sure why this isn't done by default in GCC and Clang.)

Filtering CPython Functions and Methods

I wrote a recursive-descent parser parser for PyMethodDef declarations, which look like this:

static PyMethodDef marshal_methods[] = {
  {"dump", marshal_dump, METH_VARARGS, dump_doc},
  {"load", marshal_load, METH_O,       load_doc},
  ...
};

There's at least one of these structures for every object like stringobject.c and every module like posixmodule.c.

After parsing these definitions, I re-printed them in build/oil-defs/ with two filters:

Omit methods that aren't used. A coarse heuristic got rid of most methods, and then I did some manual work and encoded it in build/cpython_defs.py.
Omit all docstrings, which removes hundreds of strings from the build.

Some metrics about the filtered methods:

cpython-defs/overview for 0.6.pre8

OSH 0.6.pre5 used about 431 methods, and this process stripped it down to 128 methods. Most methods that we use are in posixmodule.c, e.g. posix.chdir(), which makes sense for a shell!

Since Oil still passes all its tests, and our test coverage is very high, I'm sure that all this code was unused. And I'm now more convinced that it's possible to write a small Python interpreter for Oil, which I'm calling OVM2. More on that later.

Reducing the Number of Unique Bytecodes Used

The file ceval.c in CPython is the core of the bytecode interpreter. It dispatches about 120 bytecodes like:

LOAD_FAST to load local variables
BINARY_ADD to do arithmetic on numbers, or concatenate strings

I reduced Oil's usage of certain bytecodes, so that we're using a simpler subset of Python. This was done in two ways:

Removing standard library modules, which I mentioned above. They use features like import *, implemented with the IMPORT_STAR bytecode, but Oil doesn't.
Small rewrites of Python code. For example, I rewrote 2 generator expressions as list comprehensions, which remove the only 2 instances of closures in Oil (LOAD_CLOSURE and MAKE_CLOSURE). I generally prefer explicit state with classes over closures.

Another example was rewriting exponentation the exponentiation operator in shell with multiplication:

$ echo $(( 2 ** 30))  # 1 GB
1073741824

It used to rely on Python's ** operator (the BINARY_POWER bytecode).

However, Python's implementation accepts arbitrary floats, while shell has only integers. Exponentiation on floating point numbers depends on thousands of lines of C code, so we can remove all of that.

Metrics:

oil-with-opy for 0.6.pre5: 88 unique opcodes
oil-with-opy for 0.6.pre8: 80 unique opcodes

I've further reduced this number below 80 on the master branch.

(Related reading: Floating Point to Decimal Conversion is Easy. Python's implementation doesn't look easy! Again, I removed all of this code.)

Reducing Bytecode Size

Not only did I reduce the native code size and number of unique bytecodes, I also reduced the total bytecode size:

ovm-build for 0.6.pre5: 1,900,546 bytes of non-native code and data
ovm-build for 0.6.pre8: 1,335,664 bytes of non-native code and data

This reflects the smaller amount of Python source described above, as well as removing Python docstrings.

Some Python docstrings are specified in C, and some are specified in Python. I removed the latter ones by adding an -emit-docstring=0 flag to the OPy bytecode compiler.

(NOTE: The "bytecode size" column in the ovm-build metrics should be "architecture-independent files", as there are data files and .py source files in the app bundle.)

Praise for R's Tidyverse

There are many tables of metrics in the links above. I should mention that they're manipulated and analyzed with the R language, and in particular the dplyr library.

dplyr provides general data manipulation "verbs" on data frames and is part of the Tidyverse by Hadley Wickham.

In this release, I expanded my usage to include stringr, a related library for string manipulation. Surprisingly, it has some string manipulation concepts that may be useful in shell and the Oil language. Hadley's libraries are an inspiration in composable API design.

I will write more about this later.

Summary, and What's Next?

This post was a "catch-up" on several topics, but here's a concise takeaway:

Oil is now more like a C program, and less like a Python program.

I removed many parts of CPython that we're not using — in the intepreter itself, as well as in the standard library.

This gives me confidence that Oil can be moved to a VM that we control, which I'm calling OVM2.

There is still a lot to talk about, but I'll leave it for Dev Log #8:

Two Approaches to Shell Completion
Progress on the OPy bytecode compiler and OVM2
Research "Side Projects"
Recent Correspondence on Shell and Oil

Appendix A: Details on CPython Surgery

Removed small modules pystrcmp.c, pymath.c.
Replaced socket.gethostname() with our own copy. It's needed for \h in $PS1.
Use xrange() everywhere so we can remove range(). This is probably overly aggressive, but I really want to simplify OVM and eventually reimplement it.
Remove the os module in favor of the lower-level posix module. This removes a compatibility layer in Python that's unnecessary for a Unix shell.
- Functions like os.chdir() are now posix.chdir(). The posix module is implemented in posixmodule.c, and os is a thin wrapper around it on Unix.
Removed dtoa.c and pystrod.c, which convert floats to strings and back.
- This means I had to move a single usage of %.3f from Python to C. Python's % is more general than C's printf(), but we don't need all of its power in Oil.
- This also required removing the ** exponentation operator from Python (mentioned above).

Already mentioned above:

Removed thousands of lines of code that implement the newer style of string formatting in Python 2, e.g. 'foo = {}'.format(foo).
Removed the runpy module and its many dependencies.
Removed docstrings in both Python and C.
Remove the os module in favor of the underlying posix module.
Filtered PyMethodDef in dozens of objects and modules, like listobject.c, import.c, and posixmodule.c.