Why Sponsor Oils? | blog | oilshell.org

Dev Log #7: Hollowing Out the Python Interpreter

2018-11-15

I haven't published a post in several weeks, but the Oil project continues!

This post started as a free-form project update, but it ended up focusing on recent work to break free of the CPython interpreter.

Let's review the project status before getting into those details.

Table of Contents
Brief Project Recap
Project Roadmap
Unpublished Blog Post Drafts
Recent Releases: 0.6.pre6, pre7, and pre8
Contributors
Notes on $PS1
Hollowing out the Python Interpreter
Source Metrics
Native Code Metrics
Filtering CPython Functions and Methods
Reducing the Number of Unique Bytecodes Used
Reducing Bytecode Size
Praise for R's Tidyverse
Summary, and What's Next?
Appendix A: Details on CPython Surgery

Brief Project Recap

After the FAQ, these two posts summarize the current state of OSH:

Project Roadmap

Why is this post #7? I wrote Roadmap #5 last year, but I mentioned a new roadmap in August.

So I skipped #6. And I don't think there will be more roadmap posts: they got us to the January milestone, but the project now has more design choices that can't be planned out ahead of time.

Unpublished Blog Post Drafts

I've only published 4 posts since the summer. These drafts are unpublished:

Recent Releases: 0.6.pre6, pre7, and pre8

I've followed through on releasing more often. They're all pre releases because I want to make version 0.6.0 a release that's meaningful to users.

Contributors

Notes on $PS1

Describing the semantics of $PS1 would be good episode of #shell-the-bad-parts. (Or really "bash-the-bad-parts", because the issue is bash-specific.)

There are two separate steps to the expansion:

Instead, bash should work like this:

Then there would be a single phase of expansion, and a single syntax for substitution.

Hollowing out the Python Interpreter

Most other changes in these three releases have been under the hood.

Oil ships with a subset of the CPython interpreter, and I went to great lengths to cut down its size. In effect, I was auditing which parts of CPython we use.

Building Oil with the OPy Bytecode Compiler gives background on this strategy, and motivates it.

Source Metrics

Let's explain the slimming-down of Oil through metrics. As usual, there's slightly more code in the core, to implement new features:

But we depend on far less of the Python standard library:

The runpy.py module is an unnecessary CPython implementation detail, and I replaced it with some C code. This broke many dependencies on the standard library, so we're shipping ~23K lines of Python total (including the 19K lines above, not in addition to them).

This is a great result! OSH can run real shell scripts, but it's still a small program.

Native Code Metrics

Removing Python source also lets us remove native C code that implements certain Python features.

I reduced the lines of native dependencies by ~9K lines:

This still feels too big, but it's doesn't tell the whole story. Due to build changes in this release, many of those ~130K lines are no longer used.

The larger drop in compiled code size dropped shows this:

Here's another view with Bloaty McBloatFace, a nice tool that accounts for code size in ELF files:

The reduction has two components:

  1. Removing unused CPython files. I removed thousands of lines of code that implemented the newer .format() method, and thousands of lines of code that implement floating point to string conversions.
  2. Filtering CPython functions and methods. The next section describes the unusual technique I used to this. It was made possible by CPython's regular code structure.

(I also enabled the compiler flags for removing unused code, e.g. --gc-sections. I'm not sure why this isn't done by default in GCC and Clang.)

Filtering CPython Functions and Methods

I wrote a recursive-descent parser parser for PyMethodDef declarations, which look like this:

static PyMethodDef marshal_methods[] = {
  {"dump", marshal_dump, METH_VARARGS, dump_doc},
  {"load", marshal_load, METH_O,       load_doc},
  ...
};

There's at least one of these structures for every object like stringobject.c and every module like posixmodule.c.

After parsing these definitions, I re-printed them in build/oil-defs/ with two filters:

  1. Omit methods that aren't used. A coarse heuristic got rid of most methods, and then I did some manual work and encoded it in build/cpython_defs.py.
  2. Omit all docstrings, which removes hundreds of strings from the build.

Some metrics about the filtered methods:

OSH 0.6.pre5 used about 431 methods, and this process stripped it down to 128 methods. Most methods that we use are in posixmodule.c, e.g. posix.chdir(), which makes sense for a shell!

Since Oil still passes all its tests, and our test coverage is very high, I'm sure that all this code was unused. And I'm now more convinced that it's possible to write a small Python interpreter for Oil, which I'm calling OVM2. More on that later.

Reducing the Number of Unique Bytecodes Used

The file ceval.c in CPython is the core of the bytecode interpreter. It dispatches about 120 bytecodes like:

I reduced Oil's usage of certain bytecodes, so that we're using a simpler subset of Python. This was done in two ways:

  1. Removing standard library modules, which I mentioned above. They use features like import *, implemented with the IMPORT_STAR bytecode, but Oil doesn't.
  2. Small rewrites of Python code. For example, I rewrote 2 generator expressions as list comprehensions, which remove the only 2 instances of closures in Oil (LOAD_CLOSURE and MAKE_CLOSURE). I generally prefer explicit state with classes over closures.

Another example was rewriting exponentation the exponentiation operator in shell with multiplication:

$ echo $(( 2 ** 30))  # 1 GB
1073741824

It used to rely on Python's ** operator (the BINARY_POWER bytecode).

However, Python's implementation accepts arbitrary floats, while shell has only integers. Exponentiation on floating point numbers depends on thousands of lines of C code, so we can remove all of that.

Metrics:

I've further reduced this number below 80 on the master branch.

(Related reading: Floating Point to Decimal Conversion is Easy. Python's implementation doesn't look easy! Again, I removed all of this code.)

Reducing Bytecode Size

Not only did I reduce the native code size and number of unique bytecodes, I also reduced the total bytecode size:

This reflects the smaller amount of Python source described above, as well as removing Python docstrings.

Some Python docstrings are specified in C, and some are specified in Python. I removed the latter ones by adding an -emit-docstring=0 flag to the OPy bytecode compiler.

(NOTE: The "bytecode size" column in the ovm-build metrics should be "architecture-independent files", as there are data files and .py source files in the app bundle.)

Praise for R's Tidyverse

There are many tables of metrics in the links above. I should mention that they're manipulated and analyzed with the R language, and in particular the dplyr library.

dplyr provides general data manipulation "verbs" on data frames and is part of the Tidyverse by Hadley Wickham.

In this release, I expanded my usage to include stringr, a related library for string manipulation. Surprisingly, it has some string manipulation concepts that may be useful in shell and the Oil language. Hadley's libraries are an inspiration in composable API design.

I will write more about this later.

Summary, and What's Next?

This post was a "catch-up" on several topics, but here's a concise takeaway:

Oil is now more like a C program, and less like a Python program.

I removed many parts of CPython that we're not using — in the intepreter itself, as well as in the standard library.

This gives me confidence that Oil can be moved to a VM that we control, which I'm calling OVM2.

There is still a lot to talk about, but I'll leave it for Dev Log #8:

Appendix A: Details on CPython Surgery

Already mentioned above: