Building Oil with the OPy Bytecode Compiler

2018-03-04 (Last updated 2019-06-17)

In the very first post on Oil, I explained why Oil is written in Python: to have a chance of getting it done! I want to implement not just the bash-compatible OSH dialect, but also the Oil language, and that's a lot of work.

Bash alone is ~142K lines of C code, while OSH is ~16K lines of Python as of the last release. (When all's said and done, it might turn out to be a 5-7x ratio rather than 9x, but that's still huge.)

Of course, there's a problem: Python is slower than C, and I wrote benchmarks to show that it matters. For example, the OSH parser is 40-50 times slower than the bash parser, even after some optimization.

So I'm now working on making it even faster and smaller. My plan involves OPy, a Python bytecode compiler written in Python.

This post shows what I've done with OPy, recaps what I wrote about it last year, and maps out future work. If you've implemented a VM, and especially if you've modified CPython, I'd love your feedback in the comments.

Table of Contents

Release 0.5.alpha2

Benchmarks and Metrics

How Big Is OPy?

Recap of Last Year's Work

Appendix: FAQs About Python

Why Python 2?

Why not use PyPy / RPython?

Why not use MicroPython?

Why not use Cython?

Release 0.5.alpha2

I've released Oil 0.5.alpha2, which you can download here:

https://www.oilshell.org/download/oil-0.5.alpha2.tar.xz (gzip version)

It has the same features as OSH 0.4, but its bytecode is built with OPy rather than CPython.

Benchmarks and Metrics

OPy generates slightly different bytecode, but it appears that OSH is unaffected. The unit tests and spec tests pass, and these benchmark results are roughly the same:

(That is, 6-7 lines/ms on a slow machine and 13-14 lines/ms on a fast machine.)

However, the bytecode is larger:

0.5.alpha1 bytecode size: 866,303 bytes
0.5.alpha2 bytecode size: 996,913 bytes

I'm not sure why this is, but I'll look into as I optimize for both size and speed.

How Big Is OPy?

oil/opy$ ./count.sh all

LEXER, PARSER GENERATOR, AND GRAMMAR
  ... snip ...
  579 pgen2/tokenize.py
  827 pytree.py
 2574 total

COMPILER2
   ... snip ...
   410 compiler2/symbols.py
   764 compiler2/pyassem.py
  1547 compiler2/pycodegen.py
  1578 compiler2/transformer.py
  4909 total

It's around 8,000 lines of Python code, which I consider small and malleable. This is why I believe it's feasible to optimize Oil by forking the Python language.

Note that ~16K lines of Oil and ~8K lines of OPy is still a lot less than the ~160K lines of C code in bash.

Recap of Last Year's Work

Before explaining how I made this work, let's review what I wrote about OPy last year.

April 2017

(A) The Riskiest Part of the Project. I listed six reasons why a shell shouldn't be a Python program:

The size and complexity of the interpreter.
The extra dependency, which is especially undesirable on embedded systems.
Startup time.
Unicode in Python 3. (See the FAQ below.)
Issues with signal handling.
Using Oil as a library from C programs.

Two more reasons:

I/O buffering issues as mentioned here.
Significantly slower parsing and execution of shell.

In addition to the fact that Python programs allocate memory frequently, Python's garbage collector isn't "fork-friendly". Objects that are read-only at the Python level are mutated at the C level, in order to update their reference counts. This inhibits virtual memory page sharing. Ruby addressed this issue in 2012.

It might not matter for some Python programs, but it matters for a shell.

(B) Cobbling Together a Python Interpreter. I describe the components of a Python front end in Python:

tokenize, a regex-based lexer from the standard library.
Guido's pgen2 parser generator, written circa 2006 for the 2to3 conversion tool.
compiler2, a bytecode compiler that was removed from the standard library as of Python 3.

(C) The OPy Front End is Working. I describe my attempts to make these components work together. I abandoned Python 3 and ported Oil back to Python 2.

(D) OVM will be a Slice of the CPython VM. Rather than writing a small C or C++ VM to complement this front end, I decide to hack off a chunk of the Python interpreter and call it "OVM". This shortcut let me make the first release back in July.

May 2017

(E) Rewriting Python's Build System From Scratch. Oil release binaries have two parts:

Native code: ~135K lines of the CPython VM, and Oil's own C code.
Architecture-independent bytecode. Python source code is now compiled to bytecode with OPy, rather than CPython's built-in compiler.

June 2017

(F) How I Use Tests: Transforming OSH. In summary, the idea is to:

Compile Oil to a different, more efficient bytecode by forking the subset of Python that it uses.
Avoid a big-bang rewrite in a new language. In addition to being tedious, this would cause the project to "go dark" for many months. I want to avoid that at all costs. (If you don't understand this dynamic, see Things You Should Never Do, by Joel Spolsky.)
Use tests to guide this gradual transformation.

Also, it doesn't really matter how fast the OPy compiler runs, since I compile bytecode ahead of time rather than on-demand. This gives more room for optimization.

(For those curious about details, the two appendices in this post may be interesting.)

Why Do It This Way?

Admittedly, this strategy is odd. I don't know of any other programs that were almost unusably slow in their original implementation, then sped up by writing a new compiler.

I was recently asked how I consistently get things done, and my answer my shed some light on this. Part of it was:

Use Python. Python lets me explore new problems quickly. If there were a C++ compiler in my edit-run cycle, many corners of the shell language would remain unexplored.

Being able to mold the language with metaprogramming was another unexpected benefit. I learned OCaml specifically to write compilers and interpreters, but I decided not to use it for Oil. In retrospect, I suspect this was a good decision. (We'll know more once I get further into OPy!)
Don't get stuck. I've made continuous progress for nearly two years, and this strategy of incrementally optimizing Oil also reduces the likelihood of getting stuck.

I'll also add: don't go backward. With tests, I have confidence making big changes, like completely changing the bytecode compiler. I know that the OPy compiler works because the spec tests for 0.5.alpha2 did not regress. The bottom of the page records the version I ran the tests with:

$ _tmp/oil-tar-test/oil-0.5.alpha2/_bin/osh --version
Oil version 0.5.alpha2
Release Date: 2018-03-02 02:13:34+00:00
...
Bytecode: bytecode-opy.zip

I'll also admit that I'd like to prove a point about high level languages vs. gobs of C and C++. Though I was honestly surprised by how slow the initial version turned out to be. Python is not a good language for writing efficient parsers, but perhaps OPy will be.

Recent Progress

I had already done most of the work last year, so all I had to do in the last few weeks were:

Add Makefile support to build bytecode-opy.zip instead of bytecode-cpython.zip.
Compile a subset of the Python standard library with OPy. I hadn't done this last year. This mainly involved running 2to3 --fix print on a few files.

The OPy README.md records some minor differences between OPy and Python.

Future Work

I have many ideas for OPy, which fall in these categories:

Refactorings and cleanups to get familiar with the code.
Make releases of OPy built with OPy, so the bootstrapping sequence isn't lost.
Bug fixes. For example, neither CPython or OPy bytecode generation is deterministic! I believe this is due to a late change to Python 2.7 involving PYTHONHASHSEED.
Move work from runtime to compile-time. For example, resolving imports, classes, functions, and attributes are all done at runtime in Python, but can be done at compile-time (at least for Oil). In other words, it would be nice to remove the possibility of ImportError, NameError, and AttributeError.
Convergence with ASDL. It's more accurate to say that Oil is written in Python + ASDL, rather than just Python. Oil's ASDL implementation is a small compiler, and it makes sense to unify it with OPy. I suspect that integrating the two compilers can make the generated code faster.
Optimization of the OPy compiler itself. The parse tree representing Python code is unnecessarily large; I'd prefer to use an AST-like LST instead.

These changes will lead to changes to OVM. For example, ASDL data structures can be represented more efficiently in memory. Unlike Python data types, ASDL types are statically declared.

Conclusion

I released a version of Oil built with OPy, showed benchmarks and metrics, recapped previous posts on OPy, and described recent progress.

It might take a long time to optimize Oil, but I have no doubt I'll learn a lot in the process.

Oil also doesn't need to be fully optimized before adding useful features. I called this release 0.5.alpha2 instead of 0.5, because I hope that 0.5 will be the first release with a feature that bash doesn't have.

Appendix: FAQs About Python

I've been asked these questions when I've written about OPy in the past.

Why Python 2?

October 2020 Update: A few readers have misinterpreted this FAQ. You don't need either Python 2 or Python 3 to use Oil. The answers below are for Oil developers to understand its implementation techniques: high level code with DSLs. The tarball that users download and build is plain C and C++ code, with no dependencies besides libc and a kernel (optionally GNU readline.)

Three answers:

It's better to think of the code as written in "OPy", a tiny subset of Python. This isn't theoretical: for over a year, every release has been compiled with a custom bytecode compiler and not Python's builtin compiler.
Oil doesn't require Python 2 to be installed on your system. It looks like a C program to system packagers. I forked the Python interpreter and include a portion of it in the tarball. I rewrote its build system from scratch.
I've maintained this fork of Python 2 since the beginning of the project, so the upcoming 2020 deprecation has no effect on Oil.
- The nativedeps metric, published with every release, shows what portions of the Python interpreter we're using. It's gotten smaller and smaller over time — see Hollowing Out the Python Interpreter.

Oil started off in Python 2, was ported to Python 3, then back to Python 2. Both ports were easy.

Unicode is the primary reason I ported it back to Python 2.

Python 3 emphasizes Unicode strings, but in a shell, you don't know what the encoding of a string is. File system paths, argv, getenv(), stdin, etc. are all bytes in Unix. That's how the kernel works.

The bytes can of course be UTF-8-encoded. UTF-8 was designed to work with existing C functions like strstr(), rather than requiring Unicode variants of each function.
UCS vs UTF-8 as Internal String Encoding by Armin Ronacher discusses the issue of internal string encoding. It notes that Perl, Ruby, Go, and Rust use UTF-8 internally. Oil will follow that example, rather than the example of Python and bash, which used fixed-width multibyte characters.
This comment I wrote on /r/ProgrammingLanguages explains why manipulating UTF-8 text in memory is awkward with Python 3.
In a sense, Python acknowledged these issues in Python 3.7 with PEP 540: Add a New UTF-8 Mode. Python 3.7 was released in June 2018, long after the Oil project started.

Another historical reason: Porting compiler2 to Python 3 was hard, and supporting both Python 2 and 3 at the same time was even harder. So I chose Python 2.

Why not use PyPy / RPython?

I wasn't excited about PyPy, but I tried it anyway. OSH under PyPy is slower than OSH under CPython, not faster.

JIT speedups depend on the workload. My understanding is that string-heavy workloads are dominated by allocation, and the JIT can't do much about that. Even when it's faster, PyPy uses more memory than CPython, which is not a good tradeoff for a shell. A shell should use less memory than CPython or PyPy.

PyPy optimizes unmodified Python programs, which is very hard. In contrast, OPy is optimizing just the subset of the language that Oil and OPy itself use. I'm also free to change the semantics of the language, e.g. make it more static.

Implementation trivia: OPy started from the same place that PyPy did. PyPy is also based on tokenize, pgen2, and compiler2. Takeaway: writing a Python front end is a lot of work, so it's best to reuse existing code.

November 2018 Update: A more detailed answer on lobste.rs.

Why not use MicroPython?

I answered this on lobste.rs. I mentioned MicroPython in this June 2017 post.

Why not use Cython?

I didn't try Cython, but I also don't see any evidence that it speeds up string-based workloads. I believe it also has the tradeoff of bloating the executable (which likely increases memory usage.)