blog | oilshell.org

Building Oil with the OPy Bytecode Compiler

2018-03-04

In the very first post on Oil, I explained why Oil is written in Python: to have a chance of getting it done! I want to implement not just the bash-compatible OSH dialect, but also the Oil language, and that's a lot of work.

Bash alone is ~160K lines of C code, while OSH is ~16K lines of Python as of the last release. (When all's said and done, it might turn out to be a 5-7x ratio rather than 10x, but that's still huge.)

Of course, there's a problem: Python is slower than C, and I wrote benchmarks to show that it matters. For example, the OSH parser is 40-50 times slower than the bash parser, even after some optimization.

So I'm now working on making it even faster and smaller. My plan involves OPy, a Python bytecode compiler written in Python.

This post shows what I've done with OPy, recaps what I wrote about it last year, and maps out future work. If you've implemented a VM, and especially if you've modified CPython, I'd love your feedback in the comments.

Table of Contents
Release 0.5.alpha2
Benchmarks and Metrics
How Big Is OPy?
Recap of Last Year's Work
April 2017
May 2017
June 2017
Why Do It This Way?
Recent Progress
Future Work
Conclusion
Appendix: FAQs About Python
Why Python 2?
Why not use PyPy?
Why not use Cython?

Release 0.5.alpha2

I've released Oil 0.5.alpha2, which you can download here:

It has the same features as OSH 0.4, but its bytecode is built with OPy rather than CPython.

Benchmarks and Metrics

OPy generates slightly different bytecode, but it appears that OSH is unaffected. The unit tests and spec tests pass, and these benchmark results are roughly the same:

(That is, 6-7 lines/ms on a slow machine and 13-14 lines/ms on a fast machine.)

However, the bytecode is larger:

I'm not sure why this is, but I'll look into as I optimize for both size and speed.

How Big Is OPy?

oil/opy$ ./count.sh all

LEXER, PARSER GENERATOR, AND GRAMMR
  ... snip ...
  579 pgen2/tokenize.py
  827 pytree.py
 2574 total

COMPILER2
   ... snip ...
   410 compiler2/symbols.py
   764 compiler2/pyassem.py
  1547 compiler2/pycodegen.py
  1578 compiler2/transformer.py
  4909 total

It's around 8,000 lines of Python code, which I consider small and malleable. This is why I believe it's feasible to optimize Oil by forking the Python language.

Note that ~16K lines of Oil and ~8K lines of OPy is still a lot less than the ~160K lines of C code in bash.

Recap of Last Year's Work

Before explaining how I made this work, let's review what I wrote about OPy last year.

April 2017

(A) The Riskiest Part of the Project. I listed six reasons why a shell shouldn't be a Python program:

  1. The size and complexity of the interpreter.
  2. The extra dependency, which is especially undesirable on embedded systems.
  3. Startup time.
  4. Unicode in Python 3. (See the FAQ below.)
  5. Issues with signal handling.
  6. Using Oil as a library from C programs.

Two more reasons:

  1. I/O buffering issues as mentioned here.
  2. Significantly slower parsing and execution of shell.

In addition to the fact that Python programs allocate memory frequently, Python's garbage collector isn't "fork-friendly". Objects that are read-only at the Python level are mutated at the C level, in order to update their reference counts. This inhibits virtual memory page sharing. Ruby addressed this issue in 2012.

It might not matter for some Python programs, but it matters for a shell.

(B) Cobbling Together a Python Interpreter. I describe the components of a Python front end in Python:

  1. tokenize, a regex-based lexer from the standard library.
  2. Guido's pgen2 parser generator, written circa 2006 for the 2to3 conversion tool.
  3. compiler2, a bytecode compiler that was removed from the standard library as of Python 3.

(C) The OPy Front End is Working. I describe my attempts to make these components work together. I abandoned Python 3 and ported Oil back to Python 2.

(D) OVM will be a Slice of the CPython VM. Rather than writing a small C or C++ VM to complement this front end, I decide to hack off a chunk of the Python interpreter and call it "OVM". This shortcut let me make the first release back in July.

May 2017

(E) Rewriting Python's Build System From Scratch. Oil release binaries have two parts:

  1. Native code: ~135K lines of the CPython VM, and Oil's own C code.
  2. Architecture-independent bytecode. Python source code is now compiled to bytecode with OPy, rather than CPython's built-in compiler.

June 2017

(F) How I Use Tests: Transforming OSH. In summary, the idea is to:

Also, it doesn't really matter how fast the OPy compiler runs, since I compile bytecode ahead of time rather than on-demand. This gives more room for optimization.

(For those curious about details, the two appendices in this post may be interesting.)

Why Do It This Way?

Admittedly, this strategy is odd. I don't know of any other programs that were almost unusably slow in their original implementation, then sped up by writing a new compiler.

I was recently asked how I consistently get things done, and my answer my shed some light on this. Part of it was:

  1. Use Python. Python lets me explore new problems quickly. If there were a C++ compiler in my edit-run cycle, many corners of the shell language would remain unexplored.

    Being able to mold the language with metaprogramming was another unexpected benefit. I learned OCaml specifically to write compilers and interpreters, but I decided not to use it for Oil. In retrospect, I suspect this was a good decision. (We'll know more once I get further into OPy!)

  2. Don't get stuck. I've made continuous progress for nearly two years, and this strategy of incrementally optimizing Oil also reduces the likelihood of getting stuck.

    I'll also add: don't go backward. With tests, I have confidence making big changes, like completely changing the bytecode compiler. I know that the OPy compiler works because the spec tests for 0.5.alpha2 did not regress. The bottom of the page records the version I ran the tests with:

$ _tmp/oil-tar-test/oil-0.5.alpha2/_bin/osh --version
Oil version 0.5.alpha2
Release Date: 2018-03-02 02:13:34+00:00
...
Bytecode: bytecode-opy.zip

I'll also admit that I'd like to prove a point about high level languages vs. gobs of C and C++. Though I was honestly surprised by how slow the initial version turned out to be. Python is not a good language for writing efficient parsers, but perhaps OPy will be.

Recent Progress

I had already done most of the work last year, so all I had to do in the last few weeks were:

The OPy README.md records some minor differences between OPy and Python.

Future Work

I have many ideas for OPy, which fall in these categories:

These changes will lead to changes to OVM. For example, ASDL data structures can be represented more efficiently in memory. Unlike Python data types, ASDL types are statically declared.

Conclusion

I released a version of Oil built with OPy, showed benchmarks and metrics, recapped previous posts on OPy, and described recent progress.

It might take a long time to optimize Oil, but I have no doubt I'll learn a lot in the process.

Oil also doesn't need to be fully optimized before adding useful features. I called this release 0.5.alpha2 instead of 0.5, because I hope that 0.5 will be the first release with a feature that bash doesn't have.

Appendix: FAQs About Python

I've been asked these questions when I've written about OPy in the past.

Why Python 2?

Because I'm taking ownership of the code, Python 2 vs. Python 3 isn't a meaningful question from the user's point of view. It's an implementation detail.

For the curious, Oil started off in Python 2, was ported to Python 3, then back to Python 2. (Both ports were easy.)

Python 3 emphasizes Unicode strings, but in a shell, you almost never know what the encoding of a string is. File system paths, argv, getenv(), stdin, etc. are all bytes in Unix.

The bytes can of course be UTF-8-encoded. UTF-8 was designed to work with existing C functions like strstr(), rather than requiring Unicode variants of each function.

This blog post discusses the issue of internal string encoding. It notes that Perl, Ruby, Go, and Rust use UTF-8 internally. Oil will follow that example, rather than the example of Python and bash, which used fixed-width multibyte characters.

This comment explains why manipulating UTF-8 text in memory is awkward with Python 3.

The other issues with Python 2 were:

Why not use PyPy?

I wasn't excited about PyPy, but I tried it anyway. OSH under PyPy is slower than OSH under CPython, not faster.

JIT speedups depend on the workload. My understanding is that string-heavy workloads are dominated by allocation, and the JIT can't do much about that. Even when it's faster, PyPy uses more memory than CPython, which is not a good tradeoff for a shell. A shell should use less memory than CPython or PyPy.

PyPy optimizes unmodified Python programs, which is very hard. In contrast, OPy is optimizing just the subset of the language that Oil and OPy itself use. I'm also free to change the semantics of the language, e.g. make it more static.

Implementation trivia: OPy started from the same place that PyPy did. PyPy is also based on tokenize, pgen2, and compiler2. Takeaway: writing a Python front end is a lot of work, so it's best to reuse existing code.

Why not use Cython?

I didn't try Cython, but I also don't see any evidence that it speeds up string-based workloads. I believe it also has the tradeoff of bloating the executable (which likely increases memory usage.)