An Unexpected Solution: Cobbling Together a Python Interpreter

2017-04-09

Yesterday I listed six reasons that OSH shouldn't run on top of CPython. I had two ideas to break this dependency without rewriting thousands of lines of code:

Translate a subset of Python to C++ .
Compile a subset of Python to a hypothetical shell runtime, "OVM".

These are both possible, but they're vague and only address the OSH front end. The third solution I've experimented with is more general:

Write a small bootstrapped Python intepreter to run OSH and Oil.

Does that sound crazy? I mentioned yesterday that Python's core alone is 185K lines of code.

In this post, I'll explain why it's not. In fact, it appears to be easier than any alternative I've thought of.

Table of Contents

Python Has Been Bootstrapped 1.57 Times

OPy

Summary

Python Has Been Bootstrapped 1.57 Times

For background, let's unpack this odd assertion. Bootstrapping Python means writing Python in Python. Although parts of CPython are written in Python, the core is written in C, including all of the following:

Lexer
Parser
Bytecode compiler
Interpreter loop
Runtime objects like strings, dictionaries and classes
Garbage Collector
Standard Library (there is some Python here, but a lot of C too.)

But, entirely separately from CPython, Python has been rewritten in Python 1.57 times.

The first 1.0 comes from PyPy, which is a very complete implementation of Python written in Python (including novel and sophisticated JIT technology). We won't be working with PyPy, so let's leave it aside for now. There may be more to say about it later; leave a comment if you're curious.

What accounts for the remaining 0.57? I'm referring to four Python reimplementations of the seven components above:

The tokenize module from the standard library (~600 lines). It does the same thing as Parser/tokenizer.c in CPython, but it's written in pure Python.
The pgen2 parser generator from lib2to3 (~1800 lines). It does the same thing as Parser/pgen.c, but it's written in pure Python.
The deprecated compiler module in the Python 2.7 standard library, which I refer to as compiler2 from now on (~5,000 lines).
- It transforms the parse tree into an AST, generates a control flow graph, and then flattens it into bytecode. In other words, it does the same thing as Python/compile.c, but it's written in Python. The Design of CPython's Compiler describes this process, and it's largely accurate for compiler2 as well.
byterun, a Python bytecode interpreter in Python, which was described in the AOSA Book (~1300 lines). Although I need to write my own interpreter loop in C++, this code is important because it's small, and because I've run OSH unit tests under it.

These components account for a large fraction of the Python interpreter in less than 10K lines of pure Python code! You could say that Python is more compiled than interpreted: there is a lot of C code to transform your source code into bytecode, but less code to actually run the bytecode.

I believe I can glue together these four components, write around 5K lines of native code for the remaining pieces, and end up with a Python interpreter that will run OSH and Oil. This isn't trivial, because the components were written at wildly different times and don't work together, but it's possible.

For comparison, tinypy has only 1,801 lines of C code and 2,185 lines of Python code. I've used and modified tinypy, and it's fantastic. However, the style is artificially dense, and it has less functionality that I want. (I've excluded it from the 1.57 times because it's more like a dialect of Python.)

OPy

In my head, I'm calling this collection of code OPy. It will initially be a hybrid of Python 2 and 3, but I expect it to quickly diverge. For example, after making good use of ASDL in the word evaluation pipeline, I said that OSH is no longer written in Python. It's written in Python+ASDL.

OPy will reflect this evolution. It could even end up a specialized language for writing languages — a meta-language — rather than a general-purpose language like Python.

Summary

This post outlined a feasible solution for the riskiest part of the project.

It's not done, but my experiments give me confidence that it will work. For example, I've run all the OSH unit tests with various combinations of pgen2, compiler2, and byterun.

Tomorrow I will go into detail on the benefits of this solution, and what remains to be done.