Brief Descriptions of a Python to C++ Translator

2022-05-10

This post is for Python experts! I briefly describe mycpp, a hacky Python-to-C++ translator I wrote on top of the MyPy type checker.

It works, but it needs improvement, or even a rewrite.

Context: We now have a NLnet grant of 50K euros, and I want to pay one or more engineers to help me translate Oil to C++. I wrote a longer post about this in March: Oil Is Being Developed "Middle Out".

Please share this post with anyone who may be interested!

A Hybrid of mypyc and Shed Skin (Table)

I've described mycpp as a hybrid of the the recent mypyc compiler and old Shed Skin compiler. So this table is an easy way to explain it to Python experts.

Project	Shed Skin (2005-2015?)	mypyc (2018-present, for MyPy)	mycpp (2019-present, for Oil)
Types	Inferred	Explicit Annotations	Explicit Annotations
Compiler Output	C++ source code	Python-C Extension	C++ source code
Garbage Collector	Boehm GC (imprecise, architecture-specific)	CPython GC (precise, portable)	Custom Cheney collector on `Str`, `List<T>`, `Dict<K, V>`, ... (precise, portable)

Like mypyc, mycpp uses explicit type annotations.
Like Shed Skin, it generates C++ source code.

The goal is to speed up Oil, and remove the dependency on CPython.

Trivia

I've never used Shed Skin, but someone on Hacker News told me that they use it in production! I think they used it for numeric code, as opposed to string- and dict-oriented code.
The first Shed Skin blog post says that srepmub worked on it for Google Summer of Code in 2005, mentored by Jeremy Hylton and Brett Cannon.
Wikipedia says that Shed Skin combines 2 type inference algorithms, one of which is John Plevyak's Iterative Flow Analysis (IFA). John was my teammate on index metrics at Google!

So there was a lot of Python expertise at Google, especially around 2005-2010.

Python -> C++ Translation Examples

This is the second brief way of explaining mycpp — by showing input / output pairs.

They're taken from the mycpp-examples benchmark, which is published with every release. Speedups range from ~5x to ~200x.

Fibonacci

This translation is trivial because the algorithm only uses integers. C++ has fixed width integers, unlike Python, but that's OK for Oil.

Python input:

def fib_recursive(n):
  # type: (int) -> int
  if n == 0:
    return 1
  if n == 1:
    return 1
  return fib_recursive(n-1) + fib_recursive(n-2)

C++ output:

int fib_recursive(int n) {
  if (n == 0) {
    return 1;
  }
  if (n == 1) {
    return 1;
  }
  return (fib_recursive((n - 1)) + fib_recursive((n - 2)));
}

String Backslash Escaping

This example uses Str* and List<Str*>*, which makes it more interesting. A big difference is that the translator has to register C stack roots with the garbage collector.

Python input:

def BackslashEscape(s, meta_chars):
  # type: (str, str) -> str  
  escaped = []  # type: List[str]
  for c in s:
    if c in meta_chars:
      escaped.append('\\')
    escaped.append(c)
  return ''.join(escaped)

C++ output:

Str* BackslashEscape(Str* s, Str* meta_chars) {
  List<Str*>* escaped = nullptr;
  StackRoots _roots({&s, &meta_chars, &escaped});

  escaped = Alloc<List<Str*>>();
  for (StrIter it(s); !it.Done(); it.Next()) {
    Str* c = it.Value();
    StackRoots _for({&c  });
    if (str_contains(meta_chars, c)) {
      escaped->append(str0);
    }
    escaped->append(c);
  }
 return str1->join(escaped);
}

This strategy may seem a bit naive, but it works and performs well! It runs a large portion of OSH, which I show below.

FAQ

Why Generate C++, and Not C?

It makes the translation more straightforward, and the output more readable:

MyPy's Dict[K, V] → our C++ gc_heap::Dict<K, V>
Python inheritance → C++ inheritance / virtual dispatch
Python exceptions → C++ exceptions
Python context managers → C++ constructors and destructors

Using C++ also makes the runtime and libc bindings easier to write. We get a little type safety when we manually write wrappers like:

List<Str*>* glob(Str* pattern)

for the C function:

int glob(const char *pattern, int flags,
         int (*errfunc) (const char *epath, int eerrno),
         glob_t *pglob);

(Why not Rust? The FAQ I mentioned is still upcoming. Here's a Reddit comment on that.)

Why Not Use PyPy, Cython, etc.?

None of them uses Python type annotations, which I found useful for generating predictably fast C++ code.

A key benefit of mycpp is being able to analyze and optimize performance in the "C++ domain", not in the Python domain. C++ has very good performance tools, and they work best on readable source code.

Most Python experts know this, but all Python compilers have a tradeoff between speed and generality, i.e. how much Python is understood. Our mycpp tool is way toward the left of that spectrum — it's designed to make one program fast!

There are dozens of other (partial) Python implementations listed on the python.org wiki: PythonImplementations.

How Big Is mycpp?

It's pretty small! The line counts for translation show:

The translator is ~4300 lines. It's built on top of MyPy and peeks into the private AST data structure, which is one reason it may need a rewrite.
The new garbage-collected runtime is ~3100 lines.
We have ~2400 lines of hand-written C++ code, like the OS bindings.

So I would like to hire one talented person to work on this. See Compiler Engineer Job.

Background

Graph of Translation Progress

As more evidence that this is doable, here's a plot of Oil spec tests passing under the C++ translation.

Progress on the Middle-Out Implementation of Oil

This is taken from Oil is Being Implemented "Middle Out", which also states caveats. In particular, we still have to hook up the garbage collector, which will require some rewriting.

Project History

Hopefully that gives you a good picture of mycpp and what I need help with. Here's some detail on how we got here.

Winter 2018: MyPy team lead Greg Price contributed to Oil, and told me about mypyc! I was unaware of its existence, and Oil wasn't using MyPy yet.
April 2019: The first release with the mycpp experiment.
Ongoing work
1. Add Python type annotations to Oil's code
2. Enhance mycpp to handle the subset of Python we use.
3. Rewrite Oil's use of Python dynamism/reflection into textual code generation. Translation requires 100% statically typed code.
4. Write C++ wrappers for OS functions like glob().
January 2020: Oil's Parser is 160x-200x Faster Than It Was 2 Years Ago
- The proof of concept works and performs well! But there's no garbage collector yet.
Ongoing work
- I refactor, type-annotate, and translate almost all of OSH. It's kept up to date with every release, as oil-native.
- But the implementation of the Oil language isn't yet statically typed, so it's not translated. It's still "metacircular" with CPython, which is convenient since it's a Python-like language.
March 2021: The garbage collector runs on small examples, but it's not yet hooked up yet.
March 2022: I show a big jump in progress on the middle out graph, copied above.

This would have gone faster if I was a better compiler engineer and a better C++ programmer!

Oil Is a Big Project

But I'm also doing many other things at the same time — e.g. see A Tour of the Oil Language, which is a completely separate task.

You could say that the Oil project is three languages:

The compatible OSH language. It's the most bash-compatible shell, by a mile.
The new Oil language, which has Python-like expressions and Ruby-like blocks.
This "mycpp" language, which has Python syntax.
- It seemed more like a "tool" at first, but now it feels like a language.
- The core problem here is type checking and garbage collection, whereas the core problem with shell is the syntax!

(I'm thinking of calling the mycpp rewrite the Pea compiler.)

Question: MyPy Subset with Python 3.10 Pattern Matching?

Python 3.10 was released late last year, with structural pattern matching.

One thing I found difficult when building on top of MyPy is the "inverted" visitor style. My brain doesn't like"losing the stack" — i.e. when local variables become member variables.

I also noticed that recent versions of the AST module support type comments, which Oil still uses.

So I wonder if anybody has tried to write a strictly typed subset of MyPy with Python 3.10 pattern matching and the new ast info? I think that could be a fun project. If it's limited to what Oil uses, it shouldn't be too big.

Gradual typing was crucial to get Oil to where it is now, but our translator can now assume strict typing.

Somewhat related repository: Exercises for Types and Programming Languages in Python, with Python 3.10 pattern matching

Conclusion

Let me know if this made sense. Feel free to reply in the comments, or e-mail me at andy at oilshell.org.

Please circulate this post so we can find the right person to help!

And check out Compiler Engineer Job on the wiki, which lists the 5 skills sought:

Hard-won C++ experience, including knowledge of portability.
Understanding of garbage collectors.
Comfort with a test-driven and terminal-based workflow. We need to make the red line go up!
Understanding of static type systems.
Python.

Appendix: More Posts

I hope to find somebody soon, but if not, I could write posts on these technical details.

Python Meta / Inadvertently Repeating PyPy

The first attempt to speed up Oil was a bit naive. I inadvertently repeated PyPy by cobbling together a Python interpreter in Python.

It was a fun learning experience, and didn't take too long. But from there, there was no clear path to speed up Oil without static types. So you can say MyPy "saved" Oil.

So Oil is using many of Python's metalanguages and tools, and I could elaborate on this more:

Zephyr ASDL to describe the language
pgen2 for parsing the Oil language, which is heavily influenced by Python
And now MyPy, with influence from mypyc

A Garbage-Collected Heap Shaped Like Statically Typed Python

How does our List<T> and Dict<K, V> implementation work? It's somewhat influenced by OCaml's runtime, which is covered in these chapters of Real World OCaml:

I also played with femtolisp, the language used to bootstrap the Julia language, giving it Lisp-like metaprogramming. It's one of the few "production" interpreters that uses a Cheney semi-space collector. (The es shell by Paul Haahr and Byron Rakitzis also does, but it appears defunct.)

Why Cheney, and not the more common mark-and-sweep?

The implementation is simple and short.
I believe the main downside of Cheney is its speed on big heaps. Oil programs should have small heaps: less than 10 MB in the common case, and 100 MB in uncommon ones.
But I learned the hard way that the Oil workload is very allocation heavy! This isn't surprising because it was written in Python for correctness, not speed.
- With the Cheney algorithm, allocation is just bumping a pointer.
- Collection time is proportional to the number of live objects, and our workload should have many dead objects.
It's moving and compacting, and that might increase page sharing when forking the shell (?). You can also imagine disabling GC in some subshells, relying instead on process death to reclaim resources.

Why not use C++ smart pointers and/or reference counting?

I made an early attempt, but I had problems with List[T] turning into types like shared_ptr<List<shared_ptr<T>>, i.e. nested shared pointers. Apparently there are other template parameters that don't compose / nest.

Object-Oriented Type Systems are Underexplained

I started this project with some knowledge about parsing, but I have no experience with type systems.

Somehow I assumed that books like the Dragon Book would explain how to write a type system, but they don't. Not even for a procedural language, let alone an object-oriented one!

These threads were part of my research and have interesting comments, and justification for the claim:

Good textbook with implementations of OO type system? (2 months ago)
An Accessible Introduction to Type Theory and Implementing a Type Checker. A tutorial on a Java-like type checker, written in OCaml.

How Static Types Relate to Garbage Collection

The connection between static types and (precise) garbage collection also feels underexplained — e.g. the "C stack roots" problem. I have the The Garbage Collector Handbook, but it's mainly about algorithms, and not engineering concerns.

This post about Mozilla's SpiderMonkey was useful conceptually: Clawing Our Way Back to Precision (2013)
Garbage Collection in LLVM. There also seems to be a lot of "implementer lore" here.

This means that translating statically typed Python to C++ is more difficult than translating say CoffeeScript to JavaScript. The translator has to understand more than just syntax!