Why Sponsor Oils? | blog | oilshell.org

Brief Descriptions of a Python to C++ Translator

2022-05-10

This post is for Python experts! I briefly describe mycpp, a hacky Python-to-C++ translator I wrote on top of the MyPy type checker.

It works, but it needs improvement, or even a rewrite.

Context: We now have a NLnet grant of 50K euros, and I want to pay one or more engineers to help me translate Oil to C++. I wrote a longer post about this in March: Oil Is Being Developed "Middle Out".

Please share this post with anyone who may be interested!

Table of Contents
A Hybrid of mypyc and Shed Skin (Table)
Trivia
Python -> C++ Translation Examples
Fibonacci
String Backslash Escaping
FAQ
Why Generate C++, and Not C?
Why Not Use PyPy, Cython, etc.?
How Big Is mycpp?
Background
Graph of Translation Progress
Project History
Oil Is a Big Project
Question: MyPy Subset with Python 3.10 Pattern Matching?
Conclusion
Appendix: More Posts
Python Meta / Inadvertently Repeating PyPy
A Garbage-Collected Heap Shaped Like Statically Typed Python
Object-Oriented Type Systems are Underexplained
How Static Types Relate to Garbage Collection

A Hybrid of mypyc and Shed Skin (Table)

I've described mycpp as a hybrid of the the recent mypyc compiler and old Shed Skin compiler. So this table is an easy way to explain it to Python experts.

Project

Shed Skin
(2005-2015?)

mypyc
(2018-present, for MyPy)

mycpp
(2019-present, for Oil)

Types Inferred Explicit Annotations Explicit Annotations
Compiler Output C++ source code Python-C Extension C++ source code
Garbage Collector Boehm GC
(imprecise, architecture-specific)
CPython GC
(precise, portable)

Custom Cheney collector on Str, List<T>, Dict<K, V>, ...
(precise, portable)

The goal is to speed up Oil, and remove the dependency on CPython.

Trivia

So there was a lot of Python expertise at Google, especially around 2005-2010.

Python -> C++ Translation Examples

This is the second brief way of explaining mycpp — by showing input / output pairs.

They're taken from the mycpp-examples benchmark, which is published with every release. Speedups range from ~5x to ~200x.

Fibonacci

This translation is trivial because the algorithm only uses integers. C++ has fixed width integers, unlike Python, but that's OK for Oil.

Python input:

def fib_recursive(n):
  # type: (int) -> int
  if n == 0:
    return 1
  if n == 1:
    return 1
  return fib_recursive(n-1) + fib_recursive(n-2)

C++ output:

int fib_recursive(int n) {
  if (n == 0) {
    return 1;
  }
  if (n == 1) {
    return 1;
  }
  return (fib_recursive((n - 1)) + fib_recursive((n - 2)));
}

String Backslash Escaping

This example uses Str* and List<Str*>*, which makes it more interesting. A big difference is that the translator has to register C stack roots with the garbage collector.

Python input:

def BackslashEscape(s, meta_chars):
  # type: (str, str) -> str  
  escaped = []  # type: List[str]
  for c in s:
    if c in meta_chars:
      escaped.append('\\')
    escaped.append(c)
  return ''.join(escaped)

C++ output:

Str* BackslashEscape(Str* s, Str* meta_chars) {
  List<Str*>* escaped = nullptr;
  StackRoots _roots({&s, &meta_chars, &escaped});

  escaped = Alloc<List<Str*>>();
  for (StrIter it(s); !it.Done(); it.Next()) {
    Str* c = it.Value();
    StackRoots _for({&c  });
    if (str_contains(meta_chars, c)) {
      escaped->append(str0);
    }
    escaped->append(c);
  }
 return str1->join(escaped);
}

This strategy may seem a bit naive, but it works and performs well! It runs a large portion of OSH, which I show below.

FAQ

Why Generate C++, and Not C?

It makes the translation more straightforward, and the output more readable:

Using C++ also makes the runtime and libc bindings easier to write. We get a little type safety when we manually write wrappers like:

List<Str*>* glob(Str* pattern)

for the C function:

int glob(const char *pattern, int flags,
         int (*errfunc) (const char *epath, int eerrno),
         glob_t *pglob);

(Why not Rust? The FAQ I mentioned is still upcoming. Here's a Reddit comment on that.)

Why Not Use PyPy, Cython, etc.?

None of them uses Python type annotations, which I found useful for generating predictably fast C++ code.

A key benefit of mycpp is being able to analyze and optimize performance in the "C++ domain", not in the Python domain. C++ has very good performance tools, and they work best on readable source code.


Most Python experts know this, but all Python compilers have a tradeoff between speed and generality, i.e. how much Python is understood. Our mycpp tool is way toward the left of that spectrum — it's designed to make one program fast!

There are dozens of other (partial) Python implementations listed on the python.org wiki: PythonImplementations.

How Big Is mycpp?

It's pretty small! The line counts for translation show:

So I would like to hire one talented person to work on this. See Compiler Engineer Job.

Background

Graph of Translation Progress

As more evidence that this is doable, here's a plot of Oil spec tests passing under the C++ translation.

Progress on the Middle-Out Implementation of Oil

This is taken from Oil is Being Implemented "Middle Out", which also states caveats. In particular, we still have to hook up the garbage collector, which will require some rewriting.

Project History

Hopefully that gives you a good picture of mycpp and what I need help with. Here's some detail on how we got here.

This would have gone faster if I was a better compiler engineer and a better C++ programmer!

Oil Is a Big Project

But I'm also doing many other things at the same time — e.g. see A Tour of the Oil Language, which is a completely separate task.

You could say that the Oil project is three languages:

  1. The compatible OSH language. It's the most bash-compatible shell, by a mile.
  2. The new Oil language, which has Python-like expressions and Ruby-like blocks.
  3. This "mycpp" language, which has Python syntax.

(I'm thinking of calling the mycpp rewrite the Pea compiler.)

Question: MyPy Subset with Python 3.10 Pattern Matching?

Python 3.10 was released late last year, with structural pattern matching.

One thing I found difficult when building on top of MyPy is the "inverted" visitor style. My brain doesn't like"losing the stack" — i.e. when local variables become member variables.

I also noticed that recent versions of the AST module support type comments, which Oil still uses.

So I wonder if anybody has tried to write a strictly typed subset of MyPy with Python 3.10 pattern matching and the new ast info? I think that could be a fun project. If it's limited to what Oil uses, it shouldn't be too big.

Gradual typing was crucial to get Oil to where it is now, but our translator can now assume strict typing.

Conclusion

Let me know if this made sense. Feel free to reply in the comments, or e-mail me at andy at oilshell.org.

Please circulate this post so we can find the right person to help!

And check out Compiler Engineer Job on the wiki, which lists the 5 skills sought:

  1. Hard-won C++ experience, including knowledge of portability.
  2. Understanding of garbage collectors.
  3. Comfort with a test-driven and terminal-based workflow. We need to make the red line go up!
  4. Understanding of static type systems.
  5. Python.

Appendix: More Posts

I hope to find somebody soon, but if not, I could write posts on these technical details.

Python Meta / Inadvertently Repeating PyPy

The first attempt to speed up Oil was a bit naive. I inadvertently repeated PyPy by cobbling together a Python interpreter in Python.

It was a fun learning experience, and didn't take too long. But from there, there was no clear path to speed up Oil without static types. So you can say MyPy "saved" Oil.

So Oil is using many of Python's metalanguages and tools, and I could elaborate on this more:

  1. Zephyr ASDL to describe the language
  2. pgen2 for parsing the Oil language, which is heavily influenced by Python
  3. And now MyPy, with influence from mypyc

A Garbage-Collected Heap Shaped Like Statically Typed Python

How does our List<T> and Dict<K, V> implementation work? It's somewhat influenced by OCaml's runtime, which is covered in these chapters of Real World OCaml:

I also played with femtolisp, the language used to bootstrap the Julia language, giving it Lisp-like metaprogramming. It's one of the few "production" interpreters that uses a Cheney semi-space collector. (The es shell by Paul Haahr and Byron Rakitzis also does, but it appears defunct.)

Why Cheney, and not the more common mark-and-sweep?

Why not use C++ smart pointers and/or reference counting?

Object-Oriented Type Systems are Underexplained

I started this project with some knowledge about parsing, but I have no experience with type systems.

Somehow I assumed that books like the Dragon Book would explain how to write a type system, but they don't. Not even for a procedural language, let alone an object-oriented one!

These threads were part of my research and have interesting comments, and justification for the claim:

How Static Types Relate to Garbage Collection

The connection between static types and (precise) garbage collection also feels underexplained — e.g. the "C stack roots" problem. I have the The Garbage Collector Handbook, but it's mainly about algorithms, and not engineering concerns.

This means that translating statically typed Python to C++ is more difficult than translating say CoffeeScript to JavaScript. The translator has to understand more than just syntax!