Why Sponsor Oils? | blog | oilshell.org
This post is for Python experts! I briefly describe mycpp, a hacky Python-to-C++ translator I wrote on top of the MyPy type checker.
It works, but it needs improvement, or even a rewrite.
Context: We now have a NLnet grant of 50K euros, and I want to pay one or more engineers to help me translate Oil to C++. I wrote a longer post about this in March: Oil Is Being Developed "Middle Out".
Please share this post with anyone who may be interested!
I've described mycpp as a hybrid of the the recent mypyc compiler and old Shed Skin compiler. So this table is an easy way to explain it to Python experts.
Project |
Shed Skin |
mypyc |
mycpp |
Types | Inferred | Explicit Annotations | Explicit Annotations |
Compiler Output | C++ source code | Python-C Extension | C++ source code |
Garbage Collector | Boehm GC (imprecise, architecture-specific) |
CPython GC (precise, portable) |
Custom Cheney collector on |
The goal is to speed up Oil, and remove the dependency on CPython.
So there was a lot of Python expertise at Google, especially around 2005-2010.
This is the second brief way of explaining mycpp — by showing input / output pairs.
They're taken from the mycpp-examples benchmark, which is published with every release. Speedups range from ~5x to ~200x.
This translation is trivial because the algorithm only uses integers. C++ has fixed width integers, unlike Python, but that's OK for Oil.
Python input:
def fib_recursive(n):
# type: (int) -> int
if n == 0:
return 1
if n == 1:
return 1
return fib_recursive(n-1) + fib_recursive(n-2)
C++ output:
int fib_recursive(int n) {
if (n == 0) {
return 1;
}
if (n == 1) {
return 1;
}
return (fib_recursive((n - 1)) + fib_recursive((n - 2)));
}
This example uses Str*
and List<Str*>*
, which makes it more interesting. A
big difference is that the translator has to register C stack roots with
the garbage collector.
Python input:
def BackslashEscape(s, meta_chars):
# type: (str, str) -> str
escaped = [] # type: List[str]
for c in s:
if c in meta_chars:
escaped.append('\\')
escaped.append(c)
return ''.join(escaped)
C++ output:
Str* BackslashEscape(Str* s, Str* meta_chars) {
List<Str*>* escaped = nullptr;
StackRoots _roots({&s, &meta_chars, &escaped});
escaped = Alloc<List<Str*>>();
for (StrIter it(s); !it.Done(); it.Next()) {
Str* c = it.Value();
StackRoots _for({&c });
if (str_contains(meta_chars, c)) {
escaped->append(str0);
}
escaped->append(c);
}
return str1->join(escaped);
}
This strategy may seem a bit naive, but it works and performs well! It runs a large portion of OSH, which I show below.
It makes the translation more straightforward, and the output more readable:
Dict[K, V]
→ our C++ gc_heap::Dict<K, V>
Using C++ also makes the runtime and libc bindings easier to write. We get a little type safety when we manually write wrappers like:
List<Str*>* glob(Str* pattern)
for the C function:
int glob(const char *pattern, int flags,
int (*errfunc) (const char *epath, int eerrno),
glob_t *pglob);
(Why not Rust? The FAQ I mentioned is still upcoming. Here's a Reddit comment on that.)
None of them uses Python type annotations, which I found useful for generating predictably fast C++ code.
A key benefit of mycpp is being able to analyze and optimize performance in the "C++ domain", not in the Python domain. C++ has very good performance tools, and they work best on readable source code.
Most Python experts know this, but all Python compilers have a tradeoff between speed and generality, i.e. how much Python is understood. Our mycpp tool is way toward the left of that spectrum — it's designed to make one program fast!
There are dozens of other (partial) Python implementations listed on the python.org wiki: PythonImplementations.
It's pretty small! The line counts for translation show:
So I would like to hire one talented person to work on this. See Compiler Engineer Job.
As more evidence that this is doable, here's a plot of Oil spec tests passing under the C++ translation.
This is taken from Oil is Being Implemented "Middle Out", which also states caveats. In particular, we still have to hook up the garbage collector, which will require some rewriting.
Hopefully that gives you a good picture of mycpp and what I need help with. Here's some detail on how we got here.
glob()
.This would have gone faster if I was a better compiler engineer and a better C++ programmer!
But I'm also doing many other things at the same time — e.g. see A Tour of the Oil Language, which is a completely separate task.
You could say that the Oil project is three languages:
(I'm thinking of calling the mycpp rewrite the Pea compiler.)
Python 3.10 was released late last year, with structural pattern matching.
One thing I found difficult when building on top of MyPy is the "inverted" visitor style. My brain doesn't like"losing the stack" — i.e. when local variables become member variables.
I also noticed that recent versions of the AST module support type comments, which Oil still uses.
So I wonder if anybody has tried to write a strictly typed subset of
MyPy with Python 3.10 pattern matching and the new ast
info? I think
that could be a fun project. If it's limited to what Oil uses, it shouldn't be
too big.
Gradual typing was crucial to get Oil to where it is now, but our translator can now assume strict typing.
Let me know if this made sense. Feel free to reply in the
comments, or e-mail me at andy at oilshell.org
.
Please circulate this post so we can find the right person to help!
And check out Compiler Engineer Job on the wiki, which lists the 5 skills sought:
I hope to find somebody soon, but if not, I could write posts on these technical details.
The first attempt to speed up Oil was a bit naive. I inadvertently repeated PyPy by cobbling together a Python interpreter in Python.
It was a fun learning experience, and didn't take too long. But from there, there was no clear path to speed up Oil without static types. So you can say MyPy "saved" Oil.
So Oil is using many of Python's metalanguages and tools, and I could elaborate on this more:
How does our List<T>
and Dict<K, V
> implementation work? It's somewhat
influenced by OCaml's runtime, which is covered in these chapters of Real
World OCaml:
I also played with femtolisp, the language used to bootstrap the Julia language, giving it Lisp-like metaprogramming. It's one of the few "production" interpreters that uses a Cheney semi-space collector. (The es shell by Paul Haahr and Byron Rakitzis also does, but it appears defunct.)
Why Cheney, and not the more common mark-and-sweep?
Why not use C++ smart pointers and/or reference counting?
List[T]
turning into types
like shared_ptr<List<shared_ptr<T>>
, i.e. nested shared pointers.
Apparently there are other template parameters that don't compose / nest.I started this project with some knowledge about parsing, but I have no experience with type systems.
Somehow I assumed that books like the Dragon Book would explain how to write a type system, but they don't. Not even for a procedural language, let alone an object-oriented one!
These threads were part of my research and have interesting comments, and justification for the claim:
The connection between static types and (precise) garbage collection also feels underexplained — e.g. the "C stack roots" problem. I have the The Garbage Collector Handbook, but it's mainly about algorithms, and not engineering concerns.
This means that translating statically typed Python to C++ is more difficult than translating say CoffeeScript to JavaScript. The translator has to understand more than just syntax!