What is oheap?


In the last post, I mentioned that oil's ASDL implementation can serialize ASTs into a binary format I'm calling oheap.

It's likely to change, but I'll continue to mention it, so it's valuable to explain now. At the end of this post, I'll make note of limitations and future work.


I wrote oheap to solve a fairly specific problem, but there will be more applications of it in the future.

First, we want to bridge languages: to integrate the osh parser in Python with a shell runtime in C++. Sharing the AST via an in-memory binary format is actually the simplest and most efficient way to do this. The other, perhaps more conventional, way is to use the Python-C API.

This is exactly how Python uses its ASDL implementation. But the Python-C API is actually quite complicated, due its large surface area and runtime overhead. I mentioned in this post that 123 lines of Python.asdl translates to 8100 lines of C code. In contrast, in oil's implementation, 129 lines of osh.asdl translates to 1100 lines of C++ code in osh.asdl.h.

As we'll see below, most these lines simply define types and do no work.

The second motivation is further in the future: I want much of oil to be written in oil. However, it doesn't make sense to pay the cost of parsing oil for every shell script it runs. Compiling oil code to oheap will avoid this work.

Thus oheap can be compared to the .pyc file in Python. But unlike Python, we won't use the file system as a cache. This scheme has caused a number of problems, and oil shouldn't litter your system with temp files simply because you ran a shell script (although surprisingly, almost all shells use temp files to implement here docs).


I call the format oheap because it's conceptually like the C heap — a block of bytes representing integers, pointers, strings, arrays, and records. However, it's a relocatable heap, which means you can store it in a file and load it with a single read() call.

A shortcut to understanding the format is through these two functions, which are essentially the "runtime library" for oheap:

inline int Obj::Int(int n) const {
    return bytes_[n] + (bytes_[n+1] << 8) + (bytes_[n+2] << 16);

inline const Obj& Obj::Ref(const uint32_t\* base, int n) const {
    int offset = Int(n);
    return reinterpret_cast<const Obj&>(base[offset]);

The Int method takes an offset n from the beginning of an Obj instance — i.e. its first bytes_ member — and treats that location as the beginning of a little-endian three-byte integer.

The Ref method takes a pointer to the oheap image, an offset n to be treated as a pointer, and returns a reference to another Obj. In other words, it looks up a field on an Obj which is a pointer to another Obj.

Here is the analogy to the C heap:

The rest of the osh.asdl.h header file, generated from osh.asdl, gives you a ASDL-typed API over the heap using static_cast<>. It resembles a subset of the protobuf API.

Also note that every function in the header is inline. We are doing very little work: juste array indexing, left shifts, and addition.


ASDL has these logical types:

They can be represented with these respective physical storage types in oheap's C-like model:

oheap can use integers and pointers of any width, but I chose three bytes for oil because any shell script should be representable with an AST of less than 224 = 16 Mi locations.

Not only does representing pointers as integers make the heap relocatable, it also compresses it.

For example, oheap can represent a sum type with five fields in just 16 bytes: one byte for the tag (assuming there are less than 256 variants), and then three bytes for each of five fields.

On a 32-bit machine, this would take 1 + 5*4 = 21 bytes. On a 64-bit machine, it would take 1 + 5*8 = 41 bytes. In the latter case, oheap is over 60% smaller.

Comparison to Other Serialization Formats

Together, ASDL and oheap look a bit like protocol buffers, with this architecture:

I mentioned a key difference above: oheap requires no decoding.

This reminds me a bit of capnproto, which is in some ways a successor to protocol buffers (having been developed by the author of "proto2"). capnproto also avoids the decoding step by using the in-memory format as the serialization format (albeit with some message-independent compression).

You could say that oheap is the opposite: we're using the file format as the in-memory format. The format is designed to be both compact and efficiently decoded on the fly.

Calling Ref() is undoubtedly faster than parsing shell, so we achieve our goal of avoiding parsing. It remains to be seen whether oheap is useful in other situations.


Right now, oil uses oheap in an immutable fashion. Everything is packed tightly together, and there's no operation for appending to an array, for example.

I have ideas for implementing mutating operations, but it's also possible that we'll stick with the ML-like model of transformations on immutable trees.

Another important difference between oheap and protobuf or capnproto is that there are no integer tags in the binary encoding to identify fields, This makes the data smaller, but it means we have less backward compatibility across schema changes.

It's possible to add tags, but we don't need currently need them.


oheap is a compact, lazily-decoded file format and type-safe API for pointer-based data structures. We're using it to represent ASDL trees, but it can also represent graphs — i.e. data structures with cyclic pointers.

It can be compared to .pyc files in Python, or serialization schemes like protobuf and capnproto, although there are important differences.

Its limitations are that the API is read-only, and the format isn't suitable for interchange because schema upgrades will break old data.

These limitations aren't a concern for oil right now, but they can be fixed. In that case, oheap may be generalized, and perhaps even play a role in the oil language itself. Binary formats are not Unix-y, but they can be useful and are widely used in practice.


Discuss this post on Reddit