In the last post, I mentioned that oil's ASDL implementation can
serialize ASTs into a binary format I'm calling oheap.
It's likely to change, but I'll continue to mention it, so it's valuable to explain now. At the end of this post, I'll make note of limitations and future work.
I wrote oheap to solve a fairly specific problem, but there will be more
applications of it in the future.
First, we want to bridge languages: to integrate the osh parser in Python
with a shell runtime in C++. Sharing the AST via an in-memory binary format is
actually the simplest and most efficient way to do this. The other, perhaps
more conventional, way is to use the Python-C API.
This is exactly how Python uses its ASDL implementation. But the Python-C API
is actually quite complicated, due its large surface area and runtime overhead.
I mentioned in this post that 123 lines of
Python.asdl translates to 8100 lines of C code. In contrast, in oil's
implementation, 129 lines of osh.asdl translates to 1100 lines of C++ code
in osh.asdl.h.
As we'll see below, most these lines simply define types and do no work.
The second motivation is further in the future: I want much of oil to be
written in oil. However, it doesn't make sense to pay the cost of parsing
oil for every shell script it runs. Compiling oil code to oheap will avoid
this work.
Thus oheap can be compared to the .pyc file in Python. But unlike
Python, we won't use the file system as a cache. This scheme has caused a
number of problems, and oil shouldn't litter your system with temp files simply
because you ran a shell script (although surprisingly, almost all shells use
temp files to implement here docs).
I call the format oheap because it's conceptually like the C heap — a
block of bytes representing integers, pointers, strings, arrays, and records.
However, it's a relocatable heap, which means you can store it in a file
and load it with a single read() call.
A shortcut to understanding the format is through these two functions, which are essentially the "runtime library" for oheap:
inline int Obj::Int(int n) const {
return bytes_[n] + (bytes_[n+1] << 8) + (bytes_[n+2] << 16);
}
inline const Obj& Obj::Ref(const uint32_t\* base, int n) const {
int offset = Int(n);
return reinterpret_cast<const Obj&>(base[offset]);
}
The Int method takes an offset n from the beginning of an Obj instance
— i.e. its first bytes_ member — and treats that location as the
beginning of a little-endian three-byte integer.
The Ref method takes a pointer to the oheap image, an offset n to be
treated as a pointer, and returns a reference to another Obj. In other
words, it looks up a field on an Obj which is a pointer to another Obj.
Here is the analogy to the C heap:
Obj, which makes Obj& analogous
to void*.Ref() call is analogous to deferencing a pointer. This means that
ASDL structures can be lazily traversed. We can load them anywhere in
memory and immediately begin computation, without a decoding step.The rest of the osh.asdl.h header file, generated from osh.asdl, gives
you a ASDL-typed API over the heap using static_cast<>. It resembles a
subset of the protobuf API.
Also note that every function in the header is inline. We are doing very
little work: juste array indexing, left shifts, and addition.
ASDL has these logical types:
They can be represented with these respective physical storage types in
oheap's C-like model:
0 or 1NUL-terminated sequence of charactersoheap can use integers and pointers of any width, but I chose three bytes for
oil because any shell script should be representable with an AST of less than
224 = 16 Mi locations.
Not only does representing pointers as integers make the heap relocatable, it also compresses it.
For example, oheap can represent a sum type with five fields in just 16
bytes: one byte for the tag (assuming there are less than 256 variants), and
then three bytes for each of five fields.
On a 32-bit machine, this would take 1 + 5*4 = 21 bytes. On a 64-bit
machine, it would take 1 + 5*8 = 41 bytes. In the latter case, oheap
is over 60% smaller.
Together, ASDL and oheap look a bit like protocol buffers, with
this architecture:
foo.asdl vs. foo.proto)oheap vs. a tree of variable-length integers and byte
strings)I mentioned a key difference above: oheap requires no decoding.
This reminds me a bit of capnproto, which is in some ways a successor to protocol buffers (having been developed by the author of "proto2"). capnproto also avoids the decoding step by using the in-memory format as the serialization format (albeit with some message-independent compression).
You could say that oheap is the opposite: we're using the file format as the
in-memory format. The format is designed to be both compact and efficiently
decoded on the fly.
Calling Ref() is undoubtedly faster than parsing shell, so
we achieve our goal of avoiding parsing. It remains to be seen whether oheap
is useful in other situations.
Right now, oil uses oheap in an immutable fashion. Everything is packed
tightly together, and there's no operation for appending to an array, for
example.
I have ideas for implementing mutating operations, but it's also possible that we'll stick with the ML-like model of transformations on immutable trees.
Another important difference between oheap and protobuf or capnproto
is that there are no integer tags in the binary encoding to identify fields,
This makes the data smaller, but it means we have less backward compatibility
across schema changes.
It's possible to add tags, but we don't need currently need them.
oheap is a compact, lazily-decoded file format and type-safe API for
pointer-based data structures. We're using it to represent ASDL trees, but
it can also represent graphs — i.e. data structures with cyclic
pointers.
It can be compared to .pyc files in Python, or serialization schemes like protobuf and capnproto, although there are important differences.
Its limitations are that the API is read-only, and the format isn't suitable for interchange because schema upgrades will break old data.
These limitations aren't a concern for oil right now, but they can be fixed.
In that case, oheap may be generalized, and perhaps even play a role in the
oil language itself. Binary formats are not Unix-y, but they
can be useful and are widely used in practice.
Emacs Dumper Dispute: (comments on Hacker News) — This article was very timely, appearing at exactly the time I was thinking about serializing ASTs.
Serializing the heap is common in Lisp implementations, but I didn't realize
that Emacs uses special glibc APIs to do it! oheap appears to solve the
same problem, and does so in a portable fashion.
High Performance Code 201: Hybrid Data Structures at CppCon 2016
— This talk emphasizes the importance of "small-size optimization" in
LLVM -- storing elements of small containers inline, to avoid dereferencing
pointers. oheap may benefit from similar techniques.