Why Sponsor Oils? | blog | oilshell.org
For the last few months, we've been translating Oil to fast native code, with help from an NLnet grant.
It's time to come up for air and give a status update! This post describes recent work on the garbage collector, which I checkpointed in recent releases.
Oil version 0.12.7 - Source tarballs and documentation.
We've learned a lot, changed a lot of code, and have a new collector working. I aim to convey a sense of that by sharing two narratives. Each one revolves around a technical design problem:
fork()
-friendly
collector?For background, see Garbage-Collected Heap in C++, Shaped Like Typed Python.
We haven't finished our collector, but it passes many tests, and I believe we have good solutions to some tricky problems. I hope that experienced engineers will provide feedback, and help us with the code.
This post also mentions improvements to our dev tools, which make use of shell scripts, Ninja, and Docker.
If you appreciate this work, please sponsor Oil! We need the money, and are using it.
Jesse and I finished unifying the "old STL" and "Cheney" C++ experiments, mentioned in August's release notes.
It looks like the shell will have less than 7,000 lines of hand-written C++ code, which is a great result! The Line Counts for Translation show:
This is our "unsafe core". The remaining ~80K lines of C++ are translated from high-level languages. Comparisons:
One reason that we're writing a garbage collector is to eliminate the possibility of memory safety bugs in the interpreter logic. You can't express a buffer overflow or use-after-free in Python! (Or in regular languages.)
Another reason is that the Oil language itself has garbage-collected data structures.
(Our number will increase when we write C++ wrappers for libraries like GNU readline, but not by a lot.)
After unifying those experiments, we immediately switched to a non-moving mark-sweep collector. This is very different than the moving Cheney collector!
Why? When we revived the Cheney experiment, it was clear it never worked that well. Our stress tests exposed several crashes, which were mainly due to the "GC Rooting" problem.
I also realized that writing a moving collector for the metalanguage
of C++ is quite different than writing it for Python or OSH. No reader has
pointed this out, but I'm honestly not sure it's even possible in standard C++!
For example, it's disorienting that you have to copy and root the this
pointer:
FunctionThatMayAllocate();
MyMethod(); # invalid implicit this
this->MyMethod(); # invalid explicit this
auto self = this; # make a copy that can be moved
StackRoots _r({&self}); # tell the GC about it
self->MyMethod(); # happens to work
(I recall a StackOverflow post where someone said they asked Bjarne Stroustrop to add a "re-address me" operator to C++ to properly support moving collection.)
So we're now implementing a more straightforward garbage collector. But this means we'll need to do more optimization.
The reasoning behind Cheney was sound: our workload is allocation-heavy, and a
few weeks ago we indeed saw free()
at the top of perf
profiles. (The
Cheney algorithm doesn't free()
dead objects; its collection time is
proportional to the size of live set.)
Switching algorithms had an unexpected benefit: we learned that the Cheney approach is not quite compatible with AddressSanitizer (ASAN), a critical tool for finding memory errors.
When we turned on the mark-sweep collector, bugs were more reliably flagged by ASAN. It appears that we were "missing" bugs with Cheney!
In hindsight, this is obvious because ASAN works by instrumenting malloc()
,
but Cheney uses its own bump allocator. In contrast, mark-sweep uses the
standard malloc()
interface.
So we now have a great testing strategy: ASAN combined with #ifdef GC_EVERY_ALLOC
, which stress tests the GC.
I remember reading the following advice in the es
shell paper: have a mode to collect on
every allocation, and use guard pages to force segfaulting as early as
possible. (es
uses a Cheney collector, which was inspiration for our first
design.)
But it's one thing to read advice, and another to learn it the hard way!
So, after this experience, I want to pass along a modern version of that
advice. If you're writing a (simple) garbage collector, make sure there's a
build that uses the malloc()
and free()
interface, and test it with ASAN
and GC_EVERY_ALLOC
.
I call that the gold standard. It should find more bugs than creating your own guard pages. When the es shell paper was written in 1993, ASAN didn't exist, and I don't think Valgrind did either.
I find this to be an underexplained and tricky problem: How does the application tell the garbage collector about the root set of pointers to trace? In both the Cheney and mark-sweep algorithms, an object graph traversal starts from this set.
I've repeatedly referred to this 2013 blog post about precise rooting in Mozilla SpiderMonkey — in blog posts, in the job listing, and on video calls:
One takeaway is that they wrote a GCC plugin to find rooting problems. I don't think we need to do that, at least not yet.
But I wish there was a good survey of the strategies that different language implementations take. I've heard that Google V8 has changed their rooting policy over the years, but I don't know the details. (It's also a VM, which again differs from our approach of generating C++.)
Another way I think about the rooting problem: the value of the "conservative" or imprecise Boehm GC work is that it lets you punt on this problem!
Instead of having the application tell the garbage collector about pointers,
the garbage collector scans for them. But we won't use this technique, because
there's no way to do it in portable C or C++. That is, we don't want #ifdefs
for x86, ARM, 32-bit, etc.
f(g(), h())
ProblemThe ASAN + GC_EVERY_ALLOC
testing exposed a problem in our rooting logic!
With the Cheney collector, we had seen disorienting bugs when translating
statements like obj.method(s[1:])
. The problem is that:
obj
while the statement is being evaluated!We solved this problem with correct rooting, but there are many other cases to handle, and it felt error prone.
I thought that using mark-sweep would avoid most of this complexity. But there are similar problems with a non-moving collector. Consider this statement:
f(s[0], s[1:])
Or equivalently:
f(g(), h()) # does g() or h() return a live object?
The problem is that the result of g()
is "temporary", and if h()
allocates,
then the collector may sweep that result away! This is the kind of bug
you find with GC_EVERY_ALLOC
, because the time window for the bug is small.
It occurs while evaluating temporary expressions.
One way to fix this is with a bytecode- or SSA-like transformation like:
tmp1 = g()
tmp2 = h()
f(tmp1, tmp2)
However, doing this analysis correctly involves types and logic for each
language construct. For example, the +
and *
operators in Python may or
may not allocate depending on the type of the operands, so it's more work than
it appears.
Right now, mycpp is halfway between an AST printer and a compiler, and this would require making it more like a real compiler. It would also make the generated code less readable.
I came up with a policy I call "return value rooting" that "inverts" the responsibility for rooting. In short, you tell the collector about a root object when you return it from a function:
Str* MyFunc(bool cond) {
# Update a data structure that mirrors the C stack
RootingScope _r;
if (cond) {
Str* s = OtherFunc();
# Add to the root set; tell caller to remove it
# when done
gHeap.RootOnReturn(s);
return s;
}
return nullptr;
}
It's not done yet, but it made all 5 crashes in our tests go away, which seems promising.
I generally try not to "invent" anything for Oil, instead relying on textbook knowledge, and techniques lifted from production codebases. But again, I haven't found good resources on this engineering problem.
Let us know if you have worked on similar problems!
This policy isn't compatible with the Cheney algorithm, so we'll probably never go back to Cheney.
fork()
-friendly Collector Data StructuresSo that was a story about finding a good bug with testing. It shows why the GC rooting problem is important.
Here's another story, organized around the following problem: How do we design
the garbage collector's data structures so that they don't defeat the
copy-on-write of fork()
?
This is important, because forking the interpreter is idiomatic shell!
There are two issues which both involve the common object header.
next
pointer?If the object header is mutated during collection (when the objects themselves aren't), then every memory page that has an object can be dirtied. The resulting page faults would make collection slow and expensive.
The CRuby interpreter addressed this problem in 2012, and it also occurs with CPython's reference counts.
fork()
-friendly. Simple mark-sweep designs generally aren't.Recall that imprecise scanning can't be done in portable C++, which is one reason we avoid it. The portability constraint creates another problem: how do we store mark bits outside the object header?
The typical solution is a mark bitmap that's "parallel" to the heap. But that
only works when you can assume a limited range of addresses that malloc()
returns.
In portable code, you have to deal with the fact that malloc()
can return
almost any 64-bit address. And 2^64
divided by 8 bits is still a very big
bitmap!
Most garbage collectors include their own allocators, but we want to work with any allocator. (Remember how useful we found ASAN's instrumenting allocator! We also don't want to lose heap-profiling tools.)
I've looked at various GC implementations for years. I remembered that the Skew language has a nice design and small implementation, and that it was used at Figma. So I looked at the code again — in particular the C++ runtime.
It happens to be one of the most similar projects to our mycpp -- a statically-typed Python- or JavaScript-like language that's translated to C++. This generated code makes use of garbage-collected data structures. ( Vala also looks pretty similar.)
Interestingly, they have support for fork()
(which I still don't understand,
as it still blocks the mutator process), and a separate mark set!
But how did they do that? By simply storing all the marked pointers in an
std::unordered_set
.
My intuition was that this is too slow for a garbage collector, but I know the author has done many performance-oriented projects like esbuild (not to mention Figma itself!)
unordered_set<void*>::insert
Slower Than malloc(1)
So we tried unordered_set
, since it was easy enough.
There's a long story here, including one of our own bugs, but the bottom line
is that this blew up our benchmarks, and we learned the hard way why
std::unordered_set
is slow.
mattheium
explains it well: the container must
be slow by the design of C++, because it must implement methods that most
hash table users don't care about.I wrote a benchmark which reported that unordered_set<void*>::insert()
is
slower than malloc(1)
, which was baffling!
I thought the issue was that it forces implementers into a linked list approach, rather than a probing approach, but it seems even worse than that.
In any case, I think it's best to completely avoid hash tables in the garbage collector!
(Note that I never actually compiled or ran Skew — it's just compact code that was worth reading. It did seem that the C++ backend was more like a "fun hack" than production code. The advertised backends are JavaScript and C#.)
So how do we solve the "separate mark bits" problem portably, without a hash table? I think we'll simply assign a 24-bit ID to each object upon allocation, and use that as an index into a growable bitmap.
How do we avoid the "intrusive" linked list for sweeping? (An "intrusive" container is one that affects the items it contains.)
That was easy — we use a std::vector<Obj*>
for the list of all objects,
and compact it when we sweep. Most collectors use a singly- or doubly- linked
list, but this seems just as good.
So with this separate metadata for both marking and sweeping, I believe new object header will remain a compact 64 bits:
Scanned
(contains pointers) or Opaque
(no pointers)In the GC heap post, I said that Cheney could have less per-object overhead than mark-sweep, at least for our problem. But having separate mark bits and a separate vector apparently makes them equal.
If you're interested in working on this, let us know!
The garbage collector isn't done, but I think the design issues are mostly sorted out. It works on substantial new tests.
Here are some logical milestones I see:
If that all works, then:
Writing a solid collector involves many tools!
GC_EVERY_ALLOC
, code coverage, and more.As a result, I've probably spent more time working with Ninja and Docker than C++ lately!
Another goal is improve dev tools, in response to feedback from Jesse and others. I've been updating the wiki: Where Contributors Have Problems.
We should make it as easy to contribute as possible.
When I started oil-native, I used shell scripts to build it. They invoked our custom translators and the C++ toolchain.
This is the simplest approach, but it became slow and disorganized, especially with many test binaries and build variants (ASAN, UBSAN, etc.)
We now have a small declarative wrapper around Ninja:
$ wc -l build/ninja*.{py,sh}
458 build/ninja_lib.py
253 build/ninja_lib_test.py
337 build/ninja_main.py
274 build/ninja-rules-cpp.sh
359 build/ninja-rules-py.sh
1681 total
I'll write about it later, but it's like a mini-Bazel on top of Ninja. It has
declarative rules like cc_binary()
, cc_library()
, and asdl_library()
. It
correctly generates implicit dependencies for code generation. It
automatically finds Python's dynamic dependencies.
To support the garbage collector work, I put more measurements in Soil, our shell- and container-based CI configuration.
The docker build
became slow, so I learned how to use features like cache
mounts: use RUN --mount
in the Dockerfile
. This mechanism lets different
container builds share apt
packages on the host, so you don't download them
repeatedly.
This was effective, but I find it odd that there's a runtime flag docker --mount
, and a build time RUN --mount
. That motivated a thread:
Docker has been useful, and it works well with shell scripts. But I have some ideas to optimize image creation by relying more on a real shell language, and less on Docker! Optimizing the layer sizes would be our "north star" for this work. More on this later.
On the other hand, there's no need to migrate away from Ninja. It's composable in the Unix sense, while Docker isn't! It would be nice to write a blog post about this.
There are many more posts on #blog-ideas
, and I should at least dump them
into one or two "backlog" posts:
The sketches on #software-architecture I wrote in 2022 and 2021 ended up being the most popular posts on the blog!
Translating Oil to C++ is harder than I expected, and progress on it stalled in 2021. But working under the NLnet grant in 2022 has given new life to the project.
No promises, but it looks like we'll have a solid shell on the other side of this. We will likely have to crawl into a hole for another few months :-)
In the meantime, please sponsor Oil, and tell your friends and coworkers about it. The grant doesn't cover all expenses, and I just used sponsor donations to pay Melvin, who is doing excellent work.
If you made it here, you should help us! We will always need experienced C++ and Python programmers.
Join us on Zulip, or send me an e-mail at andy@oilshell.org
!
Here's a partial list of issues closed for the 0.12.7 release.
#1322 | mycpp should generate field masks for classes (e.g. DirStack in examples/scoped_resource) |
#1321 | mycpp generates invalid StackRoot for integer loop index |
#1313 | changing mycpp/NINJA-steps.sh doesn't invalidate generated code |
#1307 | _tmp/ does not get cleaned on `build/clean.sh` |
#1304 | gBuf.dealloc() to avoid leaks |
#1240 | Minor code generators should work in GC mode |
#1206 | mycpp/examples use heap-allocated Tuple and crash |
#1186 | eliminate other allocators from all bindings / libs |
#1176 | generated ASDL code should support garbage collection |
#1173 | Unified Ninja build |
These metrics help me keep track of the project. Let's compare this release with Oil 0.12.4 in late August.
We've focused on the garbage collector, so the spec test numbers for OSH and Oil haven't moved as of 0.12.7. But Melvin just revived this work, so the next release will show progress.
The parser is slower after recent work on the garbage collector. We're making the code correct first, and then fast.
The parser workload is very allocation-heavy, so it will serve as a good benchmark.
(The "NA" cachegrind irefs measurement seems like a bug which I should look into.)
Let's look at our small, synthetic benchmarks:
As expected, the mark and sweep collector is slower than the Cheney collector. But the Cheney collector wasn't practical, mainly due to the GC rooting problem.
Those pages show that we fixed some memory leaks:
In 0.12.7, every example but the worst one has a ratio less than 1.0, which means the translated C++ is indeed faster than Python. I think we should be able to fix that one as well.
Even though we added field masks, the lines of generated code are down:
The binary size has been fluctuating, but it's reasonable now, and can be optimized when there's time:
StackRoots
in one release. If so, it shows that GC is expensive in more than one way!