Oil 0.12.7 - Garbage Collector Problems

2022-10-20

For the last few months, we've been translating Oil to fast native code, with help from an NLnet grant.

It's time to come up for air and give a status update! This post describes recent work on the garbage collector, which I checkpointed in recent releases.

Oil version 0.12.7 - Source tarballs and documentation.

We've learned a lot, changed a lot of code, and have a new collector working. I aim to convey a sense of that by sharing two narratives. Each one revolves around a technical design problem:

GC rooting: How does the app tell the garbage collector where to start tracing?
What data structures should we use to make a portable, fork()-friendly collector?

Background and Motivation

For background, see Garbage-Collected Heap in C++, Shaped Like Typed Python.

We haven't finished our collector, but it passes many tests, and I believe we have good solutions to some tricky problems. I hope that experienced engineers will provide feedback, and help us with the code.

This post also mentions improvements to our dev tools, which make use of shell scripts, Ninja, and Docker.

If you appreciate this work, please sponsor Oil! We need the money, and are using it.

Recent Results and Changes

A Small Core in C++

Jesse and I finished unifying the "old STL" and "Cheney" C++ experiments, mentioned in August's release notes.

It looks like the shell will have less than 7,000 lines of hand-written C++ code, which is a great result! The Line Counts for Translation show:

~4300 lines of garbage-collected data structures
~2300 lines of hand-written OS bindings

This is our "unsafe core". The remaining ~80K lines of C++ are translated from high-level languages. Comparisons:

bash has at least 142K lines of hand-written C code.
jemalloc has ~33K lines of C. Prior to 2018, Rust programs used it by default.

One reason that we're writing a garbage collector is to eliminate the possibility of memory safety bugs in the interpreter logic. You can't express a buffer overflow or use-after-free in Python! (Or in regular languages.)

Another reason is that the Oil language itself has garbage-collected data structures.

(Our number will increase when we write C++ wrappers for libraries like GNU readline, but not by a lot.)

Switching to Non-Moving Mark and Sweep

After unifying those experiments, we immediately switched to a non-moving mark-sweep collector. This is very different than the moving Cheney collector!

#oil-dev > Simple Mark and Sweep Design for a Shell

Why? When we revived the Cheney experiment, it was clear it never worked that well. Our stress tests exposed several crashes, which were mainly due to the "GC Rooting" problem.

I also realized that writing a moving collector for the metalanguage of C++ is quite different than writing it for Python or OSH. No reader has pointed this out, but I'm honestly not sure it's even possible in standard C++! For example, it's disorienting that you have to copy and root the this pointer:

FunctionThatMayAllocate();

MyMethod();              # invalid implicit this
this->MyMethod();        # invalid explicit this

auto self = this;        # make a copy that can be moved
StackRoots _r({&self});  # tell the GC about it
self->MyMethod();        # happens to work

(I recall a StackOverflow post where someone said they asked Bjarne Stroustrop to add a "re-address me" operator to C++ to properly support moving collection.)

So we're now implementing a more straightforward garbage collector. But this means we'll need to do more optimization.

The reasoning behind Cheney was sound: our workload is allocation-heavy, and a few weeks ago we indeed saw free() at the top of perf profiles. (The Cheney algorithm doesn't free() dead objects; its collection time is proportional to the size of live set.)

Testing Gold: ASAN with GC_EVERY_ALLOC

Switching algorithms had an unexpected benefit: we learned that the Cheney approach is not quite compatible with AddressSanitizer (ASAN), a critical tool for finding memory errors.

When we turned on the mark-sweep collector, bugs were more reliably flagged by ASAN. It appears that we were "missing" bugs with Cheney!

In hindsight, this is obvious because ASAN works by instrumenting malloc(), but Cheney uses its own bump allocator. In contrast, mark-sweep uses the standard malloc() interface.

#oil-dev > ASAN is nice for writing GC

So we now have a great testing strategy: ASAN combined with #ifdef GC_EVERY_ALLOC, which stress tests the GC.

I remember reading the following advice in the es shell paper: have a mode to collect on every allocation, and use guard pages to force segfaulting as early as possible. (es uses a Cheney collector, which was inspiration for our first design.)

But it's one thing to read advice, and another to learn it the hard way!

So, after this experience, I want to pass along a modern version of that advice. If you're writing a (simple) garbage collector, make sure there's a build that uses the malloc() and free() interface, and test it with ASAN and GC_EVERY_ALLOC.

I call that the gold standard. It should find more bugs than creating your own guard pages. When the es shell paper was written in 1993, ASAN didn't exist, and I don't think Valgrind did either.

Problem: GC Rooting

I find this to be an underexplained and tricky problem: How does the application tell the garbage collector about the root set of pointers to trace? In both the Cheney and mark-sweep algorithms, an object graph traversal starts from this set.

I've repeatedly referred to this 2013 blog post about precise rooting in Mozilla SpiderMonkey — in blog posts, in the job listing, and on video calls:

Clawing Our Way Back To Precision (Mozilla, 2013)

One takeaway is that they wrote a GCC plugin to find rooting problems. I don't think we need to do that, at least not yet.

But I wish there was a good survey of the strategies that different language implementations take. I've heard that Google V8 has changed their rooting policy over the years, but I don't know the details. (It's also a VM, which again differs from our approach of generating C++.)

Another way I think about the rooting problem: the value of the "conservative" or imprecise Boehm GC work is that it lets you punt on this problem!

Instead of having the application tell the garbage collector about pointers, the garbage collector scans for them. But we won't use this technique, because there's no way to do it in portable C or C++. That is, we don't want #ifdefs for x86, ARM, 32-bit, etc.

The `f(g(), h())` Problem

The ASAN + GC_EVERY_ALLOC testing exposed a problem in our rooting logic!

With the Cheney collector, we had seen disorienting bugs when translating statements like obj.method(s[1:]). The problem is that:

The slice operation allocates, and allocating may collect garbage.
Collection moves obj while the statement is being evaluated!

We solved this problem with correct rooting, but there are many other cases to handle, and it felt error prone.

#oil-dev > Interactions Between C++ and GC

I thought that using mark-sweep would avoid most of this complexity. But there are similar problems with a non-moving collector. Consider this statement:

f(s[0], s[1:])

Or equivalently:

f(g(), h())  # does g() or h() return a live object?

The problem is that the result of g() is "temporary", and if h() allocates, then the collector may sweep that result away! This is the kind of bug you find with GC_EVERY_ALLOC, because the time window for the bug is small. It occurs while evaluating temporary expressions.

One way to fix this is with a bytecode- or SSA-like transformation like:

tmp1 = g()
tmp2 = h()
f(tmp1, tmp2)

However, doing this analysis correctly involves types and logic for each language construct. For example, the + and * operators in Python may or may not allocate depending on the type of the operands, so it's more work than it appears.

Right now, mycpp is halfway between an AST printer and a compiler, and this would require making it more like a real compiler. It would also make the generated code less readable.

"Return Value Rooting" Policy

I came up with a policy I call "return value rooting" that "inverts" the responsibility for rooting. In short, you tell the collector about a root object when you return it from a function:

Str* MyFunc(bool cond) {
  # Update a data structure that mirrors the C stack
  RootingScope _r;

  if (cond) {
    Str* s = OtherFunc();
    # Add to the root set; tell caller to remove it
    # when done
    gHeap.RootOnReturn(s);
    return s;
  }
  return nullptr;
}

It's not done yet, but it made all 5 crashes in our tests go away, which seems promising.

#oil-dev > Return Value Rooting API
#oil-dev > Allocating in a Loop. A problem and a possible solution. This is a bit ad hoc, but I think it's generally better to overapproximate the set and reduce it, rather than underapproximate it (which leads to use-after-free).
#blog-ideas > Precise Rooting by Mirroring the C Stack

I generally try not to "invent" anything for Oil, instead relying on textbook knowledge, and techniques lifted from production codebases. But again, I haven't found good resources on this engineering problem.

Let us know if you have worked on similar problems!

This policy isn't compatible with the Cheney algorithm, so we'll probably never go back to Cheney.

Problem: `fork()`-friendly Collector Data Structures

So that was a story about finding a good bug with testing. It shows why the GC rooting problem is important.

Here's another story, organized around the following problem: How do we design the garbage collector's data structures so that they don't defeat the copy-on-write of fork()?

This is important, because forking the interpreter is idiomatic shell!

Pipelines Support Vectorized, Point-Free, and Imperative Style

There are two issues which both involve the common object header.

Are the mark bits stored in the header of every object, or stored elsewhere?
How does the sweep phase find all objects? Does the object header contain a next pointer?

If the object header is mutated during collection (when the objects themselves aren't), then every memory page that has an object can be dirtied. The resulting page faults would make collection slow and expensive.

The CRuby interpreter addressed this problem in 2012, and it also occurs with CPython's reference counts.

#oil-dev > Ruby Bitmap Marking GC
Dismissing Python Garbage Collection at Instagram (2017)
As another point of comparison, the garbage collector in the excellent Crafting Interpreters book is not fork()-friendly. Simple mark-sweep designs generally aren't.

Can Mark Bits be Separate, Portably?

Recall that imprecise scanning can't be done in portable C++, which is one reason we avoid it. The portability constraint creates another problem: how do we store mark bits outside the object header?

The typical solution is a mark bitmap that's "parallel" to the heap. But that only works when you can assume a limited range of addresses that malloc() returns.

In portable code, you have to deal with the fact that malloc() can return almost any 64-bit address. And 2^64 divided by 8 bits is still a very big bitmap!

Most garbage collectors include their own allocators, but we want to work with any allocator. (Remember how useful we found ASAN's instrumenting allocator! We also don't want to lose heap-profiling tools.)

The Skew Language Has a Similar C++ Runtime

I've looked at various GC implementations for years. I remembered that the Skew language has a nice design and small implementation, and that it was used at Figma. So I looked at the code again — in particular the C++ runtime.

#oil-dev > Skew is Very Similar to Oil's Python

It happens to be one of the most similar projects to our mycpp -- a statically-typed Python- or JavaScript-like language that's translated to C++. This generated code makes use of garbage-collected data structures. ( Vala also looks pretty similar.)

Interestingly, they have support for fork() (which I still don't understand, as it still blocks the mutator process), and a separate mark set!

But how did they do that? By simply storing all the marked pointers in an std::unordered_set.

My intuition was that this is too slow for a garbage collector, but I know the author has done many performance-oriented projects like esbuild (not to mention Figma itself!)

`unordered_set<void*>::insert` Slower Than `malloc(1)`

So we tried unordered_set, since it was easy enough.

There's a long story here, including one of our own bugs, but the bottom line is that this blew up our benchmarks, and we learned the hard way why std::unordered_set is slow.

Hash map benchmark on Reddit (2017). The top comment by mattheium explains it well: the container must be slow by the design of C++, because it must implement methods that most hash table users don't care about.
Please avoid std::unordered_map (2017) on the Chromium mailing list.

I wrote a benchmark which reported that unordered_set<void*>::insert() is slower than malloc(1), which was baffling!

I thought the issue was that it forces implementers into a linked list approach, rather than a probing approach, but it seems even worse than that.

In any case, I think it's best to completely avoid hash tables in the garbage collector!

(Note that I never actually compiled or ran Skew — it's just compact code that was worth reading. It did seem that the C++ backend was more like a "fun hack" than production code. The advertised backends are JavaScript and C#.)

New Object Header

So how do we solve the "separate mark bits" problem portably, without a hash table? I think we'll simply assign a 24-bit ID to each object upon allocation, and use that as an index into a growable bitmap.

How do we avoid the "intrusive" linked list for sweeping? (An "intrusive" container is one that affects the items it contains.)

That was easy — we use a std::vector<Obj*> for the list of all objects, and compact it when we sweep. Most collectors use a singly- or doubly- linked list, but this seems just as good.

So with this separate metadata for both marking and sweeping, I believe new object header will remain a compact 64 bits:

First 32 bits
- 1 for "not a vtable" (mentioned in the GC heap post)
- 7 bits for the ASDL type tag
- 24-bit object ID (index into the mark bitmap)
Second 32 bits
- 1 bit for Scanned (contains pointers) or Opaque (no pointers)
- 1 bit for fixed length or variable length
- 30 bits for the field bit mask (struct) or object length (array)

#oil-dev > New GC Header

In the GC heap post, I said that Cheney could have less per-object overhead than mark-sweep, at least for our problem. But having separate mark bits and a separate vector apparently makes them equal.

If you're interested in working on this, let us know!

Future Milestones

The garbage collector isn't done, but I think the design issues are mostly sorted out. It works on substantial new tests.

Issue 1330: TODO on core GC enumerates the central work.

Here are some logical milestones I see:

Make all tests pass under ASAN with the the new rooting policy.
- #oil-dev > Big ASAN / Rooting Scrub
Make the parser run with the mark-sweep collector.
- This is an allocation heavy workload, and I expect it to expose performance problems.
Make the shell run batch programs.
- This involves reducing the "delta py" and "delta cpp" stats in our C++ spec tests.

If that all works, then:

Make the shell run interactive programs.
- This involves signals, GNU readline, and completion scripts. I mentioned these issues in Technical Issues and Risks (August 2020) and January Release Notes and Themes (2022)
- They haven't been on my mind lately, but I'm very happy that we have a new contributor, Melvin Walls, who is making progress in this area! It's great to have other eyes on the code.

Removing Dev Friction / Improving Tools

Writing a solid collector involves many tools!

We need build support for standalone tests, benchmarks, ASAN + GC_EVERY_ALLOC, code coverage, and more.
We also run many tests and benchmarks in the CI. See the release quality page for examples.

As a result, I've probably spent more time working with Ninja and Docker than C++ lately!

Another goal is improve dev tools, in response to feedback from Jesse and others. I've been updating the wiki: Where Contributors Have Problems.

We should make it as easy to contribute as possible.

Declarative Ninja / Mini Bazel

When I started oil-native, I used shell scripts to build it. They invoked our custom translators and the C++ toolchain.

This is the simplest approach, but it became slow and disorganized, especially with many test binaries and build variants (ASAN, UBSAN, etc.)

We now have a small declarative wrapper around Ninja:

$ wc -l build/ninja*.{py,sh}
  458 build/ninja_lib.py
  253 build/ninja_lib_test.py
  337 build/ninja_main.py
  274 build/ninja-rules-cpp.sh
  359 build/ninja-rules-py.sh
 1681 total

I'll write about it later, but it's like a mini-Bazel on top of Ninja. It has declarative rules like cc_binary(), cc_library(), and asdl_library(). It correctly generates implicit dependencies for code generation. It automatically finds Python's dynamic dependencies.

#oil-dev > How I Switched to Ninja

Soil CI: Unifying Shell and Containers?

To support the garbage collector work, I put more measurements in Soil, our shell- and container-based CI configuration.

The docker build became slow, so I learned how to use features like cache mounts: use RUN --mount in the Dockerfile. This mechanism lets different container builds share apt packages on the host, so you don't download them repeatedly.

This was effective, but I find it odd that there's a runtime flag docker --mount, and a build time RUN --mount. That motivated a thread:

#blog-ideas > Perlis-Thompson Problems in Docker/K8s

Docker has been useful, and it works well with shell scripts. But I have some ideas to optimize image creation by relying more on a real shell language, and less on Docker! Optimizing the layer sizes would be our "north star" for this work. More on this later.

Unix Philosophy: Ninja vs. Docker

On the other hand, there's no need to migrate away from Ninja. It's composable in the Unix sense, while Docker isn't! It would be nice to write a blog post about this.

There are many more posts on #blog-ideas, and I should at least dump them into one or two "backlog" posts:

#blog-ideas > Autumn Blog Backlog

The sketches on #software-architecture I wrote in 2022 and 2021 ended up being the most popular posts on the blog!

Conclusion

Translating Oil to C++ is harder than I expected, and progress on it stalled in 2021. But working under the NLnet grant in 2022 has given new life to the project.

No promises, but it looks like we'll have a solid shell on the other side of this. We will likely have to crawl into a hole for another few months :-)

Please Sponsor

In the meantime, please sponsor Oil, and tell your friends and coworkers about it. The grant doesn't cover all expenses, and I just used sponsor donations to pay Melvin, who is doing excellent work.

Help Wanted

If you made it here, you should help us! We will always need experienced C++ and Python programmers.

Join us on Zulip, or send me an e-mail at andy@oilshell.org!

Appendix

Closed Issues

Here's a partial list of issues closed for the 0.12.7 release.

#1322	mycpp should generate field masks for classes (e.g. DirStack in examples/scoped_resource)
#1321	mycpp generates invalid StackRoot for integer loop index
#1313	changing mycpp/NINJA-steps.sh doesn't invalidate generated code
#1307	_tmp/ does not get cleaned on `build/clean.sh`
#1304	gBuf.dealloc() to avoid leaks
#1240	Minor code generators should work in GC mode
#1206	mycpp/examples use heap-allocated Tuple and crash
#1186	eliminate other allocators from all bindings / libs
#1176	generated ASDL code should support garbage collection
#1173	Unified Ninja build

Metrics for the 0.12.7 Release

These metrics help me keep track of the project. Let's compare this release with Oil 0.12.4 in late August.

Spec Tests

We've focused on the garbage collector, so the spec test numbers for OSH and Oil haven't moved as of 0.12.7. But Melvin just revived this work, so the next release will show progress.

OSH spec tests for 0.12.7: 2023 tests, 1789 passing, 91 failing
Oil spec tests for 0.12.7: 502 tests, 464 passing, 38 failing
OSH C++ spec tests for 0.12.7: 1792 tests, 1722 pass in Python, 1684 pass in C++

Benchmarks

The parser is slower after recent work on the garbage collector. We're making the code correct first, and then fast.

#oil-dev > Ideas for Performance

The parser workload is very allocation-heavy, so it will serve as a good benchmark.

Parser Performance for 0.12.4: 235 lines/ms and 434 lines/ms
Parser Performance for 0.12.7: 147 lines/ms and 232 lines/ms

(The "NA" cachegrind irefs measurement seems like a bug which I should look into.)

Let's look at our small, synthetic benchmarks:

mycpp/examples for 0.12.4:
- 0.566 is the worst C++ : Python user time ratio
mycpp/examples for 0.12.7:
- 1.02 is the worst C++ : Python user time ratio

As expected, the mark and sweep collector is slower than the Cheney collector. But the Cheney collector wasn't practical, mainly due to the GC rooting problem.

Those pages show that we fixed some memory leaks:

0.12.4 release - 8.89 is the worst C++ : Python memory usage ratio!
0.12.7 release - 1.54 is the worst C++ : Python memory usage ratio

In 0.12.7, every example but the worst one has a ratio less than 1.0, which means the translated C++ is indeed faster than Python. I think we should be able to fix that one as well.

Code Size

Even though we added field masks, the lines of generated code are down:

oil-cpp for 0.12.4: 89,167 lines of C++
oil-cpp for 0.12.7: 86,985 lines of C++

The binary size has been fluctuating, but it's reasonable now, and can be optimized when there's time:

ovm-build for 0.12.3: 1,135,280 bytes of native code (under GCC)
ovm-build for 0.12.4: 828,088 bytes of native code (under GCC) - This anomaly is probably because I accidentally disabled StackRoots in one release. If so, it shows that GC is expensive in more than one way!
ovm-build for 0.12.5: 1,190,568 bytes of native code (under GCC)
ovm-build for 0.12.6: 1,141,416 bytes of native code (under GCC)
ovm-build for 0.12.7: 1,182,376 bytes of native code (under GCC)