Why Sponsor Oils? | blog | oilshell.org

Oil 0.12.7 - Garbage Collector Problems

2022-10-20

For the last few months, we've been translating Oil to fast native code, with help from an NLnet grant.

It's time to come up for air and give a status update! This post describes recent work on the garbage collector, which I checkpointed in recent releases.

Oil version 0.12.7 - Source tarballs and documentation.

We've learned a lot, changed a lot of code, and have a new collector working. I aim to convey a sense of that by sharing two narratives. Each one revolves around a technical design problem:

  1. GC rooting: How does the app tell the garbage collector where to start tracing?
  2. What data structures should we use to make a portable, fork()-friendly collector?
Table of Contents
Background and Motivation
Recent Results and Changes
A Small Core in C++
Switching to Non-Moving Mark and Sweep
Testing Gold: ASAN with GC_EVERY_ALLOC
Problem: GC Rooting
The f(g(), h()) Problem
"Return Value Rooting" Policy
Problem: fork()-friendly Collector Data Structures
Can Mark Bits be Separate, Portably?
The Skew Language Has a Similar C++ Runtime
unordered_set<void*>::insert Slower Than malloc(1)
New Object Header
Future Milestones
Removing Dev Friction / Improving Tools
Declarative Ninja / Mini Bazel
Soil CI: Unifying Shell and Containers?
Unix Philosophy: Ninja vs. Docker
Conclusion
Please Sponsor
Help Wanted
Appendix
Closed Issues
Metrics for the 0.12.7 Release

Background and Motivation

For background, see Garbage-Collected Heap in C++, Shaped Like Typed Python.

We haven't finished our collector, but it passes many tests, and I believe we have good solutions to some tricky problems. I hope that experienced engineers will provide feedback, and help us with the code.

This post also mentions improvements to our dev tools, which make use of shell scripts, Ninja, and Docker.


If you appreciate this work, please sponsor Oil! We need the money, and are using it.

Recent Results and Changes

A Small Core in C++

Jesse and I finished unifying the "old STL" and "Cheney" C++ experiments, mentioned in August's release notes.

It looks like the shell will have less than 7,000 lines of hand-written C++ code, which is a great result! The Line Counts for Translation show:

This is our "unsafe core". The remaining ~80K lines of C++ are translated from high-level languages. Comparisons:

One reason that we're writing a garbage collector is to eliminate the possibility of memory safety bugs in the interpreter logic. You can't express a buffer overflow or use-after-free in Python! (Or in regular languages.)

Another reason is that the Oil language itself has garbage-collected data structures.

(Our number will increase when we write C++ wrappers for libraries like GNU readline, but not by a lot.)

Switching to Non-Moving Mark and Sweep

After unifying those experiments, we immediately switched to a non-moving mark-sweep collector. This is very different than the moving Cheney collector!

Why? When we revived the Cheney experiment, it was clear it never worked that well. Our stress tests exposed several crashes, which were mainly due to the "GC Rooting" problem.

I also realized that writing a moving collector for the metalanguage of C++ is quite different than writing it for Python or OSH. No reader has pointed this out, but I'm honestly not sure it's even possible in standard C++! For example, it's disorienting that you have to copy and root the this pointer:

FunctionThatMayAllocate();

MyMethod();              # invalid implicit this
this->MyMethod();        # invalid explicit this

auto self = this;        # make a copy that can be moved
StackRoots _r({&self});  # tell the GC about it
self->MyMethod();        # happens to work

(I recall a StackOverflow post where someone said they asked Bjarne Stroustrop to add a "re-address me" operator to C++ to properly support moving collection.)

So we're now implementing a more straightforward garbage collector. But this means we'll need to do more optimization.

The reasoning behind Cheney was sound: our workload is allocation-heavy, and a few weeks ago we indeed saw free() at the top of perf profiles. (The Cheney algorithm doesn't free() dead objects; its collection time is proportional to the size of live set.)

Testing Gold: ASAN with GC_EVERY_ALLOC

Switching algorithms had an unexpected benefit: we learned that the Cheney approach is not quite compatible with AddressSanitizer (ASAN), a critical tool for finding memory errors.

When we turned on the mark-sweep collector, bugs were more reliably flagged by ASAN. It appears that we were "missing" bugs with Cheney!

In hindsight, this is obvious because ASAN works by instrumenting malloc(), but Cheney uses its own bump allocator. In contrast, mark-sweep uses the standard malloc() interface.


So we now have a great testing strategy: ASAN combined with #ifdef GC_EVERY_ALLOC, which stress tests the GC.

I remember reading the following advice in the es shell paper: have a mode to collect on every allocation, and use guard pages to force segfaulting as early as possible. (es uses a Cheney collector, which was inspiration for our first design.)

But it's one thing to read advice, and another to learn it the hard way!

So, after this experience, I want to pass along a modern version of that advice. If you're writing a (simple) garbage collector, make sure there's a build that uses the malloc() and free() interface, and test it with ASAN and GC_EVERY_ALLOC.

I call that the gold standard. It should find more bugs than creating your own guard pages. When the es shell paper was written in 1993, ASAN didn't exist, and I don't think Valgrind did either.

Problem: GC Rooting

I find this to be an underexplained and tricky problem: How does the application tell the garbage collector about the root set of pointers to trace? In both the Cheney and mark-sweep algorithms, an object graph traversal starts from this set.

I've repeatedly referred to this 2013 blog post about precise rooting in Mozilla SpiderMonkey — in blog posts, in the job listing, and on video calls:

One takeaway is that they wrote a GCC plugin to find rooting problems. I don't think we need to do that, at least not yet.

But I wish there was a good survey of the strategies that different language implementations take. I've heard that Google V8 has changed their rooting policy over the years, but I don't know the details. (It's also a VM, which again differs from our approach of generating C++.)

Another way I think about the rooting problem: the value of the "conservative" or imprecise Boehm GC work is that it lets you punt on this problem!

Instead of having the application tell the garbage collector about pointers, the garbage collector scans for them. But we won't use this technique, because there's no way to do it in portable C or C++. That is, we don't want #ifdefs for x86, ARM, 32-bit, etc.

The f(g(), h()) Problem

The ASAN + GC_EVERY_ALLOC testing exposed a problem in our rooting logic!

With the Cheney collector, we had seen disorienting bugs when translating statements like obj.method(s[1:]). The problem is that:

  1. The slice operation allocates, and allocating may collect garbage.
  2. Collection moves obj while the statement is being evaluated!

We solved this problem with correct rooting, but there are many other cases to handle, and it felt error prone.

I thought that using mark-sweep would avoid most of this complexity. But there are similar problems with a non-moving collector. Consider this statement:

f(s[0], s[1:])

Or equivalently:

f(g(), h())  # does g() or h() return a live object?

The problem is that the result of g() is "temporary", and if h() allocates, then the collector may sweep that result away! This is the kind of bug you find with GC_EVERY_ALLOC, because the time window for the bug is small. It occurs while evaluating temporary expressions.

One way to fix this is with a bytecode- or SSA-like transformation like:

tmp1 = g()
tmp2 = h()
f(tmp1, tmp2)

However, doing this analysis correctly involves types and logic for each language construct. For example, the + and * operators in Python may or may not allocate depending on the type of the operands, so it's more work than it appears.

Right now, mycpp is halfway between an AST printer and a compiler, and this would require making it more like a real compiler. It would also make the generated code less readable.

"Return Value Rooting" Policy

I came up with a policy I call "return value rooting" that "inverts" the responsibility for rooting. In short, you tell the collector about a root object when you return it from a function:

Str* MyFunc(bool cond) {
  # Update a data structure that mirrors the C stack
  RootingScope _r;

  if (cond) {
    Str* s = OtherFunc();
    # Add to the root set; tell caller to remove it
    # when done
    gHeap.RootOnReturn(s);
    return s;
  }
  return nullptr;
}

It's not done yet, but it made all 5 crashes in our tests go away, which seems promising.

I generally try not to "invent" anything for Oil, instead relying on textbook knowledge, and techniques lifted from production codebases. But again, I haven't found good resources on this engineering problem.

Let us know if you have worked on similar problems!

This policy isn't compatible with the Cheney algorithm, so we'll probably never go back to Cheney.

Problem: fork()-friendly Collector Data Structures

So that was a story about finding a good bug with testing. It shows why the GC rooting problem is important.

Here's another story, organized around the following problem: How do we design the garbage collector's data structures so that they don't defeat the copy-on-write of fork()?

This is important, because forking the interpreter is idiomatic shell!

There are two issues which both involve the common object header.

  1. Are the mark bits stored in the header of every object, or stored elsewhere?
  2. How does the sweep phase find all objects? Does the object header contain a next pointer?

If the object header is mutated during collection (when the objects themselves aren't), then every memory page that has an object can be dirtied. The resulting page faults would make collection slow and expensive.

The CRuby interpreter addressed this problem in 2012, and it also occurs with CPython's reference counts.

Can Mark Bits be Separate, Portably?

Recall that imprecise scanning can't be done in portable C++, which is one reason we avoid it. The portability constraint creates another problem: how do we store mark bits outside the object header?

The typical solution is a mark bitmap that's "parallel" to the heap. But that only works when you can assume a limited range of addresses that malloc() returns.

In portable code, you have to deal with the fact that malloc() can return almost any 64-bit address. And 2^64 divided by 8 bits is still a very big bitmap!

Most garbage collectors include their own allocators, but we want to work with any allocator. (Remember how useful we found ASAN's instrumenting allocator! We also don't want to lose heap-profiling tools.)

The Skew Language Has a Similar C++ Runtime

I've looked at various GC implementations for years. I remembered that the Skew language has a nice design and small implementation, and that it was used at Figma. So I looked at the code again — in particular the C++ runtime.

It happens to be one of the most similar projects to our mycpp -- a statically-typed Python- or JavaScript-like language that's translated to C++. This generated code makes use of garbage-collected data structures. ( Vala also looks pretty similar.)

Interestingly, they have support for fork() (which I still don't understand, as it still blocks the mutator process), and a separate mark set!

But how did they do that? By simply storing all the marked pointers in an std::unordered_set.

My intuition was that this is too slow for a garbage collector, but I know the author has done many performance-oriented projects like esbuild (not to mention Figma itself!)

unordered_set<void*>::insert Slower Than malloc(1)

So we tried unordered_set, since it was easy enough.

There's a long story here, including one of our own bugs, but the bottom line is that this blew up our benchmarks, and we learned the hard way why std::unordered_set is slow.

I wrote a benchmark which reported that unordered_set<void*>::insert() is slower than malloc(1), which was baffling!

I thought the issue was that it forces implementers into a linked list approach, rather than a probing approach, but it seems even worse than that.

In any case, I think it's best to completely avoid hash tables in the garbage collector!

(Note that I never actually compiled or ran Skew — it's just compact code that was worth reading. It did seem that the C++ backend was more like a "fun hack" than production code. The advertised backends are JavaScript and C#.)

New Object Header

So how do we solve the "separate mark bits" problem portably, without a hash table? I think we'll simply assign a 24-bit ID to each object upon allocation, and use that as an index into a growable bitmap.

How do we avoid the "intrusive" linked list for sweeping? (An "intrusive" container is one that affects the items it contains.)

That was easy — we use a std::vector<Obj*> for the list of all objects, and compact it when we sweep. Most collectors use a singly- or doubly- linked list, but this seems just as good.


So with this separate metadata for both marking and sweeping, I believe new object header will remain a compact 64 bits:

  1. First 32 bits
  2. Second 32 bits

In the GC heap post, I said that Cheney could have less per-object overhead than mark-sweep, at least for our problem. But having separate mark bits and a separate vector apparently makes them equal.

If you're interested in working on this, let us know!

Future Milestones

The garbage collector isn't done, but I think the design issues are mostly sorted out. It works on substantial new tests.

Here are some logical milestones I see:

  1. Make all tests pass under ASAN with the the new rooting policy.
  2. Make the parser run with the mark-sweep collector.
  3. Make the shell run batch programs.

If that all works, then:

  1. Make the shell run interactive programs.

Removing Dev Friction / Improving Tools

Writing a solid collector involves many tools!

As a result, I've probably spent more time working with Ninja and Docker than C++ lately!

Another goal is improve dev tools, in response to feedback from Jesse and others. I've been updating the wiki: Where Contributors Have Problems.

We should make it as easy to contribute as possible.

Declarative Ninja / Mini Bazel

When I started oil-native, I used shell scripts to build it. They invoked our custom translators and the C++ toolchain.

This is the simplest approach, but it became slow and disorganized, especially with many test binaries and build variants (ASAN, UBSAN, etc.)

We now have a small declarative wrapper around Ninja:

$ wc -l build/ninja*.{py,sh}
  458 build/ninja_lib.py
  253 build/ninja_lib_test.py
  337 build/ninja_main.py
  274 build/ninja-rules-cpp.sh
  359 build/ninja-rules-py.sh
 1681 total

I'll write about it later, but it's like a mini-Bazel on top of Ninja. It has declarative rules like cc_binary(), cc_library(), and asdl_library(). It correctly generates implicit dependencies for code generation. It automatically finds Python's dynamic dependencies.

Soil CI: Unifying Shell and Containers?

To support the garbage collector work, I put more measurements in Soil, our shell- and container-based CI configuration.

The docker build became slow, so I learned how to use features like cache mounts: use RUN --mount in the Dockerfile. This mechanism lets different container builds share apt packages on the host, so you don't download them repeatedly.

This was effective, but I find it odd that there's a runtime flag docker --mount, and a build time RUN --mount. That motivated a thread:

Docker has been useful, and it works well with shell scripts. But I have some ideas to optimize image creation by relying more on a real shell language, and less on Docker! Optimizing the layer sizes would be our "north star" for this work. More on this later.

Unix Philosophy: Ninja vs. Docker

On the other hand, there's no need to migrate away from Ninja. It's composable in the Unix sense, while Docker isn't! It would be nice to write a blog post about this.

There are many more posts on #blog-ideas, and I should at least dump them into one or two "backlog" posts:

The sketches on #software-architecture I wrote in 2022 and 2021 ended up being the most popular posts on the blog!

Conclusion

Translating Oil to C++ is harder than I expected, and progress on it stalled in 2021. But working under the NLnet grant in 2022 has given new life to the project.

No promises, but it looks like we'll have a solid shell on the other side of this. We will likely have to crawl into a hole for another few months :-)

Please Sponsor

In the meantime, please sponsor Oil, and tell your friends and coworkers about it. The grant doesn't cover all expenses, and I just used sponsor donations to pay Melvin, who is doing excellent work.

Help Wanted

If you made it here, you should help us! We will always need experienced C++ and Python programmers.

Join us on Zulip, or send me an e-mail at andy@oilshell.org!

Appendix

Closed Issues

Here's a partial list of issues closed for the 0.12.7 release.

#1322 mycpp should generate field masks for classes (e.g. DirStack in examples/scoped_resource)
#1321 mycpp generates invalid StackRoot for integer loop index
#1313 changing mycpp/NINJA-steps.sh doesn't invalidate generated code
#1307 _tmp/ does not get cleaned on `build/clean.sh`
#1304 gBuf.dealloc() to avoid leaks
#1240 Minor code generators should work in GC mode
#1206 mycpp/examples use heap-allocated Tuple and crash
#1186 eliminate other allocators from all bindings / libs
#1176 generated ASDL code should support garbage collection
#1173 Unified Ninja build

Metrics for the 0.12.7 Release

These metrics help me keep track of the project. Let's compare this release with Oil 0.12.4 in late August.

Spec Tests

We've focused on the garbage collector, so the spec test numbers for OSH and Oil haven't moved as of 0.12.7. But Melvin just revived this work, so the next release will show progress.

Benchmarks

The parser is slower after recent work on the garbage collector. We're making the code correct first, and then fast.

The parser workload is very allocation-heavy, so it will serve as a good benchmark.

(The "NA" cachegrind irefs measurement seems like a bug which I should look into.)

Let's look at our small, synthetic benchmarks:

As expected, the mark and sweep collector is slower than the Cheney collector. But the Cheney collector wasn't practical, mainly due to the GC rooting problem.

Those pages show that we fixed some memory leaks:

In 0.12.7, every example but the worst one has a ratio less than 1.0, which means the translated C++ is indeed faster than Python. I think we should be able to fix that one as well.

Code Size

Even though we added field masks, the lines of generated code are down:

The binary size has been fluctuating, but it's reasonable now, and can be optimized when there's time: