I'm about halfway through replacing the core of the interpreter with classes generated by ASDL. While I finish that, here's a fun math problem that Reddit user Veedrac solved for me.
The problem is to replace a lookup table with a handful of arithmetic instructions. Instead of:
def LookupKind(id): return KIND_TABLE[id]
we want something like this:
def LookupKind(id): return 175 & id & ((id ^ 173) + 11)
The second implementation is faster because it doesn't require any memory accesses.
I'm not claiming that this optimization is important — I haven't measured it. For now, I'm only claiming that this problem is fun! I'll also follow it into short digressions on cache misses and branch mispredictions.
The problem at hand is motivated by the osh parser.
In this status update, I mentioned that the file
core/id_kind.py knows about all constructs in shell. Specifically, it
contains 221 token IDs grouped into 21 "kinds", represented by the enums
Kinds are used to make coarse-grained parsing decisions. For example, each of
these redirect operators has a distinct ID, but they're all of kind
because they're parsed the same way:
cat > output.txt cat < input.txt cat >> append.txt cat >| clobber.txt
Likewise, boolean operators like
-eq have distinct IDs, but they're all
Initially, I used this function to avoid a lookup table:
def LookupKind(id): return id >> 4 # kind is the top 4 bits of 8
An ID is a single byte, and I used metaprogramming to group in them into 16 kinds. So IDs that share the top 4 bits are of the same kind, and the maximum number of IDs of any kind is 16.
Because we only test IDs for identity, the extra kind bits on top don't
matter. We use
Id enum values as unique identifier and not integers.
Unfortunately, this simple scheme broke after implementing shell arithmetic,
because there are
43 different arithmetic tokens. There are also
different kinds, so it broke in two ways.
If I were willing use more than one byte to represent an ID, I could have used five bits for the kind (25 > 21) and six bits (26 > 43) for the ID. But it's convenient for a number of reasons to use a single byte, so I switched to a simple lookup table (generated by metaprogramming of course.)
This is where programmers who are paid money to solve problems would have stopped. A lookup table is a perfectly readable and efficient mechanism, especially when it only has 221 entries.
But why not remove that memory access? It's in the inner loop of the parser, and it would be fun.
In 2002, at my first job at Electronic Arts, a coworker told me that "lookup tables are slow". He was referring to the fact that you can do hundreds of mathematical operations on registers in the time it takes to read a single byte from main memory (in other words, a cache miss.) This was true on game consoles of that era like the PlayStation 2, and is still true on modern CPUs.
Around 2007, I attended a talk by Jeff Dean, which led to the popular "latency numbers every programmer should know", which have since been published and updated on the web in various places. The idea behind the talk was that most programmers — even the ones at Google back then — were having trouble designing low latency systems because they weren't paying attention to these different orders of magnitude.
According to this this interactive chart (use the slider), main memory latency hasn't changed much since 2002.
The number of IDs under each of the 21 kinds is:
43 25 20 16 15 13 12 12 11 10 10 8 8 4 4 3 3 1 1 1 1
We can write this as a sequence of 21 conditionals that divide the space of integers into ranges:
def LookupKind(id): if id < 43: return 0 if id < 68: # 43 + 25 return 1 if id < 88: # 43 + 25 + 20 return 2 # ... 18 more conditionals
id is loaded into a register, this function doesn't need any more memory
But while we're taking things too far, let's get rid of branch mispredictions too. According to the latency numbers, a branch misprediction might cost you 5 nanoseconds, and you could have likely executed a dozen or more instructions in that time.
Jeff Dean also posed this question in his talk: How long it does it take to quicksort 1 billion 4-byte integers?
It surprised me that he estimated it by simply counting the number of branch mispredictions, which is described in this post.
In the 9 or so years since, I've never actually tried to estimate performance by counting branch mispredictions. But why not start now and say that half of the 21 branches above will mispredict, and if each one takes 5 ns, then the function will take 50 ns on average.
Aside: The 2003 paper The Structure and Performance of Efficient Interpreters says that branch misprediction is a big factor in interpreter performance:
We find that [interpreters] perform an exceptionally high number of indirect branches: each of them has a higher proportion of indirect branches than any of the benchmarks used by researchers on indirect branch prediction. This may be obvious to many interpreter writers, but not necessarily to hardware designers, and even the most prominent paper on interpreter performance characteristics  misses this point completely.
On the one hand, a shell is a unique interpreter because it starts processes, and starting a process is orders of magnitude more expensive than a branch misprediction. On the other hand, I want to oil to subsume the functionality of awk and make. Awk programs process a lot of data without executing processes, so an awk interpreter should be fast.
I suspected that you could express the relationship between
with a short mathematical function. And it would be nicer to do it without
any conditionals, avoiding the possibility of branch misprediction.
Why did I suspect this? It's a function from 8 bits to 5 bits, and there are only so many of those. And while I've never worked with binary decision diagrams, I had a vague notion that you can express any boolean function concisely with them. A 5 bit output is the same as 5 boolean functions (predicates) on 8 bits.
But our problem is easier than representing arbitrary functions: we only care
about the structure of the function between 221 IDs and 21 kinds. The
concrete integer values are immaterial:
-eq could be 1 and
== could be 2,
-eq could be 194 and
== could be 31. Likewise,
Kind.BoolBinary can be
any number from 0 to 255 as long as 25 different IDs are mapped to it.
In addition, the function only needs to be "correct" on 221 out of 256 values. The other 35 inputs can evaluate to anything at all.
To recap, we want this grouping of 221 IDs into 21 kinds:
43 25 20 16 15 13 12 12 11 10 10 8 8 4 4 3 3 1 1 1 1
Let's give this problem a fancy name: function-directed enum labeling.
The key idea is to work backwards from a function on an enum to the concrete values that represent the enum.
Specifically, how do we assign small integers to the enums
order to make the
LookupKind function as short as possible? How short can
such a function be? What mathematical operations and CPU instructions does it
These questions sat in the back of my mind for a few weeks. In the next post, I'll describe how Reddit user Veedrac solved this problem and present example code.