Status Update: Parser Correctness and Performance


In the last post, I said I would use Alpine Linux's abuild script as a motivating use case for OSH.

The good news is that OSH can now run abuild -h to show help. But the bad news is that, even before running it, OSH takes nearly two seconds to parse abuild. That's more than 100 times slower than shells like bash!

So in the last few weeks, I've been making a plan to fix this. It resulted in a lot of problems being "pushed on the stack", taking priority over running abuild.

I think I can pop the stack by early November, and then make a useful OSH 0.2 release.

This post is as much for me as it is for readers. There are many simultaneous threads running in the project, and I need to organize my thoughts. It's a bit like the post on build system observations: a laundry list of things I may write about later.

These topics deserve a more thorough treatment, but that will have to wait until well after the OSH release.

Table of Contents

Performance Tools and Parser Bottlenecks

(1) I researched the space of Python performance tools and tried a handful them, including a tool to make flame graphs and what could be called "flame traces" (pytracing).

I noticed that tracing tools written in Python have significant instrumentation error. That is, the sys.settrace() and sys.setprofile() hooks allow you to insert Python callbacks into the main interpreter loop, but this significantly slows down your program.

This is something to remember for OVM — can I do better?

(2) I now understand the relationship between flame graphs, tree maps, and the Graphviz output of Google's pprof tool. Brendan Gregg has a good post comparing Flame Graphs and TreeMaps.

(3) I ended up just using cProfile in the Python standard library. I should have tried the simple approach first!

It found three (!) performance bottlenecks, which I mentioned in this status update on lobste.rs:

  1. The AST is traversed at parse time to determine which here docs need to be read. The algorithm is correct but inefficient.
  2. The lexer needs to be translated to C code via re2c, instead of using a large set of interpreted regexes.
  3. The ASDL classes also need code generation. They currently perform dynamic type checks in the inner loop of the parser, which is slow.

Polishing the OSH Parser

The OSH parser has to be correct before it can be fast, so I reviewed and rewrote what I call the wild tests.

These tests torture the parser with shell scripts found in the wild. For example, see Parsing 183,000 Lines of Git's Shell Source Code.

The wild test harness now runs in parallel and produces nicer HTML:

I also expanded the corpus of testdata. The results showed that I should implement extended globs.

So I explored the extended glob syntax with spec test cases, and added them to the OSH parser. I'm happy that I can still make quick progress on the parser after a few months away for it. This indicates to me that its architecture is solid.

Project Infrastructure

To support my testing and performance tools, I need more tools. It's tools all the way down!

Serving Many Rarely Used Static Files

Notice that the URL above has a .wwz path component. This is simply a .zip file whose contents are served by a CGI script I wrote.

Why did I bother writing it, instead of just unzipping the files on Dreamhost? Well, there are tens of thousands of shell scripts and HTML files in the archive, and I want to publish the wild tests on every release. All this file system metadata is a sys admin headache not just for my web host, but for me. For example, it makes rsync take a long time.

The text files also compress by 10-20x.

I also have a side project in mind that will require linking to potentially millioins of source files in projects like CPython, Clang, and various Unix shells.

There's Now R Code in the Repository

The benchmarks/osh-parser.R script analyzes performance data. R is a language for statistics, data manipulation, and visualization.

I have a Wiki page called Oil and the R Language where I've been keeping notes on the relationship between R in shell. For example, they're both languages used interactively.

Unfortunately, this work is far in the future.

JSON Template in the Repository

In 2009 I designed a template language called JSON Template. It was inspired by Google's goog-ctemplate, which was used on the home page of Google, and EZT, which I used on Google Code.

I'm embarrassed that I haven't maintained a website or documentation since Google Code was turned down. There's an ancient fork of it here.

The Python script that produces the wild test report above uses JSON Template.

I should write An Evaluation of JSON Template. The language has flaws, but after 8 years I still like using it for certain tasks.

Planning a Faster Parser

In addition to measuring performance, I'm thinking about how to improve performance!

Regex Theory for the Lexer

To prepare for generating C code from the OSH lexer using re2c, I went back and studied Russ Cox's authoritative and exhaustive articles on regular expressions.

One thing that wasn't clear on skimming is that the virtual machine approach is the slower NFA approach. It has abstract "threads" to represent nondeterminism.

His re2 engine — not related to re2c! — also implements the faster DFA approach. That is, re2 has a multi-engine execution strategy.

On the other hand, re2c uses the DFA approach. It can do this because it doesn't need to implement all the bells and whistles of Perl-style regexps. It translates regular expressions into a series of switch and goto statements in C. Each jump location is a DFA state.

This definitely deserves a future blog post.

Parsing Tools / Metalanguages

I have a clear path to making the lexer faster, but the parser is a more complex problem.

I believe I can make it 10 times faster by addressing the three problems described above (here docs, lexer, and ASDL), but it might still be too slow after that.

I may have underestimated the fact that the OSH parser fundamentally does more work than other shell parsers:

To recap the problem:

So I want to create a better metalanguage. After all, I'm not just writing a bash parser, I'm writing Oil, which is supposed to subsume shell, awk, make, and maybe more.

Release Plans

So there are many different threads going now. However I want to pop the stack and focus on delivering something for the next release.

Right now, this looks like:

That will be enough for 0.2 release in early November. The 0.3 release can then include more extensive optimizations.

Conclusion and Blog Backlog

I diagnosed parser performance problems parser using tools, and enhanced its test suite.

I described some pieces of project infrastructure that I started using, and which I think will be important in the future.

And I also speculated on future plans for the lexer and parser.

If you'd like more details on any of these topics, please leave a comment.