Status Update: Parser Correctness and Performance

2017-10-24

In the last post, I said I would use Alpine Linux's abuild script as a motivating use case for OSH.

The good news is that OSH can now run abuild -h. The bad news is that, before running it, OSH takes nearly two seconds to parse abuild. That's more than 100 times slower than shells like bash!

The fundamental reason for this is that I wrote the parser in a high-level, abstract style, so I wouldn't end up with a mess of ad hoc if statements. I focused on correctness over performance.

Table of Contents

Intro

Performance Tools and Parser Bottlenecks

Polishing the OSH Parser

Project Infrastructure

Serving Many Rarely Used Static Files

There's Now R Code in the Repository

JSON Template in the Repository

Planning a Faster Parser

Regex Theory for the Lexer

Parsing Tools / Metalanguages

Release Plans

Summary

Intro

Of course, performance is important too. This post meanders over topics that are relevant to addressing this problem. It's a bit like the post on build system observations: a laundry list of things I may write in detail about later.

I also make a digression into project infrastructure. There are many simultaneous threads running in the project, and writing things down helps me organize my thoughts and prioritize.

Performance Tools and Parser Bottlenecks

(1) I researched the space of Python performance tools and tried a handful them, including a tool to make flame graphs and what could be called "flame traces" (pytracing).

I noticed that tracing tools written in Python have significant instrumentation error. That is, the sys.settrace() and sys.setprofile() hooks allow you to insert Python callbacks into the main interpreter loop, but this significantly slows down your program.

This is something to remember for OVM — can I do better?

(2) I now understand the relationship between flame graphs, tree maps, and the Graphviz output of Google's pprof tool. Brendan Gregg has a good post comparing Flame Graphs and TreeMaps.

(3) I ended up just using cProfile in the Python standard library. I should have tried the simple approach first! Flame graphs seem more appropriate for programs which are already somewhat optimized.

cProfile found three (!) performance bottlenecks, which I mentioned in this status update on lobste.rs:

The AST is traversed at parse time to determine which here docs need to be read. This algorithm is correct but inefficient.
The lexer should be translated to C code via re2c, instead of using a large set of interpreted regexes.
The ASDL classes also need code generation. They currently perform dynamic type checks in the inner loop of the parser, which is slow.

(4) Using these tools has also made me appreciate the utility of unifying stack traces between C and languages like Python and JavaScript. Python and C have fairly good integration, but when you start using performance code coverage tools, the gap between them appears wider.

Joyent has done some interesting work on observability that helps bridge this gap.

Polishing the OSH Parser

However, the OSH parser has to be correct before it can be fast, so I reviewed and rewrote what I call the wild tests.

These tests torture the parser with shell scripts found in the wild. For example, see Parsing 183,000 Lines of Git's Shell Source Code.

The wild test harness now runs in parallel and produces nicer HTML:

http://www.oilshell.org/git-branch/master/21552818/wild.wwz/index.html

I also expanded the corpus of testdata. The results showed that I should implement extended globs.

So I explored the extended glob syntax with spec test cases, and added them to the OSH parser. I'm happy that I can still make quick progress on the parser after a few months away for it. This indicates to me that its architecture is solid.

Possible blog post: A Critique of the Extended Glob Syntax.

Project Infrastructure

To support my testing and performance tools, I need more tools. It's tools all the way down!

Serving Many Rarely Used Static Files

Notice that the URL above has a .wwz path component. This is simply a .zip file whose contents are served by a CGI script I wrote.

Why did I bother writing it, instead of just unzipping the files on Dreamhost? Well, there are tens of thousands of shell scripts and HTML files in the archive, and I want to publish the wild tests on every release. All this file system metadata is a sys admin headache not just for my web host, but for me.

For example, it makes rsync take a long time. The text files also compress by 10-20x.

I also have a side project in mind that will require linking to potentially millions of source files in projects like CPython, Clang, and various Unix shells.

There's Now R Code in the Repository

The benchmarks/osh-parser.R script analyzes performance data. R is a language for statistics, data manipulation, and visualization.

I have a Wiki page called Oil and the R Language where I've been keeping notes on the relationship between R and shell. For example, both languages are used interactively!

Unfortunately, this work is far in the future.

JSON Template in the Repository

In 2009 I designed a template language called JSON Template. It was inspired by:

Google's goog-ctemplate, which was used on the home page of Google, and
EZT, which I used on Google Code.

It was designed to be "polyglot" in the sense that I implemented it in Python and JavaScript simultaneously. I felt that templates should be portable between the client and server, which wasn't a common at the time.

I'm embarrassed that I haven't maintained a website or documentation since Google Code was turned down. There's an ancient fork of it here.

I used JSON Template in the Python script that produces the wild test report above.

After more than 8 years, I should write An Evaluation of JSON Template. The language has flaws, but I still like using it for certain tasks. It will influence the design of double-quoted strings in the Oil language.

Planning a Faster Parser

In addition to measuring performance and digressions into tools, I'm thinking about how to improve performance.

Regex Theory for the Lexer

To prepare for generating a lexer in C using re2c, I went back and studied Russ Cox's authoritative and exhaustive articles on regular expressions.

One thing that wasn't clear on skimming is that the "virtual machine approach" is the NFA approach. It has abstract "threads" to represent nondeterminism.

His re2 engine — not related to re2c! — also implements the faster DFA approach. That is, re2 has a multi-engine execution strategy.

On the other hand, re2c just uses the DFA approach. It can do this because it doesn't need to implement all the bells and whistles of Perl-style regexps. It translates regular expressions into a series of switch and goto statements in C. Each jump location is a DFA state.

This definitely deserves a future blog post.

Parsing Tools / Metalanguages

I have a clear path to making the lexer faster, but the parser is a more complex problem.

I believe I can make it 10 times faster by addressing the three problems listed above (here docs, lexer, and ASDL), but it might still be too slow after that.

I may have underestimated the fact that the OSH parser fundamentally does more work than other shell parsers:

It parses everything up front. That is, code within delimiters like (echo subshell), $(echo "command" "sub"), ${var:-default}, etc. is not skipped over, which means it finds more errors in your scripts. It also means the parser is more consistent and less complex overall.
In order to support translating shell to Oil, it produces a more detailed representation than other shells do.

To recap the problem:

Yacc and ANTLR aren't appropriate for describing the OSH language.

(These tools are usually called parser generators, but I prefer the term metalanguage. They're languages for describing languages.)

The state of the art is to write a parser by hand in C++.

But I'm not excited about translating 5,000+ lines of Python to C++ by hand. I would rather generate code.

So I want to create a better metalanguage. After all, I'm not just writing a bash parser, I'm writing Oil, which is supposed to subsume shell, awk, make, and maybe more.

Related: Guy Steele at ClojureConf on Computer Science Metanotation. This talk is aimed at people who read papers on programming languages, rather than programmers. He talks a bit about describing language syntax, but it's more about describing type systems. However, it's still interesting and is rich with history.

Release Plans

So there are many different threads going now. However I want to pop the stack and focus on delivering something for the next release.

Right now, this looks like:

Publish tests showing the parser is correct. This is basically done; see the preview above.
Publish parser benchmarks. Also almost done.
Fix one performance problem with the parser, and measure the improvement.

That will be enough for a 0.2 release in early November. The 0.3 release can then include more extensive optimizations.

Summary

I diagnosed parser performance problems using tools, and enhanced its test suite.

I described project infrastructure that I started using: wwz, R, and JSON Template.

I also speculated on future plans for the lexer and parser.

If you'd like more details on any of these topics, please leave a comment.