Parsing 183,000 Lines of Git's Shell Source Code


Oil can now parse 183,000 lines of git source code, with just two remaining issues, explained below. See the source files and serialized ASTs.

Git is a good test case for a shell. It's big: 183,000 lines is more than three times the size of the next biggest project I've found.

Though much of this is test code, it hits corner cases in the language. For example, its usage of here docs taught me that they're processed in a post-order traversal of the AST.

It also taught me that comments are not lexical constructs in shell, as they are in most languages. Correctly recognizing them depends on knowledge of what words and operators are, which isn't dealt with in the lexer. Consider:

$ echo foo:#not-comment
> echo foo;#comment

The colon is not an operator, so it and the next # are part of a single word. In contrast, the semi-colon is an operator. This means the next token must begin a new word, and in this case it's a comment.

There two remaining issues are:

Unicode encodings other than UTF-8

The parser uses Python 3 strings, so it has no problem with code in the UTF-8 encoding. But git has two test scripts with non-UTF-8 Unicode (t4201 and t7831).

I'm torn on the issue of supporting other encodings, and the best way to resolve that is by examining real world usage.

On the one hand, compatibility with bash is good. On the other hand, if there are only two files with non-UTF-8 encodings out of dozens of projects totalling a million lines of code, then I'll be tempted to follow Go's approach.

This approach is simpler, uses less memory, and should reduce portability problems stemming from libc. (Bash uses various libc functions to support multiple encodings and locales.)

A Subtlety with Static Parsing

The second remaining issue that git uncovered relates to static parsing. If you look at line 10 of git-gui.sh, you'll see something odd:

exec wish "$argv0" -- "$@"

set appvers {@@GITGUI_VERSION@@}
set copyright [string map [list (c) \u00a9] {
# ...

Wait, that's not shell anymore! It turned into Tcl code. Even when non-interactive, the shell is a REPL that parses and executes each top-level command in sequence. When it hits exec, the REPL must stop, so nothing else is parsed.

Here's a similar example with exit:

$ echo "This script runs, despite bad syntax after exit"
> exit
> | invalid |
> ; syntax ;
This script runs, despite bad syntax after exit

In contrast, Python parses everything up front:

import sys
| invalid |
; syntax ;
$ demo/py_parse_before_run.py
  File "demo/py_parse_before_run.py", line 4
    | invalid |
SyntaxError: invalid syntax

Although oil can't do this without breaking certain shell scripts like git-gui.sh, it's not a problem in practice because there's a difference between "executing" a function and calling it. When the shell "executes" a function, it just puts its parsed representation in a lookup table:

$ echo before
> f() { echo 'not called, but parsed and stored'; }
> echo after

So, as long all code is in functions, and there is a single top-level main "$@" call, oil will statically parse all of the code.


In past status updates, I've shown oil parsing these projects:

The parser is converging pretty quickly. Git is the eighth project I've discussed, but there are now many more projects that it handles correctly.

The nice thing is that there have been no architectural changes for awhile; it's all been polish around the edges. I will write about this architecture in detail later, but a core observation is that it's four interleaved parsers for four sublanguages.

I've also consciously avoided any "clever" features of Python, so this parser can be ported almost line-for-line to many languages, including C++.

I want to set it up with a few more blog posts, but otherwise there's no reason not to release the code so people can play with it. I expect that to be this month.