Oil Now Parses My /etc/init.d Directory

2016-10-14

I was going to talk about lexical states today, but I reached a milestone last night that I'll share instead.

This directory contains the source code and abstract syntax trees for the /etc/init.d directory on my Ubuntu machine, totalling 5,456 lines of code.

These scripts tickled a few corner cases that Aboriginal Linux did not. They were all small and localized fixes -- cases where my parser didn't match the grammar. Writing parsers by hand is pretty error prone, but unfortunately it's still the best way to parse "real" languages.

Here are some of the bugs I encountered and fixed:

Line continuation with \ wasn't handled within a double-quoted string. This simple omission was easily fixed.

I messed up parsing a for loop without in. This was an easy fix too.

$ set -- 1 2 3  # set "$@"
> for i         # implicitly loop over "$@"
> do 
>   echo $i
> done
1
2
3

Here's some related shell trivia. What is the output of this program?

$ set -- 1 2 3  # set "$@"
> for i in
> do 
>   echo $i
> done
... no output ...

It's exactly like the last program, except there's an in after i. It turns out it has no output. It iterates over nothing rather than "$@".

Shell is a language full of subtleties like this.

The last case item is allowed to omit the ;; terminator. From /etc/init.d/acpid:

case "$1" in
  start)
    load_modules || true
    ;;
  *)
    log_success_msg "Usage: /etc/init.d/acpid ..."
    exit 1
    ### it's valid to omit ;; here
esac

This POSIX grammar expresses this as follows:

case_clause      : Case WORD linebreak in linebreak case_list    Esac
                 | Case WORD linebreak in linebreak case_list_ns Esac
                 | Case WORD linebreak in linebreak              Esac
                 ;
case_list_ns     : case_list case_item_ns
                 |           case_item_ns
                 ;
case_list        : case_list case_item
                 |           case_item
                 ;
case_item_ns     :     pattern ')' linebreak
                 |     pattern ')' compound_list
                 | '(' pattern ')' linebreak
                 | '(' pattern ')' compound_list
                 ;
case_item        :     pattern ')' linebreak     DSEMI linebreak
                 |     pattern ')' compound_list DSEMI linebreak
                 | '(' pattern ')' linebreak     DSEMI linebreak
                 | '(' pattern ')' compound_list DSEMI linebreak

Notice that there are pairs of case_list{,_ns} and case_item{,_ns} rules.

I find recursive context-free grammars like this hard to read. I ported the grammar to ANTLR, which allows using Kleene operators like *, +, and ? to express the rule iteratively:

case_item   : '('? pattern ('|' pattern)* ')' newline_ok
              command_term? trailer? ;

case_list   : case_item (DSEMI newline_ok case_item)* DSEMI?
              newline_ok;

case_clause : Case WORD newline_ok in newline_ok case_list? Esac ;

(I renamed linebreak to newline_ok, and compound_list to command_term.)

Incidentally, this is where ANTLR has problems with the grammar. My code works as follows:

If you see an esac statement where you expected a ;;, end the case item and the whole case statement.

It's trivial to express imperatively, but it it's awkward to express in the formalism of a context-free grammar. ANTLR emits a bunch of warnings around these case statement rules which I don't completely understand.

As far as I understand, ANTLR 3 tries to generate a faster lookahead algorithm for LL(k) grammars (even though it does LL(*) in general).

LL(1) is the easiest to parse: you only have to look at the current token to decide what rule to "call". I think that shell is basically LL(2), and shouldn't require too much lookahead, but I don't think ANTLR agrees.

I want to look into this issue more, but it turns out not to matter much for implementing the parser.

I think that an abstraction like PEGs will work better for expressing the shell language. It seems clear that the grammar was derived from imperative code to begin with. The shell language was not designed with a context-free grammar. So it's bit like translating a Chinese translation of an Emerson poem back to English.

PEGs are more imperative, in that they have ordered choice and a "cut" operator. Despite that, they are principled, and I believe it's easier to reason about their performance (whether you use Packrat parsing or backtracking).

Next Steps

This was a bit of a digression into minutiae, but hopefully it gives you a taste of what implementing a shell parser is like.

Tomorrow I will return to the theme of parsing shell in a single pass.