I was going to talk about lexical states today, but I reached a milestone last night that I'll share instead.
This directory contains the source code and abstract syntax trees for the
/etc/init.d directory on my Ubuntu machine, totalling
5,456 lines of code.
These scripts tickled a few corner cases that Aboriginal Linux did not. They were all small and localized fixes -- cases where my parser didn't match the grammar. Writing parsers by hand is pretty error prone, but unfortunately it's still the best way to parse "real" languages.
Here are some of the bugs I encountered and fixed:
1) Line continuation with
\ wasn't handled within a double-quoted string.
This simple omission was easily fixed.
2) I messed up parsing a for loop without
in. This was an easy fix too.
$ set -- 1 2 3 # set "$@" > for i # implicitly loop over "$@" > do > echo $i > done 1 2 3
Here's some related shell trivia. What is the output of this program?
$ set -- 1 2 3 # set "$@" > for i in > do > echo $i > done ... no output ...
It's exactly like the last program, except there's an
i. It turns
out it has no output. It iterates over nothing rather than
Shell is a language full of subtleties like this.
3) The last case item is allowed to omit the
;; terminator. From
case "$1" in start) load_modules || true ;; *) log_success_msg "Usage: /etc/init.d/acpid ..." exit 1 ### it's valid to omit ;; here esac
This POSIX grammar expresses this as follows:
case_clause : Case WORD linebreak in linebreak case_list Esac | Case WORD linebreak in linebreak case_list_ns Esac | Case WORD linebreak in linebreak Esac ; case_list_ns : case_list case_item_ns | case_item_ns ; case_list : case_list case_item | case_item ; case_item_ns : pattern ')' linebreak | pattern ')' compound_list | '(' pattern ')' linebreak | '(' pattern ')' compound_list ; case_item : pattern ')' linebreak DSEMI linebreak | pattern ')' compound_list DSEMI linebreak | '(' pattern ')' linebreak DSEMI linebreak | '(' pattern ')' compound_list DSEMI linebreak
Notice that there are pairs of
I find recursive context-free grammars like this hard to read. I ported the
grammmar to ANTLR, which allows using Kleene operators like
to express the rule iteratively:
case_item : '('? pattern ('|' pattern)* ')' newline_ok command_term? trailer? ; case_list : case_item (DSEMI newline_ok case_item)* DSEMI? newline_ok; case_clause : Case WORD newline_ok in newline_ok case_list? Esac ;
Incidentally, this is where ANTLR has problems with the grammar. My code works as follows:
If you see an
esacstatement where you expected a
;;, end the case item and the whole case statement.
It's trivial to express imperatively, but it it's awkward to express in the formalism of a context-free grammar. ANTLR emits a bunch of warnings around these case statement rules which I don't completely understand.
As far as I understand, ANTLR 3 tries to generate a faster lookahead algorithm
LL(k) grammars (even though it does
LL(*) in general).
LL(1) is the easiest to parse: you only have to look at the current token to
decide what rule to "call". I think that shell is basically
shouldn't require too much lookahead, but I don't think ANTLR agrees.
I want to look into this issue more, but it turns out not to matter much for implementing the parser.
I think that an abstraction like PEGs will work better for expressing the shell language. It seems clear that the grammar was derived from imperative code to begin with. The shell language was not designed with a context-free grammar. So it's bit like translating a Chinese translation of an Emerson poem back to English.
PEGs are more imperative, in that they have ordered choice and a "cut" operator. Despite that, they are principled, and I believe it's easier to reason about their performance (whether you use Packrat parsing or backtracking).
This was a bit of a digression into minutiae, but hopefully it gives you a taste of what implementing a shell parser is like.
Tomorrow I will return to the theme of parsing shell in a single pass.