Correction: How to Parse Here Documents

2017-11-28

Last year, in How to Parse Here Documents, I showed some curious examples of here documents.

Then I described this algorithm for handling them:

When parsing, save the here terminators you encounter in the AST. After a newline, walk the AST, reading the lines associated with the terminators using a post-order traversal.

OSH used this algorithm for over a year, and it successfully parsed millions of lines of shell scripts.

However, it's not technically accurate! I was able to construct synthetic examples of here docs that every shell could parse, but that OSH could not.

The good news is that the new algorithm is both simpler and faster. The last post has measurements. I was overcomplicating things.

The New Algorithm

Create an array of pending here docs before parsing.
Whenever you see something like <<EOF, append a HereDocRedirect(terminator='EOF') node to the array. This can happen multiple times per line.
Whenever you see a newline, fill the HereDocRedirect() nodes in order, and then clear the array.

In retrospect, this is the obvious algorithm. I was being too clever.

My reasoning was that, in shell, you format any program so it fits on a single line, like this:

while cat <<EOF1; read line; do echo "  -> read '$line'"; cat <<EOF2; done <<EOF3 ...

But what about the here docs? They happen to be in the order of a post-order traversal.

However, I didn't specify which subtree needs to be traversed. And that is what turned out to be wrong about the algorithm.

Leave a comment if this doesn't make sense. I think that understanding the odd examples and the new algorithm is enough. The old algorithm is no longer interesting.