How to Parse Here Documents

2016-10-17

Update: I issued a correction on 2017/11/28. OSH no longer uses the algorithm described in this post, but the examples are still useful.

The shell scripts in the git source tree use all the bells and whistles of the here doc syntax:

q_to_nul <<-\EOF | test-line-buffer >actual &&
skip 2
EOF

First, the E in the here terminator is escaped -- this is equivalent to <<'EOF' or <<"EOF", which makes it so that $vars aren't expanded in the body. That is, the body is treated like single-quoted string rather than a double-quoted string.

Second, they use the <<- operator, which strips leading tabs, so as not to mess up the code's indentation.

Third, there is a pipe and command after the here terminator. How do we parse that? It looks weird to me, and Vim's syntax highlighting doesn't understand it.

Some months ago, I was reading through the POSIX spec for here docs (section 2.7.4) and I noticed a similarly odd example:

cat <<eof1; cat <<eof2
Hi,
eof1
Helene.
eof2

OUTPUT
Hi,
Helene

It has a single sentence mentioning the possibility of multiple here docs on a single line, but doesn't go into detail.

Well today the answer dawned on me when trying to get git scripts to parse. First, I created this example:

if cat <<EOF1; then echo THEN; cat <<EOF2; fi
here doc 1
EOF1
here doc 2
EOF2
echo

OUTPUT:
here doc 1
THEN
here doc 2

This is even weirder — there's a here doc in the if condition, and another one in the if body. All shells I tested run this correctly (bash, dash, mksh).

A useful thought experiment: can you take any shell script and write it on a single line? Yes, just replace all newlines with semi-colons. (C has this property too, but Python doesn't.)

Except we can't put the here docs all on one line. In that case, the here doc literals will just be concatenated with the terminators at the end of the script.

Compound commands can have their own here docs:

while read line; do echo "-> $line"; done <<EOF
1
2
EOF
 
OUTPUT:
-> 1
-> 2

That is, the here doc is for the entire while loop, and not for an individual statement. Putting these two things together, I realized that the rule is:

When parsing, save the here terminators you encounter in the AST (while_node, if_node, simple_command_node, etc.). After a newline, walk the AST, reading the lines associated with the terminators using a post-order traversal.

That is, the here docs for parent nodes come after their children. Siblings go in the expected order.

Here's an example:

while cat <<EOF1; read line; do echo "  -> read '$line'"; cat <<EOF2; done <<EOF3
condition here doc
EOF1
  body here doc
EOF2
while loop here doc 1
while loop here doc 2
EOF3

OUTPUT:
condition here doc
  -> line: 'while loop here doc 1'
  body here doc
condition here doc
  -> line: 'while loop here doc 2'
  body here doc
condition here doc

So there are three here docs -- one for the condition, one for the body, and one for the while loop. They go in that order: child, child, parent.

All shells I tested respect this subtle behavior, but I've never seen it documented, let alone used in a real shell script. (I didn't actually see it in git.)

In my shell design, I'm thinking about separating here docs into two concepts: multiline strings, and here strings (<<<), both of which already exist in sh. More on that later.