Shells Use Temp Files to Implement Here Documents

2016-10-18

Consider this program:

$ for i in $(seq 10); do
>   cat <<EOF
>   here doc $i
> EOF
> done
  here doc 1
  here doc 2
  here doc 3
  here doc 4
  here doc 5
  here doc 6
  here doc 7
  here doc 8
  here doc 9
  here doc 10

To run it, bash does this 10 times:

fork() a child process
open() a temp file for write
write() the expanded here doc to it. The contents depends on the iteration.
close() it
open() it again read-only
unlink(), so it will be deleted after it's closed
dup2(4, 0) the resulting descriptor so that the new process has the temp file as stdin
Executes the /bin/cat process
cat reads the file from disk, writing its contents to stdout

strace listing:

strace -ff -e open,close,unlink,read,write,execve,dup2 \
   -- $sh ./here_doc_disk.sh

Process 4090 attached
[pid  4090] open("/tmp/sh-thd-865008962", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3
[pid  4090] write(4, "    here doc 1\n", 15) = 15
[pid  4090] close(4)                    = 0
[pid  4090] open("/tmp/sh-thd-865008962", O_RDONLY) = 4
[pid  4090] close(3)                    = 0
[pid  4090] unlink("/tmp/sh-thd-865008962") = 0
[pid  4090] dup2(4, 0)                  = 0
[pid  4090] close(4)                    = 0
[pid  4090] execve("/bin/cat", ["cat"], [/* 68 vars */]) = 0
...
[pid  4090] read(0, "    here doc 1\n", 65536) = 15
[pid  4090] write(1, "    here doc 1\n", 15    here doc 1
) = 15

zsh and mksh do the same thing, which surprised me.

dash does something more expected and elegant, which is to start cat with one end of a pipe() as stdin, rather than a temp file. Strings longer than PIPE_SIZE will cause write() to block, but I think that just requires a little extra care in the implementation.

Curiously, the "here string" construct in bash also uses temp files:

cat <<< "here string $i"

I don't see a reason to use temp files in either case, other than the fact that in ancient computing history people didn't want to hold entire "files" in memory. Compilers used to work a line at a time too.

Based on parsing real shell scripts, here docs are generally tiny, so I don't expect string size to be an issue.

I think my shell language will only have the here string operator, and implement it with pipes like dash does. From a programmer's perspective, here docs are just a weird kind of multiline string. These two cat invocations have the same output:

s="\
one
two"

cat <<< "$s"

cat <<EOF
one
two
EOF

That is, shell strings are already multiline. I guess I should allow some kind of line-based delimiter in the string literal syntax, because the \ is a bit ugly. But this special syntax for multiline strings doesn't need to be coupled with the notion of piping to stdin.

I showed in the last post that here doc syntax is unintuitive in other ways: quoted delimiters to eliminate expansion; the <<- variant to strip leading tabs; and the post-order traversal rule for multiple here docs on a line.

oil implements all of this in its sh parser. But now that I fully understand the traditional syntax, I want to design something nicer for the oil language, as well as improve its implementation.