Shells Use Temp Files to Implement Here Documents


Consider this program:

$ for i in $(seq 10); do
>   cat <<EOF
>   here doc $i
> done
  here doc 1
  here doc 2
  here doc 3
  here doc 4
  here doc 5
  here doc 6
  here doc 7
  here doc 8
  here doc 9
  here doc 10

To run it, bash does this 10 times:

  1. fork() a child process
  2. open() a temp file for write
  3. write() the expanded here doc to it. The contents depends on the iteration.
  4. close() it
  5. open() it again read-only
  6. unlink(), so it will be deleted after it's closed
  7. dup2(4, 0) the resulting descriptor so that the new process has the temp file as stdin
  8. Executes the /bin/cat process
  9. cat reads the file from disk, writing its contents to stdout

strace listing:

strace -ff -e open,close,unlink,read,write,execve,dup2 \
   -- $sh ./here_doc_disk.sh

Process 4090 attached
[pid  4090] open("/tmp/sh-thd-865008962", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3
[pid  4090] write(4, "    here doc 1\n", 15) = 15
[pid  4090] close(4)                    = 0
[pid  4090] open("/tmp/sh-thd-865008962", O_RDONLY) = 4
[pid  4090] close(3)                    = 0
[pid  4090] unlink("/tmp/sh-thd-865008962") = 0
[pid  4090] dup2(4, 0)                  = 0
[pid  4090] close(4)                    = 0
[pid  4090] execve("/bin/cat", ["cat"], [/* 68 vars */]) = 0
[pid  4090] read(0, "    here doc 1\n", 65536) = 15
[pid  4090] write(1, "    here doc 1\n", 15    here doc 1
) = 15

zsh and mksh do the same thing, which surprised me.

dash does something more expected and elegant, which is to start cat with one end of a pipe() as stdin, rather than a temp file. Strings longer than PIPE_SIZE will cause write() to block, but I think that just requires a little extra care in the implementation.

Curiously, the "here string" construct in bash also uses temp files:

cat <<< "here string $i"

I don't see a reason to use temp files in either case, other than the fact that in ancient computing history people didn't want to hold entire "files" in memory. Compilers used to work a line at a time too.

Based on parsing real shell scripts, here docs are generally tiny, so I don't expect string size to be an issue.

I think my shell langauge will only have the here string operator, and implement it with pipes like dash does. From a programmer's perspective, here docs are just a weird kind of multiline string. These two cat invocations have the same output:


cat <<< "$s"

cat <<EOF

That is, shell strings are already multiline. I guess I should allow some kind of line-based delimiter in the string literal syntax, because the \ is a bit ugly. But this special syntax for multiline strings doesn't need to be coupled with the notion of piping to stdin.

I showed in the last post that here doc syntax is unintuitive in other ways: quoted delimiters to eliminate expansion; the <<- variant to strip leading tabs; and the post-order traversal rule for multiple here docs on a line.

oil implements all of this in its sh parser. But now that I fully understand the traditional syntax, I want to design something nicer for the oil language, as well as improve its implementation.