blog | oilshell.org

Nine Meanings of Backslash in Shell

2018-02-27

notes:

In my post on CommonMark, I mentioned that markdown.pl sometimes leaves MD5 checksums in its HTML output, which is apparently caused by an incorrect attempt to avoid double-escaping.

This reminded me of complex escaping issues I've dealt with in Oil. In particular, the backslash is the most problematic character to deal with.

Before writing a shell, I didn't fully appreciate that it's not a single construct:

  1. There are at least three distinct evaluation stages where the backslash is significant.
  2. It has multiple meanings in each of those evaluation stages.

I show examples below. I don't expect that every reader wants all the gory details, but laying it out like this does shed light on the unusual evaluation model of shell.

At the end, I'll discuss lessons for the Oil language.

Table of Contents
Recap
Backslashes at Parse Time
Unquoted Words
Double Quotes
Single Quotes
C-Style Strings
Backslashes at Runtime
In Globs
In Regexes
Bash is Confused About Regexes
Backslashes at Command Time
In Builtins (echo, printf, read)
In External Commands
Backslashes For Substitution Rather than Escaping
Lessons for the Oil Language
Conclusion

Recap

My previous posts on shell-the-bad-parts only talked a

Backslashes at Parse Time

Unquoted Words

In an unquoted context, $ starts a substitution, and * is a glob character. To make them literals, escape them with \. Note that this means that \ also needs to be escaped as \\.

$ echo \\ \$ \* \t t
\ $ * t t

Note that \t is an invalid escape, and becomes t.

THREE kinds of quoting:

Regex have a differrent lexer mode though!

Double Quotes

In double quotes, $ and " need to be escaped, so \\ is again a single backslash. An invalid escape \t is left alone; it should be written \\t.

$ echo "\\ \$ \" \t \\t"  # \t should be written \\t
\ $ " \t \t

Single Quotes

In single quotes in bash, backslashes aren't special:

$ echo '\t \\'
\t \\

In other shells, they are special:

$ for sh in bash dash mksh zsh; do
>   $sh <<'EOF'
> echo $0: '\t \\'
> EOF
> done
bash: \t \\
dash: 	 \
mksh: 	 \
zsh: 	 \

C-Style Strings

In C-style strings like $'\n', the backslash is special. An invalid escape \Z is left alone; it should be written \\Z.

$ echo $'\\ a\tb \Z \\Z'  # \Z should be written \\Z
\ a	b \Z \Z

With regard to invalid \ escapes, I noticed that Julia 0.7 will disallow them. This stricter rule is inconsistent with C, Python, and JavaScript, but it's consistent with the philosophy of the Oil language.

That is, we issue more errors at parse time. echo \Z and echo "\Z" are both invalid, yet they give different results!

Backslashes at Runtime

In Globs

libc's Mini-Languages

Consider these three shell statements:

ls */'*notglob*'   # 1. single-quoted filename
ls */"*notglob*"   # 2. double-quoted filename
ls */\*notglob\*   # 3. backslash-quoted glob characters

They all do the same thing: invoke ls with files named *notglob* in any child directory. For example, dir1/*notglob* and dir2/*notglob*.

And they all "compile" down to the same glob call:

// 4. glob() takes a string pattern, which respects backslash-escaping
glob("*/\*notglob\*")

Note that the first * is not escaped, but the second and third one are.

Here's what I didn't understand before implementing a shell: although lines 3 and 4 appear identical, they're really puns. The backslash means two different things:

  1. Remove the significance of special shell characters.
  2. Remove the significance of special glob() characters. In this case, $ would not need to be escaped with \, because it's not a glob metacharacter.

In Regexes

Similarly, libc's regcomp() function is used to implement bash's [[ str =~ $regex ]] construct. It parses and compiles a POSIX regular expression.

It also accepts backslash escapes, but for a different set of characters than glob(). For example, ( needs to be escaped in a regular expression, but not in a glob.

Bash is Confused About Regexes

In conclusion, all this backslash escaping makes implementing a correct shell pretty confusing. In fact, [bash][] itself is confused, and has this text in its manual:

Storing the regular expression in a shell variable is often a useful way to avoid problems with quoting characters that are special to the shell. It is sometimes difficult to specify a regular expression literally without using quotes, or to keep track of the quoting used by regular expressions while paying attention to the shell’s quote removal.

In other words, rather than:

[[ $s =~ (a|b) ]]

use:

pat='(a|b)'
[[ $s =~ $pat ]]  # $pat shouldn't be quoted

because you'll avoid bash's quirky rules about (, ), and | in a regular expression context.

My translation of the manual:

"The lexer mode for regex literals is messed up. You can avoid this part of the language by using the lexer mode for assignments instead. Then you don't have to reason about multiple levels of escaping, which are arguably broken in this case."

Backslashes at "Command Time"

In Builtins (echo, printf, read)

Builtins also have backslashes:

$ echo -e '\\ \t \Z \\Z'  # \Z should be written \\Z
\ 	 \Z \Z
$ printf '\\ \t \Z \\Z'  # \Z should be written \\Z
\ 	 \Z \Z

With the read builtin, backslashes in input data escape $IFS characters (e.g. space, tab, newline), as well as the backslash itself:

$ read x y <<< 'x\ x y\\y'; echo "-$x-$y-"
-x x-y\y-

If you want to write unreadable code, you can compose backslashes at parse time and at runtime:

$ echo -e $'\\\\'
\
$ printf $(echo -e $'\\\\\\\\')
\

In External Commands

These commands aren't part of the shell proper, but they understand backslashes too:

$ echo $'a\tb c' | awk -F $'\t' '{print $2}';  # shell creates a tab
> echo $'a\tb c' | awk -F '\t' '{print $2}'  # awk creates a tab
b c
b c
$ echo $'a\tb c' | xargs -n 1 -d $'\t' echo .;  # shell creates a tab
> echo $'a\tb c' | xargs -n 1 -d '\t' echo .  # xargs creates a tab
. a
. b c

. a
. b c
$ find css js -maxdepth 0 -printf $'%s %p\n';  # shell creates newline
> find css js -maxdepth 0 -printf '%s %p\n'  # find creates newline
4096 css
4096 js
4096 css
4096 js

However, GNU cut apparently doesn't understand backslashes:

$ echo $'a\tb c' | cut -d $'\t' -f 2  # shell creates a tab
> echo $'a\tb c' | cut -d '\t' -f 2  # cut doesn't understand tab
b c
cut: the delimiter must be a single character
Try 'cut --help' for more information.

Backslashes For Substitution Rather than Escaping

Lessons for the Oil Language

Conclusion

Did I miss any?