blog |

Eight Different Meanings of Backslash in Shell


I'm not sure exactly what problem has with double escaping, but it reminded me of complex escaping issues I've dealt with in Oil. There are two problems:

  1. The backslash means different things in different contexts.
  2. These contexts are intertwined in various ways.

The examples below attempt to disentangle the cases.

Table of Contents
Backslashes at Parse Time
Backslashes at Runtime
Backslashes in Internal Mini-Languages
Bash is Confused
Lessons for the Oil Language

Backslashes at Parse Time

In an unquoted context, $ starts a substitution, and * is a glob character. For them to be literal characters, they need to be escaped with \. This means \ should be escaped as \\ too.

$ echo \\ \$ \* \t  # Invalid escape \t should be written t
\ $ * t

In double quotes, $ and " need to be escaped, so \\ is again a single backslash. An invalid escape \t is left alone; it should be written \\t.

$ echo "\\ \$ \" \t \\t"  # \t should be written \\t
\ $ " \t \t

In single quotes, backslash isn't special:

$ echo '\\ \t'
\\ \t

In C-style strings like $'\n', the backslash is special. An invalid escape \Z is left alone; it should be written \\Z.

$ echo $'\\ a\tb \Z \\Z'  # \Z should be written \\Z
\ a	b \Z \Z

With regard to invalid \ escapes, I noticed that Julia 0.7 will disallow them. This stricter rule would be consistent with the philosophy of the Oil language as well. We want to give the user more error messages.

Backslashes at Runtime

Builtins also have backslashes:

$ echo -e '\\ \t \Z \\Z'  # \Z should be written \\Z
\ 	 \Z \Z
$ printf '\\ \t \Z \\Z'  # \Z should be written \\Z
\ 	 \Z \Z

With the read builtin, backslashes in input data escape $IFS characters (e.g. space, tab, newline), as well as the backslash itself:

$ read x y <<< 'x\ x y\\y'; echo "-$x-$y-"
-x x-y\y-

If you want to write unreadable code, you can compose backslashes at parse time and at runtime:

$ echo -e $'\\\\'
$ printf $(echo -e $'\\\\\\\\')

Backslashes in Internal Mini-Languages

Consider these three shell statements:

ls */'*notglob*'   # 1. single-quoted filename
ls */"*notglob*"   # 2. double-quoted filename
ls */\*notglob\*   # 3. backslash-quoted glob characters

They all do the same thing: invoke ls with files named *notglob* in any child directory. For example, dir1/*notglob* and dir2/*notglob*.

And they all "compile" down to the same glob call:

// 4. glob() takes a string pattern, which respects backslash-escaping

Note that the first * is not escaped, but the second and third one are.

Here's what I didn't understand before implementing a shell: although lines 3 and 4 appear identical, they're really puns. The backslash means two different things:

  1. Remove the significance of special shell characters.
  2. Remove the significance of special glob() characters. In this case, $ would not need to be escaped with \, because it's not a glob metacharacter.

Similarly, libc's regcomp() function is used to implement bash's [[ str =~ $regex ]] construct. It parses and compiles a POSIX regular expression.

It also accepts backslash escapes, but for a different set of characters than glob(). For example, ( needs to be escaped in a regular expression, but not in a glob.

Bash is Confused

In conclusion, all this backslash escaping makes implementing a correct shell pretty confusing. In fact, [bash][] itself is confused, and has this text in [its manual][bash-manual]:

Storing the regular expression in a shell variable is often a useful way to avoid problems with quoting characters that are special to the shell. It is sometimes difficult to specify a regular expression literally without using quotes, or to keep track of the quoting used by regular expressions while paying attention to the shell’s quote removal.

In other words, rather than:

[[ $s =~ (a|b) ]]


[[ $s =~ $pat ]]  # $pat shouldn't be quoted

because you'll avoid bash's quirky rules about (, ), and | in a regular expression context.

My translation of the manual:

"The lexer mode for regex literals is messed up. You can avoid this part of the language by using the lexer mode for assignments instead. Then you don't have to reason about multiple levels of escaping, which are arguably broken in this case."

Lessons for the Oil Language