I'm not sure exactly what problem markdown.pl has with double escaping, but
it reminded me of complex escaping issues I've dealt with in Oil. There are
two problems:
The examples below attempt to disentangle the cases.
In an unquoted context, $ starts a substitution, and * is a glob character.
For them to be literal characters, they need to be escaped with \. This
means \ should be escaped as \\ too.
$ echo \\ \$ \* \t # Invalid escape \t should be written t \ $ * t
In double quotes, $ and " need to be escaped, so \\ is again a single
backslash. An invalid escape \t is left alone; it should be written \\t.
$ echo "\\ \$ \" \t \\t" # \t should be written \\t \ $ " \t \t
In single quotes, backslash isn't special:
$ echo '\\ \t' \\ \t
In C-style strings like $'\n', the backslash is special. An invalid escape
\Z is left alone; it should be written \\Z.
$ echo $'\\ a\tb \Z \\Z' # \Z should be written \\Z \ a b \Z \Z
With regard to invalid \ escapes, I noticed that Julia 0.7 will disallow
them. This stricter rule would be consistent with the
philosophy of the Oil language as well. We want to give the user
more error messages.
Builtins also have backslashes:
$ echo -e '\\ \t \Z \\Z' # \Z should be written \\Z \ \Z \Z
$ printf '\\ \t \Z \\Z' # \Z should be written \\Z \ \Z \Z
With the read builtin, backslashes in input data escape $IFS characters
(e.g. space, tab, newline), as well as the backslash itself:
$ read x y <<< 'x\ x y\\y'; echo "-$x-$y-" -x x-y\y-
If you want to write unreadable code, you can compose backslashes at parse time and at runtime:
$ echo -e $'\\\\' \
$ printf $(echo -e $'\\\\\\\\') \
Consider these three shell statements:
ls */'*notglob*' # 1. single-quoted filename
ls */"*notglob*" # 2. double-quoted filename
ls */\*notglob\* # 3. backslash-quoted glob characters
They all do the same thing: invoke ls with files named *notglob* in any
child directory. For example, dir1/*notglob* and dir2/*notglob*.
And they all "compile" down to the same glob call:
// 4. glob() takes a string pattern, which respects backslash-escaping
glob("*/\*notglob\*")
Note that the first * is not escaped, but the second and third one are.
Here's what I didn't understand before implementing a shell: although lines 3 and 4 appear identical, they're really puns. The backslash means two different things:
glob() characters. In this case, $
would not need to be escaped with \, because it's not a glob
metacharacter.Similarly, libc's regcomp() function is used to implement bash's [[ str =~ $regex ]] construct. It parses and compiles a POSIX regular expression.
It also accepts backslash escapes, but for a different set of characters
than glob(). For example, ( needs to be escaped in a regular expression,
but not in a glob.
In conclusion, all this backslash escaping makes implementing a correct shell pretty confusing. In fact, [bash][] itself is confused, and has this text in [its manual][bash-manual]:
Storing the regular expression in a shell variable is often a useful way to avoid problems with quoting characters that are special to the shell. It is sometimes difficult to specify a regular expression literally without using quotes, or to keep track of the quoting used by regular expressions while paying attention to the shell’s quote removal.
In other words, rather than:
[[ $s =~ (a|b) ]]
use:
pat='(a|b)'
[[ $s =~ $pat ]] # $pat shouldn't be quoted
because you'll avoid bash's quirky rules about (, ), and | in a regular
expression context.
My translation of the manual:
"The lexer mode for regex literals is messed up. You can avoid this part of the language by using the lexer mode for assignments instead. Then you don't have to reason about multiple levels of escaping, which are arguably broken in this case."