Quoting substitutions that start with $, operators like <
Quoting glob characters
Quoting brace sub {
C-style escapes like \
Substitution in PS1
notes:
In my post on CommonMark, I mentioned that markdown.pl sometimes
leaves MD5 checksums in its HTML output, which is apparently caused by an
incorrect attempt to avoid double-escaping.
This reminded me of complex escaping issues I've dealt with in Oil. In particular, the backslash is the most problematic character to deal with.
Before writing a shell, I didn't fully appreciate that it's not a single construct:
I show examples below. I don't expect that every reader wants all the gory details, but laying it out like this does shed light on the unusual evaluation model of shell.
At the end, I'll discuss lessons for the Oil language.
My previous posts on shell-the-bad-parts only talked a
Five meanings ##
3 meanings of /
Ambiguity in -a: This is a "command time" issue
In an unquoted context, $ starts a substitution, and * is a glob
character. To make them literals, escape them with \. Note that this
means that \ also needs to be escaped as \\.
$ echo \\ \$ \* \t t \ $ * t t
Note that \t is an invalid escape, and becomes t.
THREE kinds of quoting:
Regex have a differrent lexer mode though!
In double quotes, $ and " need to be escaped, so \\ is again a
single backslash. An invalid escape \t is left alone; it should be written
\\t.
$ echo "\\ \$ \" \t \\t" # \t should be written \\t \ $ " \t \t
In single quotes in bash, backslashes aren't special:
$ echo '\t \\' \t \\
In other shells, they are special:
$ for sh in bash dash mksh zsh; do > $sh <<'EOF' > echo $0: '\t \\' > EOF > done bash: \t \\ dash: \ mksh: \ zsh: \
In C-style strings like $'\n', the backslash is special. An invalid escape
\Z is left alone; it should be written \\Z.
$ echo $'\\ a\tb \Z \\Z' # \Z should be written \\Z \ a b \Z \Z
With regard to invalid \ escapes, I noticed that Julia 0.7 will disallow
them. This stricter rule is inconsistent with C, Python, and
JavaScript, but it's consistent with the philosophy of the Oil
language.
That is, we issue more errors at parse time. echo \Z and echo "\Z" are
both invalid, yet they give different results!
libc's Mini-Languages
Consider these three shell statements:
ls */'*notglob*' # 1. single-quoted filename
ls */"*notglob*" # 2. double-quoted filename
ls */\*notglob\* # 3. backslash-quoted glob characters
They all do the same thing: invoke ls with files named *notglob* in any
child directory. For example, dir1/*notglob* and dir2/*notglob*.
And they all "compile" down to the same glob call:
// 4. glob() takes a string pattern, which respects backslash-escaping
glob("*/\*notglob\*")
Note that the first * is not escaped, but the second and third one are.
Here's what I didn't understand before implementing a shell: although lines 3 and 4 appear identical, they're really puns. The backslash means two different things:
glob() characters. In this case, $
would not need to be escaped with \, because it's not a glob
metacharacter.Similarly, libc's regcomp() function is used to implement bash's [[ str =~ $regex ]] construct. It parses and compiles a POSIX regular expression.
It also accepts backslash escapes, but for a different set of characters
than glob(). For example, ( needs to be escaped in a regular expression,
but not in a glob.
In conclusion, all this backslash escaping makes implementing a correct shell pretty confusing. In fact, [bash][] itself is confused, and has this text in its manual:
Storing the regular expression in a shell variable is often a useful way to avoid problems with quoting characters that are special to the shell. It is sometimes difficult to specify a regular expression literally without using quotes, or to keep track of the quoting used by regular expressions while paying attention to the shell’s quote removal.
In other words, rather than:
[[ $s =~ (a|b) ]]
use:
pat='(a|b)'
[[ $s =~ $pat ]] # $pat shouldn't be quoted
because you'll avoid bash's quirky rules about (, ), and | in a regular
expression context.
My translation of the manual:
"The lexer mode for regex literals is messed up. You can avoid this part of the language by using the lexer mode for assignments instead. Then you don't have to reason about multiple levels of escaping, which are arguably broken in this case."
Builtins also have backslashes:
$ echo -e '\\ \t \Z \\Z' # \Z should be written \\Z \ \Z \Z
$ printf '\\ \t \Z \\Z' # \Z should be written \\Z \ \Z \Z
With the read builtin, backslashes in input data escape $IFS
characters (e.g. space, tab, newline), as well as the backslash itself:
$ read x y <<< 'x\ x y\\y'; echo "-$x-$y-" -x x-y\y-
If you want to write unreadable code, you can compose backslashes at parse time and at runtime:
$ echo -e $'\\\\' \
$ printf $(echo -e $'\\\\\\\\') \
These commands aren't part of the shell proper, but they understand backslashes too:
$ echo $'a\tb c' | awk -F $'\t' '{print $2}'; # shell creates a tab > echo $'a\tb c' | awk -F '\t' '{print $2}' # awk creates a tab b c b c
$ echo $'a\tb c' | xargs -n 1 -d $'\t' echo .; # shell creates a tab > echo $'a\tb c' | xargs -n 1 -d '\t' echo . # xargs creates a tab . a . b c . a . b c
$ find css js -maxdepth 0 -printf $'%s %p\n'; # shell creates newline > find css js -maxdepth 0 -printf '%s %p\n' # find creates newline 4096 css 4096 js 4096 css 4096 js
However, GNU cut apparently doesn't understand backslashes:
$ echo $'a\tb c' | cut -d $'\t' -f 2 # shell creates a tab > echo $'a\tb c' | cut -d '\t' -f 2 # cut doesn't understand tab b c cut: the delimiter must be a single character Try 'cut --help' for more information.
Did I miss any?