Why Sponsor Oils? | blog | oilshell.org
9/29 Update: I slightly modified the solution given here.
In this post, I explain a short solution to a small text processing problem. Then I use it to illustrate the strengths and drawbacks of the Unix style of programming.
For the OSH 0.1 release, I generated this HTML changelog:
At first, I used a single command, which was roughly:
$ git log --pretty=format:"<tr> <td>%H</td> <td>%s</td> <tr>"
Notice that git log
uses printf
-style substitution, e.g. where %s
is the
commit subject and %H
is the commit hash.
However, two commit descriptions happen to use the HTML metacharacters <
and
&
. The format string above gives the wrong results:
<td>Implement <&, with test.</td>
<td>Alternative to using <& $fd</td>
Instead, they should be escaped like this:
<td>Implement <&, with test.</td>
<td>Alternative to using <& $fd</td>
Some programmers might stop here and say, Let's switch to a real programming language. Do it the right way.
In other words: Use a git API to retrieve structured data from the repository, then call an HTML escaping function on each string.
That solution seems "right", but it's onerous. Finding and installing a proper git API feels like yak shaving for this small task, and creates a new problem: dependencies.
We can have this debate later, but I still want to use shell. I don't want to
turn a single git log
invocation into a big program.
On the other hand, I don't want to be sloppy about escaping. That's a dangerous habit to get into.
(Related: Master Foo and the Ten Thousand Lines.)
Here's how I solved the problem:
Surround the %
substitutions with the bytes 0x01
and 0x02
. Bash
supports C-style escapes within strings, e.g. $'\x01'
. In this case we'll
use:
$'<td>\x01%s\x02</td>'
Pipe to Python to HTML-escape everything inside pairs of 0x01
and 0x02
bytes.
In other words, it's of the form:
$ git log --pretty="format:..." | python -c 'print re.sub(...)'
Here's a simplified version of the code I used. It's also available in the oilshell/blog-code repository.
# Escape portions of standard input delimited by special bytes
escape-segments() {
python -c '
import cgi, re, sys
print re.sub(
r"\x01(.*)\x02",
lambda match: cgi.escape(match.group(1)),
sys.stdin.read())
'
}
# Write an HTML table to stdout
git-log-html() {
echo '<table>'
local format=$'
<tr>
<td> <a href="https://example.com/commit/%H">%h</a> </td>
<td>\x01%s\x02</td>
</tr>'
git log -n 5 --pretty="format:$format" | escape-segments
echo '</table>'
}
The second argument to Python's re.sub
is a replacement function. It takes
match
object and calls cgi.escape()
on the first captured group.
This solution correctly escapes the shell operators I used in my description,
e.g. Implement <&
becomes Implement <&
.
At least some of you are skeptical right now. Didn't I just push the problem around? I've prevented the obvious XSS attack:
git commit -m '<script>alert("hi")</script>'
But now I have a problem with 0x01
and 0x02
, which can occur in in git
commit descriptions.
However, I tried to attack my code with
git commit -m $'\x02<script>alert("hi")</script>\x01'
but it doesn't work because I used a greedy regex match (.*)
and not a
non-greedy one (.*?)
. A related subtlety is that, in this particular case,
%s
yields a single line of text, which helps because (.*)
doesn't match
newline characters.
This feels too clever to call secure. If you can construct a git message that
causes the alert to fire, leave a comment. You can test your
exploit by generating an HTML page with git-changelog/demo.sh
in the
oilshell/blog-code repository.
A bug in my solution would allow an Oil committer to run arbitrary code in the
context of oilshell.org
, which may or may not seem like a big deal.
But there was a recent case where trusting data in a git
repository had far
worse consequences:
In short, many developers display the current git branch in their bash prompt
(including me). An attacker could create a branch name that would cause
git-prompt.sh
to execute arbitrary code:
Here's a proof a concept:
(Note: I didn't try this code.)
I like the git log | python
solution, at least for this particular task.
It's short and efficient.
However, it illustrates the downsides of using strings for everything. Although I wasn't able to construct an exploit, I'd hesitate to use this technique in an adversarial context. The CVE is further evidence that shell is a dangerous language.
I aim to rectify this with the Oil language. You should be able to have your cake and eat it too. There shouldn't be a tradeoff between quick, but unsafe and correct, but clunky!
Here's an outline of a future post on this topic:
git log
be designed? The mini %
language is a
common pattern, but it's clunky and potentially unsafe, as this example shows.If you'd like to help me make a better Unix shell, please try the latest release on real shell scripts, and file bugs if it doesn't work. Thanks!
find
and stat
both have %
languages that print file system metadata.$ find . -printf 'relative path: %P\n'
...
$ stat -c 'name: %n'
...
curl
's variant is more readable, with multi-character variable names:$ curl -o /dev/null --write-out 'url: %{url_effective}' http://example.com
...
\h \W
for $PS1
, the prompt string.date
uses strftime
strings, but the values look safe to substitute
without escaping./usr/bin/time --format
and the TIMEFORMAT
variable for bash's
time
builtin use printf
-style formatting, but the values are usually
numbers, which don't need escaping in any context I can think of.Leave a comment if you know of other tools that follow this pattern.