Home

How to Quickly and Correctly* Generate a Git Log in HTML

2017-09-19

9/29 Update: I slightly modified the solution given here.

In this post, I explain a short solution to a small text processing problem. Then I use it to illustrate the strengths and drawbacks of the Unix style of programming.

Table of Contents

The Problem

For the OSH 0.1 release, I generated this HTML changelog:

At first, I used a single command, which was roughly:

$ git log --pretty=format:"<tr> <td>%H</td> <td>%s</td> <tr>"

Notice that git log uses printf-style substitution, e.g. where %s is the commit subject and %H is the commit hash.

However, two commit descriptions happen to use the HTML metacharacters < and &. The format string above gives the wrong results:

<td>Implement <&, with test.</td>
<td>Alternative to using <& $fd</td>

Instead, they should be escaped like this:

<td>Implement &lt;&amp;, with test.</td>
<td>Alternative to using &lt;&amp; $fd</td>

The Pedantic Solution

Some programmers might stop here and say, Let's switch to a real programming language. Do it the right way.

In other words: Use a git API to retrieve structured data from the repository, then call an HTML escaping function on each string.

That solution seems "right", but it's onerous. Finding and installing a proper git API feels like yak shaving for this small task, and creates a new problem: dependencies.

We can have this debate later, but I still want to use shell. I don't want to turn a single git log invocation into a big program.

On the other hand, I don't want to be sloppy about escaping. That's a dangerous habit to get into.

(Related: Master Foo and the Ten Thousand Lines.)

A Short and Efficient Solution

Here's how I solved the problem:

  1. Surround the % substitutions with the bytes 0x01 and 0x02. Bash supports C-style escapes within strings, e.g. $'\x01'. In this case we'll use:

    $'<td>\x01%s\x02</td>'

  2. Pipe to Python to HTML-escape everything inside pairs of 0x01 and 0x02 bytes.

In other words, it's of the form:

$ git log --pretty="format:..." | python -c 'print re.sub(...)'

Here's a simplified version of the code I used. It's also available in the oilshell/blog-code repository.

# Escape portions of standard input delimited by special bytes
escape-segments() {
  python -c '
import cgi, re, sys

print re.sub(
  r"\x01(.*)\x02", 
  lambda match: cgi.escape(match.group(1)),
  sys.stdin.read())
'
}

# Write an HTML table to stdout
git-log-html() {
  echo '<table>'

  local format=$'
  <tr>
    <td> <a href="https://example.com/commit/%H">%h</a> </td>
    <td>\x01%s\x02</td>
  </tr>'
  git log -n 5 --pretty="format:$format" | escape-segments

  echo '</table>'
}

The second argument to Python's re.sub is a replacement function. It takes match object and calls cgi.escape() on the first captured group.

This solution correctly escapes the shell operators I used in my description, e.g. Implement <& becomes Implement &lt;&amp;.

Is it Correct? Truly Adversarial Input

At least some of you are skeptical right now. Didn't I just push the problem around? I've prevented the obvious XSS attack:

git commit -m '<script>alert("hi")</script>'

But now I have a problem with 0x01 and 0x02, which can occur in in git commit descriptions.

However, I tried to attack my code with

git commit -m $'\x02<script>alert("hi")</script>\x01'

but it doesn't work because I used a greedy regex match (.*) and not a non-greedy one (.*?). A related subtlety is that, in this particular case, %s yields a single line of text, which helps because (.*) doesn't match newline characters.

This feels too clever to call secure. If you can construct a git message that causes the alert to fire, leave a comment. You can test your exploit by generating an HTML page with git-changelog/demo.sh in the oilshell/blog-code repository.

A More Severe Escaping problem

A bug in my solution would allow an Oil committer to run arbitrary code in the context of oilshell.org, which may or may not seem like a big deal.

But there was a recent case where trusting data in a git repository had far worse consequences:

In short, many developers display the current git branch in their bash prompt (including me). An attacker could create a branch name that would cause git-prompt.sh to execute arbitrary code:

Here's a proof a concept:

(Note: I didn't try this code.)

Conclusion

I like the git log | python solution, at least for this particular task. It's short and efficient.

However, it illustrates the downsides of using strings for everything. Although I wasn't able to construct an exploit, I'd hesitate to use this technique in an adversarial context. The CVE is further evidence that shell is a dangerous language.

I aim to rectify this with the Oil language. You should be able to have your cake and eat it too. There shouldn't be a tradeoff between quick, but unsafe and correct, but clunky!

Here's an outline of a future post on this topic:

Reminder

If you'd like to help me make a better Unix shell, please try the latest release on real shell scripts, and file bugs if it doesn't work. Thanks!


Appendix A: Other Unix Tools that Use Field Substitution

$ find . -printf 'relative path: %P\n'
...

$ stat -c 'name: %n'
...
$ curl -o /dev/null --write-out 'url: %{url_effective}' http://example.com
...

Leave a comment if you know of other tools that follow this pattern.