CommonMark is a Useful, High-Quality Project

2018-02-14 (Last updated 2018-02-15)

I write every page on this site in Markdown syntax. At first, I used the original markdown.pl to generate HTML.

But I've just switched to cmark, the C implementation of CommonMark. I had a great experience, which I document here.

We need more projects like this: ones that fix existing, widely-deployed technology rather than create new technology.

Table of Contents

What is CommonMark?

Why Did I Switch?

How Did it Go?

Tip: Check your charset in both HTTP and HTML

cmark Uses re2c, AFL, and AddressSanitizer

Conclusion

Update: An Incompatibility in Embedded HTML

Did I need to switch to CommonMark?

What is CommonMark?

The home page says:

We propose a standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests ...

Much like Unix shell, Markdown is a complex language with many implementations. I happened to use markdown.pl, but another popular implementation is pandoc. Sites like Reddit, Github, and StackOverflow have their own variants as well.

However, shell has a POSIX spec. It specifies many non-obvious parts of the language, and shells widely agree on these cases. (Caveat: there are many things that POSIX doesn't specify, as mentioned in the FAQ on POSIX).

But CommonMark goes further. In addition to a detailed written specification, the project provides:

An executable test suite, embedded in the source for the spec.
cmark, a high-quality C implementation that I'm now using.
commonmark.js, an implementation in JavaScript.

Perfect!

CommonMark's tests and Oil's spec tests follow the same philosophy. In order to specify the OSH language, I test over a thousand shell snippets against bash, dash, mksh, busybox ash, and zsh. (See blog posts tagged #testing.)

I'd like to see executable specs for more data formats and languages. Of course, POSIX has to specify not just the shell, but an entire operating system, so it's perhaps understandable that they don't provide exhaustive tests. However, some tests would be better than none.

Why Did I Switch?

I wanted to parse <h1>, <h2>, ... headers in the HTML output in order to generate a table of contents, like the one at the top of this post. That is, the build process now starts like this:

Markdown → HTML.
HTML → HTML with an optional table of contents inserted.

The TOC used to be generated on the client side by traversing the DOM, using JavaScript borrowed from AsciiDoc. But it caused a noticeable rendering glitch. Since switching to static HTML, my posts no longer "flash" at load time.

I could have simply parsed the output of markdown.pl, but I didn't trust it. I knew it was a Perl script that was last updated in 2004, and Perl and shell share a similar sloppiness with text. They like to confuse code and data. This is one of the things I aim to fix with Oil. (See blog posts tagged #escaping-quoting.)

I had a more concrete reason for this suspicion, too. A few months ago, I noticed markdown.pl producing MD5 checksums in the HTML output, when none were in the input. I believe I "fixed" this bug by moving whitespace around, but I still don't know what the cause was. I see several calls to the Perl function md5_hex()in the source code, but there's no explanation for them.

This 2009 reddit blog post has a clue: it says that MD5 checksums are used to prevent double-escaping. But this makes no sense to me: checksums seem irrelevant to that problem, precisely because you can't tell apart checksums that the user wrote and checksums that the rendering process inserted. These bugs feel predictable — almost inevitable.

(However, I have some sympathy, because there are multiple kinds and multiple layers of escaping in shell. Most of these cases took more than one try to get right. The next post will list the different meanings of \ in shell.)

How Did it Go?

I changed the oilshell.org Makefile to use cmark instead of markdown.pl, and every blog post rendered the same way! When I looked at the underlying HTML, there were a few differences, which were either neutral changes or improvements:

Unicode characters are represented as themselves rather than HTML entities. For example, — turned into a literal "—". I like this change, but it means that the output HTML is now UTF-8 rather than ASCII. See the next section for a tip about charset declarations.
Insignificant whitespace in the HTML output changed.
<p>"Oil"</p> → <p>"Oil"</p>. ~~The former might be valid HTML, but the latter is better.~~ Correction: Readers have pointed out that escaping " here is unnecessary in both HTML and XML. I thought that being explicit about " would be easier to parse in Python, but I now doubt that too.

A better example is >:

$ echo 'a > b' | markdown
<p>a > b</p>

$ echo 'a > b' | cmark
<p>a &gt; b</p>

I believe cmark's output is better. (However, I couldn't find an occurrence of this problem in my site, since markdown.pl does escape > within <code> tags.)

~~So every blog post rendered correctly~~. Correction: I found another cmark incompatibility after publishing this post. See the update blow.

But when I rendered the blog index, which includes generated HTML, I ran into a difference. A markdown heading between HTML tags was rendered literally, rather than with an <h3> tag:

<table>
  ...
</table>
### Heading
<table>
  ...
</table>

I fixed it by adding whitespace. I wouldn't write markdown like this anyway; it was arguably an artifact of generating HTML inside markdown.

Still, I'm glad that I have a git repository for the generated HTML as well as the source Markdown, so I can do a git diff after a build and eyeball changes.

Tip: Check your `charset` in both HTTP and HTML

As noted above, the HTML output now has UTF-8 characters, rather than using ASCII representations like —.

This could be a problem if your web server isn't properly configured. I checked and my web host is not sending a charset in the Content-Type header:

$ curl --head http://www.oilshell.org/
HTTP/1.1 200 OK
...
Content-Type: text/html

But I remembered that the default charset for HTTP is ISO-8859-1, not UTF-8. Luckily, my HTML boilerplate already declared UTF-8. If you "View Source", you'll see this line in the <head> of this document:

<meta charset=utf-8>

So I didn't need to change anything. When there's no encoding in the HTTP Content-Type header, the browser will use the HTML encoding.

In summary, if you use markdown.pl, I recommend switching to CommonMark, but be aware of the encoding you declare in both HTTP and HTML.

`cmark` Uses `re2c`, AFL, and AddressSanitizer

I haven't yet looked deeply into the cmark implementation, but I see three things I like:

It uses re2c, a tool to generate state machines in the form of switch and goto statements from regular expressions.

I also used this code generator to implement the OSH lexer. For example, see osh-lex.re2c.h, which I describe in my (unfinished) series of posts on lexing.
It uses American Fuzzy Lop, a relatively new fuzzer that has uncovered many old bugs.

The first time I used it, I found a null pointer dereference in toybox sed in less than a second. Roughly speaking, it relies on compiler technology to know what if statements are in the code. This means it can cover more code paths with less execution time than other fuzzers.
It uses AddressSanitizer, a compiler option that adds dynamic checks for memory errors to the generated code.

I used it to find at least one bug in Brian Kernighan's awk implementation, as well as several bugs in toybox. It's like Valgrind, but it has less overhead.

In summary, these are exactly the tools you should use if you're writing a parser in C that needs to be safe against adversarial input.

Fundamentally, parsers have a larger state space than most code you write. It's impossible to reason about every case, so you need tools:

Generating state machines from regular expressions is more reliable and readable than writing them by hand.
re2c also has exhaustiveness checks at compile time. They're similar to the ones that languages like OCaml and Haskell provide for pattern matching constructs over algebraic data types. These checks found bugs in my lexer statically — for example, what happens when an entire shell script ends with a single backslash?
As mentioned, American Fuzzy Lop finds novel code paths very quickly.
AddressSanitizer complements AFL, because it automatically makes assertions when exploring new code paths. Something like a 1-byte buffer overflow may not cause your program to crash, so fuzzing alone won't detect it.

Another technique I've wanted to explore, but haven't yet, is property-based testing. As far as I understand, it's related to and complementary to fuzzing.

Conclusion

I had a great experience with CommonMark, and I'm impressed by its thoroughness. I created oilshell.org/site.html to acknowledge it and all the other projects I depend on.

What other open source projects are fixing widely-deployed technology? Let me know in the comments.

Update: An Incompatibility in Embedded HTML

After publishing this post, I noticed that some of my posts had been broken for awhile. I shell out to Pygments to render code blocks like:

def Foo():
  pass

def Bar():
  pass

Its output is piped back into the Markdown document as embedded HTML:

<div class="highlight">
  <!-- Python code highlighted with <span>.
       View source to see it. -->
</div>

However, the blank line triggers issue 490, an intentional incompatibility that allows Markdown in embedded HTML blocks.

I fixed it with this Awk filter:

# Replace blank lines with an HTML comment
awk '
/^[ \t]*$/ { print "<!-- blank -->"; next }
           { print }
'

So unfortunately, a few of my posts were broken, and I didn't notice for awhile. I had inspected the diffs, but trivial changes drowned out these more important changes.

On the other hand, I've often wanted to use Markdown inside HTML tables, so I may intentionally use this feature of CommonMark.

Did I need to switch to CommonMark?

A few readers asked me this. The answer is technically no: I probably could have generated the TOC with the output of markdown.pl.

But I want firmer foundations for my blog's source text, and more rigorously defined HTML output. CommonMark has a spec, tests, and multiple implementations, while markdown.pl is a Perl script that hasn't been updated since 2004, and has known bugs.

I also learned that the author of pandoc works on CommonMark, which gives me confidence that CommonMark is "ground in reality" and not inventing something too divergent.

Also, note that Markdown has no syntax errors. Every text file is a valid Markdown document. So, in theory, every divergence from markdown.pl breaks a document.

In that sense, the fixing Markdown is harder than fixing shell. In OSH, if I can generate a good error at parse time, which leads the author to a trivial fix, I worry less about the incompatibility.