blog |

CommonMark is a Useful, High-Quality Project


Every page on this site is written in Markdown syntax. At first, I used the original to convert pages to HTML, but I just switched to cmark, the C implementation of CommonMark.

I had a great experience, which I document here. We need more projects like CommonMark: ones that fix existing, widely-deployed technology rather than create new technology.

Table of Contents
What is CommonMark?
Why Did I Switch?
How Did it Go?
Tip: Check your charset in both HTTP and HTML
cmark Uses re2c, AddressSanitizer, and AFL

What is CommonMark?

The home page says:

We propose a standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests to validate Markdown implementations against this specification.

Much like Unix shell, Markdown is a complex language with many implementations. I happened to use, but another popular implementation is pandoc. Reddit, Github, and StackOverflow also have their own variants.

However, shell has a POSIX spec. It specifies many non-obvious parts of the language, and shells widely agree on these cases. (Caveat: there are many things that POSIX doesn't specify, as mentioned in the FAQ on POSIX).

But CommonMark goes further. In addition to a detailed written specification, the project provides:

  1. An executable test suite, embedded in the source for the spec.
  2. cmark, a high-quality C implementation that I'm now using.
  3. commonmark.js, an implementation in JavaScript.

CommonMark's tests and Oil's spec tests follow the same philosophy. In order to specify the OSH language, I test over a thousand shell snippets against bash, dash, mksh, busybox ash, and zsh. (See blog posts tagged #testing.)

I'd like to see executable specs for more data formats and languages. Of course, POSIX has to specify not just the shell, but an entire operating system, so it's perhaps understandable that they don't provide exhaustive tests.

Why Did I Switch?

I wanted to parse <h1>, <h2>, ... headers in the HTML output in order to generate a table of contents, like the one at the top of this post. That is, the build process now starts like this:

  1. Markdown → HTML.
  2. HTML → HTML with an optional table of contents inserted.

The TOC used to be generated on the client side, using JavaScript borrowed from AsciiDoc that traverses the DOM, but it caused a noticable rendering glitch. Since switching to static HTML, my posts no longer "flash" at load time.

I could have simply parsed the output of, but I didn't trust it. I knew it was a Perl script that was last updated in 2004, and Perl and shell share a similar sloppiness with text. This is one of the things I'm trying to fix with Oil. (See blog posts tagged #escaping-quoting.)

This suspicion wasn't without evidence: I ran into a bug few months ago where mysterious MD5 checksums in the HTML output! I believe I "fixed" it by moving whitespace around, but I still don't know what the cause was. In, you can see several calls to the Perl function md5_hex(), but it doesn't explain why they are there.

This 2009 reddit blog post has a clue: it says that MD5 checksums are used to prevent double-escaping. But this makes no sense to me: checksums seem irrelevant to that problem, precisely because you can't tell apart checksums that the user wrote and checksums that the rendering process inserted.

However, I have some sympathy, because there are multiple layers of backslash escaping in shell, and it took me more than one try to get it right. In the appendix, I list different meanings for the \ in shell.

How Did it Go?

I changed the Makefile to use cmark instead of, and every blog post rendered the same way! When I looked at the underlying HTML, there were a few differences, which were either neutral changes or improvements:

So every blog post rendered correctly. But when I rendered the blog index, which includes generated HTML, I ran into a difference. A markdown heading between HTML tags was rendered literally, rather than with an <h3> tag:

### Heading

I fixed it by adding whitespace. I wouldn't write markdown like this anyway; it was arguably an artifact of generating HTML inside markdown.

Still, I'm glad that I have a git repository for the generated HTML as well as the source Markdown, so I can do a git diff after a build and eyeball changes.

Tip: Check your charset in both HTTP and HTML

As noted above, the output HTML now has UTF-8 characters, rather than using ASCII representations like &mdash;.

This could be a problem if your web server doesn't correctly declare the Content-Type. I checked with curl:

$ curl --head
HTTP/1.1 200 OK
Content-Type: text/html

I remembered that the default charset for HTTP is ISO-8859-1, not UTF-8. Luckily, my HTML boilerplate already declared UTF-8. If you "View Source", you'll see this line in the <head> of this document:

<meta charset=utf-8>

So I didn't need to reconfigure my web server. When there's no encoding in the HTTP Content-Type header, the browser will use the HTML encoding.

In summary, if you use, I recommend switching to CommonMark, but be aware of the encoding you declare in both HTTP and HTML.

cmark Uses re2c, AddressSanitizer, and AFL

I haven't yet looked deeply into the cmark implementation, but I see three things I like:

  1. It uses re2c. I also used this code generator to implement the OSH lexer. For example, see osh-lex.re2c.h, which I describe in my unfinished series of posts on lexing.
  2. It uses American Fuzzy Lop, a fuzzer that uses compiler technology. The first time I used it, I found null pointer dereference in toybox sed in less than a second. Since it essentially knows what if statements are in the code, it can cover more code paths with less execution time than other fuzzers.
  3. It uses AddressSanitizer, a compiler option that instruments code with dynamic checks for memory errors. I used it to find at least one bug in Brian Kernighan's awk implementation, as well as several bugs in toybox. It's like Valgrind, but it has less overhead.

In summary, these are exactly the tools you should use if you're writing a parser in C that needs to be safe against adversarial input.

Fundamentally, parsers have a larger state space than most code you write. It's impossible to reason about every case, so you need tools:

  1. Generating state machines from regular expressions is more reliable and readable than writing them by hand. re2c has exhaustiveness checks at compile time. They are similar to the ones that languages like OCaml and Haskell provide for pattern matching constructs over algebraic data types.
  2. American Fuzzy Lop finds new code paths.
  3. AddressSanitizer automatically makes assertions when exploring these new code paths. Something like a 1-byte buffer overflow may not cause your program to crash, so fuzzing alone won't detect it.

Another technique I've wanted to explore, but haven't yet, is property-based testing. As far as I understand, it's related to and complementary to fuzzing.


I had a great experience with CommonMark, and I'm impressed by its thoroughness. I created to acknowledge it and all the other projects I depend on.