Every page on this site is written in Markdown syntax. At first, I used
the original markdown.pl to convert pages to HTML, but I just switched to
cmark, the C implementation of CommonMark.
I had a great experience, which I document here. We need more projects like CommonMark: ones that fix existing, widely-deployed technology rather than create new technology.
The home page says:
We propose a standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests to validate Markdown implementations against this specification.
Much like Unix shell, Markdown is a complex language with many implementations.
I happened to use markdown.pl, but another popular implementation is
pandoc. Reddit, Github, and
StackOverflow also have their own variants.
However, shell has a POSIX spec. It specifies many non-obvious parts of the language, and shells widely agree on these cases. (Caveat: there are many things that POSIX doesn't specify, as mentioned in the FAQ on POSIX).
But CommonMark goes further. In addition to a detailed written specification, the project provides:
CommonMark's tests and Oil's spec tests follow the same philosophy. In order to specify the OSH language, I test over a thousand shell snippets against bash, dash, mksh, busybox ash, and zsh. (See blog posts tagged #testing.)
I'd like to see executable specs for more data formats and languages. Of course, POSIX has to specify not just the shell, but an entire operating system, so it's perhaps understandable that they don't provide exhaustive tests.
I wanted to parse <h1>, <h2>, ... headers in the HTML output in order to
generate a table of contents, like the one at the top of this post. That
is, the build process now starts like this:
The TOC used to be generated on the client side, using JavaScript borrowed from AsciiDoc that traverses the DOM, but it caused a noticable rendering glitch. Since switching to static HTML, my posts no longer "flash" at load time.
I could have simply parsed the output of markdown.pl, but I didn't trust it.
I knew it was a Perl script that was last updated in 2004, and Perl and shell
share a similar sloppiness with text. This is one of the things I'm trying to
fix with Oil. (See blog posts tagged #escaping-quoting.)
This suspicion wasn't without evidence: I ran into a bug few months ago where
mysterious MD5 checksums in the HTML output! I believe I "fixed" it by moving
whitespace around, but I still don't know what the cause was. In
markdown.pl, you can see several calls to the Perl function md5_hex(), but
it doesn't explain why they are there.
This 2009 reddit blog post has a clue: it says that MD5 checksums are used to prevent double-escaping. But this makes no sense to me: checksums seem irrelevant to that problem, precisely because you can't tell apart checksums that the user wrote and checksums that the rendering process inserted.
However, I have some sympathy, because there are multiple layers of
backslash escaping in shell, and it took me more than one try to get it right.
In the appendix, I list different meanings for the \ in shell.
I changed the oilshell.org Makefile to use cmark instead of
markdown.pl, and every blog post rendered the same way! When I looked at the
underlying HTML, there were a few differences, which were either neutral
changes or improvements:
<p>"Oil"</p> → <p>"Oil"</p>. The former might be valid
HTML, but the latter is better. (The former is also not valid XML.) Being
explicit about " and & makes parsing simpler. Remember, I'm not
using the browser to parse HTML; I'm using a Python script at build time.
Unicode characters are represented as themselves rather than HTML entities.
For example, — turned into a literal "—". I like this change,
but it means that the output HTML is now UTF-8 rather than ASCII. See the
next section for a tip about charset declarations.
Insignificant whitespace in the HTML output changed.
So every blog post rendered correctly. But when I rendered the blog
index, which includes generated HTML, I ran into a difference. A markdown
heading between HTML tags was rendered literally, rather than with an <h3>
tag:
<table> ... </table> ### Heading <table> ... </table>
I fixed it by adding whitespace. I wouldn't write markdown like this anyway; it was arguably an artifact of generating HTML inside markdown.
Still, I'm glad that I have a git repository for the generated HTML as well
as the source Markdown, so I can do a git diff after a build and eyeball
changes.
charset in both HTTP and
HTMLAs noted above, the output HTML now has UTF-8 characters, rather than using
ASCII representations like —.
This could be a problem if your web server doesn't correctly declare the
Content-Type. I checked with curl:
$ curl --head http://www.oilshell.org/
HTTP/1.1 200 OK
...
Content-Type: text/html
I remembered that the default charset for HTTP is ISO-8859-1, not UTF-8.
Luckily, my HTML boilerplate already declared UTF-8. If you "View Source",
you'll see this line in the <head> of this document:
<meta charset=utf-8>
So I didn't need to reconfigure my web server. When there's no encoding in the
HTTP Content-Type header, the browser will use the HTML encoding.
In summary, if you use markdown.pl, I recommend switching to CommonMark,
but be aware of the encoding you declare in both HTTP and HTML.
cmark Uses re2c, AddressSanitizer, and AFLI haven't yet looked deeply into the cmark implementation, but I see three things I like:
In summary, these are exactly the tools you should use if you're writing a parser in C that needs to be safe against adversarial input.
Fundamentally, parsers have a larger state space than most code you write. It's impossible to reason about every case, so you need tools:
Another technique I've wanted to explore, but haven't yet, is property-based testing. As far as I understand, it's related to and complementary to fuzzing.
I had a great experience with CommonMark, and I'm impressed by its thoroughness. I created oilshell.org/site.html to acknowledge it and all the other projects I depend on.