Converting HTML blog posts to Markdown

I am reviving quite a few things - this blog as well as some open source projects. They desperately need some TLC…

Since this blog has travelled through various hosting options and technologies I still had quite a few legacy posts formatted in HTML. I’ve taken the plunge to refactor them into much cleaner Markdown syntax.

It was easier than expected. With a couple of good libraries to lean on I wrote a quick Node application to do the dirty work. This aspect of the Node.js community is the same reason why I fell in love with Ruby in the first place - the abundance of small, beautifully crafted libraries/gems/packages that focus on solving very specific problems eloquently.

We’ll design the app with stream input and output, taking HTML input from stdin and outputting the resulting Markdown on stdout. In this way we can use it easily in conjunction with other tools, true to the Unix tools philosophy. First, let’s start with package.json file. We can create one with npm init and fill in the blanks:

{
  "name": "htmlconvert",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "dependencies": {
    "turndown": "^4.0.1",
    "get-stdin": "^5.0.1",
    "turndown-plugin-gfm": "^1.0.1",
    "gray-matter": "^3.1.1"
  },
  "bin": {
    "htmlconvert": "index.js"
  },
  "author": "Riaan Hanekom",
  "license": "ISC"
}

We make use of these great packages:

get-stdin : gets stdin as a string or buffer
turndown : convert HTML to Markdown using JavaScript
turndown-plugin-gfm : a plugin for turndown to enable GitHub Flavoured Markdown
gray-matter : a library that parse various types of front-matter

Note the bin section in the package.json file. This allows us to run npm link and have the convenience of not having to type in node index.js every time we want to run the app. If we decided to publish this application as an official npm package we would have to set the preferGlobal flag to true as well so that a user gets warned if the package is not installed with the --global flag [further reading].

Here is the index.js file marked for execution in all its (quick and dirty) glory:

#! /usr/bin/env node
const getStdin = require('get-stdin');
const TurndownService = require('turndown');
const turndownPluginGfm = require('turndown-plugin-gfm');
const matter = require('gray-matter');

getStdin().then(str => {

  let turndown = new TurndownService( {
    "codeBlockStyle": "fenced",
    "linkStyle": "referenced"
  });

  let gfm = turndownPluginGfm.gfm;
  turndown.use(gfm);

  let parsed = matter(str);
  let content = parsed.content;
  let data = parsed.data;

  let markdown = turndown.turndown(content);
  console.log(matter.stringify(markdown, data));
});

First we retrieve all the input from stdin. The turndown package is nicely customizable and I’ve set the output format styles to what I prefer where it differs from the defaults.

We ask the turndown library to use the gfm plugin in order to support GitHub Flavoured Markdown. The turndown library unfortunately strips out newlines out of the Jekyll front-matter at the moment. This issue provides a simple workaround - use the gray-matter library to parse the front matter.

Now that the input and output mechanisms are handled we can simply write

echo "<b>ICanHaz Bold!</b>" | htmlconvert

The output will be:

**ICanHazBold!**

Let’s pipe a sample blog post with front matter in:

STR=$'---\n title: stufff \n---\n<b>ICanHazBold\!</b>' && echo "$STR" | htmlconvert

Output:

---
title: stufff
---
**ICanHazBold!**

Beautiful!

We can now convert our posts in bulk by iterating over all the HTML posts:

for file in *.html; do
    cat "$file" | htmlconvert > "$(basename "$file" .html).md"
done

And then we dutifully proceed with QA on each and every single converted post before deleting the original. Unfortunately this code will not handle some of the edge cases that I was coding up in HTML back in 2005. To be honest, I’m not sure whether this was intentionally bad markup or signs of scars received while fighting with Wordpress, but inline styles for italics and bold text and auto-closing paragraph tags (<p/>) are just some examples. Luckily those cases are far and few so I chose to handle them manually by hand rather than diving into the insanity which is HTML parsing.

Photo by Pankaj Patel on Unsplash

See also