Exporting the contents of a Blogger blog is easy: just click a button on the administrator’s site to download an XML file with a very complete representation of the blog contents and its configuration. But making use of this exported data for any other purpose than reimporting into Blogger is difficult, and may be the reason why the export functionality is called backup and not export. This is a bit sad but, with a bit of effort, one can extract the contents of the XML backup into a more manageable state.

In this post, I will describe the process I followed to turn the contents of a Blogger XML backup file into a collection of standalone .html pages, one per post, with Front Matter in them to describe their metadata. The results of the conversion can be directly served by Jekyll as a blog.

The conversion

Outline

The conversion process consists of three steps, which are necessary due to my use of XSLT:

  1. Take the Blogger XML file and process all posts within it. Output a single text file, with delimiters for each post, that contains the verbatim representation of each post’s HTML contents with Front Matter.

  2. With a simple script, chunk the contents of the text file to split them into independent posts.

  3. Optionally convert the individual posts from HTML to Markdown. I opted not to do this for now because maintaining the original HTML is good enough for an initial conversion. For this initial conversion, I’m happy enough to leave the original HTML in the Blogger posts untouched. This is suboptimal because the HTML generated by Blogger is not particularly nice—and even if you typed it directly, Blogger may have mangled it (particularly <pre> tags that had indentation). Conversion of the posts to Markdown would be desirable, but I did not do it: while the actual conversion is easy (see pandoc or html2text), polishing the resulting documents to be “clean” markup-wise would involve much more effort than I am willing to spend for now. I’ll probably convert select articles to Markdown at a later stage.

From Blogger XML to a text file with all posts

The Blogger XML export is an XML file that uses the Atom schema. The file includes a good representation of the whole blog, including the blog’s metadata, the blog’s templates, and the blog’s actual contents—posts and comments. For our purposes, we can ignore everything that is not a post.

To do the conversion, we rely on an XSLT stylesheet and on the small-ish xsltproc processor. The stylesheet will descend into all entries that match the post properties, collect their metadata as variables for latter formatting, and then write out the Front Matter for the post and the unquoted HTML contents.

Unfortunately, xsltproc has a limitation: it is an XSLT 1.0 processor so it does not have support for multi-file (chunked) output. As a result, we have to output everything as a single text document and then do separate post-processing. (I’m not sure this is completely true though: the DocBook XSL does chunking, and if I recall correctly does it just fine with xsltproc…)

To workaround this, each post in the output will be delimited by unique start/end strings. These strings will let us do post-processing later on to actually split the output into the final posts.

TODO

From the text file to individual HTML files per post

TODO

From HTML to Markdown

This is not an avenue I have pursued yet.