The Little Machine

DeWitt Clinton
May 2006

I spent the bulk of the past two weekends trying to migrate the contents of Unto.net's WordPress blog into an installation of the MediaWiki wiki software.

For the most part I succeeded. Not that it was trivial. In order to migrate I had to do the following:

Export all of the old blog entries to a parseable format -- Atom 1.0 in this case.
Clean the entry content up using a hodgepodge of techniques including HTML Tidy and custom one-off Perl scripts.
Convert the XHTML-ish content to native MediaWiki syntax.
Write yet more custom scripts needed to clean up said MediaWiki syntax.
Write out the entires into MediaWiki XML export format.
Import the MediaWiki format.
Rebuild the MediaWiki database.
Write an Unto.net "skin" to make a more reader-oriented user interface.
Set up user permissions and roles.
Write a script that understood MediaWiki authentication to bulk upload all the old images.
Add a decent Atom/RSS feed output for syndication of articles. (Fortunately Gregory Szorc's excellent WikiFeeds extension does the trick.)

But I persevered and kept moving on. Yet for all of this effort the process was still highly lossy. The old blog comments were left behind. Much of the original XHTML data was stripped in conversion. Old revisions were never to be seen again.

And there was still more to do:

Go over each new article by hand to fix the myriad of case-by-case problems that arose.
Write an extension that would add "previous" and "next" links for people who like to read entries in sequence.
Write an extension for "blog-style" comments. MediaWiki "Talk" pages are useful, but not really the same model.
Write an extension for "breadcrumb" trails to assist in user navigation.
Use rewrite rules in httpd.conf to redirect visitors from the old pages to the new ones.

After a solid 10 or so total hours of work I finally did an import and there it was -- several years worth of Unto.net blog content inside a editable Wiki. Beautiful!

For a few minutes, anyway.

I then noticed that I had made a mistake somewhere in my conversion scripts. I needed to back out the import, make a fix, and import again. But how to bulk delete? Was there a maintenance script for that? What tables needed to be updated if I modified the MySQL db by hand? What dependencies were there?

Only then did it hit me that perhaps I didn't really want to be running MediaWiki after all. Which isn't to say that it isn't well written and well designed software. MediaWiki seems to be high quality all around.

But what did I really want out of a wiki? Convenient inter-article linking? Interwiki linking? The friendly MediaWiki syntax?

Or was I simply looking for software that emphasized that nothing is permanent, that anything and everything can and should be revisited and revised later?

Seems like an awfully large amount of work for some hyperlinks and an emphasis on the ephemeral.

So I needed to make a batch data fix; yet I wasn't exactly sure how to modify the underlying data in the MediaWiki database. That's by design -- the data is intended to be opaque to users. Unfortunately it's also opaque to administrators. It is reverse-engineerable of course and the schemas are all documented. But is playing with the underlying database data a good idea? Years of practice have taught me that writing a hack around something is often a sign of trying to solve the wrong problem.

So what is the underying problem? The problem is that I am migrating data from one format to another in the first place. And why should I? I already have data in a standard, reusable, and lossless format -- Atom 1.0. The Atom data has my blog entries in their original marked up formats (a mix of HTML and XHTML), all of the author and timestamp metadata, and all of the comments going back for years. Why was I running this data through a meatgrinder and reassembling it on the other end?

Perhaps the worst part was feeling out of touch with the canonical version of the content. Some Unto.net readers will remember when I migrated from Moveable Type to Blosxom a few years ago. One of the things I liked best about Blosxom was that it stored the content as plain text files on a normal filesystem. One of the things I didn't like as much was that it had a difficult time handling metadata, so I ended up migrating a year later to Wordpress.

Which all leads me (back) to wishing that I had a universal indexed file store for a backend -- particularly one that could be customized to take into account the particularities of the metadata associated with blog-like or wiki-like content. This project (codenamed Houston) has been something I've been working on-and-off with for many months. During my time away from work several weeks ago I dug much deeper into the distributed and syncronization sides of building such an application. Unfortunately the fruits of the Houston effort aren't yet ready to be picked.

But what if I built a small piece of software that did one thing well -- read, index, and respond to queries about Atom 1.0 content? This software would run as a daemon and constantly watch a filesystem for incoming Atom data. When an Atom file appeared it would be validated, parsed, and stored directly in memory. The daemon would listen for HTTP GET queries which general quick responses from the in-memory tables. And those responses would be Atom 1.0 feeds (with OpenSearch extensions).

What if I used this little machine as a backend for a wiki/blog? If I wanted to add a new post all I needed to do is drop it on the filesystem in Atom 1.0 format. Old revisions would still be scanned and indexed alongside the newest revision -- no need to optimize for a dataset the size of mine. Take it one step futher and add support for the PUT, POST, and DELETE operations (a la the Atom Publishing Protocol). No need to worry about state -- if the machine dies, just restart it and wait for the index to rebuild.

So how long would it take to fully parse and build inbound and outbound link tables and a full-text index of all of the data? On a site like Unto.net -- seconds. Minutes at the extreme case.

This is a far more simple application than what Houston is trying to be. There is no backend for this little machine aside from the filesystem. All indices are managed in-memory. There is no RDBMS or external index to worry about. The permanent storage is just flat files and state can be recreated simply by reparsing the source data itself.

Is this sufficient for a professional-grade blog or wiki engine? Does it scale well? No, of course not. Servers go up and down frequently and without warning and it is impractical to rebuild huge indices every time. RDBMS such as MySQL and PostgreSQL can handle powerful SQL query grammars and full-text indexing and arrange the data far better I can code. External indexes, such as Lucene can help offer sophisticated text search relevance. This tiny application won't be partitioned or distributed, it won't support failover or rollover, it won't do very many things at all.

It will however provide fast, reliable, reproduceable, and convenient read/search/write access to Atom data by speaking nothing more than "unix filesystem", Atom Syndication Format, Atom Publishing Protocol, and OpenSearch. It will be single threaded and single purposed. This may be the simplest possible instance of what Joe Gregorio calls an Atom Store. And it may be even more than I need.

I think I'll prototype this little machine after work today.