Little progress on the little machine

DeWitt Clinton
May 2006

Three days later and I have made only a small amount of progress writing a prototype for the little machine.

The first setback was a minor power blackout in my neighborhood. It was not even remotely as fun as the three-day blackout we had a few years ago in NYC. But it did keep me off the computer for several hours that evening. We walked around the neighborhood a bit. (Link goes to a short film my neighbor made about the Upper Haight.)

The second factor was an email exchange I had with Joe Gregorio. Among other things, Joe brought up the book Managing Gigabytes. Fantastic book. So good in fact that I was intentionally trying to avoid referencing it so that I could focus on simple prototyping and not production techniques. But Joe's comment prompted me to crack it open again and I spent a night thinking about what would be necessary to do an indexed Atom store right. Managing Gigabytes really is a valuable text -- highly recommended for anyone seriously interested in search programming.

On a related note, I picked up the second edition of Code Complete. I haven't done more than browse through it but it looks like an even better book than the first edition. Which is impressive because the first edition is probably the book I'd recommend to a programmer if I could only recommend one.

Getting further off-topic: As I looked up the URLs for those books I was reminded how clunky the Amazon.com website is for simple searches. (Disclosure: Amazon is the parent company of my employer, A9.com. As always on Unto.net I am speaking for myself, not for them, though grains of salt should be applied to taste.) The frustrating part is that the underlying product search technology at Amazon is both fast and powerful. And developed at A9.com, btw. We're hiring, in case you're interested in a kick-ass job.

Overall Amazon and A9 could do a better job at exposing the core product search functionality through web services. The Amazon E-Commerce Service is capable and groudbreaking but (in my opinion) it reflects too many of the underlying quirks and legacies that come from years of building the world's biggest product catalog. We are still a few technical hurdles away from building out an official OpenSearch-based interface into Amazon product search. In the meantime you could try the unofficial wrapper I wrote at aws.unto.net. It still needs to be upgraded to 1.1, but it should work. (Disclosure 2: I put my associates ID in there to help pay the hosting costs.)

Back on-topic: The third event slowing down the "little machine" prototype is perhaps the most frustrating -- lack of Atom 1.0 support in Wordpress. The idea behind the little machine is to use plain text files that contain Atom data as the canonical store for a high-speed index. I started writing a parser and indexer using the Unto.net's Atom feed as produced by Wordpress. As I quickly learned, the file I had exported contained Atom 0.3 data rather than Atom 1.0. Atom 0.3 was a interim draft intended to roughly outline the Atom Syndication format. A good draft -- good enough that most aggregators can read it and most syndicators can write it. But it isn't complete and was never intended to become a standard on its own.

Atom 1.0, on the other hand, is now an offically accepted standard and it should be supported by all feed readers. Moreover it should now be the default output format for whatever syndicatable content you develop. You can down-convert Atom 1.0 into RSS or just about any other format you need. However you can not completely up-convert RSS into Atom 1.0, though some services will do their best to do so.

The reason to use Atom 1.0 as an output format goes far beyond simple conversion concerns. Unlike RSS, Atom 1.0 is designed to be unambiguous and meaningful when read by a machine. RSS is a remarkable achievement -- responsible more than anything else for the explosion of syndicated content on the web. But RSS was always best when used to move opaque data around -- i.e., encoded HTML intended only to be displayed verbatim in a web browser to a human. RSS was never particularly good at structuring the data so that a machine could do anything useful with it. Atom, on the other hand, strikes the balance between being something easy to produce/consume and being something structured enough for machines to use.

Another way to say it is this:

If you want humans to be able to read your content anywhere, publish valid RSS. If you want both humans and computers to be able to read your content anywhere, publish valid Atom.

I am now trying to hack together a working Atom 1.0 output from Wordpress, thanks mostly to work done by volunteers. Until I get this figured out then the rest of the prototype is on hold...

(A cool thing I just noticed via Dare is that Word 2007 can publish to blogs using the Atom Pub API. I don't even use Windows and I think that's hot. Microsoft "gets it" today more than most people realize.)