Update on the little machine

DeWitt Clinton
May 2006

Bit by bit, progress was made on the demo/prototype for the little machine. The "little machine" is an experiment on how one could a) use Atom 1.0 files as a canonical store for content, b) offer read/write access via APP, c) build a real-time in-memory index of that content, d) surface that index via Atom+OpenSearch.

Building a simple demo of something like this is tricky. Not so much because it is hard to develop, but rather because it is hard to keep perspective and not try and achieve too much all at once. I found myself constantly wandering off and getting wrapped up in attempting to work out solutions to problems that I thought I would face somewhere down the road. But time couldn't be squandered; I was only able to spend an hour or two a night at best on the project.

My first version of "tlm" was a command-line Perl application. The script would first scan a directory for XML files, build a queue of potentially indexable files, and process that queue sequentially. Well-formed and valid Atom 1.0 feeds were parsed with the XML::LibXML library and stored as a DOM tree. Each "doc" id could be resolved via lookup table to a reference to an individual in-memory DOM node. The indexer used XPath expressions to extract elements of each unique Atom type (title, subtitle, summary, content, link, etc.). Each type of Atom element had its own separate index.

The method of indexing depended on the type of each field. For example, atom:link elements were stored "as is", whereas atom "text constructs" (text, html, xhtml), were stripped of markup and entities, tokenized, and stemmed before building a term-doc reverse index table. Additionally, each field was stored as a normalized term vector.

The tlm script opened up a HTTP listener after indexing the content. Each incoming request spawned a new handler thread. (I eventually removed the threading capabilities; Perl's thread support has trouble with sharing data to blessed references. Also, I needed to remind myself this was just a quick demo script, not a production version.) Initially, search used lookups in the term-doc tables and sorted results by TF-IDF scores. However, I found that for my limited corpus (hundreds of documents), taking the cosines of the vectors worked equally well with less effort.

The results on the Unto.net Atom 1.0 feed were reasonable. The parsing and indexing phase handled about 20/30 documents a second. The bulk of that time was spent stemming and building vectors. Perl isn't optimal here, even when I went to the XS-based Snowball stemmer and PDL vector libraries. But for a demo it was fine. Search could be performed in roughly 50 ms (vector based), or 5 ms (reverse index based). Of course, the vector based search method I was using scaled linearly (it compared the cosine of every doc) so it wasn't directly viable for a large corpus.

That's where I stopped on my first version of tlm. Next up would have been snippet extraction + term highlighting, a more comprehensive query parser, and APP write support.

(If this sounds complicated to some readers, it's not, really. That part is kindergarten level in the information storage, search and retrieval world. Unfortunately I'm only at the kindergarten level myself.)

I started wondering why I was putting so much time into covering the basics just to build a simple demo. I figured it would make more sense to use one of the "off-the-shelf" indexers instead. I was impressed with the benchmarks I read regarding the Lucene-inspired KinoSearch library. KinoSearch is also Perl-based (well, mostly C/XS with Perl APIs) and looks like a very solid backend for applications like this.

The KinoSearch-backed version of tlm was both easier to write (less code) and significantly faster in indexing. The tlm2 script indexed 100's of Atom entries per second and it performed searches (with snippet extraction and highlighting) in the 10s of milliseconds.

I've decided not to take the demo any further. I suspect that it could be improved to the point where it could be a back-end for a generic personal feed search with read/write access. However, I did learn what I wanted to: Atom 1.0 is particularly well suited to be a canonical storage format for indexable data.

Next up: a sketch of the real deal. I.e., an outline for a production version of a read/write indexed Atom Store.