A New Project, Part 6

DeWitt Clinton
May 2005

sticky, part 6

[This is a continuation of part 5 of a series of articles on transparently building a new web application .]

I'll be the first to admit that I don't fully understand REST. I've read and reread Fielding's dissertation. I've read most of the top web search results on REST. But still there are questions and ambiguities. In fact, the more I read, the more I'm convinced that nobody really understands REST and how to implement it in a precise way. That's okay -- services like Flickr and Amazon Web Services are still using REST-like concepts with a clear benefit for users. But the exact semantic meaning of certain concepts -- even ostensibly well defined ones like POST and PUT -- seem open to multiple interpretations. Yet I push forward, because REST feels right, and certainly avoids some of the issues that I've experienced in building RPC-based services in the past.

Picture this -- say we have a very simple notion of a user. The user has a username and a name. We decide to represent that user with a clear XML document (ignoring schemas and namespaces and all of that for the moment):

<?xml version="1.0" encoding="UTF-8"?>

<user>
  <username>dewitt</username>
  <name>DeWitt Clinton</name>
</user>

If we designed a web service that retrieved users, we could give them an API that allowed them to make a normal HTTP GET on a normal URL (such as /users/dewitt).

And this URL would return exactly the XML document we described above. Moreover, if the client instead sent the document above using a HTTP PUT request, then we would replace the version on the server with their copy. Likewise for a HTTP DELETE request -- we would remove our copy. Now, there is a little more to it (see HttpMethods on the RestWiki) but that is the basic idea.

If it were all that simple, you could imagine that instead of having some fancy data persistence layer, you could just use the filesystem. In fact, if that filesystem happened to be within your standard "htdocs" directory under your web server, then any old web server could handle the GET request. In other words, a REST-ful GET request of a static piece of data is no different than serving a file off the filesystem. This is something we're able to do very quickly and very easily with any web server. Even better, all the usual caching and acceleration mechanisms (from browser caching to squid to Akamai) work as advertised when the proper HTTP cache headers are set. It is a plain-old web query -- something that the infrastructure of the Internet has become rather efficient at.

And if, instead of just a vanilla web server, we were using a WebDAV enabled web server, then we could even handle the PUTs and DELETEs without doing anything special. (WebDAV even makes use of HTTP LOCK and UNLOCK if we wanted to manage of the concurrency issues). This is very cool, and it means that the rudements of REST are already well supported by the tools.

All of this leads one to start dreaming about ways to avoid writing a web service at all. I mean, if you can just store your user data under htdocs/users/dewitt/ and store those notes under htdocs/users/dewitt/notes/, etc., etc, why would one even bother doing anything other than just WebDAV? A filesystem would be fast -- faster than we need -- and the application support is already there.

So what's missing? For starters, we'd need to add validation. Each of those incoming PUT requests had better be valid users and notes, because you are going to be serving them back out again, and the client expects good data. Plus, how would you handle POST events, such as sending and approving invitations? And we'd need to add a RESTful way of doing searches, as there is nothing inherent in DAV that would work there. And our permission model doesn't overlap exactly with HTTP AUTH, So as nice an idea as it was on the surface, we're not going to be able use stock WebDAV. We're clearly going to require application logic that handles those, and many other, details.

But perhaps we can still leverage the filesystem behind the web application's interface. The data certainly seems hierarchical on the surface, insofar as users have notes, groups have users, etc., etc. If the data itself is not relational, then don't we benefit from avoiding overhead? First, let us consider how we will need to access that data.

Via unique key, in a hierarchical fashion. E.g., /users/dewitt
Via value. E.g. /users/dewitt/notes/tag/todo
Via relationship. E.g. /users/dewitt/groups/ or /groups/a9

There are probably more, but each of those three is indicative of particular storage demands. For instance, (1) can be largely solved with a hierarchical file system. Though it is admittedly complicated a bit when a single entity appears in more than one place. (Maintaining the integrity of aliases/symlinks is hard when the target of a link has no back reference.)

Whereas (2) could be efficiently implemented using an attributed file store of some sort. Or another way to phrase that is (2) can be done if you are willing to traverse the hierarchy, read each file in turn, and look for the matching values. Since you obviously aren't willing to do that, you need to be able to maintain an index of values each time a new entity is added or removed from the system. This however is still a tricky problem, and I expect the most interesting work in filesystems and search to be done in this area over the next few years.

Lastly (3), which gets to the heart of the matter. Should I be able to ask for a list of groups that a particular user belongs to? Certainly. Should I be able to ask for all the users that belong to a group? Of course. Should I be able to ask for all the notes that have been written by users of a group? Well... That gets a bit harder, doesn't it? It's still possible via traversal of a tree, so sure. How about getting a list of all the notes that a particular user has permission to read, either because the note is public, because the note is visible within a group that the user belongs to, or because the user wrote it? That's a mouthful, but again yes, that's necessary functionality from the perspective of the user interface. And it's also relational functionality, and while it can be implemented by walking files and building up in-memory structures (well, anything can), you start needing some of the functionality available in relational databases to make it easy and efficient to program.

But even if you need a relational database for some data, perhaps it is possible to create a hybrid approach in which parts of the data are persisted to disk in a hierarchical filesystem, and other parts are stored in the database. In reality that's happening anyway -- log files are written directly and linearly to disk for example. But is it worth the overhead to write two data access layers -- one for disk, one for db? In my opinion no, probably not right now.

I'm partly only going over this in such detail because I wanted to make it absolutely clear that the decision to use a relational database was a considered one -- not a pretermined one. In fact, I'm inclined not to use a DB whenever I can help it. My train of thought usually goes -- can I get away with this on a filesystem? No? Well then how about inside a key/value "db" like Berkeley DB? No? Fine. I'll use MySQL. (If that train of thought progresses and it gets to the point where I start saying "fine, I'll use Oracle," then my mind starts wandering to other more interesting projects...)

And sorry about missing the post earlier today -- it's been a little hectic, but I'm happy to be able to dedicate at least an hour to this each day. And again, feel free to comment on unto.net -- I noticed that people seem more comfortable emailing me privately. But the comments are the most valuable part if you ask me.