On RSS and Atom

DeWitt Clinton
July 2006

RSS is great. No, I'll go further than that. RSS, as a representation of an idea, is perhaps the single most influential cultural shift of the post-2001 technical and business community. RSS is the embodiment of the notion of sharing and syndication. Businesses will do well the heed the lessons being taught by people like Dave Winer and Robert Scoble. Users and customers alike want open access to data, and the ideas behind RSS will go a long way toward realizing their needs.

That said, RSS (the format) itself isn't always the answer. I worry that people are sometimes pushing a particular implementation (RSS 2.0) over the ideas behind the technology (content syndication). That's not to say that the marketing message of "adopt RSS, your users will love you" is a bad one. It's not; it certainly helps drive the concepts home in a concrete way that anyone, even the non-technical, can understand.

As important as it is as a cultural shift, RSS 2.0 as a format does have a few shortcomings. And one of those shortcomings is particularly worrisome, as it diminishes the overall value of syndicating content to begin with.

The issue is technical but hopefully a simple illustration can demonstrate:

Can you spot the differences between the following snippets of RSS:

  <description>
    The quick brown fox jumps over the lazy dog.
  </description>

  <description>
    The quick brown fox &lt;em&gt;jumps&lt;/em&gt; over the lazy dog.
  </description>

  <description>
    <![CDATA[The quick brown fox <em>jumps</em> over the lazy dog.]]>
  </description>

If you're a human then you'll probably have no problems spotting that the first one is plain text, the second one is XML-escaped HTML, and the third is HTML wrapped in an XML CDATA section. If presented in a web browser, in a HTML <div/> tag perhaps, then a human will have no trouble interpreting the content.

But if you're a computer, it isn't quite that easy. To a computer, the contents of a RSS <description/> element are opaque. The best a computer can do with it is hope to render it for a human to interpret.

This works fine for the bulk of syndicated content on the web today. Blogs can spit out XML-escaped content and blog readers can display that content for a person to read.

But what if you wanted to put something interesting inside a syndicated content feed? What if you wanted to put valid XHTML in a feed? You went through the trouble of writing XHTML, why should it be flattened to an opaque blob of "maybe plain text maybe escaped HTML but I'm not really sure"?

What if you added semantic microformat markup to your HTML? If you're using an opaque data format, then you may as well have spared yourself the effort, as no client will know it's there.

Or what if you wanted to put some other structured data in your syndicated content feed? Geospacial data, perhaps. Product data. Or perhaps Google's GData format. If it's syndicated over RSS, no one will ever know.

So the problem is that the RSS syndication format is that it is lossy. Lossy insofar as information you had when writing the data is lost when it is passed over the wire.

Again, this isn't a problem for many of the early scenarios in the blogging world. But as we learn that more and more content can and should be syndicated, the format itself can either help or hinder our application's capabilities.

Fortunately all is not lost. While I don't want to get embroiled in a format war, I will say that I've found the Atom 1.0 standard to meet the needs of nearly every single problem that I've thrown at it. Amazingly so, actually. I've been consistently impressed with how well the authors of the Atom syndication format anticipated the needs of the advanced content syndication community. There has yet to be a use-case that I've explored -- and I work with some thorny ones -- in which Atom has let me down.

That, and the Atom Syndication Format specification is the single best technical spec I've ever read. Seriously, give it a read just to see good spec writing in action. It's concise, accurate, unambiguous, and contains the right amount of illustrative detail.

Atom 1.0 addresses the issue of opaque content by including a very simple, but fully-defined, "type" attribute on all elements that can contain content.

For example:

  <content type="text">
    The quick brown fox jumps over the lazy dog.
  </content>

  <content type="html">
    The quick brown fox &lt;em&gt;jumps&lt;/em&gt; over the lazy dog.
  </content>

Or even:

  <content type="xhtml" xmlns:xhtml="http://www.w3.org/1999/xhtml">
    <xhtml:div>
      The quick brown fox <xhtml:em>jumps</xhtml:em> over the lazy dog.
    </xhtml:div>
  </content>

The content generator (a human editor, a blog authoring tool, a publishing protocol such as APP, or anything else), has full knowledge of the type of content being syndicated. If you syndicated that content via a format like Atom, then that data is not lost forever.

For more details regarding the technical differences between RSS and Atom, I recommend reading this page on RSS 2.0 and Atom 1.0 compared. In the article the authors outline several other important advantages of Atom. Though to me personally it is the simple issue of content type that makes the rest of the issues pale in comparison.

Put it this way -- I couldn't be doing half of the work that I'm doing right now on search syndication without Atom. Sending back search results snippets over RSS is one thing. Syndicating rich search content is an entirely different thing, and that requires a non-lossy syndication format.

My recommendation to application developers today is to use Atom 1.0, not RSS, as the basis for your content syndication.

You should absolutely continue to read and support RSS of all flavors. As Postel said, "be conservative in what you do; be liberal in what you accept from others." In the context of content (or search) syndication, this may mean being able to read all sorts of formats, but only writing the one format that preserves the data the best. And today, I believe that format to be Atom 1.0.

Fortunately, as Atom 1.0 preserves more information that RSS 2.0, it is trivial to transform an Atom feed into a RSS feed. A simple XSLT suffices to provide RSS output when you absolutely need it. Of course, now that Atom 1.0 has been ratified as a standard, it is unlikely that any major application won't support it natively.

I don't really care if RSS becomes a generic brand name for content syndication, just like "Kleenex" has for tissues. I think it is fine if engineers recommend to their directors, "we should support RSS in our applications. Content syndication is what our customers want."

Though it might be a good idea to add "well, we're actually building it using Atom. It's the same thing, only a little better, and we can still speak RSS with people who need it. But don't worry, that's just a technical detail."

It is just a technical detail. But it is a technical detail that we, the engineers, should be very concerned with...