Electricmonk

Ferry Boender

Programmer, DevOpper, Open Source enthusiast.

Blog

The mess that is RSS

Sunday, May 16th, 2004

RSS, sometimes claimed to mean Really Simple Syndication and other times said to mean RDF Site Summary, is an XML format for listing new ‘items’ on, for instance, a website. Items can describe just about anything, from news items to new forum posts.

I particulary like the meaning ‘Really Simple Syndication’. When you first look at RSS it is indeed rather simplistic. It consists of a channel, which describes the source that the RSS feed applies to, and items. Items contain a title, link and optional description. Example:


<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="0.91">
 <channel>
  <link>http://site.com</link>
  <title>Site.com New Articles</title>
  <description>Newest articles on Site.com</description>
  
  <item>
   <title>Site.com now supports RSS</title>
   <link>http://www.site.com/rss</link>
   <description>Site.com now supports RSS feeds.</description>
  </item>
  <item>
   <title>Title</title>
   <link>http://www.site.com/</link>
   <description>Description</description>
  </item>
</channel>
</rss>

Simple right? Wrong!

Though its purpose is extremely simple, from which it gains it power, it’s implementation and basis as a standard are a friggin’ mess. I’m currenly working on a small program which will merge various RSS feeds into a single feed whilst maintaing the correct order of new items across different newsfeeds. And boy, am I running into a LOT of problems.

A little summary to show what a mess RSS really is:

  • A minimum of four different organisations are all working, or have worked, on their own RSS standard: Netscape (who originaly came up with it), UserLand (who took over from Netscape once they stopped development of RSS), the RSS-DEV working group (Purl) and Berkman Center for Internet & Society at Harvard Law School.
  • None of the different organisation’s versions of RSS are compatible with eachother.
  • There are a grand total of 9 incompatible versions of RSS
  • The RSS 2.0 standards (one by UserLand, one by Harvard Law School) have no DTD.
  • The RSS 2.0 standards have no namespace (this is possible, but ugly)
  • v0.90 (by Netscape) is based on RDF (Resource Description Framework). v0.91 (netscape version) isn’t based on RDF but is plain XML. v1.0 (by RSS-DEV) is RDF based again. v2.0 …. not RDF again.
  • Various versions of the standard say that item elements should occur inside the channel element. Other versions say the items should be siblings of the channel element. Writing up a simple DTD could cure this problem once and for all, but for 2.0 they couldn’t be bothered to spit out some lines to form a DTD.
  • Nowhere in any RSS standard does it say if multiple channels in a single RSS feed are permitted. Sloppy. The behaviour of feedreaders is therefor unspecified.
  • v0.90 and 1.0 put rediculous restraints on the number of items that may occur in a single channel, namely 15 items. Other weird constrainst apply on the allowed lengths of titles and links. Instead of simply stating that a link element contains an URL/URI, they specify that it can only contain http:// and ftp:// URL’s.
  • v0.9 (and possibly 1.0) didn’t allow the use of namespaces. Other versions suddenly did allow namespaces. Why allow namespaces in such a simple standard? Seems like asking for incompatibilities to me. It’s the HTML browser war all over again. But the best part is that people are totally ignoring the fact that 0.9 doesn’t allow namespaces and are doing it anyway. I can’t blame them either. They probably didn’t know what to do anymore either.
  • Some versions of the RSS standard define entities for things like &nbsp;, others don’t. In one DTD they simply forgot to add the entity definitions. Some versions of RSS don’t even have a DTD so the entities can’t be used. You’ll be stuck with using &#10291029029203293020010010; styled XML entities.
  • RSS 2.0 standard allows the use of HTML tags inside RSS elements WITHOUT HAVING TO DEFINE A NAMESPACE! That’s brilliant. Do they even have a clue about that kind of terrible parsing errors this is going to cause? Well now that I think of it… none. Since there isn’t a DTD the feeds can never be validated so all kinds of weird tags are indeed allowed. Way to define a standard! “Let’s not think about how to do it the good-old-w3c-xml-way, let’s just not define a DTD. Good thinking guys.
  • RSS 2.0, since it doesn’t have a DTD, has no entities other than the default XML ones (&#101;, etc). But, in their own examples supplied with the standard they just use all kinds of unidentified entities. Ouch. Clearly done by super XML experts *cough*.
  • Due to different organisations all putting out their own versions of the specs, v1.0 PREDATES v0.92. Time and space do not exist anymore, we’re going back in time!
  • Specifications for the standards are scattered all over the web (which is the reason I’m not linking to any of them), including versions by the same authoring groups.

And this is just a short list of all the (possible) problems I’ve found while trying to implement a simple RSS parser which Just Works™. There’s more where that came from. Disclaimer: The list above might contain errors. I haven’t done extensive research into all the problems, but I do know it’s a pretty big mess.

I’ve composed a little table to show the mess in a clear form:

Version Creator RDF? Items inside Channel? Summary of problems
v0.90 Netscape Yes No Idiotic limitations
v0.91 Netscape No Yes Two versions
v0.91 Userland No Yes Two versions
v0.92 Userland No Yes HTML allowed w/h NS, no NS
v1.0 RSS-DEV Yes No HTML allowed w/h NS
v2.0 Userland No Yes No DTD, No NS
v2.0 Harvard No Yes No DTD, No NS, same as Userland’s, different copyright.

Really Simple Syndication kids!

But.. other than the small problems listed above, RSS (or at least the idea behind it) rocks. :) My merger (python) will be released soon.

Here’s some links to various rants about RSS:
diveintomark.com,
XML.com.

Okay and now to cover my ass:

  • Userland and Harvard blah’s v2.0 may be the same standard. Only their copyright statements differ from eachother, so I’m not sure. I think their in on this together or something. Probably a conspiracy.
  • Various sources on the internet on which I based the information in this article may not have gotten it right either. It’s a real mess, so it’s kinda hard to find out how exactally all these pieces fit together to form the big picture.
  • I’m no XML expert, so some statements I’ve made might be wrong.
  • Don’t trust the information in this article. I’m sure I am wrong. Never trust anything written on the Web anyway.

E-mail me for corrections.

The text of all posts on this blog, unless specificly mentioned otherwise, are licensed under this license.