Deconstructing RSS 2.0
Volume Number: 21 (2005)
Issue Number: 12
Column Tag: Programming
Deconstructing RSS 2.0
Understanding How RSS Feeds Work
by Dave Woolridge
If you're one of the millions of people who maintain your own weblog, then odds are, you're probably already familiar with RSS or Atom feeds since most blog tools include support for offering your blog as a syndicated feed. If you don't write a blog, then you've undoubtedly seen the RSS icon displayed online (see Figure 1) or have been invited by web sites to subscribe to their free RSS feeds. In fact, it's such a hot technology these days that you would have had to have been living "off the Grid" for the last few years to have not heard about RSS.
With aggressive spam filters making e-mail communication difficult for even legitimate marketers and businesses, web feeds have become a safe and dependable method for you to successfully deliver news to your audience, as well as allow third-party sites to syndicate your feed content for expanded reach to new viewers. It's a win-win situation for everyone involved. Users can subscribe to only the feeds they wish to receive and third-party sites are provided with free content for their sites that ultimately drives additional traffic to your site via your feed's links.
Figure 1. Typical RSS buttons seen on web sites and blogs, commonly referred to as "chicklets"
While the RSS and Atom formats are both used throughout the Internet for content syndication, this article will focus on the RSS 2.0 specification. With support for multimedia enclosures and other multi-purpose features, RSS 2.0 is quickly becoming one of the most popular flavors for content syndication, providing site owners with a powerful vehicle for delivering a lot more than just news. Have you ever subscribed to a podcast? Yup, you guessed it... podcasts are RSS 2.0 feeds.
How RSS Works
Before we dive in, let's dispel the most common misconception about web feeds. While many people use the word "broadcast" to describe online content syndication, web feeds are not transmitted like radio or television signals. They are not beamed or sent to subscribers. A web feed is nothing more than an XML document that resides on a web server. This means that your news reader software or web browser is fetching the RSS feed from a specified URL, just like it would do to access an HTML web page. The reader/browser software then parses (translates) that XML document, providing it to you in a display format that's easy to read. When using a news reader application (such as NetNewsWire or NewsFire) or an RSS-savvy browser (such as Safari or Firefox), subscribing to an RSS feed is like adding a bookmark to your favorites list. The only difference is that an RSS feed will get automatically checked for new updates on a regular basis, which requires the reader/browser to go fetch the XML document from the feed's URL at each timed interval.
Since we'll be taking a look at the structure of an RSS feed and the specific functionality of each XML tag in the RSS specification, it's important to understand the basic syntax of XML (Extensible Markup Language). Like HTML code, XML consists of tags such as:
<title>My RSS Feed</title>
Unlike HTML, which helps define styles and formatting for text, XML tags strictly define the meaning and context of information, keeping all of the text neatly organized with tag names. You don't have to be a master of XML in order to write or modify your own RSS feeds, but there are a few basic rules that you should keep in mind.
- Every valid RSS feed needs to include <?xml version="1.0" encoding="UTF-8"?> as the very first tag at the top of the document. UTF-8 is what most feeds use as the text encoding and most RSS parsers assume UTF-8 as the default if no encoding is specified, but if you need to use a different encoding for a special purpose, then change the encoding attribute accordingly.
- The nesting order of XML tags is very important in order for your code to be valid. For example, the following line is properly nested:
- XML is case-sensitive, so <skipdays> is not the same as <skipDays>. While some RSS readers and browsers may be smart enough to overlook case typos, you certainly want to avoid any potential problems with your feed, so if you're modifying your RSS code by hand in a text editor, make sure your tags conform to the RSS 2.0 specification.
- Be very careful when including HTML code within XML. If you need to include HTML code within the title or description tags of a news item, you need to either enclose all of the text within a CDATA block such as:
<description><![CDATA[My <b>Big</b> News]]> </description>
<description>My <b>Big</b> News</description>
Reviewing an Example
To get a feel for what we are talking about here, let's take a look at an example RSS feed. RSS is short for "Really Simple Syndication," and as you'll see from the XML code in Listing 1, the format really is quite simple overall. RSS 2.0 feeds can be saved with a file extension of either .rss or .xml. Listing 1 shows an RSS feed that includes two items. The first one is a typical news item, while the second one contains a media enclosure, similar to what you would find in a podcast.
Listing 1: An RSS 2.0 Feed Example
<?xml version="1.0" encoding="UTF-8"?>
<description>Quality eBooks and Printed Books from Respected
Authors at a Great Price!</description>
<copyright>Copyright 2005 SpiderWorks, LLC. All rights reserved.</copyright>
<pubDate>Mon, 24 Oct 2005 05:53:07 GMT</pubDate>
<lastBuildDate>Mon, 17 Oct 2005 08:13:02 GMT</lastBuildDate>
<cloud domain="http://www.spiderworks.com" port="80" path="/rpc"
registerProcedure="rssPleaseNotify" protocol="XML-RPC" />
<description>Quality eBooks and Printed Books from Respected Authors
at a Great Price!</description>
<description>Search the SpiderWorks Web Site</description>
<title>SpiderWorks Releases Danny Goodman's New Dashboard Book</title>
shows you how to build rock-solid, professional Dashboard widgets for Mac OS X Tiger
in his new book, Mac OS X Technology Guide to Dashboard. Includes exclusive widget
debugging tool, The Evaluator! Available as an eBook and printed edition at
Mac OS X Technology Guides</category>
<pubDate>Tue, 05 Jul 2005 2:37:01 GMT</pubDate>
<title>Spiderworks Interview with Ben Waldie</title>
<description>SpiderWorks recently sat down with author Ben Waldie to discuss
his new book, Mac OS X Technology Guide to Automator. Listen to the full interview
<enclosure url="http://www.spiderworks.com/audio/waldie.mp3" length="243108"
Mac OS X Technology Guides</category>
<pubDate>Mon, 02 May 2005 0:00:01 GMT</pubDate>
At first glance, Listing 1 looks like a typical XML document, but what sets it apart as RSS is the way the XML data is structured. The root element is, of course, the rss tag. Nested within that tag is the channel tag. It's within the channel tag that the meat of the document is stored. Figure 2 breaks down the main ingredients of this RSS example into groups, providing you with an easy way to visualize the XML code in Listing 1.
Figure 2. The RSS example broken down into basic groups of elements.
The groups within the channel tag consist of two groups of data: (1) channel elements that describe details about the RSS feed itself, and (2) items that hold your news stories, podcast audio tracks, etc. While Listing 1 and Figure 2 only include two items for example purposes, you can add as many items as you want to your own RSS feed.
Defining Your RSS Feed
As shown in Figure 2, before your actual news items are listed, your XML needs to include some basic information about the RSS feed itself. These tags are called channel elements and include important information like the title of your feed, the content's copyright, the date the feed was last updated, etc. Only a few of these elements are required, but the more information you provide, the more efficiently the receiving RSS readers (known as aggregators) can handle and process your data. For example, including optional tags like lastBuildDate and ttl can help relieve the server load from your feed being requested unnecessarily since those tags specify when the feed was last updated and how long the data should be cached (stored temporarily) before refreshing with a new HTTP request of the feed.
While all of the channel elements are defined here, please refer to the code examples in Listing 1 for the proper XML syntax of these tags.
REQUIRED. The name of your feed, which is usually the same name of your blog or web site that's related to your feed.
REQUIRED. The URL of your blog or web site (not the URL of your feed).
REQUIRED. A very brief phrase or sentence that describes your feed's overall content.
Optional. The language that the feed is written in. For list of possible language codes, please refer to: http://blogs.law.harvard.edu/tech/stories/storyReader$15
Optional. The copyright notice for the feed's content. Do not use the actual copyright symbol since that special character may not display properly in RSS readers.
Optional. The e-mail address of the editor of the content. This is not necessarily the author of the content, since the content may have come from multiple sources, but this is the person who is managing the feed.
Optional. The e-mail address of the webmaster who oversees all technical issues related to the feed.
Optional. The PICS rating of the feed, which helps adults control what online content is accessible by children. This tag is rarely used, but for more information on PICS, please visit: http://www.w3.org/PICS/
Optional. The publication date of the feed, which states the earliest date that the content can be publicly displayed. Most aggregators ignore this tag and instead focus on the lastBuildDate tag. The date should be formatted to conform to RFC 822, which can be found at: http://asg.web.cmu.edu/rfc/rfc822.html
Optional. The date that the feed was last updated. This is often one of the first tags that aggregators check to see if any new content was been added or updated since the last time the feed was requested. Like pubDate, this tag's date should be formatted to conform to RFC 822.
Optional. If your blog or web site organizes blog entries and articles into specific categories, then this tag may help aggregators to categorize items accordingly. Unfortunately, there is no standard cataloging system, so often this tag only proves useful for your own site needs. The domain attribute typically refers to a URL for that category online, but if your site does not include unique web pages for each category, then you may want to just link to your home page.
Optional. This tag should link to the official RSS specification online. As an RSS 2.0 feed, this tag should point to: http://blogs.law.harvard.edu/tech/rss
Optional. If you used feed generator software to create your feed, then the application would give itself credit in this tag. For example, if you used FeedForAll to generate your feed, then this tag may read:
Optional. This tag informs aggregators that the feed should not be read on certain days. Within the skipDays tag is a nested day sub-element, so that you can include more than one day within skipDays. Acceptable day values are: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, or Sunday. For example, to list both Sunday and Monday as days to skip, you would use the following XML syntax:
Optional. This tag informs aggregators that the feed should not be read during certain hours. The skipHours tag syntax works exactly like the skipDays tag, except that the sub-element is hour instead of day. An acceptable hour value is any whole number between 0 and 23. Like skipDays and its day sub-element, skipHours can include multiple hour sub-elements.
Optional. This tag represents the "time to live" with the value being in minutes. It tells aggregators how long the feed content should be cached on their end before refreshing with a new HTTP request of the feed. For example, if you use 60 as the value, then aggregators will know that they need to wait 60 minutes before requesting a fresh copy of the feed. This can help alleviate some of your server load since it will decrease the number of redundant feed requests.
Optional. This is probably the most rarely used and most confusing tag in the RSS 2.0 specification. Many developers who encounter this tag either don't understand how it works or have not figured out how to best utilize it. This tag represents a lightweight publish and subscribe feature that includes five essential attributes: domain, port, path, registerProcedure, and protocol. It is a way for RSS to leverage the power of web services like SOAP or XML-RPC to conserve server bandwidth, allowing aggregators to register to be automatically notified of any updates made to the feed. If you're not using SOAP or XML-RPC web services on your server, then you won't have a need for this tag. For more information on using SOAP with the optional cloud tag, visit: http://blogs.law.harvard.edu/tech/soapMeetsRss
Optional. You may have noticed that some aggregators display a logo or badge from the feed they are displaying. They get this data from the image element, which includes six sub-elements: title, url, link, description, width, and height. For example, if you publish a sports news RSS feed called "Primo Sports Plus" which has its own unique logo, you can supply aggregators with that logo to help brand your syndicated content. There's no guarantee that aggregators will use the logo, but the extra marketing potential makes it worth including. title represents the ALT tag if the image is rendered in HTML and link represents the URL of your site if someone clicks on the image. The url attribute should be the direct URL to the actual image file itself. The image's description value is usually the same as your feed's description tag. It's important to note that the maximum image width value is 144 (defaults to 88 if unspecified) and the maximum image height value is 400 (defaults to 31 if unspecified). See Listing 1 for the XML code syntax of the image tag and its nested sub-elements.
Optional. This rarely used tag is ignored by most aggregators. It provides a way for aggregators to submit a search (if your site has a search engine) or submit user comments (if your blog accepts user input). If you decide to include this tag in your feed, it requires four sub-elements: title, description, name, and link. title is the name of the submit button for your text input box. description provides user instructions for the text input box. name is the form name of the text input box. link is the URL of your server-side script (such as PHP or Perl) that should process these text input submissions. See Listing 1 for the XML code syntax of the textInput tag and its nested sub-elements.
The Anatomy of RSS Feed Items
After you've defined your channel elements within the channel tag, it's now time to add your actual content items. These are the items that are displayed by aggregators as news stories, blog entries, podcast items, etc. (depending on what content you wish to include in your feed). Each item is encapsulated in its own item tag (that is nested within the channel tag below the channel elements). See Listing 1 for the XML code syntax of the item tag and its nested sub-elements. The sub-elements that describe an item's content are defined here.
While most feed items include a link back to the full online version of the article or blog entry, a feed item does not require a link if you wish to include all of the content in its description sub-element. In fact, none of the item sub-elements are required as long as you include at least the item's title or description.
Optional. This is the title of the item. Although it's optional, most aggregators look for this item sub-element, so it's highly recommended to include it.
Optional. This is the URL to the web page version of the item on your web site or blog.
Optional. This is the description of the item. For news stories and blog entries, it's your choice to include the entire text or only a summary. If you only include a summary, then be sure to also include the link sub-element in your item, so that users can click-through to your site to read the entire story. For podcasts, the description sub-element usually holds text information about the song track and artist. Although it's optional, most aggregators look for this item sub-element, so it's highly recommended to include it.
Optional. This sub-element defines a multimedia file. If your feed is a podcast, then the enclosure sub-element of each item would identify the related audio file. It requires three attributes: url, length, and type. The url attribute should be the direct URL to the actual media file itself. The value of length should be the file size of the media object in bytes. The value of type should refer to the media file's MIME type. See Listing 1 for the XML code syntax of the enclosure sub-element and its attributes.
Optional. This sub-element works exactly like the channel's category tag, except that it defines a unique category for the individual item. Like the channel's category tag, there is no standard cataloging system, so often this tag only proves useful for your own site needs.
Optional. This should name the source of the item's content if you are not the original author. For example, if the content is from MacTech's RSS feed, you would list "MacTech" as the source value and the URL for MacTech's RSS feed as the value of the url attribute.
Optional. The e-mail address of the author of the item's content. For example, if the source is credited to MacTech's RSS feed and David Sobsey is the author of the piece, then list David Sobsey's e-mail address as the value of this sub-element.
Optional. This is the publication date of the item. Like the channel's pubDate and lastBuildDate tags, this sub-element's date should be formatted to conform to RFC 822.
Optional. This is the URL for the item's related web page of user comments. This sub-element is usually only relevant for blogs that allow web-based user comments.
Optional. This is an interesting sub-element that is used as a unique identifier for the item. There is no set convention for how this sub-element should be used, but most aggregators expect it to be a unique URL string that no other item can have, making it a valid item ID. For news stories and blog entries, this works great since they would have their own unique URLs, but if the item's URL is referred to more than once in your feed, then it cannot be used as the unique identifier here. For a unique URL that will always be available for viewing online, you should include the isPermaLink="true" attribute.
Now that you've stepped through the entire RSS 2.0 specification, you're ready to put RSS to good use. If you're only interested in building your own RSS feeds, then check out the Resources section of this article for links to Mac-compatible feed generator applications that can help streamline the creation process. If you're comfortable working with XML, you can also use a standard text editor like BBEdit (http://www.barebones.com/) or an XML editor like <oXygen/> (http://www.oxygenxml.com/) to roll your own RSS code. To ensure that your feed adheres to the RSS specification and does not include any errors, always test your feed with one of the many online validators (see the Resources section for some helpful URLs).
If you're interested in parsing RSS in your own software projects or syndicating third-party feeds on your web site, the Resources section also includes links to a few of the most popular RSS parsers for various programming languages. When developing your own RSS-savvy application or web site, it's important to assume that all RSS feeds are invalid until proven otherwise. Because many of the feeds out there are hand-coded, they often contain the wrong text encoding attribute in the xml tag and/or just bad XML, so it's a good idea to include a lot of error handling in your own parsing code to safeguard your users from problems. Defensive programming is definitely the name of the game here. It's also recommended to support caching in your RSS applications. If third-parties are kind enough to offer you free content for syndication, be considerate of their server bandwidth by caching the feed content temporarily on your end and only retrieving a fresh feed at timed intervals.
For those of you who want more out of RSS, the 2.0 specification allows you to extend RSS with namespace-defined modules. If you're not familiar with namespaces, they are part of the XML specification that RSS 2.0 supports as an easy way to extend the feed format for custom purposes. For more information on using namespaces to extend RSS 2.0, visit: http://www.reallysimplesyndication.com/howToExtendRss
And last, but not least... once you have your own RSS feed available on your site, it's easy enough to add a link to it on your home page using one of the popular "chicklet" buttons shown in Figure 1, but how do you get Safari and Firefox to automatically recognize your feed (see Figure 3) with that dynamic feed icon in the location bar and status bar respectively?
Figure 3. Dynamic feed icons in Safari and Firefox.
The answer is actually quite simple, requiring only a single line of HTML code. Nested within the head tag of your web pages, add the following meta tag with the href attribute set as the direct URL to your RSS feed:
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://yoursite.com/feed.rss">
So what are you waiting for? Now that you've gained some insight on how RSS 2.0 works, start taking advantage of one of the hottest new Internet technologies that can greatly increase exposure for your products and services, and drive additional traffic to your web site or blog.
The following list is by no means comprehensive, but should serve as a good starting point for learning more about RSS 2.0 online.
RSS Readers/Browsers for Mac OS X
RSS Feed Generators
Online RSS Validators
Dave Wooldridge is the founder of Electric Butterfly (www.ebutterfly.com), the developer of the Web Services Library for REALbasic and the award-winning HelpLogic. He is also co-founder of the new eBook publisher, SpiderWorks (www.spiderworks.com).