What is RSS?

by Mark Pilgrim
December 18, 2002

marginwidth="0" marginheight="0" src="http://ad.doubleclick.net/adi/xml.ds/art;pos=_art;sz=336x280;ord=1915013365?" frameborder="0" width="336" scrolling="no" height="280">

RSS is a format for syndicating news and the content of news-like sites, including major news sites like Wired, news-oriented community sites like Slashdot, and personal weblogs. But it's not just for news. Pretty much anything that can be broken down into discrete items can be syndicated via RSS: the "recent changes" page of a wiki, a changelog of CVS checkins, even the revision history of a book. Once information about each item is in RSS format, an RSS-aware program can check the feed for changes and react to the changes in an appropriate way.

RSS-aware programs called news aggregators are popular in the weblogging community. Many weblogs make content available in RSS. A news aggregator can help you keep up with all your favorite weblogs by checking their RSS feeds and displaying new items from each of them.

A brief history

So which one do I use?

That's 7 -- count 'em, 7! -- different formats, all called "RSS". As a coder of RSS-aware programs, you'll need to be liberal enough to handle all the variations. But as a content producer who wants to make your content available via syndication, which format should you choose?

RSS versions and recommendations
Version	Owner	Pros	Status	Recommendation
0.90	Netscape		Obsoleted by 1.0	Don't use
0.91	UserLand	Drop dead simple	Officially obsoleted by 2.0, but still quite popular	Use for basic syndication. Easy migration path to 2.0 if you need more flexibility
0.92, 0.93, 0.94	UserLand	Allows richer metadata than 0.91	Obsoleted by 2.0	Use 2.0 instead
1.0	RSS-DEV Working Group	RDF-based, extensibility via modules, not controlled by a single vendor	Stable core, active module development	Use for RDF-based applications or if you need advanced RDF-specific modules
2.0	UserLand	Extensibility via modules, easy migration path from 0.9x branch	Stable core, active module development	Use for general-purpose, metadata-rich syndication

What does RSS look like?

Imagine you want to write a program that reads RSS feeds, so that you can publish headlines on your site, build your own portal or homegrown news aggregator, or whatever. What does an RSS feed look like? That depends on which version of RSS you're talking about. Here's a sample RSS 0.91 feed (adapted from XML.com's RSS feed):

<rss version="0.91"> <channel> <title>XML.com</title> <link>http://www.xml.com/</link> <description>XML.com features a rich mix of information and services for the XML community.</description> <language>en-us</language> <item> <title>Normalizing XML, Part 2</title> <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link> <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description> </item> <item> <title>The .NET Schema Object Model</title> <link>http://www.xml.com/pub/a/2002/12/04/som.html</link> <description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description> </item> <item> <title>SVG's Past and Promising Future</title> <link>http://www.xml.com/pub/a/2002/12/04/svg.html</link> <description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description> </item> </channel> </rss>

Simple, right? A feed comprises a channel, which has a title, link, description, and (optional) language, followed by a series of items, each of which have a title, link, and description.

Now look at the RSS 1.0 version of the same information:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" > <channel rdf:about="http://www.xml.com/cs/xml/query/q/19"> <title>XML.com</title> <link>http://www.xml.com/</link> <description>XML.com features a rich mix of information and services for the XML community.</description> <language>en-us</language> <items> <rdf:Seq> <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/normalizing.html"/> <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/som.html"/> <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/svg.html"/> </rdf:Seq> </items> </channel> <item rdf:about="http://www.xml.com/pub/a/2002/12/04/normalizing.html"> <title>Normalizing XML, Part 2</title> <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link> <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description> <dc:creator>Will Provost</dc:creator> <dc:date>2002-12-04</dc:date> </item> <item rdf:about="http://www.xml.com/pub/a/2002/12/04/som.html"> <title>The .NET Schema Object Model</title> <link>http://www.xml.com/pub/a/2002/12/04/som.html</link> <description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description> <dc:creator>Priya Lakshminarayanan</dc:creator> <dc:date>2002-12-04</dc:date> </item> <item rdf:about="http://www.xml.com/pub/a/2002/12/04/svg.html"> <title>SVG's Past and Promising Future</title> <link>http://www.xml.com/pub/a/2002/12/04/svg.html</link> <description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description> <dc:creator>Antoine Quint</dc:creator> <dc:date>2002-12-04</dc:date> </item> </rdf:RDF>

Quite a bit more verbose. People familiar with RDF will recognize this as an XML serialization of an RDF document; the rest of the world will at least recognize that we're syndicating essentially the same information. In fact, we're including a bit more information: item-level authors and publishing dates, which RSS 0.91 does not support.

Despite being RDF/XML, RSS 1.0 is structurally similar to previous versions of RSS -- similar enough that we can simply treat it as XML and write a single function to extract information out of either an RSS 0.91 or RSS 1.0 feed. However, there are some significant differences that our code will need to be aware of:

The root element is rdf:RDF instead of rss. We'll either need to handle both explicitly or just ignore the name of the root element altogether and blindly look for useful information inside it.
RSS 1.0 uses namespaces extensively. The RSS 1.0 namespace is http://purl.org/rss/1.0/, and it's defined as the default namespace. The feed also uses http://www.w3.org/1999/02/22-rdf-syntax-ns# for the RDF-specific elements (which we'll simply be ignoring for our purposes) and http://purl.org/dc/elements/1.1/ (Dublin Core) for the additional metadata of article authors and publishing dates.

We can go in one of two ways here: if we don't have a namespace-aware XML parser, we can blindly assume that the feed uses the standard prefixes and default namespace and look for item elements and dc:creator elements within them. This will actually work in a large number of real-world cases; most RSS feeds use the default namespace and the same prefixes for common modules like Dublin Core. This is a horrible hack, though. There's no guarantee that a feed won't use a different prefix for a namespace (which would be perfectly valid XML and RDF). If or when it does, we'll miss it.

If we have a namespace-aware XML parser at our disposal, we can construct a more elegant solution that handles both RSS 0.91 and 1.0 feeds. We can look for items in no namespace; if that fails, we can look for items in the RSS 1.0 namespace. (Not shown, but RSS 0.90 feeds also use a namespace, but not the same one as RSS 1.0. So what we really need is a list of namespaces to search.)
Less obvious but still important, the item elements are outside the channel element. (In RSS 0.91, the item elements were inside the channel. In RSS 0.90, they were outside; in RSS 2.0, they're inside. Whee.) So we can't be picky about where we look for items.
Finally, you'll notice there is an extra items element within the channel. It's only useful to RDF parsers, and we're going to ignore it and assume that the order of the items within the RSS feed is given by their order of the item elements.

But what about RSS 2.0? Luckily, once we've written code to handle RSS 0.91 and 1.0, RSS 2.0 is a piece of cake. Here's the RSS 2.0 version of the same feed:

<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/"> <channel> <title>XML.com</title> <link>http://www.xml.com/</link> <description>XML.com features a rich mix of information and services for the XML community.</description> <language>en-us</language> <item> <title>Normalizing XML, Part 2</title> <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link> <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description> <dc:creator>Will Provost</dc:creator> <dc:date>2002-12-04</dc:date> </item> <item> <title>The .NET Schema Object Model</title> <link>http://www.xml.com/pub/a/2002/12/04/som.html</link> <description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description> <dc:creator>Priya Lakshminarayanan</dc:creator> <dc:date>2002-12-04</dc:date> </item> <item> <title>SVG's Past and Promising Future</title> <link>http://www.xml.com/pub/a/2002/12/04/svg.html</link> <description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description> <dc:creator>Antoine Quint</dc:creator> <dc:date>2002-12-04</dc:date> </item> </channel> </rss>

As this example shows, RSS 2.0 uses namespaces like RSS 1.0, but it's not RDF. Like RSS 0.91, there is no default namespace and items are back inside the channel. If our code is liberal enough to handle the differences between RSS 0.91 and 1.0, RSS 2.0 should not present any additional wrinkles.

How can I read RSS?

Now let's get down to actually reading these sample RSS feeds from Python. The first thing we'll need to do is download some RSS feeds. This is simple in Python; most distributions come with both a URL retrieval library and an XML parser. (Note to Mac OS X 10.2 users: your copy of Python does not come with an XML parser; you will need to install PyXML first.)

from xml.dom import minidom import urllib

def load(rssURL): return minidom.parse(urllib.urlopen(rssURL))

This takes the URL of an RSS feed and returns a parsed representation of the DOM, as native Python objects.

The next bit is the tricky part. To compensate for the differences in RSS formats, we'll need a function that searches for specific elements in any number of namespaces. Python's XML library includes a getElementsByTagNameNS which takes a namespace and a tag name, so we'll use that to make our code general enough to handle RSS 0.9x/2.0 (which has no default namespace), RSS 1.0 and even RSS 0.90. This function will find all elements with a given name, anywhere within a node. That's a good thing; it means that we can search for item elements within the root node and always find them, whether they are inside or outside the channel element.

DEFAULT_NAMESPACES = / (None, # RSS 0.91, 0.92, 0.93, 0.94, 2.0 'http://purl.org/rss/1.0/', # RSS 1.0 'http://my.netscape.com/rdf/simple/0.9/' # RSS 0.90 )

def getElementsByTagName(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES): for namespace in possibleNamespaces: children = node.getElementsByTagNameNS(namespace, tagName) if len(children): return children return []

Finally, we need two utility functions to make our lives easier. First, our getElementsByTagName function will return a list of elements, but most of the time we know there's only going to be one. An item only has one title, one link, one description, and so on. We'll define a first function that returns the first element of a given name (again, searching across several different namespaces). Second, Python's XML libraries are great at parsing an XML document into nodes, but not that helpful at putting the data back together again. We'll define a textOf function that returns the entire text of a particular XML element.

def first(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES): children = getElementsByTagName(node, tagName, possibleNamespaces) return len(children) and children[0] or None

def textOf(node): return node and "".join([child.data for child in node.childNodes]) or ""

That's it. The actual parsing is easy. We'll take a URL on the command line, download it, parse it, get the list of items, and then get some useful information from each item:

DUBLIN_CORE = ('http://purl.org/dc/elements/1.1/',)

if __name__ == '__main__': import sys rssDocument = load(sys.argv[1]) for item in getElementsByTagName(rssDocument, 'item'): print 'title:', textOf(first(item, 'title')) print 'link:', textOf(first(item, 'link')) print 'description:', textOf(first(item, 'description')) print 'date:', textOf(first(item, 'date', DUBLIN_CORE)) print 'author:', textOf(first(item, 'creator', DUBLIN_CORE)) print

Running it with our sample RSS 0.91 feed prints only title, link, and description (since the feed didn't include any other information on dates or authors):

$ python rss1.py http://www.xml.com/2002/12/18/examples/rss091.xml.txt
title: Normalizing XML, Part 2 link: http://www.xml.com/pub/a/2002/12/04/normalizing.html description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms. date: author:

title: The .NET Schema Object Model link: http://www.xml.com/pub/a/2002/12/04/som.html description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas. date: author:

title: SVG's Past and Promising Future link: http://www.xml.com/pub/a/2002/12/04/svg.html description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003. date: author:

For both the sample RSS 1.0 feed and sample RSS 2.0 feed, we also get dates and authors for each item. We reuse our custom getElementsByTagName function, but pass in the Dublin Core namespace and appropriate tag name. We could reuse this same function to extract information from any of the basic RSS modules. (There are a few advanced modules specific to RSS 1.0 that would require a full RDF parser, but they are not widely deployed in public RSS feeds.)

Here's the output against our sample RSS 1.0 feed:

$ python rss1.py http://www.xml.com/2002/12/18/examples/rss10.xml.txt
title: Normalizing XML, Part 2 link: http://www.xml.com/pub/a/2002/12/04/normalizing.html description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms. date: 2002-12-04 author: Will Provost

title: The .NET Schema Object Model link: http://www.xml.com/pub/a/2002/12/04/som.html description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas. date: 2002-12-04 author: Priya Lakshminarayanan

title: SVG's Past and Promising Future link: http://www.xml.com/pub/a/2002/12/04/svg.html description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003. date: 2002-12-04 author: Antoine Quint

Running against our sample RSS 2.0 feed produces the same results.

This technique will handle about 90% of the RSS feeds out there; the rest are ill-formed in a variety of interesting ways, mostly caused by non-XML-aware publishing tools building feeds out of templates and not respecting basic XML well-formedness rules. Next month we'll tackle the thorny problem of how to handle RSS feeds that are almost, but not quite, well-formed XML.

What is RSS?相关推荐

C#与RSS亲密接触
讲述动态生成RSS文件的方法. 动态生成RSS文件也基本有两种方法,一种是用字符串累加的方法,另一种是使用xml文档生成的方法.字符串累加的方法也比较简单,我也就不多说了,这里着重说一下生成XmlDo ...
博客 rss 如何使用_如何使用RSS从您的GatsbyJS博客自动交叉发布
博客 rss 如何使用 With the recent exodus from Medium many developers are now creating their own GatsbyJS B ...
Linux RSS/RPS/RFS/XPS对比
RSS适合于多队列网卡,把不同的流分散的不同的网卡多列中,至于网卡队列由哪个cpu处理还需要绑定网卡队列中断与cpu RPS:适合于单队列网卡或者虚拟网卡,把该网卡上的数据流让多个cpu处理 RFS: ...
用ASP.NET建立一个在线RSS新闻聚合器(3)
显示特定聚合摘要的新闻项我们面临的下一个任务是创建 DisplayNewsItems.aspx 页面.这个页面会以链接的形式显示所选聚合摘要的新闻项标题,当点击标题时,新闻的内容就会显示在右下部分的 ...
顶级生物信息学 RSS 订阅源
早在 2018 年的时候我在"生信草堂"的公众号上写过一篇关于 RSS 的文章<使用 RSS 打造你的科研资讯头条>,介绍了关于 RSS 的一些内容和如何使用 inor ...
rss阅读器保存html文件,轻量级RSS阅读器网页版：selfoss安装教程
说明:关于RSS阅读器,我们知道的有Feedbin.FreshRSS等,功能都挺强大的,这里就再介绍个轻量级的RSS阅读器selfoss,使用起来是非常简单的,界面颜值也还不错,支持很多种订阅和网站, ...
新浪微博RSS Feed实现中的问题
下载代码: http://code.google.com/p/rss-feed/ 把三个文件上传到支持php的空间.(文件没做修改) 在web上访问: http://sinojelly.20x.cc/ ...
Emacs中的RSS阅读器--newsticker
1 简介 ------- newsticker是一个RSS阅读器,它支持以下几种格式 * RSS 0.91 * RSS 0.92 * RSS 1.0 * RSS 2.0 * Atom 0.3 * At ...
21 个HTML网页转RSS Feeds的工具
如果你拥有一个html静态网站,或你喜欢的某个网站不支持RSS Feeds输出,你可以使用本文介绍的这些工具,将HTML网页转换为RSS Feeds. The RSS Wizard 这是一款可以让你创 ...
GraphQL 配合 JWT 使用 —— Laravel RSS (二)
我们了解了 jwt 和 GraphQL 的使用,那接下来看看他们如何结合使用. 小试牛刀创建 myProfile query <?php /*** User: yemeishu* Date: ...

What is RSS?

What is RSS?

A brief history

So which one do I use?

What does RSS look like?

How can I read RSS?

What is RSS?相关推荐

最新文章

热门文章