Developing Feeds with RSS and Atom. Chapter 8: Parsing and Using Feeds | WebReference

Developing Feeds with RSS and Atom. Chapter 8: Parsing and Using Feeds

Developing Feeds with RSS and Atom. Chapter 8: Parsing and Using Feeds

Written by Ben Hammersley

This content is excerpted from Chapter 8 of the new book, "Developing Feeds with RSS and Atom", authored by Ben Hammersley, published by O'Reilly, copyright 2005. To learn more, please visit: http://www.oreilly.com/catalog/deveoprssatom/index.html.

Parsing for Programming

The ability to display a feed on a web page is important, no doubt about it, but it's not going to really excite anyone. To do that, you need to be able to parse feeds inside your own programs. In this section, we'll look at the two major alternatives, MagpieRSS and the Ultraliberal Feed Parser. Both parsers are libraries; both convert feeds into native data structures; and neither cares whether a feed is RSS 1.0, RSS 2.0 or Atom. That, really, is the final word with respect to the Great Battle of the Standards; most of the time, at a programmatic level, no one cares.

PHP: MagpieRSS

The most popular parser in PHP, and arguably the most popular in use on the Web right now, is Kellan Elliott-McCrea's MagpieRSS. As I write this, it stands at version 0.7, a low number indicative of modesty rather than product immaturity. MagpieRSS is a very refined product indeed.

To use MagpieRSS, first download the latest build from its web page at http://sourceforge.net/projects/magpierss/. There is also a weblog at http://laughingmeme.org/magpie_blog/.

Once downloaded, you're presented with a load of READMEs and example scripts, plus five include files:

  • rss_fetch.inc is the library you call from scripts. It deals with retrieving the feed, and marshals the other files into parsing it, before returning the results to your code.

  • rss_parse.inc deals with the nitty gritty of feed parsing. MagpieRSS is a liberal parser, which means it doesn't validate the feed it is given. It can also deal with any arbitrarily invented element as long as it follows the right sort of format, meaning that it is quite futureproof.

  • rss_cache.inc lets you make rss_fetch.inc cache feeds instead of continually requesting new ones.

  • rss_utils.inc currently contains only one internal function, which converts a W3CDTF standard date to Unix epoch time.

  • extlib/Snoopy.class.inc provides the network support for the other included functions.

To install these include files, place them in the same directory as the script that is going to use them.

Using MagpieRSS

MagpieRSS is simple to use and comes well-documented. Included in the distribution is an example script called magpie_simple.php. It looks like Example 8-3.

Example 8-3. magpie_simple.php

<?php
define('MAGPIE_DIR', '../');
require_once(MAGPIE_DIR.'rss_fetch.inc');
$url = $_GET['url'];
if ( $url ) {
        $rss = fetch_rss( $url );
        
        echo "Channel: " . $rss->channel['title'] . "<p>";
        echo "<ul>";
        foreach ($rss->items as $item) {
                $href = $item['link'];
                $title = $item['title'];        
                echo "<li><a href=$href>$title</a></li>";
        }
        echo "</ul>";
}
?>
<form>
        RSS URL: <input type="text" size="30" name="url" value="<?php echo $url ?>"><br />
        <input type="submit" value="Parse RSS">
</form>

Running this on my own weblog's RSS 1.0 feed produces a page that looks like Figure 8-3.


Figure 8-3. A very basic display using MagpieRSS

As you can see, it's very straightforward. Taken line by line, the meat of the script goes like this:

define('MAGPIE_DIR', '../');
require_once(MAGPIE_DIR.'rss_fetch.inc');
$url = $_GET['url'];

Here, you tell PHP where Magpie's files are kept—in this case, in the parent directory to the script. Now, invoke the rss_fetch.inc library, retrieve the URL, and place it, as a string, into the variable $url:

if ( $url ) {
$rss = fetch_rss( $url );
        
echo "Channel: " . $rss->channel['title'] . "<p>";
echo "<ul>";

If the retrieval worked, you pass the contents of $url to the parser and print out a headline for the web page, containing the <title> of the <channel> and the start of an HTML list. (The HTML in this example isn't very compliant, but no matter.)

As you can see, the rest is easy to follow. It simply sets up a loop to run down the feed document and creates a link within an HTML list element from what it finds in the feed. This method of looping through the 15 or so elements in a feed is very typical.

foreach ($rss->items as $item) {
        $href = $item['link'];
        $title = $item['title'];        
        echo "<li><a href=$href>$title</a></li>";
}

Once that's done, you can close off the list and get on with other things:

echo "</ul>";

Python: The Universal Feed Parser

Mark Pilgrim's Universal Feed Parser, hosted at http://sourceforge.net/projects/feedparser/, is perhaps the best feed application ever written. It is incredibly well-done and magnificently well-documented. Furthermore, it is released under the GPL and comes with over 2,000 unit tests. Those unit tests themselves are worth months of screaming from anyone writing their own parser; however, the question remains why would you when the UFP already exists?

It's well-documented, so the following sections will serve only to demonstrate its power.


Created: March 27, 2003
Revised: June 3, 2005

URL: http://webreference.com/programing/rss_atom/1