Developing Feeds with RSS and Atom. Chapter 8: Parsing and Using Feeds | 3
Developing Feeds with RSS and Atom. Chapter 8: Parsing and Using Feeds
Example 8-5. Using XML::Simple to parse RSS
#!/usr/local/bin/perl
use strict;
use warnings;
use LWP::Simple;
use XML::Simple;
my $url=$ARGV[0];
# Retrieve the feed, or die gracefully
my $feed_to_parse = get ($url) or die "I can't get the feed you want";
# Parse the XML
my $parser = XML::Simple->new( );
my $rss = $parser->XMLin("$feed_to_parse");
# Decide on name for outputfile
my $outputfile = "$rss->{'channel'}->{'title'}.html";
# Replace any spaces within the title with an underscore
$outputfile =~ s/ /_/g;
# Open the output file
open (OUTPUTFILE, ">$outputfile");
# Print the Channel Title
print OUTPUTFILE '<div class="channelLink">'."\n".'<a href="';
print OUTPUTFILE "$rss->{'channel'}->{'link'}".'">';
print OUTPUTFILE "$rss->{'channel'}->{'title'}</a>\n</div>\n";
# Print the channel items
print OUTPUTFILE '<div class="linkentries">'."\n"."<ul>";
print OUTPUTFILE "\n";
foreach my $item (@{$rss->{channel}->{'item'}}) {
next unless defined($item->{'title'}) && defined($item->{'link'});
print OUTPUTFILE '<li><a href="';
print OUTPUTFILE "$item->{'link'}";
print OUTPUTFILE '">';
print OUTPUTFILE "$item->{'title'}</a></li>\n";
}
foreach my $item (@{$rss->{'item'}}) {
next unless defined($item->{'title'}) && defined($item->{'link'});
print OUTPUTFILE '<li><a href="';
print OUTPUTFILE "$item->{'link'}";
print OUTPUTFILE '">';
print OUTPUTFILE "$item->{'title'}</a></li>\n";
}
print OUTPUTFILE "</ul>\n</div>\n";
# Close the OUTPUTFILE
close (OUTPUTFILE);This script highlights various issues regarding the parsing of RSS, so it is worth dissecting closely. Start with the opening statements:
#!/usr/local/bin/perl
use strict;
use warnings;
use LWP::Simple;
use XML::Simple;
my $url=$ARGV[0];
# Retrieve the feed, or die gracefully
my $feed_to_parse = get ($url) or die "I can't get the feed you want";
This is nice and standard PerlÂthe usual usestrict;
and usewarnings; for good programming karma. Next,
load the two necessary modules: XML::Simple (which you've been
introduced to) and LWP::Simple retrieve the RSS feed from the
remote server. This is indeed what to do next: take the command-line argument as
the URL for the feed you want to parse. Place the entire feed in the scalar
$feed_to_parse, ready for the next section of the script:
# Parse the XML
my $parser = XML::Simple->new( );
my $rss = $parser->XMLin("$feed_to_parse");
This section fires up a new instance of the XML::Simple module
and calls the newly initialized object $parser. It then reads the
retrieved RSS feed and parses it into a tree, with the root of the tree called
$rss. This tree is actually a set of hashes, with the element names
as hash keys. In other words, you can do this:
# Decide on name for outputfile
my $outputfile = "$rss->{'channel'}->{'title'}.html";
# Replace any spaces within the title with an underscore
$outputfile =~ s/ /_/g;
# Open the output file
open (OUTPUTFILE, ">$outputfile");
Here, you take the value of the title element within the
channel, add the string .html, and make it the value
of $outputfile. This is done for a simple reason: I wanted to make
the user interface to this script as simple as possible. You can change it to
allow the user to input the output filename himself, but I like the script to
work one out automatically from the title element. Of course, many
title elements use spaces, which makes a nasty mess of filenames, so you can use
a regular expression to replace spaces with underscores. Now open up the file
handle, creating the file if necessary.
With a file ready for filling, and an RSS feed parsed in memory, let's fill in some of the rest:
# Print the Channel Title
print OUTPUTFILE '<div class="channelLink">'."\n".'<a href="';
print OUTPUTFILE "$rss->{'channel'}->{'link'}".'">';
print OUTPUTFILE "$rss->{'channel'}->{'title'}</a>\n</div>\n";
Here, you start to make the XHTML version. Take the link and
title elements from the channel and create a title
that is a hyperlink to the destination of the feed. Assign it a
div, so you can format it later with CSS, and include some new
lines to make the XHTML source as pretty as can be:
# Print the channel items
print OUTPUTFILE '<div class="linkentries">'."\n"."<ul>";
print OUTPUTFILE "\n";
foreach my $item (@{$rss->{channel}->{'item'}}) {
next unless defined($item->{'title'}) && defined($item->{'link'});
print OUTPUTFILE '<li><a href="';
print OUTPUTFILE "$item->{'link'}";
print OUTPUTFILE '">';
print OUTPUTFILE "$item->{'title'}</a></li>\n";
}
foreach my $item (@{$rss->{'item'}}) {
next unless defined($item->{'title'}) && defined($item->{'link'});
print OUTPUTFILE '<li><a href="';
print OUTPUTFILE "$item->{'link'}";
print OUTPUTFILE '">';
print OUTPUTFILE "$item->{'title'}</a></li>\n";
}
print OUTPUTFILE "</ul>\n</div>\n";
# Close the OUTPUTFILE
close (OUTPUTFILE);
The last section of the script deals with the biggest issue for all RSS
parsing: the differences between RSS 2.0 and RSS 1.0. With
XML::Simple, or any other tree-based parser, this is especially
crucial, because the item appears in a different place in each
specification. Remember: in RSS 2.0, item is a subelement of
channel, but in RSS 1.0, they have equal weight.
So, in the preceding snippet you can see two foreach loops. The
first one takes care of RSS 2.0 feeds, and the second covers RSS 1.0. Either
way, they are encased inside another div and made into an
ul unordered list. The script finishes by closing the file handle.
Our work is done.
Running this from the command line, with the RSS feed from http://rss.benhammersley.com/index.xml, produces the result shown in Example 8-6.
You can then include this inside another page using server-side inclusion (described later in this chapter).
After all this detailing of additional elements, I hear you cry, where are
they? Well, including extra elements in a script of this sort is rather simple.
Here I've taken another look at the second foreach loop from the
previous example. Notice the sections in bold type:
foreach my $item (@{$rss->{'item'}}) {
next unless defined($item->{'title'}) && defined($item->{'link'});
print OUTPUTFILE '<li><a href="';
print OUTPUTFILE "$item->{'link'}";
print OUTPUTFILE '">';
print OUTPUTFILE "$item->{'title'}</a>";
if ($item->{'dc:creator'}) {print OUTPUTFILE '<span class="dccreator">Written by';print OUTPUTFILE "$item->{'dc:creator'}";print OUTPUTFILE '</span>';}
print OUTPUTFILE "<ol><blockquote>$item->{'description'}</blockquote></ol>";
print OUTPUTFILE "\n</li>\n";
}
This section now looks inside the RSS feed for a dc:creator
element and displays it if it finds one. It also retrieves the contents of the
description element and displays it as a nested item in the list.
You might want to change this formatting, obviously.
By repeating the emphasized line, it is easy to add support for different
elements as you see fit, and it's also simple to give each new element its own
div or span class to control the onscreen formatting.
For example:
if ($item->{'dc:creator'}) {
print OUTPUTFILE '<span class="dccreator">Written by';
print OUTPUTFILE "$item->{'dc:creator'}";
print OUTPUTFILE '</span>';
}
if ($item->{'dc:date'}) {
print OUTPUTFILE '<span class="dcdate">Date:';
print OUTPUTFILE "$item->{'dc:date'}";
print OUTPUTFILE '</span>';
}
if ($item->{'annotate:reference'}) {
print OUTPUTFILE '<span class="annotation"><a href="';
print OUTPUTFILE "$item->{'annotate:reference'}->{'rdf:resource'}";
print OUTPUTFILE '">Comment on this</a></span>';
}
TIP: Most XML parsers found in scripting languages (Perl, Python, etc.) are really interfaces for Expat, the powerful XML parsing library. They therefore require Expat to be installed. Expat is available from http://expat.sourceforge.net/ and is released under the MIT License.
As you can see, the final extension prints the contents of the
annotate:reference element. This, as mentioned in Chapter 7, is a
single rdf:resource attribute. Note the way I get
XML::Simple to read the attribute. It treats the attribute as just
another leaf on the tree; you call the same way you would a subelement. You can
use the same syntax for any attribute-only element.
This content is excerpted from Chapter 8 of the new book, "Developing Feeds with RSS and Atom", authored by Ben Hammersley, published by O'Reilly, copyright 2005. To learn more, please visit: http://www.oreilly.com/catalog/deveoprssatom/index.html.
Created: March 27, 2003
Revised: June 3, 2005
URL: http://webreference.com/programing/rss_atom/1

Find a programming school near you