spacer

Webref WebRef   Sitemap · Experts · Tools · Services · Newsletters · About i.com

home / programming / rss_atom / 1 To page 1To page 2current page
[previous]

Subject Matter Expert - Managed Services (PA)
Next Step Systems
US-PA-Wayne

Justtechjobs.com Post A Job | Post A Resume
Developer News
News Flash: Adobe Has iPhone Workaround
Adobe's Flash 10.1 Goes Mobile (Minus iPhone)
A Salute to Visionary CEOs


Developing Feeds with RSS and Atom. Chapter 8: Parsing and Using Feeds

Example 8-5. Using XML::Simple to parse RSS

#!/usr/local/bin/perl
   
use strict;
use warnings;
   
use LWP::Simple;
use XML::Simple;
   
my $url=$ARGV[0];
   
# Retrieve the feed, or die gracefully
my $feed_to_parse = get ($url) or die "I can't get the feed you want";
   
# Parse the XML
my $parser = XML::Simple->new( );
my $rss = $parser->XMLin("$feed_to_parse");
   
# Decide on name for outputfile
my $outputfile = "$rss->{'channel'}->{'title'}.html";
   
# Replace any spaces within the title with an underscore
$outputfile =~ s/ /_/g;
   
# Open the output file
open (OUTPUTFILE, ">$outputfile");
   
# Print the Channel Title
print OUTPUTFILE '<div class="channelLink">'."\n".'<a href="';
print OUTPUTFILE "$rss->{'channel'}->{'link'}".'">';
print OUTPUTFILE "$rss->{'channel'}->{'title'}</a>\n</div>\n";
   
# Print the channel items
print OUTPUTFILE '<div class="linkentries">'."\n"."<ul>";
print OUTPUTFILE "\n";
   
foreach my $item (@{$rss->{channel}->{'item'}}) {
    next unless defined($item->{'title'}) && defined($item->{'link'});
    print OUTPUTFILE '<li><a href="';
    print OUTPUTFILE "$item->{'link'}";
    print OUTPUTFILE '">';
    print OUTPUTFILE "$item->{'title'}</a></li>\n";
           }
           
foreach my $item (@{$rss->{'item'}}) {
    next unless defined($item->{'title'}) && defined($item->{'link'});
    print OUTPUTFILE '<li><a href="';
    print OUTPUTFILE "$item->{'link'}";
    print OUTPUTFILE '">';
    print OUTPUTFILE "$item->{'title'}</a></li>\n";
           }           
           
print OUTPUTFILE "</ul>\n</div>\n";
  
# Close the OUTPUTFILE
close (OUTPUTFILE);

This script highlights various issues regarding the parsing of RSS, so it is worth dissecting closely. Start with the opening statements:

#!/usr/local/bin/perl
   
use strict;
use warnings;
   
use LWP::Simple;
use XML::Simple;
   
my $url=$ARGV[0];
   
# Retrieve the feed, or die gracefully
my $feed_to_parse = get ($url) or die "I can't get the feed you want";

This is nice and standard Perl—the usual usestrict; and usewarnings; for good programming karma. Next, load the two necessary modules: XML::Simple (which you've been introduced to) and LWP::Simple retrieve the RSS feed from the remote server. This is indeed what to do next: take the command-line argument as the URL for the feed you want to parse. Place the entire feed in the scalar $feed_to_parse, ready for the next section of the script:

# Parse the XML
my $parser = XML::Simple->new( );
my $rss = $parser->XMLin("$feed_to_parse");

This section fires up a new instance of the XML::Simple module and calls the newly initialized object $parser. It then reads the retrieved RSS feed and parses it into a tree, with the root of the tree called $rss. This tree is actually a set of hashes, with the element names as hash keys. In other words, you can do this:

# Decide on name for outputfile
my $outputfile = "$rss->{'channel'}->{'title'}.html";
   
# Replace any spaces within the title with an underscore
$outputfile =~ s/ /_/g;
   
# Open the output file
open (OUTPUTFILE, ">$outputfile");

Here, you take the value of the title element within the channel, add the string .html, and make it the value of $outputfile. This is done for a simple reason: I wanted to make the user interface to this script as simple as possible. You can change it to allow the user to input the output filename himself, but I like the script to work one out automatically from the title element. Of course, many title elements use spaces, which makes a nasty mess of filenames, so you can use a regular expression to replace spaces with underscores. Now open up the file handle, creating the file if necessary.

With a file ready for filling, and an RSS feed parsed in memory, let's fill in some of the rest:

# Print the Channel Title
print OUTPUTFILE '<div class="channelLink">'."\n".'<a  href="';
print OUTPUTFILE "$rss->{'channel'}->{'link'}".'">';
print OUTPUTFILE "$rss->{'channel'}->{'title'}</a>\n</div>\n";

Here, you start to make the XHTML version. Take the link and title elements from the channel and create a title that is a hyperlink to the destination of the feed. Assign it a div, so you can format it later with CSS, and include some new lines to make the XHTML source as pretty as can be:

# Print the channel items
print OUTPUTFILE '<div class="linkentries">'."\n"."<ul>";
print OUTPUTFILE "\n";
   
foreach my $item (@{$rss->{channel}->{'item'}}) {
    next unless defined($item->{'title'}) && defined($item->{'link'});
    print OUTPUTFILE '<li><a href="';
    print OUTPUTFILE "$item->{'link'}";
    print OUTPUTFILE '">';
    print OUTPUTFILE "$item->{'title'}</a></li>\n";
           }
           
foreach my $item (@{$rss->{'item'}}) {
    next unless defined($item->{'title'}) && defined($item->{'link'});
    print OUTPUTFILE '<li><a href="';
    print OUTPUTFILE "$item->{'link'}";
    print OUTPUTFILE '">';
    print OUTPUTFILE "$item->{'title'}</a></li>\n";
           }           
           
print OUTPUTFILE "</ul>\n</div>\n";
  
# Close the OUTPUTFILE
close (OUTPUTFILE);

The last section of the script deals with the biggest issue for all RSS parsing: the differences between RSS 2.0 and RSS 1.0. With XML::Simple, or any other tree-based parser, this is especially crucial, because the item appears in a different place in each specification. Remember: in RSS 2.0, item is a subelement of channel, but in RSS 1.0, they have equal weight.

So, in the preceding snippet you can see two foreach loops. The first one takes care of RSS 2.0 feeds, and the second covers RSS 1.0. Either way, they are encased inside another div and made into an ul unordered list. The script finishes by closing the file handle. Our work is done.

Running this from the command line, with the RSS feed from http://rss.benhammersley.com/index.xml, produces the result shown in Example 8-6.

Example 8-6. Content_Syndication_with_RSS.html

<div class="channelLink">
<a href="http://rss.benhammersley.com/">Content Syndication with XML and RSS</a>
</div>
<div class="linkentries">
<ul>
<li><a href="http://rss.benhammersley.com/archives/001150.html">PHP parsing of RSS</a></li>

<li><a href="http://rss.benhammersley.com/archives/001146.html">RSS for Pocket PC</a></li>

<li><a href="http://rss.benhammersley.com/archives/001145.html">Syndic8 is One</a></li>

<li><a href="http://rss.benhammersley.com/archives/001141.html">RDF mod_events</a></li>

<li><a href="http://rss.benhammersley.com/archives/001140.html">RSS class for cocoa</a></li>

<li><a href="http://rss.benhammersley.com/archives/001131.html">Creative Commons RDF</a></li>

<li><a href="http://rss.benhammersley.com/archives/001129.html">RDF events in Outlook.</a></li>

<li><a href="http://rss.benhammersley.com/archives/001128.html">Reading Online News</a></li>

<li><a href="http://rss.benhammersley.com/archives/001115.html">Hep messaging server</a></li>

<li><a href="http://rss.benhammersley.com/archives/001109.html">mod_link</a></li>

<li><a href="http://rss.benhammersley.com/archives/001107.html">Individual Entries as RSS
1.0</a></li>
<li><a href="http://rss.benhammersley.com/archives/001105.html">RDFMap</a></li>

<li><a href="http://rss.benhammersley.com/archives/001104.html">They're Heeereeee</a></li>

<li><a href="http://rss.benhammersley.com/archives/001077.html">Burton Modules</a></li>

<li><a href="http://rss.benhammersley.com/archives/001076.html">RSS within XHTML documents
UPDATED</a></li>
</ul>
</div>

You can then include this inside another page using server-side inclusion (described later in this chapter).

After all this detailing of additional elements, I hear you cry, where are they? Well, including extra elements in a script of this sort is rather simple. Here I've taken another look at the second foreach loop from the previous example. Notice the sections in bold type:

foreach my $item (@{$rss->{'item'}}) {
    next unless defined($item->{'title'}) && defined($item->{'link'});
    print OUTPUTFILE '<li><a href="';
    print OUTPUTFILE "$item->{'link'}";
    print OUTPUTFILE '">';
    print OUTPUTFILE "$item->{'title'}</a>";
    if ($item->{'dc:creator'}) {print OUTPUTFILE '<span class="dccreator">Written  by';print OUTPUTFILE "$item->{'dc:creator'}";print OUTPUTFILE '</span>';}
    print OUTPUTFILE "<ol><blockquote>$item->{'description'}</blockquote></ol>";

    print OUTPUTFILE "\n</li>\n";
           }

This section now looks inside the RSS feed for a dc:creator element and displays it if it finds one. It also retrieves the contents of the description element and displays it as a nested item in the list. You might want to change this formatting, obviously.

By repeating the emphasized line, it is easy to add support for different elements as you see fit, and it's also simple to give each new element its own div or span class to control the onscreen formatting. For example:

if ($item->{'dc:creator'}) {
   print OUTPUTFILE '<span class="dccreator">Written  by';
   print OUTPUTFILE "$item->{'dc:creator'}";
   print OUTPUTFILE '</span>';
   }
if ($item->{'dc:date'}) {
   print OUTPUTFILE '<span class="dcdate">Date:';
   print OUTPUTFILE "$item->{'dc:date'}";
   print OUTPUTFILE '</span>';
   }
if ($item->{'annotate:reference'}) {
   print OUTPUTFILE '<span class="annotation"><a href="';
   print OUTPUTFILE "$item->{'annotate:reference'}->{'rdf:resource'}";
   print OUTPUTFILE '">Comment  on this</a></span>';
       }

TIP: Most XML parsers found in scripting languages (Perl, Python, etc.) are really interfaces for Expat, the powerful XML parsing library. They therefore require Expat to be installed. Expat is available from http://expat.sourceforge.net/ and is released under the MIT License.

As you can see, the final extension prints the contents of the annotate:reference element. This, as mentioned in Chapter 7, is a single rdf:resource attribute. Note the way I get XML::Simple to read the attribute. It treats the attribute as just another leaf on the tree; you call the same way you would a subelement. You can use the same syntax for any attribute-only element.

This content is excerpted from Chapter 8 of the new book, "Developing Feeds with RSS and Atom", authored by Ben Hammersley, published by O'Reilly, copyright 2005. To learn more, please visit: http://www.oreilly.com/catalog/deveoprssatom/index.html.

home / programming / rss_atom / 1 To page 1To page 2current page
[previous]

internet.commediabistro.comJusttechjobs.comGraphics.com

Search:

WebMediaBrands Corporate Info

Legal Notices, Licensing, Reprints, Permissions, Privacy Policy.
Advertise | Newsletters | Shopping | E-mail Offers | Freelance Jobs

webref The latest from WebReference.com Browse >
Building a Banking Application Home Page with OOP · Mixing Scripting Languages · Review: phpFox, a Social Networking CMS with all the Bells and Whistles
Sitemap · Experts · Tools · Services · Email a Colleague · Contact FREE Newsletters 
 The latest from internet.com
Enterprise 2.0: Social Networking in the Cloud · BroadSoft Marketplace Hastens Pace of Telephony Innovation · Review: HTC Hero for Sprint

Created: March 27, 2003
Revised: June 3, 2005

URL: http://webreference.com/programing/rss_atom/1