| home / programming / rss_atom | [previous][next] |
|
|
To that end, here is a complete aggregator in only 40 lines, written by Jonas Galvez of http://jonasgalvez.com/. Jonas released this underneath the GPL license, so it is free to use within the usual GPL bounds. Many thanks to him for that.
The full code is listed later in Example 8-4, but let's step through it section by section. The program stores the list of URLs to fetch and parse in a text file, feeds.txt, so to start, you import the required modules, pull in the contents of feeds.txt, and define an array to hold the items once you parse them:
import time
import feedparser
sourceList = open('feeds.txt').readlines( )
postList = [ ]
Next, define the Entry class to act as a wrapper for the entry object the
Universal Feed Parser will return. The modified_parsed property
contains the entry date in a tuple of nine elements, in which the first six are
the year, month, day, hour, minute, and second. This tuple can be converted to
Unix Epoch with the built-in time method mktime():
class Entry:
def _ _init_ _(self, data, blog):
self.blog = blog
self.title = data.title
self.date = time.mktime(data.modified_parsed)
self.link = data.link
def _ _cmp_ _(self, other):
return other.date - self.date
The _ _cmp_ _ method defines the standard comparison behavior of
the class. Once you get an array with Entry instances and call sort(
), the _ _cmp_ _ method defines the order.
Here is where the UFP comes in. Since we want to show entries ordered by date, it's prudent to at least verify if the entry actually includes a date. With UFP. you can also check for a "bozo bit" and refuse invalid feeds altogether. The package's documentation gives details on that:
for uri in sourceList:
xml = feedparser.parse(uri.strip( ))
blog = xml.feed.title
for e in xml.entries[:10]:
if not e.has_key('modified_parsed'):
continue
postList.append(Entry(e, blog))
postList.sort( )
To finish, print everything out as an XHTML list:
print 'Content-type: text/html\n'
print '<ul style="font-family: monospace;">'
for post in postList[:20]: # last 20 items
date = time.gmtime(post.date)
date = time.strftime('%Y-%m-%d %H:%M:%S', date)
item = '\t<li>[%s] <a href=\"%s\">%s</a> (%s)</li>'
print item % (date, post.link, post.title, post.blog)
print '</ul>'
Example 8-4 shows the entire aggregator.
Because of the all-conquering success of Magpie and the UFP, Perl programmers haven't really moved on with the evolution of their feed-parsing tools. The UFP package can be called from Perl if need be, and many people have used the UFP as an excuse to try to learn Python anyway.
Certainly, there is no all-encompassing module for Perl that can parse all the flavors of RSS 1.0, RSS 2.0, and Atom with as much aplomb as the other scripting languages.
XML::RSS provides basic RSS parsing, as does Timothy Appnel's
XML::RAI module framework, but neither support Atom. Ben Trott's
XML::Atom is really designed for use with the Atom Publishing
Protocol but can be used with the Syndication Format as well, once it is
properly up to date. At time of writing, it is lagging the specification
somewhat; this situation should improve once both Atom standards are at version
1.0. Timothy Appnel has also written an Atom module,
XML::Atom::Syndication, which is very promising indeed.
With this mishmash of options for parsing feeds, and the necessity to write code to identify the feed's standard and pass it off to the correct functions, things can get too complicated too quickly with Perl. Let's take it back a notch, then, and resort to first principals. The following hasn't changed from the first edition of this book. I omit Atom to wait for the specification to settle down, but you will be able to see quite plainly how it would work with this structure.
The disadvantage of RSS's split into two separate but similar specifications is that we can never be sure which of the standards your desired feeds will arrive in. If you restrict yourself to using only RSS 2.0, it is very likely that the universe will conspire to make the most interesting stuff available solely in RSS 1.0, or vice versa. So, no matter what you want to do with the feed, your approach must be able to handle both standards with equal aplomb. With that in mind, simple parsing of RSS can be done with a standard general XML parser.
XML parsers are useful tools to have around when dealing with either RSS 2.0
or 1.0. While RSS 2.0 is quite a simple format, and using a full-fledged XML
parser on it does sometimes seem to be overkill, it does have a distinct
advantage over the other methods: futureproofing. Either way, for the majority
of purposes, the simplest XML parsers are perfectly useful. The Perl module
XML::Simple is a good example. Example 8-5 is a simple script that
uses XML::Simple to parse both RSS 2.0x and RSS 1.0 feeds into
XHTML that is ready for server-side inclusion.
| home / programming / rss_atom | [previous][next] |
Created: March 27, 2003
Revised: June 3, 2005
URL: http://webreference.com/programing/rss_atom/1