spacer

Webref WebRef   Sitemap · Experts · Tools · Services · Newsletters · About i.com

home / programming / rss_atom To page 1current pageTo page 3
[previous][next]

Developer-Building Trading-Pricing Appl-Capital Markets C#-WPF--WCF-XML-.Net 3.5,ASP, SQL Server
WSI Nationwide, Inc.
US-NY-New York

Justtechjobs.com Post A Job | Post A Resume
Developer News
Google Going Native With Chrome
Mozilla Fixes Firefox Flaws as 3.5 Release Nears
Microsoft and Novell Still Bosom Buddies


Developing Feeds with RSS and Atom. Chapter 8: Parsing and Using Feeds

A complete aggregator in 40 lines

To that end, here is a complete aggregator in only 40 lines, written by Jonas Galvez of http://jonasgalvez.com/. Jonas released this underneath the GPL license, so it is free to use within the usual GPL bounds. Many thanks to him for that.

The full code is listed later in Example 8-4, but let's step through it section by section. The program stores the list of URLs to fetch and parse in a text file, feeds.txt, so to start, you import the required modules, pull in the contents of feeds.txt, and define an array to hold the items once you parse them:

import time
import feedparser
 
sourceList = open('feeds.txt').readlines( )
postList = [  ]

Next, define the Entry class to act as a wrapper for the entry object the Universal Feed Parser will return. The modified_parsed property contains the entry date in a tuple of nine elements, in which the first six are the year, month, day, hour, minute, and second. This tuple can be converted to Unix Epoch with the built-in time method mktime():

class Entry:
    def _ _init_ _(self, data, blog):
        self.blog = blog
        self.title = data.title
        self.date = time.mktime(data.modified_parsed)
        self.link = data.link
    def _ _cmp_ _(self, other):
        return other.date - self.date

The _ _cmp_ _ method defines the standard comparison behavior of the class. Once you get an array with Entry instances and call sort( ), the _ _cmp_ _ method defines the order.

Here is where the UFP comes in. Since we want to show entries ordered by date, it's prudent to at least verify if the entry actually includes a date. With UFP. you can also check for a "bozo bit" and refuse invalid feeds altogether. The package's documentation gives details on that:

for uri in sourceList:
    xml = feedparser.parse(uri.strip( ))
    blog = xml.feed.title
    for e in xml.entries[:10]:
        if not e.has_key('modified_parsed'):
            continue
        postList.append(Entry(e, blog))

postList.sort( )

To finish, print everything out as an XHTML list:

print 'Content-type: text/html\n'
print '<ul style="font-family: monospace;">'

for post in postList[:20]: # last 20 items
    date = time.gmtime(post.date)
    date = time.strftime('%Y-%m-%d %H:%M:%S', date)
    item = '\t<li>[%s] <a href=\"%s\">%s</a> (%s)</li>'
    print item % (date, post.link, post.title, post.blog)
 
print '</ul>'

Example 8-4 shows the entire aggregator.

Example 8-4. A 40-line aggregator in Python using the Universal Feed Parser

#!/usr/bin/python2.2
"""
License: GPL 2; share and enjoy!
Requires: Universal Feed Parser <http://feedparser.org>
Author: Jonas Galvez <http://jonasgalvez.com>
"""

import time
import feedparser

sourceList = open('feeds.txt').readlines( )
postList = [  ]

class Entry:
    def _ _init_ _(self, data, blog):
        self.blog = blog
        self.title = data.title
        self.date = time.mktime(data.modified_parsed)
        self.link = data.link
    def _ _cmp_ _(self, other):
        return other.date - self.date

for uri in sourceList:
    xml = feedparser.parse(uri.strip( ))
    blog = xml.feed.title
    for e in xml.entries[:10]:
        if not e.has_key('modified_parsed'):
            continue
        postList.append(Entry(e, blog))

postList.sort( )

print 'Content-type: text/html\n'
print '<ul style="font-family: monospace;">'

for post in postList[:20]:
    date = time.gmtime(post.date)
    date = time.strftime('%Y-%m-%d %H:%M:%S', date)
    item = '\t<li>[%s] <a href=\"%s\">%s</a> (%s)</li>'
    print item % (date, post.link, post.title, post.blog)

print '</ul>'

Perl: XML::Simple

Because of the all-conquering success of Magpie and the UFP, Perl programmers haven't really moved on with the evolution of their feed-parsing tools. The UFP package can be called from Perl if need be, and many people have used the UFP as an excuse to try to learn Python anyway.

Certainly, there is no all-encompassing module for Perl that can parse all the flavors of RSS 1.0, RSS 2.0, and Atom with as much aplomb as the other scripting languages.

XML::RSS provides basic RSS parsing, as does Timothy Appnel's XML::RAI module framework, but neither support Atom. Ben Trott's XML::Atom is really designed for use with the Atom Publishing Protocol but can be used with the Syndication Format as well, once it is properly up to date. At time of writing, it is lagging the specification somewhat; this situation should improve once both Atom standards are at version 1.0. Timothy Appnel has also written an Atom module, XML::Atom::Syndication, which is very promising indeed.

With this mishmash of options for parsing feeds, and the necessity to write code to identify the feed's standard and pass it off to the correct functions, things can get too complicated too quickly with Perl. Let's take it back a notch, then, and resort to first principals. The following hasn't changed from the first edition of this book. I omit Atom to wait for the specification to settle down, but you will be able to see quite plainly how it would work with this structure.

Parsing RSS as simply as possible

The disadvantage of RSS's split into two separate but similar specifications is that we can never be sure which of the standards your desired feeds will arrive in. If you restrict yourself to using only RSS 2.0, it is very likely that the universe will conspire to make the most interesting stuff available solely in RSS 1.0, or vice versa. So, no matter what you want to do with the feed, your approach must be able to handle both standards with equal aplomb. With that in mind, simple parsing of RSS can be done with a standard general XML parser.

XML parsers are useful tools to have around when dealing with either RSS 2.0 or 1.0. While RSS 2.0 is quite a simple format, and using a full-fledged XML parser on it does sometimes seem to be overkill, it does have a distinct advantage over the other methods: futureproofing. Either way, for the majority of purposes, the simplest XML parsers are perfectly useful. The Perl module XML::Simple is a good example. Example 8-5 is a simple script that uses XML::Simple to parse both RSS 2.0x and RSS 1.0 feeds into XHTML that is ready for server-side inclusion.

home / programming / rss_atom To page 1current pageTo page 3
[previous][next]

internet.commediabistro.comJusttechjobs.comGraphics.com

Search:

WebMediaBrands Corporate Info

Legal Notices, Licensing, Reprints, Permissions, Privacy Policy.
Advertise | Newsletters | Shopping | E-mail Offers | Freelance Jobs

webref The latest from WebReference.com Browse >
XML and PHP Simplified · Creating a ASP.NET Contact Form · Data Filtering with PHP
Sitemap · Experts · Tools · Services · Email a Colleague · Contact FREE Newsletters 
 The latest from internet.com
Intel to Host Live Nehalem Q&A · 12 Tips to Troubleshoot Network File-Sharing · 10 Tips for Selling on Kijiji

Created: March 27, 2003
Revised: June 3, 2005

URL: http://webreference.com/programing/rss_atom/1