WebReference.com - Chapter 3 from Perl & XML, from O'Reilly and Associates (4/12)
Perl & XML
Writing a parser requires a lot of work. You can't be sure if you've covered everything without a lot of testing. Unless you're a mutant who loves to write efficient, low-level parser code, your program will probably be slow and resource-intensive. The good news is that a wide variety of free, high quality, and easy-to-use XML parser packages (written by friendly mutants) already exist to help you. People have bashed Perl and XML together for years, and you have a barnful of conveniently pre-invented wheels at your disposal.
Where do Perl programmers go to find ready-made modules to use in their programs? They go to the Comprehensive Perl Archive Network (CPAN), a many-mirrored public resource full of free, open-source Perl code. If you aren't familiar with using CPAN, you must change your isolationist ways and learn to become a programmer of the world. You'll find a multitude of modules authored by folks who have walked the path of Perl and XML before you, and who've chosen to share the tools they've made with the rest of the world.
TIP: Don't think of CPAN as a catalog of ready-made solutions for all specific XML problems. Rather, look at it as a toolbox or a source of building blocks you can assemble and configure to craft a solution. While some modules specialize in popular XML applications like RSS and SOAP, most are more general-purpose. Chances are, you won't find a module that specifically addresses your needs. You'll more likely take one of the general XML modules and adapt it somehow. We'll show that this process is painless and reveal several ways to configure general modules to your particular application.
XML parsers differ from one another in two major ways. First, they differ in their parsing style, which is how the parser works with XML. There are a few different strategies, such as building a data structure or creating an event stream. Another attribute of parsers, called standards-completeness, is a spectrum ranging from ad hoc on one extreme to an exhaustive, standards-based solution on the other. The balance on the latter axis is slowly moving from the eccentric, nonstandard side toward the other end as the Perl community agrees on how to implement major standards like SAX and DOM.
XML::Parser module is the great-grandpappy of all Perl-based XML processors. It is a multifaceted parser, offering a handful of different parsing styles. On the standards axis, it's closer to ad hoc than standards-compliant; however, being the first efficient XML parser to appear on the Perl horizon, it has a dear place in our hearts and is still very useful. While
XML::Parser uses a nonstandard API and has a reputation for getting a bit persnickety over some issues, it works. It parses documents with reasonable speed and flexibility, and as all Perl hackers know, people tend to glom onto the first usable solution that appears on the radar, no matter how ugly it is. Thus, nearly all of the first few years' worth of Perl and XML modules and programs based themselves on
Since 2001 or so, however, other low-level parsing modules have emerged that base themselves on faster and more standards-compliant core libraries. We'll touch on these modules shortly. However, we'll start out with an examination of
XML::Parser, giving a nod to its venerability and functionality.
In the early days of XML, a skilled programmer named James Clark wrote an XML parser library in C and called it Expat. Fast, efficient, and very stable, it became the parser of choice among early adopters of XML. To bring XML into the Perl realm, Larry Wall wrote a low-level API for it and called the module
XML::Parser::Expat. Then he built a layer on top of that,
XML::Parser, to serve as a general-purpose parser for everybody. Now maintained by Clark Cooper,
XML::Parser has served as the foundation of many XML modules.
The C underpinnings are the secret to
XML::Parser's success. We've seen how to write a basic parser in Perl. If you apply our previous example to a large XML document, you'll wait a long time before it finishes. Others have written complete XML parsers in Perl that are portable to any system, but you'll find much better performance in a compiled C parser like Expat. Fortunately, as with every other Perl module based on C code (and there are actually lots of these modules because they're not too hard to make, thanks to Perl's standard XS library), it's easy to forget you're driving Expat around when you use
1. James Clark is a big name in the XML community. He tirelessly promotes the standard with his free tools and involvement with the W3C. You can see his work at http://www.jclark.com/. Clark is also editor of the XSLT and XPath recommendation documents at http://www.w3.org/. (back)
man perlxsor Chapter 25 of O'Reilly's Programming Perl, Third Edition for more information. (back)
Created: May 8, 2002
Revised: May 8, 2002