spacer

Webref WebRef   Sitemap · Experts · Tools · Services · Newsletters · About i.com

home / experts / perl / tutorial / 1

XML and Perl::Querying the file

Developer News
Microsoft Shows Some Ankle With Visual Studio
Gentoo Linux Cancels Distribution
It's Official: Windows 7 at PDC, WinHEC

xml-fetch.pl View source | Try it
This script retrieves a file from the Web, queries it for a specified list of comma delimited tags, and displays the tag name, attributes, and content.

Now it's time to get our feet wet. We're going to develop a script that will:

  1. Retrieve an HTML/XML document from a specified URL
  2. Query the file for a specified list of tags
  3. Print the tag names, attribute values, and content in an HTML table
  4. Optionally print the document below the element table

 

CGI parameters:
The following are a list of parameters that we will be passing to our CGI script via an HTML form.

  1. url - the full URL of the XML document, i.e. http://www.webreference.com/xml/dmv.xm
  2. fields - a comma delimited list of elements to search for, i.e. document,abstract,toc
  3. display - the script will optionally display the XML document after the element table is displayed if the value equals ON

 

Note: You can either submit the data to the CGI script via a form like this one, or you can embed the query in a URL like this. Below, I will be referring to line numbers in this numbered source file. You may want to print it out now for reference. When you are ready to install the script on your system, you can retrieve the full source.

 

The Code:

  • Line 1 is only applicable if you are using Unix. It tells Unix to execute the script using the specified command, /usr/local/bin/perl in this case. If you are using NT, the equivalent is to associate the .pl extension with the Perl binary.
  • Line 13 loads the strict module which doesn't allow the use of symbolic references, barewords that aren't subroutines, and forces you to declare all local variables with my().
  • Lines 14-16 load the CGI, HTTP::Request, and LWP::UserAgent modules we will be using to retrieve files from the internet and process CGI variables.
  • Lines 18-19 initialize the global variables we will be using later.
  • Line 21 creates a new instance of the CGI class.
  • Line 22 prints a standard HTTP header.
  • Lines 24-25 return an error message unless CGI parameters url and fields contain values. The url parameter contains the HTTP URL of the file we want to retrieve. The fields parameter contains a comma delimited list of tags we want to grab from the file.
  • Line 27 creates a new instance of the LWP::UserAgent object. The LWP::UserAgent class implements the HTTP protocol and allows us to retrieve files from the Internet.
  • Line 28 sets the HTTP User-Agent to xml-fetch/1.0. This information will be passed to the Web server in the HTTP header.
  • Line 29 sets the maximum server response to 1,000,000 bytes. This limits the size of the file that can be retrieved from a remote Web server.
  • Line 31 creates a new instance of the HTTP::Request object, sets the HTTP method to GET, and passes it the URL from the $query->param('url') variable. This class takes care of building a properly formed HTTP request for the remote Web server.
  • Line 32 performs the HTTP request and assigns an instance of the HTTP::Response class to $response.
  • Lines 34-35 return an error message unless the HTTP request was successful. [Editor's note: As originally written in February of 1999, Jonathan's code tested for a successful file retrieval like this:
    &printError($response->code.": Error retrieving URL ".$query->param('url'))
        unless ($response->code == 200);
    At the time this was adequate; however recently released servers may legally send a "Partial Content Received" code (206) in response to this request even if the entire file was indeed delivered. The 206 response is the result of the max_size setting on line 29 of the code; LWP::UserAgent will automatically add a Range header to the request, and some servers will always respond with a 206 code (assuming any portion of the file was actually sent) in response to the existence of a Range header in the original request.

    To counteract this, we've changed the original code to this:

    &printError($response->code.": Error retrieving URL ".$query->param('url'))
        unless ($response->is_success);
    which will allow the script to continue upon receipt of all "successful" response codes (including 206). This does mean there is still the possibility that the information received could be less than the total size of the file (if the file is greater than the max_size setting) but should be adequate in the vast majority of implementations.]
  • Line 37 splits the comma delimited list of tags that we want to query and returns the list to @entities.
  • Line 38 assigns the contents of the file from the Web server to the $content variable.
  • Line 40 calls the &Print_Header subroutine which prints the HTML headers and the value of the url variable.
  • Lines 41-55 contains the program's main loop which prints the tag name, attribute names and values, and contents of the tag in an HTML table.
    • Line 41 loops through the returned file for each tag that we specified in the fields CGI variable.
    • Line 42 calls the &Print_Entity_Head($entity) subroutine which prints the field title.
    • Line 43 is the heart of the program. It is the regular expression that searches the file for instances of the tags specified in the fields CGI parameter. It's fairly complex, so let's break it into bite sized pieces: /<$entity\s*(.*?)(\/>|>(.*?)<\/$entity>)/gsi
      • There are two major parts in the regular expression:
        • <$entity\s*(.*?) - looks for a < character followed by the name of the tag ($entity), followed by zero or more spaces (\s), followed by zero or more characters (except \n) up to the next expression below.
        • (\/>|>(.*?)<\/$entity>) - matches the /> character sequence, or the > character, zero or more characters up to the </ character sequence, followed by the tag name and another > character.
      • Pulling it all together, the regular expression basically matches two tag variations:
        1. <tag name1="value1" name2="value2"/>
        2. <tag name1="value1" name2="value2">some text</tag>
        One of the rules of XML is that all tags must contain an opening and closing tag. For an empty tag (the first variation), we terminate the tag with a / character.
    • Line 45 splits the tag attributes into an array. In variation 1 above, the array @attribs would be equal to (name1="value1",name2="value2").
    • Lines 47-52 split the @attribs array into a hash. In variation 1 above where name1="value1", name1 would be the hash key, value1 would be the value of $h_attr{name1}.
    • Line 53 calls the &Print_Element routine which passes the %h_attr hash and tag content and prints it out into an HTML table.
  • Line 58 prints the HTML page that was retrieved from the specified URL if the CGI variable display was set to ON.
  • Lines 61-121 contain the functions used to format the HTML tables. Nothing complicated since they contain mostly HTML.

xml-fetch.pl View source | Try it
This script retrieves a file from the Web, queries it for a specified list of comma delimited tags, and displays the tag name, attributes, and content.


home / experts / perl / tutorial / 1

http://www.internet.com

Produced by Jonathan Eisenzopf and

JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

Solutions
Whitepapers and eBooks
IBM Whitepaper: Innovative Collaboration to Advance Your Business
Internet.com eBook: Real Life Rails
Avaya Article: Call Control XML - Powerful, Standards-Based Call Control
Internet.com eBook: The Pros and Cons of Outsourcing
Go Parallel Article: Scalable Parallelism with Intel(R) Threading Building Blocks
Internet.com eBook: Best Practices for Developing a Web Site
IBM CXO Whitepaper: The 2008 Global CEO Study "The Enterprise of the Future"
Avaya Article: Call Control XML in Action - A CCXML Auto Attendant
Go Parallel Article: James Reinders on the Intel Parallel Studio Beta Program
IBM CXO Whitepaper: Unlocking the DNA of the Adaptable Workforce--The Global Human Capital Study 2008
Adobe Acrobat Connect Pro: Web Conferencing and eLearning Whitepapers
Go Parallel Article: Getting Started with TBB on Windows
HP eBook: Storage Networking , Part 1
MORE WHITEPAPERS, EBOOKS, AND ARTICLES
Webcasts
Go Parallel Video: Intel(R) Threading Building Blocks: A New Method for Threading in C++
HP Video: Is Your Data Center Ready for a Real World Disaster?
Microsoft Partner Portal Video: Microsoft Gold Certified Partners Build Successful Practices
HP On Demand Webcast: Virtualization in Action
Go Parallel Video: Performance and Threading Tools for Game Developers
Rackspace Hosting Center: Customer Videos
Intel vPro Developer Virtual Bootcamp
HP Disaster-Proof Solutions eSeminar
HP On Demand Webcast: Discover the Benefits of Virtualization
MORE WEBCASTS, PODCASTS, AND VIDEOS
Downloads and eKits
Microsoft Download: Silverlight 2 Software Development Kit Beta 2
30-Day Trial: SPAMfighter Exchange Module
Red Gate Download: SQL Toolbelt
Iron Speed Designer Application Generator
Microsoft Download: Silverlight 2 Beta 2 Runtime
MORE DOWNLOADS, EKITS, AND FREE TRIALS
Tutorials and Demos
IBM IT Innovation Article: Green Servers Provide a Competitive Advantage
Microsoft Article: Expression Web 2 for PHP Developers--Simplify Your PHP Applications
Featured Algorithm: Intel Threading Building Blocks - parallel_reduce
MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES
webref The latest from WebReference.com Browse >
Controllers: Programming Application Logic - Part 2 · How to Use JavaScript to Validate Form Data · Controllers: Programming Application Logic
Sitemap · Experts · Tools · Services · Email a Colleague · Contact FREE Newsletters 
 The latest from internet.com
Sprint Launches Mobile WiMAX Network · Albatron Downsizes with the KI780G Mini-ITX Motherboard · Can't Find a Wi-Fi Network? Make Your Own.


Created: Feb. 14, 1999
Revised: Sep. 19, 2003

URL: http://www.webreference.com/perl/tutorial1/