Web Automation, from The PHP Cookbook -WebReference-
PHP Cookbook: Web Automation
Chapter 11: Web Automation
Most of the time, PHP is part of a web server, sending content to browsers. Even when you run it from the command line, it usually performs a task and then prints some output. PHP can also be useful, however, playing the role of a web browser--retrieving URLs and then operating on the content. Most recipes in this chapter cover retrieving URLs and processing the results, although there are a few other tasks in here as well, such as using templates and processing server logs.
There are four ways to retrieve a remote URL in PHP. Choosing
one method over another depends on your needs for simplicity, control, and portability.
The four methods are to use
), the cURL extension, or the
class from PEAR.
fopen( ) is simple and convenient.
We discuss it in Fetching a URL with the GET Method. The
) function automatically follows redirects, so if you use this function
to retrieve the directory http://www.example.com/people
and the server redirects you to http://www.example.com/people/,
you'll get the contents of the directory index page, not a message telling you
that the URL has moved. The
fopen( ) function also
works with both HTTP and FTP. The downsides to
) include: it can handle only HTTP GET requests (not HEAD or POST), you
can't send additional headers or any cookies with the request, and you can retrieve
only the response body with it, not response headers.
fsockopen( ) requires more
work but gives you more flexibility. We use
) in Fetching a URL with the POST Method. After opening a socket with
fsockopen( ), you need to print the appropriate
HTTP request to that socket and then read and parse the response. This lets
you add headers to the request and gives you access to all the response headers.
However, you need to have additional code to properly parse the response and
take any appropriate action, such as following a redirect.
If you have access to the cURL extension or PEAR's
class, you should use those rather than
cURL supports a number of different protocols (including HTTPS, discussed in
Fetching an HTTPS URL) and gives you access to response headers. We use cURL
in most of the recipes in this chapter. To use cURL, you must have the cURL
library installed, available at http://curl.haxx.se.
Also, PHP must be built with the
HTTP_Request class, which
we use in Fetching a URL with the POST Method, Fetching a URL with Cookies,
and Fetching a URL with Headers, doesn't support HTTPS, but does give you access
to headers and can use any HTTP method. If this PEAR module isn't installed
on your system, you can download it from http://pear.php.net/get/HTTP_Request.
As long as the module's files are in your
you can use it, making it a very portable solution.
Debugging the Raw HTTP Exchange helps you go behind the scenes of an HTTP request to examine the headers in a request and response. If a request you're making from a program isn't giving you the results you're looking for, examining the headers often provides clues as to what's wrong.
Once you've retrieved the contents of a web page into a program, use Marking Up a Web Page through Removing HTML and PHP Tags to help you manipulate those page contents. Marking Up a Web Page demonstrates how to mark up certain words in a page with blocks of color. This technique is useful for highlighting search terms, for example. Extracting Links from an HTML File provides a function to find all the links in a page. This is an essential building block for a web spider or a link checker. Converting between plain ASCII and HTML is covered in Converting ASCII to HTML and Converting HTML to ASCII. Removing HTML and PHP Tags shows how to remove all HTML and PHP tags from a web page.
Another kind of page manipulation is using a templating system. Discussed in Using Smarty Templates, templates give you freedom to change the look and feel of your web pages without changing the PHP plumbing that populates the pages with dynamic data. Similarly, you can make changes to the code that drives the pages without affecting the look and feel. Parsing a Web Server Log File discusses a common server administration task--parsing your web server's access log files.
Two sample programs use the link extractor from Extracting Links from an HTML File. The program in Program: Finding Stale Links scans the links in a page and reports which are still valid, which have been moved, and which no longer work. The program in Program: Finding Fresh Links reports on the freshness of links. It tells you when a linked-to page was last modified and if it's been moved.
Created: March 27, 2003
Revised: March 27, 2003