Drag and Drop CGI | 5
Introduction to the ICE Scripts
Because of the free-form nature of the Web and the Internet in general, search engines are probably the most-used Internet utility. The basic idea behind search engines is simple enough. Each HTML document is broken down into individual words (HTML tags are ignored). All the words are stored in a database with pointers back to the original document. The search engine searches the database and returns a set of hyperlinks to the documents matching the user's query. It will also cross-reference a thesaurus, if one exists, for any predefined synonyms. Other features, such as ``close'' matching or phonetic (sounds-like) matching, require much more complicated search algorithms than ICE uses. Presumably, your Web site search needs are a little more modest than those of Internet-wide search engines. According to its author, ICE should perform well when indexing and searching Web sites that have up to a few thousand documents.
How the ICE Scripts Work
ICE is actually a system that consists of two scripts: ice-idx.pl and ice-form.cgi. ice-idx.pl is a command-line indexing script that searches through a directory tree of HTML documents and creates the Web site word index file. It is run either manually each time the Web pages change or by means of a scheduled ``batch'' utility such as the UNIX cron facility. The index produced by ice-idx.pl is a text file listing of each word that occurs in each document. Also included in the index is the title of the document, taken from the document's <TITLE> section. ICE also can cross-reference a thesaurus, thereby enabling users to use acronyms, abbreviations, and synonyms in the search. The thesaurus file must be created and maintained by hand. We show you an example of a thesaurus file later in the chapter. The second part of the ICE system is ice-form.cgi. This script does the actual search of the word index and returns the list of matching documents based on the user's query. Ice-form.cgi will both generate the query form and format the output into a set of HTML hyperlinks. It is possible to customize both the query form and the output by editing this script. A custom input form can be used in place of the form generated by the script if desired, as long as you take care to use the same NAME= parameters for the form fields. We show examples of both methods. Unfortunately, customizing the output page is not as easy. We point out where in the script the output page is generated and then set you loose to customize it the best you can. It's not hard, but it does involve modifying the Perl code. Be sure to keep a (working) backup copy of the script before you have at it.
Figure 12.1 shows what the pages look like in the browser, and Listing 12.1 shows the index generated by ice-idx.pl.
Figure 12.1 Browser output of example HTML pages
Listing 12.1 Index file generated by ice-idx.pl
f /home/hypertising/www/DnDCGI/ice/examp1.html @t Ice Test Document 1 @m 850407821 2 CGI 2 and 1 baron 1 bob 2 book 2 category 2 cgi 2 chose 1 chris 2 committee 2 computer 2 drag 2 drop 2 fiction 2 first 2 for 2 has 2 move 2 non 2 prestigious 6 prize 2 pulitzer 2 received 2 surprise 10 the 1 this 2 time 2 today 1 weil 2 winner 1 wins @f /home/hypertising/www/DnDCGI/ice/examp2.html @t Ice Test Document 2 @m 850407823 2 cat 4 cats 3 each 4 going 4 had 2 held 2 how 2 ives 3 kits 2 man 2 many 2 met 1 puzzle 2 sack 4 sacks 8 seven 2 was 2 were 2 wife 2 with 4 wivesAt the top of the file is the full path to each indexed file (the @f line), the title (@t) taken from the HTML <TITLE> tag, and the time the file was last modified (@m) in milliseconds since the epoch. Following the file information is the list of the words that occur within the document. A configuration option allows you to ignore words that have fewer than a certain number of letters. The number alongside the word is the number of times the word occurs in that particular document. This number is used in sorting the list of matches and in presenting the page with the highest number of matched words at the top. The rest of the index file follows this convention for every file included in the index. Unfortunately, no additional inclusion/exclusion rules are available in this version of the script.
Amazing FactoidAn epoch is a specially defined instant in time. For UNIX systems, it is defined as 12:00:00.000 midnight January 1, 1970, Universal Time Coordinated (UTC). All UNIX system clocks are referenced to this point with millisecond resolution (i.e., to .001 seconds). This gives a continuous uniform time standard for all UNIX-based systems.
The user will interact with the ice-form.cgi script and the predefined index at run-time. As already mentioned, this script can present the search form, accept the query, and return the search results to the user. Figure 12.2 shows the search form as it appears in the ice-form.cgi script before any customizing has been done.
Figure 12.2 The default ice-form.cgi search form
As you can see, ICE includes several interesting options. For example, the ``Don't Show documents older than X days'' field allows users to search only for documents that have changed within the specified number of days. A simple Boolean search capability is implemented that allows the use of and and or to qualify the search terms. The thesaurus option allows alternative words to match the search terms. Substring matching allows the user to match search terms on parts of words or on whole words only. The ``Choose the Area to search'' field allows the user to restrict the search to certain sections of your Web site. For example, the user might want to search only the product information directory, not the whole Web site. This function is based on file directories, so it is useful only if you have multiple directories of HTML documents within your Web site. The options that appear in this <SELECT> list are set manually during the configuration of the script.
Hair SaverThe index must be regenerated after each change to a file on your Web site in order for ICE to detect the change. If you don't rerun the ice-idx.pl script, ICE won't know about the modifications and may even return invalid links if you've deleted or renamed files. Bottom line: Remember to run ice-idx.pl after each change or set it to run periodically using cron or another scheduled execution utility.
Figure 12.3 shows the results of a search query. The results page that comes with the script is quite generic. You'll probably want to customize it to match the look of the rest of your Web site.
Figure 12.3 Search results from ice-form.cgi
Each match from the script contains the title of the document and the last modification date, presented as a hyperlink to the document. Also presented is the URL of the document relative to the server root, the search terms that matched, and the number of matching occurrences within the document. Thesaurus matches are shown in the same way as substring matches--the matched word being followed by the search term in parentheses.
Now that you have an idea of how ICE works, let's move on to configuring the scripts for use on your Web site.
Comments are welcome
Copyright © 1997 Addison-Wesley Pub Co. and
Created: Oct. 24, 1997
Revised: Oct. 27, 1997