HTML Unleashed PRE. Strategies for Indexing and Search Engines: Search Interface | WebReference

HTML Unleashed PRE. Strategies for Indexing and Search Engines: Search Interface


HTML Unleashed PRE: Strategies for Indexing and Search Engines


Search Interface


The search interface is the visible part of a search engine's iceberg.  Every day millions of people enter myriads of keywords into search forms and get innumerable URLs in response.  This is already one of the biggest and most intensively used information resources on Earth.

I'm not going to teach you how to use search engines, as that's beyond the scope of this book.  However, to create search-friendly HTML documents, you must be aware of the range of features offered to the users of modern search engines.


Basic Options


All major search engines have, besides the simplest form of query with one or several keywords, some additional search options.  However, the scope of these features varies significantly, and no standard syntax for invoking them is yet established.  Among the most common search options are:

  • Boolean operators: AND (find all), OR (find any), AND NOT (exclude) to combine keywords in queries;

  • phrase search: looking for the keywords only if they're positioned in the document next to each other, in this particular order;

  • proximity: looking for the keywords only if they're close enough to each other (the notion of "close enough" ranges from 2 in-between words for WebCrawler to 25 words for Lycos);

  • media search: looking for pages containing Java applets, Shockwave objects, and so on;

  • special searches: looking for keywords or URLs within links, image names, document titles;

  • various search constraints: limiting the search to a time span of document creation, specifying a document language (Alta Vista), and so on.

You should be aware that even with a full inventory of these bells and whistles, you cannot expect from a search engine capabilities that are comparable to, say, the Search dialog in Microsoft Word.

For example, Alta Vista suggests using its database as a spelling dictionary: search for CDROM and CD-ROM and see which will "win" by yielding more results.  A bright idea, but you can't resolve in a similar fashion the controversy of World Wide Web vs. World-Wide Web simply because the system treats both hyphens and spaces as "separators" and cannot differentiate between them.  Those accustomed to regular expressions such as those used in Perl or awk can't even dream of using something similar with search engines.

In the future, search engines may offer more sophisticated options, although for now, their search interfaces seem to be developing in another direction, described in the following subsection.


Interactive Refining


Recently, several search engines developed schemes to categorize results of a search by combining them into groups with similar "keywords spectrum."  By selecting the Refine button in Alta Vista, you get a list of several categories that your results fall into, allowing you to specify including or excluding of any category for the next search iteration.

Similarly, Excite invites you to "Select words to add to your search," with these additional keywords extracted from the results just obtained.  This selection allows you to narrow the search in a much more efficient fashion than you could do by blindly trying different keywords.

Northern Light Search also sorts its search results into "folders" based on their content and the domain URL.  All these features make really powerful searching possible by interactively detecting trends in the data.




All search engines rank their results so that more relevant documents are at the top of the list.  This sorting is based on, first, the frequency of keywords within a document, and second, the distance of keyword occurrences from the beginning of the document.

In other words, if one document contains two matches for a keyword and another is identical but contains only one, the first document will be closer to the top of list.  If two documents are identical except that one has a keyword positioned closer to the top (especially, in the document title), it will come first.

In addition to these principles, some search engines use extra factors to determine the ranking order, called relevancy boosters.  For instance, HotBot and Infoseek favor those documents that make use of META tags over their METAless peers.

WebCrawler relies on link popularity: if a page is linked frequently from other pages and sites, it is considered "more authoritative" and gets some priority on the list of results.  Excite, being a combination of a search engine and a directory, quite naturally gives preference to those pages that are reviewed in its directory.

Finally, all search engines try to fight unfair practices of some webmasters who attempt to fool the ranking algorithm by repeating keywords to improve their effective frequency in the documents.  You might have noticed pages with a tail of hundreds of repeated keywords (usually made invisible in browsers by changing font color, but still visible to search engines) or pages with multiple TITLE elements (again, only the first one is visible in browsers, but all are indexed by a spider).  Now, not only do such "keyword spammers" not receive high rankings, but many search engines also automatically exclude them from the database.  (For more on spamming, see "The Meta Controversy" later in this chapter.)




Usually, lists of search results contain document titles, URLs, summaries, sometimes dates of the document creation (with other search engines, dates of their inclusion in the database), and document sizes.  For compiling document summaries, several approaches have been developed.

Many search engines use META descriptions provided by page authors, but when META data is unavailable, they usually take the first 100 or 200 characters of page text.  Excite stands apart by ignoring META tags altogether and employing a sophisticated---but not particularly well-performing---algorithm that extracts sentences appearing to be the "theme" of the page and presents them as the page's summary.

However, the solution that seems optimal to me is that used by Aport, a Russian search engine.  Instead of generating summaries, Aport just lists, for each document found, the sentences from the document that matched the query.  Indeed, in order to decide if a document is worth browsing, we're often more interested to see what is the context of the keyword match, not what sort of a document is this.

Aport has a number of other features unique among search engines.  For example, it allows you to retrieve a text-only reconstruction of the document directly from the search engine's database, in case the original document (or the server it's stored on) is inaccessible.


Created: Sept. 29, 1997
Revised: Sept. 29, 1997