HTML Unleashed PRE. Strategies for Indexing and Search Engines: How Search Engines Work | WebReference

HTML Unleashed PRE. Strategies for Indexing and Search Engines: How Search Engines Work


HTML Unleashed PRE: Strategies for Indexing and Search Engines

How Search Engines Work


First off, I have a bad news to tell you.  When investigating into the field, you're going to discover that search engines are, before all, a proprietary technology in a very competitive market.  Simply put, search engine companies keep their secrets well and reveal to the public only what they consider safe to reveal.

Given the importance---and necessity---of special care and feeding of your site with regard to search engines, this is really discouraging.  The webmasters that are interested in the matter have to rely mostly upon their own research, often biased and incomplete.  Rumors, gossips, and controversy are blossoming.

Of course there are sites whose maintainers are busy collecting information and making research in this field (such as the Search Engine Watch), and this chapter couldn't but draw much of its material from these useful sources.  However, when it comes to the details of search engines' technology, you should take the conclusions of these third-party investigators with a grain of salt.

On the other hand, answers to certain questions simply cannot be obtained other than by research, because systems of this level of complexity tend to be ruled by multi-factor statistics rather than by simple logic.  In fact, some peculiarities in the behavior of the searching beasts may be as much of a surprise to their creators as to the general public.  Huge information processing systems sometimes behave as live beings and not as soulless machines.

With these restrictions in mind, let's consider the principal gears that rotate inside of a typical search engine.




Indexing spiders (sometimes called robots, or bots, or crawlers) are the secret agents doing the work the results of which you enjoy when performing searches.  Spider programs, just like browsers, request and retrieve documents from web servers; but unlike browsers, they do it not for viewing by humans but for automatic indexing and inclusion into their database.  They do it tirelessly, in hardly imaginable amounts (millions of pages per day), around the clock and without day-offs.

Spiders are what sets apart search engines from directories (one of the most prominent directories is Yahoo).  Directories do not keep their pet spiders because all links in a directory are discovered (or submitted), examined, and annotated by humans.  This makes the hand-picked resources of directories, on average, much more valuable but much less voluminous than the homogeneous heap of links in a search engine.

Each new document encountered by the spider is scanned for links, and these links are either traversed immediately or scheduled for later retrieval.  Theoretically, by following all links starting from a representative initial set of documents, a spider will end up having indexed the whole Web.

In practice, however, this goal in unachievable.  To begin with, lots of documents on the web are generated dynamically, most often in response to an input from a form.  Naturally, while spiders can follow links they have no idea what to put into the fields of a form, so any data retrieved upon request is inherently inaccessible to search spiders (if no alternative access mechanism is provided).  In this category belong various web-accessible databases, including search engines themselves.

Also, spiders can never reach pages that are customized via cookies or pages using various JavaScript or Java tricks affecting their content.  Some spiders cannot even understand frames (see "Frames," later in the chapter).  As you might have guessed, search engines cannot yet make heads or tails of any images, audio or video clips, so these bits of information are wasted in vain (in fact, they aren't even requested by spiders).  What remains is pure HTML source, of which spiders additionally strip off all markup and tags to get to the bare-bones plain text.

Even with these economizing assumptions, boxing up the entire web into a single database turns out a practically unfeasible task.  It might be possible just a year ago, but not now when the Web got that large.  That's why search engines are now moving from the strategy of swallowing everything they see to various selection techniques.

Ideally, this selection should aim at improving the quality of the database by discarding junk and scanning only the premier web content.  In reality, of course, this is impossible because there are no automatic programs smart enough to separate wheat from tares.  The only way to sort out anything is by placing some rather arbitrary restrictions.

One search engine that admits "sampled spidering" is Alta Vista.  It's been claimed that the quota for Alta Vista's spider is not more than 600 documents per any single domain.  If true, this means that large domains such as or even are severely underrepresented in Alta Vista's database.  It remains open to speculation whether other search engines employ similar sampling techniques or the size of their databases is limited only by their technical capacity.

All search engines allow users to add their URLs to the database for spidering.  Some of them retrieve submitted documents immediately, others schedule them for future scanning, but in any case this allows to at least make sure that your domain isn't missed.  You're supposed to submit only the root URL of your site, while using this mechanism for registering each and every single page has been blamed as a sort of "spamming."  On the other hand, given the selective nature of spidering, it's not a bad idea to register at least all key pages of your site.  (Be careful, however: some search engines limit the number of submissions per domain.)

Another important question is how often spiders update their databases by revisiting sites they've already indexed.  This parameter varies significantly for different engines, with terms having been quoted from one week to several months.  This aspect of search engines performance allows some independent estimation: You can analyze your server's access logs to see when you were visited by spiders and what documents they requested.  A helpful Perl script for this purpose, called BotWatch, is available at

Many search engines have problems with sites in languages other than English, especially if these languages use character sets different from ISO 8859-1 (see Chapter 41, "Internationalizing Your HTML").  For example, HotBot returns nothing when queried with keywords in Russian, nor can it properly display summaries for documents in Russian.  This makes it useless for Russian surfers despite the fact that HotBot's spider routinely scans a good share of all web sites in Russia.


Search Interface


This is the visible part of a search engine's iceberg.  Every day millions of people enter myriads of keywords into search forms and get innumerable URLs in response.  This is already one of the biggest and intensively used information resources on Earth.

I'm not going to teach you how to use search engines, as this is beyond the scope of this book.  However, to create search-friendly HTML documents you must be aware of the range of features offered to the users of modern search engines.


Basic Options


All major search engines have, besides the simplest form of query with one or several keywords, some additional search options.  However, the scope of these features varies significantly, and no standard syntax for invoking them is yet established.  Among the most common search options are:

  • Boolean operators: AND (find all), OR (find any), AND NOT (exclude) to combine keywords in queries;

  • phrase search: looking for the keywords only if they're positioned in the document next to each other, in this particular order;

  • proximity: looking for the keywords only if they're close enough to each other (the notion of "close enough" ranges from 2 in-between words for WebCrawler to 25 words for Lycos);

  • media search: looking for pages containing Java applets, Shockwave objects, and so on;

  • special searches: looking for keywords or URLs within links, image names, document titles;

  • various search constraints: limiting search to a time span of document creation, specifying document language (Alta Vista), and so on.

You should be aware that even with a full inventory of these bells and whistles, you cannot expect from a search engine capabilities comparable to, say, the Search dialog in Microsoft Word.

For example, Alta Vista suggests to use its database as a spelling dictionary: search for CDROM and CD-ROM and see which will "win" by yielding more results.  A bright idea, but you can't resolve in a similar fashion the controversy of World Wide Web vs.  World-Wide Web simply because the system treats both hyphens and spaces as "separators" and cannot differentiate between them.  Those accustomed to grep-style regular expressions can't even dream of using something similar with search engines.

This may change in the future, although for now, search interface seems to be developing in another direction, described in the following subsection.


Interactive Refining


Recently, several engines developed schemes to categorize results of a search by combining them into groups with similar "keywords spectrum".  By selecting the Refine button in Alta Vista, you get a list of several categories that your results break into, allowing to require inclusion or exclusion of any category for the next search iteration.

Similarly, Excite invites to "Select words to add to your search," with these additional keywords extracted from the results just obtained.  This allows to narrow the search in a much more efficient fashion than you could do by blindly trying different keywords.

Northern Light Search also sorts its search results into "folders" based on their content and the domain URL.  All these features make a really powerful searching possible by interactively detecting trends in the data.




All search engines rank their results so that more relevant documents are listed first.  This sorting is based on, first, the frequency of keywords within a document, and second, the distance of keyword occurrences from the document beginning.

In other words, if one document contains two matches for a keyword and another is identical but contains only one, the first document will be closer to the top of list.  If two documents are identical except that one has a keyword positioned closer to the top (especially, in the document title), it will come first.

In addition to this, some search engines use extra factors to determine the ranking order, called relevancy boosters.  For instance, HotBot and Infoseek favor those documents that make use of META tags over their METAless peers.

WebCrawler relies on link popularity: if a page is linked heavily from other pages and sites, it is considered "more authoritative" and receives some impetus up the list of results.  Excite, being a combination of a search engine and a directory, quite naturally gives preference to those pages that are reviewed in its directory.

Finally, all search engines try to fight unfair practices of some webmasters who attempt to fool the ranking algorithm by repeating keywords to improve their effective frequency in the documents.  You might have noticed pages with a tail of hundreds of repeated keywords (usually made invisible in browsers by changing font color, but still visible to search engines) or pages with multiple TITLE elements (again, only the first one is visible in browsers, but all are indexed by a spider).  Now, not only such "keyword spammers" do not receive high rankings but many search engines automatically exclude them from the database.  (For more on spamming, see "The Meta Controversy" later in the chapter.)




Usually, lists of search results contain document titles, URLs, summaries, sometimes dates of the document creation (with other search engines, dates of their inclusion in the database) and document sizes.  For compiling document summaries, several approaches have been developed.

Many search engines use META descriptions provided by page authors, but when META data is unavailable they usually take the first 100 or 200 characters of page text.  Excite stands apart by ignoring META tags altogether and employing a sophisticated---but not particularly well-performing---algorithm extracting sentences that seem to be the "theme" of the page and presenting them as the page's summary.

However, the solution that seems optimal to me is that used by Aport, a Russian search engine.  Instead of generating summaries, Aport just lists, for each document found, the sentences from the document that matched the query.  Indeed, in order to decide if a document is worth browsing, we're often more interested to see what is the context of the keyword match, not what sort of a document is this.

Aport has a number of other features unique among search engines.  For example, it allows to retrieve a text-only reconstruction of the document directly from the search engine's database in case the original document (or the server it's stored on) is inaccessible.


Comments are welcome
Produced by Dmitry Kirsanov
and Publishing
Created: 09/29/97  /  Revised: 09/29/97