HTML Unleashed PRE. Strategies for Indexing and Search Engines: How Search Engines Work | 2 | WebReference

HTML Unleashed PRE. Strategies for Indexing and Search Engines: How Search Engines Work | 2

 

HTML Unleashed PRE: Strategies for Indexing and Search Engines

How Search Engines Work

 
 

First off, I have a bad news to tell you.  When investigating into the field, you're going to discover that search engines are, before all, a proprietary technology in a very competitive market.  Simply put, search engine companies keep their secrets well and reveal to the public only what they consider safe to reveal.

Given the importance---and necessity---of the special care and feeding of your site with regard to search engines, this is really discouraging.  Webmasters who are interested in the matter have to rely mostly on their own research, which is often biased and incomplete.  Rumors, gossips, and controversy are rampant.

Of course there are sites whose maintainers are busy collecting information and doing research in this field (such as the Search Engine Watch), and this chapter couldn't but draw much of its material from these useful sources.  However, when it comes to the details of search engines' technology, you should take the conclusions of these third-party investigators with a grain of salt.

On the other hand, only research can provide the answers to certain questions because systems of this level of complexity tend to be ruled by multi-factor statistics rather than by simple logic.  In fact, some peculiarities in the behavior of the searching beasts may be as much of a surprise to their creators as to the general public.  Huge information processing systems sometimes behave as live beings, not soulless machines.

With these restrictions in mind, let's consider the principal gears that rotate inside a typical search engine.

 
 
 

Spiders

 
 

Indexing spiders (sometimes called robots, or bots, or crawlers) are the secret agents doing the work the results of which you enjoy when performing searches.  Spider programs, just like browsers, request and retrieve documents from web servers; but unlike browsers, they do it not for viewing by humans but for automatic indexing and inclusion into their database.  They do it tirelessly, in hardly imaginable amounts (millions of pages per day), around the clock and without days off.

Spiders are what sets apart search engines from directories (one of the most prominent directories is Yahoo).  Directories don't keep their pet spiders because all links in a directory are discovered (or submitted), examined, and annotated by humans.  This difference makes the hand-picked resources of directories, on average, much more valuable but much less voluminous than the homogeneous heap of links in a search engine.

Each new document encountered by the spider is scanned for links, and these links are either traversed immediately or scheduled for later retrieval.  Theoretically, by following all links starting from a representative initial set of documents, a spider will end up having indexed the whole Web.

In practice, however, this goal in unachievable.  To begin with, lots of documents on the web are generated dynamically, most often in response to input from a form.  Naturally, although spiders can follow links they have no idea what to put into the fields of a form, so any data retrieved upon request is inherently inaccessible to search spiders (if no alternative access mechanism is provided).  In this category belong various web-accessible databases, including search engines themselves.

Also, spiders can never reach pages that are customized via cookies or pages using any JavaScript or Java tricks that affect their content.  Some spiders cannot even understand frames (see "Frames," later in this chapter).  As you might have guessed, search engines cannot yet make heads or tails of any images, audio or video clips, so these bits of information are wasted (in fact, they aren't even requested by spiders).  What remains is pure HTML source, of which spiders additionally strip off all markup and tags to get to the bare-bones plain text.

Even with these economizing assumptions, boxing up the entire web into a single database turns out to be a practically unfeasible task.  It might have been possible just a year ago, but not now when the Web has gotten that large.  That's why search engines are now moving from the strategy of swallowing everything they see to various selection techniques.

Ideally, this selection should aim at improving the quality of the database by discarding junk and scanning only the premier web content.  In reality, of course, this kind of discernment is impossible because there are no automatic programs smart enough to separate wheat from chaff.  The only way to sort out anything is by placing some rather arbitrary restrictions.

One search engine that admits "sampled spidering" is Alta Vista.  It's been claimed that the quota for Alta Vista's spider is not more than 600 documents per any single domain.  If true, this means that large domains such as geocities.com or even microsoft.com are severely underrepresented in Alta Vista's database.  It remains open to speculation whether other search engines employ similar sampling techniques or the size of their databases is limited only by their technical capacity.

All search engines allow users to add their URLs to the database for spidering.  Some of them retrieve submitted documents immediately, others schedule them for future scanning, but in any case this allows to at least make sure that your domain isn't missed.  You're supposed to submit only the root URL of your site, while using this mechanism for registering each and every single page has been blamed as a sort of "spamming."  On the other hand, given the selective nature of spidering, it's not a bad idea to register at least all key pages of your site.  (Be careful, however: some search engines limit the number of submissions per domain.)

Another important question is how often spiders update their databases by revisiting sites they've already indexed.  This parameter varies significantly for different engines, with the update periods having been quoted from one week to several months.  This aspect of search engines' performance allows some independent estimation: You can analyze your server's access logs to see when you were visited by spiders and what documents they requested.  A helpful Perl script for this purpose, called BotWatch, is available at http://www.tardis.ed.ac.uk/~sxw/robots/botwatch.html.

Many search engines have problems with sites in languages other than English, especially if these languages use character sets different from ISO 8859-1 (see Chapter 41, "Internationalizing Your HTML").  For example, HotBot returns nothing when queried with keywords in Russian, nor can it properly display summaries for documents in Russian.  This makes it useless for Russian surfers, despite the fact that HotBot's spider routinely scans a good share of all web sites in Russia.

 

Created: Sept. 29, 1997
Revised: Sept. 29, 1997

URL: http://www.webreference.com/dlab/books/html-pre/43-1-1.html