spacer

Webref WebRef   Sitemap · Experts · Tools · Services · Newsletters · About i.com

home / experts / dlab / html-pre / chapter 43
Developer News
MicrosoftÂ’s Automated Agent: Can We Talk?
Borland Finally Sells CodeGear
Red Hat Heads For The JON 2.0
 

HTML Unleashed PRE: Strategies for Indexing and Search Engines

How Search Engines Work

 
 

First off, I have a bad news to tell you.  When investigating into the field, you're going to discover that search engines are, before all, a proprietary technology in a very competitive market.  Simply put, search engine companies keep their secrets well and reveal to the public only what they consider safe to reveal.

Given the importance---and necessity---of the special care and feeding of your site with regard to search engines, this is really discouraging.  Webmasters who are interested in the matter have to rely mostly on their own research, which is often biased and incomplete.  Rumors, gossips, and controversy are rampant.

Of course there are sites whose maintainers are busy collecting information and doing research in this field (such as the Search Engine Watch), and this chapter couldn't but draw much of its material from these useful sources.  However, when it comes to the details of search engines' technology, you should take the conclusions of these third-party investigators with a grain of salt.

On the other hand, only research can provide the answers to certain questions because systems of this level of complexity tend to be ruled by multi-factor statistics rather than by simple logic.  In fact, some peculiarities in the behavior of the searching beasts may be as much of a surprise to their creators as to the general public.  Huge information processing systems sometimes behave as live beings, not soulless machines.

With these restrictions in mind, let's consider the principal gears that rotate inside a typical search engine.

 
 
 

Spiders

 
 

Indexing spiders (sometimes called robots, or bots, or crawlers) are the secret agents doing the work the results of which you enjoy when performing searches.  Spider programs, just like browsers, request and retrieve documents from web servers; but unlike browsers, they do it not for viewing by humans but for automatic indexing and inclusion into their database.  They do it tirelessly, in hardly imaginable amounts (millions of pages per day), around the clock and without days off.

Spiders are what sets apart search engines from directories (one of the most prominent directories is Yahoo).  Directories don't keep their pet spiders because all links in a directory are discovered (or submitted), examined, and annotated by humans.  This difference makes the hand-picked resources of directories, on average, much more valuable but much less voluminous than the homogeneous heap of links in a search engine.

Each new document encountered by the spider is scanned for links, and these links are either traversed immediately or scheduled for later retrieval.  Theoretically, by following all links starting from a representative initial set of documents, a spider will end up having indexed the whole Web.

In practice, however, this goal in unachievable.  To begin with, lots of documents on the web are generated dynamically, most often in response to input from a form.  Naturally, although spiders can follow links they have no idea what to put into the fields of a form, so any data retrieved upon request is inherently inaccessible to search spiders (if no alternative access mechanism is provided).  In this category belong various web-accessible databases, including search engines themselves.

Also, spiders can never reach pages that are customized via cookies or pages using any JavaScript or Java tricks that affect their content.  Some spiders cannot even understand frames (see "Frames," later in this chapter).  As you might have guessed, search engines cannot yet make heads or tails of any images, audio or video clips, so these bits of information are wasted (in fact, they aren't even requested by spiders).  What remains is pure HTML source, of which spiders additionally strip off all markup and tags to get to the bare-bones plain text.

Even with these economizing assumptions, boxing up the entire web into a single database turns out to be a practically unfeasible task.  It might have been possible just a year ago, but not now when the Web has gotten that large.  That's why search engines are now moving from the strategy of swallowing everything they see to various selection techniques.

Ideally, this selection should aim at improving the quality of the database by discarding junk and scanning only the premier web content.  In reality, of course, this kind of discernment is impossible because there are no automatic programs smart enough to separate wheat from chaff.  The only way to sort out anything is by placing some rather arbitrary restrictions.

One search engine that admits "sampled spidering" is Alta Vista.  It's been claimed that the quota for Alta Vista's spider is not more than 600 documents per any single domain.  If true, this means that large domains such as geocities.com or even microsoft.com are severely underrepresented in Alta Vista's database.  It remains open to speculation whether other search engines employ similar sampling techniques or the size of their databases is limited only by their technical capacity.

All search engines allow users to add their URLs to the database for spidering.  Some of them retrieve submitted documents immediately, others schedule them for future scanning, but in any case this allows to at least make sure that your domain isn't missed.  You're supposed to submit only the root URL of your site, while using this mechanism for registering each and every single page has been blamed as a sort of "spamming."  On the other hand, given the selective nature of spidering, it's not a bad idea to register at least all key pages of your site.  (Be careful, however: some search engines limit the number of submissions per domain.)

Another important question is how often spiders update their databases by revisiting sites they've already indexed.  This parameter varies significantly for different engines, with the update periods having been quoted from one week to several months.  This aspect of search engines' performance allows some independent estimation: You can analyze your server's access logs to see when you were visited by spiders and what documents they requested.  A helpful Perl script for this purpose, called BotWatch, is available at http://www.tardis.ed.ac.uk/~sxw/robots/botwatch.html.

Many search engines have problems with sites in languages other than English, especially if these languages use character sets different from ISO 8859-1 (see Chapter 41, "Internationalizing Your HTML").  For example, HotBot returns nothing when queried with keywords in Russian, nor can it properly display summaries for documents in Russian.  This makes it useless for Russian surfers, despite the fact that HotBot's spider routinely scans a good share of all web sites in Russia.

 

Produced by Dmitry Kirsanov
Copyright Sams.net Publishing and


JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

Solutions
Whitepapers and eBooks
Microsoft Article: HyperV-The Killer Feature in WinServer ‘08
Avaya Article: How to Feed Data into the Avaya Event Processor
Microsoft Article: Install What You Need with Win Server ‘08
HP eBook: Putting the Green into IT
Whitepaper: HP Integrated Citrix XenServer for HP ProLiant Servers
Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 1
Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 2--The Future of Concurrency
Avaya Article: Setting Up a SIP A/S Development Environment
IBM Article: How Cool Is Your Data Center?
Microsoft Article: Managing Virtual Machines with Microsoft System Center
HP eBook: Storage Networking , Part 1
Microsoft Article: Solving Data Center Complexity with Microsoft System Center Configuration Manager 2007
MORE WHITEPAPERS, EBOOKS, AND ARTICLES
Webcasts
Intel Video: Are Multi-core Processors Here to Stay?
On-Demand Webcast: Five Virtualization Trends to Watch
HP Video: Page Cost Calculator
Intel Video: APIs for Parallel Programming
HP Webcast: Storage Is Changing Fast - Be Ready or Be Left Behind
Microsoft Silverlight Video: Creating Fading Controls with Expression Design and Expression Blend 2
MORE WEBCASTS, PODCASTS, AND VIDEOS
Downloads and eKits
Sun Download: Solaris 8 Migration Assistant
Sybase Download: SQL Anywhere Developer Edition
Red Gate Download: SQL Backup Pro and free DBA Best Practices eBook
Red Gate Download: SQL Compare Pro 6
Iron Speed Designer Application Generator
MORE DOWNLOADS, EKITS, AND FREE TRIALS
Tutorials and Demos
How-to-Article: Preparing for Hyper-Threading Technology and Dual Core Technology
eTouch PDF: Conquering the Tyranny of E-Mail and Word Processors
IBM Article: Collaborating in the High-Performance Workplace
HP Demo: StorageWorks EVA4400
Intel Featured Algorhythm: Intel Threading Building Blocks--The Pipeline Class
Microsoft How-to Article: Get Going with Silverlight and Windows Live
MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES
webref The latest from WebReference.com Browse >
Perl Pragma Primer · Implement Drag and Drop in Your Web Apps: Part 2 · How to Create an Ajax Autocomplete Text Field: Part 5
Sitemap · Experts · Tools · Services · Email a Colleague · Contact FREE Newsletters 
 The latest from internet.com
SQL Server 2005 Express Edition - Part 22 - Upgrading from Microsoft SQL Server Desktop Engine (MSDE) · Vyatta: Downgrades that Pay Off · NetMotion Brings Cross-Network Support to Wireless VoIP

Created: Sept. 29, 1997
Revised: Sept. 29, 1997

URL: http://www.webreference.com/dlab/books/html-pre/43-1-1.html