Search Engines / How | WebReference

Search Engines / How

Search Engines

II. How Software Agents and Search Engines Work

There are at least three elements to search engines that I think are important: information discovery & the database, the user search, and the presentation and ranking of results.

Discovery and Database

A search engine finds information for its database by accepting listings sent in by authors wanting exposure, or by getting the information from their "Web crawlers," "spiders," or "robots," programs that roam the Internet storing links to and information about each page they visit. Web crawler programs are a subset of "software agents," programs with an unusual degree of autonomy which perform tasks for the user. How do these really work? Do they go across the net by IP number one by one? Do they store all or most of everything on the Web?

According to The WWW Robot Page, these agents normally start with a historical list of links, such as server lists, and lists of the most popular or best sites, and follow the links on these pages to find more links to add to the database. This makes most engines, without a doubt, biased toward more popular sites. A Web crawler could send back just the title and URL of each page it visits, or just parse some HTML tags, or it could send back the entire text of each page. Alta Vista is clearly hell-bent on indexing anything and everything, with over 30 million pages indexed (7/96). Excite actually claims more pages. OpenText, on the other hand, indexes the full text of less than a million pages (5/96), but stores many more URLs. Inktomi has implemented HotBot as a distributed computing solution, which they claim can grow with the Web and index it in entirety no matter how many users or how many pages are on the Web. By the way, in case you are worrying about software agents taking over the world, or your Web site, look over the Robot Attack Page. Normally, "good" robots can be excluded by a bit of Exclusion Standard code on your site.

It seems unfair, but developers aren't rewarded much by location services for sending in the URLs of their pages for indexing. The typical time from sending your URL in to getting it into the database seems to be 6-8 weeks. Not only that, but a submission for one of my sites expired very rapidly, no longer appearing in searches after a month or two, apparently because I didn't update it often enough. Most search engines check their databases to see if URLs still exist and to see if they are recently updated.

User Search

What can the user do besides typing a few relevant words into the search form? Can they specify that words must be in the title of a page? What about specifying that words must be in an URL, or perhaps in a special HTML tag? Can they use all logical operators between words like AND, OR, and NOT?

Query Syntax Checklist

How does your engine handle:

Truncation, Pluralization & Capitalization:
Macintosh, Mac, Macintoshes, Macs, macintosh, macintoshes, mac, macs, could all yield different results. Most engines interpret lower case as unspecified, but upper case will match only upper case, but there are exceptions. There is no standard at all for truncation, and worse yet, it is probably different in general and advanced search mode for every engine.

Multiple Words
does the engine logically AND them or OR them ?

Typically one puts quotes around a phrase so that each word in the phrase is not searched for separately.

. . . Check with your engine's help file before starting a search.

Most engines allow you to type in a few words, and then search for occurrences of these words in their data base. Each one has their own way of deciding what to do about approximate spellings, plural variations, and truncation. If you just type words into the "basic search" interface you get from the search engine's main page, you also can get different logical expressions binding the different words together. Excite! actually uses a kind of "fuzzy" logic, searching for the AND of multiple words as well as the OR of the words. Most engines have separate advanced search forms where you can be more specific, and form complex Boolean searches (every one mentioned in this article except Hotbot). Some search tools parse HTML tags, allowing you to look for things specifically as links, or as a title or URL without consideration of the text on the page.

By searching only in titles, one can eliminate pages with only brief mentions of a concept, and only retrieve pages that really focus on your concept.

By searching links, one can determine how many and which pages point at your site. Understanding what each page does with the non-standard pluralization, truncation, etc. can be quite important in how successful your searches will be. For example, if you search for "bikes" you won't get "bicycle," "bicycles," or "bike." In this case, I would use a search engine that allowed "truncation," that is, one that allowed the search word "bike" to match "bikes" as well, and I would search for "bicycyle OR bike OR cycle" ("bicycle* OR bike* OR cycle*" in Alta Vista).

Presentation & Ranking

With databases that can keep the entire Web at the fingertips of the search engines, there will always be relevant pages, but how do you get rid of the less relevant and emphasize the more relevant?

Most engines find more sites from a typical search query than you could ever wade through. Search engines give each document they find some measure of the quality of the match to your search query, a relevance score. Relevance scores reflect the number of times a search term appears, if it appears in the title, if it appears at the beginning of the document, and if all the search terms are near each other; some details are given in engine help pages. Some engines allow the user to control the relevance score by giving different weights to each search word. One thing that all engines do, however, is to use alphabetical order at some point in their display algorithm. If relevance scores are not very different for various matches, then you end up with this sorry default. Zeb's [Whatever] page will never fare very well in this case, regardless of the quality of its content. For most uses, a good summary is more useful than a ranking. The summary is usually composed of the title of a document and some text from the beginning of the document, but can include an author-specified summary given in a meta-tag. Scanning summaries really saves you time if your search returns more than a few items.

Get More Hits By Understanding Search Engines

Knowing just the little bit above can give you ideas of how to give your page more exposure.

Hustle for Links
Most software agents find your site by links from other pages. Even if you have sent in your URL, your site can be indexed longer and ranked higher in search results if many links lead to your site. One of my sites that couldn't show up in the most casual search got most of its hits from links on other sites. Links can be crucial in achieving good exposure.

Use Titles Early In the Alphabet
All engines that I used displayed results with equal scores in alphabetical order.

Submit Your URL to Multi-Database Pages
It is best to use a multiple-database submission service such as SubmitIt! to save you the time of contacting each search service separately. Remember, it takes 6-8 weeks to become indexed.

Control Your Page's Summary
You can use the meta tag name="description" to stand out in search results. Appear in search summaries as "Experienced Web service, competitive prices" not "Hello and welcome. This page is about."

Search Reverse Engineering
Simulate your audience's search for your page (have all your friends list all the searches they might try), then see what you need to do to come up first on their search engine's results list.
  1. Use the meta-tag name="keywords" to put an invisible keyword list at the beginning of your document that would match keywords your audience would use. Most search engines rate your page higher if keywords appear near the beginning.

  2. How many times do the keywords appear in the text? It usually demonstrates good writing if you don't repeat the same words over and over. However, search engines penalize you for this, usually rating your page higher for repetitions of keywords, inane or not. Some authors combat this by putting yet more keywords at the bottom of their pages in invisible text. Look at the source code for this article, and you'll see what I mean; the words are just in the same color as the background.
"Spamming" is net-lingo for spreading a lot of junk everywhere; keyword spamming is putting hidden keywords a huge number of times in your document just so yours will be rated higher by search engines.

  1. Search engines typically limit you to 25 keywords or less, and one I know of truncates your list when they see an unreasonable number of repetitions.
  2. Invisible text at the end of your pages puts blank space there, which looks bad and slows loading. Services which rate pages will enjoy marking you down for this.

    Responsible Keyword Use: If an important keyword doesn't appear at least four times in your document, I hereby give you the right to add invisible text until it appears a maximum of five times.

Comments are welcome

Revised: May 20, 1998