Search Engines / How
II. How Software Agents and Search Engines Work
There are at least three elements to search engines that I think are important: information discovery & the database, the user search, and the presentation and ranking of results.
Discovery and Database
A search engine finds information for its database by accepting listings sent in by authors wanting exposure, or by getting the information from their "Web crawlers," "spiders," or "robots," programs that roam the Internet storing links to and information about each page they visit. Web crawler programs are a subset of "software agents," programs with an unusual degree of autonomy which perform tasks for the user. How do these really work? Do they go across the net by IP number one by one? Do they store all or most of everything on the Web?
According to The WWW Robot Page, these agents normally start with a historical list of links, such as server lists, and lists of the most popular or best sites, and follow the links on these pages to find more links to add to the database. This makes most engines, without a doubt, biased toward more popular sites. A Web crawler could send back just the title and URL of each page it visits, or just parse some HTML tags, or it could send back the entire text of each page. Alta Vista is clearly hell-bent on indexing anything and everything, with over 30 million pages indexed (7/96). Excite actually claims more pages. OpenText, on the other hand, indexes the full text of less than a million pages (5/96), but stores many more URLs. Inktomi has implemented HotBot as a distributed computing solution, which they claim can grow with the Web and index it in entirety no matter how many users or how many pages are on the Web. By the way, in case you are worrying about software agents taking over the world, or your Web site, look over the Robot Attack Page. Normally, "good" robots can be excluded by a bit of Exclusion Standard code on your site.
It seems unfair, but developers aren't rewarded much by location services for sending in the URLs of their pages for indexing. The typical time from sending your URL in to getting it into the database seems to be 6-8 weeks. Not only that, but a submission for one of my sites expired very rapidly, no longer appearing in searches after a month or two, apparently because I didn't update it often enough. Most search engines check their databases to see if URLs still exist and to see if they are recently updated.
What can the user do besides typing a few relevant words into the search form? Can they specify that words must be in the title of a page? What about specifying that words must be in an URL, or perhaps in a special HTML tag? Can they use all logical operators between words like AND, OR, and NOT?
Query Syntax ChecklistHow does your engine handle:
Most engines allow you to type in a few words, and then search for occurrences of these words in their data base. Each one has their own way of deciding what to do about approximate spellings, plural variations, and truncation. If you just type words into the "basic search" interface you get from the search engine's main page, you also can get different logical expressions binding the different words together. Excite! actually uses a kind of "fuzzy" logic, searching for the AND of multiple words as well as the OR of the words. Most engines have separate advanced search forms where you can be more specific, and form complex Boolean searches (every one mentioned in this article except Hotbot). Some search tools parse HTML tags, allowing you to look for things specifically as links, or as a title or URL without consideration of the text on the page.
By searching only in titles, one can eliminate pages with only brief mentions of a concept, and only retrieve pages that really focus on your concept.
By searching links, one can determine how many and which pages point at your site. Understanding what each page does with the non-standard pluralization, truncation, etc. can be quite important in how successful your searches will be. For example, if you search for "bikes" you won't get "bicycle," "bicycles," or "bike." In this case, I would use a search engine that allowed "truncation," that is, one that allowed the search word "bike" to match "bikes" as well, and I would search for "bicycyle OR bike OR cycle" ("bicycle* OR bike* OR cycle*" in Alta Vista).
Presentation & Ranking
With databases that can keep the entire Web at the fingertips of the search engines, there will always be relevant pages, but how do you get rid of the less relevant and emphasize the more relevant?
Most engines find more sites from a typical search query than you could ever wade through. Search engines give each document they find some measure of the quality of the match to your search query, a relevance score. Relevance scores reflect the number of times a search term appears, if it appears in the title, if it appears at the beginning of the document, and if all the search terms are near each other; some details are given in engine help pages. Some engines allow the user to control the relevance score by giving different weights to each search word. One thing that all engines do, however, is to use alphabetical order at some point in their display algorithm. If relevance scores are not very different for various matches, then you end up with this sorry default. Zeb's [Whatever] page will never fare very well in this case, regardless of the quality of its content. For most uses, a good summary is more useful than a ranking. The summary is usually composed of the title of a document and some text from the beginning of the document, but can include an author-specified summary given in a meta-tag. Scanning summaries really saves you time if your search returns more than a few items.
Knowing just the little bit above can give you ideas of how to give your page more exposure. Responsible Keyword Use: If an important keyword doesn't appear at least four times in your document, I hereby give you the right to add invisible text until it appears a maximum of five times. Comments are welcome
Get More Hits By Understanding Search Engines
Knowing just the little bit above can give you ideas of how to give your page more exposure.
Responsible Keyword Use: If an important keyword doesn't appear at least four times in your document, I hereby give you the right to add invisible text until it appears a maximum of five times.
Comments are welcome
Revised: May 20, 1998