Search Engine Basics v.1.0
Search Engine Basics v.1.0
As a web developer, it's important to know how search engines work. While the details are rather complex, we'll look into the basics of crawler-based search engines (this involves a certain amount of speculation as the exact calculations are a closely guarded secret).
The index of a crawler-based search engine is built through the use of robots (spiders, web crawlers) which operate on a fixed set of instructions. The robot selects a page to visit from a list of links (a "queue") gathered from web pages that were previously searched. It fetches the web page, collects certain information (such as visible text, meta tags, links, etc.) and sends it to an indexing program. The information is then entered into a database, ready for searching inquiries, then the newly gathered links are entered into the queue for a future visit and the process begins again.
Every major search engine uses link analysis as a part of its ranking algorithm, according to Danny Sullivan, editor of Search Engine Watch. It differs from link popularity in that links are given a "weight" (rank of importance) determined by a preset calculation, whereas in link popularity a web page's importance is ranked according to how many hyperlinks are pointing to that page, regardless of where they came from. According to Google: "In essence, Google interprets a link from page A to page B as a vote, by page A, for page B, but Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves 'important' weigh more heavily and help to make other pages 'important.' Important, high-quality sites receive a higher PageRank, which Google remembers each time it conducts a search.... Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines all aspects of the page's content (and the content of the pages linking to it) to decide if it's a good match for your query" (Google Technology).
Note: An interesting discussion on the anatomy of search engines can be found in the original PageRank paper by Google's founders Sergey Brin and Lawrence Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine." Although somewhat long, it makes for good reading.
Teoma, owned by Ask Jeeves, makes use of what it calls "Subject-Specific Popularity." This technology, according to Teoma, "ranks a site based on the number of same-subject pages that reference it, not just general popularity." Teoma's process allows for a fine-tuned search using the authority of the link as a part of its relevance. Web sites are grouped into "communities" that have the same topic. Searches are then further refined within the communities, using Subject-Specific Popularity.
Outgoing links are not used in the algorithm for good reason. Think about it for a moment. The web developer creates the outgoing links. If those links were used in the algorithm, he would only need to link to the most popular sites on the web to increase his site's search engine listing position.
Topic Sensitive PageRank
Another method involved in ranking web sites is called "Topic Sensitive PageRank" (TSPR). It's an enhancement to Google's PageRank. Instead of ranking the pages solely based on all incoming links, TSPR gives weight to links that relate to the page's main subject area, like Teoma. Links from sites not directly related to the page's subject matter (topic), are assigned a lesser degree of weight in the calculation. A similar process is called "Hilltop."
There's much discussion about what method Google uses for site ranking. Topics include the use of TSPR and something called "block ranking," which basically groups internal links and uses that as a starting point for the original PageRank algorithm (Google PageRank Calculations to Get Faster?).
Keywords on a web site are just that: key words. Nothing magical. They are the words that potential visitors might enter into a search engine that could lead them to your site.
For example, to find a particular music site, I might enter the search terms, "blues music." On Google, that returns about 796,000 links. But let's say that I'm interested in blues music from the Mississippi Delta. Then I would add the word "delta" to my search: "blues music", delta (the quotes keep the words together). That narrows it down to 30,500 links. That's still a lot of links but many of them probably aren't relevant to what I want and it's a little more manageable than 796,000. Of course, the search could be further refined.
The search engine returns these particular pages because they have the words blues music and delta on them (or in the anchor text of links pointing to them, see below); those are the keywords, in this case.
If you have a web site about blues music then you need to add those words somewhere on your web pages so they can be found by people who want to know more about blues music in the delta. You might also be able to have them included in the anchor links pointing to your site. But do you just add the keywords at random on the page or maybe within a comment tag?
It doesn't quite work that way. The algorithms used by the search engines have become very sophisticated and would pick that up right away. Instead, you should have content that contains the keywords, such as articles about the delta blues, albums and songs with the words in their titles and song lyrics containing the words, however, you don't want the page to just contain those keywords or be heavily weighted in their content. According to Google's standards, no more than 2% of the words on a page should be actual targeted keywords. A general rule of thumb is 2%-8%. It's called "keyword density" and refers to the percentage of keywords contained within the total number of indexable words on a web page.
Generally, keywords are put in the title and meta tag. They should also have prominence within the page, i.e.at the beginning of the web page and at or near the beginning of a paragraph or a sentence.
There are places on the web to help determine good keywords. Some internet marketing specialists provide special copywriting services just for keyword placement, but that may be going a bit overboard.
The visible text used in a hyperlink is called "anchor text". For
instance, "delta blues music" is the anchor text in the following statement:
"Get all your <a href="http://www.bluesmusic.com/">delta blues
music</a> here." It's used to highlight the underlying link.
Search engines use this text to enhance the relevance of the link in a related
search request. The relevancy of the link to your overall site increases its
weight, or value, with the search engine.
The page targeted by the link is enhanced, not the page the link is on. The anchor text will only help the current page if any keywords appear in it, as in our example above.
Anchor text works best when used within the context of the web page. It's
important to make sure the links actually say something, as in our example above.
That's much better than using,
"Get all your delta blues music <a href="http://www.bluesmusic.com/">here</a>."
Don't try to come up with snappy ideas of using anchor text. Search engines compare the relevancy of your content to the links that are in it. They need to make sense for both the search engine and your visitors. There's not much point in having a site at the top of the search engines if you don't have anything for your visitors when they get there and they have no reason to return. Actually, that type of site won't last long at the top of the search engines, if it ever makes it there in the first place.
Anchor text can be very important. Your search engine rankings can be increased even if the anchor text used by another page does not appear on your page. Doing a Google search for "miserable failure" will list three web pages at the top (not counting the paid ad) that don't have those words anywhere on their page. In this case, it's known as Google bombing.
Content, Content, Content
As I said, the exact calculations used by search engines, and the manner in which they're used, are closely guarded secrets. They are the very foundation of what makes a particular search engine different from all the rest. Opinions as to what methods are used by search engine companies varies even among the leading experts in the field.
Providing high caliber content is one of the most important things you can do to increase your site's search engine ranking. Doing so will keep your visitors coming back and they will recommend your site to others. Eventually, the more important, heavier-weighted links will point to your site, creating important incoming links. Google puts it very simply, "make pages for users, not for search engines." According to Danny Sullivan: "do the basic, simple things that have historically helped with search engines. Have good titles. Have good content. Build good links. Don't try to highly-engineer pages that you think will please a search engine's algorithm. Focus instead on building the best site you can for your visitors, offering content that goes beyond just selling but which also offers information, and I feel you should succeed."
- PageRank Explained Correctly with Examples
- Analysis and Implications of Hilltop Algorithm
- Page rank Explained. Google's PageRank and how to make the most of it.
- Topic Sensitive PageRank
- Hilltop: A Search Engine based on Expert Documents
- Yahoo Keyword Density Analysis Comparison to Google
- Live Keyword Analysis
- Keyword Counter - Keyword Frequency Analyzer
- Keyword Suggestion Tool
- How to Use Anchor Text in Backlinks
- Designing for Search Engines and Stars
- When Search Engines Become Answer Engines
- Those Dark Hiding Places: The Invisible Web Revealed
- Search Engine Relationship Chart
- Google Information for Webmasters
- Google Zeitgeist - Search patterns, trends, and surprises according to Google
- "Invisible Web" Revealed
Created: September 7, 2004
Revised: September 7, 2004