HTML Unleashed PRE. Strategies for Indexing and Search Engines : How to Design for Search Engines | WebReference

HTML Unleashed PRE. Strategies for Indexing and Search Engines : How to Design for Search Engines


HTML Unleashed PRE: Strategies for Indexing and Search Engines

How to Design for Search Engines


If you've read the previous chapter, you might have noticed the similarity between search spiders and people with disabilities: both have no access except to the text-only content of web pages.  Therefore, most of the HTML authoring recommendations from the chapter on disabilities apply to the search-friendly design as well.

Providing text-only alternatives for every piece of information on your page is an obvious requirement because spiders only scan plain text (although, unfortunately, not all of them index alt texts of images).  Making your content fully comprehensible in text-only modality may be difficult (it's like trying to persuade somebody not in person but via a letter, without the powerful "multimedia" of motions and facial gestures), but it's really rewarding in the long run.

Preserving the logical flow of the text, rather than sacrificing it for the sake of layout tricks, is also very important.  This improves the chances of spiders extracting a better summary for your document, as well as makes the text more suitable for automatic processing or categorizing.

Similarly, logical markup is an important requirement if you care about someone being able to use your document in any way, not only to read it in a graphic browser.  Besides the spiders of the major search engines, a great number of various robots and indexers wander along the roads of the Web, and many of them rely on the logical tags such as H1 for figuring out the structure of your data.


Keyword Strategies


All searches on the Web are being done via keywords, so it is probably the most important requirement to make sure that your documents contain all the keywords that are likely to be used to find the document.  Two distinct strategies can be outlined in this respect.

  1. The first idea that comes to mind is simple: The more keywords you cram into a page, the better.  Indeed, you can never predict what particular keywords will come to users' minds, so it's always a good idea to think about all possible synonyms, variants, generic inclusive terms, subterms, and related concepts for all the main subjects of your discourse.

    Besides, remember that the keywords can be entered in a different grammatical form, such as plural instead of singular for nouns.  And of major search engines, only Alta Vista provides the "wildcard" notation to look for "table" or "tables" by specifying "table*".  So, you'd better see to it yourself by including both forms in your document.  (This problem is especially serious for languages other than English; for example, a verb in Russian may have up to 235 distinct forms.  Therefore, most Russian search engines, such as Aport mentioned earlier, by default employ word inflection algorithms allowing to automatically match all word forms.)

    Finally, if your main keyword is a relatively common word (such as "search"), it is likely that practiced search users will employ the phrase search feature to query for word combinations (such as "search engines") rather than single words.  Therefore, make sure that your document contains the most common collocations of the main keyword with closely related nouns, adjectives, verbs, and so on.

  2. However, one may think about an opposite to the strategy of maximizing "keyword coverage" described above.  Remember that one of the factors in results ranking, as implemented by major search engines, is frequency, which is computed as number of keyword occurrences divided by the document size.

    One consequence of this is that if two documents contain the same keyword (located at the same distance from the top of document), the one which is less in size will get better ranking.  This gives a clue: Select one of the root (introductory) pages on your site and try to make it as compact and concise as possible, so that it presents just the essence of your content with only the most common keywords.  This page will receive a boost with respect to searches for these keywords, thereby attracting more hits to the entire site.

Thus, the best you can make is to combine these two approaches by setting up both sorts of pages on your site, those with maximum keywords coverage and those with maximum relevance with respect to main keywords.

By the way, these two keyword strategies correspond to the two types of search queries, specific and general searches.  One part of search engine users is after some very specific information; they use rare keywords, phrase searches, and various advanced features such as Boolean operators.  It's these "power users" that your keyword-rich pages should appeal to.

Other users, however, just need to find a good resource covering some fairly general topic; they enter a couple of simple keywords, get an avalanche of results, and browse the first several links found.  For such general searches, web directories (such as Yahoo) usually perform better than search engines; however, a lot of users still employ search engines for the task.  The relevance boosting technique described above may be useful in attracting such users to your site.

You might be interested to see what keywords are entered most frequently by the search engine users, in order to better align your keyword spectrum with the public preferences.  Unfortunately, this information (which would be immensely interesting from other viewpoints as well) is considered top secret by major search engines---they never reveal their "top ten search words" lists for the (rather well-grounded) fear of spamming.

WebCrawler only allows to peep at the flow of search queries in real time, as they're entered on the search page.  However, minor search engines are usually less obsessed with confidentiality, and some of them show their search statistics (for example, a Russian search engine called Rambler presents its list of top 100 search words).

The final piece of advice concerning keywords is rather obvious: Always check your spelling.  Spiders, in contrast to human readers, cannot "overlook" spelling errors, and you risk missing a good share of your potential audience by misspelling some important keyword.  It is especially relevant given that in most cases you add your keywords into a META tag after the document itself is written, edited, and probably spell-checked.


The META tag


Getting back to HTML, you may wonder what is the syntax for adding keywords to a document?  Of course the text of a page is the primary source of searchable material, but you may also need to add certain keywords without altering the page content.  (Changing text color to make keywords invisible in the body of a document is a really ugly trick, please never resort to it!)

The META tag serves this purpose (as well as several other purposes as well).  "Meta" is a Greek word for "over," and META tag was intended to carry all sorts of meta- information, that is, information about (or "over") information.  You should understand that using META for specifying keywords is not an HTML convention, but only one of the widely accepted uses of the tag.

A META tag usually takes the following form:

  <META name="..." content="...">

As you see, the names of the META tag attributes are rather generic, which allows to use the tag to express virtually any information that may be represented as a name-value pair.  For example, you could use META tags to supply information about yourself (name="author"), the program you used to create the HTML file (name="generator"), and so on.

Here's how META tag is used for introducing your document to search engines:

  <META name="keywords"
     content="searching, search engines, keywords, HTML">
  <META name="description"
     content="A description of major web search engines, spiders,
              and search-friendly HTML authoring">

These tags should be placed within the HEAD element.  Keywords and phrases in the content of the keywords tag can be separated by commas for better readability, although spiders usually ignore the separators.  The maximum number of keywords depends on the search engine in question; for some of them, 25 has been quoted as the upper limit.

Hopefully, the keywords thus specified will be added to the searchable representation of the document in the engine's database, and the description will be stored as the summary to be displayed for the document in a list of results (in the absence of a description, most search engines will take the first lines of text on the page).

Another use of the META tag is for excluding a page from spiders' attention.  By adding the following tag,

  <META name="robots" content="noindex">

you instruct any spiders that run into your page to bypass it without indexing.

However, not all spiders support this convention.  A more reliable solution is to add a robots.txt file to the root directory of your web server, with a list of files that must be excluded from indexing.  For example, if your robots.txt contains these lines:

  User-agent: *
  Disallow: /dont_index_me.html
  Disallow: /hidden_dir/

then no robot will scan the dont_index_me.html document, nor any document from the /hidden_dir/.  For more information on robots exclusion, refer to




One of the features of HTML 4.0 deserves special attention with respect to search engines accessibility.  About half of the major search engines cannot penetrate framed sites.  For them, the root page of a frameset is all that can be viewed and indexed on the site, and all the framed pages below the root are missed.

The best solution for this problem (as well as for the problem of frames accessibility to people with disabilities; see Chapter 42, "Creating Widely Accessible Web Pages") is the NOFRAMES element.  It should be placed within the FRAMESET element, usually before the first FRAME tag, and may contain any text, links, or other material.  This is what search engines will see on the page and reflect in the database, while frame-capable browsers will ignore anything within a NOFRAMES element.

To make the rest of your content accessible, you should provide links to the framed pages from within the NOFRAMES element.  Remember that you're doing this not only for spiders but for the users of non-graphic browsers as well, so accompany the links by proper descriptions.  Usually, one of the frames contains a navigation bar with links to all other pages, so in the NOFRAMES element it may be sufficient to link to this document only.

For framed pages to be usable in absence of frame context, remember to provide them with their own TITLEs (this will improve their ranking in search engines as well).


The Meta Controversy


One of the major search engines, Excite, ignores any information in META tags, and does so on purpose.  What may be the rationale for such a decision?

The reason stated on Excite's "Getting Listed" page is that META tags can be used by spammers to improve their rankings in an unfair way.  For Excite, attempting to make a page appear any different to search spiders than to human users is an unfair practice.  Indeed, nobody can guarantee that the keywords you enter are those describing your content, and in principle, you can easily use popular keywords to inflate your hits without any improvement of the page content.

At a first glance, this position may seem logical.  But is it? Remember that I can easily put any number of "hot" keywords onto the page itself, and if I don't want to distract readers with this promotion machinery, I can make them invisible by painting them with the background color (as many spammers do already, simply because META tags don't allow them to enter too many keywords).  After all, spiders will always index what I want them to, and banning one of the weapons can only ginger up the armaments race.

Excite's policy is based on the assumption that each page has its intrinsic "value," and that this value is evident from reading the text on the page.  If this is true, then it's natural to require that spiders, to be able to assign a fair "relevance" value, would get exactly the same text as human readers.  But here, it is also silently assumed that a spider can read, understand, and evaluate the text just as humans do.  This is where the main fallacy of this approach lies.

The main purpose of a META tag is to provide some information about the document, and the tag does it mostly for computers that cannot deduce this information from the document itself.  Keywords and description, for example, are supposed to present the main concepts and subjects of the text, and no computer program can yet compile a meaningful summary or list of keywords for a given document.  (In this connection, it's interesting to note that Excite is the only search engine to employ an artificial intelligence algorithm for compiling summaries based on the document text.)

It is true that the META mechanism is open to abuse, but so far it's the only technique capable of helping computers better understand human-produced documents.  We won't have other choice but to rely on some sort of META information until computers achieve a level of intelligence comparable to that of human beings.

In view of this, it is interesting to discuss the latest development in the field of meta-information, the Meta Content Framework (MCF).  This is a language for describing meta-properties, connections, and interrelations of documents, sites, channels, subject categories, and other information objects.  MCF is developed by Netscape and submitted as a draft standard to W3 Consortium.

MCF may be useful for maintainers of closed information systems, such as intranets, corporate and scientific databases, and so on.  Its main promise, however, is the capability to build a meta-information network for the entire Web.  Unfortunately, given the controversial position of the rather primitive META tags of today, it is not very likely that the sophisticated tools of MCF, even if approved by W3C, will gain any widespread recognition.


Comments are welcome
Produced by Dmitry Kirsanov
and Publishing
Created: 09/19/97  /  Revised: 09/19/97