The Inner Workings of Robots, Spiders, and Web Crawlers | 2 | WebReference

The Inner Workings of Robots, Spiders, and Web Crawlers | 2

The Inner Workings of Robots, Spiders, and Web Crawlers

Gaining Control Over Robots

As robots visit your Web site, they follow every link and visit every directory. Unless, that is, you tell them otherwise. There are a few methods you can use to gain some control over what and where robots search. As I said earlier, most of the robots will obey the robots.txt file and the robots meta tag. Let's take a look and see what these are and how they work.

Robot Meta Tags

Meta tags are used for different things: listing the date the page was created, the author, keywords, a description of the page, and instructions for robots. The robot meta tag tells the robot whether to index the current page and follow the links on it. It's useful when you don't have access to the robots.txt file. (Remember though, every Web page is potentially accessible.)

The meta tag is placed between the <head></head> tags. There are several parameters that can be used with the tag: all, none, index, noindex, follow, nofollow. (Each parameter must be separated by a comma.) The default, without the meta tag, is all, meaning the robot can index the current page and follow all the links on it. The none tag means the robot is not to index the current page or follow any links on it. The two tags, index and noindex, tell the robot whether it can index the current page. The two tags, follow and nofollow, tell the robot whether it can follow the links on the current page. To keep the current page from being indexed but still allow the links to be followed, you would use:

<meta name="robots" content="noindex">

To stop the robot from indexing the current page and following the links, you would use:

<meta name="robots" content="noindex, nofollow">

It's important to remember that not all robots support this tag. Most search engines do, but it would be better to use the robots.txt file as it's more effective.

The Robots.txt File

The Robots Exclusion Protocol was created to limit robot access to Web sites. However, it's not a mandatory protocol. When a robot visits a Web site, it first looks for a file called "robots.txt" in the root directory, i.e. http://www.yoursite.com/robots.txt. The file will not work in any other directory. It must also be in text, or ASCII, format.

The format of the file is not too difficult to understand. Each entry or "record" in the file is separated by one or more blank lines. The first line of a record contains the command User-agent: followed by the name of the robot to be excluded or an asterisk ("*"), meaning all robots, i.e.,

User-agent: EmailSiphon

User-agent: *

Following that, on the next line, is a list of the directories that you don't want the robot to visit, i.e.,

Disallow: /cgi-bin/
Disallow: /javascript/
To block a robot from your entire site, list the root directory by itself, i.e.,

Disallow: /

A robots.txt file would look something like the following:

User-agent: EmailSiphon
Disallow: /
User-agent: CherryPicker
Disallow: /
User-agent: *
Disallow: cgi-bin
Disallow: javascript
Disallow: img
Disallow: /style/css

In the above example, the robots EmailSiphon and CherryPicker are banned from the whole site (if they obey the rules). All robots are banned from the "cgi-bin", "javascript", "img", and "/style/css" directories.

That about covers it for the robots. For more information, check out the links below (for the sake of interest, here is Microsoft's robots.txt file.

Major Search Engine Robots

Miscellaneous Links

Spambots

Software


Created: August 18, 2004
Revised: August 25, 2004

URL: http://webreference.com/authoring/robots/1