The Inner Workings of Robots, Spiders, and Web Crawlers | 2
The Inner Workings of Robots, Spiders, and Web Crawlers
Gaining Control Over Robots
As robots visit your Web site, they follow every link and visit every directory. Unless, that is, you tell them otherwise. There are a few methods you can use to gain some control over what and where robots search. As I said earlier, most of the robots will obey the robots.txt file and the robots meta tag. Let's take a look and see what these are and how they work.
Robot Meta Tags
Meta tags are used for different things: listing the date the page was created, the author, keywords, a description of the page, and instructions for robots. The robot meta tag tells the robot whether to index the current page and follow the links on it. It's useful when you don't have access to the robots.txt file. (Remember though, every Web page is potentially accessible.)
The meta tag is placed between the
<head></head> tags. There are several parameters that can be used with the tag:
nofollow. (Each parameter must be separated by a comma.) The default, without the meta tag, is
all, meaning the robot can index the current page and follow all the links on it. The
none tag means the robot is not to index the current page or follow any links on it. The two tags,
noindex, tell the robot whether it can index the current page. The two tags,
nofollow, tell the robot whether it can follow the links on the current page. To keep the current page from being indexed but still allow the links to be followed, you would use:
To stop the robot from indexing the current page and following the links, you would use:
It's important to remember that not all robots support this tag. Most search engines do, but it would be better to use the robots.txt file as it's more effective.
The Robots.txt File
The Robots Exclusion Protocol was created to limit robot access to Web sites. However, it's not a mandatory protocol. When a robot visits a Web site, it first looks for a file called "robots.txt" in the root directory, i.e. http://www.yoursite.com/robots.txt. The file will not work in any other directory. It must also be in text, or ASCII, format.
The format of the file is not too difficult to understand. Each entry or "record" in the file is separated by one or more blank lines. The first line of a record contains the command
User-agent: followed by the name of the robot to be excluded or an asterisk ("*"), meaning all robots, i.e.,
Following that, on the next line, is a list of the directories that you don't want the robot to visit, i.e.,
A robots.txt file would look something like the following:
That about covers it for the robots. For more information, check out the links below (for the sake of interest, here is Microsoft's robots.txt file.
Major Search Engine Robots
- Googlebot: Google's Web Crawler
- HTML Author's Guide to the Robots META tag
- Robots.txt Validator
- Database of Web Robots, Overview
- Crawlers, Robots, and Spiders
- Spambot Beware!
- Save Your Site from Spambots
- Protect Your Web Server From Spam Harvesters
- Stopping Spambots: A Spambot Trap
Created: August 18, 2004
Revised: August 25, 2004