Controlling search engines and web crawlers using the robots.txt file
Learn how to control search engines and web crawlers using the robots.txt file with this detailed guide that includes a full walkthrough with relevant code snippets.
You can specify which sections of your site you would like search engines and web crawlers to index, and which sections they should ignore. To do this, you specify directives in a robots.txt file, and place the robots.txt file in your document root directory.
Note
The directives you specify in a robots.txt file are only requests. Although most search engines and many web crawlers respect these directives, they are not obligated to do so. Therefore, you should never rely on the robots.txt file to hide content you do not want indexed.
Using robots.txt directives
The directives used in a robots.txt file are straightforward and easy to understand. The most commonly used directives are User-agent,Disallow, and Crawl-delay. Here are some examples:
Example 1: Instruct all crawlers to access all files
User-agent: *
Disallow:
In this example, any crawler (specified by the User-agent directive and the asterisk wildcard) can access any file on the site.
Example 2: Instruct all crawlers to ignore all files
User-agent: *
Disallow: /
In this example, all crawlers are instructed to ignore all files on the site.
Example 3: Instruct all crawlers to ignore a particular directory
User-agent: *
Disallow: /scripts/
In this example, all crawlers are instructed to ignore the scripts directory.
Example 4: Instruct all crawlers to ignore a particular file
User-agent: *
Disallow: /documents/index.html
In this example, all crawlers are instructed to ignore the documents/index.html directory.
Example 5: Control the crawl interval
User-agent: *
Crawl-delay: 30
In this example, all crawlers are instructed to wait at least 30 seconds between successive requests to the web server.
More Information
For more information about the robots.txt file, please vist http://www.robotstxt.org.
Updated 3 days ago