Robots.txt 101

bigstockphoto_Data_Servers_Vitual_Reality_973988Web robots – often referred to as crawlers, bots, or spiders – are software programs that constantly travel the web, indexing the information found on millions and millions of websites every single day. Some sites, however, don’t wish to be indexed in search engines or accessed by these Web Robots. Now that you know what a Web Robot is and what it does, it’s important you know what can be done to limit their access to your site if you so desire. There may be a number of reasons for wanting to prevent bot access to a website page or specific directory. The most common reasons are related to security, privacy and duplicate content.

The Robots Exclusion Protocol, more commonly referred to as a /robots.txt file, provides webmasters with the ability to provide instructions on indexing the site to bots. The file, which must reside in the domain’s root directory, serves to limit the botsā€™ access to files within that domainā€™s root directory. There are often a large number of pages that make up an entire site, but many of those pages ā€“ like registration, login, 404 error, privacy policy and order confirmation pages – should not be indexed by search engines. The /robots.txt file also comes in particularly handy for webmasters with a wide network of sites with identical privacy policies, terms and conditions or e-commerce sites that have checkout pages, shopping carts, etc.

Addressing Duplicate Content with /robots.txt

The /robots.txt file can also help to eliminate duplicate content issues that arise with blogging software, such as WordPress. With WordPress – and all blogging software, for that matter – content from blog posts is published on the post URL itself, but copies of that content are also published on category pages, as well as tag and author archives. This inadvertently creates several pages of duplicate content. Since duplicate content can have a negative impact on a site’s ranking in the organic search results, the /robots.txt file can help to reduce the potential for duplicate content that can adversely affect the site’s search marketing strategy.

Understanding How To Use /robots.txt

In order to function properly, the /robots.txt file should be accessible at http://www.domain.com/robots.txt and reside in the domain’s root directory. The file itself should be created as a plain text document. Do NOT use Microsoft Word or another word processing program – the standard Notepad program that is installed with Windows or SimpleText/TextEdit with the Mac OS work best. The file name must be robots.txt and uploaded directly to the domain’s root directory. The commands within the file itself can be as simple or complex as your needs demand.

The standard, generic /robots.txt file – one that does not limit access to any of the information in your domain’s root directory – would be formatted like this:

User-agent: *
Disallow:

In order to block bot access to the domain’s root directory completely requires adding only one character to the standard or generic /robots.txt file and would look like this:

User-agent: *
Disallow: /

What if you want to limit bot access only to certain subdirectories or specific pages of the site? Not a problem. You would simply add each individual subdirectory or URL to the /robots.txt file as follows:

User-agent: *
Disallow: /checkout.asp
Disallow: /add_cart.asp
Disallow: /view_cart.asp
Disallow: /error.asp
Disallow: /shipquote.asp

The Robots.txt File Is Not Fool Proof

While the /robots.txt file does a good job of blocking a bot’s access to the domain’s root directory, it isn’t fool proof. Each individual page you do not want bots to index should also incorporate a properly formatted robots META tag. The standard robots META tag is configured like this:

<meta name=”robots” content=”index,follow” />

To help to prevent the bots from accessing individual URLs, the robots META tag in the header of the page should look like this:

<meta name=”robots” content=”noindex,nofollow” />

or

<meta name=”robots” content=”noindex,follow” />

The Bottom Line

A /robots.txt is a very useful tool and, unfortunately, an often overlooked and neglected aspect of web development. Now that you have a better understanding of what it is, what it does and how to use it, take some time to consider how your site may benefit from having a properly configured /robots.txt file. In the meantime, start checking out the /robots.txt files of the sites you visit to familiarize yourself with different configurations and uses for it.

Written by
Alysson Fergison
Join the discussion

This site uses Akismet to reduce spam. Learn how your comment data is processed.

16 comments
  • I’m about to install wordpress into the blog on my site and often wondered about the duplicate content issues with the catagories etc. Mainly it seems only the titles are used and not duplications of content.

  • I’m about to install wordpress into the blog on my site and often wondered about the duplicate content issues with the catagories etc. Mainly it seems only the titles are used and not duplications of content.

  • I’m about to install wordpress into the blog on my site and often wondered about the duplicate content issues with the catagories etc. Mainly it seems only the titles are used and not duplications of content.

  • A few very important factors missed here, one of which is that you can set different rules for different user agents – Googlebot, Googlebot Image, Yahoo Slurp, MSN, etc.

    Is there an unsavory bot that’s chewing up your resources that has no business being on your site? Block it completely.

    Also, Googlebot in particular likes to hit css files, so it’s a good idea to make sure they’re allowed. Also, you can block a directory, but allow specific files. For example, you can block all /includes, but allow the external css file that sits within that directory.

    Also, keep in mind the robots.txt file is public, so if you are using it to block accessible data that you don’t want viewable at all, don’t point people right to in the robots.txt. If you’re using robots.txt to block yoursite.com/secretstuff you’re doing it wrong.

    Finally for now, its a good idea to test to make sure the pages you want indexed aren’t getting accidentally blocked by using Google webmaster tools or Bing Webmaster center.

  • A few very important factors missed here, one of which is that you can set different rules for different user agents – Googlebot, Googlebot Image, Yahoo Slurp, MSN, etc.

    Is there an unsavory bot that’s chewing up your resources that has no business being on your site? Block it completely.

    Also, Googlebot in particular likes to hit css files, so it’s a good idea to make sure they’re allowed. Also, you can block a directory, but allow specific files. For example, you can block all /includes, but allow the external css file that sits within that directory.

    Also, keep in mind the robots.txt file is public, so if you are using it to block accessible data that you don’t want viewable at all, don’t point people right to in the robots.txt. If you’re using robots.txt to block yoursite.com/secretstuff you’re doing it wrong.

    Finally for now, its a good idea to test to make sure the pages you want indexed aren’t getting accidentally blocked by using Google webmaster tools or Bing Webmaster center.

  • A few very important factors missed here, one of which is that you can set different rules for different user agents – Googlebot, Googlebot Image, Yahoo Slurp, MSN, etc.

    Is there an unsavory bot that’s chewing up your resources that has no business being on your site? Block it completely.

    Also, Googlebot in particular likes to hit css files, so it’s a good idea to make sure they’re allowed. Also, you can block a directory, but allow specific files. For example, you can block all /includes, but allow the external css file that sits within that directory.

    Also, keep in mind the robots.txt file is public, so if you are using it to block accessible data that you don’t want viewable at all, don’t point people right to in the robots.txt. If you’re using robots.txt to block yoursite.com/secretstuff you’re doing it wrong.

    Finally for now, its a good idea to test to make sure the pages you want indexed aren’t getting accidentally blocked by using Google webmaster tools or Bing Webmaster center.

  • Thanks for the comments everyone. Good tips from all. I wrote this as a basic starting point for understanding what the /robots.txt file does and how to do a basic set up – hence the “Robots.txt 101” post title. Like all aspects of SEO, there are exceptions to every “rule” and always more advanced techniques to employ.

    This obviously isn’t a security measure to stop people who know what they’re looking for and where to find it. It’s simply a way to try to keep the pages you don’t want indexed by bots from being easily accessible. If you have sensitive information you never want anyone to see, a /robots.txt file won’t do anything to help you reach that goal.

    @SEO-Doctor – keep in mind that WP creates a /robots.txt file by default in whatever directory you install it. So if you have an existing site and are installing WP into a subdirectory (like http://www.domain.com/blog/) there will be a /robots.txt file created at http://www.domain.com/blog/robots.txt.

    And Google is clearly better than they used to be at assessing what is a malicious duplicate content “problem” and what is simply inadvertent repetition of the same content within several pages of a site. On the other hand, it’s always best to keep Google’s need to “figure it out for themselves” to a minimum. Addressing the duplicate content issues you know to exist by using the /robots.txt file and appropriate robots META tags is always a smart move.

  • Thanks for the comments everyone. Good tips from all. I wrote this as a basic starting point for understanding what the /robots.txt file does and how to do a basic set up – hence the “Robots.txt 101” post title. Like all aspects of SEO, there are exceptions to every “rule” and always more advanced techniques to employ.

    This obviously isn’t a security measure to stop people who know what they’re looking for and where to find it. It’s simply a way to try to keep the pages you don’t want indexed by bots from being easily accessible. If you have sensitive information you never want anyone to see, a /robots.txt file won’t do anything to help you reach that goal.

    @SEO-Doctor – keep in mind that WP creates a /robots.txt file by default in whatever directory you install it. So if you have an existing site and are installing WP into a subdirectory (like http://www.domain.com/blog/) there will be a /robots.txt file created at http://www.domain.com/blog/robots.txt.

    And Google is clearly better than they used to be at assessing what is a malicious duplicate content “problem” and what is simply inadvertent repetition of the same content within several pages of a site. On the other hand, it’s always best to keep Google’s need to “figure it out for themselves” to a minimum. Addressing the duplicate content issues you know to exist by using the /robots.txt file and appropriate robots META tags is always a smart move.

  • Thanks for the comments everyone. Good tips from all. I wrote this as a basic starting point for understanding what the /robots.txt file does and how to do a basic set up – hence the “Robots.txt 101” post title. Like all aspects of SEO, there are exceptions to every “rule” and always more advanced techniques to employ.

    This obviously isn’t a security measure to stop people who know what they’re looking for and where to find it. It’s simply a way to try to keep the pages you don’t want indexed by bots from being easily accessible. If you have sensitive information you never want anyone to see, a /robots.txt file won’t do anything to help you reach that goal.

    @SEO-Doctor – keep in mind that WP creates a /robots.txt file by default in whatever directory you install it. So if you have an existing site and are installing WP into a subdirectory (like http://www.domain.com/blog/) there will be a /robots.txt file created at http://www.domain.com/blog/robots.txt.

    And Google is clearly better than they used to be at assessing what is a malicious duplicate content “problem” and what is simply inadvertent repetition of the same content within several pages of a site. On the other hand, it’s always best to keep Google’s need to “figure it out for themselves” to a minimum. Addressing the duplicate content issues you know to exist by using the /robots.txt file and appropriate robots META tags is always a smart move.