Web robots – often referred to as crawlers, bots, or spiders – are software programs that constantly travel the web, indexing the information found on millions and millions of websites every single day. Some sites, however, don’t wish to be indexed in search engines or accessed by these Web Robots. Now that you know what a Web Robot is and what it does, it’s important you know what can be done to limit their access to your site if you so desire. There may be a number of reasons for wanting to prevent bot access to a website page or specific directory. The most common reasons are related to security, privacy and duplicate content.
Addressing Duplicate Content with /robots.txt
The /robots.txt file can also help to eliminate duplicate content issues that arise with blogging software, such as WordPress. With WordPress – and all blogging software, for that matter – content from blog posts is published on the post URL itself, but copies of that content are also published on category pages, as well as tag and author archives. This inadvertently creates several pages of duplicate content. Since duplicate content can have a negative impact on a site’s ranking in the organic search results, the /robots.txt file can help to reduce the potential for duplicate content that can adversely affect the site’s search marketing strategy.
Understanding How To Use /robots.txt
In order to function properly, the /robots.txt file should be accessible at http://www.domain.com/robots.txt and reside in the domain’s root directory. The file itself should be created as a plain text document. Do NOT use Microsoft Word or another word processing program – the standard Notepad program that is installed with Windows or SimpleText/TextEdit with the Mac OS work best. The file name must be robots.txt and uploaded directly to the domain’s root directory. The commands within the file itself can be as simple or complex as your needs demand.
The standard, generic /robots.txt file – one that does not limit access to any of the information in your domain’s root directory – would be formatted like this:
In order to block bot access to the domain’s root directory completely requires adding only one character to the standard or generic /robots.txt file and would look like this:
What if you want to limit bot access only to certain subdirectories or specific pages of the site? Not a problem. You would simply add each individual subdirectory or URL to the /robots.txt file as follows:
The Robots.txt File Is Not Fool Proof
While the /robots.txt file does a good job of blocking a bot’s access to the domain’s root directory, it isn’t fool proof. Each individual page you do not want bots to index should also incorporate a properly formatted robots META tag. The standard robots META tag is configured like this:
<meta name=”robots” content=”index,follow” />
To help to prevent the bots from accessing individual URLs, the robots META tag in the header of the page should look like this:
<meta name=”robots” content=”noindex,nofollow” />
<meta name=”robots” content=”noindex,follow” />
The Bottom Line
A /robots.txt is a very useful tool and, unfortunately, an often overlooked and neglected aspect of web development. Now that you have a better understanding of what it is, what it does and how to use it, take some time to consider how your site may benefit from having a properly configured /robots.txt file. In the meantime, start checking out the /robots.txt files of the sites you visit to familiarize yourself with different configurations and uses for it.