Robots Exclusion Standard (also known as robots.txt protocol) is the agreement whereby search engines will not read or index certain content on your site, even though it is freely available for the public at large to view. The way it works is that a robots.txt file will instruct search engine spiders on which pages you don’t want it to read, and assuming the search engine is acting in good faith, it won’t crawl those pages. Obviously, this is not a reliable way of hiding data; you must have the cooperation of the search engine for it to work, and even pages that aren’t indexed are still available for viewing by anyone with a web browser. Yet it has its uses.
Before I go into using a robots.txt file, I should mention that to maximize your SEO potential, you should consider allowing everything on your site to be read and indexed. By being fully open, you give yourself more chances to do well in search engine queries. Technically, you can accomplish this by not putting in a robots.txt file at all, but putting a blank robots.txt file will accomplish the same thing and have the added benefit of not creating any errors in your logs. Remember that search engines will look for robots.txt whenever they crawl your site; if it is not there, then this is logged as a 404 error. If you include a blank robots.txt file, these 404 errors will disappear, and you can still rest assured that your entire site will be crawled.
A blank robots.txt must be placed in the root directory of each subdomain, with the following text:
User-agent: * Disallow:
The asterisk tells the spider that the fllowing rule should be followed by all user-agents (i.e. search engine spiders), and the blank disallow means nothing is disallowed.
What should you disallow?
If you have copyrighted images that you made yourself, and you don’t want people grabbing them off google, you might want to disallow your /images/ folder. If you use cgi, disallowing /cgi-bin/ might be useful, as they don’t tend to do well for SEO anyway. If you have a support page, you may not want that info to show up in web searches. If you are mirroring content on multiple pages, you definitely want to disallow crawling for all but one of those pages. All of these are good examples, and you may come up with many more.
How to Disallow Portions of Your Site
The code is actually quite simple. Just list whichever pages you want disallowed as follows:
An unsolved question in SEO circles now is whether or not including a robots.txt file will increase your ranking in the search engines. It’s possible, and some people think they have evidence for it. But the answer is not really clear, as it isn’t the kind of thing that’s important to test intensively enough to tell for sure. After all, it’s not difficult to create a blank robots.txt file, and you certainly can’t lose anything by including it. So even though I don’t know if it makes a difference, I recommend putting one up anyway.
Hopefully, this quick overview of robots.txt will help you to properly take care of your site. If you have any further questions, feel free to leave a comment.