This entry was originally posted on the omnistaretools.com blog. It is reposted here for reference only.

Robots Exclusion Standard (also known as robots.txt protocol) is the agreement whereby search engines will not read or index certain content on your site, even though it is freely available for the public at large to view. The way it works is that a robots.txt file will instruct search engine spiders on which pages you don’t want it to read, and assuming the search engine is acting in good faith, it won’t crawl those pages. Obviously, this is not a reliable way of hiding data; you must have the cooperation of the search engine for it to work, and even pages that aren’t indexed are still available for viewing by anyone with a web browser. Yet it has its uses.

Blank robots.txt

Before I go into using a robots.txt file, I should mention that to maximize your SEO potential, you should consider allowing everything on your site to be read and indexed. By being fully open, you give yourself more chances to do well in search engine queries. Technically, you can accomplish this by not putting in a robots.txt file at all, but putting a blank robots.txt file will accomplish the same thing and have the added benefit of not creating any errors in your logs. Remember that search engines will look for robots.txt whenever they crawl your site; if it is not there, then this is logged as a 404 error. If you include a blank robots.txt file, these 404 errors will disappear, and you can still rest assured that your entire site will be crawled.

A blank robots.txt must be placed in the root directory of each subdomain, with the following text:


User-agent: *
Disallow:

The asterisk tells the spider that the fllowing rule should be followed by all user-agents (i.e. search engine spiders), and the blank disallow means nothing is disallowed.

What should you disallow?

If you have copyrighted images that you made yourself, and you don’t want people grabbing them off google, you might want to disallow your /images/ folder. If you use cgi, disallowing /cgi-bin/ might be useful, as they don’t tend to do well for SEO anyway. If you have a support page, you may not want that info to show up in web searches. If you are mirroring content on multiple pages, you definitely want to disallow crawling for all but one of those pages. All of these are good examples, and you may come up with many more.

How to Disallow Portions of Your Site

The code is actually quite simple. Just list whichever pages you want disallowed as follows:


User-agent: *
Disallow: /images/
Disallow: /cgi-bin/
Disallow: /dontcrawlme.html

Will adding robots.txt help my SEO?

An unsolved question in SEO circles now is whether or not including a robots.txt file will increase your ranking in the search engines. It’s possible, and some people think they have evidence for it. But the answer is not really clear, as it isn’t the kind of thing that’s important to test intensively enough to tell for sure. After all, it’s not difficult to create a blank robots.txt file, and you certainly can’t lose anything by including it. So even though I don’t know if it makes a difference, I recommend putting one up anyway.

Hopefully, this quick overview of robots.txt will help you to properly take care of your site. If you have any further questions, feel free to leave a comment.

Posted by Eric Herboso.
Did you enjoy this article? If so, then subscribe to my RSS feed.
There are more resources available at our On-line Webmaster Resource Center.

No comments:

The dividing lines are there, between each instantiation of "I", even if I can never quite get a glimpse of them. If I squint just so, fast-forwarding through the events of a past self, I don't quite reach a boundary so much as reach a gap. After which another "I" instantiates itself. The dividing line is there; I'm sure of it. But it only seems blurrily visible when I don't focus on it. As soon as my eye approaches, it disapparates into the ether.

The Double-Crux Game

One of the greatest joys I've personally experienced is that feeling you get when you genuinely change your mind. It's especially rewarding when you can feel the dominoes falling as each step in a logical sequence causes you to change your mind on increasingly complex lemmas after a basic premise's truth value switches.

Why Many Worlds is Correct

I recently read a four year old article by Eliezer Yudkowsky on why Many Worlds survives Occam's razor. The argument was persuasive enough to cause me to change my mind on a stance I've held for nearly two decades.

My Favorite Podcasts

I subscribe to a lot of podcasts. I use podcasts to keep up with the news, learn more about the world, and expand my mind generally. As such, I've sampled and stopped listening to more podcasts than most people have even heard of.