Get Free Quote

BACK TO BASICS: Your Friend, the Robots.txt File

BACK TO BASICS: Your Friend, the Robots.txt File

by: Jayme Westervelt, July 2005

You've just built a new website and it's absolutely perfect. Visitors are going to love it and the search engines are going to love it even more. You know you have everything in order and you're ready to see those Top 10 Rankings show up any day now. However, when they do show up, you notice that the pages getting ranked are ones that you never intended the search engine spiders to find at all. Now what?

A simple little text file, also known as the robots.txt is your answer. It is one of the more important files that you can have on a website. A majority of search engine robots will check for this file before spidering your site. Representatives from the major search engines have stated that even if you have nothing to exclude it is very important that you still have a robots.txt file.

With this file, you can tell the search engine robots which directories or files you do not want them to visit. You can even block them from visiting your entire site. A robots.txt file is a text file that resides in the root directory of a website. The file could be as simple as being an empty file, or as complex as listing numerous directories or files within the site that you wish to exclude the search engine robots from. Don't have one? Then it's high time that you created one. Here are some simple instructions for creating one:

Compile a list of directories or files on your site that you do not want spidered or indexed. Some examples would be the cgi-bin of the site, or a directory containing pages that are only available through a subscription.

You'll want to use an editor like Wordpad or Notepad to create your file. The syntax is as follows:

User-Agent: bot name
Disallow: directory or file name

Example 1:

User-Agent: Googlebot
Disallow: /cgi-bin/
Disallow: /misc/personal_information.html

Example 2:

User-Agent: *
Disallow: /cgi-bin/

The User-Agent field is where you would specifically call out a certain robot (see example 1). If you want to exclude all the robots, simply use the wild character, * (see example 2).

The Disallow field is where you specify the directory or file name. For each new listing, you have to preface it with the Disallow:. The Disallow field also has a wildcard option. Leaving out the trailing slash on a Disallow directive will instruct the robots to not spider files or directories that begin with that name. For example, misc/personal will instruct the robots to not spider the misc/personal_information.html file as well as the misc/personal/ directory.

If you are in the process of developing a site or if you have a domain that contains the same content as another one of your domains, you might wish to prevent the robots from spidering your site at all. If this is the case you'll want to use the following in your robots.txt file:

User-Agent: *
Disallow: /

This will tell all the search engine robots to not spider or index any files or directories contained within your root directory. Once you are ready for your site to get spidered, you'll want to make sure that you edit your robots.txt file to allow the spiders to visit.

Concerned because you don't feel that you have any files or directories that need to be excluded? Don't worry. Simply save an empty text file as robots.txt and upload that to your root directory.

Robots.txt files can also include comments that may be needed. Any line in the file that begins with # will be considered a comment, and will be ignored by the robots.

Now that you know how to prevent the robots from finding those files that were never intended for the whole world, you can start focusing on increasing the rankings on the pages that really deserve to be found. Make sure that any time you make some major changes to your site that you update your robots.txt file if needed. The last thing that you want to do is give the search engine robots false instructions when visiting your site.