Battle of the Bot
Okay, lunch is done (or so I hear. Bloggers don’t eat) and it’s time for the first of the afternoon sessions. First up: The Bot Obedience Course. Excited?
It should come as no surprise that you want the search engine bots to crawl your site. After all, they are the bringers of the traffic. However, not all bots play nice. Some bots want nothing more than to give you bandwidth headaches, steal your content and crash your site. So how does a site master know which is which?
Tuesday’s Bot Obedience Course, moderated by Danny Sullivan and led by Jon Glick, Bill Atchison, and Dan Thies, helped give audience members excellent tips on how to show the good bots to behave and how to dump the bad bots.
First, we’ll cover the good bots.
Good bots are the robots that crawl your site and send you traffic. They scour your site and report back to their parent engine on what they find. Because you want them on your site, you want to make it as easy as possible for them to make their way through. A great way to aid them in their pursuits is to create a site map for them to follow. This will give them a direct link to every page on your site and will increase your chances of having your pages properly indexed. You want to make sure you are presenting the bots with a well-ordered hierarchical site so they can get all the way without getting bogged down.
By creating a robots.txt file you can guide good bots through your site. The file can specify what sections of your site you want the bots to crawl, create rules for indexing and tell crawlers how fast to go through your site.
Another way to guide good bots is to use your Meta tags. For example, using the tags below will prevent the search bots from performing the unwanted action:
NoIndex – tells the spider not to list this page.
NoArchive – tells the spider not to keep a cached copy of this page.
NoFollow – tells the spider to ignore the links on this page.
Unfortunately, all bots are not created equal. Bad bots are rude, misbehave and often misrepresent who they are. These are the bots responsible for crashing your site and stealing your content.
Bad bots ignore your Meta tags, avoid blacklisting filters, and will use as many IP addresses as possible to get your content to drive clicks to their sites and make money. All they care about is taking your information and blending it into theirs for profit.
Examples of bad bots are intelligence gathering spybots, content scrapers (pure theft), data aggregators, link checkers, web copiers, and many more.
Many times you can recognize bad bots by analyzing your daily log analysis. If you see a high query volume from an IP range or overly regular page request (every two seconds for an hour) it’s probably not legitimate.
When dealing with bad bots, it’s important to be proactive. If you see a bad bot, block it immediately, preferably at the firewall level. Put up a "challenge" (usually a form or password) that bad bots will be unable to overcome. This will stop them dead in their tracks and ensure they are unable to access the content.
Your robots.txt file is useless against bad bots, because they’re rude and will just ignore it. Instead, create an opt-in system where you specify who can come to your site and who cannot. This will allow you to block entire IP ranges for web hosts that facilitate access for scraper sites. If you’re not paying attention, however, this can get you into trouble. You should review your traffic to see what kind of user agents you’re getting.
When you get a request from a "search engine spider" user agent, check the requesting IP.
- If the IP address is "owned" by the search engine, deliver the page.
- If the IP address is not owned by the search engines, deliver a different page, empty page or 403 Forbidden.
- Store lists of good vs. bad IPs to speed processing.
- If it’s a search engine bot coming from a proxy URL, block the request to avoid duplicate content issues.
Of course, be careful who you block. Slurp, Googlebot, MSNbot, Becomebot and Fatbot are all legitimate bots that will give you valuable traffic. If you don’t let these bots in, you risk severely hindering your rankings. Blocking the Googlebot is NEVER a good idea.
Being proactive about blocking rude bots will help you get better inclusion in the search engines, give your site better performance and cause more accurate reporting because you don’t have bad bots "watering down" the equation. It will help maintain the integrity of your site.
It’s imperative to block bad bots in order to keep tighter controls on copyrighted content and to prevent your site from crashing due to bandwidth overload. By following the tips provided by the panel, webmasters can help guide the bots who mean no harm safely to their content and kick out the ill-intentioned.