Are you a good bot or a bad bot?
One of the lessons I learned during the Bot Obedience Course in San Jose was that there are two kinds of bots – the good spidering ones and the bad bandwidth stealing ones. (I know, I’m slow.)
Good bots are the robots sent out by the various search engines to crawl your site and bring you yummy traffic. They enter your site, soak up all the data and then report back to their parent engine on their findings. We like these bots. They are our friends.
Bad bots are the annoying content scrapers, web copiers, data aggregators and other nefarious beings which ignore your Meta tags and will use as many IP addresses as possible to get into your site with the intent of making money off your hard work. You want to stop these guys from being able to infiltrate your site to prevent them from crashing it or stealing copyrighted content.
In order to do this, you need to be able to verify who it is knocking on your site’s door. Typically this means verifying that the bot is who it says it is. However, because spammers often name their bots to mimic authentic ones, this isn’t always an easy task.
Back in September, the Google Webmaster blog released information to help users verify Googlebot, and now MSN has followed suit, giving users the information they need to verify that it’s a known MSN bot trying to index their site.
To verify that the user-agent visiting your site is the real MSNBot, abide by the following:
- When you get a page view request, it specifies a user-agent and an IP address. As I described above, all requests from Live Search use a user agent starting with the word ‘MSNBot’.
- If you see the MSNBot user-agent, it’s time to check the identity of the bot. Starting with the IP address (i.e. 18.104.22.168), you can use reverse DNS lookup to find out the registered name of the machine.
- Once you have the host name (in this case, livebot-207-46-98-149.search.live.com), you can check that it really is coming from Live Search. The name of all live search crawlers will end with ‘search.live.com’. If the name doesn’t end with ‘search.live.com’, you know it’s not really our crawler.
- Finally, you need to verify that the name is accurate. In order to do this, you can use Forward DNS to see the IP address associated with the host name. This should match the IP address you used in Step 2 – if it doesn’t, it means the name was fake.
Verifying the spiders true identity will allow you to uncover whether you’ve just been visited by a real MSNBot or a convincingly camouflaged bad bot. If you find that the bot isn’t who it claimed to be, don’t hand over your content. Instead, block them, deliver them a different for empty page, or a 403 Forbidden. It’s also recommend that you keep a list of good bots vs. bad bots to speed up processing in the future.