Is Slurp Behaving Badly?
There are reports at the Search Engine Watch forums, WebmasterWorld and LED Digest that Yahoo’s Webcrawler Slurp may be sticking its nose where it’s not welcome and indexing pages Webmasters have marked off limits in their robots.txt files.
The offending bots all appear to come from the 74.6 IP block and a WHOIS lookup for the IPs show that they all resolve back to Inktomi Corporation and Yahoo.com name.
So, what are we dealing with? Based on the WHOIS information, we’re not talking about bad renegade bots posing at Yahoo’s crawler. These are actual Yahoo crawlers ignoring robots.txt files and indexing pages in quantities that could cause serious server issues for site owners. That’s hot.
Will Bontrager is just one site owner who was recently greeted by Yahoo’s overeager bot, noticing Slurp trying to into this directory files. He commented that it seemed that after not being able to get into the directory files, as per Will’s robot.txt files, Slurp then began attaching the request with random query strings to try and fight its way in.
"When I started seeing requests with URI /directoryname/?N=D
I felt a little uneasy. The query string varies — ?D=D, ?M=A, ?S=A
– sometimes the same directory queried with different query strings
spaced over several days."
So if Yahoo can’t get into part of your site you’ve locked, they’ll just start throwing rocks until they can break a window? That’s special.
I actually wasn’t surprised to read that Yahoo was doing that. There are always rumors in the forums that Yahoo will often ignore a site’s robot.txt files if you don’t disallow it using a user-agent string. I just didn’t think they were true.
Derrick Wheeler offered up his two cents saying perhaps it wasn’t a case of a Slurp rampage and that merely that site owners were inadvertently linking to these indexed pages via another page on their site. That’s actually fairly common. Sites get so large that it can be difficult to remember what page has links where. You block one page from the engines not realizing that deep within your site is another page that links directly to it. Derrick’s point is valid, but it’s still curious to me that there would be so many reports of Slurp abuse all the same time. Or maybe everyone’s just jumping on the Slurp Is Evil bandwagon.
But I don’t think so. Why would Slurp be adding user-agent strings to try and get around a site’s robots.txt files? It seems dirty. It’d be nice to get a response from Yahoo about what’s going on.