Is Slurp Behaving Badly?

There are reports at the Search Engine Watch forums, WebmasterWorld and LED Digest that Yahoo’s Webcrawler Slurp may be sticking its nose where it’s not welcome and indexing pages Webmasters have marked off limits in their robots.txt files.

The offending bots all appear to come from the 74.6 IP block and a WHOIS lookup for the IPs show that they all resolve back to Inktomi Corporation and name.

So, what are we dealing with? Based on the WHOIS information, we’re not talking about bad renegade bots posing at Yahoo’s crawler. These are actual Yahoo crawlers ignoring robots.txt files and indexing pages in quantities that could cause serious server issues for site owners. That’s hot.

Will Bontrager is just one site owner who was recently greeted by Yahoo’s overeager bot, noticing Slurp trying to into this directory files. He commented that it seemed that after not being able to get into the directory files, as per Will’s robot.txt files, Slurp then began attaching the request with random query strings to try and fight its way in.

Will notes:

"When I started seeing requests with URI /directoryname/?N=D

I felt a little uneasy. The query string varies — ?D=D, ?M=A, ?S=A
— sometimes the same directory queried with different query strings
spaced over several days."

So if Yahoo can’t get into part of your site you’ve locked, they’ll just start throwing rocks until they can break a window? That’s special.

I actually wasn’t surprised to read that Yahoo was doing that. There are always rumors in the forums that Yahoo will often ignore a site’s robot.txt files if you don’t disallow it using a user-agent string. I just didn’t think they were true.

Derrick Wheeler offered up his two cents saying perhaps it wasn’t a case of a Slurp rampage and that merely that site owners were inadvertently linking to these indexed pages via another page on their site. That’s actually fairly common. Sites get so large that it can be difficult to remember what page has links where. You block one page from the engines not realizing that deep within your site is another page that links directly to it. Derrick’s point is valid, but it’s still curious to me that there would be so many reports of Slurp abuse all the same time. Or maybe everyone’s just jumping on the Slurp Is Evil bandwagon.

But I don’t think so. Why would Slurp be adding user-agent strings to try and get around a site’s robots.txt files? It seems dirty. It’d be nice to get a response from Yahoo about what’s going on.

Lisa Barone is a writer, content marketer & VP of strategy at Overit Media. She's also a very active Twitterer, much to the dismay of the rest of the world.
Comments (5)
Filed under: SEO
Still on the hunt for actionable tips and insights? Each of these recent SEO posts is better than the last!

5 Replies to “Is Slurp Behaving Badly?”


We’ve also noticed Inktomi behaving badly (ignoring robots.txt and MORE.)

As of today, we’ve banned Inktomi, Yahoo, Slurp, and all their IPs permanently from hundreds of servers and thousands of domains.

I don’t know what they’re doing, or who is in control — but they are acting like the WORST of what we see operating on the Internet.

don’t know much about this, but in trying to figure out who/what inktomi is i found your post.
they pop up (from sunnyvale, ca) as a regular visitor to my and many other bloggers i’ve spoken to. we guessed everything from gov’t spying to simple marketing (which is what it’s beginning to look like)
also noticed with yahoo past couple months that my sign-in page always flashes & duplicates itself as i’m signing in… an odd event that never happened before

I have seen the similar strange behaviour with Yahoo! but also I have noticed that slurp is spidering a couple of my URL’s a large number of times.

Anyone else noticing this?

One more thing…
I’ve seen instances of Google also “partially indexing” pages that are excluded using robots.txt files and using the text from an inbound link as the “Title”

It could also be a case of other websites linking to Will’s URLs with extra query strings in an attempt to generate duplicate content indexed. Yahoo! might only be crawling the URLs with the extra query strings because someone else is linking to them. Yahoo discovers the URLs through links and then makes a request for the URL. A question for Yahoo is… At what point does Yahoo cross reference the URLs it is requesting with the exclusions listed in the robots.txt files? On a related note, I have seen recent examples of URLs that are excluded using the robots.txt file being “partially indexed.” Meaning Yahoo discovered the URL but didnt actually index any content from the URL. In cases like this, I’ve noticed that Yahoo uses the anchor text of an inbound link as the “Title” of the link in the search results. Danny Sullivan discusses this on his Search Engine Land website.


Your email address will not be published. Required fields are marked *

Serving North America based in the Los Angeles Metropolitan Area
Bruce Clay, Inc. | 2245 First St., Suite 101 | Simi Valley, CA 93065
Voice: 1-805-517-1900 | Toll Free: 1-866-517-1900 | Fax: 1-805-517-1919