Danny Sullivan (my new male BFF, you know) is moderating today’s Robots.txt Summit with speakers Dan Crow (Google), Keith Hogan (Ask.com), Eytan Seldman (Live.com), and Sean Suchter (Yahoo). Good, we’re all here!
With all the copyright buzz and content producers crying that the engines are "stealing" content, I’m actually pretty excited about this session. I know, Bruce looked at me like I was crazy too, but I think this is going to be a good session.
For those in the dark, robots.txt is a standard file that allows site owners to block content from being spidered. It was created a decade ago and hasn’t evolved too much over the year, which naturally, has caused problems — site owners still don’t know how to use it and the engines don’t know how to improve it. That’s where this session comes in.
Representatives from the engines and angry site owners are gathering for a meeting of the minds. Watch out, it’s very likely chairs are going to be thrown.
Up first is Keith Hogan from Ask.com. Be brave, Keith.
Interestingly, Keith notes that less than 35 percent of servers have a robots.txt file. See, this explains why so much off-limits content is getting spidered and appearing in the index. It’s not the engine’s fault; it’s the site owners who didn’t read their robots.txt manual.
Some more fun facts from Keith: The majority of robots.txt files are copied from others found online or are provided by a hosting site. This is a clear sign that site owners don’t know to use them. The files typically vary in size from 1 character to well over 256,000 characters, though the average robots.txt file is 23 characters.
Not that we needed him to tell us this, but Keith notes that the robots.txt format is not well understood. Yeah, I got that impression too.
Keith spoke briefly about the new SiteMaps directive that was announced this morning. (Coverage from Ask, Google, Microsoft, Yahoo!) The new directive will expand robots.txt to include a pointer to the Sitemap file, thereby removing the requirement to submit it to the search engines. Essentially, it allows auto discovery. Very cool.
In case you’re confused about the difference between a robots.txt file and your sitemap, your robots.txt tells the search engines what NOT to index and your sitemap is intended to control/optimize the interaction with the crawler on your site.
The similarity is that both robots.txt and SiteMaps are intend to "allow" the Webmaster page-level control to identify every page that is on the site, identify the ages of the page, etc.
He also outlined some questions and possible changes to robots.txt:
- Would changing the format to XML improve the accuracy, control and understanding?
- Should webmasters be able to stop/slow crawling when it will interrupt the operation of the site?
- Would a more fine grain approach to crawling time and crawling speed be helpful?
- Some sites have hosts/IPs that are dedicated to crawlers. Should robots include a directive to the crawler’s host?
- Meta directives in HTML provide page level control for Archiving/ Caching/Link Following.
- Some sites inadvertently create Spider Trips or Duplicate content for crawlers even though there are plenty of heuristics to identify these problems (session ids, affiliate ids).
- Should robots add hints for this so that sites don’t end up with duplicate pages and smaller link credits?
Up next is Eytan Seldman, who has opted not to use a PowerPoint presentation. I’m not sure if this makes me like him a lot or hate him immensely.
Okay, I love him. He just pulled up the robots.txt file for Hilton which reads: "Do not visit Hilton.com during the day!" Hee! That is so awesome.
Eytan wants a way to facilitate the protocol to improve its effectiveness, saying that the vast majority of site owners don’t use robots.txt because it is too complex. The engines don’t support comments that have controls. There needs to be more commonality so a user can just ID the user agents and tell them what to do. A common protocol would make it simpler and therefore easier to use.
Dan Crow is up next and defines the Robots Exclusion Protocol as robots.txt + robots Meta tags. The exclusion protocol was originally created in June 1994 and tells the search engines what not to index. Since then it was become a de facto standard — everyone uses it but everyone uses a different version of it.
He asks if we should standardize the protocol so that there are common core features that exist.
Lastly, he outlines longer terms goals like creating consistent syntax and semantics, and improving a common feature set.
Up next is Sean Suchter from Yahoo who reminds us that Yahoo’s crawler is named Slurp and it supports all the standard robots.txt commands. There are also some custom extensions like crawl-deal, sitemap and wildcards.
Something to keep in mind is that different Yahoo search properties use different user-agents, so if you are trying to affect one robot, make sure you’re only addressing that robot. Otherwise you could run into an ‘oops’.
Next he has some questions for audience members.
- Would you like to replace "crawl-delay" with a different implementation that accomplishes same goals for webmasters?
- What are your goals for this setting? Bandwidth reduction? GET reduction? Database load reduction? How do you use this?
- Do you pick a setting and move it up and down?
- Do you want support for robots.txt-noindex html tag which you can use to help the engine not use certain areas of your page for matching? This would be used to demark useless template text, ad text, etc that causes irrelevant traffic.
Next is the question and answer period, only this time it’s the engines asking the questions and the audience answering them.
One panelist asks if it would be useful to have robots.txt in an XML format?
An audience remember says maybe, but if you’re going to do that, why can’t you just make it part of the SiteMaps? He says the frustrating thing is trying to control the robots. You may want to "try out" robots before you allow them to spider your site. You want to trust them first. They want to know what spiders they can trust.
Another audience member said she liked the idea of doing it in XML because it would allow her to generate it dynamically. She says it would be nice to generate some kind of process that could go out and tell her instances of duplicate content.
What about combining SiteMaps and robots.txt? The room seemed mixed on this one.
Danny asks if it would be helpful for a fast – medium – slow crawl delay? The audience basically shoots down Danny’s idea. Sorry, Danny.
One of the panelists asks if the audience would instead be interested in defining crawl frequency like megabytes per day or megabytes per month. The reaction from the audience? They growled. No, seriously. There was blatant growling and it was hilarious. Search marketers are crazy people.
Dave Naylor’s in the audience and asks if XML wouldn’t just confuse site owners more. What would happen if they put different directives in the SiteMap that they did in their robots.txt file. Dan from Google agrees it would and says that’s why the engines are wary.
He also mentions that over 75,000 robots.txt files have pictures in them. Apparently they missed the "text" part of .txt.
Danny asks how many people think some sort of standardization process would be useful?
Not surprisingly, a slew of hands were raised.
Personally, I like the idea of being able to identify the text on a page you don’t want index. For example, I think it’d be useful to be able to tell the engines that this here is the content of the article, index that, but ignore the navigation links. Follow them, but don’t index the content. Doing things this would save a lot of people duplicate content problems and wouldn’t require all those fancy table tricks site owners have to do. Just my two cents.