Duplicate Content Summit
Matt McGee, Tamar Weinberg and I have formed our own blogger central (we miss you, Kim) as we all wait for this morning’s Duplicate Content Summit panel to start. Danny Sullivan is moderating with panelists Vanessa Fox (Google), Amit Kumar (Yahoo! Search), and Peter Linsley (Ask.com), and Eytan Seidman (Microsoft).
Listed last but first up is Eytan.
Eytan starts off by talking about the different kinds of duplicate content. There’s the duplicate content you create yourself by accident and the content that other people take from you (ie scrapers).
When your content is duplicated you risk fragmentation of your rank, anchor text dilution, and lots of other nasty things. Site owners can avoid creating duplicate content by avoiding the use of tracking parameters in URLs wherever possible, not creating multiple sites with identical content, being cautious of filling regional sites with identical content (same site serving Denver, Seattle, etc), using client-side redirects instead of server-side and using absolute links instead of relative links.
Site owners must also worry about others who steal their content. The simplest way to deter people from copying your content is to ask them not to take it without permission. Fine, we said simple, not "effective". Eytan says a more advanced method would be to verify user agents and block unknown IP addresses from crawling your site. You want to minimize instances of blocking legitimate users, however.
If you think you might have duplicate content, ask yourself:
- Is there additional value to this content? Don’t just reproduce content for no reason. Make sure you are adding unique value.
- Is there attribution? If you are going to use someone else’s content, make sure you attribute it. Otherwise you’re a stupid jerk.
From there, Etyan goes into how Live.com handles duplicate content. According to him, there are no site-wide penalties. Microsoft looks very aggressively at session parameters, while also ensuring that they’re not eliminating any content in the process.
Up next is Peter from Ask.com
Peter says the main difference between his presentation and Eytan’s is that with Peter’s you get it in a British accent. Heh, sweet!
Peter gives the standard definition of duplicate content, defining it as having the same information available on multiple URLs. Content duplication is rarely a good idea, says Peter.
It’s an issue for several reasons. It’s an issue for search engines because it impairs the user experience and consumes resources. It’s an issue for webmasters because it puts them at risk for missing votes, having the wrong content candidate selected, etc.
Peter says that there is no duplicate content penalty at Ask.com. It’s just similar to not being crawled.
Ask.com looks for duplicate content only on indexable content. They’re just looking at what the user can see and they only filter content when they are absolutely sure the content has been duplicated. They don’t want any false positives. The best candidate is identified from numerous signals. Similar to ranking, the most popular is identified. (Sadly, we don’t get any specifics on the "signals" used.)
What can you do to prevent duplicate content?
- Act on the areas you are in control of. Put content on a single URL, copyright your content, and "uniquify" is so that it’s hard to use the content out of context.
- Make it hard for scrapers to steal. Mark your territory and threaten legal action
- If you think you’re being filtered out, contact Ask.com and submit a reinclusion request.
Amit is up next.
Yahoo eliminates duplicate content at every point in the pipeline, but as much as possible at query-time. They are less likely to crawl links from known duplicate pages and less likely to crawl new docs from duplicative sites. This is worrisome to me on some level, but I’ll go with it.
Amit stresses that there are legitimate reasons to duplicate content, including:
- Alternate document formats – present the same content in HTML, Word, PDF, etc
- Legitimate syndication
- Multiple language marketers
- Partial dup pages from boilerplate
Types of accidental duplication include:
- Session IDs in URLs
- Remember, to engines a URL is a URL is a URL
- Two URLs referring to the same doc look like dupes (Yahoo can sort this out, but it may inhibit crawling)
- Embedding session IDS in non dynamic URLS doesn’t change the fundamental problem
- Dodgy Duplication — Replicating content across multiple domains and "aggregation" of content found elsewhere on the Web
You can help Yahoo by avoiding bulk duplication of underlying documents. If only small variations exist in your content, do the search engines need all versions? If duplication in some sites areas is necessary, can you use robots.txt to tell them areas to avoid?
Site owners should also be cautious to avoid accidental proliferation of many URLs for the same documents. Stuff like SessionIDs, soft 404s, etc. Consider sessionID-free path for crawlers.
Vanessa is up next.
Vanessa starts her presentation by discussing the different kinds of duplicate content. Because she is both adorable and awesome she uses various Buffy the Vampire Slayer references to make her point. (Matt McGee could not be more horrified).
She defines similar content duplication as the time where there were two Xander’s who were similar and just needed to be combined to fix the problem. Then there is "same but different" duplicate content like that episode that features Good Willow and Bad/Half Naked Willow. In this case, it was obvious that these two Willows were not the same. One needed to stay and the other needed to leave.
If you are like Matt McGee and have never watched Buffy the Vampire Slayer, well then, neither I nor Vanessa can help you. Go rent it. It’ll teach you wonders about SEO or something.
Vanessa says SEOs are becoming increasingly concerned about duplicate content issues. Does syndicating your content in feeds mean you give up being seen as the original source? Is content scraping that’s out of your control going to knock you down in the rankings? In this session, search engines outline how they currently handle duplicate content detection.
Noteworthy Q&A Soundbytes:
Eytan says that he does include 301 redirects as client-side redirects. He was clarifying a statement he made earlier.
Vanessa says not to use nofollow to get rid of duplicate pages because other people can still link to them.
Lots of people seem to be in favor of digital signatures to prevent content scraping. The panelists all argued that it wouldn’t solve the problem, causing Michael Gray to shout out from the audience that if you don’t offer it, nobody can adopt it. So wise, Graywolf. So wise.
On the topic of eBay subdomains, Danny says not to worry because they won’t be in Google’s index anymore. Why? Google hates eBay now and Yahoo likes them. Hee!
Eytan asks if the session was "advanced" enough and about 20 people yell out an angry ‘no". Ouch.