« Monetizing Social Media Traffic | SEO Blog | Keyword Research, Selection and Optimization »
December 4, 2007
Duplicate Content Issues Duplicate Content Issues
I know someone thinks the title of this panel is clever and...okay, it is, but it is also a very mean thing to do to someone as compulsive as I am. I keep wanting to correct it and it's painful to just leave it like that. Duplicate contents hurts more than your rankings, people; please, think of the obsessive compulsives!
Tackling those tricky issues today are representatives from the four major engines: Rahul Lahiri, Ask.com; Derrick Wheeler, Microsoft; Evan Roseman, Google; and Priyank Garg, Yahoo. Aaron Shear moderates.
Aaron introduces Rahul Lahiri first as being from "Ask Jeeves". Rahul is quick to correct him, "Ask.com". Heh. Everyone still misses the butler.
Rahul says he's going to go quickly. Oh dear, I haven't had any coffee yet. This could be a problem.
He begins with the standard definition of duplicate content: Same content on multiple URLs. Why don't search engines like it? It impairs the user experience and consumes resources that could be better served crawling content that's unique. Duplicate content carries a risk of losing valuable votes because your links are spread out over multiple URLs instead of on the expert page. At Ask.com, duplicates are eliminated at all stages: crawling, indexing and ranking.
Contrary to belief, it's not a penalty; it's more similar to not being crawled. It's performed on indexable content, templates aren't included. It's not a concern for supported ccTLDs, like for example a site used in UK only search. (Same web sites on different ccTLDs are okay.)
They filter when the confidence factor is high. They have a low tolerance for false positives.
Some sources of duplicate content:
- Multiple URLs with the same content
- Printer Friendly pages
- Dynamic pages with session IDs/URL variant
- Content syndication
- Localization
- Mirrors
Scraping is the big concern. He asks for a show of hand of people who have had their content scraped--practically the whole room raises their hands.
Duplicate URLs might be necessary for Branding but avoid meaningless parameters and sub-domains if you can.
Rahul urges webmasters to act on the areas that they're in control of, particularly in the area of printer friendly pages. Block Robots from printer friendly pages. Even though many printer friendly pages are quite useable, they're not going to hold visitors because they present no path into the rest of the site. So block it and make sure that the in-site page is the one presented. Aaron asks when you're blocking from printer friendly pages: how do you do it? Rahul says put it in its own folder and robots.txt it out. Otherwise, make sure it's in the head section Meta Robots.
How do you make content unique? He puts up two nearly identical pages with one image name different and one word different in the title. It's nearly impossible for a spider to tell which is the original. Add unique Titles and Meta Descriptions. Add value to syndicated content to make it unique.
All JavaScript pages are a challenge for Ask.
He says you need to make it hard for Scrapers. Mark your territory--use your brand name, use absolute links, host images locally and take legal action when necessary.
If your content gets tossed for duplication, you need to content them for a re-inclusion request.
Evan Roseman is up next. He's going quickly too. Woe.
Why is Google so down on duplicate content? Users don't like it, it uses resources, it uses resources on your server and they're concerned with original authorship --they want it from the person who created it instead of secondhand.
He says URL like a name. Earl. Go ahead and imagine that every time I type it, okay?
Much of what he covers is similar to Rahul. He does point out that www vs non-www is not as much of an issue for Google as before and mentions that you can specify in the webmaster tools. Session IDs and URL parameters can split the PR between them.
Google's goal is to serve one version of the content in search results.
Hmm, interestingly his slide says that dupe content is generally just a filter and it won't destroy your site. I guess that means occasionally it isn't and it will? Now he says it's 'definitely not a penalty'. So which is it? Definitely or generally? Inquiring minds, Evan.
For exact dupes, use a 301, like in the case of tracking URLs and www vs non-www.
For near duplicates, use noindex/robots.txt such as clones of other sites. If you syndicate content, he repeats, make sure you're adding value.
Domains by country:
- Different languages is not duplicate content
- Use unique content specific to the country
- Use different TLDs (also specify in Webmaster Tools) for geo-targeting
Put data which does not affect the substance of the page in a cookie instead of in the URL so that they don't have to try to figure it out. URL parameters are problematic and can cause duplicate content.
What can you do if another site takes your content? Include an absolute URL. If you're syndicating, send out different content than what you keep.
Don’t worry about scrapers or proxies too much, they don't generally (there's that word again) affect your rankings. [Please tell blog search that, they seem to trust everyone else more than the original author.] If you're concerned, file a DCMA against the other site.
You can let them know about any issues at the Webmaster Help Discussion Group
If you're having trouble with your RSS replacing your rankings, let the discussion group know and they'll help.
And then the mics go out, so I get to catch up, yay!
And we're back. Priyank Garg is up next. He skips back about five slides because they're repeats of the other two.
He mentions a few reasons why search engines WOULD want duplicate pages:
- Site restricted queries
- Back ups
- Alternate document formats
- Multiple languages
Some other kinds of duplicate content:
Accidental duplication like session ids in URLs (a URL is a URL is a URL to search engine) and soft 404 errors -- make sure your 404 errors return a 404 error not a 200 okay. [See also our tedious explanation of the same.]
"Dodgy" forms of duplication:
Replicating content across domains unnecessarily
Aggregation of content found elsewhere
Identical content on the same site.
Approximate dupes may be filtered (real estate sites that just change out the city/state.)
Weaving and stitching (mixing and matching phrases, sentences paragraphs, and sections from different sources to create 'new" content) is also duplicate content.
Basically the same tactics work for Yahoo as for the other engines in keeping duplicate content out (robots.txt, meta, 301 dupe pages.) They support wildcards in robots.txt. Site Explorer allows you to Delete URLs or paths from authenticated sites.
Use Robots-nocontent <div>tag on non-relevant parts of the page. The tag can be used to mark templates or syndicated content that's useful in context for the user but not for search engines. (More information on the tag can be found at Ysearchblog.)
Dynamic URL rewriting available: ability to indicate parameters to remove from URLs across the site. Leads to more efficient crawling, better site coverage, more unique content discovered, fewer crawler traps and cleaner URLs that are better for users to read.
The trouble with all these engine specific solutions is that they are engine specific. I like it better when they get together and come up with standards. Sure you can rewrite your URLs just for Yahoo but then where are you in Google?
Last up is Derrick Wheeler. He's adorably brought his own mouse and mouse pad. His job is in house SEO for Microsoft.com but he says that he expects to get questions about Live Search, Office, why things don't work. He doesn't know the answers though, so it won't help.
Aw, today is his one month anniversary with Microsoft. Happy anniversary!
Major accomplishments:
- Signed up for benefits
- Find the cafeterias
- Return to his office without getting lost
- Can finally remember a couple people's names
There are over 27 million pages on Microsoft.com--it took three weeks to discover them all. It's just a little site, really. They've indexed about 7 million of them. In Derrick's view, every duplicate content page is keeping one good page out of the index.
Review your site and make sure that you know what's there. Find duplicate content there before the spiders get there. Know your parameters and which you can drop for search engines. Do a regular crawl report that includes referring URL, fetched URL, redirect path with type, landing URL with status code, Title, Meta Description, Meta Keywords. Sort by Title then landing URL and review them for dupes.
Ew! He's got a picture of spiders in a trap. EW.
Detect engines and strip out parameters that you don't need. He doesn't consider that "bad" cloaking. Remove session IDs. Smartpages.com stripped out session IDs and went from 1,000,000 pages indexed to 10,000,000 pages indexed. (A few months later, someone turned them back on and their page count fell. Whoops.)
Look for things that might be causing problems, like dynamic breadcrumbs, related products, etc. They might be helpful for users but you're probably going to get into trouble.
Q&A
Aaron: When you make changes in your rewriting can you fix it easily?
Yahoo: We validate and let you know if there's a failure. The returite starts ttaking effect in the system over a period of time. In the first few months, it's reversible, after that it gets hard.
If I did a 301 to clean this up, how soon do I expect results?
Yahoo: as soon as we start seeing it--a few weeks but it can take a while to percolate
Google: Same thing, as we recrawl, we'll incorporate. Up to a couple months.
Ask: Same.
Microsoft: Derrick's experience is 6 months to a year for full effect if you're 301 to a new site.
If you do a site: command in Google to find www vs. non-www and you come up with different counts, should you 301 the smaller number to the bigger?
Evan: First he wants to emphasize that the site colon estimates are just that, very rough estimates. He wouldn't take them as the golden number. Very very rough. Aside from that, pick whichever form you like better and they'll take it.
Is Google planning on following Yahoo in how their tools are developing?
At Google, we try to do the best we can detecting these things (that Yahoo allows webmasters to correct) automatically. Can't say when or if they'll be following on the allowing webmasters to specify.
Do breadcrumb navigation with a cookie instead of URL parameters. Aaron says that you can detect search engines and strip out parameters. Evan jumps back in to say that Google requests please don't do something special for us. Let us figure it out and if there's a problem contact us.
On what scale do you think Proxy sites (sites entirely duplicated with just a phone number different so they can track PPC calls) will affect your organic results?
Evan: In most cases, they're not outranking the original sites. They're not that popular. We do see them. If they're causing a problem for you, contact Webmaster Help.
What is the line of near-duplicate/duplicate?
Evan: I think you're looking at it from the wro9ng direction. Create unique, useful content and you'll be fine. [The room laughs. That requires work!]
Derrick: It's like the Supreme Court decision on obscenity; you know it when you see it.
Can I report copyright issues through the spam channels?
Priyank: Spam is more subjective. The DCMA is the right channel.
Posted at December 4, 2007 1:02 PM
View related entries in: Ask, Blogging, Google, Live Search, Microsoft, PubCon Las Vegas 2007, SEM Events, Search Engine Optimization, Search Engines, Yahoo
Comments
Posted by: Ill Styling at June 27, 2008 11:15 AM
I've been trying to figure out what penalty I've occured. I have over 450 uniquely written posts, but they're all about products which go to the affiliate. My question is can /tag in Wordpress being indexed cause dupe content issues? If its not that I'm thinking it's because of my affiliate links, which is a shame because it's a shopping website.

Virginia Nussey
Susan Esparza




