Duplicate Content & Multiple Site Issues
[Again, I feel I must mention the jazzy Michael Bolton music playing in the background. You really need to be here. It’s just too incredible to miss.]
Jon Glink has stepped up to take over Anne Kennedy’s role as moderator for today’s Duplicate Content & Multiple Site Issues panel, with speakers Shari Thurow, the always fashionable Mikkel deMib Svendsen, Adam Lasnik (aka MiniMatt) and Tim I’m-Not-Matt-Cutts Converse
Duplicate content continues to be a major issue for webmasters, often resulting in search penalties, if not complete banning. But what really classifies duplicate content and how similar does content have to be in order to be classified "duplicate"?
What is duplicate content?
Shari Thurow says it’s very difficult to get a definitive answer as to what classifies as duplicate content because most can’t agree on how much is too much. Instead, she says the engines are looking for resemblance — if the page contains roughly the same information, then it’s duplicate and there’s a problem.
Examples of duplicate content: Multiple domains where you have the identical home page on different URLs (often seen on affiliate sites), different links to several different URLs for one site, dynamic content with unique URLs.
Why is duplicate content a problem?
Simple. The search engines want your content, but they only want one copy of it. Too many version of the same material doesn’t offer up a good user experience. As a result, the engines will decide which version of the content to use and filter out the others. You don’t want to leave this decision up to the search engines.
Allowing the same contented to be delivered time and time again would slow down the entire information retrieval process. Not to mention the angry revolt that would form if the engines started presenting users with the same results over and over again. Seriously, people would get hurt. Hide the MiniMatt.
How do the engines determine duplicate content?
Shari Thurow highlighted several factors the engines use to determine duplicate content.
Content properties – By stripping away the "boiler plate" aspects of a site (images, header, footer, etc), the engines can see if what you’re offering is unique or whether it ‘resembles’ something they’ve seen elsewhere.
Linkage properties – The engines take into consider the number of inbound links and outbound links related to the content.
Content Evolution – Most content does not change on a weekly basis. Now, we’re not talking about blogs or news sites updating themselves. We’re talking about article mutation. The engines know which kinds of sites should be ‘mutating’ aka updating on what time frame. If you’re an insurance site updating several times a day, it may signal red flags to the engines.
Host name resolution – The unique name of a machine, such as a Web server. Changing servers too frequently is a sign you may be spamming.
Shingle comparison – Every Web document is as unique as a Chicago snowflake. The search engines can break down content into sets of work patterns. Groups of adjacent words, call shingles, are compared for similarity. More shingles equals more similarities equals big problems.
What are some common technical duplicate content issues?
- URLs With or without WWW: Less of a problem today than in the past. Most engines are learning how to handle common use of www and non-www. The problem is it dilutes your links.
- Dynamic URLs/ Session IDs: Every time a session ID is used, the engines see a new page. Dump all session information into a cookie and don’t use it in the URL.
- URL rewriting: When you rewrite URLs to make them more user-friendly, it doesn’t block the other, it rewrites it, creating two identical pages.
- Many-to-one problems in forum: You can get to the same forum page by using different URLs – the page ID and the thread ID
- Sort order parameters: the URL changes depending on how the information is sorted (time, visitors, etc), but the information is still the same.
- Bread crumb navigation: If your breadcrumb navigation is reflected in your URL you may have a problem. The URL path may change (shoes/running/adidas vs. sports/equipment/shoes/adidas), but the content is the same – a page on adidas running shoes
- Mirror Sites- The same site running on two domains.
So if you are suffering from duplicate content, what do you do?
First, if you find your site ranks for both its www and non-www version, you need to decide which domain you want to use and 303 redirect the other domain to that site. This will transfer all your related PageRank and link information. You’ll know you have performed the redirect correctly if your home page only displays for the one URL. Only use a 302 redirect for content that is going to change very frequently.
Other things you can do: Use static URLs instead of dynamic ones, don’t list products on more than one page, and if you’re going to rewrite a URL, make sure you’re redirecting the original URL, not just adding an alternative.
If you know that your content management system is delivering duplicate content, use the robots exclusion protocol to prevent the duplicate site from being crawled and indexed.
It’s very important that you do this because otherwise you leave it up to the engines to determine which version of your site they’ll index. You want to make sure you are able to rank for the one you want. If you don’t know which site is converting better, take a look at your Web analytics data.