Duplicate Content & Multiple Site Issues
Danny Goodwin, Associate Editor, Search Engine Watch
Katy Collins, Senior Product Manager, AOL
Eric Enge, President, Stone Temple Consulting
Chris Keating, Director, SEO, Performics
David Naylor, SEO, Bronco
Ahhh… one of my all-time favorite topics to hash out with clients. Let’s see a show of hands – how many of you have had duplicate content issues at one time or another? The most common questions I hear are: Does it cause black-listings? How are the pages filtered? What should I do with the affiliates who all show my content? Yep, it never fails; almost every client has a dupe content problem in one form or another, so this session should be interesting.
They’ve decided to go without slides and make this more free-form with a lot of Q & A.
Eric will start with a short presentation. He’s starting with the absolute basics, talking about a typical web page: top nav, left nav, body etc. The stuff that Search Engines care about is the body content, that stuff that usually lives in the middle of the page. Eric says the rest of the stuff is ignored when it comes to Duplicate Content. This means that all the different navs, headers etc are basically ignored in the duplicate content arena – because, let’s face it, it’s all duplicated across every page of your site and it would be a nightmare to remedy that, now wouldn’t it?
Dave Naylor has a question, which causes him to be called a Troll by Eric, asking if you could do some changing up of the page in order to manipulate the duplicate content algorithm. Oh boy, how many of you have seen that happen? Luckily, Eric gets into this exact topic in a bit…
Eric goes on to say that retailers are a good source of duplicate content, because product descriptions are usually duplicated across hundreds and even thousands of sites. This leads into a discussion about “shingles”. No, not that skin disease…but actually the act of reorganizing the same blocks of text on a page in order for it to appear unique.
The example are blocks of text like:
The brown fox runs… vs. The white cat crawls…
The orange monkey jumps… The black dog sleeps…
The white cat crawls… The orange monkey jumps…
The black dog sleeps… The brown fox runs…
These are recognized by the search engines as the same blocks of text and labeled as duplicate content.
This type of thing is typical on a product page, where products are sorted one way, and at another URL they are sorted different. Search Engines can tell it’s the same information in a different order.
Faceted Navigation can lend to Shingle type content. The content is not new depending on how you navigated to it, just information in a differently organized list. A solution for Faceted Navigation is to use the Canonical tag to identify the true source of the content rather than having dozens of pages serving the same info in the index.
Another common type of duplicate content is database substitution. I’ve always referred it to “keyword find and replace”. This is like: Boston is a great place to live vs. New York is a great place to live vs. Los Angeles is a great place to live. Same sentence for the most part, only the city name has been substituted. This is usually found on sites who are trying to geo-target their content or even widget sites that can easily substitute their widgetA for widgetB and simply create new pages. This is a bad practice and still many companies resort to this type of junk content generation.
Nowadays, when it comes to search engines picking the winner, it seems that the following are what factor in:
- Trust authority
- Recency – how current is it
- Original author
So it no longer only depends on who the original author was, but also who has the most authority overall.
Now, here’s the useful part you’ve all been waiting for – Solutions to duplicate content:
No matter the type of duplicate content, there are only a handful of ways to fix the problem. If it’s internal duplication (content duplicating on your own site) such as URLs with parameters then you can use any of these solutions:
- Delete the duplicate page (with the parameter) and do a 301 redirect to master page
- Or use canonical tag
- Or use Google Webmaster Tools or Bing’s tools to tell them to ignore the parameter
A couple of other, non-encouraged solutions:
- Use of the meta Noindex tag
- This is frowned upon especially if used extensively.
- Block crawling via robots.txt
- This is not highly recommended because often times people block more than they meant to, causing more harm than good to the site.
All great points by Eric. To go a little further, on my own tangent, if it’s external duplicate content (content that lives on other domains out of your control) then other measures will need to be taken in order to clean up the duplicate content mess. Some common solutions are rewriting content or in extreme cases, going after the offending sites to remove the duplicated text.
Dave is slated to talk about multi-lingual duplicate content, but he gets off topic talking about some of Eric’s points. Some of his ‘take aways’ that he mentions are things like:
- Remember that the canonical tag is not a directive, but a suggestive tag. This means that the search engines do not have to follow its rules. For example, in the case of affiliates and using the tag on their content, many times if the affiliate site is found to be more relevant they will win the rankings even though you’re the original author.
- To test for shingles, he recommends regularly doing long-tail queries in the search engines to see if other sites are showing up for your content. When they do, you are going to have a problem and will need to implement a solution.
- Noindex is a dangerous solution and not at all optimal in order to fix duplicate content. It can cause too many issues and the robots.txt might be a better solution – but it truly depends.
Once Eric steers Dave back to his original topic, they actually have some great bits of information regarding multi-lingual duplicate content. This is where you would have various international sites with the same content. You have to decide where you want it to rank, and then you have to give the right ‘signals’ to the search engines in order to help it rank appropriately. What Search Engines don’t want you to do is to try to get all your domains ranked in Google.com, for example. Signals would be hosting location, TLDs, links from that country and having Google Webmaster Tools set up properly. Dave gives the suggestion of having sub-domains on your main site, for each of your countries that 301 redirect over to the TLD sites for the specific sites. He says this can get messy…but it can be beneficial for search engines too.
Chris is given the floor to talk about some useful stats on duplicate content problems. He said that he had some team members do some research across 5 different verticals, on 100 different sites. Across all those sites, they saw that 93% of them had a duplicate content problem of one form or another. He said he was surprised by that stat, but honestly, I was not. I see it ALL the time!
He went on to say that they then looked to see how many sites were taking action to defend against duplicate content, and only 77% of the sites were. I’m actually surprised that it was that high. Most site owners I talk to are completely oblivious to the problem.
Content scraping is another source of duplicated content on the web. The panel agrees that the author tag isn’t necessarily the most optimal solution in preventing stolen content. Dave said to look at how you syndicate your content and take action accordingly. Again, they reiterate that the original author isn’t the one who wins anymore, but the more influential site. Dave said to limit how much of your content you send out on RSS. Send out a snippet instead, like 1000 words of your article and keep the supporting content on your site. The part that was sent out will lose value for you, but the fact that you have the supporting content will help your cause.
The panel also mentions that there are times when you syndicate that you’ll actually have the opportunity to use the canonical tag and anchor text to link back to your site, but it depends on the situation. Their advice for you: if the partner you are thinking of syndicating to won’t give you the canonical tag then steer clear of them. It’s your content and you can go elsewhere with it.
An audience member has a question about having an article that pertains to 2 different areas of their site and what to do with it so it’s not duplicate. Katie answered that the URL structure doesn’t have to be different on both versions – but instead have one version of the article living on the site with links to that same article in both places. You’ll need to decide where you want the article to live primarily and then link accordingly across the site wherever you feel it’s beneficial. If you are making some slight changes to the article depending on where it appears, like changing the heading – then possibly use the canonical tag (the url would be different either completely or with parameters).
Good bit of info: Google will see URLs with different upper and lower cases in the URL as different URLs, hence, creating duplicate content…so make sure you set rules for URL structures that the IT department knows so that you can avoid any issue in the future.
Another audience member asks about mobile sites and the potential for duplicate content that could arise from that. There are two different ways to serve content for a mobile device:
1) Have the mobile content on the main site, with a browser detection for the specific device in order to serve the mobile content for mobile devices.
2) Create a mobile version of the site on a separate domain
The problems you want to avoid with having a main and mobile site is that you don’t want Google crawling and indexing the mobile site over your main site. You have to be careful with linking across both sites and do not cross-canonical tag the sites. This will only cause confusion for the search engines. Dave says to only use m. for your mobile site. He says that these are the most highly recognized by Google. In order to avoid problems between sites, use your robots.txt smartly on the mobile side. He also says that the Google blog has a useful post about reg. sites vs. mobile and how they are handled.[I tried to hunt down the link for this post and include it but had no luck, so if anyone knows the link please pass it along.]
Overall, duplicate content can be a complex problem. It’s important to identify where your duplicate content is appearing and then taking action to get rid of it without damaging any of your SEO efforts. Handling internal duplicate content is sometimes easier because you have control over your own site. Use the canonical tag and 301 redirects where necessary and then work with IT to ensure that the back-end of your site is optimized to avoid the creation of further duplicate content in the future.