Avoiding Duplicate Content: A Chat with Eric Enge
Duplicate content is a sometimes-confusing issue for site owners, many of whom don’t even know why their site has duplicate content and how it got there. Eric Enge of Stone Temple Consulting speaks frequently on duplicate content, how to identify it and how to avoid it.
His upcoming session at SES Chicago next week is all about duplicate content issues, so I recently caught up with Eric to chat more about identifying duplicate content, how it happens on accident, consequences for it and what to do about mirror sites.
Jessica Lee: What’s the fastest and easiest way people can identify duplicate content issues?
Eric Enge: If I was short on time, and I wanted to check for duplicate content, here are a few things I would do:
- A great way to start is to see how many pages Google reports that it has indexed for you in Google Webmaster Tools. Then compare this to how many pages you think your site has. If you think you have 500 and Google reports 2000 that is a good sign that you may have a potential problem.
- Use Google Webmaster Tools to identify pages with duplicate Title tags and duplicate Meta Descriptions. This won’t catch everything, but gives you a great start on the process. Note that you can also get false positives here, where pages have the same Title tag, but are not actually duplicates of one another. However, such pages are also problems, too, so you should address them anyway. You can also do this with site crawling tools such as the Screaming Frog SEO Spider and export the data to a spreadsheet, then sort the column showing the page titles.
- The other easy thing to do is to ask yourself a series of questions that are sure indicators of a potential duplicate content problem. Some example questions are:
- Do I have a canonical redirect from http://yourdomain.com to http://www.yourdomain.com? Failing to implement a canonical redirect remains a common mistake that people make.
- Do I have print pages on my site? If so, you should implement a rel=canonical tag from your print page back to the normally HTML version of the page.
- Do I offer alternate sort orders for the content on some of my pages? This is also seen as duplicate content. If so, you should implement a rel=canonical tag from your print page back to the normally HTML version of the page.
- Do I offer filters for my pages? An example of this is if I have a page showing a list of men’s running sneakers, and I offer users options such as seeing only the pages for men’s running sneakers under $60. These are also seen as duplicate content and should be resolved with a rel=canonical tag.
What are some ways duplicate content happens on accident?
There are a few different ways that this can happen. The most common results from problems with the content management system (CMS) setup. Content management systems usually use a non-friendly type URL such as:
Instead of a more friendly URL such as:
The CMS may allow you to configure the more friendly URL, but in some situations, the site may still link to the parameter based version of the URL. A simpler version of this problem is the URL used by the CMS to reference the home page, or other folder-level pages on the site. What I mean by this is that Web servers have a concept of a default file.
For example, when you refer to the home page of your site as http://www.yourdomain.com, the web server needs to know what file on the server contains the content for that page, and this file is called the default file.
That file may be something like “index.html”, so a reference to http://www.yourdomain/com/index.html will also bring up the home page.
It is common practice to refer to your home page by just specifying the domain name, i.e. http://www.yourdomain.com. However, in many cases the CMS will refer to the page as http://www.yourdomain.com/index.html, and the two different URLs for the home page are seen as duplicates by the search engines of one another, even though they are the same page.
Another way that duplicate content is accidentally created is via the implementation of affiliate programs.
Many programs may have a simple scheme where the affiliate sites link to their site using a simple affiliate ID parameter such as: http://www.youdomain.com?affid=12.
This reference to the home page is also potentially seen as a duplicate page by the search engines.
One last example is that of e-commerce sites that use manufacturer-supplied descriptions for the products on the pages of your site. The problem with those descriptions is that the same content is given by that manufacturer to all the e-commerce sites with their products. This is something that is common practice, but a major source of duplicate content. When you have a site with thousands of products, avoiding this problem can be very challenging.
Have you seen any variances in consequences for different types of duplicate content?
For the most part, duplicate content operates as a filter. This means it is not a penalty. What this means is that when a search engine sees multiple copies of the same piece of content they will choose only one of them to show in the search results. So if there are six copies of the same article, only one of them will normally show up in the search results for a related search phrase.
In many cases, this filtering occurs prior to indexing the content, so the duplicates don’t even get into the index. This can happen when there are one or more obvious duplicates within the same site.
When you have content duplicated across different sites, the treatment might be a bit different. This is easiest to illustrate with an example. Let’s say you have a piece of content that you publish on your site, but you also allowed The New York Times to syndicate it.
In this scenario, you are clearly the author of the content, so in theory you would hope that the search engines would show your page rather than The New York Times version in the search results.
The issue is that Google may see the article on The New York Times first and think that they are the original publisher. In addition, The New York Times site is likely to be a lot more authoritative than yours. This may not seem right or fair, but there is a counter-argument to this way of thinking.
Imagine you have a searcher who is known to read articles on NYTimes.com a lot, and has never been to yours. They might feel more comfortable reading the articles on NYTimes.com. This might cause The New York Times version of the article to outrank yours.
The situation gets a bit more complex when the content is not an exact duplicate. Perhaps one block of content is duplicated between two pages, but there is other content on those two pages which is not duplicated.
If the percentage is a large part of the total content on the page, it may be treated the same an exact duplicate. An example of this is city pages with minimal changes between the content on the various pages:
There is yet one other type of duplicate. This is where there is some content which is duplicated on the pages, and the percentage is not that large, but the content that is duplicated is the only content on the page that relates to the search query, such as shown in this example:
This is the type of filter that the search engines might implement at only query time. The content is not 100 percent duplicated, so it makes sense to have both pages in the index, but one of the pages is not filtered out.
This is a longer answer than you expected, right?
I still have one more situation to talk about. I have encountered an actual penalty situation that resulted from duplicate content. This was a website that collected and categorized third party articles by industry. The end user value add was the fact that the site allowed users to scan all articles by industry in a way that was much simpler then would otherwise be possible.
However, the total percentage of pages on the site that were duplicates was large, about 70 percent. The site ranked fine for a long period of time, but experienced a large drop in traffic at one point. We helped them recover from the penalty by de-indexing a large percentage of the duplicated articles.
There are some verticals that have multiple locations across the United States that produce mirror sites. What then?
To expand on the example I mentioned previously, imagine you have a pizza shop with locations in 10 different cities. This is something that I discussed in my recent interview with Matt Cutts. I asked him about this exact scenario, and here is what he said:
“Where people get into trouble here is that they fill these pages with the exact same content on each page. ‘Our handcrafted pizza is lovingly made with the same methods we have been using for more than 50 years …’, and they’ll repeat the same information for 6 or 7 paragraphs, and it’s not necessary. That information would be great on a top-level page somewhere on the site, but repeating it on all those pages does not look good. If users see this on multiple pages on the site they aren’t likely to like it either”.
But this then leads you to the next question, which is, well, what do I do then? That is a great question. The reality is that Google would prefer that you find a way to put differentiated content on those pages. They would also prefer that you avoid replicating large blocks of text across those pages. This can often by quite challenging. But to avoid issues with duplicate content, you have to find a way.
What can you put on those pages that is unique? Directions to the local facility is one example. Perhaps specials specific to your local shop. You can also consider writing some content about the history of pizza across all your city pages, but show only different pieces of that history on each page. A lot of creativity may be required, but if you want to avoid duplicate content concerns (or low quality content concerns), you have to find a way to solve it.
You can connect with Eric on Twitter and Google+. Follow his blog for lots of goodies, including insightful interviews with industry leaders. You can learn more about Eric and his company at Stone Temple Consulting. If you’re headed to SES Chicago, don’t miss his session on duplicate content and multiple site issues.