Getting Rid of Duplicate Content Issues Once and For All
No fancy intro here, just right to the content. The moderator for this panel is Rand Fishkin. Speakers are super funny Derrick Wheeler, Senior Search Engine Optimization Architect, Microsoft; Ben D’Angelo, Software Engineer, Google; and Priyank Garg, Director Product Management, Yahoo! Search. Rahul Lahiri, VP of Search Product Management, Ask, is a might show. Hmm.
Ben D’Angelo is up first. He’s been with Google a little more than three years. I think that means he went to Google straight out of grade school.
What are duplicate content issues? There are actually multiple disjoint problems.
- Duplicate content within your site or sites:
- Multiple URLs point to the same page or similar pages
- Different countries (same language)
- Duplicate content across other sites:
- Syndicated content
- Scraped content
The guiding principle behind the search engines’ indexing is ONE URL for one piece of content. Why? Because users don’t like duplicates in results. It saves resources in Google’s index, leaving more room for other pages from your site. And it saves resources on their server. [So Ben is telling us to keep duplicate content low to save Google money? Man, that stock price must really be suffering.]
Sources of duplicate content:
- Multiple URLs pointing to the same page
- www vs. non-www
- Session IDs, URL parameters
- Printable versions of pages
- Similar content on different pages
- Manufacturer’s databases
- Different countries
How does Google handle this? They cluster like content and pick the best representative. There are variations on this depending on where it is in the pipeline. Different filters are used for different types of duplicate content. In general, it’s just a filter and it’s not going to destroy your site.
The problem comes in when Google doesn’t choose the page you want or makes a mistake in clustering. You need to take back control.
Use 301 redirects for exact duplicates, like tracking URLs, and to solve www vs. non-www issue. You can also address exact duplicates in Google Webmaster Tools, but that only solves the problem for Google. He demos briefly.
For near duplicates, no index or block with robots.txt. Things like printer pages and site clones should have this.
Domains by country are a little different. Different languages are not duplicate content. Same language, different country? Don’t worry about it — the right one will usually be okay. You can geo-target in GWT or use different TLDs to help Google recognize where the content belongs. Best of all is creating unique content for that country.
Leave out URL parameters if you can. Put that data into a cookie instead.
In Webmaster Tools you can check for all sorts of other problems too, like duplicate Title and Meta data. Fix those things.
If another site has content that duplicates yours, there’s less that you can do.
Duplicate content from syndication should include a link back to your site to make the canonical origin clear. Another option is to syndicate different content than what you publish on your site. If you’re publishing content you have syndicated, manage your expectations.
Don’t worry about scrapers or proxies too much. They generally don’t affect your rankings. If you’re concerned, file a DMCA request or a spam report with Google.
Duplicate content best practices:
- Avoid duplicate content in the first place.
- Generate unique, compelling content for users.
- Don’t be overly concerned with duplicate content.
- Let us know about any issues at the Webmaster Help Forum.
You can always check out the Webmaster Central Blog and check out the Webmaster discussion group.
Priyank Garg is next up. He’s got a sore throat so he’ll be brief. His voice is all scratchy. Aw.
Much of this will be similar to Ben’s presentation — I’ll pull out the Yahoo-specific stuff. Like Google, Yahoo filters at several places in the pipeline. Session IDs and other “content neutral” parameters can really hurt your crawl queue. They might never get to the rest of your content because they’re crawling the same page over and over with a session ID. “Soft” 404 pages can also cause duplicate content problems. Repeated elements (perhaps with just a keyword replace) lead to problems.
Abusive dupes include scrapers/spammers, weaving and stitching, etc.
- Slurp supports wildcards in robots.txt.
- Yahoo Site Explorer allows you to delete URLs or an entire path from the index for authenticated sites.
- Use the robots-nocontent tag on non-relevant parts of a page.
- Robots-nocontent can be used to mark out boilerplate content
- Robots-nocontent can be used for syndicated content that may be useful to the user in context but not for search engines.
You can do dynamic URL rewriting in Site Explorer. Tell them which parameters are content neutral for your sites:
- Ability to indicate parameter to remove URLs from site
- More efficient crawl with less duplicates
- Better site coverage as fewer resources are wasted on duplicates
- Fewer risks of crawler traps
- Cleaner URL, easier for user to read and more likely to be clicked
- Better ranking due to reduced link juice fragmentation — it’s equivalent to 301ing all the duplicates back to one URL, saves time because they don’t have to crawl it
Derrick Wheeler is up. Here’s a bit of vintage Derrick for you all: “This crowd is a perfect Web site. You’re all unique. I would crawl, index and rank all of you.” Rand interjects “That’s dirty.” Derrick: “But I wouldn’t click or take action.” Hee.
Final points (he likes to get these done first):
- Consider search engine crawler detection
- Know your parameters
- Link to URLs with parameters always in the same order
- Dig deep into search results for your domain
- Exclude duplicates by robots.txt first, Meta Robot exclusion second, and nofollow link attribute last
- Get a regular crawl report of your Web site
- Request a tab file that includes: referring URL, fetched URL, redirect path with type, landing URL with status code, Title, Meta Description, Meta Keywords
- Open file using Excel 2007, sort by Title then landing URL
- Review suspect URLs to look for dupes
- Focus on your strengths
Look for spider traps, adding a parameter and creating new pages every time you go back and forth several times.
Make sure that when you’re creating sites for users, you still avoid spider traps. Just because you don’t think the search engines will need to index it, doesn’t mean that you don’t have other pages that the search engines won’t get to because they’re busy with your trap.
Document why you’re doing things. One site removed session IDs for search engines and got 10 million pages indexed. Down the line, someone forgot why it had been done, started giving session IDs to the engines again and their index pages plummeted again.
Look for things that might be causing problems, like dynamic breadcrumbs, based on how someone clicked through the site (Brookstone does this), related products, etc. They might be helpful for users but you’re probably going to get into trouble. Make your internal linking consistent and useful. Some products might be able to live in multiple categories, but you need to make a decision.
Anytime you see related, sort or compare, think “possible duplicate content”. When you see “select region” or “sign in”, think duplicate content. Disallow those pages in your robots.txt. “Email an article”, “send to a friend” — think duplicate content.
Once you screw up the parameter order, it’s hard to fix. Keep it consistent.
Use absolute links, not relative links, especially when switching between http:// and https://. Other people could link to you with https:// as well and you can’t really do anything about that.
Priyank suggests going after the low-hanging fruit. Try the dynamic URLs first so that you can see the benefit right away.
Brent Payne asks: How do you credit a story properly when you’re the Chicago Tribune? Can I get a link attribute or something? Just linking back doesn’t work. Google tells me it’s not a big deal but it is.
There’s not so much that the reps can say to that. They’re trying and he’s already doing the right thing. Poor Brent.
Derrick doesn’t think there is a solution right now. (He also reminded everyone that he’s an in-house SEM, not a search engine representative.)
How detrimental are different link IDs?
Priyank: Every different URL linking to the same content is duplicate content. That’s why you should use dynamic URL rewriting.
Ben: We try to handle that automatically. We might have to crawl the page once but we try to learn which parameters don’t affect the page content.
[Most of these questions are site specific, so I'm skipping them.]