How Do I Get Rid of Extra Pages in the Google Index?
Let’s say you have an ecommerce website with thousands of products, each with variations in sizes and colors. You use the Google Search Console Index Coverage report to see a list of Indexed pages in the Google search results for your website.
To your surprise, you see way more pages than the website should have. Why does that happen, and how do you get rid of them?
I answer this question in our “Ask Us Anything” series on YouTube. Here’s the video, and then you can read more about this common problem and its solution below.
- Why do these “extra” webpages show up in Google’s index?
- How do I get rid of “extra” webpages in Google’s index?
- FAQ: How can I eliminate extra pages from my website’s Google index?
This issue is common for ecommerce websites. “Extra” webpages can show up in Google’s index because extra URLs are being generated on your ecommerce website.
Here’s how: When people use search parameters on a website to specify certain sizes or colors of a product, it is typical that a new URL is automatically generated for that size or color choice.
That causes a separate webpage. Even though it’s not a “separate” product, that webpage can be indexed like the main product page, if it is discovered by Google via a link
When this happens, and you have a lot of size and color combinations, you may end up with many different webpages for one product. Now, if Google discovers those webpages URLs, then you may end up having multiple webpages in the Google index for one product.
Using the canonical tag, you can get all of those product variation URLs to point to the same original product page. That is the right way to handle near-duplicate content, such as color changes.
Here’s what Google has to say about using the canonical tag to resolve this issue:
A canonical URL is the URL of the page that Google thinks is most representative from a set of duplicate pages on your site. For example, if you have URLs for the same page (example.com?dress=1234 and example.com/dresses/1234), Google chooses one as canonical. The pages don’t need to be absolutely identical; minor changes in sorting or filtering of list pages don’t make the page unique (for example, sorting by price or filtering by item color).
Google goes on to say that:
If you have a single page that’s accessible by multiple URLs, or different pages with similar content … Google sees these as duplicate versions of the same page. Google will choose one URL as the canonical version and crawl that, and all other URLs will be considered duplicate URLs and crawled less often.
If you don’t explicitly tell Google which URL is canonical, Google will make the choice for you or might consider them both of equal weight, which might lead to unwanted behavior …
But what if you don’t want those “extra” pages indexed at all? In my opinion, the canonical solution is the way to go in this situation.
But there are two other solutions that people have used in the past to get the pages out of the index:
- Block pages with robots.txt (not recommended, and I’ll explain why in a moment)
- Use a robots meta tag to block individual pages
The problem with using robots.txt to block webpages is that using it does not mean Google will drop webpages from the index.
A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google.
Also, a disallow directive in robots.txt does not guarantee the bot will not crawl the page. That is because robots.txt is a voluntary system. However it would be rare for the major search engine bots not to adhere to your directives.
Either way, this is not an optimal first choice. And Google recommends against it.
Robots Meta Tag Option
Here’s what Google says about the robots meta tag:
The robots meta tag lets you utilize a granular, page-specific approach to controlling how an individual page should be indexed and served to users in Google Search results.
Place the robots meta tag in the <head> section of any given webpage. Then, either encourage the bots to crawl that page via an XML sitemap submission or naturally (which could take up to 90 days).
When the bots come back to crawl the page, they will encounter the robots meta tag and understand the directive to not show the page in the search results.
So, to recap:
- Using the canonical tag is the best and most common solution to the problem of “extra” pages being indexed in Google — a common issue for ecommerce websites.
- If you don’t want pages to be indexed at all, consider using the robots meta tag to direct the search engine bots how you want those pages to be handled.
Still confused or want someone to take care of this problem for you? We can help you with your extra pages and remove them from the Google index for you. Schedule a free consultation here.
The issue of extra pages in your website’s Google index can be a significant roadblock. These surplus pages often stem from dynamic content generation, such as product variations on ecommerce sites, creating a cluttered index that affects your site’s performance.
Understanding the root cause is crucial. Ecommerce websites, in particular, face challenges when various product attributes trigger the generation of multiple URLs for a single product. This can lead to many indexed pages, impacting your site’s SEO and user experience.
Employing the canonical tag is the most reliable solution to tackle this. The canonical tag signals to Google the preferred version of a page, consolidating the indexing power onto a single, representative URL. Google itself recommends this method, emphasizing its effectiveness in handling near-duplicate content.
While some may consider using robots.txt to block webpages, it’s not optimal. Google interprets robots.txt as a directive to control crawler access, not as a tool for removal from the index. In contrast, the robots meta tag offers a more targeted approach, allowing precise control over individual page indexing.
The canonical tag remains the go-to solution. However, if there’s a strong preference for total removal from the index, the robot meta tag can be a strategic ally. Balancing the desire for a streamlined index with SEO best practices is the key to optimizing your online presence effectively.
Mastering the elimination of extra pages from your website’s Google index involves a strategic combination of understanding the issue, implementing best practices like the canonical tag and considering alternatives for specific scenarios. By adopting these strategies, webmasters can enhance their site’s SEO, improve user experience and maintain a clean and efficient online presence.
- Identify Extra Pages: Conduct a thorough audit to pinpoint all surplus pages in your website’s Google index.
- Determine Root Cause: Understand why these pages are generated, focusing on dynamic content elements.
- Prioritize Canonical Tag: Emphasize the use of the canonical tag as the primary solution for near-duplicate content.
- Implement Canonical Tags: Apply canonical tags to all relevant pages, specifying the preferred version for consolidation.
- Check Google Recommendations: Align strategies with Google’s guidelines, ensuring compatibility and adherence.
- Evaluate Robots.txt Option: Understand the limitations and potential drawbacks before considering robots.txt.
- Deploy Robots Meta Tag: Use robot meta tags strategically to control indexing on specific pages if necessary.
- Balance SEO Impact: Consider the impact of each solution on SEO and user experience for informed decision-making.
- Regular Monitoring: Establish a routine to monitor index changes and assess the effectiveness of implemented strategies.
- Iterative Optimization: Continuously refine and optimize strategies based on evolving site dynamics and Google algorithms.
Continue refining and adapting these steps based on your website’s unique characteristics and changing SEO landscapes.