How to improve your site indexation – XML Sitemaps Case Study
Google Webmaster Tools (GWTs) is always a good place to start when optmising a site for Search Engines (SEs). In fact this can give you an indication of site health as it highlights any possible issue or problems your site might have. One important indicator is the number of pages you have indexed in Google Search Results and Bing search results pages. Why is this important? More pages in the search engine index can mean more ranking opportunities; more authoritative, on-topic content; more supporting content for key landing pages and can result in higher rankings and more SEO traffic.
How can you improve your indexation? XML Sitemaps are a good starting point
An XML Sitemap can provide a list of all pages to supplement the search engines crawl based mechanisms to discover new or updated content. Many sites have poor IA structure, which does not always allow SEs to discover pages when crawling your site, therefore XML Sitemaps can helps SEs to find your pages. Please note that it’s always recommended having a crawlable IA structure as having XML Sitemaps doesn’t guarantee that your site URLs will get all indexed. Google uses the data in your Sitemaps to learn about your site’s structure and may use them as a factor in determining the canonical version of your URLs. XML Sitemaps are very important, and depending on the nature of your site, specialised sitemaps for News, Image, Video and Mobile may really improve your indexation results.
News XML Sitemaps
If you are a regular publisher of unique news related content and don’t have a News sitemap you could only have a little number of articles indexed in Google News (only the ones that Google discovers). If you are a publisher, this could be a huge problem (please note also that your site needs to be reviewed to be accepted in Google News). News Sitemaps uses additional News-specific tags and can only contain URLs for articles that have been published in the last two days. News sitemaps need to be submitted in addition to your generic XML Sitemaps.
Video XML Sitemaps
Similarly, video XML Sitemaps tell Google exactly where and what the video content is in your site. Video content includes web pages which embed videos, URLs to video players or the URLs of raw video content. If Googlebot cannot discover the video content at the URLs provided, it will ignore them. Video XML Sitemaps can help Googlebot to find your video content. Information in your video Sitemap must include at least a link to a landing page for a video as well as some essential information required for indexing the video. The information provided in the metadata can improve Google’s ability to include videos in search results. Google might also use the text available on your video’s page rather than the text you supply in the Sitemap, if this differs. Please note that Google also support mRSS feed’s URL instead or (or in addition) to a Sitemap. Bing recommends the use of mRSS feeds to get video content indexed. Please note that Bing doesn’t pick up automatically mRSS feeds, you would need to contact the Bing support team to request consideration of your content feed. Bing also supports Google’s Sitemap video extensions as long as it contains the basic requirements such as title, video landing page and link to the actual content. Please make sure you include the <video:content_loc> tag that points to the video itself (this is currently an optional tag in Google).
Image XML Sitemaps
To get your images indexed, you can use Google’s image extensions for sitemaps to give Google information about your images. This allows Google to discover images that perhaps are reached via JS forms and to suggest the most important images on your page that you want to be included in Google.
Mobile XML Sitemaps
Mobile XML Sitemaps are specific to mobile web content only. This means that any URLs listed in your mobile Sitemaps that serve non-mobile content, will be ignored by Google. Recently John Mueller commented on a thread on Google Webmaster Forum saying that mobile XML Sitemaps are only for those older mobile sites, not the new ones. With older mobile sites they refer to traditional mobile phone browsers, not smart-phones (which they generally treat the same as desktop browsers given their advanced capabilities). Using special CSS/HTML templates for smart-phones would be fine and would not require submitting them via mobile XML Sitemap. So if a mobile site needs to get accessed by old internet enabled phones, you would need to make a special version for Googlebot-Mobile and a mobile Sitemap.
Other specialised XML Sitemaps
Recent XML Sitemaps include Geo Sitemaps (that enables webmasters to publish geospatial content to Google in order to make it searchable in Google Earth) and Google Code Search Sitemaps (to tell Google about the source code on your site in order to appear in Google’s code search).
Please note that Google support XML Sitemaps that can include multiple content types in the same file.
The structure of a Sitemap with multiple content types is similar to a standard XML Sitemap, with the additional ability to contain URLs referencing different content types (videos, images, mobile URLs, code or geo information). Please note that this does not apply to Google News Sitemaps.
Can XML Sitemaps affect your site rankings?
Not really. While having a higher number of pages indexed in the Search Engines may increase the chances of ranking for a larger number of keywords, SEs do not favour sites ranking just because they use XML Sitemaps. You still need to apply SEO best practices in order to make your pages rank higher in the SERPs.
XML Sitemaps tips
A little trick is to break down your XML Sitemaps into smaller and manageable sitemaps, perhaps by site areas or site sections. This allows you to monitor your indexation performance for each section of your site and to identify where indexation issues exist.
Can duplicate content affect indexation ?
Yes. Duplicate content can affect the crawl budget and therefore the number of pages a search engine will crawl each time it visits your site. As Matt Cutts said during the interview with Eric Enge.
“Imagine we crawl three pages from a site, and then we discover that the two other pages were duplicates of the third page. We’ll drop two out of the three pages and keep only one, and that’s why it looks like it has less good content. So we might tend to not crawl quite as much from that site”
“the fact that you had duplicate content and we discarded those pages meant you missed an opportunity to have other pages with good, unique quality content show up in the index”
Duplicate content is wasted crawl budget. Google is willing to crawl a certain amount of pages for a site (according to PageRank and other factors) and if these pages are discarded, it would be a waste.
The canonical tag has been introduced to eliminate self-created duplicate content in the index and it allows specifying the preferred version of a URL. This tag can have a huge impact of your site indexation and site performance. The canonical tag is supported by major SEs such as Google, Yahoo & Bing.
Can crawl errors such as 404 errors affect indexation?
Yes. A large number of errors may affect your indexation results or may slow down the Google crawl rate of a site, which may increase the time needed for new content to appear in the index.
XML Sitemap Case study
We have been monitoring the indexation progress of a large site after implementing XML Sitemaps and canonical tags. Initially the site had just over 20% of pages indexed. We re-submitted standard XML Sitemaps and fixed a large number of errors coming up on our Webmaster Tools Account. We also submitted specialised XML Sitemaps and implemented the canonical tag throughout the entire site as the site had a large amount of duplicate content.
Graph 1 below shows indexation results in Google:
Indexation results jumped from 24% to 68%, and this percentage keeps growing, resulting in significant improvements in SEO traffic.
It’s fundamental to monitor your GWTs account and XML Sitemaps. Small changes can sometimes have a huge impact on your site, especially for large sites.