INTERNATIONAL: SEO Cloaking Techniques to Avoid in 2011
Head of Google Web Spam, Matt Cutts, took time away from Ozzie and Emmy (The Matt Cutts "Catts") at the end of 2010 to post a little titbit for webmasters and SEOs via Twitter, which I'm sure added to the hangovers for a few Black Hats over the holiday season.
Google will [look] more at cloaking in Q1 2011. Not just, page content matters; avoid different headers/redirects to Googlebot instead of users.
Cloaking is the technique used to present different content, layout, functionality or headers (a completely different page or partial components of the page, known as Mosaic cloaking) to a search engine spider than to a user's Web browser.
Ethical cloaking is not "black hat" however, in the past spammers have used methods to manipulate cloaking techniques, for clarity let's refer to it as cloaking-spam, to game the (Google) algorithm. This is not a new phenomenon. In the beginning, the meta keywords tag was abused by spammers and as a result is now no longer a ranking factor and the <noscript> tag may also be treated with some suspicion as it has also been abused in the past (perhaps we should open a refuge for abused HTML elements....)
First off, let me say, that if at all possible, AVOID CLOAKING. Cloaking is a high-risk exercise that, if it must be implemented, should be done so in the appropriate ethical manner, adhering to Google's Webmaster Guidelines, to ensure that your website is not penalised or dropped from the index.
Unfortunately, some webmasters may not understand the repercussions, and inadvertently cloak content, links or entire websites without even realising. This article outlines some of the common on-site functionality that may be (mis)interpreted as cloaking-spam.
Keep in mind that Google are actively investigating instances of cloaking-spam and banning websites from their index. They are also following up detection of cloaking and unnatural links with notifications to webmasters via Webmaster Tools. Google is now getting better and better at detecting cloaking spam algorithmically, even IP-delivery is not infallible and of course, Google always encourages your competition to use the spam report if they detect something fishy about your page.
How could Google detect cloaking-spam?
Naturally, this leads to the question, how could a search engine gather and analyse the two examples of a web page for comparison? Some methods may include:
Of course, the data gathering could be outsourced to a separate company to avoid the issue of IP-delivery
Ethical uses of cloaking - Ensure you implement in an appropriate fashion for SEO
There are instances where a company may wish to provide different or additional information to its users. For example:
Ensure that you consider the SEO implications when using any of the methods or functionality mentioned above, as mis-configuration may result in cloaking-spam or may not be optimal for SEO.
Cloaking - Don't try this at home
Okay, so this is not a tutorial on how to cloak; it is a "2011 cloaking-spam no-no list" or at the very least, a heads up of techniques to avoid or issues to fix early on in 2011.
Some forms of cloaking are deliberate (such as IP delivery or user agent cloaking) however, many forms of cloaking-spam may be accidental. The accidental types of cloaking-spam that inadvertently get you banned from Google are of upmost concern, as webmaster may not be aware of the issue. Even large companies get it wrong sometimes.
We will investigate some of the most common cloaking-spam techniques below in order to educate and ensure that webmasters and SEOs can make sure that they do not have them on their website.
There are typically three ways that webmasters cloak content from either users or search engines:
Delivering different content based on the IP address of the requesting web browser or search engine spider. [IP Delivery is covered in more detail here.]
Reverse DNS & forward DNS
Reverse DNS and forward DNS lookups are not a form of cloaking but may be used to query the DNS records of a requesting IP address. Google provides details on how to verify Googlebot is who it claims to be.
Delivering different content based on the User-agent of the requesting web browser or search engine spider. For example, Googlebot/2.1 (+http://www.google.com/bot.html) or Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)
Look out for the following code:
Look out for the following code:
<meta http-equiv="refresh" content="0;url=http://www.yoursite.com/second-page.html">
Double/ Multiple meta refreshes or referrer cloaking
Multiple meta refreshes may be used to hide the referrer from affiliate websites. Avoid chaining multiple redirects of any kind, as it may have negative impacts on SEO and may even be against the terms of service (TOS) of your affiliate partners
This is easy for a search engine to detect. Don't do it.
Back to back multiple redirects
Search engines may not follow multiple chained redirects (per guidelines in the HTML spec the recommended number was set at 5 redirections). Google may follow around 5 chained redirects. Web browsers may follow more.
I could not find any data about how many redirects a web browser will follow so I created a quick chained-redirect script to test some of the browsers installed on my machine and provide some stats on the approximate number of redirects followed (by redirect type). I limited the script to a maximum of 5000 chained redirects.
As the script was written, we thought we would run an additional test and submit the redirect URL to Google. We also linked to the script from Twitter. The results are in the table below.
Although Googlebot only crawled 5 of the permanent redirects in this instance, it may be fair to assume that Google may implement a crawl based verification to test redirects beyond the 5 redirection bot limit in a similar vein to Microsoft above who follow approximately 25 chained redirects. Note: we assumed that this is a Microsoft owned IP based on the IP Whois information from Domain Tools.
FramesFrames allow a webmaster to embed another document within an HTML page. Search engines have not traditionally been good at attributing the framed content to the parent page enabling a webmaster to prevent search engines from seeing some or all of the content on a page.
Frames and iFrames are legitimate HTML elements (even though they are not often best practice from an SEO point of view) however, they can also be combined with other techniques to deceive users.
I can't think of a legitimate "white hat" reason why you would choose to use this. It may result in a penalty or a ban. Check the source code of your framed documents, remove this code or implement an appropriate SEO friendly redirect.
If the offending page or website has indexation issues, consider revising the <noscript> code as part of a thorough website SEO audit.
Content Delivery Networks - CDN
Content Delivery Networks (CDNs) allow companies to distribute their static content across multiple geographic locations to improve performance for end users. Depending upon the CDN configuration there are multiple ways to route the client request to the best available source to serve the content. CDNs are a complex area, usually implemented by global companies who need to serve users content in the quickest possible time.
If you are using a CDN, ensure that it allows a search engine to access the same content and information that users' see and ensure that there is nothing that a search engine could misinterpret as deceptive.
Hackers have used exploits on common CMS's to drive traffic to less than ethical third party websites. One example is the WordPress Pharma Hack which used cloaking to present pharmaceutical related content to the search engines but hide that content from the webmaster.
Ensure that your CMS, web server and operating system software is running the latest versions and that they have been secured. Some of the most common exploits are poor passwords, unsecure software or scripts, disgruntled employees and social engineering tricks.
Cloaking http headers
HTTP headers send additional information about the requested page to the search engine spider or web browser. For example, the status of the page, cached/ expiry information, redirect information etc.
Sending different headers to a search engine in order to deceive may result in a penalty. For example, replacing good content on a high ranking page with a sign-up form and altering the expires and/ or cache control headers in an attempt to fool search engines into maintaining the high-ranking version with the good content will not work.
Googlebot may periodically download the content regardless of the expires and cache control headers to verify that the content has indeed not changed.
You can check the status of your server response headers using one of our free SEO tools.
To quote Google:
"Doorway pages are typically large sets of poor-quality pages where each page is optimized for a specific keyword or phrase. In many cases, doorway pages are written to rank for a particular phrase and then funnel users to a single destination"
Matt Cutts has a rant about Doorway pages here.
Multi-variate testing and Google Website Optimizer
Multi-variate testing tools such as Google Website Optimizer allow you to improve the effectiveness of your website by testing changes to your website content and design to improve conversions rates (or other important metrics measured).
Multi-variate testing is an ethical use of cloaking however, Google states:
"if we find a site running a single non-original combination at 100% for a number of months, or if a site's original page is loaded with keywords that don't relate to the combinations being shown to visitors, we may remove that site from our index".
301 redirecting old domains to unrelated websites
Not necessarily cloaking-spam per se, but a bait and switch technique, which 301 redirects unrelated domains (usually domains that are for sale or have expired but still have PageRank or significant external links) to a malicious or unrelated domain about a completely different topic.
This is misleading to users as they may be expecting a different website and may pass unrelated anchor text to your domain.
Also, don't expect credit for registering expired domains with external links in the hope of a PR or link boost.
Historically, search engines have struggled to interpret and index Flash content effectively, but they are getting better all of the time.
Building an entire website in Flash is still not a good idea from an SEO perspective however if you do have some Flash content consider implementing SWFObject or a similar technique to ensure that Flash degrades gracefully for both users and search engines.
Interstitial adverts and popover divs
Popover divs and adverts alone are not cloaking. When the interstitial ads or popover divs cannot be closed (for example, unless the user registers) then you may be presenting content to the search engines and a sign-up form to your users.
Ensure that users can close or skip an interstitial adverts, pop-ups, popovers, overlaid divs, light boxes etc and view the content available
AJAX can be used in a deceptive way to present different content to a user and a search engine - Don't.
Many of the techniques outlined in this article may be combined, chopped about or manipulated in a futile attempt to cheat the search engines.
Link cloaking (vanity URLs combined with redirects)
Link cloaking refers to sending a user to a different URL than the one clicked on using a redirect of some form. Redirects can be used for good and bad as we have seen above. Link cloaking is often used for analytical or maintenance purposes. There are a number of practical reasons to do this, for example:
Of course, this may be used to mislead and deceive, such as disguising an affiliate link (e.g. replacing the link with http://mysite.com/vanity-url and redirecting that to http://affiliate.com/offer.html?=my-affiliate-code).
Avoid link hijacking to deceive users as it may result in search engine penalties or get your website banned.
Hiding text is against Google's TOS and Webmaster Guidelines. It is a form of cloaking as a search engine can see the textual content but a user cannot. Avoid the following types of hidden text:
If search engine traffic is important to you, make sure you consider the following with respect to cloaking: