SEO Cloaking Techniques to Avoid in 2011
Head of Google Web Spam, Matt Cutts, took time away from Ozzie and Emmy (The Matt Cutts “Catts”) at the end of 2010 to post a little titbit for webmasters and SEOs via Twitter, which I’m sure added to the hangovers for a few Black Hats over the holiday season.
Google will [look] more at cloaking in Q1 2011. Not just, page content matters; avoid different headers/redirects to Googlebot instead of users.
Cloaking is the technique used to present different content, layout, functionality or headers (a completely different page or partial components of the page, known as Mosaic cloaking) to a search engine spider than to a user’s Web browser.
Ethical cloaking is not “black hat” however, in the past spammers have used methods to manipulate cloaking techniques, for clarity let’s refer to it as cloaking-spam, to game the (Google) algorithm. This is not a new phenomenon. In the beginning, the meta keywords tag was abused by spammers and as a result is now no longer a ranking factor and the <noscript> tag may also be treated with some suspicion as it has also been abused in the past (perhaps we should open a refuge for abused HTML elements….)
First off, let me say, that if at all possible, AVOID CLOAKING. Cloaking is a high-risk exercise that, if it must be implemented, should be done so in the appropriate ethical manner, adhering to Google’s Webmaster Guidelines, to ensure that your website is not penalised or dropped from the index.
Unfortunately, some webmasters may not understand the repercussions, and inadvertently cloak content, links or entire websites without even realising. This article outlines some of the common on-site functionality that may be (mis)interpreted as cloaking-spam.
Keep in mind that Google are actively investigating instances of cloaking-spam and banning websites from their index. They are also following up detection of cloaking and unnatural links with notifications to webmasters via Webmaster Tools. Google is now getting better and better at detecting cloaking spam algorithmically, even IP-delivery is not infallible and of course, Google always encourages your competition to use the spam report if they detect something fishy about your page.
Identifying cloaking-spam algorithmically requires a search engine to compare a single web page obtained via two or more mechanisms (for example, two or more IP ranges, User-agent identifiers or differing levels of HTML/ JavaScript functionality). Microsoft has a patent filed late 2006 claiming a system that facilitates the detection of a cloaked web page.
Naturally, this leads to the question, how could a search engine gather and analyse the two examples of a web page for comparison? Some methods may include:
- Partial content differentiation, using content topic analysis, page segmentation, Latent Semantic Analysis (LSA), keyword usage, links on page, and other on-page factors
- Different IP addresses/ separate IP ranges or proxies to analyse web spam
- Different user-agents (for example use a browser user agent to check for cloaked content)
- Spam reports from the webmaster community
- User testing
- Analysis of more than 5 chained redirects to check for cloaking (perhaps limiting indexation and flow of PageRank, authority, trust, etc, through 5 chained redirects)
- Improved interpretation of JavaScript code (specifically evaluating complex and/or encoded JavaScript functions that contain links or redirects)
- Mechanism to accept cookies (potentially in conjunction with the JavaScript and redirect analysis above)
Of course, the data gathering could be outsourced to a separate company to avoid the issue of IP-delivery
There are instances where a company may wish to provide different or additional information to its users. For example:
- Geo-targeting
- Logged-in users (customised homepage experience etc)
- Referral tracking – for example, provide feedback to the user based on their search engine query such as highlighting the words on a page that match the query
- Device cloaking for mobile phones and touch devices
- Optimisation for specific browsers or for backward compatibility
- Display optimisation (although this may usually be controlled through CSS)
- First click free – Or first five clicks free
- A/B or multivariate testing
- Vanity URLs (link cloaking)
- Display age verification (www.bacardi.com uses a combination of user-agent detection and cookies to display an age verification welcome page to users but allow search engines to access the website. Even though Google is only 14 years old)
- Load balancing
- Font replacement (via technology like sIFR or Cufon) – Note: May but not optimal for Google Preview (as of December 2010)
- SWFObject
Ensure that you consider the SEO implications when using any of the methods or functionality mentioned above, as mis-configuration may result in cloaking-spam or may not be optimal for SEO.
Okay, so this is not a tutorial on how to cloak; it is a “2011 cloaking-spam no-no list” or at the very least, a heads up of techniques to avoid or issues to fix early on in 2011.
Some forms of cloaking are deliberate (such as IP delivery or user agent cloaking) however, many forms of cloaking-spam may be accidental. The accidental types of cloaking-spam that inadvertently get you banned from Google are of upmost concern, as webmaster may not be aware of the issue. Even large companies get it wrong sometimes.
We will investigate some of the most common cloaking-spam techniques below in order to educate and ensure that webmasters and SEOs can make sure that they do not have them on their website.
There are typically three ways that webmasters cloak content from either users or search engines:
- IP-delivery
- User-agent analysis (You can check for user-agent cloaking using Bruce Clay’s free SEO Cloaking checker.
- Exploiting known search engine behaviours such as the execution of JavaScript or redirects, and the indexation or spider-ability of various HTML elements
Delivering different content based on the IP address of the requesting web browser or search engine spider. [IP Delivery is covered in more detail here.]
Reverse DNS & forward DNS
Reverse DNS and forward DNS lookups are not a form of cloaking but may be used to query the DNS records of a requesting IP address. Google provides details on how to verify Googlebot is who it claims to be.
Delivering different content based on the User-agent of the requesting web browser or search engine spider. For example, Googlebot/2.1 (+http://www.google.com/bot.html) or Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)
Google may index a page containing JavaScript but may not follow the JavaScript redirect, however we are seeing significant improvements in Google’s interpretation of JavaScript code (For example, the >Google preview generator renders JavaScript, AJAX, CSS3, frames, and iframes).
Webmasters sometimes use JavaScript redirects when they cannot implement a server side redirect, inadvertently leaving Googlebot on the first page and sending the web browser (which follows the JavaScript redirect) to a second page containing different content, and is thus flagged as cloaking-spam.
Look out for the following code:
<script type="text/javascript"> window.location="http://www.yoursite.com/second-page.html" </script>
A tag added to the head section in the HTML page to redirect users to another page after a set period. The meta refresh tag is not considered cloaking when used on its own however it may be combined with JavaScript, frames or other techniques to send a user to a different page to the search engine spiders.
Look out for the following code:
<meta http-equiv="refresh" content="0;url=http://www.yoursite.com/second-page.html">
Double/ Multiple meta refreshes or referrer cloaking
Multiple meta refreshes may be used to hide the referrer from affiliate websites. Avoid chaining multiple redirects of any kind, as it may have negative impacts on SEO and may even be against the terms of service (TOS) of your affiliate partners
Meta refresh in JavaScript or the <noscript> tag
OK, now we are getting into the realms of “black hat”. It is unlikely that a webmaster would combine a meta refresh with JavaScript unless they were up to no good.
This is easy for a search engine to detect. Don’t do it.
Search engines may not follow multiple chained redirects (per guidelines in the HTML spec the recommended number was set at 5 redirections). Google may follow around 5 chained redirects. Web browsers may follow more.
Multiple back-to-back redirects (especially combining different types of redirects 301, 302, meta refresh, JavaScript etc) impacts page load times, may impact the flow of PageRank (even 301 redirects may see somePageRank decay) and could be considered cloaking-spam.
I could not find any data about how many redirects a web browser will follow so I created a quick chained-redirect script to test some of the browsers installed on my machine and provide some stats on the approximate number of redirects followed (by redirect type). I limited the script to a maximum of 5000 chained redirects.
Web Browser | Version | Approx # of 301 Redirects | Approx # of 302 Redirects | Approx # of Meta Refresh Redirects | Approx # of JavaScript Redirects |
Google Chrome | 8.0.552.224 | 21 | 21 | 21 | Greater than 5000 (limit unknown) |
Internet Explorer | 8.0.6001.18702IC | 11 | 11 | Greater than 5000 (limit unknown) | Greater than 5000 (limit unknown) |
Mozilla Firefox | 3.5.16 | 20 | 20 | 20 | Greater than 3000 (limit unknown, as the browser ground to a halt after 3000 JS redirects) |
Safari | 3.1.2 (525.21) | 16 | 16 | Greater than 5000 (limit unknown) | Greater than 5000 (limit unknown) |
As the script was written, we thought we would run an additional test and submit the redirect URL to Google. We also linked to the script from Twitter. The results are in the table below.
Search Engine | User Agent Host IP | Approx # of 301 Redirects followed |
Microsoft *Assumed based on IP range Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) | 65.52.17.79 | 25 |
Google Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 66.249.68.249 | 5 |
Yahoo Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) | 67.195.111.225 | 4 |
Twitter Twitterbot/0.1 | 128.242.241.94 | 3 |
LinkedIn LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com) | 216.52.242.14 | 1 |
PostRank PostRank/2.0 (postrank.com) | 204.236.206.79 | 0 |
Although Googlebot only crawled 5 of the permanent redirects in this instance, it may be fair to assume that Google may implement a crawl based verification to test redirects beyond the 5 redirection bot limit in a similar vein to Microsoft above who follow approximately 25 chained redirects. Note: we assumed that this is a Microsoft owned IP based on the IP Whois information from Domain Tools.
Frames allow a webmaster to embed another document within an HTML page. Search engines have not traditionally been good at attributing the framed content to the parent page enabling a webmaster to prevent search engines from seeing some or all of the content on a page.
Frames and iFrames are legitimate HTML elements (even though they are not often best practice from an SEO point of view) however, they can also be combined with other techniques to deceive users.
Frames with a JavaScript Redirect
Embedding a frame with a JavaScript redirect may leave search engine spiders at the first page and sneakily redirect users with JavaScript enabled to the second “hidden” page.
I can’t think of a legitimate “white hat” reason why you would choose to use this. It may result in a penalty or a ban. Check the source code of your framed documents, remove this code or implement an appropriate SEO friendly redirect.
The <noscript> tag was designed to provide a non-JavaScript equivalent for JavaScript content so that text only browsers and search engines could interpret more advanced forms of content. The <noscript> tag may be treated with some suspicion as it has been abused by spammers in the past.
Build JavaScript/ AJAX functionality with progressive enhancement in mind so that the content is suitable for all users and doesn’t require the use of the <noscript> tag. If your website uses the <noscript> tag and you cannot update the code, check to ensure that any text, links and images within the <noscript> tag accurately describe the JavaScript, AJAX or Flash content it represents in an accurate, clear and concise manner.
If the offending page or website has indexation issues, consider revising the <noscript> code as part of a thorough website SEO audit.
Content Delivery Networks (CDNs) allow companies to distribute their static content across multiple geographic locations to improve performance for end users. Depending upon the CDN configuration there are multiple ways to route the client request to the best available source to serve the content. CDNs are a complex area, usually implemented by global companies who need to serve users content in the quickest possible time.
If you are using a CDN, ensure that it allows a search engine to access the same content and information that users’ see and ensure that there is nothing that a search engine could misinterpret as deceptive.
Hackers have used exploits on common CMS’s to drive traffic to less than ethical third party websites. One example is the WordPress Pharma Hack which used cloaking to present pharmaceutical related content to the search engines but hide that content from the webmaster.
Ensure that your CMS, web server and operating system software is running the latest versions and that they have been secured. Some of the most common exploits are poor passwords, unsecure software or scripts, disgruntled employees and social engineering tricks.
HTTP headers send additional information about the requested page to the search engine spider or web browser. For example, the status of the page, cached/ expiry information, redirect information etc.
Sending different headers to a search engine in order to deceive may result in a penalty. For example, replacing good content on a high ranking page with a sign-up form and altering the expires and/ or cache control headers in an attempt to fool search engines into maintaining the high-ranking version with the good content will not work.
Googlebot may periodically download the content regardless of the expires and cache control headers to verify that the content has indeed not changed.
You can check the status of your server response headers using one of our free SEO tools.
To quote Google:
“Doorway pages are typically large sets of poor-quality pages where each page is optimized for a specific keyword or phrase. In many cases, doorway pages are written to rank for a particular phrase and then funnel users to a single destination”
Source: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=66355
Matt Cutts has a rant about Doorway pages here.
Multi-variate testing tools such as Google Website Optimizer allow you to improve the effectiveness of your website by testing changes to your website content and design to improve conversions rates (or other important metrics measured).
Multi-variate testing is an ethical use of cloaking however, Google states:
“if we find a site running a single non-original combination at 100% for a number of months, or if a site’s original page is loaded with keywords that don’t relate to the combinations being shown to visitors, we may remove that site from our index”.
Not necessarily cloaking-spam per se, but a bait and switch technique, which 301 redirects unrelated domains (usually domains that are for sale or have expired but still have PageRank or significant external links) to a malicious or unrelated domain about a completely different topic.https://www.youtube.com/watch?v=70LR8H8pn1Mhttps://searchengineland.com/do-links-from-expired-domains-count-with-google-17811
This is misleading to users as they may be expecting a different website and may pass unrelated anchor text to your domain.
Also, don’t expect credit for registering expired domains with external links in the hope of a PR or link boost.
Historically, search engines have struggled to interpret and index Flash content effectively, but they are getting better all of the time.
Webmasters had to consider users and search engines that did not have Flash enabled browsers and either built a standard HTML website “behind the scenes” for search engines, used a <noscript> tag, JavaScript or similar method to get their textual content indexed. Unfortunately, this may be inadvertently identified as cloaking by search engines if the content indexed from the Flash content does not match the textual content.
Building an entire website in Flash is still not a good idea from an SEO perspective however if you do have some Flash content consider implementing SWFObject or a similar technique to ensure that Flash degrades gracefully for both users and search engines.
Popover divs and adverts alone are not cloaking. When the interstitial ads or popover divs cannot be closed (for example, unless the user registers) then you may be presenting content to the search engines and a sign-up form to your users.
Ensure that users can close or skip an interstitial adverts, pop-ups, popovers, overlaid divs, light boxes etc and view the content available
AJAX (Asynchronous JavaScript And XML) is a form of JavaScript that enables a webpage to retrieve dynamic content from a server without reloading a page. It has become very popular over the last couple of years and is often (over) used in many Web 2.0 applications.
AJAX can be used in a deceptive way to present different content to a user and a search engine – Don’t.
In addition, the other side of the coin, in a “negative cloaking” approach the user may see the content but a search engine will not as it cannot execute the JavaScript calls that retrieve the dynamic content from the server. Something to check.
Many of the techniques outlined in this article may be combined, chopped about or manipulated in a futile attempt to cheat the search engines.
One such example is combining JavaScript and Cookies to cloak content. If the JavaScript function cannot write or read a cookie (such as a search engine spider), then display different content than to a standard user with cookies enabled. There are also a few JQuery script examples that will allow an unscrupulous person to do this.
Link cloaking refers to sending a user to a different URL than the one clicked on using a redirect of some form. Redirects can be used for good and bad as we have seen above. Link cloaking is often used for analytical or maintenance purposes. There are a number of practical reasons to do this, for example:
- To maintain a link to an affiliate within a syndicated PDF or application. Using a similar vanity URL and redirect above to ensure that if the affiliate updates their URL structure you can update the redirect on the vanity URL and thus ensure that the links in eBook and syndicated content still work
- Vanity URLs used on marketing and advertising material which are easier to remember than the standard version of the URL
Of course, this may be used to mislead and deceive, such as disguising an affiliate link (e.g. replacing the link with http://mysite.com/vanity-url and redirecting that to http://affiliate.com/offer.html?=my-affiliate-code).
Modifying the anchor text or link attributes with JavaScript or a similar mechanism to trick or deceive users. This is a form of cloaking that only modifies a small component of the page to deceive a user.
- Hijacking the onClick event to send a user to a different URL to the search engines
- Adding a rel=”nofollow” attribute to links displayed to search engines and removing it from the code displayed to users
- Modifying the anchor text of links to include keywords in the anchor text sent to search engines and displaying something different to users
Avoid link hijacking to deceive users as it may result in search engine penalties or get your website banned.
There are ethical forms of this technique to ensure that both users and search engines can see your AJAX content using HiJAX as recommended on the Google blog.
Hiding text is against Google’s TOS and Webmaster Guidelines. It is a form of cloaking as a search engine can see the textual content but a user cannot. Avoid the following types of hidden text:
- Undiscernible text on background (e.g. dark grey on black)
- Setting the font size to 0
- Styling keyword rich anchor text like standard body text so that users don’t realise it is a link
- Cascading Style Sheets (CSS) display:none
- Text behind images. Always a tricky subject and often open to debate amongst SEOs. If the text behind the image is an accurate and fair representation of an image (e.g. a header with a custom font) you “should be fine” to quote Matt Cutts. The ultimate solution will depend upon your particular circumstances however check these resources for some guidance: W3C: Using CSS to replace text with images , Farner Image Replacement (FIR) , Scalable Inman Flash Replacement (sIFR) (Please note that sIFR-replaced text may not appear in Google Preview as of Dec 2010.)
If search engine traffic is important to you, make sure you consider the following with respect to cloaking:
- Make sure you are familiar with the obvious and not so obvious forms of cloaking above and are aware how these are used on your site to avoid any potential penalties.
- If you are implementing some form of cloaking, make sure this is properly reviewed from an SEO perspective to avoid potential penalties.