Nowhere Left to Hide: Blocking Content from Search Engine Spiders
- If you’re considering excluding content from search engines, first make sure you’re doing it for the right reasons.
- Don’t make the mistake of assuming you can hide content in a language or format the bots won’t comprehend; that’s a short-sighted strategy. Be up front with them by using the robots.txt file or Meta Robots tag.
- Don’t forget that just because you’re using the recommended methods to block content you’re safe. Understand how blocking content will make your site appear to the bots.
When and How to Exclude Content from a Search Engine Index
A major facet of SEO is convincing search engines that your website is reputable and provides real value to searchers. And for search engines to determine the value and relevance of your content, they have to put themselves in the shoes of a user.
Now, the software that looks at your site has certain limitations which SEOs have traditionally exploited to keep certain resources hidden from the search engines. The bots continue to develop, however, and are continuously getting more sophisticated in their efforts to see your web page like a human user would on a browser. It’s time to re-examine the content on your site that’s unavailable to search engine bots, as well as the reasons why it’s unavailable. There are still limitations in the bots and webmasters have legitimate reasons for blocking or externalizing certain pieces of content. Since the search engines are looking for sites that give quality content to users, let the user experience guide your projects and the rest will fall into place.
Why Block Content at All?
- Duplicated content. Whether snippets of text (trademark information, slogans or descriptions) or entire pages (e.g., custom search results within your site), if you have content that shows up on several URLs on your site, search engine spiders might see that as low-quality. You can use one of the available options to block those pages (or individual resources on a page) from being indexed. You can keep them visible to users but blocked from search results, which won’t hurt your rankings for the content you do want showing up in search.
- Content from other sources. Content, like ads, which are generated by third-party sources and duplicated several places throughout the web, aren’t part of a page’s primary content. If that ad content is duplicated many times throughout the web, a webmaster may want to keep ads from being viewed as part of the page.
That Takes Care of Why, How About How?
There are plenty of other methods for externalizing content that people discuss: iframes, AJAX, jQuery. But as far back as 2012, experiments were showing that Google could crawl links placed in iframes; so there goes that technique. In fact, the days of speaking a language that bots couldn’t understand are nearing an end.
But what if you politely ask the bots to avoid looking at certain things? Blocking or disallowing elements in your robots.txt or a Meta Robots tag is the only certain way (short of password-protecting server directories) of keeping elements or pages from being indexed.
One more risk that you run when blocking content: search engine spiders may not be able to see what is being blocked, but they know that something is being blocked, so they may be forced to make assumptions about what that content is. They know that ads, for instance, are often hidden in iframes or even CSS; so if you have too much blocked content near the top of a page, you run the risk of getting hit by the “Top Heavy” Page Layout Algorithm. Any webmasters reading this who are considering using iframes should strongly consider consulting with a reputable SEO first. (Insert shameless BCI promo here.)