Google Granted New Similarity Patent
Google’s new similarity patent means your search engine optimization campaign may get a little harder. Don’t you just love it?
Google was granted a new patent this morning that describes a method for determining the similarity between Web pages. The patent describes a system where a similarity engine can identify duplicate content by spidering a site and generating a sketch of that page. The sketches will be weighted and can then be compared for similarities. If a page is deemed "too similar" Google can opt not to index it.
Doing this means Google will be able to streamline its indexing process and help to reduce the amount of duplicate content on the Web. It also means if you’re not careful with your breadcrumb navigation, using dynamic URLs or implementing any of the other techniques commonly associated with duplicate content issues, you may find all your search engine optimization dollars officially wasted when Google decides not to index your site. Awesome, right?
Though the system may be imperfect at first, this would be an easy way for Google to quickly compare pages to determine their similarity. This of course mean that if you don’t want your pages booted out of a users search, you best be offering a high level of unique content to differentiate yourself from others out there. Otherwise, Google gets to decide not to crawl your site.
I absolutely love it, especially if it gets rid of scraper sites. Scraper sites infuriate me.
The entire similarity concept is also too math geeky for me to comprehend but the idea of creating a virtual geek map of your page to determine its likeness to others is pretty cool. So cool that Google’s not the only one who has filed a patent for this technology.
Anything that limits the amount of duplicate content I’m seeing in my search results is okay by me. I think it would be very interesting to see something like this in place, mainly because it would mean Google would have to form a definitive answer as to how much duplication or resemblance is too much. No one has been able to do that to date and it’d be telling to see how Google views it.
Looking down the road a bit, I think it would also be fun to watch similarity become the new PageRank. Everyone will be speculating as to what factors are given the most weight and how similar is too similar according to Google. You have to think Google will leave room for some degree of similarity. Just because every site is using Meta tags or using search engine friendly design doesn’t make them similar or even related. And what about forums? Won’t they look relatively the same to a search engine?
Of course who knows if Google will ever implement this into their algorithm? They may have just applied for the patent to block all the other companies toying with similar technology. Either way, it’s worth speculating about, especially if it means we can talk about the dangers of duplicate content. (Duplicate content is bad. Very bad.)
Search Engine Land has more info on the patent, as well as the other recently approved Google patent – Digital Mapping. And if you just want to read up on duplicate content, check out our Duplicate Content & Multiple Site Issues Chicago session recap.