Google Library--Scanning Back the Index

by: Susan Esparza, October 2005

At first, it sounded like a wonderful idea. Google was going to take the libraries of five major universities and scan them, making them fully indexable and searchable. The project would allow anyone to find a book on the topic they were looking for, narrow down a quotation to the exact title and author where it originated and overall increase Google's goal of organizing the world's information. Works in the public domain would be available in their entirety while copyrighted works would show only a sentence or two of context as well as the title page. Additionally, copyrighted works would have links to booksellers where the title could be purchased. The idea seemed bullet proof and students everywhere rejoiced.

Unfortunately the publishers disagreed. Unlike Project Gutenberg which has over 16,000 public domain works available on the web, the publisher's feel threatened Google's venture into copyrighted content, claiming that the scanning is infringement. In response to the outcry, Google put the scanning of the copyrighted materials on hold and offered the publishers the chance to opt out of having their works scanned. Google makes the assertion that the scanning is fair use because they will display only a snippet of the copyrighted material. Publishers, including the more than 300 members of the Association of American Publishers (AAP), disagree and are suing Google to prevent the search company from continuing.

While publishers yell and scream about Google, the Open Content Alliance has been garnering much better publicity. Initially launched by a consortium of companies including Yahoo and recently joined by MSN, the OCA is also scanning public domain and Creative Commons licensed work but they're handling copyrighted works with a difference. The OCA only scans books that have been opted into the program, saving them from the legal battles currently plaguing Google. Though the publishers are outraged, it's really not that hard to understand why a search engine would believe that it is entirely acceptable to require the copyright holder to opt out. After all, they do it every day. Google, Yahoo, MSN, in fact, all the search engines hold the view that unless you tell them they can't have your copyrighted work, they will happily index it and even cache it, making it available whether you will them to or not. That's the entire purpose of the robots.txt, which itself is often ignored by the search engine spiders. This overall approach makes Google's attitude quite understandable.

As Danny Sullivan has pointed out before, the fact that these two situations parallel each other is also what makes this case so important to the web as a whole. Danny's case is simply this--if the library digitalization project is copyright infringement, then every page ever indexed by Google is also infringement and the Wayback Machine is most certainly infringing. Some may claim, and indeed the AAP is doing so, that web pages are different but that simply isn't true. Copyright laws apply equally to web or book content. Google CEO Eric Schmidt equated the two in a recent article with the Washington Post and reprinted to the Google Blog.

Even those critics who understand that copyright law is not absolute argue that making a full copy of a given work, even just to index it, can never constitute fair use. If this were so, you wouldn't be able to record a TV show to watch it later or use a search engine that indexes billions of Web pages.

Remember that the AAP is concerned not with the republishing to the web of the works--indeed Google has made it very clear that they won't be doing that--but with the scanning in the first place, making the digital copy. This is equated at the very least with caching and could include search engine indexes as well.

So what does it all mean? At the moment, it's easy to opt out of being indexed (though it's hard for an SEO to imagine why anyone would want to). The trouble is that if Google's efforts are ruled as infringing then we are looking at a whole new web. Online publishers will demand the same rights as their traditional counterparts. It could result in much smaller, less relevant indexes.

On the bright side, it will be much easier to be ranked number one in an index made up only of those who want to be in.