BACK TO (not so) BASICS: Programming for SEOs: How to Easily Write Custom Data Extraction Scripts
Or, How to Use Python to Spider a Web Site
In the title of this article, the word "easily" might overstate things a little. However, as an SEO, there's nothing more frustrating than the mounds of semi-structured data we have to wade through to get to the bits useful to our job. Having access to the development systems for a site makes the job of an SEO somewhat simpler, but it is not often the case that an SEO will have such access. Usually our interface to a site is the same as anyone else's — a Web browser. Third-party site analysis tools are useful, as far as they go, but they're limited too — you're hemmed in by the vision of the tool maker.
There is a better way, but it requires biting the bullet and learning some programming. It's not for the faint of heart, but for those who put in the time and effort to acquire a few basic skills, it pays off in spades. I'll show you what I've found to be the easiest approach to mining client sites for SEO-relevant data. If you borrow the approach outlined in this brief tutorial, you'll spend less time on the parts of automated site processing that are always identical and spend more time on just the parts that vary for your site and the data you want to get.
When you think about pulling data out of a site, it's often useful to pretend it's a flat list of pages. Thinking about the site as a graph of nodes with a complex link structure usually just confuses things; it's less intimidating to think of it as a list. Of course, thinking about site data this way is only useful if you have the right tools to let you abstract away the complicated part. That's what a good Web spider framework does for you. It scans through a site, keeps track of links, tosses out duplicates, gives you a few obvious configuration options, and notifies you of interesting events so you can respond to them with custom code.
If you can think of a site as a list of pages, then you only need be concerned with how to filter that list. My favorite spider framework for this is a third-party module for Python called Ruya. I looked at spider frameworks in several languages: PHP, Perl, Python, Ruby, and Erlang. In my opinion, Ruya had the easiest, best interface for an SEO's needs. It's not the fastest or most powerful crawling library, but that's not what you're looking for when doing ad hoc analysis. Instead you want to optimize for development time.
You'll have to have a development environment, which is beyond the scope of this article. On Windows, I prefer getting Python as a part of a Cygwin install and using Emacs for my code editor. Python comes with an interactive shell feature that allows you to play around with it and learn organically. Emacs can extend the interactive shell to allow a truly blissful incremental development environment.
Once you've installed Python, the next step is to install the necessary Python modules. To find out where you should install the modules on your system, type the following into an interactive shell:
On my system, this is the result:
For these instructions, I'm going to use /usr/lib/python2.5/site-packages to house my non-standard modules. We need to install four third-party modules into this directory. Modules provide libraries of code that extend the core functionality of the language. Download each of the following into the module directory found above:
To test that you've got the modules installed correctly, type the following into the Python shell:
If all is well, you'll get your prompt back silently. Otherwise, an error will be printed to the screen.
The following is the simplest example possible.
To execute on a Unix-similar system (Linux, Cygwin, Mac) do the following from a shell or terminal:
If all goes well, the URL and page title of every page below the Promotion directory in the Google Directory will be printed to the screen.
You can actually do a lot of common tasks with Ruya's built-in data structures and methods. Inside the <code>aftercrawl</code> function, you have access to all of the following variables:
In addition to those, you can get the HTML of the processed page. In the next example, we'll combine that fact with the powerful HTML parser BeautifulSoup to show how you can extract absolutely any interesting item from a set of pages.
Extending the First Example, with BeautifulSoup and Commenting
This script descends through every category below Promotions in the Google directory, making a tab-delimited text file of each directory link and the page it was found on:
The examples given here are pretty simple, but that's part of the point. For an SEO's needs, programmatically crawling a site can be easy enough to save a lot of time under many circumstances. Another thing to note about the examples here is that they're necessarily drained of blood. In a real scenario, you'd be filtering for, e.g., every link in a section with the rel attribute set to "nofollow" or a list of pages whose Meta Descriptions contain one of the phrases from a predetermined list.
Unfortunately, I can't turn you loose on some poor, random webmaster's site to experiment with automated robots, and generic high volume sites like Google Directory aren't quite as interesting SEO-wise. Nevertheless, a little imagination makes the potential clear. With a few sprinkles of custom logic, you can extract just precisely what you need from a site to get any job done.