Get Free Quote

BACK TO (not so) BASICS: Programming for SEOs: How to Easily Write Custom Data Extraction Scripts


Or, How to Use Python to Spider a Web Site

By Mike Terry, November 2008


Introduction


In the title of this article, the word "easily" might overstate things a little. However, as an SEO, there's nothing more frustrating than the mounds of semi-structured data we have to wade through to get to the bits useful to our job. Having access to the development systems for a site makes the job of an SEO somewhat simpler, but it is not often the case that an SEO will have such access. Usually our interface to a site is the same as anyone else's — a Web browser. Third-party site analysis tools are useful, as far as they go, but they're limited too — you're hemmed in by the vision of the tool maker.

There is a better way, but it requires biting the bullet and learning some programming. It's not for the faint of heart, but for those who put in the time and effort to acquire a few basic skills, it pays off in spades. I'll show you what I've found to be the easiest approach to mining client sites for SEO-relevant data. If you borrow the approach outlined in this brief tutorial, you'll spend less time on the parts of automated site processing that are always identical and spend more time on just the parts that vary for your site and the data you want to get.

When you think about pulling data out of a site, it's often useful to pretend it's a flat list of pages. Thinking about the site as a graph of nodes with a complex link structure usually just confuses things; it's less intimidating to think of it as a list. Of course, thinking about site data this way is only useful if you have the right tools to let you abstract away the complicated part. That's what a good Web spider framework does for you. It scans through a site, keeps track of links, tosses out duplicates, gives you a few obvious configuration options, and notifies you of interesting events so you can respond to them with custom code.

If you can think of a site as a list of pages, then you only need be concerned with how to filter that list. My favorite spider framework for this is a third-party module for Python called Ruya. I looked at spider frameworks in several languages: PHP, Perl, Python, Ruby, and Erlang. In my opinion, Ruya had the easiest, best interface for an SEO's needs. It's not the fastest or most powerful crawling library, but that's not what you're looking for when doing ad hoc analysis. Instead you want to optimize for development time.


Set-up


You'll have to have a development environment, which is beyond the scope of this article. On Windows, I prefer getting Python as a part of a Cygwin install and using Emacs for my code editor. Python comes with an interactive shell feature that allows you to play around with it and learn organically. Emacs can extend the interactive shell to allow a truly blissful incremental development environment.

Once you've installed Python, the next step is to install the necessary Python modules. To find out where you should install the modules on your system, type the following into an interactive shell:

import sys
sys.path


On my system, this is the result:

['', '/usr/share/emacs/22.1/etc', '/usr/lib/python25.zip', '/usr/lib/python2.5', '/usr/lib/python2.5/plat-cygwin', '/usr/lib/python2.5/lib-tk', '/usr/lib/python2.5/lib-dynload', '/usr/lib/python2.5/site-packages']


For these instructions, I'm going to use /usr/lib/python2.5/site-packages to house my non-standard modules. We need to install four third-party modules into this directory. Modules provide libraries of code that extend the core functionality of the language. Download each of the following into the module directory found above:

  1. BeautifulSoup - This is a module for parsing "dirty" HTML. It's critical that you can parse malformed HTML, because that's what the Web primarily consists of.
  2. Ruya - This is a module for spidering Web sites.
  3. Kconv - A module required by Ruya that converts the source of downloaded pages to UTF8. After uncompressing, you'll be left with a folder called kconvp. Pull the folder kconv out and into the module directory.
  4. htmldata - A module required by Ruya that, like BeautifulSoup, does "tag soup" parsing. After downloading, change its extension to '.py'.

To test that you've got the modules installed correctly, type the following into the Python shell:

import BeautifulSoup
import ruya
import kconv
import htmldata


If all is well, you'll get your prompt back silently. Otherwise, an error will be printed to the screen.


Simplest Example


The following is the simplest example possible.

#!/usr/bin/env python

import ruya, logging

def aftercrawl(caller, eventargs):
  page = eventargs.document

  print 'Url: ' + page.uri.url
  print 'Title: ' + page.title

if('__main__'== __name__):
  url = 'http://directory.google.com/Top/Computers/Internet/Web_Design_and_Development/Promotion/'
  page = ruya.Document(ruya.Uri(url))
  c = ruya.Config(ruya.Config.CrawlConfig(crawlscope=ruya.CrawlScope.SCOPE_PATH), ruya.Config.RedirectConfig(), logging.getLogger())
  spider = ruya.SingleDomainDelayCrawler(c)
  spider.bind('aftercrawl', aftercrawl, None)

  spider.crawl(page)


To execute on a Unix-similar system (Linux, Cygwin, Mac) do the following from a shell or terminal:

  1. Save the file in /path/to/folder/file.py.
  2. Enter cd /path/to/folder into the shell and hit enter.
  3. Enter chmod 755 file.py into the shell and hit enter.
  4. Enter ./file.py into the shell and hit enter.

If all goes well, the URL and page title of every page below the Promotion directory in the Google Directory will be printed to the screen.

You can actually do a lot of common tasks with Ruya's built-in data structures and methods. Inside the <code>aftercrawl</code> function, you have access to all of the following variables:

  1. title
  2. description
  3. keywords
  4. lastmodified
  5. etag
  6. httpstatus
  7. contenttype
  8. contentencoding
  9. redirecturi
  10. redirects

In addition to those, you can get the HTML of the processed page. In the next example, we'll combine that fact with the powerful HTML parser BeautifulSoup to show how you can extract absolutely any interesting item from a set of pages.


Extending the First Example, with BeautifulSoup and Commenting


This script descends through every category below Promotions in the Google directory, making a tab-delimited text file of each directory link and the page it was found on:

# The "shebang" line: when running from the command line, tells the system to use Python to
# interpret the code
#!/usr/bin/env python

# Loads the built-in modules and 3rd party modules we installed earlier
import ruya, logging, re
from BeautifulSoup import BeautifulSoup

# This is an event handler. It's called automatically by Ruya whenever a new page
# has been crawled and gives us an opportunity to run custom code.
def aftercrawl(caller, eventargs):
  # Ruya has parsed the current page into a data structure. Here we get a reference
  # to it.
  page = eventargs.document
  # Get the URL of the current page.
  uri = page.uri
  # Use the fantastic BeautifulSoup module to parse the current page's source code into
  # a data structure we can use to easily gain access to all its elements.
  soup = BeautifulSoup(page.getPlainContent())
  # The next two lines are the hard part. Use BeautifulSoup's API to get each directory link on the page.
  eachTag = soup.findAll('table')[-5].findAll(lambda tag: tag.name == "a" and (str(tag['href'])).find('google') == -1)
  eachTag = eachTag + soup.findAll('table')[-4].findAll(lambda tag: tag.name == "a" and (str(tag['href'])).find('google') == -1)
  # Output each link.
  for thisTag in eachTag:
   print(str(uri) + "\t" + thisTag['href'])

# This is the first routine that's called when the script is run.
if('__main__'== __name__):
  # The starting URL.
  url = 'http://directory.google.com/Top/Computers/Internet/Web_Design_and_Development/Promotion/'
  # The necessary initialization stuff.
  page = ruya.Document(ruya.Uri(url))
  # This line tells Ruya how to behave and initialize it. 'levels' means how deep to spider; the default is 2. 'crawlscope' means which pages in the "directory structure" to follow. Set to SCOPE_PATH, the crawler only access links at and below the starting foler.
  cfg = ruya.Config(ruya.Config.CrawlConfig(levels=10, crawlscope=ruya.CrawlScope.SCOPE_PATH, crawldelay=0), ruya.Config.RedirectConfig(), logging.getLogger())
  spider = ruya.SingleDomainDelayCrawler(cfg)
  # This is how you associate your custom code to Ruya's built-in events.
  spider.bind('aftercrawl', aftercrawl, None)

  print "Source Page" + '\t' + "Found URL"
  # And here we start the spider.
  spider.crawl(page)



Summing Up


The examples given here are pretty simple, but that's part of the point. For an SEO's needs, programmatically crawling a site can be easy enough to save a lot of time under many circumstances. Another thing to note about the examples here is that they're necessarily drained of blood. In a real scenario, you'd be filtering for, e.g., every link in a section with the rel attribute set to "nofollow" or a list of pages whose Meta Descriptions contain one of the phrases from a predetermined list.

Unfortunately, I can't turn you loose on some poor, random webmaster's site to experiment with automated robots, and generic high volume sites like Google Directory aren't quite as interesting SEO-wise. Nevertheless, a little imagination makes the potential clear. With a few sprinkles of custom logic, you can extract just precisely what you need from a site to get any job done.


For permission to reprint or reuse any materials, please contact us. To learn more about our authors, please visit the Bruce Clay Authors page. Copyright 2008 Bruce Clay, Inc.