The 2002 Perl Advent Calendar
[about] | [archives] | [contact] | [home]

On the 16th day of Advent my True Language brought to me..

Do you ever wish that you could program your web browser to do things automatically? You know, things like clicking on a link, then filling in a form, then clicking on the first link of that page and then showing you that?

Well, you know you can do that kind of thing in Perl with LWP. LWP abstracts all away the difficult bits of dealing with HTTP and all that. You don't have to worry about setting up and tearing down sockets. You don't have to remember which headers to send to the server and how to decode the data. Using the correct modules it's as simple as asking for a page and getting the page data back again.

However, that's as far as it goes. Though LWP ships with many modules for parsing the HTML it's up to you to tie them together in order to get the desired effect. Or, alternatively, you can use WWW::Mechanize to make your life easier. It's essentially a browser level interface built ontop of LWP that allows you to interact it the same way as you conceptually interface with a browser.

This is great for little web scripts that you may want to write quickly, or for integrating other sites. One of the other areas that I find it to be really useful is for testing of CGI applications - you can set up programs that test your CGIs live on the server. It's a simple and quick technique - and therefore quite powerful.

Okay, let's write a script that deals with search CPAN. First of all we need to go to the

  • front page
  • . This can be written like so:

      # turn on perl's safety features
      use strict;
      use warnings;
      # work out the name of the module we're looking for
      my $module_name = $ARGV[0]
        or die "Must specify module name on command line";
      # create a new browser
      use WWW::Mechanize;
      my $browser = WWW::Mechanize->new();
      # tell it to get the main page

    The above code will just download the main page. Now we want to fill in the form and click on the submit button. Looking at the source of the webpage we see that the name of the input field on the main page is called query.

      # okay, fill in the box with the name of the
      # module we want to look up
      $browser->field("query", $module_name);

    So the above source fills in the query field in the first form and clicks on the submit button, taking us to the

  • page
  • that lists all the found modules. From this point on all calls to $browser will be from the point of view of the new page. We want to click on the link that's on this new page that text contains the name of the module we were looking for.

      # click on the link that matches the module name

    Now we can start pulling various things out of the web page that we're interested in. For example, we can simply get the URL of the page we're currently at like so:

      my $url = $browser->{uri};

    This is very useful when we're writing simple CGIs. The above code could be rewritten so that it takes it's arguments from CGI and issues a HTTP redirect to the last link on the page.

    Looking thought the contents

    There's a plethora of modules on CPAN that can be used to work your way though a web page. I already mentioned HTML::TokeParser::Simple when I was discussing Image::Size. I also recommend having a look at modules such as HTML::TreeBuilder.

    Another module that I often use is XML::LibXML. Even though it's designed to work with XML, it has a HTML mode that allows it to work with non well formed input. It will allow you to apply XPath expressions to a document to pick out the contents that you want.

    For example, let's try extracting a quote from an IMDB page. First, just as before we fill in the search form.

      # turn on perl's safety features
      use strict;
      use warnings;
      # check arguments
      my $film = join " ", @ARGV
        or die "Must specify film name on command line";
      use WWW::Mechanize;
      my $browser = WWW::Mechanize->new();
      # tell it to get the main page
      # okay, fill in the box with the name of the
      # module we want to look up
      $browser->field("for", $ARGV[0]);

    Now IMDB normally goes straight though to the matching film page, but sometimes it can't work out what we mean and displays a collection of links to possible pages. Since we don't really know what to do in this situation we click though on the first link we find on the page that links to a film page.

    To do this we need to inspect the links on the page. These are stored in the $browser->{links} slot in the browser's hash as an array of arrayrefs. Each of these arrayrefs has two elements - the link and the text in the link. As we're interested in the link, we look in the first entry in each of the arrayrefs.

      # check the url to see if we got the title back (we might
      # have got a page with a list of possible matches instead)
      unless ($browser->{uri} =~ /Title/)
        # get all the links on the current page that link
        # to title pages
        my @links = grep { $_->[0] =~ /Title/ } @{ $browser->{links} };
        # go to the first one

    Once we've reached the page for each film we simply click though to the quote page:

      # go to the quotes section
      $browser->follow("memorable quotes");

    Right, now we need to create a parser and parse the HTML into a internal XML tree. This is fairly simple:

      # create a new parser
      use XML::LibXML;
      my $parser = XML::LibXML->new();
      # parse the data
      my $doc = $parser->parse_html_string($browser->{res}->content);

    Now we can use XPath expressions to pull out various parts of the tree that we're interested in. In our case we're looking for the internal link tagets that delimit the start and end of the comments. The chunk of HTML we're hoping to extract looks something like this:

      <a name="qt0011526"></a>
      <b><a href="/Name?Zuniga,%20Daphne">Alison Bradbury</a></b>:
      Eight o'clock? 
      <b><a href="/Name?Cusack,%20John">Gib</a></b>:
      Mmm... sorry, that's when I rearrange my sock drawer.
      <hr width="30%">

    And we can find the starting <a name="..."> tags with an XPath expression like so:

      # all the quotes start with '<a name="....">' tags
      my @nodes = $doc->findnodes('//a[@name]');

    So we now have a list of nodes where each of the quotes start. We want to process each of these nodes in turn and get the quotes that follow after them. Each of the quotes ends whenever we find a <hr> or <p> tag, so we simply start at our starting nodes and keep turning following tags into text till we find the one of those stop tags

      # now for each one of those
      my @quotes;
      foreach my $node (@nodes)
        my $string = "";
        # process each node until we find a hr or a p
          # make the node into text
          $string .= $node->textContent();
          # get the next node
          $node = $node->nextSibling();
        while ($node->nodeName ne "hr" &&
               $node->nodeName ne "p");
        # remove excess whitespace
        $string =~ s/\s+/ /g;  # multiple spaces to one space
        $string =~ s/^\s+//;   # at start of line
        $string =~ s/\s+$//;   # at end of line
        # remember it
        push @quotes, $string;

    And that's it. @quotes now contains all the quotes for the menitoned movie.

  • WWW::Automate - the inspiration for WWW::Mechanize
  • WWW::Chat - Module that can generate simple scripts for processing data
  • Inline::Webchat - inline WWW::Chat scripts
  • LWP - basic web processing in Perl
  • XML::LibXML