Quickie: Data Mining on the Web with Perl

An extremely basic example of grabbing some data

All Articles

Tags: perl, quickie, scripting, linux

This is a simple web mining script using perl.


#!/usr/bin/perl

use LWP::Simple;

$numPages=$ARGV[0];

open OUTPUT,">/home/user/out.html";
for($i=1;$i<=$numPages;$i++){
print $i."\n";
        $content=get("http://coderswasteland.com/node/".$i);
        print OUTPUT $content;
        print OUTPUT "******************\n";
}
close OUTPUT;

This code is useful to pull information from any Drupal website. It takes the number of pages to crawl as a command line argument and uses that to increment through the site, grabbing articles. The content is then all saved to a single output file, each site separated by a series of asterisks from which you may do whatever regex or parsing is necessary to achieve your desired result.

I have a much longer version of this which does parsing and builds feeds. You may also wish to forgo writing to an output file then later reading from it and just process the information from within $content, split it into logical pieces, etc. This script is merely to get you started. If you're looking into web mining, I assume you already know about parsing and other topics, and this is just a quick way to grab web content.

A spiffier version would be to follow links on a page and popping info onto a tree rather auto-incrementing a URL. this is left as an exercise for the reader :)