cURL Page Scraping Script

Using cURL and page scraping for specific data is one of the most important things I do when creating databases. I’m not just talking about scraping pages and reposting here, either.

You can use cURL to grab the HTML of any viewable page on the web and then, most importantly take that data and pick out the bits you need. This is the basis for link analysis scripts, training scripts, compiling databases from sources around the web, there’s almost limitless things you can do.

I’m providing a simple PHP class here, which will use cURL to grab a page then pull out any information between user specified tags, into an array. So for instance, in our example you can grab all of the links from any web page.

The class is quite simple – I had to get rid of the lovely indententation to make it fit nicely onto the blog, but it’s fairly well commented.

In a nutshell, it does this:

1) Goes to specified URL

2) Uses cURL to grab the HTML of the URL

3) Takes the HTML and scans for every instance of the start and end tags you provide (e.g. < a > < / a >)

4) Returns these in an array for you.

Download taggrab.class.zip

<?php

class tagSpider
{

// set variable to hold curl instance
var $crl;

// this is where we dump the html we get
var $html; 

// set for binary type transfer
var $binary; 

// this is the url we are going to do a pass on
var $url;

// automatically executed on class call to clear variables
function tagSpider()
{
$this->html = "";
$this->binary = 0;
$this->url = "";
}

// takes url passed to it and.. can you guess?
function fetchPage($url)
{

// set the URL to scrape
$this->url = $url;

if (isset($this->url)) {

// start cURL instance
$this->ch = curl_init ();

// this tells cUrl to return the data
curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1);

// set the url to download
curl_setopt ($this->ch, CURLOPT_URL, $this->url); 

// follow redirects if any
curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true); 

// tell cURL if the data is binary data or not
curl_setopt($this->ch, CURLOPT_BINARYTRANSFER, $this->binary); 

// grabs the webpage from the internets
$this->html = curl_exec($this->ch); 

// closes the connection
curl_close ($this->ch);
}

}

// function takes html, puts the data requested into an array
function parse_array($beg_tag, $close_tag)

{
// match data between specificed tags
preg_match_all("($beg_tag.*$close_tag)siU", $this->html, $matching_data); 

// return data in array
return $matching_data[0];
}

}
?>

So that is your basic class, which should be fairly easy to follow (you can ask questions in comments if needed).

To use this, we need to call it from another PHP file to pass the variables we need to it.

Below is tag-example.php which demonstrates how to pass the URL, start/end tag variables to the class and pump out a set of results.

Download tag-example.zip

<?php

// Inlcude our tag grab class
require("taggrab.class.php"); // class for spider

// Enter the URL you want to run
$urlrun="http://www.techcrunch.com/";

// Specify the start and end tags you want to grab data between
$stag="<a href=";
$etag="</a>";

// Make a title spider
$tspider = new tagSpider();

// Pass URL to the fetch page function
$tspider->fetchPage($urlrun);

// Enter the tags into the parse array function
$linkarray = $tspider->parse_array($stag, $etag); 

echo "<h2>Links present on page: ".$urlrun."</h2><br />";
// Loop to pump out the results
foreach ($linkarray as $result) {

echo $result;

echo "<br/>";
}

?>

So this code will pass the Techcrunch website to the class, looking for any standard a href links. It will then simply echo these out. You could use this in conjunction with SearchStatus Firefox Plugin to quickly see what links Techcrunch is showing bots and what they are following and nofollowing.

You can view a working example of the code here.

As I said, there’s so much you can do from a base like this, so have a think. I might post some proper tutorials on extracting data methodically, saving it to a database then manipulating it to get some interesting results.

Enjoy.

Edit: You’ll of course need cURL library installed on your server for this to work!
Source: http://www.digeratimarketing.co.uk/2008/12/16/curl-page-scraping-script/