/Hobby/

What I've been working on lately.
 

Simple PHP/cURL Web Scraper

Added 2013-01-05


The following is a guide for making a simple web scraper using PHP's cURL extension. It scrapes a page for all of its links and optionally downloads files of a certain type. This script is a starting point for the creation of more complex and powerful scraper / crawler scripts.

First, let's make a function that uses curl to retrieve the HTML for a webpage, given the page's URL:

 
 

The above is pretty simple. The curl_setopt call is used to configure curl for what we want to do, there are a large number of settings that you can read about on the man page. We're just going to set the URL of our page, a couple of timeouts, and the final option to return the page content as a variable instead of outputting it.

So, now that we have a function that can retrieve the HTML content of a page from the web, we're going wrap that in a function used to parse the links out of the HTML:

 
 

So the first thing I'm going to talk about with this function is the regular expression. It may look a little funky, but we need to remember that not all links are as clean as <a href="http://www.google.ca/"> ... there can be a style tag after the <a and before the href, so we're going to just look for the hrefs laying around in the HTML. Also, the hrefs can be encapsulated in double, single, or no quotations, this regex will catch them all. As a result, there are a couple lines later on to clean any trailing quotations we may have grabbed.

After that, we have to deal with relative links. It would be ideal if all links were full links for our purposes, but typically that isn't the case. Our curl functions will need full links, so the bulk of the code in the middle of this function is to covert relative links to full links by using the URL the function was provided with. Now we have a function that, in tandem with the previous function, will return all the links contained within a webpage of a given URL.

Now, to make our script a little more versatile, we're going to give it the capability to also download files from the web and save them to the local hard disk:

 
 

This is another pretty simple function that is mostly just setting the curl options required to download a file, it should be self-explanatory. Now that we have all the functions for our scraper, we're going to set up some variables to serve as a basic config:

 
 

The comments should explain each item. It would be easy to modify the script to take parameters from CLI input or GET/POST variables, depending on your requirements. This script is meant to run either via browser or CLI when aggregating links (we'll optionally output links in browser-friendly format), but when downloading files it should be run via CLI to make status updates easier and to make the script more reliable. The code later on will reflect this intention.

The next step is some control code to tie everything together:

 
 

Once again, fairly straight-forward code. It just utilizes our earlier functions to get all the links from the target webpage into an array, and iterates through that array according to the behaviour we set up in the config. And there you have it, a simple web scraping script. You can download a completed copy of the code from this link.

From here you can do a lot; to extend it all you really need to do is add some more control code. The script gets an array of all the links from within a given page, so you could write some recursive code to query those pages, and so forth. If you're doing this, you should probably provide a maximum depth limit, so you don't accidentally crawl the whole internet. Alternatively, or additionally, you could add a rule to stop the script from crossing into a different domain, and simply scrape an entire site. Also you should populate a global hash to make sure links aren't visited more than once. The only tricky part of recursively crawling is to make sure the links you're following lead to actual webpages, and aren't file download links. You could utilize an extension whitelist, HEAD querying, and/or some MIME magic to remedy this.

Combined with the file retrieval code, you could make something quite powerful. It would be simple to modify this script to hunt for image tags and make it into a gallery scraper. If you have any specific needs and would like some help, feel free to get get in touch or leave a comment below!

 

Comments


 

Phil Sturgeon - Website

2013-01-06 20:23:36

This is all correct, useful information, but would it not be easier to do all of this with two lines of code?

http://symfony.com/doc/2.0/components/css_selector.html

The components exist to save you writing all this mess up yourself. CSS selectors are nice. We use them for JavaScript, why not use them for PHP too?

 

Add comment


 
Name:

Email: Optional - will not be displayed

Website: Optional

Comment:

 
Image Verification:



 
Copyright © 2012 - 2013 Kevin Dawe - info