JS_Extractor! And the death of Table Extractor
- Posted
- 10 February 2008
- Tagged
- JS_Extractor, PHP
So, it’s been a long time since I wrote (or even looked at) Table Extractor, and almost as soon as I wrote it I knew there were a lot of problems. For a start:
- It only worked with tables
- It didn’t really do that properly, or at least reliably
- It was a horrible mess of hacky code designed to workaround hacky HTML
- It was written for PHP 4, pah! Seriously, no one can still be using that, can they?
Despite all of these problems, it was surprisingly popular, and I still regularly get emails asking how to use it, suggesting new features or reporting bugs, which although I appreciate the time people have taken, I just couldn’t do anything about.
Anyway, that’s in the past, and today I’m releasing the first (beta) version of JS_Extractor, a brand new, completely reworked in every conceivable way, class library, designed for extracting data from HTML documents. And when I say data, I mean any data, not just tables.
Before I get into the examples I want to explain the new approach I’ve taken, and the various aspects of the new extractor. If you don’t care about that and just want to get your hands dirty then head down to the examples, but don’t complain to me if you don’t get it.
DOM Extension and XPath
JS_Extractor is actually an extension of the PHP DOM extension, and you can therefore use all of the DOM methods with any JS_Extractor or JS_Extractor_Element object. If you don’t know or have never used the DOM extension I seriously suggest you take a quick look over the documentation, just so you’re aware of what’s possible.
The second important aspect of JS_Extractor is that it uses XPath, a lot, and adds one very useful method to the vanilla DOM classes, query(). This method is really nothing more than a wrapper for a new DOMXPath object, but makes it much more convenient to run XPath queries on an element. For example, rather than this:
$xpath = new DOMXPath($doc);
$nodes = $xpath->query($expression, $element);
You can do this:
$nodes = $element->query($expression);
Which also allows you to easily chain queries together. (Yes, I stole this idea from SimpleXML.)
I’m not going to cover the details of XPath here, so if you’re not familiar with it already take a look a the syntax guide from W3Schools (good for beginners) or the full XPath spec. You should find the examples below fairly self explanatory though.
The DOM extensions ability to parse (even dirty) HTML, XPath support and new query() method are the heart of JS_Extractor, and give you a lot of power even without the specific “extractor” methods. For example, you could get every link on the page and then echo the href attribute values as simply as:
$extractor = new JS_Extractor(file_get_contents('sample.html'));
foreach ($extractor->query("//a") as $link) {
echo $link->getAttribute('href');
}
Easy! Of course, the additional extractor methods make this even easier, and make other, more complicated problems easy as well.
Utility Methods
There are also methods available for “tidying up” the data before you extract it. There’s actually only one of these right now, and that’s splitCells(), which applies to table, thead, tfoot, and tbody elements. This method will scan through the table cells and split any with a colspan or rowspan attribute, duplicating the content for each. This is essential for retrieving tabular data in a simple two dimensional structure.
extract()
This does all the magic. Actually, it’s really nothing more than a convenience, you can do everything this does with the standard DOM methods, but why make things more complicated than they need to be? This is primarily aimed at extracting the text within elements, in a hierarchical structure you define, or a specific attribute from a number of elements. This covers the most common uses, and anything more complicated can be achieved with the DOM and query() methods.
Examples
All of these examples are run on this sample data.
Before you start you’ll need to add the library path to your PHP include_path and then include the Extractor.php class file, by doing something like this:
set_include_path(get_include_path() . PATH_SEPARATOR . './library/');
require_once 'JS/Extractor.php';
Right, first we need to create the extractor object. The constructor requires a string of HTML, so use file_get_contents(), or another function to retrieve the contents of a local or remote file if you need to:
$extractor = new JS_Extractor($html); // or
$extractor = new JS_Extractor(file_get_contents('sample.html')); // or
$extractor = new JS_Extractor(file_get_contents('http://example.com/'));
Next I’m going to retrieve the body element of the document, this is necessary due to the way the extension of the DOM classes work. The utility and extract() methods are not available on the JS_Extractor object, only JS_Extractor_Element objects.
$body = $extractor->query("body")->item(0);
As you can see, the query() method returns a DOMNodeList rather than a single element, therefore if you only need one element you have to call the item() method.
Tables
Now we grab the first table in the body, like this:
$table = $body->query("//table")->item(0);
As far as selecting the right table goes, you’re only limited to what you can do with XPath, which is pretty much anything. I’ll cover some more examples on selecting elements later on.
Now before we start extracting data from this table we need to clean it up, because this table has colspans and rowspans, which need splitting and duplicating in order to create a simple two dimensional structure, this is as easy as:
$table->splitCells();
Right, let’s extract all the cell data from the rows in the tbody, grouped by row:
$data = $table->extract(array("tbody/tr", "td"));
As you can see, the first argument of the extract() method is an array of XPath expressions. These define the hierarchical structure of the array that is returned. What you’re saying here is: get all the tr elements from the tbody element, and then get the text from all the td elements within those. This will return something like:
array
0 =>
array
0 => string 'A' (length=1)
1 => string 'A' (length=1)
2 => string 'A' (length=1)
3 => string 'A' (length=1)
1 =>
array
0 => string 'B' (length=1)
1 => string 'A' (length=1)
2 => string 'A' (length=1)
3 => string 'B' (length=1)
2 =>
array
0 => string 'C' (length=1)
1 => string 'C' (length=1)
2 => string 'C' (length=1)
3 => string 'C' (length=1)
Great! Now how about all trs rather than just the ones in the tbody? You could do:
$data = $table->extract(array(".//tr", "td"));
But you’ll run into a problem, because the thead contains th elements rather than tds, so instead we need to do:
$data = $table->extract(array(".//tr", "th|td"));
But now we have another problem. We have all the rows, but no idea which ones came from the thead, tbody and tfoot, which is kinda important. The first thing we need to do is separate the tr expression into individual parts for the thead, tbody and tfoot, like so:
$data = $table->extract(array(
array("thead/tr", "tbody/tr", "tfoot/tr"),
"th|td",
));
Then we need to name these parts so that when the array comes back the rows are grouped into the right section:
$data = $table->extract(array(
array('head' => "thead/tr", 'foot' => "tfoot/tr", 'body' => "tbody/tr"),
"th|td",
));
This will give you something like:
array
'head' =>
array
0 =>
array
0 => string 'H' (length=1)
1 => string 'H' (length=1)
2 => string 'H' (length=1)
3 => string 'H' (length=1)
'foot' =>
array
0 =>
array
0 => string 'F' (length=1)
1 => string 'F' (length=1)
2 => string 'F' (length=1)
3 => string 'F' (length=1)
'body' =>
array
0 =>
array
0 => string 'A' (length=1)
1 => string 'A' (length=1)
2 => string 'A' (length=1)
3 => string 'A' (length=1)
1 =>
array
0 => string 'B' (length=1)
1 => string 'A' (length=1)
2 => string 'A' (length=1)
3 => string 'B' (length=1)
2 =>
array
0 => string 'C' (length=1)
1 => string 'C' (length=1)
2 => string 'C' (length=1)
3 => string 'C' (length=1)
And that’s it! Of course this is in no way limited to tables, and below are some further examples of extracting data from other elements:
Lists (ul, ol)
Here we get the ul element with the id “list”, and then extract the text from each li:
$list = $body->query("//ul[@id='list']")->item(0);
$data = $list->extract("li");
Custom Markup
You can even extract data from custom markup based on divs and spans (or any other element type):
$data = $body->extract(array(
"div[@class='article']",
array('title' => "h2", 'date' => "span[@class='date']", 'body' => "p"),
));
What you’re doing here is getting all the div elements with a class of “article”, and then extracting the title, date and body text from the relevant elements.
Attribute Data
The attribute extraction method returns the value of a specified attribute, rather than the elements text content. Although not demonstrated here, the hierarchical structure feature works in exactly the same way with the attribute extraction. Here we’re going to get all the href values of all a elements in the body:
$urls = $body->extract(".//a", JS_Extractor::EXTRACT_ATTRIBUTE, 'href');
Or alternatively, all href values of any element in the body that has an href attribute:
$urls = $body->extract(".//*[@href]", JS_Extractor::EXTRACT_ATTRIBUTE, 'href');
That covers everything I wanted to demonstrate today. I plan to post further, more specific examples in the future, so if you have any requests let me know.
Download
Please bear in mind that this is a beta version, and therefore the API may change in future releases.
I will also be posting the full API documentation in the near future.
What about Table Extractor?
Well, this post marks the death of Table Extractor, anyone using it should start using JS_Extractor. If you’re on PHP 4 then upgrade, seriously, there’s no good reason not to, and many reasons to do so. If you don’t have the DOM extension enabled then enable it, I mean come on, it comes with PHP and is enabled by default anyway.
