A PHP HTML parser that lets me do class select and get parent nodes

问题

So I'm in a situation where I am scraping a website with PHP and I need to be able to get a node based on it's css class. I need to get a ul tag that doesn't have an id attribute but does have a css class. I, then need to get only li tags inside it which contain specific anchor tags, not all the li tags.

I've looked through DOMDocument, Zend_Dom, and neither have both of the requirements, class selections and dom traversal(specifically ascending to parents).

回答1:

You could use querypath and then something like this might work:

htmlqp($html)->find("ul.class")->not("#id")
             ->find('li a[href*="specific"]')->parent()
// then foreach over it or use ->writeHTML() for extraction

See http://api.querypath.org/docs/class_query_path.html for the API.

(Traversing is much easier, if you don't use the fiddly DOMDocument.)

回答2:

You can do this with DOMDocument and DOMXPath. Selecting by class in XPath is a pain, but it can be done.

Here is some sample (and totally valid!) HTML:

$html = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<title>Document Title</title>
<ul id="myid"><li>myid-listitem1</ul>
<ul class="foo 
theclass
"><li>list2-item1<li>list2-item2</ul>
<ul id="myid2" class="foo&#xD;theclass bar"><li>list3-item1<li>list3-item2</ul>
EOT
;

$doc = new DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$nodes = $xp->query("/html/body//ul[not(@id) and contains(concat(' ',normalize-space(@class),' '), ' theclass ')]");

var_dump($nodes->length);

If you are using PHP 5.3, you can simplify this a bit by registering an XPath function in php. (Note that you can register functions for use in XPath expressions by XSLTProcessor starting at PHP 5.1, but not directly for DOMXPath.)

function hasToken($nodearray, $token) {
    foreach ($nodearray as $node) {
        if ($node->nodeValue===null or !hasTokenS($node->nodeValue, $token)) {
            return False;
        }
    }
    return True;
    // I could even return nodes or document fragments if I wanted!
}
function hasTokenS($str, $token) {
    $str = trim($str, "\r\n\t ");
    $tokens = preg_split('/[\r\n\t ]+/', $str);
    return in_array($token, $tokens);
}

$xp->registerNamespace('php', 'http://php.net/xpath');
$xp->registerPhpFunctions(array('hasToken', 'hasTokenS'));

// These two are equivalent:
$nodes1 = $xp->query("/html/body//ul[not(@id) and php:function('hasToken', @class, 'theclass')]");
$nodes2 = $xp->query("/html/body//ul[not(@id) and php:functionString('hasTokenS', @class, 'theclass')]");

var_dump($nodes1->length);
var_dump($nodes1->item(0));
var_dump($nodes2->length);
var_dump($nodes2->item(0));

If DOMDocument is just not parsing your HTML very well, you can use the html5lib parser, which will return a DOMDocument:

require_once('lib/HTML5/Parser.php'); // or where-ever you put it
$dom = HTML5_Parser::parse($html);
// $dom is a plain DOMDocument object, created according to html5 parsing rules

回答3:

I've had good luck with: http://simplehtmldom.sourceforge.net/

来源：https://stackoverflow.com/questions/8584554/a-php-html-parser-that-lets-me-do-class-select-and-get-parent-nodes

标签

php

html

screen-scraping