问题
So I'm in a situation where I am scraping a website with PHP and I need to be able to get a node based on it's css class. I need to get a ul tag that doesn't have an id attribute but does have a css class. I, then need to get only li tags inside it which contain specific anchor tags, not all the li tags.
I've looked through DOMDocument, Zend_Dom, and neither have both of the requirements, class selections and dom traversal(specifically ascending to parents).
回答1:
You could use querypath and then something like this might work:
htmlqp($html)->find("ul.class")->not("#id")
->find('li a[href*="specific"]')->parent()
// then foreach over it or use ->writeHTML() for extraction
See http://api.querypath.org/docs/class_query_path.html for the API.
(Traversing is much easier, if you don't use the fiddly DOMDocument.)
回答2:
You can do this with DOMDocument and DOMXPath. Selecting by class in XPath is a pain, but it can be done.
Here is some sample (and totally valid!) HTML:
$html = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<title>Document Title</title>
<ul id="myid"><li>myid-listitem1</ul>
<ul class="foo
theclass
"><li>list2-item1<li>list2-item2</ul>
<ul id="myid2" class="foo
theclass bar"><li>list3-item1<li>list3-item2</ul>
EOT
;
$doc = new DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$nodes = $xp->query("/html/body//ul[not(@id) and contains(concat(' ',normalize-space(@class),' '), ' theclass ')]");
var_dump($nodes->length);
If you are using PHP 5.3, you can simplify this a bit by registering an XPath function in php. (Note that you can register functions for use in XPath expressions by XSLTProcessor
starting at PHP 5.1, but not directly for DOMXPath
.)
function hasToken($nodearray, $token) {
foreach ($nodearray as $node) {
if ($node->nodeValue===null or !hasTokenS($node->nodeValue, $token)) {
return False;
}
}
return True;
// I could even return nodes or document fragments if I wanted!
}
function hasTokenS($str, $token) {
$str = trim($str, "\r\n\t ");
$tokens = preg_split('/[\r\n\t ]+/', $str);
return in_array($token, $tokens);
}
$xp->registerNamespace('php', 'http://php.net/xpath');
$xp->registerPhpFunctions(array('hasToken', 'hasTokenS'));
// These two are equivalent:
$nodes1 = $xp->query("/html/body//ul[not(@id) and php:function('hasToken', @class, 'theclass')]");
$nodes2 = $xp->query("/html/body//ul[not(@id) and php:functionString('hasTokenS', @class, 'theclass')]");
var_dump($nodes1->length);
var_dump($nodes1->item(0));
var_dump($nodes2->length);
var_dump($nodes2->item(0));
If DOMDocument
is just not parsing your HTML very well, you can use the html5lib parser, which will return a DOMDocument:
require_once('lib/HTML5/Parser.php'); // or where-ever you put it
$dom = HTML5_Parser::parse($html);
// $dom is a plain DOMDocument object, created according to html5 parsing rules
回答3:
I've had good luck with: http://simplehtmldom.sourceforge.net/
来源:https://stackoverflow.com/questions/8584554/a-php-html-parser-that-lets-me-do-class-select-and-get-parent-nodes