A PHP HTML parser that lets me do class select and get parent nodes

[亡魂溺海] 提交于 2019-12-12 03:03:56

问题


So I'm in a situation where I am scraping a website with PHP and I need to be able to get a node based on it's css class. I need to get a ul tag that doesn't have an id attribute but does have a css class. I, then need to get only li tags inside it which contain specific anchor tags, not all the li tags.

I've looked through DOMDocument, Zend_Dom, and neither have both of the requirements, class selections and dom traversal(specifically ascending to parents).


回答1:


You could use querypath and then something like this might work:

htmlqp($html)->find("ul.class")->not("#id")
             ->find('li a[href*="specific"]')->parent()
// then foreach over it or use ->writeHTML() for extraction

See http://api.querypath.org/docs/class_query_path.html for the API.

(Traversing is much easier, if you don't use the fiddly DOMDocument.)




回答2:


You can do this with DOMDocument and DOMXPath. Selecting by class in XPath is a pain, but it can be done.

Here is some sample (and totally valid!) HTML:

$html = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<title>Document Title</title>
<ul id="myid"><li>myid-listitem1</ul>
<ul class="foo 
theclass
"><li>list2-item1<li>list2-item2</ul>
<ul id="myid2" class="foo&#xD;theclass bar"><li>list3-item1<li>list3-item2</ul>
EOT
;

$doc = new DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$nodes = $xp->query("/html/body//ul[not(@id) and contains(concat(' ',normalize-space(@class),' '), ' theclass ')]");

var_dump($nodes->length);

If you are using PHP 5.3, you can simplify this a bit by registering an XPath function in php. (Note that you can register functions for use in XPath expressions by XSLTProcessor starting at PHP 5.1, but not directly for DOMXPath.)

function hasToken($nodearray, $token) {
    foreach ($nodearray as $node) {
        if ($node->nodeValue===null or !hasTokenS($node->nodeValue, $token)) {
            return False;
        }
    }
    return True;
    // I could even return nodes or document fragments if I wanted!
}
function hasTokenS($str, $token) {
    $str = trim($str, "\r\n\t ");
    $tokens = preg_split('/[\r\n\t ]+/', $str);
    return in_array($token, $tokens);
}

$xp->registerNamespace('php', 'http://php.net/xpath');
$xp->registerPhpFunctions(array('hasToken', 'hasTokenS'));

// These two are equivalent:
$nodes1 = $xp->query("/html/body//ul[not(@id) and php:function('hasToken', @class, 'theclass')]");
$nodes2 = $xp->query("/html/body//ul[not(@id) and php:functionString('hasTokenS', @class, 'theclass')]");

var_dump($nodes1->length);
var_dump($nodes1->item(0));
var_dump($nodes2->length);
var_dump($nodes2->item(0));

If DOMDocument is just not parsing your HTML very well, you can use the html5lib parser, which will return a DOMDocument:

require_once('lib/HTML5/Parser.php'); // or where-ever you put it
$dom = HTML5_Parser::parse($html);
// $dom is a plain DOMDocument object, created according to html5 parsing rules



回答3:


I've had good luck with: http://simplehtmldom.sourceforge.net/



来源:https://stackoverflow.com/questions/8584554/a-php-html-parser-that-lets-me-do-class-select-and-get-parent-nodes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!