DOMDocument for parsing HTML (instead of regex)

前端未结

关注

 2  702

无人共我

I am trying to learn using DOMDocument for parsing HTML code.

I am just doing some simple work, I already liked gordon\'s answer on scrap data using regex and simpl

相关标签:

2条回答

执笔经年

2020-12-12 02:48
Here is how you could do it with DOM and XPath:
```
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://www.nu.nl/…');
libxml_clear_errors();

$xpath = new DOMXPath($dom);
echo $xpath->evaluate('string(id("leadarticle")/div/h1)');
echo $dom->saveHtml(
    $xpath->evaluate('id("leadarticle")/div[@class="content"]')->item(0)
);
```
The XPath string(id("leadarticle")/div/h1) will return the textContent of the h1 that is a child of a div that is the child of the element with the id leadarticle.

The XPath id("leadarticle")/div[@class="content"] will return the div with the class attribute content that is a child of the element with the id leadarticle.

Because you want the outerHTML of the content div you'll have to fetch the entire node and not just the content, hence no string() function in the XPath. Passing a node to the DOMDocument::saveHTML() method (which is only possible as of 5.3.6) will then serialize that node back to HTML.
0 讨论(0)
发布评论:

提交评论
- 加载中...
执念已碎

2020-12-12 02:51
You shouldn't bother with the raw DOMDocument interface. Rather use one of the jQuery-style classes for extraction. How to parse HTML with PHP?

QueryPath seems to work fine if you use more specific selectors:
```
include "qp.phar";
$qp = htmlqp("http://www.nu.nl/internet/1106541/taalunie-keurt-open-sourcewoordenlijst-goed.html");

print $qp->find(".header h1")->text();
print $qp->top()->find(".article .content")->xhtml();
```
You might need to strip the intermingled Javascript before however (->find("script")->remove()).
0 讨论(0)
发布评论:

提交评论
- 加载中...