问题
Is there a way to use Xpath to parse text between two SETS of tags? For example, see example:
<div class="par">
<p class="pp">
<span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada
yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
</p>
</div>
<div class="par">
<p class="pp">
<span class="dv">4 </span>Hmm hmm hmm hmm.
</p>
</div>
I want to parse to get an array like below by getting the text between the sets of SPAN tags:
array[0] = "Blah blah blah blah.";
array[1] = "Yada yada yada yada.";
array[2] = "Foo foo foo foo.";
array[3] = "Hmm hmm hmm hmm.";
Can I use DOMDocument to do this simply? If not, what is the best way to achieve this? Please note that there may be or tags in the middle of the sentences. Such as:
...<span class="dv">5 </span>Uhh uhh <a href="www.uhh.com">uhh</a> uhh. <span class="dv">6 </span>...
回答1:
UPDATE
Seems you did want a flat list so im adding this specific example so there is no confusion:
$html = '<div class="par">
<p class="pp">
<span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada
yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
</p>
</div>
<div class="par">
<p class="pp">
<span class="dv">4 </span>Hmm hmm hmm hmm.
</p>
</div>';
$dom = DOMDocument::loadHTML($html);
$finder = new DOMXPath($dom);
// select THE TEXT NODES of all p elements with the class pp
// - note that means its explictly class="pp",
// not that "pp" is anywhere in the class list you may need to change this up depending...
// post additional questions for specific xpath help
$found = $finder->query('//p[@class="pp"]/text()');
$nodes = array();
// simply transform the resulting DOMNodeList into an array
// for easier consumption/manipulation
foreach($found as $textNode) {
$node[] = $textNode->nodeValue;
}
print_r($nodes);
Produces:
Array
(
[0] =>
[1] => Blah blah blah blah.
[2] => Yada
yada yada yada.
[3] => Foo foo foo foo.
[4] =>
[5] => Hmm hmm hmm hmm.
)
If the case is always this simple i think you could just use xpath to get the content of child DOMText nodes within the p.pp.
$html = '<div class="par">
<p class="pp">
<span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada
yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
</p>
</div>
<div class="par">
<p class="pp">
<span class="dv">4 </span>Hmm hmm hmm hmm.
</p>
</div>';
$dom = DOMDocument::loadHTML($html);
$finder = new DOMXPath($dom);
// select all p elements with the class pp - note that means its explictly class="pp",
// not that "pp" is anywhere in the class list you may need to change this up depending...
// post additional questions for specific xpath help
$found = $finder->query('//p[@class="pp"]');
$nodes = array();
foreach($found as $p) {
// for each p element, pull its text nodes.
$textNodes = $finder->query('text()', $p);
$textStr = '';
// loop over the textNodes and concat them into a single string
foreach ($textNodes as $n) {
$textStr .= $n->nodeValue;
}
// push the compiled string onto the array
$nodes[] = $textStr;
}
print_r($nodes);
This will produce a result like:
Array
(
[0] =>
Blah blah blah blah. Yada
yada yada yada. Foo foo foo foo.
[1] =>
Hmm hmm hmm hmm.
)
If you really do want each text node separately you just need to change the loop:
foreach($found as $p) {
// for each p element, pull its text nodes.
$textNodes = $finder->query('text()', $p);
$textArr = array();
// loop over the textNodes and concat them into a single string
foreach ($textNodes as $n) {
$textArr[] = $n->nodeValue;
}
// push the compiled string onto the array
$nodes[] = $textArr;
}
Which will give you:
Array
(
[0] => Array
(
[0] =>
[1] => Blah blah blah blah.
[2] => Yada
yada yada yada.
[3] => Foo foo foo foo.
)
[1] => Array
(
[0] =>
[1] => Hmm hmm hmm hmm.
)
)
Obviously as you can see it has grabbed line breaks you can easily filter those with your array filtering method of choice if they are undesirable. Or you can look into XPath and DOMDocument settings to adjust this, IIRC there are some settings dealing with how whitespace is interpreted (or not) that would probably let you avoid that but that could have some other consequences as well if you doing other processing on the same DOMDocument instance.
回答2:
You want the first text-node that is the directly following sibling after the span element:
//span/following-sibling::text()[1]
This is 1:1 in PHP syntax:
$doc = new DOMDocument();
$doc->loadHTML($buffer, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($doc);
$expr = '//span/following-sibling::text()[1]';
$result = $xpath->evaluate($expr);
You then want the resulting text-nodes turned into an array of strings. I'd say when you make yourself that work already, run some white-space normalization on it:
$array = array_map(function(DOMText $text) {
return preg_replace(['~\s+~u', '~^ | $~'], [' ', ''], $text->nodeValue);
}, iterator_to_array($result));
The result then is:
[
"Blah blah blah blah.",
"Yada yada yada yada.",
"Foo foo foo foo.",
"Hmm hmm hmm hmm."
]
The full code example:
<?php
/**
* http://stackoverflow.com/questions/27674012/php-domdocument-get-text-between-two-sets-of-tags
*/
$buffer = <<<HTML
<div class="par">
<p class="pp">
<span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada
yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
</p>
</div>
<div class="par">
<p class="pp">
<span class="dv">4 </span>Hmm hmm hmm hmm.
</p>
</div>
HTML;
$doc = new DOMDocument();
$doc->loadHTML($buffer, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($doc);
$expr = '//span/following-sibling::text()[1]';
$result = $xpath->evaluate($expr);
$array = array_map(function(DOMText $text) {
return preg_replace(['~\s+~u', '~^ | $~'], [' ', ''], $text->nodeValue);
}, iterator_to_array($result));
echo json_encode($array, JSON_PRETTY_PRINT);
来源:https://stackoverflow.com/questions/27674012/php-domdocument-get-text-between-two-sets-of-tags