问题
I am trying to parse the html format data into arrays using the a tag classes but i was not able to get the desired format . Below is my data
$text ='<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">Text1</a>
</h2>
<a class="result__snippet" href="">Text1</a>
<a class="result__url" href="">
example.com
</a>
</div>
</div>
<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">text3</a>
</h2>
<a class="result__snippet" href="">text23</a>
<a class="result__url" href="">
text.com
</a>
</div>
</div>';
I am trying to get the result using below code
$lines = explode("\n", $text);
$out = array();
foreach ($lines as $line) {
$parts = explode(" > ", $line);
$ref = &$out;
while (count($parts) > 0) {
if (isset($ref[$parts[0]]) === false) {
$ref[$parts[0]] = array();
}
$ref = &$ref[$parts[0]];
array_shift($parts);
}
}
print_r($out);
But i need the result exactly like below
array:2 [
0 => array:3 [
0 => "Text1"
1 => "Text1"
2 => "example.com"
]
1 => array:3 [
0 => "text3"
1 => "text23"
2 => "text.com"
]
]
Demo : https://eval.in/746170
Even i was trying dom like below in laravel :
$dom = new DOMDocument;
$dom->loadHTML($text);
foreach($dom->getElementsByTagName('a') as $node)
{
$array[] = $dom->saveHTML($node);
}
print_r($array);
So how can i use the classes to separate the data as i wanted .Any suggestions please.Thank you .
回答1:
I will do it using DOMDocument
and DOMXPath
to target interesting parts more easily. In order to be more precise, I register a function that checks if a class attribute contains a set of classes:
function hasClasses($attrValue, $requiredClasses) {
$requiredClasses = explode(' ', $requiredClasses);
$classes = preg_split('~\s+~', $attrValue, -1, PREG_SPLIT_NO_EMPTY);
return array_diff($requiredClasses, $classes) ? false : true;
}
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($state);
$xp = new DOMXPath($dom);
$xp->registerNamespace('php', 'http://php.net/xpath');
$xp->registerPhpFunctions('hasClasses');
$mainDivClasses = 'result results_links results_links_deep web-result';
$childDivClasses = 'links_main links_deep result__body';
$divNodeList = $xp->query('//div[php:functionString("hasClasses", @class, "' . $mainDivClasses . '")]
/div[php:functionString("hasClasses", @class, "' . $childDivClasses . '")]');
$results = [];
foreach ($divNodeList as $divNode) {
$results[] = [
trim($xp->evaluate('string(./h2/a[@class="result__a"])', $divNode)),
trim($xp->evaluate('string(.//a[@class="result__snippet"])', $divNode)),
trim($xp->evaluate('string(.//a[@class="result__url"])', $divNode))
];
}
print_r($results);
without registering a function, you can also use the XPath function contains
in your predicates. It's less precise since it only checks if a substring is in a larger string (and not if a class attribute have a specific class like the hasClasses
function) but it must be enough:
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($state);
$xp = new DOMXPath($dom);
$divNodeList = $xp->query('//div[contains(@class, "results_links_deep")]
[contains(@class, "web-result")]
/div[contains(@class, "links_main")]
[contains(@class, "links_deep")]
[contains(@class, "result__body")]');
$results = [];
foreach ($divNodeList as $divNode) {
$results[] = [
trim($xp->evaluate('string(./h2/a[@class="result__a"])', $divNode)),
trim($xp->evaluate('string(.//a[@class="result__snippet"])', $divNode)),
trim($xp->evaluate('string(.//a[@class="result__url"])', $divNode))
];
}
print_r($results);
回答2:
Here you go, try this and tell me if you need any more help:
<?php
$test = <<<EOS
<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">Text1</a>
</h2>
<a class="result__snippet" href="">Text1</a>
<a class="result__url" href="">
example.com
</a>
</div>
</div>
<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">text3</a>
</h2>
<a class="result__snippet" href="">text23</a>
<a class="result__url" href="">
text.com
</a>
</div>
</div>
EOS;
$document = new DOMDocument();
$document->loadHTML($test);
// first extract all the divs with the links_deep class
$divs = [];
foreach ($document->getElementsByTagName('div') as $div) {
$classes = $div->attributes->getNamedItem('class')->nodeValue;
if (!$classes) continue;
$classes = explode(' ', $classes);
if (in_array('links_main', $classes)) {
$divs[] = $div;
}
}
// now iterate through them and retrieve all the links in order
$results = [];
foreach ($divs as $div) {
$temp = [];
foreach ($div->getElementsByTagName('a') as $link) {
$temp[] = $link->nodeValue;
}
$results[] = $temp;
}
var_dump($results);
Working version - http://sandbox.onlinephpfunctions.com/code/e7ed2615ea32c5b9f0a89e3460da28a2702343f1
来源:https://stackoverflow.com/questions/42555739/parse-the-html-data-to-array-data-in-php