Preventing DOMDocument::loadHTML() from converting entities

て烟熏妆下的殇ゞ 提交于 2019-12-05 11:49:24

Solution for not PHP 5.3.6++

$html =<<<HTML
<ul><li>text</li>
<li>&frac12; of this is <strong>strong</strong></li></ul>
HTML;

$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($html);
foreach ($doc->getElementsByTagName('li') as $node)
{
  echo htmlentities(iconv('UTF-8', 'ISO-8859-1', $node->nodeValue)), "\n";
}
Reuben

Based on the answer provided by ajreal, I've expanded the example variable to handle more cases, and changed _get_inner_html() to make recursive calls and handle the entity conversion for text nodes.

It's probably not the best answer, since it makes some assumptions about the elements (such as no attributes). But since my particular needs don't require attributes to be carried across (yet.. I'm sure my sample data will throw that one at me later on), this solution works for me.

$example = '<ul><li>text</li>'.
'<li>&frac12; of this is <strong>strong</strong></li>'.
'<li>Entity <strong attr="3">in &frac12; tag</strong></li>'.
'<li>Nested nodes <strong attr="3">in &frac12; <em>tag &frac12;</em></strong></li>'.
'</ul>';

echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;

$doc = new DOMDocument();
$doc->resolveExternals = true;
$doc->substituteEntities = false;

$doc->loadHTML($example);

$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;

for ($idx = 0; $idx < $count; $idx++) {
    $value = trim(_get_inner_html($domNodeList->item($idx)));

    /* remainder of processing and storing in database */
    echo 'Saved '.$value.PHP_EOL;

}

function _get_inner_html( $node ) {
    $innerHTML= '';
    $children = $node->childNodes;
    foreach ($children as $child) {
        echo 'Node type is '.$child->nodeType.PHP_EOL;
        switch ($child->nodeType) {
        case 3:
            $innerHTML .= htmlentities(iconv('UTF-8', 'ISO-8859-1', $child->nodeValue));
            break;
        default:
            echo 'Non text node has '.$child->childNodes->length.' children'.PHP_EOL;
            echo 'Node name '.$child->nodeName.PHP_EOL;
            $innerHTML .= '<'.$child->nodeName.'>';
            $innerHTML .= _get_inner_html( $child );
            $innerHTML .= '</'.$child->nodeName.'>';
            break;
        }
    }

    return $innerHTML;
}

Need no iterate child nodes:

function innerHTML($node)
         {$html=$node->ownerDocument->saveXML($node);
          return preg_replace("%^<{$node->nodeName}[^>]*>|</{$node->nodeName}>$%", '', $html);
         }
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!