Why do these two DOMDocument functions behave differently?

99封情书 提交于 2021-01-28 06:18:33

问题


There are two approaches to getting the outer HTML of a DOMDocument node suggested here: How to return outer html of DOMDocument?

I'm interested in why they seem to treat HTML entities differently.

EXAMPLE:

function outerHTML($node) {
    $doc = new DOMDocument();
    $doc->appendChild($doc->importNode($node, true));
    return $doc->saveHTML();
}

$html = '<p>ACME&rsquo;s 27&rdquo; Monitor is $200.</p>';
$dom = new DOMDocument();
@$dom->loadHTML($html);
$el = $dom->getElementsByTagname('p')->item(0);
echo $el->ownerDocument->saveHtml($el) . PHP_EOL;
echo outerHTML($el) . PHP_EOL;

OUTPUT:

<p>ACME’s 27” Monitor is $200.</p>
<p>ACME&rsquo;s 27&rdquo; Monitor is $200.</p>

Both methods use saveHTML() but for some reason the function preserves html entities in the final output, while directly calling saveHTML() with a node context does not. Can anyone explain why - preferably with some kind of authoritative reference?


回答1:


What this comes down to is even more simple than your test case above:

<?php
$html = '<p>ACME&rsquo;s 27&rdquo; Monitor is $200.</p>';
$dom = new DOMDocument();
@$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

echo $dom->saveHtml($dom->documentElement) . PHP_EOL;
echo $dom->saveHtml() . PHP_EOL;

So the question becomes, why does DomDocument::saveHtml behave differently when saving an entire document instead of just a specific node?

Taking a peek at the PHP source, we find a check for whether it's working with a single node or a whole document. For the former, the htmlNodeDumpFormatOutput function is called with the encoding explicitly set to null. For the latter, the htmlDocDumpMemoryFormat function is used, the encoding is not included as an argument to this function.

Both of these functions are from the libxml2 library. Looking at that source, we can see that htmlDocDumpMemoryFormat tries to detect the document encoding, and explicitly sets it to ASCII/HTML if it can't find one.

Both functions end up calling htmlNodeListDumpOutput, passing it the encoding that's been determined; either null – which results in no encoding – or ASCII/HTML – which encodes using HTML entities.

My guess is that, for a document fragment or single node, encoding is considered less important than for a full document.



来源:https://stackoverflow.com/questions/59938856/why-do-these-two-domdocument-functions-behave-differently

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!