I am using DOMDocument to manipulate / modify HTML before it gets output to the page. This is only a html fragment, not a complete page. My initial problem was that all fren
loadHtml()
doesn't always recognize the correct encoding as specified in the Content-type HTTP-EQUIV meta tag.
If the DomDocument('1.0', 'UTF-8')
and loadHTML('' . $html)
hacks don't work as they didn't for me (PHP 5.3.13), try this:
Add another section immediately after the opening
tag with the correct Content-type HTTP-EQUIV meta tag. Then call
loadHtml()
, then remove the extra tag.
// Ensure entire page is encoded in UTF-8
$encoding = mb_detect_encoding($body);
$body = $encoding ? @iconv($encoding, 'UTF-8', $body) : $body;
// Insert a head and meta tag immediately after the opening to force UTF-8 encoding
$insertPoint = false;
if (preg_match("//is", $body, $matches, PREG_OFFSET_CAPTURE)) {
$insertPoint = mb_strlen( $matches[0][0] ) + $matches[0][1];
}
if ($insertPoint) {
$body = mb_substr(
$body,
0,
$insertPoint
) . "" . mb_substr(
$body,
$insertPoint
);
}
$dom = new DOMDocument();
// Suppress warnings for loading non-standard html pages
libxml_use_internal_errors(true);
$dom->loadHTML($body);
libxml_use_internal_errors(false);
// Now remove extra
See this article: http://devzone.zend.com/1538/php-dom-xml-extension-encoding-processing/