DOMDocument encoding problems / characters transformed

前端未结

关注

 4  1250

粉色の甜心 2020-12-16 23:56

I am using DOMDocument to manipulate / modify HTML before it gets output to the page. This is only a html fragment, not a complete page. My initial problem was that all fren

4条回答

挽巷 (楼主)

2020-12-17 00:31

loadHtml() doesn't always recognize the correct encoding as specified in the Content-type HTTP-EQUIV meta tag.

If the DomDocument('1.0', 'UTF-8') and loadHTML('' . $html) hacks don't work as they didn't for me (PHP 5.3.13), try this:

Add another section immediately after the opening tag with the correct Content-type HTTP-EQUIV meta tag. Then call loadHtml(), then remove the extra tag.

// Ensure entire page is encoded in UTF-8
$encoding = mb_detect_encoding($body);
$body = $encoding ? @iconv($encoding, 'UTF-8', $body) : $body;

// Insert a head and meta tag immediately after the opening  to force UTF-8 encoding
$insertPoint = false;
if (preg_match("//is", $body, $matches, PREG_OFFSET_CAPTURE)) {
    $insertPoint = mb_strlen( $matches[0][0] ) + $matches[0][1];
}
if ($insertPoint) {
    $body = mb_substr(
        $body,
        0,
        $insertPoint
    ) . "" . mb_substr(
        $body,
        $insertPoint
    );
}
$dom = new DOMDocument();

// Suppress warnings for loading non-standard html pages
libxml_use_internal_errors(true);
$dom->loadHTML($body);
libxml_use_internal_errors(false);

// Now remove extra

See this article: http://devzone.zend.com/1538/php-dom-xml-extension-encoding-processing/

0 讨论(0)

查看其它4个回答