PHP DOMDocument loadHTML not encoding UTF-8 correctly

后端 未结 13 1727
梦如初夏
梦如初夏 2020-11-22 15:11

I\'m trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).

$profile = \"

        
13条回答
  •  Happy的楠姐
    2020-11-22 15:45

    This took me a while to figure out but here's my answer.

    Before using DomDocument I would use file_get_contents to retrieve urls and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:

    $dom = new DomDocument('1.0', 'UTF-8');
    if ($dom->loadHTMLFile($url) == false) { // read the url
        // error message
    }
    else {
        // process
    }
    

    This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, php settings and all the rest of the remedies offered here and elsewhere. Here's what works:

    $dom = new DomDocument('1.0', 'UTF-8');
    $str = file_get_contents($url);
    if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
    }
    

    etc. Now everything's right with the world. Hope this helps.

提交回复
热议问题