DomDocument and special characters

后端 未结 8 1469
轮回少年
轮回少年 2020-12-14 16:47

This is my code:

$oDom = new DOMDocument();
$oDom->loadHTML(\"èàéìòù\");
echo $oDom->saveHTML();

This is the output:

         


        
相关标签:
8条回答
  • 2020-12-14 16:50

    This way:

    /**
     * @param string $text
     * @return DOMDocument
     */
    private function buildDocument($text)
    {
        $dom = new DOMDocument();
    
        libxml_use_internal_errors(true);
        $dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $text);
        libxml_use_internal_errors(false);
    
        return $dom;
    }
    
    0 讨论(0)
  • 2020-12-14 17:00

    Solution:

    $oDom = new DOMDocument();
    $oDom->encoding = 'utf-8';
    $oDom->loadHTML( utf8_decode( $sString ) ); // important!
    
    $sHtml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">';
    $sHtml .= $oDom->saveHTML( $oDom->documentElement ); // important!
    

    The saveHTML() method works differently specifying a node. You can use the main node ($oDom->documentElement) adding the desired !DOCTYPE manually. Another important thing is utf8_decode(). All the attributes and the other methods of the DOMDocument class, in my case, don't produce the desired result.

    0 讨论(0)
  • 2020-12-14 17:05

    Looks like you just need to set substituteEntities when you create the DOMDocument object.

    0 讨论(0)
  • 2020-12-14 17:06

    The issue appears to be known, according to the user comments on the manual page at php.net. Solutions suggested there include putting

    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    

    in the document before you put any strings with non-ASCII chars in.

    Another hack suggests putting

    <?xml encoding="UTF-8">
    

    as the first text in the document and then removing it at the end.

    Nasty stuff. Smells like a bug to me.

    0 讨论(0)
  • 2020-12-14 17:10

    I don't know why the marked answer didn't work for my problem. But this one did.

    ref: https://www.php.net/manual/en/class.domdocument.php

    <?php
    
                // checks if the content we're receiving isn't empty, to avoid the warning
                if ( empty( $content ) ) {
                    return false;
                }
    
                // converts all special characters to utf-8
                $content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');
    
                // creating new document
                $doc = new DOMDocument('1.0', 'utf-8');
    
                //turning off some errors
                libxml_use_internal_errors(true);
    
                // it loads the content without adding enclosing html/body tags and also the doctype declaration
                $doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    
                // do whatever you want to do with this code now
    
    ?>
    
    0 讨论(0)
  • 2020-12-14 17:15
    $dom = new DomDocument();
    $str = htmlentities($str);
    $dom->loadHTML(utf8_decode($str));
    $dom->encoding = 'utf-8';
    .
    .
    .
    $str = $dom->saveHTML();
    $str = html_entity_decode($str);
    

    The above code worked for me.

    0 讨论(0)
提交回复
热议问题