convert unicode to html entities hex

前端 未结 4 2093
日久生厌
日久生厌 2020-12-03 19:27

How to convert a Unicode string to HTML entities? (HEX not decimal)

For example, convert Français to Français.

相关标签:
4条回答
  • 2020-12-03 19:49

    See How to get the character from unicode code point in PHP? for some code that allows you to do the following :

    Example use :

    echo "Get string from numeric DEC value\n";
    var_dump(mb_chr(50319, 'UCS-4BE'));
    var_dump(mb_chr(271));
    
    echo "\nGet string from numeric HEX value\n";
    var_dump(mb_chr(0xC48F, 'UCS-4BE'));
    var_dump(mb_chr(0x010F));
    
    echo "\nGet numeric value of character as DEC string\n";
    var_dump(mb_ord('ď', 'UCS-4BE'));
    var_dump(mb_ord('ď'));
    
    echo "\nGet numeric value of character as HEX string\n";
    var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
    var_dump(dechex(mb_ord('ď')));
    
    echo "\nEncode / decode to DEC based HTML entities\n";
    var_dump(mb_htmlentities('tchüß', false));
    var_dump(mb_html_entity_decode('tchüß'));
    
    echo "\nEncode / decode to HEX based HTML entities\n";
    var_dump(mb_htmlentities('tchüß'));
    var_dump(mb_html_entity_decode('tchüß'));
    
    echo "\nUse JSON encoding / decoding\n";
    var_dump(codepoint_encode("tchüß"));
    var_dump(codepoint_decode('tch\u00fc\u00df'));
    

    Output :

    Get string from numeric DEC value
    string(4) "ď"
    string(2) "ď"
    
    Get string from numeric HEX value
    string(4) "ď"
    string(2) "ď"
    
    Get numeric value of character as DEC int
    int(50319)
    int(271)
    
    Get numeric value of character as HEX string
    string(4) "c48f"
    string(3) "10f"
    
    Encode / decode to DEC based HTML entities
    string(15) "tchüß"
    string(7) "tchüß"
    
    Encode / decode to HEX based HTML entities
    string(15) "tchüß"
    string(7) "tchüß"
    
    Use JSON encoding / decoding
    string(15) "tch\u00fc\u00df"
    string(7) "tchüß"
    
    0 讨论(0)
  • 2020-12-03 19:54

    For the missing hex-encoding in the related question:

    $output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
        list($utf8) = $match;
        $binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
        $entity = vsprintf('&#x%X;', unpack('N', $binary));
        return $entity;
    }, $input);
    

    This is similar to @Baba's answer using UTF-32BE and then unpack and vsprintf for the formatting needs.

    If you prefer iconv over mb_convert_encoding, it's similar:

    $output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
        list($utf8) = $match;
        $binary = iconv('UTF-8', 'UTF-32BE', $utf8);
        $entity = vsprintf('&#x%X;', unpack('N', $binary));
        return $entity;
    }, $input);
    

    I find this string manipulation a bit more clear then in Get hexcode of html entities.

    0 讨论(0)
  • 2020-12-03 19:56

    Firstly, when I faced this problem recently, I solved it by making sure my code-files, DB connection, and DB tables were all UTF-8 Then, simply echoing the text works. If you must escape the output from the DB use htmlspecialchars() and not htmlentities() so that the UTF-8 symbols are left alone and not attempted to be escaped.

    Would like to document an alternative solution because it solved a similar problem for me. I was using PHP's utf8_encode() to escape 'special' characters.

    I wanted to convert them into HTML entities for display, I wrote this code because I wanted to avoid iconv or such functions as far as possible since not all environments necessarily have them (do correct me if it is not so!)

    $foo = 'This is my test string \u03b50';
    echo unicode2html($foo);
    
    function unicode2html($string) {
        return preg_replace('/\\\\u([0-9a-z]{4})/', '&#x$1;', $string);
    }
    

    Hope this helps somebody in need :-)

    0 讨论(0)
  • 2020-12-03 20:03

    Your string looks like UCS-4 encoding you can try

    $first = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
        $char = current($m);
        $utf = iconv('UTF-8', 'UCS-4', $char);
        return sprintf("&#x%s;", ltrim(strtoupper(bin2hex($utf)), "0"));
    }, $string);
    

    Output

    string 'Français' (length=13)
    
    0 讨论(0)
提交回复
热议问题