How to decode Unicode escape sequences like “\u00ed” to proper UTF-8 encoded characters?

后端 未结 7 960
傲寒
傲寒 2020-11-22 01:01

Is there a function in PHP that can decode Unicode escape sequences like \"\\u00ed\" to \"í\" and all other similar occurrences?

I found si

7条回答
  •  庸人自扰
    2020-11-22 01:30

    $str = '\u0063\u0061\u0074'.'\ud83d\ude38';
    $str2 = '\u0063\u0061\u0074'.'\ud83d';
    
    // U+1F638
    var_dump(
        "cat\xF0\x9F\x98\xB8" === escape_sequence_decode($str),
        "cat\xEF\xBF\xBD" === escape_sequence_decode($str2)
    );
    
    function escape_sequence_decode($str) {
    
        // [U+D800 - U+DBFF][U+DC00 - U+DFFF]|[U+0000 - U+FFFF]
        $regex = '/\\\u([dD][89abAB][\da-fA-F]{2})\\\u([dD][c-fC-F][\da-fA-F]{2})
                  |\\\u([\da-fA-F]{4})/sx';
    
        return preg_replace_callback($regex, function($matches) {
    
            if (isset($matches[3])) {
                $cp = hexdec($matches[3]);
            } else {
                $lead = hexdec($matches[1]);
                $trail = hexdec($matches[2]);
    
                // http://unicode.org/faq/utf_bom.html#utf16-4
                $cp = ($lead << 10) + $trail + 0x10000 - (0xD800 << 10) - 0xDC00;
            }
    
            // https://tools.ietf.org/html/rfc3629#section-3
            // Characters between U+D800 and U+DFFF are not allowed in UTF-8
            if ($cp > 0xD7FF && 0xE000 > $cp) {
                $cp = 0xFFFD;
            }
    
            // https://github.com/php/php-src/blob/php-5.6.4/ext/standard/html.c#L471
            // php_utf32_utf8(unsigned char *buf, unsigned k)
    
            if ($cp < 0x80) {
                return chr($cp);
            } else if ($cp < 0xA0) {
                return chr(0xC0 | $cp >> 6).chr(0x80 | $cp & 0x3F);
            }
    
            return html_entity_decode('&#'.$cp.';');
        }, $str);
    }
    

提交回复
热议问题