Convert Unicode from JSON string with PHP

后端 未结 3 1078
挽巷
挽巷 2020-12-18 03:26

I\'ve been reading up on a few solutions but have not managed to get anything to work as yet.

I have a JSON string that I read in from an API call and it contains Un

相关标签:
3条回答
  • 2020-12-18 04:08

    The output is correct.

    \u00c2 == Â
    \u00a3 == £
    

    So nothing is wrong here. And converting to HTML entities is easy:

    htmlentities($title);
    
    0 讨论(0)
  • 2020-12-18 04:09

    It is not UTF-16 encoding. It rather seems like bogus encoding, because the \uXXXX encoding is independant of whatever UTF or UCS encodings for Unicode. \u00c2\u00a3 really maps to the £ string.

    What you should have is \u00a3 which is the unicode code point for £.

    {0xC2, 0xA3} is the UTF-8 encoded 2-byte character for this code point.

    If, as I think, the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point, then you need to convert each pair of unicode code points to an UTF-8 encoded character, and then decode it to the native PHP encoding to make it printable.

    function fixBadUnicode($str) {
        return utf8_decode(preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str));
    }
    

    Example here: http://phpfiddle.org/main/code/6sq-rkn

    Edit:

    If you want to fix the string in order to obtain a valid JSON string, you need to use the following function:

    function fixBadUnicodeForJson($str) {
        $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4"))', $str);
        $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3"))', $str);
        $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str);
        $str = preg_replace("/\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1"))', $str);
        return $str;
    }
    

    Edit 2: fixed the previous function to transform any wrongly unicode escaped utf-8 byte sequence into the equivalent utf-8 character.

    Be careful that some of these characters, which probably come from an editor such as Word are not translatable to ISO-8859-1, therefore will appear as '?' after ut8_decode.

    0 讨论(0)
  • 2020-12-18 04:21

    Here is an updated version of the function using preg_replace_callback instead of preg_replace.

    function fixBadUnicodeForJson($str) {
        $str = preg_replace_callback(
        '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
        function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4")); },
        $str
    );
        $str = preg_replace_callback(
        '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
        function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")); },
        $str
    );
        $str = preg_replace_callback(
        '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
        function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")); },
        $str
    );
        $str = preg_replace_callback(
        '/\\\\u00([0-9a-f]{2})/',
        function($matches) { return chr(hexdec("$1")); },
        $str
    );
        return $str;
    }
    
    0 讨论(0)
提交回复
热议问题