PHP cp1252/windows-1252 conversion to UTF-8

后端 未结 1 2027
北荒
北荒 2021-01-03 04:09

I\'m in the process of trying to convert our database from latin1 to UTF-8. Unfortunately I can\'t do a massive single switchover as the application needs to stay online and

相关标签:
1条回答
  • 2021-01-03 04:52

    Firstly, Windows-1252 is not a subset of UTF-8. You could argue that ASCII is a subset of UTF-8, but that is usually more of an ideological debate.

    Secondly, it is impossible to handle strings with both CP1252 and UTF-8 "characters" in them (really for CP1252 it's a byte and for Unicode it's a code point). Either you try to read it as CP1252, and see all the Unicode characters as single bytes, or you read it as UTF-8 and it cuts out any invalid byte sequences (or creates random characters if the CP1252 characters match a Unicode code point). You are not removing the test character with $c = mb_strcut($c, 1);, you are removing a question mark created by mb_convert_encoding because it could not convert that Unicode character into a CP1252 character.

    Thirdly, you should never convert a String, and then after the fact try to determine the encoding. After you converted your second test string, it was ?COD?. There is no reason to check if a Unicode character exists in it, because you converted it to CP1252. There can't be Unicode characters in it. As the programmer, you have to know what the output is.

    The only solution is to check if the string is CP1252, convert the offending characters to placeholders, and then convert that string to Unicode:

    function convert_cp1252_to_utf8($input, $default = '', $replace = array()) {
        if ($input === null || $input == '') {
            return $default;
        }
    
        // https://en.wikipedia.org/wiki/UTF-8
        // https://en.wikipedia.org/wiki/ISO/IEC_8859-1
        // https://en.wikipedia.org/wiki/Windows-1252
        // http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
        $encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
        if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
            /*
             * Use the search/replace arrays if a character needs to be replaced with
             * something other than its Unicode equivalent.
             */ 
    
            /*$replace = array(
                128 => "€",      // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN
                129 => "",              // UNDEFINED
                130 => "‚",      // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK
                131 => "ƒ",      // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK
                132 => "„",      // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK
                133 => "…",      // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS
                134 => "†",      // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER
                135 => "‡",      // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER
                136 => "ˆ",      // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT
                137 => "‰",      // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN
                138 => "Š",      // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON
                139 => "‹",      // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK
                140 => "Œ",      // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE
                141 => "",              // UNDEFINED
                142 => "Ž",      // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON 
                143 => "",              // UNDEFINED
                144 => "",              // UNDEFINED
                145 => "‘",      // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK 
                146 => "’",      // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK
                147 => "“",      // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK
                148 => "”",      // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK
                149 => "•",      // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET
                150 => "–",      // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH
                151 => "—",      // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH
                152 => "˜",      // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE
                153 => "™",      // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN
                154 => "š",      // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON
                155 => "›",      // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
                156 => "œ",      // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE
                157 => "",              // UNDEFINED
                158 => "ž",      // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON
                159 => "Ÿ",      // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS
            );*/
    
            if (count($replace) != 0) {
                $find = array();
                foreach (array_keys($replace) as $key) {
                    $find[] = chr($key);
                }
                $input = str_replace($find, array_values($replace), $input);
            }
            /*
             * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
             * and control characters, always convert from Windows-1252 to UTF-8.
             */
            $input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
            if (count($replace) != 0) {
                $input = html_entity_decode($input);
            }
        }
        return $input;
    }
    

    The trick is that you have to check for both ISO-8859-1 and CP1252 because they are so similar. I found this out the hard way after hours of playing around with this function, only to have this answer save me. If you found this function helpful, go +1 that answer.

    Basically, this function replaces all those bad CP1252 bytes with HTML entities representing the Unicode characters. We then convert the string from ISO-8859-1/CP1252 to UTF-8, while none of our new Unicode characters are mangled because they are simple ASCII characters. Finally, we decode the HTML entities and finally have a 100% Unicode string.

    0 讨论(0)
提交回复
热议问题