How to correct double-encoded UTF-8 strings sitting in MySQL utf8_general_ci fields?

后端 未结 6 1541
感动是毒
感动是毒 2020-12-15 07:42

I have to redesign a class where (amongst other things) UTF-8 strings are double-encoded wrongly:

$string = iconv(\'ISO-8859-1\', \'UTF-8\', $string);
:
$str         


        
相关标签:
6条回答
  • 2020-12-15 08:31

    MySQL provides a regexp match but no regexp replace, so you're usually better off iterating through each row in php, converting as needed, and updating the row if it has been changed.

    0 讨论(0)
  • 2020-12-15 08:32

    MySql is charset aware, so you can convert in SQL. But for this case, I would probably prefer to just script it in PHP, as it's a one-off task anyway.

    Keep in mind that columns in MySql have a charset property. The collation is (in theory) orthogonal to the charset. While a utf8_general_ci collation would imply that the charset is utf8, it's not a given. You could in theory mix a utf8 collation with a latin1 encoding (And get garbage as a result).

    If you decide to do this in SQL, look here:

    http://dev.mysql.com/doc/refman/5.0/en/charset-convert.html

    0 讨论(0)
  • 2020-12-15 08:36

    I tried the posted solutions, but my DB kept spitting up errors. Eventually I stumbled upon the following solution (in a forum I believe, but I can't remember where):

    UPDATE table_name SET col_name = CONVERT(CONVERT(CONVERT(col_name USING latin1) USING binary) USING utf8);
    

    and it worked a treat. Hope this helps anyone who stumbled here from desperate google searching like me.

    NOTE: This is of course assuming your double encoded character issues originate from an overly helpful MySQL conversion from latin1 to utf8, but I believe that's where most of these "corrupted characters" happen. This basically does the same conversion as mentioned above back to latin1, then binary, then to utf8 (using the binary step as a way to prevent the re-encoding of the already encoded latin1 entities)

    0 讨论(0)
  • 2020-12-15 08:42

    I found the following approach simpler:

    mysqldump -h DB_HOST -u DB_USER -p --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql
    

    Then drop all tables and re-import with following command:

    mysql -h DB_HOST -u DB_USER -p --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql
    

    Tip was found at this URL: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

    0 讨论(0)
  • 2020-12-15 08:43

    Alter the table to change the column character set to Latin-1. You will now have singly-encoded UTF-8 strings, but sitting in a field whose collation is supposed to be Latin-1.

    What you do then is, change the column character set back to UTF-8 via the binary character set - that way MySQL doesn't convert the characters at any point.

    ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET latin1
    ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET binary
    ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET utf8
    

    (is the correct syntax iirc; put the appropriate column type in where ... is)

    0 讨论(0)
  • 2020-12-15 08:43

    Generate a dump using mysqldump, change the encoding declaration (it's in the first commands), and reload in another database.

    You can also use iconv on your dump to transcode it.

    You can SELECT INTO OUTFILE, massage the file using php or iconv, then LOAD DATA INFILE.

    0 讨论(0)
提交回复
热议问题