Detecting utf8 broken characters in MySQL

后端 未结 18 2081
广开言路
广开言路 2020-12-02 05:03

I\'ve got a database with a bunch of broken utf8 characters scattered across several tables. The list of characters isn\'t very extensive AFAIK (áéíúóÁÉÍÓÚÑñ)

Fixing

相关标签:
18条回答
  • 2020-12-02 05:37

    You might have rows with properly encoded UTF8 and with wrongly encoded characters. In this case "CONVERT(BINARY CONVERT(post_title USING latin1) USING utf8)" will trim some fields.

    I ended up doing it this way

    update `table` set `name` = replace(`name` ,CONVERT(BINARY "ä" USING latin1),'ä');
    update `table` set `name` = replace(`name` ,CONVERT(BINARY "ö" USING latin1),'ö');
    update `table` set `name` = replace(`name` ,CONVERT(BINARY "ü" USING latin1),'ü');
    update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ä" USING latin1),'Ä');
    update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ö" USING latin1),'Ö');
    update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ü" USING latin1),'Ü');
    update `table` set `name` = replace(`name` ,CONVERT(BINARY "ß" USING latin1),'ß');
    
    0 讨论(0)
  • 2020-12-02 05:39

    I had this same problem but didn't like the replace() solution because there's always the possibility of missing some characters. I was working against a column with mixed data (some had been utf8_encode()d and some not) with 4 million or so rows, about 250k records with mis-encoded data (with É/etc characters), covering about 15 international languages, including mainly European languages but also Russian, Japanese and Chinese.

    I started by duplicating the column, since I didn't want to lose any data:

    ALTER TABLE images ADD COLUMN reptitle TEXT;
    

    Copied all the data with multibyte characters (thanks Adam for the tip)

    UPDATE images SET reptitle = title WHERE LENGTH(title) != CHAR_LENGTH(title)
    

    Since reptitle was created with the table's default character set it was already utf8, but contained the corrupted data since images table used to be a latin source. Column reptitle now contains some data which is correctly encoded, and some corrupted (all values with multibyte characters, some had been correctly utf8_encode()d. So then with David's tip...

    ALTER TABLE images MODIFY reptitle TEXT character set latin1;
    ALTER TABLE images MODIFY reptitle BLOB;
    ALTER TABLE images MODIFY reptitle TEXT character set utf8;
    

    The middle step may not have been necessary since TEXT and BLOB (I think) are the same. This had the effect of correcting all incorrectly encoded data ('étudiantes' became 'étudiantes', etc) but data which was previously correct was truncated at the first multibyte character ('Lapin de Pâques' became 'Lapin de P'). I don't know why the truncation, but it's in a disposable column so I didn't care. The truncated data gives CHAR_LENGTH and LENGTH of the same values because there are no multi-byte characters remaining so easy query...

    UPDATE images SET title = reptitle WHERE LENGTH(reptitle)!=CHAR_LENGTH(reptitle)
    

    Then of course just drop the spare column

    ALTER TABLE images DROP COLUMN reptitle
    

    Also make sure (since I use PHP and this had tripped me up a couple of times so I thought I'd mention it here) all your script files are UTF8 (without BOM) and you are using:

    mysql_set_charset('utf8', $connection);
    

    Et voilà... perfectly repaired data, all languages :)

    0 讨论(0)
  • 2020-12-02 05:41

    This saved my life

    UPDATE ohp_posts SET post_content = CONVERT(CAST(CONVERT(post_content USING latin1) AS BINARY) USING utf8)
    

    I've found it here http://stanis.net/2014/04/replacing-latin-1-with-utf-8-characters-in-mysql/

    0 讨论(0)
  • 2020-12-02 05:43

    This is an extension of @Thales Ceolin's answer in order to modify every table in the db:

    select concat(
        "update ", 
        a.TABLE_NAME, 
        " set ", b.COLUMN_NAME, 
        " = CONVERT(BINARY CONVERT(", 
        b.COLUMN_NAME, 
        " USING latin1) USING utf8) where ",
        b.COLUMN_NAME, 
        " is not null;") query
    from INFORMATION_SCHEMA.TABLES a
    left join INFORMATION_SCHEMA.COLUMNS b on a.TABLE_NAME = b.TABLE_NAME
    where a.table_schema = 'db_name'
    and a.TABLE_TYPE = 'BASE TABLE'
    and b.data_type in ('text', 'varchar')
    and a.TABLE_NAME = 'table_name';
    

    This will result in:

    update table_name set idn = CONVERT(BINARY CONVERT(idn USING latin1) USING utf8) where idn is not null;
    update table_nameset name = CONVERT(BINARY CONVERT(name USING latin1) USING utf8) where name is not null;
    update table_name set primary_last_name = CONVERT(BINARY CONVERT(primary_last_name USING latin1) USING utf8) where primary_last_name is not null;
    
    0 讨论(0)
  • 2020-12-02 05:45

    The SELECT statement you need is the following:

    SELECT * FROM TABLE WHERE LENGTH(name) != CHAR_LENGTH(name);
    

    This returns all rows which contain multi-byte characters.

    name is assumed to be a field / the field where weird characters would be found. *

    0 讨论(0)
  • 2020-12-02 05:49

    How about a different approach, namely converting the column back and forth to get the correct character set? You can convert it to binary, then to utf-8 and then to iso-8859-1 or whatever else you're using. See the manual for the details.

    0 讨论(0)
提交回复
热议问题