Fixing encodings

醉酒当歌 提交于 2019-12-28 19:04:06

问题


I have ended up with messed up character encodings in one of our mysql columns.

Typically I have

√© instead of é
√∂ instead of ö
√≠ instead of í

and so on...

Fairly certain that someone here would know what happened and how to fix.

UPDATE: Based on bobince's answer and since I had this data in a file I did the following

#!/user/bin/env python
import codecs
f = codecs.open('./file.csv', 'r', 'utf-8')
f2 = codecs.open('./file-fixed.csv', 'w', 'utf-8')
for line in f:
    f2.write(line.encode('macroman').decode('utf-8')),

after which

load data infile 'file-fixed.csv' 
into table list1 
fields terminated by ',' 
optionally enclosed by '"' 
ignore 1 lines;

properly imported the data.

UPDATE2: Hammerite, just for completeness here are the requested details...

mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

The SHOW CREATE TABLE for the table I am importing to has DEFAULT CHARSET=utf8

EDIT3:

Actually with the above settings the load didn't do the right thing (I could not compare to existing utf8 fields and my loaded data only looked as if it was loaded correctly; I assume because of the wrong, but matching client, connection and results charsets), so I updated the settings to:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

uploaded data again and then finally I got the data loaded correctly (comparable with existing data).


回答1:


Your text has been encoded to UTF-8 and then re-decoded, erroneously, as Mac Roman.

You won't be able to fix this inside the database, as MySQL doesn't know the Mac Roman encoding. You might write a script to walk each row of each affected table and fix up the text by reversing the encode/decode cycle. Python is a possibility, having a good range of encodings:

>>> print u'é'.encode('macroman').decode('utf-8')
é
>>> print u'ö'.encode('macroman').decode('utf-8')
ö
>>> print u'í'.encode('macroman').decode('utf-8')
í

Alternatively, if there is no non-ASCII content that is unaffected by this problem, you could try using mysqladmin to export an SQL dump, then convert it all at once, either using a script as above or, if you have a text editor (presumably on the Mac) that can do it, loading the script as UTF-8 then saving it as Mac Roman. Finally re-import the dump using mysql < dump.sql.



来源:https://stackoverflow.com/questions/3823278/fixing-encodings

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!