Fixing mojibakes in UTF-8 text

冷暖自知 提交于 2019-12-20 03:13:37

问题


I have a file with text in Portuguese in UTF-8. Somehow, who produced the file selected the wrong encoding, and the text is full of mojibake:

IDENTIFICAÌàÌÄO instead of identificação
André instead of André

Automated tools do not see anything wrong with the file. I tried to fix it with Python package ftfy to no avail. How can I fix this file, apart from replacing all incorrect characters manually?


回答1:


"André" instead of "André" is the Latin-1 interpretation of UTF-8 encoding. You can fix it by inverting the encoding/decoding:

>>> 'André'.encode('latin-1').decode('utf-8')
'André'

All cases following this pattern can be fixed like that.

However, I can't explain the other case (with "Ìà" for "ç" and "ÌÄ" for "ã"), and therefore can't provide a solution. If you can find a codec where "Ì", "à", and "Ä" have the codepoints C3, A7, and A3, respectively, then you can use this instead of Latin-1 for fixing the text.



来源:https://stackoverflow.com/questions/48430825/fixing-mojibakes-in-utf-8-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!