Decoding unknown encoded Traditional Chinese character strings using Python

牧云@^-^@ 提交于 2019-12-02 11:42:09

问题


Hi I have a website that is in Traditional Chinese and when I check the site statistics it tell me that the search term for the website is å%8f°å%8d%97 親å­%90é¤%90廳 which obviously makes no sense to me. My question is what is this encoding called? And is there a way to use Python to decode this character string. Thank you.


回答1:


It is called a mutt encoding; the underlying bytes have been mangled beyond their original meaning and they are no longer a real encoding.

It was once URL-quoted UTF-8, but now interpreted as latin-1 without unquoting those URL escapes. I was able to un-mangle this by interpreting it as such:

>>> from urllib2 import unquote
>>> bytesquoted = u'å%8f°å%8d%97 親å­%90é¤%90廳'.encode('latin1')
>>> unquoted = unquote(bytesquoted)
>>> print unquoted.decode('utf8')
台南 親子餐廳



回答2:


You can use chardet. Install the library with:

pip install chardet
# or for python3
pip3 install chardet

The library includes a cli utility chardetect (or chardetect3 accordingly) that takes the path to a file.

Once you know the encoding you can use it in python for example like this:

codecs.open('myfile.txt', 'r', 'GB2312')

or from shell:

iconv -f GB2312 -t UTF-8 myfile.txt -o decoded.txt

If you need more performance then there is also cchardet — a faster C-optimized version of chardet.



来源:https://stackoverflow.com/questions/12316766/decoding-unknown-encoded-traditional-chinese-character-strings-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!