Decoding unknown encoded Traditional Chinese character strings using Python

前端 未结 2 1047
情歌与酒
情歌与酒 2021-01-25 14:25

Hi I have a website that is in Traditional Chinese and when I check the site statistics it tell me that the search term for the website is å%8f°å%8d%97 親å­%90é¤%90廳

相关标签:
2条回答
  • 2021-01-25 14:42

    You can use chardet. Install the library with:

    pip install chardet
    # or for python3
    pip3 install chardet
    

    The library includes a cli utility chardetect (or chardetect3 accordingly) that takes the path to a file.

    Once you know the encoding you can use it in python for example like this:

    codecs.open('myfile.txt', 'r', 'GB2312')
    

    or from shell:

    iconv -f GB2312 -t UTF-8 myfile.txt -o decoded.txt
    

    If you need more performance then there is also cchardet — a faster C-optimized version of chardet.

    0 讨论(0)
  • 2021-01-25 14:55

    It is called a mutt encoding; the underlying bytes have been mangled beyond their original meaning and they are no longer a real encoding.

    It was once URL-quoted UTF-8, but now interpreted as latin-1 without unquoting those URL escapes. I was able to un-mangle this by interpreting it as such:

    >>> from urllib2 import unquote
    >>> bytesquoted = u'å%8f°å%8d%97 親å­%90é¤%90廳'.encode('latin1')
    >>> unquoted = unquote(bytesquoted)
    >>> print unquoted.decode('utf8')
    台南 親子餐廳
    
    0 讨论(0)
提交回复
热议问题