how to correct the misencoded string?

i used mutagen to read the mp3 metadata, since the id3 tag is read in as unicode but in fact it is GBK encoded. how to correct this in python?

audio = EasyID3(name)
title = audio["title"][0] 
print title
print repr(title)

produces

µ±Äã¹Âµ¥Äã»áÏëÆðË
u'\xb5\xb1\xc4\xe3\xb9\xc2\xb5\xa5\xc4\xe3\xbb\xe1\xcf\xeb\xc6\xf0\xcb\xad'

but in fact it should be in GBK (chinese).

当你孤单你会想起谁

It looks like the string has been decoded to unicode using the wrong encoding (latin-1).

You need to encode it to a byte string and then decode it back to unicode using the correct encoding.

title = u'\xb5\xb1\xc4\xe3\xb9\xc2\xb5\xa5\xc4\xe3\xbb\xe1\xcf\xeb\xc6\xf0\xcb\xad'
print title.encode('latin-1').decode('gbk')
当你孤单你会想起谁

Looks like it's auto-decoding using latin1. To fix:

>>> title = u'\xb5\xb1\xc4\xe3\xb9\xc2\xb5\xa5\xc4\xe3\xbb\xe1\xcf\xeb\xc6\xf0\xcb\xad'
>>> print title.encode('latin1').decode('GBK')
当你孤单你会想起谁

Tested in Python 2.x but should work fine in 3 as well.

来源：https://stackoverflow.com/questions/2190904/how-to-correct-the-misencoded-string

标签

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!