How to convert a string from CP-1251 to UTF-8?

旧城冷巷雨未停 提交于 2019-11-30 03:04:05
Johannes Charra

If you know for sure that you have cp1251 in your input, you can do

d.decode('cp1251').encode('utf8')

Your string d is a Unicode string, not a UTF-8-encoded string! So you can't decode() it, you must encode() it to UTF-8 or whatever encoding you need.

>>> d = u'\xc1\xe5\xeb\xe0\xff \xff\xe1\xeb\xfb\xed\xff \xe3\xf0\xee\xec\xf3'
>>> d
u'\xc1\xe5\xeb\xe0\xff \xff\xe1\xeb\xfb\xed\xff \xe3\xf0\xee\xec\xf3'
>>> print d
Áåëàÿ ÿáëûíÿ ãðîìó
>>> a.encode("utf-8")
'\xc3\x81\xc3\xa5\xc3\xab\xc3\xa0\xc3\xbf \xc3\xbf\xc3\xa1\xc3\xab\xc3\xbb\xc3\xad\xc3\xbf \xc3\xa3\xc3\xb0\xc3\xae\xc3\xac\xc3\xb3'

(which is something you'd do at the very end of all processing when you need to save it as a UTF-8 encoded file, for example).

If your input is in a different encoding, it's the other way around:

>>> d = "Schoßhündchen"                 # native encoding: cp850
>>> d = "Schoßhündchen".decode("cp850") # decode from Windows codepage
>>> d                                   # into a Unicode string (now work with this!)
u'Scho\xdfh\xfcndchen'
>>> print d                             # it displays correctly if your shell knows the glyphs
Schoßhündchen
>>> d.encode("utf-8")                   # before output, convert to UTF-8
'Scho\xc3\x9fh\xc3\xbcndchen'

If d is a correct Unicode string, then d.encode('utf-8') yields an encoded UTF-8 bytestring. Don't test it by printing, though, it might be that it just doesn't display properly because of the codepage shenanigans.

I provided some relevant info on encoding/decoding text in this response: https://stackoverflow.com/a/34662963/2957811

To add to that here, it's important to think of text in one of two possible states: 'encoded' and 'decoded'

'decoded' means it is in an internal representation by your interpreter/libraries that can be used for character manipulation (e.g. searches, case conversion, substring slicing, character counts, ...) or display (looking up a code point in a font and drawing the glyph), but cannot be passed in or out of the running process.

'encoded' means it is a byte stream that can be passed around as can any other data, but is not useful for manipulation or display.

If you've worked with serialized objects before, consider 'decoded' to be the useful object in memory and 'encoded' to be the serialized version.

'\xc1\xe5\xeb\xe0\xff \xff\xe1\xeb\xfb\xed\xff \xe3\xf0\xee\xec\xf3' is your encoded (or serialized) version, presumably encoded with cp1251. This encoding needs to be right because that's the 'language' used to serialize the characters and is needed to recreate the characters in memory.

You need to decode this from it's current encoding (cp1251) into python unicode characters, then re-encode it as a utf8 byte stream. The answerer that suggested d.decode('cp1251').encode('utf8') had this right, I am just hoping to help explain why that should work.

I lost half of my day to find correct answer. So if you got some unicode string from external source windows-1251 encoded (from web site in my situation) you will see in Linux console something like this:

u'\u043a\u043e\u043c\u043d\u0430\u0442\u043d\u0430\u044f \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430.....'

This is not correct unicode presentation of your data. So, Tim Pietzcker is right. You should encode() it first then decode() and then encode again to correct encoding.

So in my case this strange line was saved in "text" variable, and line:

print text.encode("cp1251").decode('cp1251').encode('utf8')   

gave me:

"Своя 2-х комнатная квартира с отличным ремонтом...."

Yes, it makes me crazy too. But it works!

P.S. Saving to file you should do the same way.

some_file.write(text.encode("cp1251").decode('cp1251').encode('utf8'))

I'd rather add a comment to Александр Степаненко answer but my reputation doesn't yet allow it. I had similar problem of converting MP3 tags from CP-1251 to UTF-8 and the solution of encode/decode/encode worked for me. Except for I had to replace first encoding with "latin-1", which essentially converts Unicode string into byte sequence without real encoding:

print text.encode("latin-1").decode('cp1251').encode('utf8')

and for saving back using for example mutagen it doesn't need to be encoded:

audio["title"] = title.encode("latin-1").decode('cp1251')
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!