Decoding double encoded utf8 in Python

I've got a problem with strings that I get from one of my clients over xmlrpc. He sends me utf8 strings that are encoded twice :( so when I get them in python I have an unicode object that has to be decoded one more time, but obviously python doesn't allow that. I've noticed my client however I need to do quick workaround for now before he fixes it.

Raw string from tcp dump:

<string>Rafa\xc3\x85\xc2\x82</string>

this is converted into:

u'Rafa\xc5\x82'

The best we get is:

eval(repr(u'Rafa\xc5\x82')[1:]).decode("utf8")

This results in correct string which is:

u'Rafa\u0142'

this works however is ugly as hell and cannot be used in production code. If anyone knows how to fix this problem in more suitable way please write. Thanks, Chris

>>> s = u'Rafa\xc5\x82'
>>> s.encode('raw_unicode_escape').decode('utf-8')
u'Rafa\u0142'
>>>

Yow, that was fun!

>>> original = "Rafa\xc3\x85\xc2\x82"
>>> first_decode = original.decode('utf-8')
>>> as_chars = ''.join([chr(ord(x)) for x in first_decode])
>>> result = as_chars.decode('utf-8')
>>> result
u'Rafa\u0142'

So you do the first decode, getting a Unicode string where each character is actually a UTF-8 byte value. You go via the integer value of each of those characters to get back to a genuine UTF-8 string, which you then decode as normal.

>>> weird = u'Rafa\xc5\x82'
>>> weird.encode('latin1').decode('utf8')
u'Rafa\u0142'
>>>

latin1 is just an abbreviation for Richie's nuts'n'bolts method.

It is very curious that the seriously under-described raw_unicode_escape codec gives the same result as latin1 in this case. Do they always give the same result? If so, why have such a codec? If not, it would preferable to know for sure exactly how the OP's client did the transformation from 'Rafa\xc5\x82' to u'Rafa\xc5\x82' and then to reverse that process exactly -- otherwise we might come unstuck if different data crops up before the double encoding is fixed.

来源：https://stackoverflow.com/questions/1177316/decoding-double-encoded-utf8-in-python

标签

python

string

utf-8

decode