How to decode string representative of utf-8 with python?

后端 未结 1 1440
时光取名叫无心
时光取名叫无心 2020-12-20 00:18

I have a unicode like this:

\\xE5\\xB1\\xB1\\xE4\\xB8\\x9C \\xE6\\x97\\xA5\\xE7\\x85\\xA7

And I know it is the string repr

相关标签:
1条回答
  • 2020-12-20 00:55

    If you printed the repr() output of your unicode string then you appear to have a Mojibake, bytes data decoded using the wrong encoding.

    First encode back to bytes, then decode using the right codec. This may be as simple as encoding as Latin-1:

    unicode_string.encode('latin1').decode('utf8')
    

    This depends on how the incorrect decoding was applied however. If a Windows codepage (like CP1252) was used, you can end up with Unicode data that is not actually encodable back to CP1252 if UTF-8 bytes outside the CP1252 range were force-decoded anyway.

    The best way to repair such mistakes is using the ftfy library, which knows how to deal with forced-decoded Mojibake texts for a variety of codecs.

    For your small sample, Latin-1 appears to work just fine:

    >>> unicode_string = u'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
    >>> print unicode_string.encode('latin1').decode('utf8')
    山东 日照
    >>> import ftfy
    >>> print ftfy.fix_text(unicode_string)
    山东 日照
    

    If you have the literal character \, x, followed by two digits, you have another layer of encoding where the bytes where replaced by 4 characters each. You'd have to 'decode' those to actual bytes first, by asking Python to interpret the escapes with the string_escape codec:

    >>> unicode_string = ur'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
    >>> unicode_string
    u'\\xE5\\xB1\\xB1\\xE4\\xB8\\x9C \\xE6\\x97\\xA5\\xE7\\x85\\xA7'
    >>> print unicode_string.decode('string_escape').decode('utf8')
    山东 日照
    

    'string_escape' is a Python 2 only codec that produces a bytestring, so it is safe to decode that as UTF-8 afterwards.

    0 讨论(0)
提交回复
热议问题