Double-decoding unicode in python

前端 未结 4 959
猫巷女王i
猫巷女王i 2020-12-09 18:28

I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings.

I send the string u\'XüYß\' encode

相关标签:
4条回答
  • 2020-12-09 18:44

    What you want is the encoding where Unicode code point X is encoded to the same byte value X. For code points inside 0-255 you have this in the latin-1 encoding:

    def double_decode(bstr):
        return bstr.decode("utf-8").encode("latin-1").decode("utf-8")
    
    0 讨论(0)
  • 2020-12-09 18:44

    Don't use this! Use @hop's solution.

    My nasty hack: (cringe! but quietly. It's not my fault, it's the server developers' fault)

    def double_decode_unicode(s, encoding='utf-8'):
        return ''.join(chr(ord(c)) for c in s.decode(encoding)).decode(encoding)
    

    Then,

    >>> double_decode_unicode('X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f')
    u'X\xfcY\xdf'
    >>> print _
    XüYß
    
    0 讨论(0)
  • ret.decode() tries implicitly to encode ret with the system encoding - in your case ascii.

    If you explicitly encode the unicode string, you should be fine. There is a builtin encoding that does what you need:

    >>> 'X\xc3\xbcY\xc3\x9f'.encode('raw_unicode_escape').decode('utf-8')
    'XüYß'
    

    Really, .encode('latin1') (or cp1252) would be OK, because that's what the server is almost cerainly using. The raw_unicode_escape codec will just give you something recognizable at the end instead of raising an exception:

    >>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
    '\\u20ac€'
    
    >>> '€\xe2\x82\xac'.encode('latin1').decode('utf8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)
    

    In case you run into this sort of mixed data, you can use the codec again, to normalize everything:

    >>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
    '\\u20ac€'
    
    >>> '\\u20ac€'.encode('raw_unicode_escape')
    b'\\u20ac\\u20ac'
    >>> '\\u20ac€'.encode('raw_unicode_escape').decode('raw_unicode_escape')
    '€€'
    
    0 讨论(0)
  • 2020-12-09 19:06

    Here's a little script that might help you, doubledecode.py -- https://gist.github.com/1282752

    0 讨论(0)
提交回复
热议问题