UnicodeDecodeError, invalid continuation byte

前端 未结 10 2193
忘掉有多难
忘掉有多难 2020-11-22 08:25

Why is the below item failing? Why does it succeed with "latin-1" codec?

o = "a test of \\xe9 char" #I want this to remain a string as thi         


        
10条回答
  •  广开言路
    2020-11-22 09:06

    In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:

    >>> b'\xe9\x80\x80'.decode('utf-8')
    u'\u9000'
    

    But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:

    >>> u'\xe9'.encode('utf-8')
    b'\xc3\xa9'
    >>> u'\xe9'.encode('latin-1')
    b'\xe9'
    

    (Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

提交回复
热议问题