Python 3 chokes on CP-1252/ANSI reading

前端 未结 3 1356
野趣味
野趣味 2020-12-04 02:17

I\'m working on a series of parsers where I get a bunch of tracebacks from my unit tests like:

  File \"c:\\Python31\\lib\\encodings\\cp1252.py\", line 23, i         


        
3条回答
  •  庸人自扰
    2020-12-04 02:42

    Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:

    >>> b'\x81'.decode('cp1252')
    Traceback (most recent call last):
      ...
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to 
    

    or with an actual file:

    >>> open('test.txt', 'wb').write(b'\x81\n')
    2
    >>> open('test.txt').read()
    Traceback (most recent call last):
      ...
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
    

    Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:

    >>> open('test.txt', encoding='latin-1').read()
    '\x81\n'
    

    Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.

提交回复
热议问题