set the implicit default encoding\decoding error handling in python

南楼画角 提交于 2019-12-11 06:27:52

问题


I am working with external data that's encoded in latin1. So I've add sitecustomize.py and in it added

sys.setdefaultencoding('latin_1') 

sure enough, now working with latin1 strings works fine.

But, in case I encounter something that is not encoded in latin1:

s=str(u'abc\u2013')

I get UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)

What I would like is that the undecodable chars would simply be ignored, i.e I would get that in the above example s=='abc?', and do that without explicitly calling decode() or encode each time, i.e not s.decode(...,'replace') on each call.

I tried doing different things with codecs.register_error but to no avail.

please help?


回答1:


There is a reason scripts can't call sys.setdefaultencoding. Don't do that, some libraries (including standard libraries included with Python) expect the default to be 'ascii'.

Instead, explicitly decode strings to Unicode when read into your program (via file, stdin, socket, etc.) and explicitly encode strings when writing them out.

Explicit decoding takes a parameter specifying behavior for undecodable bytes.




回答2:


You can define your own custom handler and use it instead to do as you please. See this example:

import codecs
from logging import getLogger

log = getLogger()

def custom_character_handler(exception):
    log.error("%s for %s on %s from position %s to %s. Using '?' in-place of it!",
            exception.reason,
            exception.object[exception.start:exception.end],
            exception.encoding,
            exception.start,
            exception.end )
    return ("?", exception.end)

codecs.register_error("custom_character_handler", custom_character_handler)

print( b'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'.decode('utf8', 'custom_character_handler') )
print( codecs.encode(u"abc\u03c0de", "ascii", "custom_character_handler") )

Running it, you will see:

invalid start byte for b'\xbb' on utf-8 from position 5 to 6. Using '?' in-place of it!
Føö?Bår
ordinal not in range(128) for π on ascii from position 3 to 4. Using '?' in-place of it!
b'abc?de'

References:

  1. https://docs.python.org/3/library/codecs.html#codecs.register_error
  2. https://docs.python.org/3/library/exceptions.html#UnicodeError
  3. How to ignore invalid lines in a file?
  4. 'str' object has no attribute 'decode'. Python 3 error?
  5. How to replace invalid unicode characters in a string in Python?
  6. UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?


来源:https://stackoverflow.com/questions/3363339/set-the-implicit-default-encoding-decoding-error-handling-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!