How should I decode bytes (using ASCII) without losing any “junk” bytes if xmlcharrefreplace and backslashreplace don't work?

问题

I have a network resource which returns me data that should (according to the specs) be an ASCII encoded string. But in some rare occasions, I get junk data.

One resource for example returns b'\xd3PS-90AC' whereas another resource, for the same key returns b'PS-90AC'

The first value contains a non-ASCII string. Clearly a violation of the spec, but that's unfortunately out of my control. None of us are 100% certain that this really is junk or data which should be kept.

The application calling on the remote resources saves the data in a local database for daily use. I could simply do a data.decode('ascii', 'replace') or ..., 'ignore') but then I would lose data which could turn out to be useful later on.

My immediate reflex was to use 'xmlcharrefreplace' or 'backslashreplace' as error handler. Simply because it would result in a displayable string. But then I get the following error: TypeError: don't know how to handle UnicodeDecodeError in error callback

The only error-handler which worked was surrogateescape, but this seems to be intended for filenames. On the other hand, for my intent and purpose it would work.

Why are 'xmlcharrefreplace' and 'backslashreplace' not working? I don't understand the error.

For example, an expected execution would be:

>>> data = b'\xd3PS-90AC'
>>> new_data = data.decode('ascii', 'xmlcharrefreplace')
>>> print(repr(new_data))
'&#d3;PS-90AC'

This is a contrived example. My aim is to not lose any data. If I would use the ignore or replace error-handler, the byte in question would essentially disappear, and information is lost.

回答1:

>>> data = b'\xd3PS-90AC'
>>> data.decode('ascii', 'surrogateescape')
'\udcd3PS-90AC'

It does not use html entities but it is a decent starting point. If not sufficient, you will have to register your own error handler using codecs.register_error I assume.

For Python3:

def handler(err):
    start = err.start
    end = err.end
    return ("".join(["&#{0};".format(err.object[i]) for i in range(start,end)]),end)

import codecs
codecs.register_error('xmlcharreffallback', handler)
data = b'\xd3PS-90AC'
data.decode('ascii', 'xmlcharreffallback')

For Python 2

def handler(err):
    start = err.start
    end = err.end
    return (u"".join([u"&#{0};".format(ord(err.object[i])) for i in range(start,end)]),end)

import codecs
codecs.register_error('xmlcharreffallback', handler)
data = b'\xd3PS-90AC'
data.decode('ascii', 'xmlcharreffallback')

Both producing:

'&#211;PS-90AC'

回答2:

For completeness, wanted to add that as of python 3.5, backslashreplace works for decoding, so you no longer have to add a custom error handler.

来源：https://stackoverflow.com/questions/25442954/how-should-i-decode-bytes-using-ascii-without-losing-any-junk-bytes-if-xmlch

标签

python

python-3.x

encoding

byte