问题
I have a network resource which returns me data that should (according to the specs) be an ASCII encoded string. But in some rare occasions, I get junk data.
One resource for example returns b'\xd3PS-90AC'
whereas another resource, for the same key returns b'PS-90AC'
The first value contains a non-ASCII string. Clearly a violation of the spec, but that's unfortunately out of my control. None of us are 100% certain that this really is junk or data which should be kept.
The application calling on the remote resources saves the data in a local database for daily use. I could simply do a data.decode('ascii', 'replace')
or ..., 'ignore')
but then I would lose data which could turn out to be useful later on.
My immediate reflex was to use 'xmlcharrefreplace'
or 'backslashreplace'
as error handler. Simply because it would result in a displayable string. But then I get the following error: TypeError: don't know how to handle UnicodeDecodeError in error callback
The only error-handler which worked was surrogateescape
, but this seems to be intended for filenames. On the other hand, for my intent and purpose it would work.
Why are 'xmlcharrefreplace'
and 'backslashreplace'
not working? I don't understand the error.
For example, an expected execution would be:
>>> data = b'\xd3PS-90AC'
>>> new_data = data.decode('ascii', 'xmlcharrefreplace')
>>> print(repr(new_data))
'&#d3;PS-90AC'
This is a contrived example. My aim is to not lose any data. If I would use the ignore
or replace
error-handler, the byte in question would essentially disappear, and information is lost.
回答1:
>>> data = b'\xd3PS-90AC'
>>> data.decode('ascii', 'surrogateescape')
'\udcd3PS-90AC'
It does not use html entities but it is a decent starting point. If not sufficient, you will have to register your own error handler using codecs.register_error I assume.
For Python3:
def handler(err):
start = err.start
end = err.end
return ("".join(["&#{0};".format(err.object[i]) for i in range(start,end)]),end)
import codecs
codecs.register_error('xmlcharreffallback', handler)
data = b'\xd3PS-90AC'
data.decode('ascii', 'xmlcharreffallback')
For Python 2
def handler(err):
start = err.start
end = err.end
return (u"".join([u"&#{0};".format(ord(err.object[i])) for i in range(start,end)]),end)
import codecs
codecs.register_error('xmlcharreffallback', handler)
data = b'\xd3PS-90AC'
data.decode('ascii', 'xmlcharreffallback')
Both producing:
'ÓPS-90AC'
回答2:
For completeness, wanted to add that as of python 3.5, backslashreplace
works for decoding, so you no longer have to add a custom error handler.
来源:https://stackoverflow.com/questions/25442954/how-should-i-decode-bytes-using-ascii-without-losing-any-junk-bytes-if-xmlch