Python Unreproducible UnicodeDecodeError

问题

I'm trying to replace a substring in a Word file, using the following command sequence in Python. The code alone works perfectly fine - even with the exact same Word file, but when embedding it in a larger scale project structure, it throws an error at exact that spot. I'm clueless as to what causes it, as it seemingly has nothing to do with the code and seems unreproducible for me.

Side note: I know what's causing the Error, it's a german 'ü' in the Word file, but it's needed and removing it doesn't seem like the right solution, if the code works standalone.

#foo.py
from bar import make_wordm
def main(uuid):
    with open('foo.docm', 'w+') as f:
        f.write(make_wordm(uuid=uuid))

main('1cb02f34-b331-4616-8d20-aa1821ef0fbd')

foo.py imports bar.py for doing the heavy lifting.

#bar.py
import tempfile
import shutil
from cStringIO import StringIO
from zipfile import ZipFile, ZipInfo

WORDM_TEMPLATE='./res/template.docm'
MODE_DIRECTORY = 0x10

def zipinfo_contents_replace(zipfile=None, zipinfo=None,
                             search=None, replace=None):
    dirname = tempfile.mkdtemp()
    fname = zipfile.extract(zipinfo, dirname)
    with open(fname, 'r') as fd:
        contents = fd.read().replace(search, replace)
    shutil.rmtree(dirname)
    return contents

def make_wordm(uuid=None, template=WORDM_TEMPLATE):
    with open(template, 'r') as f:
        input_buf = StringIO(f.read())
    output_buf = StringIO()
    output_zip = ZipFile(output_buf, 'w')

    with ZipFile(input_buf, 'r') as doc:
        for entry in doc.filelist:
            if entry.external_attr & MODE_DIRECTORY:
                continue

            contents = zipinfo_contents_replace(zipfile=doc, zipinfo=entry,
                                        search="00000000-0000-0000-0000-000000000000"
                                        , replace=uuid)
            output_zip.writestr(entry, contents)
    output_zip.close()
    return output_buf.getvalue()

The following error is thrown when embedding the same code in a larger scale context:

ERROR:root:message
Traceback (most recent call last):
  File "FooBar.py", line 402, in foo_bar
    bar = bar_constructor(bar_theme,bar_user,uuid)
  File "FooBar.py", line 187, in bar_constructor
    if(main(uuid)):
  File "FooBar.py", line 158, in main
    f.write(make_wordm(uuid=uuid))
  File "/home/foo/FooBarGen.py", line 57, in make_wordm
    search="00000000-0000-0000-0000-000000000000", replace=uuid)
  File "/home/foo/FooBarGen.py", line 24, in zipinfo_contents_replace
    contents = fd.read().replace(search, replace)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2722: ordinal not in range(128)
INFO:FooBar:None

edit: Upon further examination and debugging, it seems like the variable 'uuid' is causing the issue. When giving the parameter as a fulltext string ('1cb02f34-b331-4616-8d20-aa1821ef0fbd'), instead of using the variable parsed from a JSON, it works perfectly fine.

edit2: I had to add uuid = uuid.encode('utf-8', 'ignore') and it works perfectly fine now.

回答1:

The problem is mixing Unicode and byte strings. Python 2 "helpfully" tries to convert from one to the other but defaults to using the ascii codec.

Here's an example:

>>> 'aeioü'.replace('a','b')  # all byte strings
'beio\xfc'
>>> 'aeioü'.replace(u'a','b') # one Unicode string and it converts...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 4: ordinal not in range(128)

You mentioned reading a UUID from JSON. JSON returns Unicode strings. Ideally read all text files decoding to Unicode, do all text processing in Unicode, and encode text files when writing back to storage. In your "larger framework" this could be a big porting job, but essentially use io.open with an encoding to read a file and decode to Unicode:

with io.open(fname, 'r', encoding='utf8') as fd:
    contents = fd.read().replace(search, replace)

Note that encoding should match the actual encoding of the files you are reading. That's something you'll have to determine.

A shortcut, as you've found in your edit, is to encode the UUID from JSON back to a byte string, but using Unicode to deal with text should be the goal.

Python 3 cleans up this process by making strings Unicode by default, and drops the implicit conversion to/from byte/Unicode strings.

回答2:

Change this line:

with open(fname, 'r') as fd:

to this:

with open(fname, 'r', encoding='latin1') as fd:

The ascii coded can handle character codes between 0 and 127 inclusive. Your file contains the character code 0xc3, which is outside the range. You need to choose a different codec.

回答3:

All the times I've had a problem with special characters in the past I have resolved them by decoding to Unicode when reading and then encoding to utf-8 when writing back to a file.

I hope this works for you too.

For my solution I 've always used what I found in this presentation http://farmdev.com/talks/unicode/

So I would use this:

def to_unicode_or_bust(obj, encoding='utf-8'):
    if isinstance(obj, basestring):
        if not isinstance(obj, unicode):
            obj = unicode(obj, encoding)
    return obj

Then on your code:

contents = to_unicode_or_bust(fd.read().replace(search, replace))

And then when writing it set encoding back to utf-8.

output_zip.writestr(entry, contents.encode('utf-8'))

I didn't reproduce your issue so it's just a suggestion. Hope it works

来源：https://stackoverflow.com/questions/49938517/python-unreproducible-unicodedecodeerror

标签

python

unicode

ascii

decode