Python3 - Cannot read docx, odt file - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 10: invalid continuation byte

问题

I am trying to split a large docx file into small files. For that when reading a file in python3.6 with the following code.

with open('h.docx', 'r') as f:
    a = f.read()

It throws this error.

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 
  10: invalid continuation byte

h.docx is created using LibreOffice Calc with just 'hello world' in it as content. I can read this successfully in Python 2.7 without any errors.

I tried

with open('h.docx', 'r', encoding='latin-1') as f:
    a = f.read()

In this I can read the file without any errors. But when written to another file, the original contents are lost.

Also tried errors='surrogateescape', but when written to another file the original contents are lost.

回答1:

Not really an answer but too long for a comment. What you are doing is just non-sense: you are trying to read a ".docx" file as if it was a text file which it is not. It is a complex format where several xml files (and possibly others...) are concatenated into a single zip file. You should not even contemplate processing such a file by hand unless:

trivial changes such as replacing a word with another one
read-only operation such as researching a particular string
you want to write an docx processing package (good luck with it)

and even those would not be simple operation.

What is possible:

process the file as a binary file when you only see it as an opaque content, for example to send it over a network connection
use a dedicated library like python-docx
under Windows, use the automation interface of Word to have word itself process the file (comtypes could help here)

来源：https://stackoverflow.com/questions/48536461/python3-cannot-read-docx-odt-file-unicodedecodeerror-utf-8-codec-cant-d

标签

python

file

encoding

utf-8

decode