问题
I am trying to split a large docx file into small files. For that when reading a file in python3.6 with the following code.
with open('h.docx', 'r') as f:
a = f.read()
It throws this error.
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position
10: invalid continuation byte
h.docx is created using LibreOffice Calc with just 'hello world'
in it as content. I can read this successfully in Python 2.7 without any errors.
I tried
with open('h.docx', 'r', encoding='latin-1') as f:
a = f.read()
In this I can read the file without any errors. But when written to another file, the original contents are lost.
Also tried errors='surrogateescape'
, but when written to another file the original contents are lost.
回答1:
Not really an answer but too long for a comment. What you are doing is just non-sense: you are trying to read a ".docx" file as if it was a text file which it is not. It is a complex format where several xml files (and possibly others...) are concatenated into a single zip file. You should not even contemplate processing such a file by hand unless:
- trivial changes such as replacing a word with another one
- read-only operation such as researching a particular string
- you want to write an docx processing package (good luck with it)
and even those would not be simple operation.
What is possible:
- process the file as a binary file when you only see it as an opaque content, for example to send it over a network connection
- use a dedicated library like python-docx
- under Windows, use the automation interface of Word to have word itself process the file (comtypes could help here)
来源:https://stackoverflow.com/questions/48536461/python3-cannot-read-docx-odt-file-unicodedecodeerror-utf-8-codec-cant-d