Python3 - Cannot read docx, odt file - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 10: invalid continuation byte

余生颓废 提交于 2019-12-23 16:26:05

问题


I am trying to split a large docx file into small files. For that when reading a file in python3.6 with the following code.

with open('h.docx', 'r') as f:
    a = f.read()

It throws this error.

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 
  10: invalid continuation byte

h.docx is created using LibreOffice Calc with just 'hello world' in it as content. I can read this successfully in Python 2.7 without any errors.

I tried

with open('h.docx', 'r', encoding='latin-1') as f:
    a = f.read()

In this I can read the file without any errors. But when written to another file, the original contents are lost.

Also tried errors='surrogateescape', but when written to another file the original contents are lost.


回答1:


Not really an answer but too long for a comment. What you are doing is just non-sense: you are trying to read a ".docx" file as if it was a text file which it is not. It is a complex format where several xml files (and possibly others...) are concatenated into a single zip file. You should not even contemplate processing such a file by hand unless:

  • trivial changes such as replacing a word with another one
  • read-only operation such as researching a particular string
  • you want to write an docx processing package (good luck with it)

and even those would not be simple operation.

What is possible:

  • process the file as a binary file when you only see it as an opaque content, for example to send it over a network connection
  • use a dedicated library like python-docx
  • under Windows, use the automation interface of Word to have word itself process the file (comtypes could help here)


来源:https://stackoverflow.com/questions/48536461/python3-cannot-read-docx-odt-file-unicodedecodeerror-utf-8-codec-cant-d

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!