Extracting tables from a DOCX Word document in python

余生长醉 提交于 2019-12-01 03:19:08

问题


I'm trying to extract a content of tables in DOCX Word document, and boy I'm new to xml/xpath.

from docx import *
document = opendocx('someFile.docx')
tableList = document.xpath('/w:tbl')

This triggers "XPathEvalError: Undefined namespace prefix" error. I'm sure it's just the first one to expect while developing the script. Unfortunately, I couldn't find a tutorial for python-docx.

Could you kindly provide an example of table extraction?


回答1:


After some back and forth, we found out that a namespace was needed for this to work correctly. The xpath method is the appropriate solution, it just needs to have the document namespace passed in first.

The lxml xpath method has the details for namespace stuff. Look down the page in the link for passing a namespaces dictionary, and other details.

As explained by mgierdal in his comment above:

tblList = document.xpath('//w:tbl', namespaces=document.nsmap) works like a dream. So, as I understand it w: is a shorthand that has to be expanded to the full namespace name, and the dictionary for that is provided by document.nsmap.




回答2:


You can extract the table from docx using python-docx. Check the following code:

from docx import Document()
document = Document(file_path)

tables = document.tables


来源:https://stackoverflow.com/questions/7097631/extracting-tables-from-a-docx-word-document-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!