I know there are similar questions out there, but I couldn\'t find something that would answer my prayers. What I need is a way to access certain data from MS-Word files and
The docx is a zip file containing an XML of the document. You can open the zip, read the document and parse data using ElementTree.
The advantage of this technique is that you don't need any extra python libraries installed.
import zipfile
import xml.etree.ElementTree
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'
with zipfile.ZipFile('') as docx:
tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))
for table in tree.iter(TABLE):
for row in table.iter(ROW):
for cell in row.iter(CELL):
print ''.join(node.text for node in cell.iter(TEXT))
See my stackoverflow answer to How to read contents of an Table in MS-Word file Using Python? for more details and references.