How do I extract data from a doc/docx file using Python

后端 未结 4 1865
[愿得一人]
[愿得一人] 2020-12-08 23:57

I know there are similar questions out there, but I couldn\'t find something that would answer my prayers. What I need is a way to access certain data from MS-Word files and

4条回答
  •  清歌不尽
    2020-12-09 00:45

    The docx is a zip file containing an XML of the document. You can open the zip, read the document and parse data using ElementTree.

    The advantage of this technique is that you don't need any extra python libraries installed.

    import zipfile
    import xml.etree.ElementTree
    
    WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
    PARA = WORD_NAMESPACE + 'p'
    TEXT = WORD_NAMESPACE + 't'
    TABLE = WORD_NAMESPACE + 'tbl'
    ROW = WORD_NAMESPACE + 'tr'
    CELL = WORD_NAMESPACE + 'tc'
    
    with zipfile.ZipFile('') as docx:
        tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))
    
    for table in tree.iter(TABLE):
        for row in table.iter(ROW):
            for cell in row.iter(CELL):
                print ''.join(node.text for node in cell.iter(TEXT))
    

    See my stackoverflow answer to How to read contents of an Table in MS-Word file Using Python? for more details and references.

提交回复
热议问题