How do I extract data from a doc/docx file using Python

后端 未结 4 1857
[愿得一人]
[愿得一人] 2020-12-08 23:57

I know there are similar questions out there, but I couldn\'t find something that would answer my prayers. What I need is a way to access certain data from MS-Word files and

相关标签:
4条回答
  • 2020-12-09 00:37

    A more simple library with image extraction capability.

    pip install docx2txt
    


    Then use below code to read docx file.

    import docx2txt
    text = docx2txt.process("file.docx")
    
    0 讨论(0)
  • 2020-12-09 00:45

    The docx is a zip file containing an XML of the document. You can open the zip, read the document and parse data using ElementTree.

    The advantage of this technique is that you don't need any extra python libraries installed.

    import zipfile
    import xml.etree.ElementTree
    
    WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
    PARA = WORD_NAMESPACE + 'p'
    TEXT = WORD_NAMESPACE + 't'
    TABLE = WORD_NAMESPACE + 'tbl'
    ROW = WORD_NAMESPACE + 'tr'
    CELL = WORD_NAMESPACE + 'tc'
    
    with zipfile.ZipFile('<path to docx file>') as docx:
        tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))
    
    for table in tree.iter(TABLE):
        for row in table.iter(ROW):
            for cell in row.iter(CELL):
                print ''.join(node.text for node in cell.iter(TEXT))
    

    See my stackoverflow answer to How to read contents of an Table in MS-Word file Using Python? for more details and references.

    0 讨论(0)
  • 2020-12-09 00:50

    To search in a document with python-docx

    # Import the module
    from docx import *
    
    # Open the .docx file
    document = opendocx('A document.docx')
    
    # Search returns true if found    
    search(document,'your search string')
    

    You also have a function to get the text of a document:

    https://github.com/mikemaccana/python-docx/blob/master/docx.py#L910

    # Import the module
    from docx import *
    
    # Open the .docx file
    document = opendocx('A document.docx')
    fullText=getdocumenttext(document)
    

    Using https://github.com/mikemaccana/python-docx

    0 讨论(0)
  • 2020-12-09 00:55

    It seems that pywin32 does the trick. You can iterate through all the tables in a document and through all the cells inside a table. It's a bit tricky to get the data (the last 2 characters from every entry have to be omitted), but otherwise, it's a ten minute code. If anyone needs additional details, please say so in the comments.

    0 讨论(0)
提交回复
热议问题