Best way to extract text from a Word doc without using COM/automation?

后端 未结 10 1820
遇见更好的自我
遇见更好的自我 2020-12-07 21:29

Is there a reasonable way to extract plain text from a Word file that doesn\'t depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platfo

10条回答
  •  不思量自难忘°
    2020-12-07 21:37

    If all you want to do is extracting text from Word files (.docx), it's possible to do it only with Python. Like Guy Starbuck wrote it, you just need to unzip the file and then parse the XML. Inspired by python-docx, I have written a simple function to do this:

    try:
        from xml.etree.cElementTree import XML
    except ImportError:
        from xml.etree.ElementTree import XML
    import zipfile
    
    
    """
    Module that extract text from MS XML Word document (.docx).
    (Inspired by python-docx )
    """
    
    WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
    PARA = WORD_NAMESPACE + 'p'
    TEXT = WORD_NAMESPACE + 't'
    
    
    def get_docx_text(path):
        """
        Take the path of a docx file as argument, return the text in unicode.
        """
        document = zipfile.ZipFile(path)
        xml_content = document.read('word/document.xml')
        document.close()
        tree = XML(xml_content)
    
        paragraphs = []
        for paragraph in tree.getiterator(PARA):
            texts = [node.text
                     for node in paragraph.getiterator(TEXT)
                     if node.text]
            if texts:
                paragraphs.append(''.join(texts))
    
        return '\n\n'.join(paragraphs)
    

提交回复
热议问题