Best way to extract text from a Word doc without using COM/automation?

后端 未结 10 1831
遇见更好的自我
遇见更好的自我 2020-12-07 21:29

Is there a reasonable way to extract plain text from a Word file that doesn\'t depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platfo

相关标签:
10条回答
  • 2020-12-07 21:30

    Just in case if someone wants to do in Java language there is Apache poi api. extractor.getText() will extract plane text from docx . Here is the link https://www.tutorialspoint.com/apache_poi_word/apache_poi_word_text_extraction.htm

    0 讨论(0)
  • 2020-12-07 21:36

    This worked well for .doc and .odt.

    It calls openoffice on the command line to convert your file to text, which you can then simply load into python.

    (It seems to have other format options, though they are not apparenlty documented.)

    0 讨论(0)
  • 2020-12-07 21:37

    If all you want to do is extracting text from Word files (.docx), it's possible to do it only with Python. Like Guy Starbuck wrote it, you just need to unzip the file and then parse the XML. Inspired by python-docx, I have written a simple function to do this:

    try:
        from xml.etree.cElementTree import XML
    except ImportError:
        from xml.etree.ElementTree import XML
    import zipfile
    
    
    """
    Module that extract text from MS XML Word document (.docx).
    (Inspired by python-docx <https://github.com/mikemaccana/python-docx>)
    """
    
    WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
    PARA = WORD_NAMESPACE + 'p'
    TEXT = WORD_NAMESPACE + 't'
    
    
    def get_docx_text(path):
        """
        Take the path of a docx file as argument, return the text in unicode.
        """
        document = zipfile.ZipFile(path)
        xml_content = document.read('word/document.xml')
        document.close()
        tree = XML(xml_content)
    
        paragraphs = []
        for paragraph in tree.getiterator(PARA):
            texts = [node.text
                     for node in paragraph.getiterator(TEXT)
                     if node.text]
            if texts:
                paragraphs.append(''.join(texts))
    
        return '\n\n'.join(paragraphs)
    
    0 讨论(0)
  • 2020-12-07 21:40

    Honestly don't use "pip install tika", this has been developed for mono-user (one developper working on his laptop) and not for multi-users (multi-developpers).

    The small class TikaWrapper.py bellow which uses Tika in command line is widely enough to meet our needs.

    You just have to instanciate this class with JAVA_HOME path and the Tika jar path, that's all ! And it works perfectly for lot of formats (e.g: PDF, DOCX, ODT, XLSX, PPT, etc.).

    #!/bin/python
    # -*- coding: utf-8 -*-
    
    # Class to extract metadata and text from different file types (such as PPT, XLS, and PDF)
    # Developed by Philippe ROSSIGNOL
    #####################
    # TikaWrapper class #
    #####################
    class TikaWrapper:
    
        java_home = None
        tikalib_path = None
    
        # Constructor
        def __init__(self, java_home, tikalib_path):
            self.java_home = java_home
            self.tika_lib_path = tikalib_path
    
        def extractMetadata(self, filePath, encoding="UTF-8", returnTuple=False):
            '''
            - Description:
              Extract metadata from a document
            
            - Params:
              filePath: The document file path
              encoding: The encoding (default = "UTF-8")
              returnTuple: If True return a tuple which contains both the output and the error (default = False)
            
            - Examples:
              metadata = extractMetadata(filePath="MyDocument.docx")
              metadata, error = extractMetadata(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
            '''
            cmd = self._getCmd(self._cmdExtractMetadata, filePath, encoding)
            out, err = self._execute(cmd, encoding)
            if (returnTuple): return out, err
            return out
    
        def extractText(self, filePath, encoding="UTF-8", returnTuple=False):
            '''
            - Description:
              Extract text from a document
            
            - Params:
              filePath: The document file path
              encoding: The encoding (default = "UTF-8")
              returnTuple: If True return a tuple which contains both the output and the error (default = False)
            
            - Examples:
              text = extractText(filePath="MyDocument.docx")
              text, error = extractText(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
            '''
            cmd = self._getCmd(self._cmdExtractText, filePath, encoding)
            out, err = self._execute(cmd, encoding)
            return out, err
    
        # ===========
        # = PRIVATE =
        # ===========
    
        _cmdExtractMetadata = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --metadata ${FILE_PATH}"
        _cmdExtractText = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --encoding=${ENCODING} --text ${FILE_PATH}"
    
        def _getCmd(self, cmdModel, filePath, encoding):
            cmd = cmdModel.replace("${JAVA_HOME}", self.java_home)
            cmd = cmd.replace("${TIKALIB_PATH}", self.tika_lib_path)
            cmd = cmd.replace("${ENCODING}", encoding)
            cmd = cmd.replace("${FILE_PATH}", filePath)
            return cmd
    
        def _execute(self, cmd, encoding):
            import subprocess
            process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            out, err = process.communicate()
            out = out.decode(encoding=encoding)
            err = err.decode(encoding=encoding)
            return out, err
    
    0 讨论(0)
  • 2020-12-07 21:47

    I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).

    import os
    
    def doc_to_text_catdoc(filename):
        (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
        fi.close()
        retval = fo.read()
        erroroutput = fe.read()
        fo.close()
        fe.close()
        if not erroroutput:
            return retval
        else:
            raise OSError("Executing the command caused an error: %s" % erroroutput)
    
    # similar doc_to_text_antiword()
    

    The -w switch to catdoc turns off line wrapping, BTW.

    0 讨论(0)
  • 2020-12-07 21:47

    tika-python

    A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.

    Note: It also works charmingly with pyinstaller

    Install with pip :

    pip install tika
    

    Sample:

    #!/usr/bin/env python
    from tika import parser
    parsed = parser.from_file('/path/to/file')
    print(parsed["metadata"]) #To get the meta data of the file
    print(parsed["content"]) # To get the content of the file
    

    Link to official GitHub

    0 讨论(0)
提交回复
热议问题