Best way to extract text from a Word doc without using COM/automation?

后端 未结 10 1841
遇见更好的自我
遇见更好的自我 2020-12-07 21:29

Is there a reasonable way to extract plain text from a Word file that doesn\'t depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platfo

10条回答
  •  孤城傲影
    2020-12-07 21:40

    Honestly don't use "pip install tika", this has been developed for mono-user (one developper working on his laptop) and not for multi-users (multi-developpers).

    The small class TikaWrapper.py bellow which uses Tika in command line is widely enough to meet our needs.

    You just have to instanciate this class with JAVA_HOME path and the Tika jar path, that's all ! And it works perfectly for lot of formats (e.g: PDF, DOCX, ODT, XLSX, PPT, etc.).

    #!/bin/python
    # -*- coding: utf-8 -*-
    
    # Class to extract metadata and text from different file types (such as PPT, XLS, and PDF)
    # Developed by Philippe ROSSIGNOL
    #####################
    # TikaWrapper class #
    #####################
    class TikaWrapper:
    
        java_home = None
        tikalib_path = None
    
        # Constructor
        def __init__(self, java_home, tikalib_path):
            self.java_home = java_home
            self.tika_lib_path = tikalib_path
    
        def extractMetadata(self, filePath, encoding="UTF-8", returnTuple=False):
            '''
            - Description:
              Extract metadata from a document
            
            - Params:
              filePath: The document file path
              encoding: The encoding (default = "UTF-8")
              returnTuple: If True return a tuple which contains both the output and the error (default = False)
            
            - Examples:
              metadata = extractMetadata(filePath="MyDocument.docx")
              metadata, error = extractMetadata(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
            '''
            cmd = self._getCmd(self._cmdExtractMetadata, filePath, encoding)
            out, err = self._execute(cmd, encoding)
            if (returnTuple): return out, err
            return out
    
        def extractText(self, filePath, encoding="UTF-8", returnTuple=False):
            '''
            - Description:
              Extract text from a document
            
            - Params:
              filePath: The document file path
              encoding: The encoding (default = "UTF-8")
              returnTuple: If True return a tuple which contains both the output and the error (default = False)
            
            - Examples:
              text = extractText(filePath="MyDocument.docx")
              text, error = extractText(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
            '''
            cmd = self._getCmd(self._cmdExtractText, filePath, encoding)
            out, err = self._execute(cmd, encoding)
            return out, err
    
        # ===========
        # = PRIVATE =
        # ===========
    
        _cmdExtractMetadata = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --metadata ${FILE_PATH}"
        _cmdExtractText = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --encoding=${ENCODING} --text ${FILE_PATH}"
    
        def _getCmd(self, cmdModel, filePath, encoding):
            cmd = cmdModel.replace("${JAVA_HOME}", self.java_home)
            cmd = cmd.replace("${TIKALIB_PATH}", self.tika_lib_path)
            cmd = cmd.replace("${ENCODING}", encoding)
            cmd = cmd.replace("${FILE_PATH}", filePath)
            return cmd
    
        def _execute(self, cmd, encoding):
            import subprocess
            process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            out, err = process.communicate()
            out = out.decode(encoding=encoding)
            err = err.decode(encoding=encoding)
            return out, err
    

提交回复
热议问题