Best way to extract text from a Word doc without using COM/automation?

后端未结

关注

 10  1831

Is there a reasonable way to extract plain text from a Word file that doesn\'t depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platfo

相关标签:

10条回答

清酒与你

2020-12-07 21:30

Just in case if someone wants to do in Java language there is Apache poi api. extractor.getText() will extract plane text from docx . Here is the link https://www.tutorialspoint.com/apache_poi_word/apache_poi_word_text_extraction.htm

0 讨论(0)
发布评论:

提交评论
- 加载中...
一个人的身影

2020-12-07 21:36

This worked well for .doc and .odt.

It calls openoffice on the command line to convert your file to text, which you can then simply load into python.

(It seems to have other format options, though they are not apparenlty documented.)

0 讨论(0)
发布评论:

提交评论
- 加载中...

不思量自难忘°

2020-12-07 21:37

If all you want to do is extracting text from Word files (.docx), it's possible to do it only with Python. Like Guy Starbuck wrote it, you just need to unzip the file and then parse the XML. Inspired by python-docx, I have written a simple function to do this:

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile


"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx <https://github.com/mikemaccana/python-docx>)
"""

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    xml_content = document.read('word/document.xml')
    document.close()
    tree = XML(xml_content)

    paragraphs = []
    for paragraph in tree.getiterator(PARA):
        texts = [node.text
                 for node in paragraph.getiterator(TEXT)
                 if node.text]
        if texts:
            paragraphs.append(''.join(texts))

    return '\n\n'.join(paragraphs)

0 讨论(0)

孤城傲影

2020-12-07 21:40

Honestly don't use "pip install tika", this has been developed for mono-user (one developper working on his laptop) and not for multi-users (multi-developpers).

The small class TikaWrapper.py bellow which uses Tika in command line is widely enough to meet our needs.

You just have to instanciate this class with JAVA_HOME path and the Tika jar path, that's all ! And it works perfectly for lot of formats (e.g: PDF, DOCX, ODT, XLSX, PPT, etc.).

#!/bin/python
# -*- coding: utf-8 -*-

# Class to extract metadata and text from different file types (such as PPT, XLS, and PDF)
# Developed by Philippe ROSSIGNOL
#####################
# TikaWrapper class #
#####################
class TikaWrapper:

    java_home = None
    tikalib_path = None

    # Constructor
    def __init__(self, java_home, tikalib_path):
        self.java_home = java_home
        self.tika_lib_path = tikalib_path

    def extractMetadata(self, filePath, encoding="UTF-8", returnTuple=False):
        '''
        - Description:
          Extract metadata from a document
        
        - Params:
          filePath: The document file path
          encoding: The encoding (default = "UTF-8")
          returnTuple: If True return a tuple which contains both the output and the error (default = False)
        
        - Examples:
          metadata = extractMetadata(filePath="MyDocument.docx")
          metadata, error = extractMetadata(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
        '''
        cmd = self._getCmd(self._cmdExtractMetadata, filePath, encoding)
        out, err = self._execute(cmd, encoding)
        if (returnTuple): return out, err
        return out

    def extractText(self, filePath, encoding="UTF-8", returnTuple=False):
        '''
        - Description:
          Extract text from a document
        
        - Params:
          filePath: The document file path
          encoding: The encoding (default = "UTF-8")
          returnTuple: If True return a tuple which contains both the output and the error (default = False)
        
        - Examples:
          text = extractText(filePath="MyDocument.docx")
          text, error = extractText(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
        '''
        cmd = self._getCmd(self._cmdExtractText, filePath, encoding)
        out, err = self._execute(cmd, encoding)
        return out, err

    # ===========
    # = PRIVATE =
    # ===========

    _cmdExtractMetadata = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --metadata ${FILE_PATH}"
    _cmdExtractText = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --encoding=${ENCODING} --text ${FILE_PATH}"

    def _getCmd(self, cmdModel, filePath, encoding):
        cmd = cmdModel.replace("${JAVA_HOME}", self.java_home)
        cmd = cmd.replace("${TIKALIB_PATH}", self.tika_lib_path)
        cmd = cmd.replace("${ENCODING}", encoding)
        cmd = cmd.replace("${FILE_PATH}", filePath)
        return cmd

    def _execute(self, cmd, encoding):
        import subprocess
        process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        out, err = process.communicate()
        out = out.decode(encoding=encoding)
        err = err.decode(encoding=encoding)
        return out, err

0 讨论(0)

天命终不由人

2020-12-07 21:47

I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).

import os

def doc_to_text_catdoc(filename):
    (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
    fi.close()
    retval = fo.read()
    erroroutput = fe.read()
    fo.close()
    fe.close()
    if not erroroutput:
        return retval
    else:
        raise OSError("Executing the command caused an error: %s" % erroroutput)

# similar doc_to_text_antiword()

The -w switch to catdoc turns off line wrapping, BTW.

0 讨论(0)

予麋鹿

2020-12-07 21:47
tika-python

A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.

Note: It also works charmingly with pyinstaller

Install with pip :
```
pip install tika
```
Sample:
```
#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file
```
Link to official GitHub
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页