Is there a reasonable way to extract plain text from a Word file that doesn\'t depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platfo
tika-python
A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.
Note: It also works charmingly with pyinstaller
Install with pip :
pip install tika
Sample:
#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file
Link to official GitHub