Best way to extract text from a Word doc without using COM/automation?

后端未结

关注

 10  1840

遇见更好的自我 2020-12-07 21:29

Is there a reasonable way to extract plain text from a Word file that doesn\'t depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platfo

10条回答

予麋鹿 (楼主)

2020-12-07 21:47
tika-python

A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.

Note: It also works charmingly with pyinstaller

Install with pip :
```
pip install tika
```
Sample:
```
#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file
```
Link to official GitHub
0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...