Best way to extract text from a Word doc without using COM/automation?

后端 未结 10 1840
遇见更好的自我
遇见更好的自我 2020-12-07 21:29

Is there a reasonable way to extract plain text from a Word file that doesn\'t depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platfo

10条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-07 21:47

    tika-python

    A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.

    Note: It also works charmingly with pyinstaller

    Install with pip :

    pip install tika
    

    Sample:

    #!/usr/bin/env python
    from tika import parser
    parsed = parser.from_file('/path/to/file')
    print(parsed["metadata"]) #To get the meta data of the file
    print(parsed["content"]) # To get the content of the file
    

    Link to official GitHub

提交回复
热议问题