solution to convert PDFs, DOCs, DOCXs into a textual format with python

后端 未结 4 1778
独厮守ぢ
独厮守ぢ 2021-01-16 17:29

I am developing a full text search engine for indexing popular binary formats. I know that there are hundereds of such questions (and solutions) already, but I found it toug

4条回答
  •  猫巷女王i
    2021-01-16 18:10

    One possible solution is to use google documents to extract the text contents from binary .doc-files. You upload the document to google docs and then download the text contents. It is a fairly slow process, but it is the only "pure Python" solution I know of since it doesn't require any external tools except for network access. An external tool such as catdoc or antiword is a much better solution if you are allowed to install it on your host.

提交回复
热议问题