Python — Parsing files (docx, pdf and odt) and converting the content into my data model

旧巷老猫 提交于 2019-12-06 04:31:56

Textract is the best tool that i have encountered so far for parsing different file formats.

It can parse most of the file formats.

You can find the project on Github

Here is the official documentation

(Python 3 answer)

When I was looking for a tool to read .docx files, I was able to find one here: http://etienned.github.io/posts/extract-text-from-word-docx-simply/

What it does is simply get the text from a .docx file and return it as a string; separate paragraphs are still clearly separate, as there are the new lines between, but all other formatting is lost. I think this may include the loss of end- and foot-notes, but if you want the body of a text, it works great.

I have tested it on both Windows 10 and on OS X, and it has worked successfully on both. Here is what it imports:

import zipfile
try:
    from xml.etree.cElementTree import XML
    print("cElementTree")
except ImportError:
    from xml.etree.ElementTree import XML
    print("ElementTree")

EDIT:

If, in the body of the function, you replace

'word/document.xml'

with

'word/footnotes.xml'

or

'word/endnotes.xml'

you can get the footnotes and endnotes, respectively.

The markers for where they were in the text are lost, however.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!