Python — Parsing files (docx, pdf and odt) and converting the content into my data model

问题

I'm writing an import/export tool for importing docx, pdf, and odt files; in which a book has been written.

We already have a tool for the .epub format, and we'd like to extend the functionality beyond that, so users of the site can have more flexibility.

So far I've looked at PDFMiner and also found out that docx is just based on the openxml format, so the word/document.xml is essentially the file containing the whole thing, and I can parse it with lxml.

The question I have is: I'm hoping to parse the contents of these files, and from that content, extract things like chapter names, images (if any), and chapter text, so that I can fit the content into a data model of:

Book --> o2m --> Chapter --> o2m --> Image

Clearly, PDFMiner has a .get_outlines() function that will return the TOC for me. But it can't link any of the returned tuples (chapter numbers and titles) to the actual pages for that chapter.

Even more problematic is that with docx/odt; those are just paragraphs -- <\w:sdt> -- elements, with attrs and child elements.

I'm looking for idea(s) to extrapolate some sense of structure from these filetypes, and if need be, I can apply those ideas (2 or 3) as suggested formats for our users who wish to import a book via one of those file formats.

回答1:

Textract is the best tool that i have encountered so far for parsing different file formats.

It can parse most of the file formats.

You can find the project on Github

Here is the official documentation

回答2:

(Python 3 answer)

When I was looking for a tool to read .docx files, I was able to find one here: http://etienned.github.io/posts/extract-text-from-word-docx-simply/

What it does is simply get the text from a .docx file and return it as a string; separate paragraphs are still clearly separate, as there are the new lines between, but all other formatting is lost. I think this may include the loss of end- and foot-notes, but if you want the body of a text, it works great.

I have tested it on both Windows 10 and on OS X, and it has worked successfully on both. Here is what it imports:

import zipfile
try:
    from xml.etree.cElementTree import XML
    print("cElementTree")
except ImportError:
    from xml.etree.ElementTree import XML
    print("ElementTree")

EDIT:

If, in the body of the function, you replace