I\'ve been using python to implement a custom parser and use that parsed data to format a word document to be distributed internally. All of the formatting has been straight
The key thing with these workaround functions is to have an example of XML that works, and to be able to compare the XML you generate. If you generate XML that matches the working example, it will work every time. opc-diag
is handy for inspecting the XML in a Word document. Working with really small documents (like single paragraph or two-row table, for analysis purposes) makes it a lot easier to work out how Word is structuring the XML.
An important thing to note is that the XML elements in a Word document are sequence sensitive, meaning the child elements within any other element generally have a set order in which they must appear. If you get this swapped around, you get the "repair" error you mentioned.
I find it much easier to manipulate the XML from within python-docx
, as it takes care of all the unzipping and rezipping for you, along with a lot of the other details.
To get the sequencing right, you'll need to be familiar with the XML Schema specifications for the elements you're working with. There is an example here: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/text/paragraph-format.html
The full schema is in the code tree under ref/xsd/
. Most of the elements for text are in the wml.xsd
file (wml stands for WordProcessing Markup Language).
You can find examples of other so-called "workaround functions" by searching on "python-docx" workaround function
. Pay particular attention to the parse_xml()
function and the OxmlElement
objects which will allow you to create new XML subtrees and individual elements respectively. XML elements can be positioned using regular lxml._Element
methods; all XML elements in python-docx
are based on lxml
. http://lxml.de/api/lxml.etree._Element-class.html