问题
The python docx library works with word documents. The below piece of code extracts all paragraphs and tables in document order and appends them to a list.
def iter_block_items(parent):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of either Table or Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
if isinstance(parent, doctwo):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
document_as_list = []
for block in iter_block_items(document):
if 'text' in str(block):
document_as_list.append(block.text)
elif 'table' in str(block):
document_as_list.append(block)
But the above code doesn't extract images from the document.It works for only paragraphs and tables. Each image in the document has a unique 'rID'. I already have the code for extracting the images as a whole from the word document.
But the requirement is that I want to extract the images in document order. It is enough if I atleast append the 'rID' of every image to the list 'document_as_list' as they appear in document order along with the paragraphs and tables. I know we have to manipulate the xml of the word document. But I lack the ability to convert it into code. Can somebody help me out?
I have already gone through the following stackoverflow questions and I'm not able to figure out a good way to do this.
Link1
Link2
回答1:
I have shared the answer to this on the following github link:
Reading paragraphs, tables and images in document order from .docx
来源:https://stackoverflow.com/questions/56787961/how-to-retrieve-paragraphs-tables-and-imagesinline-shapes-by-document-order-i