How to retrieve paragraphs, tables and images(inline shapes) by document order in python using docx library

问题

The python docx library works with word documents. The below piece of code extracts all paragraphs and tables in document order and appends them to a list.

def iter_block_items(parent):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of either Table or Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
    if isinstance(parent, doctwo):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")
    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

document_as_list = []
for block in iter_block_items(document):
    if 'text' in str(block):
        document_as_list.append(block.text)
    elif 'table' in str(block):
        document_as_list.append(block)

But the above code doesn't extract images from the document.It works for only paragraphs and tables. Each image in the document has a unique 'rID'. I already have the code for extracting the images as a whole from the word document.

But the requirement is that I want to extract the images in document order. It is enough if I atleast append the 'rID' of every image to the list 'document_as_list' as they appear in document order along with the paragraphs and tables. I know we have to manipulate the xml of the word document. But I lack the ability to convert it into code. Can somebody help me out?

I have already gone through the following stackoverflow questions and I'm not able to figure out a good way to do this.

Link1

Link2

回答1:

I have shared the answer to this on the following github link:

Reading paragraphs, tables and images in document order from .docx

来源：https://stackoverflow.com/questions/56787961/how-to-retrieve-paragraphs-tables-and-imagesinline-shapes-by-document-order-i

标签

python-3.x

python-docx