How to retrieve paragraphs, tables and images(inline shapes) by document order in python using docx library

偶尔善良 提交于 2020-01-06 05:59:07

问题


The python docx library works with word documents. The below piece of code extracts all paragraphs and tables in document order and appends them to a list.

def iter_block_items(parent):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of either Table or Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
    if isinstance(parent, doctwo):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")
    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

document_as_list = []
for block in iter_block_items(document):
    if 'text' in str(block):
        document_as_list.append(block.text)
    elif 'table' in str(block):
        document_as_list.append(block)

But the above code doesn't extract images from the document.It works for only paragraphs and tables. Each image in the document has a unique 'rID'. I already have the code for extracting the images as a whole from the word document.

But the requirement is that I want to extract the images in document order. It is enough if I atleast append the 'rID' of every image to the list 'document_as_list' as they appear in document order along with the paragraphs and tables. I know we have to manipulate the xml of the word document. But I lack the ability to convert it into code. Can somebody help me out?

I have already gone through the following stackoverflow questions and I'm not able to figure out a good way to do this.

Link1

Link2


回答1:


I have shared the answer to this on the following github link:

Reading paragraphs, tables and images in document order from .docx



来源:https://stackoverflow.com/questions/56787961/how-to-retrieve-paragraphs-tables-and-imagesinline-shapes-by-document-order-i

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!