Is there any way to read .docx file include auto numbering using python-docx

后端 未结 2 646
天命终不由人
天命终不由人 2021-02-01 06:49

Problem statement: Extract sections from .docx file including autonumbering.

I tried python-docx to extract text from .docx file but it excludes the autonumbering.

相关标签:
2条回答
  • 2021-02-01 07:05

    It appears that currently python-docx v0.8 does not fully support numbering. You need to do some hacking.

    First, for the demo, to iterate the document paragraphs, you need to write your own iterator. Here is something functional:

    import docx.document
    import docx.oxml.table
    import docx.oxml.text.paragraph
    import docx.table
    import docx.text.paragraph
    
    
    def iter_paragraphs(parent, recursive=True):
        """
        Yield each paragraph and table child within *parent*, in document order.
        Each returned value is an instance of Paragraph. *parent*
        would most commonly be a reference to a main Document object, but
        also works for a _Cell object, which itself can contain paragraphs and tables.
        """
        if isinstance(parent, docx.document.Document):
            parent_elm = parent.element.body
        elif isinstance(parent, docx.table._Cell):
            parent_elm = parent._tc
        else:
            raise TypeError(repr(type(parent)))
    
        for child in parent_elm.iterchildren():
            if isinstance(child, docx.oxml.text.paragraph.CT_P):
                yield docx.text.paragraph.Paragraph(child, parent)
            elif isinstance(child, docx.oxml.table.CT_Tbl):
                if recursive:
                    table = docx.table.Table(child, parent)
                    for row in table.rows:
                        for cell in row.cells:
                            for child_paragraph in iter_paragraphs(cell):
                                yield child_paragraph
    

    You can use it to find all document paragraphs including paragraphs in table cells.

    For instance:

    import docx
    
    document = docx.Document("sample.docx")
    for paragraph in iter_paragraphs(document):
        print(paragraph.text)
    

    To access the numbering property, you need to search in the "protected" members paragraph._p.pPr.numPr, which is a docx.oxml.numbering.CT_NumPr object:

    for paragraph in iter_paragraphs(document):
        num_pr = paragraph._p.pPr.numPr
        if num_pr is not None:
            print(num_pr)  # type: docx.oxml.numbering.CT_NumPr
    

    Note that this object is extracted from the numbering.xml file (inside the docx), if it exists.

    To access it, you need to read your docx file like a package. For instance:

    import docx.package
    import docx.parts.document
    import docx.parts.numbering
    
    package = docx.package.Package.open("sample.docx")
    
    main_document_part = package.main_document_part
    assert isinstance(main_document_part, docx.parts.document.DocumentPart)
    
    numbering_part = main_document_part.numbering_part
    assert isinstance(numbering_part, docx.parts.numbering.NumberingPart)
    
    ct_numbering = numbering_part._element
    print(ct_numbering)  # CT_Numbering
    for num in ct_numbering.num_lst:
        print(num)  # CT_Num
        print(num.abstractNumId)  # CT_DecimalNumber
    

    Mor information is available in the Office Open XMl documentation.

    0 讨论(0)
  • 2021-02-01 07:12

    There is a package, docx2python which does this in a lot simpler fashion: pypi.org/project/docx2python/

    The following code:

    from docx2python import docx2python
    document = docx2python("C:/input/MyDoc.docx")
    print(document.body)
    

    produces a list which contains the contents including bullet lists in a nice parse-able fashion.

    0 讨论(0)
提交回复
热议问题