Extract DOCX Comments

前端 未结 3 601
太阳男子
太阳男子 2020-12-20 03:04

I\'m a teacher. I want a list of all the students who commented on the essay I assigned, and what they said. The Drive API stuff was too challenging for me, but I figured I

3条回答
  •  [愿得一人]
    2020-12-20 03:47

    Thank you @kjhughes for this amazing answer for extracting all the comments from the document file. I was facing same issue like others in this thread to get the text that the comment relates to. I took the code from @kjhughes as a base and try to solve this using python-docx. So here is my take at this.

    Sample document.

    I will extract the comment and the paragraph which it was referenced in the document.

    from docx import Document
    from lxml import etree
    import zipfile
    ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
    #Function to extract all the comments of document(Same as accepted answer)
    #Returns a dictionary with comment id as key and comment string as value
    def get_document_comments(docxFileName):
        comments_dict={}
        docxZip = zipfile.ZipFile(docxFileName)
        commentsXML = docxZip.read('word/comments.xml')
        et = etree.XML(commentsXML)
        comments = et.xpath('//w:comment',namespaces=ooXMLns)
        for c in comments:
            comment=c.xpath('string(.)',namespaces=ooXMLns)
            comment_id=c.xpath('@w:id',namespaces=ooXMLns)[0]
            comments_dict[comment_id]=comment
        return comments_dict
    #Function to fetch all the comments in a paragraph
    def paragraph_comments(paragraph,comments_dict):
        comments=[]
        for run in paragraph.runs:
            comment_reference=run._r.xpath("./w:commentReference")
            if comment_reference:
                comment_id=comment_reference[0].xpath('@w:id',namespaces=ooXMLns)[0]
                comment=comments_dict[comment_id]
                comments.append(comment)
        return comments
    #Function to fetch all comments with their referenced paragraph
    #This will return list like this [{'Paragraph text': [comment 1,comment 2]}]
    def comments_with_reference_paragraph(docxFileName):
        document = Document(docxFileName)
        comments_dict=get_document_comments(docxFileName)
        comments_with_their_reference_paragraph=[]
        for paragraph in document.paragraphs:  
            if comments_dict: 
                comments=paragraph_comments(paragraph,comments_dict)  
                if comments:
                    comments_with_their_reference_paragraph.append({paragraph.text: comments})
        return comments_with_their_reference_paragraph
    if __name__=="__main__":
        document="Hi this is a test.docx"
        print(comments_with_reference_paragraph(document))
    

    Output for the sample document look like this

    I have done this at a paragraph level. This could be done at a python-docx run level as well. Hopefully it will be of help.

提交回复
热议问题