I\'m a teacher. I want a list of all the students who commented on the essay I assigned, and what they said. The Drive API stuff was too challenging for me, but I figured I
Thank you @kjhughes for this amazing answer for extracting all the comments from the document file. I was facing same issue like others in this thread to get the text that the comment relates to. I took the code from @kjhughes as a base and try to solve this using python-docx. So here is my take at this.
Sample document.
I will extract the comment and the paragraph which it was referenced in the document.
from docx import Document
from lxml import etree
import zipfile
ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
#Function to extract all the comments of document(Same as accepted answer)
#Returns a dictionary with comment id as key and comment string as value
def get_document_comments(docxFileName):
comments_dict={}
docxZip = zipfile.ZipFile(docxFileName)
commentsXML = docxZip.read('word/comments.xml')
et = etree.XML(commentsXML)
comments = et.xpath('//w:comment',namespaces=ooXMLns)
for c in comments:
comment=c.xpath('string(.)',namespaces=ooXMLns)
comment_id=c.xpath('@w:id',namespaces=ooXMLns)[0]
comments_dict[comment_id]=comment
return comments_dict
#Function to fetch all the comments in a paragraph
def paragraph_comments(paragraph,comments_dict):
comments=[]
for run in paragraph.runs:
comment_reference=run._r.xpath("./w:commentReference")
if comment_reference:
comment_id=comment_reference[0].xpath('@w:id',namespaces=ooXMLns)[0]
comment=comments_dict[comment_id]
comments.append(comment)
return comments
#Function to fetch all comments with their referenced paragraph
#This will return list like this [{'Paragraph text': [comment 1,comment 2]}]
def comments_with_reference_paragraph(docxFileName):
document = Document(docxFileName)
comments_dict=get_document_comments(docxFileName)
comments_with_their_reference_paragraph=[]
for paragraph in document.paragraphs:
if comments_dict:
comments=paragraph_comments(paragraph,comments_dict)
if comments:
comments_with_their_reference_paragraph.append({paragraph.text: comments})
return comments_with_their_reference_paragraph
if __name__=="__main__":
document="Hi this is a test.docx"
print(comments_with_reference_paragraph(document))
Output for the sample document look like this
I have done this at a paragraph level. This could be done at a python-docx run level as well. Hopefully it will be of help.
You got remarkably far considering that OOXML is such a complex format.
Here's some sample Python code showing how to access the comments of a DOCX file via XPath:
from lxml import etree
import zipfile
ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
def get_comments(docxFileName):
docxZip = zipfile.ZipFile(docxFileName)
commentsXML = docxZip.read('word/comments.xml')
et = etree.XML(commentsXML)
comments = et.xpath('//w:comment',namespaces=ooXMLns)
for c in comments:
# attributes:
print(c.xpath('@w:author',namespaces=ooXMLns))
print(c.xpath('@w:date',namespaces=ooXMLns))
# string value of the comment:
print(c.xpath('string(.)',namespaces=ooXMLns))
I used Word Object Model to extract comments with replies from a Word document. Documentation on Comments object can be found here. This documentation uses Visual Basic for Applications (VBA). But I was able to use the functions in Python with slight modifications. Only issue with Word Object Model is that I had to use win32com package from pywin32 which works fine on Windows PC, but I'm not sure if it will work on macOS.
Here's the sample code I used to extract comments with associated replies:
import win32com.client as win32
from win32com.client import constants
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = False
filepath = "path\to\file.docx"
def get_comments(filepath):
doc = word.Documents.Open(filepath)
doc.Activate()
activeDoc = word.ActiveDocument
for c in activeDoc.Comments:
if c.Ancestor is None: #checking if this is a top-level comment
print("Comment by: " + c.Author)
print("Comment text: " + c.Range.Text) #text of the comment
print("Regarding: " + c.Scope.Text) #text of the original document where the comment is anchored
if len(c.Replies)> 0: #if the comment has replies
print("Number of replies: " + str(len(c.Replies)))
for r in range(1, len(c.Replies)+1):
print("Reply by: " + c.Replies(r).Author)
print("Reply text: " + c.Replies(r).Range.Text) #text of the reply
doc.Close()