Extracting Highlighted Words from Word Document (.docx) in Python

后端 未结 1 1705
执念已碎
执念已碎 2021-01-03 10:57

I am working with a bunch of word documents in which I have text (words) that are highlighted (using color codes e.g. yellow,blue,gray), now I want to extract the highlighte

相关标签:
1条回答
  • 2021-01-03 11:15

    I had never before worked with python-docx, but what helped was that I found a snippet online of how the XML structure of a highlighted piece of text lookls like:

     <w:r>
        <w:rPr>
          <w:highlight w:val="yellow"/>
        </w:rPr>
        <w:t>text that is highlighted</w:t>
      </w:r>
    

    From there, it was relatively straightforward to come up with this:

    from docx import *
    document = opendocx(r'test.docx')
    words = document.xpath('//w:r', namespaces=document.nsmap)
    
    WPML_URI = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
    tag_rPr = WPML_URI + 'rPr'
    tag_highlight = WPML_URI + 'highlight'
    tag_val = WPML_URI + 'val'
    
    for word in words:
        for rPr in word.findall(tag_rPr):
            if rPr.find(tag_highlight).attrib[tag_val] == 'yellow':
                print word.find(tag_t).text
    
    0 讨论(0)
提交回复
热议问题