Parse annotations from a pdf

前端 未结 8 757
攒了一身酷
攒了一身酷 2020-11-28 21:15

I want a python function that takes a pdf and returns a list of the text of the note annotations in the document. I have looked at python-poppler (https://code.launchpad.net

8条回答
  •  粉色の甜心
    2020-11-28 21:58

    The author @JorjMcKie of PyMuPDF wrote a snippet for me and I modified a bit:

    import fitz  # to import the PyMuPDF library
    # from pprint import pprint
    
    
    def _parse_highlight(annot: fitz.Annot, wordlist: list) -> str:
        points = annot.vertices
        quad_count = int(len(points) / 4)
        sentences = ['' for i in range(quad_count)]
        for i in range(quad_count):
            r = fitz.Quad(points[i * 4: i * 4 + 4]).rect
            words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
            sentences[i] = ' '.join(w[4] for w in words)
        sentence = ' '.join(sentences)
        return sentence
    
    
    def main() -> dict:
        doc = fitz.open('path/to/your/file')
        page = doc[0]
    
        wordlist = page.getText("words")  # list of words on page
        wordlist.sort(key=lambda w: (w[3], w[0]))  # ascending y, then x
    
        highlights = {}
        annot = page.firstAnnot
        i = 0
        while annot:
            if annot.type[0] == 8:
                highlights[i] = _parse_highlight(annot, wordlist)
                i += 1
                print('> ' + highlights[i] + '\n')
            annot = annot.next
    
        # pprint(highlights)
        return highlights
    
    
    if __name__ == "__main__":
        main()
    
    • https://github.com/pymupdf/PyMuPDF
    • Is it posible to extract highlighted text? #318 https://github.com/pymupdf/PyMuPDF/issues/318#issuecomment-657102559

    Though there are still some small typos in the results:

    > system upsets,
    
    > expansion of smart grid monitoring devices that generally provide nodal voltages and power injections at fine spatial resolution,
    
    > hurricanes to indi- vidual lightning strikes),
    

提交回复
热议问题