read, highlight, save PDF programmatically

孤街浪徒 提交于 2019-12-12 08:24:21

问题


I'd like to write a small script (which will run on a headless Linux server) that reads a PDF, highlights text that matches anything in an array of strings that I pass, then saves the modified PDF. I imagine I'll end up using something like the python bindings to poppler but unfortunately there's next to zero documentation and I have next to zero experience in python.

If anyone could point me to a tutorial, example, or some helpful documentation to get me started it would be greatly appreciated!


回答1:


Have you tried looking at PDFMiner? It sounds like it does what you want.




回答2:


Yes, it is possible with a combination of pdfminer (pip install pdfminer.six) and PyPDF2.

First, find the coordinates (e.g. like this). Then highlight it:

#!/usr/bin/env python

"""Create sample highlight in a PDF file."""

from PyPDF2 import PdfFileWriter, PdfFileReader

from PyPDF2.generic import (
    DictionaryObject,
    NumberObject,
    FloatObject,
    NameObject,
    TextStringObject,
    ArrayObject
)


def create_highlight(x1, y1, x2, y2, meta, color=[0, 1, 0]):
    """
    Create a highlight for a PDF.

    Parameters
    ----------
    x1, y1 : float
        bottom left corner
    x2, y2 : float
        top right corner
    meta : dict
        keys are "author" and "contents"
    color : iterable
        Three elements, (r,g,b)
    """
    new_highlight = DictionaryObject()

    new_highlight.update({
        NameObject("/F"): NumberObject(4),
        NameObject("/Type"): NameObject("/Annot"),
        NameObject("/Subtype"): NameObject("/Highlight"),

        NameObject("/T"): TextStringObject(meta["author"]),
        NameObject("/Contents"): TextStringObject(meta["contents"]),

        NameObject("/C"): ArrayObject([FloatObject(c) for c in color]),
        NameObject("/Rect"): ArrayObject([
            FloatObject(x1),
            FloatObject(y1),
            FloatObject(x2),
            FloatObject(y2)
        ]),
        NameObject("/QuadPoints"): ArrayObject([
            FloatObject(x1),
            FloatObject(y2),
            FloatObject(x2),
            FloatObject(y2),
            FloatObject(x1),
            FloatObject(y1),
            FloatObject(x2),
            FloatObject(y1)
        ]),
    })

    return new_highlight


def add_highlight_to_page(highlight, page, output):
    """
    Add a highlight to a PDF page.

    Parameters
    ----------
    highlight : Highlight object
    page : PDF page object
    output : PdfFileWriter object
    """
    highlight_ref = output._addObject(highlight)

    if "/Annots" in page:
        page[NameObject("/Annots")].append(highlight_ref)
    else:
        page[NameObject("/Annots")] = ArrayObject([highlight_ref])


def main():
    pdf_input = PdfFileReader(open("samples/test3.pdf", "rb"))
    pdf_output = PdfFileWriter()

    page1 = pdf_input.getPage(0)

    highlight = create_highlight(89.9206, 573.1283, 376.849, 591.3563, {
        "author": "John Doe",
        "contents": "Lorem ipsum"
    })

    add_highlight_to_page(highlight, page1, pdf_output)

    pdf_output.addPage(page1)

    output_stream = open("output.pdf", "wb")
    pdf_output.write(output_stream)


if __name__ == '__main__':
    main()



回答3:


PDFlib has Python bindings and supports these operations. You will want with PDI if you want to open a PDF. http://www.pdflib.com/products/pdflib-family/pdflib-pdi/ and TET.

Unfortunately, it is a commercial product. I have used this library in production in the past and it works great. The bindings are very functional and not so Python. I have seen some attempts to make them more Pythonic: https://github.com/alexhayes/pythonic-pdflib You will want to use: open_pdi_document().

It sounds like you will want to do search highlighting of some sort:

http://www.pdflib.com/tet-cookbook/tet-and-pdflib/highlight-search-terms/



来源:https://stackoverflow.com/questions/7605577/read-highlight-save-pdf-programmatically

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!