Search and replace for text within a pdf, in Python

这一生的挚爱 提交于 2021-02-07 20:30:45

问题


I am writing mailmerge software as part of a Python web app.

I have a template called letter.pdf which was generated from a MS Word file and includes the text {name} where the resident's name will go. I also have a list of c. 100 residents' names.

What I want to do is to read in letter.pdf do a search for "{name}" and replace it with the resident's name (for each resident) then write the result to another pdf. I then want to gather all these pdfs togetherinot a big pdf (one page per letter) which my web app's users will print out to create their letters.

Are there any Python libraries that will do this? I've looked and pdfrw and pdfminer but I couldn't see where they would be able to do it.

(NB: I also have the MS Word file, so if there was another way of using that not going through a pdf, that would also do the job.)


回答1:


This can be done with PyPDF2 package. The implementation may depend on the original PDF template structure. But if the template is stable enough and isn't changed very often the replacement code shouldn't be generic but rather simple.

I did a small sketch on how you could replace the text inside a PDF file. It replaces all occurrences of PDF tokens to DOC.

import os
import argparse
from PyPDF2 import PdfFileReader, PdfFileWriter


def replace_text(content, replacements = dict()):
    lines = content.splitlines()

    result = ""
    in_text = False

    for line in lines:
        if line == "BT":
            in_text = True

        elif line == "ET":
            in_text = False

        elif in_text:
            cmd = line[-2:]
            if cmd.lower() == 'tj':
                replaced_line = line
                for k, v in replacements.items():
                    replaced_line = replaced_line.replace(k, v)
                result += replaced_line + "\n"
            else:
                result += line + "\n"
            continue

        result += line + "\n"

    return result

def process_data(object, replacements):
    data = object.getData()
    decoded_data = data.decode('utf-8')

    replaced_data = replace_text(decoded_data, replacements)

    encoded_data = replaced_data.encode('utf-8')
    if object.decodedSelf is not None:
        object.decodedSelf.setData(encoded_data)
    else:
        object.setData(encoded_data)

if __name__ == "__main__":
    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--input", required=True, help="path to PDF document")
    args = vars(ap.parse_args())

    in_file = args["input"]
    filename_base = in_file.replace(os.path.splitext(in_file)[1], "")

    # Provide replacements list that you need here
    replacements = { 'PDF': 'DOC'}

    pdf = PdfFileReader(in_file)
    writer = PdfFileWriter()

    for page_number in range(0, pdf.getNumPages()):

        page = pdf.getPage(page_number)
        contents = page.getContents()

        if len(contents) > 0:
            for obj in contents:
                streamObj = obj.getObject()
                process_data(streamObj, replacements)
        else:
            process_data(contents, replacements)

        writer.addPage(page)

    with open(filename_base + ".result.pdf", 'wb') as out_file:
        writer.write(out_file)

The results are



来源:https://stackoverflow.com/questions/41769120/search-and-replace-for-text-within-a-pdf-in-python

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!