search for specific text in large pdf by format, after decrypting it

梦想的初衷 提交于 2020-01-06 05:50:07

问题


I've gotten to the point where I can locate, decrypt in, open, and count number of pages in my large pdf file....

Now; I am simply wanting to grab the below (it is on each page, at line 6). I'm wondering if I should continue to attempt gather via line number (which I tried, but error-ed out saying not indexed at 0). Or, try to regex from the zipcode format backwards?


Text format needed from each page in large PDF. Want to scrape and put into two variables. (i.e. MemberName1 = ; MemberName1Address = ;)

PersonsFirstName LastName # want to grab this from pdf, store in variable
513 StreetName St. # then want to grab this from pdf, address, store in variable
Harrisonburg, PA 22801-1860 

I'd like to create a variable that stores all the above in my large pdf; so there would be like 2000 instances of the above - any thoughts. Variable for gathered name, and gathered address.

Here's what I have with working base functionality described above.

import PyPDF2

ENCRYPTED_FILE_PATH = './pdfs/largePDFletters.pdf'

with open(ENCRYPTED_FILE_PATH, mode='rb') as f:
        reader = PyPDF2.PdfFileReader(f)
        if reader.isEncrypted:
            reader.decrypt('AppleSauce')
            print(f"Number of page: {reader.getNumPages()}") # all works til here

            first_page = reader.getPage(1) # does not work

            print(first_page.extractText()) # does not work

So, ultimately just trying to figure out how to scrape the specific text at this point...

来源:https://stackoverflow.com/questions/57149639/search-for-specific-text-in-large-pdf-by-format-after-decrypting-it

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!