问题
I've gotten to the point where I can locate, decrypt in, open, and count number of pages in my large pdf file....
Now; I am simply wanting to grab the below (it is on each page, at line 6). I'm wondering if I should continue to attempt gather via line number (which I tried, but error-ed out saying not indexed at 0). Or, try to regex from the zipcode format backwards?
Text format needed from each page in large PDF. Want to scrape and put into two variables. (i.e. MemberName1 = ; MemberName1Address = ;
)
PersonsFirstName LastName # want to grab this from pdf, store in variable
513 StreetName St. # then want to grab this from pdf, address, store in variable
Harrisonburg, PA 22801-1860
I'd like to create a variable that stores all the above in my large pdf; so there would be like 2000 instances of the above - any thoughts. Variable for gathered name, and gathered address.
Here's what I have with working base functionality described above.
import PyPDF2
ENCRYPTED_FILE_PATH = './pdfs/largePDFletters.pdf'
with open(ENCRYPTED_FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
if reader.isEncrypted:
reader.decrypt('AppleSauce')
print(f"Number of page: {reader.getNumPages()}") # all works til here
first_page = reader.getPage(1) # does not work
print(first_page.extractText()) # does not work
So, ultimately just trying to figure out how to scrape the specific text at this point...
来源:https://stackoverflow.com/questions/57149639/search-for-specific-text-in-large-pdf-by-format-after-decrypting-it