Extract hyperlinks from PDF in Python

前端 未结 5 1061
情深已故
情深已故 2020-12-30 09:33

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurio

5条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-12-30 09:44

    slightly modified version of Ashwin's Answer:

    import PyPDF2
    PDFFile = open("file.pdf",'rb')
    
    PDF = PyPDF2.PdfFileReader(PDFFile)
    pages = PDF.getNumPages()
    key = '/Annots'
    uri = '/URI'
    ank = '/A'
    
    for page in range(pages):
        print("Current Page: {}".format(page))
        pageSliced = PDF.getPage(page)
        pageObject = pageSliced.getObject()
        if key in pageObject.keys():
            ann = pageObject[key]
            for a in ann:
                u = a.getObject()
                if uri in u[ank].keys():
                    print(u[ank][uri])
    

提交回复
热议问题