Extract hyperlinks from PDF in Python

前端 未结 5 1071
情深已故
情深已故 2020-12-30 09:33

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurio

5条回答
  •  天命终不由人
    2020-12-30 10:03

    I think using PyPDF you could do that. If you want to extract the links from PDF. I am not sure where I got this from but it resides in my code as a part of something else. Hope this helps:

    PDFFile = open('File Location','rb')
    
    PDF = pyPdf.PdfFileReader(PDFFile)
    pages = PDF.getNumPages()
    key = '/Annots'
    uri = '/URI'
    ank = '/A'
    
    for page in range(pages):
    
        pageSliced = PDF.getPage(page)
        pageObject = pageSliced.getObject()
    
        if pageObject.has_key(key):
            ann = pageObject[key]
            for a in ann:
                u = a.getObject()
                if u[ank].has_key(uri):
                print u[ank][uri]
    

    This I hope should give the links in your PDF. P.S: I haven't extensively tried this.

提交回复
热议问题