Extract hyperlinks from PDF in Python

前端 未结 5 1062
情深已故
情深已故 2020-12-30 09:33

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurio

5条回答
  •  挽巷
    挽巷 (楼主)
    2020-12-30 09:49

    Here's a version that creates a list of URLs in the simplest way I could find:

    import PyPDF2
    
    pdf = PyPDF2.PdfFileReader('filename.pdf')
    
    urls = []
    for page in range(pdf.numPages):
        pdfPage = pdf.getPage(page)
        try:
            for item in (pdfPage['/Annots']):
                urls.append(item['/A']['/URI'])
        except KeyError:
            pass
    

提交回复
热议问题