Extract hyperlinks from PDF in Python

前端未结

关注

 5  1062

情深已故 2020-12-30 09:33

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurio

5条回答

挽巷 (楼主)

2020-12-30 09:49

Here's a version that creates a list of URLs in the simplest way I could find:

import PyPDF2

pdf = PyPDF2.PdfFileReader('filename.pdf')

urls = []
for page in range(pdf.numPages):
    pdfPage = pdf.getPage(page)
    try:
        for item in (pdfPage['/Annots']):
            urls.append(item['/A']['/URI'])
    except KeyError:
        pass

0 讨论(0)

查看其它5个回答