I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf.
I have used the PDFMiner library and code from http://www.endlesslycurio
Here's a version that creates a list of URLs in the simplest way I could find:
import PyPDF2
pdf = PyPDF2.PdfFileReader('filename.pdf')
urls = []
for page in range(pdf.numPages):
pdfPage = pdf.getPage(page)
try:
for item in (pdfPage['/Annots']):
urls.append(item['/A']['/URI'])
except KeyError:
pass