Extract hyperlinks from PDF in Python

前端 未结 5 1060
情深已故
情深已故 2020-12-30 09:33

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurio

相关标签:
5条回答
  • 2020-12-30 09:40

    This is an old question, but it seems a lot of people look at it (including me while trying to answer this question), so I am sharing the answer I came up with. As a side note, it helps a lot to learn how to use the Python debugger (pdb) so you can inspect these objects on-the-fly.

    It is possible to get the hyperlinks using PDFMiner. The complication is (like with so much about PDFs), there is really no relationship between the link annotations and the text of the link, except that they are both located at the same region of the page.

    Here is the code I used to get links on a PDFPage

    annotationList = []
    if page.annots:
        for annotation in page.annots.resolve():
            annotationDict = annotation.resolve()
            if str(annotationDict["Subtype"]) != "/Link":
                # Skip over any annotations that are not links
                continue
            position = annotationDict["Rect"]
            uriDict = annotationDict["A"].resolve()
            # This has always been true so far.
            assert str(uriDict["S"]) == "/URI"
            # Some of my URI's have spaces.
            uri = uriDict["URI"].replace(" ", "%20")
            annotationList.append((position, uri))
    

    Then I defined a function like:

    def getOverlappingLink(annotationList, element):
        for (x0, y0, x1, y1), url in annotationList:
            if x0 > element.x1 or element.x0 > x1:
                continue
            if y0 > element.y1 or element.y0 > y1:
                continue
            return url
        else:
            return None
    

    which I used to search the annotationList I previously found on the page to see if any hyperlink occupies the same region as a LTTextBoxHorizontal that I was inspecting on the page.

    In my case, since PDFMiner was consolidating too much text together in the text box, I walked through the _objs attribute of each text box and looked though all of the LTTextLineHorizontal instances to see if they overlapped any of the annotation positions.

    0 讨论(0)
  • 2020-12-30 09:42

    The hyperlink will actually be an annotation, so you need to process the annotation rather than 'extract the text'. I suspect that you are going to need to use a library such as itextsharp, or MuPDF, or Ghostscript if you are really desperate (and comfortable programming in PostScript).

    I'd have thought it relatvely easy to process the annotations looking for type LNK though.

    0 讨论(0)
  • 2020-12-30 09:44

    slightly modified version of Ashwin's Answer:

    import PyPDF2
    PDFFile = open("file.pdf",'rb')
    
    PDF = PyPDF2.PdfFileReader(PDFFile)
    pages = PDF.getNumPages()
    key = '/Annots'
    uri = '/URI'
    ank = '/A'
    
    for page in range(pages):
        print("Current Page: {}".format(page))
        pageSliced = PDF.getPage(page)
        pageObject = pageSliced.getObject()
        if key in pageObject.keys():
            ann = pageObject[key]
            for a in ann:
                u = a.getObject()
                if uri in u[ank].keys():
                    print(u[ank][uri])
    
    0 讨论(0)
  • 2020-12-30 09:49

    Here's a version that creates a list of URLs in the simplest way I could find:

    import PyPDF2
    
    pdf = PyPDF2.PdfFileReader('filename.pdf')
    
    urls = []
    for page in range(pdf.numPages):
        pdfPage = pdf.getPage(page)
        try:
            for item in (pdfPage['/Annots']):
                urls.append(item['/A']['/URI'])
        except KeyError:
            pass
    
    0 讨论(0)
  • 2020-12-30 10:03

    I think using PyPDF you could do that. If you want to extract the links from PDF. I am not sure where I got this from but it resides in my code as a part of something else. Hope this helps:

    PDFFile = open('File Location','rb')
    
    PDF = pyPdf.PdfFileReader(PDFFile)
    pages = PDF.getNumPages()
    key = '/Annots'
    uri = '/URI'
    ank = '/A'
    
    for page in range(pages):
    
        pageSliced = PDF.getPage(page)
        pageObject = pageSliced.getObject()
    
        if pageObject.has_key(key):
            ann = pageObject[key]
            for a in ann:
                u = a.getObject()
                if u[ank].has_key(uri):
                print u[ank][uri]
    

    This I hope should give the links in your PDF. P.S: I haven't extensively tried this.

    0 讨论(0)
提交回复
热议问题