Reading hyperlinks from pdf file

后端 未结 2 2043
天命终不由人
天命终不由人 2020-12-10 08:26

I\'m trying to read a pdf file and get all hyperlinks from this file. I\'m using iTextSharp for C# .net.

PdfReader reader = new PdfReader(\"test.pdf\");              


        
相关标签:
2条回答
  • 2020-12-10 08:55

    PdfReader.GetLinks() is only meant to be used with links internal to the document, not external hyperlinks. Why? I don't know.

    The code below is based off of code I wrote earlier but I've limited it to links stored in the PDF as a PdfName.URI. Its possible to store the link as Javascript that ultimately does the same thing and there's probably other types but you'll need to detect for that. I don't believe there's anything in the spec that says that a link actually needs to be a URI, its just implied, so the code below returns a string that you can (probably) convert to a URI on your own.

        private static List<string> GetPdfLinks(string file, int page)
        {
            //Open our reader
            PdfReader R = new PdfReader(file);
    
            //Get the current page
            PdfDictionary PageDictionary = R.GetPageN(page);
    
            //Get all of the annotations for the current page
            PdfArray Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);
    
            //Make sure we have something
            if ((Annots == null) || (Annots.Length == 0))
                return null;
    
            List<string> Ret = new List<string>();
    
            //Loop through each annotation
            foreach (PdfObject A in Annots.ArrayList)
            {
                //Convert the itext-specific object as a generic PDF object
                PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(A);
    
                //Make sure this annotation has a link
                if (!AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK))
                    continue;
    
                //Make sure this annotation has an ACTION
                if (AnnotationDictionary.Get(PdfName.A) == null)
                    continue;
    
                //Get the ACTION for the current annotation
                PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.Get(PdfName.A);
    
                //Test if it is a URI action (There are tons of other types of actions, some of which might mimic URI, such as JavaScript, but those need to be handled seperately)
                if (AnnotationAction.Get(PdfName.S).Equals(PdfName.URI))
                {
                    PdfString Destination = AnnotationAction.GetAsString(PdfName.URI);
                    if (Destination != null)
                        Ret.Add(Destination.ToString());
                }
            }
    
            return Ret;
    
        }
    

    And call it:

            string myfile = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Output.pdf");
            List<string> Links = GetPdfLinks(myfile, 1);
    
    0 讨论(0)
  • 2020-12-10 08:56

    I have noticed that any text on a PDF that looks like a URL can be simulated as a annotation link by the PDF vewer. In Adobe Acrobat there is a page display preference under the general tab called "Create links from URLs" that controls this. I was writing code to remove URL link annotations, only to find that there were none. But yet Acrobat was automatically turning text that looked like a URL into a what appeared to be an annotation link.

    0 讨论(0)
提交回复
热议问题