How to extract highlighed text from PDF using iTextSharp?

As per folowing post: iTextSharp PDF Reading highlighed text (highlight annotations) using C#

this code:

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

is working to extract PDF annotations. But why the same following code is not working for highlight (specifically PdfName.HIGHLIGHT is not working) :

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

Please take a look at table 30 in ISO-32000-1 (aka the PDF reference). It is entitled "Entries in a page object". Among these entries, you can find a key named Annots. Its value is:

(Optional) An array of annotation dictionaries that shall contain indirect references to all annotations associated with the page (see 12.5, "Annotations").

You will not find an entry with a key such as Highlight, hence it is only normal that the array that is returned is null when you have this line:

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);

You need to get the annotations the way you already did:

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);

Now you need to loop over this array and look for annotations with Subtype equal to Highlight. This type of annotation is listed in table 169 of ISO-32000-1, entitled "Annotation types".

In other words, your assumption that a page dictionary contains entries with key Highlight was wrong and if you read the whole specification, you will also discover another false assumption you've been making. You are falsely assuming that the highlighted text is stored in the Contents entry of the annotations. This reveals a lack of understanding about the nature of annotations versus page content.

The text you are looking for is stored in the content stream of the page. The content stream of the page is independent of the page's annotations. Hence, to get the highlighted text, you need to get the coordinates stored in the Highlight annotation (stored in the QuadPoints array) and you need to use these coordinates to parse the text that is present in the page content at those coordinates.

Here is complete example of extracting highlighted text using itextSharp

    public void GetRectAnno()
    {

        string appRootDir = new DirectoryInfo(Environment.CurrentDirectory).Parent.Parent.FullName;

        string filePath = appRootDir + "/PDFs/" + "anot.pdf";

        int pageFrom = 0;
        int pageTo = 0;

        try
        {
            using (PdfReader reader = new PdfReader(filePath))
            {
                pageTo = reader.NumberOfPages;

                for (int i = 1; i <= reader.NumberOfPages; i++)
                {


                    PdfDictionary page = reader.GetPageN(i);
                    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
                    if (annots != null)
                        foreach (PdfObject annot in annots.ArrayList)
                        {

                            //Get Annotation from PDF File
                            PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annot);
                            PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
                            //check only subtype is highlight
                            if (subType.Equals(PdfName.HIGHLIGHT))
                            {
                                 // Get Quadpoints and Rectangle of highlighted text
                                Console.Write("HighLight at Rectangle {0} with QuadPoints {1}\n", annotationDic.GetAsArray(PdfName.RECT), annotationDic.GetAsArray(PdfName.QUADPOINTS));

                                //Extract Text using rectangle strategy    
                                PdfArray coordinates = annotationDic.GetAsArray(PdfName.RECT);

                                Rectangle rect = new Rectangle(float.Parse(coordinates.ArrayList[0].ToString(), CultureInfo.InvariantCulture.NumberFormat), float.Parse(coordinates.ArrayList[1].ToString(), CultureInfo.InvariantCulture.NumberFormat),
                                float.Parse(coordinates.ArrayList[2].ToString(), CultureInfo.InvariantCulture.NumberFormat),float.Parse(coordinates.ArrayList[3].ToString(), CultureInfo.InvariantCulture.NumberFormat));



                                RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
                                ITextExtractionStrategy strategy;
                                StringBuilder sb = new StringBuilder();


                                strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
                                sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));

                                //Show extract text on Console
                                Console.WriteLine(sb.ToString());
                                //Console.WriteLine("Page No" + i);

                            }



                        }



                }
            }
        }
        catch (Exception ex)
        {
        }
    }

来源：https://stackoverflow.com/questions/26652411/how-to-extract-highlighed-text-from-pdf-using-itextsharp

标签

.net

pdf

itextsharp