How to extract highlighed text from PDF using iTextSharp?

风流意气都作罢 提交于 2019-12-03 20:45:51

Please take a look at table 30 in ISO-32000-1 (aka the PDF reference). It is entitled "Entries in a page object". Among these entries, you can find a key named Annots. Its value is:

(Optional) An array of annotation dictionaries that shall contain indirect references to all annotations associated with the page (see 12.5, "Annotations").

You will not find an entry with a key such as Highlight, hence it is only normal that the array that is returned is null when you have this line:

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);

You need to get the annotations the way you already did:

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);

Now you need to loop over this array and look for annotations with Subtype equal to Highlight. This type of annotation is listed in table 169 of ISO-32000-1, entitled "Annotation types".

In other words, your assumption that a page dictionary contains entries with key Highlight was wrong and if you read the whole specification, you will also discover another false assumption you've been making. You are falsely assuming that the highlighted text is stored in the Contents entry of the annotations. This reveals a lack of understanding about the nature of annotations versus page content.

The text you are looking for is stored in the content stream of the page. The content stream of the page is independent of the page's annotations. Hence, to get the highlighted text, you need to get the coordinates stored in the Highlight annotation (stored in the QuadPoints array) and you need to use these coordinates to parse the text that is present in the page content at those coordinates.

Here is complete example of extracting highlighted text using itextSharp

    public void GetRectAnno()
    {

        string appRootDir = new DirectoryInfo(Environment.CurrentDirectory).Parent.Parent.FullName;

        string filePath = appRootDir + "/PDFs/" + "anot.pdf";

        int pageFrom = 0;
        int pageTo = 0;

        try
        {
            using (PdfReader reader = new PdfReader(filePath))
            {
                pageTo = reader.NumberOfPages;

                for (int i = 1; i <= reader.NumberOfPages; i++)
                {


                    PdfDictionary page = reader.GetPageN(i);
                    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
                    if (annots != null)
                        foreach (PdfObject annot in annots.ArrayList)
                        {

                            //Get Annotation from PDF File
                            PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annot);
                            PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
                            //check only subtype is highlight
                            if (subType.Equals(PdfName.HIGHLIGHT))
                            {
                                 // Get Quadpoints and Rectangle of highlighted text
                                Console.Write("HighLight at Rectangle {0} with QuadPoints {1}\n", annotationDic.GetAsArray(PdfName.RECT), annotationDic.GetAsArray(PdfName.QUADPOINTS));

                                //Extract Text using rectangle strategy    
                                PdfArray coordinates = annotationDic.GetAsArray(PdfName.RECT);

                                Rectangle rect = new Rectangle(float.Parse(coordinates.ArrayList[0].ToString(), CultureInfo.InvariantCulture.NumberFormat), float.Parse(coordinates.ArrayList[1].ToString(), CultureInfo.InvariantCulture.NumberFormat),
                                float.Parse(coordinates.ArrayList[2].ToString(), CultureInfo.InvariantCulture.NumberFormat),float.Parse(coordinates.ArrayList[3].ToString(), CultureInfo.InvariantCulture.NumberFormat));



                                RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
                                ITextExtractionStrategy strategy;
                                StringBuilder sb = new StringBuilder();


                                strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
                                sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));

                                //Show extract text on Console
                                Console.WriteLine(sb.ToString());
                                //Console.WriteLine("Page No" + i);

                            }



                        }



                }
            }
        }
        catch (Exception ex)
        {
        }
    }
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!