Itextsharp text extraction

▼魔方 西西 提交于 2019-11-28 06:28:35

complementary for Mark's answer that helps me a lot .iTextSharp implementation namespaces and classes are a bit different from java version

 public static string GetTextFromAllPages(String pdfPath)
    {
        PdfReader reader = new PdfReader(pdfPath); 

        StringWriter output = new StringWriter();  

        for (int i = 1; i <= reader.NumberOfPages; i++) 
            output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));

        return output.ToString();
    }

Check out PdfTextExtractor.

String pageText = 
  PdfTextExtractor.getTextFromPage(myReader, pageNum);

or

String pageText = 
  PdfTextExtractor.getTextFromPage(myReader, pageNum, new LocationTextExtractionStrategy());

Both require fairly recent versions of iText[Sharp]. Actually parsing the content stream yourself is just reinventing the wheel at this point. Spare yourself some pain and let iText do it for you.

PdfTextExtractor will handle all the different font/encoding issues for you... all the ones that can be handled anyway. If you can't copy/paste from Reader accurately, then there's not enough information present in the PDF to get character information from the content stream.

Here is a variant with iTextSharp.text.pdf.PdfName.ANNOTS and iTextSharp.text.pdf.PdfName.CONTENT if some one need it.

        string strFile = @"C:\my\path\tothefile.pdf";
        iTextSharp.text.pdf.PdfReader pdfRida = new iTextSharp.text.pdf.PdfReader(strFile);
        iTextSharp.text.pdf.PRTokeniser prtTokeneiser;
        int pageFrom = 1;
        int pageTo = pdfRida.NumberOfPages;
        iTextSharp.text.pdf.PRTokeniser.TokType tkntype ;
        string tknValue;

        for (int i = pageFrom; i <= pageTo; i++) 
        {
            iTextSharp.text.pdf.PdfDictionary cpage = pdfRida.GetPageN(i);
            iTextSharp.text.pdf.PdfArray cannots = cpage.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);

            if(cannots!=null)
                foreach (iTextSharp.text.pdf.PdfObject oAnnot in cannots.ArrayList) 
                {
                    iTextSharp.text.pdf.PdfDictionary cAnnotationDictironary = (iTextSharp.text.pdf.PdfDictionary)pdfRida.GetPdfObject(((iTextSharp.text.pdf.PRIndirectReference)oAnnot).Number);

                    iTextSharp.text.pdf.PdfObject moreshit = cAnnotationDictironary.Get(iTextSharp.text.pdf.PdfName.CONTENTS);
                    if (moreshit != null && moreshit.GetType() == typeof(iTextSharp.text.pdf.PdfString)) 
                    {
                        string cStringVal = ((iTextSharp.text.pdf.PdfString)moreshit).ToString();
                        if (cStringVal.ToUpper().Contains("LOS 8"))
                        { // DO SOMETHING FUN

                        }
                    }
                }
        }
        pdfRida.Close();
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!