Itextsharp text extraction

I'm using itextsharp on vb.net to get the text content from a pdf file. The solution works fine for some files but not for other even quite simple ones. The problem is that the token stringvalue is set to null (a set of empty square boxes)

token = New iTextSharp.text.pdf.PRTokeniser(pageBytes)
    While token.NextToken()
        tknType = token.TokenType()
        tknValue = token.StringValue

I can meassure the length of the content but I cannot get the actual string content.

I realized that this happens depending on the font of the pdf. If I create a pdf using either Acrobat or PdfCreator with Courier (that by the way is the default font in my visual studio editor) I can get all the text content. If the same pdf is built using a different font I got the empty square boxes.

Now the question is, How can I extract text regardless of the font setting?

Thanks

complementary for Mark's answer that helps me a lot .iTextSharp implementation namespaces and classes are a bit different from java version

 public static string GetTextFromAllPages(String pdfPath)
    {
        PdfReader reader = new PdfReader(pdfPath); 

        StringWriter output = new StringWriter();  

        for (int i = 1; i <= reader.NumberOfPages; i++) 
            output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));

        return output.ToString();
    }

Check out PdfTextExtractor.

String pageText = 
  PdfTextExtractor.getTextFromPage(myReader, pageNum);

String pageText = 
  PdfTextExtractor.getTextFromPage(myReader, pageNum, new LocationTextExtractionStrategy());

Both require fairly recent versions of iText[Sharp]. Actually parsing the content stream yourself is just reinventing the wheel at this point. Spare yourself some pain and let iText do it for you.

PdfTextExtractor will handle all the different font/encoding issues for you... all the ones that can be handled anyway. If you can't copy/paste from Reader accurately, then there's not enough information present in the PDF to get character information from the content stream.

Here is a variant with iTextSharp.text.pdf.PdfName.ANNOTS and iTextSharp.text.pdf.PdfName.CONTENT if some one need it.

        string strFile = @"C:\my\path\tothefile.pdf";
        iTextSharp.text.pdf.PdfReader pdfRida = new iTextSharp.text.pdf.PdfReader(strFile);
        iTextSharp.text.pdf.PRTokeniser prtTokeneiser;
        int pageFrom = 1;
        int pageTo = pdfRida.NumberOfPages;
        iTextSharp.text.pdf.PRTokeniser.TokType tkntype ;
        string tknValue;

        for (int i = pageFrom; i <= pageTo; i++) 
        {
            iTextSharp.text.pdf.PdfDictionary cpage = pdfRida.GetPageN(i);
            iTextSharp.text.pdf.PdfArray cannots = cpage.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);

            if(cannots!=null)
                foreach (iTextSharp.text.pdf.PdfObject oAnnot in cannots.ArrayList) 
                {
                    iTextSharp.text.pdf.PdfDictionary cAnnotationDictironary = (iTextSharp.text.pdf.PdfDictionary)pdfRida.GetPdfObject(((iTextSharp.text.pdf.PRIndirectReference)oAnnot).Number);

                    iTextSharp.text.pdf.PdfObject moreshit = cAnnotationDictironary.Get(iTextSharp.text.pdf.PdfName.CONTENTS);
                    if (moreshit != null && moreshit.GetType() == typeof(iTextSharp.text.pdf.PdfString)) 
                    {
                        string cStringVal = ((iTextSharp.text.pdf.PdfString)moreshit).ToString();
                        if (cStringVal.ToUpper().Contains("LOS 8"))
                        { // DO SOMETHING FUN

                        }
                    }
                }
        }
        pdfRida.Close();

来源：https://stackoverflow.com/questions/4711134/itextsharp-text-extraction

标签

itextsharp