How to extract text with iTextSharp 4.1.6?

别等时光非礼了梦想. 提交于 2021-02-18 21:55:12

问题


iTextSharp 4.1.6 is the last version licensed under LGPL and is free to use in commercial purpose without paying license fees.

It might be interesting for some and for me, how to extract text with this version.

Does anyone have an idea?


回答1:


I had to hack this together manually as I was in the same boat as you. Hopefully this well help. It's probably not perfect, but I was able to get the text I needed out of the document this way. fileName is a string variable/parameter to the PDF file.

var reader = new PdfReader(fileName);

StringBuilder sb = new StringBuilder();

try
{
    for (int page = 1; page <= reader.NumberOfPages; page++)
    {
        var cpage = reader.GetPageN(page);
        var content = cpage.Get(PdfName.CONTENTS);

        var ir = (PRIndirectReference)content;

        var value = reader.GetPdfObject(ir.Number);

        if (value.IsStream())
        {
            PRStream stream = (PRStream)value;

            var streamBytes = PdfReader.GetStreamBytes(stream);

            var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamBytes));

            try
            {
                while (tokenizer.NextToken())
                {
                    if (tokenizer.TokenType == PRTokeniser.TK_STRING)
                    {
                        string str = tokenizer.StringValue;
                        sb.Append(str);
                    }
                }
            }
            finally
            {
                tokenizer.Close();
            }
        }
    }
}
finally
{
    reader.Close();
}

return sb.ToString();


来源:https://stackoverflow.com/questions/10143098/how-to-extract-text-with-itextsharp-4-1-6

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!