C# Extract text from PDF using PdfSharp

后端 未结 3 1162
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-05 06:12

Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don\'t want to use iTextSharp because of its license.

相关标签:
3条回答
  • 2020-12-05 07:05

    PDFSharp provides all the tools to extract the text from a PDF. Use the ContentReader class to access the commands within each page and extract the strings from TJ/Tj operators.

    I've uploaded a simple implementation to github.

    0 讨论(0)
  • 2020-12-05 07:08

    I have implemented it somehow similar to how David did it. Here is my code:

        {
            // ....
            var page = document.Pages[1];
            CObject content = ContentReader.ReadContent(page);
            var extractedText = ExtractText(content);
            // ...
        }
    
        private IEnumerable<string> ExtractText(CObject cObject )
        {
            var textList = new List<string>();
            if (cObject is COperator)
            {
                var cOperator = cObject as COperator;
                if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
                    cOperator.OpCode.Name == OpCodeName.TJ.ToString())
                {
                    foreach (var cOperand in cOperator.Operands)
                    {
                        textList.AddRange(ExtractText(cOperand));
                    }
                }
            }
            else if (cObject is CSequence)
            {
                var cSequence = cObject as CSequence;
                foreach (var element in cSequence)
                {
                    textList.AddRange(ExtractText(element));
                }
            }
            else if (cObject is CString)
            {
                var cString = cObject as CString;
                textList.Add(cString.Value);
            }
            return textList;
        }
    
    0 讨论(0)
  • 2020-12-05 07:11

    Took Sergio's answer and made some extension methods. I also changed the accumulation of strings into an iterator.

    public static class PdfSharpExtensions
    {
        public static IEnumerable<string> ExtractText(this PdfPage page)
        {       
            var content = ContentReader.ReadContent(page);      
            var text = content.ExtractText();
            return text;
        }   
    
        public static IEnumerable<string> ExtractText(this CObject cObject)
        {   
            if (cObject is COperator)
            {
                var cOperator = cObject as COperator;
                if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
                    cOperator.OpCode.Name == OpCodeName.TJ.ToString())
                {
                    foreach (var cOperand in cOperator.Operands)
                        foreach (var txt in ExtractText(cOperand))
                            yield return txt;   
                }
            }
            else if (cObject is CSequence)
            {
                var cSequence = cObject as CSequence;
                foreach (var element in cSequence)
                    foreach (var txt in ExtractText(element))
                        yield return txt;
            }
            else if (cObject is CString)
            {
                var cString = cObject as CString;
                yield return cString.Value;
            }
        }
    }
    
    0 讨论(0)
提交回复
热议问题