C# Extract text from PDF using PdfSharp

后端 未结 3 1164
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-05 06:12

Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don\'t want to use iTextSharp because of its license.

3条回答
  •  失恋的感觉
    2020-12-05 07:11

    Took Sergio's answer and made some extension methods. I also changed the accumulation of strings into an iterator.

    public static class PdfSharpExtensions
    {
        public static IEnumerable ExtractText(this PdfPage page)
        {       
            var content = ContentReader.ReadContent(page);      
            var text = content.ExtractText();
            return text;
        }   
    
        public static IEnumerable ExtractText(this CObject cObject)
        {   
            if (cObject is COperator)
            {
                var cOperator = cObject as COperator;
                if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
                    cOperator.OpCode.Name == OpCodeName.TJ.ToString())
                {
                    foreach (var cOperand in cOperator.Operands)
                        foreach (var txt in ExtractText(cOperand))
                            yield return txt;   
                }
            }
            else if (cObject is CSequence)
            {
                var cSequence = cObject as CSequence;
                foreach (var element in cSequence)
                    foreach (var txt in ExtractText(element))
                        yield return txt;
            }
            else if (cObject is CString)
            {
                var cString = cObject as CString;
                yield return cString.Value;
            }
        }
    }
    

提交回复
热议问题