C# Extract Text by using PdfSharp return unreadable content

感情迁移 提交于 2020-01-04 09:40:05


I managed to extract text from PDF version 1.2 by using PdfSharp as refer to this link

My code to extract text

private string ExtractText(CObject cObject, ref string pdfcontentstr)
        if (cObject is COperator)
            var cOperator = cObject as COperator;
            if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
                cOperator.OpCode.Name == OpCodeName.TJ.ToString())
                foreach (var cOperand in cOperator.Operands)
                    ExtractText(cOperand, ref pdfcontentstr);
        else if (cObject is CSequence)
            var cSequence = cObject as CSequence;
            foreach (var element in cSequence)
                ExtractText(element, ref pdfcontentstr);
        else if (cObject is CString)
            var cString = cObject as CString;
            pdfcontentstr = pdfcontentstr + ";" + cString.Value;
        return pdfcontentstr;

But when i try to extract PDF version 1.3 (with same content), the program return unreadable content, example:


The actual content in PDF file: Block B

Anyone can help? Thanks in advance.

