Issue when trying to remove inline images from PDF with iTextSharp

I recently discovered iTextSharp.

I was investigating a performance issue with the rendering of PDF documents and Bruno Lowagie (author of iText) kindly explained to me the reason why I was experiencing such an issue : it was due to the amount of "Inline Images" in my PDF documents. He also explained the basics to remove those "Inline Images"... (My purpose is to "possibly" show a preview of the document with a clear notice that it's not the actual document and that this one could be very slow to open. I clearly understand that what I am trying to do is far from robust/safe/... The problem must be solved at another level, e.g.: when generating the documents, ...)

Unfortunately, I don't succeed in implementing the clean-up on my own :/ Here is some code I currently have (inspired from various samples found on stackOverflow)...

PdfReader pdfReader = new PdfReader(filename);
try
{  
    //pdfReader.RemoveUnusedObjects();

    var cleanfilename = filename.Replace(".pdf", ".clean.pdf");
    if (File.Exists(cleanfilename))
        File.Delete(cleanfilename);

    using (var file = new FileStream(cleanfilename, FileMode.Create))
    {
        var pdfstamper = new PdfStamper(pdfReader, file);

        for (var page = 1; page <= pdfReader.NumberOfPages; page++)
        {    
            PdfDictionary pageDict = pdfReader.GetPageN(page);
            PdfObject pageObj = pageDict.GetDirectObject(PdfName.CONTENTS);
            if (pageObj.IsStream())
            {
                CleanStream(pageObj);
            }
            else if (pageObj.IsArray())
            {
                PdfArray pageArray = pageDict.GetAsArray(PdfName.CONTENTS);

                for (int j = 0; j < pageArray.Size; j++)
                {
                    PdfIndirectReference arrayElement = (PdfIndirectReference)pageArray[j];
                    pageObj = pdfReader.GetPdfObject(arrayElement.Number);
                    if (pageObj.IsStream())
                    {
                        CleanStream(pageObj);
                    }
                }
            }
        }

        pdfstamper.Close();
    }
}
catch (Exception ex)
{
    MessageBox.Show("Error: " + ex.Message, "Error");
}
finally
{
    pdfReader.Close();
}

and

Regex regEx = new Regex("\\nBI.*?\\nEI", RegexOptions.Compiled);

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);

    var currentContent = Encoding.ASCII.GetString(data);    
    var newContent = regEx.Replace(currentContent, "");
    var newData = Encoding.ASCII.GetBytes(newContent);

    stream.SetData(newData);
}

It works fine on PDF without Inline Images... But "Text" is disappearing from pages where there are Inline Images.

I thought the problem was with the Replacement. But it's not the case as far as I can tell. Using the following code (kind of passthrough), the output document is ok:

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);

    stream.SetData(data);
}

Using however the following code, which is theoretically not changing any byte (does it ?), the output documents does not display fine any more (some content seems to not be rendered) ?!?!?

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);

    var currentContent = Encoding.ASCII.GetString(data);    
    var newData = Encoding.ASCII.GetBytes(currentContent);

    stream.SetData(newData);
}

I looks like converting the byte array into a string and back into an array is not a "transparent" operation.

I really don't get it !?! But on the other side, I know I am real beginner regarding PDF. What am I missing ?

This is not at all critical (I don't really care if I can't succeed in removing those inline images). But I am now really curious about understanding what's happening :D

Here is a PDF sample : https://drive.google.com/file/d/0Byqch0ZyIb5DWDdmSTJ3SDMxMW8/edit?usp=sharing

As you've found out and as mkl and I pointed out in the comments, it's not a good idea to manipulate a content stream without taking a look at every operator in the stream. You really need to parse the syntax and interpret every single operator and every single operand.

Please take a look at the OCG removing functionality in the extra jar that is provided with iText in the com.itextpdf.text.pdf.ocg/ package.

In the OCGParser class, we define all possible operators:

protected void populateOperators() {
    if (operators != null)
        return;
    operators = new HashMap<String, PdfOperator>();
    operators.put(DEFAULTOPERATOR, new CopyContentOperator());
    PathConstructionOrPaintingOperator opConstructionPainting = new PathConstructionOrPaintingOperator();
    operators.put("m", opConstructionPainting);
    operators.put("l", opConstructionPainting);
    operators.put("c", opConstructionPainting);
    operators.put("v", opConstructionPainting);
    operators.put("y", opConstructionPainting);
    operators.put("h", opConstructionPainting);
    operators.put("re", opConstructionPainting);
    operators.put("S", opConstructionPainting);
    operators.put("s", opConstructionPainting);
    operators.put("f", opConstructionPainting);
    operators.put("F", opConstructionPainting);
    operators.put("f*", opConstructionPainting);
    operators.put("B", opConstructionPainting);
    operators.put("B*", opConstructionPainting);
    operators.put("b", opConstructionPainting);
    operators.put("b*", opConstructionPainting);
    operators.put("n", opConstructionPainting);
    operators.put("W", opConstructionPainting);
    operators.put("W*", opConstructionPainting);
    GraphicsOperator graphics = new GraphicsOperator();
    operators.put("q", graphics);
    operators.put("Q", graphics);
    operators.put("w", graphics);
    operators.put("J", graphics);
    operators.put("j", graphics);
    operators.put("M", graphics);
    operators.put("d", graphics);
    operators.put("ri", graphics);
    operators.put("i", graphics);
    operators.put("gs", graphics);
    operators.put("cm", graphics);
    operators.put("g", graphics);
    operators.put("G", graphics);
    operators.put("rg", graphics);
    operators.put("RG", graphics);
    operators.put("k", graphics);
    operators.put("K", graphics);
    operators.put("cs", graphics);
    operators.put("CS", graphics);
    operators.put("sc", graphics);
    operators.put("SC", graphics);
    operators.put("scn", graphics);
    operators.put("SCN", graphics);
    operators.put("sh", graphics);
    XObjectOperator xObject = new XObjectOperator();
    operators.put("Do", xObject);
    InlineImageOperator inlineImage = new InlineImageOperator();
    operators.put("BI", inlineImage);
    operators.put("EI", inlineImage);
    TextOperator text = new TextOperator();
    operators.put("BT", text);
    operators.put("ID", text);
    operators.put("ET", text);
    operators.put("Tc", text);
    operators.put("Tw", text);
    operators.put("Tz", text);
    operators.put("TL", text);
    operators.put("Tf", text);
    operators.put("Tr", text);
    operators.put("Ts", text);
    operators.put("Td", text);
    operators.put("TD", text);
    operators.put("Tm", text);
    operators.put("T*", text);
    operators.put("Tj", text);
    operators.put("'", text);
    operators.put("\"", text);
    operators.put("TJ", text);
    MarkedContentOperator markedContent = new MarkedContentOperator();
    operators.put("BMC", markedContent);
    operators.put("BDC", markedContent);
    operators.put("EMC", markedContent);
}

The parse() method will look at all the content streams, including the content streams of Form XObjects (which you are overlooking if I understand your code correctly).

In the process() method, we make copies of every operator and all its operands, unless some condition tells us that part of the syntax needs to be removed.

You should adapt this code so that all operators are copied, except those that involve an inline images. Your approach was a brute force approach that was bound to damage more PDFs than it would ever fix.

Valery Letroye

Instead of working on strings, I work now directly on the bytes...

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);
    var workingData = new byte[data.Length];

    var BI = Encoding.ASCII.GetBytes("\nBI");
    var EI = Encoding.ASCII.GetBytes("\nEI");

    var len = EI.Length - 1;
    var BIpos = data.Locate(BI);
    var EIpos = data.Locate(EI);
    var pos = BIpos.Length;
    if (pos != EIpos.Length)
        throw new Exception("BI and EI operators not matching ?!");

    var skip = 0;
    var newI = 0;
    for (var i = 0; i < data.Length; i++)
    {
        if (skip >= pos || i < BIpos[skip])
        {
            workingData[newI] = data[i];
            newI++;
        }
        else if (i >= EIpos[skip] + len)
            skip++;
    }

    var newData = new byte[newI];
    Array.Copy(workingData, newData, newI);

    stream.SetData(newData);
}

"Locate" is the extension method suggested here : byte[] array pattern search

Any comment on this solution is welcome!

来源：https://stackoverflow.com/questions/24725730/issue-when-trying-to-remove-inline-images-from-pdf-with-itextsharp

标签

pdf

itextsharp