iTextSharp: Convert PdfObject to PdfStream

问题

I am attempting to pull some font streams out of a pdf file (legality is not an issue, as my company has paid for the rights to display these documents in their original manner - and this requires a conversion which requires the extraction of the fonts).

Now, I had been using MUTool - but it also extracts the images in the pdf as well with no method for bypassing them and some of these contain 10s of thousands of images. So, I took to the web for answers and have come to the following solution:

I get all of the fonts into a font dictionary and then I attempt to convert them into PdfStreams (for flatedecode and then writing to files) using the following code:

    PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject((PdfObject)cItem.pObj);
        PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
        try
        {

            int xrefIdx = ((PRIndirectReference)((PdfObject)cItem.pObj)).Number;
            PdfObject pdfObj = (PdfObject)reader.GetPdfObject(xrefIdx);
            PdfStream str = (PdfStream)(pdfObj);

            byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)str);
        }
        catch { }

But, when I get to PdfStream str = (PdfStream)(pdfObj); I get the error below:

    Unable to cast object of type 'iTextSharp.text.pdf.PdfDictionary' 
    to type 'iTextSharp.text.pdf.PdfStream'.

Now, I know that PdfDictionary derives from (extends) PdfObject so I am uncertain as to what I am doing incorrectly here. Someone please help - I either need advice on patching this code, or if entirely incorrect, either code to extract the stream properly or direction to a place with said code.

Thank you.

EDIT My revised code is here:

     public static void GetStreams(PdfReader pdf)
    {
        int page_count = pdf.NumberOfPages;
        for (int i = 1; i <= page_count; i++)
        {
            PdfDictionary pg = pdf.GetPageN(i);
            PdfDictionary fObj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.FONT));
            if (fObj != null)
            {
                foreach (PdfName name in fObj.Keys)
                {
                    PdfObject obj = fObj.Get(name);
                    if (obj.IsIndirect())
                    {
                        PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
                        PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));

                        int xrefIdx = ((PRIndirectReference)obj).Number;
                        PdfObject pdfObj = pdf.GetPdfObject(xrefIdx);
                        if (pdfObj == null && pdfObj.IsStream())
                        {
                            PdfStream str = (PdfStream)(pdfObj);
                            byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)str);
                        }
                    }
                }
            }
        }
    }

However, I am still receiving the same error - so I am assuming that this is an incorrect method of retrieving font streams. The same document has had fonts extracted using muTool successfully - so I know the problem is me and not the pdf.

回答1:

There are at least two things wrong in your code:

You cast an object to a stream without performing this check: if (pdfObj == null && pdfObj.isStream()) { // cast to stream } As you get the error message that you're trying to cast a dictionary to a stream, I'm 99% sure that the second part of the check will return false whereas pdfObj.isDictionary() probably returns true.
You try extracting a stream from PdfReader and you're trying to cast that object to a PdfStream instead of to a PRStream. PdfStream is the object we use to create PDFs, PRStream is the object used when we inspect PDFs using PdfReader.

You should fix this problem first.

Now for your general question. If you read ISO-32000-1, you'll discover that a font is defined using a font dictionary. If the font is embedded (fully or partly), the font dictionary will refer to a stream. This stream can contain the full font information, but most of the times, you'll only get a subset of the glyphs (because that's best practice when creating a PDF).

Take a look at the example ListFontFiles from my book "iText in Action" to get a first impression of how fonts are organized inside a PDF. You'll need to combine this example with ISO-32000-1 to find more info about the difference between FONTFILE, FONTFILE2 and FONTFILE3.

I've also written an example that replaces an unembedded font with a font file: EmbedFontPostFacto. This example serves as an introduction to explain how difficult font replacement is.

Please go to http://tinyurl.com/iiacsCH16 if you need the C# version of the book samples.

来源：https://stackoverflow.com/questions/16343922/itextsharp-convert-pdfobject-to-pdfstream

标签

pdf

fonts

stream

itextsharp

itext