iTextSharp: Split pages size equals file size

后端 未结 1 1376
情歌与酒
情歌与酒 2020-12-10 23:02

Here is how I split a large PDF (144 mb):

public int SplitAndSave(string inputPath, string outputPath)
{
    FileInfo file = new FileInfo(inputPath);
    str         


        
相关标签:
1条回答
  • 2020-12-10 23:21

    In case of your document Top_Gear_Magazine_2012_09.pdf the reason is indeed the one I mentioned: All pages refer to object 2 0 R as their /Resources, and the dictionary in 2 0 obj in turn references all images in the PDF.

    To split that document into partial documents containing only the images required, you should preprocess the document by first finding out which images belong to which pages and then creating individual /Resources dictionaries for all pages.

    As you already use iText in this context, you can also use it to find out which images are required for which pages. Use the iText parser package to initially parse the PDF page by page using a RenderListener implementation whose RenderImage method simply remembers which image objects are used on the current page. (As a special twist, iText hides the name of the image XObject in question; you get the indirect object, though, and can query its object and generation number which suffices for the next step.)

    In a second step, you open the document in a PdfStamperand iterate over the pages. For each page you retrieve the /Resources dictionary and copy it, but only copy those XObjects references referencing one of the image objects whose object number and generation you remembered for the respective page during the first step. Finally you set the diminished copy as the /Resources dictionary of the page in question.

    The resulting PDF should split just fine.

    PS A very similar issue recently came up on the iText mailing list. In that thread the solution recipe given here has been improved, to get around the difficulties caused by iText hiding the xobject name, I now would propose to intervene before the name is lost by using a different ContentOperator for "Do", here the Java version:

    class Do implements ContentOperator 
    { 
        public void invoke(PdfContentStreamProcessor processor, PdfLiteral operator, ArrayList<PdfObject> operands) throws IOException 
        { 
            PdfName xobjectName = (PdfName)operands.get(0); 
            names.add(xobjectName); 
        } 
    
        final List<PdfName> names = new ArrayList<PdfName>(); 
    } 
    

    This content operator simply collects the names of the used xobjects, i.e. the xobject resources to keep for the given page.

    0 讨论(0)
提交回复
热议问题