Error while retrieving images from pdf using Itext

问题

I have an existing PDF from which I want to retrieve images

NOTE:

In the Documentation, this is the RESULT variable

public static final String RESULT = "results/part4/chapter15/Img%s.%s";

I am not getting why this image is needed?I just want to extract the images from my PDF file

So Now when I use MyImageRenderListener listener = new MyImageRenderListener(RESULT);

I am getting the error:

results\part4\chapter15\Img16.jpg (The system cannot find the path specified)

This is the code that I am having.

    package part4.chapter15;

    import java.io.IOException;


    import com.itextpdf.text.DocumentException;
    import com.itextpdf.text.pdf.PdfReader;
    import com.itextpdf.text.pdf.parser.PdfReaderContentParser;

    /**
     * Extracts images from a PDF file.
     */
    public class ExtractImages {

    /** The new document to which we've added a border rectangle. */
    public static final String RESOURCE = "resources/pdfs/samplefile.pdf";
    public static final String RESULT = "results/part4/chapter15/Img%s.%s";
    /**
     * Parses a PDF and extracts all the images.
     * @param src the source PDF
     * @param dest the resulting PDF
     */
    public void extractImages(String filename)
        throws IOException, DocumentException {
        PdfReader reader = new PdfReader(filename);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        MyImageRenderListener listener = new MyImageRenderListener(RESULT);
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            parser.processContent(i, listener);
        }
        reader.close();
    }

    /**
     * Main method.
     * @param    args    no arguments needed
     * @throws DocumentException 
     * @throws IOException
     */
    public static void main(String[] args) throws IOException, DocumentException {
        new ExtractImages().extractImages(RESOURCE);
    }
}

回答1:

You have two questions and the answer to the first question is the key to the answer of the second.

Question 1:

You refer to:

public static final String RESULT = "results/part4/chapter15/Img%s.%s";

And you ask: why is this image needed?

That question is wrong, because Img%s.%s is not a filename of an image, it's a pattern of the filename of an image. While parsing, iText will detect images in the PDF. These images are stored in numbered objects (e.g. object 16) and these images can be exported in different formats (e.g. jpg, png,...).

Suppose that an image is stored in object 16 and that this image is a jpg, then the pattern will resolve to Img16.jpg.

Question 2:

Why do I get an error:

results\part4\chapter15\Img16.jpg (The system cannot find the path specified)

In your PDF, there's a jpg stored in object 16. You are asking iText to store that image using this path: results\part4\chapter15\Img16.jpg (as explained in my answer to Question 1). However: you working directory doesn't have the subdirectories results\part4\chapter15\, hence an IOException (or a FileNotFoundException?) is thrown.

What is the general problem?

You have copy/pasted the ExtractImages example I wrote for my book "iText in Action - Second Edition", but:

You didn't read that book, so you have no idea what that code is supposed to do.
You aren't telling the readers on StackOverflow that this example depends on the MyImageRenderer class, which is where all the magic happens.

How can you solve your problem?

Option 1:

Change RESULT like this:

public static final String RESULT = "Img%s.%s";

Now the images will be stored in your working directory.

Option 2:

Adapt the MyImageRenderer class, more specifically this method:

public void renderImage(ImageRenderInfo renderInfo) {
    try {
        String filename;
        FileOutputStream os;
        PdfImageObject image = renderInfo.getImage();
        if (image == null) return;
        filename = String.format(path,
            renderInfo.getRef().getNumber(), image.getFileType());
        os = new FileOutputStream(filename);
        os.write(image.getImageAsBytes());
        os.flush();
        os.close();
    } catch (IOException e) {
        System.out.println(e.getMessage());
    }
}

iText calls this class whenever an image is encountered. It passed an ImageRenderInfo to this method that contains plenty of information about that image.

In this implementation, we store the image bytes as a file. This is how we create the path to that file:

String.format(path,
     renderInfo.getRef().getNumber(), image.getFileType())

As you can see, the pattern stored in RESULT is used in such a way that the first occurrence of %s is replaced with a number and the second occurrence with a file extension.

You could easily adapt this method so that it stores the images as byte[] in a List if that is what you want.

来源：https://stackoverflow.com/questions/31962472/error-while-retrieving-images-from-pdf-using-itext

标签

java

pdf

itext

pdf-parsing