How can I extract images and their metadata from PDFs?

Is it possible to use Java to extract images from a PDF file and export them to a specific folder without losing their original creation and modification dates? I tried to achieve this goal by using IText and PDFBox but had no success. Any ideas or examples are welcome.

Images do not contain metadata and are stored as raw data which needs to be assemebled into images. I wrote 2 blog posts explaining how image data is stored in a PDF file at https://blog.idrsolutions.com/2010/04/understanding-the-pdf-file-format-how-are-images-stored/ and https://blog.idrsolutions.com/2010/09/understanding-the-pdf-file-format-images/

I don't agree to the others and have a POC for your question: You can extract the XMP Metadata of images using pdfbox in the following way:

public void getXMPInformation() {
    // Open PDF document
    PDDocument document = null;
    try {
        document = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
    } catch (IOException e) {
        e.printStackTrace();
    }
    // Get all pages and loop through them
    List pages = document.getDocumentCatalog().getAllPages();
    Iterator iter = pages.iterator();
    while( iter.hasNext() ) {
        PDPage page = (PDPage)iter.next();
        PDResources resources = page.getResources();            
        Map images = null;
        // Get all Images on page
        try {
            images = resources.getImages();
        } catch (IOException e) {
            e.printStackTrace();
        }
        if( images != null ) {
            // Check all images for metadata
            Iterator imageIter = images.keySet().iterator();
            while( imageIter.hasNext() ) {
                String key = (String)imageIter.next();
                PDXObjectImage image = (PDXObjectImage)images.get( key );
                PDMetadata metadata = image.getMetadata();
                System.out.println("Found a image: Analyzing for Metadata");
                if (metadata == null) {
                    System.out.println("No Metadata found for this image.");
                } else {
                    InputStream xmlInputStream = null;
                    try {
                        xmlInputStream = metadata.createInputStream();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    try {
                        System.out.println("--------------------------------------------------------------------------------");
                        String mystring = convertStreamToString(xmlInputStream);
                        System.out.println(mystring);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                // Export the images
                String name = getUniqueFileName( key, image.getSuffix() );
                    System.out.println( "Writing image:" + name );
                    try {
                        image.write2file( name );
                    } catch (IOException e) {
                        // TODO Auto-generated catch block
                        //e.printStackTrace();
                }
                System.out.println("--------------------------------------------------------------------------------");
            }
        }
    }
}

And the "Helper methods":

public String convertStreamToString(InputStream is) throws IOException {
    /*
     * To convert the InputStream to String we use the BufferedReader.readLine()
     * method. We iterate until the BufferedReader return null which means
     * there's no more data to read. Each line will appended to a StringBuilder
     * and returned as String.
     */
    if (is != null) {
        StringBuilder sb = new StringBuilder();
        String line;

        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
            while ((line = reader.readLine()) != null) {
                sb.append(line).append("\n");
            }
        } finally {
            is.close();
        }
        return sb.toString();
    } else {       
        return "";
    }
}

private String getUniqueFileName( String prefix, String suffix ) {
    /*
    * imagecounter is a global variable that counts from 0 to the number of
    * extracted images
    */
    String uniqueName = null;
    File f = null;
    while( f == null || f.exists() ) {
        uniqueName = prefix + "-" + imageCounter;
        f = new File( uniqueName + "." + suffix );
    }
    imageCounter++;
    return uniqueName;
}

Note: This is a quick and dirty proof of concept and not a well-styled code.

The Images must have XMP-Metadata when placed in InDesign before building the PDF document. The XMP-Metdadata can be set by using Photoshop for example. Please be aware, that p.e. not all IPTC/Exif/... Information is converted into the XMP-Metadata. Only a small number of fields are converted.

I'm using this method on JPG and PNG images, placed in PDFs build with InDesign. It works well and I can get all image-informations after the production-steps from the ready PDFs (picture coating).

The original creation and modification dates are generally not saved when the image is embedded into the PDF. Just the raw pixel data is compressed and saved. However, according to Wikipedia:

Raster images in PDF (called Image XObjects) are represented by dictionaries with an associated stream.

The dictionary contains meta data, amongst which you might find the dates.

Short Answer

Maybe, but probably not.

Long Answer

PDF natively supports JPEG, JPEG2000 (which is growing more common), CITT (fax) 3 & 4, and JBIG2 (really rare). Images in these formats can be copied byte-for-byte into the PDF, preserving any metadata WITHIN THE FILE. Creation/change dates are generally part of the file system, not the image.

JPEG: doesn't look like it supports internal metadata.

JPEG2000: Yep. Lots of stuff in there potentially

CITT: doesn't look that way.

JBIG2: Err.. I think so, but it's none to clear from the specs I've just skimmed.

All other image formats must be turned into pixels and then compressed In Some Way (often with Flate/ZIP). These conversions could keep the metadata as part of the PDF's xml metadata or the image's dictionary, but I've never even heard of that happening. It just gets pitched.

Get the Meta Data From PDF file Using SonwTide API . Use PDFTextStream.jar At the end it will return all the PDF Properties and print on command line.

public static void getPDFMetaData(String pdfFilePath) throws IOException{

            // input pdf file with location Add PDFTextStream.jar from snowtide web site to your code build path
            PDFTextStream stream = new PDFTextStream(pdfFilePath);

            // get collection of all document attribute names
            Set attributeKeys = stream.getAttributeKeys();

            // print the values of all document attributes to System.out
            Iterator iter = attributeKeys.iterator();
            String attrKey;
            while (iter.hasNext()) {
                attrKey = (String)iter.next();
                System.out.println(attrKey + " = " + stream.getAttribute(attrKey));

            }


}

来源：https://stackoverflow.com/questions/5653826/how-can-i-extract-images-and-their-metadata-from-pdfs

标签

java

pdf

itext

pdfbox