extract images from pdf using pdfbox

前端 未结 8 1966
刺人心
刺人心 2020-11-28 09:22

I m trying to extract images from a pdf using pdfbox. The example pdf here

But i m getting blank images only.

The code i m trying:-

public st         


        
相关标签:
8条回答
  • 2020-11-28 09:43

    Just add the .jpeg to the end of your path:

    image.write2file("C:\\Users\\Pradyut\\Documents\\image" + i + ".jpeg");
    

    That works for me.

    0 讨论(0)
  • 2020-11-28 09:43

    You can use PDPage.convertToImage() function which can convert the PDF page into a BufferedImage. Next you can use the BufferedImage to create an Image.

    Use the following reference for further detail:

    • All PDF realated classes in PDFBox you can get in Apache PDFBox 1.8.3 API
    • Here you can see PDPage related documentation.

    And do not forget to look for PDPage.convertToImage() function in PDPage class.

    0 讨论(0)
  • 2020-11-28 09:48

    For PDFBox 2.0.1, pudaykiran's answer must be slightly modified since some APIs have been changed.

    public static void testPDFBoxExtractImages() throws Exception {
        PDDocument document = PDDocument.load(new File("D:/Temp/Test.pdf"));
        PDPageTree list = document.getPages();
        for (PDPage page : list) {
            PDResources pdResources = page.getResources();
            for (COSName c : pdResources.getXObjectNames()) {
                PDXObject o = pdResources.getXObject(c);
                if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {
                    File file = new File("D:/Temp/" + System.nanoTime() + ".png");
                    ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject)o).getImage(), "png", file);
                }
            }
        }
    }
    
    0 讨论(0)
  • 2020-11-28 09:55

    Here is code using PDFBox 2.0.1 that will get a list of all images from the PDF. This is different than the other code in that it will recurse through the document instead of trying to get the images from the top level.

    public List<RenderedImage> getImagesFromPDF(PDDocument document) throws IOException {
            List<RenderedImage> images = new ArrayList<>();
        for (PDPage page : document.getPages()) {
            images.addAll(getImagesFromResources(page.getResources()));
        }
    
        return images;
    }
    
    private List<RenderedImage> getImagesFromResources(PDResources resources) throws IOException {
        List<RenderedImage> images = new ArrayList<>();
    
        for (COSName xObjectName : resources.getXObjectNames()) {
            PDXObject xObject = resources.getXObject(xObjectName);
    
            if (xObject instanceof PDFormXObject) {
                images.addAll(getImagesFromResources(((PDFormXObject) xObject).getResources()));
            } else if (xObject instanceof PDImageXObject) {
                images.add(((PDImageXObject) xObject).getImage());
            }
        }
    
        return images;
    }
    
    0 讨论(0)
  • 2020-11-28 09:56

    Instead of calling

    image.write2file("C:\\Users\\Pradyut\\Documents\\image" + i);
    

    You can use the ImageIO.write() static method to write the RGB image out in whatever format you need. Here I've used PNG:

    File outputFile = new File( "C:\\Users\\Pradyut\\Documents\\image" + i + ".png");
    ImageIO.write( image.getRGBImage(), "png", outputFile);
    
    0 讨论(0)
  • 2020-11-28 10:01

    The below GetImagesFromPDF java class get all images in 04-Request-Headers.pdf file and save those files into destination folder PDFCopy.

    import java.io.File;
    import java.util.Iterator;
    import java.util.List;
    import java.util.Map;
    
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.pdmodel.PDPage;
    import org.apache.pdfbox.pdmodel.PDResources;
    import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
    
    @SuppressWarnings({ "unchecked", "rawtypes", "deprecation" })
    public class GetImagesFromPDF {
        public static void main(String[] args) {
            try {
                String sourceDir = "C:/PDFCopy/04-Request-Headers.pdf";// Paste pdf files in PDFCopy folder to read
                String destinationDir = "C:/PDFCopy/";
                File oldFile = new File(sourceDir);
                if (oldFile.exists()) {
                PDDocument document = PDDocument.load(sourceDir);
    
                List<PDPage> list = document.getDocumentCatalog().getAllPages();
    
                String fileName = oldFile.getName().replace(".pdf", "_cover");
                int totalImages = 1;
                for (PDPage page : list) {
                    PDResources pdResources = page.getResources();
    
                    Map pageImages = pdResources.getImages();
                    if (pageImages != null) {
    
                        Iterator imageIter = pageImages.keySet().iterator();
                        while (imageIter.hasNext()) {
                            String key = (String) imageIter.next();
                            PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);
                            pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
                            totalImages++;
                        }
                    }
                }
            } else {
                System.err.println("File not exists");
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    

    }

    0 讨论(0)
提交回复
热议问题