extract images from pdf using pdfbox

匿名 (未验证) 提交于 2019-12-03 02:11:02

问题:

I m trying to extract images from a pdf using pdfbox. The example pdf here

But i m getting blank images only.

The code i m trying:-

public static void main(String[] args) {    PDFImageExtract obj = new PDFImageExtract();     try {         obj.read_pdf();     } catch (IOException ex) {         System.out.println("" + ex);     }  }   void read_pdf() throws IOException {     PDDocument document = null;      try {         document = PDDocument.load("C:\\Users\\Pradyut\\Documents\\MCS-034.pdf");     } catch (IOException ex) {         System.out.println("" + ex);     }     List pages = document.getDocumentCatalog().getAllPages();     Iterator iter = pages.iterator();      int i =1;     String name = null;      while (iter.hasNext()) {         PDPage page = (PDPage) iter.next();         PDResources resources = page.getResources();         Map pageImages = resources.getImages();         if (pageImages != null) {              Iterator imageIter = pageImages.keySet().iterator();             while (imageIter.hasNext()) {                 String key = (String) imageIter.next();                 PDXObjectImage image = (PDXObjectImage) pageImages.get(key);                 image.write2file("C:\\Users\\Pradyut\\Documents\\image" + i);                 i ++;             }         }     }  } 

Thanks

回答1:

The below GetImagesFromPDF java class get all images in 04-Request-Headers.pdf file and save those files into destination folder PDFCopy.

import java.io.File; import java.util.Iterator; import java.util.List; import java.util.Map;  import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.PDResources; import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;  @SuppressWarnings({ "unchecked", "rawtypes", "deprecation" }) public class GetImagesFromPDF {     public static void main(String[] args) {         try {             String sourceDir = "C:/PDFCopy/04-Request-Headers.pdf";// Paste pdf files in PDFCopy folder to read             String destinationDir = "C:/PDFCopy/";             File oldFile = new File(sourceDir);             if (oldFile.exists()) {             PDDocument document = PDDocument.load(sourceDir);              List list = document.getDocumentCatalog().getAllPages();              String fileName = oldFile.getName().replace(".pdf", "_cover");             int totalImages = 1;             for (PDPage page : list) {                 PDResources pdResources = page.getResources();                  Map pageImages = pdResources.getImages();                 if (pageImages != null) {                      Iterator imageIter = pageImages.keySet().iterator();                     while (imageIter.hasNext()) {                         String key = (String) imageIter.next();                         PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);                         pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);                         totalImages++;                     }                 }             }         } else {             System.err.println("File not exists");         }     } catch (Exception e) {         e.printStackTrace();     } } 

}



回答2:

Here is code using PDFBox 2.0.1 that will get a list of all images from the PDF. This is different than the other code in that it will recurse through the document instead of trying to get the images from the top level.

public List getImagesFromPDF(PDDocument document) throws IOException {         List images = new ArrayList();     for (PDPage page : document.getPages()) {         images.addAll(getImagesFromResources(page.getResources()));     }      return images; }  private List getImagesFromResources(PDResources resources) throws IOException {     List images = new ArrayList();      for (COSName xObjectName : resources.getXObjectNames()) {         PDXObject xObject = resources.getXObject(xObjectName);          if (xObject instanceof PDFormXObject) {             images.addAll(getImagesFromResources(((PDFormXObject) xObject).getResources()));         } else if (xObject instanceof PDImageXObject) {             images.add(((PDImageXObject) xObject).getImage());         }     }      return images; } 


回答3:

For PDFBox 2.0.1, pudaykiran's answer must be slightly modified since some APIs have been changed.

public static void testPDFBoxExtractImages() throws Exception {     PDDocument document = PDDocument.load(new File("D:/Temp/Test.pdf"));     PDPageTree list = document.getPages();     for (PDPage page : list) {         PDResources pdResources = page.getResources();         for (COSName c : pdResources.getXObjectNames()) {             PDXObject o = pdResources.getXObject(c);             if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {                 File file = new File("D:/Temp/" + System.nanoTime() + ".png");                 ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject)o).getImage(), "png", file);             }         }     } } 


回答4:

You can use PDPage.convertToImage() function which can convert the PDF page into a BufferedImage. Next you can use the BufferedImage to create an Image.

Use the following reference for further detail:

And do not forget to look for PDPage.convertToImage() function in PDPage class.



回答5:

Just add the .jpeg to the end of your path:

image.write2file("C:\\Users\\Pradyut\\Documents\\image" + i + ".jpeg"); 

That works for me.



回答6:

The PDF consists of JBIG2 encoded images. I am not sure if pdfBox supports these.



回答7:

Instead of calling

image.write2file("C:\\Users\\Pradyut\\Documents\\image" + i); 

You can use the ImageIO.write() static method to write the RGB image out in whatever format you need. Here I've used PNG:

File outputFile = new File( "C:\\Users\\Pradyut\\Documents\\image" + i + ".png"); ImageIO.write( image.getRGBImage(), "png", outputFile); 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!