Error: org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

不羁的心 提交于 2020-01-06 19:52:26

问题


I am trying to extract image from the pdf using pdfbox. I have taken help from this post . It worked for some of the pdfs but for others/most it did not. For example, I am not able to extract the figures in this file

After doing some research I found that PDResources.getImages is deprecated. So, I am using PDResources.getXObjects(). With this, I am not able to extract any image from the PDF and instead get this message at the console:

org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

Now I am stuck and unable to find the solution. Please assist if anyone can.

//////UPDATE AS REPLY ON COMMENTS///

I am using pdfbox-1.8.10

Here is the code:

public void getimg ()throws Exception {

try {
        String sourceDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/inputs/Yavaa.pdf";
        String destinationDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/outputs/";
        File oldFile = new File(sourceDir);
        if (oldFile.exists()){
              PDDocument document = PDDocument.load(sourceDir);
               List<PDPage> list =   document.getDocumentCatalog().getAllPages();
               String fileName = oldFile.getName().replace(".pdf", "_cover");
               int totalImages = 1;
               for (PDPage page : list) {
                   PDResources pdResources = page.getResources();
                   Map pageImages = pdResources.getXObjects();
                    if (pageImages != null){
                      Iterator imageIter = pageImages.keySet().iterator();
                      while (imageIter.hasNext()){
                      String key = (String) imageIter.next();
                      Object obj = pageImages.get(key);

                      if(obj instanceof PDXObjectImage) {
               PDXObjectImage pdxObjectImage = (PDXObjectImage) obj;

                         pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);

                     totalImages++;
                      }
                      }
                    }
               }
        }  else {
                    System.err.println("File not exist");
                       }  
}
catch (Exception e){

    System.err.println(e.getMessage());
 }
 }

//// PARTIAL SOLUTION/////

I have solved the problem of the error message. I have updated the correct code in the post as well. However, the problem remains the same. I am still not able to extract the images from few of the files. Like the one, I have mentioned in this post. Any solution in that regards.


回答1:


The first problem with the original code is that XObjects can be PDXObjectImage or PDXObjectForm, so it is needed to check the instance. The second problem is that the code doesn't walk PDXObjectForm recursively, forms can have resources too. The third problem (only in 1.8) is that you used getResources() instead of findResources(), getResources() doesn't check higher levels.

Code for 1.8 can be found here: https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractImages.java?view=markup

Code for 2.0 can be found here: https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup&sortby=date

(Even these are not always perfect, see this answer)

The fourth problem is that your file doesn't have any XObjects at all. All "graphics" were really vector drawings, these can't be "extracted" like embedded images. All you could do is to convert the PDF pages to images, and then mark and cut what you need.



来源:https://stackoverflow.com/questions/34989223/error-org-apache-pdfbox-pdmodel-graphics-xobject-pdxobjectform-cannot-be-cast-t

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!