Using pdfbox, why text can be extracted, but not image

偶尔善良 提交于 2019-12-11 15:13:10

问题


I am using pdfbox to extract image and text from this pdf. I have following code for extraction of text:

 PDFTextStripper p = new PDFTextStripper();
 String thistext=p.getText(document);

Which extracts the text properly. However, when I try to extract images from the same pdf using ExtractImages class, the images produced are all pages of the pdf, not the actual images. Is that because of the reason that the pdf might be a scanned copy? If that is true, how come the text is extracted?


回答1:


I believe the fact that it is scanned is your issue. While I have seen scanned PDFs detect text (and make it highlightable), it is still an image. To test this hypothesis, I would try using a known good PDF such as this one.



来源:https://stackoverflow.com/questions/14617728/using-pdfbox-why-text-can-be-extracted-but-not-image

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!