问题
I am using pdfbox to extract image and text from this pdf. I have following code for extraction of text:
PDFTextStripper p = new PDFTextStripper();
String thistext=p.getText(document);
Which extracts the text properly. However, when I try to extract images from the same pdf using ExtractImages
class, the images produced are all pages of the pdf, not the actual images. Is that because of the reason that the pdf might be a scanned copy? If that is true, how come the text is extracted?
回答1:
I believe the fact that it is scanned is your issue. While I have seen scanned PDFs detect text (and make it highlightable), it is still an image. To test this hypothesis, I would try using a known good PDF such as this one.
来源:https://stackoverflow.com/questions/14617728/using-pdfbox-why-text-can-be-extracted-but-not-image