Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I\'ve been struggling to get it to work.
My use case is that I want some code t
Try the code bellow, ContentHandler turned has your xml content.
public ContentHandler convertPdf(byte[] content, String path, String filename)throws IOException, SAXException, TikaException{
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
ContentHandler handler = new ToXMLContentHandler();
PDFParser parser = new PDFParser();
PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(true);
parser.setPDFParserConfig(config);
EmbeddedDocumentExtractor embeddedDocumentExtractor =
new EmbeddedDocumentExtractor() {
@Override
public boolean shouldParseEmbedded(Metadata metadata) {
return true;
}
@Override
public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
throws SAXException, IOException {
Path outputFile = new File(path+metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
Files.copy(stream, outputFile);
}
};
context.set(PDFParser.class, parser);
context.set(EmbeddedDocumentExtractor.class,embeddedDocumentExtractor );
try (InputStream stream = new ByteArrayInputStream(content)) {
parser.parse(stream, handler, metadata, context);
}
return handler;
}