Is it possible to extract text by page for word/pdf files using Apache Tika?

后端未结

关注

 3  514

陌清茗 2020-12-10 05:21

All the documentation I can find seems to suggest I can only extract the entire file\'s content. But I need to extract pages individually. Do I need to write my own parser f

3条回答

刺人心 (楼主)

2020-12-10 05:36

You can get the number of pages in a Pdf using the metadata object's xmpTPg:NPages key as in the following:

Parser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
parser.parse(fis, handler, metadata, parseContext);
metadata.get("xmpTPg:NPages");

0 讨论(0)

查看其它3个回答