Is it possible to extract text by page for word/pdf files using Apache Tika?

后端 未结 3 514
陌清茗
陌清茗 2020-12-10 05:21

All the documentation I can find seems to suggest I can only extract the entire file\'s content. But I need to extract pages individually. Do I need to write my own parser f

3条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-10 05:36

    You can get the number of pages in a Pdf using the metadata object's xmpTPg:NPages key as in the following:

    Parser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    ParseContext parseContext = new ParseContext();
    parser.parse(fis, handler, metadata, parseContext);
    metadata.get("xmpTPg:NPages");
    

提交回复
热议问题