Is it possible to extract text by page for word/pdf files using Apache Tika?

后端 未结 3 512
陌清茗
陌清茗 2020-12-10 05:21

All the documentation I can find seems to suggest I can only extract the entire file\'s content. But I need to extract pages individually. Do I need to write my own parser f

3条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-10 05:38

    You'll need to work with the underlying libraries - Tika doesn't do anything at the page level.

    For PDF files, PDFBox should be able to give you some page stuff. For Word, HWPF and XWPF from Apache POI don't really do page level things - the page breaks aren't stored in the file, but instead need to be calculated on the fly based on the text + fonts + page size...

提交回复
热议问题