发表新帖

发表新帖

How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

后端未结

关注

 4  2015

北荒 2020-12-19 13:57

Also I want to know how to add meta data while indexing so that i can boost some parameters

4条回答

生来不讨喜 (楼主)

2020-12-19 14:39

see https://github.com/WolfgangFahl/pdfindexer for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text, index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

热议问题