How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

后端 未结 4 2015
北荒
北荒 2020-12-19 13:57

Also I want to know how to add meta data while indexing so that i can boost some parameters

4条回答
  •  生来不讨喜
    2020-12-19 14:39

    see https://github.com/WolfgangFahl/pdfindexer for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text, index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.

提交回复
热议问题