How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

余生颓废 提交于 2019-12-18 09:04:19

问题


Also I want to know how to add meta data while indexing so that i can boost some parameters


回答1:


Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.




回答2:


There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)

  • One of them is Apache Tika, a sub-project of Lucene.
  • Apache POI is a more general document handling project inside Apache.
  • There are also some commercial alternatives.



回答3:


You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Supported Document Formats

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document formats
  • OpenDocument Format
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image formats
  • Video formats
  • Java class files and archives
  • The mbox format

The code will look like this. Reader reader = new Tika().parse(stream);




回答4:


see https://github.com/WolfgangFahl/pdfindexer for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text, index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.



来源:https://stackoverflow.com/questions/2582951/how-to-index-pdf-ppt-xl-files-in-lucene-java-based-or-python-or-php-any-of-th

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!