How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

后端未结

关注

 4  2016

北荒

Also I want to know how to add meta data while indexing so that i can boost some parameters

相关标签:

4条回答

小鲜肉

2020-12-19 14:31
You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Supported Document Formats
- HyperText Markup Language
- XML and derived formats
- Microsoft Office document formats
- OpenDocument Format
- Portable Document Format
- Electronic Publication Format
- Rich Text Format
- Compression and packaging formats
- Text formats
- Audio formats
- Image formats
- Video formats
- Java class files and archives
- The mbox format
The code will look like this. Reader reader = new Tika().parse(stream);
0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2020-12-19 14:35

Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.

0 讨论(0)
发布评论:

提交评论
- 加载中...
日久生厌

2020-12-19 14:36
There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)
- One of them is Apache Tika, a sub-project of Lucene.
- Apache POI is a more general document handling project inside Apache.
- There are also some commercial alternatives.
0 讨论(0)
发布评论:

提交评论
- 加载中...
生来不讨喜

2020-12-19 14:39

see https://github.com/WolfgangFahl/pdfindexer for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text, index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.

0 讨论(0)
发布评论:

提交评论
- 加载中...