问题
Also I want to know how to add meta data while indexing so that i can boost some parameters
回答1:
Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.
回答2:
There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)
- One of them is Apache Tika, a sub-project of Lucene.
- Apache POI is a more general document handling project inside Apache.
- There are also some commercial alternatives.
回答3:
You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Supported Document Formats
- HyperText Markup Language
- XML and derived formats
- Microsoft Office document formats
- OpenDocument Format
- Portable Document Format
- Electronic Publication Format
- Rich Text Format
- Compression and packaging formats
- Text formats
- Audio formats
- Image formats
- Video formats
- Java class files and archives
- The mbox format
The code will look like this. Reader reader = new Tika().parse(stream);
回答4:
see https://github.com/WolfgangFahl/pdfindexer for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text, index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.
来源:https://stackoverflow.com/questions/2582951/how-to-index-pdf-ppt-xl-files-in-lucene-java-based-or-python-or-php-any-of-th