How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

后端 未结 4 2009
北荒
北荒 2020-12-19 13:57

Also I want to know how to add meta data while indexing so that i can boost some parameters

相关标签:
4条回答
  • 2020-12-19 14:31

    You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

    Supported Document Formats

    • HyperText Markup Language
    • XML and derived formats
    • Microsoft Office document formats
    • OpenDocument Format
    • Portable Document Format
    • Electronic Publication Format
    • Rich Text Format
    • Compression and packaging formats
    • Text formats
    • Audio formats
    • Image formats
    • Video formats
    • Java class files and archives
    • The mbox format

    The code will look like this. Reader reader = new Tika().parse(stream);

    0 讨论(0)
  • 2020-12-19 14:35

    Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.

    0 讨论(0)
  • 2020-12-19 14:36

    There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)

    • One of them is Apache Tika, a sub-project of Lucene.
    • Apache POI is a more general document handling project inside Apache.
    • There are also some commercial alternatives.
    0 讨论(0)
  • 2020-12-19 14:39

    see https://github.com/WolfgangFahl/pdfindexer for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text, index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.

    0 讨论(0)
提交回复
热议问题