Get page numbers of searchresult of a pdf in solr

左心房为你撑大大i 提交于 2019-12-07 16:47:37

问题


I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.

So what I need is the page number and a short text snippet of every search result.

I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.

I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.


回答1:


I'm now splitting the PDF and sending each page separately to SOLR. So every page is an own document with an id <id_of_document>_<page_number> and an additional field doc_id which contains only the <id_of_document> for grouping the results.




回答2:


There is JIRA SOLR-380 with a Patch, which you can check upon.




回答3:


I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.




回答4:


I have not tried it myself. Approach,

  1. Solr customer connector integrating with Apache Tika parser for indexing PDFs
  2. Create multiple attributes in Solr like page1, page2, page3…,pageN – Alternatively, can use dynamic attributes in Solr
  3. In the customer connector, read the PDFs, page by page, index them onto the respective page attributes/dynamic attributes
  4. Enable search on all the “page” attributes
  5. When user searches, use the “highlighter/Summary/Teaser” component to only retrieve “page” attributes that has hits
  6. The “page” attributes that has a hit (find from highlighter/Summary/Teaser) for a given records are the pages that has the searched phrase.
  7. Link the PDF with the “#PageNumber” of the PDF and pop up the page on click

A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.

If you find a flaw in this design, respond to my thread. I will attempt to resolve it.



来源:https://stackoverflow.com/questions/15116160/get-page-numbers-of-searchresult-of-a-pdf-in-solr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!