Searching (extracting text) PDF files with Algolia

有些话、适合烂在心里 提交于 2020-08-08 03:52:01

问题


This is just a speculative idea for a client who has a lot of PDF files.

Algolia say in their FAQs that to search PDF files you first need to extract the text from the file. How would you go about this?

The way I envisage the a system working would be:

  • Client uploads PDF via CMS
  • CMS calls some service / program to extract the text
  • Algolia indexes the extracted and it's somehow linked to the original PDF

It would need to be an automated system as the client shouldn't have to tell it to index. It would be built in PHP, probably Laravel running on Ubuntu.

What software / service could do the text extraction from the PDFs and is any magic needed to 'link' this with the PDF file?

I'm also happy to have suggestions on other search services which may handle this.


回答1:


Fortunately, text extraction from pdf's is a subject that has been covered multiple times. On the command line, you could use pdftotext (available on Linux or Mac) or in your code a library as Apache Tika (for which you can find a PHP wrapper).

To avoid having too much noise in your records, I'd recommend you to then split the text and create one record per paragraph. You can then use Algolia's distinct feature to deduplicate the results.

You should already have the links to your files somewhere, just store them in your records and then, in your front-end you'll easily be able to create links to them using for instance autocomplete.js or instantsearch.js .




回答2:


For anyone still looking for a solution, I put together a GitHub repository that does exactly that: https://github.com/PDFTron/pdftron-document-search.

The text extraction happens client-side as the user uploads the document using React + Firebase + Algolia.

You can check out a quick video walking you through the sample app: https://youtu.be/IQATnzHTp7Q.

Let me know if you have any questions.



来源:https://stackoverflow.com/questions/38640877/searching-extracting-text-pdf-files-with-algolia

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!