Searching attachments from a Rails app (Word, PDF, Excel etc)

匆匆过客 提交于 2019-12-05 16:26:33
maecro

Well I have not done binary file indexing before, but apparently Solr has support for it see Indexing files with SPHINX/ultrasphinx and http://wiki.apache.org/solr/ExtractingRequestHandler There are quite a few gems available for Solr, Sunspot seems to be a popular one http://outoftime.github.com/sunspot/ Although it seems Sunspot does not have built in support for Solr Cells, there seems to be some work going into it https://github.com/tomasc/sunspot_cell There are probably better options out there, but this should give you a good starting point.

Just to update this. The approach I've decided to go with is:

Try to extract plain-text versions of the attachments into the database for thinking-sphinx to read

Specifically, I'll be doing the following:

  • Using thinking-sphinx
  • Using the subexec gem to call ...
  • ... Tika from the command line

It looks as if it will be as simple as calling java -jar tika-app-0.10.jar -t [file] but I'll post my experiences if it turns out to be more complicated!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!