Configuring Tika With Solr

烂漫一生 提交于 2019-12-08 11:42:51

问题


I am Looking to index Rich types documents(Pdf, Doc, rtf, txt) into Solr. I found Tika as a solution. I made a rant over the web but didn't found any Docs/links to make it work with ExtractingRequestHandler.

Anyone can please provide step by step way to configure Tika with ExtractingRequestHandler.

Thanks In Advance :)


回答1:


Check ExtractingRequestHandler for Integration of Solr with Tika.
Solr provides tika.config inbuilt and you would not need to define it unless overriding the config.
You can go with the default config as defined in the solrconfig.xml

<!-- Solr Cell Update Request Handler

   http://wiki.apache.org/solr/ExtractingRequestHandler 

-->
<requestHandler name="/update/extract" 
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <str name="lowernames">true</str>
  <str name="uprefix">ignored_</str>

  <!-- capture link hrefs but ignore div attributes -->
  <str name="captureAttr">true</str>
  <str name="fmap.a">links</str>
  <str name="fmap.div">ignored_</str>
</lst>
</requestHandler>

You can use the commands to index the files to solr with additional metadata.

curl "http://localhost:8983/solr/update/extract?literal.id=2&literal.title=Test&commit=true&fmap.content=text" -F "myfile=@1.pdf"

By default the content of the files are copied to content field and copied over to text, you can override the settings.



来源:https://stackoverflow.com/questions/17622544/configuring-tika-with-solr

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!