ContentExtraction of PDF file in solr using Apache Tika

主宰稳场 提交于 2019-12-08 04:22:04

问题


I am trying to index the PDF file in the solr using the following tutorial http://wiki.apache.org/solr/ExtractingRequestHandler But everytime i am firing the command

java -jar post.jar *.pdf

it says some org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 Error Kindly help me in indexing the PDF to solr server.Is there any other integration then tika which can help me.


回答1:


Post.jar is just an utility to upload files to Solr.
Solr uses Extract handler so you need to provide as url. e.g.

java -Durl=http://localhost:8983/solr/update/extract?literal.id=1 -Dtype=application/pdf -jar post.jar 1.pdf

For encrpted files check link
For Password Protected Files check link




回答2:


There is obviously some encoding issue here.

I remember doing something like this a few months ago, and it is fairly easy if you can write your own piece of Java code. These are mostly simple to write, and they work like a charm!



来源:https://stackoverflow.com/questions/18767945/contentextraction-of-pdf-file-in-solr-using-apache-tika

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!