How to index text files using apache solr

假如想象 提交于 2019-12-04 14:18:22

问题


I wanted to index text files. After searching a lot I got to know about Apache tika. Now in some sites where I studied Apache tika, I got to know that Apache tika converts the text it into XML format and then sends it to solr. But while converting it creates only one tag example ....... Now the text file I wish to index is a tomcat local host access file. This file is in GB's. I cannot store it and a single index. I want each line to have line-id ....... So that i can easily retrieve the matching line.

Can this be done in Apache Tika?


回答1:


Solr with Tika supports extraction of data from multiple file formats.
The complete list of supported file formats can be found @ link

You can provide as an input any of the above file formats and Tika would be able to autodetect the file format and extract text from the files and provide it to Solr for indexing.

Edit :-
Tika does not convert the text file to XML before sneding it to Solr. Tika would just extract the metadata and the content of the file and populate fields in Solr as per the mapping defined.

You either have to feed the entire file as input to solr, which would be indexed as a single document OR you have to read the file line by line and provide it to Solr as a seperate document.
Solr and Tika would not handle this for you.




回答2:


You may want to look at DataImportHandler to parse the file into lines or entries. It is a better match than running Tika on something that already has internal structure.



来源:https://stackoverflow.com/questions/15496255/how-to-index-text-files-using-apache-solr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!