Can Solr retain the formatting of the HTML documents whcih was fed to it in its result?

时光毁灭记忆、已成空白 提交于 2019-12-25 09:42:11

问题


How do I maintain the Original formatting of the HTML document in the results given by Solr?

I am trying to provide search functionality in one of my companies website that is having millions of documents and all are not having similar formatting, So it is hard to individually format each document.

I am using Solr 4.1 nightly builds at apache site which is having inbuilt support for solr-cell and tika. i.e. i need not to separately configure them.

does solr-cell or tika retains these formatting anywhere?

If it does not retain the formatting then I'll need to fetch each document from physical file location using resourcename field of solr and apply the highlights and other solr ready made functionality, But this process is too tedious.

EDIT: What can i use as a Request Handler if i have to use "HTMLStripCharFilterFactory" as suggested by Jayendra in the answer? also can i extract metadata tags in that case?

Can anyone guide me regarding this!

Thank you for all your support.!!!


回答1:


Solr Cell with Tika does not maintain the original formatting of the document.
You would get only the extracted text from the documents fed to Solr through Tika.

Else you have to feed the html document as a normal Solr field and apply HTMLStripCharFilterFactory filter to maintain both copies.

Solr will maintain the Original Document with HTML fields when stored=true.
However, for Search (indexed=true) the search will only happen on the Content and not on the html elements.



来源:https://stackoverflow.com/questions/14770605/can-solr-retain-the-formatting-of-the-html-documents-whcih-was-fed-to-it-in-its

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!