Can Solr retain the formatting of the HTML documents whcih was fed to it in its result?

问题

How do I maintain the Original formatting of the HTML document in the results given by Solr?

I am trying to provide search functionality in one of my companies website that is having millions of documents and all are not having similar formatting, So it is hard to individually format each document.

I am using Solr 4.1 nightly builds at apache site which is having inbuilt support for solr-cell and tika. i.e. i need not to separately configure them.

does solr-cell or tika retains these formatting anywhere?

If it does not retain the formatting then I'll need to fetch each document from physical file location using resourcename field of solr and apply the highlights and other solr ready made functionality, But this process is too tedious.

EDIT: What can i use as a Request Handler if i have to use "HTMLStripCharFilterFactory" as suggested by Jayendra in the answer? also can i extract metadata tags in that case?

Can anyone guide me regarding this!

Thank you for all your support.!!!

回答1:

Solr Cell with Tika does not maintain the original formatting of the document.
You would get only the extracted text from the documents fed to Solr through Tika.

Else you have to feed the html document as a normal Solr field and apply HTMLStripCharFilterFactory filter to maintain both copies.

Solr will maintain the Original Document with HTML fields when stored=true.
However, for Search (indexed=true) the search will only happen on the Content and not on the html elements.

来源：https://stackoverflow.com/questions/14770605/can-solr-retain-the-formatting-of-the-html-documents-whcih-was-fed-to-it-in-its

标签

solr

solrj

apache-tika

solr-cell