Does HTMLStripCharFilterFactory @ Solr 3.4 strip out html for returned fields?

我怕爱的太早我们不能终老 提交于 2019-12-25 16:27:52

问题


I'm using CF10 which should be using Solr 3.4 according to corporatezen.com/2013/11/updating-solr-engine-coldfusion. I added <charFilter class="solr.HTMLStripCharFilterFactory"/> to <fieldType name="text"> but the summary field in the search result still includes HTML. Any idea why?

<field name="summary" type="text" indexed="false" stored="true" required="false" />

http://localhost:8985/solr/test/admin/schema.jsp shows:

Field: summary Field Type: TEXT

Properties: Tokenized, Stored

Schema: Tokenized, Stored

Position Increment Gap: 100

Index Analyzer: org.apache.solr.analysis.TokenizerChain DETAILS

Char Filters:

org.apache.solr.analysis.HTMLStripCharFilterFactory args:{luceneMatchVersion: LUCENE_24 } Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

org.apache.solr.analysis.StopFilterFactory args:{words: stopwords.txt ignoreCase: true enablePositionIncrements: true luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.WordDelimiterFilterFactory args:{splitOnCaseChange: 1 generateNumberParts: 1 catenateWords: 1 luceneMatchVersion: LUCENE_24 generateWordParts: 1 catenateAll: 0 catenateNumbers: 1 } org.apache.solr.analysis.LowerCaseFilterFactory args:{luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.EnglishPorterFilterFactory args:{protected: protwords.txt luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{luceneMatchVersion: LUCENE_24 } Query Analyzer: org.apache.solr.analysis.TokenizerChain DETAILS

Char Filters:

org.apache.solr.analysis.HTMLStripCharFilterFactory args:{luceneMatchVersion: LUCENE_24 } Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

org.apache.solr.analysis.SynonymFilterFactory args:{synonyms: synonyms.txt expand: true ignoreCase: true luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.StopFilterFactory args:{words: stopwords.txt ignoreCase: true luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.WordDelimiterFilterFactory args:{splitOnCaseChange: 1 generateNumberParts: 1 catenateWords: 0 luceneMatchVersion: LUCENE_24 generateWordParts: 1 catenateAll: 0 catenateNumbers: 0 } org.apache.solr.analysis.LowerCaseFilterFactory args:{luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.EnglishPorterFilterFactory args:{protected: protwords.txt luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{luceneMatchVersion: LUCENE_24 }


回答1:


You need to differentiate between the stored and the indexed. The filter you have added to the field will alter the tokens that are stored in Solr's index, for searching, but not the stored values for display.

Solr keeps two versions of a field*. One is the stored one. This is the original portion of text, in your case with HTML included. The other one is the index version. There the whole magic of text analysis has been applied.

Then when you perform a search, the index is used to look up which documents have created a match. When displaying the result, the stored version is presented to you.


* Of course only in case that you turned on stored="true" and indexed="true".



来源:https://stackoverflow.com/questions/28687200/does-htmlstripcharfilterfactory-solr-3-4-strip-out-html-for-returned-fields

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!