Non-English Language support via SolrNet

旧街凉风 提交于 2019-12-23 02:50:27

问题


I am using SolrNet to search over Solr from an .NET application. Everything works fine when I search over English words. However if I use spanish words like español, I get no search result though I have indexed them. When I debugged over Solr, I found that the query was parsed as espaA+ol.

Do I have to do some UTF-8 encoding or does SolrNet supports search over only ASCII characters?


回答1:


This is not a SolrNet issue, it is related to how Solr handles characters that are not in the first 127 ASCII character set. The best recommendation is add the ASCIIFoldingFilterFactory to your Solr field where you are storing the Spanish words.

As an example, if you were using the text_general fieldType as defined in the Solr example which is setup as follows in the schema.xml file:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

I would recommend modifying it as follows adding the ASCIIFoldingFilterFactory to the index and query analyzers.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
  </analyzer>
</fieldType>

Also, please note that you will need to reindex your data after making this schema change for the changes to be reflected in the index.




回答2:


Not sure if you want to specifically keep those characters in the index? If you don't need to, it would be better to use something like

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>

so 'español' would be indexed as 'espanol' and searching for any of them would find 'español' (same for á, ü etc).



来源:https://stackoverflow.com/questions/10493830/non-english-language-support-via-solrnet

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!