Remove diacritics at index time into Solr

核能气质少年 提交于 2020-01-25 05:55:17

问题


I am working on a Solr search fine tuning. I'm using Solr 4.0.

Normally, I worked with language analyzers and tokenizers for English language, however this time I'm working with Portuguese language and I'm facing issue as it doesn't really give the expected result I need.

For example: I'm searching for word 'proteses' but what is indexed is 'próteses' which is with diacritics. So it gives wrong results!

What I need to do is remove all diacritics before indexing and search, so it gives correct results. However, I'm unable to find how to handle this part.

Can anyone point me in right direction?


回答1:


You have to use a char mapping filter on the fields that can contain diacritics. This filter will normalize them.

For example :

<fieldType name="text_with_diacritics" class="solr.TextField">     
    <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>     
</fieldType>

The mapping-ISOLatin1Accent.txt comes with Solr has mappings for many diacritics.

Obviously, you'll have to reindex your documents after you configured this filter.




回答2:


Solr also has several ICU filters available, and have both a Normalization and Folding filter available to allow for removal of accents and diacritics across Unicode.

There is also a ASCIIFoldingFilter available, which will attempt to convert any character above the standard 7-bit ASCII range down into the range.



来源:https://stackoverflow.com/questions/25697009/remove-diacritics-at-index-time-into-solr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!