Remove diacritics at index time into Solr

问题

I am working on a Solr search fine tuning. I'm using Solr 4.0.

Normally, I worked with language analyzers and tokenizers for English language, however this time I'm working with Portuguese language and I'm facing issue as it doesn't really give the expected result I need.

For example: I'm searching for word 'proteses' but what is indexed is 'próteses' which is with diacritics. So it gives wrong results!

What I need to do is remove all diacritics before indexing and search, so it gives correct results. However, I'm unable to find how to handle this part.

Can anyone point me in right direction?

回答1:

You have to use a char mapping filter on the fields that can contain diacritics. This filter will normalize them.

For example :

<fieldType name="text_with_diacritics" class="solr.TextField">     
    <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>     
</fieldType>

The mapping-ISOLatin1Accent.txt comes with Solr has mappings for many diacritics.

Obviously, you'll have to reindex your documents after you configured this filter.

回答2:

Solr also has several ICU filters available, and have both a Normalization and Folding filter available to allow for removal of accents and diacritics across Unicode.

There is also a ASCIIFoldingFilter available, which will attempt to convert any character above the standard 7-bit ASCII range down into the range.

来源：https://stackoverflow.com/questions/25697009/remove-diacritics-at-index-time-into-solr

标签

solr

full-text-search

solr4