How to make Solr spellchecker to correct both Latin and Cyrillic words?

我的梦境 提交于 2020-01-03 09:00:59

问题


I allow users to type Russian words in Latin letters. If user misspells Russian word in Latin letters, I want Solr spellchecker to suggest correct word in Cyrillic (Russian words in the index is in Cyrillic). However, if user misspells not a Russian word (for example a brand name), it should be corrected in Latin letters (not russian words in the index is in Latin).

For example, tilevizor smasung should be fixed to телевизор samsung

Now I'm using the following configuration:

<fieldType name="spell_ru" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ICUTransformFilterFactory" id="Any-Cyrillic; NFD; [^\p{Alnum}] Remove" />
    </analyzer>
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.LengthFilterFactory" min="3" max="256" />
    </analyzer>
</fieldType>

It converts query to Cyrillic letters, so Russian words correction works. But Latin doesn't. (tilevizor to телевизор works, but smasung to samsung doesn't).

Any ideas, how can I make spellchecker to correct both Cyrillic and Latin words?


回答1:


I think, that solution, that could help here is Beider-Morse Phonetic Matching (BMPM)

Beider-Morse Phonetic Matching (BMPM) is a "soundalike" tool that lets you search using a new phonetic matching system.

So, for example words 'tilevizor' and 'телевизор' will sound a like and we will get a match. Something that could be tuned is the algorithm for phonetic matching. Solr is supporting a lot of them and I'm not sure which one will perform better : DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone (v2.0), ColognePhonetic, or Nysiis.

Also, I would like to update solr.ICUTransformFilterFactory with id="Russian-Latin/BGN", which do a much better job converting Russian symbols to Latin ones.

    <fieldType name="spell_ru" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.ICUTransformFilterFactory" id="Russian-Latin/BGN"/>
            <filter class="solr.PhoneticFilterFactory" encoder="Caverphone"/>
        </analyzer>
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.ICUTransformFilterFactory" id="Russian-Latin/BGN"/>
            <filter class="solr.PhoneticFilterFactory" encoder="Caverphone"/>
        </analyzer>
    </fieldType>

The fieldType above do the trick in a lot of cases, e.g

q=title:tilevizor
SolrDocument{title=телевизор samsung, _version_=1583123812650582016}
SolrDocument{title=televizor самсунг, _version_=1583123812667359232}

q=title:тилевизор
SolrDocument{title=телевизор samsung, _version_=1583123812650582016}
SolrDocument{title=televizor самсунг, _version_=1583123812667359232}

q=title:smasung
SolrDocument{title=телевизор samsung, _version_=1583123812650582016}
SolrDocument{title=televizor самсунг, _version_=1583123812667359232}
SolrDocument{title=гэлакси samsung, _version_=1583123812684136448}
SolrDocument{title=galaxy самсунг, _version_=1583123812684136449}

I've created the following test class here, feel free to play with this one.



来源:https://stackoverflow.com/questions/20350714/how-to-make-solr-spellchecker-to-correct-both-latin-and-cyrillic-words

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!