Elasticsearch using NEST: How to configure analyzers to find partial words?

左心房为你撑大大i 提交于 2019-12-19 03:53:12

问题


I am trying to make a search by partial word, ignoring casing and ignoring the accentuation of some letters. Is it possible? I think ngram with default tokenizer should do the trick but i don't understand how to do it with NEST.

Example: "musiic" should match records that have "music"

The version I am using of Elasticsearch is 1.9.

I am doing like this but it doesn't work...

var ix = new IndexSettings();
        ix.Add("analysis",
            @"{
               'index_analyzer' : {
                          'my_index_analyzer' : {
                                        'type' : 'custom',
                                        'tokenizer' : 'standard',
                                        'filter' : ['lowercase', 'mynGram']
                          }
               },
               'search_analyzer' : {
                          'my_search_analyzer' : {
                                        'type' : 'custom',
                                        'tokenizer' : 'standard',
                                        'filter' : ['standard', 'lowercase', 'mynGram']
                          }
               },
               'filter' : {
                        'mynGram' : {
                                   'type' : 'nGram',
                                   'min_gram' : 2,
                                   'max_gram' : 50
                        }
               }
    }");
        client.CreateIndex("sample", ix);

Thanks,

David


回答1:


Short Answer

I think what you're looking for is a fuzzy query, which uses the Levenshtein distance algorithm to match similar words.

Long Answer on nGrams

The nGram filter splits the text into many smaller tokens based on the defined min/max range.

For example, from your 'music' query the filter will generate: 'mu', 'us', 'si', 'ic', 'mus', 'usi', 'sic', 'musi', 'usic', and 'music'

As you can see musiic does not match any of these nGram tokens.

Why nGrams

One benefit of nGrams is that it makes wildcard queries significantly faster because all potential substrings are pre-generated and indexed at insert time (I have seen queries speed up from multi-seconds to 15 milliseconds using nGrams).

Without the nGrams, each string must be searched at query time for a match [O(n^2)] instead of directly looked up in the index [O(1)]. As pseudocode:

hits = []
foreach string in index:
    if string.substring(query):
        hits.add(string)
return hits

vs

return index[query]

Note that this comes at the expense of making inserts slower, requiring more storage, and heavier memory usage.



来源:https://stackoverflow.com/questions/13788178/elasticsearch-using-nest-how-to-configure-analyzers-to-find-partial-words

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!