Tokenizing string for completion suggester

前端未结

关注

 2  1867

我寻月下人不归 2020-12-06 23:07

want to build the auto complete functionality of an e-commerce website, using Completion Suggester.

This is my Index:

PUT myIndex
{
    \"mappings\":


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   既然无缘
                                             
                
                
                (楼主)
            
              
              
                2020-12-06 23:54
              

            
            
                        
Based on Russ Cam's answer above (option 2), this Elasticsearch guide and also this document, I ended up with the following solution:

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "edge_ngram_token_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10
        },
        "additional_stop_words": {
          "type":       "stop",
          "stopwords":  ["your"]
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "C# => csharp",
            "c# => csharp"
          ]
        }
       },
       "analyzer": {
        "result_suggester_analyzer": { 
          "type": "custom",
          "tokenizer": "standard",
          "char_filter":  [ "html_strip", "my_char_filter" ],
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "asciifolding",
            "stop",
            "additional_stop_words",
            "english_stemmer",
            "edge_ngram_token_filter",
            "unique"
          ]
        }
      }
    }
  }
}


Query to test this solution:

POST my_index/_analyze
{
  "analyzer": "result_suggester_analyzer",
  "text": "C# & SQL are great languages. K2 is the mountaineer's mountain. Your house-décor is à la Mode"
}


I would get these tokens (NGrams):

cs, csh, csha, cshar, csharp, sq, sql, gr, gre, grea, great, la, lan, lang,
langu, langua, languag, k2, mo, mou, moun, mount, mounta, mountai, mountain, 
ho, hou, hous, hous, de, dec, deco, decor, mod, mode


Things to note here:


I am using stop filter, which is the default English language
filter and is blocking are, is, the - but not your. 
I have defined the additional_stop_words, which stops your
I am using built in english & possessive_english stemmers, which would tokenize the words stems: that's why we have languag token but not language or languages... also note that we have mountain but not mountaineering.
I have defined mapped_words_char_filter which convert C# to csharp, without this c# would not be a valid token... (this setting would not tokenize F#)
I am using built in html_strip, char_filter which converts & to &, and it is ignored since our min_gram = 2
We are using built it asciifolding token filter and that's why décor is tokenized as decor.




This is the NEST code for the above:

var createIndexResponse = ElasticClient.CreateIndex(IndexName, c => c
    .Settings(st => st
        .Analysis(an => an
            .Analyzers(anz => anz
                .Custom("result_suggester_analyzer", cc => cc
                    .Tokenizer("standard")
                    .CharFilters("html_strip", "mapped_words_char_filter")
                    .Filters(new string[] { "english_possessive_stemmer", "lowercase", "asciifolding", "stop", "english_stemmer", "edge_ngram_token_filter", "unique" })
                )
            )
            .CharFilters(cf => cf
                .Mapping("mapped_words_char_filter", md => md
                    .Mappings(
                        "C# => csharp",
                        "c# => csharp"
                    )
                )
            )
            .TokenFilters(tfd => tfd
                .EdgeNGram("edge_ngram_token_filter", engd => engd
                    .MinGram(2)
                    .MaxGram(10)
                )
                .Stop("additional_stop_word", sfd => sfd.StopWords(new string[] { "your" }))
                .Stemmer("english_stemmer", esd => esd.Language("english"))
                .Stemmer("english_possessive_stemmer", epsd => epsd.Language("possessive_english"))
            )
        )
    )
    .Mappings(m => m.Map(d => d.AutoMap())));

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复