Tokenizing string for completion suggester

前端 未结 2 1867
我寻月下人不归
我寻月下人不归 2020-12-06 23:07

want to build the auto complete functionality of an e-commerce website, using Completion Suggester.

This is my Index:

PUT myIndex
{
    \"mappings\":         


        
2条回答
  •  既然无缘
    2020-12-06 23:54

    Based on Russ Cam's answer above (option 2), this Elasticsearch guide and also this document, I ended up with the following solution:

    PUT my_index
    {
      "settings": {
        "analysis": {
          "filter": {
            "edge_ngram_token_filter": {
              "type": "edge_ngram",
              "min_gram": 2,
              "max_gram": 10
            },
            "additional_stop_words": {
              "type":       "stop",
              "stopwords":  ["your"]
            },
            "english_stemmer": {
              "type":       "stemmer",
              "language":   "english"
            },
            "english_possessive_stemmer": {
              "type":       "stemmer",
              "language":   "possessive_english"
            }
          },
          "char_filter": {
            "my_char_filter": {
              "type": "mapping",
              "mappings": [
                "C# => csharp",
                "c# => csharp"
              ]
            }
           },
           "analyzer": {
            "result_suggester_analyzer": { 
              "type": "custom",
              "tokenizer": "standard",
              "char_filter":  [ "html_strip", "my_char_filter" ],
              "filter": [
                "english_possessive_stemmer",
                "lowercase",
                "asciifolding",
                "stop",
                "additional_stop_words",
                "english_stemmer",
                "edge_ngram_token_filter",
                "unique"
              ]
            }
          }
        }
      }
    }
    

    Query to test this solution:

    POST my_index/_analyze
    {
      "analyzer": "result_suggester_analyzer",
      "text": "C# & SQL are great languages. K2 is the mountaineer's mountain. Your house-décor is à la Mode"
    }
    

    I would get these tokens (NGrams):

    cs, csh, csha, cshar, csharp, sq, sql, gr, gre, grea, great, la, lan, lang,
    langu, langua, languag, k2, mo, mou, moun, mount, mounta, mountai, mountain, 
    ho, hou, hous, hous, de, dec, deco, decor, mod, mode
    

    Things to note here:

    1. I am using stop filter, which is the default English language filter and is blocking are, is, the - but not your.
    2. I have defined the additional_stop_words, which stops your
    3. I am using built in english & possessive_english stemmers, which would tokenize the words stems: that's why we have languag token but not language or languages... also note that we have mountain but not mountaineering.
    4. I have defined mapped_words_char_filter which convert C# to csharp, without this c# would not be a valid token... (this setting would not tokenize F#)
    5. I am using built in html_strip, char_filter which converts & to &, and it is ignored since our min_gram = 2
    6. We are using built it asciifolding token filter and that's why décor is tokenized as decor.

    This is the NEST code for the above:

    var createIndexResponse = ElasticClient.CreateIndex(IndexName, c => c
        .Settings(st => st
            .Analysis(an => an
                .Analyzers(anz => anz
                    .Custom("result_suggester_analyzer", cc => cc
                        .Tokenizer("standard")
                        .CharFilters("html_strip", "mapped_words_char_filter")
                        .Filters(new string[] { "english_possessive_stemmer", "lowercase", "asciifolding", "stop", "english_stemmer", "edge_ngram_token_filter", "unique" })
                    )
                )
                .CharFilters(cf => cf
                    .Mapping("mapped_words_char_filter", md => md
                        .Mappings(
                            "C# => csharp",
                            "c# => csharp"
                        )
                    )
                )
                .TokenFilters(tfd => tfd
                    .EdgeNGram("edge_ngram_token_filter", engd => engd
                        .MinGram(2)
                        .MaxGram(10)
                    )
                    .Stop("additional_stop_word", sfd => sfd.StopWords(new string[] { "your" }))
                    .Stemmer("english_stemmer", esd => esd.Language("english"))
                    .Stemmer("english_possessive_stemmer", epsd => epsd.Language("possessive_english"))
                )
            )
        )
        .Mappings(m => m.Map(d => d.AutoMap())));
    

提交回复
热议问题