How to properly handle multi words synonym expansion using elasticsearch?

拜拜、爱过 提交于 2019-12-24 15:14:37

问题


I have the following synonym expansion :

suco => suco, refresco, bebida de soja

What i want is to tokenize the search this way:

Search for "suco de laranja" would be tokenized to ["suco", "laranja", "refresco", "bebida de soja"].

But i'm getting it tokenized to ["suco", "laranja", "refresco", "bebida", "soja"].

Consider that the "de" word is a stop word. And i want it to be ignored on the query like "bebida de laranja" becomes ["bebida", "laranja"]. But i don't want it to be considered on the synonym tokenization so "bebida de soja" still stays as one token "bebida de soja".

my settings :

{
    "settings":{
        "analysis":{
            "filter":{
                "synonym_br":{
                    "type":"synonym",
                    "synonyms":[
                        "suco => suco, refresco, bebida de soja"
                    ]
                },
                "brazilian_stop":{
                    "type":"stop",
                    "stopwords":"_brazilian_"
                }
            },
            "analyzer":{
                "synonyms":{
                    "filter":[
                        "synonym_br",
                        "lowercase",
                        "brazilian_stop",
                        "asciifolding"
                    ],
                    "type":"custom",
                    "tokenizer":"standard"
                }
            }
        }
    }
}

回答1:


I would suggest you to make following two changes. First one directly relates to the question you asked and the second one is a suggestion.

  1. Instead of using expansion of multiple synonyms to a single word, do the opposite i.e. all the synonyms points to a single word synonym. Note is there is no synonym which is single world the set is to some combination of letters. So, change "suco => suco, refresco, bebida de soja" to "suco, refresco, bebida de soja => suco"

  2. Change the order of filters in synonyms analyzer. Place lowercase before synonym_br. This will ensure that case does't effect synonym_br token filter.

So final settings will be:

{
  "settings": {
    "analysis": {
      "filter": {
        "synonym_br": {
          "type": "synonym",
          "synonyms": [
            "suco, refresco, bebida de soja => suco"
          ]
        },
        "brazilian_stop": {
          "type": "stop",
          "stopwords": "_brazilian_"
        }
      },
      "analyzer": {
        "synonyms": {
          "filter": [
            "lowercase",
            "synonym_br",
            "brazilian_stop",
            "asciifolding"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  }
}

How does this work?

For input bebida de soja filter apply in the following order:

Input Filter        Result tokens
====================================
lowercase           bebida, de, soja
synonym_br          suco             <------- all the above tokens(including position) exactly matches a synonym
brazilian_stop      suco
asciifolding        suco

Let's see brazilian_stop in action. For this we need an input which doesn't match the synonym but have de in it. E.g. de soja:

Input Filter        Result tokens
=================================
lowercase           de, soja
synonym_br          de, soja  <------- none of the tokens (independently or combined(including position)) matches any synonym
brazilian_stop      suco      <------- de is removed as it is a stopword
asciifolding        suco


来源:https://stackoverflow.com/questions/55944061/how-to-properly-handle-multi-words-synonym-expansion-using-elasticsearch

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!