elasticsearch: how to index terms which are stopwords only?

青春壹個敷衍的年華 提交于 2019-12-01 11:22:40

You can use the synonym filter to convert The The into a single token eg thethe which won't be removed by the stopwords filter.

First, configure the analyzer:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "syn" : {
               "synonyms" : [
                  "the the => thethe"
               ],
               "type" : "synonym"
            }
         },
         "analyzer" : {
            "syn" : {
               "filter" : [
                  "lowercase",
                  "syn",
                  "stop"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   }
}
'

Then test it with the string "The The The Who".

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+The+The+Who&analyzer=syn' 

{
   "tokens" : [
      {
         "end_offset" : 7,
         "position" : 1,
         "start_offset" : 0,
         "type" : "SYNONYM",
         "token" : "thethe"
      },
      {
         "end_offset" : 15,
         "position" : 3,
         "start_offset" : 12,
         "type" : "<ALPHANUM>",
         "token" : "who"
      }
   ]
}

"The The" has been tokenized as "the the", and "The Who" as "who" because the preceding "the" was removed by the stopwords filter.

To stop or not to stop

Which brings us back to whether we should include stopwords or not? You said:

I know I can ignore the stop words list completely 
but this is not what I want since the results searching 
for other bands like "the who" would explode.

What do you mean by that? Explode how? Index size? Performance?

Stopwords were originally introduced to improve search engine performance by removing common words which are likely to have little effect on the relevance of a query. However, we've come a long way since then. Our servers are capable of much more than they were back in the 80s.

Indexing stopwords won't have a huge impact on index size. For instance, to index the word the means adding a single term to the index. You already have thousands of terms - indexing the stopwords as well won't make much difference to size or to performance.

Actually, the bigger problem is that the is very common and thus will have a low impact on relevance, so a search for "The The concert Madrid" will prefer Madrid over the other terms. This can be mitigated by using a shingle filter, which would result in these tokens:

['the the','the concert','concert madrid']

While the may be common, the the isn't and so will rank higher.

You wouldn't query the shingled field by itself, but you could combine a query against a field tokenized by the standard analyzer (without stopwords) with a query against the shingled field.

We can use a multi-field to analyze the text field in two different ways:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "mappings" : {
      "test" : {
         "properties" : {
            "text" : {
               "fields" : {
                  "shingle" : {
                     "type" : "string",
                     "analyzer" : "shingle"
                  },
                  "text" : {
                     "type" : "string",
                     "analyzer" : "no_stop"
                  }
               },
               "type" : "multi_field"
            }
         }
      }
   },
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "no_stop" : {
               "stopwords" : "",
               "type" : "standard"
            },
            "shingle" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "shingle"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   }
}
'

Then use a multi_match query to query both versions of the field, giving the shingled version more "boost"/relevance. In this example the text.shingle^2 means that we want to boost that field by 2:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "query" : {
      "multi_match" : {
         "fields" : [
            "text",
            "text.shingle^2"
         ],
         "query" : "the the concert madrid"
      }
   }
}
'
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!