Assign a higher score to matches containing the search query at an earlier position in elasticsearch

问题

This question is similar to my other question enter link description here which Val answered.

I have an index containing 3 documents.

    {
            "firstname": "Anne",
            "lastname": "Borg",
        }

    {
            "firstname": "Leanne",
            "lastname": "Ray"
        },

    {
            "firstname": "Anne",
            "middlename": "M",
            "lastname": "Stone"
        }

When I search for "Ann", I would like elastic to return all 3 of these documents (because they all match the term "Ann" to a degree). BUT, I would like Leanne Ray to have a lower score (relevance ranking) because the search term "Ann" appears at a later position in this document than the term appears in the other two documents.

Here are my index settings...

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "filter": [
                        "lowercase"
                    ],
                    "type": "custom",
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "token_chars": [
                        "letter",
                        "digit",
                        "custom"
                    ],
                    "custom_token_chars": "'-",
                    "min_gram": "1",
                    "type": "ngram",
                    "max_gram": "2"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "firstname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "lastname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "middlename": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "full_name": {
                "type": "text",
                "analyzer": "my_analyzer",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

The following query brings back the expected documents, but attributes a higher score to Leanne Ray than to Anne Borg.

{
    "query": {
        "bool": {
            "must": {
                "query_string": {
                    "query": "Ann",
                    "fields": ["full_name"]
                }
            },
            "should": {
                "match": {
                    "full_name": "Ann"}
            }
        }
    }
}

Here are the results...

"hits": [
        {
            "_index": "contacts_4",
            "_type": "_doc",
            "_id": "2",
            "_score": 6.6333585,
            "_source": {
                "firstname": "Anne",
                "middlename": "M",
                "lastname": "Stone"
            }
        },
        {
            "_index": "contacts_4",
            "_type": "_doc",
            "_id": "1",
            "_score": 6.142234,
            "_source": {
                "firstname": "Leanne",
                "lastname": "Ray"
            }
        },
        {
            "_index": "contacts_4",
            "_type": "_doc",
            "_id": "3",
            "_score": 6.079495,
            "_source": {
                "firstname": "Anne",
                "lastname": "Borg"
            }
        }

Using an ngram token filter and an ngram tokenizer together seems to fix this problem...

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "filter": [
                        "ngram"
                    ],
                    "tokenizer": "ngram"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "firstname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "lastname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "middlename": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "full_name": {
                "type": "text",
                "analyzer": "my_analyzer",
                "search_analyzer": "my_analyzer"
            }
        }
    }
}

The same query brings back the expected results with the desired relative scoring. Why does this work? Note that above, I am using an ngram tokenizer with a lowercase filter and the only difference here is that I am using an ngram filter instead of the lowercase filter.

Here are the results. Notice that Leanne Ray scored lower than both Anne Borg and Anne M Stone, as desired.

"hits": [
    {
        "_index": "contacts_4",
        "_type": "_doc",
        "_id": "3",
        "_score": 4.953257,
        "_source": {
            "firstname": "Anne",
            "lastname": "Borg"
        }
    },
    {
        "_index": "contacts_4",
        "_type": "_doc",
        "_id": "2",
        "_score": 4.87168,
        "_source": {
            "firstname": "Anne",
            "middlename": "M",
            "lastname": "Stone"
        }
    },
    {
        "_index": "contacts_4",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.0364896,
        "_source": {
            "firstname": "Leanne",
            "lastname": "Ray"
        }
    }

By the way, this query also brings back a whole lot of false positive results when the index contains other documents as well. It's not such a problem becasuethese false positives have very low scores relative to the scores of the desirable hits. But still not ideal. For example, if I add {firstname: Gideon, lastname: Grossma} to the document, the above query will bring back that document in the result set as well - albeit with a much lower score than the documents containing the string "Ann"

回答1:

The answer is the same as in the linked thread. Since you're ngraming all the indexed data, it works the same way with Ann as with Anne, You'll get the exact same response (see below), with different scores, though:

"hits" : [
  {
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "5Jr-DHIBhYuDqANwSeiw",
    "_score" : 4.8442974,
    "_source" : {
      "firstname" : "Anne",
      "lastname" : "Borg"
    }
  },
  {
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "5pr-DHIBhYuDqANwSeiw",
    "_score" : 4.828779,
    "_source" : {
      "firstname" : "Anne",
      "middlename" : "M",
      "lastname" : "Stone"
    }
  },
  {
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "5Zr-DHIBhYuDqANwSeiw",
    "_score" : 0.12874341,
    "_source" : {
      "firstname" : "Leanne",
      "lastname" : "Ray"
    }
  }
]

UPDATE

Here is a modified query that you can use to check for parts (i.e. ann vs anne). Again, the casing makes no difference here, since the analyzer lowercases everything before indexing.

{
  "query": {
    "bool": {
      "must": {
        "query_string": {
          "query": "ann",
          "fields": [
            "full_name"
          ]
        }
      },
      "should": [
        {
          "match_phrase_prefix": {
            "firstname": {
              "query": "ann",
              "boost": "10"
            }
          }
        },
        {
          "match_phrase_prefix": {
            "lastname": {
              "query": "ann",
              "boost": "10"
            }
          }
        }
      ]
    }
  }
}

来源：https://stackoverflow.com/questions/61768534/assign-a-higher-score-to-matches-containing-the-search-query-at-an-earlier-posit

标签

ElasticSearch

n-gram

relevance

booleanquery