Weighted random sampling in Elasticsearch

问题

I need to obtain a random sample from an ElasticSearch index, i.e. to issue a query that retrieves some documents from a given index with weighted probability Wj/ΣWi (where Wj is a weight of row j and Wj/ΣWi is a sum of weights of all documents in this query).

Currently, I have the following query:

GET products/_search?pretty=true

{"size":5,
  "query": {
    "function_score": {
      "query": {
        "bool":{
          "must": {
            "term":
              {"category_id": "5df3ab90-6e93-0133-7197-04383561729e"}
          }
        }
      },
      "functions":
        [{"random_score":{}}]
    }
  },
  "sort": [{"_score":{"order":"desc"}}]
}

It returns 5 items from selected category, randomly. Each item has a field weight. So, I probably have to use

"script_score": {
  "script": "weight = data['weight'].value / SUM; if (_score.doubleValue() > weight) {return 1;} else {return 0;}"
}

as described here.

I have the following issues:

What is the correct way to do this?
Do I need to enable Dynamic Scripting?
How to calculate the total sum of the query?

Thanks a lot for your help!

回答1:

In case it helps anyone, here is how I recently implemented a weighted shuffling.

On this example, we shuffle companies. Each company has a "company_score" between 0 and 100. With this simple weighted shuffling, a company with score 100 is 5 times more likely to appear in first page than a company with score 20.

json_body = {
    "sort": ["_score"],
    "query": {
        "function_score": {
            "query": main_query,  # put your main query here
            "functions": [
                {
                    "random_score": {},
                },
                {
                    "field_value_factor": {
                        "field": "company_score",
                        "modifier": "none",
                        "missing": 0,
                    }
                }
            ],
            # How to combine the result of the two functions 'random_score' and 'field_value_factor'.
            # This way, on average the combined _score of a company having score 100 will be 5 times as much
            # as the combined _score of a company having score 20, and thus will be 5 times more likely
            # to appear on first page.
            "score_mode": "multiply",
            # How to combine the result of function_score with the original _score from the query.
            # We overwrite it as our combined _score (random x company_score) is all we need.
            "boost_mode": "replace",
        }
    }
}

回答2:

I know this question is old, but answering for any future searchers.

The comment before yours in the GitHub thread seems to have the answer. If each of your documents has a relative weight, then you can pick a random score for each document and multiply it by the weight to create your new weighted random score. This has the added bonus of not needing the sum of weights.

e.g. if two documents have weights 1 and 2, then you'd expect the second to have twice the likelihood of selection as the first. Give each document a random score between 0 and 1 (which you're already doing with "random_score"). Multiply the random score by the weight and you'll have the first document with a score between 0 and 1 and the second with a score between 0 and 2, so twice as likely to be selected!

来源：https://stackoverflow.com/questions/34128770/weighted-random-sampling-in-elasticsearch

标签

ElasticSearch

random-sample

weighted