问题
I have a large list of items, each item has a weight.
I'd like to select N items randomly without replacement, while the items with more weight are more probable to be selected.
I'm looking for the most performing idea. Performance is paramount. Any ideas?
回答1:
If you want to sample items without replacement, you have lots of options.
Use a weighted-choice-with-replacement algorithm to choose random indices. There are many algorithms like this. One of them is
WeightedChoice, described later in this answer, and another is rejection sampling, described as follows. Assume that the highest weight ismaxand there arenweights. To choose an index in [0,n) using rejection sampling:- Choose a uniform random integer
iin [0,n). - With probability
weights[i]/max, returni. Otherwise, go to step 1.
Each time the weighted choice algorithm chooses an index, set the weight for the chosen index to 0 to keep it from being chosen again. Or...
- Choose a uniform random integer
Assign each index an exponentially distributed random number (with a rate equal to that index's weight), make a list of pairs assigning each number to an index, then sort that list by those numbers. Then take each item from first to last. This sorting can be done on-line using a priority queue data structure (a technique that leads to weighted reservoir sampling). Notice that the naïve way to generate the random number,
-ln(1-RNDU01())/weight, is not robust, however ("Index of Non-Uniform Distributions", under "Exponential distribution").Tim Vieira gives additional options in his blog.
A paper by Bram van de Klundert compares various algorithms.
EDIT (Aug. 19): Note that for these solutions, the weight expresses how likely a given item will appear first in the sample. This weight is not necessarily the chance that a given sample of n items will include that item (that is, an inclusion probability). The methods given above will not necessarily ensure that a given item will appear in a random sample with probability proportional to its weight; for that, see "Algorithms of sampling with equal or unequal probabilities".
Previous post:
Assuming you want to choose items at random with replacement, here is pseudocode implementing this kind of choice. Given a list of weights, it returns a random index (starting at 0), chosen with a probability proportional to its weight. See also "Weighted Choice".
METHOD WChoose(weights, value)
// Choose the index according to the given value
lastItem = size(weights) - 1
runningValue = 0
for i in 0...size(weights) - 1
if weights[i] > 0
newValue = runningValue + weights[i]
lastItem = i
// NOTE: Includes start, excludes end
if value < newValue: break
runningValue = newValue
end
end
// If we didn't break above, this is a last
// resort (might happen because rounding
// error happened somehow)
return lastItem
END METHOD
METHOD WeightedChoice(weights)
return WChoose(weights, RNDINTEXC(Sum(weights)))
END METHOD
This algorithm is a straightforward way to implement weighted choice, but if it's too slow for you, the following alternatives may be faster:
- Vose's alias method, a variant of the original Walker's alias method. See "Darts, Dice, and Coins: Sampling from a Discrete Distribution" by Keith Schwarz for more information.
- The Fast Loaded Dice Roller.
回答2:
Let A be the item array with x itens. The complexity of each method is defined as
< preprocessing_time, querying_time >
If sorting is possible: < O(x lg x), O(n) >
- sort
Aby the weight of the itens. create an array
B, for example:B = [ 0, 0, 0, x/2, x/2, x/2, x/2, x/2 ].- it's clear to see that
Bhas a bigger probability from choosingx/2.
if you haven't picked
nelements yet, choose a random elementefromB.- pick a random element from
Awithin the intervale : x-1.
If iterating through the itens is possible: < O(x), O(tn) >
- iterate through
Aand find the average weightwof the elements. - define the maximum number of tries
t. - try (at most
ttimes) to pick a random number inAwhose weight is bigger thanw.- test for some
tthat gives you good/satisfactory results.
- test for some
If nothing above is possible: < O(1), O(tn) >
- define the maximum number of tries
t. - if you haven't picked
nelements yet, taketrandom elements inA. - pick the element with biggest value.
- test for some
tthat gives you good/satisfactory results.
- test for some
来源:https://stackoverflow.com/questions/62455064/what-would-be-the-fastest-algorithm-to-randomly-select-n-items-from-a-list-based