Python Random choice with 'percentage'

问题

Foreword

It looks like it is a duplicate of few stackoverflow question but my situation is (probably) slightly unique.

My situation

I have a dictionary. The key is a string and the value is an integer.

I want the python script to randomly choose N number of keys.

The value is how likely it is to be chosen. The higher the key's value is, the higher the chance the key will be randomly chosen.

My solution

So using some other StackOverflow post and the power of the internet I managed to solve it using Weighted Random.

DICT_VAR= {'best':308281009, 'good':7066325, 'meh':26884, 'bad':71, 'terrible':16, 'never':0}

list_var = []
for i in DICT_VAR.keys():
    list_var.extend([i]*DICT_VAR[i])

print random.sample(list_var, 2) # get 2 random choice I suppose

The issue (the catch)

As you may notice, the value in the dictionary can be incredibly big (It can be unlimitedly big) and it can also be as small as 0 (zero is smallest, there is no negative number).

Running this code (with a little bigger numbers) resulted in my computer to freeze and be unresponsive until I hard reset it.

My question

How should I deal with the situation? Is there any other way of randomly choosing that is suitable for my situation since Weighted Random is the worst possible solution to this current case.

回答1:

I will assume here that a value of 0 means the key should never be chosen, the keys may be repeated in the sample (in the dictionary is irrelevant), and we may use a third-party module--numpy in this case. Here is code tested in Python 3.6.4 but I modified it so it should run in Python 2.7, but I can't test it that way.

DICT_VAR= {'best':308281009, 'good':7066325, 'meh':26884, 'bad':71,
           'terrible':16, 'never':0}

import numpy as np

keys, weights = zip(*DICT_VAR.items())
probs = np.array(weights, dtype=float) / float(sum(weights))
sample_np = np.random.choice(keys, 2, p=probs)
sample = [str(val) for val in sample_np]

Then sample holds your sample as a list of key strings. Note that your weight for key 'best' is so much larger than the other weights that your sample will almost always be ['best', 'best'].

To explain my code: first split the dictionary's keys (strings) and values (weights) into separate lists. Then change the weights to probabilities--larger weights give larger probabilities, a zero weight gives a zero probability. Then use numpy's choice function to choose a sample of keys using the probabilities as weights. The result is a numpy array, but you seem to want a standard Python list, so the final line converts the sample of keys into a standard list.

There is, of course, a fairly short routine that could be written in standard Python, so we could avoid the use of numpy. But it would most probably be slower.

The reason your routine was slow is that it builds a large list, with each key repeated the number of times given by its value, then a sample is chosen with uniform probability. With your sample data, that means building a huge list, much larger than your available RAM, and that takes much time. Numpy's choice routine can handle a non-uniform random distribution directly, without building another list.

回答2:

In Py 3.6 this is part of the standard library, with random.choices():

In []:
import random
random.choices(list(DICT_VAR.keys()), DICT_VAR.values(), k=2)

Out[]:
['best', 'best']

Or slightly more arcanely:

random.choices(*zip(*DICT_VAR.items()), k=2)

来源：https://stackoverflow.com/questions/51813308/python-random-choice-with-percentage

标签

python

dictionary

random

percentage

biginteger