Fast way to obtain a random index from an array of weights in python

前端 未结 1 845
故里飘歌
故里飘歌 2020-12-21 14:41

I regularly find myself in the position of needing a random index to an array or a list, where the probabilities of indices are not uniformly distributed, but according to c

相关标签:
1条回答
  • 2020-12-21 15:42

    Cumulative summing and bisect

    In any generic case, it seems advisable to calculate the cumulative sum of weights, and use bisect from the bisect module to find a random point in the resulting sorted array

    def weighted_choice(weights):
        cs = numpy.cumsum(weights)
        return bisect.bisect(cs, numpy.random.random() * cs[-1])
    

    if speed is a concern. A more detailed analysis is given below.

    Note: If the array is not flat, numpy.unravel_index can be used to transform a flat index into a shaped index, as seen in https://stackoverflow.com/a/19760118/1274613

    Experimental Analysis

    There are four more or less obvious solutions using numpy builtin functions. Comparing all of them using timeit gives the following result:

    import timeit
    
    weighted_choice_functions = [
    """import numpy
    wc = lambda weights: numpy.random.choice(
        range(len(weights)),
        p=weights/weights.sum())
    """,
    """import numpy
    # Adapted from https://stackoverflow.com/a/19760118/1274613
    def wc(weights):
        cs = numpy.cumsum(weights)
        return cs.searchsorted(numpy.random.random() * cs[-1], 'right')
    """,
    """import numpy, bisect
    # Using bisect mentioned in https://stackoverflow.com/a/13052108/1274613
    def wc(weights):
        cs = numpy.cumsum(weights)
        return bisect.bisect(cs, numpy.random.random() * cs[-1])
    """,
    """import numpy
    wc = lambda weights: numpy.random.multinomial(
        1,
        weights/weights.sum()).argmax()
    """]
    
    for setup in weighted_choice_functions:
        for ps in ["numpy.ones(40)",
                   "numpy.arange(10)",
                   "numpy.arange(200)",
                   "numpy.arange(199,-1,-1)",
                   "numpy.arange(4000)"]:
            timeit.timeit("wc(%s)"%ps, setup=setup)
        print()
    

    The resulting output is

    178.45797914802097
    161.72161589498864
    223.53492237901082
    224.80936180002755
    1901.6298267539823
    
    15.197789980040397
    19.985687876993325
    20.795070077001583
    20.919113760988694
    41.6509403079981
    
    14.240949985047337
    17.335801470966544
    19.433710905024782
    19.52205040602712
    35.60536142199999
    
    26.6195822560112
    20.501282756973524
    31.271995796996634
    27.20013752405066
    243.09768892999273
    

    This means that numpy.random.choice is surprisingly very slow, and even the dedicated numpy searchsorted method is slower than the type-naive bisect variant. (These results were obtained using Python 3.3.5 with numpy 1.8.1, so things may be different for other versions.) The function based on numpy.random.multinomial is less efficient for large weights than the methods based on cumulative summing. Presumably the fact that argmax has to iterate over the whole array and run comparisons each step plays a significant role, as can be seen as well from the four second difference between an increasing and a decreasing weight list.

    0 讨论(0)
提交回复
热议问题