How to improve np.random.choice() looping efficiency

我们两清 提交于 2019-12-11 06:19:47

问题


I am trying to apply np.random.choice to a big array with different weights, and wondering any way could avoid looping and improve the performance? Over here len(weights) could be millions.

weights = [[0.1, 0.5, 0.4],
           [0.2, 0.4, 0.4],
           ...
           [0.3, 0.3, 0.4]]

choice = [1, 2, 3]
ret = np.zeros((len(weights), 20))
for i in range(len(weights)):
    ret[i] = np.random.choice(choice, 20, p=weights[i])

回答1:


Here's a generalization of my answer in Fast random weighted selection across all rows of a stochastic matrix:

def vectorized_choice(p, n, items=None):
    s = p.cumsum(axis=1)
    r = np.random.rand(p.shape[0], n, 1)
    q = np.expand_dims(s, 1) >= r
    k = q.argmax(axis=-1)
    if items is not None:
        k = np.asarray(items)[k]
    return k

p is expected to be a two-dimensional array whose rows are probability vectors. n is the number of samples to draw from the distribution defined by each row. If items is None, the samples are integers in range(0, p.shape[1]). If items is not None, it is expected to be a sequence with length p.shape[1].

Example:

In [258]: p = np.array([[0.1, 0.5, 0.4], [0.75, 0, 0.25], [0, 0, 1], [1/3, 1/3, 1/3]])                                                   

In [259]: p                                                                                                                              
Out[259]: 
array([[0.1       , 0.5       , 0.4       ],
       [0.75      , 0.        , 0.25      ],
       [0.        , 0.        , 1.        ],
       [0.33333333, 0.33333333, 0.33333333]])

In [260]: vectorized_choice(p, 20)                                                                                                       
Out[260]: 
array([[1, 1, 2, 1, 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 2, 0, 1, 2, 2, 2],
       [0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0],
       [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
       [1, 0, 2, 2, 0, 1, 2, 1, 0, 0, 0, 0, 2, 2, 0, 0, 2, 1, 1, 2]])

In [261]: vectorized_choice(p, 20, items=[1, 2, 3])                                                                                      
Out[261]: 
array([[2, 1, 2, 2, 2, 3, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3, 2, 3, 2, 2],
       [1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 1, 1, 3, 3, 1, 3, 1, 1, 1],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [3, 3, 3, 1, 3, 2, 1, 2, 3, 1, 2, 2, 3, 2, 1, 2, 1, 2, 2, 2]])

Timing for p with shape (1000000, 3):

In [317]: p = np.random.rand(1000000, 3)

In [318]: p /= p.sum(axis=1, keepdims=True)

In [319]: %timeit vectorized_choice(p, 20, items=np.arange(1, p.shape[1]+1))
1.89 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here's the timing for Divakar's function:

In [320]: %timeit random_choice_prob_vectorized(p, 20, choice=np.arange(1, p.shape[1]+1))
7.33 s ± 43.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The difference will be less pronounced if you increase the number of columns in p, and if you make the number of columns big enough, Divakar's function will be faster. E.g.

In [321]: p = np.random.rand(1000, 120)

In [322]: p /= p.sum(axis=1, keepdims=True)

In [323]: %timeit vectorized_choice(p, 20, items=np.arange(1, p.shape[1]+1))
6.41 ms ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [324]: %timeit random_choice_prob_vectorized(p, 20, choice=np.arange(1, p.shape[1]+1))
6.29 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



回答2:


Borrowing idea from Vectorizing numpy.random.choice for given 2D array of probabilities along an axis alongwith idea from vectorized searchsorted, here's one vectorized way -

def random_choice_prob_vectorized(weights, num_items, choice=None):
    weights = np.asarray(weights)

    w = weights.cumsum(1)
    r = np.random.rand(len(weights),num_items)

    m,n = w.shape
    o = np.arange(m)[:,None]
    w_o = (w+o).ravel()
    r_o = (r+o).ravel()
    idx = np.searchsorted(w_o,r_o).reshape(m,-1)%n
    if choice is not None:
        return np.asarray(choice)[idx]
    else:
        return idx

Sample run to verify using 2D bincount -

In [28]: weights = [[0.1, 0.5, 0.4],
    ...:            [0.2, 0.4, 0.4],
    ...:            [0.3, 0.3, 0.4]]
    ...: 
    ...: choice = [1, 2, 3]
    ...: num_items = 20000

In [29]: out = random_choice_prob_vectorized(weights, num_items, choice)

# Use 2D bincount to get per average occurences and verify against weights
In [75]: bincount2D_vectorized(out)/num_items
Out[75]: 
array([[0.     , 0.09715, 0.4988 , 0.40405],
       [0.     , 0.1983 , 0.40235, 0.39935],
       [0.     , 0.30025, 0.29485, 0.4049 ]])



回答3:


Looks like each row of the resulting array is independent of other rows. I am not sure how bad is the performance now. If it really is a concern, I would try to use python's multiprocessing module to run random number generations with several processes in parallel. It should help.



来源:https://stackoverflow.com/questions/57237596/how-to-improve-np-random-choice-looping-efficiency

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!