Speed up sub-array shuffling and storing

微笑、不失礼 提交于 2019-12-24 06:45:13

问题


I have a list of integers (di), and another list (rang_indx) made up of numpy sub-arrays of integers (code below). For each of these sub-arrays, I need to store in a separate list (indx) a number of random elements, given by the di list.

For what I can see np.random.shuffle() will not shuffle the elements within the sub-arrays but the sub-arrays themselves within rang_indx, which is not what I need. Hence, I need to use a for loop to first shuffle the sub-arrays (in place), and then another one (combined with a zip()) to generate the indx list.

This function is called millions of times as part of a larger code. Is there a way I can speed up the process?

import numpy as np


def func(di, rang_indx):
    # Shuffle each sub-array in place.
    for _ in rang_indx:
        np.random.shuffle(_)

    # For each shuffled sub-array, only keep as many elements as those
    # indicated by the 'di' array.
    indx = [_[:i] for (_, i) in zip(*[rang_indx, di.astype(int)])]

    return indx


# This data is not fixed, and will change with each call to func()
di = np.array([ 4.,  2.,   0.,   600.,  12.,  22.,  13.,  21.,  25.,  25.,  12.,  11.,
         7.,  12.,  10.,  13.,   5.,  10.])
rang_indx = [np.array([]), np.array([189, 195, 209, 214, 236, 237, 255, 286, 290, 296, 301, 304, 321,
       323, 327, 329]), np.array([164, 171, 207, 217, 225, 240, 250, 263, 272, 279, 284, 285, 289]), np.array([101, 162, 168, 177, 179, 185, 258, 261, 264, 269, 270, 278, 281,
       287, 293, 298]), np.array([111, 127, 143, 156, 159, 161, 181, 182, 183, 194, 196, 198, 204,
       205, 210, 212, 235, 239, 267, 268, 297]), np.array([107, 116, 120, 128, 130, 136, 137, 144, 152, 155, 157, 166, 169,
       170, 184, 186, 192, 218, 220, 226, 228, 241, 245, 246, 247, 251,
       252, 253]), np.array([ 99, 114, 118, 121, 131, 134, 158, 216, 219, 221, 224, 231, 233,
       234, 243, 244]), np.array([ 34,  37,  38,  48,  56,  78,  84, 100, 108, 117, 122, 123, 132,
       149, 151, 153, 163, 178, 180, 191, 199, 202, 208, 211]), np.array([ 31,  40,  41,  45,  51,  53,  57,  60,  61,  66,  67,  69,  71,
        75,  85,  90,  95,  96, 167, 173, 174, 176, 188, 190, 197, 206]), np.array([  0,   1,   2,   3,   6,  11,  12,  13,  17,  25,  33,  36,  47,
        58,  64,  76,  87,  94, 160, 165, 172, 175, 187, 193, 201, 203]), np.array([  4,  16,  18,  19, 109, 113, 115, 124, 138, 142, 145, 150]), np.array([103, 105, 106, 112, 125, 135, 139, 140, 141, 146, 147, 154]), np.array([102, 104, 110, 119, 126, 129, 133, 148]), np.array([29, 32, 42, 43, 55, 63, 72, 77, 79, 83, 91, 92]), np.array([35, 49, 59, 73, 74, 81, 86, 88, 89, 97, 98]), np.array([30, 39, 44, 46, 50, 52, 54, 62, 65, 68, 80, 82, 93]), np.array([ 8, 10, 15, 27, 70]), np.array([ 5,  7,  9, 14, 20, 21, 22, 23, 24, 26, 28])]

func(di, rang_indx)

回答1:


Approach #1 : Here's one idea with the intention to keep minimal work when we loop and use one loop only -

  1. Create a 2D random array in interval [0,1) to cover the max. length of subarrays.
  2. For each subarray, set the invalid places to 1.0. Get argsort for each row. Those 1s corresponding to the invalid places would stay at the back because there were no 1s in the original random array. Thus, we have the indices array.
  3. Slice each row of those indices array to the extent of the lengths listed in di.
  4. Start a loop and slice each subarray from rang_indx using those sliced indices.

Hence, the implementation -

lens = np.array([len(i) for i in rang_indx])
di0 = np.minimum(lens, di.astype(int))
invalid_mask = lens[:,None] <= np.arange(lens.max())
rand_nums = np.random.rand(len(lens), lens.max())
rand_nums[invalid_mask] = 1
shuffled_indx = np.argpartition(rand_nums, lens-1, axis=1)

out = []
for i,all_idx in enumerate(shuffled_indx):
    if lens[i]==0:
        out.append(np.array([]))
    else:
        slice_idx = all_idx[:di0[i]]
        out.append(rang_indx[i][slice_idx])

Approach #2 : Another way with doing much of the setup work in an efficient manner within the loop -

lens = np.array([len(i) for i in rang_indx])
di0 = np.minimum(lens, di.astype(int))
out = []
for i in range(len(lens)):
    if lens[i]==0:
        out.append(np.array([]))
    else:
        k = di0[i]
        slice_idx = np.argpartition(np.random.rand(lens[i]), k-1)[:k]
        out.append(rang_indx[i][slice_idx])


来源:https://stackoverflow.com/questions/46078995/speed-up-sub-array-shuffling-and-storing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!