Numpy shuffle multidimensional array by row only, keep column order unchanged

后端 未结 6 620
不思量自难忘°
不思量自难忘° 2020-12-05 12:52

How can I shuffle a multidimensional array by row only in Python (so do not shuffle the columns).

I am looking for the most efficient solution, because my ma

相关标签:
6条回答
  • 2020-12-05 13:04

    You can also use np.random.permutation to generate random permutation of row indices and then index into the rows of X using np.take with axis=0. Also, np.take facilitates overwriting to the input array X itself with out= option, which would save us memory. Thus, the implementation would look like this -

    np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
    

    Sample run -

    In [23]: X
    Out[23]: 
    array([[ 0.60511059,  0.75001599],
           [ 0.30968339,  0.09162172],
           [ 0.14673218,  0.09089028],
           [ 0.31663128,  0.10000309],
           [ 0.0957233 ,  0.96210485],
           [ 0.56843186,  0.36654023]])
    
    In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X);
    
    In [25]: X
    Out[25]: 
    array([[ 0.14673218,  0.09089028],
           [ 0.31663128,  0.10000309],
           [ 0.30968339,  0.09162172],
           [ 0.56843186,  0.36654023],
           [ 0.0957233 ,  0.96210485],
           [ 0.60511059,  0.75001599]])
    

    Additional performance boost

    Here's a trick to speed up np.random.permutation(X.shape[0]) with np.argsort() -

    np.random.rand(X.shape[0]).argsort()
    

    Speedup results -

    In [32]: X = np.random.random((6000, 2000))
    
    In [33]: %timeit np.random.permutation(X.shape[0])
    1000 loops, best of 3: 510 µs per loop
    
    In [34]: %timeit np.random.rand(X.shape[0]).argsort()
    1000 loops, best of 3: 297 µs per loop
    

    Thus, the shuffling solution could be modified to -

    np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
    

    Runtime tests -

    These tests include the two approaches listed in this post and np.shuffle based one in @Kasramvd's solution.

    In [40]: X = np.random.random((6000, 2000))
    
    In [41]: %timeit np.random.shuffle(X)
    10 loops, best of 3: 25.2 ms per loop
    
    In [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
    10 loops, best of 3: 53.3 ms per loop
    
    In [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
    10 loops, best of 3: 53.2 ms per loop
    

    So, it seems using these np.take based could be used only if memory is a concern or else np.random.shuffle based solution looks like the way to go.

    0 讨论(0)
  • 2020-12-05 13:07

    That's what numpy.random.shuffle() is for :

    >>> X = np.random.random((6, 2))
    >>> X
    array([[ 0.9818058 ,  0.67513579],
           [ 0.82312674,  0.82768118],
           [ 0.29468324,  0.59305925],
           [ 0.25731731,  0.16676408],
           [ 0.27402974,  0.55215778],
           [ 0.44323485,  0.78779887]])
    
    >>> np.random.shuffle(X)
    >>> X
    array([[ 0.9818058 ,  0.67513579],
           [ 0.44323485,  0.78779887],
           [ 0.82312674,  0.82768118],
           [ 0.29468324,  0.59305925],
           [ 0.25731731,  0.16676408],
           [ 0.27402974,  0.55215778]])
    
    0 讨论(0)
  • 2020-12-05 13:08

    After a bit experiment i found most memory and time efficient way to shuffle data(row wise) of nd-array is, shuffle the index and get the data from shuffled index

    rand_num2 = np.random.randint(5, size=(6000, 2000))
    perm = np.arange(rand_num2.shape[0])
    np.random.shuffle(perm)
    rand_num2 = rand_num2[perm]
    

    in more details
    Here, I am using memory_profiler to find memory usage and python's builtin "time" module to record time and comparing all previous answers

    def main():
        # shuffle data itself
        rand_num = np.random.randint(5, size=(6000, 2000))
        start = time.time()
        np.random.shuffle(rand_num)
        print('Time for direct shuffle: {0}'.format((time.time() - start)))
    
        # Shuffle index and get data from shuffled index
        rand_num2 = np.random.randint(5, size=(6000, 2000))
        start = time.time()
        perm = np.arange(rand_num2.shape[0])
        np.random.shuffle(perm)
        rand_num2 = rand_num2[perm]
        print('Time for shuffling index: {0}'.format((time.time() - start)))
    
        # using np.take()
        rand_num3 = np.random.randint(5, size=(6000, 2000))
        start = time.time()
        np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
        print("Time taken by np.take, {0}".format((time.time() - start)))
    

    Result for Time

    Time for direct shuffle: 0.03345608711242676   # 33.4msec
    Time for shuffling index: 0.019818782806396484 # 19.8msec
    Time taken by np.take, 0.06726956367492676     # 67.2msec
    

    Memory profiler Result

    Line #    Mem usage    Increment   Line Contents
    ================================================
        39  117.422 MiB    0.000 MiB   @profile
        40                             def main():
        41                                 # shuffle data itself
        42  208.977 MiB   91.555 MiB       rand_num = np.random.randint(5, size=(6000, 2000))
        43  208.977 MiB    0.000 MiB       start = time.time()
        44  208.977 MiB    0.000 MiB       np.random.shuffle(rand_num)
        45  208.977 MiB    0.000 MiB       print('Time for direct shuffle: {0}'.format((time.time() - start)))
        46                             
        47                                 # Shuffle index and get data from shuffled index
        48  300.531 MiB   91.555 MiB       rand_num2 = np.random.randint(5, size=(6000, 2000))
        49  300.531 MiB    0.000 MiB       start = time.time()
        50  300.535 MiB    0.004 MiB       perm = np.arange(rand_num2.shape[0])
        51  300.539 MiB    0.004 MiB       np.random.shuffle(perm)
        52  300.539 MiB    0.000 MiB       rand_num2 = rand_num2[perm]
        53  300.539 MiB    0.000 MiB       print('Time for shuffling index: {0}'.format((time.time() - start)))
        54                             
        55                                 # using np.take()
        56  392.094 MiB   91.555 MiB       rand_num3 = np.random.randint(5, size=(6000, 2000))
        57  392.094 MiB    0.000 MiB       start = time.time()
        58  392.242 MiB    0.148 MiB       np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
        59  392.242 MiB    0.000 MiB       print("Time taken by np.take, {0}".format((time.time() - start)))
    
    0 讨论(0)
  • 2020-12-05 13:18

    I tried many solutions, and at the end I used this simple one:

    from sklearn.utils import shuffle
    x = np.array([[1, 2],
                  [3, 4],
                  [5, 6]])
    print(shuffle(x, random_state=0))
    

    output:

    [
    [5 6]  
    [3 4]  
    [1 2]
    ]
    

    if you have 3d array, loop through the 1st axis (axis=0) and apply this function, like:

    np.array([shuffle(item) for item in 3D_numpy_array])
    
    0 讨论(0)
  • 2020-12-05 13:24

    I have a question on this (or maybe it is the answer) Lets say we have a numpy array X with shape=(1000,60,11,1) Also suppose that X is an array of images with size 60x11 and channel number =1 (60x11x1).

    What if I want to shuffle the order of all these images, and to do that I'll use shuffling on the indexes of X.

    def shuffling( X):
     indx=np.arange(len(X))          # create a array with indexes for X data
     np.random.shuffle(indx)
     X=X[indx]
     return X
    

    Will that work? From my knowledge len(X) will return the biggest dimension size.

    0 讨论(0)
  • 2020-12-05 13:26

    You can shuffle a two dimensional array A by row using the np.vectorize() function:

    shuffle = np.vectorize(np.random.permutation, signature='(n)->(n)')
    
    A_shuffled = shuffle(A)
    
    0 讨论(0)
提交回复
热议问题