How can I shuffle a multidimensional array by row only in Python (so do not shuffle the columns).
I am looking for the most efficient solution, because my ma
You can also use np.random.permutation to generate random permutation of row indices and then index into the rows of X
using np.take with axis=0
. Also, np.take
facilitates overwriting to the input array X
itself with out=
option, which would save us memory. Thus, the implementation would look like this -
np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
Sample run -
In [23]: X
Out[23]:
array([[ 0.60511059, 0.75001599],
[ 0.30968339, 0.09162172],
[ 0.14673218, 0.09089028],
[ 0.31663128, 0.10000309],
[ 0.0957233 , 0.96210485],
[ 0.56843186, 0.36654023]])
In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X);
In [25]: X
Out[25]:
array([[ 0.14673218, 0.09089028],
[ 0.31663128, 0.10000309],
[ 0.30968339, 0.09162172],
[ 0.56843186, 0.36654023],
[ 0.0957233 , 0.96210485],
[ 0.60511059, 0.75001599]])
Additional performance boost
Here's a trick to speed up np.random.permutation(X.shape[0])
with np.argsort()
-
np.random.rand(X.shape[0]).argsort()
Speedup results -
In [32]: X = np.random.random((6000, 2000))
In [33]: %timeit np.random.permutation(X.shape[0])
1000 loops, best of 3: 510 µs per loop
In [34]: %timeit np.random.rand(X.shape[0]).argsort()
1000 loops, best of 3: 297 µs per loop
Thus, the shuffling solution could be modified to -
np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
Runtime tests -
These tests include the two approaches listed in this post and np.shuffle
based one in @Kasramvd's solution.
In [40]: X = np.random.random((6000, 2000))
In [41]: %timeit np.random.shuffle(X)
10 loops, best of 3: 25.2 ms per loop
In [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
10 loops, best of 3: 53.3 ms per loop
In [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
10 loops, best of 3: 53.2 ms per loop
So, it seems using these np.take
based could be used only if memory is a concern or else np.random.shuffle
based solution looks like the way to go.
That's what numpy.random.shuffle()
is for :
>>> X = np.random.random((6, 2))
>>> X
array([[ 0.9818058 , 0.67513579],
[ 0.82312674, 0.82768118],
[ 0.29468324, 0.59305925],
[ 0.25731731, 0.16676408],
[ 0.27402974, 0.55215778],
[ 0.44323485, 0.78779887]])
>>> np.random.shuffle(X)
>>> X
array([[ 0.9818058 , 0.67513579],
[ 0.44323485, 0.78779887],
[ 0.82312674, 0.82768118],
[ 0.29468324, 0.59305925],
[ 0.25731731, 0.16676408],
[ 0.27402974, 0.55215778]])
After a bit experiment i found most memory and time efficient way to shuffle data(row wise) of nd-array is, shuffle the index and get the data from shuffled index
rand_num2 = np.random.randint(5, size=(6000, 2000))
perm = np.arange(rand_num2.shape[0])
np.random.shuffle(perm)
rand_num2 = rand_num2[perm]
in more details
Here, I am using memory_profiler to find memory usage and python's builtin "time" module to record time and comparing all previous answers
def main():
# shuffle data itself
rand_num = np.random.randint(5, size=(6000, 2000))
start = time.time()
np.random.shuffle(rand_num)
print('Time for direct shuffle: {0}'.format((time.time() - start)))
# Shuffle index and get data from shuffled index
rand_num2 = np.random.randint(5, size=(6000, 2000))
start = time.time()
perm = np.arange(rand_num2.shape[0])
np.random.shuffle(perm)
rand_num2 = rand_num2[perm]
print('Time for shuffling index: {0}'.format((time.time() - start)))
# using np.take()
rand_num3 = np.random.randint(5, size=(6000, 2000))
start = time.time()
np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
print("Time taken by np.take, {0}".format((time.time() - start)))
Result for Time
Time for direct shuffle: 0.03345608711242676 # 33.4msec
Time for shuffling index: 0.019818782806396484 # 19.8msec
Time taken by np.take, 0.06726956367492676 # 67.2msec
Memory profiler Result
Line # Mem usage Increment Line Contents
================================================
39 117.422 MiB 0.000 MiB @profile
40 def main():
41 # shuffle data itself
42 208.977 MiB 91.555 MiB rand_num = np.random.randint(5, size=(6000, 2000))
43 208.977 MiB 0.000 MiB start = time.time()
44 208.977 MiB 0.000 MiB np.random.shuffle(rand_num)
45 208.977 MiB 0.000 MiB print('Time for direct shuffle: {0}'.format((time.time() - start)))
46
47 # Shuffle index and get data from shuffled index
48 300.531 MiB 91.555 MiB rand_num2 = np.random.randint(5, size=(6000, 2000))
49 300.531 MiB 0.000 MiB start = time.time()
50 300.535 MiB 0.004 MiB perm = np.arange(rand_num2.shape[0])
51 300.539 MiB 0.004 MiB np.random.shuffle(perm)
52 300.539 MiB 0.000 MiB rand_num2 = rand_num2[perm]
53 300.539 MiB 0.000 MiB print('Time for shuffling index: {0}'.format((time.time() - start)))
54
55 # using np.take()
56 392.094 MiB 91.555 MiB rand_num3 = np.random.randint(5, size=(6000, 2000))
57 392.094 MiB 0.000 MiB start = time.time()
58 392.242 MiB 0.148 MiB np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
59 392.242 MiB 0.000 MiB print("Time taken by np.take, {0}".format((time.time() - start)))
I tried many solutions, and at the end I used this simple one:
from sklearn.utils import shuffle
x = np.array([[1, 2],
[3, 4],
[5, 6]])
print(shuffle(x, random_state=0))
output:
[
[5 6]
[3 4]
[1 2]
]
if you have 3d array, loop through the 1st axis (axis=0) and apply this function, like:
np.array([shuffle(item) for item in 3D_numpy_array])
I have a question on this (or maybe it is the answer) Lets say we have a numpy array X with shape=(1000,60,11,1) Also suppose that X is an array of images with size 60x11 and channel number =1 (60x11x1).
What if I want to shuffle the order of all these images, and to do that I'll use shuffling on the indexes of X.
def shuffling( X):
indx=np.arange(len(X)) # create a array with indexes for X data
np.random.shuffle(indx)
X=X[indx]
return X
Will that work? From my knowledge len(X) will return the biggest dimension size.
You can shuffle a two dimensional array A
by row using the np.vectorize()
function:
shuffle = np.vectorize(np.random.permutation, signature='(n)->(n)')
A_shuffled = shuffle(A)