How to split/partition a dataset into training and test datasets for, e.g., cross validation?

前端未结

关注

 12  2054

醉话见心 2020-11-27 10:42

What is a good way to split a NumPy array randomly into training and testing/validation dataset? Something similar to the cvpartition or crossvalind

12条回答

没有蜡笔的小新 (楼主)

2020-11-27 11:35
After doing some reading and taking into account the (many..) different ways of splitting the data to train and test, I decided to timeit!

I used 4 different methods (non of them are using the library sklearn, which I'm sure will give the best results, giving that it is well designed and tested code):
1. shuffle the whole matrix arr and then split the data to train and test
2. shuffle the indices and then assign it x and y to split the data
3. same as method 2, but in a more efficient way to do it
4. using pandas dataframe to split
method 3 won by far with the shortest time, after that method 1, and method 2 and 4 discovered to be really inefficient.

The code for the 4 different methods I timed:
```
import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)

#%% Method 1:  shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]

#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]

test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]

#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]

#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)

train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)
```
And for the times, the minimum time to execute out of 3 repetitions of 1000 loops is:
- Method 1: 0.35883826200006297 seconds
- Method 2: 1.7157016959999964 seconds
- Method 3: 1.7876616719995582 seconds
- Method 4: 0.07562861499991413 seconds
I hope that's helpful!
0 讨论(0)

查看其它12个回答
发布评论:

提交评论
- 加载中...