Matlab split into train/valid/test set and keep proportion

♀尐吖头ヾ 提交于 2021-02-19 08:17:47

问题


I have dataset with 12 columns + 1 target (binary) and about 4000 rows. I need to split it into train (70%), validation (20%) and test (10%) set.

The dataset is quite undersampled (95% of class 0 to 5% of class 1) so I need to keep the ratio of target in each sample.

I am able to split the dataset somehow, but I have no idea how to keep the ratio.

I am working with subset Wine Quality data here


回答1:


If you have access to Matlab's Statistical processing toolbox you can used the cvpartition function.

From matlab help on cvpartition -:

c = cvpartition(group,'HoldOut',p) randomly partitions observations into a training set and a test set with stratification, using the class information in group; that is, both training and test sets have roughly the same class proportions as in group.

You can apply the function twice to get three partitions. This function preserves the original class distribution.




回答2:


So far I came up with this, if anyone knows better solution, let me know. I split my dataset by target column, then each of those two splits were further split into first 70%, next 20% and last 10% data and then merged together. After, I split features and targets.

%split in 0/1 samples
winedataset_0 = winedataset(winedataset(:, 13) == 0, :);
winedataset_1 = winedataset(winedataset(:, 13) == 1, :);

%train
split_tr_0 = round(length(winedataset_0)*0.7);
split_tr_1 = round(length(winedataset_1)*0.7);
train_0 = winedataset_0(1:split_tr_0,:);
train_1 = winedataset_1(1:split_tr_1,:);
train_set = vertcat(train_0, train_1);
train_set = train_set(randperm(length(train_set)),:);

%valid
split_valid_0 = split_tr_0 + round(length(winedataset_0)*0.2);
split_valid_1 = split_tr_1 + round(length(winedataset_1)*0.2);
valid_0 = winedataset_0(split_tr_0+1:split_valid_0,:);
valid_1 = winedataset_1(split_tr_1+1:split_valid_1,:);
valid_set = vertcat(valid_0, valid_1);
valid_set = valid_set(randperm(length(valid_set)),:);

%test
test_0 = winedataset_0(split_valid_0+1:end,:);
test_1 = winedataset_1(split_valid_1+1:end,:);
test_set = vertcat(test_0, test_1);
test_set = test_set(randperm(length(test_set)),:);


%Split into X and y
X_train = train_set(:,1:12);
y_train = train_set(:,13);

X_valid = valid_set(:,1:12);
y_valid = valid_set(:,13);

X_test = test_set(:,1:12);
y_test = test_set(:,13);


来源:https://stackoverflow.com/questions/36674651/matlab-split-into-train-valid-test-set-and-keep-proportion

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!