问题
I have a data set with two outcome variables, case1 and case2. Case1 has 4 levels, while case2 has 50 (levels in case2 could increase later). I would like to create data partition for train and test keeping the ratio in both cases. The real data is imbalanced for both case1 and case2. As an example,
library(caret)
set.seed(123)
matris=matrix(rnorm(10),1000,20)
case1 <- as.factor(ceiling(runif(1000, 0, 4)))
case2 <- as.factor(ceiling(runif(1000, 0, 50)))
df <- as.data.frame(matris)
df$case1 <- case1
df$case2 <- case2
split1 <- createDataPartition(df$case1, p=0.2)[[1]]
train1 <- df[-split1,]
test1 <- df[split1,]
length(split1)
201
split2 <- createDataPartition(df$case2, p=0.2)[[1]]
train2 <- df[-split2,]
test2 <- df[split2,]
length(split2)
220
If I do separate splitting, I get different length for the data frame. If I do one splitting based on case2 (one with more classes), I lose the ratio of classes for case1.
I will be predicting the two cases separately, but at the end my accuracy will be given by having the exact match for both cases (e.g., ix = which(pred1 == case1 & pred2 == case2), so I need the arrays to be the same size.
Is there a smart way to do this?
Thank you!
回答1:
If I understand correctly (which I do not guarantee) I can offer the following approach:
Group by case1 and case2 and get the group indices
library(tidyverse)
df %>%
select(case1, case2) %>%
group_by(case1, case2) %>%
group_indices() -> indeces
use these indeces as the outcome variable in create data partition:
split1 <- createDataPartition(as.factor(indeces), p=0.2)[[1]]
check if satisfactory:
table(df[split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
5 6 5 8 5 5 6 6 4 6 6 6 6 6 5 5 5 4 4 7 5 6 5 6 7 5 5 8 6 7 6 6 7
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
4 5 6 6 6 5 5 6 5 6 6 5 4 5 6 4 6
table(df[-split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
15 19 13 18 12 13 16 15 8 13 13 15 21 14 11 13 12 9 12 20 17 15 16 19 16 11 14 21 13 20 18 13 16
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
9 6 12 19 14 10 16 19 17 17 16 14 4 15 14 9 19
table(df[split1,21])
#output
1 2 3 4
71 70 71 67
table(df[-split1,21])
1 2 3 4
176 193 174 178
来源:https://stackoverflow.com/questions/48499271/create-partition-based-in-two-variables