Create partition based in two variables

旧街凉风 提交于 2019-12-11 08:54:57

问题


I have a data set with two outcome variables, case1 and case2. Case1 has 4 levels, while case2 has 50 (levels in case2 could increase later). I would like to create data partition for train and test keeping the ratio in both cases. The real data is imbalanced for both case1 and case2. As an example,

library(caret)

set.seed(123)
matris=matrix(rnorm(10),1000,20)
case1 <- as.factor(ceiling(runif(1000, 0, 4)))
case2 <- as.factor(ceiling(runif(1000, 0, 50)))

df <- as.data.frame(matris)
df$case1 <- case1
df$case2 <- case2

split1 <- createDataPartition(df$case1, p=0.2)[[1]]
train1 <- df[-split1,]
test1 <- df[split1,]
length(split1)
201

split2 <- createDataPartition(df$case2, p=0.2)[[1]]
train2 <- df[-split2,]
test2 <- df[split2,]
length(split2)
220

If I do separate splitting, I get different length for the data frame. If I do one splitting based on case2 (one with more classes), I lose the ratio of classes for case1.

I will be predicting the two cases separately, but at the end my accuracy will be given by having the exact match for both cases (e.g., ix = which(pred1 == case1 & pred2 == case2), so I need the arrays to be the same size.

Is there a smart way to do this?

Thank you!


回答1:


If I understand correctly (which I do not guarantee) I can offer the following approach:

Group by case1 and case2 and get the group indices

library(tidyverse)

df %>%
  select(case1, case2) %>%
  group_by(case1, case2) %>%
  group_indices() -> indeces

use these indeces as the outcome variable in create data partition:

split1 <- createDataPartition(as.factor(indeces), p=0.2)[[1]]

check if satisfactory:

table(df[split1,22])
#output
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 
 5  6  5  8  5  5  6  6  4  6  6  6  6  6  5  5  5  4  4  7  5  6  5  6  7  5  5  8  6  7  6  6  7 
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
 4  5  6  6  6  5  5  6  5  6  6  5  4  5  6  4  6

table(df[-split1,22])
#output
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 
15 19 13 18 12 13 16 15  8 13 13 15 21 14 11 13 12  9 12 20 17 15 16 19 16 11 14 21 13 20 18 13 16 
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
 9  6 12 19 14 10 16 19 17 17 16 14  4 15 14  9 19 

table(df[split1,21])
#output
 1  2  3  4 
71 70 71 67 

table(df[-split1,21])
  1   2   3   4 
176 193 174 178 


来源:https://stackoverflow.com/questions/48499271/create-partition-based-in-two-variables

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!