Grouping rows from an R dataframe together when randomly assigning to training/testing datasets

只谈情不闲聊 提交于 2019-12-11 10:34:07

问题


I have a dataframe that consists of blocks of X rows, each corresponding to a single individual (where X can be different for each individual). I'd like to randomly distribute these individuals into train, test and validation samples but so far I haven't been able to get the syntax correct to ensure that each of a user's X rows are always collected into the same subsample.

For example, the data can be simplified to look like:

user    feature1     feature2
 1        "A"           "B"
 1        "L"           "L"
 1        "Q"           "B"
 1        "D"           "M"
 1        "D"           "M"
 1        "P"           "E"
 2        "A"           "B"
 2        "R"           "P"
 2        "A"           "F"
 3        "X"           "U"
...       ...           ...

and then if I ended up randomly assigning the users to a train, test or validation set all of the rows for that user (the user number is unique) would be in the same set, and grouped together so that if user 1 was in the traininng set, for example, then the format would still be:

user    feature1     feature2
 1        "A"           "B"
 1        "L"           "L"
 1        "Q"           "B"
 1        "D"           "M"
 1        "D"           "M"
 1        "P"           "E"

As a bonus I'd love to know if the solution to this could be extended to do k-folds cross validation, but so far I haven't even figured out this more simple first step.

Thanks in advance.


回答1:


We can first create an index to indicate each set of data. I chose test: 60%, train: 40%, validation: 10%, but you can choose the ratio that you need with the prob= argument of sample. Then we split the data frame, by user. Lastly, we rbind the users based on the index we created. We can then call all_dfs[['train']] and so on:

indx <- sample(1:3, length(unique(df$user)), replace=TRUE, prob=c(.6,.4,.1))
s <- split(df, df$user)
all_dfs <- lapply(1:3, function(x) do.call(rbind, s[indx==x]))
names(all_dfs) <- c('train', 'test', 'validation')


来源:https://stackoverflow.com/questions/33857248/grouping-rows-from-an-r-dataframe-together-when-randomly-assigning-to-training-t

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!