How does createDataPartition function from caret package split data?

拟墨画扇 提交于 2019-12-30 09:00:38

问题


From the documentation:

For bootstrap samples, simple random sampling is used.

For other data splitting, the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits.

For numeric y, the sample is split into groups sections based on percentiles and sampling is done within these subgroups.

For createDataPartition, the number of percentiles is set via the groups argument.

I don't understand why this "balance" thing is needed. I think I understand it superficially, but any additional insight would be really helpful.


回答1:


It means, if you have a data set ds with 10000 rows

set.seed(42)
ds <- data.frame(values = runif(10000))

with 2 "classes" with unequal distribution (9000 vs 1000)

ds$class <- c(rep(1, 9000), rep(2, 1000))
ds$class <- as.factor(ds$class)
table(ds$class)
#    1    2 
# 9000 1000 

you can create a sample, which tries to maintain the ratio / "balance" of the factor classes.

dpart <- createDataPartition(ds$class, p = 0.1, list = F)
dsDP <- ds[dpart, ]
table(dsDP$class)
#   1   2 
# 900 100 


来源:https://stackoverflow.com/questions/40709722/how-does-createdatapartition-function-from-caret-package-split-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!