preprocess within cross-validation in caret

╄→尐↘猪︶ㄣ 提交于 2020-05-13 06:22:13

问题


I have a question about data preprocess that need to be clarified. To my understanding, when we tune hyperparameters and estimate model performance via cross-validation, rather than preprocess the whole dataset, we need to do that within cross-validation. In other words, in cross-validation, we preprocess training folds, then use the same preprocess parameter to process test fold and make predictions.

In the example code below, when I specify the preProcess within caret::train, does it automatically do that? Really appreciate it if someone can clarify me on that.

From some online sources, some people preprocess the whole dataset (trainset) and then use the preprocess data to tune hyperparameters via cross-validation, it does not seems to be right....

library(caret)
library(mlbench)
data(PimaIndiansDiabetes)

control <- trainControl(method="cv", 
                        number=5,
                        preProcOptions = list(pcaComp=4))
grid=expand.grid(mtry=c(1,2,3))

model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf", 
               preProcess=c("scale", "center", "pca"), 
               trControl=control,
               tuneGrid=grid)

回答1:


Your worries are on the right spot. So many ways to introduce positive bias.

According to Max Kuhn the creator of caret there is no data leakage when preProcess is specified in train:

All pre-processing is applied on the resampled version of the data (e.g. 90% in 10-fold CV) and then those calculations are applied to the holdouts (the remaining 10%) with no re-calculation.

source: https://github.com/topepo/caret/issues/335



来源:https://stackoverflow.com/questions/50295233/preprocess-within-cross-validation-in-caret

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!