问题
I have carefully read the CARET documentation at: http://caret.r-forge.r-project.org/training.html, the vignettes, and everything is quite clear (the examples on the website help a lot!), but I am still a confused about the relationship between two arguments to trainControl:
method
index
and the interplay between trainControl and the data splitting functions in caret (e.g. createDataPartition, createResample, createFolds and createMultiFolds)
To better frame my questions, let me use the following example from the documentation:
data(BloodBrain)
set.seed(1)
tmp <- createDataPartition(logBBB,p = .8, times = 100)
trControl = trainControl(method = "LGOCV", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)
My questions are:
If I use
createDataPartition(which I assume that does stratified bootstrapping), as in the above example, and I pass the result asindextotrainControldo I need to useLGOCVas the method in my calltrainControl? If I use another one (e.g.cv) What difference would it make? In my head, once you fixindex, you are essentially choosing the type of cross-validation, so I am not sure what rolemethodplays if you useindex.What is the difference between
createDataPartitionandcreateResample? Is it thatcreateDataPartitiondoes stratified bootstrapping, whilecreateResampledoesn't?
3) How can I do stratified k-fold (e.g. 10 fold) cross validation using caret? Would the following do it?
tmp <- createFolds(logBBB, k=10, list=TRUE, times = 100)
trControl = trainControl(method = "cv", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)
回答1:
If you are not sure what role method plays if you use index, why not to apply all the methods and compare results. It is a blind method of comparaison, but it can give you some intuitions.
methods <- c('boot', 'boot632', 'cv',
'repeatedcv', 'LOOCV', 'LGOCV')
I create my index:
n <- 100
tmp <- createDataPartition(logBBB,p = .8, times = n)
I apply trainControl for my list of method, and I remove index from result since it is common to all my methods.
ll <- lapply(methods,function(x)
trControl = trainControl(method = x, index = tmp))
ll <- sapply(ll,'[<-','index', NULL)
Hence my ll is :
[,1] [,2] [,3] [,4] [,5] [,6]
method "boot" "boot632" "cv" "repeatedcv" "LOOCV" "LGOCV"
number 25 25 10 10 25 25
repeats 25 25 1 1 25 25
verboseIter FALSE FALSE FALSE FALSE FALSE FALSE
returnData TRUE TRUE TRUE TRUE TRUE TRUE
returnResamp "final" "final" "final" "final" "final" "final"
savePredictions FALSE FALSE FALSE FALSE FALSE FALSE
p 0.75 0.75 0.75 0.75 0.75 0.75
classProbs FALSE FALSE FALSE FALSE FALSE FALSE
summaryFunction ? ? ? ? ? ?
selectionFunction "best" "best" "best" "best" "best" "best"
preProcOptions List,3 List,3 List,3 List,3 List,3 List,3
custom NULL NULL NULL NULL NULL NULL
timingSamps 0 0 0 0 0 0
predictionBounds Logical,2 Logical,2 Logical,2 Logical,2 Logical,2 Logical,2
来源:https://stackoverflow.com/questions/14968874/caret-relationship-between-data-splitting-and-traincontrol