I have a dataset of 550k items that I split 500k for training and 50k for testing. During the training stage it is necessary to establish the \'best\' combination of each algori
This is possible by specifying the index, indexOut and indexFinal arguments to trainControl.
Here is an example using the Sonar data set from mlbench library:
library(caret)
library(mlbench)
data(Sonar)
Lets say we want to draw half of the Sonar data set each time for training, and repeat that 10 times:
train_inds <- replicate(10, sample(1:nrow(Sonar), size = nrow(Sonar)/2), simplify = FALSE)
If you are interested in a different sampling approach please post the details. This is for illustration only.
For testing we will use random 10 rows not in the train_inds:
test_inds <- lapply(train_inds, function(x){
inds <- setdiff(1:nrow(Sonar), x)
return(sample(inds, size = 10))
}
)
now just specify the test_inds and train_inds in trainControl:
ctrl <- trainControl(
method = "boot",
number = 10,
classProbs = T,
savePredictions = "final",
index = train_inds,
indexOut = test_inds,
indexFinal = 1:nrow(Sonar),
summaryFunction = twoClassSummary
)
you can also specify indexFinal if you do not wish to fit the final model on all rows.
and fit:
model <- train(
Class ~ .,
data = Sonar,
method = "rf",
trControl = ctrl,
metric = "ROC"
)
model
#output
Random Forest
208 samples, 208 used for final model
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Bootstrapped (10 reps)
Summary of sample sizes: 104, 104, 104, 104, 104, 104, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec
2 0.9104167 0.7750 0.8250000
31 0.9125000 0.7875 0.7916667
60 0.9083333 0.7875 0.8166667