I am new to caret, and I just want to ensure that I fully understand what it’s doing. Towards that end, I’ve been attempting to replicate the results I get from a randomFore
Using formula interface in train converts factors to dummy. To compare results from caret with randomForest you should use the non-formula interface.
In your case, you should provide a seed inside trainControl to get the same result as in randomForest.
Section training in caret webpage, there are some notes on reproducibility where it explains how to use seeds.
library("randomForest")
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry = 2,
importance = TRUE,
metric = "RMSE")
library("caret")
caret.oob.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "oob", seed = 1),
allowParallel = FALSE)
If you are doing resampling, you should provide seeds for each resampling iteration and an additional one for the final model. Examples in ?trainControl show how to create them.
In the following example, the last seed is for the final model and I set it to 1.
seeds <- as.vector(c(1:26), mode = "list")
# For the final model
seeds[[26]] <- 1
caret.boot.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "boot", seeds = seeds),
allowParallel = FALSE)
Definig correctly the non-formula interface with caret and seed in trainControl you will get the same results in all three models:
rf.model
caret.oob.model$final
caret.boot.model$final