r caret estimate parameters on a subset fit to full data

后端未结

关注

 1  446

离开以前 2021-01-28 15:40

I have a dataset of 550k items that I split 500k for training and 50k for testing. During the training stage it is necessary to establish the \'best\' combination of each algori

1条回答

無奈伤痛 (楼主)

2021-01-28 16:15

This is possible by specifying the index, indexOut and indexFinal arguments to trainControl.

Here is an example using the Sonar data set from mlbench library:

library(caret)
library(mlbench)
data(Sonar)

Lets say we want to draw half of the Sonar data set each time for training, and repeat that 10 times:

train_inds <- replicate(10, sample(1:nrow(Sonar), size = nrow(Sonar)/2), simplify = FALSE)

If you are interested in a different sampling approach please post the details. This is for illustration only.

For testing we will use random 10 rows not in the train_inds:

test_inds <- lapply(train_inds, function(x){
  inds <- setdiff(1:nrow(Sonar), x)
  return(sample(inds, size = 10))
}
)

now just specify the test_inds and train_inds in trainControl:

ctrl <-  trainControl(
    method = "boot",
    number = 10,
    classProbs = T,
    savePredictions = "final",
    index = train_inds,
    indexOut = test_inds,
    indexFinal = 1:nrow(Sonar),
    summaryFunction = twoClassSummary
  )

you can also specify indexFinal if you do not wish to fit the final model on all rows.

and fit:

model <- train(
    Class ~ .,
    data = Sonar,
    method = "rf",
    trControl = ctrl,
    metric = "ROC"
  )
model
#output
Random Forest 

208 samples, 208 used for final model
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Bootstrapped (10 reps) 
Summary of sample sizes: 104, 104, 104, 104, 104, 104, ... 
Resampling results across tuning parameters:

  mtry  ROC        Sens    Spec     
   2    0.9104167  0.7750  0.8250000
  31    0.9125000  0.7875  0.7916667
  60    0.9083333  0.7875  0.8166667

0 讨论(0)