R: can caret::train function for glmnet cross-validate AUC at fixed alpha and lambda?

问题

I would like to calculate the 10-fold cross-validated AUC of an elastic net regression model with the optimal alpha and lambda using caret::train

https://stats.stackexchange.com/questions/69638/does-caret-train-function-for-glmnet-cross-validate-for-both-alpha-and-lambda/69651 explains how to cross-validate alpha and lambda with caret::train

My question on Cross Validated got closed, because it has been classified as a programming question: https://stats.stackexchange.com/questions/505865/r-calculate-the-10-fold-crossvalidated-auc-with-glmnet-and-given-alpha-and-lamb?noredirect=1#comment934491_505865

What I have

Dataset:

library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)

# example data
data(PimaIndiansDiabetes, package="mlbench")

# make a training set
set.seed(2323)
train.data <- PimaIndiansDiabetes

My model:

# build a model using the training set
set.seed(2323)
model <- train(
  diabetes ~., data = train.data, method = "glmnet",
  trControl = trainControl("cv",
                           number = 10,
                           classProbs = TRUE,
                           savePredictions = TRUE),
  tuneLength = 10,
  metric="ROC"
)

Here I get the error:

Warning message:
In train.default(x, y, weights = w, ...) :
  The metric "ROC" was not in the result set. Accuracy will be used instead.

If I ignore the error the best alpha and lambda would be:

model$bestTune
   alpha      lambda
11   0.2 0.002926378

Now I would like to get a 10-fold cross-validated AUC using my model with the best alpha and lambda and the train data.

What I tried

My approach would be something like this, however, I get the error: Something is wrong; all the Accuracy metric values are missing:

model <- train(
  diabetes ~., data = train.data, method = "glmnet",
  trControl = trainControl("cv",
                           number = 10,
                           classProbs = TRUE,
                           savePredictions = TRUE),
  alpha=model$bestTune$alpha,
  lambda=model$bestTune$lambda,
  tuneLength = 10,
  metric="ROC"
)

How could I calculate a cross-validated AUC using the optimal alpha and lambda and the train data?

I am still not sure how to cross-validate for AUC not, Accuracy.

Thank you for your help.

回答1:

You intend to use "ROC" - area under the ROC curve to pick the best tuning parameters but you do not specify twoClassSummary() which holds this metric. This is what the warning is informing you

Warning message:
In train.default(x, y, weights = w, ...) :
  The metric "ROC" was not in the result set. Accuracy will be used instead.

Perform turning:

library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)

data(PimaIndiansDiabetes, package="mlbench")

set.seed(2323)
train.data <- PimaIndiansDiabetes

set.seed(2323)
model <- train(
  diabetes ~., data = train.data, method = "glmnet",
  trControl = trainControl("cv",
                           number = 10,
                           classProbs = TRUE,
                           savePredictions = TRUE,
                           summaryFunction = twoClassSummary),
  tuneLength = 10,
  metric="ROC" #ROC metric is in twoClassSummary
)

Since you specified classProbs = TRUE and savePredictions = TRUE you can calculate any metric based on the predictions. The calculate accuracy:

model$pred %>%
  filter(alpha == model$bestTune$alpha,   #filter predictions for best tuning parameters
         lambda == model$bestTune$lambda) %>%
  group_by(Resample) %>% #group by fold
  summarise(acc = sum(pred == obs)/n()) #calculate metric
#output
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 10 x 2
   Resample   acc
   <chr>    <dbl>
 1 Fold01   0.740
 2 Fold02   0.753
 3 Fold03   0.818
 4 Fold04   0.776
 5 Fold05   0.779
 6 Fold06   0.753
 7 Fold07   0.766
 8 Fold08   0.792
 9 Fold09   0.727
10 Fold10   0.789

This gives you per fold metric. To get the average performance

model$pred %>%
  filter(alpha == model$bestTune$alpha,
         lambda == model$bestTune$lambda) %>%
  group_by(Resample) %>%
  summarise(acc = sum(pred == obs)/n()) %>%
  pull(acc) %>%
  mean
#output
0.769566

When ROC is used as a selection metric the hyper parameters are optimized over all decision thresholds. In many cases the chosen model would preform suboptimal using the default decision threshold of 0.5.

Caret has a function thresholder()

it will calculate many metrics based on the resampled data over specified decision thresholds.

thresholder(model, seq(0, 1, length.out = 10)) #in reality I would use length.out = 100

#output

alpha     lambda prob_threshold Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall        F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy  Accuracy
1    0.1 0.03607775      0.0000000       1.000  0.00000000      0.6510595            NaN 0.6510595  1.000 0.7886514  0.6510595      0.6510595            1.0000000         0.5000000 0.6510595
2    0.1 0.03607775      0.1111111       0.994  0.02621083      0.6557464      0.7380952 0.6557464  0.994 0.7901580  0.6510595      0.6471463            0.9869617         0.5101054 0.6562714
3    0.1 0.03607775      0.2222222       0.986  0.15270655      0.6850874      0.8711111 0.6850874  0.986 0.8082906  0.6510595      0.6419344            0.9375256         0.5693533 0.6952837
4    0.1 0.03607775      0.3333333       0.964  0.32421652      0.7278778      0.8406807 0.7278778  0.964 0.8290127  0.6510595      0.6276316            0.8633459         0.6441083 0.7408578
5    0.1 0.03607775      0.4444444       0.928  0.47364672      0.7674158      0.7903159 0.7674158  0.928 0.8395895  0.6510595      0.6041866            0.7877990         0.7008234 0.7695147
6    0.1 0.03607775      0.5555556       0.862  0.59002849      0.7970454      0.7053968 0.7970454  0.862 0.8274687  0.6510595      0.5611928            0.7043575         0.7260142 0.7669686
7    0.1 0.03607775      0.6666667       0.742  0.75740741      0.8521972      0.6114289 0.8521972  0.742 0.7926993  0.6510595      0.4830827            0.5677204         0.7497037 0.7473855
8    0.1 0.03607775      0.7777778       0.536  0.90284900      0.9156149      0.5113452 0.9156149  0.536 0.6739140  0.6510595      0.3489918            0.3828606         0.7194245 0.6640636
9    0.1 0.03607775      0.8888889       0.198  0.98119658      0.9573810      0.3967404 0.9573810  0.198 0.3231917  0.6510595      0.1289474            0.1354751         0.5895983 0.4713602
10   0.1 0.03607775      1.0000000       0.000  1.00000000            NaN      0.3489405       NaN  0.000       NaN  0.6510595      0.0000000            0.0000000         0.5000000 0.3489405
       Kappa          J      Dist
1  0.0000000 0.00000000 1.0000000
2  0.0258717 0.02021083 0.9738516
3  0.1699809 0.13870655 0.8475624
4  0.3337322 0.28821652 0.6774055
5  0.4417759 0.40164672 0.5329805
6  0.4692998 0.45202849 0.4363768
7  0.4727251 0.49940741 0.3580090
8  0.3726156 0.43884900 0.4785352
9  0.1342372 0.17919658 0.8026597
10 0.0000000 0.00000000 1.0000000

Now pick a threshold based on your desired metric and use that. Usually the metrics used with imbalanced data Cohen's Kappa, Youden's J or Matthews correlation coefficient (MCC). Here is a decent paper on the matter.

Please note that since this data was used to find the optimal threshold the performance obtained this way will be optimistically biased. To evaluate the performance of the picked decision threshold it would be best to use several independent test sets. In other words I recommend nested resampling where you would optimize the parameters and threshold using the inner folds and evaluate on the outer folds.

Here is an explanation on how to use nested resampling with caret with regression. Some modifications are needed to make it work with classification with optimized threshold.

Please note that this is not the only way to pick the best decision threshold. Another way is to pick the desired metric a priori (MCC for instance) and treat the decision threshold as a hyper parameter which is to be tuned jointly with all the other hyper parameters. I trust this is not supported with caret with creating custom models.

来源：https://stackoverflow.com/questions/65814703/r-can-carettrain-function-for-glmnet-cross-validate-auc-at-fixed-alpha-and-la

标签

logistic-regression

cross-validation

r-caret

roc