问题
I am running caret's neural network on the Bike Sharing dataset and I get the following error message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.
I am not sure what the problem is. Can anyone help please?
The dataset is from: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Here is the coding:
library(caret)
library(bestNormalize)
data_hour = read.csv("hour.csv")
# Split dataset
set.seed(3)
split = createDataPartition(data_hour$casual, p=0.80, list=FALSE)
validation = data_hour[-split,]
dataset = data_hour[split,]
dataset = dataset[,c(-1,-2,-4)]
# View strucutre of data
str(dataset)
# 'data.frame': 13905 obs. of 14 variables:
# $ season : int 1 1 1 1 1 1 1 1 1 1 ...
# $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
# $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
# $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
# $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
# $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
# $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
# $ temp : num 0.22 0.22 0.24 0.24 0.24 0.38 0.36 0.42 0.46 0.44 ...
# $ atemp : num 0.273 0.273 0.288 0.258 0.288 ...
# $ hum : num 0.8 0.8 0.75 0.75 0.75 0.76 0.81 0.77 0.72 0.77 ...
# $ windspeed : num 0 0 0 0.0896 0 ...
# $ casual : int 8 5 3 0 1 12 26 29 35 40 ...
# $ registered: int 32 27 10 1 7 24 30 55 71 70 ...
# $ cnt : int 40 32 13 1 8 36 56 84 106 110 ...
## transform numeric data to Guassian
dataset_selected = dataset[,c(-13,-14)]
for (i in 8:12) { dataset_selected[,i] = predict(boxcox(dataset_selected[,i] +0.1))}
# View transformed dataset
str(dataset_selected)
#'data.frame': 13905 obs. of 12 variables:
#' $ season : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
#' $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
#' $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
#' $ temp : num -1.47 -1.47 -1.35 -1.35 -1.35 ...
#' $ atemp : num -1.18 -1.18 -1.09 -1.27 -1.09 ...
#' $ hum : num 0.899 0.899 0.637 0.637 0.637 ...
#' $ windspeed : num -1.8 -1.8 -1.8 -0.787 -1.8 ...
#' $ casual : num -0.361 -0.588 -0.81 -1.867 -1.208 ...
# Train data with Neural Network model from caret
control = trainControl(method = 'repeatedcv', number = 10, repeats =3)
metric = 'RMSE'
set.seed(3)
fit = train(casual ~., data = dataset_selected, method = 'nnet', metric = metric, trControl = control, trace = FALSE)
Thanks for your help!
回答1:
phivers comment is spot on, however I would still like to provide a more verbose answer on this concrete example.
In order to investigate what is going on in more detail one should add the argument savePredictions = "all"
to trainControl
:
control = trainControl(method = 'repeatedcv',
number = 10,
repeats = 3,
returnResamp = "all",
savePredictions = "all")
metric = 'RMSE'
set.seed(3)
fit = train(casual ~.,
data = dataset_selected,
method = 'nnet',
metric = metric,
trControl = control,
trace = FALSE,
form = "traditional")
now when running:
fit$results
#output
size decay RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
2 1 1e-04 0.9479487 0.1850270 0.7657225 0.074211541 0.20380571 0.079640883
3 1 1e-01 0.8801701 0.3516646 0.6937938 0.074484860 0.20787440 0.077960642
4 3 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
5 3 1e-04 0.9272942 0.2482794 0.7434689 0.091409600 0.24363651 0.098854133
6 3 1e-01 0.7943899 0.6193242 0.5944279 0.011560524 0.03299137 0.013002708
7 5 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
8 5 1e-04 0.8811411 0.3621494 0.6941335 0.092169810 0.22980560 0.098987058
9 5 1e-01 0.7896507 0.6431808 0.5870894 0.009947324 0.01063359 0.009121535
we notice the problem occurs when decay = 0
.
lets filter the observations and predictions for decay = 0
library(tidyverse)
fit$pred %>%
filter(decay == 0) -> for_r2
var(for_r2$pred)
#output
0
we can observe that all of the predictions when decay == 0
are the same (have zero variance). The model exclusively predicts 0:
unique(for_r2$pred)
#output
0
So when the summary function tries to predict R squared:
caret::R2(for_r2$obs, for_r2$pred)
#output
[1] NA
Warning message:
In cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) :
the standard deviation is zero
回答2:
The answer by @missuse is already very insightful to understand why this error happens.
So I just want to add some straightforward ways how to get rid of this error.
If in some cross-validation folds the predictions get zero variance, the model didn't converge. In such cases, you can try the neuralnet package which offers two parameters you can tune:
- threshold : default value = 0.01. Set it to 0.3 and then try lower values 0.2, 0.1, 0.05.
- stepmax : default value = 1e+05. Set it to 1e+08 and then try lower values 1e+07, 1e+06.
In most cases, it is sufficient to change the threshold parameter like this:
model.nn <- caret::train(formula1,
method = "neuralnet",
data = training.set[,],
# apply preProcess within cross-validation folds
preProcess = c("center", "scale"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3),
threshold = 0.3
)
回答3:
Answer by @topepo (Caret package main developer). See detailed Github thread here.
It looks like it happens when you have one hidden unit and almost no regularization. What is happening is that the model is predicting a value very close to a constant (so that the RMSE is a little worse than the basic st deviation of the outcome):
> ANN_cooling_fit$resample %>% dplyr::filter(is.na(Rsquared))
RMSE Rsquared MAE size decay Resample
1 8.414010 NA 6.704311 1 0e+00 Fold04.Rep01
2 8.421244 NA 6.844363 1 0e+00 Fold01.Rep03
3 7.855925 NA 6.372947 1 1e-04 Fold10.Rep07
4 7.963816 NA 6.428947 1 0e+00 Fold07.Rep09
5 8.492898 NA 6.901842 1 0e+00 Fold09.Rep09
6 7.892527 NA 6.479474 1 0e+00 Fold10.Rep10
> sd(mydata$V7)
[1] 7.962888
So it's nothing to really worry about; just some parameters that do very poorly.
来源:https://stackoverflow.com/questions/51548255/caret-there-were-missing-values-in-resampled-performance-measures