Is cv.glmnet overfitting the the data by using the full lambda sequence?

拈花ヽ惹草 提交于 2019-12-03 20:15:08
Scortchi - Reinstate Monica

You're correct that using a cross-validated measure of fit to pick the "best" value of a tuning parameter introduces an optimistic bias into that measure when viewed as an estimate of the out-of-sample performance of the model with that "best" value. Any statistic has a sampling variance. But to talk of over-fitting seems to imply that optimization over the tuning parameter results in a degradation of out-of-sample performance compared to keeping it at a pre-specified value (say zero). That's unusual, in my experience—the optimization is very constrained (over a single parameter) compared to many other methods of feature selection. In any case it's a good idea to validate the whole procedure, including the choice of tuning parameter, on a hold-out set, or with an outer cross-validation loop, or by bootstrapping. See Cross Validation (error generalization) after model selection.

No, this is not overfitting.

cv.glmnet() does build the entire solution path for the lambda sequence. But you never pick the last entry in that path. You typically pick lambda==lambda.1se (or lambda.min) , as @Fabians said:

lambda==lambda.min : is the lambda-value where cvm is minimized

lambda==lambda.1se : is the lambda-value where (cvm-cvsd)=cvlow is minimized. This is your optimal lambda

See the documentation for cv.glmnet() and coef(..., s='lambda.1se')

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!