Cross validation for glm() models

浪子不回头ぞ 提交于 2019-12-02 17:19:05

I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package:

#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)

#Perform 10 fold cross validation
for(i in 1:10){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- yourData[testIndexes, ]
    trainData <- yourData[-testIndexes, ]
    #Use test and train data partitions however you desire...
}

@Roman provided some answers in his comments, however, the answer to your questions is provided by inspecting the code with cv.glm:

I believe this bit of code splits the data set up randomly into the K-folds, arranging rounding as necessary if K does not divide n:

if ((K > n) || (K <= 1)) 
    stop("'K' outside allowable range")
K.o <- K
K <- round(K)
kvals <- unique(round(n/(1L:floor(n/2))))
temp <- abs(kvals - K)
if (!any(temp == 0)) 
    K <- kvals[temp == min(temp)][1L]
if (K != K.o) 
    warning(gettextf("'K' has been set to %f", K), domain = NA)
f <- ceiling(n/K)
s <- sample0(rep(1L:K, f), n)

This bit here shows that the delta value is NOT the root mean square error. It is, as the helpfile says The default is the average squared error function. What does this mean? We can see this by inspecting the function declaration:

function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2), 
    K = n) 

which shows that within each fold, we calculate the average of the error squared, where error is in the usual sense between predicted response vs actual response.

delta[1] is simply the weighted average of the SUM of all of these terms for each fold, see my inline comments in the code of cv.glm:

for (i in seq_len(ms)) {
    j.out <- seq_len(n)[(s == i)]
    j.in <- seq_len(n)[(s != i)]
    Call$data <- data[j.in, , drop = FALSE]
    d.glm <- eval.parent(Call)
    p.alpha <- n.s[i]/n #create weighted average for later
    cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out, 
        , drop = FALSE], type = "response"))
    CV <- CV + p.alpha * cost.i # add weighted average error to running total
    cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm, 
        data, type = "response"))
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!