R: Cross validation on a dataset with factors

大兔子大兔子 提交于 2019-11-30 05:09:11

Everyone agrees that there sure is an optimal solution. But personally, I would just try the cv.glm call until it works usingwhile.

m.cv<- try(cv.glm(d, m, K=2)) #First try
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list

I've tried it with 100,000 rows in the data.fame and it only takes a few seconds.

library(boot)
n <-100000
d <- data.frame(x=c(rep('A',n), rep('B', n), 'C', 'C'), y=1:(n*2+2))
m <- glm(y ~ x, data=d)

m.cv<- try(cv.glm(d, m, K=2))
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list

When I call traceback I get this:

> traceback()
9: stop(sprintf(ngettext(length(m), "factor %s has new level %s", 
       "factor %s has new levels %s"), nm, paste(nxl[m], collapse = ", ")), 
       domain = NA)
8: model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels)
7: model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)
6: predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == 
       "link", "response", type), terms = terms, na.action = na.action)
5: predict.glm(d.glm, data[j.out, , drop = FALSE], type = "response")
4: predict(d.glm, data[j.out, , drop = FALSE], type = "response")
3: mean((y - yhat)^2)
2: cost(glm.y[j.out], predict(d.glm, data[j.out, , drop = FALSE], 
       type = "response"))
1: cv.glm(d, m, K = 2)

And looking at the cv.glm function gives:

> cv.glm
function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2), 
    K = n) 
{
    call <- match.call()
    if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) 
        runif(1)
    seed <- get(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
    n <- nrow(data)
    out <- NULL
    if ((K > n) || (K <= 1)) 
        stop("'K' outside allowable range")
    K.o <- K
    K <- round(K)
    kvals <- unique(round(n/(1L:floor(n/2))))
    temp <- abs(kvals - K)
    if (!any(temp == 0)) 
        K <- kvals[temp == min(temp)][1L]
    if (K != K.o) 
        warning(gettextf("'K' has been set to %f", K), domain = NA)
    f <- ceiling(n/K)
    s <- sample0(rep(1L:K, f), n)
    n.s <- table(s)
    glm.y <- glmfit$y
    cost.0 <- cost(glm.y, fitted(glmfit))
    ms <- max(s)
    CV <- 0
    Call <- glmfit$call
    for (i in seq_len(ms)) {
        j.out <- seq_len(n)[(s == i)]
        j.in <- seq_len(n)[(s != i)]
        Call$data <- data[j.in, , drop = FALSE]
        d.glm <- eval.parent(Call)
        p.alpha <- n.s[i]/n
        cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out, 
            , drop = FALSE], type = "response"))
        CV <- CV + p.alpha * cost.i
        cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm, 
            data, type = "response"))
    }
    list(call = call, K = K, delta = as.numeric(c(CV, CV + cost.0)), 
        seed = seed)
}

It seems the problem has to do with your extremely small sample size and categorical effect (with values "A", "B", and "C"). You are fitting a glm with 2 effects: "B:A" and "C:A". In each CV iteration you bootstrap from the sample dataset and fit a new model d.glm. Given the size, the bootstrapped data are guaranteed to come up with 1 or more iteration in which the value "C" is not sampled, hence the error comes from calculating fitted probabilities from the bootstrap model from the training data in which validation data has a "C" level for x not observed in the training data.

Frank Harrell (often on stats.stackexchange.com) wrote in Regression Modelling Strategies that one ought to favor against split sample validation when sample size is small and/or some cell counts are small in categorical data analysis. Singularity (as you are seeing here) is one of many reasons why I think this is true.

Given the small sample size here, you should consider some split sample cross validation alternatives like a permutation test, or a parametric bootstrap. Another important consideration is exactly why you feel model based inference isn't correct. As Tukey said of the bootstrap, he'd like to call it a shotgun. It will blow the head off of any problem, as long as you're willing to reassemble the pieces.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!