Caret package - cross-validating GAM with both smooth and linear predictors

后端未结

关注

 1  1266

I would like to cross validate a GAM model using caret. My GAM model has a binary outcome variable, an isotropic smooth of latitude and longitude coordinate pairs, and then line

相关标签:

1条回答

时光取名叫无心

2021-02-09 22:04

It is very interesting to see someone using mgcv outside mgcv. After a bit of research, I am here to frustrate you: using mgcv with caret is a bad idea, at least with current support from caret.

Let's me just ask you a few fundamental questions, if you are using caret:

How can you specify the number of knots, as well as spline basis class for a smooth function?
How can you specify 2D smooth function?
How can you specify tensor product spline with te or ti?
How can you tweak with smoothing parameters?

If you want to know what caret::train is doing with method = "gam", check out its fitting routine:

getModelInfo(model = "gam", regex = FALSE)$gam$fit

function(x, y, wts, param, lev, last, classProbs, ...) { 
            dat <- if(is.data.frame(x)) x else as.data.frame(x)
            modForm <- caret:::smootherFormula(x)
            if(is.factor(y)) {
              dat$.outcome <- ifelse(y == lev[1], 0, 1)
              dist <- binomial()
            } else {
              dat$.outcome <- y
              dist <- gaussian()
            }
            modelArgs <- list(formula = modForm,
                              data = dat,
                              select = param$select, 
                              method = as.character(param$method))
            ## Intercept family if passed in
            theDots <- list(...)
            if(!any(names(theDots) == "family")) modelArgs$family <- dist
            modelArgs <- c(modelArgs, theDots)                 
            out <- do.call(getFromNamespace("gam", "mgcv"), modelArgs)
            out    
            }

You see the modForm <- caret:::smootherFormula(x) line? That line is the key, while other lines is just routine construction of a model call. So, let's have a check with what GAM formula caret is constructing:

caret:::smootherFormula

function (data, smoother = "s", cut = 10, df = 0, span = 0.5, 
    degree = 1, y = ".outcome") 
{
    nzv <- nearZeroVar(data)
    if (length(nzv) > 0) 
        data <- data[, -nzv, drop = FALSE]
    numValues <- sort(apply(data, 2, function(x) length(unique(x))))
    prefix <- rep("", ncol(data))
    suffix <- rep("", ncol(data))
    prefix[numValues > cut] <- paste(smoother, "(", sep = "")
    if (smoother == "s") {
        suffix[numValues > cut] <- if (df == 0) 
            ")"
        else paste(", df=", df, ")", sep = "")
    }
    if (smoother == "lo") {
        suffix[numValues > cut] <- paste(", span=", span, ",degree=", 
            degree, ")", sep = "")
    }
    if (smoother == "rcs") {
        suffix[numValues > cut] <- ")"
    }
    rhs <- paste(prefix, names(numValues), suffix, sep = "")
    rhs <- paste(rhs, collapse = "+")
    form <- as.formula(paste(y, rhs, sep = "~"))
    form
}

In short, it creates additive, univariate smooth. This is the classic form when GAM was first proposed.

To this end, you lose a significant amount of control on mgcv, as listed previously.

To verify this, let me construct a similar example to your case:

set.seed(0)
dat <- gamSim(eg = 2, scale = 0.2)$data[1:3]
dat$a <- runif(400)
dat$b <- runif(400)
dat$y <- with(dat, y + 0.3 * a - 0.7 * b)

#            y         x         z          a         b
#1 -0.30258559 0.8966972 0.1478457 0.07721866 0.3871130
#2 -0.59518832 0.2655087 0.6588776 0.13853856 0.8718050
#3 -0.06978648 0.3721239 0.1850700 0.04752457 0.9671970
#4 -0.17002059 0.5728534 0.9543781 0.03391887 0.8669163
#5  0.55452069 0.9082078 0.8978485 0.91608902 0.4377153
#6 -0.17763650 0.2016819 0.9436971 0.84020039 0.1919378

So we aim to fit a model: y ~ s(x, z) + a + b. The data y is Gaussian, but this does not matter; it does not affect how caret works with mgcv.

cv <- train(y ~ x + z + a + b, data = dat, method = "gam", family = "gaussian",
            trControl = trainControl(method = "LOOCV", number=1, repeats=1), 
            tuneGrid = data.frame(method = "GCV.Cp", select = FALSE))

You can extract the final model:

fit <- cv[[11]]

So what formula is it using?

fit$formula
#.outcome ~ s(x) + s(z) + s(a) + s(b)

See? Apart from being "additive, univariate", it also leaves everything of mgcv::s to its default: default bs = "tp", default k = 10, etc.

0 讨论(0)