Caret package - cross-validating GAM with both smooth and linear predictors

后端 未结 1 1253
庸人自扰
庸人自扰 2021-02-09 21:21

I would like to cross validate a GAM model using caret. My GAM model has a binary outcome variable, an isotropic smooth of latitude and longitude coordinate pairs, and then line

相关标签:
1条回答
  • 2021-02-09 22:04

    It is very interesting to see someone using mgcv outside mgcv. After a bit of research, I am here to frustrate you: using mgcv with caret is a bad idea, at least with current support from caret.

    Let's me just ask you a few fundamental questions, if you are using caret:

    1. How can you specify the number of knots, as well as spline basis class for a smooth function?
    2. How can you specify 2D smooth function?
    3. How can you specify tensor product spline with te or ti?
    4. How can you tweak with smoothing parameters?

    If you want to know what caret::train is doing with method = "gam", check out its fitting routine:

    getModelInfo(model = "gam", regex = FALSE)$gam$fit
    
    function(x, y, wts, param, lev, last, classProbs, ...) { 
                dat <- if(is.data.frame(x)) x else as.data.frame(x)
                modForm <- caret:::smootherFormula(x)
                if(is.factor(y)) {
                  dat$.outcome <- ifelse(y == lev[1], 0, 1)
                  dist <- binomial()
                } else {
                  dat$.outcome <- y
                  dist <- gaussian()
                }
                modelArgs <- list(formula = modForm,
                                  data = dat,
                                  select = param$select, 
                                  method = as.character(param$method))
                ## Intercept family if passed in
                theDots <- list(...)
                if(!any(names(theDots) == "family")) modelArgs$family <- dist
                modelArgs <- c(modelArgs, theDots)                 
                out <- do.call(getFromNamespace("gam", "mgcv"), modelArgs)
                out    
                }
    

    You see the modForm <- caret:::smootherFormula(x) line? That line is the key, while other lines is just routine construction of a model call. So, let's have a check with what GAM formula caret is constructing:

    caret:::smootherFormula
    
    function (data, smoother = "s", cut = 10, df = 0, span = 0.5, 
        degree = 1, y = ".outcome") 
    {
        nzv <- nearZeroVar(data)
        if (length(nzv) > 0) 
            data <- data[, -nzv, drop = FALSE]
        numValues <- sort(apply(data, 2, function(x) length(unique(x))))
        prefix <- rep("", ncol(data))
        suffix <- rep("", ncol(data))
        prefix[numValues > cut] <- paste(smoother, "(", sep = "")
        if (smoother == "s") {
            suffix[numValues > cut] <- if (df == 0) 
                ")"
            else paste(", df=", df, ")", sep = "")
        }
        if (smoother == "lo") {
            suffix[numValues > cut] <- paste(", span=", span, ",degree=", 
                degree, ")", sep = "")
        }
        if (smoother == "rcs") {
            suffix[numValues > cut] <- ")"
        }
        rhs <- paste(prefix, names(numValues), suffix, sep = "")
        rhs <- paste(rhs, collapse = "+")
        form <- as.formula(paste(y, rhs, sep = "~"))
        form
    }
    

    In short, it creates additive, univariate smooth. This is the classic form when GAM was first proposed.

    To this end, you lose a significant amount of control on mgcv, as listed previously.

    To verify this, let me construct a similar example to your case:

    set.seed(0)
    dat <- gamSim(eg = 2, scale = 0.2)$data[1:3]
    dat$a <- runif(400)
    dat$b <- runif(400)
    dat$y <- with(dat, y + 0.3 * a - 0.7 * b)
    
    #            y         x         z          a         b
    #1 -0.30258559 0.8966972 0.1478457 0.07721866 0.3871130
    #2 -0.59518832 0.2655087 0.6588776 0.13853856 0.8718050
    #3 -0.06978648 0.3721239 0.1850700 0.04752457 0.9671970
    #4 -0.17002059 0.5728534 0.9543781 0.03391887 0.8669163
    #5  0.55452069 0.9082078 0.8978485 0.91608902 0.4377153
    #6 -0.17763650 0.2016819 0.9436971 0.84020039 0.1919378
    

    So we aim to fit a model: y ~ s(x, z) + a + b. The data y is Gaussian, but this does not matter; it does not affect how caret works with mgcv.

    cv <- train(y ~ x + z + a + b, data = dat, method = "gam", family = "gaussian",
                trControl = trainControl(method = "LOOCV", number=1, repeats=1), 
                tuneGrid = data.frame(method = "GCV.Cp", select = FALSE))
    

    You can extract the final model:

    fit <- cv[[11]]
    

    So what formula is it using?

    fit$formula
    #.outcome ~ s(x) + s(z) + s(a) + s(b)
    

    See? Apart from being "additive, univariate", it also leaves everything of mgcv::s to its default: default bs = "tp", default k = 10, etc.

    0 讨论(0)
提交回复
热议问题