R gbm handling of missing values

前端未结

关注

 5  1772

刺人心 2020-12-29 14:24

Does anyone know how gbm in R handles missing values? I can\'t seem to find any explanation using google.

5条回答

庸人自扰 (楼主)

2020-12-29 14:33

Start with the source code then. Just typing gbm at the console shows you the source code:

function (formula = formula(data), distribution = "bernoulli", 
    data = list(), weights, var.monotone = NULL, n.trees = 100, 
    interaction.depth = 1, n.minobsinnode = 10, shrinkage = 0.001, 
    bag.fraction = 0.5, train.fraction = 1, cv.folds = 0, keep.data = TRUE, 
    verbose = TRUE) 
{
    mf <- match.call(expand.dots = FALSE)
    m <- match(c("formula", "data", "weights", "offset"), names(mf), 
        0)
    mf <- mf[c(1, m)]
    mf$drop.unused.levels <- TRUE
    mf$na.action <- na.pass
    mf[[1]] <- as.name("model.frame")
    mf <- eval(mf, parent.frame())
    Terms <- attr(mf, "terms")
    y <- model.response(mf, "numeric")
    w <- model.weights(mf)
    offset <- model.offset(mf)
    var.names <- attributes(Terms)$term.labels
    x <- model.frame(terms(reformulate(var.names)), data, na.action = na.pass)
    response.name <- as.character(formula[[2]])
    if (is.character(distribution)) 
        distribution <- list(name = distribution)
    cv.error <- NULL
    if (cv.folds > 1) {
        if (distribution$name == "coxph") 
            i.train <- 1:floor(train.fraction * nrow(y))
        else i.train <- 1:floor(train.fraction * length(y))
        cv.group <- sample(rep(1:cv.folds, length = length(i.train)))
        cv.error <- rep(0, n.trees)
        for (i.cv in 1:cv.folds) {
            if (verbose) 
                cat("CV:", i.cv, "\n")
            i <- order(cv.group == i.cv)
            gbm.obj <- gbm.fit(x[i.train, , drop = FALSE][i, 
                , drop = FALSE], y[i.train][i], offset = offset[i.train][i], 
                distribution = distribution, w = ifelse(w == 
                  NULL, NULL, w[i.train][i]), var.monotone = var.monotone, 
                n.trees = n.trees, interaction.depth = interaction.depth, 
                n.minobsinnode = n.minobsinnode, shrinkage = shrinkage, 
                bag.fraction = bag.fraction, train.fraction = mean(cv.group != 
                  i.cv), keep.data = FALSE, verbose = verbose, 
                var.names = var.names, response.name = response.name)
            cv.error <- cv.error + gbm.obj$valid.error * sum(cv.group == 
                i.cv)
        }
        cv.error <- cv.error/length(i.train)
    }
    gbm.obj <- gbm.fit(x, y, offset = offset, distribution = distribution, 
        w = w, var.monotone = var.monotone, n.trees = n.trees, 
        interaction.depth = interaction.depth, n.minobsinnode = n.minobsinnode, 
        shrinkage = shrinkage, bag.fraction = bag.fraction, train.fraction = train.fraction, 
        keep.data = keep.data, verbose = verbose, var.names = var.names, 
        response.name = response.name)
    gbm.obj$Terms <- Terms
    gbm.obj$cv.error <- cv.error
    gbm.obj$cv.folds <- cv.folds
    return(gbm.obj)
}

A quick read suggests that the data is put into a model frame and that NA's are handled with na.pass so in turn, ?na.pass Reading that, it looks like it does nothing special with them, but you'd probably have to read up on the whole fitting process to see what that means in the long run. Looks like you might need to also look at the code of gbm.fit and so on.

0 讨论(0)

查看其它5个回答