Logistic regression: how to try every combination of predictors in R?

后端 未结 2 1115
既然无缘
既然无缘 2020-12-10 23:12

This is a duplicate of https://stats.stackexchange.com/questions/293988/logistic-regression-how-to-try-every-combination-of-predictors.

I want to perform a logistic

相关标签:
2条回答
  • 2020-12-10 23:33

    I am not sure about the value of this "educational exercise", but for the sake of programming, here would be my approach:

    First, let's create some example predictor names. I use 5 predictors as in your example, but for 10, you would obviously need to replace 5 with 10.

    X = paste0("x",1:5)
    X
    [1] "x1" "x2" "x3" "x4" "x5"    
    

    Now, we can get the combinations with combn.

    For instance, for one variable at a time:

     t(combn(X,1))
         [,1]
    [1,] "x1"
    [2,] "x2"
    [3,] "x3"
    [4,] "x4"
    [5,] "x5"
    

    Two variables at a time:

    > t(combn(X,2))
          [,1] [,2]
     [1,] "x1" "x2"
     [2,] "x1" "x3"
     [3,] "x1" "x4"
     [4,] "x1" "x5"
     [5,] "x2" "x3"
     [6,] "x2" "x4"
     [7,] "x2" "x5"
     [8,] "x3" "x4"
     [9,] "x3" "x5"
    [10,] "x4" "x5"
    

    etc.

    We can use lapply to call these functions successively with an increasing number of variables to consider, and to catch the results in a list. For instance, have a look at the output of lapply(1:5, function(n) t(combn(X,n))). To turn these combinations into formulas, we can use the following:

    out <- unlist(lapply(1:5, function(n) {
      # get combinations
      combinations <- t(combn(X,n))
      # collapse them into usable formulas:
      formulas <- apply(combinations, 1, 
                        function(row) paste0("y ~ ", paste0(row, collapse = "+")))}))
    

    Or equivalently using the FUN argument of combn (as pointed out by user20650):

    out <- unlist(lapply(1:5, function(n) combn(X, n, FUN=function(row) paste0("y ~ ", paste0(row, collapse = "+")))))
    

    This gives:

    out
     [1] "y ~ x1"             "y ~ x2"             "y ~ x3"             "y ~ x4"             "y ~ x5"            
     [6] "y ~ x1+x2"          "y ~ x1+x3"          "y ~ x1+x4"          "y ~ x1+x5"          "y ~ x2+x3"         
    [11] "y ~ x2+x4"          "y ~ x2+x5"          "y ~ x3+x4"          "y ~ x3+x5"          "y ~ x4+x5"         
    [16] "y ~ x1+x2+x3"       "y ~ x1+x2+x4"       "y ~ x1+x2+x5"       "y ~ x1+x3+x4"       "y ~ x1+x3+x5"      
    [21] "y ~ x1+x4+x5"       "y ~ x2+x3+x4"       "y ~ x2+x3+x5"       "y ~ x2+x4+x5"       "y ~ x3+x4+x5"      
    [26] "y ~ x1+x2+x3+x4"    "y ~ x1+x2+x3+x5"    "y ~ x1+x2+x4+x5"    "y ~ x1+x3+x4+x5"    "y ~ x2+x3+x4+x5"   
    [31] "y ~ x1+x2+x3+x4+x5"
    

    This can now be passed to your logistic regression function.


    Example:

    Let's use the mtcars dataset, with mpg as dependent variable.

    X = names(mtcars[,-1])
    X
     [1] "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
    

    Now, let's use the aforementioned function:

    out <- unlist(lapply(1:length(X), function(n) combn(X, n, FUN=function(row) paste0("mpg ~ ", paste0(row, collapse = "+")))))
    

    which gives us a vector of all combinations as formulas.

    To run the corresponding models, we can do for instance

    mods = lapply(out, function(frml) lm(frml, data=mtcars))
    

    Since you want to capture specific statistics and order the models accordingly, I would use broom::glance. broom::tidy turns lm output into a dataframe (useful if you want to compare coefficients etc) and broom::glance turns e.g. r-squared, sigma, the F-statistic, the logLikelihood, AIC, BIC etc into a dataframe. For instance:

    library(broom)
    library(dplyr)
    tmp = bind_rows(lapply(out, function(frml) {
      a = glance(lm(frml, data=mtcars))
      a$frml = frml
      return(a)
    }))
    
    head(tmp)
      r.squared adj.r.squared    sigma statistic      p.value df    logLik      AIC      BIC deviance df.residual       frml
    1 0.7261800     0.7170527 3.205902 79.561028 6.112687e-10  2 -81.65321 169.3064 173.7036 308.3342          30  mpg ~ cyl
    2 0.7183433     0.7089548 3.251454 76.512660 9.380327e-10  2 -82.10469 170.2094 174.6066 317.1587          30 mpg ~ disp
    3 0.6024373     0.5891853 3.862962 45.459803 1.787835e-07  2 -87.61931 181.2386 185.6358 447.6743          30   mpg ~ hp
    4 0.4639952     0.4461283 4.485409 25.969645 1.776240e-05  2 -92.39996 190.7999 195.1971 603.5667          30 mpg ~ drat
    5 0.7528328     0.7445939 3.045882 91.375325 1.293959e-10  2 -80.01471 166.0294 170.4266 278.3219          30   mpg ~ wt
    6 0.1752963     0.1478062 5.563738  6.376702 1.708199e-02  2 -99.29406 204.5881 208.9853 928.6553          30 mpg ~ qsec
    

    which you can sort as you wish.

    0 讨论(0)
  • 2020-12-10 23:48

    There's a package that does this, MuMIn (multimodel inference), as part of a more principled multi-model approach (i.e. it doesn't just pick the best model(s) and ignore the fact that selection has been done):

    Set up data and full model:

    set.seed(101)
    d <- data.frame(replicate(5,rnorm(100)))
    d$y <- rbinom(100,size=1,prob=0.5)
    full <- glm(y~.,data=d,na.action=na.fail)
    

    "dredge" the result:

    library(MuMIn)
    allfits <- dredge(full)
    

    results (also contains all fitted parameters):

    head(allfits[,7:11])
    ##    df    logLik     AICc    delta     weight
    ## 3   3 -69.66403 145.5781 0.000000 0.15916685
    ## 11  4 -69.22909 146.8792 1.301191 0.08304293
    ## 19  4 -69.30856 147.0382 1.460123 0.07669921
    ## 7   4 -69.31233 147.0457 1.467655 0.07641093
    ## 4   4 -69.40589 147.2328 1.654775 0.06958615
    ## 1   2 -72.07662 148.2769 2.698896 0.04128523
    
    0 讨论(0)
提交回复
热议问题