Stepwise regression using p-values to drop variables with nonsignificant p-values

后端 未结 6 993
感动是毒
感动是毒 2020-11-29 16:46

I want to perform a stepwise linear Regression using p-values as a selection criterion, e.g.: at each step dropping variables that have the

6条回答
  •  情话喂你
    2020-11-29 17:35

    Show your boss the following :

    set.seed(100)
    x1 <- runif(100,0,1)
    x2 <- as.factor(sample(letters[1:3],100,replace=T))
    
    y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100)
    summary(lm(y~x1*x2))
    

    Which gives :

                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  -0.1525     0.3066  -0.498  0.61995    
    x1            1.8693     0.6045   3.092  0.00261 ** 
    x2b           2.5149     0.4334   5.802 8.77e-08 ***
    x2c           0.3089     0.4475   0.690  0.49180    
    x1:x2b       -1.1239     0.8022  -1.401  0.16451    
    x1:x2c       -1.0497     0.7873  -1.333  0.18566    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    

    Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time.


    Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables.

    I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you.

    #####################################
    # Automated model selection
    # Author      : Joris Meys
    # version     : 0.2
    # date        : 12/01/09
    #####################################
    #CHANGE LOG
    # 0.2   : check for empty scopevar vector
    #####################################
    
    # Function has.interaction checks whether x is part of a term in terms
    # terms is a vector with names of terms from a model
    has.interaction <- function(x,terms){
        out <- sapply(terms,function(i){
            sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0
        })
        return(sum(out)>0)
    }
    
    # Function Model.select
    # model is the lm object of the full model
    # keep is a list of model terms to keep in the model at all times
    # sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS)
    # verbose=T gives the F-tests, dropped var and resulting model after 
    model.select <- function(model,keep,sig=0.05,verbose=F){
          counter=1
          # check input
          if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm object\n"))
          # calculate scope for drop1 function
          terms <- attr(model$terms,"term.labels")
          if(missing(keep)){ # set scopevars to all terms
              scopevars <- terms
          } else{            # select the scopevars if keep is used
              index <- match(keep,terms)
              # check if all is specified correctly
              if(sum(is.na(index))>0){
                  novar <- keep[is.na(index)]
                  warning(paste(
                      c(novar,"cannot be found in the model",
                      "\nThese terms are ignored in the model selection."),
                      collapse=" "))
                  index <- as.vector(na.omit(index))
              }
              scopevars <- terms[-index]
          }
    
          # Backward model selection : 
    
          while(T){
              # extract the test statistics from drop.
              test <- drop1(model, scope=scopevars,test="F")
    
              if(verbose){
                  cat("-------------STEP ",counter,"-------------\n",
                  "The drop statistics : \n")
                  print(test)
              }
    
              pval <- test[,dim(test)[2]]
    
              names(pval) <- rownames(test)
              pval <- sort(pval,decreasing=T)
    
              if(sum(is.na(pval))>0) stop(paste("Model",
                  deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))
    
              # check if all significant
              if(pval[1]

提交回复
热议问题