How does glmnet's standardize argument handle dummy variables?

前端 未结 2 1705
天命终不由人
天命终不由人 2020-12-13 15:31

In my dataset I have a number of continuous and dummy variables. For analysis with glmnet, I want the continuous variables to be standardized but not the dummy variables.

相关标签:
2条回答
  • 2020-12-13 16:13

    glmnet doesn't know anything about dummy variables, because it doesn't have a formula interface (and hence doesn't touch model.frame and model.matrix.) If you want them to be treated specially, you'll have to do it yourself.

    0 讨论(0)
  • 2020-12-13 16:20

    In short, yes - this will standardize the dummy variables, but there's a reason for doing so. The glmnet function takes a matrix as an input for its X parameter, not a data frame, so it doesn't make the distinction for factor columns which you may have if the parameter was a data.frame. If you take a look at the R function, glmnet codes the standardize parameter internally as

        isd = as.integer(standardize)
    

    Which converts the R boolean to a 0 or 1 integer to feed to any of the internal FORTRAN functions (elnet, lognet, et. al.)

    If you go even further by examining the FORTRAN code (fixed width - old school!), you'll see the following block:

              subroutine standard1 (no,ni,x,y,w,isd,intr,ju,xm,xs,ym,ys,xv,jerr)    989
              real x(no,ni),y(no),w(no),xm(ni),xs(ni),xv(ni)                        989
              integer ju(ni)                                                        990
              real, dimension (:), allocatable :: v                                     
              allocate(v(1:no),stat=jerr)                                           993
              if(jerr.ne.0) return                                                  994
              w=w/sum(w)                                                            994
              v=sqrt(w)                                                             995
              if(intr .ne. 0)goto 10651                                             995
              ym=0.0                                                                995
              y=v*y                                                                 996
              ys=sqrt(dot_product(y,y)-dot_product(v,y)**2)                         996
              y=y/ys                                                                997
        10660 do 10661 j=1,ni                                                       997
              if(ju(j).eq.0)goto 10661                                              997
              xm(j)=0.0                                                             997
              x(:,j)=v*x(:,j)                                                       998
              xv(j)=dot_product(x(:,j),x(:,j))                                      999
              if(isd .eq. 0)goto 10681                                              999
              xbq=dot_product(v,x(:,j))**2                                          999
              vc=xv(j)-xbq                                                         1000
              xs(j)=sqrt(vc)                                                       1000
              x(:,j)=x(:,j)/xs(j)                                                  1000
              xv(j)=1.0+xbq/vc                                                     1001
              goto 10691                                                           1002
    

    Take a look at the lines marked 1000 - this is basically applying the standardization formula to the X matrix.

    Now statistically speaking, one does not generally standardize categorical variables to retain the interpretability of the estimated regressors. However, as pointed out by Tibshirani here, "The lasso method requires initial standardization of the regressors, so that the penalization scheme is fair to all regressors. For categorical regressors, one codes the regressor with dummy variables and then standardizes the dummy variables" - so while this causes arbitrary scaling between continuous and categorical variables, it's done for equal penalization treatment.

    0 讨论(0)
提交回复
热议问题