R: lm() result differs when using `weights` argument and when using manually reweighted data

前端 未结 1 1514
粉色の甜心
粉色の甜心 2020-11-30 06:34

In order to correct heteroskedasticity in error terms, I am running the following weighted least squares regression in R :

#Call:
#lm(formula = a ~ q + q2 +          


        
相关标签:
1条回答
  • 2020-11-30 07:00

    Provided you do manual weighting correctly, you won't see discrepancy.

    So the correct way to go is:

    X <- model.matrix(~ q + q2 + b + c, mydata)  ## non-weighted model matrix (with intercept)
    w <- mydata$weighting  ## weights
    rw <- sqrt(w)    ## root weights
    y <- mydata$a    ## non-weighted response
    X_tilde <- rw * X    ## weighted model matrix (with intercept)
    y_tilde <- rw * y    ## weighted response
    
    ## remember to drop intercept when using formula
    fit_by_wls <- lm(y ~ X - 1, weights = w)
    fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
    

    Although it is generally recommended to use lm.fit and lm.wfit when passing in matrix directly:

    matfit_by_wls <- lm.wfit(X, y, w)
    matfit_by_ols <- lm.fit(X_tilde, y_tilde)
    

    But when using these internal subroutines lm.fit and lm.wfit, it is required that all input are complete cases without NA, otherwise the underlying C routine stats:::C_Cdqrls will complain.

    If you still want to use the formula interface rather than matrix, you can do the following:

    ## weight by square root of weights, not weights
    mydata$root.weighting <- sqrt(mydata$weighting)
    mydata$a.wls <- mydata$a * mydata$root.weighting
    mydata$q.wls <- mydata$q * mydata$root.weighting
    mydata$q2.wls <- mydata$q2 * mydata$root.weighting
    mydata$b.wls <- mydata$b * mydata$root.weighting
    mydata$c.wls <- mydata$c * mydata$root.weighting
    
    fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
    
    fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
                     data = mydata)
    

    Reproducible Example

    Let's use R's built-in data set trees. Use head(trees) to inspect this dataset. There is no NA in this dataset. We aim to fit a model:

    Height ~ Girth + Volume
    

    with some random weights between 1 and 2:

    set.seed(0); w <- runif(nrow(trees), 1, 2)
    

    We fit this model via weighted regression, either by passing weights to lm, or manually transforming data and calling lm with no weigths:

    X <- model.matrix(~ Girth + Volume, trees)  ## non-weighted model matrix (with intercept)
    rw <- sqrt(w)    ## root weights
    y <- trees$Height    ## non-weighted response
    X_tilde <- rw * X    ## weighted model matrix (with intercept)
    y_tilde <- rw * y    ## weighted response
    
    fit_by_wls <- lm(y ~ X - 1, weights = w)
    #Call:
    #lm(formula = y ~ X - 1, weights = w)
    
    #Coefficients:
    #X(Intercept)        XGirth       XVolume  
    #     83.2127       -1.8639        0.5843
    
    fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
    #Call:
    #lm(formula = y_tilde ~ X_tilde - 1)
    
    #Coefficients:
    #X_tilde(Intercept)        X_tildeGirth       X_tildeVolume  
    #           83.2127             -1.8639              0.5843
    

    So indeed, we see identical results.

    Alternatively, we can use lm.fit and lm.wfit:

    matfit_by_wls <- lm.wfit(X, y, w)
    matfit_by_ols <- lm.fit(X_tilde, y_tilde)
    

    We can check coefficients by:

    matfit_by_wls$coefficients
    #(Intercept)       Girth      Volume 
    # 83.2127455  -1.8639351   0.5843191 
    
    matfit_by_ols$coefficients
    #(Intercept)       Girth      Volume 
    # 83.2127455  -1.8639351   0.5843191
    

    Again, results are the same.

    0 讨论(0)
提交回复
热议问题