Using gather from tidyr changes my regression results

泪湿孤枕 提交于 2019-12-04 18:37:07

The underlying reason for this unexpected change is that dplyr (dplyr, not tidyr) changes the default method of the lag function. The gather function calls dplyr::select_vars, which loads dplyr via namespace and overwrites lag.default.

The dynlm function internally calls lag when you use L in the formula. The method dispatch then finds lag.default. When dplyr is loaded via namespace (it does not even need to be attached), the lag.default from dplyr is found.

The two lag functions are fundamentally different. In a new R session, you will find the following difference:

lag(1:3, 1)
## [1] 1 2 3
## attr(,"tsp")
## [1] 0 2 1
invisible(dplyr::mutate) # side effect: loads dplyr via namespace...
lag(1:3, 1)
## [1] NA  1  2

So the solution is fairly simple. Just overwrite the lag.default function yourself.

lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
##   Start = 1952, End = 1993
## 
## Call:
##   dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##   (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
## -0.05476       0.83870       0.01818       0.13928      

lag.default <- dplyr:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
## Start = 1951, End = 1993
## 
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##  (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
##     -0.05669       0.82128       0.17484            NA  

lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
##   Start = 1952, End = 1993
## 
## Call:
##   dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##   (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
## -0.05476       0.83870       0.01818       0.13928    

When I ran your first block of code in R 3.1.3 I got the results you are showing as your second set of results with this:

(R Version 3.1.3, dynlm version .3-3).

Time series regression with "ts" data:
Start = 1951, End = 1993

Call:
dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)) + 
    log(L(X, 3)) + log(L(X, 4)) + log(L(X, 5)), data = data_ts)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.030753 -0.006364  0.001321  0.007939  0.025982 

Coefficients: (4 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.05669    0.07546  -0.751    0.457    
log(X)        0.82128    0.13486   6.090 3.53e-07 ***
log(L(X))     0.17484    0.13365   1.308    0.198    
log(L(X, 2))       NA         NA      NA       NA    
log(L(X, 3))       NA         NA      NA       NA    
log(L(X, 4))       NA         NA      NA       NA    
log(L(X, 5))       NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01419 on 40 degrees of freedom
Multiple R-squared:  0.9974,    Adjusted R-squared:  0.9972 
F-statistic:  7578 on 2 and 40 DF,  p-value: < 2.2e-16

However when I updated to R 3.2.0 I got a repeat of what you got initially but then settled back into always getting the second results.

Now from your later comments you are also getting the second results consistently. So I think it must be that it is either that there was a typo in the code at some point or that there is something about the first time this runs in an empty environment.

Based on that hypothesis I closed RStudio completely, restarted and ran the first codeblock. In that case I got your initial result again.

So I think the answer to your question has to be there is something odd going on in the environment.

I read the documentation for dynlm and there are a few places where the defaults (if they are coming into play) would cause differences potentially. For example it will take the variables from the environment if no data is specified. It will use either a timeseries object or a dataframe. In your case you have both in the environment (data and data_ts). If you notice on the summary output that I have above it says Time series regression with "ts" data: which means it is running with a ts object. When I am getting the other result (without the NAs, your first result) it says Time series regression with "numeric" data: and in that case it is running with on data which is a dataframe or X and Y directly. So I think that must be the source of the difference. I'm not sure why exactly that would have happened with data_ts explicitly named though.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!