Double clustered standard errors for panel data

前端未结

关注

 4  766

I have a panel data set in R (time and cross section) and would like to compute standard errors that are clustered by two dimensions, because my residuals are correlated bot

相关标签:

4条回答

Happy的楠姐

2020-12-09 00:49

For panel regressions, the plm package can estimate clustered SEs along two dimensions.

Using M. Petersen’s benchmark results:

require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")

##Double-clustering formula (Thompson, 2011)
vcovDC <- function(x, ...){
    vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) - 
        vcovHC(x, method="white1", ...)
}

fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))

So that now you can obtain clustered SEs:

##Clustered by *group*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

##Clustered by *time*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.022189  1.3376   0.1811    
x           1.034833   0.031679 32.6666   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

##Clustered by *group* and *time*
> coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.064580  0.4596   0.6458    
x           1.034833   0.052465 19.7243   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

For more details see:

Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R.

However the above works only if your data can be coerced to a pdata.frame. It will fail if you have "duplicate couples (time-id)". In this case you can still cluster, but only along one dimension.

Trick plm into thinking that you have a proper panel data set by specifying only one index:

fpm.tr <- plm(y ~ x, test, model='pooling', index=c('firmid'))

So that now you can obtain clustered SEs:

##Clustered by *group*
> coeftest(fpm.tr, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You can also use this workaround to cluster by a higher dimension or at a higher level (e.g. industry or country). However in that case you won't be able to use the group (or time) effects, which is the main limit of the approach.

Another approach that works for both panel and other types of data is the multiwayvcov package. It allows double clustering, but also clustering at higher dimensions. As per the packages's website, it is an improvement upon Arai's code:

Transparent handling of observations dropped due to missingness

Full multi-way (or n-way, or n-dimensional, or multi-dimensional) clustering

Using the Petersen data and cluster.vcov():

library("lmtest")
library("multiwayvcov")

data(petersen)
m1 <- lm(y ~ x, data = petersen)

coeftest(m1, vcov=function(x) cluster.vcov(x, petersen[ , c("firmid", "year")]))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.029680   0.065066  0.4561   0.6483    
## x           1.034833   0.053561 19.3206   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

0 讨论(0)

[愿得一人]

2020-12-09 00:51

Arai's function can be used for clustering standard-errors. He has another version for clustering in multiple dimensions:

mcl <- function(dat,fm, cluster1, cluster2){
          attach(dat, warn.conflicts = F)
          library(sandwich);library(lmtest)
          cluster12 = paste(cluster1,cluster2, sep="")
          M1  <- length(unique(cluster1))
          M2  <- length(unique(cluster2))   
          M12 <- length(unique(cluster12))
          N   <- length(cluster1)          
          K   <- fm$rank             
          dfc1  <- (M1/(M1-1))*((N-1)/(N-K))  
          dfc2  <- (M2/(M2-1))*((N-1)/(N-K))  
          dfc12 <- (M12/(M12-1))*((N-1)/(N-K))  
          u1j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster1,  sum)) 
          u2j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster2,  sum)) 
          u12j  <- apply(estfun(fm), 2, function(x) tapply(x, cluster12, sum)) 
          vc1   <-  dfc1*sandwich(fm, meat=crossprod(u1j)/N )
          vc2   <-  dfc2*sandwich(fm, meat=crossprod(u2j)/N )
          vc12  <- dfc12*sandwich(fm, meat=crossprod(u12j)/N)
          vcovMCL <- vc1 + vc2 - vc12
          coeftest(fm, vcovMCL)}

For references and usage example see:

Clustered Standard Errors in R

0 讨论(0)

抹茶落季

2020-12-09 00:52

Frank Harrell's package rms (which used to be named Design) has a function that I use often when clustering: robcov.

See this part of ?robcov, for example.

cluster: a variable indicating groupings. ‘cluster’ may be any type of
      vector (factor, character, integer).  NAs are not allowed.
      Unique values of ‘cluster’ indicate possibly correlated
      groupings of observations. Note the data used in the fit and
      stored in ‘fit$x’ and ‘fit$y’ may have had observations
      containing missing values deleted. It is assumed that if any
      NAs were removed during the original model fitting, an
      ‘naresid’ function exists to restore NAs so that the rows of
      the score matrix coincide with ‘cluster’. If ‘cluster’ is
      omitted, it defaults to the integers 1,2,...,n to obtain the
      "sandwich" robust covariance matrix estimate.

0 讨论(0)

面向向阳花

2020-12-09 01:00

This is an old question. But seeing as people still appear to be landing on it, I thought I'd provide some modern approaches to multiway clustering in R:

Option 1 (fastest): `fixest::feols()`

library(fixest)

nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")

est_feols = feols(ln_wage ~ age | race + year, data = nlswork)

## SEs will automatically be clustered by the first FE (i.e. race) in the above model
est_feols

## But we can instantaneously compute other SEs on the fly with summary.fixest()
summary(est_feols, se = 'standard') ## vanilla SEs
summary(est_feols, se = 'white') ## robust SEs
summary(est_feols, se = 'twoway') ## twoway clustering
summary(est_feols, cluster = c('race', 'year')) ## same as the above
summary(est_feols, cluster = c('race', 'year', 'idcode'))  ## add third cluster var (not in original model call)

Option 2 (fast): `lfe::felm()`

library(lfe)

## Unlike fixest::feols, here we specify the clusters in the actual model call.
## (Note the third "| 0 " slot means we're not using IV) 

est_felm = felm(ln_wage ~ age | race + year | 0 | race + year + idcode, data = nlswork)
summary(est_felm)

Option 3 (slower, but flexible): sandwich

library(sandwich)
library(lmtest)


est_sandwich = lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork) 
coeftest(est_sandwich, vcov = vcovCL, cluster = ~ race + year)

Benchmark

Aaaand, just to belabour the point about speed. Here's a benchmark of the three different approaches (using two fixed FEs and twoway clustering).

est_feols = function() {summary(feols(ln_wage ~ age | race + year, data = nlswork), 
                               cluster = c('race', 'year'))} 
est_felm = function() felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
est_standwich = function() {coeftest(lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork), 
                                     vcov = vcovCL, cluster = ~ race + year)}

microbenchmark(est_feols(), est_felm(), est_standwich(), times = 3)

#> Unit: milliseconds
#>             expr       min        lq      mean    median        uq       max neval cld
#>      est_feols()  10.40799  10.54351  11.71474  10.67902  12.36811  14.05719     3  a 
#>       est_felm()  99.56899 108.89241 112.55856 118.21584 119.05334 119.89085     3  a 
#>  est_standwich() 190.30892 198.92584 245.12421 207.54276 272.53185 337.52095     3   b

0 讨论(0)

Double clustered standard errors for panel data

Option 1 (fastest): fixest::feols()

Option 2 (fast): lfe::felm()

Option 3 (slower, but flexible): sandwich

Benchmark

Option 1 (fastest): `fixest::feols()`

Option 2 (fast): `lfe::felm()`