I have a panel data set in R (time and cross section) and would like to compute standard errors that are clustered by two dimensions, because my residuals are correlated bot
For panel regressions, the plm
package can estimate clustered SEs along two dimensions.
Using M. Petersen’s benchmark results:
require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")
##Double-clustering formula (Thompson, 2011)
vcovDC <- function(x, ...){
vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) -
vcovHC(x, method="white1", ...)
}
fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))
So that now you can obtain clustered SEs:
##Clustered by *group*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.066952 0.4433 0.6576
x 1.034833 0.050550 20.4714 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##Clustered by *time*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.022189 1.3376 0.1811
x 1.034833 0.031679 32.6666 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##Clustered by *group* and *time*
> coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.064580 0.4596 0.6458
x 1.034833 0.052465 19.7243 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
For more details see:
However the above works only if your data can be coerced to a pdata.frame
. It will fail if you have "duplicate couples (time-id)"
. In this case you can still cluster, but only along one dimension.
Trick plm
into thinking that you have a proper panel data set by specifying only one index:
fpm.tr <- plm(y ~ x, test, model='pooling', index=c('firmid'))
So that now you can obtain clustered SEs:
##Clustered by *group*
> coeftest(fpm.tr, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.066952 0.4433 0.6576
x 1.034833 0.050550 20.4714 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
You can also use this workaround to cluster by a higher dimension or at a higher level (e.g. industry
or country
). However in that case you won't be able to use the group
(or time
) effects
, which is the main limit of the approach.
Another approach that works for both panel and other types of data is the multiwayvcov
package. It allows double clustering, but also clustering at higher dimensions. As per the packages's website, it is an improvement upon Arai's code:
- Transparent handling of observations dropped due to missingness
- Full multi-way (or n-way, or n-dimensional, or multi-dimensional) clustering
Using the Petersen data and cluster.vcov()
:
library("lmtest")
library("multiwayvcov")
data(petersen)
m1 <- lm(y ~ x, data = petersen)
coeftest(m1, vcov=function(x) cluster.vcov(x, petersen[ , c("firmid", "year")]))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.029680 0.065066 0.4561 0.6483
## x 1.034833 0.053561 19.3206 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Arai's function can be used for clustering standard-errors. He has another version for clustering in multiple dimensions:
mcl <- function(dat,fm, cluster1, cluster2){
attach(dat, warn.conflicts = F)
library(sandwich);library(lmtest)
cluster12 = paste(cluster1,cluster2, sep="")
M1 <- length(unique(cluster1))
M2 <- length(unique(cluster2))
M12 <- length(unique(cluster12))
N <- length(cluster1)
K <- fm$rank
dfc1 <- (M1/(M1-1))*((N-1)/(N-K))
dfc2 <- (M2/(M2-1))*((N-1)/(N-K))
dfc12 <- (M12/(M12-1))*((N-1)/(N-K))
u1j <- apply(estfun(fm), 2, function(x) tapply(x, cluster1, sum))
u2j <- apply(estfun(fm), 2, function(x) tapply(x, cluster2, sum))
u12j <- apply(estfun(fm), 2, function(x) tapply(x, cluster12, sum))
vc1 <- dfc1*sandwich(fm, meat=crossprod(u1j)/N )
vc2 <- dfc2*sandwich(fm, meat=crossprod(u2j)/N )
vc12 <- dfc12*sandwich(fm, meat=crossprod(u12j)/N)
vcovMCL <- vc1 + vc2 - vc12
coeftest(fm, vcovMCL)}
For references and usage example see:
Frank Harrell's package rms
(which used to be named Design
) has a function that I use often when clustering: robcov
.
See this part of ?robcov
, for example.
cluster: a variable indicating groupings. ‘cluster’ may be any type of
vector (factor, character, integer). NAs are not allowed.
Unique values of ‘cluster’ indicate possibly correlated
groupings of observations. Note the data used in the fit and
stored in ‘fit$x’ and ‘fit$y’ may have had observations
containing missing values deleted. It is assumed that if any
NAs were removed during the original model fitting, an
‘naresid’ function exists to restore NAs so that the rows of
the score matrix coincide with ‘cluster’. If ‘cluster’ is
omitted, it defaults to the integers 1,2,...,n to obtain the
"sandwich" robust covariance matrix estimate.
This is an old question. But seeing as people still appear to be landing on it, I thought I'd provide some modern approaches to multiway clustering in R:
fixest::feols()
library(fixest)
nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")
est_feols = feols(ln_wage ~ age | race + year, data = nlswork)
## SEs will automatically be clustered by the first FE (i.e. race) in the above model
est_feols
## But we can instantaneously compute other SEs on the fly with summary.fixest()
summary(est_feols, se = 'standard') ## vanilla SEs
summary(est_feols, se = 'white') ## robust SEs
summary(est_feols, se = 'twoway') ## twoway clustering
summary(est_feols, cluster = c('race', 'year')) ## same as the above
summary(est_feols, cluster = c('race', 'year', 'idcode')) ## add third cluster var (not in original model call)
lfe::felm()
library(lfe)
## Unlike fixest::feols, here we specify the clusters in the actual model call.
## (Note the third "| 0 " slot means we're not using IV)
est_felm = felm(ln_wage ~ age | race + year | 0 | race + year + idcode, data = nlswork)
summary(est_felm)
library(sandwich)
library(lmtest)
est_sandwich = lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork)
coeftest(est_sandwich, vcov = vcovCL, cluster = ~ race + year)
Aaaand, just to belabour the point about speed. Here's a benchmark of the three different approaches (using two fixed FEs and twoway clustering).
est_feols = function() {summary(feols(ln_wage ~ age | race + year, data = nlswork),
cluster = c('race', 'year'))}
est_felm = function() felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
est_standwich = function() {coeftest(lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork),
vcov = vcovCL, cluster = ~ race + year)}
microbenchmark(est_feols(), est_felm(), est_standwich(), times = 3)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> est_feols() 10.40799 10.54351 11.71474 10.67902 12.36811 14.05719 3 a
#> est_felm() 99.56899 108.89241 112.55856 118.21584 119.05334 119.89085 3 a
#> est_standwich() 190.30892 198.92584 245.12421 207.54276 272.53185 337.52095 3 b