Double clustered standard errors for panel data

前端 未结 4 760
小鲜肉
小鲜肉 2020-12-09 00:33

I have a panel data set in R (time and cross section) and would like to compute standard errors that are clustered by two dimensions, because my residuals are correlated bot

相关标签:
4条回答
  • 2020-12-09 00:49

    For panel regressions, the plm package can estimate clustered SEs along two dimensions.

    Using M. Petersen’s benchmark results:

    require(foreign)
    require(plm)
    require(lmtest)
    test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")
    
    ##Double-clustering formula (Thompson, 2011)
    vcovDC <- function(x, ...){
        vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) - 
            vcovHC(x, method="white1", ...)
    }
    
    fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))
    

    So that now you can obtain clustered SEs:

    ##Clustered by *group*
    > coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
    
    t test of coefficients:
    
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 0.029680   0.066952  0.4433   0.6576    
    x           1.034833   0.050550 20.4714   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    ##Clustered by *time*
    > coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))
    
    t test of coefficients:
    
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 0.029680   0.022189  1.3376   0.1811    
    x           1.034833   0.031679 32.6666   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    ##Clustered by *group* and *time*
    > coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))
    
    t test of coefficients:
    
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 0.029680   0.064580  0.4596   0.6458    
    x           1.034833   0.052465 19.7243   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    

    For more details see:

    • Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R.

    However the above works only if your data can be coerced to a pdata.frame. It will fail if you have "duplicate couples (time-id)". In this case you can still cluster, but only along one dimension.

    Trick plm into thinking that you have a proper panel data set by specifying only one index:

    fpm.tr <- plm(y ~ x, test, model='pooling', index=c('firmid'))
    

    So that now you can obtain clustered SEs:

    ##Clustered by *group*
    > coeftest(fpm.tr, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
    
    t test of coefficients:
    
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 0.029680   0.066952  0.4433   0.6576    
    x           1.034833   0.050550 20.4714   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    

    You can also use this workaround to cluster by a higher dimension or at a higher level (e.g. industry or country). However in that case you won't be able to use the group (or time) effects, which is the main limit of the approach.


    Another approach that works for both panel and other types of data is the multiwayvcov package. It allows double clustering, but also clustering at higher dimensions. As per the packages's website, it is an improvement upon Arai's code:

    • Transparent handling of observations dropped due to missingness
    • Full multi-way (or n-way, or n-dimensional, or multi-dimensional) clustering

    Using the Petersen data and cluster.vcov():

    library("lmtest")
    library("multiwayvcov")
    
    data(petersen)
    m1 <- lm(y ~ x, data = petersen)
    
    coeftest(m1, vcov=function(x) cluster.vcov(x, petersen[ , c("firmid", "year")]))
    ## 
    ## t test of coefficients:
    ## 
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 0.029680   0.065066  0.4561   0.6483    
    ## x           1.034833   0.053561 19.3206   <2e-16 ***
    ## ---
    ## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    0 讨论(0)
  • 2020-12-09 00:51

    Arai's function can be used for clustering standard-errors. He has another version for clustering in multiple dimensions:

    mcl <- function(dat,fm, cluster1, cluster2){
              attach(dat, warn.conflicts = F)
              library(sandwich);library(lmtest)
              cluster12 = paste(cluster1,cluster2, sep="")
              M1  <- length(unique(cluster1))
              M2  <- length(unique(cluster2))   
              M12 <- length(unique(cluster12))
              N   <- length(cluster1)          
              K   <- fm$rank             
              dfc1  <- (M1/(M1-1))*((N-1)/(N-K))  
              dfc2  <- (M2/(M2-1))*((N-1)/(N-K))  
              dfc12 <- (M12/(M12-1))*((N-1)/(N-K))  
              u1j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster1,  sum)) 
              u2j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster2,  sum)) 
              u12j  <- apply(estfun(fm), 2, function(x) tapply(x, cluster12, sum)) 
              vc1   <-  dfc1*sandwich(fm, meat=crossprod(u1j)/N )
              vc2   <-  dfc2*sandwich(fm, meat=crossprod(u2j)/N )
              vc12  <- dfc12*sandwich(fm, meat=crossprod(u12j)/N)
              vcovMCL <- vc1 + vc2 - vc12
              coeftest(fm, vcovMCL)}
    

    For references and usage example see:

    • Clustered Standard Errors in R
    0 讨论(0)
  • 2020-12-09 00:52

    Frank Harrell's package rms (which used to be named Design) has a function that I use often when clustering: robcov.

    See this part of ?robcov, for example.

    cluster: a variable indicating groupings. ‘cluster’ may be any type of
          vector (factor, character, integer).  NAs are not allowed.
          Unique values of ‘cluster’ indicate possibly correlated
          groupings of observations. Note the data used in the fit and
          stored in ‘fit$x’ and ‘fit$y’ may have had observations
          containing missing values deleted. It is assumed that if any
          NAs were removed during the original model fitting, an
          ‘naresid’ function exists to restore NAs so that the rows of
          the score matrix coincide with ‘cluster’. If ‘cluster’ is
          omitted, it defaults to the integers 1,2,...,n to obtain the
          "sandwich" robust covariance matrix estimate.
    
    0 讨论(0)
  • 2020-12-09 01:00

    This is an old question. But seeing as people still appear to be landing on it, I thought I'd provide some modern approaches to multiway clustering in R:

    Option 1 (fastest): fixest::feols()

    library(fixest)
    
    nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")
    
    est_feols = feols(ln_wage ~ age | race + year, data = nlswork)
    
    ## SEs will automatically be clustered by the first FE (i.e. race) in the above model
    est_feols
    
    ## But we can instantaneously compute other SEs on the fly with summary.fixest()
    summary(est_feols, se = 'standard') ## vanilla SEs
    summary(est_feols, se = 'white') ## robust SEs
    summary(est_feols, se = 'twoway') ## twoway clustering
    summary(est_feols, cluster = c('race', 'year')) ## same as the above
    summary(est_feols, cluster = c('race', 'year', 'idcode'))  ## add third cluster var (not in original model call)
    

    Option 2 (fast): lfe::felm()

    library(lfe)
    
    ## Unlike fixest::feols, here we specify the clusters in the actual model call.
    ## (Note the third "| 0 " slot means we're not using IV) 
    
    est_felm = felm(ln_wage ~ age | race + year | 0 | race + year + idcode, data = nlswork)
    summary(est_felm)
    

    Option 3 (slower, but flexible): sandwich

    library(sandwich)
    library(lmtest)
    
    
    est_sandwich = lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork) 
    coeftest(est_sandwich, vcov = vcovCL, cluster = ~ race + year)
    

    Benchmark

    Aaaand, just to belabour the point about speed. Here's a benchmark of the three different approaches (using two fixed FEs and twoway clustering).

    est_feols = function() {summary(feols(ln_wage ~ age | race + year, data = nlswork), 
                                   cluster = c('race', 'year'))} 
    est_felm = function() felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
    est_standwich = function() {coeftest(lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork), 
                                         vcov = vcovCL, cluster = ~ race + year)}
    
    microbenchmark(est_feols(), est_felm(), est_standwich(), times = 3)
    
    #> Unit: milliseconds
    #>             expr       min        lq      mean    median        uq       max neval cld
    #>      est_feols()  10.40799  10.54351  11.71474  10.67902  12.36811  14.05719     3  a 
    #>       est_felm()  99.56899 108.89241 112.55856 118.21584 119.05334 119.89085     3  a 
    #>  est_standwich() 190.30892 198.92584 245.12421 207.54276 272.53185 337.52095     3   b
    
    0 讨论(0)
提交回复
热议问题