Congrats! You’ve managed to make a linear model in R, but the immediate problem that us economists face is that we want robust standard errors. R doesn’t provide this out of the box, so we have to use a package that will compute robust standard errors for us. Luckily, this is quite easy to do, and we’ll run through it here in this article.
sandwich
does it all
The sandwich
R package is a Swiss army knife that can do
a lot for you when it comes to robust standard errors. It’s called
“sandwich” because most variance estimators tend to have a form that
looks like a sandwich: that’s the \(Q_{XX}Q_{u^2XX}Q_{XX}\) form you may be
familiar with. Either way, I think it’s a cute name.
Let’s recreate the model that we used in the intro article.
library(tidyverse)
library(haven)
htv <- read_dta(system.file("HTV.DTA", package = "metrics.in.r"))
model <- lm(wage ~ educ * abil, data = htv)
Now we load the sandwich package.
There are a few different functions in sandwich, which correspond to
the different types of robust variance estimators: vcovHC
for the classic “White’s” (heteroskedasticity-consistent) standard
errors; vcovCL
for clustered standard errors; and
vcovBS
for boostrapped standard errors. Making a covariance
matrix of our \(\beta\) estimates for a
given regression model is as easy as calling vcovHC
on the
model. The type
argument is optional: There are different
types of “corrections” for heteroskedasticity-robust standard errors,
and HC1 is the correction that Stata uses by default.
vcovHC(model, type = "HC1")
## (Intercept) educ abil educ:abil
## (Intercept) 3.47283828 -0.305897247 -0.24542291 0.030789574
## educ -0.30589725 0.027454256 0.02829398 -0.003324115
## abil -0.24542291 0.028293984 0.43706077 -0.035738635
## educ:abil 0.03078957 -0.003324115 -0.03573864 0.003002688
From now on, I’ll omit the type
argument to keep my R
code cleaner. Once we have the robust covariance matrix, we can get the
robust standard errors of our \(\beta\)
estimates by taking the square root of the diagonal:
## (Intercept) educ abil educ:abil
## 1.87304779 0.16649606 0.66783405 0.05534231
Another way to do this is with the pipe operator:
I won’t explain the pipe operator now. I prefer you to have an “immersive” introduction for now, and I will cover it more in-depth later. Just notice that it sort of “unwraps” the nested function calls and reverses the order of the calls.
But what about the pretty tables
Recall that in the intro article we could do this with our model and get a nice summary table
summary(model)
##
## Call:
## lm(formula = wage ~ educ * abil, data = htv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.943 -4.602 -1.184 2.420 69.450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.65484 1.97210 -0.332 0.740
## educ 0.98368 0.16900 5.821 7.48e-09 ***
## abil -0.40005 0.56506 -0.708 0.479
## educ:abil 0.06939 0.04565 1.520 0.129
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.438 on 1226 degrees of freedom
## Multiple R-squared: 0.1388, Adjusted R-squared: 0.1367
## F-statistic: 65.88 on 3 and 1226 DF, p-value: < 2.2e-16
This summary table was generated with a homoskedastic assumption on
the variance of the residuals, so we need to figure something else out
for our robust standard errors. Luckily, there is a package called
lmtest
which provides this functionality. We call the
coeftest
function and give it both our model and a
covariance matrix, and it will print a nice summary table like
above:
library(lmtest)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.654839 1.873048 -0.3496 0.7267
## educ 0.983682 0.166496 5.9081 4.477e-09 ***
## abil -0.400051 0.667834 -0.5990 0.5493
## educ:abil 0.069387 0.055342 1.2538 0.2102
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, with the pipe this would look like
So we have to expend a little more effort to get the covariance
matrix of our \(\beta\) estimates, but
once we have both the model and our covariance matrix, we can make a
summary table. In fact, we really could do this ourselves. I’m going to
give a demonstration below that all we need for these summary
tables is a model and a covariance matrix. Let’s replicate the output of
coeftest
—you don’t need to understand what I’m doing, just
notice that the only objects that I’m using are the model and the
covariance matrix.
estimate <- coef(model)
std_error <- model %>% vcovHC %>% diag %>% sqrt
t_value <- estimate / std_error
p_value <- 2 * pt(-abs(t_value), df = df.residual(model))
cbind(estimate, std_error, t_value, p_value)
## estimate std_error t_value p_value
## (Intercept) -0.65483870 1.87304779 -0.3496113 7.266905e-01
## educ 0.98368161 0.16649606 5.9081373 4.477086e-09
## abil -0.40005124 0.66783405 -0.5990279 5.492649e-01
## educ:abil 0.06938703 0.05534231 1.2537791 2.101613e-01
There are easier alternatives
Since this is R and there are a lot of people using it, it makes sense that someone has written something to make robust standard errors easier. Two people have in fact.
The first is the lm_robust
function from the
estimatr
package. You specify the covariance type when you
create your model, and then the summary
function will use
the correct type of robust standard errors in its output:
library(estimatr)
model <- lm_robust(wage ~ educ * abil, data = htv, se_type = "HC1")
summary(model)
##
## Call:
## lm_robust(formula = wage ~ educ * abil, data = htv, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) -0.65484 1.8636 -0.3514 7.254e-01 -4.31095 3.0013 1226
## educ 0.98368 0.1657 5.9368 3.780e-09 0.65861 1.3088 1226
## abil -0.40005 0.6611 -0.6051 5.452e-01 -1.69708 0.8970 1226
## educ:abil 0.06939 0.0548 1.2663 2.057e-01 -0.03812 0.1769 1226
##
## Multiple R-squared: 0.1388 , Adjusted R-squared: 0.1367
## F-statistic: 61.57 on 3 and 1226 DF, p-value: < 2.2e-16
The second (and preferred) method is the feols
function
from the fixest
package. Just like lm_robust
,
you specify the covariance type when you create your model:
## OLS estimation, Dep. Var.: wage
## Observations: 1,230
## Standard-errors: Heteroskedasticity-robust
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.654839 1.863555 -0.351392 7.2535e-01
## educ 0.983682 0.165693 5.936763 3.7799e-09 ***
## abil -0.400051 0.661106 -0.605124 5.4521e-01
## educ:abil 0.069387 0.054797 1.266261 2.0566e-01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 8.42455 Adj. R2: 0.136723
A word of warning: There are issues computing
clustered standard errors with lm_robust
. In general, I
don’t trust the results that lm_robust
give as much as I
trust sandwich
or fixest
. Keep that in mind
and choose wisely. I show you lm_robust
because there are
still many people that use it. You should convince them to switch over
to fixest
🙂
Conclusion: Why’d you show me the hard way?
Two reasons why I showed the sandwich
and
coeftest
method. One is that I want to demystify a little
bit of what is going on behind the scenes. You only need a
model and a covariance matrix. Once you have those,
you can compute standard errors, t statistics, and p values. I prefer
using sandwich
exactly for this reason: I don’t like
“magic” and functions that do too much for me without me realizing what
is going on.
The second reason is that the sandwich
and
coeftest
method remains enduringly popular.
lm_robust
had a moment of fame, and fixest
is
up and coming as a very popular library in econometrics, but sometimes
you just want a simple, tried-and-true method. You can choose for
yourself which method you’d like to use.
Happy Coding!