A data frame for everyone
R is fundamentally different from Stata in that you can work with
more than one dataset at a time. In Stata, the use
command
loads a single dataset into memory. In R, you create a data
frame and save it to a variable. You can have as many data
frames loaded as you want.
The haven
package in R can be used to load Stata
.dta
files. An example is below. Note that the
system.file()
command locates a .dta
file that
I have included with this package. This code loads the
HTV.DTA
file included with this package and saves it as a
variable named htv
. You can see from the output that
htv
is a “tibble” which is another name for a data
frame.
library(haven)
htv <- read_dta(system.file("HTV.DTA", package = "metrics.in.r"))
htv
#> # A tibble: 1,230 × 23
#> wage abil educ ne nc west south exper motheduc fatheduc brkhme14
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 12.0 5.03 15 0 0 1 0 9 12 12 0
#> 2 8.91 2.04 13 1 0 0 0 8 12 10 1
#> 3 15.5 2.48 15 1 0 0 0 11 12 16 0
#> 4 13.3 3.61 15 1 0 0 0 6 12 12 0
#> 5 11.1 2.64 13 1 0 0 0 15 12 15 1
#> 6 17.5 3.47 18 1 0 0 0 8 12 12 0
#> 7 21.1 -1.76 13 1 0 0 0 13 13 12 0
#> 8 11.7 -0.134 12 0 0 0 1 14 12 12 1
#> 9 17.0 2.07 13 1 0 0 0 9 10 12 1
#> 10 11.5 -0.338 12 1 0 0 0 9 14 12 0
#> # ℹ 1,220 more rows
#> # ℹ 12 more variables: sibs <dbl>, urban <dbl>, ne18 <dbl>, nc18 <dbl>,
#> # south18 <dbl>, west18 <dbl>, urban18 <dbl>, tuit17 <dbl>, tuit18 <dbl>,
#> # lwage <dbl>, expersq <dbl>, ctuit <dbl>
Reshape, reformat, restructure
See Chapter 5 of R for Data Science here for information on how to transform your data frame to create new variables, summarize your datasets, etc.
A formula for everyone
Before we get to running regressions, it’s worth talking a little bit about how we describe regressions in R. There is a special object called a formula which is a symbolic description of the equation we want to estimate. We simply put the outcome variable on the left-hand side and any explanatory variables on the right-hand side, like so:
wage ~ abil + educ + motheduc
#> wage ~ abil + educ + motheduc
This is saying we want to estimate wage as a function of ability,
education, and mother’s education. It’s a symbolic description
because the objects wage
, abil
,
educ
, and motheduc
aren’t actually defined!
That is, we get an error if we do this:
wage
#> Error in eval(expr, envir, enclos): object 'wage' not found
So without elaborating too much, I just want to show that formulas are “special” objects in R. This means that there are some “special” rules about formulas that you need to know:
- If you want to include the product of two variables in a formula,
use
:
like so:wage ~ abil:educ
. This regresses wage on the product of ability and education. You can use*
to get main effects as well as interactions, sowage ~ abil * educ
is the same aswage ~ abil + educ + abil:educ
. - You can estimate a model without an intercept by adding
-1
to the end of your formula, sowage ~ educ + abil - 1
regresses wage on education and ability with no intercept. - You can use functions inside of a formula,
i.e.
log(wage) ~ abil + educ
regresses log wage on education and ability. - HOWEVER,
^
has a special meaning in formulas, sowage ~ educ^2
is NOT regressing wage on education squared! There is a special functionI()
to escape any special meaning in formulas. So you would have to writewage ~ I(educ^2)
to regress wage on squared education. Because of this weird gotcha, I generally prefer to generate a new variable, i.e. doinghtv |> mutate(educ_sq = educ^2)
and then usingwage ~ educ_sq
in my formula. See the “more on data frames” note above for some information on generating new variables.
There are some more rules you can read about by typing
?formula
into your R console, but these are the “need to
know” rules, in my opinion.
An lm for Everyone
Now that we know how to load a data frame and write a formula, we can
create a regression model. The primary function for this is the
lm
function, which stands for “linear model.” Linear models
in R require a formula and a data frame as input. Let’s see an example
of what this looks like:
lm(wage ~ educ * abil, data = htv)
#>
#> Call:
#> lm(formula = wage ~ educ * abil, data = htv)
#>
#> Coefficients:
#> (Intercept) educ abil educ:abil
#> -0.65484 0.98368 -0.40005 0.06939
What we see are estimates of the \(\beta_i\)s in the regression model \[ E(\text{wage}\mid \text{educ}, \text{abil}) = \beta_1 + \beta_2\text{educ} + \beta_3\text{abil} + \beta_4\text{educ}*\text{abil} \]
There’s a lot more information we can get from a linear model, however. Let’s store the model to a variable and see what we can do with it:
model <- lm(wage ~ educ * abil, data = htv)
names(model)
#> [1] "coefficients" "residuals" "effects" "rank"
#> [5] "fitted.values" "assign" "qr" "df.residual"
#> [9] "xlevels" "call" "terms" "model"
What this demonstrates is that the model
variable stores
information related to fitting the linear model. The names
of the model are the information that it stores. For instance, we can
see the coefficient estimates again by writing
model$coefficients
#> (Intercept) educ abil educ:abil
#> -0.65483870 0.98368161 -0.40005124 0.06938703
You can also get the coefficients of the model by using the
coef()
function like so:
coef(model)
#> (Intercept) educ abil educ:abil
#> -0.65483870 0.98368161 -0.40005124 0.06938703
Why are there so many ways to do the same thing?
This has to do with a programming best practice known roughly as “hiding
implementation details.” The model$coefficients
call is NOT
how we want to access the coefficients, in general. In fact, for linear
models you should avoid using any of the properties in
names(model)
directly, if you can.
Why?
The properties in names(model)
represent the
internal structure of the linear model object. I run
names(model)
to give you an idea of what
internal things you might want to store if you are
estimating a linear model. However, there are external
or public ways of interacting with linear models that
are meant for us, the users of linear models. coef
is one
such example. In fact, type ?coef
into your R console and
read the first sentence:
coef
is a generic function which extracts model coefficients from objects returned by modeling functions.
The idea of generics is a fundamental programming concept. The
coef
function is not just for linear models, it is for
many, many types of models! Generalized linear models, local linear
models, LASSO or ridge models all implement this function. Each type of
model can choose how to implement the coef
function using its internal structure, but we the users
don’t have to know or care how each model chooses to do that. We just
need to call the coef
function. It simplifies our lives by
reducing the number of functions and amount of detail that we have to
learn.
Within the lm
Here is a non-exhaustive list of other generic functions you can use on linear models.
A summary table of coefficients with standard errors (assuming homoskedasticity!), F-statistic, R-squared, and residual MSE:
summary(model)
#>
#> Call:
#> lm(formula = wage ~ educ * abil, data = htv)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -17.943 -4.602 -1.184 2.420 69.450
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.65484 1.97210 -0.332 0.740
#> educ 0.98368 0.16900 5.821 7.48e-09 ***
#> abil -0.40005 0.56506 -0.708 0.479
#> educ:abil 0.06939 0.04565 1.520 0.129
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 8.438 on 1226 degrees of freedom
#> Multiple R-squared: 0.1388, Adjusted R-squared: 0.1367
#> F-statistic: 65.88 on 3 and 1226 DF, p-value: < 2.2e-16
A vector of fitted values \(\widehat{\text{wage}}\) and residuals \(\hat{u}\), respectively:
fitted(model)
#> 1 2 3 4 5 6 7 8
#> 17.32193 13.15564 15.68683 16.41302 13.45652 20.00085 11.24712 11.09155
#> 9 10 11 12 13 14 15 16
#> 13.17334 11.00293 11.74663 12.62569 19.47078 12.68040 20.02992 16.59600
#> 17 18 19 20
#> 17.54809 16.83575 15.64953 22.10532
#> [ reached getOption("max.print") -- omitted 1210 entries ]
#> attr(,"label")
#> [1] "hourly wage, 1991"
#> attr(,"format.stata")
#> [1] "%9.0g"
resid(model)
#> 1 2 3 4 5 6
#> -5.3026990 -4.2429855 -0.1724920 -3.0796881 -2.3864056 -2.5183286
#> 7 8 9 10 11 12
#> 9.8499222 0.5755521 3.7877886 0.5355364 3.0681884 8.0734791
#> 13 14 15 16 17 18
#> 9.2129169 -0.1804047 15.8412321 6.8072210 -0.9066400 -0.7906550
#> 19 20
#> -3.6303015 -13.5704727
#> [ reached getOption("max.print") -- omitted 1210 entries ]
#> attr(,"label")
#> [1] "hourly wage, 1991"
#> attr(,"format.stata")
#> [1] "%9.0g"
A covariance matrix of your \(\beta_i\)s, assuming homoskedasticity:
vcov(model)
#> (Intercept) educ abil educ:abil
#> (Intercept) 3.88918247 -0.328879090 -0.57247356 0.056087334
#> educ -0.32887909 0.028561667 0.05072765 -0.005085001
#> abil -0.57247356 0.050727650 0.31928964 -0.025023818
#> educ:abil 0.05608733 -0.005085001 -0.02502382 0.002083647
Postscript: Factors and xpd
Two more points if you’re hungry for more.
First, handling non-numeric columns is much nicer in R than in Stata. In general, it just works. Typically, with a categorical variable we think of estimating a linear model including a categorical variable as estimating it with a set of dummy variables equal to one or zero for each “level” of the categorical variable. An example of this in the HTV dataset is the region that each individual lives in:
library(tidyverse)
htv |>
select(nc18, ne18, south18, west18) |>
sample_n(10)
#> # A tibble: 10 × 4
#> nc18 ne18 south18 west18
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 0 0 1
#> 2 1 0 0 0
#> 3 1 0 0 0
#> 4 0 0 0 1
#> 5 0 1 0 0
#> 6 1 0 0 0
#> 7 1 0 0 0
#> 8 1 0 0 0
#> 9 0 1 0 0
#> 10 0 1 0 0
The code above is selecting the four columns and then picking 10 random rows. It’s just to demonstrate that these four columns take on mutually exclusive values of one or zero.
We can convert this to one column with the categorical value
of region. One way to do that is to use the case_when()
function:
htv <- htv |>
mutate(region = case_when(
ne18 == 1 ~ "Northeast",
nc18 == 1 ~ "North-Central",
west18 == 1 ~ "West",
south18 == 1 ~ "South"
))
htv |> select(region) |> sample_n(10)
#> # A tibble: 10 × 1
#> region
#> <chr>
#> 1 Northeast
#> 2 West
#> 3 North-Central
#> 4 South
#> 5 North-Central
#> 6 West
#> 7 South
#> 8 Northeast
#> 9 North-Central
#> 10 Northeast
Now when we create a linear model using the region variable, R will create dummies for us!
lm(wage ~ region, data = htv)
#>
#> Call:
#> lm(formula = wage ~ region, data = htv)
#>
#> Coefficients:
#> (Intercept) regionNortheast regionSouth regionWest
#> 12.7275 2.6015 -0.5605 0.5629
Hence, whereas in Stata we need to be concerned about whether we have one categorical column or a set of dummy variables, in R our programming language is smart enough to convert between the two relatively easily. For more information on categorical variables in R, see chapter 15 of R for Data Science here.
Second, if you find yourself writing really big formulas, you might
consider a tool in the fixest
package called
xpd()
. It performs several different types of “formula
expansion” to make your formulas a little more concise. If you have a
bunch of columns that are numbered sequentially, you could do this:
library(fixest)
xpd(y ~ x.[1:20])
#> y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
#> x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20
If you have some other quantifiable pattern in your formula, you can use something called a regular expression. You can read all about regular expressions in Chapter 14 of R for Data Science here. Here’s an example that gets all columns from HTV that end in “educ”:
xpd(wage ~ ..("^.*educ$"), data = htv)
#> wage ~ educ + motheduc + fatheduc
Happy coding!