Skip to contents

A data frame for everyone

R is fundamentally different from Stata in that you can work with more than one dataset at a time. In Stata, the use command loads a single dataset into memory. In R, you create a data frame and save it to a variable. You can have as many data frames loaded as you want.

The haven package in R can be used to load Stata .dta files. An example is below. Note that the system.file() command locates a .dta file that I have included with this package. This code loads the HTV.DTA file included with this package and saves it as a variable named htv. You can see from the output that htv is a “tibble” which is another name for a data frame.

library(haven)
htv <- read_dta(system.file("HTV.DTA", package = "metrics.in.r"))
htv
#> # A tibble: 1,230 × 23
#>     wage   abil  educ    ne    nc  west south exper motheduc fatheduc brkhme14
#>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl>
#>  1 12.0   5.03     15     0     0     1     0     9       12       12        0
#>  2  8.91  2.04     13     1     0     0     0     8       12       10        1
#>  3 15.5   2.48     15     1     0     0     0    11       12       16        0
#>  4 13.3   3.61     15     1     0     0     0     6       12       12        0
#>  5 11.1   2.64     13     1     0     0     0    15       12       15        1
#>  6 17.5   3.47     18     1     0     0     0     8       12       12        0
#>  7 21.1  -1.76     13     1     0     0     0    13       13       12        0
#>  8 11.7  -0.134    12     0     0     0     1    14       12       12        1
#>  9 17.0   2.07     13     1     0     0     0     9       10       12        1
#> 10 11.5  -0.338    12     1     0     0     0     9       14       12        0
#> # ℹ 1,220 more rows
#> # ℹ 12 more variables: sibs <dbl>, urban <dbl>, ne18 <dbl>, nc18 <dbl>,
#> #   south18 <dbl>, west18 <dbl>, urban18 <dbl>, tuit17 <dbl>, tuit18 <dbl>,
#> #   lwage <dbl>, expersq <dbl>, ctuit <dbl>
More on data frames
Reshape, reformat, restructure

See Chapter 5 of R for Data Science here for information on how to transform your data frame to create new variables, summarize your datasets, etc.

A formula for everyone

Before we get to running regressions, it’s worth talking a little bit about how we describe regressions in R. There is a special object called a formula which is a symbolic description of the equation we want to estimate. We simply put the outcome variable on the left-hand side and any explanatory variables on the right-hand side, like so:

wage ~ abil + educ + motheduc
#> wage ~ abil + educ + motheduc

This is saying we want to estimate wage as a function of ability, education, and mother’s education. It’s a symbolic description because the objects wage, abil, educ, and motheduc aren’t actually defined! That is, we get an error if we do this:

wage
#> Error in eval(expr, envir, enclos): object 'wage' not found

So without elaborating too much, I just want to show that formulas are “special” objects in R. This means that there are some “special” rules about formulas that you need to know:

  • If you want to include the product of two variables in a formula, use : like so: wage ~ abil:educ. This regresses wage on the product of ability and education. You can use * to get main effects as well as interactions, so wage ~ abil * educ is the same as wage ~ abil + educ + abil:educ.
  • You can estimate a model without an intercept by adding -1 to the end of your formula, so wage ~ educ + abil - 1 regresses wage on education and ability with no intercept.
  • You can use functions inside of a formula, i.e. log(wage) ~ abil + educ regresses log wage on education and ability.
  • HOWEVER, ^ has a special meaning in formulas, so wage ~ educ^2 is NOT regressing wage on education squared! There is a special function I() to escape any special meaning in formulas. So you would have to write wage ~ I(educ^2) to regress wage on squared education. Because of this weird gotcha, I generally prefer to generate a new variable, i.e. doing htv |> mutate(educ_sq = educ^2) and then using wage ~ educ_sq in my formula. See the “more on data frames” note above for some information on generating new variables.

There are some more rules you can read about by typing ?formula into your R console, but these are the “need to know” rules, in my opinion.

An lm for Everyone

Now that we know how to load a data frame and write a formula, we can create a regression model. The primary function for this is the lm function, which stands for “linear model.” Linear models in R require a formula and a data frame as input. Let’s see an example of what this looks like:

lm(wage ~ educ * abil, data = htv)
#> 
#> Call:
#> lm(formula = wage ~ educ * abil, data = htv)
#> 
#> Coefficients:
#> (Intercept)         educ         abil    educ:abil  
#>    -0.65484      0.98368     -0.40005      0.06939

What we see are estimates of the \(\beta_i\)s in the regression model \[ E(\text{wage}\mid \text{educ}, \text{abil}) = \beta_1 + \beta_2\text{educ} + \beta_3\text{abil} + \beta_4\text{educ}*\text{abil} \]

There’s a lot more information we can get from a linear model, however. Let’s store the model to a variable and see what we can do with it:

model <- lm(wage ~ educ * abil, data = htv)
names(model)
#>  [1] "coefficients"  "residuals"     "effects"       "rank"         
#>  [5] "fitted.values" "assign"        "qr"            "df.residual"  
#>  [9] "xlevels"       "call"          "terms"         "model"

What this demonstrates is that the model variable stores information related to fitting the linear model. The names of the model are the information that it stores. For instance, we can see the coefficient estimates again by writing

model$coefficients
#> (Intercept)        educ        abil   educ:abil 
#> -0.65483870  0.98368161 -0.40005124  0.06938703

You can also get the coefficients of the model by using the coef() function like so:

coef(model)
#> (Intercept)        educ        abil   educ:abil 
#> -0.65483870  0.98368161 -0.40005124  0.06938703
Generics Demonstrated
Why are there so many ways to do the same thing?

This has to do with a programming best practice known roughly as “hiding implementation details.” The model$coefficients call is NOT how we want to access the coefficients, in general. In fact, for linear models you should avoid using any of the properties in names(model) directly, if you can.

Why?

The properties in names(model) represent the internal structure of the linear model object. I run names(model) to give you an idea of what internal things you might want to store if you are estimating a linear model. However, there are external or public ways of interacting with linear models that are meant for us, the users of linear models. coef is one such example. In fact, type ?coef into your R console and read the first sentence:

coef is a generic function which extracts model coefficients from objects returned by modeling functions.

The idea of generics is a fundamental programming concept. The coef function is not just for linear models, it is for many, many types of models! Generalized linear models, local linear models, LASSO or ridge models all implement this function. Each type of model can choose how to implement the coef function using its internal structure, but we the users don’t have to know or care how each model chooses to do that. We just need to call the coef function. It simplifies our lives by reducing the number of functions and amount of detail that we have to learn.

Within the lm

Here is a non-exhaustive list of other generic functions you can use on linear models.

A summary table of coefficients with standard errors (assuming homoskedasticity!), F-statistic, R-squared, and residual MSE:

summary(model)
#> 
#> Call:
#> lm(formula = wage ~ educ * abil, data = htv)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -17.943  -4.602  -1.184   2.420  69.450 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -0.65484    1.97210  -0.332    0.740    
#> educ         0.98368    0.16900   5.821 7.48e-09 ***
#> abil        -0.40005    0.56506  -0.708    0.479    
#> educ:abil    0.06939    0.04565   1.520    0.129    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 8.438 on 1226 degrees of freedom
#> Multiple R-squared:  0.1388, Adjusted R-squared:  0.1367 
#> F-statistic: 65.88 on 3 and 1226 DF,  p-value: < 2.2e-16

A vector of fitted values \(\widehat{\text{wage}}\) and residuals \(\hat{u}\), respectively:

fitted(model)
#>        1        2        3        4        5        6        7        8 
#> 17.32193 13.15564 15.68683 16.41302 13.45652 20.00085 11.24712 11.09155 
#>        9       10       11       12       13       14       15       16 
#> 13.17334 11.00293 11.74663 12.62569 19.47078 12.68040 20.02992 16.59600 
#>       17       18       19       20 
#> 17.54809 16.83575 15.64953 22.10532 
#>  [ reached getOption("max.print") -- omitted 1210 entries ]
#> attr(,"label")
#> [1] "hourly wage, 1991"
#> attr(,"format.stata")
#> [1] "%9.0g"
resid(model)
#>           1           2           3           4           5           6 
#>  -5.3026990  -4.2429855  -0.1724920  -3.0796881  -2.3864056  -2.5183286 
#>           7           8           9          10          11          12 
#>   9.8499222   0.5755521   3.7877886   0.5355364   3.0681884   8.0734791 
#>          13          14          15          16          17          18 
#>   9.2129169  -0.1804047  15.8412321   6.8072210  -0.9066400  -0.7906550 
#>          19          20 
#>  -3.6303015 -13.5704727 
#>  [ reached getOption("max.print") -- omitted 1210 entries ]
#> attr(,"label")
#> [1] "hourly wage, 1991"
#> attr(,"format.stata")
#> [1] "%9.0g"

A covariance matrix of your \(\beta_i\)s, assuming homoskedasticity:

vcov(model)
#>             (Intercept)         educ        abil    educ:abil
#> (Intercept)  3.88918247 -0.328879090 -0.57247356  0.056087334
#> educ        -0.32887909  0.028561667  0.05072765 -0.005085001
#> abil        -0.57247356  0.050727650  0.31928964 -0.025023818
#> educ:abil    0.05608733 -0.005085001 -0.02502382  0.002083647

Postscript: Factors and xpd

Two more points if you’re hungry for more.

First, handling non-numeric columns is much nicer in R than in Stata. In general, it just works. Typically, with a categorical variable we think of estimating a linear model including a categorical variable as estimating it with a set of dummy variables equal to one or zero for each “level” of the categorical variable. An example of this in the HTV dataset is the region that each individual lives in:

library(tidyverse)
htv |> 
  select(nc18, ne18, south18, west18) |> 
  sample_n(10)
#> # A tibble: 10 × 4
#>     nc18  ne18 south18 west18
#>    <dbl> <dbl>   <dbl>  <dbl>
#>  1     0     0       0      1
#>  2     1     0       0      0
#>  3     1     0       0      0
#>  4     0     0       0      1
#>  5     0     1       0      0
#>  6     1     0       0      0
#>  7     1     0       0      0
#>  8     1     0       0      0
#>  9     0     1       0      0
#> 10     0     1       0      0

The code above is selecting the four columns and then picking 10 random rows. It’s just to demonstrate that these four columns take on mutually exclusive values of one or zero.

We can convert this to one column with the categorical value of region. One way to do that is to use the case_when() function:

htv <- htv |>
  mutate(region = case_when(
    ne18 == 1    ~ "Northeast",
    nc18 == 1    ~ "North-Central",
    west18 == 1  ~ "West",
    south18 == 1 ~ "South"
  ))

htv |> select(region) |> sample_n(10)
#> # A tibble: 10 × 1
#>    region       
#>    <chr>        
#>  1 Northeast    
#>  2 West         
#>  3 North-Central
#>  4 South        
#>  5 North-Central
#>  6 West         
#>  7 South        
#>  8 Northeast    
#>  9 North-Central
#> 10 Northeast

Now when we create a linear model using the region variable, R will create dummies for us!

lm(wage ~ region, data = htv)
#> 
#> Call:
#> lm(formula = wage ~ region, data = htv)
#> 
#> Coefficients:
#>     (Intercept)  regionNortheast      regionSouth       regionWest  
#>         12.7275           2.6015          -0.5605           0.5629

Hence, whereas in Stata we need to be concerned about whether we have one categorical column or a set of dummy variables, in R our programming language is smart enough to convert between the two relatively easily. For more information on categorical variables in R, see chapter 15 of R for Data Science here.


Second, if you find yourself writing really big formulas, you might consider a tool in the fixest package called xpd(). It performs several different types of “formula expansion” to make your formulas a little more concise. If you have a bunch of columns that are numbered sequentially, you could do this:

library(fixest)
xpd(y ~ x.[1:20])
#> y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + 
#>     x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20

If you have some other quantifiable pattern in your formula, you can use something called a regular expression. You can read all about regular expressions in Chapter 14 of R for Data Science here. Here’s an example that gets all columns from HTV that end in “educ”:

xpd(wage ~ ..("^.*educ$"), data = htv)
#> wage ~ educ + motheduc + fatheduc

Happy coding!