5. Analyzing RCT reemployment experiment with Precision by Ajusting for Baseline Covariates#

5.1. Jonathan Roth’s DGP#

Here we set up a DGP with heterogenous effects. In this example, which is due to Jonathan Roth, we have

\[ E [Y(0) | Z] = - Z, \quad E [Y(1) |Z] = Z, \quad Z \sim N(0,1). \]

The CATE is

\[ E [Y(1) - Y(0) | Z ]= 2 Z. \]

and the ATE is

\[ 2 E Z = 0. \]

We would like to estimate the ATE as precisely as possible.

An economic motivation for this example could be provided as follows: Let D be the treatment of going to college, and \(Z\) academic skills. Suppose that academic skills cause lower earnings \(Y(0)\) in jobs that don’t require college degree, and cause higher earnings \(Y(1)\) in jobs that require college degrees. This type of scenario is reflected in the DGP set-up above.

# install.packages("librarian", quiet = T)
librarian::shelf(sandwich, lmtest, quiet = T)
# generate the simulated dataset
set.seed(123)        # set MC seed
n = 1000             # sample size
Z = rnorm(n)         # generate Z
Y0 = -Z + rnorm(n)   # conditional average baseline response is -Z
Y1 = Z + rnorm(n)    # conditional average treatment effect is +Z
D = (runif(n)<.2)    # treatment indicator; only 20% get treated 
Y = Y1*D + Y0*(1-D)  # observed Y
D = D - mean(D)      # demean D
Z = Z-mean(Z)        # demean Z
also installing the dependency ‘BiocManager’



  These packages will be installed:

  'sandwich', 'lmtest'

  It may take some time.

also installing the dependency ‘zoo’

5.2. Analyze the RCT data with Precision Adjustment#

Consider the follow regression models:

  • classical 2-sample approach, no adjustment (CL)

  • classical linear regression adjustment (CRA)

  • interactive regression adjusment (IRA)

We carry out inference using heteroskedasticity robust inference, using the sandwich formulas for variance (Eicker-Huber-White).

We observe that the CRA delivers estimates that are less efficient than the CL (pointed out by Freedman), whereas the IRA delivers a more efficient approach (pointed out by Lin). In order for the CRA to be more efficient than the CL, we need the CRA to be a correct model of the conditional expectation function of Y given D and X, which is not the case here.

# implement each of the models on the simulated data
CL = lm(Y ~ D)          
CRA = lm(Y ~ D+ Z)      #classical
IRA = lm(Y ~ D+ Z+ Z*D) #interactive approach

# we are interested in the coefficients on variable "D".
# library(sandwich) # heterokedasticity robust standard errors
# library(lmtest) # coefficient testing
coeftest(CL, vcov = vcovHC(CL, type="HC1"))
coeftest(CRA, vcov = vcovHC(CRA, type="HC1"))
coeftest(IRA, vcov = vcovHC(IRA, type="HC1"))
t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.038457   0.042730  0.9000   0.3683
D           0.010373   0.109479  0.0947   0.9245
t test of coefficients:

             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  0.038457   0.039038   0.9851   0.3248    
D            0.070199   0.136195   0.5154   0.6064    
Z           -0.555628   0.050840 -10.9289   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
t test of coefficients:

             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  0.007973   0.031466   0.2534   0.8000    
D           -0.054906   0.077519  -0.7083   0.4789    
Z           -0.568043   0.031620 -17.9647   <2e-16 ***
D:Z          1.869928   0.078191  23.9147   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

5.3. Using classical standard errors (non-robust) is misleading here.#

We don’t teach non-robust standard errors in econometrics courses, but the default statistical inference for lm() procedure in R, summary.lm(), still uses 100-year old concepts, perhaps in part due to historical legacy.

Here the non-robust standard errors suggest that there is not much difference between the different approaches, contrary to the conclusions reached using the robust standard errors.

summary(CL)
summary(CRA)
summary(IRA)
Call:
lm(formula = Y ~ D)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0128 -0.8897  0.0232  0.8901  4.0058 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.03846    0.04273   0.900    0.368
D            0.01037    0.10982   0.094    0.925

Residual standard error: 1.351 on 998 degrees of freedom
Multiple R-squared:  8.94e-06,	Adjusted R-squared:  -0.0009931 
F-statistic: 0.008922 on 1 and 998 DF,  p-value: 0.9248
Call:
lm(formula = Y ~ D + Z)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3167 -0.7649 -0.0166  0.7739  5.2087 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.03846    0.03904   0.985    0.325    
D            0.07020    0.10042   0.699    0.485    
Z           -0.55563    0.03942 -14.095   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.235 on 997 degrees of freedom
Multiple R-squared:  0.1662,	Adjusted R-squared:  0.1645 
F-statistic: 99.34 on 2 and 997 DF,  p-value: < 2.2e-16
Call:
lm(formula = Y ~ D + Z + Z * D)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0225 -0.6707  0.0003  0.6877  3.3018 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.007973   0.031492   0.253    0.800    
D           -0.054906   0.081116  -0.677    0.499    
Z           -0.568043   0.031777 -17.876   <2e-16 ***
D:Z          1.869928   0.080565  23.210   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.995 on 996 degrees of freedom
Multiple R-squared:  0.4589,	Adjusted R-squared:  0.4572 
F-statistic: 281.5 on 3 and 996 DF,  p-value: < 2.2e-16

5.4. Verify Asymptotic Approximations Hold in Finite-Sample Simulation Experiment#

set.seed(123)
n = 1000
B= 1000

CLs = rep(0, B)
CRAs = rep(0, B)
IRAs = rep(0, B)

for ( i in 1:B){
  Z = rnorm(n)
  Y0 = -Z + rnorm(n)
  Y1 =  Z + rnorm(n)
  Z = Z - mean(Z)
  D = (runif(n)<.1)
  D = D- mean(D)
  Y = Y1*D + Y0*(1-D)
  CLs[i]= lm(Y ~ D)$coef[2]
  CRAs[i] = lm(Y ~ D+ Z)$coef[2]
  IRAs[i] = lm(Y ~ D+ Z+ Z*D)$coef[2]
  }

print("Standard deviations for estimators")

sqrt(mean(CLs^2))
sqrt(mean(CRAs^2))
sqrt(mean(IRAs^2))
[1] "Standard deviations for estimators"
0.130864481384174
0.205865178405738
0.116085705957897