{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analyzing RCT reemployment experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analyzing RCT with Precision by Adjusting for Baseline Covariates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Jonathan Roth's DGP\n",
    "\n",
    "Here we set up a DGP with heterogenous effects. In this example, with is due to Jonathan Roth, we have:\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "E [Y(0) | Z] = - Z, \\quad E [Y(1) |Z] = Z, \\quad Z \\sim N(0,1).\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "The CATE is\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "E [Y(1) - Y(0) | Z ]= 2 Z.\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "and the ATE is\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "2 E Z = 0.\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "We would like to estimate ATE as precisely as possible.\n",
    "\n",
    "An economic motivation for this example could be provided as follows: Let D be the treatment of going to college, and $Z$ academic skills.  Suppose that academic skills cause lower earnings Y(0) in jobs that don't require college degree, and cause higher earnings  Y(1) in jobs that require college degrees. This type of scenario is reflected in the DGP set-up above.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.181"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Import relevant packages for splitting data\n",
    "import numpy as np\n",
    "import random\n",
    "import math\n",
    "import pandas as pd\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "# Set Seed\n",
    "# to make the results replicable (generating random numbers)\n",
    "np.random.seed(12345676)     # set MC seed\n",
    "\n",
    "n = 1000                # sample size\n",
    "Z = np.random.normal(0, 1, 1000).reshape((1000, 1))  # generate Z\n",
    "Y0 = -Z + np.random.normal(0, 1, 1000).reshape((1000, 1))   # conditional average baseline response is -Z\n",
    "Y1 = Z + np.random.normal(0, 1, 1000).reshape((1000, 1))    # conditional average treatment effect is +Z\n",
    "D = (np.random.uniform(0, 1, n)<.2).reshape((1000, 1))      # treatment indicator; only 20% get treated\n",
    "np.mean(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y = Y1*D + Y0*(1-D)  # observed Y\n",
    "D = D - np.mean(D)      # demean D\n",
    "Z = Z - np.mean(Z)        # demean Z"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analyze the RCT data with Precision Adjustment\n",
    "\n",
    "Consider \n",
    "\n",
    "*  classical 2-sample approach, no adjustment (CL)\n",
    "*  classical linear regression adjustment (CRA)\n",
    "*  interactive regression adjusment (IRA)\n",
    "\n",
    "Carry out inference using robust inference, using the sandwich formulas (Eicker-Huber-White).  \n",
    "\n",
    "Observe that CRA delivers estimates that are less efficient than CL (pointed out by Freedman), whereas IRA delivers more efficient approach (pointed out by Lin). In order for CRA to be more efficient than CL, we need the CRA to be a correct model of the conditional expectation function of Y given D and X, which is not the case here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>D</th>\n",
       "      <th>Z</th>\n",
       "      <th>Z_times_D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.181</td>\n",
       "      <td>1.408649</td>\n",
       "      <td>-0.254966</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.181</td>\n",
       "      <td>0.466085</td>\n",
       "      <td>-0.084361</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-0.181</td>\n",
       "      <td>0.365742</td>\n",
       "      <td>-0.066199</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.181</td>\n",
       "      <td>-1.038993</td>\n",
       "      <td>0.188058</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.181</td>\n",
       "      <td>0.222988</td>\n",
       "      <td>-0.040361</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>-0.181</td>\n",
       "      <td>0.161225</td>\n",
       "      <td>-0.029182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>-0.181</td>\n",
       "      <td>-0.472047</td>\n",
       "      <td>0.085440</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>-0.181</td>\n",
       "      <td>1.010122</td>\n",
       "      <td>-0.182832</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>-0.181</td>\n",
       "      <td>-0.177596</td>\n",
       "      <td>0.032145</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>-0.181</td>\n",
       "      <td>0.392989</td>\n",
       "      <td>-0.071131</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         D         Z  Z_times_D\n",
       "0   -0.181  1.408649  -0.254966\n",
       "1   -0.181  0.466085  -0.084361\n",
       "2   -0.181  0.365742  -0.066199\n",
       "3   -0.181 -1.038993   0.188058\n",
       "4   -0.181  0.222988  -0.040361\n",
       "..     ...       ...        ...\n",
       "995 -0.181  0.161225  -0.029182\n",
       "996 -0.181 -0.472047   0.085440\n",
       "997 -0.181  1.010122  -0.182832\n",
       "998 -0.181 -0.177596   0.032145\n",
       "999 -0.181  0.392989  -0.071131\n",
       "\n",
       "[1000 rows x 3 columns]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Z_times_D = Z*D\n",
    "X = np.hstack((D, Z, Z_times_D))\n",
    "data = pd.DataFrame(X, columns = [\"D\", \"Z\", \"Z_times_D\"])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import packages for OLS regression\n",
    "import statsmodels.api as sm\n",
    "import statsmodels.formula.api as smf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "CL_model = \"Y ~ D\"          \n",
    "CRA_model = \"Y ~ D + Z\"      #classical\n",
    "IRA_model = \"Y ~ D+ Z+ Z*D\" #interactive approach\n",
    "\n",
    "CL = smf.ols(CL_model , data=data).fit()\n",
    "CRA = smf.ols(CRA_model , data=data).fit()\n",
    "IRA = smf.ols(IRA_model , data=data).fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check t values of regressors \n",
    "print(CL.tvalues)\n",
    "print(CRA.tvalues)\n",
    "print(IRA.tvalues)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using classical standard errors (non-robust) is misleading here.\n",
    "\n",
    "We don't teach non-robust standard errors in econometrics courses, but the default statistical inference for lm() procedure in R, summary.lm(), still uses 100-year old concepts, perhaps in part due to historical legacy.  \n",
    "\n",
    "Here the non-robust standard errors suggest that there is not much difference between the different approaches, contrary to the conclusions reached using the robust standard errors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                      Y   R-squared:                       0.001\n",
      "Model:                            OLS   Adj. R-squared:                  0.000\n",
      "Method:                 Least Squares   F-statistic:                     1.306\n",
      "Date:                Sat, 13 Mar 2021   Prob (F-statistic):              0.253\n",
      "Time:                        11:22:53   Log-Likelihood:                -1737.3\n",
      "No. Observations:                1000   AIC:                             3479.\n",
      "Df Residuals:                     998   BIC:                             3488.\n",
      "Df Model:                           1                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "Intercept     -0.0312      0.044     -0.717      0.473      -0.117       0.054\n",
      "D              0.1292      0.113      1.143      0.253      -0.093       0.351\n",
      "==============================================================================\n",
      "Omnibus:                        0.706   Durbin-Watson:                   1.972\n",
      "Prob(Omnibus):                  0.702   Jarque-Bera (JB):                0.578\n",
      "Skew:                          -0.006   Prob(JB):                        0.749\n",
      "Kurtosis:                       3.117   Cond. No.                         2.60\n",
      "==============================================================================\n",
      "\n",
      "Notes:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                      Y   R-squared:                       0.229\n",
      "Model:                            OLS   Adj. R-squared:                  0.228\n",
      "Method:                 Least Squares   F-statistic:                     148.3\n",
      "Date:                Sat, 13 Mar 2021   Prob (F-statistic):           4.15e-57\n",
      "Time:                        11:22:53   Log-Likelihood:                -1607.7\n",
      "No. Observations:                1000   AIC:                             3221.\n",
      "Df Residuals:                     997   BIC:                             3236.\n",
      "Df Model:                           2                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "Intercept     -0.0312      0.038     -0.816      0.415      -0.106       0.044\n",
      "D              0.1546      0.099      1.556      0.120      -0.040       0.350\n",
      "Z             -0.6608      0.038    -17.173      0.000      -0.736      -0.585\n",
      "==============================================================================\n",
      "Omnibus:                       32.600   Durbin-Watson:                   1.952\n",
      "Prob(Omnibus):                  0.000   Jarque-Bera (JB):               79.232\n",
      "Skew:                          -0.067   Prob(JB):                     6.24e-18\n",
      "Kurtosis:                       4.372   Cond. No.                         2.60\n",
      "==============================================================================\n",
      "\n",
      "Notes:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                      Y   R-squared:                       0.483\n",
      "Model:                            OLS   Adj. R-squared:                  0.481\n",
      "Method:                 Least Squares   F-statistic:                     310.0\n",
      "Date:                Sat, 13 Mar 2021   Prob (F-statistic):          4.01e-142\n",
      "Time:                        11:22:53   Log-Likelihood:                -1408.2\n",
      "No. Observations:                1000   AIC:                             2824.\n",
      "Df Residuals:                     996   BIC:                             2844.\n",
      "Df Model:                           3                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "Intercept     -0.0421      0.031     -1.343      0.180      -0.104       0.019\n",
      "D              0.1060      0.081      1.301      0.194      -0.054       0.266\n",
      "Z             -0.6167      0.032    -19.518      0.000      -0.679      -0.555\n",
      "Z:D            1.9111      0.086     22.102      0.000       1.741       2.081\n",
      "==============================================================================\n",
      "Omnibus:                        2.203   Durbin-Watson:                   2.015\n",
      "Prob(Omnibus):                  0.332   Jarque-Bera (JB):                2.274\n",
      "Skew:                           0.006   Prob(JB):                        0.321\n",
      "Kurtosis:                       3.233   Cond. No.                         2.77\n",
      "==============================================================================\n",
      "\n",
      "Notes:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
     ]
    }
   ],
   "source": [
    "# we are interested in the coefficients on variable \"D\".\n",
    "print(CL.summary())\n",
    "print(CRA.summary())\n",
    "print(IRA.summary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Intercept   -0.717115\n",
      "D            1.142621\n",
      "dtype: float64\n",
      "Intercept    -0.815904\n",
      "D             1.555656\n",
      "Z           -17.172891\n",
      "dtype: float64\n",
      "Intercept    -1.343013\n",
      "D             1.301074\n",
      "Z           -19.518353\n",
      "Z:D          22.102233\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Verify Asymptotic Approximations Hold in Finite-Sample Simulation Experiment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Standard deviations for estimators\n",
      "0.09610043229856047\n",
      "0.13473909989882688\n",
      "0.09467905369858437\n"
     ]
    }
   ],
   "source": [
    "np.random.seed(12345676)     # set MC seed\n",
    "n = 1000\n",
    "B = 1000\n",
    "\n",
    "# numpy format of data = float32\n",
    "CLs = np.repeat(0., B)\n",
    "CRAs = np.repeat(0., B)\n",
    "IRAs = np.repeat(0., B)\n",
    "\n",
    "# models\n",
    "CL_model = \"Y ~ D\"          \n",
    "CRA_model = \"Y ~ D + Z\"      #classical\n",
    "IRA_model = \"Y ~ D+ Z+ Z*D\" #interactive approachIRAs = np.repeat(0, B)\n",
    "\n",
    "# simulation\n",
    "for i in range(0, B, 1):\n",
    "    Z = np.random.normal(0, 1, n).reshape((n, 1))\n",
    "    Y0 = -Z + np.random.normal(0, 1, n).reshape((n, 1))\n",
    "    Y1 = Z + np.random.normal(0, 1, n).reshape((n, 1))\n",
    "    D = (np.random.uniform(0, 1, n)<.2).reshape((n, 1))\n",
    "    \n",
    "    D = D - np.mean(D)\n",
    "    Z = Z - np.mean(Z)\n",
    "    \n",
    "    Y = Y1*D + Y0*(1-D)\n",
    "    \n",
    "    Z_times_D = Z*D\n",
    "    X = np.hstack((D, Z, Z_times_D))\n",
    "    data = pd.DataFrame(X, columns = [\"D\", \"Z\", \"Z_times_D\"])\n",
    "     \n",
    "    CLs[i,] = smf.ols(CL_model , data=data).fit().params[1]\n",
    "    CRAs[i,] = smf.ols(CRA_model , data=data).fit().params[1]\n",
    "    IRAs[i,] = smf.ols(IRA_model , data=data).fit().params[1]\n",
    "\n",
    "# check  standard deviations\n",
    "print(\"Standard deviations for estimators\")\n",
    "print(np.sqrt(np.mean(CLs**2)))\n",
    "print(np.sqrt(np.mean(CRAs**2)))\n",
    "print(np.sqrt(np.mean(IRAs**2)))"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}