10. Double Lasso for Testing the Convergence Hypothesis#
10.1. Double Lasso for the convergence hypothesis#
10.2. Introduction#
We provide an additional empirical example of partialling-out with Lasso to estimate the regression coefficient \(\beta_1\) in the high-dimensional linear regression model:
Specifically, we are interested in how the rates at which economies of different countries grow (\(Y\)) are related to the initial wealth levels in each country (\(D\)) controlling for country’s institutional, educational, and other similar characteristics (\(W\)).
The relationship is captured by \(\beta_1\), the speed of convergence/divergence, which measures the speed at which poor countries catch up \((\beta_1< 0)\) or fall behind \((\beta_1> 0)\) rich countries, after controlling for \(W\). Our inference question here is: do poor countries grow faster than rich countries, controlling for educational and other characteristics? In other words, is the speed of convergence negative: \( \beta_1 <0?\) This is the Convergence Hypothesis predicted by the Solow Growth Model. This is a structural economic model. Under some strong assumptions, that we won’t state here, the predictive exercise we are doing here can be given causal interpretation.
The outcome \(Y\) is the realized annual growth rate of a country’s wealth (Gross Domestic Product per capita). The target regressor (\(D\)) is the initial level of the country’s wealth. The target parameter \(\beta_1\) is the speed of convergence, which measures the speed at which poor countries catch up with rich countries. The controls (\(W\)) include measures of education levels, quality of institutions, trade openness, and political stability in the country.
10.3. Data analysis#
We consider the data set GrowthData which is included in the package hdm. First, let us load the data set to get familiar with the data.
import hdmpy
import pandas as pd
import numpy as np
import pyreadr
import os
from urllib.request import urlopen
import math
import matplotlib.pyplot as plt
import random
import warnings
warnings.filterwarnings('ignore')
link="https://raw.githubusercontent.com/d2cml-ai/14.388_py/main/data/GrowthData.RData"
response = urlopen(link)
content = response.read()
fhandle = open( 'GrowthData.RData', 'wb')
fhandle.write(content)
fhandle.close()
result = pyreadr.read_r("GrowthData.RData")
os.remove("GrowthData.RData")
# Extracting the data frame from rdata_read
growth = result[ 'GrowthData' ]
We determine the dimension of our data set.
growth.shape
(90, 63)
The sample contains \(90\) countries and \(63\) controls. Thus \(p \approx 60\), \(n=90\) and \(p/n\) is not small. We expect the least squares method to provide a poor estimate of \(\beta_1\). We expect the method based on partialling-out with Lasso to provide a high quality estimate of \(\beta_1\).
To check this hypothesis, we analyze the relation between the output variable \(Y\) and the other country’s characteristics by running a linear regression in the first step.
import statsmodels.api as sm
import statsmodels.formula.api as smf
# We create the main variables
y = growth['Outcome']
X = growth.drop('Outcome', 1)
# OLS regression
reg_ols = sm.OLS(y, X).fit()
print(reg_ols.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Outcome R-squared: 0.887
Model: OLS Adj. R-squared: 0.641
Method: Least Squares F-statistic: 3.607
Date: Fri, 22 Jul 2022 Prob (F-statistic): 0.000200
Time: 19:48:53 Log-Likelihood: 238.24
No. Observations: 90 AIC: -352.5
Df Residuals: 28 BIC: -197.5
Df Model: 61
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
intercept 0.2472 0.785 0.315 0.755 -1.360 1.854
gdpsh465 -0.0094 0.030 -0.314 0.756 -0.071 0.052
bmp1l -0.0689 0.033 -2.117 0.043 -0.135 -0.002
freeop 0.0801 0.208 0.385 0.703 -0.346 0.506
freetar -0.4890 0.418 -1.169 0.252 -1.346 0.368
h65 -2.3621 0.857 -2.755 0.010 -4.118 -0.606
hm65 0.7071 0.523 1.352 0.187 -0.364 1.779
hf65 1.6934 0.503 3.365 0.002 0.663 2.724
p65 0.2655 0.164 1.616 0.117 -0.071 0.602
pm65 0.1370 0.151 0.906 0.373 -0.173 0.447
pf65 -0.3313 0.165 -2.006 0.055 -0.670 0.007
s65 0.0391 0.186 0.211 0.835 -0.341 0.419
sm65 -0.0307 0.117 -0.263 0.795 -0.270 0.209
sf65 -0.1799 0.118 -1.523 0.139 -0.422 0.062
fert65 0.0069 0.027 0.254 0.801 -0.049 0.062
mort65 -0.2335 0.817 -0.286 0.777 -1.908 1.441
lifee065 -0.0149 0.193 -0.077 0.939 -0.411 0.381
gpop1 0.9702 1.812 0.535 0.597 -2.742 4.682
fert1 0.0088 0.035 0.252 0.803 -0.063 0.081
mort1 0.0666 0.685 0.097 0.923 -1.336 1.469
invsh41 0.0745 0.108 0.687 0.498 -0.148 0.297
geetot1 -0.7151 1.680 -0.426 0.674 -4.157 2.726
geerec1 0.6300 2.447 0.257 0.799 -4.383 5.643
gde1 -0.4436 1.671 -0.265 0.793 -3.867 2.980
govwb1 0.3375 0.438 0.770 0.447 -0.560 1.235
govsh41 0.4632 1.925 0.241 0.812 -3.481 4.407
gvxdxe41 -0.7934 2.059 -0.385 0.703 -5.012 3.425
high65 -0.7525 0.906 -0.831 0.413 -2.608 1.103
highm65 -0.3903 0.681 -0.573 0.571 -1.786 1.005
highf65 -0.4177 0.561 -0.744 0.463 -1.568 0.732
highc65 -2.2158 1.481 -1.496 0.146 -5.249 0.818
highcm65 0.2797 0.658 0.425 0.674 -1.069 1.628
highcf65 0.3921 0.766 0.512 0.613 -1.177 1.961
human65 2.3373 3.307 0.707 0.486 -4.437 9.112
humanm65 -1.2092 1.619 -0.747 0.461 -4.525 2.106
humanf65 -1.1039 1.685 -0.655 0.518 -4.555 2.347
hyr65 54.9139 23.887 2.299 0.029 5.983 103.845
hyrm65 12.9350 23.171 0.558 0.581 -34.529 60.400
hyrf65 9.0926 17.670 0.515 0.611 -27.102 45.287
no65 0.0372 0.132 0.282 0.780 -0.233 0.308
nom65 -0.0212 0.065 -0.326 0.747 -0.154 0.112
nof65 -0.0169 0.067 -0.252 0.803 -0.154 0.120
pinstab1 -0.0500 0.031 -1.616 0.117 -0.113 0.013
pop65 1.032e-07 1.32e-07 0.783 0.440 -1.67e-07 3.73e-07
worker65 0.0341 0.156 0.218 0.829 -0.286 0.354
pop1565 -0.4655 0.471 -0.988 0.332 -1.431 0.500
pop6565 -1.3575 0.635 -2.138 0.041 -2.658 -0.057
sec65 -0.0109 0.308 -0.035 0.972 -0.641 0.619
secm65 0.0033 0.151 0.022 0.983 -0.306 0.313
secf65 -0.0023 0.158 -0.015 0.988 -0.326 0.321
secc65 -0.4915 0.729 -0.674 0.506 -1.985 1.002
seccm65 0.2596 0.356 0.730 0.471 -0.469 0.988
seccf65 0.2207 0.373 0.591 0.559 -0.544 0.985
syr65 -0.7556 7.977 -0.095 0.925 -17.095 15.584
syrm65 0.3109 3.897 0.080 0.937 -7.671 8.293
syrf65 0.7593 4.111 0.185 0.855 -7.661 9.180
teapri65 3.955e-05 0.001 0.051 0.959 -0.002 0.002
teasec65 0.0002 0.001 0.213 0.833 -0.002 0.003
ex1 -0.5804 0.242 -2.400 0.023 -1.076 -0.085
im1 0.5914 0.250 2.363 0.025 0.079 1.104
xr65 -0.0001 5.42e-05 -1.916 0.066 -0.000 7.18e-06
tot1 -0.1279 0.113 -1.136 0.266 -0.359 0.103
==============================================================================
Omnibus: 0.439 Durbin-Watson: 1.982
Prob(Omnibus): 0.803 Jarque-Bera (JB): 0.417
Skew: 0.158 Prob(JB): 0.812
Kurtosis: 2.896 Cond. No. 7.52e+08
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.52e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
# output: estimated regression coefficient corresponding to the target regressor
est_ols = reg_ols.summary2().tables[1]['Coef.']['gdpsh465']
# output: std. error
std_ols = reg_ols.summary2().tables[1]['Std.Err.']['gdpsh465']
# output: 95% confidence interval
lower_ci = reg_ols.summary2().tables[1]['[0.025']['gdpsh465']
upper_ci = reg_ols.summary2().tables[1]['0.975]']['gdpsh465']
10.4. Summarize OLS results#
from tabulate import tabulate
table_1 = np.zeros( (1, 4) )
table_1[0,0] = est_ols
table_1[0,1] = std_ols
table_1[0,2] = lower_ci
table_1[0,3] = upper_ci
table_1_pandas = pd.DataFrame( table_1, columns = [ "Estimator","Std. Error", "lower bound CI", "upper bound CI" ])
table_1_pandas.index = [ "OLS" ]
print(table_1_pandas.to_html())
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Estimator</th>
<th>Std. Error</th>
<th>lower bound CI</th>
<th>upper bound CI</th>
</tr>
</thead>
<tbody>
<tr>
<th>OLS</th>
<td>-0.009378</td>
<td>0.029888</td>
<td>-0.0706</td>
<td>0.051844</td>
</tr>
</tbody>
</table>
Estimator | Std. Error | lower bound CI | upper bound CI | |
---|---|---|---|---|
OLS | -0.009378 | 0.029888 | -0.0706 | 0.051844 |
Least squares provides a rather noisy estimate (high standard error) of the speed of convergence, and does not allow us to answer the question about the convergence hypothesis since the confidence interval includes zero.
In contrast, we can use the partialling-out approach based on lasso regression (“Double Lasso”).
# Create main variables
Y = growth['Outcome']
W = growth.drop(['Outcome','intercept', 'gdpsh465'], 1 )
D = growth['gdpsh465']
10.5. Method 1 - Using Sklearn#
from sklearn import linear_model
# Seat values for Lasso
lasso_model = linear_model.Lasso( alpha = 0.00077 )
r_Y = Y - lasso_model.fit( W, Y ).predict( W )
r_Y = r_Y.rename('r_Y')
# Part. out d
r_D = D - lasso_model.fit( W, D ).predict( W )
r_D = r_D.rename('r_D')
# ols
partial_lasso_fit = sm.OLS(r_Y, r_D).fit()
est_lasso = partial_lasso_fit.summary2().tables[1]['Coef.']['r_D']
std_lasso = partial_lasso_fit.summary2().tables[1]['Std.Err.']['r_D']
lower_ci_lasso = partial_lasso_fit.summary2().tables[1]['[0.025']['r_D']
upper_ci_lasso = partial_lasso_fit.summary2().tables[1]['0.975]']['r_D']
# Regress residuales
partial_lasso_fit = sm.OLS(r_Y, r_D).fit()
partial_lasso_est = partial_lasso_fit.summary2().tables[1]['Coef.']['r_D']
print( f"Coefficient for D via partialling-out using lasso {partial_lasso_est}" )
Coefficient for D via partialling-out using lasso -0.04774655653302118
# output: estimated regression coefficient corresponding to the target regressor
est_lasso = partial_lasso_fit.summary2().tables[1]['Coef.']['r_D']
# output: std. error
std_lasso = partial_lasso_fit.summary2().tables[1]['Std.Err.']['r_D']
# output: 95% confidence interval
lower_ci_lasso = partial_lasso_fit.summary2().tables[1]['[0.025']['r_D']
upper_ci_lasso = partial_lasso_fit.summary2().tables[1]['0.975]']['r_D']
10.5.1. Summary LASSO results#
Finally, let us have a look at the results.
table_2 = np.zeros( (1, 4) )
table_2[0,0] = est_lasso
table_2[0,1] = std_lasso
table_2[0,2] = lower_ci_lasso
table_2[0,3] = upper_ci_lasso
table_2_pandas = pd.DataFrame( table_2, columns = [ "Estimator","Std. Error", "lower bound CI", "upper bound CI" ])
table_2_pandas.index = [ "LASSO" ]
table_2_pandas
Estimator | Std. Error | lower bound CI | upper bound CI | |
---|---|---|---|---|
LASSO | -0.047747 | 0.017705 | -0.082926 | -0.012567 |
table_3 = table_1_pandas.append(table_2_pandas)
table_3
Estimator | Std. Error | lower bound CI | upper bound CI | |
---|---|---|---|---|
OLS | -0.009378 | 0.029888 | -0.070600 | 0.051844 |
LASSO | -0.047747 | 0.017705 | -0.082926 | -0.012567 |
The least square method provides a rather noisy estimate of the speed of convergence. We can not answer the question if poor countries grow faster than rich countries. The least square method does not work when the ratio \(p/n\) is large.
In sharp contrast, partialling-out via Lasso provides a more precise estimate. The Lasso based point estimate is \(-5\%\) and the \(95\%\) confidence interval for the (annual) rate of convergence \([-7.8\%,-2.2\%]\) only includes negative numbers. This empirical evidence does support the convergence hypothesis.
Estimator | Std. Error | lower bound CI | upper bound CI | |
---|---|---|---|---|
OLS | -0.009378 | 0.029888 | -0.070600 | 0.051844 |
LASSO | -0.047747 | 0.017705 | -0.082926 | -0.012567 |
10.6. Method 2 - HDMPY#
res_Y = hdmpy.rlasso( W, Y, post=True ).est['residuals']
res_D = hdmpy.rlasso( W, D, post=True ).est['residuals']
r_Y = pd.DataFrame(res_Y, columns=['r_Y'])
r_D = pd.DataFrame(res_D, columns=['r_D'])
# OLS regression
reg_ols = sm.OLS(r_Y, r_D).fit()
print(reg_ols.summary())
OLS Regression Results
=======================================================================================
Dep. Variable: r_Y R-squared (uncentered): 0.127
Model: OLS Adj. R-squared (uncentered): 0.117
Method: Least Squares F-statistic: 12.92
Date: Fri, 22 Jul 2022 Prob (F-statistic): 0.000533
Time: 16:43:54 Log-Likelihood: 152.68
No. Observations: 90 AIC: -303.4
Df Residuals: 89 BIC: -300.9
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
r_D -0.0498 0.014 -3.594 0.001 -0.077 -0.022
==============================================================================
Omnibus: 4.622 Durbin-Watson: 1.510
Prob(Omnibus): 0.099 Jarque-Bera (JB): 5.807
Skew: 0.120 Prob(JB): 0.0548
Kurtosis: 4.221 Cond. No. 1.00
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# output: estimated regression coefficient corresponding to the target regressor
est_lasso = reg_ols.summary2().tables[1]['Coef.']['r_D']
# output: std. error
std_lasso = reg_ols.summary2().tables[1]['Std.Err.']['r_D']
# output: 95% confidence interval
lower_ci_lasso = reg_ols.summary2().tables[1]['[0.025']['r_D']
upper_ci_lasso = reg_ols.summary2().tables[1]['0.975]']['r_D']
table_3 = np.zeros( (1, 4) )
table_3[0,0] = est_lasso
table_3[0,1] = std_lasso
table_3[0,2] = lower_ci_lasso
table_3[0,3] = upper_ci_lasso
table_3_pandas = pd.DataFrame( table_3, columns = [ "Estimator","Std. Error", "lower bound CI", "upper bound CI" ])
table_3_pandas.index = [ "LASSO" ]
table_3_pandas
Estimator | Std. Error | lower bound CI | upper bound CI | |
---|---|---|---|---|
LASSO | -0.049811 | 0.013858 | -0.077347 | -0.022276 |
10.7. Method 3 - HDMPY Direct#
lasso_direct = hdmpy.rlassoEffect(x=W, y=Y, d=D, method="partialling out")
est_lasso = lasso_direct["coefficients"]
std_lasso = lasso_direct["se"]
lower_ci_lasso = est_lasso - 1.96*std_lasso
upper_ci_lasso = est_lasso + 1.96*std_lasso
table_4 = np.zeros( (1, 4) )
table_4[0,0] = est_lasso
table_4[0,1] = std_lasso
table_4[0,2] = lower_ci_lasso
table_4[0,3] = upper_ci_lasso
table_4_pandas = pd.DataFrame( table_4, columns = [ "Estimator","Std. Error", "lower bound CI", "upper bound CI" ])
table_4_pandas.index = [ "LASSO_direct" ]
table_4_pandas
Estimator | Std. Error | lower bound CI | upper bound CI | |
---|---|---|---|---|
LASSO_direct | -0.049811 | 0.015391 | -0.079978 | -0.019644 |