OLS and lasso for wage prediction

2. OLS and lasso for wage prediction#

This notebook contains an example for teaching.

2.1. Introduction#

In labor economics an important question is what determines the wage of workers. This is a causal question, but we could begin to investigate from a predictive perspective.

In the following wage example, Y is the hourly wage of a worker and X is a vector of worker’s characteristics, e.g., education, experience, gender. Two main questions here are:

How to use job-relevant characteristics, such as education and experience, to best predict wages?
What is the difference in predicted wages between men and women with the same job-relevant characteristics?

In this lab, we focus on the prediction question first.

2.2. Data#

The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below 3 .

The variable of interest Y is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size n=5150 .

2.3. Data analysis#

We start by loading the data set.

# Import relevant packages
import pandas as pd
import numpy as np
import pyreadr
import os
import warnings
import pyreadr
from urllib.request import urlopen
warnings.filterwarnings('ignore')

link="https://raw.githubusercontent.com/d2cml-ai/14.388_py/main/data/wage2015_subsample_inference.Rdata"
response = urlopen(link)
content = response.read()
fhandle = open( 'wage2015_subsample_inference.Rdata', 'wb')
fhandle.write(content)
fhandle.close()
result = pyreadr.read_r("wage2015_subsample_inference.Rdata")
os.remove("wage2015_subsample_inference.Rdata")

# Extracting the data frame from rdata_read
data = result[ 'data' ]
data.shape

(5150, 20)

Let’s have a look at the structure of the data.

data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5150 entries, 10 to 32643
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 wage    5150 non-null   float64 
 lwage   5150 non-null   float64 
 sex     5150 non-null   float64 
 shs     5150 non-null   float64 
 hsg     5150 non-null   float64 
 scl     5150 non-null   float64 
 clg     5150 non-null   float64 
 ad      5150 non-null   float64 
 mw      5150 non-null   float64 
 so      5150 non-null   float64 
we      5150 non-null   float64 
ne      5150 non-null   float64 
exp1    5150 non-null   float64 
exp2    5150 non-null   float64 
exp3    5150 non-null   float64 
exp4    5150 non-null   float64 
occ     5150 non-null   category
occ2    5150 non-null   category
ind     5150 non-null   category
ind2    5150 non-null   category
dtypes: category(4), float64(16)
memory usage: 736.3+ KB

data.describe()

	wage	lwage	sex	shs	hsg	scl	clg	ad	mw	so	we	ne	exp1	exp2	exp3	exp4
count	5150.000000	5150.000000	5150.000000	5150.000000	5150.000000	5150.000000	5150.000000	5150.000000	5150.000000	5150.000000	5150.000000	5150.000000	5150.000000	5150.000000	5150.000000	5150.000000
mean	23.410410	2.970787	0.444466	0.023301	0.243883	0.278058	0.317670	0.137087	0.259612	0.296505	0.216117	0.227767	13.760583	3.018925	8.235867	25.118038
std	21.003016	0.570385	0.496955	0.150872	0.429465	0.448086	0.465616	0.343973	0.438464	0.456761	0.411635	0.419432	10.609465	4.000904	14.488962	53.530225
min	3.021978	1.105912	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	13.461538	2.599837	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	0.250000	0.125000	0.062500
50%	19.230769	2.956512	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	10.000000	1.000000	1.000000	1.000000
75%	27.777778	3.324236	1.000000	0.000000	0.000000	1.000000	1.000000	0.000000	1.000000	1.000000	0.000000	0.000000	21.000000	4.410000	9.261000	19.448100
max	528.845673	6.270697	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	47.000000	22.090000	103.823000	487.968100

We are constructing the output variable Y and the matrix Z which includes the characteristics of workers that are given in the data.

Y = np.log2(data['wage']) 
n = len(Y)
z = data.loc[:, ~data.columns.isin(['wage', 'lwage','Unnamed: 0'])]
p = z.shape[1]

print("Number of observation:", n, '\n')
print( "Number of raw regressors:", p)

Number of observation: 5150 

Number of raw regressors: 18

For the outcome variable wage and a subset of the raw regressors, we calculate the empirical mean to get familiar with the data.

Z_subset = data.loc[:, data.columns.isin(["lwage","sex","shs","hsg","scl","clg","ad","mw","so","we","ne","exp1"])]
table = Z_subset.mean(axis=0)
table

lwage     2.970787
sex       0.444466
shs       0.023301
hsg       0.243883
scl       0.278058
clg       0.317670
ad        0.137087
mw        0.259612
so        0.296505
we        0.216117
ne        0.227767
exp1     13.760583
dtype: float64

table = pd.DataFrame(data=table, columns={"Sample mean":"0"} )
table.index
index1 = list(table.index)
index2 = ["Log Wage","Sex","Some High School","High School Graduate",\
          "Some College","College Graduate", "Advanced Degree","Midwest",\
          "South","West","Northeast","Experience"]

table = table.rename(index=dict(zip(index1,index2)))
table

	Sample mean
Log Wage	2.970787
Sex	0.444466
Some High School	0.023301
High School Graduate	0.243883
Some College	0.278058
College Graduate	0.317670
Advanced Degree	0.137087
Midwest	0.259612
South	0.296505
West	0.216117
Northeast	0.227767
Experience	13.760583

E.g., the share of female workers in our sample is ~44% ($sex=1$ if female).

Alternatively, we can also print the table as latex.

2.4. Prediction Question#

Now, we will construct a prediction rule for hourly wage $Y$, which depends linearly on job-relevant characteristics $X$:

\[ \begin{align} Y = \beta'X+ \epsilon. \end{align} \]

Our goals are

Predict wages using various characteristics of workers.
Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample $R^2$ and the out-of-sample MSE and $R^2$.

We employ two different specifications for prediction:

Basic Model: $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators, occupation and industry indicators, regional indicators).
Flexible Model: $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions of polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is experience times the indicator of having a college degree.

Using the Flexible Model, enables us to approximate the real relationship by a more complex regression model and therefore to reduce the bias. The Flexible Model increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver good prediction accuracy but give models which are harder to interpret.

Now, let us fit both models to our data by running ordinary least squares (ols):

# Import packages for OLS regression
import statsmodels.api as sm
import statsmodels.formula.api as smf

# 1. basic model
basic = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
basic_results = smf.ols(basic , data=data).fit()
print(basic_results.summary()) # estimated coefficients
print( "Number of regressors in the basic model:",len(basic_results.params), '\n')  # number of regressors in the Basic Model

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  lwage   R-squared:                       0.310
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                     45.83
Date:                Wed, 03 Aug 2022   Prob (F-statistic):               0.00
Time:                        23:47:31   Log-Likelihood:                -3459.9
No. Observations:                5150   AIC:                             7022.
Df Residuals:                    5099   BIC:                             7356.
Df Model:                          50                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.5284      0.054     65.317      0.000       3.422       3.634
occ2[T.10]    -0.0106      0.040     -0.268      0.789      -0.088       0.067
occ2[T.11]    -0.4558      0.059     -7.669      0.000      -0.572      -0.339
occ2[T.12]    -0.3076      0.056     -5.541      0.000      -0.416      -0.199
occ2[T.13]    -0.3614      0.046     -7.937      0.000      -0.451      -0.272
occ2[T.14]    -0.4995      0.051     -9.867      0.000      -0.599      -0.400
occ2[T.15]    -0.4645      0.052     -8.973      0.000      -0.566      -0.363
occ2[T.16]    -0.2337      0.032     -7.206      0.000      -0.297      -0.170
occ2[T.17]    -0.4126      0.028    -14.784      0.000      -0.467      -0.358
occ2[T.18]    -0.3404      0.197     -1.731      0.083      -0.726       0.045
occ2[T.19]    -0.2415      0.049     -4.880      0.000      -0.338      -0.144
occ2[T.2]     -0.0765      0.034     -2.236      0.025      -0.144      -0.009
occ2[T.20]    -0.2126      0.041     -5.201      0.000      -0.293      -0.132
occ2[T.21]    -0.2884      0.038     -7.573      0.000      -0.363      -0.214
occ2[T.22]    -0.4224      0.041    -10.187      0.000      -0.504      -0.341
occ2[T.3]     -0.0347      0.039     -0.895      0.371      -0.111       0.041
occ2[T.4]     -0.0962      0.052     -1.853      0.064      -0.198       0.006
occ2[T.5]     -0.1879      0.060     -3.111      0.002      -0.306      -0.070
occ2[T.6]     -0.4149      0.050     -8.263      0.000      -0.513      -0.316
occ2[T.7]     -0.0460      0.057     -0.814      0.416      -0.157       0.065
occ2[T.8]     -0.3778      0.044     -8.601      0.000      -0.464      -0.292
occ2[T.9]     -0.2158      0.046     -4.678      0.000      -0.306      -0.125
ind2[T.11]     0.0248      0.058      0.427      0.669      -0.089       0.139
ind2[T.12]     0.1164      0.053      2.201      0.028       0.013       0.220
ind2[T.13]     0.0212      0.068      0.312      0.755      -0.112       0.155
ind2[T.14]     0.0068      0.050      0.136      0.892      -0.092       0.106
ind2[T.15]    -0.1315      0.137     -0.959      0.338      -0.400       0.137
ind2[T.16]    -0.1215      0.056     -2.177      0.029      -0.231      -0.012
ind2[T.17]    -0.1106      0.056     -1.987      0.047      -0.220      -0.001
ind2[T.18]    -0.1415      0.051     -2.774      0.006      -0.242      -0.042
ind2[T.19]    -0.1803      0.065     -2.761      0.006      -0.308      -0.052
ind2[T.2]      0.1939      0.084      2.301      0.021       0.029       0.359
ind2[T.20]    -0.3581      0.057     -6.332      0.000      -0.469      -0.247
ind2[T.21]    -0.1228      0.055     -2.239      0.025      -0.230      -0.015
ind2[T.22]     0.0749      0.053      1.407      0.160      -0.029       0.179
ind2[T.3]      0.0770      0.079      0.972      0.331      -0.078       0.232
ind2[T.4]     -0.0506      0.058     -0.874      0.382      -0.164       0.063
ind2[T.5]     -0.0797      0.056     -1.432      0.152      -0.189       0.029
ind2[T.6]     -0.0555      0.052     -1.072      0.284      -0.157       0.046
ind2[T.7]      0.0543      0.072      0.756      0.450      -0.086       0.195
ind2[T.8]     -0.0491      0.072     -0.679      0.497      -0.191       0.093
ind2[T.9]     -0.1936      0.048     -4.017      0.000      -0.288      -0.099
sex           -0.0729      0.015     -4.848      0.000      -0.102      -0.043
exp1           0.0086      0.001     13.106      0.000       0.007       0.010
shs           -0.5928      0.051    -11.726      0.000      -0.692      -0.494
hsg           -0.5043      0.027    -18.626      0.000      -0.557      -0.451
scl           -0.4120      0.025    -16.347      0.000      -0.461      -0.363
clg           -0.1822      0.023     -7.939      0.000      -0.227      -0.137
mw            -0.0275      0.019     -1.425      0.154      -0.065       0.010
so            -0.0345      0.019     -1.842      0.066      -0.071       0.002
we             0.0172      0.020      0.859      0.391      -0.022       0.057
==============================================================================
Omnibus:                      437.645   Durbin-Watson:                   1.885
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1862.313
Skew:                           0.322   Prob(JB):                         0.00
Kurtosis:                       5.875   Cond. No.                         541.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Number of regressors in the basic model: 51 

2.4.1. Note that the basic model consists of $51$ regressors.#

# 2. flexible model
flex = 'lwage ~ sex + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'
flex_results_0 = smf.ols(flex , data=data)
flex_results = smf.ols(flex , data=data).fit()
print(flex_results.summary()) # estimated coefficients
print( "Number of regressors in the basic model:",len(flex_results.params), '\n') # number of regressors in the Flexible Model

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  lwage   R-squared:                       0.351
Model:                            OLS   Adj. R-squared:                  0.319
Method:                 Least Squares   F-statistic:                     10.83
Date:                Wed, 03 Aug 2022   Prob (F-statistic):          2.69e-305
Time:                        23:48:33   Log-Likelihood:                -3301.9
No. Observations:                5150   AIC:                             7096.
Df Residuals:                    4904   BIC:                             8706.
Df Model:                         245                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           3.2797      0.284     11.540      0.000       2.723       3.837
occ2[T.10]          0.0210      0.156      0.134      0.893      -0.286       0.328
occ2[T.11]         -0.6424      0.309     -2.078      0.038      -1.248      -0.036
occ2[T.12]         -0.0675      0.252     -0.268      0.789      -0.562       0.427
occ2[T.13]         -0.2330      0.232     -1.006      0.314      -0.687       0.221
occ2[T.14]          0.2562      0.323      0.794      0.427      -0.376       0.889
occ2[T.15]         -0.1939      0.260     -0.747      0.455      -0.703       0.315
occ2[T.16]         -0.0551      0.147     -0.375      0.708      -0.343       0.233
occ2[T.17]         -0.4156      0.136     -3.053      0.002      -0.682      -0.149
occ2[T.18]         -0.4822      1.044     -0.462      0.644      -2.530       1.565
occ2[T.19]         -0.2579      0.333     -0.776      0.438      -0.910       0.394
occ2[T.2]           0.1613      0.130      1.244      0.214      -0.093       0.416
occ2[T.20]         -0.3010      0.234     -1.286      0.199      -0.760       0.158
occ2[T.21]         -0.4272      0.221     -1.936      0.053      -0.860       0.005
occ2[T.22]         -0.8695      0.298     -2.922      0.003      -1.453      -0.286
occ2[T.3]           0.2102      0.169      1.246      0.213      -0.121       0.541
occ2[T.4]           0.0709      0.184      0.386      0.700      -0.289       0.431
occ2[T.5]          -0.3960      0.189     -2.100      0.036      -0.766      -0.026
occ2[T.6]          -0.2311      0.187     -1.236      0.217      -0.598       0.135
occ2[T.7]           0.3147      0.194      1.621      0.105      -0.066       0.695
occ2[T.8]          -0.1875      0.169     -1.108      0.268      -0.519       0.144
occ2[T.9]          -0.3390      0.167     -2.027      0.043      -0.667      -0.011
ind2[T.11]         -0.4047      0.314     -1.288      0.198      -1.021       0.211
ind2[T.12]         -0.1570      0.279     -0.562      0.574      -0.705       0.391
ind2[T.13]         -0.4377      0.362     -1.210      0.226      -1.147       0.271
ind2[T.14]         -0.0054      0.270     -0.020      0.984      -0.535       0.524
ind2[T.15]          0.2004      0.501      0.400      0.689      -0.781       1.182
ind2[T.16]          0.0102      0.300      0.034      0.973      -0.577       0.598
ind2[T.17]         -0.2396      0.285     -0.841      0.401      -0.798       0.319
ind2[T.18]         -0.1808      0.278     -0.649      0.516      -0.727       0.365
ind2[T.19]         -0.3007      0.327     -0.921      0.357      -0.941       0.339
ind2[T.2]           0.5806      0.481      1.207      0.227      -0.362       1.523
ind2[T.20]         -0.3293      0.313     -1.052      0.293      -0.943       0.284
ind2[T.21]         -0.1781      0.304     -0.586      0.558      -0.773       0.417
ind2[T.22]          0.1765      0.296      0.596      0.551      -0.404       0.757
ind2[T.3]          -0.6668      0.561     -1.189      0.234      -1.766       0.433
ind2[T.4]           0.4858      0.350      1.389      0.165      -0.200       1.171
ind2[T.5]           0.0512      0.298      0.172      0.863      -0.532       0.635
ind2[T.6]          -0.0416      0.299     -0.139      0.889      -0.628       0.545
ind2[T.7]           0.0758      0.387      0.196      0.845      -0.683       0.835
ind2[T.8]          -0.1490      0.337     -0.441      0.659      -0.811       0.513
ind2[T.9]          -0.2219      0.275     -0.808      0.419      -0.760       0.316
sex                -0.0696      0.015     -4.570      0.000      -0.099      -0.040
shs                -0.1233      0.907     -0.136      0.892      -1.901       1.654
hsg                -0.5289      0.198     -2.675      0.008      -0.917      -0.141
scl                -0.2921      0.126     -2.318      0.021      -0.539      -0.045
clg                -0.0412      0.070     -0.585      0.559      -0.179       0.097
mw                  0.1107      0.081      1.359      0.174      -0.049       0.270
so                  0.0224      0.074      0.301      0.763      -0.123       0.168
we                 -0.0216      0.084     -0.256      0.798      -0.187       0.143
exp1                0.0835      0.094      0.884      0.377      -0.102       0.269
exp1:occ2[T.10]     0.0076      0.058      0.130      0.897      -0.106       0.122
exp1:occ2[T.11]     0.1014      0.101      1.009      0.313      -0.096       0.298
exp1:occ2[T.12]    -0.0863      0.087     -0.986      0.324      -0.258       0.085
exp1:occ2[T.13]     0.0067      0.076      0.088      0.930      -0.143       0.156
exp1:occ2[T.14]    -0.1369      0.097     -1.405      0.160      -0.328       0.054
exp1:occ2[T.15]    -0.0400      0.090     -0.445      0.656      -0.216       0.136
exp1:occ2[T.16]    -0.0539      0.052     -1.035      0.301      -0.156       0.048
exp1:occ2[T.17]     0.0147      0.047      0.315      0.753      -0.077       0.106
exp1:occ2[T.18]     0.1074      0.472      0.228      0.820      -0.818       1.032
exp1:occ2[T.19]     0.0047      0.106      0.044      0.965      -0.203       0.213
exp1:occ2[T.2]     -0.0736      0.050     -1.469      0.142      -0.172       0.025
exp1:occ2[T.20]     0.0243      0.074      0.327      0.744      -0.121       0.170
exp1:occ2[T.21]     0.0792      0.070      1.136      0.256      -0.057       0.216
exp1:occ2[T.22]     0.1093      0.088      1.241      0.215      -0.063       0.282
exp1:occ2[T.3]     -0.0715      0.064     -1.121      0.262      -0.197       0.054
exp1:occ2[T.4]     -0.0724      0.075     -0.968      0.333      -0.219       0.074
exp1:occ2[T.5]      0.0947      0.079      1.192      0.233      -0.061       0.250
exp1:occ2[T.6]     -0.0349      0.071     -0.490      0.624      -0.175       0.105
exp1:occ2[T.7]     -0.2279      0.078     -2.904      0.004      -0.382      -0.074
exp1:occ2[T.8]     -0.0727      0.065     -1.126      0.260      -0.199       0.054
exp1:occ2[T.9]      0.0274      0.067      0.409      0.682      -0.104       0.159
exp1:ind2[T.11]     0.1662      0.106      1.570      0.116      -0.041       0.374
exp1:ind2[T.12]     0.1079      0.093      1.156      0.248      -0.075       0.291
exp1:ind2[T.13]     0.1884      0.117      1.605      0.109      -0.042       0.418
exp1:ind2[T.14]    -0.0071      0.089     -0.080      0.936      -0.182       0.168
exp1:ind2[T.15]    -0.2081      0.203     -1.024      0.306      -0.607       0.190
exp1:ind2[T.16]    -0.0665      0.099     -0.671      0.502      -0.261       0.128
exp1:ind2[T.17]     0.0216      0.095      0.229      0.819      -0.164       0.207
exp1:ind2[T.18]     0.0053      0.091      0.058      0.954      -0.172       0.183
exp1:ind2[T.19]     0.0004      0.111      0.003      0.997      -0.217       0.217
exp1:ind2[T.2]     -0.1513      0.164     -0.920      0.358      -0.474       0.171
exp1:ind2[T.20]    -0.0186      0.102     -0.182      0.855      -0.219       0.182
exp1:ind2[T.21]     0.0678      0.101      0.673      0.501      -0.130       0.265
exp1:ind2[T.22]    -0.0367      0.096     -0.380      0.704      -0.226       0.152
exp1:ind2[T.3]      0.3246      0.188      1.722      0.085      -0.045       0.694
exp1:ind2[T.4]     -0.1365      0.114     -1.194      0.233      -0.361       0.088
exp1:ind2[T.5]     -0.0256      0.096     -0.265      0.791      -0.215       0.163
exp1:ind2[T.6]      0.0028      0.096      0.029      0.977      -0.185       0.191
exp1:ind2[T.7]     -0.0483      0.133     -0.364      0.716      -0.309       0.212
exp1:ind2[T.8]      0.0845      0.118      0.715      0.475      -0.147       0.316
exp1:ind2[T.9]     -0.0153      0.087     -0.176      0.860      -0.186       0.156
exp2               -0.5610      0.949     -0.591      0.554      -2.421       1.299
exp2:occ2[T.10]    -0.2692      0.641     -0.420      0.674      -1.525       0.986
exp2:occ2[T.11]    -1.0817      1.006     -1.075      0.282      -3.053       0.890
exp2:occ2[T.12]     0.8324      0.934      0.891      0.373      -0.999       2.664
exp2:occ2[T.13]    -0.2210      0.773     -0.286      0.775      -1.736       1.294
exp2:occ2[T.14]     0.7511      0.927      0.810      0.418      -1.067       2.569
exp2:occ2[T.15]    -0.0327      0.941     -0.035      0.972      -1.877       1.812
exp2:occ2[T.16]     0.3636      0.551      0.660      0.509      -0.717       1.444
exp2:occ2[T.17]    -0.2659      0.486     -0.547      0.584      -1.219       0.687
exp2:occ2[T.18]    -2.5609      5.170     -0.495      0.620     -12.697       7.575
exp2:occ2[T.19]    -0.1292      1.062     -0.122      0.903      -2.211       1.952
exp2:occ2[T.2]      0.6632      0.552      1.201      0.230      -0.420       1.746
exp2:occ2[T.20]    -0.3323      0.723     -0.460      0.646      -1.750       1.085
exp2:occ2[T.21]    -0.9100      0.685     -1.328      0.184      -2.254       0.434
exp2:occ2[T.22]    -0.8551      0.828     -1.033      0.302      -2.478       0.768
exp2:occ2[T.3]      0.6415      0.710      0.903      0.366      -0.751       2.034
exp2:occ2[T.4]      0.9748      0.866      1.126      0.260      -0.722       2.672
exp2:occ2[T.5]     -0.9779      0.974     -1.004      0.315      -2.887       0.931
exp2:occ2[T.6]      0.1051      0.800      0.131      0.896      -1.464       1.674
exp2:occ2[T.7]      3.1407      0.939      3.345      0.001       1.300       4.981
exp2:occ2[T.8]      0.6711      0.719      0.933      0.351      -0.739       2.081
exp2:occ2[T.9]      0.0232      0.763      0.030      0.976      -1.472       1.519
exp2:ind2[T.11]    -1.6804      1.080     -1.555      0.120      -3.798       0.438
exp2:ind2[T.12]    -0.9717      0.935     -1.039      0.299      -2.804       0.861
exp2:ind2[T.13]    -1.7679      1.156     -1.529      0.126      -4.035       0.499
exp2:ind2[T.14]     0.1190      0.888      0.134      0.893      -1.622       1.860
exp2:ind2[T.15]     2.3885      2.202      1.085      0.278      -1.928       6.705
exp2:ind2[T.16]     0.8707      0.990      0.879      0.379      -1.070       2.812
exp2:ind2[T.17]    -0.0030      0.948     -0.003      0.998      -1.862       1.856
exp2:ind2[T.18]    -0.0033      0.889     -0.004      0.997      -1.746       1.740
exp2:ind2[T.19]     0.2665      1.119      0.238      0.812      -1.927       2.460
exp2:ind2[T.2]      2.1973      1.774      1.239      0.216      -1.280       5.675
exp2:ind2[T.20]     0.2506      1.010      0.248      0.804      -1.729       2.231
exp2:ind2[T.21]    -0.9154      1.015     -0.902      0.367      -2.904       1.074
exp2:ind2[T.22]     0.3395      0.949      0.358      0.721      -1.522       2.201
exp2:ind2[T.3]     -3.7396      1.959     -1.908      0.056      -7.581       0.102
exp2:ind2[T.4]      1.0920      1.143      0.955      0.339      -1.149       3.333
exp2:ind2[T.5]      0.1824      0.939      0.194      0.846      -1.658       2.023
exp2:ind2[T.6]     -0.0304      0.930     -0.033      0.974      -1.853       1.792
exp2:ind2[T.7]      0.7325      1.440      0.509      0.611      -2.090       3.555
exp2:ind2[T.8]     -0.7507      1.203     -0.624      0.533      -3.108       1.607
exp2:ind2[T.9]      0.4177      0.839      0.498      0.619      -1.227       2.062
exp3                0.1293      0.357      0.363      0.717      -0.570       0.828
exp3:occ2[T.10]     0.1855      0.258      0.720      0.471      -0.319       0.690
exp3:occ2[T.11]     0.3932      0.382      1.030      0.303      -0.355       1.142
exp3:occ2[T.12]    -0.2203      0.366     -0.602      0.547      -0.938       0.497
exp3:occ2[T.13]     0.0950      0.290      0.327      0.744      -0.474       0.664
exp3:occ2[T.14]    -0.1444      0.334     -0.432      0.666      -0.800       0.511
exp3:occ2[T.15]     0.1477      0.365      0.405      0.685      -0.567       0.862
exp3:occ2[T.16]    -0.0379      0.215     -0.176      0.860      -0.460       0.384
exp3:occ2[T.17]     0.1510      0.188      0.804      0.421      -0.217       0.519
exp3:occ2[T.18]     1.4084      1.885      0.747      0.455      -2.287       5.104
exp3:occ2[T.19]     0.0923      0.404      0.228      0.819      -0.700       0.885
exp3:occ2[T.2]     -0.2039      0.221     -0.922      0.356      -0.637       0.230
exp3:occ2[T.20]     0.1807      0.265      0.681      0.496      -0.339       0.701
exp3:occ2[T.21]     0.3779      0.255      1.480      0.139      -0.123       0.878
exp3:occ2[T.22]     0.2855      0.298      0.957      0.339      -0.300       0.871
exp3:occ2[T.3]     -0.2370      0.287     -0.826      0.409      -0.800       0.326
exp3:occ2[T.4]     -0.4367      0.352     -1.241      0.215      -1.127       0.253
exp3:occ2[T.5]      0.3885      0.412      0.943      0.346      -0.419       1.196
exp3:occ2[T.6]      0.0485      0.329      0.147      0.883      -0.597       0.694
exp3:occ2[T.7]     -1.3949      0.405     -3.444      0.001      -2.189      -0.601
exp3:occ2[T.8]     -0.2054      0.290     -0.709      0.478      -0.773       0.362
exp3:occ2[T.9]     -0.0910      0.314     -0.289      0.772      -0.707       0.525
exp3:ind2[T.11]     0.6429      0.411      1.564      0.118      -0.163       1.449
exp3:ind2[T.12]     0.3286      0.347      0.947      0.344      -0.351       1.009
exp3:ind2[T.13]     0.5929      0.426      1.392      0.164      -0.242       1.428
exp3:ind2[T.14]    -0.0285      0.328     -0.087      0.931      -0.672       0.615
exp3:ind2[T.15]    -0.8569      0.844     -1.016      0.310      -2.511       0.797
exp3:ind2[T.16]    -0.3558      0.368     -0.968      0.333      -1.077       0.365
exp3:ind2[T.17]    -0.0363      0.352     -0.103      0.918      -0.727       0.654
exp3:ind2[T.18]     0.0157      0.325      0.048      0.961      -0.621       0.653
exp3:ind2[T.19]    -0.1488      0.421     -0.354      0.724      -0.974       0.676
exp3:ind2[T.2]     -1.0448      0.707     -1.479      0.139      -2.430       0.341
exp3:ind2[T.20]    -0.0679      0.370     -0.183      0.855      -0.794       0.658
exp3:ind2[T.21]     0.3967      0.381      1.041      0.298      -0.351       1.144
exp3:ind2[T.22]    -0.0760      0.349     -0.218      0.827      -0.760       0.608
exp3:ind2[T.3]      1.6218      0.785      2.067      0.039       0.084       3.160
exp3:ind2[T.4]     -0.3150      0.429     -0.735      0.463      -1.155       0.526
exp3:ind2[T.5]     -0.0506      0.340     -0.149      0.882      -0.717       0.616
exp3:ind2[T.6]      0.0193      0.336      0.058      0.954      -0.639       0.678
exp3:ind2[T.7]     -0.3359      0.586     -0.573      0.567      -1.485       0.813
exp3:ind2[T.8]      0.1893      0.452      0.419      0.675      -0.696       1.075
exp3:ind2[T.9]     -0.2161      0.300     -0.721      0.471      -0.804       0.372
exp4               -0.0053      0.045     -0.120      0.904      -0.093       0.082
exp4:occ2[T.10]    -0.0333      0.034     -0.984      0.325      -0.100       0.033
exp4:occ2[T.11]    -0.0466      0.048     -0.973      0.331      -0.141       0.047
exp4:occ2[T.12]     0.0110      0.047      0.234      0.815      -0.081       0.103
exp4:occ2[T.13]    -0.0137      0.036     -0.381      0.703      -0.084       0.057
exp4:occ2[T.14]     0.0056      0.040      0.139      0.890      -0.073       0.084
exp4:occ2[T.15]    -0.0327      0.046     -0.708      0.479      -0.123       0.058
exp4:occ2[T.16]    -0.0090      0.028     -0.325      0.745      -0.063       0.045
exp4:occ2[T.17]    -0.0257      0.024     -1.073      0.283      -0.073       0.021
exp4:occ2[T.18]    -0.2121      0.220     -0.963      0.336      -0.644       0.220
exp4:occ2[T.19]    -0.0169      0.051     -0.330      0.741      -0.118       0.084
exp4:occ2[T.2]      0.0176      0.029      0.610      0.542      -0.039       0.074
exp4:occ2[T.20]    -0.0296      0.032     -0.916      0.360      -0.093       0.034
exp4:occ2[T.21]    -0.0525      0.032     -1.654      0.098      -0.115       0.010
exp4:occ2[T.22]    -0.0351      0.036     -0.972      0.331      -0.106       0.036
exp4:occ2[T.3]      0.0303      0.038      0.805      0.421      -0.044       0.104
exp4:occ2[T.4]      0.0584      0.046      1.276      0.202      -0.031       0.148
exp4:occ2[T.5]     -0.0515      0.055     -0.938      0.349      -0.159       0.056
exp4:occ2[T.6]     -0.0170      0.044     -0.386      0.699      -0.103       0.069
exp4:occ2[T.7]      0.1905      0.056      3.410      0.001       0.081       0.300
exp4:occ2[T.8]      0.0197      0.038      0.518      0.604      -0.055       0.094
exp4:occ2[T.9]      0.0190      0.042      0.451      0.652      -0.064       0.102
exp4:ind2[T.11]    -0.0840      0.052     -1.619      0.106      -0.186       0.018
exp4:ind2[T.12]    -0.0390      0.042     -0.918      0.359      -0.122       0.044
exp4:ind2[T.13]    -0.0673      0.052     -1.297      0.195      -0.169       0.034
exp4:ind2[T.14] -6.822e-05      0.040     -0.002      0.999      -0.079       0.078
exp4:ind2[T.15]     0.0951      0.104      0.911      0.362      -0.110       0.300
exp4:ind2[T.16]     0.0439      0.045      0.970      0.332      -0.045       0.132
exp4:ind2[T.17]     0.0055      0.043      0.129      0.898      -0.079       0.090
exp4:ind2[T.18]    -0.0063      0.039     -0.161      0.872      -0.084       0.071
exp4:ind2[T.19]     0.0213      0.052      0.407      0.684      -0.081       0.124
exp4:ind2[T.2]      0.1483      0.092      1.615      0.106      -0.032       0.328
exp4:ind2[T.20]     0.0014      0.045      0.032      0.975      -0.087       0.089
exp4:ind2[T.21]    -0.0550      0.047     -1.160      0.246      -0.148       0.038
exp4:ind2[T.22]     0.0002      0.042      0.004      0.996      -0.083       0.083
exp4:ind2[T.3]     -0.2369      0.107     -2.220      0.026      -0.446      -0.028
exp4:ind2[T.4]      0.0273      0.053      0.511      0.609      -0.077       0.132
exp4:ind2[T.5]      0.0042      0.041      0.103      0.918      -0.076       0.084
exp4:ind2[T.6]     -0.0043      0.040     -0.108      0.914      -0.083       0.075
exp4:ind2[T.7]      0.0481      0.078      0.613      0.540      -0.106       0.202
exp4:ind2[T.8]     -0.0127      0.056     -0.226      0.821      -0.123       0.097
exp4:ind2[T.9]      0.0305      0.035      0.861      0.389      -0.039       0.100
exp1:shs           -0.1920      0.196     -0.982      0.326      -0.575       0.191
exp1:hsg           -0.0173      0.057     -0.303      0.762      -0.130       0.095
exp1:scl           -0.0665      0.043     -1.532      0.126      -0.151       0.019
exp1:clg           -0.0550      0.031     -1.774      0.076      -0.116       0.006
exp1:mw            -0.0280      0.030     -0.944      0.345      -0.086       0.030
exp1:so            -0.0100      0.027     -0.374      0.709      -0.062       0.042
exp1:we             0.0063      0.030      0.209      0.834      -0.053       0.065
exp2:shs            1.9005      1.450      1.310      0.190      -0.943       4.744
exp2:hsg            0.1172      0.551      0.213      0.832      -0.963       1.197
exp2:scl            0.6218      0.463      1.343      0.179      -0.286       1.529
exp2:clg            0.4097      0.380      1.077      0.281      -0.336       1.155
exp2:mw             0.2006      0.317      0.632      0.527      -0.421       0.823
exp2:so             0.0544      0.282      0.193      0.847      -0.498       0.606
exp2:we             0.0013      0.321      0.004      0.997      -0.628       0.630
exp3:shs           -0.6721      0.443     -1.518      0.129      -1.540       0.196
exp3:hsg           -0.0180      0.208     -0.086      0.931      -0.426       0.390
exp3:scl           -0.1998      0.186     -1.077      0.282      -0.563       0.164
exp3:clg           -0.1025      0.164     -0.624      0.533      -0.425       0.220
exp3:mw            -0.0626      0.124     -0.504      0.614      -0.306       0.181
exp3:so            -0.0116      0.108     -0.107      0.915      -0.224       0.201
exp3:we            -0.0125      0.125     -0.100      0.921      -0.258       0.233
exp4:shs            0.0777      0.048      1.635      0.102      -0.015       0.171
exp4:hsg            0.0005      0.027      0.018      0.985      -0.052       0.053
exp4:scl            0.0211      0.025      0.859      0.390      -0.027       0.069
exp4:clg            0.0079      0.023      0.346      0.729      -0.037       0.052
exp4:mw             0.0062      0.016      0.393      0.694      -0.025       0.037
exp4:so             0.0003      0.014      0.023      0.982      -0.026       0.027
exp4:we             0.0018      0.016      0.111      0.912      -0.030       0.033
==============================================================================
Omnibus:                      395.012   Durbin-Watson:                   1.898
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1529.250
Skew:                           0.303   Prob(JB):                         0.00
Kurtosis:                       5.600   Cond. No.                     6.87e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.87e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Number of regressors in the basic model: 246 

Note that the flexible model consists of $246$ regressors.

2.5. Try Lasso next#

# Import relevant packages for lasso 
from sklearn.linear_model import LassoCV
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

# Get exogenous variables from flexible model
X = flex_results_0.exog
X.shape

(5150, 246)

# Set endogenous variable
lwage = data["lwage"]
lwage.shape

(5150,)

alpha=0.1

# Set penalty value = 0.1
#reg = linear_model.Lasso(alpha=0.1/np.log(len(lwage)))
reg = linear_model.Lasso(alpha = alpha)

# LASSO regression for flexible model
reg.fit(X, lwage)
lwage_lasso_fitted = reg.fit(X, lwage).predict( X )

# coefficients 
reg.coef_
print('Lasso Regression: R^2 score', reg.score(X, lwage))

Lasso Regression: R^2 score 0.1604784962552065

# Check predicted values
lwage_lasso_fitted

array([3.03889357, 3.27908164, 2.82185985, ..., 3.07156184, 2.85967102,
       3.21586563])

Now, we can evaluate the performance of both models based on the (adjusted) $R^2_{sample}$ and the (adjusted) $MSE_{sample}$:

# Basic Model
basic = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
basic_results = smf.ols(basic , data=data).fit()

# Flexible model 
flex = 'lwage ~ sex + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'
flex_results = smf.ols(flex , data=data).fit()

# Assess the predictive performance
R2_1 = basic_results.rsquared
print("R-squared for the basic model: ", R2_1, "\n")
R2_adj1 = basic_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj1, "\n")


R2_2 = flex_results.rsquared
print("R-squared for the basic model: ", R2_2, "\n")
R2_adj2 = flex_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj2, "\n")

R2_L = reg.score(flex_results_0.exog, lwage)
print("R-squared for LASSO: ", R2_L, "\n")
R2_adjL = 1 - (1-R2_L)*(len(lwage)-1)/(len(lwage)-X.shape[1]-1)
print("adjusted R-squared for LASSO: ", R2_adjL, "\n")

R-squared for the basic model:  0.31004650692219504 

adjusted R-squared for the basic model:  0.3032809304064292 

R-squared for the basic model:  0.3511098950617231 

adjusted R-squared for the basic model:  0.31869185352218843 

R-squared for LASSO:  0.1604784962552065 

adjusted R-squared for LASSO:  0.11835687889415825

# calculating the MSE
MSE1 =  np.mean(basic_results.resid**2)
print("MSE for the basic model: ", MSE1, "\n")
p1 = len(basic_results.params) # number of regressors
n = len(lwage)
MSE_adj1  = (n/(n-p1))*MSE1
print("adjusted MSE for the basic model: ", MSE_adj1, "\n")

MSE2 =  np.mean(flex_results.resid**2)
print("MSE for the flexible model: ", MSE2, "\n")
p2 = len(flex_results.params) # number of regressors
n = len(lwage)
MSE_adj2  = (n/(n-p2))*MSE2
print("adjusted MSE for the flexible model: ", MSE_adj2, "\n")


MSEL = mean_squared_error(lwage, lwage_lasso_fitted)
print("MSE for the LASSO model: ", MSEL, "\n")
pL = reg.coef_.shape[0] # number of regressors
n = len(lwage)
MSE_adjL  = (n/(n-pL))*MSEL
print("adjusted MSE for LASSO model: ", MSE_adjL, "\n")

MSE for the basic model:  0.22442505581164474 

adjusted MSE for the basic model:  0.22666974650519128 

MSE for the flexible model:  0.21106813644318256 

adjusted MSE for the flexible model:  0.22165597526149883 

MSE for the LASSO model:  0.273075884423059 

adjusted MSE for LASSO model:  0.2867742260968095

# Package for latex table 
import array_to_latex as a2l

table = np.zeros((3, 5))
table[0,0:5] = [p1, R2_1, MSE1, R2_adj1, MSE_adj1]
table[1,0:5] = [p2, R2_2, MSE2, R2_adj2, MSE_adj2]
table[2,0:5] = [pL, R2_L, MSEL, R2_adjL, MSE_adjL]

table = pd.DataFrame(table, columns = ["p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$"], \
                      index = ["basic reg","flexible reg", "lasso flex"])
table

	p	$R^2_{sample}$	$MSE_{sample}$	$R^2_{adjusted}$	$MSE_{adjusted}$
basic reg	51.0	0.310047	0.224425	0.303281	0.226670
flexible reg	246.0	0.351110	0.211068	0.318692	0.221656
lasso flex	246.0	0.160478	0.273076	0.118357	0.286774

Considering all measures above, the flexible model performs slightly better than the basic model.

One procedure to circumvent this issue is to use data splitting that is described and applied in the following.

2.6. Data Splitting#

Measure the prediction quality of the two models via data splitting:

Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophisticated version of splitting that we can consider).
Use the training sample for estimating the parameters of the Basic Model and the Flexible Model.
Use the testing sample for evaluation. Predict the $\mathtt{wage}$ of every observation in the testing sample based on the estimated parameters in the training sample.
Calculate the Mean Squared Prediction Error $MSE_{test}$ based on the testing sample for both prediction models.

# Import relevant packages for splitting data
import random
import math

# Set Seed
# to make the results replicable (generating random numbers)
np.random.seed(0)
random = np.random.randint(0,n, size=math.floor(n))
data["random"] = random
random    # the array does not change 

array([2732, 2607, 1653, ..., 4184, 2349, 3462])

data_2 = data.sort_values(by=['random'])
data_2.head()

	wage	lwage	sex	shs	hsg	scl	clg	ad	mw	so	...	ne	exp1	exp2	exp3	exp4	occ	occ2	ind	ind2	random
rownames
2223	26.442308	3.274965	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	1.0	29.0	8.4100	24.389000	70.728100	340	1	8660	20	0
3467	19.230769	2.956512	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	...	1.0	33.5	11.2225	37.595375	125.944506	9620	22	1870	5	0
13501	48.076923	3.872802	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	...	0.0	2.0	0.0400	0.008000	0.001600	3060	10	8190	18	0
15588	12.019231	2.486508	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	...	0.0	29.0	8.4100	24.389000	70.728100	6440	19	770	4	2
16049	39.903846	3.686473	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	...	0.0	12.0	1.4400	1.728000	2.073600	1820	5	7860	17	2

5 rows × 21 columns

# Create training and testing sample 
train = data_2[ : math.floor(n*4/5)]    # training sample
test =  data_2[ math.floor(n*4/5) : ]   # testing sample
print(train.shape)
print(test.shape)

(4120, 21)
(1030, 21)

# Basic Model
basic = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
basic_results = smf.ols(basic , data=data).fit()

# Flexible model 
flex = 'lwage ~ sex + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'
flex_results = smf.ols(flex , data=data).fit()

# basic model
# estimating the parameters in the training sample
basic_results = smf.ols(basic , data=train).fit()
print(basic_results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  lwage   R-squared:                       0.316
Model:                            OLS   Adj. R-squared:                  0.308
Method:                 Least Squares   F-statistic:                     37.65
Date:                Wed, 03 Aug 2022   Prob (F-statistic):          4.85e-293
Time:                        23:52:25   Log-Likelihood:                -2784.1
No. Observations:                4120   AIC:                             5670.
Df Residuals:                    4069   BIC:                             5993.
Df Model:                          50                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.5365      0.061     58.134      0.000       3.417       3.656
occ2[T.10]    -0.0054      0.045     -0.120      0.904      -0.094       0.083
occ2[T.11]    -0.4594      0.067     -6.893      0.000      -0.590      -0.329
occ2[T.12]    -0.3300      0.061     -5.365      0.000      -0.451      -0.209
occ2[T.13]    -0.3767      0.050     -7.544      0.000      -0.475      -0.279
occ2[T.14]    -0.5026      0.056     -8.947      0.000      -0.613      -0.392
occ2[T.15]    -0.4511      0.059     -7.586      0.000      -0.568      -0.335
occ2[T.16]    -0.2482      0.036     -6.818      0.000      -0.320      -0.177
occ2[T.17]    -0.4286      0.031    -13.624      0.000      -0.490      -0.367
occ2[T.18]    -0.2957      0.216     -1.367      0.172      -0.720       0.128
occ2[T.19]    -0.2354      0.056     -4.191      0.000      -0.345      -0.125
occ2[T.2]     -0.0771      0.038     -2.029      0.043      -0.152      -0.003
occ2[T.20]    -0.2158      0.046     -4.669      0.000      -0.306      -0.125
occ2[T.21]    -0.3029      0.042     -7.171      0.000      -0.386      -0.220
occ2[T.22]    -0.4385      0.047     -9.385      0.000      -0.530      -0.347
occ2[T.3]     -0.0054      0.044     -0.121      0.904      -0.092       0.082
occ2[T.4]     -0.0867      0.061     -1.431      0.152      -0.206       0.032
occ2[T.5]     -0.2064      0.072     -2.866      0.004      -0.348      -0.065
occ2[T.6]     -0.4175      0.057     -7.317      0.000      -0.529      -0.306
occ2[T.7]     -0.0111      0.063     -0.177      0.860      -0.134       0.112
occ2[T.8]     -0.3633      0.049     -7.380      0.000      -0.460      -0.267
occ2[T.9]     -0.1928      0.052     -3.743      0.000      -0.294      -0.092
ind2[T.11]     0.0622      0.066      0.937      0.349      -0.068       0.192
ind2[T.12]     0.1328      0.060      2.220      0.026       0.016       0.250
ind2[T.13]     0.0492      0.078      0.629      0.529      -0.104       0.203
ind2[T.14]     0.0062      0.057      0.108      0.914      -0.106       0.118
ind2[T.15]    -0.1137      0.150     -0.759      0.448      -0.407       0.180
ind2[T.16]    -0.1072      0.063     -1.690      0.091      -0.232       0.017
ind2[T.17]    -0.1034      0.063     -1.640      0.101      -0.227       0.020
ind2[T.18]    -0.1331      0.058     -2.298      0.022      -0.247      -0.020
ind2[T.19]    -0.1591      0.073     -2.190      0.029      -0.301      -0.017
ind2[T.2]      0.2190      0.097      2.248      0.025       0.028       0.410
ind2[T.20]    -0.3512      0.064     -5.518      0.000      -0.476      -0.226
ind2[T.21]    -0.0824      0.062     -1.325      0.185      -0.204       0.040
ind2[T.22]     0.0795      0.060      1.321      0.186      -0.038       0.197
ind2[T.3]      0.0533      0.088      0.603      0.547      -0.120       0.227
ind2[T.4]     -0.0416      0.065     -0.636      0.525      -0.170       0.087
ind2[T.5]     -0.0628      0.063     -1.004      0.315      -0.185       0.060
ind2[T.6]     -0.0394      0.059     -0.673      0.501      -0.154       0.075
ind2[T.7]      0.0058      0.080      0.073      0.942      -0.152       0.163
ind2[T.8]     -0.0610      0.081     -0.754      0.451      -0.220       0.098
ind2[T.9]     -0.1683      0.055     -3.086      0.002      -0.275      -0.061
sex           -0.0763      0.017     -4.521      0.000      -0.109      -0.043
exp1           0.0087      0.001     11.758      0.000       0.007       0.010
shs           -0.5928      0.057    -10.436      0.000      -0.704      -0.481
hsg           -0.5213      0.030    -17.127      0.000      -0.581      -0.462
scl           -0.4215      0.028    -14.848      0.000      -0.477      -0.366
clg           -0.1974      0.026     -7.655      0.000      -0.248      -0.147
mw            -0.0233      0.022     -1.075      0.283      -0.066       0.019
so            -0.0428      0.021     -2.048      0.041      -0.084      -0.002
we         -5.145e-05      0.023     -0.002      0.998      -0.044       0.044
==============================================================================
Omnibus:                      358.629   Durbin-Watson:                   1.946
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1439.044
Skew:                           0.355   Prob(JB):                         0.00
Kurtosis:                       5.807   Cond. No.                         543.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

lwage_test = test["lwage"].values
#test = test.drop(columns=['wage', 'lwage', 'random'])
#test

# calculating the out-of-sample MSE
test = sm.add_constant(test)   #add constant 

lwage_pred =  basic_results.predict(test) # predict out of sample
print(lwage_pred)

rownames
  2.454760
  2.729422
   3.374858
    3.451121
   2.883054
           ...   
  3.039693
   2.669400
   3.271324
   2.943550
  3.462293
Length: 1030, dtype: float64

MSE_test1 = np.sum((lwage_test-lwage_pred)**2)/len(lwage_test)
R2_test1  = 1 - MSE_test1/np.var(lwage_test)

print("Test MSE for the basic model: ", MSE_test1, " ")
print("Test R2 for the basic model: ", R2_test1)

Test MSE for the basic model:  0.21963534669163998  
Test R2 for the basic model:  0.2749843118453724

In the basic model, the $MSE_{test}$ is quite closed to the $MSE_{sample}$.

# Flexible model
# estimating the parameters in the training sample
flex_results = smf.ols(flex , data=train).fit()

# calculating the out-of-sample MSE
lwage_flex_pred =  flex_results.predict(test) # predict out of sample
lwage_test = test["lwage"].values

MSE_test2 = np.sum((lwage_test-lwage_flex_pred)**2)/len(lwage_test)
R2_test2  = 1 - MSE_test2/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_test2, " ")
print("Test R2 for the flexible model: ", R2_test2)

Test MSE for the flexible model:  0.2332944574254996  
Test R2 for the flexible model:  0.2298956240842307

In the flexible model, the discrepancy between the $MSE_{test}$ and the $MSE_{sample}$ is not large.

It is worth to notice that the $MSE_{test}$ vary across different data splits. Hence, it is a good idea average the out-of-sample MSE over different data splits to get valid results.

Nevertheless, we observe that, based on the out-of-sample $MSE$, the basic model using ols regression performs is about as well (or slightly better) than the flexible model.

Next, let us use lasso regression in the flexible model instead of ols regression. Lasso (least absolute shrinkage and selection operator) is a penalized regression method that can be used to reduce the complexity of a regression model when the number of regressors $p$ is relatively large in relation to $n$.

Note that the out-of-sample $MSE$ on the test sample can be computed for any other black-box prediction method as well. Thus, let us finally compare the performance of lasso regression in the flexible model to ols regression.

# flexible model using lasso
# get exogenous variables from training data used in flex model
flex_results_0 = smf.ols(flex , data=train)
X_train = flex_results_0.exog
print(X_train.shape)

# Get endogenous variable 
lwage_train = train["lwage"]
print(lwage_train.shape)

(4120, 246)
(4120,)

# flexible model using lasso
# get exogenous variables from testing data used in flex model
flex_results_1 = smf.ols(flex , data=test)
X_test = flex_results_1.exog
print(X_test.shape)

# Get endogenous variable 
lwage_test = test["lwage"]
print(lwage_test.shape)

(1030, 246)
(1030,)

# calculating the out-of-sample MSE
reg = linear_model.Lasso(alpha=0.1)
lwage_lasso_fitted = reg.fit(X_train, lwage_train).predict( X_test )

MSE_lasso = np.sum((lwage_test-lwage_lasso_fitted)**2)/len(lwage_test)
R2_lasso  = 1 - MSE_lasso/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_lasso, " ")
print("Test R2 for the flexible model: ", R2_lasso)

Test MSE for the flexible model:  0.2540862168655473  
Test R2 for the flexible model:  0.1612620821455767

Finally, let us summarize the results:

# Package for latex table 
import array_to_latex as a2l

table2 = np.zeros((3, 2))
table2[0,0] = MSE_test1
table2[1,0] = MSE_test2
table2[2,0] = MSE_lasso
table2[0,1] = R2_test1
table2[1,1] = R2_test2
table2[2,1] = R2_lasso

table2 = pd.DataFrame(table2, columns = ["$MSE_{test}$", "$R^2_{test}$"], \
                      index = ["basic reg","flexible reg","lasso regression"])
table2

	$MSE_{test}$	$R^2_{test}$
basic reg	0.219635	0.274984
flexible reg	0.233294	0.229896
lasso regression	0.254086	0.161262