OLS and lasso for wage prediction

# !wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
# !dpkg -i cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
# !apt-key add /var/cuda-repo-9-0-local/7fa2af80.pub
# !apt update -q
# !apt install cuda gcc-6 g++-6 -y -q
# !ln -s /usr/bin/gcc-6 /usr/local/cuda/bin/gcc
# !ln -s /usr/bin/g++-6 /usr/local/cuda/bin/g++

# !curl -sSL "https://julialang-s3.julialang.org/bin/linux/x64/1.7/julia-1.7.3-linux-x86_64.tar.gz" -o julia.tar.gz
# !tar -xzf julia.tar.gz -C /usr --strip-components 1
# !rm -rf julia.tar.gz*
# !julia -e 'using Pkg; pkg"add IJulia; precompile"'

2. OLS and lasso for wage prediction#

2.1. Introduction#

In labor economics an important question is what determines the wage of workers. This is a causal question, but we could begin to investigate from a predictive perspective.

In the following wage example, \(Y\) is the hourly wage of a worker and \(X\) is a vector of worker’s characteristics, e.g., education, experience, gender. Two main questions here are:

How to use job-relevant characteristics, such as education and experience, to best predict wages?
What is the difference in predicted wages between men and women with the same job-relevant characteristics?

In this lab, we focus on the prediction question first.

2.2. Data#

The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below \(3\).

The variable of interest \(Y\) is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size \(n = 5150\).

2.3. Data Analysis#

# to_install = ["CSV", "DataFrames", "Dates", "Plots"]
# using Pkg 
# Pkg.add(to_install)
# PKG.add("Lathe")
# Pkg.add("HTTP")
using CSV, DataFrames, Dates, Plots, Lathe, GLM, Statistics, MLBase, HTTP

We start by loading the data set.

#Reading the CSV file into a DataFrame
#We have to set the category type for some variable
url = "https://github.com/d2cml-ai/14.388_jl/raw/main/data/wage2015_subsample_inference.csv"
data = CSV.File(download(url); types = Dict("occ" => String,"occ2"=> String,"ind"=>String,"ind2"=>String)) |> DataFrame
size(data)

(5150, 21)

Let’s look at the structure.

#a quick decribe of the data
describe(data)

21 rows × 7 columns

	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	rownames	15636.3	10	15260.0	32643	0	Int64
2	wage	23.4104	3.02198	19.2308	528.846	0	Float64
3	lwage	2.97079	1.10591	2.95651	6.2707	0	Float64
4	sex	0.444466	0.0	0.0	1.0	0	Float64
5	shs	0.023301	0.0	0.0	1.0	0	Float64
6	hsg	0.243883	0.0	0.0	1.0	0	Float64
7	scl	0.278058	0.0	0.0	1.0	0	Float64
8	clg	0.31767	0.0	0.0	1.0	0	Float64
9	ad	0.137087	0.0	0.0	1.0	0	Float64
10	mw	0.259612	0.0	0.0	1.0	0	Float64
11	so	0.296505	0.0	0.0	1.0	0	Float64
12	we	0.216117	0.0	0.0	1.0	0	Float64
13	ne	0.227767	0.0	0.0	1.0	0	Float64
14	exp1	13.7606	0.0	10.0	47.0	0	Float64
15	exp2	3.01893	0.0	1.0	22.09	0	Float64
16	exp3	8.23587	0.0	1.0	103.823	0	Float64
17	exp4	25.118	0.0	1.0	487.968	0	Float64
18	occ		10		9750	0	String
19	occ2		1		9	0	String
20	ind		1070		9590	0	String
21	ind2		10		9	0	String

We are constructing the output variable \(Y\) and the matrix \(Z\) which includes the characteristics of workers that are given in the data.

n = size(data)[1]
z = select(data, Not([:rownames, :lwage, :wage]))
p = size(z)[2] 

println("Number of observations : ", n, "\n","Number of raw regressors: ", p )

Number of observations : 5150
Number of raw regressors: 18

For the outcome variable wage and a subset of the raw regressors, we calculate the empirical mean to get familiar with the data.

z_subset = select(data, ["lwage","sex","shs","hsg","scl","clg","ad","mw","so","we","ne","exp1"])
rename!(z_subset, ["Log Wage", "Sex", "Some High School", "High School Graduate", "Some College", "College Graduate", "Advanced Degree", "Midwest", "South", "West", "Northeast", "Experience"])

describe(z_subset, :mean)

12 rows × 2 columns

	variable	mean
	Symbol	Float64
1	Log Wage	2.97079
2	Sex	0.444466
3	Some High School	0.023301
4	High School Graduate	0.243883
5	Some College	0.278058
6	College Graduate	0.31767
7	Advanced Degree	0.137087
8	Midwest	0.259612
9	South	0.296505
10	West	0.216117
11	Northeast	0.227767
12	Experience	13.7606

E.g., the share of female workers in our example is ~44% (\(sex = 1\) if female).

2.4. Prediction Question#

Now, we will construct a prediction rule for hourly wage \(Y\) , which depends linearly on job-relevant characteristics \(X\):

\[Y = \beta' X + \epsilon \]

Our goals are

Predict wages using various characteristics of workers.
Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample \(R^2\) and the out-of-sample \(MSE\) and \(R^2\).

We employ two different specifications for prediction:

Basic Model: \(X\) consists of a set of raw regressors (e.g. gender, experience, education indicators, occupation and industry indicators, regional indicators).
Flexible Model: \(X\) consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g.,\(exp2\) and \(exp3\)) and additional two-way interactions of polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is experience times the indicator of having a college degree.

Using the Flexible Model, enables us to approximate the real relationship by a more complex regression model and therefore to reduce the bias. The Flexible Model increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver good prediction accuracy but give models which are harder to interpret.

Now, let us fit both models to our data by running ordinary least squares (ols):

#basic model
basic  = @formula(lwage ~ (sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2))
basic_results  = lm(basic, data)
println(basic_results)
println("Number of regressors in the basic model: ", size(coef(basic_results), 1))

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

lwage ~ 1 + sex + exp1 + shs + hsg + scl + clg + mw + so + we + occ2 + ind2

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                   Coef.   Std. Error       t  Pr(>|t|)    Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept)   3.52838     0.0540195     65.32    <1e-99   3.42248      3.63429
sex          -0.0728575   0.0150269     -4.85    <1e-05  -0.102317    -0.0433983
exp1          0.0085677   0.000653744   13.11    <1e-37   0.00728608   0.00984932
shs          -0.592798    0.0505549    -11.73    <1e-30  -0.691908    -0.493689
hsg          -0.504337    0.0270767    -18.63    <1e-74  -0.557419    -0.451255
scl          -0.411994    0.0252036    -16.35    <1e-57  -0.461404    -0.362584
clg          -0.182216    0.0229524     -7.94    <1e-14  -0.227213    -0.137219
mw           -0.0275413   0.0193301     -1.42    0.1543  -0.0654367    0.010354
so           -0.0344538   0.0187063     -1.84    0.0656  -0.0711262    0.00221853
we            0.0172492   0.020086       0.86    0.3905  -0.022128     0.0566264
occ2: 10     -0.0106235   0.0396274     -0.27    0.7886  -0.0883102    0.0670633
occ2: 11     -0.455834    0.0594409     -7.67    <1e-13  -0.572364    -0.339305
occ2: 12     -0.307589    0.0555146     -5.54    <1e-07  -0.416421    -0.198756
occ2: 13     -0.36144     0.0455401     -7.94    <1e-14  -0.450718    -0.272162
occ2: 14     -0.499495    0.0506204     -9.87    <1e-22  -0.598733    -0.400258
occ2: 15     -0.464482    0.0517634     -8.97    <1e-18  -0.56596     -0.363003
occ2: 16     -0.233715    0.0324348     -7.21    <1e-12  -0.297301    -0.170129
occ2: 17     -0.412588    0.0279079    -14.78    <1e-47  -0.4673      -0.357877
occ2: 18     -0.340418    0.196628      -1.73    0.0835  -0.725893     0.0450565
occ2: 19     -0.24148     0.0494794     -4.88    <1e-05  -0.33848     -0.144479
occ2: 2      -0.0764717   0.0342039     -2.24    0.0254  -0.143526    -0.0094174
occ2: 20     -0.212628    0.0408854     -5.20    <1e-06  -0.292781    -0.132475
occ2: 21     -0.288413    0.0380839     -7.57    <1e-13  -0.363074    -0.213752
occ2: 22     -0.422394    0.0414626    -10.19    <1e-23  -0.503678    -0.341109
occ2: 3      -0.0346777   0.0387595     -0.89    0.3710  -0.110663     0.0413075
occ2: 4      -0.0962017   0.0519073     -1.85    0.0639  -0.197962     0.00555892
occ2: 5      -0.187915    0.0603999     -3.11    0.0019  -0.306325    -0.0695053
occ2: 6      -0.414933    0.0502176     -8.26    <1e-15  -0.513381    -0.316485
occ2: 7      -0.0459867   0.0565054     -0.81    0.4158  -0.156762     0.0647881
occ2: 8      -0.377847    0.043929      -8.60    <1e-16  -0.463967    -0.291727
occ2: 9      -0.215752    0.0461229     -4.68    <1e-05  -0.306173    -0.125331
ind2: 11      0.0247881   0.0580424      0.43    0.6693  -0.0889999    0.138576
ind2: 12      0.116415    0.0528911      2.20    0.0278   0.0127259    0.220104
ind2: 13      0.0212468   0.0681013      0.31    0.7551  -0.112261     0.154755
ind2: 14      0.00684568  0.050356       0.14    0.8919  -0.0918738    0.105565
ind2: 15     -0.131513    0.137131      -0.96    0.3376  -0.400348     0.137322
ind2: 16     -0.121548    0.0558235     -2.18    0.0295  -0.230986    -0.0121101
ind2: 17     -0.110554    0.0556398     -1.99    0.0470  -0.219632    -0.00147636
ind2: 18     -0.141535    0.0510202     -2.77    0.0056  -0.241557    -0.041514
ind2: 19     -0.18027     0.0652937     -2.76    0.0058  -0.308274    -0.052266
ind2: 2       0.193851    0.0842585      2.30    0.0215   0.028668     0.359034
ind2: 20     -0.358081    0.0565485     -6.33    <1e-09  -0.468941    -0.247222
ind2: 21     -0.122828    0.0548657     -2.24    0.0252  -0.230388    -0.0152675
ind2: 22      0.0748796   0.0532236      1.41    0.1595  -0.0294615    0.179221
ind2: 3       0.0770144   0.0792451      0.97    0.3312  -0.0783399    0.232369
ind2: 4      -0.0506417   0.0579651     -0.87    0.3823  -0.164278     0.0629948
ind2: 5      -0.0796816   0.0556328     -1.43    0.1521  -0.188746     0.0293825
ind2: 6      -0.0555174   0.0517977     -1.07    0.2839  -0.157063     0.0460284
ind2: 7       0.0542625   0.0717473      0.76    0.4495  -0.086393     0.194918
ind2: 8      -0.0490971   0.0723613     -0.68    0.4975  -0.190956     0.092762
ind2: 9      -0.193634    0.0482064     -4.02    <1e-04  -0.288139    -0.0991287
─────────────────────────────────────────────────────────────────────────────────
Number of regressors in the basic model: 51

#flexible model
flex = @formula(lwage ~ sex + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we))
flex_results = lm(flex, data)
println(flex_results)
println("Number of regressors in the flexible model: ", size(coef(flex_results), 1))

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

lwage ~ 1 + sex + shs + hsg + scl + clg + occ2 + ind2 + mw + so + we + exp1 + exp2 + exp3 + exp4 + exp1 & shs + exp1 & hsg + exp1 & scl + exp1 & clg + exp1 & occ2 + exp1 & ind2 + exp1 & mw + exp1 & so + exp1 & we + exp2 & shs + exp2 & hsg + exp2 & scl + exp2 & clg + exp2 & occ2 + exp2 & ind2 + exp2 & mw + exp2 & so + exp2 & we + exp3 & shs + exp3 & hsg + exp3 & scl + exp3 & clg + exp3 & occ2 + exp3 & ind2 + exp3 & mw + exp3 & so + exp3 & we + exp4 & shs + exp4 & hsg + exp4 & scl + exp4 & clg + exp4 & occ2 + exp4 & ind2 + exp4 & mw + exp4 & so + exp4 & we

Coefficients:
────────────────────────────────────────────────────────────────────────────────────
                        Coef.  Std. Error      t  Pr(>|t|)    Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────────────
(Intercept)       3.27968       0.284196   11.54    <1e-29    2.72253     3.83683
sex              -0.0695532     0.015218   -4.57    <1e-05   -0.0993874  -0.039719
shs              -0.123309      0.906832   -0.14    0.8918   -1.90111     1.65449
hsg              -0.528902      0.197756   -2.67    0.0075   -0.916593   -0.141212
scl              -0.292058      0.126016   -2.32    0.0205   -0.539105   -0.0450112
clg              -0.0411641     0.0703862  -0.58    0.5587   -0.179153    0.0968245
occ2: 10          0.0209545     0.156498    0.13    0.8935   -0.285852    0.327761
occ2: 11         -0.642418      0.30909    -2.08    0.0377   -1.24837    -0.036463
occ2: 12         -0.0674774     0.252049   -0.27    0.7889   -0.561605    0.426651
occ2: 13         -0.232978      0.231538   -1.01    0.3144   -0.686896    0.22094
occ2: 14          0.256201      0.322673    0.79    0.4272   -0.376382    0.888784
occ2: 15         -0.193858      0.259508   -0.75    0.4551   -0.702611    0.314894
occ2: 16         -0.0551256     0.147066   -0.37    0.7078   -0.34344     0.233189
occ2: 17         -0.415609      0.136114   -3.05    0.0023   -0.682454   -0.148764
occ2: 18         -0.482217      1.04435    -0.46    0.6443   -2.52962     1.56518
occ2: 19         -0.257941      0.332522   -0.78    0.4380   -0.909832    0.39395
occ2: 2           0.16134       0.129724    1.24    0.2137   -0.0929781   0.415657
occ2: 20         -0.30102       0.234102   -1.29    0.1986   -0.759965    0.157925
occ2: 21         -0.427181      0.220649   -1.94    0.0529   -0.859751    0.00538904
occ2: 22         -0.869453      0.297522   -2.92    0.0035   -1.45273    -0.286176
occ2: 3           0.210151      0.168677    1.25    0.2129   -0.120532    0.540835
occ2: 4           0.070857      0.183717    0.39    0.6997   -0.28931     0.431024
occ2: 5          -0.396008      0.18854    -2.10    0.0357   -0.76563    -0.026385
occ2: 6          -0.231061      0.186966   -1.24    0.2166   -0.597599    0.135476
occ2: 7           0.314725      0.194152    1.62    0.1051   -0.0658997   0.69535
occ2: 8          -0.187542      0.169299   -1.11    0.2680   -0.519443    0.14436
occ2: 9          -0.339027      0.16723    -2.03    0.0427   -0.666873   -0.0111811
ind2: 11         -0.404651      0.314232   -1.29    0.1979   -1.02069     0.211384
ind2: 12         -0.156994      0.279436   -0.56    0.5743   -0.704813    0.390825
ind2: 13         -0.437744      0.361684   -1.21    0.2262   -1.14681     0.271318
ind2: 14         -0.00543335    0.270026   -0.02    0.9839   -0.534804    0.523938
ind2: 15          0.200448      0.500542    0.40    0.6888   -0.780838    1.18173
ind2: 16          0.0101935     0.299579    0.03    0.9729   -0.577116    0.597503
ind2: 17         -0.2396        0.284972   -0.84    0.4005   -0.798273    0.319073
ind2: 18         -0.180776      0.278446   -0.65    0.5162   -0.726655    0.365103
ind2: 19         -0.300698      0.326555   -0.92    0.3572   -0.940892    0.339497
ind2: 2           0.580584      0.480878    1.21    0.2274   -0.362151    1.52332
ind2: 20         -0.329318      0.312909   -1.05    0.2927   -0.942761    0.284125
ind2: 21         -0.178069      0.303635   -0.59    0.5576   -0.773331    0.417192
ind2: 22          0.176507      0.296357    0.60    0.5515   -0.404487    0.7575
ind2: 3          -0.666781      0.560741   -1.19    0.2345   -1.76608     0.432522
ind2: 4           0.485756      0.349761    1.39    0.1650   -0.199933    1.17144
ind2: 5           0.051198      0.297545    0.17    0.8634   -0.532124    0.63452
ind2: 6          -0.0415848     0.299194   -0.14    0.8895   -0.628138    0.544969
ind2: 7           0.0758343     0.387069    0.20    0.8447   -0.682995    0.834664
ind2: 8          -0.14896       0.337453   -0.44    0.6589   -0.810518    0.512598
ind2: 9          -0.221949      0.274596   -0.81    0.4190   -0.76028     0.316382
mw                0.110683      0.0814463   1.36    0.1742   -0.0489879   0.270355
so                0.0224244     0.0743855   0.30    0.7631   -0.123404    0.168253
we               -0.0215659     0.0841591  -0.26    0.7978   -0.186556    0.143424
exp1              0.0835331     0.0944741   0.88    0.3766   -0.101678    0.268745
exp2             -0.561036      0.948991   -0.59    0.5544   -2.42148     1.29941
exp3              0.129343      0.356603    0.36    0.7168   -0.569759    0.828445
exp4             -0.00534853    0.0445154  -0.12    0.9044   -0.0926186   0.0819216
exp1 & shs       -0.191998      0.195541   -0.98    0.3262   -0.575346    0.191349
exp1 & hsg       -0.0173433     0.0572279  -0.30    0.7619   -0.129536    0.0948491
exp1 & scl       -0.0664505     0.043373   -1.53    0.1256   -0.151481    0.01858
exp1 & clg       -0.0550346     0.0310279  -1.77    0.0762   -0.115863    0.00579393
exp1 & occ2: 10   0.00756285    0.0581715   0.13    0.8966   -0.106479    0.121605
exp1 & occ2: 11   0.101422      0.100509    1.01    0.3130   -0.0956214   0.298466
exp1 & occ2: 12  -0.0862744     0.0874768  -0.99    0.3241   -0.257768    0.0852193
exp1 & occ2: 13   0.00671485    0.0761825   0.09    0.9298   -0.142637    0.156067
exp1 & occ2: 14  -0.136915      0.0974458  -1.41    0.1601   -0.327953    0.0541221
exp1 & occ2: 15  -0.0400425     0.0898931  -0.45    0.6560   -0.216273    0.136188
exp1 & occ2: 16  -0.0539314     0.0520926  -1.04    0.3006   -0.156056    0.0481934
exp1 & occ2: 17   0.0147277     0.0467903   0.31    0.7530   -0.0770023   0.106458
exp1 & occ2: 18   0.10741       0.471844    0.23    0.8199   -0.817616    1.03244
exp1 & occ2: 19   0.0047165     0.106074    0.04    0.9645   -0.203237    0.21267
exp1 & occ2: 2   -0.0736239     0.0501108  -1.47    0.1418   -0.171863    0.0246157
exp1 & occ2: 20   0.0243156     0.0743274   0.33    0.7436   -0.121399    0.170031
exp1 & occ2: 21   0.0791776     0.0696947   1.14    0.2560   -0.0574551   0.21581
exp1 & occ2: 22   0.109325      0.0880828   1.24    0.2146   -0.063357    0.282006
exp1 & occ2: 3   -0.0714859     0.0637688  -1.12    0.2623   -0.196501    0.0535296
exp1 & occ2: 4   -0.0723997     0.0747715  -0.97    0.3330   -0.218985    0.0741859
exp1 & occ2: 5    0.0946732     0.0794005   1.19    0.2332   -0.0609873   0.250334
exp1 & occ2: 6   -0.0348928     0.0712136  -0.49    0.6242   -0.174503    0.104718
exp1 & occ2: 7   -0.227934      0.078486   -2.90    0.0037   -0.381802   -0.074066
exp1 & occ2: 8   -0.0727459     0.0645883  -1.13    0.2601   -0.199368    0.0538762
exp1 & occ2: 9    0.0274143     0.0669517   0.41    0.6822   -0.103841    0.15867
exp1 & ind2: 11   0.166231      0.105875    1.57    0.1165   -0.0413313   0.373793
exp1 & ind2: 12   0.107851      0.0933357   1.16    0.2479   -0.0751287   0.290831
exp1 & ind2: 13   0.188352      0.11737     1.60    0.1086   -0.0417468   0.41845
exp1 & ind2: 14  -0.00711671    0.0891863  -0.08    0.9364   -0.181962    0.167728
exp1 & ind2: 15  -0.208076      0.203289   -1.02    0.3061   -0.606614    0.190462
exp1 & ind2: 16  -0.0665283     0.0991814  -0.67    0.5024   -0.260968    0.127912
exp1 & ind2: 17   0.0216289     0.094615    0.23    0.8192   -0.163859    0.207117
exp1 & ind2: 18   0.00528206    0.0906315   0.06    0.9535   -0.172396    0.18296
exp1 & ind2: 19   0.000352497   0.110676    0.00    0.9975   -0.216623    0.217328
exp1 & ind2: 2   -0.151258      0.164434   -0.92    0.3577   -0.473622    0.171107
exp1 & ind2: 20  -0.0185949     0.102096   -0.18    0.8555   -0.218749    0.181559
exp1 & ind2: 21   0.0678327     0.100782    0.67    0.5009   -0.129745    0.26541
exp1 & ind2: 22  -0.0366764     0.0964226  -0.38    0.7037   -0.225708    0.152355
exp1 & ind2: 3    0.324631      0.188468    1.72    0.0850   -0.0448505   0.694113
exp1 & ind2: 4   -0.136527      0.114344   -1.19    0.2325   -0.360692    0.0876375
exp1 & ind2: 5   -0.0255591     0.0964004  -0.27    0.7909   -0.214547    0.163429
exp1 & ind2: 6    0.00276967    0.0959674   0.03    0.9770   -0.185369    0.190909
exp1 & ind2: 7   -0.0483333     0.132829   -0.36    0.7160   -0.308738    0.212071
exp1 & ind2: 8    0.0845092     0.118195    0.72    0.4746   -0.147205    0.316224
exp1 & ind2: 9   -0.0153499     0.0872762  -0.18    0.8604   -0.18645     0.155751
exp1 & mw        -0.0279931     0.0296572  -0.94    0.3453   -0.0861345   0.0301484
exp1 & so        -0.00996775    0.0266868  -0.37    0.7088   -0.0622858   0.0423503
exp1 & we         0.00630768    0.0301417   0.21    0.8342   -0.0527835   0.0653989
exp2 & shs        1.90051       1.45025     1.31    0.1901   -0.94263     4.74364
exp2 & hsg        0.117164      0.550973    0.21    0.8316   -0.962989    1.19732
exp2 & scl        0.621792      0.462999    1.34    0.1793   -0.285892    1.52948
exp2 & clg        0.409675      0.380217    1.08    0.2813   -0.335721    1.15507
exp2 & occ2: 10  -0.269229      0.640527   -0.42    0.6743   -1.52495     0.986491
exp2 & occ2: 11  -1.08165       1.00576    -1.08    0.2822   -3.05339     0.890081
exp2 & occ2: 12   0.832374      0.934125    0.89    0.3729   -0.998929    2.66368
exp2 & occ2: 13  -0.220981      0.772846   -0.29    0.7749   -1.73611     1.29414
exp2 & occ2: 14   0.751116      0.927255    0.81    0.4180   -1.06672     2.56895
exp2 & occ2: 15  -0.0326858     0.940912   -0.03    0.9723   -1.87729     1.81192
exp2 & occ2: 16   0.363581      0.550955    0.66    0.5093   -0.716537    1.4437
exp2 & occ2: 17  -0.265929      0.486113   -0.55    0.5844   -1.21893     0.687071
exp2 & occ2: 18  -2.56088       5.17009    -0.50    0.6204  -12.6966      7.57482
exp2 & occ2: 19  -0.129176      1.06169    -0.12    0.9032   -2.21056     1.95221
exp2 & occ2: 2    0.663217      0.552322    1.20    0.2299   -0.419581    1.74602
exp2 & occ2: 20  -0.33233       0.722907   -0.46    0.6457   -1.74955     1.08489
exp2 & occ2: 21  -0.91          0.685411   -1.33    0.1843   -2.25371     0.433714
exp2 & occ2: 22  -0.855054      0.827941   -1.03    0.3018   -2.47819     0.768082
exp2 & occ2: 3    0.641546      0.710278    0.90    0.3664   -0.750918    2.03401
exp2 & occ2: 4    0.974842      0.865535    1.13    0.2601   -0.721994    2.67168
exp2 & occ2: 5   -0.977882      0.973799   -1.00    0.3153   -2.88696     0.9312
exp2 & occ2: 6    0.105086      0.800227    0.13    0.8955   -1.46372     1.67389
exp2 & occ2: 7    3.14071       0.938942    3.34    0.0008    1.29996     4.98146
exp2 & occ2: 8    0.671088      0.719208    0.93    0.3508   -0.738881    2.08106
exp2 & occ2: 9    0.0231977     0.762914    0.03    0.9757   -1.47246     1.51885
exp2 & ind2: 11  -1.68035       1.08035    -1.56    0.1199   -3.79831     0.437612
exp2 & ind2: 12  -0.971713      0.934825   -1.04    0.2986   -2.80439     0.860963
exp2 & ind2: 13  -1.76787       1.1562     -1.53    0.1263   -4.03453     0.498797
exp2 & ind2: 14   0.119001      0.888083    0.13    0.8934   -1.62204     1.86004
exp2 & ind2: 15   2.3885        2.20174     1.08    0.2781   -1.92789     6.70489
exp2 & ind2: 16   0.870745      0.990124    0.88    0.3792   -1.07034     2.81183
exp2 & ind2: 17  -0.00295735    0.948274   -0.00    0.9975   -1.862       1.85609
exp2 & ind2: 18  -0.00329326    0.889074   -0.00    0.9970   -1.74628     1.73969
exp2 & ind2: 19   0.266476      1.11891     0.24    0.8118   -1.92709     2.46004
exp2 & ind2: 2    2.19733       1.77386     1.24    0.2155   -1.28024     5.6749
exp2 & ind2: 20   0.250603      1.01001     0.25    0.8041   -1.72947     2.23068
exp2 & ind2: 21  -0.915406      1.01458    -0.90    0.3670   -2.90443     1.07362
exp2 & ind2: 22   0.339496      0.949368    0.36    0.7207   -1.52169     2.20068
exp2 & ind2: 3   -3.73956       1.95946    -1.91    0.0564   -7.58098     0.101848
exp2 & ind2: 4    1.09199       1.143       0.96    0.3394   -1.1488      3.33278
exp2 & ind2: 5    0.182412      0.938629    0.19    0.8459   -1.65772     2.02254
exp2 & ind2: 6   -0.0304448     0.929643   -0.03    0.9739   -1.85296     1.79207
exp2 & ind2: 7    0.73252       1.43981     0.51    0.6109   -2.09016     3.5552
exp2 & ind2: 8   -0.750665      1.20255    -0.62    0.5325   -3.1082      1.60687
exp2 & ind2: 9    0.417708      0.838948    0.50    0.6186   -1.22701     2.06242
exp2 & mw         0.200561      0.317291    0.63    0.5273   -0.421472    0.822594
exp2 & so         0.0544354     0.281566    0.19    0.8467   -0.49756     0.606431
exp2 & we         0.00127174    0.320787    0.00    0.9968   -0.627615    0.630159
exp3 & shs       -0.672124      0.442663   -1.52    0.1290   -1.53994     0.195693
exp3 & hsg       -0.0179937     0.208318   -0.09    0.9312   -0.426389    0.390402
exp3 & scl       -0.199788      0.185519   -1.08    0.2816   -0.563488    0.163912
exp3 & clg       -0.102523      0.164365   -0.62    0.5328   -0.424752    0.219706
exp3 & occ2: 10   0.185475      0.257556    0.72    0.4715   -0.319451    0.690401
exp3 & occ2: 11   0.393155      0.381776    1.03    0.3032   -0.355296    1.14161
exp3 & occ2: 12  -0.220256      0.366021   -0.60    0.5474   -0.93782     0.497308
exp3 & occ2: 13   0.0950356     0.290437    0.33    0.7435   -0.474351    0.664422
exp3 & occ2: 14  -0.144393      0.334162   -0.43    0.6657   -0.799501    0.510714
exp3 & occ2: 15   0.147708      0.364519    0.41    0.6853   -0.566913    0.862328
exp3 & occ2: 16  -0.0378548     0.215129   -0.18    0.8603   -0.459604    0.383894
exp3 & occ2: 17   0.15105       0.187808    0.80    0.4213   -0.217138    0.519238
exp3 & occ2: 18   1.40844       1.88525     0.75    0.4550   -2.28748     5.10437
exp3 & occ2: 19   0.0923425     0.404231    0.23    0.8193   -0.700131    0.884816
exp3 & occ2: 2   -0.20394       0.221139   -0.92    0.3565   -0.637471    0.22959
exp3 & occ2: 20   0.180699      0.265208    0.68    0.4957   -0.339227    0.700626
exp3 & occ2: 21   0.377908      0.255303    1.48    0.1389   -0.1226      0.878417
exp3 & occ2: 22   0.285506      0.298421    0.96    0.3388   -0.299532    0.870544
exp3 & occ2: 3   -0.236962      0.287037   -0.83    0.4091   -0.799683    0.325759
exp3 & occ2: 4   -0.436696      0.352017   -1.24    0.2148   -1.12681     0.253415
exp3 & occ2: 5    0.38853       0.411886    0.94    0.3456   -0.418951    1.19601
exp3 & occ2: 6    0.0484737     0.329353    0.15    0.8830   -0.597205    0.694152
exp3 & occ2: 7   -1.39493       0.405011   -3.44    0.0006   -2.18893    -0.600926
exp3 & occ2: 8   -0.20539       0.289573   -0.71    0.4782   -0.773082    0.362302
exp3 & occ2: 9   -0.090966      0.314335   -0.29    0.7723   -0.707203    0.525271
exp3 & ind2: 11   0.642942      0.41101     1.56    0.1178   -0.162821    1.4487
exp3 & ind2: 12   0.328629      0.346913    0.95    0.3435   -0.351475    1.00873
exp3 & ind2: 13   0.592851      0.425896    1.39    0.1640   -0.242095    1.4278
exp3 & ind2: 14  -0.0285251     0.328414   -0.09    0.9308   -0.672364    0.615314
exp3 & ind2: 15  -0.856868      0.843546   -1.02    0.3098   -2.5106      0.796861
exp3 & ind2: 16  -0.355848      0.367758   -0.97    0.3333   -1.07682     0.365122
exp3 & ind2: 17  -0.0362622     0.352283   -0.10    0.9180   -0.726895    0.654371
exp3 & ind2: 18   0.0157436     0.325024    0.05    0.9614   -0.621448    0.652936
exp3 & ind2: 19  -0.14883       0.420781   -0.35    0.7236   -0.973749    0.67609
exp3 & ind2: 2   -1.04482       0.706672   -1.48    0.1393   -2.43021     0.340577
exp3 & ind2: 20  -0.0679218     0.37047    -0.18    0.8545   -0.794209    0.658365
exp3 & ind2: 21   0.396705      0.38116     1.04    0.2980   -0.350539    1.14395
exp3 & ind2: 22  -0.0760277     0.34879    -0.22    0.8275   -0.759812    0.607756
exp3 & ind2: 3    1.62176       0.784599    2.07    0.0388    0.0835992   3.15993
exp3 & ind2: 4   -0.314973      0.428727   -0.73    0.4626   -1.15547     0.525524
exp3 & ind2: 5   -0.0505912     0.339866   -0.15    0.8817   -0.716881    0.615699
exp3 & ind2: 6    0.0193266     0.335922    0.06    0.9541   -0.63923     0.677883
exp3 & ind2: 7   -0.335907      0.58607    -0.57    0.5666   -1.48487     0.813053
exp3 & ind2: 8    0.189279      0.451639    0.42    0.6752   -0.696137    1.07469
exp3 & ind2: 9   -0.216085      0.299794   -0.72    0.4711   -0.803816    0.371646
exp3 & mw        -0.0625771     0.124129   -0.50    0.6142   -0.305926    0.180772
exp3 & so        -0.0115842     0.108422   -0.11    0.9149   -0.224139    0.200971
exp3 & we        -0.0124875     0.125138   -0.10    0.9205   -0.257813    0.232838
exp4 & shs        0.0777418     0.0475427   1.64    0.1021   -0.0154632   0.170947
exp4 & hsg        0.000491255   0.0265964   0.02    0.9853   -0.0516497   0.0526322
exp4 & scl        0.021076      0.0245289   0.86    0.3903   -0.0270117   0.0691637
exp4 & clg        0.00786949    0.0227528   0.35    0.7295   -0.0367363   0.0524753
exp4 & occ2: 10  -0.0333347     0.0338825  -0.98    0.3252   -0.0997595   0.0330901
exp4 & occ2: 11  -0.0465914     0.0479018  -0.97    0.3308   -0.1405      0.0473175
exp4 & occ2: 12   0.0110212     0.0470536   0.23    0.8148   -0.0812249   0.103267
exp4 & occ2: 13  -0.0136895     0.0358988  -0.38    0.7030   -0.0840673   0.0566883
exp4 & occ2: 14   0.00555824    0.0400331   0.14    0.8896   -0.0729245   0.084041
exp4 & occ2: 15  -0.0327444     0.0462379  -0.71    0.4789   -0.123391    0.0579026
exp4 & occ2: 16  -0.00897062    0.0275729  -0.33    0.7449   -0.0630259   0.0450847
exp4 & occ2: 17  -0.0256735     0.0239306  -1.07    0.2834   -0.0725881   0.0212412
exp4 & occ2: 18  -0.212137      0.2204     -0.96    0.3358   -0.64422     0.219946
exp4 & occ2: 19  -0.0169398     0.0513428  -0.33    0.7415   -0.117595    0.083715
exp4 & occ2: 2    0.0176389     0.0289257   0.61    0.5420   -0.0390683   0.0743462
exp4 & occ2: 20  -0.0296125     0.0323353  -0.92    0.3598   -0.0930042   0.0337791
exp4 & occ2: 21  -0.0524577     0.0317251  -1.65    0.0983   -0.114653    0.00973765
exp4 & occ2: 22  -0.0350646     0.0360687  -0.97    0.3310   -0.105775    0.0356463
exp4 & occ2: 3    0.0303057     0.0376552   0.80    0.4210   -0.0435153   0.104127
exp4 & occ2: 4    0.0584146     0.0457704   1.28    0.2019   -0.0313159   0.148145
exp4 & occ2: 5   -0.0515181     0.0549489  -0.94    0.3485   -0.159243    0.0562063
exp4 & occ2: 6   -0.0170182     0.0440847  -0.39    0.6995   -0.103444    0.0694076
exp4 & occ2: 7    0.190535      0.0558757   3.41    0.0007    0.0809939   0.300077
exp4 & occ2: 8    0.0196522     0.0379084   0.52    0.6042   -0.0546653   0.0939697
exp4 & occ2: 9    0.0190014     0.0421099   0.45    0.6518   -0.0635528   0.101556
exp4 & ind2: 11  -0.084012      0.0518917  -1.62    0.1055   -0.185743    0.0177191
exp4 & ind2: 12  -0.0390069     0.0424964  -0.92    0.3587   -0.122319    0.044305
exp4 & ind2: 13  -0.0672775     0.0518686  -1.30    0.1947   -0.168963    0.0344082
exp4 & ind2: 14  -6.82236e-5    0.0400746  -0.00    0.9986   -0.0786325   0.078496
exp4 & ind2: 15   0.0950646     0.104359    0.91    0.3624   -0.109525    0.299654
exp4 & ind2: 16   0.0438506     0.0451919   0.97    0.3319   -0.0447458   0.132447
exp4 & ind2: 17   0.00554933    0.043163    0.13    0.8977   -0.0790695   0.0901682
exp4 & ind2: 18  -0.00634059    0.0394237  -0.16    0.8722   -0.0836287   0.0709475
exp4 & ind2: 19   0.0213249     0.0523585   0.41    0.6838   -0.0813212   0.123971
exp4 & ind2: 2    0.148284      0.0918416   1.61    0.1065   -0.0317665   0.328335
exp4 & ind2: 20   0.00142884    0.0448582   0.03    0.9746   -0.0865134   0.0893711
exp4 & ind2: 21  -0.0549777     0.0473826  -1.16    0.2460   -0.147869    0.0379135
exp4 & ind2: 22   0.000189061   0.0424445   0.00    0.9964   -0.0830211   0.0833992
exp4 & ind2: 3   -0.236895      0.106698   -2.22    0.0264   -0.446071   -0.0277192
exp4 & ind2: 4    0.0273364     0.0534731   0.51    0.6092   -0.0774948   0.132168
exp4 & ind2: 5    0.00417968    0.0407488   0.10    0.9183   -0.0757062   0.0840656
exp4 & ind2: 6   -0.00432682    0.0402397  -0.11    0.9144   -0.0832147   0.0745611
exp4 & ind2: 7    0.0480848     0.0784057   0.61    0.5397   -0.105625    0.201795
exp4 & ind2: 8   -0.0126822     0.0561032  -0.23    0.8212   -0.12267     0.0973051
exp4 & ind2: 9    0.0304762     0.0353913   0.86    0.3892   -0.0389067   0.099859
exp4 & mw         0.00624394    0.0158699   0.39    0.6940   -0.0248681   0.037356
exp4 & so         0.000314457   0.0136275   0.02    0.9816   -0.0264016   0.0270305
exp4 & we         0.00176845    0.0159602   0.11    0.9118   -0.0295206   0.0330575
────────────────────────────────────────────────────────────────────────────────────
Number of regressors in the flexible model: 246

2.4.1. Re-estimating the flexible model using lasso#

We re-estimate the flexible model using Lasso (the least absolute shrinkage and selection operator) rather than ols. Lasso is a penalized regression method that can be used to reduce the complexity of a regression model when the ratio \(p/n\) is not small. We will introduce this approach formally later in the course, but for now, we try it out here as a black-box method.

using Lasso

lasso_model = fit(LassoModel, flex, data)

StatsModels.TableRegressionModel{LassoModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, MinAICc}, Matrix{Float64}}

lwage ~ sex + shs + hsg + scl + clg + occ2 + ind2 + mw + so + we + exp1 + exp2 + exp3 + exp4 + exp1 & shs + exp1 & hsg + exp1 & scl + exp1 & clg + exp1 & occ2 + exp1 & ind2 + exp1 & mw + exp1 & so + exp1 & we + exp2 & shs + exp2 & hsg + exp2 & scl + exp2 & clg + exp2 & occ2 + exp2 & ind2 + exp2 & mw + exp2 & so + exp2 & we + exp3 & shs + exp3 & hsg + exp3 & scl + exp3 & clg + exp3 & occ2 + exp3 & ind2 + exp3 & mw + exp3 & so + exp3 & we + exp4 & shs + exp4 & hsg + exp4 & scl + exp4 & clg + exp4 & occ2 + exp4 & ind2 + exp4 & mw + exp4 & so + exp4 & we

Coefficients:
LassoModel using MinAICc(2) segment of the regularization path.

Coefficients:
──────────────────
          Estimate
──────────────────
x1     3.34129
x2    -0.0631407
x3    -0.565639
x4    -0.50156
x5    -0.400049
x6    -0.14691
x7     0.0
x8    -0.360004
x9    -0.209659
x10   -0.188597
x11   -0.249669
x12   -0.363077
x13   -0.16079
x14   -0.353041
x15   -0.324731
x16   -0.178614
x17    0.0
x18   -0.165913
x19   -0.147868
x20   -0.369714
x21    0.00735858
x22    0.0
x23   -0.11278
x24   -0.330056
x25    0.00179747
x26   -0.327129
x27   -0.175565
x28    0.0646178
x29    0.172216
x30    0.0774118
x31    0.0506054
x32    0.0
x33   -0.0489911
x34   -0.0632706
x35   -0.0881081
x36   -0.137242
x37    0.282243
x38   -0.27567
x39    0.0
x40    0.103844
x41    0.124828
x42    0.0
x43    0.0
x44    0.0
x45    0.0266645
x46    0.041781
x47   -0.148758
x48    0.0
x49   -0.018235
x50    0.0
x51    0.0142924
x52    0.0
x53    0.0
x54    0.0
x55    0.0
x56    0.0
x57    0.0
x58   -0.000195042
x59    0.00361625
x60   -0.00321403
x61    0.0
x62   -0.00919592
x63   -0.0109497
x64   -0.00207628
x65    0.0
x66    0.0
x67    0.0
x68    0.0
x69    0.0
x70    0.0
x71   -0.00436299
x72    0.0
x73   -0.000609112
x74   -0.00228671
x75    0.0
x76    0.0
x77    0.0
x78    0.0
x79    0.00130242
x80    0.0
x81    0.0
x82    0.0
x83    0.0
x84    0.0
x85    0.0
x86    0.000821768
x87    0.0
x88    0.0
x89    0.0
x90    0.0
x91   -0.00541602
x92    0.00312622
x93    0.0
x94    0.0
x95   -0.00189558
x96   -0.000570929
x97    0.00105877
x98    0.0
x99    0.0
x100  -0.000714472
x101   0.0
x102   0.00278992
x103   0.0
x104   0.0
x105   0.0
x106   0.0
x107   0.0
x108   0.0
x109   0.0
x110  -0.00651112
x111   0.0
x112   0.0
x113   0.0
x114   0.0
x115   0.0
x116   0.0
x117  -0.00142291
x118   0.0
x119   0.0
x120   0.0
x121   0.0
x122  -0.00754277
x123   0.0
x124   0.0
x125   0.0
x126   0.0
x127   0.0
x128   0.0
x129   0.0
x130  -0.00780449
x131   0.0
x132   0.0
x133   0.0
x134   0.0
x135   0.0
x136   0.0
x137  -0.0240004
x138   0.0
x139   0.0
x140   0.0
x141   0.0
x142   0.0
x143   0.0
x144  -0.000657946
x145   0.00747602
x146  -0.0168673
x147   0.0
x148  -0.0023378
x149  -0.000272732
x150   0.0
x151   0.0
x152   0.0
x153   0.0
x154   0.0
x155   0.0
x156   0.0
x157   0.0
x158   0.0
x159   0.0
x160   0.0
x161   0.0
x162   0.0
x163   0.00459082
x164   0.0
x165  -0.00107904
x166   2.35047e-5
x167   0.0
x168   0.0
x169   0.0
x170   0.0
x171   0.0
x172   0.0
x173   0.0
x174   0.0
x175   0.0
x176   0.0
x177  -0.00168826
x178   0.0
x179   0.0
x180   0.0
x181   0.0
x182   0.0
x183   0.0
x184   0.000737299
x185   0.0
x186  -0.000841314
x187   0.0
x188   0.0
x189   0.0
x190   0.0
x191   0.0
x192   0.0
x193   0.0
x194   0.0
x195   0.0
x196   0.0
x197  -0.00102585
x198   0.0
x199   0.0
x200   0.000244024
x201   0.0
x202  -0.000476642
x203  -0.000894288
x204   0.0
x205  -0.00130759
x206   0.0
x207   0.0
x208  -0.000999333
x209  -0.000961788
x210  -0.000514903
x211   0.0
x212   0.0
x213  -0.00131441
x214   0.000250381
x215  -0.000652594
x216   0.0
x217   0.000192503
x218   0.0
x219   0.0
x220  -0.00166666
x221   0.0
x222   0.0
x223   0.0
x224  -0.000478704
x225   0.0
x226   0.0
x227   0.000417421
x228   0.0
x229  -0.000827938
x230   0.0
x231  -9.91966e-5
x232   0.000237123
x233   0.0
x234  -0.00042345
x235   0.0
x236  -0.00143342
x237  -0.00146757
x238  -0.000852059
x239   0.0
x240   0.0
x241   0.000435754
x242   0.0
x243   0.0
x244  -0.000537361
x245  -6.33045e-5
x246  -0.000877637
──────────────────

2.4.2. Evaluating the predictive performance of the basic and flexible models#

Now, we can evaluate the performance of both models based on the (adjusted) \(R^2_{sample}\) and the (adjusted) \(MSE_{sample}\):

n_data = size(data)[1]

function ref_bsc(model, lasso = false, n = n_data)
     if lasso
        p = length(coef(model))
        y_hat = predict(model)
        y_r = data.lwage
        r_2 = 1 - sum((y_r .- y_hat).^2)  / sum((y_r .- mean(y_r)).^2)
        adj_r2 = 1 - (1 - r_2) * ((n - 1) / (n - p - 1))
    else
        p = length(coef(model))
        r_2 = r2(model)
        adj_r2 = adjr2(model)
    end   
    
    mse = mean(residuals(model).^2)
    mse_adj = (n / (n - p)) * mse
    
    ref = [p, r_2, adj_r2, mse, mse_adj]
    
    return p, r_2, adj_r2, mse, mse_adj
end

p1, r2_1, r2_adj1, mse1, mse_adj1 = ref_bsc(basic_results);

p2, r2_2, r2_adj2, mse2, mse_adj2 = ref_bsc(flex_results);

pL, r2_L, r2_adjL, mseL, mse_adjL = ref_bsc(lasso_model, true);

println("R-squared for the basic model: ", r2_1)
println("Adjusted R-squared for the basic model: ", r2_adj1)
println("R-squared for the flexible model: ", r2_2)
println("Adjusted R-squared for the flexible model: ", r2_adj2)
println("R-squared for the lasso with flexible model: ", r2_2)
println("Adjusted R-squared for the lasso with flexible model: ", r2_adj2, "\n")

println("MSE for the basic model: ", mse1)
println("MSE for the basic model: ", mse_adj1)
println("MSE for the flexible model: ", mse2)
println("MSE for the flexible model: ", mse_adj2)
println("MSE for the lasso with flexible model: ", mseL)
println("MSE for the lasso with flexible model: ", mse_adjL)

R-squared for the basic model: 0.31004650692219493
Adjusted R-squared for the basic model: 0.303280930406429
R-squared for the flexible model: 0.35110989506172274
Adjusted R-squared for the flexible model: 0.3186918535221881
R-squared for the lasso with flexible model: 0.35110989506172274
Adjusted R-squared for the lasso with flexible model: 0.3186918535221881

MSE for the basic model: 0.22442505581164396
MSE for the basic model: 0.2266697465051905
MSE for the flexible model: 0.2110681364431821
MSE for the flexible model: 0.22165597526149833
MSE for the lasso with flexible model: 0.21912180704256773
MSE for the lasso with flexible model: 0.23011364320334907

# using Pkg
# Pkg.add("Lasso")
using Lasso

flex = @formula(lwage ~ sex + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we))
lasso_model = fit(LassoModel, flex, data)

StatsModels.TableRegressionModel{LassoModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, MinAICc}, Matrix{Float64}}

lwage ~ sex + shs + hsg + scl + clg + occ2 + ind2 + mw + so + we + exp1 + exp2 + exp3 + exp4 + exp1 & shs + exp1 & hsg + exp1 & scl + exp1 & clg + exp1 & occ2 + exp1 & ind2 + exp1 & mw + exp1 & so + exp1 & we + exp2 & shs + exp2 & hsg + exp2 & scl + exp2 & clg + exp2 & occ2 + exp2 & ind2 + exp2 & mw + exp2 & so + exp2 & we + exp3 & shs + exp3 & hsg + exp3 & scl + exp3 & clg + exp3 & occ2 + exp3 & ind2 + exp3 & mw + exp3 & so + exp3 & we + exp4 & shs + exp4 & hsg + exp4 & scl + exp4 & clg + exp4 & occ2 + exp4 & ind2 + exp4 & mw + exp4 & so + exp4 & we

Coefficients:
LassoModel using MinAICc(2) segment of the regularization path.

Coefficients:
──────────────────
          Estimate
──────────────────
x1     3.34129
x2    -0.0631407
x3    -0.565639
x4    -0.50156
x5    -0.400049
x6    -0.14691
x7     0.0
x8    -0.360004
x9    -0.209659
x10   -0.188597
x11   -0.249669
x12   -0.363077
x13   -0.16079
x14   -0.353041
x15   -0.324731
x16   -0.178614
x17    0.0
x18   -0.165913
x19   -0.147868
x20   -0.369714
x21    0.00735858
x22    0.0
x23   -0.11278
x24   -0.330056
x25    0.00179747
x26   -0.327129
x27   -0.175565
x28    0.0646178
x29    0.172216
x30    0.0774118
x31    0.0506054
x32    0.0
x33   -0.0489911
x34   -0.0632706
x35   -0.0881081
x36   -0.137242
x37    0.282243
x38   -0.27567
x39    0.0
x40    0.103844
x41    0.124828
x42    0.0
x43    0.0
x44    0.0
x45    0.0266645
x46    0.041781
x47   -0.148758
x48    0.0
x49   -0.018235
x50    0.0
x51    0.0142924
x52    0.0
x53    0.0
x54    0.0
x55    0.0
x56    0.0
x57    0.0
x58   -0.000195042
x59    0.00361625
x60   -0.00321403
x61    0.0
x62   -0.00919592
x63   -0.0109497
x64   -0.00207628
x65    0.0
x66    0.0
x67    0.0
x68    0.0
x69    0.0
x70    0.0
x71   -0.00436299
x72    0.0
x73   -0.000609112
x74   -0.00228671
x75    0.0
x76    0.0
x77    0.0
x78    0.0
x79    0.00130242
x80    0.0
x81    0.0
x82    0.0
x83    0.0
x84    0.0
x85    0.0
x86    0.000821768
x87    0.0
x88    0.0
x89    0.0
x90    0.0
x91   -0.00541602
x92    0.00312622
x93    0.0
x94    0.0
x95   -0.00189558
x96   -0.000570929
x97    0.00105877
x98    0.0
x99    0.0
x100  -0.000714472
x101   0.0
x102   0.00278992
x103   0.0
x104   0.0
x105   0.0
x106   0.0
x107   0.0
x108   0.0
x109   0.0
x110  -0.00651112
x111   0.0
x112   0.0
x113   0.0
x114   0.0
x115   0.0
x116   0.0
x117  -0.00142291
x118   0.0
x119   0.0
x120   0.0
x121   0.0
x122  -0.00754277
x123   0.0
x124   0.0
x125   0.0
x126   0.0
x127   0.0
x128   0.0
x129   0.0
x130  -0.00780449
x131   0.0
x132   0.0
x133   0.0
x134   0.0
x135   0.0
x136   0.0
x137  -0.0240004
x138   0.0
x139   0.0
x140   0.0
x141   0.0
x142   0.0
x143   0.0
x144  -0.000657946
x145   0.00747602
x146  -0.0168673
x147   0.0
x148  -0.0023378
x149  -0.000272732
x150   0.0
x151   0.0
x152   0.0
x153   0.0
x154   0.0
x155   0.0
x156   0.0
x157   0.0
x158   0.0
x159   0.0
x160   0.0
x161   0.0
x162   0.0
x163   0.00459082
x164   0.0
x165  -0.00107904
x166   2.35047e-5
x167   0.0
x168   0.0
x169   0.0
x170   0.0
x171   0.0
x172   0.0
x173   0.0
x174   0.0
x175   0.0
x176   0.0
x177  -0.00168826
x178   0.0
x179   0.0
x180   0.0
x181   0.0
x182   0.0
x183   0.0
x184   0.000737299
x185   0.0
x186  -0.000841314
x187   0.0
x188   0.0
x189   0.0
x190   0.0
x191   0.0
x192   0.0
x193   0.0
x194   0.0
x195   0.0
x196   0.0
x197  -0.00102585
x198   0.0
x199   0.0
x200   0.000244024
x201   0.0
x202  -0.000476642
x203  -0.000894288
x204   0.0
x205  -0.00130759
x206   0.0
x207   0.0
x208  -0.000999333
x209  -0.000961788
x210  -0.000514903
x211   0.0
x212   0.0
x213  -0.00131441
x214   0.000250381
x215  -0.000652594
x216   0.0
x217   0.000192503
x218   0.0
x219   0.0
x220  -0.00166666
x221   0.0
x222   0.0
x223   0.0
x224  -0.000478704
x225   0.0
x226   0.0
x227   0.000417421
x228   0.0
x229  -0.000827938
x230   0.0
x231  -9.91966e-5
x232   0.000237123
x233   0.0
x234  -0.00042345
x235   0.0
x236  -0.00143342
x237  -0.00146757
x238  -0.000852059
x239   0.0
x240   0.0
x241   0.000435754
x242   0.0
x243   0.0
x244  -0.000537361
x245  -6.33045e-5
x246  -0.000877637
──────────────────

# lasso_model, basic_results, regflex
n_data = size(data)[1]

function ref_bsc(model, lasso = false, n = n_data)
     if lasso
        p = length(coef(model))
        y_hat = predict(model)
        y_r = data.lwage
        r_2 = 1 - sum((y_r .- y_hat).^2)  / sum((y_r .- mean(y_r)).^2)
        adj_r2 = 1 - (1 - r_2) * ((n - 1) / (n - p - 1))
    else
        p = length(coef(model))
        r_2 = r2(model)
        adj_r2 = adjr2(model)
    end   
    
    mse = mean(residuals(model).^2)
    mse_adj = (n / (n - p)) * mse
    
    ref = [p, r_2, adj_r2, mse, mse_adj]
    
    return ref
    
end

DataFrame(
    Model = ["p", "R^2", "MSE", "R^2 adjusted", "MSE adjusted"],
    Basic_reg = ref_bsc(basic_results),
    Flexible_reg = ref_bsc(flex_results),
    lasso_flex = ref_bsc(lasso_model, true)
)

5 rows × 4 columns

	Model	Basic_reg	Flexible_reg	lasso_flex
	String	Float64	Float64	Float64
1	p	51.0	246.0	246.0
2	R^2	0.310047	0.35111	0.32635
3	MSE	0.303281	0.318692	0.292551
4	R^2 adjusted	0.224425	0.211068	0.219122
5	MSE adjusted	0.22667	0.221656	0.230114

Considering the measures above, the flexible model performs slightly better than the basic model.

As \(p/n\) is not large, the discrepancy between the adjusted and unadjusted measures is not large. However, if it were, we might still like to apply data splitting as a more general procedure to deal with potential overfitting if \(p/n\). We illustrate the approach in the following.

2.5. Data Splitting#

Measure the prediction quality of the two models via data splitting:

Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophisticated version of splitting that we might consider).
Use the training sample to estimate the parameters of the Basic Model and the Flexible Model.
Use the testing sample for evaluation. Predict the \(\mathtt{wage}\) of every observation in the testing sample based on the estimated parameters in the training sample.
Calculate the Mean Squared Prediction Error \(MSE_{test}\) based on the testing sample for both prediction models.

using Lathe.preprocess: TrainTestSplit

train, test = TrainTestSplit(data, 4/5)
reg_basic = lm(basic, train)

train_reg_basic = predict(reg_basic, test)
y_test = test.lwage

mse_test1 = sum((y_test .- train_reg_basic).^2) / length(y_test)
r2_test1 = 1 - mse_test1 / var(y_test)

print("Test MSE for the basic model: $mse_test1\nTest R2 for the basic model: $r2_test1")

Test MSE for the basic model: 0.2185049608897876
Test R2 for the basic model: 0.32969404519304046

In the basic model, the \(MSE_{test}\) jis quite close to the \(MSE_{sample}\)

reg_flex = lm(flex, train)
train_reg_flex = predict(reg_flex, test)
mse_test2 = sum((y_test .- train_reg_flex).^2) / length(y_test)
r2_test2 = 1 - mse_test2 / var(y_test)
    
print("Test MSE for the basic model: $mse_test2\nTest R2 for the basic model: $r2_test2")

Test MSE for the basic model: 0.2549752677075409
Test R2 for the basic model: 0.20895231863861619

In the flexible model too, the discrepancy between the \(MSE_{test}\) and the \(MSE_{sample}\) is not large.

It is worth noticing that the \(MSE_{test}\) varies across different data splits. Hence, it is a good idea to average the out-of-sample MSE over different data splits to get valid results.

Nevertheless, we observe that, based on the out-of-sample \(MSE\), the basic model using ols regression performs about as well (or slightly better) than the flexible model.

Next, let us use lasso regression in the flexible model instead of ols regression. The out-of-sample \(MSE\) on the test sample can be computed for any black-box prediction method, so we also compare the performance of lasso regression in the flexible model to ols regression.

reg_lasso = fit(LassoModel, flex, train)
train_reg_lasso = predict(reg_lasso, test)
mse_lasso = sum((y_test .- train_reg_lasso).^2) / length(y_test)
r2_lasso = 1 - mse_lasso / var(y_test)
print("Test MSE for the basic model: $mse_lasso\nTest R2 for the basic model: $r2_lasso")

Test MSE for the basic model: 0.22909929950928218
Test R2 for the basic model: 0.28923118187993957

Finally, let us summarize the results:

MSE = [mse_test1, mse_test2, mse_lasso]
R2 = [r2_test1, r2_test2, r2_lasso]
Model = ["Basic reg", "Flexible reg", "Lasso Regression"]
DataFrame( Model = Model, MSE_test = MSE, R2_test = R2)

3 rows × 3 columns

	Model	MSE_test	R2_test
	String	Float64	Float64
1	Basic reg	0.229097	0.289239
2	Flexible reg	0.254975	0.208952
3	Lasso Regression	0.229099	0.289231