# !wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
# !dpkg -i cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
# !apt-key add /var/cuda-repo-9-0-local/7fa2af80.pub
# !apt update -q
# !apt install cuda gcc-6 g++-6 -y -q
# !ln -s /usr/bin/gcc-6 /usr/local/cuda/bin/gcc
# !ln -s /usr/bin/g++-6 /usr/local/cuda/bin/g++
# !curl -sSL "https://julialang-s3.julialang.org/bin/linux/x64/1.7/julia-1.7.3-linux-x86_64.tar.gz" -o julia.tar.gz
# !tar -xzf julia.tar.gz -C /usr --strip-components 1
# !rm -rf julia.tar.gz*
# !julia -e 'using Pkg; pkg"add IJulia; precompile"'

3. OLS and lasso for gender wage gap inference#

In the previous lab, we already analyzed data from the March Supplement of the U.S. Current Population Survey (2015) and answered the question how to use job-relevant characteristics, such as education and experience, to best predict wages. Now, we focus on the following inference question:

What is the difference in predicted wages between men and women with the same job-relevant characteristics?

Thus, we analyze if there is a difference in the payment of men and women (gender wage gap). The gender wage gap may partly reflect discrimination against women in the labor market or may partly reflect a selection effect, namely that women are relatively more likely to take on occupations that pay somewhat less (for example, school teaching).

To investigate the gender wage gap, we consider the following log-linear regression model

\[\begin{split} \begin{align} \log(Y) &= \beta'X + \epsilon \\ &= \beta_1 D + \beta_2' W + \epsilon, \end{align} \end{split}\]

where \(D\) is the indicator of being female (\(1\) if female and \(0\) otherwise) and the \(W\)’s are controls explaining variation in wages. Considering transformed wages by the logarithm, we are analyzing the relative difference in the payment of men and women.

3.1. Data Analysis#

We consider the same subsample of the U.S. Current Population Survey (2015) as in the previous lab. Let us load the data set.

# using Pkg
# Pkg.add("CSV")
# Pkg.add("DataFrames")
# Pkg.add("Dates")
# Pkg.add("Plots")
# Pkg.add("GLMNet")
using GLMNet
using CSV
using DataFrames
using Dates
using Plots
using Statistics
using RData
url = "https://github.com/d2cml-ai/14.388_jl/raw/github_data/data/wage2015_subsample_inference.RData"
download(url, "data.RData")
rdata_read = load("data.RData")
rm("data.RData")
data = rdata_read["data"]
names(data)
println("Number of Rows : ", size(data)[1],"\n","Number of Columns : ", size(data)[2],) #rows and columns
Number of Rows : 5150
Number of Columns : 20

To start our (causal) analysis, we compare the sample means given gender:

Z = select(data, ["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"])

data_female = filter(row -> row.sex == 1, data)
Z_female = select(data_female,["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] )

data_male = filter(row -> row.sex == 0, data)
Z_male = select(data_male,["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] )

means = DataFrame( variables = names(Z), All = describe(Z, :mean)[!,2], Men = describe(Z_male,:mean)[!,2], Female = describe(Z_female,:mean)[!,2])

12 rows × 4 columns

variablesAllMenFemale
StringFloat64Float64Float64
1lwage2.970792.987832.94948
2sex0.4444660.01.0
3shs0.0233010.03180710.0126693
4hsg0.2438830.2943030.180865
5scl0.2780580.2733310.283967
6clg0.317670.2939530.347313
7ad0.1370870.1066060.175186
8ne0.2277670.221950.235037
9mw0.2596120.2590.260376
10so0.2965050.2981480.294452
11we0.2161170.2209020.210135
12exp113.760613.78413.7313

In particular, the table above shows that the difference in average logwage between men and women is equal to \(0,038\)

mean(Z_female[:,:lwage]) - mean(Z_male[:,:lwage])
-0.03834473367441493

Thus, the unconditional gender wage gap is about \(3,8\)% for the group of never married workers (women get paid less on average in our sample). We also observe that never married working women are relatively more educated than working men and have lower working experience.

This unconditional (predictive) effect of gender equals the coefficient \(\beta\) in the univariate ols regression of \(Y\) on \(D\):

\[ \log(Y) =\beta D + \epsilon. \]

We verify this by running an ols regression in Julia.

#install all the package that we can need
# Pkg.add("Plots")
# Pkg.add("Lathe")
# Pkg.add("GLM")
# Pkg.add("StatsPlots")
# Pkg.add("MLBase")
# Pkg.add("Tables")

# Load the installed packages
using DataFrames
using CSV
using Tables
using Plots
using Lathe
using GLM
   Resolving package versions...
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Project.toml`
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Manifest.toml`
   Resolving package versions...
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Project.toml`
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Manifest.toml`
   Resolving package versions...
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Project.toml`
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Manifest.toml`
   Resolving package versions...
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Project.toml`
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Manifest.toml`
   Resolving package versions...
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Project.toml`
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Manifest.toml`
   Resolving package versions...
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Project.toml`
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Manifest.toml`
nocontrol_model = lm(@formula(lwage ~ sex),data)
nocontrol_est = GLM.coef(nocontrol_model)[2]
nocontrol_se = GLM.coeftable(nocontrol_model).cols[2][2]

println("The estimated gender coefficient is ", nocontrol_est ," and the corresponding robust standard error is " ,nocontrol_se )
The estimated gender coefficient is -0.03834473367440746 and the corresponding robust standard error is 0.015987825519430444

Next, we run an ols regression of \(Y\) on \((D,W)\) to control for the effect of covariates summarized in \(W\):

\begin{align} \log(Y) &=\beta_1 D + \beta_2’ W + \epsilon. \end{align}

Here, we are considering the flexible model from the previous lab. Hence, \(W\) controls for experience, education, region, and occupation and industry indicators plus transformations and two-way interactions.

Let us run the ols regression with controls.

3.2. Ols regression with controls#

flex = @formula(lwage ~ sex + (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we))
control_model = lm(flex , data)
control_est = GLM.coef(control_model)[2]
control_se = GLM.coeftable(control_model).cols[2][2]
println(control_model)
println("Coefficient for OLS with controls " , control_est)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

lwage ~ 1 + sex + exp1 + exp2 + exp3 + exp4 + shs + hsg + scl + clg + occ2 + ind2 + mw + so + we + exp1 & shs + exp1 & hsg + exp1 & scl + exp1 & clg + exp1 & occ2 + exp1 & ind2 + exp1 & mw + exp1 & so + exp1 & we + exp2 & shs + exp2 & hsg + exp2 & scl + exp2 & clg + exp2 & occ2 + exp2 & ind2 + exp2 & mw + exp2 & so + exp2 & we + exp3 & shs + exp3 & hsg + exp3 & scl + exp3 & clg + exp3 & occ2 + exp3 & ind2 + exp3 & mw + exp3 & so + exp3 & we + exp4 & shs + exp4 & hsg + exp4 & scl + exp4 & clg + exp4 & occ2 + exp4 & ind2 + exp4 & mw + exp4 & so + exp4 & we

Coefficients:
────────────────────────────────────────────────────────────────────────────────────
                        Coef.  Std. Error      t  Pr(>|t|)    Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────────────
(Intercept)       3.27968       0.284196   11.54    <1e-29    2.72253     3.83683
sex              -0.0695532     0.015218   -4.57    <1e-05   -0.0993874  -0.039719
exp1              0.0835331     0.0944741   0.88    0.3766   -0.101678    0.268745
exp2             -0.561036      0.948991   -0.59    0.5544   -2.42148     1.29941
exp3              0.129343      0.356603    0.36    0.7168   -0.569759    0.828445
exp4             -0.00534853    0.0445154  -0.12    0.9044   -0.0926186   0.0819216
shs              -0.123309      0.906832   -0.14    0.8918   -1.90111     1.65449
hsg              -0.528902      0.197756   -2.67    0.0075   -0.916593   -0.141212
scl              -0.292058      0.126016   -2.32    0.0205   -0.539105   -0.0450112
clg              -0.0411641     0.0703862  -0.58    0.5587   -0.179153    0.0968245
occ2: 10          0.0209545     0.156498    0.13    0.8935   -0.285852    0.327761
occ2: 11         -0.642418      0.30909    -2.08    0.0377   -1.24837    -0.036463
occ2: 12         -0.0674774     0.252049   -0.27    0.7889   -0.561605    0.426651
occ2: 13         -0.232978      0.231538   -1.01    0.3144   -0.686896    0.22094
occ2: 14          0.256201      0.322673    0.79    0.4272   -0.376382    0.888784
occ2: 15         -0.193858      0.259508   -0.75    0.4551   -0.702611    0.314894
occ2: 16         -0.0551256     0.147066   -0.37    0.7078   -0.34344     0.233189
occ2: 17         -0.415609      0.136114   -3.05    0.0023   -0.682454   -0.148764
occ2: 18         -0.482217      1.04435    -0.46    0.6443   -2.52962     1.56518
occ2: 19         -0.257941      0.332522   -0.78    0.4380   -0.909832    0.39395
occ2: 2           0.16134       0.129724    1.24    0.2137   -0.0929781   0.415657
occ2: 20         -0.30102       0.234102   -1.29    0.1986   -0.759965    0.157925
occ2: 21         -0.427181      0.220649   -1.94    0.0529   -0.859751    0.00538904
occ2: 22         -0.869453      0.297522   -2.92    0.0035   -1.45273    -0.286176
occ2: 3           0.210151      0.168677    1.25    0.2129   -0.120532    0.540835
occ2: 4           0.070857      0.183717    0.39    0.6997   -0.28931     0.431024
occ2: 5          -0.396008      0.18854    -2.10    0.0357   -0.76563    -0.026385
occ2: 6          -0.231061      0.186966   -1.24    0.2166   -0.597599    0.135476
occ2: 7           0.314725      0.194152    1.62    0.1051   -0.0658997   0.69535
occ2: 8          -0.187542      0.169299   -1.11    0.2680   -0.519443    0.14436
occ2: 9          -0.339027      0.16723    -2.03    0.0427   -0.666873   -0.0111811
ind2: 11         -0.404651      0.314232   -1.29    0.1979   -1.02069     0.211384
ind2: 12         -0.156994      0.279436   -0.56    0.5743   -0.704813    0.390825
ind2: 13         -0.437744      0.361684   -1.21    0.2262   -1.14681     0.271318
ind2: 14         -0.00543335    0.270026   -0.02    0.9839   -0.534804    0.523938
ind2: 15          0.200448      0.500542    0.40    0.6888   -0.780838    1.18173
ind2: 16          0.0101935     0.299579    0.03    0.9729   -0.577116    0.597503
ind2: 17         -0.2396        0.284972   -0.84    0.4005   -0.798273    0.319073
ind2: 18         -0.180776      0.278446   -0.65    0.5162   -0.726655    0.365103
ind2: 19         -0.300698      0.326555   -0.92    0.3572   -0.940892    0.339497
ind2: 2           0.580584      0.480878    1.21    0.2274   -0.362151    1.52332
ind2: 20         -0.329318      0.312909   -1.05    0.2927   -0.942761    0.284125
ind2: 21         -0.178069      0.303635   -0.59    0.5576   -0.773331    0.417192
ind2: 22          0.176507      0.296357    0.60    0.5515   -0.404487    0.7575
ind2: 3          -0.666781      0.560741   -1.19    0.2345   -1.76608     0.432522
ind2: 4           0.485756      0.349761    1.39    0.1650   -0.199933    1.17144
ind2: 5           0.051198      0.297545    0.17    0.8634   -0.532124    0.63452
ind2: 6          -0.0415848     0.299194   -0.14    0.8895   -0.628138    0.544969
ind2: 7           0.0758343     0.387069    0.20    0.8447   -0.682995    0.834664
ind2: 8          -0.14896       0.337453   -0.44    0.6589   -0.810518    0.512598
ind2: 9          -0.221949      0.274596   -0.81    0.4190   -0.76028     0.316382
mw                0.110683      0.0814463   1.36    0.1742   -0.0489879   0.270355
so                0.0224244     0.0743855   0.30    0.7631   -0.123404    0.168253
we               -0.0215659     0.0841591  -0.26    0.7978   -0.186556    0.143424
exp1 & shs       -0.191998      0.195541   -0.98    0.3262   -0.575346    0.191349
exp1 & hsg       -0.0173433     0.0572279  -0.30    0.7619   -0.129536    0.0948491
exp1 & scl       -0.0664505     0.043373   -1.53    0.1256   -0.151481    0.01858
exp1 & clg       -0.0550346     0.0310279  -1.77    0.0762   -0.115863    0.00579393
exp1 & occ2: 10   0.00756285    0.0581715   0.13    0.8966   -0.106479    0.121605
exp1 & occ2: 11   0.101422      0.100509    1.01    0.3130   -0.0956214   0.298466
exp1 & occ2: 12  -0.0862744     0.0874768  -0.99    0.3241   -0.257768    0.0852193
exp1 & occ2: 13   0.00671485    0.0761825   0.09    0.9298   -0.142637    0.156067
exp1 & occ2: 14  -0.136915      0.0974458  -1.41    0.1601   -0.327953    0.0541221
exp1 & occ2: 15  -0.0400425     0.0898931  -0.45    0.6560   -0.216273    0.136188
exp1 & occ2: 16  -0.0539314     0.0520926  -1.04    0.3006   -0.156056    0.0481934
exp1 & occ2: 17   0.0147277     0.0467903   0.31    0.7530   -0.0770023   0.106458
exp1 & occ2: 18   0.10741       0.471844    0.23    0.8199   -0.817616    1.03244
exp1 & occ2: 19   0.0047165     0.106074    0.04    0.9645   -0.203237    0.21267
exp1 & occ2: 2   -0.0736239     0.0501108  -1.47    0.1418   -0.171863    0.0246157
exp1 & occ2: 20   0.0243156     0.0743274   0.33    0.7436   -0.121399    0.170031
exp1 & occ2: 21   0.0791776     0.0696947   1.14    0.2560   -0.0574551   0.21581
exp1 & occ2: 22   0.109325      0.0880828   1.24    0.2146   -0.063357    0.282006
exp1 & occ2: 3   -0.0714859     0.0637688  -1.12    0.2623   -0.196501    0.0535296
exp1 & occ2: 4   -0.0723997     0.0747715  -0.97    0.3330   -0.218985    0.0741859
exp1 & occ2: 5    0.0946732     0.0794005   1.19    0.2332   -0.0609873   0.250334
exp1 & occ2: 6   -0.0348928     0.0712136  -0.49    0.6242   -0.174503    0.104718
exp1 & occ2: 7   -0.227934      0.078486   -2.90    0.0037   -0.381802   -0.074066
exp1 & occ2: 8   -0.0727459     0.0645883  -1.13    0.2601   -0.199368    0.0538762
exp1 & occ2: 9    0.0274143     0.0669517   0.41    0.6822   -0.103841    0.15867
exp1 & ind2: 11   0.166231      0.105875    1.57    0.1165   -0.0413313   0.373793
exp1 & ind2: 12   0.107851      0.0933357   1.16    0.2479   -0.0751287   0.290831
exp1 & ind2: 13   0.188352      0.11737     1.60    0.1086   -0.0417468   0.41845
exp1 & ind2: 14  -0.00711671    0.0891863  -0.08    0.9364   -0.181962    0.167728
exp1 & ind2: 15  -0.208076      0.203289   -1.02    0.3061   -0.606614    0.190462
exp1 & ind2: 16  -0.0665283     0.0991814  -0.67    0.5024   -0.260968    0.127912
exp1 & ind2: 17   0.0216289     0.094615    0.23    0.8192   -0.163859    0.207117
exp1 & ind2: 18   0.00528206    0.0906315   0.06    0.9535   -0.172396    0.18296
exp1 & ind2: 19   0.000352497   0.110676    0.00    0.9975   -0.216623    0.217328
exp1 & ind2: 2   -0.151258      0.164434   -0.92    0.3577   -0.473622    0.171107
exp1 & ind2: 20  -0.0185949     0.102096   -0.18    0.8555   -0.218749    0.181559
exp1 & ind2: 21   0.0678327     0.100782    0.67    0.5009   -0.129745    0.26541
exp1 & ind2: 22  -0.0366764     0.0964226  -0.38    0.7037   -0.225708    0.152355
exp1 & ind2: 3    0.324631      0.188468    1.72    0.0850   -0.0448505   0.694113
exp1 & ind2: 4   -0.136527      0.114344   -1.19    0.2325   -0.360692    0.0876375
exp1 & ind2: 5   -0.0255591     0.0964004  -0.27    0.7909   -0.214547    0.163429
exp1 & ind2: 6    0.00276967    0.0959674   0.03    0.9770   -0.185369    0.190909
exp1 & ind2: 7   -0.0483333     0.132829   -0.36    0.7160   -0.308738    0.212071
exp1 & ind2: 8    0.0845092     0.118195    0.72    0.4746   -0.147205    0.316224
exp1 & ind2: 9   -0.0153499     0.0872762  -0.18    0.8604   -0.18645     0.155751
exp1 & mw        -0.0279931     0.0296572  -0.94    0.3453   -0.0861345   0.0301484
exp1 & so        -0.00996775    0.0266868  -0.37    0.7088   -0.0622858   0.0423503
exp1 & we         0.00630768    0.0301417   0.21    0.8342   -0.0527835   0.0653989
exp2 & shs        1.90051       1.45025     1.31    0.1901   -0.94263     4.74364
exp2 & hsg        0.117164      0.550973    0.21    0.8316   -0.962989    1.19732
exp2 & scl        0.621792      0.462999    1.34    0.1793   -0.285892    1.52948
exp2 & clg        0.409675      0.380217    1.08    0.2813   -0.335721    1.15507
exp2 & occ2: 10  -0.269229      0.640527   -0.42    0.6743   -1.52495     0.986491
exp2 & occ2: 11  -1.08165       1.00576    -1.08    0.2822   -3.05339     0.890081
exp2 & occ2: 12   0.832374      0.934125    0.89    0.3729   -0.998929    2.66368
exp2 & occ2: 13  -0.220981      0.772846   -0.29    0.7749   -1.73611     1.29414
exp2 & occ2: 14   0.751116      0.927255    0.81    0.4180   -1.06672     2.56895
exp2 & occ2: 15  -0.0326858     0.940912   -0.03    0.9723   -1.87729     1.81192
exp2 & occ2: 16   0.363581      0.550955    0.66    0.5093   -0.716537    1.4437
exp2 & occ2: 17  -0.265929      0.486113   -0.55    0.5844   -1.21893     0.687071
exp2 & occ2: 18  -2.56088       5.17009    -0.50    0.6204  -12.6966      7.57482
exp2 & occ2: 19  -0.129176      1.06169    -0.12    0.9032   -2.21056     1.95221
exp2 & occ2: 2    0.663217      0.552322    1.20    0.2299   -0.419581    1.74602
exp2 & occ2: 20  -0.33233       0.722907   -0.46    0.6457   -1.74955     1.08489
exp2 & occ2: 21  -0.91          0.685411   -1.33    0.1843   -2.25371     0.433714
exp2 & occ2: 22  -0.855054      0.827941   -1.03    0.3018   -2.47819     0.768082
exp2 & occ2: 3    0.641546      0.710278    0.90    0.3664   -0.750918    2.03401
exp2 & occ2: 4    0.974842      0.865535    1.13    0.2601   -0.721994    2.67168
exp2 & occ2: 5   -0.977882      0.973799   -1.00    0.3153   -2.88696     0.9312
exp2 & occ2: 6    0.105086      0.800227    0.13    0.8955   -1.46372     1.67389
exp2 & occ2: 7    3.14071       0.938942    3.34    0.0008    1.29996     4.98146
exp2 & occ2: 8    0.671088      0.719208    0.93    0.3508   -0.738881    2.08106
exp2 & occ2: 9    0.0231977     0.762914    0.03    0.9757   -1.47246     1.51885
exp2 & ind2: 11  -1.68035       1.08035    -1.56    0.1199   -3.79831     0.437612
exp2 & ind2: 12  -0.971713      0.934825   -1.04    0.2986   -2.80439     0.860963
exp2 & ind2: 13  -1.76787       1.1562     -1.53    0.1263   -4.03453     0.498797
exp2 & ind2: 14   0.119001      0.888083    0.13    0.8934   -1.62204     1.86004
exp2 & ind2: 15   2.3885        2.20174     1.08    0.2781   -1.92789     6.70489
exp2 & ind2: 16   0.870745      0.990124    0.88    0.3792   -1.07034     2.81183
exp2 & ind2: 17  -0.00295735    0.948274   -0.00    0.9975   -1.862       1.85609
exp2 & ind2: 18  -0.00329326    0.889074   -0.00    0.9970   -1.74628     1.73969
exp2 & ind2: 19   0.266476      1.11891     0.24    0.8118   -1.92709     2.46004
exp2 & ind2: 2    2.19733       1.77386     1.24    0.2155   -1.28024     5.6749
exp2 & ind2: 20   0.250603      1.01001     0.25    0.8041   -1.72947     2.23068
exp2 & ind2: 21  -0.915406      1.01458    -0.90    0.3670   -2.90443     1.07362
exp2 & ind2: 22   0.339496      0.949368    0.36    0.7207   -1.52169     2.20068
exp2 & ind2: 3   -3.73956       1.95946    -1.91    0.0564   -7.58098     0.101848
exp2 & ind2: 4    1.09199       1.143       0.96    0.3394   -1.1488      3.33278
exp2 & ind2: 5    0.182412      0.938629    0.19    0.8459   -1.65772     2.02254
exp2 & ind2: 6   -0.0304448     0.929643   -0.03    0.9739   -1.85296     1.79207
exp2 & ind2: 7    0.73252       1.43981     0.51    0.6109   -2.09016     3.5552
exp2 & ind2: 8   -0.750665      1.20255    -0.62    0.5325   -3.1082      1.60687
exp2 & ind2: 9    0.417708      0.838948    0.50    0.6186   -1.22701     2.06242
exp2 & mw         0.200561      0.317291    0.63    0.5273   -0.421472    0.822594
exp2 & so         0.0544354     0.281566    0.19    0.8467   -0.49756     0.606431
exp2 & we         0.00127174    0.320787    0.00    0.9968   -0.627615    0.630159
exp3 & shs       -0.672124      0.442663   -1.52    0.1290   -1.53994     0.195693
exp3 & hsg       -0.0179937     0.208318   -0.09    0.9312   -0.426389    0.390402
exp3 & scl       -0.199788      0.185519   -1.08    0.2816   -0.563488    0.163912
exp3 & clg       -0.102523      0.164365   -0.62    0.5328   -0.424752    0.219706
exp3 & occ2: 10   0.185475      0.257556    0.72    0.4715   -0.319451    0.690401
exp3 & occ2: 11   0.393155      0.381776    1.03    0.3032   -0.355296    1.14161
exp3 & occ2: 12  -0.220256      0.366021   -0.60    0.5474   -0.93782     0.497308
exp3 & occ2: 13   0.0950356     0.290437    0.33    0.7435   -0.474351    0.664422
exp3 & occ2: 14  -0.144393      0.334162   -0.43    0.6657   -0.799501    0.510714
exp3 & occ2: 15   0.147708      0.364519    0.41    0.6853   -0.566913    0.862328
exp3 & occ2: 16  -0.0378548     0.215129   -0.18    0.8603   -0.459604    0.383894
exp3 & occ2: 17   0.15105       0.187808    0.80    0.4213   -0.217138    0.519238
exp3 & occ2: 18   1.40844       1.88525     0.75    0.4550   -2.28748     5.10437
exp3 & occ2: 19   0.0923425     0.404231    0.23    0.8193   -0.700131    0.884816
exp3 & occ2: 2   -0.20394       0.221139   -0.92    0.3565   -0.637471    0.22959
exp3 & occ2: 20   0.180699      0.265208    0.68    0.4957   -0.339227    0.700626
exp3 & occ2: 21   0.377908      0.255303    1.48    0.1389   -0.1226      0.878417
exp3 & occ2: 22   0.285506      0.298421    0.96    0.3388   -0.299532    0.870544
exp3 & occ2: 3   -0.236962      0.287037   -0.83    0.4091   -0.799683    0.325759
exp3 & occ2: 4   -0.436696      0.352017   -1.24    0.2148   -1.12681     0.253415
exp3 & occ2: 5    0.38853       0.411886    0.94    0.3456   -0.418951    1.19601
exp3 & occ2: 6    0.0484737     0.329353    0.15    0.8830   -0.597205    0.694152
exp3 & occ2: 7   -1.39493       0.405011   -3.44    0.0006   -2.18893    -0.600926
exp3 & occ2: 8   -0.20539       0.289573   -0.71    0.4782   -0.773082    0.362302
exp3 & occ2: 9   -0.090966      0.314335   -0.29    0.7723   -0.707203    0.525271
exp3 & ind2: 11   0.642942      0.41101     1.56    0.1178   -0.162821    1.4487
exp3 & ind2: 12   0.328629      0.346913    0.95    0.3435   -0.351475    1.00873
exp3 & ind2: 13   0.592851      0.425896    1.39    0.1640   -0.242095    1.4278
exp3 & ind2: 14  -0.0285251     0.328414   -0.09    0.9308   -0.672364    0.615314
exp3 & ind2: 15  -0.856868      0.843546   -1.02    0.3098   -2.5106      0.796861
exp3 & ind2: 16  -0.355848      0.367758   -0.97    0.3333   -1.07682     0.365122
exp3 & ind2: 17  -0.0362622     0.352283   -0.10    0.9180   -0.726895    0.654371
exp3 & ind2: 18   0.0157436     0.325024    0.05    0.9614   -0.621448    0.652936
exp3 & ind2: 19  -0.14883       0.420781   -0.35    0.7236   -0.973749    0.67609
exp3 & ind2: 2   -1.04482       0.706672   -1.48    0.1393   -2.43021     0.340577
exp3 & ind2: 20  -0.0679218     0.37047    -0.18    0.8545   -0.794209    0.658365
exp3 & ind2: 21   0.396705      0.38116     1.04    0.2980   -0.350539    1.14395
exp3 & ind2: 22  -0.0760277     0.34879    -0.22    0.8275   -0.759812    0.607756
exp3 & ind2: 3    1.62176       0.784599    2.07    0.0388    0.0835992   3.15993
exp3 & ind2: 4   -0.314973      0.428727   -0.73    0.4626   -1.15547     0.525524
exp3 & ind2: 5   -0.0505912     0.339866   -0.15    0.8817   -0.716881    0.615699
exp3 & ind2: 6    0.0193266     0.335922    0.06    0.9541   -0.63923     0.677883
exp3 & ind2: 7   -0.335907      0.58607    -0.57    0.5666   -1.48487     0.813053
exp3 & ind2: 8    0.189279      0.451639    0.42    0.6752   -0.696137    1.07469
exp3 & ind2: 9   -0.216085      0.299794   -0.72    0.4711   -0.803816    0.371646
exp3 & mw        -0.0625771     0.124129   -0.50    0.6142   -0.305926    0.180772
exp3 & so        -0.0115842     0.108422   -0.11    0.9149   -0.224139    0.200971
exp3 & we        -0.0124875     0.125138   -0.10    0.9205   -0.257813    0.232838
exp4 & shs        0.0777418     0.0475427   1.64    0.1021   -0.0154632   0.170947
exp4 & hsg        0.000491255   0.0265964   0.02    0.9853   -0.0516497   0.0526322
exp4 & scl        0.021076      0.0245289   0.86    0.3903   -0.0270117   0.0691637
exp4 & clg        0.00786949    0.0227528   0.35    0.7295   -0.0367363   0.0524753
exp4 & occ2: 10  -0.0333347     0.0338825  -0.98    0.3252   -0.0997595   0.0330901
exp4 & occ2: 11  -0.0465914     0.0479018  -0.97    0.3308   -0.1405      0.0473175
exp4 & occ2: 12   0.0110212     0.0470536   0.23    0.8148   -0.0812249   0.103267
exp4 & occ2: 13  -0.0136895     0.0358988  -0.38    0.7030   -0.0840673   0.0566883
exp4 & occ2: 14   0.00555824    0.0400331   0.14    0.8896   -0.0729245   0.084041
exp4 & occ2: 15  -0.0327444     0.0462379  -0.71    0.4789   -0.123391    0.0579026
exp4 & occ2: 16  -0.00897062    0.0275729  -0.33    0.7449   -0.0630259   0.0450847
exp4 & occ2: 17  -0.0256735     0.0239306  -1.07    0.2834   -0.0725881   0.0212412
exp4 & occ2: 18  -0.212137      0.2204     -0.96    0.3358   -0.64422     0.219946
exp4 & occ2: 19  -0.0169398     0.0513428  -0.33    0.7415   -0.117595    0.083715
exp4 & occ2: 2    0.0176389     0.0289257   0.61    0.5420   -0.0390683   0.0743462
exp4 & occ2: 20  -0.0296125     0.0323353  -0.92    0.3598   -0.0930042   0.0337791
exp4 & occ2: 21  -0.0524577     0.0317251  -1.65    0.0983   -0.114653    0.00973765
exp4 & occ2: 22  -0.0350646     0.0360687  -0.97    0.3310   -0.105775    0.0356463
exp4 & occ2: 3    0.0303057     0.0376552   0.80    0.4210   -0.0435153   0.104127
exp4 & occ2: 4    0.0584146     0.0457704   1.28    0.2019   -0.0313159   0.148145
exp4 & occ2: 5   -0.0515181     0.0549489  -0.94    0.3485   -0.159243    0.0562063
exp4 & occ2: 6   -0.0170182     0.0440847  -0.39    0.6995   -0.103444    0.0694076
exp4 & occ2: 7    0.190535      0.0558757   3.41    0.0007    0.0809939   0.300077
exp4 & occ2: 8    0.0196522     0.0379084   0.52    0.6042   -0.0546653   0.0939697
exp4 & occ2: 9    0.0190014     0.0421099   0.45    0.6518   -0.0635528   0.101556
exp4 & ind2: 11  -0.084012      0.0518917  -1.62    0.1055   -0.185743    0.0177191
exp4 & ind2: 12  -0.0390069     0.0424964  -0.92    0.3587   -0.122319    0.044305
exp4 & ind2: 13  -0.0672775     0.0518686  -1.30    0.1947   -0.168963    0.0344082
exp4 & ind2: 14  -6.82236e-5    0.0400746  -0.00    0.9986   -0.0786325   0.078496
exp4 & ind2: 15   0.0950646     0.104359    0.91    0.3624   -0.109525    0.299654
exp4 & ind2: 16   0.0438506     0.0451919   0.97    0.3319   -0.0447458   0.132447
exp4 & ind2: 17   0.00554933    0.043163    0.13    0.8977   -0.0790695   0.0901682
exp4 & ind2: 18  -0.00634059    0.0394237  -0.16    0.8722   -0.0836287   0.0709475
exp4 & ind2: 19   0.0213249     0.0523585   0.41    0.6838   -0.0813212   0.123971
exp4 & ind2: 2    0.148284      0.0918416   1.61    0.1065   -0.0317665   0.328335
exp4 & ind2: 20   0.00142884    0.0448582   0.03    0.9746   -0.0865134   0.0893711
exp4 & ind2: 21  -0.0549777     0.0473826  -1.16    0.2460   -0.147869    0.0379135
exp4 & ind2: 22   0.000189061   0.0424445   0.00    0.9964   -0.0830211   0.0833992
exp4 & ind2: 3   -0.236895      0.106698   -2.22    0.0264   -0.446071   -0.0277192
exp4 & ind2: 4    0.0273364     0.0534731   0.51    0.6092   -0.0774948   0.132168
exp4 & ind2: 5    0.00417968    0.0407488   0.10    0.9183   -0.0757062   0.0840656
exp4 & ind2: 6   -0.00432682    0.0402397  -0.11    0.9144   -0.0832147   0.0745611
exp4 & ind2: 7    0.0480848     0.0784057   0.61    0.5397   -0.105625    0.201795
exp4 & ind2: 8   -0.0126822     0.0561032  -0.23    0.8212   -0.12267     0.0973051
exp4 & ind2: 9    0.0304762     0.0353913   0.86    0.3892   -0.0389067   0.099859
exp4 & mw         0.00624394    0.0158699   0.39    0.6940   -0.0248681   0.037356
exp4 & so         0.000314457   0.0136275   0.02    0.9816   -0.0264016   0.0270305
exp4 & we         0.00176845    0.0159602   0.11    0.9118   -0.0295206   0.0330575
────────────────────────────────────────────────────────────────────────────────────
Coefficient for OLS with controls -0.06955320329671351

The estimated regression coefficient \(\beta_1\approx-0.0696\) measures how our linear prediction of wage changes if we set the gender variable \(D\) from 0 to 1, holding the controls \(W\) fixed. We can call this the predictive effect (PE), as it measures the impact of a variable on the prediction we make. Overall, we see that the unconditional wage gap of size \(4\)% for women increases to about \(7\)% after controlling for worker characteristics.

Next, we are using the Frisch-Waugh-Lovell theorem from the lecture partialling-out the linear effect of the controls via ols.

3.3. Partialling-Out using ols#

# models
# model for Y
flex_y = @formula(lwage ~ (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we))
flex_d = @formula(sex ~ (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we))

# partialling-out the linear effect of W from Y
t_Y = residuals(lm(flex_y, data))

# partialling-out the linear effect of W from D
t_D = residuals(lm(flex_d, data))

data_res = DataFrame(t_Y = t_Y, t_D = t_D )
# regression of Y on D after partialling-out the effect of W
partial_fit = lm(@formula(t_Y ~ t_D), data_res)
partial_est = GLM.coef(partial_fit)[2]

println("Coefficient for D via partiallig-out ", partial_est)

# standard error
partial_se = GLM.coeftable(partial_fit).cols[2][2]

#condifence interval
GLM.confint(partial_fit)[2,:]
Coefficient for D via partiallig-out -0.06955320329684614
2-element Vector{Float64}:
 -0.0986714235748635
 -0.040434983018828786

Again, the estimated coefficient measures the linear predictive effect (PE) of \(D\) on \(Y\) after taking out the linear effect of \(W\) on both of these variables. This coefficient equals the estimated coefficient from the ols regression with controls.

We know that the partialling-out approach works well when the dimension of \(W\) is low in relation to the sample size \(n\). When the dimension of \(W\) is relatively high, we need to use variable selection or penalization for regularization purposes.

In the following, we illustrate the partialling-out approach using lasso instead of ols.

3.4. Partialling-Out using lasso#

# models
# model for Y
flex_y = @formula(lwage ~  (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we));

# model for D
flex_d = @formula(sex ~ (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we));

3.4.1. With the Lasso package#

# Pkg.add("Lasso")
using Lasso
   Resolving package versions...
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Project.toml`
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Manifest.toml`
lasso_y = fit(LassoModel, flex_y, data,  α = 0.1)
t_y = residuals(lasso_y)

lasso_d = fit(LassoModel, flex_d, data, α = 0.1)
t_d = residuals(lasso_d)

data_res = DataFrame(t_Y = t_y, t_D = t_d )

partial_lasso_fit = lm(@formula(t_Y ~ t_D), data_res)
partial_lasso_est = GLM.coef(partial_lasso_fit)[2]
partial_lasso_se = GLM.coeftable(partial_lasso_fit).cols[2][2]

println("Coefficient for D via partialling-out using lasso ", partial_lasso_est)
Coefficient for D via partialling-out using lasso -0.0682195329952077

Using lasso for partialling-out here provides similar results as using ols.

Next, we summarize the results.

3.5. Summarize the results#

DataFrame(modelos = [ "Without controls", "full reg", "partial reg", "partial reg via lasso" ], 
Estimate = [nocontrol_est,control_est,partial_est, partial_lasso_est], 
StdError = [nocontrol_se,control_se, partial_se, partial_lasso_se])

4 rows × 3 columns

modelosEstimateStdError
StringFloat64Float64
1Without controls-0.03834470.0159878
2full reg-0.06955320.015218
3partial reg-0.06955320.014853
4partial reg via lasso-0.06821950.0148044

It it worth to notice that controlling for worker characteristics increases the gender wage gap from less that 4% to 7%. The controls we used in our analysis include 5 educational attainment indicators (less than high school graduates, high school graduates, some college, college graduate, and advanced degree), 4 region indicators (midwest, south, west, and northeast); a quartic term (first, second, third, and fourth power) in experience and 22 occupation and 23 industry indicators.

Keep in mind that the predictive effect (PE) does not only measures discrimination (causal effect of being female), it also may reflect selection effects of unobserved differences in covariates between men and women in our sample.

Next we try “extra” flexible model, where we take interactions of all controls, giving us about 1000 controls.

3.6. “Extra” flexible model#

# import Pkg
# Pkg.add("StatsModels")
# Pkg.add("Combinatorics")
# Pkg.add("IterTools")
# we have to configure the package internaly with the itertools package, this because 
#julia dont iunderstand (a formula) ^2, it takes as an entire term not as interactions 
#between variables
   Resolving package versions...
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Project.toml`
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Manifest.toml`
   Resolving package versions...
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Project.toml`
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Manifest.toml`
   Resolving package versions...
    Updating `C:\Users\PC\.julia\environments\v1.6\Project.toml`
  [c8e1da08] + IterTools v1.4.0
  No Changes to `C:\Users\PC\.julia\environments\v1.6\Manifest.toml`
#this code fix the problem mencioned above
using StatsModels, Combinatorics, IterTools

combinations_upto(x, n) = Iterators.flatten(combinations(x, i) for i in 1:n)
expand_exp(args, deg::ConstantTerm) =
    tuple(((&)(terms...) for terms in combinations_upto(args, deg.n))...)

StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.Schema, ctx::Type) =
    apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)

StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.FullRank, ctx::Type) =
    apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)
extra_flex = @formula(lwage ~  sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2)

control_fit = lm(extra_flex, data)
control_est = GLM.coef(control_fit)[2]

println("Number of Extra-Flex Controls: ", size(modelmatrix(control_fit))[2] -1) #minus the intercept
println("Coefficient for OLS with extra flex controls ", control_est)

#std error
control_se = GLM.stderror(control_fit)[2];
Number of Extra-Flex Controls: 979
Coefficient for OLS with extra flex controls -0.06127046379432059

3.7. Laso “Extra” Flexible model#

extraflex_y = @formula(lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2)# model for Y
extraflex_d = @formula(sex ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2) # model for D

# partialling-out the linear effect of W from Y
t_y = residuals(fit(LassoModel, extraflex_y, data,standardize = false))
# partialling-out the linear effect of W from D
t_d = residuals(fit(LassoModel, extraflex_d, data,standardize = false))

data_partial = DataFrame(t_y = t_y, t_d = t_d )

# regression of Y on D after partialling-out the effect of W
partial_lasso_fit = lm(@formula(t_y~t_d), data_partial)

partial_lasso_est = GLM.coef(partial_lasso_fit)[2]

println("Coefficient for D via partialling-out using lasso :", partial_lasso_est)

#standard error

partial_lasso_se = GLM.stderror(partial_lasso_fit)[2];
Coefficient for D via partialling-out using lasso :-0.05876465629317397

3.8. Summarize the results#

tabla3 = DataFrame(modelos = [ "Full reg", "partial reg via lasso" ], 
Estimate = [control_est,partial_lasso_est], 
StdError = [control_se,partial_lasso_se])

2 rows × 3 columns

modelosEstimateStdError
StringFloat64Float64
1Full reg-0.06127050.0159811
2partial reg via lasso-0.05876470.0133701

In this case p/n = 20%, that is \(p/n\) is no longer small and we start seeing the differences between unregularized partialling out and regularized partialling out with lasso (double lasso). The results based on double lasso have rigorous guarantees in this non-small p/n regime under approximate sparsity. The results based on OLS still have guarantees in p/n< 1 regime under assumptions laid out in Cattaneo, Newey, and Jansson (2018), without approximate sparsity, although other regularity conditions are needed.