AutoML for wage prediction

14. AutoML for wage prediction#

14.1. Automatic Machine Learning with H2O AutoML using Wage Data from 2015#

We illustrate how to predict an outcome variable Y in a high-dimensional setting, using the AutoML package H2O that covers the complete pipeline from the raw dataset to the deployable machine learning model. In last few years, AutoML or automated machine learning has become widely popular among data science community.

We can use AutoML as a benchmark and compare it to the methods that we used in the previous notebook where we applied one machine learning method after the other.

# Import relevant packages
import pandas as pd
import numpy as np
import pyreadr
import os
from urllib.request import urlopen
from sklearn import preprocessing
import patsy
from h2o.automl import H2OAutoML

from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
import warnings
warnings.filterwarnings('ignore')

#pip install h2o

# load the H2O package
import h2o

# start h2o cluster
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.

H2O_cluster_uptime:	4 hours 11 mins
H2O_cluster_timezone:	America/Bogota
H2O_data_parsing_timezone:	UTC
H2O_cluster_version:	3.36.1.3
H2O_cluster_version_age:	26 days
H2O_cluster_name:	H2O_from_python_User_fi8ht0
H2O_cluster_total_nodes:	1
H2O_cluster_free_memory:	2.467 Gb
H2O_cluster_total_cores:	4
H2O_cluster_allowed_cores:	4
H2O_cluster_status:	locked, healthy
H2O_connection_url:	http://localhost:54321
H2O_connection_proxy:	{"http": null, "https": null}
H2O_internal_security:	False
Python_version:	3.9.12 final

link="https://raw.githubusercontent.com/d2cml-ai/14.388_py/main/data/wage2015_subsample_inference.Rdata"
response = urlopen(link)
content = response.read()
fhandle = open( 'wage2015_subsample_inference.Rdata', 'wb')
fhandle.write(content)
fhandle.close()
result = pyreadr.read_r("wage2015_subsample_inference.Rdata")
os.remove("wage2015_subsample_inference.Rdata")

# Extracting the data frame from rdata_read
data = result[ 'data' ]
n = data.shape[0]
type(data)

pandas.core.frame.DataFrame

# Import relevant packages for splitting data
import random
import math

# Set Seed
# to make the results replicable (generating random numbers)
np.random.seed(0)
random = np.random.randint(0, data.shape[0], size=math.floor(data.shape[0]))
data["random"] = random
random    # the array does not change 
data_2 = data.sort_values(by=['random'])

# Create training and testing sample 
train = data_2[ : math.floor(n*3/4)]    # training sample
test =  data_2[ math.floor(n*3/4) : ]   # testing sample
print(train.shape)
print(test.shape)

(3862, 21)
(1288, 21)

# start h2o cluster
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.

H2O_cluster_uptime:	4 hours 12 mins
H2O_cluster_timezone:	America/Bogota
H2O_data_parsing_timezone:	UTC
H2O_cluster_version:	3.36.1.3
H2O_cluster_version_age:	26 days
H2O_cluster_name:	H2O_from_python_User_fi8ht0
H2O_cluster_total_nodes:	1
H2O_cluster_free_memory:	2.467 Gb
H2O_cluster_total_cores:	4
H2O_cluster_allowed_cores:	4
H2O_cluster_status:	locked, healthy
H2O_connection_url:	http://localhost:54321
H2O_connection_proxy:	{"http": null, "https": null}
H2O_internal_security:	False
Python_version:	3.9.12 final

# convert data as h2o type
train_h = h2o.H2OFrame(train)
test_h = h2o.H2OFrame(test)

# have a look at the data
train_h.describe()

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Rows:3862
Cols:21

	wage	lwage	sex	shs	hsg	scl	clg	ad	mw	so	we	ne	exp1	exp2	exp3	exp4	occ	occ2	ind	ind2	random
type	real	real	int	int	int	int	int	int	int	int	int	int	real	real	real	real	int	int	int	int	int
mins	3.021978021978022	1.1059115911497213	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	10.0	1.0	370.0	2.0	0.0
mean	23.465417731467202	2.969427791239379	0.446918694976696	0.023562920766442258	0.24702226825479026	0.2780942516830658	0.3125323666494045	0.13878819264629724	0.2553081305023304	0.29829104091144487	0.21569135163127914	0.23070947695494562	13.672190574831728	2.9923032107716137	8.15417316157433	24.849760334671146	5243.418436043504	11.69135163127914	6667.996116002073	13.333764888658706	1914.2263076126374
maxs	528.845673076923	6.270696655981913	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	47.0	22.09	103.823	487.9681	100000.0	22.0	100000.0	22.0	3825.0
sigma	21.430085743506766	0.5750893708933925	0.4972387709704293	0.15170256601034887	0.4313356487428773	0.44811810407097313	0.46358551980855045	0.3457701367976389	0.436090737870262	0.45756716236312467	0.4113543571812672	0.4213414081729523	10.598613687032655	3.987480265404191	14.424744487903597	53.27996156215637	11579.91114662104	6.9783416585657125	5588.264282354911	5.691380293915183	1104.7025392074318
zeros	0	0	2136	3771	2908	2788	2655	3326	2876	2710	3029	2971	48	48	48	48	0	0	0	0	3
missing	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	26.442307692307693	3.274965291519244	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	29.0	8.41	24.389	70.7281	340.0	1.0	8660.0	20.0	0.0
1	19.23076923076923	2.9565115604007097	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	33.5	11.2225	37.595375	125.94450625	9620.0	22.0	1870.0	5.0	0.0
2	48.07692307692308	3.872802292274865	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	2.0	0.04	0.008	0.0016	3060.0	10.0	8190.0	18.0	0.0
3	12.01923076923077	2.486507931154974	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	29.0	8.41	24.389	70.7281	6440.0	19.0	770.0	4.0	2.0
4	39.90384615384615	3.6864727140833713	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	12.0	1.44	1.728	2.0736	1820.0	5.0	7860.0	17.0	2.0
5	13.157894736842104	2.5770219386958058	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	11.0	1.21	1.331	1.4641	8810.0	21.0	3895.0	6.0	3.0
6	20.192307692307693	3.005301724570142	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	17.0	2.89	4.913	8.3521	7200.0	20.0	8770.0	21.0	4.0
7	12.01923076923077	2.486507931154974	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	7.0	0.49	0.343	0.2401	5610.0	17.0	4265.0	7.0	5.0
8	28.846153846153847	3.361976668508874	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	8.0	0.64	0.512	0.4096	5240.0	17.0	6970.0	12.0	7.0
9	34.13461538461539	3.530311983328089	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	38.0	14.44	54.872	208.5136	5550.0	17.0	6370.0	10.0	7.0

# define the variables
y = 'lwage'

data_columns = list(data)
no_relev_col = ['wage','occ2', 'ind2', 'random', 'lwage']

# This gives us: new_list = ['carrot' , 'lemon']
x = [col for col in data_columns if col not in no_relev_col]

# run AutoML for 10 base models and a maximal runtime of 100 seconds
# Run AutoML for 30 seconds
aml = H2OAutoML(max_runtime_secs = 100, max_models = 10, seed = 1)
aml.train(x = x, y = y, training_frame = train_h, leaderboard_frame = test_h)

AutoML progress: |█
05:14:21.267: AutoML: XGBoost is not available; skipping it.

██████████████████████████████████████████████████████████████| (done) 100%
Model Details
=============
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_1_AutoML_2_20220804_51421

No model summary for this model

ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 0.14963411645818037
RMSE: 0.38682569260350375
MAE: 0.29493414661545647
RMSLE: 0.09962964024727533
R^2: 0.5474439172253015
Mean Residual Deviance: 0.14963411645818037
Null degrees of freedom: 3861
Residual degrees of freedom: 3855
Null deviance: 1276.9399854672324
Residual deviance: 577.8869577614926
AIC: 3639.772059449712

ModelMetricsRegressionGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.21847845950742845
RMSE: 0.4674167942077268
MAE: 0.35554195972164104
RMSLE: 0.11935352477830528
R^2: 0.3392298618411601
Mean Residual Deviance: 0.21847845950742845
Null degrees of freedom: 3861
Residual degrees of freedom: 3854
Null deviance: 1277.1043650451534
Residual deviance: 843.7638106176887
AIC: 5103.517182973992

Cross-Validation Metrics Summary: 

		mean	sd	cv_1_valid	cv_2_valid	cv_3_valid	cv_4_valid	cv_5_valid
0	mae	0.355537	0.011149	0.358856	0.365665	0.336852	0.354934	0.361377
1	mean_residual_deviance	0.218658	0.018164	0.232228	0.226281	0.191559	0.208852	0.234372
2	mse	0.218658	0.018164	0.232228	0.226281	0.191559	0.208852	0.234372
3	null_deviance	255.420870	18.944805	270.143950	279.337650	235.192600	252.053120	240.377060
4	r2	0.339152	0.024681	0.332046	0.342915	0.367003	0.352354	0.301443
5	residual_deviance	168.744060	13.997013	180.441440	183.513900	148.841520	163.113280	167.810170
6	rmse	0.467278	0.019686	0.481901	0.475690	0.437675	0.457003	0.484120
7	rmsle	0.119307	0.004619	0.122457	0.121297	0.111312	0.119422	0.122048

# AutoML Leaderboard
lb = aml.leaderboard
print(lb)

model_id	rmse	mse	mae	rmsle	mean_residual_deviance
StackedEnsemble_AllModels_1_AutoML_1_20220722_170939	0.47064	0.221502	0.353943	0.120368	0.221502
GBM_2_AutoML_1_20220722_170939	0.471849	0.222642	0.355941	0.12074	0.222642
GBM_5_AutoML_1_20220722_170939	0.47191	0.222699	0.35582	0.120501	0.222699
StackedEnsemble_BestOfFamily_1_AutoML_1_20220722_170939	0.473798	0.224484	0.357598	0.121491	0.224484
GBM_3_AutoML_1_20220722_170939	0.474585	0.22523	0.359309	0.121264	0.22523
GBM_1_AutoML_1_20220722_170939	0.478149	0.228626	0.362211	0.122278	0.228626
GBM_4_AutoML_1_20220722_170939	0.479072	0.22951	0.362916	0.122456	0.22951
GBM_grid_1_AutoML_1_20220722_170939_model_1	0.48001	0.23041	0.36381	0.122671	0.23041
XRT_1_AutoML_1_20220722_170939	0.491218	0.241295	0.375089	0.125318	0.241295
DRF_1_AutoML_1_20220722_170939	0.50224	0.252245	0.382606	0.12847	0.252245

We see that two Stacked Ensembles are at the top of the leaderboard. Stacked Ensembles often outperform a single model. The out-of-sample (test) MSE of the leading model is given by

aml.leaderboard['mse'][0,0]

0.22150159010610537

The in-sample performance can be evaluated by

aml.leader

Model Details
=============
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_1_AutoML_1_20220722_170939

No model summary for this model

ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 0.15027004489538795
RMSE: 0.38764680431468534
MAE: 0.2955371675604367
RMSLE: 0.09990016533775842
R^2: 0.5455206039510314
Mean Residual Deviance: 0.15027004489538795
Null degrees of freedom: 3861
Residual degrees of freedom: 3854
Null deviance: 1276.9399854672324
Residual deviance: 580.3429133859883
AIC: 3658.1503537307485

ModelMetricsRegressionGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.2185070080606413
RMSE: 0.46744733185744175
MAE: 0.35554501482828577
RMSLE: 0.11935774800872184
R^2: 0.3391435190891413
Mean Residual Deviance: 0.2185070080606413
Null degrees of freedom: 3861
Residual degrees of freedom: 3853
Null deviance: 1277.1043650451534
Residual deviance: 843.8740651301968
AIC: 5106.021797066896

Cross-Validation Metrics Summary: 

		mean	sd	cv_1_valid	cv_2_valid	cv_3_valid	cv_4_valid	cv_5_valid
0	mae	0.355575	0.011150	0.358694	0.365597	0.336775	0.355430	0.361377
1	mean_residual_deviance	0.218685	0.018149	0.232207	0.226213	0.191451	0.209182	0.234372
2	mse	0.218685	0.018149	0.232207	0.226213	0.191451	0.209182	0.234372
3	null_deviance	255.420870	18.944805	270.143950	279.337650	235.192600	252.053120	240.377060
4	r2	0.339071	0.024654	0.332108	0.343113	0.367360	0.351331	0.301443
5	residual_deviance	168.764340	13.983558	180.424550	183.458680	148.757460	163.370830	167.810170
6	rmse	0.467306	0.019674	0.481878	0.475618	0.437551	0.457364	0.484120
7	rmsle	0.119310	0.004632	0.122437	0.121279	0.111271	0.119514	0.122048

This is in line with our previous results. To understand how the ensemble works, let’s take a peek inside the Stacked Ensemble “All Models” model. The “All Models” ensemble is an ensemble of all of the individual models in the AutoML run. This is often the top performing model on the leaderboard.

model_ids = h2o.as_list(aml.leaderboard['model_id'][0], use_pandas=True)

model = model_ids[model_ids['model_id'].str.contains("StackedEnsemble_AllModels")].values.tolist()
model_id = model[0][0]
model_id

'StackedEnsemble_AllModels_1_AutoML_1_20220722_170939'

se = h2o.get_model(model_id)
se

Model Details
=============
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_1_AutoML_1_20220722_170939

No model summary for this model

ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 0.15027004489538795
RMSE: 0.38764680431468534
MAE: 0.2955371675604367
RMSLE: 0.09990016533775842
R^2: 0.5455206039510314
Mean Residual Deviance: 0.15027004489538795
Null degrees of freedom: 3861
Residual degrees of freedom: 3854
Null deviance: 1276.9399854672324
Residual deviance: 580.3429133859883
AIC: 3658.1503537307485

ModelMetricsRegressionGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.2185070080606413
RMSE: 0.46744733185744175
MAE: 0.35554501482828577
RMSLE: 0.11935774800872184
R^2: 0.3391435190891413
Mean Residual Deviance: 0.2185070080606413
Null degrees of freedom: 3861
Residual degrees of freedom: 3853
Null deviance: 1277.1043650451534
Residual deviance: 843.8740651301968
AIC: 5106.021797066896

Cross-Validation Metrics Summary: 

		mean	sd	cv_1_valid	cv_2_valid	cv_3_valid	cv_4_valid	cv_5_valid
0	mae	0.355575	0.011150	0.358694	0.365597	0.336775	0.355430	0.361377
1	mean_residual_deviance	0.218685	0.018149	0.232207	0.226213	0.191451	0.209182	0.234372
2	mse	0.218685	0.018149	0.232207	0.226213	0.191451	0.209182	0.234372
3	null_deviance	255.420870	18.944805	270.143950	279.337650	235.192600	252.053120	240.377060
4	r2	0.339071	0.024654	0.332108	0.343113	0.367360	0.351331	0.301443
5	residual_deviance	168.764340	13.983558	180.424550	183.458680	148.757460	163.370830	167.810170
6	rmse	0.467306	0.019674	0.481878	0.475618	0.437551	0.457364	0.484120
7	rmsle	0.119310	0.004632	0.122437	0.121279	0.111271	0.119514	0.122048

# Get the Stacked Ensemble metalearner model
metalearner = se.metalearner()
metalearner

Model Details
=============
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  metalearner_AUTO_StackedEnsemble_AllModels_1_AutoML_1_20220722_170939


GLM Model: summary

		family	link	regularization	lambda_search	number_of_predictors_total	number_of_active_predictors	number_of_iterations	training_frame
0		gaussian	identity	Elastic Net (alpha = 0.5, lambda = 0.002551 )	nlambda = 100, lambda.max = 0.2219, lambda.min = 0.002551, lambda....	10	7	49	levelone_training_StackedEnsemble_AllModels_1_AutoML_1_20220722_17...

ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 0.21727419974117612
RMSE: 0.4661268065035266
MAE: 0.35448759076481845
RMSLE: 0.11902010738109643
R^2: 0.3428720464937892
Mean Residual Deviance: 0.21727419974117612
Null degrees of freedom: 3861
Residual degrees of freedom: 3854
Null deviance: 1276.9399854672324
Residual deviance: 839.1129594004221
AIC: 5082.170839083037

ModelMetricsRegressionGLM: glm
** Reported on cross-validation data. **

MSE: 0.2185070080606413
RMSE: 0.46744733185744175
MAE: 0.35554501482828577
RMSLE: 0.11935774800872184
R^2: 0.3391435190891413
Mean Residual Deviance: 0.2185070080606413
Null degrees of freedom: 3861
Residual degrees of freedom: 3853
Null deviance: 1277.1043650451534
Residual deviance: 843.8740651301968
AIC: 5106.021797066896

Cross-Validation Metrics Summary: 

		mean	sd	cv_1_valid	cv_2_valid	cv_3_valid	cv_4_valid	cv_5_valid
0	mae	0.355575	0.011150	0.358694	0.365597	0.336775	0.355430	0.361377
1	mean_residual_deviance	0.218685	0.018149	0.232207	0.226213	0.191451	0.209182	0.234372
2	mse	0.218685	0.018149	0.232207	0.226213	0.191451	0.209182	0.234372
3	null_deviance	255.420870	18.944805	270.143950	279.337650	235.192600	252.053120	240.377060
4	r2	0.339071	0.024654	0.332108	0.343113	0.367360	0.351331	0.301443
5	residual_deviance	168.764340	13.983558	180.424550	183.458680	148.757460	163.370830	167.810170
6	rmse	0.467306	0.019674	0.481878	0.475618	0.437551	0.457364	0.484120
7	rmsle	0.119310	0.004632	0.122437	0.121279	0.111271	0.119514	0.122048

Scoring History:

	timestamp	duration	iteration	lambda	predictors	deviance_train	deviance_xval	deviance_se	alpha	iterations	training_rmse	training_deviance	training_mae	training_r2
0	2022-07-22 17:10:18	0.000 sec	1	,22E0	1	0.330642	0.330652	0.008216	0.5	NaN
1	2022-07-22 17:10:18	0.000 sec	2	,2E0	5	0.316066	0.330609	0.008220	0.5	NaN
2	2022-07-22 17:10:18	0.000 sec	3	,18E0	6	0.302049	0.330627	0.008216	0.5	NaN
3	2022-07-22 17:10:18	0.000 sec	4	,17E0	6	0.289848	0.322157	0.008512	0.5	NaN
4	2022-07-22 17:10:18	0.016 sec	5	,15E0	7	0.279287	0.307694	0.008407	0.5	5.0	0.528476	0.279287	0.407732	0.155318
5	2022-07-22 17:10:18	0.016 sec	6	,14E0	7	0.270007	0.294801	0.008323	0.5	NaN
6	2022-07-22 17:10:18	0.016 sec	7	,13E0	7	0.262086	0.283621	0.008253	0.5	NaN
7	2022-07-22 17:10:18	0.016 sec	8	,12E0	7	0.255335	0.273875	0.008226	0.5	NaN
8	2022-07-22 17:10:18	0.016 sec	9	,11E0	8	0.249586	0.265471	0.008206	0.5	NaN
9	2022-07-22 17:10:18	0.016 sec	10	,96E-1	8	0.244598	0.258304	0.008188	0.5	10.0	0.494568	0.244598	0.377588	0.260234
10	2022-07-22 17:10:18	0.032 sec	11	,88E-1	8	0.240460	0.252204	0.008175	0.5	NaN
11	2022-07-22 17:10:18	0.032 sec	12	,8E-1	8	0.236926	0.246950	0.008189	0.5	NaN
12	2022-07-22 17:10:18	0.032 sec	13	,73E-1	8	0.233943	0.242530	0.008194	0.5	NaN
13	2022-07-22 17:10:18	0.032 sec	14	,66E-1	8	0.231423	0.238774	0.008196	0.5	NaN
14	2022-07-22 17:10:18	0.032 sec	15	,6E-1	8	0.229293	0.235608	0.008204	0.5	15.0	0.478846	0.229293	0.364087	0.306521
15	2022-07-22 17:10:18	0.032 sec	16	,55E-1	8	0.227495	0.232935	0.008210	0.5	NaN
16	2022-07-22 17:10:18	0.032 sec	17	,5E-1	8	0.226154	0.230682	0.008216	0.5	NaN
17	2022-07-22 17:10:18	0.047 sec	18	,46E-1	8	0.224642	0.228783	0.008220	0.5	NaN
18	2022-07-22 17:10:18	0.047 sec	19	,42E-1	8	0.223737	0.227337	0.008211	0.5	NaN
19	2022-07-22 17:10:18	0.047 sec	20	,38E-1	8	0.222642	0.225825	0.008238	0.5	20.0	0.471849	0.222642	0.358377	0.326639

See the whole table with table.as_data_frame()

Variable Importances: 

	variable	relative_importance	scaled_importance	percentage
0	GBM_5_AutoML_1_20220722_170939	0.156444	1.000000	0.459637
1	GBM_3_AutoML_1_20220722_170939	0.064177	0.410220	0.188552
2	GBM_1_AutoML_1_20220722_170939	0.045077	0.288132	0.132436
3	GBM_grid_1_AutoML_1_20220722_170939_model_1	0.042987	0.274777	0.126298
4	GBM_4_AutoML_1_20220722_170939	0.022251	0.142227	0.065373
5	DeepLearning_1_AutoML_1_20220722_170939	0.009429	0.060268	0.027701
6	DRF_1_AutoML_1_20220722_170939	0.000001	0.000007	0.000003
7	GBM_2_AutoML_1_20220722_170939	0.000000	0.000000	0.000000
8	XRT_1_AutoML_1_20220722_170939	0.000000	0.000000	0.000000
9	GLM_1_AutoML_1_20220722_170939	0.000000	0.000000	0.000000

Examine the variable importance of the metalearner (combiner) algorithm in the ensemble. This shows us how much each base learner is contributing to the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM.

The table above gives us the variable importance of the metalearner in the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM.

metalearner.coef_norm()

{'Intercept': 2.969427791239395,
 'GBM_2_AutoML_1_20220722_170939': 0.0,
 'GBM_5_AutoML_1_20220722_170939': 0.1564442116943833,
 'GBM_3_AutoML_1_20220722_170939': 0.06417658529927933,
 'GBM_1_AutoML_1_20220722_170939': 0.04507655475235933,
 'GBM_4_AutoML_1_20220722_170939': 0.02225052417703666,
 'GBM_grid_1_AutoML_1_20220722_170939_model_1': 0.0429872925804153,
 'XRT_1_AutoML_1_20220722_170939': 0.0,
 'DRF_1_AutoML_1_20220722_170939': 1.1143189727895769e-06,
 'GLM_1_AutoML_1_20220722_170939': 0.0,
 'DeepLearning_1_AutoML_1_20220722_170939': 0.009428554928469175}

metalearner.std_coef_plot()

<h2o.plot._plot_result._MObject at 0x265c10daf70>

../_images/6eab4cbbdfeaa27a598e7d47da3146c52ed6f3f684ae2e621f54af4c0b5a2558.png

h2o.get_model(model_id).metalearner()

Model Details
=============
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  metalearner_AUTO_StackedEnsemble_AllModels_1_AutoML_1_20220722_170939


GLM Model: summary

		family	link	regularization	lambda_search	number_of_predictors_total	number_of_active_predictors	number_of_iterations	training_frame
0		gaussian	identity	Elastic Net (alpha = 0.5, lambda = 0.002551 )	nlambda = 100, lambda.max = 0.2219, lambda.min = 0.002551, lambda....	10	7	49	levelone_training_StackedEnsemble_AllModels_1_AutoML_1_20220722_17...

ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 0.21727419974117612
RMSE: 0.4661268065035266
MAE: 0.35448759076481845
RMSLE: 0.11902010738109643
R^2: 0.3428720464937892
Mean Residual Deviance: 0.21727419974117612
Null degrees of freedom: 3861
Residual degrees of freedom: 3854
Null deviance: 1276.9399854672324
Residual deviance: 839.1129594004221
AIC: 5082.170839083037

ModelMetricsRegressionGLM: glm
** Reported on cross-validation data. **

MSE: 0.2185070080606413
RMSE: 0.46744733185744175
MAE: 0.35554501482828577
RMSLE: 0.11935774800872184
R^2: 0.3391435190891413
Mean Residual Deviance: 0.2185070080606413
Null degrees of freedom: 3861
Residual degrees of freedom: 3853
Null deviance: 1277.1043650451534
Residual deviance: 843.8740651301968
AIC: 5106.021797066896

Cross-Validation Metrics Summary: 

		mean	sd	cv_1_valid	cv_2_valid	cv_3_valid	cv_4_valid	cv_5_valid
0	mae	0.355575	0.011150	0.358694	0.365597	0.336775	0.355430	0.361377
1	mean_residual_deviance	0.218685	0.018149	0.232207	0.226213	0.191451	0.209182	0.234372
2	mse	0.218685	0.018149	0.232207	0.226213	0.191451	0.209182	0.234372
3	null_deviance	255.420870	18.944805	270.143950	279.337650	235.192600	252.053120	240.377060
4	r2	0.339071	0.024654	0.332108	0.343113	0.367360	0.351331	0.301443
5	residual_deviance	168.764340	13.983558	180.424550	183.458680	148.757460	163.370830	167.810170
6	rmse	0.467306	0.019674	0.481878	0.475618	0.437551	0.457364	0.484120
7	rmsle	0.119310	0.004632	0.122437	0.121279	0.111271	0.119514	0.122048

Scoring History:

	timestamp	duration	iteration	lambda	predictors	deviance_train	deviance_xval	deviance_se	alpha	iterations	training_rmse	training_deviance	training_mae	training_r2
0	2022-07-22 17:10:18	0.000 sec	1	,22E0	1	0.330642	0.330652	0.008216	0.5	NaN
1	2022-07-22 17:10:18	0.000 sec	2	,2E0	5	0.316066	0.330609	0.008220	0.5	NaN
2	2022-07-22 17:10:18	0.000 sec	3	,18E0	6	0.302049	0.330627	0.008216	0.5	NaN
3	2022-07-22 17:10:18	0.000 sec	4	,17E0	6	0.289848	0.322157	0.008512	0.5	NaN
4	2022-07-22 17:10:18	0.016 sec	5	,15E0	7	0.279287	0.307694	0.008407	0.5	5.0	0.528476	0.279287	0.407732	0.155318
5	2022-07-22 17:10:18	0.016 sec	6	,14E0	7	0.270007	0.294801	0.008323	0.5	NaN
6	2022-07-22 17:10:18	0.016 sec	7	,13E0	7	0.262086	0.283621	0.008253	0.5	NaN
7	2022-07-22 17:10:18	0.016 sec	8	,12E0	7	0.255335	0.273875	0.008226	0.5	NaN
8	2022-07-22 17:10:18	0.016 sec	9	,11E0	8	0.249586	0.265471	0.008206	0.5	NaN
9	2022-07-22 17:10:18	0.016 sec	10	,96E-1	8	0.244598	0.258304	0.008188	0.5	10.0	0.494568	0.244598	0.377588	0.260234
10	2022-07-22 17:10:18	0.032 sec	11	,88E-1	8	0.240460	0.252204	0.008175	0.5	NaN
11	2022-07-22 17:10:18	0.032 sec	12	,8E-1	8	0.236926	0.246950	0.008189	0.5	NaN
12	2022-07-22 17:10:18	0.032 sec	13	,73E-1	8	0.233943	0.242530	0.008194	0.5	NaN
13	2022-07-22 17:10:18	0.032 sec	14	,66E-1	8	0.231423	0.238774	0.008196	0.5	NaN
14	2022-07-22 17:10:18	0.032 sec	15	,6E-1	8	0.229293	0.235608	0.008204	0.5	15.0	0.478846	0.229293	0.364087	0.306521
15	2022-07-22 17:10:18	0.032 sec	16	,55E-1	8	0.227495	0.232935	0.008210	0.5	NaN
16	2022-07-22 17:10:18	0.032 sec	17	,5E-1	8	0.226154	0.230682	0.008216	0.5	NaN
17	2022-07-22 17:10:18	0.047 sec	18	,46E-1	8	0.224642	0.228783	0.008220	0.5	NaN
18	2022-07-22 17:10:18	0.047 sec	19	,42E-1	8	0.223737	0.227337	0.008211	0.5	NaN
19	2022-07-22 17:10:18	0.047 sec	20	,38E-1	8	0.222642	0.225825	0.008238	0.5	20.0	0.471849	0.222642	0.358377	0.326639

See the whole table with table.as_data_frame()

Variable Importances: 

	variable	relative_importance	scaled_importance	percentage
0	GBM_5_AutoML_1_20220722_170939	0.156444	1.000000	0.459637
1	GBM_3_AutoML_1_20220722_170939	0.064177	0.410220	0.188552
2	GBM_1_AutoML_1_20220722_170939	0.045077	0.288132	0.132436
3	GBM_grid_1_AutoML_1_20220722_170939_model_1	0.042987	0.274777	0.126298
4	GBM_4_AutoML_1_20220722_170939	0.022251	0.142227	0.065373
5	DeepLearning_1_AutoML_1_20220722_170939	0.009429	0.060268	0.027701
6	DRF_1_AutoML_1_20220722_170939	0.000001	0.000007	0.000003
7	GBM_2_AutoML_1_20220722_170939	0.000000	0.000000	0.000000
8	XRT_1_AutoML_1_20220722_170939	0.000000	0.000000	0.000000
9	GLM_1_AutoML_1_20220722_170939	0.000000	0.000000	0.000000

14.2. Generating Predictions Using Leader Model#

We can also generate predictions on a test sample using the leader model object.

pred = aml.predict(test_h)
pred.head()

stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%

predict
2.88073
3.32158
3.08133
2.73111
2.51044
3.13355
3.18375
3.74316
2.58496
3.3082

This allows us to estimate the out-of-sample (test) MSE and the standard error as well.

pred_2 = pred.as_data_frame()
pred_aml = pred_2.to_numpy()

Y_test = test_h['lwage'].as_data_frame().to_numpy()

import statsmodels.api as sm
import statsmodels.formula.api as smf

resid_basic = (Y_test-pred_aml)**2

MSE_aml_basic = sm.OLS( resid_basic , np.ones( resid_basic.shape[0] ) ).fit().summary2().tables[1].iloc[0, 0:2]
MSE_aml_basic

Coef.       0.221502
Std.Err.    0.012942
Name: const, dtype: float64

We observe both a lower MSE and a lower standard error compared to our previous results (see here).

14.2.1. By using model_performance()#

If needed, the standard model_performance() method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.

perf = aml.leader.model_performance(test_h)
perf

ModelMetricsRegressionGLM: stackedensemble
** Reported on test data. **

MSE: 0.22150159010610537
RMSE: 0.4706395543365489
MAE: 0.353942656628514
RMSLE: 0.12036767274774818
R^2: 0.2835426043371053
Mean Residual Deviance: 0.22150159010610537
Null degrees of freedom: 1287
Residual degrees of freedom: 1280
Null deviance: 398.23902107893576
Residual deviance: 285.2940480566637
AIC: 1731.7504037354604

AutoML for wage prediction

Contents

14. AutoML for wage prediction#

14.1. Automatic Machine Learning with H2O AutoML using Wage Data from 2015#

14.2. Generating Predictions Using Leader Model#

14.2.1. By using model_performance()#