14. AutoML for wage prediction#

14.1. Automatic Machine Learning with H2O AutoML using Wage Data from 2015#

We illustrate how to predict an outcome variable Y in a high-dimensional setting, using the AutoML package H2O that covers the complete pipeline from the raw dataset to the deployable machine learning model. In last few years, AutoML or automated machine learning has become widely popular among data science community.

We can use AutoML as a benchmark and compare it to the methods that we used in the previous notebook where we applied one machine learning method after the other.

# Import relevant packages
import pandas as pd
import numpy as np
import pyreadr
import os
from urllib.request import urlopen
from sklearn import preprocessing
import patsy
from h2o.automl import H2OAutoML

from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
import warnings
warnings.filterwarnings('ignore')
#pip install h2o
# load the H2O package
import h2o

# start h2o cluster
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O_cluster_uptime: 4 hours 11 mins
H2O_cluster_timezone: America/Bogota
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.36.1.3
H2O_cluster_version_age: 26 days
H2O_cluster_name: H2O_from_python_User_fi8ht0
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 2.467 Gb
H2O_cluster_total_cores: 4
H2O_cluster_allowed_cores: 4
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
Python_version: 3.9.12 final
link="https://raw.githubusercontent.com/d2cml-ai/14.388_py/main/data/wage2015_subsample_inference.Rdata"
response = urlopen(link)
content = response.read()
fhandle = open( 'wage2015_subsample_inference.Rdata', 'wb')
fhandle.write(content)
fhandle.close()
result = pyreadr.read_r("wage2015_subsample_inference.Rdata")
os.remove("wage2015_subsample_inference.Rdata")

# Extracting the data frame from rdata_read
data = result[ 'data' ]
n = data.shape[0]
type(data)
pandas.core.frame.DataFrame
# Import relevant packages for splitting data
import random
import math

# Set Seed
# to make the results replicable (generating random numbers)
np.random.seed(0)
random = np.random.randint(0, data.shape[0], size=math.floor(data.shape[0]))
data["random"] = random
random    # the array does not change 
data_2 = data.sort_values(by=['random'])
# Create training and testing sample 
train = data_2[ : math.floor(n*3/4)]    # training sample
test =  data_2[ math.floor(n*3/4) : ]   # testing sample
print(train.shape)
print(test.shape)
(3862, 21)
(1288, 21)
# start h2o cluster
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O_cluster_uptime: 4 hours 12 mins
H2O_cluster_timezone: America/Bogota
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.36.1.3
H2O_cluster_version_age: 26 days
H2O_cluster_name: H2O_from_python_User_fi8ht0
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 2.467 Gb
H2O_cluster_total_cores: 4
H2O_cluster_allowed_cores: 4
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
Python_version: 3.9.12 final
# convert data as h2o type
train_h = h2o.H2OFrame(train)
test_h = h2o.H2OFrame(test)

# have a look at the data
train_h.describe()
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Rows:3862
Cols:21
wage lwage sex shs hsg scl clg ad mw so we ne exp1 exp2 exp3 exp4 occ occ2 ind ind2 random
type real real int int int int int int int int int int real real real real int int int int int
mins 3.021978021978022 1.10591159114972130.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10.0 1.0 370.0 2.0 0.0
mean 23.4654177314672022.969427791239379 0.446918694976696 0.0235629207664422580.247022268254790260.2780942516830658 0.3125323666494045 0.138788192646297240.25530813050233040.298291040911444870.215691351631279140.2307094769549456213.6721905748317282.99230321077161378.15417316157433 24.8497603346711465243.41843604350411.69135163127914 6667.99611600207313.3337648886587061914.2263076126374
maxs 528.845673076923 6.270696655981913 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 47.0 22.09 103.823 487.9681 100000.0 22.0 100000.0 22.0 3825.0
sigma 21.4300857435067660.57508937089339250.49723877097042930.15170256601034887 0.4313356487428773 0.448118104070973130.463585519808550450.3457701367976389 0.436090737870262 0.457567162363124670.4113543571812672 0.4213414081729523 10.5986136870326553.987480265404191 14.42474448790359753.27996156215637 11579.911146621046.97834165856571255588.2642823549115.691380293915183 1104.7025392074318
zeros 0 0 2136 3771 2908 2788 2655 3326 2876 2710 3029 2971 48 48 48 48 0 0 0 0 3
missing0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 26.4423076923076933.274965291519244 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 29.0 8.41 24.389 70.7281 340.0 1.0 8660.0 20.0 0.0
1 19.23076923076923 2.95651156040070970.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 33.5 11.2225 37.595375 125.94450625 9620.0 22.0 1870.0 5.0 0.0
2 48.07692307692308 3.872802292274865 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 2.0 0.04 0.008 0.0016 3060.0 10.0 8190.0 18.0 0.0
3 12.01923076923077 2.486507931154974 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 29.0 8.41 24.389 70.7281 6440.0 19.0 770.0 4.0 2.0
4 39.90384615384615 3.68647271408337131.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 12.0 1.44 1.728 2.0736 1820.0 5.0 7860.0 17.0 2.0
5 13.1578947368421042.57702193869580580.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 11.0 1.21 1.331 1.4641 8810.0 21.0 3895.0 6.0 3.0
6 20.1923076923076933.005301724570142 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 17.0 2.89 4.913 8.3521 7200.0 20.0 8770.0 21.0 4.0
7 12.01923076923077 2.486507931154974 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 7.0 0.49 0.343 0.2401 5610.0 17.0 4265.0 7.0 5.0
8 28.8461538461538473.361976668508874 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 8.0 0.64 0.512 0.4096 5240.0 17.0 6970.0 12.0 7.0
9 34.13461538461539 3.530311983328089 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 38.0 14.44 54.872 208.5136 5550.0 17.0 6370.0 10.0 7.0
# define the variables
y = 'lwage'

data_columns = list(data)
no_relev_col = ['wage','occ2', 'ind2', 'random', 'lwage']

# This gives us: new_list = ['carrot' , 'lemon']
x = [col for col in data_columns if col not in no_relev_col]
# run AutoML for 10 base models and a maximal runtime of 100 seconds
# Run AutoML for 30 seconds
aml = H2OAutoML(max_runtime_secs = 100, max_models = 10, seed = 1)
aml.train(x = x, y = y, training_frame = train_h, leaderboard_frame = test_h)
AutoML progress: |█
05:14:21.267: AutoML: XGBoost is not available; skipping it.

██████████████████████████████████████████████████████████████| (done) 100%
Model Details
=============
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_1_AutoML_2_20220804_51421

No model summary for this model

ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 0.14963411645818037
RMSE: 0.38682569260350375
MAE: 0.29493414661545647
RMSLE: 0.09962964024727533
R^2: 0.5474439172253015
Mean Residual Deviance: 0.14963411645818037
Null degrees of freedom: 3861
Residual degrees of freedom: 3855
Null deviance: 1276.9399854672324
Residual deviance: 577.8869577614926
AIC: 3639.772059449712

ModelMetricsRegressionGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.21847845950742845
RMSE: 0.4674167942077268
MAE: 0.35554195972164104
RMSLE: 0.11935352477830528
R^2: 0.3392298618411601
Mean Residual Deviance: 0.21847845950742845
Null degrees of freedom: 3861
Residual degrees of freedom: 3854
Null deviance: 1277.1043650451534
Residual deviance: 843.7638106176887
AIC: 5103.517182973992

Cross-Validation Metrics Summary: 
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
0 mae 0.355537 0.011149 0.358856 0.365665 0.336852 0.354934 0.361377
1 mean_residual_deviance 0.218658 0.018164 0.232228 0.226281 0.191559 0.208852 0.234372
2 mse 0.218658 0.018164 0.232228 0.226281 0.191559 0.208852 0.234372
3 null_deviance 255.420870 18.944805 270.143950 279.337650 235.192600 252.053120 240.377060
4 r2 0.339152 0.024681 0.332046 0.342915 0.367003 0.352354 0.301443
5 residual_deviance 168.744060 13.997013 180.441440 183.513900 148.841520 163.113280 167.810170
6 rmse 0.467278 0.019686 0.481901 0.475690 0.437675 0.457003 0.484120
7 rmsle 0.119307 0.004619 0.122457 0.121297 0.111312 0.119422 0.122048

# AutoML Leaderboard
lb = aml.leaderboard
print(lb)
model_id rmse mse mae rmsle mean_residual_deviance
StackedEnsemble_AllModels_1_AutoML_1_20220722_170939 0.47064 0.2215020.3539430.120368 0.221502
GBM_2_AutoML_1_20220722_170939 0.4718490.2226420.3559410.12074 0.222642
GBM_5_AutoML_1_20220722_170939 0.47191 0.2226990.35582 0.120501 0.222699
StackedEnsemble_BestOfFamily_1_AutoML_1_20220722_1709390.4737980.2244840.3575980.121491 0.224484
GBM_3_AutoML_1_20220722_170939 0.4745850.22523 0.3593090.121264 0.22523
GBM_1_AutoML_1_20220722_170939 0.4781490.2286260.3622110.122278 0.228626
GBM_4_AutoML_1_20220722_170939 0.4790720.22951 0.3629160.122456 0.22951
GBM_grid_1_AutoML_1_20220722_170939_model_1 0.48001 0.23041 0.36381 0.122671 0.23041
XRT_1_AutoML_1_20220722_170939 0.4912180.2412950.3750890.125318 0.241295
DRF_1_AutoML_1_20220722_170939 0.50224 0.2522450.3826060.12847 0.252245

We see that two Stacked Ensembles are at the top of the leaderboard. Stacked Ensembles often outperform a single model. The out-of-sample (test) MSE of the leading model is given by

aml.leaderboard['mse'][0,0]
0.22150159010610537

The in-sample performance can be evaluated by

aml.leader
Model Details
=============
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_1_AutoML_1_20220722_170939

No model summary for this model

ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 0.15027004489538795
RMSE: 0.38764680431468534
MAE: 0.2955371675604367
RMSLE: 0.09990016533775842
R^2: 0.5455206039510314
Mean Residual Deviance: 0.15027004489538795
Null degrees of freedom: 3861
Residual degrees of freedom: 3854
Null deviance: 1276.9399854672324
Residual deviance: 580.3429133859883
AIC: 3658.1503537307485

ModelMetricsRegressionGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.2185070080606413
RMSE: 0.46744733185744175
MAE: 0.35554501482828577
RMSLE: 0.11935774800872184
R^2: 0.3391435190891413
Mean Residual Deviance: 0.2185070080606413
Null degrees of freedom: 3861
Residual degrees of freedom: 3853
Null deviance: 1277.1043650451534
Residual deviance: 843.8740651301968
AIC: 5106.021797066896

Cross-Validation Metrics Summary: 
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
0 mae 0.355575 0.011150 0.358694 0.365597 0.336775 0.355430 0.361377
1 mean_residual_deviance 0.218685 0.018149 0.232207 0.226213 0.191451 0.209182 0.234372
2 mse 0.218685 0.018149 0.232207 0.226213 0.191451 0.209182 0.234372
3 null_deviance 255.420870 18.944805 270.143950 279.337650 235.192600 252.053120 240.377060
4 r2 0.339071 0.024654 0.332108 0.343113 0.367360 0.351331 0.301443
5 residual_deviance 168.764340 13.983558 180.424550 183.458680 148.757460 163.370830 167.810170
6 rmse 0.467306 0.019674 0.481878 0.475618 0.437551 0.457364 0.484120
7 rmsle 0.119310 0.004632 0.122437 0.121279 0.111271 0.119514 0.122048

This is in line with our previous results. To understand how the ensemble works, let’s take a peek inside the Stacked Ensemble “All Models” model. The “All Models” ensemble is an ensemble of all of the individual models in the AutoML run. This is often the top performing model on the leaderboard.

model_ids = h2o.as_list(aml.leaderboard['model_id'][0], use_pandas=True)
model = model_ids[model_ids['model_id'].str.contains("StackedEnsemble_AllModels")].values.tolist()
model_id = model[0][0]
model_id
'StackedEnsemble_AllModels_1_AutoML_1_20220722_170939'
se = h2o.get_model(model_id)
se
Model Details
=============
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_1_AutoML_1_20220722_170939

No model summary for this model

ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 0.15027004489538795
RMSE: 0.38764680431468534
MAE: 0.2955371675604367
RMSLE: 0.09990016533775842
R^2: 0.5455206039510314
Mean Residual Deviance: 0.15027004489538795
Null degrees of freedom: 3861
Residual degrees of freedom: 3854
Null deviance: 1276.9399854672324
Residual deviance: 580.3429133859883
AIC: 3658.1503537307485

ModelMetricsRegressionGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.2185070080606413
RMSE: 0.46744733185744175
MAE: 0.35554501482828577
RMSLE: 0.11935774800872184
R^2: 0.3391435190891413
Mean Residual Deviance: 0.2185070080606413
Null degrees of freedom: 3861
Residual degrees of freedom: 3853
Null deviance: 1277.1043650451534
Residual deviance: 843.8740651301968
AIC: 5106.021797066896

Cross-Validation Metrics Summary: 
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
0 mae 0.355575 0.011150 0.358694 0.365597 0.336775 0.355430 0.361377
1 mean_residual_deviance 0.218685 0.018149 0.232207 0.226213 0.191451 0.209182 0.234372
2 mse 0.218685 0.018149 0.232207 0.226213 0.191451 0.209182 0.234372
3 null_deviance 255.420870 18.944805 270.143950 279.337650 235.192600 252.053120 240.377060
4 r2 0.339071 0.024654 0.332108 0.343113 0.367360 0.351331 0.301443
5 residual_deviance 168.764340 13.983558 180.424550 183.458680 148.757460 163.370830 167.810170
6 rmse 0.467306 0.019674 0.481878 0.475618 0.437551 0.457364 0.484120
7 rmsle 0.119310 0.004632 0.122437 0.121279 0.111271 0.119514 0.122048

# Get the Stacked Ensemble metalearner model
metalearner = se.metalearner()
metalearner
Model Details
=============
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  metalearner_AUTO_StackedEnsemble_AllModels_1_AutoML_1_20220722_170939


GLM Model: summary
family link regularization lambda_search number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
0 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.002551 ) nlambda = 100, lambda.max = 0.2219, lambda.min = 0.002551, lambda.... 10 7 49 levelone_training_StackedEnsemble_AllModels_1_AutoML_1_20220722_17...
ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 0.21727419974117612
RMSE: 0.4661268065035266
MAE: 0.35448759076481845
RMSLE: 0.11902010738109643
R^2: 0.3428720464937892
Mean Residual Deviance: 0.21727419974117612
Null degrees of freedom: 3861
Residual degrees of freedom: 3854
Null deviance: 1276.9399854672324
Residual deviance: 839.1129594004221
AIC: 5082.170839083037

ModelMetricsRegressionGLM: glm
** Reported on cross-validation data. **

MSE: 0.2185070080606413
RMSE: 0.46744733185744175
MAE: 0.35554501482828577
RMSLE: 0.11935774800872184
R^2: 0.3391435190891413
Mean Residual Deviance: 0.2185070080606413
Null degrees of freedom: 3861
Residual degrees of freedom: 3853
Null deviance: 1277.1043650451534
Residual deviance: 843.8740651301968
AIC: 5106.021797066896

Cross-Validation Metrics Summary: 
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
0 mae 0.355575 0.011150 0.358694 0.365597 0.336775 0.355430 0.361377
1 mean_residual_deviance 0.218685 0.018149 0.232207 0.226213 0.191451 0.209182 0.234372
2 mse 0.218685 0.018149 0.232207 0.226213 0.191451 0.209182 0.234372
3 null_deviance 255.420870 18.944805 270.143950 279.337650 235.192600 252.053120 240.377060
4 r2 0.339071 0.024654 0.332108 0.343113 0.367360 0.351331 0.301443
5 residual_deviance 168.764340 13.983558 180.424550 183.458680 148.757460 163.370830 167.810170
6 rmse 0.467306 0.019674 0.481878 0.475618 0.437551 0.457364 0.484120
7 rmsle 0.119310 0.004632 0.122437 0.121279 0.111271 0.119514 0.122048
Scoring History: 
timestamp duration iteration lambda predictors deviance_train deviance_xval deviance_se alpha iterations training_rmse training_deviance training_mae training_r2
0 2022-07-22 17:10:18 0.000 sec 1 ,22E0 1 0.330642 0.330652 0.008216 0.5 NaN
1 2022-07-22 17:10:18 0.000 sec 2 ,2E0 5 0.316066 0.330609 0.008220 0.5 NaN
2 2022-07-22 17:10:18 0.000 sec 3 ,18E0 6 0.302049 0.330627 0.008216 0.5 NaN
3 2022-07-22 17:10:18 0.000 sec 4 ,17E0 6 0.289848 0.322157 0.008512 0.5 NaN
4 2022-07-22 17:10:18 0.016 sec 5 ,15E0 7 0.279287 0.307694 0.008407 0.5 5.0 0.528476 0.279287 0.407732 0.155318
5 2022-07-22 17:10:18 0.016 sec 6 ,14E0 7 0.270007 0.294801 0.008323 0.5 NaN
6 2022-07-22 17:10:18 0.016 sec 7 ,13E0 7 0.262086 0.283621 0.008253 0.5 NaN
7 2022-07-22 17:10:18 0.016 sec 8 ,12E0 7 0.255335 0.273875 0.008226 0.5 NaN
8 2022-07-22 17:10:18 0.016 sec 9 ,11E0 8 0.249586 0.265471 0.008206 0.5 NaN
9 2022-07-22 17:10:18 0.016 sec 10 ,96E-1 8 0.244598 0.258304 0.008188 0.5 10.0 0.494568 0.244598 0.377588 0.260234
10 2022-07-22 17:10:18 0.032 sec 11 ,88E-1 8 0.240460 0.252204 0.008175 0.5 NaN
11 2022-07-22 17:10:18 0.032 sec 12 ,8E-1 8 0.236926 0.246950 0.008189 0.5 NaN
12 2022-07-22 17:10:18 0.032 sec 13 ,73E-1 8 0.233943 0.242530 0.008194 0.5 NaN
13 2022-07-22 17:10:18 0.032 sec 14 ,66E-1 8 0.231423 0.238774 0.008196 0.5 NaN
14 2022-07-22 17:10:18 0.032 sec 15 ,6E-1 8 0.229293 0.235608 0.008204 0.5 15.0 0.478846 0.229293 0.364087 0.306521
15 2022-07-22 17:10:18 0.032 sec 16 ,55E-1 8 0.227495 0.232935 0.008210 0.5 NaN
16 2022-07-22 17:10:18 0.032 sec 17 ,5E-1 8 0.226154 0.230682 0.008216 0.5 NaN
17 2022-07-22 17:10:18 0.047 sec 18 ,46E-1 8 0.224642 0.228783 0.008220 0.5 NaN
18 2022-07-22 17:10:18 0.047 sec 19 ,42E-1 8 0.223737 0.227337 0.008211 0.5 NaN
19 2022-07-22 17:10:18 0.047 sec 20 ,38E-1 8 0.222642 0.225825 0.008238 0.5 20.0 0.471849 0.222642 0.358377 0.326639
See the whole table with table.as_data_frame()

Variable Importances: 
variable relative_importance scaled_importance percentage
0 GBM_5_AutoML_1_20220722_170939 0.156444 1.000000 0.459637
1 GBM_3_AutoML_1_20220722_170939 0.064177 0.410220 0.188552
2 GBM_1_AutoML_1_20220722_170939 0.045077 0.288132 0.132436
3 GBM_grid_1_AutoML_1_20220722_170939_model_1 0.042987 0.274777 0.126298
4 GBM_4_AutoML_1_20220722_170939 0.022251 0.142227 0.065373
5 DeepLearning_1_AutoML_1_20220722_170939 0.009429 0.060268 0.027701
6 DRF_1_AutoML_1_20220722_170939 0.000001 0.000007 0.000003
7 GBM_2_AutoML_1_20220722_170939 0.000000 0.000000 0.000000
8 XRT_1_AutoML_1_20220722_170939 0.000000 0.000000 0.000000
9 GLM_1_AutoML_1_20220722_170939 0.000000 0.000000 0.000000

Examine the variable importance of the metalearner (combiner) algorithm in the ensemble. This shows us how much each base learner is contributing to the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM.

The table above gives us the variable importance of the metalearner in the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM.

metalearner.coef_norm()
{'Intercept': 2.969427791239395,
 'GBM_2_AutoML_1_20220722_170939': 0.0,
 'GBM_5_AutoML_1_20220722_170939': 0.1564442116943833,
 'GBM_3_AutoML_1_20220722_170939': 0.06417658529927933,
 'GBM_1_AutoML_1_20220722_170939': 0.04507655475235933,
 'GBM_4_AutoML_1_20220722_170939': 0.02225052417703666,
 'GBM_grid_1_AutoML_1_20220722_170939_model_1': 0.0429872925804153,
 'XRT_1_AutoML_1_20220722_170939': 0.0,
 'DRF_1_AutoML_1_20220722_170939': 1.1143189727895769e-06,
 'GLM_1_AutoML_1_20220722_170939': 0.0,
 'DeepLearning_1_AutoML_1_20220722_170939': 0.009428554928469175}
metalearner.std_coef_plot()
<h2o.plot._plot_result._MObject at 0x265c10daf70>
../_images/6eab4cbbdfeaa27a598e7d47da3146c52ed6f3f684ae2e621f54af4c0b5a2558.png
h2o.get_model(model_id).metalearner()
Model Details
=============
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  metalearner_AUTO_StackedEnsemble_AllModels_1_AutoML_1_20220722_170939


GLM Model: summary
family link regularization lambda_search number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
0 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.002551 ) nlambda = 100, lambda.max = 0.2219, lambda.min = 0.002551, lambda.... 10 7 49 levelone_training_StackedEnsemble_AllModels_1_AutoML_1_20220722_17...
ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 0.21727419974117612
RMSE: 0.4661268065035266
MAE: 0.35448759076481845
RMSLE: 0.11902010738109643
R^2: 0.3428720464937892
Mean Residual Deviance: 0.21727419974117612
Null degrees of freedom: 3861
Residual degrees of freedom: 3854
Null deviance: 1276.9399854672324
Residual deviance: 839.1129594004221
AIC: 5082.170839083037

ModelMetricsRegressionGLM: glm
** Reported on cross-validation data. **

MSE: 0.2185070080606413
RMSE: 0.46744733185744175
MAE: 0.35554501482828577
RMSLE: 0.11935774800872184
R^2: 0.3391435190891413
Mean Residual Deviance: 0.2185070080606413
Null degrees of freedom: 3861
Residual degrees of freedom: 3853
Null deviance: 1277.1043650451534
Residual deviance: 843.8740651301968
AIC: 5106.021797066896

Cross-Validation Metrics Summary: 
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
0 mae 0.355575 0.011150 0.358694 0.365597 0.336775 0.355430 0.361377
1 mean_residual_deviance 0.218685 0.018149 0.232207 0.226213 0.191451 0.209182 0.234372
2 mse 0.218685 0.018149 0.232207 0.226213 0.191451 0.209182 0.234372
3 null_deviance 255.420870 18.944805 270.143950 279.337650 235.192600 252.053120 240.377060
4 r2 0.339071 0.024654 0.332108 0.343113 0.367360 0.351331 0.301443
5 residual_deviance 168.764340 13.983558 180.424550 183.458680 148.757460 163.370830 167.810170
6 rmse 0.467306 0.019674 0.481878 0.475618 0.437551 0.457364 0.484120
7 rmsle 0.119310 0.004632 0.122437 0.121279 0.111271 0.119514 0.122048
Scoring History: 
timestamp duration iteration lambda predictors deviance_train deviance_xval deviance_se alpha iterations training_rmse training_deviance training_mae training_r2
0 2022-07-22 17:10:18 0.000 sec 1 ,22E0 1 0.330642 0.330652 0.008216 0.5 NaN
1 2022-07-22 17:10:18 0.000 sec 2 ,2E0 5 0.316066 0.330609 0.008220 0.5 NaN
2 2022-07-22 17:10:18 0.000 sec 3 ,18E0 6 0.302049 0.330627 0.008216 0.5 NaN
3 2022-07-22 17:10:18 0.000 sec 4 ,17E0 6 0.289848 0.322157 0.008512 0.5 NaN
4 2022-07-22 17:10:18 0.016 sec 5 ,15E0 7 0.279287 0.307694 0.008407 0.5 5.0 0.528476 0.279287 0.407732 0.155318
5 2022-07-22 17:10:18 0.016 sec 6 ,14E0 7 0.270007 0.294801 0.008323 0.5 NaN
6 2022-07-22 17:10:18 0.016 sec 7 ,13E0 7 0.262086 0.283621 0.008253 0.5 NaN
7 2022-07-22 17:10:18 0.016 sec 8 ,12E0 7 0.255335 0.273875 0.008226 0.5 NaN
8 2022-07-22 17:10:18 0.016 sec 9 ,11E0 8 0.249586 0.265471 0.008206 0.5 NaN
9 2022-07-22 17:10:18 0.016 sec 10 ,96E-1 8 0.244598 0.258304 0.008188 0.5 10.0 0.494568 0.244598 0.377588 0.260234
10 2022-07-22 17:10:18 0.032 sec 11 ,88E-1 8 0.240460 0.252204 0.008175 0.5 NaN
11 2022-07-22 17:10:18 0.032 sec 12 ,8E-1 8 0.236926 0.246950 0.008189 0.5 NaN
12 2022-07-22 17:10:18 0.032 sec 13 ,73E-1 8 0.233943 0.242530 0.008194 0.5 NaN
13 2022-07-22 17:10:18 0.032 sec 14 ,66E-1 8 0.231423 0.238774 0.008196 0.5 NaN
14 2022-07-22 17:10:18 0.032 sec 15 ,6E-1 8 0.229293 0.235608 0.008204 0.5 15.0 0.478846 0.229293 0.364087 0.306521
15 2022-07-22 17:10:18 0.032 sec 16 ,55E-1 8 0.227495 0.232935 0.008210 0.5 NaN
16 2022-07-22 17:10:18 0.032 sec 17 ,5E-1 8 0.226154 0.230682 0.008216 0.5 NaN
17 2022-07-22 17:10:18 0.047 sec 18 ,46E-1 8 0.224642 0.228783 0.008220 0.5 NaN
18 2022-07-22 17:10:18 0.047 sec 19 ,42E-1 8 0.223737 0.227337 0.008211 0.5 NaN
19 2022-07-22 17:10:18 0.047 sec 20 ,38E-1 8 0.222642 0.225825 0.008238 0.5 20.0 0.471849 0.222642 0.358377 0.326639
See the whole table with table.as_data_frame()

Variable Importances: 
variable relative_importance scaled_importance percentage
0 GBM_5_AutoML_1_20220722_170939 0.156444 1.000000 0.459637
1 GBM_3_AutoML_1_20220722_170939 0.064177 0.410220 0.188552
2 GBM_1_AutoML_1_20220722_170939 0.045077 0.288132 0.132436
3 GBM_grid_1_AutoML_1_20220722_170939_model_1 0.042987 0.274777 0.126298
4 GBM_4_AutoML_1_20220722_170939 0.022251 0.142227 0.065373
5 DeepLearning_1_AutoML_1_20220722_170939 0.009429 0.060268 0.027701
6 DRF_1_AutoML_1_20220722_170939 0.000001 0.000007 0.000003
7 GBM_2_AutoML_1_20220722_170939 0.000000 0.000000 0.000000
8 XRT_1_AutoML_1_20220722_170939 0.000000 0.000000 0.000000
9 GLM_1_AutoML_1_20220722_170939 0.000000 0.000000 0.000000

14.2. Generating Predictions Using Leader Model#

We can also generate predictions on a test sample using the leader model object.

pred = aml.predict(test_h)
pred.head()
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
predict
2.88073
3.32158
3.08133
2.73111
2.51044
3.13355
3.18375
3.74316
2.58496
3.3082

This allows us to estimate the out-of-sample (test) MSE and the standard error as well.

pred_2 = pred.as_data_frame()
pred_aml = pred_2.to_numpy()
Y_test = test_h['lwage'].as_data_frame().to_numpy()
import statsmodels.api as sm
import statsmodels.formula.api as smf
resid_basic = (Y_test-pred_aml)**2

MSE_aml_basic = sm.OLS( resid_basic , np.ones( resid_basic.shape[0] ) ).fit().summary2().tables[1].iloc[0, 0:2]
MSE_aml_basic
Coef.       0.221502
Std.Err.    0.012942
Name: const, dtype: float64

We observe both a lower MSE and a lower standard error compared to our previous results (see here).

14.2.1. By using model_performance()#

If needed, the standard model_performance() method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.

perf = aml.leader.model_performance(test_h)
perf
ModelMetricsRegressionGLM: stackedensemble
** Reported on test data. **

MSE: 0.22150159010610537
RMSE: 0.4706395543365489
MAE: 0.353942656628514
RMSLE: 0.12036767274774818
R^2: 0.2835426043371053
Mean Residual Deviance: 0.22150159010610537
Null degrees of freedom: 1287
Residual degrees of freedom: 1280
Null deviance: 398.23902107893576
Residual deviance: 285.2940480566637
AIC: 1731.7504037354604