{
"cells": [
{
"cell_type": "markdown",
"id": "76dc69f3-2413-4686-bfe9-cafc08f20c27",
"metadata": {},
"source": [
"# AutoML for wage prediction"
]
},
{
"cell_type": "markdown",
"id": "e8342cb7",
"metadata": {
"papermill": {
"duration": 0.013777,
"end_time": "2021-03-24T11:24:18.450894",
"exception": false,
"start_time": "2021-03-24T11:24:18.437117",
"status": "completed"
},
"tags": []
},
"source": [
"## Automatic Machine Learning with H2O AutoML using Wage Data from 2015"
]
},
{
"cell_type": "markdown",
"id": "1fbd05b7",
"metadata": {
"papermill": {
"duration": 0.014076,
"end_time": "2021-03-24T11:24:18.478815",
"exception": false,
"start_time": "2021-03-24T11:24:18.464739",
"status": "completed"
},
"tags": []
},
"source": [
"We illustrate how to predict an outcome variable Y in a high-dimensional setting, using the AutoML package *H2O* that covers the complete pipeline from the raw dataset to the deployable machine learning model. In last few years, AutoML or automated machine learning has become widely popular among data science community. "
]
},
{
"cell_type": "markdown",
"id": "5333433f",
"metadata": {
"papermill": {
"duration": 0.013915,
"end_time": "2021-03-24T11:24:18.508556",
"exception": false,
"start_time": "2021-03-24T11:24:18.494641",
"status": "completed"
},
"tags": []
},
"source": [
"We can use AutoML as a benchmark and compare it to the methods that we used in the previous notebook where we applied one machine learning method after the other."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "dd30c6a6",
"metadata": {},
"outputs": [],
"source": [
"# Import relevant packages\n",
"import pandas as pd\n",
"import numpy as np\n",
"import pyreadr\n",
"import os\n",
"from urllib.request import urlopen\n",
"from sklearn import preprocessing\n",
"import patsy\n",
"from h2o.automl import H2OAutoML\n",
"\n",
"from numpy import loadtxt\n",
"from keras.models import Sequential\n",
"from keras.layers import Dense\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "fb1130e2-716b-49c7-b6f3-e7061fc1dd9e",
"metadata": {},
"outputs": [],
"source": [
"#pip install h2o"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "0e6d7d98",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking whether there is an H2O instance running at http://localhost:54321 . connected.\n"
]
},
{
"data": {
"text/html": [
"
H2O_cluster_uptime: | \n",
"4 hours 11 mins |
\n",
"H2O_cluster_timezone: | \n",
"America/Bogota |
\n",
"H2O_data_parsing_timezone: | \n",
"UTC |
\n",
"H2O_cluster_version: | \n",
"3.36.1.3 |
\n",
"H2O_cluster_version_age: | \n",
"26 days |
\n",
"H2O_cluster_name: | \n",
"H2O_from_python_User_fi8ht0 |
\n",
"H2O_cluster_total_nodes: | \n",
"1 |
\n",
"H2O_cluster_free_memory: | \n",
"2.467 Gb |
\n",
"H2O_cluster_total_cores: | \n",
"4 |
\n",
"H2O_cluster_allowed_cores: | \n",
"4 |
\n",
"H2O_cluster_status: | \n",
"locked, healthy |
\n",
"H2O_connection_url: | \n",
"http://localhost:54321 |
\n",
"H2O_connection_proxy: | \n",
"{\"http\": null, \"https\": null} |
\n",
"H2O_internal_security: | \n",
"False |
\n",
"Python_version: | \n",
"3.9.12 final |
"
],
"text/plain": [
"-------------------------- -----------------------------\n",
"H2O_cluster_uptime: 4 hours 11 mins\n",
"H2O_cluster_timezone: America/Bogota\n",
"H2O_data_parsing_timezone: UTC\n",
"H2O_cluster_version: 3.36.1.3\n",
"H2O_cluster_version_age: 26 days\n",
"H2O_cluster_name: H2O_from_python_User_fi8ht0\n",
"H2O_cluster_total_nodes: 1\n",
"H2O_cluster_free_memory: 2.467 Gb\n",
"H2O_cluster_total_cores: 4\n",
"H2O_cluster_allowed_cores: 4\n",
"H2O_cluster_status: locked, healthy\n",
"H2O_connection_url: http://localhost:54321\n",
"H2O_connection_proxy: {\"http\": null, \"https\": null}\n",
"H2O_internal_security: False\n",
"Python_version: 3.9.12 final\n",
"-------------------------- -----------------------------"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# load the H2O package\n",
"import h2o\n",
"\n",
"# start h2o cluster\n",
"h2o.init()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "6de4a8bc-161e-4e0a-ae67-45bab3517933",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"link=\"https://raw.githubusercontent.com/d2cml-ai/14.388_py/main/data/wage2015_subsample_inference.Rdata\"\n",
"response = urlopen(link)\n",
"content = response.read()\n",
"fhandle = open( 'wage2015_subsample_inference.Rdata', 'wb')\n",
"fhandle.write(content)\n",
"fhandle.close()\n",
"result = pyreadr.read_r(\"wage2015_subsample_inference.Rdata\")\n",
"os.remove(\"wage2015_subsample_inference.Rdata\")\n",
"\n",
"# Extracting the data frame from rdata_read\n",
"data = result[ 'data' ]\n",
"n = data.shape[0]\n",
"type(data)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "305bb8e2",
"metadata": {},
"outputs": [],
"source": [
"# Import relevant packages for splitting data\n",
"import random\n",
"import math\n",
"\n",
"# Set Seed\n",
"# to make the results replicable (generating random numbers)\n",
"np.random.seed(0)\n",
"random = np.random.randint(0, data.shape[0], size=math.floor(data.shape[0]))\n",
"data[\"random\"] = random\n",
"random # the array does not change \n",
"data_2 = data.sort_values(by=['random'])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "52dd607c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(3862, 21)\n",
"(1288, 21)\n"
]
}
],
"source": [
"# Create training and testing sample \n",
"train = data_2[ : math.floor(n*3/4)] # training sample\n",
"test = data_2[ math.floor(n*3/4) : ] # testing sample\n",
"print(train.shape)\n",
"print(test.shape)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "bed9f791",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking whether there is an H2O instance running at http://localhost:54321 . connected.\n"
]
},
{
"data": {
"text/html": [
"H2O_cluster_uptime: | \n",
"4 hours 12 mins |
\n",
"H2O_cluster_timezone: | \n",
"America/Bogota |
\n",
"H2O_data_parsing_timezone: | \n",
"UTC |
\n",
"H2O_cluster_version: | \n",
"3.36.1.3 |
\n",
"H2O_cluster_version_age: | \n",
"26 days |
\n",
"H2O_cluster_name: | \n",
"H2O_from_python_User_fi8ht0 |
\n",
"H2O_cluster_total_nodes: | \n",
"1 |
\n",
"H2O_cluster_free_memory: | \n",
"2.467 Gb |
\n",
"H2O_cluster_total_cores: | \n",
"4 |
\n",
"H2O_cluster_allowed_cores: | \n",
"4 |
\n",
"H2O_cluster_status: | \n",
"locked, healthy |
\n",
"H2O_connection_url: | \n",
"http://localhost:54321 |
\n",
"H2O_connection_proxy: | \n",
"{\"http\": null, \"https\": null} |
\n",
"H2O_internal_security: | \n",
"False |
\n",
"Python_version: | \n",
"3.9.12 final |
"
],
"text/plain": [
"-------------------------- -----------------------------\n",
"H2O_cluster_uptime: 4 hours 12 mins\n",
"H2O_cluster_timezone: America/Bogota\n",
"H2O_data_parsing_timezone: UTC\n",
"H2O_cluster_version: 3.36.1.3\n",
"H2O_cluster_version_age: 26 days\n",
"H2O_cluster_name: H2O_from_python_User_fi8ht0\n",
"H2O_cluster_total_nodes: 1\n",
"H2O_cluster_free_memory: 2.467 Gb\n",
"H2O_cluster_total_cores: 4\n",
"H2O_cluster_allowed_cores: 4\n",
"H2O_cluster_status: locked, healthy\n",
"H2O_connection_url: http://localhost:54321\n",
"H2O_connection_proxy: {\"http\": null, \"https\": null}\n",
"H2O_internal_security: False\n",
"Python_version: 3.9.12 final\n",
"-------------------------- -----------------------------"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# start h2o cluster\n",
"h2o.init()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "bb269c76",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%\n",
"Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%\n",
"Rows:3862\n",
"Cols:21\n",
"\n",
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" | wage | lwage | sex | shs | hsg | scl | clg | ad | mw | so | we | ne | exp1 | exp2 | exp3 | exp4 | occ | occ2 | ind | ind2 | random |
\n",
"\n",
"\n",
"type | real | real | int | int | int | int | int | int | int | int | int | int | real | real | real | real | int | int | int | int | int |
\n",
"mins | 3.021978021978022 | 1.1059115911497213 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10.0 | 1.0 | 370.0 | 2.0 | 0.0 |
\n",
"mean | 23.465417731467202 | 2.969427791239379 | 0.446918694976696 | 0.023562920766442258 | 0.24702226825479026 | 0.2780942516830658 | 0.3125323666494045 | 0.13878819264629724 | 0.2553081305023304 | 0.29829104091144487 | 0.21569135163127914 | 0.23070947695494562 | 13.672190574831728 | 2.9923032107716137 | 8.15417316157433 | 24.849760334671146 | 5243.418436043504 | 11.69135163127914 | 6667.996116002073 | 13.333764888658706 | 1914.2263076126374 |
\n",
"maxs | 528.845673076923 | 6.270696655981913 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 47.0 | 22.09 | 103.823 | 487.9681 | 100000.0 | 22.0 | 100000.0 | 22.0 | 3825.0 |
\n",
"sigma | 21.430085743506766 | 0.5750893708933925 | 0.4972387709704293 | 0.15170256601034887 | 0.4313356487428773 | 0.44811810407097313 | 0.46358551980855045 | 0.3457701367976389 | 0.436090737870262 | 0.45756716236312467 | 0.4113543571812672 | 0.4213414081729523 | 10.598613687032655 | 3.987480265404191 | 14.424744487903597 | 53.27996156215637 | 11579.91114662104 | 6.9783416585657125 | 5588.264282354911 | 5.691380293915183 | 1104.7025392074318 |
\n",
"zeros | 0 | 0 | 2136 | 3771 | 2908 | 2788 | 2655 | 3326 | 2876 | 2710 | 3029 | 2971 | 48 | 48 | 48 | 48 | 0 | 0 | 0 | 0 | 3 |
\n",
"missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
\n",
"0 | 26.442307692307693 | 3.274965291519244 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 29.0 | 8.41 | 24.389 | 70.7281 | 340.0 | 1.0 | 8660.0 | 20.0 | 0.0 |
\n",
"1 | 19.23076923076923 | 2.9565115604007097 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 33.5 | 11.2225 | 37.595375 | 125.94450625 | 9620.0 | 22.0 | 1870.0 | 5.0 | 0.0 |
\n",
"2 | 48.07692307692308 | 3.872802292274865 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.04 | 0.008 | 0.0016 | 3060.0 | 10.0 | 8190.0 | 18.0 | 0.0 |
\n",
"3 | 12.01923076923077 | 2.486507931154974 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 29.0 | 8.41 | 24.389 | 70.7281 | 6440.0 | 19.0 | 770.0 | 4.0 | 2.0 |
\n",
"4 | 39.90384615384615 | 3.6864727140833713 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 12.0 | 1.44 | 1.728 | 2.0736 | 1820.0 | 5.0 | 7860.0 | 17.0 | 2.0 |
\n",
"5 | 13.157894736842104 | 2.5770219386958058 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 11.0 | 1.21 | 1.331 | 1.4641 | 8810.0 | 21.0 | 3895.0 | 6.0 | 3.0 |
\n",
"6 | 20.192307692307693 | 3.005301724570142 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 17.0 | 2.89 | 4.913 | 8.3521 | 7200.0 | 20.0 | 8770.0 | 21.0 | 4.0 |
\n",
"7 | 12.01923076923077 | 2.486507931154974 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 7.0 | 0.49 | 0.343 | 0.2401 | 5610.0 | 17.0 | 4265.0 | 7.0 | 5.0 |
\n",
"8 | 28.846153846153847 | 3.361976668508874 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 | 0.64 | 0.512 | 0.4096 | 5240.0 | 17.0 | 6970.0 | 12.0 | 7.0 |
\n",
"9 | 34.13461538461539 | 3.530311983328089 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 38.0 | 14.44 | 54.872 | 208.5136 | 5550.0 | 17.0 | 6370.0 | 10.0 | 7.0 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# convert data as h2o type\n",
"train_h = h2o.H2OFrame(train)\n",
"test_h = h2o.H2OFrame(test)\n",
"\n",
"# have a look at the data\n",
"train_h.describe()\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "1f0bfa3f",
"metadata": {},
"outputs": [],
"source": [
"# define the variables\n",
"y = 'lwage'\n",
"\n",
"data_columns = list(data)\n",
"no_relev_col = ['wage','occ2', 'ind2', 'random', 'lwage']\n",
"\n",
"# This gives us: new_list = ['carrot' , 'lemon']\n",
"x = [col for col in data_columns if col not in no_relev_col]\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "57c48dce",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AutoML progress: |█\n",
"05:14:21.267: AutoML: XGBoost is not available; skipping it.\n",
"\n",
"██████████████████████████████████████████████████████████████| (done) 100%\n",
"Model Details\n",
"=============\n",
"H2OStackedEnsembleEstimator : Stacked Ensemble\n",
"Model Key: StackedEnsemble_AllModels_1_AutoML_2_20220804_51421\n",
"\n",
"No model summary for this model\n",
"\n",
"ModelMetricsRegressionGLM: stackedensemble\n",
"** Reported on train data. **\n",
"\n",
"MSE: 0.14963411645818037\n",
"RMSE: 0.38682569260350375\n",
"MAE: 0.29493414661545647\n",
"RMSLE: 0.09962964024727533\n",
"R^2: 0.5474439172253015\n",
"Mean Residual Deviance: 0.14963411645818037\n",
"Null degrees of freedom: 3861\n",
"Residual degrees of freedom: 3855\n",
"Null deviance: 1276.9399854672324\n",
"Residual deviance: 577.8869577614926\n",
"AIC: 3639.772059449712\n",
"\n",
"ModelMetricsRegressionGLM: stackedensemble\n",
"** Reported on cross-validation data. **\n",
"\n",
"MSE: 0.21847845950742845\n",
"RMSE: 0.4674167942077268\n",
"MAE: 0.35554195972164104\n",
"RMSLE: 0.11935352477830528\n",
"R^2: 0.3392298618411601\n",
"Mean Residual Deviance: 0.21847845950742845\n",
"Null degrees of freedom: 3861\n",
"Residual degrees of freedom: 3854\n",
"Null deviance: 1277.1043650451534\n",
"Residual deviance: 843.7638106176887\n",
"AIC: 5103.517182973992\n",
"\n",
"Cross-Validation Metrics Summary: \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" mean | \n",
" sd | \n",
" cv_1_valid | \n",
" cv_2_valid | \n",
" cv_3_valid | \n",
" cv_4_valid | \n",
" cv_5_valid | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" mae | \n",
" 0.355537 | \n",
" 0.011149 | \n",
" 0.358856 | \n",
" 0.365665 | \n",
" 0.336852 | \n",
" 0.354934 | \n",
" 0.361377 | \n",
"
\n",
" \n",
" 1 | \n",
" mean_residual_deviance | \n",
" 0.218658 | \n",
" 0.018164 | \n",
" 0.232228 | \n",
" 0.226281 | \n",
" 0.191559 | \n",
" 0.208852 | \n",
" 0.234372 | \n",
"
\n",
" \n",
" 2 | \n",
" mse | \n",
" 0.218658 | \n",
" 0.018164 | \n",
" 0.232228 | \n",
" 0.226281 | \n",
" 0.191559 | \n",
" 0.208852 | \n",
" 0.234372 | \n",
"
\n",
" \n",
" 3 | \n",
" null_deviance | \n",
" 255.420870 | \n",
" 18.944805 | \n",
" 270.143950 | \n",
" 279.337650 | \n",
" 235.192600 | \n",
" 252.053120 | \n",
" 240.377060 | \n",
"
\n",
" \n",
" 4 | \n",
" r2 | \n",
" 0.339152 | \n",
" 0.024681 | \n",
" 0.332046 | \n",
" 0.342915 | \n",
" 0.367003 | \n",
" 0.352354 | \n",
" 0.301443 | \n",
"
\n",
" \n",
" 5 | \n",
" residual_deviance | \n",
" 168.744060 | \n",
" 13.997013 | \n",
" 180.441440 | \n",
" 183.513900 | \n",
" 148.841520 | \n",
" 163.113280 | \n",
" 167.810170 | \n",
"
\n",
" \n",
" 6 | \n",
" rmse | \n",
" 0.467278 | \n",
" 0.019686 | \n",
" 0.481901 | \n",
" 0.475690 | \n",
" 0.437675 | \n",
" 0.457003 | \n",
" 0.484120 | \n",
"
\n",
" \n",
" 7 | \n",
" rmsle | \n",
" 0.119307 | \n",
" 0.004619 | \n",
" 0.122457 | \n",
" 0.121297 | \n",
" 0.111312 | \n",
" 0.119422 | \n",
" 0.122048 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean sd cv_1_valid cv_2_valid \\\n",
"0 mae 0.355537 0.011149 0.358856 0.365665 \n",
"1 mean_residual_deviance 0.218658 0.018164 0.232228 0.226281 \n",
"2 mse 0.218658 0.018164 0.232228 0.226281 \n",
"3 null_deviance 255.420870 18.944805 270.143950 279.337650 \n",
"4 r2 0.339152 0.024681 0.332046 0.342915 \n",
"5 residual_deviance 168.744060 13.997013 180.441440 183.513900 \n",
"6 rmse 0.467278 0.019686 0.481901 0.475690 \n",
"7 rmsle 0.119307 0.004619 0.122457 0.121297 \n",
"\n",
" cv_3_valid cv_4_valid cv_5_valid \n",
"0 0.336852 0.354934 0.361377 \n",
"1 0.191559 0.208852 0.234372 \n",
"2 0.191559 0.208852 0.234372 \n",
"3 235.192600 252.053120 240.377060 \n",
"4 0.367003 0.352354 0.301443 \n",
"5 148.841520 163.113280 167.810170 \n",
"6 0.437675 0.457003 0.484120 \n",
"7 0.111312 0.119422 0.122048 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# run AutoML for 10 base models and a maximal runtime of 100 seconds\n",
"# Run AutoML for 30 seconds\n",
"aml = H2OAutoML(max_runtime_secs = 100, max_models = 10, seed = 1)\n",
"aml.train(x = x, y = y, training_frame = train_h, leaderboard_frame = test_h)\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "88df97d7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"model_id | rmse | mse | mae | rmsle | mean_residual_deviance |
\n",
"\n",
"\n",
"StackedEnsemble_AllModels_1_AutoML_1_20220722_170939 | 0.47064 | 0.221502 | 0.353943 | 0.120368 | 0.221502 |
\n",
"GBM_2_AutoML_1_20220722_170939 | 0.471849 | 0.222642 | 0.355941 | 0.12074 | 0.222642 |
\n",
"GBM_5_AutoML_1_20220722_170939 | 0.47191 | 0.222699 | 0.35582 | 0.120501 | 0.222699 |
\n",
"StackedEnsemble_BestOfFamily_1_AutoML_1_20220722_170939 | 0.473798 | 0.224484 | 0.357598 | 0.121491 | 0.224484 |
\n",
"GBM_3_AutoML_1_20220722_170939 | 0.474585 | 0.22523 | 0.359309 | 0.121264 | 0.22523 |
\n",
"GBM_1_AutoML_1_20220722_170939 | 0.478149 | 0.228626 | 0.362211 | 0.122278 | 0.228626 |
\n",
"GBM_4_AutoML_1_20220722_170939 | 0.479072 | 0.22951 | 0.362916 | 0.122456 | 0.22951 |
\n",
"GBM_grid_1_AutoML_1_20220722_170939_model_1 | 0.48001 | 0.23041 | 0.36381 | 0.122671 | 0.23041 |
\n",
"XRT_1_AutoML_1_20220722_170939 | 0.491218 | 0.241295 | 0.375089 | 0.125318 | 0.241295 |
\n",
"DRF_1_AutoML_1_20220722_170939 | 0.50224 | 0.252245 | 0.382606 | 0.12847 | 0.252245 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"# AutoML Leaderboard\n",
"lb = aml.leaderboard\n",
"print(lb)"
]
},
{
"cell_type": "markdown",
"id": "d295cdc1",
"metadata": {},
"source": [
"We see that two Stacked Ensembles are at the top of the leaderboard. Stacked Ensembles often outperform a single model. The out-of-sample (test) MSE of the leading model is given by"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "800f6aba",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.22150159010610537"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aml.leaderboard['mse'][0,0]"
]
},
{
"cell_type": "markdown",
"id": "6c9c7ec9",
"metadata": {},
"source": [
"The in-sample performance can be evaluated by"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "7e47ac17",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model Details\n",
"=============\n",
"H2OStackedEnsembleEstimator : Stacked Ensemble\n",
"Model Key: StackedEnsemble_AllModels_1_AutoML_1_20220722_170939\n",
"\n",
"No model summary for this model\n",
"\n",
"ModelMetricsRegressionGLM: stackedensemble\n",
"** Reported on train data. **\n",
"\n",
"MSE: 0.15027004489538795\n",
"RMSE: 0.38764680431468534\n",
"MAE: 0.2955371675604367\n",
"RMSLE: 0.09990016533775842\n",
"R^2: 0.5455206039510314\n",
"Mean Residual Deviance: 0.15027004489538795\n",
"Null degrees of freedom: 3861\n",
"Residual degrees of freedom: 3854\n",
"Null deviance: 1276.9399854672324\n",
"Residual deviance: 580.3429133859883\n",
"AIC: 3658.1503537307485\n",
"\n",
"ModelMetricsRegressionGLM: stackedensemble\n",
"** Reported on cross-validation data. **\n",
"\n",
"MSE: 0.2185070080606413\n",
"RMSE: 0.46744733185744175\n",
"MAE: 0.35554501482828577\n",
"RMSLE: 0.11935774800872184\n",
"R^2: 0.3391435190891413\n",
"Mean Residual Deviance: 0.2185070080606413\n",
"Null degrees of freedom: 3861\n",
"Residual degrees of freedom: 3853\n",
"Null deviance: 1277.1043650451534\n",
"Residual deviance: 843.8740651301968\n",
"AIC: 5106.021797066896\n",
"\n",
"Cross-Validation Metrics Summary: \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" mean | \n",
" sd | \n",
" cv_1_valid | \n",
" cv_2_valid | \n",
" cv_3_valid | \n",
" cv_4_valid | \n",
" cv_5_valid | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" mae | \n",
" 0.355575 | \n",
" 0.011150 | \n",
" 0.358694 | \n",
" 0.365597 | \n",
" 0.336775 | \n",
" 0.355430 | \n",
" 0.361377 | \n",
"
\n",
" \n",
" 1 | \n",
" mean_residual_deviance | \n",
" 0.218685 | \n",
" 0.018149 | \n",
" 0.232207 | \n",
" 0.226213 | \n",
" 0.191451 | \n",
" 0.209182 | \n",
" 0.234372 | \n",
"
\n",
" \n",
" 2 | \n",
" mse | \n",
" 0.218685 | \n",
" 0.018149 | \n",
" 0.232207 | \n",
" 0.226213 | \n",
" 0.191451 | \n",
" 0.209182 | \n",
" 0.234372 | \n",
"
\n",
" \n",
" 3 | \n",
" null_deviance | \n",
" 255.420870 | \n",
" 18.944805 | \n",
" 270.143950 | \n",
" 279.337650 | \n",
" 235.192600 | \n",
" 252.053120 | \n",
" 240.377060 | \n",
"
\n",
" \n",
" 4 | \n",
" r2 | \n",
" 0.339071 | \n",
" 0.024654 | \n",
" 0.332108 | \n",
" 0.343113 | \n",
" 0.367360 | \n",
" 0.351331 | \n",
" 0.301443 | \n",
"
\n",
" \n",
" 5 | \n",
" residual_deviance | \n",
" 168.764340 | \n",
" 13.983558 | \n",
" 180.424550 | \n",
" 183.458680 | \n",
" 148.757460 | \n",
" 163.370830 | \n",
" 167.810170 | \n",
"
\n",
" \n",
" 6 | \n",
" rmse | \n",
" 0.467306 | \n",
" 0.019674 | \n",
" 0.481878 | \n",
" 0.475618 | \n",
" 0.437551 | \n",
" 0.457364 | \n",
" 0.484120 | \n",
"
\n",
" \n",
" 7 | \n",
" rmsle | \n",
" 0.119310 | \n",
" 0.004632 | \n",
" 0.122437 | \n",
" 0.121279 | \n",
" 0.111271 | \n",
" 0.119514 | \n",
" 0.122048 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean sd cv_1_valid cv_2_valid \\\n",
"0 mae 0.355575 0.011150 0.358694 0.365597 \n",
"1 mean_residual_deviance 0.218685 0.018149 0.232207 0.226213 \n",
"2 mse 0.218685 0.018149 0.232207 0.226213 \n",
"3 null_deviance 255.420870 18.944805 270.143950 279.337650 \n",
"4 r2 0.339071 0.024654 0.332108 0.343113 \n",
"5 residual_deviance 168.764340 13.983558 180.424550 183.458680 \n",
"6 rmse 0.467306 0.019674 0.481878 0.475618 \n",
"7 rmsle 0.119310 0.004632 0.122437 0.121279 \n",
"\n",
" cv_3_valid cv_4_valid cv_5_valid \n",
"0 0.336775 0.355430 0.361377 \n",
"1 0.191451 0.209182 0.234372 \n",
"2 0.191451 0.209182 0.234372 \n",
"3 235.192600 252.053120 240.377060 \n",
"4 0.367360 0.351331 0.301443 \n",
"5 148.757460 163.370830 167.810170 \n",
"6 0.437551 0.457364 0.484120 \n",
"7 0.111271 0.119514 0.122048 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aml.leader"
]
},
{
"cell_type": "markdown",
"id": "ca0b36af",
"metadata": {
"papermill": {
"duration": 0.027663,
"end_time": "2021-03-24T11:25:13.491063",
"exception": false,
"start_time": "2021-03-24T11:25:13.463400",
"status": "completed"
},
"tags": []
},
"source": [
"This is in line with our previous results. To understand how the ensemble works, let's take a peek inside the Stacked Ensemble \"All Models\" model. The \"All Models\" ensemble is an ensemble of all of the individual models in the AutoML run. This is often the top performing model on the leaderboard."
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "95549783",
"metadata": {},
"outputs": [],
"source": [
"model_ids = h2o.as_list(aml.leaderboard['model_id'][0], use_pandas=True)\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "c2236931",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'StackedEnsemble_AllModels_1_AutoML_1_20220722_170939'"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = model_ids[model_ids['model_id'].str.contains(\"StackedEnsemble_AllModels\")].values.tolist()\n",
"model_id = model[0][0]\n",
"model_id"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "615b33c4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model Details\n",
"=============\n",
"H2OStackedEnsembleEstimator : Stacked Ensemble\n",
"Model Key: StackedEnsemble_AllModels_1_AutoML_1_20220722_170939\n",
"\n",
"No model summary for this model\n",
"\n",
"ModelMetricsRegressionGLM: stackedensemble\n",
"** Reported on train data. **\n",
"\n",
"MSE: 0.15027004489538795\n",
"RMSE: 0.38764680431468534\n",
"MAE: 0.2955371675604367\n",
"RMSLE: 0.09990016533775842\n",
"R^2: 0.5455206039510314\n",
"Mean Residual Deviance: 0.15027004489538795\n",
"Null degrees of freedom: 3861\n",
"Residual degrees of freedom: 3854\n",
"Null deviance: 1276.9399854672324\n",
"Residual deviance: 580.3429133859883\n",
"AIC: 3658.1503537307485\n",
"\n",
"ModelMetricsRegressionGLM: stackedensemble\n",
"** Reported on cross-validation data. **\n",
"\n",
"MSE: 0.2185070080606413\n",
"RMSE: 0.46744733185744175\n",
"MAE: 0.35554501482828577\n",
"RMSLE: 0.11935774800872184\n",
"R^2: 0.3391435190891413\n",
"Mean Residual Deviance: 0.2185070080606413\n",
"Null degrees of freedom: 3861\n",
"Residual degrees of freedom: 3853\n",
"Null deviance: 1277.1043650451534\n",
"Residual deviance: 843.8740651301968\n",
"AIC: 5106.021797066896\n",
"\n",
"Cross-Validation Metrics Summary: \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" mean | \n",
" sd | \n",
" cv_1_valid | \n",
" cv_2_valid | \n",
" cv_3_valid | \n",
" cv_4_valid | \n",
" cv_5_valid | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" mae | \n",
" 0.355575 | \n",
" 0.011150 | \n",
" 0.358694 | \n",
" 0.365597 | \n",
" 0.336775 | \n",
" 0.355430 | \n",
" 0.361377 | \n",
"
\n",
" \n",
" 1 | \n",
" mean_residual_deviance | \n",
" 0.218685 | \n",
" 0.018149 | \n",
" 0.232207 | \n",
" 0.226213 | \n",
" 0.191451 | \n",
" 0.209182 | \n",
" 0.234372 | \n",
"
\n",
" \n",
" 2 | \n",
" mse | \n",
" 0.218685 | \n",
" 0.018149 | \n",
" 0.232207 | \n",
" 0.226213 | \n",
" 0.191451 | \n",
" 0.209182 | \n",
" 0.234372 | \n",
"
\n",
" \n",
" 3 | \n",
" null_deviance | \n",
" 255.420870 | \n",
" 18.944805 | \n",
" 270.143950 | \n",
" 279.337650 | \n",
" 235.192600 | \n",
" 252.053120 | \n",
" 240.377060 | \n",
"
\n",
" \n",
" 4 | \n",
" r2 | \n",
" 0.339071 | \n",
" 0.024654 | \n",
" 0.332108 | \n",
" 0.343113 | \n",
" 0.367360 | \n",
" 0.351331 | \n",
" 0.301443 | \n",
"
\n",
" \n",
" 5 | \n",
" residual_deviance | \n",
" 168.764340 | \n",
" 13.983558 | \n",
" 180.424550 | \n",
" 183.458680 | \n",
" 148.757460 | \n",
" 163.370830 | \n",
" 167.810170 | \n",
"
\n",
" \n",
" 6 | \n",
" rmse | \n",
" 0.467306 | \n",
" 0.019674 | \n",
" 0.481878 | \n",
" 0.475618 | \n",
" 0.437551 | \n",
" 0.457364 | \n",
" 0.484120 | \n",
"
\n",
" \n",
" 7 | \n",
" rmsle | \n",
" 0.119310 | \n",
" 0.004632 | \n",
" 0.122437 | \n",
" 0.121279 | \n",
" 0.111271 | \n",
" 0.119514 | \n",
" 0.122048 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean sd cv_1_valid cv_2_valid \\\n",
"0 mae 0.355575 0.011150 0.358694 0.365597 \n",
"1 mean_residual_deviance 0.218685 0.018149 0.232207 0.226213 \n",
"2 mse 0.218685 0.018149 0.232207 0.226213 \n",
"3 null_deviance 255.420870 18.944805 270.143950 279.337650 \n",
"4 r2 0.339071 0.024654 0.332108 0.343113 \n",
"5 residual_deviance 168.764340 13.983558 180.424550 183.458680 \n",
"6 rmse 0.467306 0.019674 0.481878 0.475618 \n",
"7 rmsle 0.119310 0.004632 0.122437 0.121279 \n",
"\n",
" cv_3_valid cv_4_valid cv_5_valid \n",
"0 0.336775 0.355430 0.361377 \n",
"1 0.191451 0.209182 0.234372 \n",
"2 0.191451 0.209182 0.234372 \n",
"3 235.192600 252.053120 240.377060 \n",
"4 0.367360 0.351331 0.301443 \n",
"5 148.757460 163.370830 167.810170 \n",
"6 0.437551 0.457364 0.484120 \n",
"7 0.111271 0.119514 0.122048 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"se = h2o.get_model(model_id)\n",
"se"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "439e8999",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model Details\n",
"=============\n",
"H2OGeneralizedLinearEstimator : Generalized Linear Modeling\n",
"Model Key: metalearner_AUTO_StackedEnsemble_AllModels_1_AutoML_1_20220722_170939\n",
"\n",
"\n",
"GLM Model: summary\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" family | \n",
" link | \n",
" regularization | \n",
" lambda_search | \n",
" number_of_predictors_total | \n",
" number_of_active_predictors | \n",
" number_of_iterations | \n",
" training_frame | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" | \n",
" gaussian | \n",
" identity | \n",
" Elastic Net (alpha = 0.5, lambda = 0.002551 ) | \n",
" nlambda = 100, lambda.max = 0.2219, lambda.min = 0.002551, lambda.... | \n",
" 10 | \n",
" 7 | \n",
" 49 | \n",
" levelone_training_StackedEnsemble_AllModels_1_AutoML_1_20220722_17... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" family link regularization \\\n",
"0 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.002551 ) \n",
"\n",
" lambda_search \\\n",
"0 nlambda = 100, lambda.max = 0.2219, lambda.min = 0.002551, lambda.... \n",
"\n",
" number_of_predictors_total number_of_active_predictors \\\n",
"0 10 7 \n",
"\n",
" number_of_iterations \\\n",
"0 49 \n",
"\n",
" training_frame \n",
"0 levelone_training_StackedEnsemble_AllModels_1_AutoML_1_20220722_17... "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"ModelMetricsRegressionGLM: glm\n",
"** Reported on train data. **\n",
"\n",
"MSE: 0.21727419974117612\n",
"RMSE: 0.4661268065035266\n",
"MAE: 0.35448759076481845\n",
"RMSLE: 0.11902010738109643\n",
"R^2: 0.3428720464937892\n",
"Mean Residual Deviance: 0.21727419974117612\n",
"Null degrees of freedom: 3861\n",
"Residual degrees of freedom: 3854\n",
"Null deviance: 1276.9399854672324\n",
"Residual deviance: 839.1129594004221\n",
"AIC: 5082.170839083037\n",
"\n",
"ModelMetricsRegressionGLM: glm\n",
"** Reported on cross-validation data. **\n",
"\n",
"MSE: 0.2185070080606413\n",
"RMSE: 0.46744733185744175\n",
"MAE: 0.35554501482828577\n",
"RMSLE: 0.11935774800872184\n",
"R^2: 0.3391435190891413\n",
"Mean Residual Deviance: 0.2185070080606413\n",
"Null degrees of freedom: 3861\n",
"Residual degrees of freedom: 3853\n",
"Null deviance: 1277.1043650451534\n",
"Residual deviance: 843.8740651301968\n",
"AIC: 5106.021797066896\n",
"\n",
"Cross-Validation Metrics Summary: \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" mean | \n",
" sd | \n",
" cv_1_valid | \n",
" cv_2_valid | \n",
" cv_3_valid | \n",
" cv_4_valid | \n",
" cv_5_valid | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" mae | \n",
" 0.355575 | \n",
" 0.011150 | \n",
" 0.358694 | \n",
" 0.365597 | \n",
" 0.336775 | \n",
" 0.355430 | \n",
" 0.361377 | \n",
"
\n",
" \n",
" 1 | \n",
" mean_residual_deviance | \n",
" 0.218685 | \n",
" 0.018149 | \n",
" 0.232207 | \n",
" 0.226213 | \n",
" 0.191451 | \n",
" 0.209182 | \n",
" 0.234372 | \n",
"
\n",
" \n",
" 2 | \n",
" mse | \n",
" 0.218685 | \n",
" 0.018149 | \n",
" 0.232207 | \n",
" 0.226213 | \n",
" 0.191451 | \n",
" 0.209182 | \n",
" 0.234372 | \n",
"
\n",
" \n",
" 3 | \n",
" null_deviance | \n",
" 255.420870 | \n",
" 18.944805 | \n",
" 270.143950 | \n",
" 279.337650 | \n",
" 235.192600 | \n",
" 252.053120 | \n",
" 240.377060 | \n",
"
\n",
" \n",
" 4 | \n",
" r2 | \n",
" 0.339071 | \n",
" 0.024654 | \n",
" 0.332108 | \n",
" 0.343113 | \n",
" 0.367360 | \n",
" 0.351331 | \n",
" 0.301443 | \n",
"
\n",
" \n",
" 5 | \n",
" residual_deviance | \n",
" 168.764340 | \n",
" 13.983558 | \n",
" 180.424550 | \n",
" 183.458680 | \n",
" 148.757460 | \n",
" 163.370830 | \n",
" 167.810170 | \n",
"
\n",
" \n",
" 6 | \n",
" rmse | \n",
" 0.467306 | \n",
" 0.019674 | \n",
" 0.481878 | \n",
" 0.475618 | \n",
" 0.437551 | \n",
" 0.457364 | \n",
" 0.484120 | \n",
"
\n",
" \n",
" 7 | \n",
" rmsle | \n",
" 0.119310 | \n",
" 0.004632 | \n",
" 0.122437 | \n",
" 0.121279 | \n",
" 0.111271 | \n",
" 0.119514 | \n",
" 0.122048 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean sd cv_1_valid cv_2_valid \\\n",
"0 mae 0.355575 0.011150 0.358694 0.365597 \n",
"1 mean_residual_deviance 0.218685 0.018149 0.232207 0.226213 \n",
"2 mse 0.218685 0.018149 0.232207 0.226213 \n",
"3 null_deviance 255.420870 18.944805 270.143950 279.337650 \n",
"4 r2 0.339071 0.024654 0.332108 0.343113 \n",
"5 residual_deviance 168.764340 13.983558 180.424550 183.458680 \n",
"6 rmse 0.467306 0.019674 0.481878 0.475618 \n",
"7 rmsle 0.119310 0.004632 0.122437 0.121279 \n",
"\n",
" cv_3_valid cv_4_valid cv_5_valid \n",
"0 0.336775 0.355430 0.361377 \n",
"1 0.191451 0.209182 0.234372 \n",
"2 0.191451 0.209182 0.234372 \n",
"3 235.192600 252.053120 240.377060 \n",
"4 0.367360 0.351331 0.301443 \n",
"5 148.757460 163.370830 167.810170 \n",
"6 0.437551 0.457364 0.484120 \n",
"7 0.111271 0.119514 0.122048 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Scoring History: \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" timestamp | \n",
" duration | \n",
" iteration | \n",
" lambda | \n",
" predictors | \n",
" deviance_train | \n",
" deviance_xval | \n",
" deviance_se | \n",
" alpha | \n",
" iterations | \n",
" training_rmse | \n",
" training_deviance | \n",
" training_mae | \n",
" training_r2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.000 sec | \n",
" 1 | \n",
" ,22E0 | \n",
" 1 | \n",
" 0.330642 | \n",
" 0.330652 | \n",
" 0.008216 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 1 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.000 sec | \n",
" 2 | \n",
" ,2E0 | \n",
" 5 | \n",
" 0.316066 | \n",
" 0.330609 | \n",
" 0.008220 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 2 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.000 sec | \n",
" 3 | \n",
" ,18E0 | \n",
" 6 | \n",
" 0.302049 | \n",
" 0.330627 | \n",
" 0.008216 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 3 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.000 sec | \n",
" 4 | \n",
" ,17E0 | \n",
" 6 | \n",
" 0.289848 | \n",
" 0.322157 | \n",
" 0.008512 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 4 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.016 sec | \n",
" 5 | \n",
" ,15E0 | \n",
" 7 | \n",
" 0.279287 | \n",
" 0.307694 | \n",
" 0.008407 | \n",
" 0.5 | \n",
" 5.0 | \n",
" 0.528476 | \n",
" 0.279287 | \n",
" 0.407732 | \n",
" 0.155318 | \n",
"
\n",
" \n",
" 5 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.016 sec | \n",
" 6 | \n",
" ,14E0 | \n",
" 7 | \n",
" 0.270007 | \n",
" 0.294801 | \n",
" 0.008323 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 6 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.016 sec | \n",
" 7 | \n",
" ,13E0 | \n",
" 7 | \n",
" 0.262086 | \n",
" 0.283621 | \n",
" 0.008253 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 7 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.016 sec | \n",
" 8 | \n",
" ,12E0 | \n",
" 7 | \n",
" 0.255335 | \n",
" 0.273875 | \n",
" 0.008226 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 8 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.016 sec | \n",
" 9 | \n",
" ,11E0 | \n",
" 8 | \n",
" 0.249586 | \n",
" 0.265471 | \n",
" 0.008206 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 9 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.016 sec | \n",
" 10 | \n",
" ,96E-1 | \n",
" 8 | \n",
" 0.244598 | \n",
" 0.258304 | \n",
" 0.008188 | \n",
" 0.5 | \n",
" 10.0 | \n",
" 0.494568 | \n",
" 0.244598 | \n",
" 0.377588 | \n",
" 0.260234 | \n",
"
\n",
" \n",
" 10 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 11 | \n",
" ,88E-1 | \n",
" 8 | \n",
" 0.240460 | \n",
" 0.252204 | \n",
" 0.008175 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 11 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 12 | \n",
" ,8E-1 | \n",
" 8 | \n",
" 0.236926 | \n",
" 0.246950 | \n",
" 0.008189 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 12 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 13 | \n",
" ,73E-1 | \n",
" 8 | \n",
" 0.233943 | \n",
" 0.242530 | \n",
" 0.008194 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 13 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 14 | \n",
" ,66E-1 | \n",
" 8 | \n",
" 0.231423 | \n",
" 0.238774 | \n",
" 0.008196 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 14 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 15 | \n",
" ,6E-1 | \n",
" 8 | \n",
" 0.229293 | \n",
" 0.235608 | \n",
" 0.008204 | \n",
" 0.5 | \n",
" 15.0 | \n",
" 0.478846 | \n",
" 0.229293 | \n",
" 0.364087 | \n",
" 0.306521 | \n",
"
\n",
" \n",
" 15 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 16 | \n",
" ,55E-1 | \n",
" 8 | \n",
" 0.227495 | \n",
" 0.232935 | \n",
" 0.008210 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 16 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 17 | \n",
" ,5E-1 | \n",
" 8 | \n",
" 0.226154 | \n",
" 0.230682 | \n",
" 0.008216 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 17 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.047 sec | \n",
" 18 | \n",
" ,46E-1 | \n",
" 8 | \n",
" 0.224642 | \n",
" 0.228783 | \n",
" 0.008220 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 18 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.047 sec | \n",
" 19 | \n",
" ,42E-1 | \n",
" 8 | \n",
" 0.223737 | \n",
" 0.227337 | \n",
" 0.008211 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 19 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.047 sec | \n",
" 20 | \n",
" ,38E-1 | \n",
" 8 | \n",
" 0.222642 | \n",
" 0.225825 | \n",
" 0.008238 | \n",
" 0.5 | \n",
" 20.0 | \n",
" 0.471849 | \n",
" 0.222642 | \n",
" 0.358377 | \n",
" 0.326639 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" timestamp duration iteration lambda predictors \\\n",
"0 2022-07-22 17:10:18 0.000 sec 1 ,22E0 1 \n",
"1 2022-07-22 17:10:18 0.000 sec 2 ,2E0 5 \n",
"2 2022-07-22 17:10:18 0.000 sec 3 ,18E0 6 \n",
"3 2022-07-22 17:10:18 0.000 sec 4 ,17E0 6 \n",
"4 2022-07-22 17:10:18 0.016 sec 5 ,15E0 7 \n",
"5 2022-07-22 17:10:18 0.016 sec 6 ,14E0 7 \n",
"6 2022-07-22 17:10:18 0.016 sec 7 ,13E0 7 \n",
"7 2022-07-22 17:10:18 0.016 sec 8 ,12E0 7 \n",
"8 2022-07-22 17:10:18 0.016 sec 9 ,11E0 8 \n",
"9 2022-07-22 17:10:18 0.016 sec 10 ,96E-1 8 \n",
"10 2022-07-22 17:10:18 0.032 sec 11 ,88E-1 8 \n",
"11 2022-07-22 17:10:18 0.032 sec 12 ,8E-1 8 \n",
"12 2022-07-22 17:10:18 0.032 sec 13 ,73E-1 8 \n",
"13 2022-07-22 17:10:18 0.032 sec 14 ,66E-1 8 \n",
"14 2022-07-22 17:10:18 0.032 sec 15 ,6E-1 8 \n",
"15 2022-07-22 17:10:18 0.032 sec 16 ,55E-1 8 \n",
"16 2022-07-22 17:10:18 0.032 sec 17 ,5E-1 8 \n",
"17 2022-07-22 17:10:18 0.047 sec 18 ,46E-1 8 \n",
"18 2022-07-22 17:10:18 0.047 sec 19 ,42E-1 8 \n",
"19 2022-07-22 17:10:18 0.047 sec 20 ,38E-1 8 \n",
"\n",
" deviance_train deviance_xval deviance_se alpha iterations \\\n",
"0 0.330642 0.330652 0.008216 0.5 NaN \n",
"1 0.316066 0.330609 0.008220 0.5 NaN \n",
"2 0.302049 0.330627 0.008216 0.5 NaN \n",
"3 0.289848 0.322157 0.008512 0.5 NaN \n",
"4 0.279287 0.307694 0.008407 0.5 5.0 \n",
"5 0.270007 0.294801 0.008323 0.5 NaN \n",
"6 0.262086 0.283621 0.008253 0.5 NaN \n",
"7 0.255335 0.273875 0.008226 0.5 NaN \n",
"8 0.249586 0.265471 0.008206 0.5 NaN \n",
"9 0.244598 0.258304 0.008188 0.5 10.0 \n",
"10 0.240460 0.252204 0.008175 0.5 NaN \n",
"11 0.236926 0.246950 0.008189 0.5 NaN \n",
"12 0.233943 0.242530 0.008194 0.5 NaN \n",
"13 0.231423 0.238774 0.008196 0.5 NaN \n",
"14 0.229293 0.235608 0.008204 0.5 15.0 \n",
"15 0.227495 0.232935 0.008210 0.5 NaN \n",
"16 0.226154 0.230682 0.008216 0.5 NaN \n",
"17 0.224642 0.228783 0.008220 0.5 NaN \n",
"18 0.223737 0.227337 0.008211 0.5 NaN \n",
"19 0.222642 0.225825 0.008238 0.5 20.0 \n",
"\n",
" training_rmse training_deviance training_mae training_r2 \n",
"0 \n",
"1 \n",
"2 \n",
"3 \n",
"4 0.528476 0.279287 0.407732 0.155318 \n",
"5 \n",
"6 \n",
"7 \n",
"8 \n",
"9 0.494568 0.244598 0.377588 0.260234 \n",
"10 \n",
"11 \n",
"12 \n",
"13 \n",
"14 0.478846 0.229293 0.364087 0.306521 \n",
"15 \n",
"16 \n",
"17 \n",
"18 \n",
"19 0.471849 0.222642 0.358377 0.326639 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"See the whole table with table.as_data_frame()\n",
"\n",
"Variable Importances: \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" variable | \n",
" relative_importance | \n",
" scaled_importance | \n",
" percentage | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" GBM_5_AutoML_1_20220722_170939 | \n",
" 0.156444 | \n",
" 1.000000 | \n",
" 0.459637 | \n",
"
\n",
" \n",
" 1 | \n",
" GBM_3_AutoML_1_20220722_170939 | \n",
" 0.064177 | \n",
" 0.410220 | \n",
" 0.188552 | \n",
"
\n",
" \n",
" 2 | \n",
" GBM_1_AutoML_1_20220722_170939 | \n",
" 0.045077 | \n",
" 0.288132 | \n",
" 0.132436 | \n",
"
\n",
" \n",
" 3 | \n",
" GBM_grid_1_AutoML_1_20220722_170939_model_1 | \n",
" 0.042987 | \n",
" 0.274777 | \n",
" 0.126298 | \n",
"
\n",
" \n",
" 4 | \n",
" GBM_4_AutoML_1_20220722_170939 | \n",
" 0.022251 | \n",
" 0.142227 | \n",
" 0.065373 | \n",
"
\n",
" \n",
" 5 | \n",
" DeepLearning_1_AutoML_1_20220722_170939 | \n",
" 0.009429 | \n",
" 0.060268 | \n",
" 0.027701 | \n",
"
\n",
" \n",
" 6 | \n",
" DRF_1_AutoML_1_20220722_170939 | \n",
" 0.000001 | \n",
" 0.000007 | \n",
" 0.000003 | \n",
"
\n",
" \n",
" 7 | \n",
" GBM_2_AutoML_1_20220722_170939 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 8 | \n",
" XRT_1_AutoML_1_20220722_170939 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 9 | \n",
" GLM_1_AutoML_1_20220722_170939 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" variable relative_importance \\\n",
"0 GBM_5_AutoML_1_20220722_170939 0.156444 \n",
"1 GBM_3_AutoML_1_20220722_170939 0.064177 \n",
"2 GBM_1_AutoML_1_20220722_170939 0.045077 \n",
"3 GBM_grid_1_AutoML_1_20220722_170939_model_1 0.042987 \n",
"4 GBM_4_AutoML_1_20220722_170939 0.022251 \n",
"5 DeepLearning_1_AutoML_1_20220722_170939 0.009429 \n",
"6 DRF_1_AutoML_1_20220722_170939 0.000001 \n",
"7 GBM_2_AutoML_1_20220722_170939 0.000000 \n",
"8 XRT_1_AutoML_1_20220722_170939 0.000000 \n",
"9 GLM_1_AutoML_1_20220722_170939 0.000000 \n",
"\n",
" scaled_importance percentage \n",
"0 1.000000 0.459637 \n",
"1 0.410220 0.188552 \n",
"2 0.288132 0.132436 \n",
"3 0.274777 0.126298 \n",
"4 0.142227 0.065373 \n",
"5 0.060268 0.027701 \n",
"6 0.000007 0.000003 \n",
"7 0.000000 0.000000 \n",
"8 0.000000 0.000000 \n",
"9 0.000000 0.000000 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get the Stacked Ensemble metalearner model\n",
"metalearner = se.metalearner()\n",
"metalearner"
]
},
{
"cell_type": "markdown",
"id": "06402906",
"metadata": {},
"source": [
"Examine the variable importance of the metalearner (combiner) algorithm in the ensemble. This shows us how much each base learner is contributing to the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM."
]
},
{
"cell_type": "markdown",
"id": "7425b332",
"metadata": {},
"source": [
"The table above gives us the variable importance of the metalearner in the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM. \n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "4d86b390",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Intercept': 2.969427791239395,\n",
" 'GBM_2_AutoML_1_20220722_170939': 0.0,\n",
" 'GBM_5_AutoML_1_20220722_170939': 0.1564442116943833,\n",
" 'GBM_3_AutoML_1_20220722_170939': 0.06417658529927933,\n",
" 'GBM_1_AutoML_1_20220722_170939': 0.04507655475235933,\n",
" 'GBM_4_AutoML_1_20220722_170939': 0.02225052417703666,\n",
" 'GBM_grid_1_AutoML_1_20220722_170939_model_1': 0.0429872925804153,\n",
" 'XRT_1_AutoML_1_20220722_170939': 0.0,\n",
" 'DRF_1_AutoML_1_20220722_170939': 1.1143189727895769e-06,\n",
" 'GLM_1_AutoML_1_20220722_170939': 0.0,\n",
" 'DeepLearning_1_AutoML_1_20220722_170939': 0.009428554928469175}"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"metalearner.coef_norm()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "b4160a06",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"metalearner.std_coef_plot()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "d08c1586",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model Details\n",
"=============\n",
"H2OGeneralizedLinearEstimator : Generalized Linear Modeling\n",
"Model Key: metalearner_AUTO_StackedEnsemble_AllModels_1_AutoML_1_20220722_170939\n",
"\n",
"\n",
"GLM Model: summary\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" family | \n",
" link | \n",
" regularization | \n",
" lambda_search | \n",
" number_of_predictors_total | \n",
" number_of_active_predictors | \n",
" number_of_iterations | \n",
" training_frame | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" | \n",
" gaussian | \n",
" identity | \n",
" Elastic Net (alpha = 0.5, lambda = 0.002551 ) | \n",
" nlambda = 100, lambda.max = 0.2219, lambda.min = 0.002551, lambda.... | \n",
" 10 | \n",
" 7 | \n",
" 49 | \n",
" levelone_training_StackedEnsemble_AllModels_1_AutoML_1_20220722_17... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" family link regularization \\\n",
"0 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.002551 ) \n",
"\n",
" lambda_search \\\n",
"0 nlambda = 100, lambda.max = 0.2219, lambda.min = 0.002551, lambda.... \n",
"\n",
" number_of_predictors_total number_of_active_predictors \\\n",
"0 10 7 \n",
"\n",
" number_of_iterations \\\n",
"0 49 \n",
"\n",
" training_frame \n",
"0 levelone_training_StackedEnsemble_AllModels_1_AutoML_1_20220722_17... "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"ModelMetricsRegressionGLM: glm\n",
"** Reported on train data. **\n",
"\n",
"MSE: 0.21727419974117612\n",
"RMSE: 0.4661268065035266\n",
"MAE: 0.35448759076481845\n",
"RMSLE: 0.11902010738109643\n",
"R^2: 0.3428720464937892\n",
"Mean Residual Deviance: 0.21727419974117612\n",
"Null degrees of freedom: 3861\n",
"Residual degrees of freedom: 3854\n",
"Null deviance: 1276.9399854672324\n",
"Residual deviance: 839.1129594004221\n",
"AIC: 5082.170839083037\n",
"\n",
"ModelMetricsRegressionGLM: glm\n",
"** Reported on cross-validation data. **\n",
"\n",
"MSE: 0.2185070080606413\n",
"RMSE: 0.46744733185744175\n",
"MAE: 0.35554501482828577\n",
"RMSLE: 0.11935774800872184\n",
"R^2: 0.3391435190891413\n",
"Mean Residual Deviance: 0.2185070080606413\n",
"Null degrees of freedom: 3861\n",
"Residual degrees of freedom: 3853\n",
"Null deviance: 1277.1043650451534\n",
"Residual deviance: 843.8740651301968\n",
"AIC: 5106.021797066896\n",
"\n",
"Cross-Validation Metrics Summary: \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" mean | \n",
" sd | \n",
" cv_1_valid | \n",
" cv_2_valid | \n",
" cv_3_valid | \n",
" cv_4_valid | \n",
" cv_5_valid | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" mae | \n",
" 0.355575 | \n",
" 0.011150 | \n",
" 0.358694 | \n",
" 0.365597 | \n",
" 0.336775 | \n",
" 0.355430 | \n",
" 0.361377 | \n",
"
\n",
" \n",
" 1 | \n",
" mean_residual_deviance | \n",
" 0.218685 | \n",
" 0.018149 | \n",
" 0.232207 | \n",
" 0.226213 | \n",
" 0.191451 | \n",
" 0.209182 | \n",
" 0.234372 | \n",
"
\n",
" \n",
" 2 | \n",
" mse | \n",
" 0.218685 | \n",
" 0.018149 | \n",
" 0.232207 | \n",
" 0.226213 | \n",
" 0.191451 | \n",
" 0.209182 | \n",
" 0.234372 | \n",
"
\n",
" \n",
" 3 | \n",
" null_deviance | \n",
" 255.420870 | \n",
" 18.944805 | \n",
" 270.143950 | \n",
" 279.337650 | \n",
" 235.192600 | \n",
" 252.053120 | \n",
" 240.377060 | \n",
"
\n",
" \n",
" 4 | \n",
" r2 | \n",
" 0.339071 | \n",
" 0.024654 | \n",
" 0.332108 | \n",
" 0.343113 | \n",
" 0.367360 | \n",
" 0.351331 | \n",
" 0.301443 | \n",
"
\n",
" \n",
" 5 | \n",
" residual_deviance | \n",
" 168.764340 | \n",
" 13.983558 | \n",
" 180.424550 | \n",
" 183.458680 | \n",
" 148.757460 | \n",
" 163.370830 | \n",
" 167.810170 | \n",
"
\n",
" \n",
" 6 | \n",
" rmse | \n",
" 0.467306 | \n",
" 0.019674 | \n",
" 0.481878 | \n",
" 0.475618 | \n",
" 0.437551 | \n",
" 0.457364 | \n",
" 0.484120 | \n",
"
\n",
" \n",
" 7 | \n",
" rmsle | \n",
" 0.119310 | \n",
" 0.004632 | \n",
" 0.122437 | \n",
" 0.121279 | \n",
" 0.111271 | \n",
" 0.119514 | \n",
" 0.122048 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean sd cv_1_valid cv_2_valid \\\n",
"0 mae 0.355575 0.011150 0.358694 0.365597 \n",
"1 mean_residual_deviance 0.218685 0.018149 0.232207 0.226213 \n",
"2 mse 0.218685 0.018149 0.232207 0.226213 \n",
"3 null_deviance 255.420870 18.944805 270.143950 279.337650 \n",
"4 r2 0.339071 0.024654 0.332108 0.343113 \n",
"5 residual_deviance 168.764340 13.983558 180.424550 183.458680 \n",
"6 rmse 0.467306 0.019674 0.481878 0.475618 \n",
"7 rmsle 0.119310 0.004632 0.122437 0.121279 \n",
"\n",
" cv_3_valid cv_4_valid cv_5_valid \n",
"0 0.336775 0.355430 0.361377 \n",
"1 0.191451 0.209182 0.234372 \n",
"2 0.191451 0.209182 0.234372 \n",
"3 235.192600 252.053120 240.377060 \n",
"4 0.367360 0.351331 0.301443 \n",
"5 148.757460 163.370830 167.810170 \n",
"6 0.437551 0.457364 0.484120 \n",
"7 0.111271 0.119514 0.122048 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Scoring History: \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" timestamp | \n",
" duration | \n",
" iteration | \n",
" lambda | \n",
" predictors | \n",
" deviance_train | \n",
" deviance_xval | \n",
" deviance_se | \n",
" alpha | \n",
" iterations | \n",
" training_rmse | \n",
" training_deviance | \n",
" training_mae | \n",
" training_r2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.000 sec | \n",
" 1 | \n",
" ,22E0 | \n",
" 1 | \n",
" 0.330642 | \n",
" 0.330652 | \n",
" 0.008216 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 1 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.000 sec | \n",
" 2 | \n",
" ,2E0 | \n",
" 5 | \n",
" 0.316066 | \n",
" 0.330609 | \n",
" 0.008220 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 2 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.000 sec | \n",
" 3 | \n",
" ,18E0 | \n",
" 6 | \n",
" 0.302049 | \n",
" 0.330627 | \n",
" 0.008216 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 3 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.000 sec | \n",
" 4 | \n",
" ,17E0 | \n",
" 6 | \n",
" 0.289848 | \n",
" 0.322157 | \n",
" 0.008512 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 4 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.016 sec | \n",
" 5 | \n",
" ,15E0 | \n",
" 7 | \n",
" 0.279287 | \n",
" 0.307694 | \n",
" 0.008407 | \n",
" 0.5 | \n",
" 5.0 | \n",
" 0.528476 | \n",
" 0.279287 | \n",
" 0.407732 | \n",
" 0.155318 | \n",
"
\n",
" \n",
" 5 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.016 sec | \n",
" 6 | \n",
" ,14E0 | \n",
" 7 | \n",
" 0.270007 | \n",
" 0.294801 | \n",
" 0.008323 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 6 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.016 sec | \n",
" 7 | \n",
" ,13E0 | \n",
" 7 | \n",
" 0.262086 | \n",
" 0.283621 | \n",
" 0.008253 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 7 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.016 sec | \n",
" 8 | \n",
" ,12E0 | \n",
" 7 | \n",
" 0.255335 | \n",
" 0.273875 | \n",
" 0.008226 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 8 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.016 sec | \n",
" 9 | \n",
" ,11E0 | \n",
" 8 | \n",
" 0.249586 | \n",
" 0.265471 | \n",
" 0.008206 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 9 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.016 sec | \n",
" 10 | \n",
" ,96E-1 | \n",
" 8 | \n",
" 0.244598 | \n",
" 0.258304 | \n",
" 0.008188 | \n",
" 0.5 | \n",
" 10.0 | \n",
" 0.494568 | \n",
" 0.244598 | \n",
" 0.377588 | \n",
" 0.260234 | \n",
"
\n",
" \n",
" 10 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 11 | \n",
" ,88E-1 | \n",
" 8 | \n",
" 0.240460 | \n",
" 0.252204 | \n",
" 0.008175 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 11 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 12 | \n",
" ,8E-1 | \n",
" 8 | \n",
" 0.236926 | \n",
" 0.246950 | \n",
" 0.008189 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 12 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 13 | \n",
" ,73E-1 | \n",
" 8 | \n",
" 0.233943 | \n",
" 0.242530 | \n",
" 0.008194 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 13 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 14 | \n",
" ,66E-1 | \n",
" 8 | \n",
" 0.231423 | \n",
" 0.238774 | \n",
" 0.008196 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 14 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 15 | \n",
" ,6E-1 | \n",
" 8 | \n",
" 0.229293 | \n",
" 0.235608 | \n",
" 0.008204 | \n",
" 0.5 | \n",
" 15.0 | \n",
" 0.478846 | \n",
" 0.229293 | \n",
" 0.364087 | \n",
" 0.306521 | \n",
"
\n",
" \n",
" 15 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 16 | \n",
" ,55E-1 | \n",
" 8 | \n",
" 0.227495 | \n",
" 0.232935 | \n",
" 0.008210 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 16 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.032 sec | \n",
" 17 | \n",
" ,5E-1 | \n",
" 8 | \n",
" 0.226154 | \n",
" 0.230682 | \n",
" 0.008216 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 17 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.047 sec | \n",
" 18 | \n",
" ,46E-1 | \n",
" 8 | \n",
" 0.224642 | \n",
" 0.228783 | \n",
" 0.008220 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 18 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.047 sec | \n",
" 19 | \n",
" ,42E-1 | \n",
" 8 | \n",
" 0.223737 | \n",
" 0.227337 | \n",
" 0.008211 | \n",
" 0.5 | \n",
" NaN | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" 19 | \n",
" | \n",
" 2022-07-22 17:10:18 | \n",
" 0.047 sec | \n",
" 20 | \n",
" ,38E-1 | \n",
" 8 | \n",
" 0.222642 | \n",
" 0.225825 | \n",
" 0.008238 | \n",
" 0.5 | \n",
" 20.0 | \n",
" 0.471849 | \n",
" 0.222642 | \n",
" 0.358377 | \n",
" 0.326639 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" timestamp duration iteration lambda predictors \\\n",
"0 2022-07-22 17:10:18 0.000 sec 1 ,22E0 1 \n",
"1 2022-07-22 17:10:18 0.000 sec 2 ,2E0 5 \n",
"2 2022-07-22 17:10:18 0.000 sec 3 ,18E0 6 \n",
"3 2022-07-22 17:10:18 0.000 sec 4 ,17E0 6 \n",
"4 2022-07-22 17:10:18 0.016 sec 5 ,15E0 7 \n",
"5 2022-07-22 17:10:18 0.016 sec 6 ,14E0 7 \n",
"6 2022-07-22 17:10:18 0.016 sec 7 ,13E0 7 \n",
"7 2022-07-22 17:10:18 0.016 sec 8 ,12E0 7 \n",
"8 2022-07-22 17:10:18 0.016 sec 9 ,11E0 8 \n",
"9 2022-07-22 17:10:18 0.016 sec 10 ,96E-1 8 \n",
"10 2022-07-22 17:10:18 0.032 sec 11 ,88E-1 8 \n",
"11 2022-07-22 17:10:18 0.032 sec 12 ,8E-1 8 \n",
"12 2022-07-22 17:10:18 0.032 sec 13 ,73E-1 8 \n",
"13 2022-07-22 17:10:18 0.032 sec 14 ,66E-1 8 \n",
"14 2022-07-22 17:10:18 0.032 sec 15 ,6E-1 8 \n",
"15 2022-07-22 17:10:18 0.032 sec 16 ,55E-1 8 \n",
"16 2022-07-22 17:10:18 0.032 sec 17 ,5E-1 8 \n",
"17 2022-07-22 17:10:18 0.047 sec 18 ,46E-1 8 \n",
"18 2022-07-22 17:10:18 0.047 sec 19 ,42E-1 8 \n",
"19 2022-07-22 17:10:18 0.047 sec 20 ,38E-1 8 \n",
"\n",
" deviance_train deviance_xval deviance_se alpha iterations \\\n",
"0 0.330642 0.330652 0.008216 0.5 NaN \n",
"1 0.316066 0.330609 0.008220 0.5 NaN \n",
"2 0.302049 0.330627 0.008216 0.5 NaN \n",
"3 0.289848 0.322157 0.008512 0.5 NaN \n",
"4 0.279287 0.307694 0.008407 0.5 5.0 \n",
"5 0.270007 0.294801 0.008323 0.5 NaN \n",
"6 0.262086 0.283621 0.008253 0.5 NaN \n",
"7 0.255335 0.273875 0.008226 0.5 NaN \n",
"8 0.249586 0.265471 0.008206 0.5 NaN \n",
"9 0.244598 0.258304 0.008188 0.5 10.0 \n",
"10 0.240460 0.252204 0.008175 0.5 NaN \n",
"11 0.236926 0.246950 0.008189 0.5 NaN \n",
"12 0.233943 0.242530 0.008194 0.5 NaN \n",
"13 0.231423 0.238774 0.008196 0.5 NaN \n",
"14 0.229293 0.235608 0.008204 0.5 15.0 \n",
"15 0.227495 0.232935 0.008210 0.5 NaN \n",
"16 0.226154 0.230682 0.008216 0.5 NaN \n",
"17 0.224642 0.228783 0.008220 0.5 NaN \n",
"18 0.223737 0.227337 0.008211 0.5 NaN \n",
"19 0.222642 0.225825 0.008238 0.5 20.0 \n",
"\n",
" training_rmse training_deviance training_mae training_r2 \n",
"0 \n",
"1 \n",
"2 \n",
"3 \n",
"4 0.528476 0.279287 0.407732 0.155318 \n",
"5 \n",
"6 \n",
"7 \n",
"8 \n",
"9 0.494568 0.244598 0.377588 0.260234 \n",
"10 \n",
"11 \n",
"12 \n",
"13 \n",
"14 0.478846 0.229293 0.364087 0.306521 \n",
"15 \n",
"16 \n",
"17 \n",
"18 \n",
"19 0.471849 0.222642 0.358377 0.326639 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"See the whole table with table.as_data_frame()\n",
"\n",
"Variable Importances: \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" variable | \n",
" relative_importance | \n",
" scaled_importance | \n",
" percentage | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" GBM_5_AutoML_1_20220722_170939 | \n",
" 0.156444 | \n",
" 1.000000 | \n",
" 0.459637 | \n",
"
\n",
" \n",
" 1 | \n",
" GBM_3_AutoML_1_20220722_170939 | \n",
" 0.064177 | \n",
" 0.410220 | \n",
" 0.188552 | \n",
"
\n",
" \n",
" 2 | \n",
" GBM_1_AutoML_1_20220722_170939 | \n",
" 0.045077 | \n",
" 0.288132 | \n",
" 0.132436 | \n",
"
\n",
" \n",
" 3 | \n",
" GBM_grid_1_AutoML_1_20220722_170939_model_1 | \n",
" 0.042987 | \n",
" 0.274777 | \n",
" 0.126298 | \n",
"
\n",
" \n",
" 4 | \n",
" GBM_4_AutoML_1_20220722_170939 | \n",
" 0.022251 | \n",
" 0.142227 | \n",
" 0.065373 | \n",
"
\n",
" \n",
" 5 | \n",
" DeepLearning_1_AutoML_1_20220722_170939 | \n",
" 0.009429 | \n",
" 0.060268 | \n",
" 0.027701 | \n",
"
\n",
" \n",
" 6 | \n",
" DRF_1_AutoML_1_20220722_170939 | \n",
" 0.000001 | \n",
" 0.000007 | \n",
" 0.000003 | \n",
"
\n",
" \n",
" 7 | \n",
" GBM_2_AutoML_1_20220722_170939 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 8 | \n",
" XRT_1_AutoML_1_20220722_170939 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 9 | \n",
" GLM_1_AutoML_1_20220722_170939 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" variable relative_importance \\\n",
"0 GBM_5_AutoML_1_20220722_170939 0.156444 \n",
"1 GBM_3_AutoML_1_20220722_170939 0.064177 \n",
"2 GBM_1_AutoML_1_20220722_170939 0.045077 \n",
"3 GBM_grid_1_AutoML_1_20220722_170939_model_1 0.042987 \n",
"4 GBM_4_AutoML_1_20220722_170939 0.022251 \n",
"5 DeepLearning_1_AutoML_1_20220722_170939 0.009429 \n",
"6 DRF_1_AutoML_1_20220722_170939 0.000001 \n",
"7 GBM_2_AutoML_1_20220722_170939 0.000000 \n",
"8 XRT_1_AutoML_1_20220722_170939 0.000000 \n",
"9 GLM_1_AutoML_1_20220722_170939 0.000000 \n",
"\n",
" scaled_importance percentage \n",
"0 1.000000 0.459637 \n",
"1 0.410220 0.188552 \n",
"2 0.288132 0.132436 \n",
"3 0.274777 0.126298 \n",
"4 0.142227 0.065373 \n",
"5 0.060268 0.027701 \n",
"6 0.000007 0.000003 \n",
"7 0.000000 0.000000 \n",
"8 0.000000 0.000000 \n",
"9 0.000000 0.000000 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"h2o.get_model(model_id).metalearner()"
]
},
{
"cell_type": "markdown",
"id": "a8f423ad",
"metadata": {
"papermill": {
"duration": 0.030956,
"end_time": "2021-03-24T11:25:14.345344",
"exception": false,
"start_time": "2021-03-24T11:25:14.314388",
"status": "completed"
},
"tags": []
},
"source": [
"## Generating Predictions Using Leader Model\n",
"\n",
"We can also generate predictions on a test sample using the leader model object."
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "90dd2625",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" predict |
\n",
"\n",
"\n",
" 2.88073 |
\n",
" 3.32158 |
\n",
" 3.08133 |
\n",
" 2.73111 |
\n",
" 2.51044 |
\n",
" 3.13355 |
\n",
" 3.18375 |
\n",
" 3.74316 |
\n",
" 2.58496 |
\n",
" 3.3082 |
\n",
"\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pred = aml.predict(test_h)\n",
"pred.head()"
]
},
{
"cell_type": "markdown",
"id": "cbb85276",
"metadata": {},
"source": [
"This allows us to estimate the out-of-sample (test) MSE and the standard error as well."
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "d38d7686",
"metadata": {},
"outputs": [],
"source": [
"pred_2 = pred.as_data_frame()\n",
"pred_aml = pred_2.to_numpy()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "3b5cb41a",
"metadata": {},
"outputs": [],
"source": [
"Y_test = test_h['lwage'].as_data_frame().to_numpy()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "3b070b47",
"metadata": {},
"outputs": [],
"source": [
"import statsmodels.api as sm\n",
"import statsmodels.formula.api as smf"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "7602a9e8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Coef. 0.221502\n",
"Std.Err. 0.012942\n",
"Name: const, dtype: float64"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"resid_basic = (Y_test-pred_aml)**2\n",
"\n",
"MSE_aml_basic = sm.OLS( resid_basic , np.ones( resid_basic.shape[0] ) ).fit().summary2().tables[1].iloc[0, 0:2]\n",
"MSE_aml_basic"
]
},
{
"cell_type": "markdown",
"id": "6725164f",
"metadata": {},
"source": [
"We observe both a lower MSE and a lower standard error compared to our previous results (see [here](https://www.kaggle.com/janniskueck/pm3-notebook-newdata))."
]
},
{
"cell_type": "markdown",
"id": "e03b4c5b",
"metadata": {
"tags": []
},
"source": [
"### By using model_performance()\n",
"If needed, the standard model_performance() method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "caaedabd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"ModelMetricsRegressionGLM: stackedensemble\n",
"** Reported on test data. **\n",
"\n",
"MSE: 0.22150159010610537\n",
"RMSE: 0.4706395543365489\n",
"MAE: 0.353942656628514\n",
"RMSLE: 0.12036767274774818\n",
"R^2: 0.2835426043371053\n",
"Mean Residual Deviance: 0.22150159010610537\n",
"Null degrees of freedom: 1287\n",
"Residual degrees of freedom: 1280\n",
"Null deviance: 398.23902107893576\n",
"Residual deviance: 285.2940480566637\n",
"AIC: 1731.7504037354604\n"
]
},
{
"data": {
"text/plain": []
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"perf = aml.leader.model_performance(test_h)\n",
"perf"
]
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}