{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"_execution_state": "idle",
"_uuid": "051d70d956493feee0c6d64651c6a088724dca2a",
"id": "DfBfWzYzWitP",
"papermill": {
"duration": 0.037179,
"end_time": "2021-07-22T21:33:20.414432",
"exception": false,
"start_time": "2021-07-22T21:33:20.377253",
"status": "completed"
},
"tags": []
},
"source": [
"# ML for wage prediction"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Kk13PYstWitS",
"papermill": {
"duration": 0.035632,
"end_time": "2021-07-22T21:33:20.486115",
"exception": false,
"start_time": "2021-07-22T21:33:20.450483",
"status": "completed"
},
"tags": []
},
"source": [
"We illustrate how to predict an outcome variable Y in a high-dimensional setting, where the number of covariates $p$ is large in relation to the sample size $n$. So far we have used linear prediction rules, e.g. Lasso regression, for estimation.\n",
"Now, we also consider nonlinear prediction rules including tree-based methods."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fQLyPTHhWitT",
"papermill": {
"duration": 0.035698,
"end_time": "2021-07-22T21:33:20.558154",
"exception": false,
"start_time": "2021-07-22T21:33:20.522456",
"status": "completed"
},
"tags": []
},
"source": [
"## Data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QAJp34YkWitU",
"papermill": {
"duration": 0.035998,
"end_time": "2021-07-22T21:33:20.630093",
"exception": false,
"start_time": "2021-07-22T21:33:20.594095",
"status": "completed"
},
"tags": []
},
"source": [
"Again, we consider data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015.\n",
"The preproccessed sample consists of $5150$ never-married individuals."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 69
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:20.735227Z",
"iopub.status.busy": "2021-07-22T21:33:20.732866Z",
"iopub.status.idle": "2021-07-22T21:33:20.898796Z",
"shell.execute_reply": "2021-07-22T21:33:20.897685Z"
},
"executionInfo": {
"elapsed": 6809,
"status": "ok",
"timestamp": 1658250110912,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "BGctl_l4WitW",
"outputId": "9b78359d-a132-4e38-92c9-7c5f83859541",
"papermill": {
"duration": 0.232703,
"end_time": "2021-07-22T21:33:20.899004",
"exception": false,
"start_time": "2021-07-22T21:33:20.666301",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Warning message in system(\"timedatectl\", intern = TRUE):\n",
"“running command 'timedatectl' had status 1”\n"
]
},
{
"data": {
"text/html": [
"\n",
"
- 5150
- 21
\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 5150\n",
"\\item 21\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 5150\n",
"2. 21\n",
"\n",
"\n"
],
"text/plain": [
"[1] 5150 21"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"install.packages(\"librarian\", quiet = T)\n",
"librarian::shelf(\n",
" tidyverse\n",
" , randomForest\n",
" , rpart\n",
" , glmnnet\n",
" , gbm\n",
" , rpart.plot\n",
" , keras\n",
" , hdm\n",
" , quiet = T\n",
")\n",
"data = read_csv(\"https://raw.githubusercontent.com/d2cml-ai/14.388_R/main/Data/wage2015_subsample_inference.csv\", show_col_types = F)\n",
"dim(data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SijRfuBnWitY",
"papermill": {
"duration": 0.036385,
"end_time": "2021-07-22T21:33:20.972103",
"exception": false,
"start_time": "2021-07-22T21:33:20.935718",
"status": "completed"
},
"tags": []
},
"source": [
"The outcomes $Y_i$'s are hourly (log) wages of never-married workers living in the U.S. The raw regressors $Z_i$'s consist of a variety of characteristics, including experience, education and industry and occupation indicators."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:21.078896Z",
"iopub.status.busy": "2021-07-22T21:33:21.048765Z",
"iopub.status.idle": "2021-07-22T21:33:21.095997Z",
"shell.execute_reply": "2021-07-22T21:33:21.094629Z"
},
"executionInfo": {
"elapsed": 38,
"status": "ok",
"timestamp": 1658250110915,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "iqAhALdZWitZ",
"outputId": "a0c2a5b7-7f0d-49f3-9445-8213247eb3a7",
"papermill": {
"duration": 0.087405,
"end_time": "2021-07-22T21:33:21.096144",
"exception": false,
"start_time": "2021-07-22T21:33:21.008739",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"- 'rownames'
- 'sex'
- 'shs'
- 'hsg'
- 'scl'
- 'clg'
- 'ad'
- 'mw'
- 'so'
- 'we'
- 'ne'
- 'exp1'
- 'exp2'
- 'exp3'
- 'exp4'
- 'occ'
- 'occ2'
- 'ind'
- 'ind2'
\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 'rownames'\n",
"\\item 'sex'\n",
"\\item 'shs'\n",
"\\item 'hsg'\n",
"\\item 'scl'\n",
"\\item 'clg'\n",
"\\item 'ad'\n",
"\\item 'mw'\n",
"\\item 'so'\n",
"\\item 'we'\n",
"\\item 'ne'\n",
"\\item 'exp1'\n",
"\\item 'exp2'\n",
"\\item 'exp3'\n",
"\\item 'exp4'\n",
"\\item 'occ'\n",
"\\item 'occ2'\n",
"\\item 'ind'\n",
"\\item 'ind2'\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 'rownames'\n",
"2. 'sex'\n",
"3. 'shs'\n",
"4. 'hsg'\n",
"5. 'scl'\n",
"6. 'clg'\n",
"7. 'ad'\n",
"8. 'mw'\n",
"9. 'so'\n",
"10. 'we'\n",
"11. 'ne'\n",
"12. 'exp1'\n",
"13. 'exp2'\n",
"14. 'exp3'\n",
"15. 'exp4'\n",
"16. 'occ'\n",
"17. 'occ2'\n",
"18. 'ind'\n",
"19. 'ind2'\n",
"\n",
"\n"
],
"text/plain": [
" [1] \"rownames\" \"sex\" \"shs\" \"hsg\" \"scl\" \"clg\" \n",
" [7] \"ad\" \"mw\" \"so\" \"we\" \"ne\" \"exp1\" \n",
"[13] \"exp2\" \"exp3\" \"exp4\" \"occ\" \"occ2\" \"ind\" \n",
"[19] \"ind2\" "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"Z <- data |> select(-c(lwage, wage)) # regressors\n",
"colnames(Z)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Oo2wY8oIWitc",
"papermill": {
"duration": 0.036941,
"end_time": "2021-07-22T21:33:21.171426",
"exception": false,
"start_time": "2021-07-22T21:33:21.134485",
"status": "completed"
},
"tags": []
},
"source": [
"The following figure shows the weekly wage distribution from the US survey data."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 437
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:21.250222Z",
"iopub.status.busy": "2021-07-22T21:33:21.249037Z",
"iopub.status.idle": "2021-07-22T21:33:21.551533Z",
"shell.execute_reply": "2021-07-22T21:33:21.550113Z"
},
"executionInfo": {
"elapsed": 39,
"status": "ok",
"timestamp": 1658250110918,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "5Xsm3cUyWite",
"outputId": "33fd519e-3ee5-4c10-d693-2f9805aeb98a",
"papermill": {
"duration": 0.343165,
"end_time": "2021-07-22T21:33:21.551709",
"exception": false,
"start_time": "2021-07-22T21:33:21.208544",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"Plot with title “Empirical wage distribution from the US survey data”"
]
},
"metadata": {
"image/png": {
"height": 420,
"width": 420
}
},
"output_type": "display_data"
}
],
"source": [
"hist(data$wage, xlab= \"hourly wage\", main=\"Empirical wage distribution from the US survey data\", breaks= 35)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XCwz2ZNoWitg",
"papermill": {
"duration": 0.038762,
"end_time": "2021-07-22T21:33:21.629360",
"exception": false,
"start_time": "2021-07-22T21:33:21.590598",
"status": "completed"
},
"tags": []
},
"source": [
"Wages show a high degree of skewness. Hence, wages are transformed in almost all studies by\n",
"the logarithm."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cp2Tf15pWiti",
"papermill": {
"duration": 0.038159,
"end_time": "2021-07-22T21:33:21.706383",
"exception": false,
"start_time": "2021-07-22T21:33:21.668224",
"status": "completed"
},
"tags": []
},
"source": [
"## Analysis"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jjN31ZYbWitj",
"papermill": {
"duration": 0.038644,
"end_time": "2021-07-22T21:33:21.783461",
"exception": false,
"start_time": "2021-07-22T21:33:21.744817",
"status": "completed"
},
"tags": []
},
"source": [
"Due to the skewness of the data, we are considering log wages which leads to the following regression model\n",
"\n",
"$$log(wage) = g(Z) + \\epsilon.$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jNsHcgDTWitl",
"papermill": {
"duration": 0.038767,
"end_time": "2021-07-22T21:33:21.861283",
"exception": false,
"start_time": "2021-07-22T21:33:21.822516",
"status": "completed"
},
"tags": []
},
"source": [
"We will estimate the two sets of prediction rules: Linear and Nonlinear Models.\n",
"In linear models, we estimate the prediction rule of the form\n",
"\n",
"$$\\hat g(Z) = \\hat \\beta'X.$$\n",
"Again, we generate $X$ in two ways:\n",
" \n",
"1. Basic Model: $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators, regional indicators).\n",
"\n",
"\n",
"2. Flexible Model: $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DpTuFFQ-Witn",
"papermill": {
"duration": 0.038121,
"end_time": "2021-07-22T21:33:21.938272",
"exception": false,
"start_time": "2021-07-22T21:33:21.900151",
"status": "completed"
},
"tags": []
},
"source": [
"To evaluate the out-of-sample performance, we split the data first."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2021-07-22T21:33:22.019774Z",
"iopub.status.busy": "2021-07-22T21:33:22.018129Z",
"iopub.status.idle": "2021-07-22T21:33:22.037820Z",
"shell.execute_reply": "2021-07-22T21:33:22.036538Z"
},
"executionInfo": {
"elapsed": 38,
"status": "ok",
"timestamp": 1658250110920,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "alXZbuZSWitp",
"papermill": {
"duration": 0.061626,
"end_time": "2021-07-22T21:33:22.037969",
"exception": false,
"start_time": "2021-07-22T21:33:21.976343",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"set.seed(1234)\n",
"training <- sample(nrow(data), nrow(data)*(3/4), replace=FALSE)\n",
"\n",
"data_train <- data[training,]\n",
"data_test <- data[-training,]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "s0AlDE-nWitp",
"papermill": {
"duration": 0.039021,
"end_time": "2021-07-22T21:33:22.115014",
"exception": false,
"start_time": "2021-07-22T21:33:22.075993",
"status": "completed"
},
"tags": []
},
"source": [
"We construct the two different model matrices $X_{basic}$ and $X_{flex}$ for both the training and the test sample:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2021-07-22T21:33:22.198405Z",
"iopub.status.busy": "2021-07-22T21:33:22.197691Z",
"iopub.status.idle": "2021-07-22T21:33:22.248902Z",
"shell.execute_reply": "2021-07-22T21:33:22.246885Z"
},
"executionInfo": {
"elapsed": 39,
"status": "ok",
"timestamp": 1658250110921,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "TVNFi2fuWits",
"papermill": {
"duration": 0.09478,
"end_time": "2021-07-22T21:33:22.249085",
"exception": false,
"start_time": "2021-07-22T21:33:22.154305",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"X_basic <- \"sex + exp1 + exp2+ shs + hsg+ scl + clg + mw + so + we + occ2+ ind2\"\n",
"X_flex <- \"sex + exp1 + exp2 + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)\"\n",
"formula_basic <- as.formula(paste(\"lwage\", \"~\", X_basic))\n",
"formula_flex <- as.formula(paste(\"lwage\", \"~\", X_flex))\n",
"\n",
"model_X_basic_train <- model.matrix(formula_basic,data_train)\n",
"model_X_basic_test <- model.matrix(formula_basic,data_test)\n",
"p_basic <- dim(model_X_basic_train)[2]\n",
"model_X_flex_train <- model.matrix(formula_flex,data_train)\n",
"model_X_flex_test <- model.matrix(formula_flex,data_test)\n",
"p_flex <- dim(model_X_flex_train)[2]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2021-07-22T21:33:22.331274Z",
"iopub.status.busy": "2021-07-22T21:33:22.329613Z",
"iopub.status.idle": "2021-07-22T21:33:22.347093Z",
"shell.execute_reply": "2021-07-22T21:33:22.345856Z"
},
"executionInfo": {
"elapsed": 39,
"status": "ok",
"timestamp": 1658250110922,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "7VvnCqX5Witt",
"papermill": {
"duration": 0.059764,
"end_time": "2021-07-22T21:33:22.347237",
"exception": false,
"start_time": "2021-07-22T21:33:22.287473",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"Y_train <- data_train$lwage\n",
"Y_test <- data_test$lwage"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:22.429312Z",
"iopub.status.busy": "2021-07-22T21:33:22.428202Z",
"iopub.status.idle": "2021-07-22T21:33:22.445258Z",
"shell.execute_reply": "2021-07-22T21:33:22.443989Z"
},
"executionInfo": {
"elapsed": 39,
"status": "ok",
"timestamp": 1658250110923,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "HpKaJlTgWitt",
"outputId": "df8c7f4f-e0bb-4c62-de65-b3ed914c9994",
"papermill": {
"duration": 0.059354,
"end_time": "2021-07-22T21:33:22.445413",
"exception": false,
"start_time": "2021-07-22T21:33:22.386059",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"data": {
"text/html": [
"13"
],
"text/latex": [
"13"
],
"text/markdown": [
"13"
],
"text/plain": [
"[1] 13"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"51"
],
"text/latex": [
"51"
],
"text/markdown": [
"51"
],
"text/plain": [
"[1] 51"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"p_basic\n",
"p_flex"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qUvQOieNWitu",
"papermill": {
"duration": 0.039589,
"end_time": "2021-07-22T21:33:22.525196",
"exception": false,
"start_time": "2021-07-22T21:33:22.485607",
"status": "completed"
},
"tags": []
},
"source": [
"As known from our first lab, the basic model consists of $10$ regressors and the flexible model of $246$ regressors. Let us fit our models to the training sample using the two different model specifications. We are starting by running a simple ols regression. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QpogkfgpWitu",
"papermill": {
"duration": 0.039781,
"end_time": "2021-07-22T21:33:22.604883",
"exception": false,
"start_time": "2021-07-22T21:33:22.565102",
"status": "completed"
},
"tags": []
},
"source": [
"### OLS"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9VZOD13IWitu",
"papermill": {
"duration": 0.039692,
"end_time": "2021-07-22T21:33:22.684509",
"exception": false,
"start_time": "2021-07-22T21:33:22.644817",
"status": "completed"
},
"tags": []
},
"source": [
"We fit the basic model to our training data by running an ols regression and compute the mean squared error on the test sample."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"execution": {
"iopub.execute_input": "2021-07-22T21:33:22.770034Z",
"iopub.status.busy": "2021-07-22T21:33:22.769510Z",
"iopub.status.idle": "2021-07-22T21:33:22.791709Z",
"shell.execute_reply": "2021-07-22T21:33:22.790718Z"
},
"executionInfo": {
"elapsed": 37,
"status": "ok",
"timestamp": 1658250110924,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "YRjB3xgOWitv",
"papermill": {
"duration": 0.067482,
"end_time": "2021-07-22T21:33:22.791854",
"exception": false,
"start_time": "2021-07-22T21:33:22.724372",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# ols (basic model)\n",
"fit_lm_basic <- lm(formula_basic, data_train) "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:22.876209Z",
"iopub.status.busy": "2021-07-22T21:33:22.875015Z",
"iopub.status.idle": "2021-07-22T21:33:22.900006Z",
"shell.execute_reply": "2021-07-22T21:33:22.897996Z"
},
"executionInfo": {
"elapsed": 1203,
"status": "ok",
"timestamp": 1658250112090,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "7VpQneW6Witv",
"outputId": "c365c34c-f7bc-4161-f45a-949c9e659e5b",
"papermill": {
"duration": 0.068516,
"end_time": "2021-07-22T21:33:22.900213",
"exception": false,
"start_time": "2021-07-22T21:33:22.831697",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The mean squared error (MSE) using the basic model is equal to 0.2496282"
]
}
],
"source": [
"# compute out-of-sample performance\n",
"yhat_lm_basic <- predict(fit_lm_basic, newdata = data_test)\n",
"cat(\"The mean squared error (MSE) using the basic model is equal to\" , mean((Y_test - yhat_lm_basic)^2)) # MSE OLS (basic model) "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aUdqcpXAWitw",
"papermill": {
"duration": 0.056431,
"end_time": "2021-07-22T21:33:23.026875",
"exception": false,
"start_time": "2021-07-22T21:33:22.970444",
"status": "completed"
},
"tags": []
},
"source": [
"To determine the out-of-sample $MSE$ and the standard error in one step, we can use the function *lm*:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:23.116493Z",
"iopub.status.busy": "2021-07-22T21:33:23.115227Z",
"iopub.status.idle": "2021-07-22T21:33:23.142144Z",
"shell.execute_reply": "2021-07-22T21:33:23.140424Z"
},
"executionInfo": {
"elapsed": 23,
"status": "ok",
"timestamp": 1658250112093,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "-3VIzYKCWitw",
"outputId": "faefb36a-56df-4a67-c824-3ffd2476013b",
"papermill": {
"duration": 0.075002,
"end_time": "2021-07-22T21:33:23.142292",
"exception": false,
"start_time": "2021-07-22T21:33:23.067290",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"- 0.249628205480316
- 0.0155845190765748
\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 0.249628205480316\n",
"\\item 0.0155845190765748\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 0.249628205480316\n",
"2. 0.0155845190765748\n",
"\n",
"\n"
],
"text/plain": [
"[1] 0.24962821 0.01558452"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"MSE_lm_basic <- summary(lm((Y_test - yhat_lm_basic)^2~1))$coef[1:2]\n",
"MSE_lm_basic"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WSJe5AWiWitw",
"papermill": {
"duration": 0.040507,
"end_time": "2021-07-22T21:33:23.223781",
"exception": false,
"start_time": "2021-07-22T21:33:23.183274",
"status": "completed"
},
"tags": []
},
"source": [
"We also compute the out-of-sample $R^2$:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:23.310177Z",
"iopub.status.busy": "2021-07-22T21:33:23.308577Z",
"iopub.status.idle": "2021-07-22T21:33:23.323143Z",
"shell.execute_reply": "2021-07-22T21:33:23.321848Z"
},
"executionInfo": {
"elapsed": 24,
"status": "ok",
"timestamp": 1658250112095,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "6iCw0TS4Witw",
"outputId": "b7a5abc6-be2d-4852-9c68-dffdd5ce07ce",
"papermill": {
"duration": 0.058991,
"end_time": "2021-07-22T21:33:23.323276",
"exception": false,
"start_time": "2021-07-22T21:33:23.264285",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The R^2 using the basic model is equal to 0.2185201"
]
}
],
"source": [
"R2_lm_basic <- 1 - MSE_lm_basic[1] / var(Y_test)\n",
"cat(\"The R^2 using the basic model is equal to\", R2_lm_basic) # MSE OLS (basic model) "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OxxeDdauWity",
"papermill": {
"duration": 0.040697,
"end_time": "2021-07-22T21:33:23.404884",
"exception": false,
"start_time": "2021-07-22T21:33:23.364187",
"status": "completed"
},
"tags": []
},
"source": [
"We repeat the same procedure for the flexible model."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:23.491381Z",
"iopub.status.busy": "2021-07-22T21:33:23.490191Z",
"iopub.status.idle": "2021-07-22T21:33:23.637474Z",
"shell.execute_reply": "2021-07-22T21:33:23.635485Z"
},
"executionInfo": {
"elapsed": 21,
"status": "ok",
"timestamp": 1658250112096,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "4PUFKVx1Wity",
"outputId": "9b400580-29da-42ad-ddbb-5a6b9442279d",
"papermill": {
"duration": 0.192034,
"end_time": "2021-07-22T21:33:23.637702",
"exception": false,
"start_time": "2021-07-22T21:33:23.445668",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The R^2 using the flexible model is equal to 0.2165618"
]
}
],
"source": [
"# ols (flexible model)\n",
"fit_lm_flex <- lm(formula_flex, data_train) \n",
"# Compute the Out-Of-Sample Performance\n",
"options(warn=-1)\n",
"yhat_lm_flex <- predict(fit_lm_flex, newdata = data_test)\n",
"MSE_lm_flex <- summary(lm((Y_test - yhat_lm_flex)^2~1))$coef[1:2]\n",
"R2_lm_flex <- 1 - MSE_lm_flex[1] / var(Y_test)\n",
"cat(\"The R^2 using the flexible model is equal to\", R2_lm_flex) # MSE OLS (flexible model) "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "smBG74geWity",
"papermill": {
"duration": 0.055232,
"end_time": "2021-07-22T21:33:23.764568",
"exception": false,
"start_time": "2021-07-22T21:33:23.709336",
"status": "completed"
},
"tags": []
},
"source": [
"We observe that ols regression works better for the basic model with smaller $p/n$ ratio. We are proceeding by running lasso regressions and its versions."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4gRVP4umWitz",
"papermill": {
"duration": 0.041825,
"end_time": "2021-07-22T21:33:23.870407",
"exception": false,
"start_time": "2021-07-22T21:33:23.828582",
"status": "completed"
},
"tags": []
},
"source": [
"### Lasso, Ridge and Elastic Net\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8_l0nfHyWitz",
"papermill": {
"duration": 0.04152,
"end_time": "2021-07-22T21:33:23.953937",
"exception": false,
"start_time": "2021-07-22T21:33:23.912417",
"status": "completed"
},
"tags": []
},
"source": [
"Considering the basic model, we run a lasso/post-lasso regression first and then we compute the measures for the out-of-sample performance. Note that applying the package *hdm* and the function *rlasso* we rely on a theory-based choice of the penalty level $\\lambda$ in the lasso regression."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:24.047109Z",
"iopub.status.busy": "2021-07-22T21:33:24.041516Z",
"iopub.status.idle": "2021-07-22T21:33:24.535625Z",
"shell.execute_reply": "2021-07-22T21:33:24.532403Z"
},
"executionInfo": {
"elapsed": 19,
"status": "ok",
"timestamp": 1658250112097,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "Is9ir6zuWit0",
"outputId": "05db9dec-f46a-4f8b-ee43-fee29e3ea4af",
"papermill": {
"duration": 0.539956,
"end_time": "2021-07-22T21:33:24.535881",
"exception": false,
"start_time": "2021-07-22T21:33:23.995925",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The R^2 using the basic model is equal to 0.2184129 for lasso and 0.2220093 for post-lasso"
]
}
],
"source": [
"# lasso and versions\n",
"# library(hdm) \n",
"fit.rlasso <- rlasso(formula_basic, data_train, post=FALSE)\n",
"fit.rlasso.post <- rlasso(formula_basic, data_train, post=TRUE)\n",
"yhat.rlasso <- predict(fit.rlasso, newdata=data_test)\n",
"yhat.rlasso.post <- predict(fit.rlasso.post, newdata=data_test)\n",
"\n",
"MSE.lasso <- summary(lm((Y_test-yhat.rlasso)^2~1))$coef[1:2]\n",
"MSE.lasso.post <- summary(lm((Y_test-yhat.rlasso.post)^2~1))$coef[1:2]\n",
"\n",
"R2.lasso <- 1-MSE.lasso[1]/var(Y_test)\n",
"R2.lasso.post <- 1-MSE.lasso.post[1]/var(Y_test)\n",
"cat(\"The R^2 using the basic model is equal to\",R2.lasso,\"for lasso and\",R2.lasso.post,\"for post-lasso\") # R^2 lasso/post-lasso (basic model) "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4D3os7iXWit1",
"papermill": {
"duration": 0.051245,
"end_time": "2021-07-22T21:33:24.660529",
"exception": false,
"start_time": "2021-07-22T21:33:24.609284",
"status": "completed"
},
"tags": []
},
"source": [
"Now, we repeat the same procedure for the flexible model."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:24.749884Z",
"iopub.status.busy": "2021-07-22T21:33:24.748227Z",
"iopub.status.idle": "2021-07-22T21:33:27.945305Z",
"shell.execute_reply": "2021-07-22T21:33:27.942260Z"
},
"executionInfo": {
"elapsed": 315,
"status": "ok",
"timestamp": 1658250112397,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "GWGYhDYUWit1",
"outputId": "4f3718c4-a1ab-4126-8092-e921add22961",
"papermill": {
"duration": 3.242671,
"end_time": "2021-07-22T21:33:27.945527",
"exception": false,
"start_time": "2021-07-22T21:33:24.702856",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The R^2 using the flexible model is equal to 0.2167083 for lasso and 0.2220093 for post-lasso"
]
}
],
"source": [
"fit.rlasso.flex <- rlasso(formula_flex, data_train, post=FALSE)\n",
"fit.rlasso.post.flex <- rlasso(formula_flex, data_train, post=TRUE)\n",
"yhat.rlasso.flex <- predict(fit.rlasso.flex, newdata=data_test)\n",
"yhat.rlasso.post.flex <- predict(fit.rlasso.post.flex, newdata=data_test)\n",
"\n",
"MSE.lasso.flex <- summary(lm((Y_test-yhat.rlasso.flex)^2~1))$coef[1:2]\n",
"MSE.lasso.post.flex <- summary(lm((Y_test-yhat.rlasso.post.flex)^2~1))$coef[1:2]\n",
"\n",
"R2.lasso.flex <- 1-MSE.lasso.flex[1]/var(Y_test)\n",
"R2.lasso.post.flex <- 1-MSE.lasso.post.flex[1]/var(Y_test)\n",
"cat(\"The R^2 using the flexible model is equal to\",R2.lasso.flex,\"for lasso and\",R2.lasso.post.flex,\"for post-lasso\") # R^2 lasso/post-lasso (flexible model) "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oVYQoNUSWit2",
"papermill": {
"duration": 0.051694,
"end_time": "2021-07-22T21:33:28.071250",
"exception": false,
"start_time": "2021-07-22T21:33:28.019556",
"status": "completed"
},
"tags": []
},
"source": [
"The lasso regression works better for the more complex model."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "p19rDtdYWit2",
"papermill": {
"duration": 0.043014,
"end_time": "2021-07-22T21:33:28.156922",
"exception": false,
"start_time": "2021-07-22T21:33:28.113908",
"status": "completed"
},
"tags": []
},
"source": [
"In contrast to a theory-based choice of the tuning parameter $\\lambda$ in the lasso regression, we can also use cross-validation to determine the penalty level by applying the package *glmnet* and the function cv.glmnet. In this context, we also run a ridge and a elastic net regression by adjusting the parameter *alpha*."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:28.245441Z",
"iopub.status.busy": "2021-07-22T21:33:28.244334Z",
"iopub.status.idle": "2021-07-22T21:33:30.309699Z",
"shell.execute_reply": "2021-07-22T21:33:30.308186Z"
},
"executionInfo": {
"elapsed": 676,
"status": "ok",
"timestamp": 1658250113060,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "VO4vXSlsWit3",
"outputId": "fb9c95e3-14a6-4153-996c-6ed658693581",
"papermill": {
"duration": 2.110744,
"end_time": "2021-07-22T21:33:30.309881",
"exception": false,
"start_time": "2021-07-22T21:33:28.199137",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Loading required package: Matrix\n",
"\n",
"\n",
"Attaching package: ‘Matrix’\n",
"\n",
"\n",
"The following objects are masked from ‘package:tidyr’:\n",
"\n",
" expand, pack, unpack\n",
"\n",
"\n",
"Loaded glmnet 4.1-4\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"R^2 using cross-validation for lasso, ridge and elastic net in the basic model: 0.2203463 0.2055758 0.2192991"
]
}
],
"source": [
"library(glmnet)\n",
"fit.lasso.cv <- cv.glmnet(model_X_basic_train, Y_train, family=\"gaussian\", alpha=1)\n",
"fit.ridge <- cv.glmnet(model_X_basic_train, Y_train, family=\"gaussian\", alpha=0)\n",
"fit.elnet <- cv.glmnet(model_X_basic_train, Y_train, family=\"gaussian\", alpha=.5)\n",
"\n",
"yhat.lasso.cv <- predict(fit.lasso.cv, newx = model_X_basic_test)\n",
"yhat.ridge <- predict(fit.ridge, newx = model_X_basic_test)\n",
"yhat.elnet <- predict(fit.elnet, newx = model_X_basic_test)\n",
"\n",
"MSE.lasso.cv <- summary(lm((Y_test-yhat.lasso.cv)^2~1))$coef[1:2]\n",
"MSE.ridge <- summary(lm((Y_test-yhat.ridge)^2~1))$coef[1:2]\n",
"MSE.elnet <- summary(lm((Y_test-yhat.elnet)^2~1))$coef[1:2]\n",
"\n",
"R2.lasso.cv <- 1-MSE.lasso.cv[1]/var(Y_test)\n",
"R2.ridge <- 1-MSE.ridge[1]/var(Y_test)\n",
"R2.elnet <- 1-MSE.elnet[1]/var(Y_test)\n",
"\n",
"# R^2 using cross-validation (basic model) \n",
"cat(\"R^2 using cross-validation for lasso, ridge and elastic net in the basic model:\",R2.lasso.cv,R2.ridge,R2.elnet)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HXyV0kqUWit4",
"papermill": {
"duration": 0.04689,
"end_time": "2021-07-22T21:33:30.403846",
"exception": false,
"start_time": "2021-07-22T21:33:30.356956",
"status": "completed"
},
"tags": []
},
"source": [
"Note that the following calculations for the flexible model require significant computation time."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:30.499405Z",
"iopub.status.busy": "2021-07-22T21:33:30.498099Z",
"iopub.status.idle": "2021-07-22T21:33:44.141438Z",
"shell.execute_reply": "2021-07-22T21:33:44.139990Z"
},
"executionInfo": {
"elapsed": 883,
"status": "ok",
"timestamp": 1658250113936,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "-CtIV7FDWit4",
"outputId": "f1154812-104e-4a4b-946d-b98eca9e2097",
"papermill": {
"duration": 13.691447,
"end_time": "2021-07-22T21:33:44.141768",
"exception": false,
"start_time": "2021-07-22T21:33:30.450321",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R^2 using cross-validation for lasso, ridge and elastic net in the flexible model: 0.2201523 0.2062207 0.2211247"
]
}
],
"source": [
"fit.lasso.cv.flex <- cv.glmnet(model_X_flex_train, Y_train, family=\"gaussian\", alpha=1)\n",
"fit.ridge.flex <- cv.glmnet(model_X_flex_train, Y_train, family=\"gaussian\", alpha=0)\n",
"fit.elnet.flex <- cv.glmnet(model_X_flex_train, Y_train, family=\"gaussian\", alpha=.5)\n",
"\n",
"yhat.lasso.cv.flex <- predict(fit.lasso.cv.flex , newx = model_X_flex_test)\n",
"yhat.ridge.flex <- predict(fit.ridge.flex , newx = model_X_flex_test)\n",
"yhat.elnet.flex <- predict(fit.elnet.flex , newx = model_X_flex_test)\n",
"\n",
"MSE.lasso.cv.flex <- summary(lm((Y_test-yhat.lasso.cv.flex )^2~1))$coef[1:2]\n",
"MSE.ridge.flex <- summary(lm((Y_test-yhat.ridge.flex )^2~1))$coef[1:2]\n",
"MSE.elnet.flex <- summary(lm((Y_test-yhat.elnet.flex )^2~1))$coef[1:2]\n",
"\n",
"R2.lasso.cv.flex <- 1-MSE.lasso.cv.flex [1]/var(Y_test)\n",
"R2.ridge.flex <- 1-MSE.ridge.flex [1]/var(Y_test)\n",
"R2.elnet.flex <- 1-MSE.elnet.flex [1]/var(Y_test)\n",
"\n",
"# R^2 using cross-validation (flexible model) \n",
"cat(\"R^2 using cross-validation for lasso, ridge and elastic net in the flexible model:\",R2.lasso.cv.flex,R2.ridge.flex,R2.elnet.flex)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TjeCfaakWit6",
"papermill": {
"duration": 0.043216,
"end_time": "2021-07-22T21:33:44.229372",
"exception": false,
"start_time": "2021-07-22T21:33:44.186156",
"status": "completed"
},
"tags": []
},
"source": [
"The performance of the lasso regression with cross-validated penalty is quite similar to the performance of lasso using a theoretical based choice of the tuning parameter."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ypNrjgnwWit6",
"papermill": {
"duration": 0.043279,
"end_time": "2021-07-22T21:33:44.316231",
"exception": false,
"start_time": "2021-07-22T21:33:44.272952",
"status": "completed"
},
"tags": []
},
"source": [
"## Non-linear models"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qkm_TFteWit6",
"papermill": {
"duration": 0.043761,
"end_time": "2021-07-22T21:33:44.403514",
"exception": false,
"start_time": "2021-07-22T21:33:44.359753",
"status": "completed"
},
"tags": []
},
"source": [
"Besides linear regression models, we consider nonlinear regression models to build a predictive model. We are apply regression trees, random forests, boosted trees and neural nets to estimate the regression function $g(X)$. First, we load the relevant libraries."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"execution": {
"iopub.execute_input": "2021-07-22T21:33:44.494395Z",
"iopub.status.busy": "2021-07-22T21:33:44.493268Z",
"iopub.status.idle": "2021-07-22T21:33:44.752574Z",
"shell.execute_reply": "2021-07-22T21:33:44.751346Z"
},
"executionInfo": {
"elapsed": 5,
"status": "ok",
"timestamp": 1658250113937,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "LHITdYUFWit6",
"papermill": {
"duration": 0.30601,
"end_time": "2021-07-22T21:33:44.752736",
"exception": false,
"start_time": "2021-07-22T21:33:44.446726",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"library(randomForest)\n",
"library(rpart)\n",
"library(nnet)\n",
"library(gbm)\n",
"library(rpart.plot)\n",
"library(keras)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "s8QdDVokWit7",
"papermill": {
"duration": 0.043594,
"end_time": "2021-07-22T21:33:44.840683",
"exception": false,
"start_time": "2021-07-22T21:33:44.797089",
"status": "completed"
},
"tags": []
},
"source": [
"and we illustrate the application of regression trees."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "npHGkYKsWit7",
"papermill": {
"duration": 0.043219,
"end_time": "2021-07-22T21:33:44.927600",
"exception": false,
"start_time": "2021-07-22T21:33:44.884381",
"status": "completed"
},
"tags": []
},
"source": [
"### Regression Trees"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GBDE9z5yWit7",
"papermill": {
"duration": 0.043147,
"end_time": "2021-07-22T21:33:45.014486",
"exception": false,
"start_time": "2021-07-22T21:33:44.971339",
"status": "completed"
},
"tags": []
},
"source": [
"We fit a regression tree to the training data using the basic model. The variable *cp* controls the complexity of the regression tree, i.e. how deep we build the tree."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 454
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:45.108375Z",
"iopub.status.busy": "2021-07-22T21:33:45.106930Z",
"iopub.status.idle": "2021-07-22T21:33:46.833179Z",
"shell.execute_reply": "2021-07-22T21:33:46.833557Z"
},
"executionInfo": {
"elapsed": 1544,
"status": "ok",
"timestamp": 1658250115476,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "z4-0ZKerWit7",
"outputId": "cdd56787-b884-4954-9572-c873e7829de7",
"papermill": {
"duration": 1.774435,
"end_time": "2021-07-22T21:33:46.833743",
"exception": false,
"start_time": "2021-07-22T21:33:45.059308",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"cex 0.2 xlim c(0, 1) ylim c(0, 1)\n"
]
},
{
"data": {
"image/png": "",
"text/plain": [
"plot without title"
]
},
"metadata": {
"image/png": {
"height": 420,
"width": 420
}
},
"output_type": "display_data"
}
],
"source": [
"# fit the tree\n",
"fit.trees <- rpart(formula_basic, data_train,cp = 0.001)\n",
"prp(fit.trees,leaf.round=1, space=2, yspace=2,split.space=2,shadow.col = \"gray\",trace = 1) # plotting the tree"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "G3QZT03wWit8",
"papermill": {
"duration": 0.047977,
"end_time": "2021-07-22T21:33:46.929812",
"exception": false,
"start_time": "2021-07-22T21:33:46.881835",
"status": "completed"
},
"tags": []
},
"source": [
"An important method to improve predictive performance is called \"Pruning the Tree\". This\n",
"means the process of cutting down the branches of a tree. We apply pruning to the complex tree above to reduce the depth. Initially, we determine the optimal complexity of the regression tree."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:47.029943Z",
"iopub.status.busy": "2021-07-22T21:33:47.028067Z",
"iopub.status.idle": "2021-07-22T21:33:47.046676Z",
"shell.execute_reply": "2021-07-22T21:33:47.045274Z"
},
"executionInfo": {
"elapsed": 36,
"status": "ok",
"timestamp": 1658250115477,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "EumyiAhqWit8",
"outputId": "44570ce8-fcd1-4478-d787-289d8c8c77f2",
"papermill": {
"duration": 0.069407,
"end_time": "2021-07-22T21:33:47.046815",
"exception": false,
"start_time": "2021-07-22T21:33:46.977408",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"data": {
"text/html": [
"0.00188444410871555"
],
"text/latex": [
"0.00188444410871555"
],
"text/markdown": [
"0.00188444410871555"
],
"text/plain": [
"[1] 0.001884444"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"bestcp<- fit.trees$cptable[which.min(fit.trees$cptable[,\"xerror\"]),\"CP\"]\n",
"bestcp"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IG7NtCahWit9",
"papermill": {
"duration": 0.04778,
"end_time": "2021-07-22T21:33:47.142808",
"exception": false,
"start_time": "2021-07-22T21:33:47.095028",
"status": "completed"
},
"tags": []
},
"source": [
"Now, we can prune the tree and visualize the prediction rule."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 454
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:47.245777Z",
"iopub.status.busy": "2021-07-22T21:33:47.243764Z",
"iopub.status.idle": "2021-07-22T21:33:47.717872Z",
"shell.execute_reply": "2021-07-22T21:33:47.717444Z"
},
"executionInfo": {
"elapsed": 666,
"status": "ok",
"timestamp": 1658250116115,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "-mIXdvQmWit9",
"outputId": "d66e227a-bdbb-472a-aa22-707070a59da8",
"papermill": {
"duration": 0.526641,
"end_time": "2021-07-22T21:33:47.718019",
"exception": false,
"start_time": "2021-07-22T21:33:47.191378",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"cex 0.438 xlim c(0, 1) ylim c(0, 1)\n"
]
},
{
"data": {
"image/png": "",
"text/plain": [
"plot without title"
]
},
"metadata": {
"image/png": {
"height": 420,
"width": 420
}
},
"output_type": "display_data"
}
],
"source": [
"fit.prunedtree <- prune(fit.trees,cp=bestcp)\n",
"prp(fit.prunedtree,leaf.round=1, space=3, yspace=3, split.space=7, shadow.col = \"gray\",trace = 1,yesno=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pYAym1xTWit9",
"papermill": {
"duration": 0.051023,
"end_time": "2021-07-22T21:33:47.820426",
"exception": false,
"start_time": "2021-07-22T21:33:47.769403",
"status": "completed"
},
"tags": []
},
"source": [
"E.g., in the pruned tree the predicted hourly log wage for high-school graduates with more than $9.5$ years of experience is $2.8$, and otherwise is $2.6$."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Leso28bjWit9",
"papermill": {
"duration": 0.050261,
"end_time": "2021-07-22T21:33:47.921859",
"exception": false,
"start_time": "2021-07-22T21:33:47.871598",
"status": "completed"
},
"tags": []
},
"source": [
"Finally, we calculate the mean-squared error and the $R^2$ on the test sample to evaluate the out-of-sample performance of the pruned tree."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:48.028361Z",
"iopub.status.busy": "2021-07-22T21:33:48.027097Z",
"iopub.status.idle": "2021-07-22T21:33:48.049467Z",
"shell.execute_reply": "2021-07-22T21:33:48.048085Z"
},
"executionInfo": {
"elapsed": 15,
"status": "ok",
"timestamp": 1658250116117,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "-GdlHf6sWit9",
"outputId": "395779d8-1795-4775-b811-b457792799d9",
"papermill": {
"duration": 0.077314,
"end_time": "2021-07-22T21:33:48.049633",
"exception": false,
"start_time": "2021-07-22T21:33:47.972319",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R^2 of the pruned tree: 0.2250066"
]
}
],
"source": [
"yhat.pt <- predict(fit.prunedtree,newdata=data_test)\n",
"MSE.pt <- summary(lm((Y_test-yhat.pt)^2~1))$coef[1:2]\n",
"R2.pt <- 1-MSE.pt[1]/var(Y_test)\n",
"\n",
"# R^2 of the pruned tree\n",
"cat(\"R^2 of the pruned tree:\",R2.pt)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FRT62uP-Wit-",
"papermill": {
"duration": 0.050968,
"end_time": "2021-07-22T21:33:48.152253",
"exception": false,
"start_time": "2021-07-22T21:33:48.101285",
"status": "completed"
},
"tags": []
},
"source": [
"### Random Forest and Boosted Trees"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0u9KdQZrWit-",
"papermill": {
"duration": 0.050209,
"end_time": "2021-07-22T21:33:48.253970",
"exception": false,
"start_time": "2021-07-22T21:33:48.203761",
"status": "completed"
},
"tags": []
},
"source": [
"In the next step, we apply the more advanced tree-based methods: random forest and boosted trees."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2021-07-22T21:33:48.360982Z",
"iopub.status.busy": "2021-07-22T21:33:48.360463Z",
"iopub.status.idle": "2021-07-22T21:34:43.865598Z",
"shell.execute_reply": "2021-07-22T21:34:43.864364Z"
},
"executionInfo": {
"elapsed": 37648,
"status": "ok",
"timestamp": 1658250153754,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "hAvYz1zbWit-",
"outputId": "6d7ad332-f3f6-461c-8547-50d0d6947d38",
"papermill": {
"duration": 55.561432,
"end_time": "2021-07-22T21:34:43.865807",
"exception": false,
"start_time": "2021-07-22T21:33:48.304375",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv_folds>1 when calling gbm usually results in improved predictive performance.\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"R^2 of the random forest and boosted trees: 0.26582 0.2675069"
]
}
],
"source": [
"## Applying the methods\n",
"# random forest\n",
"fit.rf <- randomForest(formula_basic, ntree=2000, nodesize=5, data=data_train)\n",
"# for tuning: adjust input \"mtry\" to change the number of variables randomly sampled as candidates at each split\n",
"\n",
"# boosting\n",
"fit.boost <- gbm(formula_basic, data=data_train, distribution= \"gaussian\", bag.fraction = .5, interaction.depth=2, n.trees=1000, shrinkage=.01)\n",
"best.boost <- gbm.perf(fit.boost, plot.it = FALSE) # cross-validation to determine when to stop\n",
"\n",
"## Evaluating the methods\n",
"yhat.rf <- predict(fit.rf, newdata=data_test) # prediction\n",
"yhat.boost <- predict(fit.boost, newdata=data_test, n.trees=best.boost)\n",
"\n",
"MSE.rf = summary(lm((Y_test-yhat.rf)^2~1))$coef[1:2]\n",
"MSE.boost = summary(lm((Y_test-yhat.boost)^2~1))$coef[1:2]\n",
"R2.rf <- 1-MSE.rf[1]/var(Y_test)\n",
"R2.boost <- 1-MSE.boost[1]/var(Y_test)\n",
"\n",
"# printing R^2\n",
"cat(\"R^2 of the random forest and boosted trees:\",R2.rf,R2.boost)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Y3SItRnPWit_",
"papermill": {
"duration": 0.050943,
"end_time": "2021-07-22T21:34:43.969202",
"exception": false,
"start_time": "2021-07-22T21:34:43.918259",
"status": "completed"
},
"tags": []
},
"source": [
"To conclude, let us have a look at our results."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Sj1p6gTaWit_",
"papermill": {
"duration": 0.051543,
"end_time": "2021-07-22T21:34:44.072428",
"exception": false,
"start_time": "2021-07-22T21:34:44.020885",
"status": "completed"
},
"tags": []
},
"source": [
"## Results"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 537
},
"execution": {
"iopub.execute_input": "2021-07-22T21:34:44.181135Z",
"iopub.status.busy": "2021-07-22T21:34:44.179921Z",
"iopub.status.idle": "2021-07-22T21:34:44.286892Z",
"shell.execute_reply": "2021-07-22T21:34:44.285724Z"
},
"executionInfo": {
"elapsed": 17,
"status": "ok",
"timestamp": 1658250153756,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "P_ijvjVPWit_",
"outputId": "cf17f27c-ebbe-49de-d476-0be22b447326",
"papermill": {
"duration": 0.163242,
"end_time": "2021-07-22T21:34:44.287033",
"exception": false,
"start_time": "2021-07-22T21:34:44.123791",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"A matrix: 15 × 3 of type dbl\n",
"\n",
"\t | MSE | S.E. for MSE | R-squared |
\n",
"\n",
"\n",
"\tLeast Squares (basic) | 0.2496282 | 0.01558452 | 0.2185201 |
\n",
"\tLeast Squares (flexible) | 0.2502538 | 0.01557896 | 0.2165618 |
\n",
"\tLasso | 0.2496625 | 0.01511149 | 0.2184129 |
\n",
"\tPost-Lasso | 0.2485137 | 0.01537409 | 0.2220093 |
\n",
"\tLasso (flexible) | 0.2502069 | 0.01503037 | 0.2167083 |
\n",
"\tPost-Lasso (flexible) | 0.2485137 | 0.01537409 | 0.2220093 |
\n",
"\tCross-Validated lasso | 0.2490449 | 0.01519297 | 0.2203463 |
\n",
"\tCross-Validated ridge | 0.2537630 | 0.01536035 | 0.2055758 |
\n",
"\tCross-Validated elnet | 0.2493794 | 0.01520680 | 0.2192991 |
\n",
"\tCross-Validated lasso (flexible) | 0.2491068 | 0.01512894 | 0.2201523 |
\n",
"\tCross-Validated ridge (flexible) | 0.2535570 | 0.01535679 | 0.2062207 |
\n",
"\tCross-Validated elnet (flexible) | 0.2487962 | 0.01515842 | 0.2211247 |
\n",
"\tRandom Forest | 0.2345192 | 0.01550091 | 0.2658200 |
\n",
"\tBoosted Trees | 0.2339804 | 0.01502881 | 0.2675069 |
\n",
"\tPruned Tree | 0.2475562 | 0.01548539 | 0.2250066 |
\n",
"\n",
"
\n"
],
"text/latex": [
"A matrix: 15 × 3 of type dbl\n",
"\\begin{tabular}{r|lll}\n",
" & MSE & S.E. for MSE & R-squared\\\\\n",
"\\hline\n",
"\tLeast Squares (basic) & 0.2496282 & 0.01558452 & 0.2185201\\\\\n",
"\tLeast Squares (flexible) & 0.2502538 & 0.01557896 & 0.2165618\\\\\n",
"\tLasso & 0.2496625 & 0.01511149 & 0.2184129\\\\\n",
"\tPost-Lasso & 0.2485137 & 0.01537409 & 0.2220093\\\\\n",
"\tLasso (flexible) & 0.2502069 & 0.01503037 & 0.2167083\\\\\n",
"\tPost-Lasso (flexible) & 0.2485137 & 0.01537409 & 0.2220093\\\\\n",
"\tCross-Validated lasso & 0.2490449 & 0.01519297 & 0.2203463\\\\\n",
"\tCross-Validated ridge & 0.2537630 & 0.01536035 & 0.2055758\\\\\n",
"\tCross-Validated elnet & 0.2493794 & 0.01520680 & 0.2192991\\\\\n",
"\tCross-Validated lasso (flexible) & 0.2491068 & 0.01512894 & 0.2201523\\\\\n",
"\tCross-Validated ridge (flexible) & 0.2535570 & 0.01535679 & 0.2062207\\\\\n",
"\tCross-Validated elnet (flexible) & 0.2487962 & 0.01515842 & 0.2211247\\\\\n",
"\tRandom Forest & 0.2345192 & 0.01550091 & 0.2658200\\\\\n",
"\tBoosted Trees & 0.2339804 & 0.01502881 & 0.2675069\\\\\n",
"\tPruned Tree & 0.2475562 & 0.01548539 & 0.2250066\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A matrix: 15 × 3 of type dbl\n",
"\n",
"| | MSE | S.E. for MSE | R-squared |\n",
"|---|---|---|---|\n",
"| Least Squares (basic) | 0.2496282 | 0.01558452 | 0.2185201 |\n",
"| Least Squares (flexible) | 0.2502538 | 0.01557896 | 0.2165618 |\n",
"| Lasso | 0.2496625 | 0.01511149 | 0.2184129 |\n",
"| Post-Lasso | 0.2485137 | 0.01537409 | 0.2220093 |\n",
"| Lasso (flexible) | 0.2502069 | 0.01503037 | 0.2167083 |\n",
"| Post-Lasso (flexible) | 0.2485137 | 0.01537409 | 0.2220093 |\n",
"| Cross-Validated lasso | 0.2490449 | 0.01519297 | 0.2203463 |\n",
"| Cross-Validated ridge | 0.2537630 | 0.01536035 | 0.2055758 |\n",
"| Cross-Validated elnet | 0.2493794 | 0.01520680 | 0.2192991 |\n",
"| Cross-Validated lasso (flexible) | 0.2491068 | 0.01512894 | 0.2201523 |\n",
"| Cross-Validated ridge (flexible) | 0.2535570 | 0.01535679 | 0.2062207 |\n",
"| Cross-Validated elnet (flexible) | 0.2487962 | 0.01515842 | 0.2211247 |\n",
"| Random Forest | 0.2345192 | 0.01550091 | 0.2658200 |\n",
"| Boosted Trees | 0.2339804 | 0.01502881 | 0.2675069 |\n",
"| Pruned Tree | 0.2475562 | 0.01548539 | 0.2250066 |\n",
"\n"
],
"text/plain": [
" MSE S.E. for MSE R-squared\n",
"Least Squares (basic) 0.2496282 0.01558452 0.2185201\n",
"Least Squares (flexible) 0.2502538 0.01557896 0.2165618\n",
"Lasso 0.2496625 0.01511149 0.2184129\n",
"Post-Lasso 0.2485137 0.01537409 0.2220093\n",
"Lasso (flexible) 0.2502069 0.01503037 0.2167083\n",
"Post-Lasso (flexible) 0.2485137 0.01537409 0.2220093\n",
"Cross-Validated lasso 0.2490449 0.01519297 0.2203463\n",
"Cross-Validated ridge 0.2537630 0.01536035 0.2055758\n",
"Cross-Validated elnet 0.2493794 0.01520680 0.2192991\n",
"Cross-Validated lasso (flexible) 0.2491068 0.01512894 0.2201523\n",
"Cross-Validated ridge (flexible) 0.2535570 0.01535679 0.2062207\n",
"Cross-Validated elnet (flexible) 0.2487962 0.01515842 0.2211247\n",
"Random Forest 0.2345192 0.01550091 0.2658200\n",
"Boosted Trees 0.2339804 0.01502881 0.2675069\n",
"Pruned Tree 0.2475562 0.01548539 0.2250066"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# library(xtable)\n",
"table<- matrix(0, 15, 3)\n",
"table[1,1:2] <- MSE_lm_basic\n",
"table[2,1:2] <- MSE_lm_flex\n",
"table[3,1:2] <- MSE.lasso\n",
"table[4,1:2] <- MSE.lasso.post\n",
"table[5,1:2] <- MSE.lasso.flex\n",
"table[6,1:2] <- MSE.lasso.post.flex\n",
"table[7,1:2] <- MSE.lasso.cv\n",
"table[8,1:2] <- MSE.ridge\n",
"table[9,1:2] <- MSE.elnet\n",
"table[10,1:2] <- MSE.lasso.cv.flex\n",
"table[11,1:2] <- MSE.ridge.flex\n",
"table[12,1:2] <- MSE.elnet.flex\n",
"table[13,1:2] <- MSE.rf\n",
"table[14,1:2] <- MSE.boost\n",
"table[15,1:2] <- MSE.pt\n",
"\n",
"table[1,3] <- R2_lm_basic\n",
"table[2,3] <- R2_lm_flex\n",
"table[3,3] <- R2.lasso\n",
"table[4,3] <- R2.lasso.post\n",
"table[5,3] <- R2.lasso.flex\n",
"table[6,3] <- R2.lasso.post.flex\n",
"table[7,3] <- R2.lasso.cv\n",
"table[8,3] <- R2.ridge\n",
"table[9,3] <- R2.elnet\n",
"table[10,3] <- R2.lasso.cv.flex\n",
"table[11,3] <- R2.ridge.flex\n",
"table[12,3] <- R2.elnet.flex\n",
"table[13,3] <- R2.rf\n",
"table[14,3] <- R2.boost\n",
"table[15,3] <- R2.pt\n",
"\n",
"colnames(table)<- c(\"MSE\", \"S.E. for MSE\", \"R-squared\")\n",
"rownames(table)<- c(\"Least Squares (basic)\",\"Least Squares (flexible)\", \"Lasso\", \"Post-Lasso\",\"Lasso (flexible)\",\"Post-Lasso (flexible)\", \n",
" \"Cross-Validated lasso\", \"Cross-Validated ridge\",\"Cross-Validated elnet\",\"Cross-Validated lasso (flexible)\",\"Cross-Validated ridge (flexible)\",\"Cross-Validated elnet (flexible)\", \n",
" \"Random Forest\",\"Boosted Trees\", \"Pruned Tree\")\n",
"table"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TpqFypSbWit_",
"papermill": {
"duration": 0.052757,
"end_time": "2021-07-22T21:34:44.392711",
"exception": false,
"start_time": "2021-07-22T21:34:44.339954",
"status": "completed"
},
"tags": []
},
"source": [
"Above, we have displayed the results for a single split of data into the training and testing part. The table shows the test MSE in column 1 as well as the standard error in column 2 and the test $R^2$ in column 3. \n",
"\n",
"We see that the prediction rule produced by the Elastic Net using the flexible model performs the best here, giving the lowest test MSE. Cross-Validated Lasso and Ridge, perform nearly as well. For any two of these methods, their testing MSEs are within one standard error of each other. Remarkably, OLS on a simple model performs extremely well, almost as well as best tree based method Random Forest. On the other hand, OLS on a flexible model with many regressors performs very poorly giving the highest test MSE. Notice that the nonlinear models, e.g. Random Forest, are not tuned. Thus, there is a lot of potential to improve the performance of the nonlinear methods we used in the analysis."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YFol0C-XWiuA",
"papermill": {
"duration": 0.052773,
"end_time": "2021-07-22T21:34:44.497668",
"exception": false,
"start_time": "2021-07-22T21:34:44.444895",
"status": "completed"
},
"tags": []
},
"source": [
"## Ensemble learning"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "P40FBHxRWiuA",
"papermill": {
"duration": 0.052345,
"end_time": "2021-07-22T21:34:44.603135",
"exception": false,
"start_time": "2021-07-22T21:34:44.550790",
"status": "completed"
},
"tags": []
},
"source": [
"In the final step, we can build a prediction model by combining the strengths of the models we considered so far. This ensemble method is of the form\n",
"\n",
"$$ f(x) = \\sum_{k=1}^K \\alpha_k f_k(x) $$\n",
" \n",
"where the $f_k$'s denote our prediction rules from the table above and the $\\alpha_k$'s are the corresponding weights."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nYVckjqeWiuA",
"papermill": {
"duration": 0.052659,
"end_time": "2021-07-22T21:34:44.710467",
"exception": false,
"start_time": "2021-07-22T21:34:44.657808",
"status": "completed"
},
"tags": []
},
"source": [
"We focus on the prediction rules based on OLS, Post-Lasso, Elastic Net, Pruned Tree, Random Forest, Boosted Trees, and Neural Network and combine these methods into an ensemble method. The appropriate weights can be determined by a simple ols regression:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 434
},
"execution": {
"iopub.execute_input": "2021-07-22T21:34:44.820377Z",
"iopub.status.busy": "2021-07-22T21:34:44.819096Z",
"iopub.status.idle": "2021-07-22T21:34:44.841481Z",
"shell.execute_reply": "2021-07-22T21:34:44.840197Z"
},
"executionInfo": {
"elapsed": 621,
"status": "ok",
"timestamp": 1658250154361,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "Xs9ndCggWiuA",
"outputId": "3c5c35fc-7a9d-4882-daa1-27b50ffda431",
"papermill": {
"duration": 0.078475,
"end_time": "2021-07-22T21:34:44.841619",
"exception": false,
"start_time": "2021-07-22T21:34:44.763144",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"Call:\n",
"lm(formula = Y_test ~ yhat_lm_basic + yhat.rlasso.post.flex + \n",
" yhat.elnet.flex + yhat.pt + yhat.rf + yhat.boost)\n",
"\n",
"Residuals:\n",
" Min 1Q Median 3Q Max \n",
"-1.7261 -0.2821 -0.0141 0.2717 3.6299 \n",
"\n",
"Coefficients:\n",
" Estimate Std. Error t value Pr(>|t|) \n",
"(Intercept) -0.13636 0.28142 -0.485 0.62809 \n",
"yhat_lm_basic -0.05963 0.18749 -0.318 0.75050 \n",
"yhat.rlasso.post.flex 0.24141 0.41866 0.577 0.56429 \n",
"yhat.elnet.flex -0.18802 0.57863 -0.325 0.74529 \n",
"yhat.pt 0.01050 0.10543 0.100 0.92070 \n",
"yhat.rf 0.48543 0.09208 5.272 1.58e-07 ***\n",
"yhat.boost 0.55427 0.17909 3.095 0.00201 ** \n",
"---\n",
"Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n",
"\n",
"Residual standard error: 0.4789 on 1281 degrees of freedom\n",
"Multiple R-squared: 0.2855,\tAdjusted R-squared: 0.2821 \n",
"F-statistic: 85.31 on 6 and 1281 DF, p-value: < 2.2e-16\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"ensemble.ols <- summary(lm(Y_test~ yhat_lm_basic + yhat.rlasso.post.flex + yhat.elnet.flex+ yhat.pt+ yhat.rf + yhat.boost))\n",
"ensemble.ols"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Y07LSBYrWiuB",
"papermill": {
"duration": 0.053593,
"end_time": "2021-07-22T21:34:44.948758",
"exception": false,
"start_time": "2021-07-22T21:34:44.895165",
"status": "completed"
},
"tags": []
},
"source": [
"Alternatively, we can determine the weights via lasso regression. "
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 712
},
"execution": {
"iopub.execute_input": "2021-07-22T21:34:45.060856Z",
"iopub.status.busy": "2021-07-22T21:34:45.059440Z",
"iopub.status.idle": "2021-07-22T21:34:45.174027Z",
"shell.execute_reply": "2021-07-22T21:34:45.172952Z"
},
"executionInfo": {
"elapsed": 37,
"status": "ok",
"timestamp": 1658250154363,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "hlm-aAX4WiuB",
"outputId": "a17029d8-fa3b-4b63-bdaf-b9f89037e36b",
"papermill": {
"duration": 0.171986,
"end_time": "2021-07-22T21:34:45.174182",
"exception": false,
"start_time": "2021-07-22T21:34:45.002196",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Call:\n",
"rlasso.formula(formula = Y_test ~ yhat_lm_basic + yhat.rlasso.post.flex + \n",
" yhat.elnet.flex + yhat.pt + yhat.rf + yhat.boost)\n",
"\n",
"Post-Lasso Estimation: TRUE \n",
"\n",
"Total number of variables: 6\n",
"Number of selected variables: 2 \n",
"\n",
"Residuals: \n",
" Min 1Q Median 3Q Max \n",
"-1.72393 -0.28094 -0.01085 0.27243 3.61820 \n",
"\n",
" Estimate\n",
"(Intercept) -0.196\n",
"yhat_lm_basic 0.000\n",
"yhat.rlasso.post.flex 0.000\n",
"yhat.elnet.flex 0.000\n",
"yhat.pt 0.000\n",
"yhat.rf 0.475\n",
"yhat.boost 0.589\n",
"\n",
"Residual standard error: 0.4779\n",
"Multiple R-squared: 0.2851\n",
"Adjusted R-squared: 0.284\n",
"Joint significance test:\n",
" the sup score statistic for joint significance test is 3.399 with a p-value of 0.058\n"
]
},
{
"data": {
"text/plain": [
"\n",
"Call:\n",
"rlasso.formula(formula = Y_test ~ yhat_lm_basic + yhat.rlasso.post.flex + \n",
" yhat.elnet.flex + yhat.pt + yhat.rf + yhat.boost)\n",
"\n",
"Coefficients:\n",
" (Intercept) yhat_lm_basic yhat.rlasso.post.flex \n",
" -0.1963 0.0000 0.0000 \n",
" yhat.elnet.flex yhat.pt yhat.rf \n",
" 0.0000 0.0000 0.4750 \n",
" yhat.boost \n",
" 0.5892 \n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"ensemble.lasso <- summary(rlasso(Y_test~ yhat_lm_basic + yhat.rlasso.post.flex + yhat.elnet.flex+ yhat.pt+ yhat.rf + yhat.boost))\n",
"ensemble.lasso"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sceeSDLpWiuB",
"papermill": {
"duration": 0.05475,
"end_time": "2021-07-22T21:34:45.284948",
"exception": false,
"start_time": "2021-07-22T21:34:45.230198",
"status": "completed"
},
"tags": []
},
"source": [
"The estimated weights are shown in the following table."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 286
},
"execution": {
"iopub.execute_input": "2021-07-22T21:34:45.398344Z",
"iopub.status.busy": "2021-07-22T21:34:45.397127Z",
"iopub.status.idle": "2021-07-22T21:34:45.432227Z",
"shell.execute_reply": "2021-07-22T21:34:45.430974Z"
},
"executionInfo": {
"elapsed": 34,
"status": "ok",
"timestamp": 1658250154365,
"user": {
"displayName": "Jhon Kevin Flores Rojas",
"userId": "10267608749788811245"
},
"user_tz": 300
},
"id": "KPAWuQiMWiuC",
"outputId": "30a8ea04-e39b-4ce4-e679-c76c56f92325",
"papermill": {
"duration": 0.09307,
"end_time": "2021-07-22T21:34:45.432382",
"exception": false,
"start_time": "2021-07-22T21:34:45.339312",
"status": "completed"
},
"tags": [],
"vscode": {
"languageId": "r"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"A matrix: 7 × 2 of type dbl\n",
"\n",
"\t | Weight OLS | Weight Lasso |
\n",
"\n",
"\n",
"\tConstant | -0.13635940 | -0.1963149 |
\n",
"\tLeast Squares (basic) | -0.05963050 | 0.0000000 |
\n",
"\tPost-Lasso (flexible) | 0.24140935 | 0.0000000 |
\n",
"\tCross-Validated elnet (flexible) | -0.18801716 | 0.0000000 |
\n",
"\tPruned Tree | 0.01049777 | 0.0000000 |
\n",
"\tRandom Forest | 0.48543389 | 0.4749684 |
\n",
"\tBoosted Trees | 0.55426895 | 0.5891756 |
\n",
"\n",
"
\n"
],
"text/latex": [
"A matrix: 7 × 2 of type dbl\n",
"\\begin{tabular}{r|ll}\n",
" & Weight OLS & Weight Lasso\\\\\n",
"\\hline\n",
"\tConstant & -0.13635940 & -0.1963149\\\\\n",
"\tLeast Squares (basic) & -0.05963050 & 0.0000000\\\\\n",
"\tPost-Lasso (flexible) & 0.24140935 & 0.0000000\\\\\n",
"\tCross-Validated elnet (flexible) & -0.18801716 & 0.0000000\\\\\n",
"\tPruned Tree & 0.01049777 & 0.0000000\\\\\n",
"\tRandom Forest & 0.48543389 & 0.4749684\\\\\n",
"\tBoosted Trees & 0.55426895 & 0.5891756\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A matrix: 7 × 2 of type dbl\n",
"\n",
"| | Weight OLS | Weight Lasso |\n",
"|---|---|---|\n",
"| Constant | -0.13635940 | -0.1963149 |\n",
"| Least Squares (basic) | -0.05963050 | 0.0000000 |\n",
"| Post-Lasso (flexible) | 0.24140935 | 0.0000000 |\n",
"| Cross-Validated elnet (flexible) | -0.18801716 | 0.0000000 |\n",
"| Pruned Tree | 0.01049777 | 0.0000000 |\n",
"| Random Forest | 0.48543389 | 0.4749684 |\n",
"| Boosted Trees | 0.55426895 | 0.5891756 |\n",
"\n"
],
"text/plain": [
" Weight OLS Weight Lasso\n",
"Constant -0.13635940 -0.1963149 \n",
"Least Squares (basic) -0.05963050 0.0000000 \n",
"Post-Lasso (flexible) 0.24140935 0.0000000 \n",
"Cross-Validated elnet (flexible) -0.18801716 0.0000000 \n",
"Pruned Tree 0.01049777 0.0000000 \n",
"Random Forest 0.48543389 0.4749684 \n",
"Boosted Trees 0.55426895 0.5891756 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"table<- matrix(0, 7, 2)\n",
"table[1:7,1] <- ensemble.ols$coef[1:7]\n",
"table[1:7,2] <- ensemble.lasso$coef[1:7]\n",
"\n",
"colnames(table)<- c(\"Weight OLS\", \"Weight Lasso\")\n",
"rownames(table)<- c(\"Constant\",\"Least Squares (basic)\",\"Post-Lasso (flexible)\", \"Cross-Validated elnet (flexible)\", \"Pruned Tree\",\n",
" \"Random Forest\",\"Boosted Trees\")\n",
"table"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ijgxlqKMWiuC",
"papermill": {
"duration": 0.055235,
"end_time": "2021-07-22T21:34:45.543519",
"exception": false,
"start_time": "2021-07-22T21:34:45.488284",
"status": "completed"
},
"tags": []
},
"source": [
"Further, the $R^2$ for the test sample improves from $30\\%$ obtained by OLS to about $31\\%$ obtained by the ensemble method. We see that it is very powerful to aggregate prediction rules into an ensemble rule. Nevertheless, it is worth noticing that we should compare the ensemble method and the single rules on an additional validation set to ensure a fair comparison."
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "08_ml-for-wage-prediction.ipynb",
"provenance": []
},
"jupytext": {
"formats": "ipynb,auto:light"
},
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "4.2.1"
},
"papermill": {
"default_parameters": {},
"duration": 88.350056,
"end_time": "2021-07-22T21:34:45.707112",
"environment_variables": {},
"exception": null,
"input_path": "__notebook__.ipynb",
"output_path": "__notebook__.ipynb",
"parameters": {},
"start_time": "2021-07-22T21:33:17.357056",
"version": "2.2.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}