{ "cells": [ { "cell_type": "markdown", "id": "48053402", "metadata": {}, "source": [ "(ridge)=\n", "\n", "## Ridge Regression\n", "\n", "Ridge regression is a method of linear regression that helps prevent overfitting a \n", "model in the case of high correlation between variables. It is a regularization \n", "method - a method specifically designed to reduce overfitting a model.\n", "\n", "While OLS regression minimizes the Residual Sum of Squares, Ridge regression is the \n", "Residual Sum Squares + Shrinkage Penalty: $λΣβj^2$\n", "\n", "A larger λ means a harsher penalty and smaller coefficients, but at a certain point, \n", "the coefficients will become underestimated, greatly increasing bias in the model.\n", "Following the variance-bias tradeoff, the λ chosen must be one that introduces some \n", "bias while minimizing the variance and MSE.\n", "\n", "Basic Steps (for models with known multicollinearity):\n", "1. Standardize each predictor variable\n", "2. Fit model and choose λ (either through ridge trace plot or MSE of each λ)\n", "3. Test model accuracy\n", "\n", "Biggest Drawback: no variable selection = final model includes all predictors\n", "- makes coefficients very close to 0 if not significant\n", "\n", "Better predictions, but harder to interpret\n", "- good for models where most/all variables significant\n", "\n", "### In Python\n", "\n", "The `Ridge` and `RidgeCV` () are availble through `sklearn.linear_model`\n", "\n", "* `RidgeCV`:\n", "- Ridge with a cross-validation option" ] }, { "cell_type": "code", "execution_count": 1, "id": "dfbeebaf", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_width
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "\n", "flowers = sns.load_dataset('iris')\n", "flowers = flowers.drop(['species'], axis = 1)\n", "flowers.head()" ] }, { "cell_type": "code", "execution_count": 2, "id": "a19d906a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_width
sepal_length1.000000-0.1175700.8717540.817941
sepal_width-0.1175701.000000-0.428440-0.366126
petal_length0.871754-0.4284401.0000000.962865
petal_width0.817941-0.3661260.9628651.000000
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width\n", "sepal_length 1.000000 -0.117570 0.871754 0.817941\n", "sepal_width -0.117570 1.000000 -0.428440 -0.366126\n", "petal_length 0.871754 -0.428440 1.000000 0.962865\n", "petal_width 0.817941 -0.366126 0.962865 1.000000" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "filenames": { "image/png": "/home/runner/work/ids-s22/ids-s22/notes/_build/jupyter_execute/docs/ridge_2_1.png" }, "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "flowers_corr = flowers.corr()\n", "sns.heatmap(flowers_corr, annot = True)\n", "plt.show\n", "\n", "flowers_corr" ] }, { "cell_type": "code", "execution_count": 3, "id": "0f986b61", "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Ridge\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 4, "id": "a8f67dc3", "metadata": {}, "outputs": [], "source": [ "X = flowers[['sepal_length', 'petal_width', 'petal_length']]\n", "y = flowers['sepal_width']\n", "\n", "scaler = StandardScaler()\n", "Xs = scaler.fit_transform(X)\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)\n", "\n", "model = Ridge()\n", "\n", "model.fit(X_train, y_train)\n", "\n", "y_pred = model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 5, "id": "4caf26d7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.5310770203976505\n" ] } ], "source": [ "from sklearn.metrics import r2_score\n", "\n", "print(r2_score(y_test, y_pred))" ] }, { "cell_type": "markdown", "id": "bcc1d664", "metadata": {}, "source": [ "Then we run the model again using regular OLS regression." ] }, { "cell_type": "code", "execution_count": 6, "id": "06e8fbcb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: sepal_width R-squared: 0.524
Model: OLS Adj. R-squared: 0.514
Method: Least Squares F-statistic: 53.58
Date: Fri, 29 Jul 2022 Prob (F-statistic): 2.06e-23
Time: 22:52:36 Log-Likelihood: -32.100
No. Observations: 150 AIC: 72.20
Df Residuals: 146 BIC: 84.24
Df Model: 3
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 1.0431 0.271 3.855 0.000 0.508 1.578
sepal_length 0.6071 0.062 9.765 0.000 0.484 0.730
petal_width 0.5580 0.123 4.553 0.000 0.316 0.800
petal_length -0.5860 0.062 -9.431 0.000 -0.709 -0.463
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 0.738 Durbin-Watson: 1.889
Prob(Omnibus): 0.691 Jarque-Bera (JB): 0.426
Skew: -0.102 Prob(JB): 0.808
Kurtosis: 3.163 Cond. No. 82.1


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: sepal_width R-squared: 0.524\n", "Model: OLS Adj. R-squared: 0.514\n", "Method: Least Squares F-statistic: 53.58\n", "Date: Fri, 29 Jul 2022 Prob (F-statistic): 2.06e-23\n", "Time: 22:52:36 Log-Likelihood: -32.100\n", "No. Observations: 150 AIC: 72.20\n", "Df Residuals: 146 BIC: 84.24\n", "Df Model: 3 \n", "Covariance Type: nonrobust \n", "================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "--------------------------------------------------------------------------------\n", "Intercept 1.0431 0.271 3.855 0.000 0.508 1.578\n", "sepal_length 0.6071 0.062 9.765 0.000 0.484 0.730\n", "petal_width 0.5580 0.123 4.553 0.000 0.316 0.800\n", "petal_length -0.5860 0.062 -9.431 0.000 -0.709 -0.463\n", "==============================================================================\n", "Omnibus: 0.738 Durbin-Watson: 1.889\n", "Prob(Omnibus): 0.691 Jarque-Bera (JB): 0.426\n", "Skew: -0.102 Prob(JB): 0.808\n", "Kurtosis: 3.163 Cond. No. 82.1\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from statsmodels.formula.api import ols\n", "model2 = ols('sepal_width ~ sepal_length + petal_width + petal_length', data = flowers)\n", "\n", "fit2 = model2.fit()\n", "fit2.summary()" ] }, { "cell_type": "markdown", "id": "e0fd48cd", "metadata": {}, "source": [ "The R-squared value is 0.524, so for this particular model, multicollinearity does not appear to have a significant effect." ] }, { "cell_type": "markdown", "id": "705c1c24", "metadata": {}, "source": [ "More on [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html?highlight=ridge#sklearn.linear_model.Ridge) and [RidgeCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html?highlight=ridgecv#sklearn.linear_model.RidgeCV)" ] } ], "metadata": { "jupytext": { "text_representation": { "extension": ".md", "format_name": "myst", "format_version": 0.13, "jupytext_version": "1.11.5" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "source_map": [ 12, 48, 57, 67, 73, 89, 93, 97, 103, 107 ] }, "nbformat": 4, "nbformat_minor": 5 }