{
"cells": [
{
"cell_type": "markdown",
"id": "48053402",
"metadata": {},
"source": [
"(ridge)=\n",
"\n",
"## Ridge Regression\n",
"\n",
"Ridge regression is a method of linear regression that helps prevent overfitting a \n",
"model in the case of high correlation between variables. It is a regularization \n",
"method - a method specifically designed to reduce overfitting a model.\n",
"\n",
"While OLS regression minimizes the Residual Sum of Squares, Ridge regression is the \n",
"Residual Sum Squares + Shrinkage Penalty: $λΣβj^2$\n",
"\n",
"A larger λ means a harsher penalty and smaller coefficients, but at a certain point, \n",
"the coefficients will become underestimated, greatly increasing bias in the model.\n",
"Following the variance-bias tradeoff, the λ chosen must be one that introduces some \n",
"bias while minimizing the variance and MSE.\n",
"\n",
"Basic Steps (for models with known multicollinearity):\n",
"1. Standardize each predictor variable\n",
"2. Fit model and choose λ (either through ridge trace plot or MSE of each λ)\n",
"3. Test model accuracy\n",
"\n",
"Biggest Drawback: no variable selection = final model includes all predictors\n",
"- makes coefficients very close to 0 if not significant\n",
"\n",
"Better predictions, but harder to interpret\n",
"- good for models where most/all variables significant\n",
"\n",
"### In Python\n",
"\n",
"The `Ridge` and `RidgeCV` () are availble through `sklearn.linear_model`\n",
"\n",
"* `RidgeCV`:\n",
"- Ridge with a cross-validation option"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "dfbeebaf",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" sepal_length \n",
" sepal_width \n",
" petal_length \n",
" petal_width \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 5.1 \n",
" 3.5 \n",
" 1.4 \n",
" 0.2 \n",
" \n",
" \n",
" 1 \n",
" 4.9 \n",
" 3.0 \n",
" 1.4 \n",
" 0.2 \n",
" \n",
" \n",
" 2 \n",
" 4.7 \n",
" 3.2 \n",
" 1.3 \n",
" 0.2 \n",
" \n",
" \n",
" 3 \n",
" 4.6 \n",
" 3.1 \n",
" 1.5 \n",
" 0.2 \n",
" \n",
" \n",
" 4 \n",
" 5.0 \n",
" 3.6 \n",
" 1.4 \n",
" 0.2 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width\n",
"0 5.1 3.5 1.4 0.2\n",
"1 4.9 3.0 1.4 0.2\n",
"2 4.7 3.2 1.3 0.2\n",
"3 4.6 3.1 1.5 0.2\n",
"4 5.0 3.6 1.4 0.2"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import seaborn as sns\n",
"\n",
"flowers = sns.load_dataset('iris')\n",
"flowers = flowers.drop(['species'], axis = 1)\n",
"flowers.head()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a19d906a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" sepal_length \n",
" sepal_width \n",
" petal_length \n",
" petal_width \n",
" \n",
" \n",
" \n",
" \n",
" sepal_length \n",
" 1.000000 \n",
" -0.117570 \n",
" 0.871754 \n",
" 0.817941 \n",
" \n",
" \n",
" sepal_width \n",
" -0.117570 \n",
" 1.000000 \n",
" -0.428440 \n",
" -0.366126 \n",
" \n",
" \n",
" petal_length \n",
" 0.871754 \n",
" -0.428440 \n",
" 1.000000 \n",
" 0.962865 \n",
" \n",
" \n",
" petal_width \n",
" 0.817941 \n",
" -0.366126 \n",
" 0.962865 \n",
" 1.000000 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width\n",
"sepal_length 1.000000 -0.117570 0.871754 0.817941\n",
"sepal_width -0.117570 1.000000 -0.428440 -0.366126\n",
"petal_length 0.871754 -0.428440 1.000000 0.962865\n",
"petal_width 0.817941 -0.366126 0.962865 1.000000"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"filenames": {
"image/png": "/home/runner/work/ids-s22/ids-s22/notes/_build/jupyter_execute/docs/ridge_2_1.png"
},
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"flowers_corr = flowers.corr()\n",
"sns.heatmap(flowers_corr, annot = True)\n",
"plt.show\n",
"\n",
"flowers_corr"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "0f986b61",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import Ridge\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a8f67dc3",
"metadata": {},
"outputs": [],
"source": [
"X = flowers[['sepal_length', 'petal_width', 'petal_length']]\n",
"y = flowers['sepal_width']\n",
"\n",
"scaler = StandardScaler()\n",
"Xs = scaler.fit_transform(X)\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)\n",
"\n",
"model = Ridge()\n",
"\n",
"model.fit(X_train, y_train)\n",
"\n",
"y_pred = model.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "4caf26d7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.5310770203976505\n"
]
}
],
"source": [
"from sklearn.metrics import r2_score\n",
"\n",
"print(r2_score(y_test, y_pred))"
]
},
{
"cell_type": "markdown",
"id": "bcc1d664",
"metadata": {},
"source": [
"Then we run the model again using regular OLS regression."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "06e8fbcb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"OLS Regression Results \n",
"\n",
" Dep. Variable: sepal_width R-squared: 0.524 \n",
" \n",
"\n",
" Model: OLS Adj. R-squared: 0.514 \n",
" \n",
"\n",
" Method: Least Squares F-statistic: 53.58 \n",
" \n",
"\n",
" Date: Fri, 29 Jul 2022 Prob (F-statistic): 2.06e-23 \n",
" \n",
"\n",
" Time: 22:52:36 Log-Likelihood: -32.100 \n",
" \n",
"\n",
" No. Observations: 150 AIC: 72.20 \n",
" \n",
"\n",
" Df Residuals: 146 BIC: 84.24 \n",
" \n",
"\n",
" Df Model: 3 \n",
" \n",
"\n",
" Covariance Type: nonrobust \n",
" \n",
"
\n",
"\n",
"\n",
" coef std err t P>|t| [0.025 0.975] \n",
" \n",
"\n",
" Intercept 1.0431 0.271 3.855 0.000 0.508 1.578 \n",
" \n",
"\n",
" sepal_length 0.6071 0.062 9.765 0.000 0.484 0.730 \n",
" \n",
"\n",
" petal_width 0.5580 0.123 4.553 0.000 0.316 0.800 \n",
" \n",
"\n",
" petal_length -0.5860 0.062 -9.431 0.000 -0.709 -0.463 \n",
" \n",
"
\n",
"\n",
"\n",
" Omnibus: 0.738 Durbin-Watson: 1.889 \n",
" \n",
"\n",
" Prob(Omnibus): 0.691 Jarque-Bera (JB): 0.426 \n",
" \n",
"\n",
" Skew: -0.102 Prob(JB): 0.808 \n",
" \n",
"\n",
" Kurtosis: 3.163 Cond. No. 82.1 \n",
" \n",
"
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/plain": [
"\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: sepal_width R-squared: 0.524\n",
"Model: OLS Adj. R-squared: 0.514\n",
"Method: Least Squares F-statistic: 53.58\n",
"Date: Fri, 29 Jul 2022 Prob (F-statistic): 2.06e-23\n",
"Time: 22:52:36 Log-Likelihood: -32.100\n",
"No. Observations: 150 AIC: 72.20\n",
"Df Residuals: 146 BIC: 84.24\n",
"Df Model: 3 \n",
"Covariance Type: nonrobust \n",
"================================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"--------------------------------------------------------------------------------\n",
"Intercept 1.0431 0.271 3.855 0.000 0.508 1.578\n",
"sepal_length 0.6071 0.062 9.765 0.000 0.484 0.730\n",
"petal_width 0.5580 0.123 4.553 0.000 0.316 0.800\n",
"petal_length -0.5860 0.062 -9.431 0.000 -0.709 -0.463\n",
"==============================================================================\n",
"Omnibus: 0.738 Durbin-Watson: 1.889\n",
"Prob(Omnibus): 0.691 Jarque-Bera (JB): 0.426\n",
"Skew: -0.102 Prob(JB): 0.808\n",
"Kurtosis: 3.163 Cond. No. 82.1\n",
"==============================================================================\n",
"\n",
"Notes:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"\"\"\""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from statsmodels.formula.api import ols\n",
"model2 = ols('sepal_width ~ sepal_length + petal_width + petal_length', data = flowers)\n",
"\n",
"fit2 = model2.fit()\n",
"fit2.summary()"
]
},
{
"cell_type": "markdown",
"id": "e0fd48cd",
"metadata": {},
"source": [
"The R-squared value is 0.524, so for this particular model, multicollinearity does not appear to have a significant effect."
]
},
{
"cell_type": "markdown",
"id": "705c1c24",
"metadata": {},
"source": [
"More on [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html?highlight=ridge#sklearn.linear_model.Ridge) and [RidgeCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html?highlight=ridgecv#sklearn.linear_model.RidgeCV)"
]
}
],
"metadata": {
"jupytext": {
"text_representation": {
"extension": ".md",
"format_name": "myst",
"format_version": 0.13,
"jupytext_version": "1.11.5"
}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
},
"source_map": [
12,
48,
57,
67,
73,
89,
93,
97,
103,
107
]
},
"nbformat": 4,
"nbformat_minor": 5
}