Generalized Additive Models
Contents
9.5. Generalized Additive Models¶
9.5.1. Background¶
Generalized Additive Models (abbreviated as GAM) are a regression technique used to model nonlinear variable relationships.
GAMs use piecewise equations to create a regression line that is continuous in the second derivative.
A simple linear regression equation is \(y~=~\beta\)0\(~+~\beta\)1\(x\)1\(~+~\epsilon\)
The same equation as a GAM would be \(y~=~\alpha~+~f\)1\((x\)1\()~+~\epsilon\)
Two Important Terms
lambda: smooths regression line to prevent overfitting
knots: controls amount of sub-functions in piecewise equation, also related to overfitting
Assumptions
Observations are idependent
Correct smoothing
Correct link function
Minimal outliers
No multicollinearity between predictors
Source: http://www.medicine.mcgill.ca/epidemiology/goldberg/gam_class_part4.pdf
Pros
Linear model on steroids
More understandable than other methods such as XGBoost
Continuity of second derivative
Can be used on not just linear models, but also logistic, poisson, etc.
Cons
Not as understandable as a linear model
Not as powerful as other methods such as XGBoost
Requires assumptions to be met
The python package is a work in progress
9.5.2. Python¶
Resources
https://pygam.readthedocs.io/en/latest/notebooks/tour_of_pygam.html
https://www.youtube.com/watch?v=XQ1vk7wEI7c
To install the package, use one of the following commands:
pip install pygam
conda install -c conda-forge pyGAM
Dataset
Fantasy football draft data (2015 to 2021)
import pandas as pd
Load in average draft position (ADP) data
df_ADP = pd.read_html("https://www.fantasypros.com/nfl/adp/ppr-overall.php?year=2015")[0]
df_ADP["year"] = int(2015)
df_ADP.head()
for i in range(16, 22):
df_ADP = df_ADP.append(pd.read_html("https://www.fantasypros.com/nfl/adp/ppr-overall.php?year=20" + str(i))[0])
df_ADP.loc[df_ADP["year"].isnull(), "year"] = int("20" + str(i))
df_ADP.reset_index(drop = True, inplace = True)
df_ADP.sample(10)
df_ADP.info()
Clean column names
df_ADP.rename({
"Rank": "rank_adp",
"Player Team (Bye)": "player",
"AVG": "avg_adp"
}, axis = 1, inplace = True)
df_ADP.columns = df_ADP.columns.str.lower()
df_ADP.head()
Clean player names (because player names include team and bye)
temp = df_ADP["player"].str.split(n = 2)
df_ADP["player"] = temp.str[0] + " " + temp.str[1]
df_ADP.head()
Clean position (because position includes ranking number)
df_ADP['pos'] = df_ADP['pos'].str.replace('\d+', '')
df_ADP.head()
Load in points data (2015 to 2021)
df_points = pd.read_html("https://www.fantasypros.com/nfl/reports/leaders/ppr.php?year=2015&start=1&end=16")[0]
df_points["year"] = 2015
df_points.head()
for i in range(16, 22):
df_points = df_points.append(
pd.read_html("https://www.fantasypros.com/nfl/reports/leaders/ppr.php?year=20" + str(i) + "&start=1&end=16"
)[0])
df_points.loc[df_points["year"].isnull(), "year"] = int("20" + str(i))
df_points.reset_index(drop = True, inplace = True)
df_points.sample(10)
df_points.info()
Clean column names
df_points.rename({
"Rank": "rank_scoring",
"Position": "pos",
"Avg": "avg_scoring"
}, axis = 1, inplace = True)
df_points.columns = df_points.columns.str.lower()
df_points.head()
Clean player names (to match with ADP)
temp = df_points["player"].str.split(n = 2)
df_points["player"] = temp.str[0] + " " + temp.str[1]
df_points.head()
Merge dataframes
df = pd.merge(df_ADP, df_points, how='left', on = ["player", "pos", "year"])
df.head()
Only include players who have played over 8 games and players with ADP before 251.
Players who played less than 9 games may have crazy averages due to low sample size.
Players with ADP above 250 would not get drafted in basically any fantasy league of reasonable size.
df = df.loc[(df["games"] > 8) & (df["rank_adp"] <= 250)]
df.head()
To start with, let’s only analyze running backs
df_rb = df.loc[df["pos"] == "RB", ]
df_rb.head()
from plotnine import *
(
ggplot(data = df_rb, mapping = aes(x = "rank_adp", y = "avg_scoring")) +
geom_point() +
labs(title = "Running Back Scoring Average by ADP",
x = "Draft Ranking", y = "Scoring Average")
)
The graph is clearly nonlinear
Model Building
First specify x and y variables
X = df_rb["rank_adp"].to_numpy()
y = df_rb["avg_scoring"].to_numpy()
It is advised to use to_numpy()
as this will allow gridsearching parameters but there is a bug for doing so with a single predictor anyway
from pygam import LinearGAM, s, l, f
s
: smoothingl
: linearf
: factor
To start with, we can make a linear line by using l
# Note 0 is the first term inserted to X
lin = LinearGAM(l(0)).fit(X, y)
lin.summary()
df_rb["prediction_lin"] = lin.predict(X)
df_rb.head()
(
ggplot(data = df_rb, mapping = aes(x = "rank_adp", y = "avg_scoring")) +
geom_point() +
geom_line(aes(y = "prediction_lin"), color = "blue", size = 3) +
labs(title = "Running Back Scoring Average by ADP",
x = "Draft Ranking", y = "Scoring Average")
)
Now let’s fit a real GAM using s
gam = LinearGAM(s(0)).fit(X, y)
gam.summary()
df_rb["prediction_gam"] = gam.predict(X)
df_rb.head()
(
ggplot(data = df_rb, mapping = aes(x = "rank_adp", y = "avg_scoring")) +
geom_point() +
geom_line(aes(y = "prediction_gam"), color = "red", size = 3) +
labs(title = "Running Back Scoring Average by ADP",
x = "Draft Ranking", y = "Scoring Average")
)
It looks like the line is a little more squiggly than it needs to be. This means that are model is likely overfitting.
To counteract this, we can adjust n_splines
and/or lam
gam_n = LinearGAM(s(0, n_splines = 6)).fit(X, y)
gam_n.summary()
df_rb["prediction_gam_n"] = gam_n.predict(X)
df_rb.head()
gam_lam = LinearGAM(s(0, lam = 100)).fit(X, y)
gam_lam.summary()
df_rb["prediction_gam_lam"] = gam_lam.predict(X)
df_rb.head()
(
ggplot(data = df_rb, mapping = aes(x = "rank_adp", y = "avg_scoring")) +
geom_point(alpha = 0.25) +
geom_line(aes(y = "prediction_lin"), color = "blue", size = 3, alpha = 0.75) +
geom_line(aes(y = "prediction_gam"), color = "red", size = 3, alpha = 0.75) +
geom_line(aes(y = "prediction_gam_n"), color = "orange", size = 3, alpha = 0.75) +
geom_line(aes(y = "prediction_gam_lam"), color = "green", size = 3, alpha = 0.75) +
labs(title = "Running Back Scoring Average by ADP",
x = "Draft Ranking", y = "Scoring Average")
)
X.shape
Unfortunately, when there is only 1 column of shape
, it does not display it.
This causes gridsearch
to break so we cannot use this to find the best equation.
We can also fit a regression model with multiple predictors
We need to label encode categorical variables for the model to behave correctly
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(df["pos"])
df["le_pos"] = le.transform(df["pos"])
df.head()
X = df[["rank_adp", "le_pos"]].to_numpy()
y = df["avg_scoring"].to_numpy()
X
gam = LinearGAM(s(0) + f(1)).fit(X, y)
gam.summary()
df["prediction_gam"] = gam.predict(X)
(
ggplot(data = df, mapping = aes(x = "rank_adp", y = "avg_scoring", color = "pos")) +
geom_point(alpha = 0.25) +
geom_line(aes(y = "prediction_gam"), size = 3, alpha = 0.75) +
labs(title = "Scoring Average by ADP",
x = "Draft Ranking", y = "Scoring Average")
)
Once again, it looks like the lines are overfitting due to the squigglyness
To counteract this, we can create an array with random terms to use for lambda.
gridsearch
will select the best lambda values via cross validation.
import numpy as np
This formula will create a matrix of shape (1000, 2) with each value between \(10^{-3}\) and \(10^3\). We can use these as potential lambda values to optimize our GAM.
np.random.seed(123)
lams = 10 ** (np.random.rand(1000, 2) * 6 - 3)
lams
The gridsearch
method uses the above matrix as well as cross validation
(or another specified criteria) to determine the best values of lambda.
gam_grid = LinearGAM(s(0) + f(1)).gridsearch(X, y, lam = lams)
gam_grid.summary()
df["prediction_gam_grid"] = gam_grid.predict(X)
(
ggplot(data = df, mapping = aes(x = "rank_adp", y = "avg_scoring", color = "pos")) +
geom_point(alpha = 0.25) +
geom_line(aes(y = "prediction_gam_grid"), size = 3, alpha = 0.75) +
labs(title = "Scoring Average by ADP",
x = "Draft Ranking", y = "Scoring Average")
)
It is clear that the regression lines are much smoother and no longer suffer from overfitting.