9.5. Generalized Additive Models¶

9.5.1. Background¶

Generalized Additive Models (abbreviated as GAM) are a regression technique used to model nonlinear variable relationships.

GAMs use piecewise equations to create a regression line that is continuous in the second derivative.

A simple linear regression equation is \(y~=~\beta\)₀\(~+~\beta\)₁\(x\)₁\(~+~\epsilon\)

The same equation as a GAM would be \(y~=~\alpha~+~f\)₁\((x\)₁\()~+~\epsilon\)

Two Important Terms

lambda: smooths regression line to prevent overfitting
knots: controls amount of sub-functions in piecewise equation, also related to overfitting

Assumptions

Observations are idependent
Correct smoothing
Correct link function
Minimal outliers
No multicollinearity between predictors

Source: http://www.medicine.mcgill.ca/epidemiology/goldberg/gam_class_part4.pdf

Pros

Linear model on steroids
More understandable than other methods such as XGBoost
Continuity of second derivative
Can be used on not just linear models, but also logistic, poisson, etc.

Cons

Not as understandable as a linear model
Not as powerful as other methods such as XGBoost
Requires assumptions to be met
The python package is a work in progress

9.5.2. Python¶

Resources

https://pygam.readthedocs.io/en/latest/notebooks/tour_of_pygam.html
https://www.youtube.com/watch?v=XQ1vk7wEI7c

To install the package, use one of the following commands:

pip install pygam
conda install -c conda-forge pyGAM

Dataset

Fantasy football draft data (2015 to 2021)

import pandas as pd

Load in average draft position (ADP) data

df_ADP = pd.read_html("https://www.fantasypros.com/nfl/adp/ppr-overall.php?year=2015")[0]
df_ADP["year"] = int(2015)
df_ADP.head()

for i in range(16, 22):
    df_ADP = df_ADP.append(pd.read_html("https://www.fantasypros.com/nfl/adp/ppr-overall.php?year=20" + str(i))[0])
    df_ADP.loc[df_ADP["year"].isnull(), "year"] = int("20" + str(i))
df_ADP.reset_index(drop = True, inplace = True)
df_ADP.sample(10)

df_ADP.info()

Clean column names

df_ADP.rename({
               "Rank": "rank_adp", 
               "Player Team (Bye)": "player", 
               "AVG": "avg_adp"
              }, axis = 1, inplace = True)
df_ADP.columns = df_ADP.columns.str.lower()
df_ADP.head()

Clean player names (because player names include team and bye)

temp = df_ADP["player"].str.split(n = 2)
df_ADP["player"] = temp.str[0] + " " + temp.str[1]
df_ADP.head()

Clean position (because position includes ranking number)

df_ADP['pos'] = df_ADP['pos'].str.replace('\d+', '')
df_ADP.head()

Load in points data (2015 to 2021)

df_points = pd.read_html("https://www.fantasypros.com/nfl/reports/leaders/ppr.php?year=2015&start=1&end=16")[0]
df_points["year"] = 2015
df_points.head()

for i in range(16, 22):
    df_points = df_points.append(
        pd.read_html("https://www.fantasypros.com/nfl/reports/leaders/ppr.php?year=20" + str(i) + "&start=1&end=16"
                    )[0])
    df_points.loc[df_points["year"].isnull(), "year"] = int("20" + str(i))
    
df_points.reset_index(drop = True, inplace = True)
df_points.sample(10)

df_points.info()

Clean column names

df_points.rename({
                  "Rank": "rank_scoring", 
                  "Position": "pos", 
                  "Avg": "avg_scoring"
                 }, axis = 1, inplace = True)
df_points.columns = df_points.columns.str.lower()
df_points.head()

Clean player names (to match with ADP)

temp = df_points["player"].str.split(n = 2)
df_points["player"] = temp.str[0] + " " + temp.str[1]
df_points.head()

Merge dataframes

df = pd.merge(df_ADP, df_points,  how='left', on = ["player", "pos", "year"])
df.head()

Only include players who have played over 8 games and players with ADP before 251.

Players who played less than 9 games may have crazy averages due to low sample size.

Players with ADP above 250 would not get drafted in basically any fantasy league of reasonable size.

df = df.loc[(df["games"] > 8) & (df["rank_adp"] <= 250)]
df.head()

To start with, let’s only analyze running backs

df_rb = df.loc[df["pos"] == "RB", ]
df_rb.head()

from plotnine import *

(
    ggplot(data = df_rb, mapping = aes(x = "rank_adp", y = "avg_scoring")) +
    geom_point() +
    labs(title = "Running Back Scoring Average by ADP", 
         x = "Draft Ranking", y = "Scoring Average")
)

The graph is clearly nonlinear

Model Building

First specify x and y variables

X = df_rb["rank_adp"].to_numpy()
y = df_rb["avg_scoring"].to_numpy()

It is advised to use to_numpy() as this will allow gridsearching parameters but there is a bug for doing so with a single predictor anyway

from pygam import LinearGAM, s, l, f

s: smoothing
l: linear
f: factor

To start with, we can make a linear line by using l

# Note 0 is the first term inserted to X
lin = LinearGAM(l(0)).fit(X, y)
lin.summary()

df_rb["prediction_lin"] = lin.predict(X)
df_rb.head()

(
    ggplot(data = df_rb, mapping = aes(x = "rank_adp", y = "avg_scoring")) +
    geom_point() +
    geom_line(aes(y = "prediction_lin"), color = "blue", size = 3) +
    labs(title = "Running Back Scoring Average by ADP", 
         x = "Draft Ranking", y = "Scoring Average")
)

Now let’s fit a real GAM using s

gam = LinearGAM(s(0)).fit(X, y)
gam.summary()

df_rb["prediction_gam"] = gam.predict(X)
df_rb.head()

(
    ggplot(data = df_rb, mapping = aes(x = "rank_adp", y = "avg_scoring")) +
    geom_point() +
    geom_line(aes(y = "prediction_gam"), color = "red", size = 3) +
    labs(title = "Running Back Scoring Average by ADP", 
         x = "Draft Ranking", y = "Scoring Average")
)

It looks like the line is a little more squiggly than it needs to be. This means that are model is likely overfitting.

To counteract this, we can adjust n_splines and/or lam

gam_n = LinearGAM(s(0, n_splines = 6)).fit(X, y)
gam_n.summary()

df_rb["prediction_gam_n"] = gam_n.predict(X)
df_rb.head()

gam_lam = LinearGAM(s(0, lam = 100)).fit(X, y)
gam_lam.summary()

df_rb["prediction_gam_lam"] = gam_lam.predict(X)
df_rb.head()

(
    ggplot(data = df_rb, mapping = aes(x = "rank_adp", y = "avg_scoring")) +
    geom_point(alpha = 0.25) +
    geom_line(aes(y = "prediction_lin"), color = "blue", size = 3, alpha = 0.75) +
    geom_line(aes(y = "prediction_gam"), color = "red", size = 3, alpha = 0.75) +
    geom_line(aes(y = "prediction_gam_n"), color = "orange", size = 3, alpha = 0.75) +
    geom_line(aes(y = "prediction_gam_lam"), color = "green", size = 3, alpha = 0.75) +
    labs(title = "Running Back Scoring Average by ADP", 
         x = "Draft Ranking", y = "Scoring Average")
)

X.shape

Unfortunately, when there is only 1 column of shape, it does not display it. This causes gridsearch to break so we cannot use this to find the best equation.

We can also fit a regression model with multiple predictors

We need to label encode categorical variables for the model to behave correctly

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(df["pos"])
df["le_pos"] = le.transform(df["pos"])
df.head()

X = df[["rank_adp", "le_pos"]].to_numpy()
y = df["avg_scoring"].to_numpy()
X

gam = LinearGAM(s(0) + f(1)).fit(X, y)
gam.summary()

df["prediction_gam"] = gam.predict(X)

(
    ggplot(data = df, mapping = aes(x = "rank_adp", y = "avg_scoring", color = "pos")) +
    geom_point(alpha = 0.25) +
    geom_line(aes(y = "prediction_gam"), size = 3, alpha = 0.75) +
    labs(title = "Scoring Average by ADP", 
         x = "Draft Ranking", y = "Scoring Average")
)

Once again, it looks like the lines are overfitting due to the squigglyness

To counteract this, we can create an array with random terms to use for lambda. gridsearch will select the best lambda values via cross validation.

import numpy as np

This formula will create a matrix of shape (1000, 2) with each value between \(10^{-3}\) and \(10^3\). We can use these as potential lambda values to optimize our GAM.

np.random.seed(123)
lams = 10 ** (np.random.rand(1000, 2) * 6 - 3)
lams

The gridsearch method uses the above matrix as well as cross validation (or another specified criteria) to determine the best values of lambda.

gam_grid = LinearGAM(s(0) + f(1)).gridsearch(X, y, lam = lams)
gam_grid.summary()

df["prediction_gam_grid"] = gam_grid.predict(X)

(
    ggplot(data = df, mapping = aes(x = "rank_adp", y = "avg_scoring", color = "pos")) +
    geom_point(alpha = 0.25) +
    geom_line(aes(y = "prediction_gam_grid"), size = 3, alpha = 0.75) +
    labs(title = "Scoring Average by ADP", 
         x = "Draft Ranking", y = "Scoring Average")
)

It is clear that the regression lines are much smoother and no longer suffer from overfitting.

Introduction to Data Science, Spring 2022

Generalized Additive Models

Contents

9.5. Generalized Additive Models¶

9.5.1. Background¶

9.5.2. Python¶