8  Statistical Modeling & Machine Learning

8.1 K-Means Clustering

This presentation was prepared by Alysha Desai.

8.1.1 Introduction

8.1.1.1 What is K-Means and how does it work?

  • K-Means Clustering is a learning algorithm used for data clustering and grouping unlabeled data by similarity.
  • Unsupervised: No pre-set labels or answers provided to the computer.
  • Goal: Find hidden patterns and structure in unorganized, raw data. The objective is to minimize the sum of distances between data points and their assigned clusters.

8.1.1.2 Real-World Use Case

  • This type of cluster analysis is used in data science for market segmentation, document clustering, and much more.
  • I will be focusing on Customer Segmentation, for targeting advertising.

Dataset:

Mall Customer Segmentation

  • Profiles of 200 shoppers from a supermarket mall.
  • We use Age, Income, and Spending Score to find groups.

8.1.2 The Logic

8.1.2.1 The “K” in K-Means and Centroids

8.1.3 What is K?

  • K represents the number of clusters we want the algorithm to create.
  • You must choose K before the algorithm begins.

8.1.4 Centroids

  • A centroid is the imaginary “center” of a cluster.
  • It is the average of all data points in that group.
  • Centroids move as the group of points changes.

8.1.4.1 K-Means in Motion

We can use Python to create an animation that shows the step-by-step flow of how clusters are formed by stopping the algorithm after each iteration.

  • Iterative Movement: Watch how the Red X centroids physically shift as they recalculate the mean of their assigned points.
  • Dynamic Assignment: Notice how individual shoppers change colors (clusters) as centroids move closer to them.
  • Convergence: The animation stops once the centroids find their stable “home,” resulting in our five final market segments.

8.1.5 The Elbow Method

8.1.5.1 How do we find the optimal K?

  • A visual technique used to decide exactly how many clusters K are needed for a specific dataset.
  • As we add more clusters, the data points naturally get closer to their centers, but adding too many makes the model too complex.
  • We calculate the Within-Cluster Sum of Squares(WCSS), which is the total variance or “spread” inside each cluster.
  • We look for the specific point on the graph where the drop in WCSS slows down significantly—this is the “sweet spot”.

8.1.5.2 Visualizing the Elbow Curve

We tested K from 1 to 10 using our actual mall dataset. The “bend” happens at K=5, which is our optimal number. For our 200 shoppers, 5 clusters provide the best fit. Beyond 5, adding clusters provides very little extra value.

8.1.6 Step-by-step Breakdown

Step 1: Initialization

  • Randomly select K points from the dataset to be initial centroids.

Step 2: Assignment

  • Every data point is assigned to its nearest centroid using Euclidean distance.

Step 3: Update

  • Centroids move to the mathematical mean of all points in their new group.

Step 4: Convergence

  • The loop repeats until centroids stop moving or reach a set limit.

8.1.7 Python Implementation

8.1.7.1 Coding with Scikit-Learn and Matplotlib

First, we need to import the following libraries:

  • Numpy: Essential for numerical operations and distance calculations.
  • Matplotlib: The primary tool for plotting our raw data and final results.
  • Scikit learn: Used to generate synthetic data and run the KMeans model.
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans 

# Load the mall dataset
import pandas as pd
df = pd.read_csv('data/Mall_Customers.csv')

# Extract the features for clustering
# Using columns 3 (Income) and 4 (Spending Score)
X = df.iloc[:, [3, 4]].values

# Apply K-Means
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
y_kmeans = kmeans.fit_predict(X)

8.1.8 Visualizing Results

8.1.8.1 Raw Data & Model Setup

8.1.9 The “Unorganized” Data

8.1.9.1 Implementation Code

# Select 5 clusters & Fit
kmeans = KMeans(n_clusters=5, 
                init='k-means++', 
                random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Plotting with Matplotlib
plt.scatter(X[y_kmeans == 0, 0], 
            X[y_kmeans == 0, 1], 
            s=50, c='pink')

8.1.10 Visualizing Results

8.1.10.1 Final Segments and Analysis

8.1.10.2 Segmented Clusters

8.1.10.3 Cluster Analysis

Pink Group: High Income / Low Spenders

Blue Group: High Income / High Spenders

Centroids: The mathematical mean of each group

Insight: Target specific marketing strategies

8.1.11 Advantages & Limitations

  • Advantages: Simple to implement, fast, and highly scalable.

  • Outlier Sensitivity: Extreme points skew the mean and cluster shape.

  • Manual Choice of K: The model can’t find the best K without us.

  • Initialization Risk: Starting positions heavily impact final results.

  • Dimensionality: Performance drops as you add too many variables.

8.1.11.1 Further Readings

Scikit-learn Clustering Guide: Technical documentation for K-Means and other models. https://scikit-learn.org/stable/modules/clustering.html

Google Machine Learning Education: A great overview of K-Means advantages and disadvantages. https://developers.google.com/machine-learning/clustering/kmeans/

IBM: What is K-Means?: A high-level explanation of how the algorithm is used in business. https://www.ibm.com/think/topics/k-means-clustering

8.1.12 Thank you!

8.1.12.1 Questions?

8.2 Lasso Regression

8.3 Lasso Regression

This presentation was prepared by Emily Bennett.

What is LASSO?

LASSO stands for Least Absolute Shrinkage and Selection Operator. LASSO is a regression technique used in statistics and machine learning.

LASSO is used in regression models to:

  • Provides a more accurate prediction.

  • Selects important variables

  • Shrinks or eliminates irrelevant variables

LASSO gives you a better model that is more effective and efficient.

8.3.1 Refresh of Regression

Y = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ + ε

  • Y is the dependent variable

  • X’s are the independent variable

  • B’s the coefficients (parameters)

  • e is the error term

This is a simple linear regression model. This equation assumes a linear relationship between the independent variables and dependent variable.

A good example use of linear regression is from our midterm:

  • Estimating the number of bike rental sales from different variables like season, location, temperature, etc.
  • Each coefficient shows how much the variable will affect the output

8.3.2 Main Idea of LASSO

In linear regression, we have trouble understanding what variables are helpful and influential and what variables are just noise. The model could overfit, which means the model would not perform well on new data.

LASSO helps us find these unhelpful variables by shrinking less important coefficients. LASSO will also remove unneeded variables by setting them to be zero. This makes the model simpler and more interpretable.

Using our real world example of bike rentals, I might think the gender of the customer is important to estimating the bike rentals, but using LASSO could show me gender doesn’t matter. LASSO could shrink gender’s coefficient or set it to zero if it wasn’t useful to the model.

8.3.3 How LASSO works

LASSO will penalize these unhelpful variables by either:

  • shrinking the coefficents -> has less effect on equation

  • Making the coefficent be zero -> has no effect on equation

8.3.4 LASSO Regularization

LASSO will add a penalty term to the model. If there is a variable that doesn’t add much or affect the model, this penalty term will help shrink that variables’ coefficient.

L₁ = λ * (|β₁| + |β₂| + … + |βₚ|)

  • L1 is the regularization term

  • λ is the regularization parameter, controls how strong the penalty it

  • |B| are the absolute values of the coefficients

Lambda controls how strong the penalty is.

  • If the lambda is small there is a more complex model and most of the variables are kept.

  • If lambda is big then the model is simple and a lot of the variables are removed or shrunk

8.3.5 Example through coding

I am using a real dataset to predict housing prices in California

# Basic tools and dataset
import pandas as pd
from sklearn.datasets import fetch_california_housing

# Modeling tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso

# Load dataset
data = fetch_california_housing()

# Convert to data frame
# "target" is house price
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Preview data
print(X.head())
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
  • medinc- medium income (tens of thousands) of households in block group

  • HouseAge- median house age in block group

  • AveRooms- average number of rooms per household in block group

  • AveBedrms- average number of bedrooms per household in block group

  • Population- population in block group

  • AveOccup- average number of household members in block group

  • Latitude- block group latitude

  • Longitude- block group longitude

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
  • Training set is used to build the model (80% of data)

  • Test set is used to evaluate the performance (20% of data)

  • Random_state=42 makes sure the data in either the training or test set, no randomness

# Standardize features (important for LASSO)
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
  • Feature scaling

  • This makes sure that all the variables are treated fairly

  • This will convert values to a similar range

  • LASSO can see which variables to keep or remove in a fair way by making the variables comparable

# Create LASSO model
lasso = Lasso(alpha=0.1)

# Train model
lasso.fit(X_train_scaled, y_train)
Lasso(alpha=0.1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  • Alpha value is your lambda

  • Alpha is strength of penalty

  • Small alpha- keep most features

  • Big alpha- remove/shrink most features

To find the alpha value, do a cross validation technique shown here:

from sklearn.linear_model import LassoCV

# Try many alpha values automatically
lasso_cv = LassoCV(alphas=None, cv=5)

lasso_cv.fit(X_train_scaled, y_train)

# Best alpha found
print("Best alpha:", lasso_cv.alpha_)
Best alpha: 0.0007985195644260347
/Users/kuc13001/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:1641: FutureWarning:

'alphas=None' is deprecated and will be removed in 1.9, at which point the default value will be set to 100. Set 'alphas=100' to silence this warning.
# Convert coefficients into readable format
coef = pd.Series(lasso.coef_, index=X.columns)

print(coef.sort_values())
Latitude     -0.011469
AveRooms     -0.000000
AveBedrms     0.000000
Population   -0.000000
AveOccup     -0.000000
Longitude    -0.000000
HouseAge      0.106453
MedInc        0.710598
dtype: float64

Latitude -0.011469

AveBedrms 0.000000

Population -0.000000

AveRooms -0.000000

AveOccup -0.000000

Longitude -0.000000

HouseAge 0.106453

MedInc 0.710598

  • Here we can see that LASSO has made some of these values to be zero, so they are not important to predicting the housing price.

  • The smaller numbers are less important, bigger numbers are more important

import matplotlib.pyplot as plt

coef.sort_values().plot(kind='barh')
plt.title("LASSO Feature Importance (Housing Data)")
plt.show()

  • Can visualize importance of housing prices
from sklearn.metrics import mean_squared_error

# Predict on test data
y_pred = lasso.predict(X_test_scaled)

# Compute error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
Mean Squared Error: 0.6796290284328825

Mean Squared Error: 0.67962

  • This measures prediction error

8.3.6 Cons of LASSO

  • LASSO has trouble with points that are highly correlated. LASSO might just pick one of the variables and throw the other variable out, without figuring out which one is the important one. In this case use a different tool called Elastic Net

  • If the lambda is too high, LASSO might shrink the coefficients too much. This would make the model too simple

8.3.7 Conclusion

“LASSO is like a personal trainer for your model: it pushes you to do better while keeping things lean and focused.” (Ujang Riswanto, 2024)

LASSO is a tool to help your model from over-fitting. It finds the unimportant variables and adds a penalty to them, especially in bigger datasets.

8.3.8 References

https://ujangriswanto08.medium.com/a-beginners-guide-to-lasso-regression-when-and-how-to-use-it-dc872e4ee489

https://www.mygreatlearning.com/blog/understanding-of-lasso-regression/

8.4 Statistical Analysis for Proportions and Rates

This presentation was prepared by Kayleigh Budnick.

8.4.1 Introduction

8.4.1.1 What are proportions and rates?

  • Proportions measure the fraction of units with a certain characteristic
    • Examples: % of emails marked spam, % of voters supporting a candidate
  • Rates measure how often events occur relative to time
    • Examples: # of new COVID-19 cases per day, # of website visits per hour
  • Both are important for summarizing data and making comparisons across groups or time periods. They are commonly used in fields like:
    • Public health
    • Marketing
    • Social science
    • Quality control

8.4.1.2 Why do they matter in data science?

  • Many real-world outcomes are binary (e.g., success/failure, yes/no) or count-based (e.g., number of clicks, number of defects).
  • Proportions and rates allow:
    • Estimation
    • Uncertainty quantification
    • Hypothesis testing
    • Comparison across groups
  • They connect directly to generalized linear models (GLMs) and machine learning algorithms that handle classification and count data.

8.4.2 Proportions

8.4.2.1 Modeling Proportions with the Binomial Distribution

  • Binary outcomes follow a Bernoulli distribution (success/failure).
  • Repeated trials → Binomial distribution
  • Sample proportion: \(\hat{p} = \frac{x}{n}\) where \(x\) is the number of successes and \(n\) is the total trials.
  • Sampling distribution of \(\hat{p}\) is approximately normal for large \(n\) (Central Limit Theorem):
    • Mean: \(p\)
    • Standard error: \(\sqrt{\frac{p(1-p)}{n}}\)

8.4.2.2 Confidence Intervals for Proportions

  • Wald interval (simple, but often inaccurate)
    • \(\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
  • Wilson score interval (more accurate, especially for small samples)
    • \(\frac{\hat{p} + \frac{z^2}{2n} \pm z \sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}}{1 + \frac{z^2}{n}}\)
  • Interpretation: We are 95% confident that the true population proportion lies within this interval.

8.4.2.3 Hypothesis Testing for Proportions

  • One-sample z-test:
    • Compare \(z\) to critical values from the standard normal distribution
    • Calculate p-value for the observed \(z\) and compare to significance level \(\alpha\)
    • Tests whether the observed proportion is significantly different from a hypothesized value.
  • Two-sample z-test for comparing two proportions:
    • \(z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}}\) where \(\hat{p}\) is the pooled proportion
    • Used in A/B testing, clinical trials, etc.
    • Tests whether P1 and P2 are significantly different.
    • Use paired-sample test for before/after of the same subject. Use two-sample test for independent samples.

8.4.2.4 Logistic Regression for Proportions

  • Regression model for binary outcomes
  • Predicts probability of success
  • Part of GLM family
    • \(\log\!\left(\frac{p}{1 - p}\right) = \beta_0 + \beta_1 X_1 + \dots\)

Why use logit link?

  • Keeps predicted probabilities between 0 and 1
  • Converts probabilities into a scale where linear models work

8.4.2.5 Python Example: Proportion Confidence Interval

Suppose we have a sample of 100 emails, and 56 were marked as spam. We want to calculate a 95% confidence interval for the true proportion of spam emails.

import statsmodels.stats.proportion as smp

count = 56
nobs = 100

ci_low, ci_high = smp.proportion_confint(count, nobs, method="wilson")
print("Confidence interval: (", ci_low, ci_high, ")")
Confidence interval: ( 0.4622810465167698 0.6532797336983921 )

Output: (0.465, 0.651) → We are 95% confident that the true proportion is between 46.5% and 65.1%.

8.4.3 Rates

8.4.3.1 Modeling Rates with the Poisson Distribution

  • Count data often follows a Poisson distribution, especially for rare events.
  • Rate parameter \(\lambda\) represents the average number of events per unit time.
  • The variance of a Poisson distribution is equal to its mean (\(\lambda\)).

Examples:

  • Calls to a help desk per hour
  • Defects in a manufacturing process per day
  • Disease cases per 100,000 people

8.4.3.2 Confidence Intervals for Rates

  • For a Poisson count \(x\) observed over time \(t\), the rate is \(\hat{\lambda} = \frac{x}{t}\).
  • Confidence intervals can be calculated using the chi-squared distribution:
  • Lower bound: \(\frac{1}{2t} \chi^2_{\alpha/2, 2x}\)
  • Upper bound: \(\frac{1}{2t} \chi^2_{1-\alpha/2, 2(x+1)}\)

8.4.3.3 Comparing Rates

  • Rate ratio: \(\text{RR} = \frac{\hat{\lambda}_1}{\hat{\lambda}_2}\)
  • Confidence interval for the rate ratio can be calculated using the logarithm of the rate ratio and its standard error:
    • \(\text{SE}(\log(\text{RR})) = \sqrt{\frac{1}{x_1} + \frac{1}{x_2}}\)
  • Overdispersion check may be necessary if the variance exceeds the mean, suggesting a negative binomial model may be more appropriate

8.4.3.4 Poisson Regression for Rates

  • Poisson regression is the regression model for count data
  • Predicts the rate at which events occur
  • Part of the GLM family
    • \(\log(\lambda) = \beta_0 + \beta_1 X_1 + \dots\)

Why use log?

  • Ensures predicted rates are positive
  • Makes effect multiplicative
  • Coefficients exponentiate to rate ratios

Poisson regression is to counts what logistic regression is to binary outcomes.

8.4.3.5 Overdispersion in Count Data

  • Poisson assumes that the variance equals the mean
  • In real data, the variance is often actually larger than the mean

Why does this matter?

  • Our standard errors will be too small
  • Our CIs will be too narrow
  • P-values will appear more significant than they are
  • Detect overdispersion by comapring residual deviance to degrees of freedom
    • If deviance > df, then overdispersion is likely

How to fix it?

  • Negative binomial regression
    • Adds a dispersion parameter
    • Naturally handles extra variability
  • Check for missing predictors
    • Overdispersion can mean you are missing variables somewhere

8.4.3.6 Python Example: Rate Confidence Interval

Suppose we observed 20 new COVID-19 cases in a city over a week (7 days). We want to calculate a 95% confidence interval for the true rate of new cases per day.

The chi‑square distribution can be used to compute exact Poisson confidence intervals for a single rate. This works because a transformed Poisson count follows a chi‑square distribution and is a common method for calculating confidence intervals for rates, especially when the count is small.

import scipy.stats as stats
x = 20  # number of cases
t = 7   # time in days

alpha = 0.05
ci_low = (1/(2*t)) * stats.chi2.ppf(alpha/2, 2*x)
ci_high = (1/(2*t)) * stats.chi2.ppf(1 - alpha/2, 2*(x + 1))
print("Confidence interval: (", ci_low, ci_high, ")")
Confidence interval: ( 1.7452170836291347 4.412625414667798 )

Output: (1.74, 4.41) → We are 95% confident that the true rate of new cases per day is between 1.74 and 4.41.

8.4.4 GLM Perspective

  • Proportions can be modeled using a binomial family with a logit link function (logistic regression).
  • Rates can be modeled using a Poisson family with a log link function (Poisson regression).
  • Why this matters:
    • Allows us to model the relationship between predictors and the outcome (proportion or rate) while accounting for the distribution of the data.
  • Example: Logistic regression for predicting the probability of an email being spam based on features like word count, presence of certain keywords, etc. Poisson regression for modeling the number of website visits per day based on factors like advertising spend, day of the week, etc.

8.4.5 Key Python Packages

  • statsmodels:
    • proportion_confint for confidence intervals for proportions
    • proportions_ztest for hypothesis testing of proportions
    • rate_ratio and poisson for modeling rates and calculating rate ratios
    • poisson_confint for confidence intervals for rates
    • poisson_test for hypothesis testing of rates
    • Poisson and NegativeBinomial classes for modeling count data
  • scipy.stats:
    • chi2 to test independence in proportions
    • binom for binomial distribution functions
    • shapiro or normaltest to check normality assumption
  • matplotlib/seaborn:
    • Visualization of proportions and rates with bar charts, error bars, etc.
  • pandas:
    • Data manipulation and summarization for calculating proportions and rates from raw data

8.4.6 Applied Example 1 - A/B Testing for Proportions

  • Scenario: An marketing company wants to test if a new website design (B) leads to higher engagement (number of clicks per visitors) than the current design (A).
  • Data:
    • Design A: 200 clicks out of 2,000 visitors
    • Design B: 260 clicks out of 2,100 visitors
    • Null hypothesis: The number of clicks per visitor for both designs are the same (\(p_A = p_B\))
    • Alternative hypothesis: Design B has higher clicks per visitor than Design A (\(p_B > p_A\))
8.4.6.0.1 Two-sample z-test for proportions:
import statsmodels.api as sm

count = [200, 260] # A # B
nobs = [2000, 2100] # A # B

# Alternative = smaller because we are testing if design A engagement is significantly less than design B engagement

stat, pval = sm.stats.proportions_ztest(count, nobs, alternative='smaller')
print("Test statistic:", stat)
print("P-value:", pval)
Test statistic: -2.414558186049353
P-value: 0.00787715322307177

Output: The p-value is much less than 0.05, indicating that Design A has significantly less engagement than Design B. Additionally, the absolute value of the test statistic is very large (> 1.96), which also supports that there is a significant difference between engagement in the two websites.

8.4.6.0.2 Confidence interval for the difference in proportions:
import statsmodels.stats.proportion as smp

ci_low, ci_high = smp.proportion_confint(
  count[1] - count[0], 
  nobs[1] + nobs[0], 
  method="wilson")

print("Confidence interval: (", ci_low, ci_high, ")")
Confidence interval: ( 0.011386527353967016 0.018790432468521346 )

Output: (0.011, 0.018) → The 95% confidence interval for the difference in engagement between design B and design A is between 1.1% and 1.8%, suggesting that design B’s engagement is likely higher than design A’s by this margin. Because the interval does not include 0, it supports the conclusion that there is a significant difference in engagement between the two designs.

8.4.6.0.3 Logistic regression model for engagement:
import pandas as pd
import statsmodels.api as sm

# Data for logistic regression
# 1 row per design (A, B)
data = pd.DataFrame({
    "clicks": [200, 260],
    "visitors": [2000, 2100],
    "design_B": [0, 1]   # 0 = Design A, 1 = Design B
})

# Response: successes (clicks) and failures (non-clicks)
data["failures"] = data["visitors"] - data["clicks"]

# Logistic regression requires a 2-column array: [successes, failures]
y = data[["clicks", "failures"]]

# Predictor matrix (intercept + indicator for Design B)
X = sm.add_constant(data["design_B"])

# Fit logistic regression (binomial family)
model = sm.GLM(y, X, family=sm.families.Binomial()).fit()

print(model.summary())
                   Generalized Linear Model Regression Results                    
==================================================================================
Dep. Variable:     ['clicks', 'failures']   No. Observations:                    2
Model:                                GLM   Df Residuals:                        0
Model Family:                    Binomial   Df Model:                            1
Link Function:                      Logit   Scale:                          1.0000
Method:                              IRLS   Log-Likelihood:                -7.1494
Date:                    Mon, 20 Apr 2026   Deviance:                   1.5099e-13
Time:                            11:47:03   Pearson chi2:                 1.85e-24
No. Iterations:                         5   Pseudo R-squ. (CS):             0.9463
Covariance Type:                nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.1972      0.075    -29.479      0.000      -2.343      -2.051
design_B       0.2404      0.100      2.410      0.016       0.045       0.436
==============================================================================
/Users/kuc13001/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/statsmodels/regression/_tools.py:121: RuntimeWarning:

divide by zero encountered in scalar divide

/Users/kuc13001/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/statsmodels/genmod/generalized_linear_model.py:1342: PerfectSeparationWarning:

Perfect separation or prediction detected, parameter may not be identified

Output: Our key coefficient design_B = 0.2404, with p-value = 0.016. The coefficient is positive, meaning Design B has higher engagement. The p-value < 0.05 so this increase is statistically significant.

8.4.6.0.4 Visualization of engagement:
import matplotlib.pyplot as plt

props = [200/2000, 260/2100]
plt.bar(["A","B"], props)
plt.title("Number of clicks per visitor")
plt.ylim(0, max(props)*1.2)
plt.show()

8.4.7 Applied Example 2 - Comparing Accident Rates

  • Scenario: A city wants to compare the rate of traffic accidents before and after implementing a new traffic safety program. The city collects data on 10 road segments, each observed before and after the safety program.
  • Data:
import numpy as np
import pandas as pd

# Create data frame with 10 road segments, before and after the safety program
data = pd.DataFrame({
    "segment": np.repeat(np.arange(1, 11), 2),
    "after_program": [0, 1] * 10,
    "exposure": [
        5000, 5200,   # segment 1
        4800, 5100,   # segment 2
        5100, 5300,   # segment 3
        4950, 5050,   # segment 4
        5050, 5150,   # segment 5
        4900, 5000,   # segment 6
        5200, 5400,   # segment 7
        5000, 5200,   # segment 8
        5100, 5300,   # segment 9
        4950, 5150    # segment 10
    ],
    "accidents": [
        2, 1,   # segment 1
        3, 1,   # segment 2
        2, 1,   # segment 3
        2, 1,   # segment 4
        3, 2,   # segment 5
        2, 1,   # segment 6
        3, 1,   # segment 7
        2, 1,   # segment 8
        2, 1,   # segment 9
        3, 1    # segment 10
    ]
})

# Pivot data to get total accidents and exposure before and after for each segment
wide = data.pivot(index="segment", columns="after_program", values=["accidents", "exposure"])

# Rename columns for clarity
wide.columns = ["accidents_before", "accidents_after", "exposure_before", "exposure_after"]

# Reorder columns
wide = wide[["accidents_before", "exposure_before", "accidents_after", "exposure_after"]]

# Print 
print(wide.to_string())
         accidents_before  exposure_before  accidents_after  exposure_after
segment                                                                    
1                       2             5000                1            5200
2                       3             4800                1            5100
3                       2             5100                1            5300
4                       2             4950                1            5050
5                       3             5050                2            5150
6                       2             4900                1            5000
7                       3             5200                1            5400
8                       2             5000                1            5200
9                       2             5100                1            5300
10                      3             4950                1            5150
  • Null hypothesis: The accident rates before and after the program are the same (\(\lambda_{before} = \lambda_{after}\))
  • Alternative hypothesis: The accident rate after the program is lower than before (\(\lambda_{after} < \lambda_{before}\))
8.4.7.0.1 Poisson test for comparing two rates:
from statsmodels.stats.rates import test_poisson_2indep

# Aggregate totals
count_before = data.loc[data.after_program == 0, "accidents"].sum()
count_after  = data.loc[data.after_program == 1, "accidents"].sum()

exposure_before = data.loc[data.after_program == 0, "exposure"].sum()
exposure_after  = data.loc[data.after_program == 1, "exposure"].sum()

print("Before totals:", count_before, "accidents,", exposure_before, "hours")
print("After totals:", count_after, "accidents,", exposure_after, "hours")

# Poisson test
result = test_poisson_2indep(
    count_before, exposure_before,
    count_after, exposure_after,
    method="score"
)

print("\nTest statistic:", result.statistic)
print("P-value:", result.pvalue)
Before totals: 24 accidents, 50050 hours
After totals: 11 accidents, 51850 hours

Test statistic: 2.302264141003014
P-value: 0.021320280750128675

Output: (2.30, 0.02) → The test statistic is 2.30 with a p-value of 0.02 < 0.05, indicating a significant difference in accident rates with the after-program rate being lower.

8.4.7.0.2 Rate ratio and confidence interval:
import numpy as np
import statsmodels.api as sm

# Rate ratio (after / before)
rate_ratio = (count_after / exposure_after) / (count_before / exposure_before)

# Standard error of log(rate ratio)
se_log_rr = np.sqrt((1 / count_before) + (1 / count_after))

# 95% CI on log scale
z = 1.96
ci_low = np.exp(np.log(rate_ratio) - z * se_log_rr)
ci_high = np.exp(np.log(rate_ratio) + z * se_log_rr)

print("Rate ratio:", rate_ratio)
print("95% CI: (", ci_low, ",", ci_high, ")")
Rate ratio: 0.4424220507875281
95% CI: ( 0.21672073578994006 , 0.9031774015974341 )

Output: (0.217, 0.903) → The 95% confidence interval for the rate ratio ranges from 0.217 to 0.903. This interval does not include 1, supporting the above conclusion that there does exist a significant difference in accident rates before and after the program was implemented.

In other words, with 95% confidence the accident rate after the program is between 21.7% and 90.3% of the accident rate before the program.

According to the rate ratio of 0.442, the accident rate after the program is estimated to be about 44.2% of the rate before, which indicates a substantial decrease in accidents following the implementation of the safety program.

8.4.7.0.3 Poisson regression model for accident rates:
import numpy as np
import pandas as pd
import statsmodels.api as sm

# Design matrix: intercept + indicator for after_program
X = sm.add_constant(data["after_program"])

# Offset = log(exposure) so we model accident *rates*
offset = np.log(data["exposure"])

model = sm.GLM(
    data["accidents"],
    X,
    family=sm.families.Poisson(),
    offset=offset
).fit()

print(model.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:              accidents   No. Observations:                   20
Model:                            GLM   Df Residuals:                       18
Model Family:                 Poisson   Df Model:                            1
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -24.979
Date:                Mon, 20 Apr 2026   Deviance:                       1.6948
Time:                        11:47:03   Pearson chi2:                     1.86
No. Iterations:                     4   Pseudo R-squ. (CS):             0.2372
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -7.6427      0.204    -37.442      0.000      -8.043      -7.243
after_program    -0.8155      0.364     -2.240      0.025      -1.529      -0.102
=================================================================================

Output: Our key coefficient after_program = -0.8155 with p-value = 0.025. Because our coefficient is negative, the accident rate decreased after the program. Because the p-value < 0.05, this decrease is significant.

This model shows our rate ratio = 0.409, which aligns closely with the manually computed one in the confidence interval problem. The accident rate after the program is estimated to be about 41% of the rate before.

8.4.7.0.4 Visualization of accident rates:
import matplotlib.pyplot as plt

segments = data["segment"].unique()

for seg in segments:
    seg_data = data[data.segment == seg]
    before_rate = seg_data.loc[seg_data.after_program == 0, "accidents"].values[0] / \
                  seg_data.loc[seg_data.after_program == 0, "exposure"].values[0]
    after_rate = seg_data.loc[seg_data.after_program == 1, "accidents"].values[0] / \
                 seg_data.loc[seg_data.after_program == 1, "exposure"].values[0]
    
    plt.plot(["Before", "After"], [before_rate, after_rate], marker="o", alpha=0.6)

plt.title("Accident Rates by Road Segment")
plt.ylabel("Accidents per Hour")
plt.show()

8.4.8 Common Mistakes

Things to avoid:

  • Using Wald intervals for small sample sizes
  • Ignoring different exposure times when modeling rates
  • Treating counts as proportions
  • Not checking assumptions (independence, overdispersion, normality)
  • Overinterpreting tiny differences with huge sample sizes

8.4.9 Conclusion

  • Proportions and rates are fundamental for summarizing and analyzing binary and count data
  • Proportions → Binomial distribution, confidence intervals, hypothesis testing
  • Rates → Poisson distribution, confidence intervals, rate ratios
  • Applications: A/B testing, public health, quality control, etc.

8.4.10 Further readings

  • Buis, M. L., (2020). Analysis of Proportions, In P. Atkinson, S. Delamont, A. Cernat, J.W. Sakshaug, & R.A. Williams (Eds.), SAGE Research Methods Foundations. https://doi.org/10.4135/9781526421036852590
  • Miettinen, O. and Nurminen, M. (1985), Comparative analysis of two rates. Statist. Med., 4: 213-226. https://doi.org/10.1002/sim.4780040211
  • Buddhi, K., (2025). Basics of probability and statistics using Python. Intelligent Computing Techniques in Biomedical Imaging, Academic Press. https://doi.org/10.1016/B978-0-443-15999-2.00017-7

8.5 Logistic Regression

This presentation was prepared by Ronnie Orsini.

8.5.1 Regression idea

Before talking about logistic regression, it helps to briefly understand what regression means in statistics.

Regression is a method used to model the relationship between a response variable and one or more predictor variables. The goal is usually to predict the value of the response variable based on the predictors.

For example, regression can be used to predict:

  • house prices based on square footage and location

  • sales revenue based on advertising spending

  • temperature based on time of day

One of the most common forms is linear regression, which assumes the relationship between variables follows a straight line.

A simple linear regression model looks like:

\[ y = \beta_0 + \beta_1 x \]

Here:

  • \(y\) is the response variable we want to predict
  • \(x\) is a predictor variable
  • \(\beta_0\) is the intercept
  • \(\beta_1\) is the coefficient describing how \(x\) affects \(y\)

Linear regression works well when the response variable is a continuous number, such as price, temperature, or revenue.

8.5.2 Classification problem

In many real-world situations, we are not trying to predict a numerical value. Instead, we want to predict which category an observation belongs to.

This type of problem is called a classification problem.

Examples include:

  • determining whether an email is spam or not spam

  • diagnosing whether a patient has a disease

In many cases there are only two possible outcomes, which is known as a binary classification problem.

For convenience, these outcomes are usually coded as:

  • 1 → the event occurs (positive class)

  • 0 → the event does not occur (negative class)

Instead of predicting a number, the goal is to estimate the probability that an observation belongs to class 1.

This probability is written as:

\[ P(Y = 1 \mid X) \]

which means the probability that the outcome is 1 given the predictor variables (X).

8.5.2.1 Example: Insider trading project

As an example, in a data science project I worked on last semester, we looked at whether politicians were outperforming the stock market using their disclosed trades.

The question we wanted to answer was:

Are politicians consistently beating the market?

At first it might seem like this could be modeled with linear regression, because we could try to predict how much a trade outperformed the market.

However, the main question we cared about was simpler:

Did the trade beat the market or not?

So each trade could be classified into one of two outcomes:

  • 1 → the trade outperformed the market

  • 0 → the trade did not outperform the market

Because the response variable only had two possible outcomes, this became a binary classification problem.

Instead of predicting a numerical value, the model estimated the probability that a trade would outperform the market.

This is exactly the type of situation where logistic regression is more appropriate than linear regression.

8.5.3 Why linear regression fails for classification

At first, it might seem like we could still use linear regression for classification problems.

For example, in the insider trading project mentioned earlier, we might try to predict whether a trade beats the market using a linear model like

\[ y = \beta_0 + \beta_1 x \]

However, this creates several problems when the response variable only takes values of 0 or 1.

Problem 1: Predictions can fall outside the probability range

Linear regression can produce any real number as a prediction.

For example, the model might predict:

  • -0.3
  • 0.7
  • 1.4

But probabilities must always stay between:

\[ 0 \le p \le 1 \]

A predicted probability of -0.3 or 1.4 does not make sense.

So linear regression cannot guarantee valid probability predictions.

Problem 2: The relationship between predictors and probability is not linear

With binary outcomes, the relationship between predictors and the probability of an event usually follows an S-shaped pattern rather than a straight line.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-10,10,200)

# linear relationship
linear = 0.1*x + 0.5

# logistic function
logistic = 1/(1+np.exp(-x))

plt.figure(figsize=(7,5))

plt.plot(x, linear, label="Linear regression", linestyle="--")
plt.plot(x, logistic, label="Logistic regression")

plt.axhline(0, color='gray', linewidth=0.5)
plt.axhline(1, color='gray', linewidth=0.5)

plt.xlabel("Predictor")
plt.ylabel("Predicted Value / Probability")
plt.legend()
plt.show()

Linear relationship vs logistic (S-shaped) probability curve

For example, as a predictor increases, the probability of success might increase quickly at first, then level off as it approaches 1.

A straight line cannot capture this behavior properly.

Because of these issues, we need a model that:

  • produces predictions between 0 and 1
  • can model nonlinear probability behavior

This is where logistic regression comes in.

8.5.4 Logistic regression solution

To solve the problems with linear regression, we need a model that predicts probabilities between 0 and 1.

Logistic regression does this by using a special function called the logistic function.

Instead of predicting the outcome directly, logistic regression first computes a linear combination of the predictors:

\[ z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p \]

This part is actually very similar to linear regression.

However, instead of using this value directly as the prediction, logistic regression passes it through the logistic function:

\[ P(Y = 1 | X) = \frac{1}{1 + e^{-z}} \]

where:

  • \(P(Y = 1 \mid X)\) = the probability that the outcome is 1 given the predictor variables \(X\)

  • \(Y\) = the response variable we are trying to predict

  • \(X\) = the predictor variables used to make the prediction

  • \(z\) = the input to the logistic function

  • \(e\) = Euler’s number, approximately 2.718

This transformation converts any real number into a value between 0 and 1, which allows the model to represent probabilities.

The logistic function produces an S-shaped curve, where:

  • probabilities close to 0 indicate the event is unlikely
  • probabilities close to 1 indicate the event is likely
  • probabilities near 0.5 represent uncertainty

Because of this transformation, logistic regression avoids the main problem of linear regression. Predictions are always valid probabilities.

8.5.5 Interpreting logistic regression predictions

Logistic regression produces a predicted probability that an observation belongs to class 1.

For example, the model might output:

\[ P(Y = 1 \mid X) = 0.82 \]

This means the model estimates an 82% probability that the outcome belongs to class 1 given the predictor variables.

However, in many applications we still need to convert this probability into a final class prediction.

A common rule is to use a threshold of 0.5:

  • if \(P(Y = 1 \mid X) \geq 0.5\), predict class 1

  • if \(P(Y = 1 \mid X) < 0.5\), predict class 0

The threshold does not always have to be 0.5. In some applications it may be adjusted depending on the cost of incorrect predictions.

8.5.6 Logistic regression in Python

Up to this point we have discussed how logistic regression works conceptually. Now we will look at a simple example of how it can be implemented in Python.

In this example, the goal is to predict whether a breast tumor is cancerous or not based on measured features from cell samples.

We will use the Breast Cancer Wisconsin dataset, which is included in the scikit-learn library. This dataset contains several numerical measurements that will be used as predictor variables.

The response variable is binary:

  • 1 → tumor is not cancerous
  • 0 → tumor is cancerous

This makes it a binary classification problem, which is exactly the type of situation where logistic regression is commonly used.

8.5.6.1 Step 1: Import libraries

First we import the libraries needed to run logistic regression.

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

These modules allow us to:

  • load the dataset
  • train a logistic regression model
  • split the data into training and testing sets
  • evaluate model performance

8.5.6.2 Step 2: Load the dataset

Next we load the dataset and separate the predictor variables and the response variable.

data = load_breast_cancer()

X = data.data
y = data.target

print("Predictor variables used in the model:")
print(data.feature_names)
Predictor variables used in the model:
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']

Here:

  • X contains the predictor variables
  • y contains the response variable (0 or 1)

8.5.6.3 Step 3: Split the data

Before training the model, we split the data into training and testing sets so we can evaluate how well the model performs on new data.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In this case:

  • 80% of the data is used for training
  • 20% of the data is used for testing

8.5.6.4 Step 4: Train the model

Now we create the logistic regression model and train it using the training data.

model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
LogisticRegression(max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The model estimates the probability that the outcome is 1 given the predictor variables.

\[ P(Y = 1 \mid X) = \frac{1}{1 + e^{-z}} \]

where

\[ z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p \]

The .fit() function allows the model to learn the relationship between the predictor variables and the response variable.

8.5.6.5 Step 5: Make predictions

Once the model has been trained, we can use it to predict outcomes for the testing data.

y_pred = model.predict(X_test)

These predictions are class labels:

  • 0 → predicted cancerous
  • 1 → predicted not cancerous

8.5.6.6 Step 6: Predict probabilities

Logistic regression can also estimate the probability that an observation belongs to each class.

y_prob = model.predict_proba(X_test)

print(np.round(y_prob[:5], 4))
[[1.224e-01 8.776e-01]
 [1.000e+00 0.000e+00]
 [9.984e-01 1.600e-03]
 [1.200e-03 9.988e-01]
 [1.000e-04 9.999e-01]]

Each row contains two probabilities:

  • probability of class 0
  • probability of class 1

For example, for output:

[0.1217, 0.8783]

This means the model estimates:

• 12.17% probability the tumor is cancerous
• 87.83% probability the tumor is not cancerous

8.5.6.7 Step 7: Evaluate model performance

Finally, we evaluate how well the model performs using several common metrics.

print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
[[39  4]
 [ 1 70]]
0.956140350877193
              precision    recall  f1-score   support

           0       0.97      0.91      0.94        43
           1       0.95      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

From the confusion matrix we can see how the model performed on the test data.

• 39 tumors were correctly predicted as cancerous

• 70 tumors were correctly predicted as not cancerous

• 4 tumors were predicted as not cancerous when they were actually cancerous

• 1 tumor was predicted as cancerous when it was actually not cancerous

In total, the model made 109 correct predictions out of 114 cases, which gives an overall accuracy of about 95.6%.

Looking at the classification report, we can also see that the model performs very well for both classes:

• For cancerous tumors (class 0), the model correctly identifies most of them with a recall of 0.91.

• For non-cancerous tumors (class 1), the recall is 0.99, meaning the model almost always correctly identifies tumors that are not cancerous.

Overall, these results suggest that the logistic regression model performs very well on this classification task and is able to correctly classify most tumors in the dataset.

8.5.7 Advantages and limitations of logistic regression

Like any statistical model, logistic regression has both strengths and limitations. Understanding these helps us decide when it is a good model to use and when another method might be better.

8.5.7.1 Advantages

• Easy to interpret

One of the biggest advantages of logistic regression is that it is relatively easy to understand compared to more complex machine learning models.

Each coefficient tells us how a predictor variable affects the probability of the outcome.

Because of this, logistic regression is often preferred when we want a model that is both predictive and interpretable.

• Produces probabilities

Unlike some classification models that only give a final class prediction, logistic regression estimates the probability that an observation belongs to a particular class.

This is useful in many real world situations where we care about risk or likelihood, not just the final classification.

• Computationally efficient

Logistic regression is also relatively fast to train and does not require extremely large amounts of data.

Because of this, it is often used as a baseline model when starting a new classification problem.

8.5.7.2 Limitations

• Assumes a linear relationship in the log-odds

Logistic regression assumes that the predictor variables have a linear relationship with the log-odds of the outcome.

If the true relationship between the predictors and the response is highly nonlinear, the model may not perform as well.

• Limited ability to capture complex patterns

More flexible models such as decision trees, random forests, or neural networks are able to capture complex nonlinear relationships that logistic regression cannot.

Because of this, logistic regression may not perform well on very complicated datasets.

• Sensitive to correlated predictors

If predictor variables are highly correlated with each other, the estimated coefficients can become unstable and harder to interpret.

This can sometimes make it difficult to understand the true effect of each predictor on the outcome.

8.5.7.3 When logistic regression works well

In practice, logistic regression tends to work well when

• the response variable is binary
• the relationship between predictors and the outcome is relatively simple
• interpretability of the model is important

Because of these properties, logistic regression is widely used in areas such as medicine, finance, and risk analysis.

8.5.7.4 Works Cited

Scikit-learn Logistic Regression Documentation

Breast Cancer Wisconsin Dataset

Logistic Regression Explained

8.6 Decision Trees

This presentation was prepared by Riley Sawyer.

8.6.1 Introduction

Decision trees (DTs) are non-paremtreic supervised learning method used for classification and regression. This presentation will cover:

  1. How decision trees work

  2. The advanatages and disadvantages of decision trees

  3. Classification

  4. Regression

  5. How to use decision trees in Python

  6. Remedies

8.6.2 How decision trees work

Structure of a decision tree A decision tree has a flow chart like structure that allows individuals to clearly visualize the decision making process. Decision trees have three main components:

  1. Root node: This is the first node of the tree. All following nodes can be traced make to the root node.

  2. Internal nodes: These nodes represent the features used for splitting the data. These features determine the path taken from this node onward. There are two types of internal nodes:

    • Decision nodes: nodes where a specific question, atribute, or choice is made to split the data.

    • Chance nodes: these nodes represent uncertainty in the decision-making process. Mulitple outcomes are possible here, so a probability is assigned to each outcome and the path is determined from there.

  3. Leaf nodes: These are the terminal nodes of the tree that represent the final output or decision. In classification tasks, leaf nodes represent class labels, while in regression tasks, they represent continuous values.

What should I eat: a decision tree Decision Tree example

8.6.3 Advantages and disadvantages of decision trees

Advantages

  • Simple to understand and interpret

  • Requires little data preprocessing, ei:

    • feature scaling

    • normalization

  • Can handle both numerical and categorical data

  • Can capture non-linear relationships

  • Can be easily validated with statistical tests

Disadvantages

  • Prone to overfitting, especially with deep trees. Can be remedied by:

    • Pruning

    • Setting a maximum depth

  • Can be unstable, as small changes in the data can lead to different tree structures

    • Can be remedied by ensemble methods like random forests
  • Can be biased towards features with more levels. May not perform well with imbalanced datasets. Can be remedies by:

    • Class weighing

    • Resampling methods

  • May not capture complex relationships as well as other models, ei:

    • XOR problems: where classes cannot be separated by a single linear decision boundary

    • Parity problems: determining if something is even or odd. XOR is a type of parity problem.

    • Mulitplexer problems: complex, non-linear boolian functions.

8.6.4 Classification

Decision trees can be used to classify data into different categories. Ei,:

  • Image classification

  • Spam detection

  • Customer segmentation

This uses DecisionTreeClassifier from sklearn.tree module in Python.

8.6.5 Regression

Decision trees can also be used for regression tasks, where the goal is to predict a continuous value. Ei,:

  • Predicting house prices

  • Forecasting sales

  • Estimating average customer spending

This uses DecisionTreeRegressor from sklearn.tree module in Python.

8.6.6 How to use decision trees in Python

Decision trees can be implemented in Python using the scikit-learn library. However, here are other library you may want to use to supplement your DT usage.

Step 1: Import necessary libraries

import matplotlib.pyplot as plt
import numpy as np

Step 2: Import scikit-learn modules

Before accessing any decision tree model, you will need to import the designated library. Use the following code to install: pip install -U scikit-learn

Now, you can use the following code to import the indicated modules:

from sklearn import tree # this is where your DTs are located
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree

confusion_matrix uses a table layout and is used to evaluate the performance of a classification model. It compares the predicted labels from your trained decision tree with the true labels from the data set.

  • It is a summary of the correct and incorrect predicitons.

accuracy_score is a metric used to evaluate the performance of a classification model. It calculates the ratio of correctly predicted instances to the total number of instances in the dataset.

classification_report is a function that provides a detailed report of the precision, recall, F1-score, and support for each class in a classification model. It helps to evaluate the performance of the model for each class.

  • precision: number of true positive predictions vs. number of positive predictions made by the model.

  • recall: number of true positive predictions vs. number of actual positive instances in the dataset.

  • F1-score: a combination of both precision and recall, an average of the two.

train_test_split is a function used to split a dataset into training and testing sets. It allows you to decide a random state for reproducibility.

plot_tree is a function used to visualize a decision tree model. Can help to understand how the model makes predictions and to identify important features in the dataset.

Step 3: Name your decision tree model

You’ll know based on your statisitcal questions what type of decision tree you will want to call.

For classification:

clf = tree.DecisionTreeClassifier()

For regression:

reg = tree.DecisionTreeRegressor()

Step 4: Load data

For this example, we will be using the iris dataset. It contains 150 samples of iris flowers, with 4 features (sepal length, sepal width, petal length, petal width) and a target variable (species of iris).

from sklearn.datasets import load_iris
iris = load_iris()

Assing your features to X and y. Your data will always be X and your target variable will always be y.

X = iris.data
y = iris.target

Always split your data into training and testing sets. If you skip this step, you will not be able to evaluate the performance of your model.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  • train_test_split splits the data into training and testing sets.
  • test_size indicates how much of the dataset you want in your testing group (here, it is 20%).
  • random_state ensures reproducibility. This can be any integer.

Step 5: Fit your model

Also knows as “training” your model. Fit your model only on the training set.

clf.fit(X_train, y_train)
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  • fit is used to train a model on a training data. It takes a format as follows: your_model_type.fit(X_train, y_train)

Step 6: Make predictions

Now that your model is trained, you can use it to predict the classifications from the testing set. Save the predictions that your model makes in a variable you can call upon later. Here, we will call it y_pred.

y_pred = clf.predict(X_test)
  • predict is used to make predictions on new data. It takes a format as follows: your_model_type.predict(X_test)

Step 7: Evaluate your model

It is in good practice to test the performance of your model.

print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

You can also show the decision tree itself.

plt.figure(figsize=(12,8))
plot_tree(clf,
        feature_names=iris.feature_names, # refers to x-variables
        class_names=iris.target_names,    #refers to y-variables
        filled=True)    # colors the nodes to indicate the majority class
plt.show()

8.6.7 Remedies

Overfitting

When a decision tree is too complex, it may fit the training data too well and be unable to predict new data well.

You can check to see if you have overfit your data by comparing the perfomance on your model on the training data vs. the testing data.

train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)

print(f"Training score: {train_score}")
print(f"Testing score: {test_score}")
Training score: 1.0
Testing score: 1.0
  • .score is used to evaluate the performance of a model on a given dataset. It takes a format as follows: your_model_type.score(X, y)

If your training score and testing score are similar, your model has not be overfit. If your training score is much higher than your testing score, your model has likely overfit the data.

Here are some remedies for overfitting: 1. Pruning: This involves removing branches of the tree that do not significantly improve your model’s predictions. This can be done by: * Setting a minimum number of samples DecisionTreeClassifier(min_samples_split=n)

  1. Setting a maximum depth: This limits the depth of the tree, preventing it from becoming too complex and overfitting the data. A good place to start is n=3. DecisionTreeClassifier(max_depth=n)

  2. Setting minimum samples per leaf: This sets a minimum number of samples required to be at a leaf node. This can prevent the tree from creating a branch that only applies to a small number of samples, which can lead to overfitting. DecisionTreeClassifier(min_samples_leaf=n)

To find out what number you should set for these parameters, you can use GridSearchCV from sklearn.

from sklearn.model_selection import GridSearchCV

GridSearchCV allows you to set a range of values for specified parameters and test each combination for the best performing model.

You must specify the parameters you want to test and the values you want to test for. Your parameters should be stored in a dictionary:

param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [2, 5, 10]
    }

Now, you must call GridSearchCV and specify the model, the parameter grid, as well as the number of cross folds and the scoring metric.

grid_search = GridSearchCV(
    estimator = clf, # the model you want to test
    param_grid = param_grid, # the parameters
    cv = 5, # the number of cross folds
    scoring = 'accuracy' # the scoring metric
    )

Grid search must now be fit to your data, like so:

grid_search.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [3, 5, 10, None],
                         'min_samples_leaf': [2, 5, 10],
                         'min_samples_split': [2, 5, 10]},
             scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Grid search actually tests a lot of different feature of your model. You can find all the features being tested here. However, we only tested three parameters: max_depth, min_samples_split, and min_samples_leaf.

print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)
Best parameters: {'max_depth': 3, 'min_samples_leaf': 5, 'min_samples_split': 2}
Best score: 0.95

Now, you can create a new model including these updated parameters.

  1. Changing the criterion: This involves changing the criterion used to split the data at each node. Changing the criterion can help to reduce overfitting by making the tree less sensitive to small changes in the data. DecisionTreeClassifier(criterion='gini')
    • gini impurity: the probability of misclassifying a randomly chosen element from the dataset
    • entropy: the average amount of information needed to classify an element from the dataset
    • log_loss: the negative log-likelihood of the true labels given the predicted probabilities from the model. Used when the output is a probability distribution over multiple classes.
    DecisionTreeRegressor(criterion='squared_error')
    • squared_error: the mean squared error between the predicted values and the true values
    • friedman_mse: the mean squared error between the predicted values and the true values, but with a bias towards reducing the variance of the predictions
    • absolute_error: the mean absolute error between the predicted values and the true values
    • poisson: the mean Poisson deviance between the predicted values and the true values, used for count data
  2. Using ensemble methods: Ensemble methods (ei, random forests) combine multiple decision trees to make predictions, thus generalizing the model and reducing the risk of overfitting.

Unbalanced datasets

When one class is much more prevalent than the other, the decision tree may be biased towards the majority class.

You can tell if a dataset is unbalanced by counting the number of instances in each class. If one class has significantly more instances than the other, the dataset is likely unbalanced. Note that decision trees use np.float32 array internally.

unique, counts = np.unique(y, return_counts=True) # coutns the number of instances in each class
num_in_class = {int(k): int(v) for k, v in zip(unique, counts)} # readable format
print(num_in_class)
{0: 50, 1: 50, 2: 50}

Here are some remedies for unbalanced datasets:

  1. Class weighting: This involves assigning higher weights to the minority class during the training process, which can help the model pay more attention to the minority class and improve its performance on that class. DecisionTreeClassifier(class_weight='balanced')

  2. Resampling methods: This involves either oversampling the minority class (creating synthetic samples) or undersampling the majority class (removing samples) to balance the dataset.

    • Oversampling: from imblearn.over_sampling import SMOTE
    • Undersampling: from imblearn.under_sampling import RandomUnderSampler

8.6.8 Further readings

scikit-learn modules

Decision tree implementation

What is a decision tree?

Work management topics: decision trees

8.7 Random Forests

This presentation was prepared by Sara Watanabe.

8.7.1 Overview

Random forest is a machine learning algorithm that uses a combination of bagging, feature importance, and complex decision trees to make predictions. The algorithm uses a collection of decision trees to reach a single result in order to make predictions or regression models based on imputed data.

8.7.2 Decision Trees

 A Decision Tree is a method that uses a flow chart-like structure to make decisions about a simple question. They can output either a categorical or numerical result, and are constructed using nodes and branches.

8.7.2.1 Structure

Root Node: Asks a question about data, contains all training samples

Internal nodes: Where decisions are made about data features

Leaf Nodes: Tree end points; the final decision/prediction

Branches: Lines connecting the nodes

Code
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(4.5, 2))

# Turn off axes
ax.axis('off')

# Helper function to draw a labeled box
def draw_box(x, y, text, color):
    ax.text(
        x, y, text,
        ha='center', va='center',
        fontsize=8, fontweight='bold',
        bbox=dict(boxstyle="round,pad=0.4", facecolor=color, edgecolor='black')
    )

# Root node
draw_box(0.5, 0.9, "Root Node", "#2e7d32")

# Internal nodes
draw_box(0.25, 0.6, "Internal Node", "#c5e1a5")
draw_box(0.75, 0.6, "Internal Node", "#c5e1a5")

# Leaf nodes
draw_box(0.1, 0.3, "Leaf Node", "#fff9c4")
draw_box(0.4, 0.3, "Leaf Node", "#fff9c4")
draw_box(0.6, 0.3, "Leaf Node", "#fff9c4")
draw_box(0.9, 0.3, "Leaf Node", "#fff9c4")

# Helper function to draw a line
def connect(x1, y1, x2, y2):
    ax.plot([x1, x2], [y1, y2])

# Connections from root
connect(0.5, 0.85, 0.25, 0.65)
connect(0.5, 0.85, 0.75, 0.65)

# Connections from left internal
connect(0.25, 0.55, 0.1, 0.35)
connect(0.25, 0.55, 0.4, 0.35)

# Connections from right internal
connect(0.75, 0.55, 0.6, 0.35)
connect(0.75, 0.55, 0.9, 0.35)

plt.show()

8.7.2.2 Iris Dataset Example

8.7.2.2.1 Dataset Overview

The Iris dataset has 150 samples of iris flowers. This data measures four features: sepal length, sepal width, petal length, and petal width. These features are used by the Decision Tree algorithm to classify the samples into three classes: setosa, versicolor, and virginica.

from sklearn.datasets import load_iris
import pandas as pd

# Load data
iris = load_iris()
X = iris.data       # all features
y = iris.target     # all classes (setosa, versicolor, virginica)

df = pd.DataFrame(X, columns=iris.feature_names)
df["species"] = iris.target_names[y]

df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
8.7.2.2.2 Modelling the Data

Here we initialize the Decision Tree Classifier and fit it to the data. The Decision Tree Classifier is imported from the sklearn package, and matplotlib is imorted for future visualization.

from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

# Initialize the model with Gini impurity (default)
model = DecisionTreeClassifier(max_depth=3, random_state=42)
# Fitting the model calculates Gini values for all possible splits
model.fit(X, y)
DecisionTreeClassifier(max_depth=3, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

8.7.2.3 Growth-Stop Conditions

Decision trees are prone to overfitting, and can cause the model to fragment data into small catergories that cannot be generalized to new data. To prevent this, growth-stop conditions can be passed as parameters to the model. These include:

  • max_depth: Sets the maximum depth of the tree, which limits how many splits the tree can make. In this case, we set it to 3.
  • min_samples_split: Sets the minimum number of samples required to split a node. If a node has fewer samples than this threshold, it will not be split further.
  • min_impurity_decrease: Sets the minimum reduction in the Gini Coefficient required to split a node. If the reduction is less than this value, the node will not be split.

8.7.2.4 Gini Coefficient

In order for the Decision Tree Classifier to determine the best splits, it uses a metric to evaluate the quality of splits. The Gini Coefficient is used as the default metric, and is calculated as: \[ \mathrm{Gini}(p) = 1 - \sum_{i=1}^{n} p_i^2 \]

Where \(p_i\) is the proportion of samples belonging to class \(i\) at a given node.

Gini can range from 0 to 0.5 for binary classification, where 0 indicates a pure node and 0.5 indicates a completely impure node (ex. 50% True and 50% False).

8.7.2.5 Training Process

Decision Trees are built by recursively splitting training samples in a series of steps:

Step 1. Start with the root node; this contains all training smaples.

# Root node contains all 150 samples

print("Number of samples at root:", X.shape[0])
class_dist = {str(name): int(sum(y == i)) for i, name in enumerate(iris.target_names)}
print("Class distribution at root:", class_dist)
Number of samples at root: 150
Class distribution at root: {'setosa': 50, 'versicolor': 50, 'virginica': 50}

Step 2. The algorithm evaluates calculated metrics to determine the features that best split the data. In this case, the Gini Coefficient is used. The smaller the Gini value, the better the split.

# Step 2: Evaluate metrics to find the best split

# Show Gini at root and splits (access tree structure)

tree = model.tree_
print("Gini at root:", tree.impurity[0])
print("Gini at left child of root:", tree.impurity[1])
print("Gini at right child of root:", tree.impurity[2])
print("Number of samples at root:", tree.n_node_samples[0])
Gini at root: 0.6666666666666667
Gini at left child of root: 0.0
Gini at right child of root: 0.5
Number of samples at root: 150

Internally, all possible Gini values are calculated for each feature and threshold for each step, and the split uses the feature and threshold that produces the lowest Gini value

Step 3. The data is then split based on the selected feature and threshold value

# You can see which feature and threshold were chosen at root
root_feature = tree.feature[0]
root_threshold = tree.threshold[0]

print("Best feature at root:", iris.feature_names[root_feature])
print("Threshold at root:", root_threshold)
print("Samples go left if feature < threshold, else right")
print("\n")
Best feature at root: petal length (cm)
Threshold at root: 2.449999988079071
Samples go left if feature < threshold, else right

Step 4. Steps 2 and 3 are repeated until growth-stop conditions are met. These include a maximum depth of the tree, a minimum number of samples required to split a node, and a minimum reduction in the Gini Coefficient required to split a node. The lower the Gini value the better

# max_depth=3 and minimum samples per split control stopping

print("Max depth set:", model.max_depth)
print("Tree stops splitting when nodes reach max depth or minimum samples")
Max depth set: 3
Tree stops splitting when nodes reach max depth or minimum samples

The resulting tree is visualized below:

Code
# 4. Plot the tree with smaller size
plt.figure(figsize=(6.5, 4.5))  # adjust width and height
plot_tree(
    model, 
    feature_names=iris.feature_names, 
    class_names=iris.target_names, 
    filled=True,
    fontsize=7  # smaller font
)
plt.show()

  • Root Node: Contains all 150 samples, 50 in each class, with a Gini Coefficient of 0.66
  • Internal Nodes:
    • The first split uses petal length with a threshold of 2.45.
    • The left split uses petal width with a threshold of 0.8
    • the right split uses petal width with a threshold of 1.75.
  • Leaf Nodes: The Gini Coefficient at the leaf nodes are considerably small, indicating that the samples at these nodes are mostly of one class.

Seen in each box is the calculated Gini Coefficient, the number of samples at the node, and the distribution of classes at the node.

8.7.2.6 Benefits and Limitations of Decision Trees

Benefits

  • Easy to understand and interpret
  • Computationally efficient
  • Time and space efficient

Limitations

  • Decision Trees are prone to bias and overfitting
    • As a tree grows in size, it can result in data fragmentation
    • Using the random forest algorithm reduces the risk of making extreme predictions due to overfitting
  • Decision Trees tend to stay on the smaller side, making it hard to deal with many complex variables

8.7.3 Random Forest Classifier

Random Forest is an algorithm that uses a combination of bagging, feature importance, and complex decision trees to make predictions.

8.7.3.1 Characteristics:

  • While singular Decision Trees consider the use of all possible feature splits, Random Forest only considers a subset of features for each split.
  • Random Forest uses random features to make splits, which reduces correlation between trees and improves generalization
  • Random Forest can be used for classification or regression problems
    • classification uses the majority vote of all trees to come to a conclusion
    • regression uses the average of all trees to make a prediction

8.7.3.2 Iris Dataset Example

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load data
iris = load_iris()
X = iris.data
y = iris.target

df = pd.DataFrame(X, columns=iris.feature_names)
df["species"] = iris.target_names[y]

df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

8.7.3.3 Bagging

Random Forest uses a combination of bootstrap and aggregation, also known as bagging, during the creation process.

  • Bootstrap: A resampling technique that uses sampling with replacement to create triaining sets
    • A training set is selected with replacement for each Decision Tree made
    • The unselected values will be set aside as part of the out-of-bag samples
    • Individual Decision Trees are then trained independently using respective training sets, selecting their own features for each split
  • Aggregation: The process of combining the predictions of all the trees created by the random forest algorithm

Thirty percent of our selected bootstrap sample is set aside for training (out-of-bag sample), which will be used for cross-validation later on

8.7.3.4 Tree construction

Step 1. Create the training and testing sets

  • These will be used to create bootstrap samples and out-of-bag samples for each tree during the fitting process
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

Step 2. Create Random forest

  • Assign a random set of features to each dataset
    • Set parameters; n_estimators is the number of trees that will be created
# Create Random Forest model
model = RandomForestClassifier(
    n_estimators=100,   # number of trees
    random_state=42
)

Step 3. Fit the model: Train each tree

  • Use the best feature to split the data
    • The feature that makes the clearest separation is used
  • Trees continue to grow until growth-stop conditions are met
# Train model (same idea as decision tree fit)
model.fit(X_train, y_train)
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Internal Process:

  1. Bootstrap samples are created for each tree from the training set created in step 1
  2. A random subset of features is selected to consider at each split
  3. The best feature is selected to split the data, and the best threshold value is also determined
  4. The data is split based on the selected feature and threshold value
  5. repeat steps 2-4 until growth-stop conditions are met (ex. max depth, minimum samples per split, minimum impurity decrease)

Step 4. Make Predictions

# Make predictions
y_pred = model.predict(X_test)

Step 5. Test all of the trees with the out-of-bag set to determine how good the model is

  • You can adjust how many random variables are considered, number of trees, etc. in an attempt to improve accuracy
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 1.0

Our accuracy score is 1.0, which means that all of our predictions were correct. This is a good score, but it is important to note that this is likely due to the simplicity of the dataset and the fact that we have a small number of features. Adjusting things like the number of trees, the number of features considered for each split, and the depth of the trees can help to improve accuracy on more complex datasets.

8.7.3.5 Feature Selection

Feature Randomization Selection of a random subset of features for each split

  • Reduces correlation between trees and improves generalization
  • The Random Forest algorithm uses this during the training process
  • Number of features that are considered can be changed to improve model accuaracy; the default value is the square root of the total number of features

Feature Importance: The relative importance of each feature in the data

  • Tells what features are the most important in making accurate predictions
  • Random Forest has built-in feature importance
  • Feature importance is measured as the average decrease in impurity across all trees
    • if a feature splits the data into “purer” subsets, it is considered more important
    • Can also consider how much a feature is used and how much it contributes to the overall predictions of the algorithm
plt.figure()
plt.bar(iris.feature_names, model.feature_importances_)
plt.xticks(rotation=45)
plt.title("Feature Importance (Random Forest)")
plt.show()

The horizontal axis shows feature, and the vertical axis shows the average reduction in Gini Coefficient caused by splits using that feature across all trees

As shown in the figure, petal length and petal width are the most important features for making accurate predictions in this dataset. This means that they created the most “pure” splits in the data, and were used more often in the Decision Trees to make accurate predictions.

8.7.3.6 Benefits and Limitations of Random Forest

Benefits

  • “Wisdom of crowds”: even if a few trees make a mistake, the overall collective decision will mitigate these mistakes
  • Reduces the risk of overfitting compared to a single Decision Tree through feature randomization and bagging
  • Can handle large datasets with higher dimensionality and complex interactions between features

Limitations

  • Decision Trees are simpler and can be easier to interpret
  • Random Forest can be computationally intensive, especially with a large number of trees and features
  • Training time can get long with large datasets, and it may require more memory to store the multiple trees

8.7.3.7 Regression example

Now we are going to use the Random Forest regression model to predict the median house value per neighborhood in California.

Step 1. Load the dataset and split into training and testing sets

# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Print out the first few rows of the dataset
df = pd.DataFrame(X, columns=data.feature_names)
df["target"] = y
print(df.head())

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  target  
0    -122.23   4.526  
1    -122.22   3.585  
2    -122.24   3.521  
3    -122.25   3.413  
4    -122.25   3.422  

Our x-values will be all features in this dataset(Median income, Average house age, Average number of rooms, Average number of bedrooms, Population, Longitude, and Latitude), the target variable is the median house value per neighborhood. Based on my knowledge of housing prices, I am expecting median income to be the most important feature in making accurate predictions.

Step 2. Create the Random Forest regression model, train it, and make predictions

# Create Random Forest regression model
model = RandomForestRegressor(
    n_estimators=100,   # number of trees
    random_state=42
)

# Train model
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

Step 3. Evaluate the performance of the model using mean squared error

# Evaluate performance
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)
Mean Squared Error: 0.2557259876588585

Mean squared error tells us the average squared difference between the predicted values and the actual values. In this case, an MSE of 0.25537 indicates our model is doing a pretty good job at predicting the median house values, but if we wanted to imporve our model’s accuracy we could adjust parameters such as the number of trees, the number of features considered for each split, and the depth of the trees.

Step 4. Evalutate feature importance to determine which features are most important in making accurate predictions

plt.figure()
plt.bar(data.feature_names, model.feature_importances_)
plt.xticks(rotation=45)
plt.title("Feature Importance (Random Forest Regression)")
plt.show()

As seen in the figure above, median income is the most important feature for making accurate predictions in this dataset, as predicted. This means that it created the most “pure” splits in the data, and was used more often in the Decision Trees to make accurate predictions.

8.7.4 Further Reading

Breif overeview of Random Forest

In-depth Walkthrough and Example

Step-By-Step Guide with Code Examples

General Overview of Random Forest

Explanation of feature importance with examples

In-depth explanation of feature importance

Step-by-step guide to the random forest training process

Explanation of the gini metric

8.8 Gradient Boosting

9 Gradient Boosting

This presentation was prepared by Violet LaCasse.

9.1 Introduction

This presentation will:

  • Define Gradient Boosting (and boosting)

  • Provide examples of Gradient Boosting in regression and Gradient Boosting in classification

  • Discuss the advantages and disadvantages of Gradient Boosting

9.2 What is Boosting?

Boosting: an iterative ensemble learning technique that takes multiple weaker predictive models to create a single stronger model

  • Ensemble learning: combining multiple models to create a new (better) one

  • Models are also referred to as “learners”

    • A “weak learner” is only slightly better than random guessing

Boosting improves predictive accuracy by having each new model correct the mistakes of the previous one through varying methods.

9.2.1 Major Steps in (Adaptive) Boosting:

  1. Selecting initial weights

Here, we’re looking at Adaptive Boosting, which uses weights as a corrective measure.

All data points in the training set are given equal initial weights.

  1. Sequential training

The first model is trained using the previously weighted data. The performance of this model is usually weak (a “weak learner”).

So, we take the data points that the model didn’t handle well and give them higher weights than points that were identified correctly, assigning them more importance in the next model.

  1. Iterating

Having addressed the mistakes from the previous model, this next model’s performance is evaluated. Misclassified data points have their weights increased.

This process is repeated multiple times until there is an ensemble of weak models with varying predictive accuracies.

  1. Aggregating

The predictions of all the weak models in the ensemble are aggregated to create a final predictive model.

Models that have higher predictive accuracy relative to the others in the ensemble have more influence on the final model.

Finally, the resulting model can be used with the test set to evaluate the model’s overall performance.

Boosting can be contrasted with Bagging (see Random Forests).

9.3 What is Gradient Boosting?

Gradient Boosting: a type of boosting that relies on residuals to minimize the loss function through descent

  • Loss function: a measure of how good or bad a model is at predicting a value compared to the actual value

    • Commonly MSE (mean squared error) for regression

    • Commonly cross-entropy aka log loss (difference between predicted probability distribution and actual probability distribution) for classification

    • Smaller loss means more accurate predictions

  • Gradient descent: an optimization algorithm that minimizes the loss function (prediction error)

    • Typically minimizes loss by changing parameters (weight/bias), but this is not the case in Gradient Boosting

    • (Negative) Gradient: the slope of the loss function that tells the algorithm which direction to “step” to minimize the loss function

      • Determined by pseudo-residuals in Gradient Boosting
    • Learning rate: dictates the size of each iterative “step” the algorithm takes to minimize the loss function

      • Too small: waste of computational resources

      • Too big: might miss the minimum of the function entirely

      • Also used to prevent overfitting

9.3.1 Major Steps in Gradient Boosting:

  1. Creating an initial prediction

Using a training set, we create a model where the initial prediction of the target variable for all data points is a constant, so:

\[ F_{0}(x) = \bar{y} \]

Usually, the models used for Gradient Boosting are decision trees because they’re fairly versatile.

  1. Calculating (pseudo-) residuals

Then, we calculate the residuals (or pseudo-residuals as they’re referred to in gradient boosting) for each data point in the dataset. When using MSE as the loss function, the pseudo-residual is:

\[ r_{im} = y_{i} - F_{m-1}(x_{i}) \]

or \[ r_{im} = y_{i} - \hat{y}_{i} (actual - predicted) \]

where:

  • i = 1, …, n (n is the number of observations in the sample)

  • m = 1, …, M (M is the total number of iterations/models)

So for this first iteration/model, because the predicted value is \(\bar{y}\):

\[ r_{i1} = y_{i} - \bar{y} \]

  1. Creating a new model (and shrinkage)

Gradient Boosting minimizes the loss function by sequentially having each new model fix the errors of the previous model. So, the pseudo-residuals of the previous model become the target variable. The features stay the same.

Over time, as the models get better at predicting the pseudo-residuals, error decreases, because we are “following” the pseudo-residuals (also referred to as the negative gradient for this reason) down the loss function.

This is how each new model learns to fix the errors of the previous model and how the loss function gets minimized.

Gradient Boosting is prone to overfitting. To help combat this, before adding a model to the ensemble, the model’s predictions are scaled according to the learning rate (a process called shrinkage). This process weights each model’s contribution to the final prediction.

  1. Iterating

This process is repeated and continues until we want it to stop (i.e. after a certain number of iterations, or until loss stops decreasing).

  1. Aggregating

Ultimately, the predictions of all models in the ensemble are combined. This is what the final prediction of the initial target variable looks like:

\[ y_{pred} = F_{0}(x) + \eta \times r_{1} + \eta \times r_{2} + ... + \eta \times r_{n} \]

where:

  • \(\eta\) is the learning rate

  • \(r_{n}\) is the residuals of the nth model (tree, if you’re using decision trees).

9.4 Gradient Boosting: Regression

Both functions for Gradient Boosting Regression and Classification can be found in the sklearn library.

For an example of Gradient Boosting in regression, we will be using the sklearn California housing dataset.

import sklearn
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame = True)

Basic info about the dataset:

housing.frame.head()
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422
housing.frame.shape
(20640, 9)
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

X = housing.data
y = housing.target # MedHouseVal

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.2, random_state = 3255)
gb_reg = GradientBoostingRegressor(loss = 'squared_error',
    learning_rate = 0.1, # "step size"
    n_estimators = 400, # number of iterations/models (trees in this case)
    max_depth = 5,
    random_state = 3255,
    n_iter_no_change = 50)

gb_reg.fit(train_X, train_y)
GradientBoostingRegressor(max_depth=5, n_estimators=400, n_iter_no_change=50,
                          random_state=3255)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pred_y = gb_reg.predict(test_X)

mse_reg = mean_squared_error(test_y, pred_y)
r2_reg = r2_score(test_y, pred_y)

print(f'R2: {r2_reg:.2f}')
print(f'RMSE: {(mse_reg ** (1/2)):.2f}')
R2: 0.84
RMSE: 0.47

R-squared (amount of variation in MedHouseVal explained by the features) is decent, and given the context of the problem and the values of the target variable, the RMSE isn’t so bad either.

To see improvement, you may want to try the GridSearchCV function in order to find which combinations of hyperparameters yield the best results (however this is very computationally expensive).

9.5 Gradient Boosting in Classification

For an example of Gradient Boosting in regression, we will be using the sklearn Wisconsin breast cancer dataset.

from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer(as_frame = True)
cancer.frame.head()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

cancer.frame.shape
(569, 31)
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

X = cancer.data
y = cancer.target

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.2, random_state = 3255)
gb_clf = GradientBoostingClassifier(loss = 'log_loss',
    learning_rate = 0.1, # "step size"
    n_estimators = 400, # number of iterations/models (trees in this case)
    max_depth = 5,
    random_state = 3255,
    n_iter_no_change = 50)

gb_clf.fit(train_X, train_y)
GradientBoostingClassifier(max_depth=5, n_estimators=400, n_iter_no_change=50,
                           random_state=3255)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pred_y = gb_clf.predict(test_X)

gb_acc = accuracy_score(test_y, pred_y)

print(f'Accuracy score: {gb_acc}')
Accuracy score: 0.9473684210526315

The accuracy score is fairly high, so we can say that this Gradient Boosting model is quite good at predicting whether a patient has breast cancer or not with the given features.

9.6 Advantages and Disadvantages

9.6.0.0.0.1 Advantages:
  • Good for both regression and classification

  • Good with large/imbalanced data

  • Typically very good predictive accuracy

9.6.0.0.0.2 Disadvantages:
  • Prone to overfitting with too many boosting rounds

    • This is why tuning hyperparameters is important
  • Can be very computationally expensive

9.7 Conclusion

ESSENTIALLY, Gradient Boosting is an algorithm that uses residuals to minimize overall error by sequentially fixing previous errors. With the right tuning, it can be a very powerful predictive tool.

9.8 References

https://www.geeksforgeeks.org/machine-learning/boosting-in-machine-learning/

https://www.geeksforgeeks.org/machine-learning/ml-gradient-boosting/

https://www.datacamp.com/tutorial/guide-to-the-gradient-boosting-algorithm

https://apxml.com/courses/getting-started-with-gradient-boosting-algorithms/chapter-2-the-gradient-boosting-machine/gbm-algorithm-steps

9.9 Support Vector Machines

9.10 Support Vector Machines

This presentation was prepared by Kyle Burns.

9.10.1 Introduction

Support Vector Machines (SVMs) are a supervised machine learning models that take in classifed data, and mathematically divide the data into sections. SVMs are supervised models, so they need a labeled data set to learn off of before being applied. If given a new set of data points, SVMs can quickly and easily classify these new points. This makes them very good at problems like image classification, face detection, and text categorization.

9.10.2 Definitions

9.10.2.1 Hyperplanes

In the general sense, a hyperplane is a flat, n-1 dimensional subspace that separates an n dimensional space into two half-spaces. In the case of SVMs, these spaces are often labeled 1 and -1.

  • In a 2D vizualization, a hyperplane is simply a line.
  • In a 3D space, it is a plane.

9.10.2.2 Margins

Extending perpendicularly out from the hyperplane are its margins. They extend out until they hit a point.

9.10.2.3 Support Vectors

Support vectors are the data points that lie on the margins of the hyperplane. These points heavily influence the resulting optimal hyperplane.

9.10.3 How It Works

A hard-margin linear SVM works by finding the hyperplane with the largest possible margins. It does this by starting with a random seed where the hyperplane is in a random location and orientation. Next, this position is evaluated, and a SVM algorithm iteratively adjusts the hyperplane until it separates the two classifed groups. Then, it maximizes the width of its margins, at which point it becomes an optimal margin classifier.

9.10.4 Python Implementation

9.10.4.1 Importing the Dataset

For demonstration, I’m using this simple penguin dataset. The dataset contains physical characteristics and basic information on 3 different species of penguins: Gentoo, Adelie, and Chinstrap.

import pandas as pd
import ssl

ssl._create_default_https_context = ssl._create_unverified_context
url = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"
df = pd.read_csv(url)

df.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007

9.10.4.2 Filtering the Data

For the purposes of this demonstration, I will only be using the columns species, bill_depth_mm, and body_mass_g, and only on the species Gentoo and Adelie. We will filter our dataset accordingly.

# Select features and target
features = df[["bill_depth_mm", "body_mass_g"]].copy()
target = df["species"].copy()

# Filter to Adelie and Gentoo only
species_filter = target.isin(["Adelie", "Gentoo"])
features = features.loc[species_filter].copy()
target = target.loc[species_filter].copy()

# Drop missing values
mask = features.notna().all(axis=1) & target.notna()
features = features.loc[mask]
target = target.loc[mask]

9.10.4.3 Fitting the Model

Now, we can fit the SVM model to the data and output some basic facts about the support vectors of this model.

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import matplotlib.pyplot as plt

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)

# Train a linear SVM classifier
svm = SVC(kernel="linear", C=1e10, random_state=37)
svm.fit(X_scaled, target)

# Info on support vectors
print("Number of support vectors:", len(svm.support_))
print("Support vector indices:", svm.support_)
print("Support vector classes:")
for idx in svm.support_:
    print(f"Index {idx}: {target.iloc[idx]}")
Number of support vectors: 3
Support vector indices: [ 80 165 188]
Support vector classes:
Index 80: Adelie
Index 165: Gentoo
Index 188: Gentoo

9.10.4.4 Plotting the Model

# Scatter plot of the data and decision boundaries
plt.figure(figsize=(8, 6))

# Plot the points
species_order = ["Adelie", "Gentoo"]
colors = ["red", "blue"]
for species_name, color in zip(species_order, colors):
    species_mask = target == species_name
    plt.scatter(
        features.loc[species_mask, "bill_depth_mm"],
        features.loc[species_mask, "body_mass_g"],
        label=species_name,
        color=color,
        alpha=0.8,
        edgecolor="k",
        s=60,
    )

# Plot support vectors
support_vectors = scaler.inverse_transform(svm.support_vectors_)
plt.scatter(
    support_vectors[:, 0],
    support_vectors[:, 1],
    facecolors="green",
    s=60,
    alpha=0.4,
    linewidths=1.5,
    label="Support vectors",
)

# Plot optimal hyperplane
w = svm.coef_[0]
b = svm.intercept_[0]
w_orig = w / scaler.scale_
b_orig = b - (w * scaler.mean_ / scaler.scale_).sum()

x_vals = np.linspace(features["bill_depth_mm"].min(), features["bill_depth_mm"].max(), 100)
y_vals = -(w_orig[0] * x_vals + b_orig) / w_orig[1]
plt.plot(x_vals, y_vals, color="black", linestyle="--", linewidth=2, label="Optimal hyperplane")

# Plot the margins
y_vals_pos = -(w_orig[0] * x_vals + (b_orig - 1)) / w_orig[1]
plt.plot(x_vals, y_vals_pos, color="gray", linestyle="--", linewidth=1, label="Margins")

y_vals_neg = -(w_orig[0] * x_vals + (b_orig + 1)) / w_orig[1]
plt.plot(x_vals, y_vals_neg, color="gray", linestyle="--", linewidth=1)

plt.xlabel("Bill depth (mm)")
plt.ylabel("Body mass (g)")
plt.title("SVM classification of penguin species")
plt.legend()

plt.show()

9.10.5 How This is Useful

Now that we have this SVM model, any new data that we get on penguins of an unknown species can be used to guess the penguin’s species (given that it is one of the two we have a model for). For example, if we found a penguin that weighed 3500 grams that had a bill depth of 18 millimeters, we could pretty confidently guess that the penguin is an Adelie penguin.

In the case of something like image classification, a group of images could be separated into Contains ___ and Does not Contain ___.

9.10.6 Real World Data and Remedies

This dataset is rather convinient for the application of an SVM. However, real world examples are can and will be more complicated. In such cases, remedies to SVM can be implemented.

  • Soft Margins - Often, groups will not be linearly seperable. In cases like this, soft margins allow points to be on either side of the hyperplane, and simply find the hyperplane with the highest accuracy of classification.
  • Non-linear SVMs - Sometimes, groups cannot be split by a simple plane. Instead, a geometric figure can be used to contain / exclude groups.
  • Non-binary SVMs - When dealing with more than two classification groups, multiple SVMs can be used to separate each group from each other group.

9.10.6.1 Advantages

  • Works just as well with many features / dimensions
  • Works well even with smaller datasets
  • New data can be classified quickly and easily

9.10.6.2 Limitations

  • There is often not possible to get a clean separating hyperplane, which leads to less accuracy.
  • Can sometimes struggle with non-binary classification problems.

9.10.7 Further Readings

9.11 Multidimensional Scaling

9.12 Multidimensional Scaling


Presentation by Mallory Whitlow

9.12.1 Introduction: What is Multidimensional Scaling?

Multidimensional Scaling (MDS) is an unsupervised ML dimensionality reduction technique that visualizes the similarity between objects in a lower-dimensional space.

  • Takes a distance or dissimilarity square matrix as input

  • Determines number of dimensions for visualization

  • Minimizes “stress” (the measure of how poorly the low-dim distances represent the original distances)

  • Produces a 2D/3D map where similar objects are placed closer together

  • Preserves pairwise relationships as faithfully as possible

Widely used in psychology, marketing, bioinformatics, and neuro-linguistic programming, but can be used for any sort of relational data.

Before diving into each type, a few terms used throughout:

  • Unsupervised ML: finds hidden patterns/relationships in unlabeled data without explicit guidance.
    • Examples include: clustering, association, and dimensionality reduction
  • Dimensionality Reduction: ML technique that that visualizes a high-dim. data set in a lower dim.
    • Examples include: PCA, t-SNE, UMAP, and MDS
  • Dissimilarity matrix \(\Delta\): an \(n \times n\) matrix where \(\delta_{ij}\) is how different objects \(i\) and \(j\) are.
  • Configuration \(X\): the recovered \(n \times k\) coordinate matrix in the low-dimensional space.
  • Euclidian Distance: the shortest path between two points (Pythagorean distance/hypotenuse).
  • Stress: a scalar measuring how badly the configuration distorts the original dissimilarities.

\[\text{Stress} = \sqrt{\frac{\sum_{i<j}(\delta_{ij} - d_{ij})^2}{\sum_{i<j} d_{ij}^2}}\]

Stress Quality
< 0.05 Excellent
0.05 – 0.10 Good
0.10 – 0.20 Fair
> 0.20 Poor fit - consider more dimensions

9.12.2 Core Idea

If cities A and B are 100 km apart, and B and C are 200 km apart, where is C relative to A?

The primary goal? Visualizing similarities in the data and identifying hidden structures.

import pandas as pd
import numpy as np
from sklearn.manifold import MDS
import matplotlib.pyplot as plt

labels = ["A", "B", "C", "D"]
D = np.array([[0, 1, 2, 3],
              [1, 0, 1, 2],
              [2, 1, 0, 1],
              [3, 2, 1, 0]])

df = pd.DataFrame(D, index=labels, columns=labels)
print("Example Distance Matrix:")
print(df.to_string())

mds = MDS(n_components=2,dissimilarity="precomputed", random_state=49)

D_transformed = mds.fit_transform(D)

plt.scatter(D_transformed[:, 0], D_transformed[:, 1])
plt.show()
Example Distance Matrix:
   A  B  C  D
A  0  1  2  3
B  1  0  1  2
C  2  1  0  1
D  3  2  1  0
/Users/kuc13001/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/sklearn/manifold/_mds.py:677: FutureWarning:

The default value of `n_init` will change from 4 to 1 in 1.9.

**From scikit-learn manual

n_components: number of dimensions in output

metric_mds: If TRUE, perform metric MDS; otherwise, perform non-metric MDS

dissimilarity: dissimilarity measure to use

  • ‘euclidean’: pairwise Euclidian distances between dataset points

  • ‘precomputed’: pre-computed dissimilarities passed directly to fit and fit_transform.

random_state: determines the random number generator used to initialize centers.


9.12.3 Type 1 - Classical

Equivalent to PCA (Principal Coordinate Analysis) when the distances are Euclidian.

\[B = -\tfrac{1}{2} H \Delta^{(2)} H, \qquad H = I - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top\]

\[X = V_k \Lambda_k^{1/2}\]

**When to use it

  • Your dissimilarities are Euclidean distances
  • You need results to be reproducible across runs
  • Dataset is small/medium (under 5,000 observations)

**Real-world uses

Geographic reconstruction - given a table of road distances between cities, create an approximate map.

Genomics/phylogenetics - input species based on genetic distance matrices to visualize evolutionary relationships.

Chemistry - visualize molecular similarity from fingerprint-based distance matrices.


9.12.4 Type 2 - Metric

A generalization of cMDS, it accepts any quantitative dissimilarity measure and finds a configuration that minimizes stress iteratively.

*Unlike cMDS, metric MDS uses numerical optimization (gradient descent), so results can vary slightly with initialization and may need multiple restarts.

**When to use it

  • You have quantitative dissimilarities that are not strictly Euclidean (e.g. correlation-based, cosine distances)
  • You want to fit many dimensions and compare stress across them
  • Classical MDS gives negative eigenvalues (a sign the distances are non-Euclidean) and you need a robust alternative

**Real-World Uses

Market research - compute dissimilarity between brands from survey ratings and produce a map showing the competitive landscape.

Neuroscience - visualize brain region similarity from fMRI correlation matrices.

from sklearn.manifold import MDS
from sklearn.datasets import load_iris
from sklearn.metrics import pairwise_distances
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


iris = load_iris()
X = iris.data
dist_matrix = pairwise_distances(X, metric="euclidean")


mds = MDS(n_components=2, metric=True, dissimilarity="precomputed",
          random_state=42, normalized_stress="auto")
coords = mds.fit_transform(dist_matrix)

df_coords = pd.DataFrame(coords, columns=["Dim1", "Dim2"])
df_coords["species"] = [iris.target_names[t] for t in iris.target]
print(df_coords.head(8))

#| fig-align: center

colors = {"setosa": "#E63946", "versicolor": "#457B9D", "virginica": "#2DC653"}

fig, ax = plt.subplots(figsize=(9, 5))


for species, grp in df_coords.groupby("species"):
    ax.scatter(grp["Dim1"], grp["Dim2"], label=species,
               color=colors[species], s=60, alpha=0.85)

ax.set_title("MDS of Iris Dataset", color="white", fontsize=14)
ax.set_xlabel("Dimension 1", color="#9EB3C2")
ax.set_ylabel("Dimension 2", color="#9EB3C2")
ax.tick_params(colors="#9EB3C2")
ax.legend(facecolor="#1a1a2e", labelcolor="white")
for spine in ax.spines.values():
    spine.set_edgecolor("#2a2a4a")
ax.grid(color="#2a2a4a", linestyle="--", linewidth=0.5)
plt.tight_layout()
plt.show()
       Dim1      Dim2 species
0 -2.651531 -0.559466  setosa
1 -2.729591 -0.047610  setosa
2 -2.901949 -0.140717  setosa
3 -2.771825  0.066302  setosa
4 -2.698422 -0.590300  setosa
5 -2.205573 -0.955087  setosa
6 -2.848783 -0.199389  setosa
7 -2.605070 -0.402099  setosa
/Users/kuc13001/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/sklearn/manifold/_mds.py:677: FutureWarning:

The default value of `n_init` will change from 4 to 1 in 1.9.


9.12.5 Type 3 - Non-Metric (Ordinal) MDS

Used for qualitative dissimilarities, non-metric MDS finds a non-parametric monotonic relationship between the differences and the distance between data.

*Though metric MDS shows a linear relationship, nMDS is described using curves.

**When to use it

  • Your dissimilarities come from rankings, preference data, or subjective ratings where the scale is arbitrary
  • You cannot assume the data is Euclidean or metric
  • Classical or metric MDS gives high stress and you suspect non-linearity
  • Used almost universally in community ecology

**Real-world uses

Ecology - Bray-Curtis dissimilarity between ecological communities based on species counts is ordinal; nMDS is the standard tool.

Psychology - participants rank how similar different stimuli are (smells, faces, colors).

Social science - survey respondents rank preferences.

from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from scipy.stats import rankdata


digits = load_digits()
X_digits = digits.data        # (1797, 64)
y_digits = digits.target      # 0–9

print(f"Full dataset shape : {X_digits.shape}")
print(f"Image grid size    : 8×8 pixels")
print(f"Classes            : {digits.target_names.tolist()}")

np.random.seed(42)
sample_idx = np.concatenate([
    np.random.choice(np.where(y_digits == d)[0], 30, replace=False)
    for d in range(10)
])
X_sample = X_digits[sample_idx]
y_sample = y_digits[sample_idx]

print(f"\nSample shape       : {X_sample.shape}  (30 per digit)")
Full dataset shape : (1797, 64)
Image grid size    : 8×8 pixels
Classes            : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Sample shape       : (300, 64)  (30 per digit)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sample)

dist_digits = pairwise_distances(X_scaled, metric="euclidean")


n = dist_digits.shape[0]
upper_idx = np.triu_indices(n, k=1)
ranks = rankdata(dist_digits[upper_idx])

ranked_dist = np.zeros_like(dist_digits)
ranked_dist[upper_idx] = ranks
ranked_dist += ranked_dist.T   

print(f"Distance range   : {dist_digits[upper_idx].min():.2f}{dist_digits[upper_idx].max():.2f}")
print(f"Rank range       : {int(ranks.min())}{int(ranks.max())}")
print(f"\nRanked dissimilarity matrix (first 5×5):")
print(pd.DataFrame(ranked_dist[:5, :5].astype(int)).to_string())
Distance range   : 1.45 – 33.41
Rank range       : 1 – 44850

Ranked dissimilarity matrix (first 5×5):
      0     1     2     3     4
0     0   192  2150  9630  2105
1   192     0   702  3335   943
2  2150   702     0   939   951
3  9630  3335   939     0   708
4  2105   943   951   708     0
nmds_digits = MDS(
    n_components=2,
    metric=False,                
    dissimilarity="precomputed",
    random_state=42,
    normalized_stress="auto",
    max_iter=1000,
    n_init=10
)
digit_coords = nmds_digits.fit_transform(ranked_dist)

stress = nmds_digits.stress_
quality = ("Excellent" if stress < 0.05 else
           "Good"      if stress < 0.10 else
           "Fair"      if stress < 0.20 else "Poor")
print(f"Final Stress: {stress:.4f}{quality}")


digit_palette = [
    "#E63946", "#F4A261", "#2DC653", "#457B9D", "#9B59B6",
    "#1ABC9C", "#E67E22", "#3498DB", "#E91E63", "#95A5A6"
]

fig, ax = plt.subplots(figsize=(10, 6))
fig.patch.set_facecolor("#1a1a2e")
ax.set_facecolor("#1a1a2e")

for d in range(10):
    mask = y_sample == d
    ax.scatter(digit_coords[mask, 0], digit_coords[mask, 1],
               label=str(d), color=digit_palette[d], s=60, alpha=0.85)

ax.set_title("NMDS — Handwritten Digits (Ranked Pixel Distances)",
             color="white", fontsize=13)
ax.set_xlabel("NMDS Axis 1", color="#9EB3C2")
ax.set_ylabel("NMDS Axis 2", color="#9EB3C2")
ax.tick_params(colors="#9EB3C2")
ax.legend(title="Digit", title_fontsize=9, facecolor="#1a1a2e",
          labelcolor="white", fontsize=8, ncol=2)
ax.grid(color="#2a2a4a", linestyle="--", linewidth=0.5)
for spine in ax.spines.values():
    spine.set_edgecolor("#2a2a4a")
ax.text(0.02, 0.97, f"Stress = {stress:.3f} ({quality})",
        transform=ax.transAxes, color="#9EB3C2", fontsize=9, va="top")
plt.tight_layout()
plt.show()
Final Stress: 0.1022  → Fair


9.12.6 Why is MDS a good reduction method?

  1. **They preserve distances directly.
    • Other methods, like PCA, don’t explicitly preserve pairwise distances, focusing on variance instead.
  2. **They work with any distance metric.
    • Euclidian, Manhattan, Cosine, Genetics, and survey data are just some of the different metrics MDS can analyze.
  3. **Handles nonlinear structures well.
    • MDS does not require the distance to have numerical meaning, and therefore can be used by many different nominal and ordinal data sets.

9.12.7 Limitations

  1. Sensitive to noise and outliers. This can distort representations, especially in 2D spaces.
  2. Computational complexity. MDS can demand a lot of resources, especially for large data sets.
  3. Data is lost. MDS loses data when reducing size, so the outcome may not truly represent the original data.
  4. Subjectivity. The interpretation of the MDS result can be quite subjective, depending on the arrangement of points.

9.12.8 References

(borg2005?; cox2000?; kruskal1978?; sklearn_mds?; scipy_distance?)

9.13 Principal Component Analysis

9.14 Principal Component Analysis

This section was prepared by Alexander Wolven.

9.14.1 The Dimensionality Problem

Most real datasets have far more variables than you actually need. Many of those variables are correlated, meaning they are partially measuring the same underlying thing. Feeding all of them into a model means handing redundant information to the algorithm and letting it treat that redundancy as if it were real signal. The result is slower training, higher overfitting risk, and data that is impossible to visualize.

MNIST is a useful example. It contains 70,000 handwritten digit images, each stored as a 28-by-28 pixel grid, giving 784 features per image. Adjacent pixels almost always activate together because they belong to the same stroke. That redundancy compounds as datasets grow into thousands of features. This is the dimensionality problem. PCA is one of the most widely used techniques to address it. *** ### What PCA Is and What It Is Not

9.14.1.1 A Transformation, Not a Model

PCA takes high-dimensional data and finds a lower-dimensional representation that preserves as much of the original information as possible. It does not predict anything. It does not classify digits or make decisions. It is a preprocessing step. The compressed output gets handed to an actual model, which does the predicting.

This changes how you evaluate PCA. You do not ask whether PCA made good predictions. You ask whether the downstream model performs better, faster, or more reliably after PCA than without it.

The standard pipeline:

raw data  -->  standardize  -->  PCA  -->  classifier or regressor

Fit the scaler and PCA on training data only, then apply both to the test set using parameters learned from training. Fitting on the full dataset before splitting leaks test information into the transformation and makes results look better than they are.

9.14.1.2 It Is Just a Rotation

Take two correlated variables and plot them against each other. You get an elongated cloud of points tilted diagonally. Most of the variation lives along that diagonal, not along the original axes. The original axes were chosen by whoever designed the measurement system, not because they describe the structure of the data well.

PCA rotates the axes to fit the data instead of forcing the data to fit the axes. PC1 points along the direction of maximum variance. PC2 is perpendicular and captures whatever variance is left. The data points do not move. Only the coordinate system changes.

In higher dimensions the same idea extends beyond what you can visualize. PCA finds a sequence of perpendicular directions, each capturing as much remaining variance as possible, ordered from most to least important.

9.14.1.3 What a Principal Component Actually Is

A single raw pixel feature looks like this:

\[x_{14} = 182\]

It is just one number. One pixel intensity. It tells you nothing on its own about what digit you are looking at.

A principal component instead takes every pixel and combines them with learned weights:

\[ PC_1 = 0.03 \cdot x_1 + 0.01 \cdot x_2 + \cdots + 0.31 \cdot x_{14} + 0.28 \cdot x_{15} + \cdots + (-0.18 \cdot x_{401}) + \cdots \]

That single number summarizes how much of the dominant visual pattern is present in the entire image. Instead of asking what one pixel is doing, PCA asks what the whole image looks like along the most important direction of variation. That is why a handful of components can stand in for hundreds of individual pixels. *** ### Standardization: The Step You Cannot Skip

PCA finds directions of maximum variance, so it is directly affected by the scale of your variables. If one variable is measured in dollars and another in percentages, the dollar variable will dominate the variance calculation simply because its numbers are larger, not because it carries more information. Standardizing every variable to mean zero and standard deviation one removes that bias and puts all variables on equal footing before PCA runs.

\[z = \frac{x - \mu}{\sigma}\]

In MNIST every pixel already shares the same scale, but standardizing still matters. Some pixels are almost always dark across every image and carry near-zero variance. Without centering, PCA gets pulled toward overall brightness rather than the structural differences that actually distinguish one digit from another.

Skipping standardization is the most common mistake when applying PCA. The algorithm still runs and produces output, but the output reflects the wrong sources of variation. Fit the scaler on training data only. *** ### The Math Under the Hood

9.14.1.4 Covariance, Eigenvectors, Projection

The covariance matrix measures how much every pair of variables moves together across observations. For MNIST this is a 784-by-784 matrix. Each entry captures how much two pixels tend to activate together across all images. Diagonal entries are the variances of individual pixels. Off-diagonal entries are covariances between pairs. Many large off-diagonal values mean adjacent pixels are highly correlated and there is a lot of redundancy to compress.

\[\Sigma = \frac{1}{n-1} X^T X\]

PCA finds the directions explaining the most variation via eigendecomposition:

\[\Sigma v = \lambda v\]

Here \(v\) is an eigenvector and \(\lambda\) is its eigenvalue. Eigenvectors are the directions the covariance matrix does not rotate, only stretch. Each eigenvector defines one principal component. Each eigenvalue tells you how much variance that direction captures. Sort by eigenvalue from largest to smallest and you have your principal components in order of importance.

Projection is the final step:

\[Z = X W\]

\(W\) holds the top \(k\) eigenvectors you chose to keep. The result \(Z\) is the new lower-dimensional dataset, with columns that are uncorrelated by construction and ordered from most to least informative.

The plain English version: the covariance matrix is a table of how much every pixel moves with every other pixel. Eigendecomposition cracks that table open to find the actual directions all that co-movement is pointing. The eigenvalues are scores telling you how important each direction is. You keep the top ones, drop the rest, and projection is just rotating your data into that new smaller coordinate system. *** ### Seeing It in Action

KNN classification on MNIST, run twice. Same model, same data, different input: all 784 pixel features versus PCA-reduced features retaining 95 percent of the variance. We use 10,000 samples for speed.

import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.utils import resample
from sklearn.pipeline import Pipeline

mnist  = fetch_openml('mnist_784', version=1, as_frame=False, parser='auto')
X_full = mnist.data.astype(float)
y_full = mnist.target.astype(int)

X, y = resample(X_full, y_full, n_samples=10000, random_state=42)

scaler   = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca   = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

pipe_raw = Pipeline([
    ('knn', KNeighborsClassifier(n_neighbors=5, algorithm='brute'))
])
pipe_pca = Pipeline([
    ('pca', PCA(n_components=0.95)),
    ('knn', KNeighborsClassifier(n_neighbors=5, algorithm='brute'))
])

acc_raw = cross_val_score(
    pipe_raw, X_scaled, y, cv=3, scoring='accuracy'
).mean()
acc_pca = cross_val_score(
    pipe_pca, X_scaled, y, cv=3, scoring='accuracy'
).mean()

n_runs  = 20
knn_raw = KNeighborsClassifier(
    n_neighbors=5, algorithm='brute'
).fit(X_scaled, y)
start    = time.perf_counter()
for _ in range(n_runs):
    knn_raw.predict(X_scaled)
time_raw = (time.perf_counter() - start) / n_runs

knn_pca = KNeighborsClassifier(
    n_neighbors=5, algorithm='brute'
).fit(X_pca, y)
start    = time.perf_counter()
for _ in range(n_runs):
    knn_pca.predict(X_pca)
time_pca = (time.perf_counter() - start) / n_runs

print(f"Without PCA | Features: {X_scaled.shape[1]} | "
      f"Acc: {acc_raw:.3f} | Time: {time_raw:.3f}s")
print(f"With PCA    | Features: {X_pca.shape[1]}   | "
      f"Acc: {acc_pca:.3f} | Time: {time_pca:.3f}s")
Without PCA | Features: 784 | Acc: 0.907 | Time: 0.365s
With PCA    | Features: 278   | Acc: 0.912 | Time: 0.230s

784 features down to 278 components. Accuracy holds and actually improves slightly. Prediction time drops by more than half. That is the core argument for PCA. *** ### How Many Components to Keep

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

scaler_full   = StandardScaler()
X_scaled_full = scaler_full.fit_transform(X)

pca_full = PCA()
pca_full.fit(X_scaled_full)
cumvar = np.cumsum(pca_full.explained_variance_ratio_)

print(f"Samples:                     {X.shape[0]}")
print(f"Features:                    {X.shape[1]}")
print(f"Components for 80% variance: {np.argmax(cumvar >= 0.80) + 1}")
print(f"Components for 90% variance: {np.argmax(cumvar >= 0.90) + 1}")
print(f"Components for 95% variance: {np.argmax(cumvar >= 0.95) + 1}")
Samples:                     10000
Features:                    784
Components for 80% variance: 119
Components for 90% variance: 194
Components for 95% variance: 278

Each principal component comes with an explained variance ratio: the fraction of total variance it captures. Summing these in order gives the cumulative curve below. Common thresholds are 80 percent for maximum compression, 90 percent for a reasonable balance, and 95 percent when high fidelity for a downstream model matters. There is no universally correct answer.

The curve rises steeply then flattens. A small number of components does the heavy lifting. Everything after that is diminishing returns. This shape is what makes PCA practical, and the chart makes the compression tradeoff explicit rather than hiding it. *** ### Visualizing 784 Dimensions in 2D

One of the most compelling things PCA enables is projecting data that cannot otherwise be visualized. Below, all 784 pixel features are compressed to two components for digits 0, 1, 2, and 3. PCA was given no digit labels during the transformation.

The 1s separate cleanly from everything else because a single vertical stroke is geometrically nothing like a loop or a curve. The 0s, 2s, and 3s cluster in their own regions. PCA found that separation purely from the variance in pixel intensities, with no knowledge of what any image actually was. That is the latent structure made visible. *** ### Image Reconstruction: What Information Loss Actually Looks Like

The clearest way to understand what PCA discards is to compress an image to different numbers of components and reconstruct it back to pixel space.

At 2 components you can already make out the shape of the digit, remarkable given that 2 numbers are standing in for 784 pixels. As components are added the image sharpens progressively. At 784 there is no loss at all. The fact that the digit is readable at such low numbers of components is the whole point: 87 percent of the pixel information was either redundant or noise that did not matter for recognizing the shape.

At 2 components the shape is very blurry, as a 6 is more geometrically complex than a simple digit like 0. However the digit already starts coming through at just 10 to 30 components, and is clearly recognizable by 100. At 784 there is no loss at all. The more complex the shape, the more components it takes to reconstruct it faithfully, but the compression still works. *** ### What PCA Actually Found: Eigendigits

Because MNIST features are pixels, each principal component can be reshaped back into a 28-by-28 image and viewed directly. These are called eigendigits.

Each eigendigit is a question PCA is asking about every image it sees. PC1 looks like an oval because the single biggest source of variation across all digits is whether the pixels forming a circular or loop-like pattern are lit. Digits like 0, 6, 8, and 9 all have that structure. PC2 and PC3 start separating on stroke orientation, whether dominant lines are horizontal or vertical. Later components capture progressively finer distinctions: the specific curves that separate a 3 from an 8, or the gap that distinguishes a 7 from a 1. The model never sees individual pixels. It sees scores along these directions, which turns out to be enough. *** ### The Full Pipeline

import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.utils import resample

mnist  = fetch_openml('mnist_784', version=1, as_frame=False, parser='auto')
X_all  = mnist.data.astype(float)
y_all  = mnist.target.astype(int)
X_all, y_all = resample(X_all, y_all, n_samples=10000, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(
    X_all, y_all, test_size=0.2, random_state=42
)

# Step 1: Standardize on training data only
scaler         = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

# Step 2: Fit PCA on training data only
pca         = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca  = pca.transform(X_test_scaled)

print(f"Original shape:  {X_train.shape}")
print(f"Reduced shape:   {X_train_pca.shape}")
print(f"Components kept: {pca.n_components_}")

# Step 3: Train and evaluate
clf = LogisticRegression(max_iter=1000, random_state=42)
clf.fit(X_train_pca, y_train)
acc = accuracy_score(y_test, clf.predict(X_test_pca))
print(f"Test accuracy:   {acc:.3f}")
Original shape:  (8000, 784)
Reduced shape:   (8000, 272)
Components kept: 272
Test accuracy:   0.904

n_components=0.95 tells sklearn to automatically select however many components explain 95 percent of variance. The test set is always transformed using already-fitted objects. Never re-fit the scaler or PCA on test data. *** ### Real-World Applications

PCA shows up wherever the dimensionality problem lives. In finance, equity returns across hundreds of stocks collapse to a small number of underlying risk factors. The first component typically captures broad market movement. In genomics, expression measurements across tens of thousands of genes reduce to a manageable number of axes capturing the major sources of biological variation. In manufacturing, correlated sensor readings compress into a low-dimensional representation of a machine’s operating state, and readings that fall outside the normal pattern get flagged before failures occur. In NLP, Latent Semantic Analysis applies PCA to word-document frequency matrices to surface underlying topics, one of the earliest techniques for measuring semantic similarity between documents.

The common thread: too many correlated variables, a need to surface the underlying structure, and a requirement that the result be usable downstream. *** ### When Not to Use It

PCA is the wrong choice in two situations. The first is interpretability: once you project into PCA space you can no longer point to a specific pixel or feature as the driver of a prediction. The second is nonlinearity: PCA assumes the important structure in your data lives along linear directions. When relationships between variables are nonlinear, PCA will miss that structure entirely.

In those cases, consider the alternatives:

  • UMAP or t-SNE: Better for visualization when you care about preserving local cluster structure. They handle nonlinear relationships well but cannot be plugged into a modeling pipeline the way PCA can.
  • Kernel PCA: Extends PCA to nonlinear relationships by implicitly mapping data into a higher-dimensional space before reducing it. More flexible than standard PCA but harder to interpret and more computationally expensive.
  • Autoencoders: Neural networks that learn a compressed nonlinear representation of the data. Most powerful option but require significantly more data and compute to train well.

The general rule: if your data is roughly linear and you need a reusable pipeline step, use PCA. If you are exploring nonlinear structure or only need visualization, use UMAP. If you have the data and compute budget, autoencoders. *** ### Key Takeaways

PCA is a preprocessing step, not a model. Evaluate it by whether the downstream model improves, not by whether PCA itself predicts anything.

It finds latent structure by maximizing variance. The principal components surface the stroke patterns and shapes explaining why groups of pixels activate together. Components are uncorrelated with each other by construction.

Always standardize first. Features on different scales will dominate the variance calculation simply because of their magnitude, not because they carry more information. Fit the scaler on training data only.

Choose components by explained variance. MNIST goes from 784 features down to 278 at 95 percent variance retained. The cumulative variance curve makes that tradeoff explicit.

Avoid it when interpretability matters or relationships are nonlinear. *** ### Further Readings

9.15 False Discovery Rate Control

]633;E;git push;4e4a6e1b-495d-4f5b-b667-6fd373729108]633;C## False Discovery Rate Control

This section was prepared by Scott McCabe.

9.15.1 Why False Discovery Rate Matters

Why Do We Need FDR Control?

  • In modern data science, we often test hundreds or thousands of hypotheses
  • Examples:
    • Genomics: testing gene expression differences
    • A/B testing: evaluating multiple features
    • Social science: multiple survey questions

The Multiple Testing Problem

  • Suppose we test \(m = 1000\) hypotheses
  • Each test uses significance level \(\alpha = 0.05\)
  • Even if ALL null hypotheses are true:

\[ \text{Expected false positives} = 0.05 \times 1000 = 50 \]

  • We may incorrectly reject dozens of hypotheses

False Positives Visualization - The histogram below shows 1000 p-values generated under the null hypothesis (uniformly distributed between 0 and 1). Even though all nulls are true, we expect about 5% of them (around 50) to be below the 0.05 threshold, leading to false discoveries if we use the standard cutoff without adjustment.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

m = 1000
pvals = np.random.uniform(0, 1, m)

plt.hist(pvals, bins=30)
plt.axvline(0.05)
plt.title("P-values Under All Null Hypotheses")
plt.xlabel("p-value")
plt.ylabel("Frequency")
plt.show()

9.15.2 Definitions

Measuring Errors in Multiple Testing

  • When testing many hypotheses, we track:
Null True Null False
Reject H₀ False Positive (V) True Positive (S)
Do Not Reject H₀ True Negative False Negative
  • Total rejections: \[ R = V + S \]

9.15.3 False Discovery Proportion (FDP)

  • The proportion of false discoveries among all discoveries:

\[ \text{FDP} = \frac{V}{R} \]

  • If no rejections (R = 0), define FDP = 0

  • Interpretation:

    • FDP = 0 → all discoveries are correct
    • FDP = 0.2 → 20% of discoveries are false

9.15.4 False Discovery Rate (FDR)

  • The expected value of FDP:

\[ \text{FDR} = \mathbb{E}\left[\frac{V}{R}\right] \]

  • This is what we aim to control

  • Example:

    • If FDR = 0.05
    • Then on average, 5% of discoveries are false

Why Expectation?

  • FDP varies from dataset to dataset

  • We cannot control it exactly every time

  • Instead, we control it on average across experiments

This makes FDR practical for real-world data science


Example

  • Suppose we reject 100 hypotheses:
    • 90 are true discoveries
    • 10 are false discoveries

\[ \text{FDP} = \frac{10}{100} = 0.10 \]

  • Over many experiments:
    • The average FDP is roughly equal to FDR

  • FDR does NOT eliminate false positives
  • It controls their proportion

“Among all discoveries, most should be correct.”

9.15.5 Classical Approach vs. Modern Methods

9.15.6 Classical Error Control: FWER

When controlling errors in muliple testing, the traditional approach focuses on avoiding any false positives at all. This is known as controlling the Family-Wise Error Rate (FWER).

  • FWER is defined as the probability of making at least one false discovery: \[ \text{FWER} = P(V \geq 1) \]

  • A common method to control FWER is the Bonferroni correction,

\[ \alpha_{adj} = \frac{\alpha}{m} \]

  • A hypothesis is rejected if its p-value satisfies: \[ p_i \leq \frac{\alpha}{m} \]

For example, if we have 1000 tests and want to maintain an overall \(\alpha\) of 0.05, the Bonferroni threshold would be 0.00005, which is extremely strict and often leads to very few discoveries, even when there are true effects present.


Why FWER is Too Conservative

  • Example:
    • \(m = 1000, \alpha = 0.05\)
    • Bonferroni threshold = 0.00005
  • Very hard to detect real effects
  • While it minimizes false positives, it also leads to many false negatives (missed discoveries)

9.15.7 Modern Approach: False Discovery Rate (FDR)

  • Instead of eliminating all errors, control their proportion

\[ \text{FDR} = \mathbb{E} \left[ \frac{V}{R} \right] \]

  • Goal:
    • Allow some false positives
    • But limit their proportion among discoveries

9.15.8 The Benjamini–Hochberg (BH) Procedure

  • The BH procedure is the most widely used method for controlling FDR.

  • How it works:

    1. Sort p-values from smallest to largest.

    2. Compare each p-value to an increasing threshold: \[ \frac{i}{m}\alpha \]

    3. Reject all hypotheses up to the largest p-value that satisfies this condition.

  • Key idea:

    • The threshold becomes less strict as we move through the ranked p-values.
    • This allows more discoveries while still controlling the overall error rate.

Mindset: FWER vs FDR

  • FWER mindset: “I want ZERO false positives.”

  • FDR mindset: “I can tolerate some false positives, as long as most discoveries are correct.”


Visual Comparison

This figure illustrates how the Benjamini–Hochberg (BH) procedure works step by step:

  1. Sort the p-values
    • The blue curve shows the p-values arranged from smallest to largest (by rank)
  2. Construct the BH threshold
    • The dashed line represents the BH threshold: \[ \frac{i}{m}\alpha \]
    • This line increases with rank, making the criterion less strict for larger p-values
  3. Compare p-values to the threshold
    • For each rank \(i\), compare the p-value to the corresponding point on the dashed line
  4. Find the cutoff
    • Identify the largest rank \(i\) where: \[ p_{(i)} \le \frac{i}{m}\alpha \]
  5. Reject hypotheses
    • All p-values to the left of this cutoff are rejected
    • These are considered statistically significant discoveries

Interpretation

  • The blue curve below the dashed line represents rejected hypotheses
  • Because the BH threshold increases, it allows more rejections than a fixed cutoff
  • In contrast, the Bonferroni line (horizontal) applies the same strict threshold to all tests:
    • Only very small p-values are rejected
    • This leads to fewer discoveries but stronger error control
  • The BH procedure adapts to the data: it relaxes the threshold as rank increases, allowing more discoveries while still controlling the expected proportion of false positives.
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)

m = 1000
pi1 = 0.05  # 5% true signals
m1 = int(m * pi1)
m0 = m - m1

# -----------------------------
# Generate p-values
# -----------------------------
p_null = np.random.uniform(0, 1, m0)
p_signal = np.random.beta(0.3, 1, m1)  # strong signals

pvals = np.sort(np.concatenate([p_null, p_signal]))

# -----------------------------
# Thresholds
# -----------------------------
alpha = 0.05
bonf = alpha / m
bh_line = [(i/m)*alpha for i in range(1, m+1)]

# -----------------------------
# Full Plot
# -----------------------------
plt.figure(figsize=(6,4))
plt.plot(pvals, label="Sorted p-values")
plt.plot(bh_line, '--', label="BH threshold")
plt.axhline(bonf, color='red', linestyle=':', label="Bonferroni threshold")

plt.xlabel("Rank")
plt.ylabel("p-value")
plt.title("FWER vs FDR Thresholds")
plt.legend()
plt.show()

# -----------------------------
# Zoomed Plot
# -----------------------------
plt.figure(figsize=(6,4))
plt.plot(pvals, label="Sorted p-values")
plt.plot(bh_line, '--', label="BH threshold")
plt.axhline(bonf, color='red', linestyle=':', label="Bonferroni threshold")

plt.xlim(0, 100)
plt.ylim(0, 0.01)

plt.xlabel("Rank")
plt.ylabel("p-value")
plt.title("Zoomed View: FWER vs FDR Thresholds")
plt.legend()
plt.show()

9.15.9 The 3 Methods

We compare three methods:

  • FWER (Bonferroni) – strict error control
  • BH (Benjamini–Hochberg) – standard FDR method
  • BY (Benjamini–Yekutieli) – FDR under dependence

9.15.10 1. FWER (Bonferroni)

  • Controls: \[ P(V \geq 1) \]

  • Rule: \[ p_i \leq \frac{\alpha}{m} \]

  • Pros:

    • Very safe (almost no false positives)
  • Cons:

    • Extremely conservative
    • Low power (misses real signals)

9.15.11 2. BH (Benjamini–Hochberg)

  • Controls: \[ \text{FDR} = \mathbb{E}[V/R] \]

  • Procedure:

    1. Sort p-values
    2. Find largest ( k ) such that: \[ p_{(k)} \leq \frac{k}{m}\alpha \]
  • Pros:

    • Good balance of power and error control
    • Widely used
  • Cons:

    • Assumes independence (or weak dependence)

9.15.12 3. BY (Benjamini–Yekutieli)

  • Modification of BH for dependent tests

  • Threshold: \[ p_{(k)} \leq \frac{k}{m} \cdot \frac{\alpha}{\sum_{i=1}^m \frac{1}{i}} \]

  • Pros:

    • Valid under arbitrary dependence
  • Cons:

    • More conservative than BH
    • Can reduce power significantly

9.15.13 Visual Comparison

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)
m = 50       # smaller number of hypotheses
alpha = 0.05

# 20% true signals
pi1 = 0.2
m1 = int(pi1*m)
m0 = m - m1

# null p-values
p_null = np.random.uniform(0,1,m0)
# strong signals
p_signal = np.random.beta(0.2,1,m1)  # skewed very close to 0

pvals = np.sort(np.concatenate([p_null, p_signal]))

# thresholds
bh = [(i/m)*alpha for i in range(1,m+1)]
by = [(i/m)*alpha / np.sum(1/np.arange(1,m+1)) for i in range(1,m+1)]
bonf = alpha / m

# plot
plt.figure(figsize=(8,5))
plt.plot(pvals, label="p-values", marker='o', linestyle='')
plt.plot(bh, '--', label="BH")
plt.plot(by, ':', label="BY")
plt.axhline(bonf, linestyle='-.', label="Bonferroni")
plt.xlabel("Rank")
plt.ylabel("p-value / Threshold")
plt.title("FWER vs FDR Thresholds")
plt.legend()
plt.show()

9.15.14 Simulation

9.15.15 Simulation: Comparing Methods

  • Compare FWER, BH, and BY
  • Evaluate:
    • Number of rejections
    • False Discovery Rate (FDR)
    • Power (true positives)

9.15.16 Simulation Setup

  • Total hypotheses: \(m = 1000\)
  • Proportion of true signals: \(\pi_1 = 0.1\)
  • Generate:
    • Null p-values ~ Uniform(0,1)
    • Signal p-values ~ Beta(0.5, 1)
  • Apply:
    • Bonferroni (FWER)
    • BH
    • BY

Data generation and method application are repeated across multiple simulations to assess average performance.

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.multitest import multipletests

# -------------------------------
# Simulation parameters
# -------------------------------
np.random.seed(123)

m = 1000
pi1 = 0.3
m1 = int(pi1 * m)
m0 = m - m1

alpha = 0.05
n_sim = 100  # number of simulations

methods = {
    "Bonferroni": "bonferroni",
    "BH": "fdr_bh",
    "BY": "fdr_by"
}

# Store results across simulations
R_all = {name: [] for name in methods}
fdp_all = {name: [] for name in methods}
power_all = {name: [] for name in methods}

# -------------------------------
# Run simulations
# -------------------------------
for sim in range(n_sim):
    
    # Generate data each time
    p_null = np.random.uniform(0, 1, m0)
    p_signal = np.random.beta(0.3, 1, m1)
    
    pvals = np.concatenate([p_null, p_signal])
    truth = np.array([0]*m0 + [1]*m1)
    
    for name, method in methods.items():
        reject, _, _, _ = multipletests(pvals, alpha=alpha, method=method)
        
        R = reject.sum()
        V = ((reject == 1) & (truth == 0)).sum()
        S = ((reject == 1) & (truth == 1)).sum()
        
        fdp = V / R if R > 0 else 0
        power = S / m1
        
        R_all[name].append(R)
        fdp_all[name].append(fdp)
        power_all[name].append(power)

# -------------------------------
# Compute averages and std devs
# -------------------------------
names = list(methods.keys())
x = np.arange(len(names))

R_mean = [np.mean(R_all[n]) for n in names]
R_std  = [np.std(R_all[n]) for n in names]

fdp_mean = [np.mean(fdp_all[n]) for n in names]
fdp_std  = [np.std(fdp_all[n]) for n in names]

power_mean = [np.mean(power_all[n]) for n in names]
power_std  = [np.std(power_all[n]) for n in names]

colors = ['skyblue', 'orange', 'green']

# -------------------------------
# Number of Discoveries
# -------------------------------
plt.figure(figsize=(6,4))
bars = plt.bar(x, R_mean, yerr=R_std, capsize=5, color=colors)
plt.xticks(x, names)
plt.ylabel("Number of Discoveries")
plt.title("Average Discoveries (± SD)")

for bar, val in zip(bars, R_mean):
    plt.text(bar.get_x() + bar.get_width()/2, val + 10, f"{val:.0f}", ha='center')

plt.show()

print("Interpretation:")
print("BH discovers the most signals, while Bonferroni is very conservative.")
print("BY is in between but closer to Bonferroni.")

# -------------------------------
# FDP
# -------------------------------
plt.figure(figsize=(6,4))
bars = plt.bar(x, fdp_mean, yerr=fdp_std, capsize=5, color=colors)
plt.xticks(x, names)
plt.ylabel("False Discovery Proportion")
plt.title("Average FDP (± SD)")
plt.ylim(0, max(fdp_mean)*1.3)

for bar, val in zip(bars, fdp_mean):
    plt.text(bar.get_x() + bar.get_width()/2, val + 0.005, f"{val:.3f}", ha='center')

plt.show()

print("Interpretation:")
print("All methods control false discoveries well.")
print("Bonferroni and BY have very low FDP.")
print("BH has a slightly higher but still acceptable FDP.")

# -------------------------------
# Power
# -------------------------------
plt.figure(figsize=(6,4))
bars = plt.bar(x, power_mean, yerr=power_std, capsize=5, color=colors)
plt.xticks(x, names)
plt.ylabel("Power")
plt.title("Average Power (± SD)")
plt.ylim(0,1)

for bar, val in zip(bars, power_mean):
    plt.text(bar.get_x() + bar.get_width()/2, val + 0.02, f"{val:.2f}", ha='center')

plt.show()

print("Interpretation:")
print("BH has the highest power, meaning it detects the most true signals.")
print("BY has lower power than BH due to its conservativeness under dependence.")
print("Bonferroni has the lowest power, missing many true discoveries.")

Interpretation:
BH discovers the most signals, while Bonferroni is very conservative.
BY is in between but closer to Bonferroni.

Interpretation:
All methods control false discoveries well.
Bonferroni and BY have very low FDP.
BH has a slightly higher but still acceptable FDP.

Interpretation:
BH has the highest power, meaning it detects the most true signals.
BY has lower power than BH due to its conservativeness under dependence.
Bonferroni has the lowest power, missing many true discoveries.

9.15.17 Key Takeaways

  1. Multiple testing is tricky
    • Testing many hypotheses increases the chance of false positives.
    • Traditional FWER control (Bonferroni) is too strict for large-scale testing.
  2. FDR vs FWER
    • FWER: strict, very few false positives, low power.
    • FDR: controls the expected proportion of false discoveries, higher power.
  3. Choosing a method
Method When to Use
Bonferroni (FWER) Very strict control, few tests
BH (Benjamini-Hochberg) Default FDR control, independent/weakly dependent tests
BY (Benjamini-Yekutieli) Dependent tests
  1. Simulation confirms intuition
    • BH discovers more true signals but has a slightly higher FDP.
    • BY is more conservative than BH, leading to fewer discoveries but lower FDP.
    • Bonferroni is the most conservative, with the fewest discoveries and lowest FDP.
  2. Practical Use in Data Science
    • No single “best” method — choose based on signal strength, test dependence, and desired strictness
    • FDR control methods like BH are often a good starting point for large-scale testing in data science.

9.15.18 References

Efron, B. (2010). False discovery rate control. In Large‑scale inference: Empirical bayes methods for estimation, testing, and prediction (pp. 46–69). Cambridge University Press. https: //utstat.toronto.edu/reid/sta2212s/2021/EfronLSIChapter4.pdf

Genovese, C. R. (2004). A tutorial on false discovery control. Technical report, Carnegie Mellon University.

Public Health, C. U. M. S. of. (n.d.). False discovery rate. Web page. https://www.publichealth. columbia.edu/research/population-health-methods/false-discovery-rate 15

9.16 Neural Networks in Sports Analytics