
8 Statistical Modeling & Machine Learning
8.1 K-Means Clustering
This presentation was prepared by Alysha Desai.
8.1.1 Introduction
8.1.1.1 What is K-Means and how does it work?
- K-Means Clustering is a learning algorithm used for data clustering and grouping unlabeled data by similarity.
- Unsupervised: No pre-set labels or answers provided to the computer.
- Goal: Find hidden patterns and structure in unorganized, raw data. The objective is to minimize the sum of distances between data points and their assigned clusters.
8.1.1.2 Real-World Use Case
- This type of cluster analysis is used in data science for market segmentation, document clustering, and much more.
- I will be focusing on Customer Segmentation, for targeting advertising.
Dataset:
Mall Customer Segmentation
- Profiles of 200 shoppers from a supermarket mall.
- We use Age, Income, and Spending Score to find groups.
8.1.2 The Logic
8.1.2.1 The “K” in K-Means and Centroids
8.1.3 What is K?
- K represents the number of clusters we want the algorithm to create.
- You must choose K before the algorithm begins.
8.1.4 Centroids
- A centroid is the imaginary “center” of a cluster.
- It is the average of all data points in that group.
- Centroids move as the group of points changes.
8.1.4.1 K-Means in Motion
We can use Python to create an animation that shows the step-by-step flow of how clusters are formed by stopping the algorithm after each iteration.
- Iterative Movement: Watch how the Red X centroids physically shift as they recalculate the mean of their assigned points.
- Dynamic Assignment: Notice how individual shoppers change colors (clusters) as centroids move closer to them.
- Convergence: The animation stops once the centroids find their stable “home,” resulting in our five final market segments.
8.1.5 The Elbow Method
8.1.5.1 How do we find the optimal K?
- A visual technique used to decide exactly how many clusters K are needed for a specific dataset.
- As we add more clusters, the data points naturally get closer to their centers, but adding too many makes the model too complex.
- We calculate the Within-Cluster Sum of Squares(WCSS), which is the total variance or “spread” inside each cluster.
- We look for the specific point on the graph where the drop in WCSS slows down significantly—this is the “sweet spot”.
8.1.5.2 Visualizing the Elbow Curve

We tested K from 1 to 10 using our actual mall dataset. The “bend” happens at K=5, which is our optimal number. For our 200 shoppers, 5 clusters provide the best fit. Beyond 5, adding clusters provides very little extra value.
8.1.6 Step-by-step Breakdown
Step 1: Initialization
- Randomly select K points from the dataset to be initial centroids.
Step 2: Assignment
- Every data point is assigned to its nearest centroid using Euclidean distance.
Step 3: Update
- Centroids move to the mathematical mean of all points in their new group.
Step 4: Convergence
- The loop repeats until centroids stop moving or reach a set limit.
8.1.7 Python Implementation
8.1.7.1 Coding with Scikit-Learn and Matplotlib
First, we need to import the following libraries:
- Numpy: Essential for numerical operations and distance calculations.
- Matplotlib: The primary tool for plotting our raw data and final results.
- Scikit learn: Used to generate synthetic data and run the KMeans model.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Load the mall dataset
import pandas as pd
df = pd.read_csv('data/Mall_Customers.csv')
# Extract the features for clustering
# Using columns 3 (Income) and 4 (Spending Score)
X = df.iloc[:, [3, 4]].values
# Apply K-Means
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
y_kmeans = kmeans.fit_predict(X)8.1.8 Visualizing Results
8.1.8.1 Raw Data & Model Setup
8.1.9 The “Unorganized” Data

8.1.9.1 Implementation Code
# Select 5 clusters & Fit
kmeans = KMeans(n_clusters=5,
init='k-means++',
random_state=42)
y_kmeans = kmeans.fit_predict(X)
# Plotting with Matplotlib
plt.scatter(X[y_kmeans == 0, 0],
X[y_kmeans == 0, 1],
s=50, c='pink')8.1.10 Visualizing Results
8.1.10.1 Final Segments and Analysis
8.1.10.2 Segmented Clusters

8.1.10.3 Cluster Analysis
Pink Group: High Income / Low Spenders
Blue Group: High Income / High Spenders
Centroids: The mathematical mean of each group
Insight: Target specific marketing strategies
8.1.11 Advantages & Limitations
Advantages: Simple to implement, fast, and highly scalable.
Outlier Sensitivity: Extreme points skew the mean and cluster shape.
Manual Choice of K: The model can’t find the best K without us.
Initialization Risk: Starting positions heavily impact final results.
Dimensionality: Performance drops as you add too many variables.
8.1.11.1 Further Readings
Scikit-learn Clustering Guide: Technical documentation for K-Means and other models. https://scikit-learn.org/stable/modules/clustering.html
Google Machine Learning Education: A great overview of K-Means advantages and disadvantages. https://developers.google.com/machine-learning/clustering/kmeans/
IBM: What is K-Means?: A high-level explanation of how the algorithm is used in business. https://www.ibm.com/think/topics/k-means-clustering
8.1.12 Thank you!
8.1.12.1 Questions?
8.2 Logistic Regression
This presentation was prepared by Ronnie Orsini.
8.2.1 Regression idea
Before talking about logistic regression, it helps to briefly understand what regression means in statistics.
Regression is a method used to model the relationship between a response variable and one or more predictor variables. The goal is usually to predict the value of the response variable based on the predictors.
For example, regression can be used to predict:
house prices based on square footage and location
sales revenue based on advertising spending
temperature based on time of day
One of the most common forms is linear regression, which assumes the relationship between variables follows a straight line.
A simple linear regression model looks like:
\[ y = \beta_0 + \beta_1 x \]
Here:
- \(y\) is the response variable we want to predict
- \(x\) is a predictor variable
- \(\beta_0\) is the intercept
- \(\beta_1\) is the coefficient describing how \(x\) affects \(y\)
Linear regression works well when the response variable is a continuous number, such as price, temperature, or revenue.
8.2.2 Classification problem
In many real-world situations, we are not trying to predict a numerical value. Instead, we want to predict which category an observation belongs to.
This type of problem is called a classification problem.
Examples include:
determining whether an email is spam or not spam
diagnosing whether a patient has a disease
In many cases there are only two possible outcomes, which is known as a binary classification problem.
For convenience, these outcomes are usually coded as:
1→ the event occurs (positive class)0→ the event does not occur (negative class)
Instead of predicting a number, the goal is to estimate the probability that an observation belongs to class 1.
This probability is written as:
\[ P(Y = 1 \mid X) \]
which means the probability that the outcome is 1 given the predictor variables (X).
8.2.2.1 Example: Insider trading project
As an example, in a data science project I worked on last semester, we looked at whether politicians were outperforming the stock market using their disclosed trades.
The question we wanted to answer was:
Are politicians consistently beating the market?
At first it might seem like this could be modeled with linear regression, because we could try to predict how much a trade outperformed the market.
However, the main question we cared about was simpler:
Did the trade beat the market or not?
So each trade could be classified into one of two outcomes:
1→ the trade outperformed the market0→ the trade did not outperform the market
Because the response variable only had two possible outcomes, this became a binary classification problem.
Instead of predicting a numerical value, the model estimated the probability that a trade would outperform the market.
This is exactly the type of situation where logistic regression is more appropriate than linear regression.
8.2.3 Why linear regression fails for classification
At first, it might seem like we could still use linear regression for classification problems.
For example, in the insider trading project mentioned earlier, we might try to predict whether a trade beats the market using a linear model like
\[ y = \beta_0 + \beta_1 x \]
However, this creates several problems when the response variable only takes values of 0 or 1.
Problem 1: Predictions can fall outside the probability range
Linear regression can produce any real number as a prediction.
For example, the model might predict:
- -0.3
- 0.7
- 1.4
But probabilities must always stay between:
\[ 0 \le p \le 1 \]
A predicted probability of -0.3 or 1.4 does not make sense.
So linear regression cannot guarantee valid probability predictions.
Problem 2: The relationship between predictors and probability is not linear
With binary outcomes, the relationship between predictors and the probability of an event usually follows an S-shaped pattern rather than a straight line.
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-10,10,200)
# linear relationship
linear = 0.1*x + 0.5
# logistic function
logistic = 1/(1+np.exp(-x))
plt.figure(figsize=(7,5))
plt.plot(x, linear, label="Linear regression", linestyle="--")
plt.plot(x, logistic, label="Logistic regression")
plt.axhline(0, color='gray', linewidth=0.5)
plt.axhline(1, color='gray', linewidth=0.5)
plt.xlabel("Predictor")
plt.ylabel("Predicted Value / Probability")
plt.legend()
plt.show()
For example, as a predictor increases, the probability of success might increase quickly at first, then level off as it approaches 1.
A straight line cannot capture this behavior properly.
Because of these issues, we need a model that:
- produces predictions between 0 and 1
- can model nonlinear probability behavior
This is where logistic regression comes in.
8.2.4 Logistic regression solution
To solve the problems with linear regression, we need a model that predicts probabilities between 0 and 1.
Logistic regression does this by using a special function called the logistic function.
Instead of predicting the outcome directly, logistic regression first computes a linear combination of the predictors:
\[ z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p \]
This part is actually very similar to linear regression.
However, instead of using this value directly as the prediction, logistic regression passes it through the logistic function:
\[ P(Y = 1 | X) = \frac{1}{1 + e^{-z}} \]
where:
\(P(Y = 1 \mid X)\) = the probability that the outcome is 1 given the predictor variables \(X\)
\(Y\) = the response variable we are trying to predict
\(X\) = the predictor variables used to make the prediction
\(z\) = the input to the logistic function
\(e\) = Euler’s number, approximately 2.718
This transformation converts any real number into a value between 0 and 1, which allows the model to represent probabilities.
The logistic function produces an S-shaped curve, where:
- probabilities close to 0 indicate the event is unlikely
- probabilities close to 1 indicate the event is likely
- probabilities near 0.5 represent uncertainty
Because of this transformation, logistic regression avoids the main problem of linear regression. Predictions are always valid probabilities.
8.2.5 Interpreting logistic regression predictions
Logistic regression produces a predicted probability that an observation belongs to class 1.
For example, the model might output:
\[ P(Y = 1 \mid X) = 0.82 \]
This means the model estimates an 82% probability that the outcome belongs to class 1 given the predictor variables.
However, in many applications we still need to convert this probability into a final class prediction.
A common rule is to use a threshold of 0.5:
if \(P(Y = 1 \mid X) \geq 0.5\), predict class 1
if \(P(Y = 1 \mid X) < 0.5\), predict class 0
The threshold does not always have to be 0.5. In some applications it may be adjusted depending on the cost of incorrect predictions.
8.2.6 Logistic regression in Python
Up to this point we have discussed how logistic regression works conceptually. Now we will look at a simple example of how it can be implemented in Python.
In this example, the goal is to predict whether a breast tumor is cancerous or not based on measured features from cell samples.
We will use the Breast Cancer Wisconsin dataset, which is included in the scikit-learn library. This dataset contains several numerical measurements that will be used as predictor variables.
The response variable is binary:
- 1 → tumor is not cancerous
- 0 → tumor is cancerous
This makes it a binary classification problem, which is exactly the type of situation where logistic regression is commonly used.
8.2.6.1 Step 1: Import libraries
First we import the libraries needed to run logistic regression.
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_reportThese modules allow us to:
- load the dataset
- train a logistic regression model
- split the data into training and testing sets
- evaluate model performance
8.2.6.2 Step 2: Load the dataset
Next we load the dataset and separate the predictor variables and the response variable.
data = load_breast_cancer()
X = data.data
y = data.target
print("Predictor variables used in the model:")
print(data.feature_names)Predictor variables used in the model:
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal dimension']
Here:
Xcontains the predictor variablesycontains the response variable (0 or 1)
8.2.6.3 Step 3: Split the data
Before training the model, we split the data into training and testing sets so we can evaluate how well the model performs on new data.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)In this case:
- 80% of the data is used for training
- 20% of the data is used for testing
8.2.6.4 Step 4: Train the model
Now we create the logistic regression model and train it using the training data.
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)LogisticRegression(max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| penalty | 'l2' | |
| dual | False | |
| tol | 0.0001 | |
| C | 1.0 | |
| fit_intercept | True | |
| intercept_scaling | 1 | |
| class_weight | None | |
| random_state | None | |
| solver | 'lbfgs' | |
| max_iter | 10000 | |
| multi_class | 'deprecated' | |
| verbose | 0 | |
| warm_start | False | |
| n_jobs | None | |
| l1_ratio | None |
The model estimates the probability that the outcome is 1 given the predictor variables.
\[ P(Y = 1 \mid X) = \frac{1}{1 + e^{-z}} \]
where
\[ z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p \]
The .fit() function allows the model to learn the relationship between the predictor variables and the response variable.
8.2.6.5 Step 5: Make predictions
Once the model has been trained, we can use it to predict outcomes for the testing data.
y_pred = model.predict(X_test)These predictions are class labels:
- 0 → predicted cancerous
- 1 → predicted not cancerous
8.2.6.6 Step 6: Predict probabilities
Logistic regression can also estimate the probability that an observation belongs to each class.
y_prob = model.predict_proba(X_test)
print(np.round(y_prob[:5], 4))[[1.213e-01 8.787e-01]
[1.000e+00 0.000e+00]
[9.984e-01 1.600e-03]
[1.200e-03 9.988e-01]
[1.000e-04 9.999e-01]]
Each row contains two probabilities:
- probability of class 0
- probability of class 1
For example, for output:
[0.1217, 0.8783]
This means the model estimates:
• 12.17% probability the tumor is cancerous
• 87.83% probability the tumor is not cancerous
8.2.6.7 Step 7: Evaluate model performance
Finally, we evaluate how well the model performs using several common metrics.
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))[[39 4]
[ 1 70]]
0.956140350877193
precision recall f1-score support
0 0.97 0.91 0.94 43
1 0.95 0.99 0.97 71
accuracy 0.96 114
macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114
From the confusion matrix we can see how the model performed on the test data.
• 39 tumors were correctly predicted as cancerous
• 70 tumors were correctly predicted as not cancerous
• 4 tumors were predicted as not cancerous when they were actually cancerous
• 1 tumor was predicted as cancerous when it was actually not cancerous
In total, the model made 109 correct predictions out of 114 cases, which gives an overall accuracy of about 95.6%.
Looking at the classification report, we can also see that the model performs very well for both classes:
• For cancerous tumors (class 0), the model correctly identifies most of them with a recall of 0.91.
• For non-cancerous tumors (class 1), the recall is 0.99, meaning the model almost always correctly identifies tumors that are not cancerous.
Overall, these results suggest that the logistic regression model performs very well on this classification task and is able to correctly classify most tumors in the dataset.
8.2.7 Advantages and limitations of logistic regression
Like any statistical model, logistic regression has both strengths and limitations. Understanding these helps us decide when it is a good model to use and when another method might be better.
8.2.7.1 Advantages
• Easy to interpret
One of the biggest advantages of logistic regression is that it is relatively easy to understand compared to more complex machine learning models.
Each coefficient tells us how a predictor variable affects the probability of the outcome.
Because of this, logistic regression is often preferred when we want a model that is both predictive and interpretable.
• Produces probabilities
Unlike some classification models that only give a final class prediction, logistic regression estimates the probability that an observation belongs to a particular class.
This is useful in many real world situations where we care about risk or likelihood, not just the final classification.
• Computationally efficient
Logistic regression is also relatively fast to train and does not require extremely large amounts of data.
Because of this, it is often used as a baseline model when starting a new classification problem.
8.2.7.2 Limitations
• Assumes a linear relationship in the log-odds
Logistic regression assumes that the predictor variables have a linear relationship with the log-odds of the outcome.
If the true relationship between the predictors and the response is highly nonlinear, the model may not perform as well.
• Limited ability to capture complex patterns
More flexible models such as decision trees, random forests, or neural networks are able to capture complex nonlinear relationships that logistic regression cannot.
Because of this, logistic regression may not perform well on very complicated datasets.
• Sensitive to correlated predictors
If predictor variables are highly correlated with each other, the estimated coefficients can become unstable and harder to interpret.
This can sometimes make it difficult to understand the true effect of each predictor on the outcome.
8.2.7.3 When logistic regression works well
In practice, logistic regression tends to work well when
• the response variable is binary
• the relationship between predictors and the outcome is relatively simple
• interpretability of the model is important
Because of these properties, logistic regression is widely used in areas such as medicine, finance, and risk analysis.
8.2.7.4 Works Cited
Scikit-learn Logistic Regression Documentation
8.3 Decision Trees
This presentation was prepared by Riley Sawyer.
8.3.1 Introduction
Decision trees (DTs) are non-paremtreic supervised learning method used for classification and regression. This presentation will cover:
How decision trees work
The advanatages and disadvantages of decision trees
Classification
Regression
How to use decision trees in Python
Remedies
8.3.2 How decision trees work
Structure of a decision tree A decision tree has a flow chart like structure that allows individuals to clearly visualize the decision making process. Decision trees have three main components:
Root node: This is the first node of the tree. All following nodes can be traced make to the root node.
Internal nodes: These nodes represent the features used for splitting the data. These features determine the path taken from this node onward. There are two types of internal nodes:
Decision nodes: nodes where a specific question, atribute, or choice is made to split the data.
Chance nodes: these nodes represent uncertainty in the decision-making process. Mulitple outcomes are possible here, so a probability is assigned to each outcome and the path is determined from there.
Leaf nodes: These are the terminal nodes of the tree that represent the final output or decision. In classification tasks, leaf nodes represent class labels, while in regression tasks, they represent continuous values.
What should I eat: a decision tree 
8.3.3 Advantages and disadvantages of decision trees
Advantages
Simple to understand and interpret
Requires little data preprocessing, ei:
feature scaling
normalization
Can handle both numerical and categorical data
Can capture non-linear relationships
Can be easily validated with statistical tests
Disadvantages
Prone to overfitting, especially with deep trees. Can be remedied by:
Pruning
Setting a maximum depth
Can be unstable, as small changes in the data can lead to different tree structures
- Can be remedied by ensemble methods like random forests
Can be biased towards features with more levels. May not perform well with imbalanced datasets. Can be remedies by:
Class weighing
Resampling methods
May not capture complex relationships as well as other models, ei:
XOR problems: where classes cannot be separated by a single linear decision boundary
Parity problems: determining if something is even or odd. XOR is a type of parity problem.
Mulitplexer problems: complex, non-linear boolian functions.
8.3.4 Classification
Decision trees can be used to classify data into different categories. Ei,:
Image classification
Spam detection
Customer segmentation
This uses DecisionTreeClassifier from sklearn.tree module in Python.
8.3.5 Regression
Decision trees can also be used for regression tasks, where the goal is to predict a continuous value. Ei,:
Predicting house prices
Forecasting sales
Estimating average customer spending
This uses DecisionTreeRegressor from sklearn.tree module in Python.
8.3.6 How to use decision trees in Python
Decision trees can be implemented in Python using the scikit-learn library. However, here are other library you may want to use to supplement your DT usage.
Step 1: Import necessary libraries
import matplotlib.pyplot as plt
import numpy as npStep 2: Import scikit-learn modules
Before accessing any decision tree model, you will need to import the designated library. Use the following code to install: pip install -U scikit-learn
Now, you can use the following code to import the indicated modules:
from sklearn import tree # this is where your DTs are located
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_treeconfusion_matrix uses a table layout and is used to evaluate the performance of a classification model. It compares the predicted labels from your trained decision tree with the true labels from the data set.
- It is a summary of the correct and incorrect predicitons.
accuracy_score is a metric used to evaluate the performance of a classification model. It calculates the ratio of correctly predicted instances to the total number of instances in the dataset.
classification_report is a function that provides a detailed report of the precision, recall, F1-score, and support for each class in a classification model. It helps to evaluate the performance of the model for each class.
precision: number of true positive predictions vs. number of positive predictions made by the model.
recall: number of true positive predictions vs. number of actual positive instances in the dataset.
F1-score: a combination of both precision and recall, an average of the two.
train_test_split is a function used to split a dataset into training and testing sets. It allows you to decide a random state for reproducibility.
plot_tree is a function used to visualize a decision tree model. Can help to understand how the model makes predictions and to identify important features in the dataset.
Step 3: Name your decision tree model
You’ll know based on your statisitcal questions what type of decision tree you will want to call.
For classification:
clf = tree.DecisionTreeClassifier()For regression:
reg = tree.DecisionTreeRegressor()Step 4: Load data
For this example, we will be using the iris dataset. It contains 150 samples of iris flowers, with 4 features (sepal length, sepal width, petal length, petal width) and a target variable (species of iris).
from sklearn.datasets import load_iris
iris = load_iris()Assing your features to X and y. Your data will always be X and your target variable will always be y.
X = iris.data
y = iris.targetAlways split your data into training and testing sets. If you skip this step, you will not be able to evaluate the performance of your model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)train_test_splitsplits the data into training and testing sets.test_sizeindicates how much of the dataset you want in your testing group (here, it is 20%).random_stateensures reproducibility. This can be any integer.
Step 5: Fit your model
Also knows as “training” your model. Fit your model only on the training set.
clf.fit(X_train, y_train)DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| criterion | 'gini' | |
| splitter | 'best' | |
| max_depth | None | |
| min_samples_split | 2 | |
| min_samples_leaf | 1 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | None | |
| random_state | None | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| class_weight | None | |
| ccp_alpha | 0.0 | |
| monotonic_cst | None |
fitis used to train a model on a training data. It takes a format as follows:your_model_type.fit(X_train, y_train)
Step 6: Make predictions
Now that your model is trained, you can use it to predict the classifications from the testing set. Save the predictions that your model makes in a variable you can call upon later. Here, we will call it y_pred.
y_pred = clf.predict(X_test)predictis used to make predictions on new data. It takes a format as follows:your_model_type.predict(X_test)
Step 7: Evaluate your model
It is in good practice to test the performance of your model.
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
1.0
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
You can also show the decision tree itself.
plt.figure(figsize=(12,8))
plot_tree(clf,
feature_names=iris.feature_names, # refers to x-variables
class_names=iris.target_names, #refers to y-variables
filled=True) # colors the nodes to indicate the majority class
plt.show()
8.3.7 Remedies
Overfitting
When a decision tree is too complex, it may fit the training data too well and be unable to predict new data well.
You can check to see if you have overfit your data by comparing the perfomance on your model on the training data vs. the testing data.
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
print(f"Training score: {train_score}")
print(f"Testing score: {test_score}")Training score: 1.0
Testing score: 1.0
.scoreis used to evaluate the performance of a model on a given dataset. It takes a format as follows:your_model_type.score(X, y)
If your training score and testing score are similar, your model has not be overfit. If your training score is much higher than your testing score, your model has likely overfit the data.
Here are some remedies for overfitting: 1. Pruning: This involves removing branches of the tree that do not significantly improve your model’s predictions. This can be done by: * Setting a minimum number of samples DecisionTreeClassifier(min_samples_split=n)
Setting a maximum depth: This limits the depth of the tree, preventing it from becoming too complex and overfitting the data. A good place to start is n=3.
DecisionTreeClassifier(max_depth=n)Setting minimum samples per leaf: This sets a minimum number of samples required to be at a leaf node. This can prevent the tree from creating a branch that only applies to a small number of samples, which can lead to overfitting.
DecisionTreeClassifier(min_samples_leaf=n)
To find out what number you should set for these parameters, you can use GridSearchCV from sklearn.
from sklearn.model_selection import GridSearchCVGridSearchCV allows you to set a range of values for specified parameters and test each combination for the best performing model.
You must specify the parameters you want to test and the values you want to test for. Your parameters should be stored in a dictionary:
param_grid = {
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [2, 5, 10]
}Now, you must call GridSearchCV and specify the model, the parameter grid, as well as the number of cross folds and the scoring metric.
grid_search = GridSearchCV(
estimator = clf, # the model you want to test
param_grid = param_grid, # the parameters
cv = 5, # the number of cross folds
scoring = 'accuracy' # the scoring metric
)Grid search must now be fit to your data, like so:
grid_search.fit(X_train, y_train)GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
param_grid={'max_depth': [3, 5, 10, None],
'min_samples_leaf': [2, 5, 10],
'min_samples_split': [2, 5, 10]},
scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| estimator | DecisionTreeClassifier() | |
| param_grid | {'max_depth': [3, 5, ...], 'min_samples_leaf': [2, 5, ...], 'min_samples_split': [2, 5, ...]} | |
| scoring | 'accuracy' | |
| n_jobs | None | |
| refit | True | |
| cv | 5 | |
| verbose | 0 | |
| pre_dispatch | '2*n_jobs' | |
| error_score | nan | |
| return_train_score | False |
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
Parameters
| criterion | 'gini' | |
| splitter | 'best' | |
| max_depth | 3 | |
| min_samples_split | 2 | |
| min_samples_leaf | 5 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | None | |
| random_state | None | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| class_weight | None | |
| ccp_alpha | 0.0 | |
| monotonic_cst | None |
Grid search actually tests a lot of different feature of your model. You can find all the features being tested here. However, we only tested three parameters: max_depth, min_samples_split, and min_samples_leaf.
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)Best parameters: {'max_depth': 3, 'min_samples_leaf': 5, 'min_samples_split': 2}
Best score: 0.95
Now, you can create a new model including these updated parameters.
- Changing the criterion: This involves changing the criterion used to split the data at each node. Changing the criterion can help to reduce overfitting by making the tree less sensitive to small changes in the data.
DecisionTreeClassifier(criterion='gini')- gini impurity: the probability of misclassifying a randomly chosen element from the dataset
- entropy: the average amount of information needed to classify an element from the dataset
- log_loss: the negative log-likelihood of the true labels given the predicted probabilities from the model. Used when the output is a probability distribution over multiple classes.
DecisionTreeRegressor(criterion='squared_error')- squared_error: the mean squared error between the predicted values and the true values
- friedman_mse: the mean squared error between the predicted values and the true values, but with a bias towards reducing the variance of the predictions
- absolute_error: the mean absolute error between the predicted values and the true values
- poisson: the mean Poisson deviance between the predicted values and the true values, used for count data
- Using ensemble methods: Ensemble methods (ei, random forests) combine multiple decision trees to make predictions, thus generalizing the model and reducing the risk of overfitting.
Unbalanced datasets
When one class is much more prevalent than the other, the decision tree may be biased towards the majority class.
You can tell if a dataset is unbalanced by counting the number of instances in each class. If one class has significantly more instances than the other, the dataset is likely unbalanced. Note that decision trees use np.float32 array internally.
unique, counts = np.unique(y, return_counts=True) # coutns the number of instances in each class
num_in_class = {int(k): int(v) for k, v in zip(unique, counts)} # readable format
print(num_in_class){0: 50, 1: 50, 2: 50}
Here are some remedies for unbalanced datasets:
Class weighting: This involves assigning higher weights to the minority class during the training process, which can help the model pay more attention to the minority class and improve its performance on that class.
DecisionTreeClassifier(class_weight='balanced')Resampling methods: This involves either oversampling the minority class (creating synthetic samples) or undersampling the majority class (removing samples) to balance the dataset.
- Oversampling:
from imblearn.over_sampling import SMOTE - Undersampling:
from imblearn.under_sampling import RandomUnderSampler
- Oversampling:
8.3.8 Further readings
8.4 Random Forests
This presentation was prepared by Sara Watanabe.
8.4.1 Overview
Random forest is a machine learning algorithm that uses a combination of bagging, feature importance, and complex decision trees to make predictions. The algorithm uses a collection of decision trees to reach a single result in order to make predictions or regression models based on imputed data.
8.4.2 Decision Trees
A Decision Tree is a method that uses a flow chart-like structure to make decisions about a simple question. They can output either a categorical or numerical result, and are constructed using nodes and branches.
8.4.2.1 Structure
Root Node: Asks a question about data, contains all training samples
Internal nodes: Where decisions are made about data features
Leaf Nodes: Tree end points; the final decision/prediction
Branches: Lines connecting the nodes
Code
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4.5, 2))
# Turn off axes
ax.axis('off')
# Helper function to draw a labeled box
def draw_box(x, y, text, color):
ax.text(
x, y, text,
ha='center', va='center',
fontsize=8, fontweight='bold',
bbox=dict(boxstyle="round,pad=0.4", facecolor=color, edgecolor='black')
)
# Root node
draw_box(0.5, 0.9, "Root Node", "#2e7d32")
# Internal nodes
draw_box(0.25, 0.6, "Internal Node", "#c5e1a5")
draw_box(0.75, 0.6, "Internal Node", "#c5e1a5")
# Leaf nodes
draw_box(0.1, 0.3, "Leaf Node", "#fff9c4")
draw_box(0.4, 0.3, "Leaf Node", "#fff9c4")
draw_box(0.6, 0.3, "Leaf Node", "#fff9c4")
draw_box(0.9, 0.3, "Leaf Node", "#fff9c4")
# Helper function to draw a line
def connect(x1, y1, x2, y2):
ax.plot([x1, x2], [y1, y2])
# Connections from root
connect(0.5, 0.85, 0.25, 0.65)
connect(0.5, 0.85, 0.75, 0.65)
# Connections from left internal
connect(0.25, 0.55, 0.1, 0.35)
connect(0.25, 0.55, 0.4, 0.35)
# Connections from right internal
connect(0.75, 0.55, 0.6, 0.35)
connect(0.75, 0.55, 0.9, 0.35)
plt.show()
8.4.2.2 Iris Dataset Example
8.4.2.2.1 Dataset Overview
The Iris dataset has 150 samples of iris flowers. This data measures four features: sepal length, sepal width, petal length, and petal width. These features are used by the Decision Tree algorithm to classify the samples into three classes: setosa, versicolor, and virginica.
from sklearn.datasets import load_iris
import pandas as pd
# Load data
iris = load_iris()
X = iris.data # all features
y = iris.target # all classes (setosa, versicolor, virginica)
df = pd.DataFrame(X, columns=iris.feature_names)
df["species"] = iris.target_names[y]
df.head()| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
8.4.2.2.2 Modelling the Data
Here we initialize the Decision Tree Classifier and fit it to the data. The Decision Tree Classifier is imported from the sklearn package, and matplotlib is imorted for future visualization.
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
# Initialize the model with Gini impurity (default)
model = DecisionTreeClassifier(max_depth=3, random_state=42)
# Fitting the model calculates Gini values for all possible splits
model.fit(X, y)DecisionTreeClassifier(max_depth=3, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| criterion | 'gini' | |
| splitter | 'best' | |
| max_depth | 3 | |
| min_samples_split | 2 | |
| min_samples_leaf | 1 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | None | |
| random_state | 42 | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| class_weight | None | |
| ccp_alpha | 0.0 | |
| monotonic_cst | None |
8.4.2.3 Growth-Stop Conditions
Decision trees are prone to overfitting, and can cause the model to fragment data into small catergories that cannot be generalized to new data. To prevent this, growth-stop conditions can be passed as parameters to the model. These include:
- max_depth: Sets the maximum depth of the tree, which limits how many splits the tree can make. In this case, we set it to 3.
- min_samples_split: Sets the minimum number of samples required to split a node. If a node has fewer samples than this threshold, it will not be split further.
- min_impurity_decrease: Sets the minimum reduction in the Gini Coefficient required to split a node. If the reduction is less than this value, the node will not be split.
8.4.2.4 Gini Coefficient
In order for the Decision Tree Classifier to determine the best splits, it uses a metric to evaluate the quality of splits. The Gini Coefficient is used as the default metric, and is calculated as: \[ \mathrm{Gini}(p) = 1 - \sum_{i=1}^{n} p_i^2 \]
Where \(p_i\) is the proportion of samples belonging to class \(i\) at a given node.
Gini can range from 0 to 0.5 for binary classification, where 0 indicates a pure node and 0.5 indicates a completely impure node (ex. 50% True and 50% False).
8.4.2.5 Training Process
Decision Trees are built by recursively splitting training samples in a series of steps:
Step 1. Start with the root node; this contains all training smaples.
# Root node contains all 150 samples
print("Number of samples at root:", X.shape[0])
class_dist = {str(name): int(sum(y == i)) for i, name in enumerate(iris.target_names)}
print("Class distribution at root:", class_dist)Number of samples at root: 150
Class distribution at root: {'setosa': 50, 'versicolor': 50, 'virginica': 50}
Step 2. The algorithm evaluates calculated metrics to determine the features that best split the data. In this case, the Gini Coefficient is used. The smaller the Gini value, the better the split.
# Step 2: Evaluate metrics to find the best split
# Show Gini at root and splits (access tree structure)
tree = model.tree_
print("Gini at root:", tree.impurity[0])
print("Gini at left child of root:", tree.impurity[1])
print("Gini at right child of root:", tree.impurity[2])
print("Number of samples at root:", tree.n_node_samples[0])Gini at root: 0.6666666666666667
Gini at left child of root: 0.0
Gini at right child of root: 0.5
Number of samples at root: 150
Internally, all possible Gini values are calculated for each feature and threshold for each step, and the split uses the feature and threshold that produces the lowest Gini value
Step 3. The data is then split based on the selected feature and threshold value
# You can see which feature and threshold were chosen at root
root_feature = tree.feature[0]
root_threshold = tree.threshold[0]
print("Best feature at root:", iris.feature_names[root_feature])
print("Threshold at root:", root_threshold)
print("Samples go left if feature < threshold, else right")
print("\n")Best feature at root: petal length (cm)
Threshold at root: 2.449999988079071
Samples go left if feature < threshold, else right
Step 4. Steps 2 and 3 are repeated until growth-stop conditions are met. These include a maximum depth of the tree, a minimum number of samples required to split a node, and a minimum reduction in the Gini Coefficient required to split a node. The lower the Gini value the better
# max_depth=3 and minimum samples per split control stopping
print("Max depth set:", model.max_depth)
print("Tree stops splitting when nodes reach max depth or minimum samples")Max depth set: 3
Tree stops splitting when nodes reach max depth or minimum samples
The resulting tree is visualized below:
Code
# 4. Plot the tree with smaller size
plt.figure(figsize=(6.5, 4.5)) # adjust width and height
plot_tree(
model,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True,
fontsize=7 # smaller font
)
plt.show()
- Root Node: Contains all 150 samples, 50 in each class, with a Gini Coefficient of 0.66
- Internal Nodes:
- The first split uses petal length with a threshold of 2.45.
- The left split uses petal width with a threshold of 0.8
- the right split uses petal width with a threshold of 1.75.
- Leaf Nodes: The Gini Coefficient at the leaf nodes are considerably small, indicating that the samples at these nodes are mostly of one class.
Seen in each box is the calculated Gini Coefficient, the number of samples at the node, and the distribution of classes at the node.
8.4.2.6 Benefits and Limitations of Decision Trees
Benefits
- Easy to understand and interpret
- Computationally efficient
- Time and space efficient
Limitations
- Decision Trees are prone to bias and overfitting
- As a tree grows in size, it can result in data fragmentation
- Using the random forest algorithm reduces the risk of making extreme predictions due to overfitting
- Decision Trees tend to stay on the smaller side, making it hard to deal with many complex variables
8.4.3 Random Forest Classifier
Random Forest is an algorithm that uses a combination of bagging, feature importance, and complex decision trees to make predictions.
8.4.3.1 Characteristics:
- While singular Decision Trees consider the use of all possible feature splits, Random Forest only considers a subset of features for each split.
- Random Forest uses random features to make splits, which reduces correlation between trees and improves generalization
- Random Forest can be used for classification or regression problems
- classification uses the majority vote of all trees to come to a conclusion
- regression uses the average of all trees to make a prediction
8.4.3.2 Iris Dataset Example
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Load data
iris = load_iris()
X = iris.data
y = iris.target
df = pd.DataFrame(X, columns=iris.feature_names)
df["species"] = iris.target_names[y]
df.head()| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
8.4.3.3 Bagging
Random Forest uses a combination of bootstrap and aggregation, also known as bagging, during the creation process.
- Bootstrap: A resampling technique that uses sampling with replacement to create triaining sets
- A training set is selected with replacement for each Decision Tree made
- The unselected values will be set aside as part of the out-of-bag samples
- Individual Decision Trees are then trained independently using respective training sets, selecting their own features for each split
- Aggregation: The process of combining the predictions of all the trees created by the random forest algorithm
Thirty percent of our selected bootstrap sample is set aside for training (out-of-bag sample), which will be used for cross-validation later on
8.4.3.4 Tree construction
Step 1. Create the training and testing sets
- These will be used to create bootstrap samples and out-of-bag samples for each tree during the fitting process
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)Step 2. Create Random forest
- Assign a random set of features to each dataset
- Set parameters; n_estimators is the number of trees that will be created
# Create Random Forest model
model = RandomForestClassifier(
n_estimators=100, # number of trees
random_state=42
)Step 3. Fit the model: Train each tree
- Use the best feature to split the data
- The feature that makes the clearest separation is used
- Trees continue to grow until growth-stop conditions are met
# Train model (same idea as decision tree fit)
model.fit(X_train, y_train)RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| n_estimators | 100 | |
| criterion | 'gini' | |
| max_depth | None | |
| min_samples_split | 2 | |
| min_samples_leaf | 1 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | 'sqrt' | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| bootstrap | True | |
| oob_score | False | |
| n_jobs | None | |
| random_state | 42 | |
| verbose | 0 | |
| warm_start | False | |
| class_weight | None | |
| ccp_alpha | 0.0 | |
| max_samples | None | |
| monotonic_cst | None |
Internal Process:
- Bootstrap samples are created for each tree from the training set created in step 1
- A random subset of features is selected to consider at each split
- The best feature is selected to split the data, and the best threshold value is also determined
- The data is split based on the selected feature and threshold value
- repeat steps 2-4 until growth-stop conditions are met (ex. max depth, minimum samples per split, minimum impurity decrease)
Step 4. Make Predictions
# Make predictions
y_pred = model.predict(X_test)Step 5. Test all of the trees with the out-of-bag set to determine how good the model is
- You can adjust how many random variables are considered, number of trees, etc. in an attempt to improve accuracy
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)Accuracy: 1.0
Our accuracy score is 1.0, which means that all of our predictions were correct. This is a good score, but it is important to note that this is likely due to the simplicity of the dataset and the fact that we have a small number of features. Adjusting things like the number of trees, the number of features considered for each split, and the depth of the trees can help to improve accuracy on more complex datasets.
8.4.3.5 Feature Selection
Feature Randomization Selection of a random subset of features for each split
- Reduces correlation between trees and improves generalization
- The Random Forest algorithm uses this during the training process
- Number of features that are considered can be changed to improve model accuaracy; the default value is the square root of the total number of features
Feature Importance: The relative importance of each feature in the data
- Tells what features are the most important in making accurate predictions
- Random Forest has built-in feature importance
- Feature importance is measured as the average decrease in impurity across all trees
- if a feature splits the data into “purer” subsets, it is considered more important
- Can also consider how much a feature is used and how much it contributes to the overall predictions of the algorithm
plt.figure()
plt.bar(iris.feature_names, model.feature_importances_)
plt.xticks(rotation=45)
plt.title("Feature Importance (Random Forest)")
plt.show()
The horizontal axis shows feature, and the vertical axis shows the average reduction in Gini Coefficient caused by splits using that feature across all trees
As shown in the figure, petal length and petal width are the most important features for making accurate predictions in this dataset. This means that they created the most “pure” splits in the data, and were used more often in the Decision Trees to make accurate predictions.
8.4.3.6 Benefits and Limitations of Random Forest
Benefits
- “Wisdom of crowds”: even if a few trees make a mistake, the overall collective decision will mitigate these mistakes
- Reduces the risk of overfitting compared to a single Decision Tree through feature randomization and bagging
- Can handle large datasets with higher dimensionality and complex interactions between features
Limitations
- Decision Trees are simpler and can be easier to interpret
- Random Forest can be computationally intensive, especially with a large number of trees and features
- Training time can get long with large datasets, and it may require more memory to store the multiple trees
8.4.3.7 Regression example
Now we are going to use the Random Forest regression model to predict the median house value per neighborhood in California.
Step 1. Load the dataset and split into training and testing sets
# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd
# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target
# Print out the first few rows of the dataset
df = pd.DataFrame(X, columns=data.feature_names)
df["target"] = y
print(df.head())
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
) MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85
Longitude target
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
Our x-values will be all features in this dataset(Median income, Average house age, Average number of rooms, Average number of bedrooms, Population, Longitude, and Latitude), the target variable is the median house value per neighborhood. Based on my knowledge of housing prices, I am expecting median income to be the most important feature in making accurate predictions.
Step 2. Create the Random Forest regression model, train it, and make predictions
# Create Random Forest regression model
model = RandomForestRegressor(
n_estimators=100, # number of trees
random_state=42
)
# Train model
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)Step 3. Evaluate the performance of the model using mean squared error
# Evaluate performance
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)Mean Squared Error: 0.2557259876588585
Mean squared error tells us the average squared difference between the predicted values and the actual values. In this case, an MSE of 0.25537 indicates our model is doing a pretty good job at predicting the median house values, but if we wanted to imporve our model’s accuracy we could adjust parameters such as the number of trees, the number of features considered for each split, and the depth of the trees.
Step 4. Evalutate feature importance to determine which features are most important in making accurate predictions
plt.figure()
plt.bar(data.feature_names, model.feature_importances_)
plt.xticks(rotation=45)
plt.title("Feature Importance (Random Forest Regression)")
plt.show()
As seen in the figure above, median income is the most important feature for making accurate predictions in this dataset, as predicted. This means that it created the most “pure” splits in the data, and was used more often in the Decision Trees to make accurate predictions.
8.4.4 Further Reading
Decision Trees Explained | Towards Data Science
Random Forest, Explained | Towards Data Science
What Is Random Forest in Machine Learning?
Random Forest Feature Importance Computed with Python
Feature Importance & Random Forest - Python Example
StatQuest: Random Forests Part 1 - Building, Using and Evaluating
Breif overeview of Random Forest
In-depth Walkthrough and Example
Step-By-Step Guide with Code Examples
General Overview of Random Forest
Explanation of feature importance with examples
In-depth explanation of feature importance
Step-by-step guide to the random forest training process
Explanation of the gini metric