10  Supervised Learning

10.1 Introduction

Supervised and unsupervised learning represent two core approaches in the field of machine learning, each with distinct methodologies, applications, and goals. Understanding the differences and applicabilities of these learning paradigms is fundamental for anyone venturing into data science and machine learning.

  • Supervised Learning Supervised learning is characterized by its use of labeled datasets to train algorithms. In this paradigm, the model is trained on a pre-defined set of training examples, which include an input and the corresponding output. The goal of supervised learning is to learn a mapping from inputs to outputs, enabling the model to make predictions or decisions based on new, unseen data. This approach is widely used in applications such as spam detection, image recognition, and predicting consumer behavior. Supervised learning is further divided into two main categories: regression, where the output is continuous, and classification, where the output is categorical.

  • Unsupervised Learning In contrast, unsupervised learning involves working with datasets without labeled responses. The aim here is to uncover hidden patterns, correlations, or structures from input data without the guidance of an explicit output variable. Unsupervised learning algorithms are adept at clustering, dimensionality reduction, and association tasks. They are invaluable in exploratory data analysis, customer segmentation, and anomaly detection, where the structure of the data is unknown, and the goal is to derive insights directly from the data itself.

The key difference between supervised and unsupervised learning lies in the presence or absence of labeled output data. Supervised learning depends on known outputs to train the model, making it suitable for predictive tasks where the relationship between the input and output is clear. Unsupervised learning, however, thrives on discovering the intrinsic structure of data, making it ideal for exploratory analysis and understanding complex data dynamics without predefined labels.

Both supervised and unsupervised learning have their place in the machine learning ecosystem, often complementing each other in comprehensive data analysis and modeling projects. While supervised learning allows for precise predictions and classifications, unsupervised learning offers deep insights and uncovers underlying patterns that might not be immediately apparent.

10.2 Classification versus Regression

The main tasks in supervised learning can broadly be categorized into two types: classification and regression. Each task utilizes algorithms to interpret the input data and make predictions or decisions based on that data.

  • Classification Classification tasks involve categorizing data into predefined classes or groups. In these tasks, the output variable is categorical, such as “spam” or “not spam” in email filtering, or “malignant” or “benign” in tumor diagnosis. The aim is to accurately assign new, unseen instances to one of the categories based on the learning from the training dataset. Classification can be binary, involving two classes, or multiclass, involving more than two classes. Common algorithms used for classification include Logistic Regression, Decision Trees, Support Vector Machines, and Neural Networks.

  • Regression Regression tasks predict a continuous quantity. Unlike classification, where the outcomes are discrete labels, regression models predict a numeric value. Examples of regression tasks include predicting the price of a house based on its features, forecasting stock prices, or estimating the age of a person from a photograph. The goal is to find the relationship or correlation between the input features and the continuous output variable. Linear regression is the most basic form of regression, but there are more complex models like Polynomial Regression, Ridge Regression, Lasso Regression, and Regression Trees.

Both classification and regression are foundational to supervised learning, addressing different types of predictive modeling problems. Classification is used when the output is a category, while regression is used when the output is a numeric value. The choice between classification and regression depends on the nature of the target variable you are trying to predict. Supervised learning algorithms learn from labeled data, refining their models to minimize error and improve prediction accuracy on new, unseen data.

10.2.1 Classification Metrics

10.2.1.1 Confusion matrix

https://en.wikipedia.org/wiki/Confusion_matrix

Four entries in the confusion matrix:

  • TP: number of true positives
  • FN: number of false negatives
  • FP: number of false positives
  • TN: number of true negatives

Four rates from the confusion matrix with actual (row) margins:

  • TPR: TP / (TP + FN). Also known as sensitivity.
  • FNR: FN / (TP + FN). Also known as miss rate.
  • FPR: FP / (FP + TN). Also known as false alarm, fall-out.
  • TNR: TN / (FP + TN). Also known as specificity.

Note that TPR and FPR do not add up to one. Neither do FNR and FPR.

Four rates from the confusion matrix with predicted (column) margins:

  • PPV: TP / (TP + FP). Also known as precision.
  • FDR: FP / (TP + FP).
  • FOR: FN / (FN + TN).
  • NPV: TN / (FN + TN).

10.2.1.2 Measure of classification performance

Measures for a given confusion matrix:

  • Accuracy: (TP + TN) / (P + N). The proportion of all corrected predictions. Not good for highly imbalanced data.
  • Recall (sensitivity/TPR): TP / (TP + FN). Intuitively, the ability of the classifier to find all the positive samples.
  • Precision: TP / (TP + FP). Intuitively, the ability of the classifier not to label as positive a sample that is negative.
  • F-beta score: Harmonic mean of precision and recall with \(\beta\) chosen such that recall is considered \(\beta\) times as important as precision, \[ (1 + \beta^2) \frac{\text{precision} \cdot \text{recall}} {\beta^2 \text{precision} + \text{recall}} \] See stackexchange post for the motivation of \(\beta^2\).

When classification is obtained by dichotomizing a continuous score, the receiver operating characteristic (ROC) curve gives a graphical summary of the FPR and TPR for all thresholds. The ROC curve plots the TPR against the FPR at all thresholds.

  • Increasing from \((0, 0)\) to \((1, 1)\).
  • Best classification passes \((0, 1)\).
  • Classification by random guess gives the 45-degree line.
  • Area between the ROC and the 45-degree line is the Gini coefficient, a measure of inequality.
  • Area under the curve (AUC) of ROC thus provides an important metric of classification results.

10.2.2 Cross-validation

  • Goal: strike a bias-variance tradeoff.
  • K-fold: hold out each fold as testing data.
  • Scores: minimized to train a model

Cross-validation is an important measure to prevent over-fitting. Good in-sample performance does not necessarily mean good out-sample performance. A general work flow in model selection with cross-validation is as follows.

  • Split the data into training and testing
  • For each candidate model \(m\) (with possibly multiple tuning parameters)
    • Fit the model to the training data
    • Obtain the performance measure \(f(m)\) on the testing data (e.g., CV score, MSE, loss, etc.)
  • Choose the model \(m^* = \arg\max_m f(m)\).

10.3 Tree Based Machine Learning Models

Presented by Emmanuel Yankson

10.3.0.0.1 About Me

My name is Emmanuel Yankson, although some call me Emmy for short (first three letters of my first name and the initial of my last name). I am a junior studying statistical data science as a major and computer science as a minor. Hobbies include playing the piano, philosphy, building computers, and co-Teaching STAT 4188.

10.3.1 What is Machine Learning?

Similar to how humans learn where there is a focus on acquisition of knowledge that would better inform decisions made in the future, take that knowledge into our own personal exeprience which can then be used to inform us about the knowledge we gather and decisions we make in the future, a machine learns when we give it data, set criteria for training and testing, and then set metrics for the outcomes of these tests which can then be used to improve the effectiveness and learning of our model.

There are different types of machine learning, namely supervised learning, unsupervised learning, and reinforcement learning. Decision Trees and Random Forest fall under the category of supervised, as given some labels Y, we train our model on correct examples of our labels, which will come later.

10.3.2 Basic Terms

  • Training - The process involves pushing a separated section of your data to your model with the purpose of enabling the model to learn about which values are optimal for adjusting all the weights and biases, based on labeled examples. The training section of your data is typically the majority of your dataset, although the percentage depends on the size of the data you are working with

  • Testing - It serves as a completely independent dataset to evaluate the model’s performance and generalization ability. The test section of your data is typically the minority of your dataset, although again the percentage depends on the size of the data you are working with

  • Models - An algorithm meant to learn from the data it recieves

  • Prediction - When the model recieves data and makes a prediction concerning our label of interest based on the previous data it received

  • Performance Metrics - A way to evaluate model effectiveness and performnace in different aspects. Different performance metrics are used depending on the type of machine learning task and the specific goals of the analysis. In this file we will be utilizing the performance metrics of accuracy (the proportion of correctly classified instances out of the total instances in the dataset), precision (out of all positives, how many are correctly identified), and F1 Score (the harmonic mean of precision and recall, it provides a balance between precision and recall)

10.3.3 Training and Testing

So when it comes to training, testing, and evaluating our models, we of course cannot have the model trained on the entirity of our dataset. Why? Because by having a separate testing set, we can evaluate how well our trained model generalizes to new, unseen data. Additionally, if were to train our model on the entirity of our dataset, it would cause overfitting (Unexplained high accuracy, used to training noise). While the model would perform very well when it comes to the training set, it performs poorly for new data that it hasn’t seen before.

10.3.4 Decision Trees

A Decision Tree is hierarchical tree-like structure which frequently takes the form of a binary tree in our case, where starting from the root node, we recursively split our data until we reach an outcome. There are two types of decision trees, regression and classification, with classification being the focus of this discussion.

Examining the structure, imagine a Decision Tree as a series of interconnected questions posed to our dataset. The root node plays the role of asking the first question that results in the first split of the data. Subsequent internal nodes act as further questions, introducing additional criteria that lead to successive splits. These splits create branches and sub-branches, forming a branching structure throughout the tree.

At each decision node, the dataset is partitioned based on the answers to the posed questions, guiding the data along different paths within the tree. This recursive process continues until we reach the leaves of the tree, which represent the outcomes or decisions made. The leaves contain the final classifications based on the criteria applied throughout the tree. When moving along the tree, we will have our left branch denote our true or ‘yes’ condition at that node, while we will have our right branch denote our false condition at that particular node.

Due its easy accessbility through visuals like a flow chart, it almost appears that the decision tree algorithm is simpler than it actually is. One would foolishly assume that the algorithm is a series of if-else statements that parse through a dataset. How do we also decide the root note as well?

However, simple if-else statements will not work as the purpose of this machine learning algorithm is to find the optimal split for each decision node. The hope is to get something called a pure leaf node, which is a leaf node a node where all instances of our datapoints share the same class or outcome.

How do we find the optimal split? While there are different forms of splitting criteria, the one we will be discussing today is known as GINI.

10.3.4.1 GINI Index

The GINI Index, or GINI takes on values between 0 and 1 and calculated by the following formula:

Gini(p) = 1 - \(\Sigma_{i=1}^J\) \(p_{i}^2\)

Where the probability value in this case is the probability of the incorrect classification squared subtracted by the probability of the correct classification squared. We will calculate the Gini index for each side of the split. And once we calculated the GINI index for both sides of the split we can calculate the weighted GINI index:

Weighted GINI = 1 - \(\Sigma_{i=1}^J\) \(w_{i}\) * \(p_{i}^2\)

This weighted Gini value will provide a measure between 0 and 1, where 0 signifies the purest node. In a pure node, only one class of outcome exists, and the data in that leaf node is entirely homogenous. On the other end of the spectrum, a value of 1 indicates the highest level of impurity, suggesting a mixture of outcomes present at that node.

The weighted Gini index serves as the splitting criterion for our decision nodes. The split that minimizes the weighted Gini index is chosen because it represents the most effective way to segregate the data. This process assigns more weight to important groups and less weight to less important ones, allowing the decision tree algorithm to make informed and balanced decisions at each internal node.

Besides choosing the splitting criteria for your decision tree, there are other ways to tune your model in order to improve your classification results, using the following hyperparameters:

  • splitter: The strategy used to choose the split at each node
  • criterion: “Gini”, “Entropy”, “log_loss”
  • max_depth: Maximum depth of the tree
  • min_samples_split: min number of samples required to split decision node
  • min_samples_leaf: Min number of samples required to be at a leaf node
  • min_weight_fraction_leaf: The minimum weighted fraction of the sum total of weights required to be at a leaf node
  • max_features: The number of features to consider when looking for the best split
  • max_leaf_nodes: Grow trees with max_leaf_nodes in best-first fashion
  • min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
  • class_weight: Weights associated with classes in the form {class_label: weight}
  • ccp_alpha: Complexity parameter used for Minimal Cost-Complexity Pruning
  • random_state: CControls the randomness of the estimator

Although we will be using the model in its most basic form, using this form of machine learning requires extensive testing and model tuning in order to receive the best results.

The following example uses a synthetic dataset generated by myself which
uses multiple features in order to predict whether or not a student will dine out while staying on campus.

10.3.5 Generating an Example Dataset

Do I eat out?

Code

from datetime import datetime
import random
import pandas as pd
import numpy as np


eat_out = np.random.choice([True, False], size=100000)

in_class = []


for status in eat_out:
    if status == True:
       in_class.append(np.random.choice([True, False], p=[.10, .90]))
    
    else:
        in_class.append(np.random.choice([False, True], p=[.40, .60]))

good_places_open = []

for status in eat_out:
    if status == True:
       good_places_open.append(np.random.choice([True, False], p=[.90, .10]))
    
    else:
        good_places_open.append(np.random.choice([False, True], p=[.50, .50]))


amount_of_money = []

for status in eat_out:
    if status:
        choices = [round(random.uniform(55, 250), 2), 
        round(random.uniform(0.0, 55), 2)]
        
        probabilities = [.90, .10]
        
        dollar_amount = np.random.choice(choices, p=probabilities)
        
        amount_of_money.append(dollar_amount)
    

    else:
        choices = [round(random.uniform(55, 250), 2), 
        round(random.uniform(0.0, 55), 2)]
        
        probabilities = [.41, .59]
        
        dollar_amount = np.random.choice(choices, p=probabilities)
        
        amount_of_money.append(dollar_amount)


current_time = []

for i in range(len(eat_out)):
    early = random.randint(10, 19)
    late = random.choice([20, 21, 22, 23, 0, 1, 2])
    
    if eat_out[i] and not in_class[i] and good_places_open[i]:
        probabilities_t = [0.80, 0.20]
        time_choices_hrs = [early, late]
        
    elif not eat_out[i] and not in_class[i] and not good_places_open[i]:
        probabilities_t = [0.25, 0.75]
        time_choices_hrs = [early, late]
        
    elif not eat_out[i] and in_class[i] and good_places_open[i]:
        probabilities_t = [0.60, 0.40]
        time_choices_hrs = [early, late]
        
    elif eat_out[i] and in_class[i]:
        probabilities_t = [0.90, 0.10]
        time_choices_hrs = [early, late]
        
    hour = np.random.choice(time_choices_hrs, p=probabilities_t)
    minute = random.randint(0, 59)

    now = datetime.now()
    random_time = datetime(now.year, now.month, now.day, hour, minute)
    given_time = random_time.strftime("%H:%M")
    current_time.append(given_time) 


my_data = {'In Class?':in_class, 'Are there good places open?':good_places_open, 
           'Amount Of Money (In Dollars)':amount_of_money, 
           'Time of Day (24-Hour)':current_time, 
           'Do I eat out?':eat_out}

dine_out_df = pd.DataFrame(data=my_data)

The following code takes the example dataset and attempts to use a decision tree model to predict whether or not a student is likely to dine out based on the features of:

In Class?: Whether or not the student is in class (Boolean)

Are there good places open?: Asks if good restaurants are open (Boolean)

Amount of Money (In Dollars): Money in account (Continuous)

Time of Day (24 Hour): Time of the day when eating (Continuous)

Do I eat out?: Asks if the student eats out (Target Label, Boolean)

10.3.5.1 Decision Tree Code

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, recall_score 
from sklearn.metrics import precision_score
#displays the first five rows of the dataset
dine_out_df.head()
In Class? Are there good places open? Amount Of Money (In Dollars) Time of Day (24-Hour) Do I eat out?
0 False True 59.09 15:01 True
1 True False 90.13 11:13 True
2 False True 87.06 18:35 True
3 False False 18.29 22:59 False
4 False True 2.39 19:54 True
def encode_time(time_str):
    hours, minutes = map(int, time_str.split(':'))
    
    normalized_hours = hours / 24.0
    
    return pd.Series({'encoded_hours': normalized_hours})

dine_out_df[['encoded_hours']] = dine_out_df['Time of Day (24-Hour)'].apply(encode_time)
X_dtc = dine_out_df.drop(['Do I eat out?', 'Time of Day (24-Hour)'], axis=1)
Y_dtc = dine_out_df['Do I eat out?']

X_train, X_test, Y_train, Y_test = train_test_split(X_dtc, Y_dtc, 
test_size=0.30, random_state=42, shuffle=True, stratify=Y_dtc)

The code directly above using the train_test_split function is part of the training and testing phase of machine learning discussed earlier. The dataset is divided into two parts:

  • Training Set: This portion of the data, typically the majority (e.g., 70% of the dataset in this example), is used to train the machine learning model. For instance, the variable X_train is trained on 70% of the dataset containing features like OverTime, BusinessTravel, and MaritalStatus, while the variable y_train is trained on 70% of the Attrition feature column. The model learns the patterns and relationships between these features and the target variable (Attrition) from this training data.

  • Test Set: The remaining portion of the data, 30% of the dataset in this case is set aside and not used during the training phase.

# Creates and Trains a Decision Tree Classifier object
dtc = DecisionTreeClassifier()
dtc.fit(X_train, Y_train)
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_pred_dtc = dtc.predict(X_test)

accuracy_dtc = accuracy_score(Y_test, y_pred_dtc)

precision_dtc = precision_score(Y_test, y_pred_dtc, average='binary')

recall_dtc = recall_score(Y_test, y_pred_dtc, average='binary')

f1_dtc = f1_score(Y_test, y_pred_dtc, average='binary')


print(f'Accuracy: {accuracy_dtc:.4f}')
print(f'Precision: {precision_dtc:.4f}')
print(f'Recall: {recall_dtc:.4f}')
print(f'F1 Score: {f1_dtc:.4f}')
Accuracy: 0.7690
Precision: 0.7688
Recall: 0.7695
F1 Score: 0.7692

However, decision trees on their own can often not be enough to make accurate choices based on the data available. Decision Trees have some pitfalls, including

  • Instability with large amounts of data
  • Overfitting
  • Sensitive to small variations in input data
  • High Variance

Some of these effects can be mitigated through the usage of many, many, many, decision trees. What if we had many decision trees such that the variance can be accounted for based on there existing multiple different combinations of each possible tree? What if we wanted the root of our tree to be different in order to get outcomes? What if we needed our trees to compensate for large amounts of data? Then, how about we use more than one tree? More than two? More than three? How about we use an entire forest of trees? Go big or go home. You run into the idea of using a Random Forest Algorithm.

10.3.6 Random Forest

In simple terms, you can envision the Random Forest Algorithm as a group of decision trees combined together. Each decision tree operates independently, making its own decisions similar to a single decision tree. After training, when a data point is fed into the random forest, it passes through all the decision trees created during training. The final prediction is determined by the majority choice among all the decision trees, hopefully providing an accurate outcome. More accurately, it is an ensemble learning method (making a decision based on multiple models), which combines the predictions of multiple decision trees.

But how is a random forest generated? As previously stated, there are many ways to generate a decision tree, and for vast amounts of data creating an individual decision trees on it every single time can be a costly process in terms of resources and time. The roots of our trees can also differ as well. So how are these trees being made?

10.3.6.1 Bagging & Bootstrapping

Bootstrapping involves making multiple datasets based on our original dataset. These new datasets are sampled with replacement from the original dataset. Each bootstrap dataset is likely to be different promoting diversity amongst the tress in the forest.

The process involves the random selection of a row from the original dataset and then putting it into a separate smaller dataset called the bootstrapped dataset until the number of rows in your bootstrapped dataset equals the number of rows in your original dataset.

Additionally, more randomness is introduced by randomly excluding columns in the bootstrap dataset during the construction of the tree. When building a decision tree on the bootstrap dataset and a split occurs, the algorithm randomly chooses to exclude one of the features from the bootstrapped dataset to introduce additional randomness. Although in the Scikit-Learn library you cannot directly specify which feature to exclude randomly, you can control the number of features excluded simultaneously.

Bagging is the technique used to aggregate the predictions of the trees in an instance of Random Forest. Each tree in the forest is trained on a bootstrap sample and makes predictions independently. The final prediction of the Random Forest is obtained by taking the majority (in this case majority classification) across all of the trees.

10.4 Random Forest Code

from sklearn.ensemble import RandomForestClassifier
from datetime import datetime
import random
X_rf = dine_out_df.drop(['Do I eat out?', 'Time of Day (24-Hour)'], axis=1)
Y_rf = dine_out_df['Do I eat out?']

X_train, X_test, Y_train, Y_test = train_test_split(X_rf, Y_rf, 
test_size=0.30, random_state=42, shuffle=True, stratify=Y_rf)

Although the random forest model shown below is only the default hyperparameters, we can tune the following hyperparameters to have our model more closely follow the data presented:

  • n_estimators: Number of trees in forest
  • criterion: “Gini”, “Entropy”, “log_loss”
  • max_depth: Maximum depth of the tree
  • min_samples_split: min number of samples required to split decision node
  • min_samples_leaf: Min number of samples required to be at a leaf node
  • min_weight_fraction_leaf: The minimum weighted fraction of the sum total of weights required to be at a leaf node
  • max_features: The number of features to consider when looking for the best split
  • max_leaf_nodes: Grow trees with max_leaf_nodes in best-first fashion
  • min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
  • boostrap: Whether bootstrap samples are used when building trees
  • oob_score: Whether to use out-of-bag samples to estimate the generalization score
  • n_jobs: The number of jobs to run in parallel
  • random_state: Controls both the randomness of the bootstrapping of the samples used when building trees
  • Verbose: Controls the verbosity when fitting and predicting

What you’ll notice is that besides some new parameters, many of the parameters of the random forest classifier are the same as the parameters of the decision tree classifier. This is of course because the random forest is an ensemble of decision trees.

rf_classifier = RandomForestClassifier() 

rf_classifier.fit(X_train, Y_train)

Y_pred_rf = rf_classifier.predict(X_test)
accuracy_rf = accuracy_score(Y_test, Y_pred_rf)

precision_rf = precision_score(Y_test, Y_pred_rf, average='binary')

recall_rf = recall_score(Y_test, Y_pred_rf, average='binary')

f1_rf = f1_score(Y_test, Y_pred_rf, average='binary')


print(f'Accuracy: {accuracy_rf:.4f}')
print(f'Precision: {precision_rf:.4f}')
print(f'Recall: {recall_rf:.4f}')
print(f'F1 Score: {f1_rf:.4f}')
Accuracy: 0.7794
Precision: 0.7764
Recall: 0.7851
F1 Score: 0.7807

In comparison, we see that even without hyperparameter tuning the random forest model outperforms the decision tree model that we have previously trained.

10.5 References

  • Tree Based Models - UCONN Data Science

  • Decision Tree Classification Clearly Explained!

    • https://www.youtube.com/watch?v=ZVR2Way4nwQ
  • What is Random Forest? - IBM

    • https://www.ibm.com/topics/random-forest#:~:text=Random%20forest%20is%20a%20commonly,both%20classification%20and%20regression%20problems.
  • sklearn.ensemble.RandomForestClassifier - Scikit Learn Documentation

    • https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
  • sklearn.tree.DecisionTreeClassifier - Scikit Learn Documentation

    • https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
  • Gini Index and Entropy|Gini Index and Information gain in Decision Tree| Decision tree splitting rule

    • https://www.youtube.com/watch?v=-W0DnxQK1Eo
  • Random Forest | ScienceDirect

    • https://www.sciencedirect.com/topics/engineering/random-forest
  • Bootstrapping and OOB samples in Random Forests | Medium

    • https://medium.com/analytics-vidhya/bootstrapping-and-oob-samples-in-random-forests-6e083b6bc341

10.6 Boosted Tree

Boosted trees are a powerful ensemble technique in machine learning that combines multiple weak learners, typically decision trees, to form a strong learner. The essence of the boosting approach is to fit sequential models, where each new model attempts to correct errors made by the previous ones. Gradient boosting, one of the most popular boosting methods, optimizes a loss function over weak learners by iteratively adding trees that correct the residuals of the combined ensemble.

10.6.1 Introduction

Boosted trees build on the concept of boosting, an ensemble technique that aims to create a strong classifier from a number of weak classifiers. In the context of boosted trees, the weak learners are decision trees, usually of a fixed size, which are added sequentially to the model. The key idea is to improve the model iteratively by focusing on examples that are hard to predict, thus enhancing the overall predictive accuracy of the model.

10.6.2 Boosting Process

In the context of gradient boosted trees, the ensemble model is built iteratively, where each new tree is fitted to the residual errors made by the previous ensemble of trees. The aim is to minimize a loss function, and each tree incrementally improves the model by reducing the loss. The ensemble model after \(M\) trees can be described more accurately as:

The final predictive model is \(F_M(x)\), which represents the accumulated predictions of all \(M\) trees, adjusted by their respective learning rates.

The process can be delineated as follows:

  • Initialization: Begin with a base model \(F_0(x)\), often the mean of the target variable for regression or the log odds for classification.
  • Iterative Boosting:
    • At each stage \(m\), compute the pseudo-residuals, which are the gradients of the loss function with respect to the predictions of the current model, \(F_{m-1}(x)\).
    • Fit a new tree, \(h_m(x)\), to these pseudo-residuals.
    • Update the model by adding this new tree, weighted by a learning rate \(\lambda\), to minimize the loss, resulting in \(F_m(x) = F_{m-1}(x) + \lambda \cdot h_m(x)\).
  • Final Model: After \(M\) iterations, the ensemble model \(F_M(x)\) represents the sum of the initial model and the incremental improvements made by each of the \(M\) trees, where each tree is fitted to correct the residuals of the model up to that point.

Key Concepts

  • Loss Function (\(L\)): A measure of how far the ensemble’s predictions are from the actual values. Common choices include the squared error for regression and logistic loss for classification.
  • Learning Rate (\(\lambda\)): A small positive number that scales the contribution of each tree. It helps in controlling the speed of learning and prevents overfitting.

This iterative approach, focusing on correcting the errors of the ensemble up to the current step, distinguishes gradient boosting from other ensemble methods and allows for a nuanced adjustment of the model to the data.

10.6.3 Boosting vs Bagging in Ensemble Learning

Boosting and bagging are foundational ensemble learning techniques in machine learning, designed to improve the accuracy of predictive models by combining the strengths of multiple weaker models. Despite their common goal, they differ significantly in methodology and application.

Boosting builds models in a sequential manner, focusing each subsequent model on correcting the errors made by the previous ones. The process initiates with a base model, with each new model added aiming to correct its predecessor, culminating in a weighted sum of all models.

  • Sequential Model Building: Models are built sequentially, each correcting the error of the ensemble before it.
  • Focus on Misclassification: Boosting prioritizes instances misclassified by earlier models, adapting to the “harder” cases.
  • Variance Reduction: It mainly reduces model variance, carefully avoiding overfitting despite increasing complexity.

Examples include AdaBoost, Gradient Boosting, and XGBoost.

Bagging, or Bootstrap Aggregating, constructs multiple models independently and combines their predictions through averaging or majority voting. Each model is trained on a randomly drawn subset of the data, allowing for parallel model construction.

  • Parallel Model Building: Models are built independently, enabling parallelization.
  • Uniform Attention: All instances are equally considered, with the diversity of the ensemble reducing variance.
  • Bias and Variance Reduction: Effective at reducing both, but particularly good at cutting down variance in complex models.

Random Forest is a prominent example of bagging.

10.6.3.1 Differences

  • Model Dependency: Boosting’s models are interdependent, building upon the errors of their predecessors, unlike bagging’s independent models.
  • Error Correction Focus: Boosting directly targets previous errors, while bagging reduces error through diversity.
  • Computational Complexity: Boosting can be more computationally intensive due to its sequential nature.
  • Overfitting Risk: Boosting may overfit on noisy data, whereas bagging remains robust, especially as the number of models increases.

Understanding the distinctions between boosting and bagging is crucial for selecting the appropriate ensemble method for specific machine learning tasks. Both strategies offer unique advantages and can be applied based on the problem, data characteristics, and desired outcomes.

10.7 Handling Imbalanced Data

Handling imbalanced data is a critical task in machine learning, particularly in classification problems where the distribution of instances across the known classes is uneven. This imbalance can significantly bias the model’s training process, leading to poor predictive performance, especially for the minority class. Understanding and addressing this issue is crucial for building effective and fair models.

10.7.1 Imbalanced Data

Imbalanced data refers to a situation where the number of observations in one class significantly outweighs those in one or more other classes. This is a common scenario in various domains:

  • In fraud detection, legitimate transactions vastly outnumber fraudulent ones.
  • In medical diagnosis, the dataset might have more negative results (no disease) than positive results (disease presence).
  • In email filtering, spam emails are less common than non-spam.

Such disparities can lead to models that are overly biased towards the majority class, as they tend to optimize overall accuracy by simply predicting the majority class.

10.7.2 Approaches to Handling Imbalanced Data

  • Resampling Techniques:

    • Oversampling Minority Class: Enhancing the representation of the minority class by replicating instances or generating synthetic samples. SMOTE (Synthetic Minority Over-sampling Technique) is a popular method for creating synthetic samples (Chawla et al. 2002).
    • Undersampling Majority Class: Reducing the size of the majority class to balance the dataset. Careful selection or clustering methods are employed to retain information (Liu, Wu, and Zhou 2008).
  • Cost-sensitive Learning: Adjusting the classification cost to make errors on the minority class more impactful than errors on the majority class. This approach often involves modifying the algorithm to be more sensitive to the minority class (Elkan 2001).

  • Ensemble Methods: Leveraging multiple models to improve classification, particularly of the minority class. Techniques such as Balanced Random Forest (Chen, Liaw, and Breiman 2004) and AdaBoost (Sun, Wong, and Kamel 2007) have been adapted for imbalanced datasets.

  • Algorithmic Adjustments: Some algorithms have built-in mechanisms for dealing with imbalanced data, such as adjusting class weights. This is evident in algorithms like SVM and logistic regression, where class weights can be inversely proportional to class frequencies (King and Zeng 2001).

10.7.3 SMOTE

The Synthetic Minority Over-sampling Technique (SMOTE) offers a nuanced approach to mitigating the challenges posed by imbalanced datasets. Unlike simple oversampling techniques that replicate minority class instances, SMOTE generates synthetic samples in a feature space, enhancing the diversity and representation of the minority class without directly copying existing observations (Chawla et al. 2002).

10.7.3.1 Algorithm

SMOTE’s algorithm can be succinctly described using the following steps for each minority class sample \(x\):

  1. Identify the \(k\) nearest neighbors in the minority class for \(x\).
  2. Randomly choose one of these \(k\) neighbors, call it \(x_{\text{nn}}\).
  3. Generate a synthetic sample \(x_{\text{new}}\) along the line connecting \(x\) and \(x_{\text{nn}}\). Mathematically, \(x_{\text{new}} = x + \lambda (x_{\text{nn}} - x)\), where \(\lambda\) is a random number between \(0\) and \(1\).

This process is repeated until the class distribution is deemed balanced.

Several software tools and libraries have made implementing SMOTE straightforward in various programming environments. In Python, the imbalanced-learn library provides an efficient and flexible implementation of SMOTE and its variants, allowing seamless integration with scikit-learn workflows. R users can utilize the DMwR package, which includes functions for applying SMOTE to datasets. These tools not only offer basic SMOTE functionality but also support advanced variants and allow for easy experimentation with different oversampling strategies.

10.7.3.2 Applications, Considerations, and Variants

Applications: SMOTE’s approach to enriching datasets by synthesizing new examples has proven beneficial in various fields. For instance, in fraud detection, where fraudulent transactions are rare (Dal Pozzolo et al. 2015), and in medical diagnostics, where certain conditions may be underrepresented in datasets (Johnson and Khoshgoftaar 2019).

Considerations: While SMOTE enhances the representation of the minority class, it may also lead to overfitting, especially if the minority class is dispersed or if there’s significant overlap with the majority class (Guo et al. 2017). Choosing the right variant of SMOTE and adjusting its parameters requires careful consideration of the dataset’s characteristics.

Advanced Variants: Specific challenges in datasets have led to the development of SMOTE variants like Borderline-SMOTE, which generates samples closer to the decision boundary to improve classifier sensitivity to the minority class (Han, Wang, and Mao 2005). Each variant addresses particular issues, such as noise or the risk of generating ambiguous synthetic samples.

10.7.3.3 Conclusion

SMOTE and its variants offer a sophisticated approach to mitigating the challenges posed by imbalanced datasets. Through synthetic sample generation, these methods enhance the diversity and representation of the minority class, facilitating improved model accuracy and generalizability. The choice of variant and parameters should be guided by the dataset’s characteristics and the problem context, underscoring the importance of careful experimentation and validation.