10  Supervised Learning

10.1 Decision Trees

Decision trees are widely used supervised learning models that predict the value of a target variable by iteratively splitting the dataset based on decision rules derived from input features. The model functions as a piecewise constant approximation of the target function, producing clear, interpretable rules that are easily visualized and analyzed (Breiman et al. 1984). Decision trees are fundamental in both classification and regression tasks, serving as the building blocks for more advanced ensemble models such as Random Forests and Gradient Boosting Machines.

10.1.1 Algorithm Formulation

The core mechanism of a decision tree algorithm is the identification of optimal splits that partition the data into subsets that are increasingly homogeneous with respect to the target variable. At any node \(m\), the data subset is denoted as \(Q_m\) with a sample size of \(n_m\). The objective is to find a candidate split \(\theta\), defined as a threshold for a given feature, that minimizes an impurity or loss measure \(H\).

When a split is made at node \(m\), the data is divided into two subsets: \(Q_{m,l}\) (left node) with sample size \(n_{m,l}\), and \(Q_{m,r}\) (right node) with sample size \(n_{m,r}\). The split quality, measured by \(G(Q_m, \theta)\), is given by:

\[ G(Q_m, \theta) = \frac{n_{m,l}}{n_m} H(Q_{m,l}(\theta)) + \frac{n_{m,r}}{n_m} H(Q_{m,r}(\theta)). \]

The algorithm aims to identify the split that minimizes the impurity:

\[ \theta^* = \arg\min_{\theta} G(Q_m, \theta). \]

This process is applied recursively at each child node until a stopping condition is met.

  • Stopping Criteria: The algorithm stops when the maximum tree depth is reached or when the node sample size falls below a preset threshold.
  • Pruning: Reduce the complexity of the final tree by removing branches that add little predictive value. This reduces overfitting and improves the generalization accuracy of the model.

10.1.2 Search Space for Possible Splits

At each node in the decision tree, the search space for possible splits comprises all features in the dataset and potential thresholds derived from the values of each feature. For a given feature, the algorithm considers each unique value in the current node’s subset as a possible split point. The potential thresholds are typically set as midpoints between consecutive unique values, ensuring the data is partitioned effectively.

Formally, let the feature set be \(\{X_1, X_2, \ldots, X_p\}\), where \(p\) is the total number of features, and let the unique values of feature $ X_j $ at node $ m $ be denoted by ${v_{j,1}, v_{j,2}, , v_{j,k_j}} \(. The search space at node\)m$ includes:

  • Feature candidates: \(\{X_1, X_2, \ldots, X_p\}\).
  • Threshold candidates for \(X_j\): \[ \left\{ \frac{v_{j,i} + v_{j,i+1}}{2} \mid 1 \leq i < k_j \right\}. \]

The search space therefore encompasses all combinations of features and their respective thresholds. While the complexity of this search can be substantial, particularly for high-dimensional data or features with numerous unique values, efficient algorithms use sorting and single-pass scanning techniques to mitigate the computational cost.

10.1.3 Metrics

10.1.3.1 Classification

In decision tree classification, several criteria can be used to measure the quality of a split at each node. These criteria are based on how “pure” the resulting nodes are after the split. A pure node contains samples that predominantly belong to a single class. The goal is to minimize impurity, leading to nodes that are as homogeneous as possible.

  • Gini Index: The Gini index measures the impurity of a node by calculating the probability of randomly choosing two different classes. A perfect split (all instances belong to one class) has a Gini index of 0. At node \(m\), the Gini index is \[ H(Q_m) = \sum_{k=1}^{K} p_{mk} (1 - p_{mk}), \] where \(p_{mk}\) is the proportion of samples of class \(k\) at node \(m\); and\(K\) is the total number of classes The Gini index is often preferred for its speed and simplicity, and it’s used by default in many implementations of decision trees, including sklearn.

  • Entropy (Information Gain): Entropy is another measure of impurity, derived from information theory. It quantifies the “disorder” of the data at a node. Lower entropy means higher purity. At node \(m\), it is defined as \[ H(Q_m) = - \sum_{k=1}^{K} p_{mk} \log p_{mk} \] Entropy is commonly used in decision tree algorithms like ID3 and C4.5. The choice between Gini and entropy often depends on specific use cases, but both perform similarly in practice.

  • Misclassification Error: Misclassification error focuses solely on the most frequent class in the node. It measures the proportion of samples that do not belong to the majority class. Although less sensitive than Gini and entropy, it can be useful for classification when simplicity is preferred. At node \(m\), it is defined as \[ H(Q_m) = 1 - \max_k p_{mk}, \] where \(\max_k p_{mk}\) is the largest proportion of samples belonging to any class \(k\).

10.1.3.2 Regression Criteria

In decision tree regression, different criteria are used to assess the quality of a split. The goal is to minimize the spread or variance of the target variable within each node.

  • Mean Squared Error (MSE): Mean squared error is the most common criterion used in regression trees. It measures the average squared difference between the actual values and the predicted values (mean of the target in the node). The smaller the MSE, the better the fit. At node \(m\), it is \[ H(Q_m) = \frac{1}{n_m} \sum_{i=1}^{n_m} (y_i - \bar{y}_m)^2, \] where

    • \(y_i\) is the actual value for sample \(i\);
    • \(\bar{y}_m\) is the mean value of the target at node \(m\);
    • \(n_m\) is the number of samples at node \(m\).

    MSE works well when the target is continuous and normally distributed.

  • Half Poisson Deviance (for count targets): When dealing with count data, the Poisson deviance is used to model the variance in the number of occurrences of an event. It is well-suited for target variables representing counts (e.g., number of occurrences of an event). At node \(m\), it is \[ H(Q_m) = \sum_{i=1}^{n_m} \left( y_i \log\left(\frac{y_i}{\hat{y}_i}\right) - (y_i - \hat{y}_i) \right), \] where \(\hat{y}_i\) is the predicted count. This criterion is especially useful when the target variable represents discrete counts, such as predicting the number of occurrences of an event.

  • Mean Absolute Error (MAE): Mean absolute error is another criterion that minimizes the absolute differences between actual and predicted values. While it is more robust to outliers than MSE, it is slower computationally due to the lack of a closed-form solution for minimization. At node \(m\), it is \[ H(Q_m) = \frac{1}{n_m} \sum_{i=1}^{n_m} |y_i - \bar{y}_m| \] MAE is useful when you want to minimize large deviations and can be more robust in cases where outliers are present in the data.

10.1.3.3 Summary

In decision trees, the choice of splitting criterion depends on the type of task (classification or regression) and the nature of the data. For classification tasks, the Gini index and entropy are the most commonly used, with Gini offering simplicity and speed, and entropy providing a more theoretically grounded approach. Misclassification error can be used for simpler cases. For regression tasks, MSE is the most popular choice, but Poisson deviance and MAE are useful for specific use cases such as count data and robust models, respectively.

10.2 Naive Bayes

Naive Bayes is a probabilistic classification algorithm based on Bayes’ Theorem, which is used for both binary and multiclass classification problems. It is particularly effective for high-dimensional datasets and is commonly applied in tasks like text classification, spam detection, and sentiment analysis. The algorithm is called “naive” because it assumes that all features are conditionally independent given the class label, an assumption that rarely holds in real-world data but still performs well in many cases.

10.2.1 Theoretical Foundations

The foundation of the Naive Bayes classifier is Bayes’ Theorem, which is used to update the probability estimate of a hypothesis given new evidence. Mathematically, Bayes’ Theorem is expressed as:

\[ P(y \mid X) = \frac{P(X \mid y) \, P(y)}{P(X)}, \]

where:

  • \(P(y \mid X)\): Posterior probability of class \(y\) given the input features \(X\).
  • \(P(X \mid y)\): Likelihood of observing \(X\) given that the class is \(y\).
  • \(P(y)\): Prior probability of the class \(y\).
  • \(P(X)\): Marginal probability of the feature vector \(X\).

10.2.1.1 Naive Assumption and Likelihood Decomposition

The algorithm makes the simplifying assumption that features in \(X\) are conditionally independent given the class \(y\). This assumption enables the likelihood \(P(X \mid y)\) to be decomposed as:

\[ P(X \mid y) = \prod_{i=1}^n P(x_i \mid y), \]

where \(X = \{x_1, x_2, \ldots, x_n\}\) represents the feature vector with \(n\) features, and \(P(x_i \mid y)\) is the conditional probability of feature \(x_i\) given the class \(y\).

The model parameters are the prior probabilities \(P(y)\) and the conditional probabilities \(P(x_i \mid y)\). These are estimated from the training data using the maximum likelihood estimation (MLE):

  1. Prior Estimation: The prior probability \(P(y)\) is estimated as the proportion of training samples in class \(y\):

    \[ \hat{P}(y) = \frac{\text{count}(y)}{N}, \]

    where \(\text{count}(y)\) is the number of instances belonging to class \(y\), and \(N\) is the total number of training samples.

  2. Conditional Probability Estimation:

    • Categorical Features: For discrete or categorical features, the conditional probability \(P(x_i \mid y)\) is estimated as:

      \[ \hat{P}(x_i \mid y) = \frac{\text{count}(x_i, y)}{\text{count}(y)}, \]

      where \(\text{count}(x_i, y)\) is the number of samples in class \(y\) that have feature \(x_i\).

    • Continuous Features: For continuous features, Naive Bayes commonly assumes a Gaussian distribution. In this case, \(P(x_i \mid y)\) is modeled using the Gaussian distribution with mean \(\mu_{y,i}\) and variance \(\sigma_{y,i}^2\):

      \[ P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma_{y,i}^2}} \exp \left( -\frac{(x_i - \mu_{y,i})^2}{2\sigma_{y,i}^2} \right). \]

      The parameters \(\mu_{y,i}\) and \(\sigma_{y,i}^2\) are estimated from the training data using the sample mean and variance for each feature in each class.

10.2.1.2 Class Prediction

The goal of the Naive Bayes classifier is to predict the class \(y\) that maximizes the posterior probability \(P(y \mid X)\). After applying Bayes’ Theorem and dropping the constant denominator \(P(X)\), the decision rule becomes:

\[ y^* = \arg\max_y \, P(y) \prod_{i=1}^n P(x_i \mid y). \]

In practice, the log of the posterior is used to prevent numerical underflow:

\[ \log P(y \mid X) = \log P(y) + \sum_{i=1}^n \log P(x_i \mid y). \]

The predicted class is the one that maximizes this expression.

10.2.1.3 Surprisingly Good Performance

Although the assumption of conditional independence among features is often unrealistic, Naive Bayes still performs well for several reasons:

  1. Robustness to Violations of Independence: Literature suggests that Naive Bayes can achieve good classification performance even when features are correlated, as long as the dependencies are consistent across classes (Domingos and Pazzani 1997). This is because the decision boundaries produced by Naive Bayes are often well-aligned with the true boundaries, despite the imprecise probability estimates.

  2. Decision Rule Effectiveness: Since Naive Bayes focuses on finding the class that maximizes the posterior probability, it is less sensitive to errors in individual probability estimates, as long as the relative ordering of probabilities remains correct (Rish 2001).

  3. Zero-One Loss Minimization: Naive Bayes aims to minimize the zero-one loss, i.e., the number of misclassifications. The method benefits from the fact that exact probability estimation is not essential for accurate classification, as the correct class can still be chosen even with approximate probabilities (Ng and Jordan 2001).

  4. High-Dimensional Settings: In high-dimensional settings, the conditional independence assumption can act as a form of implicit regularization, preventing overfitting by simplifying the probability model (Rish 2001). This makes Naive Bayes particularly well-suited for text classification and other sparse feature spaces.

10.2.1.4 Advantages and Limitations

Advantages:

  • Computationally efficient, with linear time complexity in terms of the number of features and data samples.
  • Performs well on large datasets, especially when features are conditionally independent.
  • Suitable for high-dimensional data, making it popular in text classification.

Limitations:

  • Relies on the assumption of conditional independence, which may not hold in real-world datasets, potentially affecting performance.
  • It is sensitive to zero probabilities; if a feature value never appears in the training set for a given class, its likelihood becomes zero. To address this, Laplace smoothing (or add-one smoothing) is often applied.

10.2.1.5 Laplace Smoothing

Laplace smoothing is used to handle zero probabilities in the likelihood estimation. It adds a small constant $ $ (usually 1) to the count of each feature value, preventing the probability from becoming zero:

\[ P(x_i \mid y) = \frac{\text{count}(x_i, y) + \alpha} {\sum_{x_i'} (\text{count}(x_i', y) + \alpha)}. \]

This adjustment ensures that even unseen features in the training data do not lead to zero probabilities, thus improving the model’s robustness.