import openml
import pandas as pd
import numpy as np
# Load Ames Housing dataset (OpenML ID 42165)
dataset = openml.datasets.get_dataset(42165)
df, *_ = dataset.get_data()
df["LogPrice"] = np.log(df["SalePrice"])10 Supervised Learning
10.1 Introduction
Machine Learning (ML) is a branch of artificial intelligence that enables systems to learn from data and improve their performance over time without being explicitly programmed. At its core, machine learning algorithms aim to identify patterns in data and use those patterns to make decisions or predictions.
Machine learning can be categorized into three main types: supervised learning, unsupervised learning, and reinforcement learning. Each type differs in the data it uses and the learning tasks it performs, addressing addresses different tasks and problems. Supervised learning aims to predict outcomes based on labeled data, unsupervised learning focuses on discovering hidden patterns within the data, and reinforcement learning centers around learning optimal actions through interaction with an environment.
Let’s define some notations to introduce them:
\(X\): A set of feature vectors representing the input data. Each element \(X_i\) corresponds to a set of features or attributes that describe an instance of data.
\(Y\): A set of labels or rewards associated with outcomes. In supervised learning, \(Y\) is used to evaluate the correctness of the model’s predictions. In reinforcement learning, \(Y\) represents the rewards that guide the learning process.
\(A\): A set of possible actions in a given context. In reinforcement learning, actions \(A\) represent choices that can be made in response to a given situation, with the goal of maximizing a reward.
10.1.1 Supervised Learning
Supervised learning is the most widely used type of machine learning. In supervised learning, we have both feature vectors \(X\) and their corresponding labels \(Y\). The objective is to train a model that can predict \(Y\) based on \(X\). This model is trained on labeled examples, where the correct outcome is known, and it adjusts its internal parameters to minimize the error in its predictions, which occurs as part of the cross-validation process.
Key tasks in supervised learning include:
- Classification: Assigning data points to predefined categories or classes.
- Regression: Predicting a continuous value based on input data.
In supervised learning, the data consists of both feature vectors \(X\) and labels \(Y\), namely, \((X, Y)\).
10.1.2 Unsupervised Learning
Unsupervised learning involves learning patterns from data without any associated labels or outcomes. The objective is to explore and identify hidden structures in the feature vectors \(X\). Since there are no ground-truth labels \(Y\) to guide the learning process, the algorithm must discover patterns on its own. This is particularly useful when subject matter experts are unsure of common properties within a data set.
Common tasks in unsupervised learning include:
Clustering: Grouping similar data points together based on certain features.
Dimension Reduction: Simplifying the input data by reducing the number of features while preserving essential patterns.
In unsupervised learning, the data consists solely of feature vectors \(X\).
10.1.3 Reinforcement Learning
Reinforcement learning involves learning how to make a sequence of decisions to maximize a cumulative reward. Unlike supervised learning, where the model learns from a static dataset of labeled examples, reinforcement learning involves an agent that interacts with an environment by taking actions \(A\), receiving feedback in the form of rewards \(Y\), and learning over time which actions lead to the highest cumulative reward.
The process in reinforcement learning involves:
States: The context or environment the agent is in, represented by feature vectors \(X\).
Actions: The set of possible choices the agent can make in response to the current state, denoted as \(A\).
Rewards: Feedback the agent receives after taking an action, which guides the learning process.
In reinforcement learning, the data consists of feature vectors \(X\), actions \(A\), and rewards \(Y\), namely, \((X, A, Y)\).
10.2 Decision Trees
Decision trees are widely used supervised learning models that predict the value of a target variable by iteratively splitting the dataset based on decision rules derived from input features. The model functions as a piecewise constant approximation of the target function, producing clear, interpretable rules that are easily visualized and analyzed (Breiman et al., 1984). Decision trees are fundamental in both classification and regression tasks, serving as the building blocks for more advanced ensemble models such as Random Forests and Gradient Boosting Machines.
10.2.1 Recursive Partition Algorithm
The core mechanism of a decision tree algorithm is the identification of optimal splits that partition the data into subsets that are increasingly homogeneous with respect to the target variable. At any node \(m\), the data subset is denoted as \(Q_m\) with a sample size of \(n_m\). The objective is to find a candidate split \(\theta\), defined as a threshold for a given feature, that minimizes an impurity or loss measure \(H\).
When a split is made at node \(m\), the data is divided into two subsets: \(Q_{m,l}\) (left node) with sample size \(n_{m,l}\), and \(Q_{m,r}\) (right node) with sample size \(n_{m,r}\). The split quality, measured by \(G(Q_m, \theta)\), is given by:
\[ G(Q_m, \theta) = \frac{n_{m,l}}{n_m} H(Q_{m,l}(\theta)) + \frac{n_{m,r}}{n_m} H(Q_{m,r}(\theta)). \]
The algorithm aims to identify the split that minimizes the impurity:
\[ \theta^* = \arg\min_{\theta} G(Q_m, \theta). \]
This process is applied recursively at each child node until a stopping condition is met.
- Stopping Criteria: The algorithm stops when the maximum tree depth is reached or when the node sample size falls below a preset threshold.
- Pruning: Reduce the complexity of the final tree by removing branches that add little predictive value. This reduces overfitting and improves generalization.
10.2.2 Search Space for Possible Splits
At each node, the search space for possible splits comprises all features in the dataset and potential thresholds derived from the feature values. For a given feature, the algorithm considers each of its unique value in the current node as a possible split point. The potential thresholds are typically set as midpoints between consecutive unique values, ensuring effective partition.
Formally, let the feature set be \(\{X_1, X_2, \ldots, X_p\}\), where \(p\) is the total number of features, and let the unique values of feature \(X_j\) at node \(m\) be denoted by \(\{v_{m,j,1}, v_{m,j,2}, \ldots, v_{m,j,k_{mj}}\}\). The search space at node \(m\) includes:
- Feature candidates: \(\{X_1, X_2, \ldots, X_p\}\).
- Threshold candidates for \(X_j\): \[ \left\{ \frac{v_{m,j,i} + v_{m,j,i+1}}{2} \mid 1 \leq i < k_{mj} \right\}. \]
While the complexity of this search can be substantial, particularly for high-dimensional data or features with numerous unique values, efficient algorithms use sorting and single-pass scanning techniques to mitigate the computational cost.
10.2.3 Metrics
10.2.3.1 Classification
In classification, the split quality metric measures how pure the resulting nodes are after a split. A pure node contains observations that predominantly belong to a single class.
Gini Index: The Gini index measures node impurity by the probability that two observations randomly drawn from the node belong to different classes. A perfect split (all instances belong to one class) has a Gini index of 0. At node \(m\), the Gini index is \[ H(Q_m) = \sum_{k=1}^{K} p_{mk} (1 - p_{mk}) = 1 - \sum_{k=1}^n p_{mk}^2, \] where \(p_{mk}\) is the proportion of samples of class \(k\) at node \(m\); and \(K\) is the total number of classes The Gini index is often preferred for its speed and simplicity, and it’s used by default in many implementations of decision trees, including
sklearn.The Gini index originates from the Gini coefficient, introduced by Corrado Gini in 1912 to quantify inequality in income distributions. In that context, the Gini coefficient measures how unevenly a quantity (such as wealth) is distributed across a population. Decision tree algorithms adapt this concept of inequality to measure the impurity of a node: instead of wealth, the distribution concerns class membership. A perfectly pure node, where all observations belong to the same class, represents complete equality and yields a Gini index of zero. As class proportions become more mixed, inequality in class membership increases, leading to higher impurity values. Thus, the Gini index used in decision trees can be viewed as a statistical measure of diversity or heterogeneity derived from Gini’s original work on inequality.
Entropy (Information Gain): Derived from information theory, entropy quantifies the disorder of the data at a node. Lower entropy means higher purity. At node \(m\), it is defined as \[ H(Q_m) = - \sum_{k=1}^{K} p_{mk} \log p_{mk}. \] Entropy is commonly used in decision tree algorithms like ID3 and C4.5. The choice between Gini and entropy often depends on specific use cases, but both perform similarly in practice.
Misclassification Error: Misclassification error focuses on the most frequent class in the node. It measures the proportion of samples that do not belong to the majority class. Although less sensitive than Gini and entropy, it can be useful for classification when simplicity is preferred. At node \(m\), it is defined as \[ H(Q_m) = 1 - \max_k p_{mk}, \] where \(\max_k p_{mk}\) is the largest proportion of samples belonging to any class \(k\).
10.2.3.2 Regression Criteria
In regression, the goal is to minimize the spread or variance of the target variable within each node.
Mean Squared Error (MSE): MSE is the average squared difference between observed and predicted values (mean of the target in the node). The smaller the MSE, the better the fit. At node \(m\), it is \[ H(Q_m) = \frac{1}{n_m} \sum_{i=1}^{n_m} (y_i - \bar{y}_m)^2, \] where
- \(y_i\) is the actual value for sample \(i\);
- \(\bar{y}_m\) is the mean value of the target at node \(m\);
- \(n_m\) is the number of samples at node \(m\).
MSE works well when the target is continuous and normally distributed.
Half Poisson Deviance: Used for count target, the Poisson deviance measures the variance in the number of occurrences of an event. At node \(m\), it is \[ H(Q_m) = \sum_{i=1}^{n_m} \left( y_i \log\left(\frac{y_i}{\hat{y}_i}\right) - (y_i - \hat{y}_i) \right), \] where \(\hat{y}_i\) is the predicted count. This criterion is especially useful when the target variable represents discrete counts, such as predicting the number of occurrences of an event.
Mean Absolute Error (MAE): MAE aims to minimize the absolute differences between actual and predicted values. While it is more robust to outliers than MSE, it is slower computationally due to the lack of a closed-form solution for minimization. At node \(m\), it is \[ H(Q_m) = \frac{1}{n_m} \sum_{i=1}^{n_m} |y_i - \bar{y}_m|. \] MAE is useful when you want to minimize large deviations and can be more robust in cases where outliers are present in the data.
10.2.4 Ames Housing Example
The Ames Housing data are used to illustrate a regression tree model for predicting log house price.
As before, we retrive the data from OpenML.
A decision tree partitions the feature space into regions where the average log price is relatively constant.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
import plotnine as gg
import pandas as pd
numeric_features = [
"OverallQual", "GrLivArea", "GarageCars",
"TotalBsmtSF", "YearBuilt", "FullBath"
]
categorical_features = ["KitchenQual"]
preprocessor = ColumnTransformer([
("num", "passthrough", numeric_features),
("cat", OneHotEncoder(drop="first"), categorical_features)
])
X = df[numeric_features + categorical_features]
y = df["LogPrice"]
depths = range(2, 11)
cv_scores = [
cross_val_score(
Pipeline([
("pre", preprocessor),
("model", DecisionTreeRegressor(max_depth=d, random_state=0))
]),
X, y, cv=5, scoring="r2"
).mean()
for d in depths
]
list(zip(depths, cv_scores))[(2, np.float64(0.594134995704976)),
(3, np.float64(0.6712972027857058)),
(4, np.float64(0.7141792234890973)),
(5, np.float64(0.748794485599919)),
(6, np.float64(0.7587964739851765)),
(7, np.float64(0.7343953839481492)),
(8, np.float64(0.7186324525934304)),
(9, np.float64(0.7112790937242873)),
(10, np.float64(0.6938966980572193))]
Cross-validation identifies an appropriate tree depth that balances fit and generalization. A too-deep tree overfits, while a shallow tree misses structure.
dt = Pipeline([
("pre", preprocessor),
("model", DecisionTreeRegressor(max_depth=4, random_state=0))
])
dt.fit(X, y)
y_pred = dt.predict(X)
df_pred = pd.DataFrame({"Observed": y, "Predicted": y_pred})
(gg.ggplot(df_pred, gg.aes(x="Observed", y="Predicted")) +
gg.geom_point(alpha=0.5) +
gg.geom_abline(slope=1, intercept=0, linetype="dashed") +
gg.labs(title="Decision Tree Regression on Ames Housing",
x="Observed Log Price", y="Predicted Log Price"))
The plot shows predicted versus observed log prices. A well-fitted model has points close to the diagonal. The decision tree naturally captures nonlinear effects and interactions, though its predictions are piecewise constant, producing visible step patterns.
10.3 Gradient-Boosted Models
Gradient boosting is a powerful ensemble technique in machine learning that combines multiple weak learners into a strong predictive model. Unlike bagging methods, which train models independently, gradient boosting fits models sequentially, with each new model correcting errors made by the previous ensemble (Friedman, 2001). While decision trees are commonly used as weak learners, gradient boosting can be generalized to other base models. This iterative method optimizes a specified loss function by repeatedly adding models designed to reduce residual errors.
10.3.1 Introduction
Gradient boosting builds on the general concept of boosting, aiming to construct a strong predictor from an ensemble of sequentially trained weak learners. The weak learners are often shallow decision trees (stumps), linear models, or generalized additive models (Hastie et al., 2009). Each iteration adds a new learner focusing primarily on the data points poorly predicted by the existing ensemble, thereby progressively enhancing predictive accuracy.
Gradient boosting’s effectiveness stems from:
- Error Correction: Each iteration specifically targets previous errors, refining predictive accuracy.
- Weighted Learning: Iteratively focuses more heavily on difficult-to-predict data points.
- Flexibility: Capable of handling diverse loss functions and various types of predictive tasks.
The effectiveness of gradient-boosted models has made them popular across diverse tasks, including classification, regression, and ranking. Gradient boosting forms the foundation for algorithms such as XGBoost (Chen & Guestrin, 2016), LightGBM (Ke et al., 2017), and CatBoost (Prokhorenkova et al., 2018), known for their high performance and scalability.
10.3.2 Gradient Boosting Process
Gradient boosting builds an ensemble by iteratively minimizing the residual errors from previous models. This iterative approach optimizes a loss function, \(L(y, F(x))\), where \(y\) represents the observed target variable and \(F(x)\) the model’s prediction for a given feature vector \(x\).
Key concepts:
- Loss Function: Guides model optimization, such as squared error for regression or logistic loss for classification.
- Learning Rate: Controls incremental updates, balancing training speed and generalization.
- Regularization: Reduces overfitting through tree depth limitation, subsampling, and L1/L2 penalties.
10.3.2.1 Model Iteration
The gradient boosting algorithm proceeds as follows:
Initialization: Define a base model \(F_0(x)\), typically the mean of the target variable for regression or the log-odds for classification.
Iterative Boosting: At each iteration \(m\):
Compute pseudo-residuals representing the negative gradient of the loss function at the current predictions. For each observation \(i\): \[ r_i^{(m)} = -\left.\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right|_{F(x)=F_{m-1}(x)}, \] where \(x_i\) and \(y_i\) denote the feature vector and observed value for the \(i\)-th observation, respectively. The residuals represent the direction of steepest descent in function space, so fitting a learner to them approximates a gradient descent step minimizing \(L(y, F(x))\).
Fit a new weak learner \(h_m(x)\) to these residuals.
Update the model: \[ F_m(x) = F_{m-1}(x) + \eta \, h_m(x), \] where \(\eta\) is a small positive learning rate (e.g., 0.01–0.1), controlling incremental improvement and reducing overfitting.
In some implementations, the update step includes an additional multiplier determined by a one-dimensional line search that minimizes the loss function at each iteration. Specifically, the optimal step length is defined as \[ \gamma_m = \arg\min_\gamma \sum_{i=1}^n L\bigl(y_i,\, F_{m-1}(x_i) + \gamma\, h_m(x_i)\bigr), \] leading to an updated model of the form \[ F_m(x) = F_{m-1}(x) + \eta\, \gamma_m\, h_m(x), \] where \(\eta\) remains a shrinkage factor controlling the overall rate of learning, while \(\gamma_m\) adjusts the step size adaptively at each iteration.
Final Model: After \(M\) iterations, the ensemble model is: \[ F_M(x) = F_0(x) + \sum_{m=1}^M \eta \, h_m(x). \]
Stochastic gradient boosting is a variant that enhances gradient boosting by introducing randomness through subsampling at each iteration, selecting a random fraction of data points (typically 50%–80%) to fit the model . This randomness helps reduce correlation among trees, improve model robustness, and lower the risk of overfitting.
10.3.3 Boosted Trees with Ames Housing
Boosted trees apply the gradient boosting framework to decision trees. They build an ensemble of shallow trees, each trained to correct the residual errors of the preceding ones. By sequentially emphasizing observations that are difficult to predict, the model progressively improves its overall predictive accuracy. We now apply gradient boosting using the same preprocessed features. Boosting combines many shallow trees, each correcting the residual errors of its predecessors, to improve predictive accuracy.
from sklearn.ensemble import GradientBoostingRegressor
# Define a range of tree counts for tuning
n_estimators_list = [50, 100, 200, 400]
cv_scores_gb = [
cross_val_score(
Pipeline([
("pre", preprocessor),
("model", GradientBoostingRegressor(
n_estimators=n,
learning_rate=0.05,
max_depth=3,
random_state=0))
]),
X, y, cv=5, scoring="r2"
).mean()
for n in n_estimators_list
]
list(zip(n_estimators_list, cv_scores_gb))[(50, np.float64(0.8199019428994425)),
(100, np.float64(0.8437789746745888)),
(200, np.float64(0.8451433195728157)),
(400, np.float64(0.8405078307788265))]
Cross-validation shows how increasing the number of boosting rounds initially improves performance but eventually risks overfitting when too many trees are added.
gb = Pipeline([
("pre", preprocessor),
("model", GradientBoostingRegressor(
n_estimators=200,
learning_rate=0.05,
max_depth=3,
random_state=0))
])
gb.fit(X, y)
y_pred_gb = gb.predict(X)
df_pred_gb = pd.DataFrame({"Observed": y, "Predicted": y_pred_gb})
(gg.ggplot(df_pred_gb, gg.aes(x="Observed", y="Predicted")) +
gg.geom_point(alpha=0.5) +
gg.geom_abline(slope=1, intercept=0, linetype="dashed") +
gg.labs(title="Gradient-Boosted Regression on Ames Housing",
x="Observed Log Price", y="Predicted Log Price"))
The boosted model produces predictions that are generally closer to the 45-degree line than a single tree, reflecting improved accuracy and smoother response across the feature space.
Gradient-boosted trees introduce several parameters that govern model complexity, learning stability, and overfitting control:
n_estimators: the number of trees (boosting rounds). More trees can reduce bias but increase computation and risk of overfitting. learning_rate — the shrinkage parameter \(\eta\) controlling the contribution of each new tree. Smaller values (e.g., 0.05 or 0.01) require more trees but often yield better generalization.max_depth: the maximum depth of each individual tree, limiting the model’s ability to overfit local noise. Shallow trees (depth 2–4) are typical weak learners.subsample: the fraction of data used in each iteration. Values below 1.0 introduce randomness (stochastic boosting), improving robustness and reducing correlation among trees.min_samples_splitandmin_samples_leaf: minimum numbers of observations required for splitting or forming leaves. These control tree granularity and help regularize the model.
In practice, moderate learning rates with a sufficiently large number of estimators and shallow trees often perform best, balancing bias, variance, and computational cost.
10.3.4 XGBoost: Extreme Gradient Boosting
XGBoost is a scalable and efficient implementation of gradient-boosted decision trees (Chen & Guestrin, 2016). It has become one of the most widely used machine learning methods for structured data due to its high predictive performance, regularization capabilities, and speed. XGBoost builds an ensemble of decision trees in a stage-wise fashion, minimizing a regularized objective that balances training loss and model complexity.
The core idea of XGBoost is to fit each new tree to the gradient of the loss function with respect to the model’s predictions. Unlike traditional boosting algorithms like AdaBoost, which use only first-order gradients, XGBoost optionally uses second-order derivatives (Hessians), enabling better convergence and stability (Friedman, 2001).
XGBoost is widely used in data science competitions and real-world applications. It supports regularization (L1 and L2), handles missing values internally, and is designed for distributed computing.
XGBoost builds upon the same foundational idea as gradient boosted machines—sequentially adding trees to improve the predictive model— but introduces a number of enhancements:
| Aspect | Traditional GBM | XGBoost |
|---|---|---|
| Implementation | Basic gradient boosting | Optimized, regularized boosting |
| Regularization | Shrinkage only | L1 and L2 regularization |
| Loss Optimization | First-order gradients | First- and second-order |
| Missing Data | Requires manual imputation | Handled automatically |
| Tree Construction | Depth-wise | Level-wise (faster) |
| Parallelization | Limited | Built-in |
| Sparsity Handling | No | Yes |
| Objective Functions | Few options | Custom supported |
| Cross-validation | External via GridSearchCV |
Built-in xgb.cv |
XGBoost is therefore more suitable for large-scale problems and provides better generalization performance in many practical tasks.
10.4 Neural Networks
Neural networks learn flexible, nonlinear relationships by composing simple computational units. Each unit (or neuron) applies an affine transformation to its inputs followed by a nonlinearity; stacking layers of such units yields a feed-forward mapping from features to predictions. A helpful anchor is logistic regression: it can be viewed as a one-layer neural network with a sigmoid activation. Hidden layers generalize this idea by transforming inputs into progressively more useful internal representations, enabling the model to capture complex patterns beyond linear decision boundaries.
Historically, the field traces back to the early formal model of a neuron by McCulloch & Pitts (1943), followed by Rosenblatt’s perceptron (Rosenblatt, 1958), which was the first trainable linear classifier. The limitations of single-layer perceptrons, notably their inability to model nonlinearly separable functions, were rigorously analyzed by Minsky & Papert (1969), leading to a temporary decline in interest. The introduction of the backpropagation algorithm by Rumelhart et al. (1986) revived the study of neural networks in the 1980s. With the growth of data, advances in hardware, and improved optimization methods, neural networks became the foundation for modern deep learning. Multilayer networks are universal function approximators in principle; in practice, their success depends on careful architectural design, efficient optimization, and appropriate regularization.
For an input vector \(x \in \mathbb{R}^d\), a hidden layer with \(m\) neurons computes \(h = \sigma(Wx + b)\) through an activation function \(\sigma\) \(W \in \mathbb{R}^{m\times d}\) and \(b \in \mathbb{R}^m\). The output layer applies another affine map (and possibly a final activation) to \(h\). Keeping track of these shapes prevents many implementation bugs.
The figure below shows an input layer, one hidden layer, and an output layer with directed connections. (We will formalize the mathematics in the next subsection.)
10.4.1 Structure of a Neural Network
A neural network is composed of layers of interconnected processing units called neurons or nodes. Each neuron receives inputs, applies a linear transformation, and passes the result through a nonlinear activation function. The layers are arranged sequentially so that information flows from input features to intermediate representations and finally to the output.
10.4.2 Mathematical Formulation
Let the input vector be \(x \in \mathbb{R}^{d}\). A hidden layer with \(m\) neurons computes
\[ h = \sigma(Wx + b), \]
where \(W \in \mathbb{R}^{m \times d}\) is the weight matrix, \(b \in \mathbb{R}^{m}\) is the bias vector, and \(\sigma(\cdot)\) is an activation function applied element-wise.
Subsequent layers take \(h\) as input and repeat the same computation, producing successively transformed representations.
If the output layer contains \(k\) neurons with linear activations, the network computes
\[ \hat{y} = Vh + c, \]
where \(V \in \mathbb{R}^{k \times m}\) and \(c \in \mathbb{R}^{k}\) are the output weights and biases.
Collectively, these parameters define the model’s trainable structure.
During a forward pass, inputs propagate layer by layer: \(x \rightarrow h^{(1)} \rightarrow h^{(2)} \rightarrow \cdots
\rightarrow \hat{y}\).
Backward propagation of gradients (discussed later) updates all weights based on prediction error.
10.4.3 Network Diagram
Figure 10.2 illustrates a feed-forward neural network with two hidden layers. Arrows indicate the direction of information flow from one layer to the next. Each connection carries a learnable weight, and nonlinear activations transform the signals as they propagate forward.
Each connection (edge) has an associated weight adjusted during training. The depth of a network refers to the number of hidden layers, and the width refers to the number of neurons per layer.
10.4.4 Activation Functions
Activation functions introduce nonlinearity into neural networks. Without them, the entire network would collapse into a single linear transformation, regardless of the number of layers. Nonlinear activations allow networks to approximate arbitrary complex functions, enabling them to model curved decision boundaries and hierarchical representations.
Let \(z\) denote the input to a neuron before activation. The most common choices are:
| Function | Formula | Range | Key Property |
|---|---|---|---|
| Sigmoid | \(\displaystyle \sigma(z) = \frac{1}{1 + e^{-z}}\) | \((0,1)\) | Smooth, bounded; used in early networks and binary outputs |
| Tanh | \(\displaystyle \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\) | \((-1,1)\) | Zero-centered; stronger gradients than sigmoid |
| ReLU | \(\displaystyle \mathrm{ReLU}(z) = \max(0, z)\) | \([0, \infty)\) | Sparse activations; efficient to compute |
| Leaky ReLU | \(\displaystyle \mathrm{LReLU}(z) = \max(0.01z, z)\) | \((-\infty, \infty)\) | Avoids “dead” neurons of ReLU |
Figure 10.3 shows the shape of these activation functions. Notice how ReLU sharply truncates negative values while sigmoid and tanh saturate for large magnitudes.
10.4.5 Training Neural Networks
Training a neural network means adjusting its parameters so that its predictions \(\hat{y}\) align closely with the true outputs \(y\). This is done by minimizing a loss function that quantifies prediction error. Optimization proceeds iteratively through forward and backward passes.
10.4.5.1 Loss Functions
The choice of loss depends on the task:
| Task | Typical Loss Function | Expression |
|---|---|---|
| Regression | Mean squared error (MSE) | \(\displaystyle L = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\) |
| Binary classification | Binary cross-entropy | \(\displaystyle L = -\frac{1}{n}\sum_{i=1}^{n} \left[y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right]\) |
| Multiclass classification | Categorical cross-entropy | \(\displaystyle L = -\frac{1}{n}\sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik}\log \hat{y}_{ik}\) |
The loss function \(L(\theta)\) depends on all network parameters \(\theta = \{W, b, V, c, \ldots\}\) and guides the optimization.
10.4.5.2 Gradient Descent
Neural networks are trained by gradient descent, which updates parameters in the opposite direction of the gradient of the loss:
\[ \theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta L(\theta^{(t)}), \]
where \(\eta > 0\) is the learning rate controlling the step size. Choosing \(\eta\) too small leads to slow convergence; too large can cause divergence.
Variants such as stochastic and mini-batch gradient descent compute gradients on subsets of data to speed learning and improve generalization.
Figure 10.4 gives a two-dimensional analogy of gradient descent. In practice, a neural network may contain millions of parameters, meaning the loss function is defined over an extremely high-dimensional space. The contours shown here merely represent a projection of that space onto two dimensions for illustration.
The true optimization surface is much more complex—full of flat regions, sharp valleys, and numerous local minima—but the intuition of moving “downhill” along the loss gradient remains valid.
10.4.6 Regularization and Overfitting
Because neural networks contain large numbers of parameters, they can easily overfit training data—capturing noise or random fluctuations rather than generalizable patterns. A model that overfits performs well on the training set but poorly on new data. Regularization techniques introduce constraints or randomness that help the network generalize.
10.4.6.1 The Bias–Variance Trade-Off
Adding model complexity (more layers or neurons) reduces bias but increases variance. The goal of regularization is to find a balance between these forces. Figure 10.5 provides a schematic view: the unregularized model follows every data fluctuation, while a regularized model captures only the underlying trend.
10.4.6.2 Common Regularization Techniques
Among the simplest and most widely used techniques is L2 weight decay, which adds a penalty term to discourage large weights:
\[ L_{\text{total}} = L_{\text{data}} + \lambda \sum_{l}\sum_{i,j} (w_{ij}^{(l)})^2, \]
where \(\lambda\) controls the strength of the constraint. Other general approaches include dropout, which randomly deactivates neurons during training to prevent reliance on any single pathway, and early stopping, which halts training once the validation loss stops improving. These ideas apply to networks of all depths and form the foundation for more advanced regularization strategies discussed later in the deep learning section.
Regularization methods restrict how freely the model parameters can adapt to training data. This constraint encourages smoother mappings, reduces sensitivity to noise, and improves generalization to unseen inputs.
- Start with small \(\lambda\) for L2 weight decay.
- Use dropout (e.g., \(p = 0.2\)–\(0.5\)) between dense layers.
- Always track training vs. validation loss curves to detect overfitting.
10.4.7 Example: A Simple Feed-Forward Network
To illustrate how a neural network operates in practice, we train a simple multilayer perceptron (MLP) on a two-dimensional nonlinear dataset. The model learns a curved decision boundary that cannot be captured by linear classifiers.
The two-moons dataset consists of two interleaving half-circles, forming a pattern that resembles a pair of crescents. Each point represents an observation with two features \((x_1, x_2)\), and the color indicates its class. The two classes overlap slightly due to added random noise.
A linear classifier such as logistic regression would draw a straight line through the plane, misclassifying many points. In contrast, a neural network can learn the nonlinear boundary that curves along the interface between the moons. This problem, although simple, captures the essence of nonlinear learning in higher dimensions.
We generate the data using make_moons() from scikit-learn and train a feed-forward neural network with two hidden layers. The model’s task is to assign each observation to one of the two moon shapes.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
# Generate synthetic data
X, y = make_moons(n_samples=500, noise=0.25, random_state=1023)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Define and train a small neural network
mlp = MLPClassifier(hidden_layer_sizes=(10, 5),
activation='relu',
solver='adam',
alpha=0.001,
max_iter=2000,
random_state=1023)
mlp.fit(X_train, y_train)MLPClassifier(alpha=0.001, hidden_layer_sizes=(10, 5), max_iter=2000,
random_state=1023)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| hidden_layer_sizes | (10, ...) | |
| activation | 'relu' | |
| solver | 'adam' | |
| alpha | 0.001 | |
| batch_size | 'auto' | |
| learning_rate | 'constant' | |
| learning_rate_init | 0.001 | |
| power_t | 0.5 | |
| max_iter | 2000 | |
| shuffle | True | |
| random_state | 1023 | |
| tol | 0.0001 | |
| verbose | False | |
| warm_start | False | |
| momentum | 0.9 | |
| nesterovs_momentum | True | |
| early_stopping | False | |
| validation_fraction | 0.1 | |
| beta_1 | 0.9 | |
| beta_2 | 0.999 | |
| epsilon | 1e-08 | |
| n_iter_no_change | 10 | |
| max_fun | 15000 |
Afer the model is trained, we evaluate its accuracy.
# Evaluate accuracy
acc_train = mlp.score(X_train, y_train)
acc_test = mlp.score(X_test, y_test)
# Prepare grid for decision boundary
xx, yy = np.meshgrid(np.linspace(-2, 3, 300),
np.linspace(-1.5, 2, 300))
Z = mlp.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
# Plot
fig, ax = plt.subplots(figsize=(6, 5))
ax.contourf(xx, yy, Z, cmap="coolwarm", alpha=0.3)
scatter = ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap="coolwarm", edgecolor="k", s=40)
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.set_title(f"Two-Moons Classification (Train Acc: {acc_train:.2f}, Test Acc: {acc_test:.2f})")
plt.show()
The trained network achieves high accuracy on both the training and test sets, demonstrating good generalization. The contour regions in Figure Figure 10.6 reveal the curved boundary the network has learned. Each hidden layer transforms the input coordinates into a new representation where the two classes become more separable. After a few layers of nonlinear transformations, the output layer can classify the points with a simple linear threshold.
Even this small network, with only a few dozen parameters, can learn a highly nonlinear decision surface. This illustrates the expressive power of multilayer neural networks, even in low-dimensional spaces.
Increase the noise level in the data to test robustness. Change the hidden layer sizes (e.g., (3,), (20,10)), or the activation function (tanh, logistic). Adjust the regularization parameter alpha to see how it affects smoothness of the boundary.