10  Supervised Learning

10.1 Introduction

Machine Learning (ML) is a branch of artificial intelligence that enables systems to learn from data and improve their performance over time without being explicitly programmed. At its core, machine learning algorithms aim to identify patterns in data and use those patterns to make decisions or predictions.

Machine learning can be categorized into three main types: supervised learning, unsupervised learning, and reinforcement learning. Each type differs in the data it uses and the learning tasks it performs, addressing addresses different tasks and problems. Supervised learning aims to predict outcomes based on labeled data, unsupervised learning focuses on discovering hidden patterns within the data, and reinforcement learning centers around learning optimal actions through interaction with an environment.

Let’s define some notations to introduce them:

  • \(X\): A set of feature vectors representing the input data. Each element \(X_i\) corresponds to a set of features or attributes that describe an instance of data.

  • \(Y\): A set of labels or rewards associated with outcomes. In supervised learning, \(Y\) is used to evaluate the correctness of the model’s predictions. In reinforcement learning, \(Y\) represents the rewards that guide the learning process.

  • \(A\): A set of possible actions in a given context. In reinforcement learning, actions \(A\) represent choices that can be made in response to a given situation, with the goal of maximizing a reward.

10.1.1 Supervised Learning

Supervised learning is the most widely used type of machine learning. In supervised learning, we have both feature vectors \(X\) and their corresponding labels \(Y\). The objective is to train a model that can predict \(Y\) based on \(X\). This model is trained on labeled examples, where the correct outcome is known, and it adjusts its internal parameters to minimize the error in its predictions, which occurs as part of the cross-validation process.

Key tasks in supervised learning include:

  • Classification: Assigning data points to predefined categories or classes.
  • Regression: Predicting a continuous value based on input data.

In supervised learning, the data consists of both feature vectors \(X\) and labels \(Y\), namely, \((X, Y)\).

10.1.2 Unsupervised Learning

Unsupervised learning involves learning patterns from data without any associated labels or outcomes. The objective is to explore and identify hidden structures in the feature vectors \(X\). Since there are no ground-truth labels \(Y\) to guide the learning process, the algorithm must discover patterns on its own. This is particularly useful when subject matter experts are unsure of common properties within a data set.

Common tasks in unsupervised learning include:

  • Clustering: Grouping similar data points together based on certain features.

  • Dimension Reduction: Simplifying the input data by reducing the number of features while preserving essential patterns.

In unsupervised learning, the data consists solely of feature vectors \(X\).

10.1.3 Reinforcement Learning

Reinforcement learning involves learning how to make a sequence of decisions to maximize a cumulative reward. Unlike supervised learning, where the model learns from a static dataset of labeled examples, reinforcement learning involves an agent that interacts with an environment by taking actions \(A\), receiving feedback in the form of rewards \(Y\), and learning over time which actions lead to the highest cumulative reward.

The process in reinforcement learning involves:

  • States: The context or environment the agent is in, represented by feature vectors \(X\).

  • Actions: The set of possible choices the agent can make in response to the current state, denoted as \(A\).

  • Rewards: Feedback the agent receives after taking an action, which guides the learning process.

In reinforcement learning, the data consists of feature vectors \(X\), actions \(A\), and rewards \(Y\), namely, \((X, A, Y)\).

10.2 Decision Trees

Decision trees are widely used supervised learning models that predict the value of a target variable by iteratively splitting the dataset based on decision rules derived from input features. The model functions as a piecewise constant approximation of the target function, producing clear, interpretable rules that are easily visualized and analyzed (Breiman et al., 1984). Decision trees are fundamental in both classification and regression tasks, serving as the building blocks for more advanced ensemble models such as Random Forests and Gradient Boosting Machines.

10.2.1 Recursive Partition Algorithm

The core mechanism of a decision tree algorithm is the identification of optimal splits that partition the data into subsets that are increasingly homogeneous with respect to the target variable. At any node \(m\), the data subset is denoted as \(Q_m\) with a sample size of \(n_m\). The objective is to find a candidate split \(\theta\), defined as a threshold for a given feature, that minimizes an impurity or loss measure \(H\).

When a split is made at node \(m\), the data is divided into two subsets: \(Q_{m,l}\) (left node) with sample size \(n_{m,l}\), and \(Q_{m,r}\) (right node) with sample size \(n_{m,r}\). The split quality, measured by \(G(Q_m, \theta)\), is given by:

\[ G(Q_m, \theta) = \frac{n_{m,l}}{n_m} H(Q_{m,l}(\theta)) + \frac{n_{m,r}}{n_m} H(Q_{m,r}(\theta)). \]

The algorithm aims to identify the split that minimizes the impurity:

\[ \theta^* = \arg\min_{\theta} G(Q_m, \theta). \]

This process is applied recursively at each child node until a stopping condition is met.

  • Stopping Criteria: The algorithm stops when the maximum tree depth is reached or when the node sample size falls below a preset threshold.
  • Pruning: Reduce the complexity of the final tree by removing branches that add little predictive value. This reduces overfitting and improves generalization.

10.2.2 Search Space for Possible Splits

At each node, the search space for possible splits comprises all features in the dataset and potential thresholds derived from the feature values. For a given feature, the algorithm considers each of its unique value in the current node as a possible split point. The potential thresholds are typically set as midpoints between consecutive unique values, ensuring effective partition.

Formally, let the feature set be \(\{X_1, X_2, \ldots, X_p\}\), where \(p\) is the total number of features, and let the unique values of feature \(X_j\) at node \(m\) be denoted by \(\{v_{m,j,1}, v_{m,j,2}, \ldots, v_{m,j,k_{mj}}\}\). The search space at node \(m\) includes:

  • Feature candidates: \(\{X_1, X_2, \ldots, X_p\}\).
  • Threshold candidates for \(X_j\): \[ \left\{ \frac{v_{m,j,i} + v_{m,j,i+1}}{2} \mid 1 \leq i < k_{mj} \right\}. \]

While the complexity of this search can be substantial, particularly for high-dimensional data or features with numerous unique values, efficient algorithms use sorting and single-pass scanning techniques to mitigate the computational cost.

10.2.3 Metrics

10.2.3.1 Classification

In classification, the split quality metric measures how pure the resulting nodes are after a split. A pure node contains observations that predominantly belong to a single class.

  • Gini Index: The Gini index measures node impurity by the probability that two observations randomly drawn from the node belong to different classes. A perfect split (all instances belong to one class) has a Gini index of 0. At node \(m\), the Gini index is \[ H(Q_m) = \sum_{k=1}^{K} p_{mk} (1 - p_{mk}) = 1 - \sum_{k=1}^n p_{mk}^2, \] where \(p_{mk}\) is the proportion of samples of class \(k\) at node \(m\); and \(K\) is the total number of classes The Gini index is often preferred for its speed and simplicity, and it’s used by default in many implementations of decision trees, including sklearn.

    The Gini index originates from the Gini coefficient, introduced by Corrado Gini in 1912 to quantify inequality in income distributions. In that context, the Gini coefficient measures how unevenly a quantity (such as wealth) is distributed across a population. Decision tree algorithms adapt this concept of inequality to measure the impurity of a node: instead of wealth, the distribution concerns class membership. A perfectly pure node, where all observations belong to the same class, represents complete equality and yields a Gini index of zero. As class proportions become more mixed, inequality in class membership increases, leading to higher impurity values. Thus, the Gini index used in decision trees can be viewed as a statistical measure of diversity or heterogeneity derived from Gini’s original work on inequality.

  • Entropy (Information Gain): Derived from information theory, entropy quantifies the disorder of the data at a node. Lower entropy means higher purity. At node \(m\), it is defined as \[ H(Q_m) = - \sum_{k=1}^{K} p_{mk} \log p_{mk}. \] Entropy is commonly used in decision tree algorithms like ID3 and C4.5. The choice between Gini and entropy often depends on specific use cases, but both perform similarly in practice.

  • Misclassification Error: Misclassification error focuses on the most frequent class in the node. It measures the proportion of samples that do not belong to the majority class. Although less sensitive than Gini and entropy, it can be useful for classification when simplicity is preferred. At node \(m\), it is defined as \[ H(Q_m) = 1 - \max_k p_{mk}, \] where \(\max_k p_{mk}\) is the largest proportion of samples belonging to any class \(k\).

10.2.3.2 Regression Criteria

In regression, the goal is to minimize the spread or variance of the target variable within each node.

  • Mean Squared Error (MSE): MSE is the average squared difference between observed and predicted values (mean of the target in the node). The smaller the MSE, the better the fit. At node \(m\), it is \[ H(Q_m) = \frac{1}{n_m} \sum_{i=1}^{n_m} (y_i - \bar{y}_m)^2, \] where

    • \(y_i\) is the actual value for sample \(i\);
    • \(\bar{y}_m\) is the mean value of the target at node \(m\);
    • \(n_m\) is the number of samples at node \(m\).

    MSE works well when the target is continuous and normally distributed.

  • Half Poisson Deviance: Used for count target, the Poisson deviance measures the variance in the number of occurrences of an event. At node \(m\), it is \[ H(Q_m) = \sum_{i=1}^{n_m} \left( y_i \log\left(\frac{y_i}{\hat{y}_i}\right) - (y_i - \hat{y}_i) \right), \] where \(\hat{y}_i\) is the predicted count. This criterion is especially useful when the target variable represents discrete counts, such as predicting the number of occurrences of an event.

  • Mean Absolute Error (MAE): MAE aims to minimize the absolute differences between actual and predicted values. While it is more robust to outliers than MSE, it is slower computationally due to the lack of a closed-form solution for minimization. At node \(m\), it is \[ H(Q_m) = \frac{1}{n_m} \sum_{i=1}^{n_m} |y_i - \bar{y}_m|. \] MAE is useful when you want to minimize large deviations and can be more robust in cases where outliers are present in the data.

10.2.4 Ames Housing Example

The Ames Housing data are used to illustrate a regression tree model for predicting log house price.

As before, we retrive the data from OpenML.

import openml
import pandas as pd
import numpy as np

# Load Ames Housing dataset (OpenML ID 42165)
dataset = openml.datasets.get_dataset(42165)
df, *_ = dataset.get_data()
df["LogPrice"] = np.log(df["SalePrice"])

A decision tree partitions the feature space into regions where the average log price is relatively constant.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
import plotnine as gg
import pandas as pd

numeric_features = [
    "OverallQual", "GrLivArea", "GarageCars",
    "TotalBsmtSF", "YearBuilt", "FullBath"
]
categorical_features = ["KitchenQual"]

preprocessor = ColumnTransformer([
    ("num", "passthrough", numeric_features),
    ("cat", OneHotEncoder(drop="first"), categorical_features)
])

X = df[numeric_features + categorical_features]
y = df["LogPrice"]

depths = range(2, 11)
cv_scores = [
    cross_val_score(
        Pipeline([
            ("pre", preprocessor),
            ("model", DecisionTreeRegressor(max_depth=d, random_state=0))
        ]),
        X, y, cv=5, scoring="r2"
    ).mean()
    for d in depths
]

list(zip(depths, cv_scores))
[(2, 0.594134995704976),
 (3, 0.6712972027857058),
 (4, 0.7141792234890973),
 (5, 0.748794485599919),
 (6, 0.7587964739851765),
 (7, 0.7343953839481492),
 (8, 0.7186324525934304),
 (9, 0.7112790937242873),
 (10, 0.6938966980572193)]

Cross-validation identifies an appropriate tree depth that balances fit and generalization. A too-deep tree overfits, while a shallow tree misses structure.

dt = Pipeline([
    ("pre", preprocessor),
    ("model", DecisionTreeRegressor(max_depth=4, random_state=0))
])

dt.fit(X, y)
y_pred = dt.predict(X)

df_pred = pd.DataFrame({"Observed": y, "Predicted": y_pred})

(gg.ggplot(df_pred, gg.aes(x="Observed", y="Predicted")) +
 gg.geom_point(alpha=0.5) +
 gg.geom_abline(slope=1, intercept=0, linetype="dashed") +
 gg.labs(title="Decision Tree Regression on Ames Housing",
         x="Observed Log Price", y="Predicted Log Price"))

The plot shows predicted versus observed log prices. A well-fitted model has points close to the diagonal. The decision tree naturally captures nonlinear effects and interactions, though its predictions are piecewise constant, producing visible step patterns.

10.3 Gradient-Boosted Models

Gradient boosting is a powerful ensemble technique in machine learning that combines multiple weak learners into a strong predictive model. Unlike bagging methods, which train models independently, gradient boosting fits models sequentially, with each new model correcting errors made by the previous ensemble (Friedman, 2001). While decision trees are commonly used as weak learners, gradient boosting can be generalized to other base models. This iterative method optimizes a specified loss function by repeatedly adding models designed to reduce residual errors.

10.3.1 Introduction

Gradient boosting builds on the general concept of boosting, aiming to construct a strong predictor from an ensemble of sequentially trained weak learners. The weak learners are often shallow decision trees (stumps), linear models, or generalized additive models (Hastie et al., 2009). Each iteration adds a new learner focusing primarily on the data points poorly predicted by the existing ensemble, thereby progressively enhancing predictive accuracy.

Gradient boosting’s effectiveness stems from:

  • Error Correction: Each iteration specifically targets previous errors, refining predictive accuracy.
  • Weighted Learning: Iteratively focuses more heavily on difficult-to-predict data points.
  • Flexibility: Capable of handling diverse loss functions and various types of predictive tasks.

The effectiveness of gradient-boosted models has made them popular across diverse tasks, including classification, regression, and ranking. Gradient boosting forms the foundation for algorithms such as XGBoost (Chen & Guestrin, 2016), LightGBM (Ke et al., 2017), and CatBoost (Prokhorenkova et al., 2018), known for their high performance and scalability.

10.3.2 Gradient Boosting Process

Gradient boosting builds an ensemble by iteratively minimizing the residual errors from previous models. This iterative approach optimizes a loss function, \(L(y, F(x))\), where \(y\) represents the observed target variable and \(F(x)\) the model’s prediction for a given feature vector \(x\).

Key concepts:

  • Loss Function: Guides model optimization, such as squared error for regression or logistic loss for classification.
  • Learning Rate: Controls incremental updates, balancing training speed and generalization.
  • Regularization: Reduces overfitting through tree depth limitation, subsampling, and L1/L2 penalties.

10.3.2.1 Model Iteration

The gradient boosting algorithm proceeds as follows:

  1. Initialization: Define a base model \(F_0(x)\), typically the mean of the target variable for regression or the log-odds for classification.

  2. Iterative Boosting: At each iteration \(m\):

    • Compute pseudo-residuals representing the negative gradient of the loss function at the current predictions. For each observation \(i\): \[ r_i^{(m)} = -\left.\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right|_{F(x)=F_{m-1}(x)}, \] where \(x_i\) and \(y_i\) denote the feature vector and observed value for the \(i\)-th observation, respectively. The residuals represent the direction of steepest descent in function space, so fitting a learner to them approximates a gradient descent step minimizing \(L(y, F(x))\).

    • Fit a new weak learner \(h_m(x)\) to these residuals.

    • Update the model: \[ F_m(x) = F_{m-1}(x) + \eta \, h_m(x), \] where \(\eta\) is a small positive learning rate (e.g., 0.01–0.1), controlling incremental improvement and reducing overfitting.

    In some implementations, the update step includes an additional multiplier determined by a one-dimensional line search that minimizes the loss function at each iteration. Specifically, the optimal step length is defined as \[ \gamma_m = \arg\min_\gamma \sum_{i=1}^n L\bigl(y_i,\, F_{m-1}(x_i) + \gamma\, h_m(x_i)\bigr), \] leading to an updated model of the form \[ F_m(x) = F_{m-1}(x) + \eta\, \gamma_m\, h_m(x), \] where \(\eta\) remains a shrinkage factor controlling the overall rate of learning, while \(\gamma_m\) adjusts the step size adaptively at each iteration.

  3. Final Model: After \(M\) iterations, the ensemble model is: \[ F_M(x) = F_0(x) + \sum_{m=1}^M \eta \, h_m(x). \]

Stochastic gradient boosting is a variant that enhances gradient boosting by introducing randomness through subsampling at each iteration, selecting a random fraction of data points (typically 50%–80%) to fit the model . This randomness helps reduce correlation among trees, improve model robustness, and lower the risk of overfitting.

10.3.3 Boosted Trees with Ames Housing

Boosted trees apply the gradient boosting framework to decision trees. They build an ensemble of shallow trees, each trained to correct the residual errors of the preceding ones. By sequentially emphasizing observations that are difficult to predict, the model progressively improves its overall predictive accuracy. We now apply gradient boosting using the same preprocessed features. Boosting combines many shallow trees, each correcting the residual errors of its predecessors, to improve predictive accuracy.

from sklearn.ensemble import GradientBoostingRegressor

# Define a range of tree counts for tuning
n_estimators_list = [50, 100, 200, 400]

cv_scores_gb = [
    cross_val_score(
        Pipeline([
            ("pre", preprocessor),
            ("model", GradientBoostingRegressor(
                n_estimators=n,
                learning_rate=0.05,
                max_depth=3,
                random_state=0))
        ]),
        X, y, cv=5, scoring="r2"
    ).mean()
    for n in n_estimators_list
]

list(zip(n_estimators_list, cv_scores_gb))
[(50, 0.8199019428994425),
 (100, 0.8437789746745888),
 (200, 0.8451433195728157),
 (400, 0.8405078307788265)]

Cross-validation shows how increasing the number of boosting rounds initially improves performance but eventually risks overfitting when too many trees are added.

gb = Pipeline([
    ("pre", preprocessor),
    ("model", GradientBoostingRegressor(
        n_estimators=200,
        learning_rate=0.05,
        max_depth=3,
        random_state=0))
])

gb.fit(X, y)
y_pred_gb = gb.predict(X)

df_pred_gb = pd.DataFrame({"Observed": y, "Predicted": y_pred_gb})

(gg.ggplot(df_pred_gb, gg.aes(x="Observed", y="Predicted")) +
 gg.geom_point(alpha=0.5) +
 gg.geom_abline(slope=1, intercept=0, linetype="dashed") +
 gg.labs(title="Gradient-Boosted Regression on Ames Housing",
         x="Observed Log Price", y="Predicted Log Price"))

The boosted model produces predictions that are generally closer to the 45-degree line than a single tree, reflecting improved accuracy and smoother response across the feature space.

Gradient-boosted trees introduce several parameters that govern model complexity, learning stability, and overfitting control:

  • n_estimators: the number of trees (boosting rounds). More trees can reduce bias but increase computation and risk of overfitting. learning_rate — the shrinkage parameter \(\eta\) controlling the contribution of each new tree. Smaller values (e.g., 0.05 or 0.01) require more trees but often yield better generalization.
  • max_depth: the maximum depth of each individual tree, limiting the model’s ability to overfit local noise. Shallow trees (depth 2–4) are typical weak learners.
  • subsample: the fraction of data used in each iteration. Values below 1.0 introduce randomness (stochastic boosting), improving robustness and reducing correlation among trees.
  • min_samples_split and min_samples_leaf: minimum numbers of observations required for splitting or forming leaves. These control tree granularity and help regularize the model.

In practice, moderate learning rates with a sufficiently large number of estimators and shallow trees often perform best, balancing bias, variance, and computational cost.

10.3.4 XGBoost: Extreme Gradient Boosting

XGBoost is a scalable and efficient implementation of gradient-boosted decision trees (Chen & Guestrin, 2016). It has become one of the most widely used machine learning methods for structured data due to its high predictive performance, regularization capabilities, and speed. XGBoost builds an ensemble of decision trees in a stage-wise fashion, minimizing a regularized objective that balances training loss and model complexity.

The core idea of XGBoost is to fit each new tree to the gradient of the loss function with respect to the model’s predictions. Unlike traditional boosting algorithms like AdaBoost, which use only first-order gradients, XGBoost optionally uses second-order derivatives (Hessians), enabling better convergence and stability (Friedman, 2001).

XGBoost is widely used in data science competitions and real-world applications. It supports regularization (L1 and L2), handles missing values internally, and is designed for distributed computing.

XGBoost builds upon the same foundational idea as gradient boosted machines—sequentially adding trees to improve the predictive model— but introduces a number of enhancements:

Aspect Traditional GBM XGBoost
Implementation Basic gradient boosting Optimized, regularized boosting
Regularization Shrinkage only L1 and L2 regularization
Loss Optimization First-order gradients First- and second-order
Missing Data Requires manual imputation Handled automatically
Tree Construction Depth-wise Level-wise (faster)
Parallelization Limited Built-in
Sparsity Handling No Yes
Objective Functions Few options Custom supported
Cross-validation External via GridSearchCV Built-in xgb.cv

XGBoost is therefore more suitable for large-scale problems and provides better generalization performance in many practical tasks.

10.4 Random Forest

This section was prepared by Alex Plotnikov, first-year master’s student in statistics.

10.4.1 Random Forest Algorithm

Random forests are an ensemble method that was developed by Leo Breiman in 2001 to improve the performance of tree-based models. This model creates multiple trees in parallel. There are a few key differences in how trees in a random forest are constructed compared to the decision tree algorithm. The main differences are:

  • Sample of data used for each tree.
  • Number of features selected at each node.
  • Trees are allowed to grow deep in random forests.
  • Prediction is made based on aggregating results from all trees.

10.4.1.1 Bagging (Bootstrap Aggregation)

Unlike decision trees which are trained on the entire dataset, random forests use bagging to create a sample for each tree. Afterwards, the tree is trained on that specific sample.

The term bagging comes from bootstrap aggregation. This process involves creating a sample set with the same number of records as the original by sampling from the original dataset with replacement. This means that observations may appear multiple times in the sample dataset.

For one draw:

\[ P(\text{not selected}) = 1 - \frac{1}{N}. \]

For N draws (since we sample N times with replacement):

\[ P(\text{not selected in entire sample}) = \left(1 - \frac{1}{N}\right)^N. \]

As N approaches infinity:

\[ \left(1 - \frac{1}{N}\right)^N \approx e^{-1} \approx 0.3679. \]

So, about 36.8% of the original observations are not included in a given bootstrap sample.

Therefore, the remaining portion is:

\[ 1 - e^{-1} \approx 0.632. \]

On average, 63.2% of the values in each sample are non-repeated. Bagging helps reduce overfitting to noise and reducing variance of the outcome for a more stable model.

10.4.1.2 Feature Sampling and Tree Depth

Random forests also calculate impurity on a subset of all features rather than all features as in a decision tree. At each node, the algorithm selects a random subset of features and calculates the impurity metric for these features. At the next node, the algorithm takes another sample of the entire feature set and calculates the metric on this new set of features.

For \(p\) features, the algorithm uses (\(\sqrt{p}\)) features for classification and (\(\frac{p}{3}\)) features for regression.

Trees in random forests are also usually allowed to grow to max depth. The goal of having multiple trees is reducing variance of output and overfitting, and this is achieved through averaging of trees.

10.4.1.3 Prediction

Random forests make predictions by aggregating results from all trees.

For classification, this is done by each tree predicting a label and then the forest combines predictions from all trees to output a label based on a majority vote.

For regression, the forest takes the numerical value predicted by each tree and then calculates the average for the output.

10.4.1.4 Algorithm Comparison

Decision Trees Random Forests
Data Set Entire dataset Bootstrap sample for each tree
Features All features Random subset at each node
Tree Size Based on input Can be modified with input but usually until tree fully grows
Prediction Leaf where new data lands Majority vote of all trees for classification, average for regression

10.4.2 RandomForestClassifier Parameters

These concepts can be adjusted as parameters in the RandomForestClassifier class in Python. Not all parameters are shown here as they are similar to decision tree parameters:

  • n_estimators: number of trees to train in forest.
  • max_depth: max depth of the tree in the forest, default value is set to None which allows the tree to grow to max depth.
  • max_features: number of features to use at each node. Default value is 'sqrt' and can also be set to 'log2' and None.
  • bootstrap: whether or not to use bootstrap for each tree, if False then each tree will use the entire dataset.
  • max_samples: the maximum number of samples for each tree can be adjusted rather than the same size as the dataset.

10.4.3 Random Forests in Python (Ames Data)

The below example shows a random forest regression on the Ames data. We construct a plot to show the observed log price against the predicted log price.

import openml
import os

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier # For classification
from sklearn.ensemble import RandomForestRegressor # For regression
from sklearn.metrics import accuracy_score, mean_squared_error # For evaluation
import plotnine as gg


# Load Ames Housing (OpenML ID 42165)
dataset = openml.datasets.get_dataset(42165)
df, *_ = dataset.get_data()

df["LogPrice"] = np.log(df["SalePrice"])

ind_columns = df.columns.drop(['Id','SalePrice','LogPrice']).tolist()
df_encoded = pd.get_dummies(df[ind_columns], drop_first=True)

X = df_encoded 
y = df['LogPrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                    test_size=0.2, random_state=42)

# # For Classification
# model = RandomForestClassifier(n_estimators=100, random_state=42)
# model.fit(X_train, y_train)

# For Regression (if 'target' was continuous)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)


# For graph
dt = RandomForestRegressor(n_estimators=100, random_state=42)

dt.fit(X, y)
y_pred = dt.predict(X)

df_pred = pd.DataFrame({"Observed": y, "Predicted": y_pred})

(gg.ggplot(df_pred, gg.aes(x="Observed", y="Predicted")) +
 gg.geom_point(alpha=0.5) +
 gg.geom_abline(slope=1, intercept=0, linetype="dashed") +
 gg.labs(title="Random Forest Regression on Ames Housing",
         x="Observed Log Price", y="Predicted Log Price"))

10.4.4 Further Reading

10.5 Neural Networks

Neural networks learn flexible, nonlinear relationships by composing simple computational units. Each unit (or neuron) applies an affine transformation to its inputs followed by a nonlinearity; stacking layers of such units yields a feed-forward mapping from features to predictions. A helpful anchor is logistic regression: it can be viewed as a one-layer neural network with a sigmoid activation. Hidden layers generalize this idea by transforming inputs into progressively more useful internal representations, enabling the model to capture complex patterns beyond linear decision boundaries.

Historically, the field traces back to the early formal model of a neuron by McCulloch & Pitts (1943), followed by Rosenblatt’s perceptron (Rosenblatt, 1958), which was the first trainable linear classifier. The limitations of single-layer perceptrons, notably their inability to model nonlinearly separable functions, were rigorously analyzed by Minsky & Papert (1969), leading to a temporary decline in interest. The introduction of the backpropagation algorithm by Rumelhart et al. (1986) revived the study of neural networks in the 1980s. With the growth of data, advances in hardware, and improved optimization methods, neural networks became the foundation for modern deep learning. Multilayer networks are universal function approximators in principle; in practice, their success depends on careful architectural design, efficient optimization, and appropriate regularization.

Dimensions and Shapes

For an input vector \(x \in \mathbb{R}^d\), a hidden layer with \(m\) neurons computes \(h = \sigma(Wx + b)\) through an activation function \(\sigma\) \(W \in \mathbb{R}^{m\times d}\) and \(b \in \mathbb{R}^m\). The output layer applies another affine map (and possibly a final activation) to \(h\). Keeping track of these shapes prevents many implementation bugs.

The figure below shows an input layer, one hidden layer, and an output layer with directed connections. (We will formalize the mathematics in the next subsection.)

Figure 10.1: A minimal feed-forward neural network with an input layer (3 nodes), one hidden layer (4 nodes), and an output layer (1 node). Edges indicate the direction of information flow.

10.5.1 Structure of a Neural Network

A neural network is composed of layers of interconnected processing units called neurons or nodes. Each neuron receives inputs, applies a linear transformation, and passes the result through a nonlinear activation function. The layers are arranged sequentially so that information flows from input features to intermediate representations and finally to the output.

10.5.2 Mathematical Formulation

Let the input vector be \(x \in \mathbb{R}^{d}\). A hidden layer with \(m\) neurons computes

\[ h = \sigma(Wx + b), \]

where \(W \in \mathbb{R}^{m \times d}\) is the weight matrix, \(b \in \mathbb{R}^{m}\) is the bias vector, and \(\sigma(\cdot)\) is an activation function applied element-wise.
Subsequent layers take \(h\) as input and repeat the same computation, producing successively transformed representations.
If the output layer contains \(k\) neurons with linear activations, the network computes

\[ \hat{y} = Vh + c, \]

where \(V \in \mathbb{R}^{k \times m}\) and \(c \in \mathbb{R}^{k}\) are the output weights and biases.
Collectively, these parameters define the model’s trainable structure.

Feed-Forward Flow

During a forward pass, inputs propagate layer by layer: \(x \rightarrow h^{(1)} \rightarrow h^{(2)} \rightarrow \cdots \rightarrow \hat{y}\).
Backward propagation of gradients (discussed later) updates all weights based on prediction error.

10.5.3 Network Diagram

Figure 10.2 illustrates a feed-forward neural network with two hidden layers. Arrows indicate the direction of information flow from one layer to the next. Each connection carries a learnable weight, and nonlinear activations transform the signals as they propagate forward.

Figure 10.2: A feed-forward neural network with two hidden layers. Directed arrows represent weighted connections; one example weight \(w^{(1)}_{23}\) is annotated to indicate its origin and destination neurons.
Note

Each connection (edge) has an associated weight adjusted during training. The depth of a network refers to the number of hidden layers, and the width refers to the number of neurons per layer.

10.5.4 Activation Functions

Activation functions introduce nonlinearity into neural networks. Without them, the entire network would collapse into a single linear transformation, regardless of the number of layers. Nonlinear activations allow networks to approximate arbitrary complex functions, enabling them to model curved decision boundaries and hierarchical representations.

Let \(z\) denote the input to a neuron before activation. The most common choices are:

Function Formula Range Key Property
Sigmoid \(\displaystyle \sigma(z) = \frac{1}{1 + e^{-z}}\) \((0,1)\) Smooth, bounded; used in early networks and binary outputs
Tanh \(\displaystyle \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\) \((-1,1)\) Zero-centered; stronger gradients than sigmoid
ReLU \(\displaystyle \mathrm{ReLU}(z) = \max(0, z)\) \([0, \infty)\) Sparse activations; efficient to compute
Leaky ReLU \(\displaystyle \mathrm{LReLU}(z) = \max(0.01z, z)\) \((-\infty, \infty)\) Avoids “dead” neurons of ReLU

Figure 10.3 shows the shape of these activation functions. Notice how ReLU sharply truncates negative values while sigmoid and tanh saturate for large magnitudes.

Figure 10.3: Activation functions commonly used in neural networks. ReLU and its variants introduce sparsity, while sigmoid and tanh are smooth and saturating.

10.5.5 Training Neural Networks

Training a neural network means adjusting its parameters so that its predictions \(\hat{y}\) align closely with the true outputs \(y\). This is done by minimizing a loss function that quantifies prediction error. Optimization proceeds iteratively through forward and backward passes.

10.5.5.1 Loss Functions

The choice of loss depends on the task:

Task Typical Loss Function Expression
Regression Mean squared error (MSE) \(\displaystyle L = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\)
Binary classification Binary cross-entropy \(\displaystyle L = -\frac{1}{n}\sum_{i=1}^{n} \left[y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right]\)
Multiclass classification Categorical cross-entropy \(\displaystyle L = -\frac{1}{n}\sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik}\log \hat{y}_{ik}\)

The loss function \(L(\theta)\) depends on all network parameters \(\theta = \{W, b, V, c, \ldots\}\) and guides the optimization.

10.5.5.2 Gradient Descent

Neural networks are trained by gradient descent, which updates parameters in the opposite direction of the gradient of the loss:

\[ \theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta L(\theta^{(t)}), \]

where \(\eta > 0\) is the learning rate controlling the step size. Choosing \(\eta\) too small leads to slow convergence; too large can cause divergence.

Variants such as stochastic and mini-batch gradient descent compute gradients on subsets of data to speed learning and improve generalization.

Figure 10.4 gives a two-dimensional analogy of gradient descent. In practice, a neural network may contain millions of parameters, meaning the loss function is defined over an extremely high-dimensional space. The contours shown here merely represent a projection of that space onto two dimensions for illustration.
The true optimization surface is much more complex—full of flat regions, sharp valleys, and numerous local minima—but the intuition of moving “downhill” along the loss gradient remains valid.

Figure 10.4: Conceptual illustration of gradient descent. The optimizer iteratively moves parameters down the loss surface toward a local minimum.

10.5.6 Regularization and Overfitting

Because neural networks contain large numbers of parameters, they can easily overfit training data—capturing noise or random fluctuations rather than generalizable patterns. A model that overfits performs well on the training set but poorly on new data. Regularization techniques introduce constraints or randomness that help the network generalize.

10.5.6.1 The Bias–Variance Trade-Off

Adding model complexity (more layers or neurons) reduces bias but increases variance. The goal of regularization is to find a balance between these forces. Figure 10.5 provides a schematic view: the unregularized model follows every data fluctuation, while a regularized model captures only the underlying trend.

Figure 10.5: Illustration of overfitting and regularization. The unregularized model follows noise in the data, whereas the regularized model captures the broader pattern.

10.5.6.2 Common Regularization Techniques

Among the simplest and most widely used techniques is L2 weight decay, which adds a penalty term to discourage large weights:

\[ L_{\text{total}} = L_{\text{data}} + \lambda \sum_{l}\sum_{i,j} (w_{ij}^{(l)})^2, \]

where \(\lambda\) controls the strength of the constraint. Other general approaches include dropout, which randomly deactivates neurons during training to prevent reliance on any single pathway, and early stopping, which halts training once the validation loss stops improving. These ideas apply to networks of all depths and form the foundation for more advanced regularization strategies discussed later in the deep learning section.

Why Regularization Works

Regularization methods restrict how freely the model parameters can adapt to training data. This constraint encourages smoother mappings, reduces sensitivity to noise, and improves generalization to unseen inputs.

Practical Tips
  • Start with small \(\lambda\) for L2 weight decay.
  • Use dropout (e.g., \(p = 0.2\)\(0.5\)) between dense layers.
  • Always track training vs. validation loss curves to detect overfitting.

10.5.7 Example: A Simple Feed-Forward Network

To illustrate how a neural network operates in practice, we train a simple multilayer perceptron (MLP) on a two-dimensional nonlinear dataset. The model learns a curved decision boundary that cannot be captured by linear classifiers.

The two-moons dataset consists of two interleaving half-circles, forming a pattern that resembles a pair of crescents. Each point represents an observation with two features \((x_1, x_2)\), and the color indicates its class. The two classes overlap slightly due to added random noise.

A linear classifier such as logistic regression would draw a straight line through the plane, misclassifying many points. In contrast, a neural network can learn the nonlinear boundary that curves along the interface between the moons. This problem, although simple, captures the essence of nonlinear learning in higher dimensions.

We generate the data using make_moons() from scikit-learn and train a feed-forward neural network with two hidden layers. The model’s task is to assign each observation to one of the two moon shapes.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

# Generate synthetic data
X, y = make_moons(n_samples=500, noise=0.25, random_state=1023)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Define and train a small neural network
mlp = MLPClassifier(hidden_layer_sizes=(10, 5),
                    activation='relu',
                    solver='adam',
                    alpha=0.001,
                    max_iter=2000,
                    random_state=1023)
mlp.fit(X_train, y_train)
MLPClassifier(alpha=0.001, hidden_layer_sizes=(10, 5), max_iter=2000,
              random_state=1023)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Afer the model is trained, we evaluate its accuracy.

# Evaluate accuracy
acc_train = mlp.score(X_train, y_train)
acc_test = mlp.score(X_test, y_test)

# Prepare grid for decision boundary
xx, yy = np.meshgrid(np.linspace(-2, 3, 300),
                     np.linspace(-1.5, 2, 300))
Z = mlp.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

# Plot
fig, ax = plt.subplots(figsize=(6, 5))
ax.contourf(xx, yy, Z, cmap="coolwarm", alpha=0.3)
scatter = ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap="coolwarm", edgecolor="k", s=40)
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.set_title(f"Two-Moons Classification (Train Acc: {acc_train:.2f}, Test Acc: {acc_test:.2f})")
plt.show()
Figure 10.6: Decision boundary learned by a simple multilayer perceptron on the two-moons dataset. The network captures the nonlinear separation between the classes.

The trained network achieves high accuracy on both the training and test sets, demonstrating good generalization. The contour regions in Figure Figure 10.6 reveal the curved boundary the network has learned. Each hidden layer transforms the input coordinates into a new representation where the two classes become more separable. After a few layers of nonlinear transformations, the output layer can classify the points with a simple linear threshold.

Note

Even this small network, with only a few dozen parameters, can learn a highly nonlinear decision surface. This illustrates the expressive power of multilayer neural networks, even in low-dimensional spaces.

Try It Yourself

Increase the noise level in the data to test robustness. Change the hidden layer sizes (e.g., (3,), (20,10)), or the activation function (tanh, logistic). Adjust the regularization parameter alpha to see how it affects smoothness of the boundary.

10.6 Neural Networks for Data with Temporal Dependence

Many real-world datasets are sequential, where earlier observations influence what happens later. Examples include electricity demand over hours, temperature across days, and stock prices through trading sessions. Such data exhibit temporal dependence, meaning that successive observations are not independent.

Traditional supervised learning models, such as linear regression and feedforward neural networks, treat each observation as if it were independent. When applied directly to time-ordered data, they fail to capture how information evolves through time. A prediction for one step does not reflect patterns that unfolded earlier.

To learn from sequential patterns, we need models that can remember what has already occurred and use that information to improve predictions. Neural networks designed for temporal dependence achieve this by introducing internal states that are updated as the sequence unfolds. The simplest such model is the recurrent neural network (RNN), which forms the basis for more advanced architectures such as long short-term memory (LSTM) and gated recurrent unit (GRU) networks.

10.6.1 Recurrent Neural Networks (RNNs)

To model data with temporal dependence, a neural network must be able to retain information about what has happened previously. A recurrent neural network (RNN) accomplishes this by maintaining an internal hidden state that evolves over time. The hidden state acts as a summary of all past inputs and is updated as new data arrive.

At each time step \(t\), an RNN receives an input vector \(x_t\) and produces a hidden state \(h_t\) according to

\[ h_t = \tanh(W_h h_{t-1} + W_x x_t + b_h), \]

where \(W_h\) and \(W_x\) are weight matrices and \(b_h\) is a bias term. The output at the same step can be expressed as

\[ \hat{y}_t = \sigma(W_y h_t + b_y), \]

with \(\sigma(\cdot)\) representing an activation or link function. Because \(h_t\) depends on \(h_{t-1}\), the network can in principle capture relationships across time.

The initial hidden state \(h_0\) must be specified before the sequence starts. In most applications, \(h_0\) is set to a vector of zeros with the same dimension as \(h_t\), allowing the network to begin without prior memory. This default works well because the recurrent updates quickly overwrite the initial state as new inputs arrive. In some advanced or stateful applications, \(h_0\) can instead be learned during training or carried over from the final state of a previous sequence, enabling the model to preserve continuity across batches.

Before training can begin, an objective function must be defined to measure how well the network predicts the target sequence. For a series of observations \(\{(x_t, y_t)\}_{t=1}^T\), the total loss is typically the sum of stepwise prediction errors, \[ \mathcal{L} = \sum_{t=1}^T \ell(y_t, \hat{y}_t), \] where \(\ell\) is a suitable loss such as mean squared error for regression or cross-entropy for classification. The gradients of \(\mathcal{L}\) with respect to the network parameters are then computed and used to update the weights through backpropagation through time.

Figure 10.7: An unrolled RNN showing how the hidden state connects across time steps.

Figure 10.7 illustrates how an RNN can be unrolled across time steps, showing that the same set of weights is reused at each step. The hidden state serves as a bridge between past and present inputs, allowing the network to accumulate information through time.

Training an RNN is done by backpropagation through time (BPTT), which unrolls the network over all time steps and applies gradient descent. However, when sequences are long, the repeated multiplication of gradients can lead to vanishing or exploding gradients. This makes it difficult for a standard RNN to learn long-term dependencies, limiting its ability to remember events far in the past.

In many applications, temporal dependence is only one part of the problem. Alongside the time-varying input \(x_t\), there may be additional covariates \(z\) that describe static or slowly changing characteristics, such as a station ID, region, or weather condition. These can be incorporated into an RNN by concatenating them with \(x_t\) at each time step or by feeding them into separate layers whose outputs are combined with the recurrent representation. In practice, the design depends on whether such covariates are constant across time or vary together with the sequence.

To address the limitations of standard RNNs, researchers developed architectures that explicitly control how information is remembered or forgotten. The most influential of these is the LSTM network, which introduces a structured memory cell and gating mechanisms to stabilize learning over longer sequences.

10.6.2 Long Short-Term Memory (LSTM)

The main limitation of a standard RNN is its inability to retain information over long sequences. During backpropagation through time, gradients tend to either vanish or explode, preventing effective learning of long-term dependencies. The Long Short-Term Memory (LSTM) network, proposed by Hochreiter and Schmidhuber (1997), was designed to overcome this problem.

An LSTM introduces a separate cell state \(C_t\) that acts as a highway for information to flow across time steps, along with gating mechanisms that regulate what to remember and what to forget. The gates use sigmoid activations to produce values between 0 and 1, allowing the network to scale information rather than overwrite it.

The key update equations of an LSTM are

\[ \begin{aligned} f_t &= \sigma(W_f [h_{t-1}, x_t] + b_f), \\ i_t &= \sigma(W_i [h_{t-1}, x_t] + b_i), \\ \tilde{C}_t &= \tanh(W_C [h_{t-1}, x_t] + b_C), \\ C_t &= f_t \odot C_{t-1} + i_t \odot \tilde{C}_t, \\ o_t &= \sigma(W_o [h_{t-1}, x_t] + b_o), \\ h_t &= o_t \odot \tanh(C_t), \end{aligned} \]

where \(\odot\) denotes element-wise (Hadamard) multiplication and \(\sigma(\cdot)\) is the logistic sigmoid function. Each gate \(f_t\), \(i_t\), and \(o_t\) outputs values between 0 and 1 that determine how information flows through the cell.

The activation functions \(\tanh(\cdot)\) and \(\sigma(\cdot)\) play specific roles in the LSTM design. The sigmoid \(\sigma\) compresses values to the range \((0,1)\), making it suitable for gate control because it behaves like a smooth on–off switch. The hyperbolic tangent \(\tanh\) maps inputs to \((-1,1)\), allowing both positive and negative contributions to the cell state.

Other activation functions can in principle replace \(\tanh\), such as ReLU or Leaky ReLU, but this is uncommon in practice. ReLU may cause the cell state to grow without bound, and smooth symmetric activations like \(\tanh\) are generally more stable for recurrent updates. Some modern variants, such as the Peephole LSTM and GRU, adjust or simplify these activations, but the original combination of \(\sigma\) and \(\tanh\) remains the standard choice.

Figure 10.8: Structure of an LSTM cell showing the flow of information through the input, forget, and output gates.

Each of the three gates in an LSTM serves a distinct role.
The forget gate \(f_t\) determines how much of the previous cell state \(C_{t-1}\) should be retained, effectively deciding what information to discard. The input gate \(i_t\) controls how much new information \(\tilde{C}_t\) enters the cell state, allowing the network to incorporate relevant updates. The output gate \(o_t\) regulates how much of the cell state is exposed as the hidden state \(h_t\), influencing the network’s prediction at the current step. Together, these gates maintain a balance between remembering long-term patterns and adapting to new signals. Figure Figure 10.8 illustrates how the three gates interact with the cell state and hidden states to manage information flow through time.

10.6.3 Gated Recurrent Unit (GRU)

The Gated Recurrent Unit (GRU), introduced by Cho et al. (2014), is a simplified variant of the LSTM that retains its ability to capture long-term dependencies while using fewer parameters. The GRU combines the roles of the input and forget gates into a single update gate and omits the separate cell state \(C_t\), relying only on the hidden state \(h_t\) to store information.

The GRU update equations are

\[ \begin{aligned} z_t &= \sigma(W_z [h_{t-1}, x_t] + b_z), \\ r_t &= \sigma(W_r [h_{t-1}, x_t] + b_r), \\ \tilde{h}_t &= \tanh(W_h [r_t \odot h_{t-1}, x_t] + b_h), \\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t, \end{aligned} \]

where \(z_t\) is the update gate and \(r_t\) is the reset gate. The update gate controls how much of the previous hidden state to keep, while the reset gate determines how strongly past information should influence the new candidate state \(\tilde{h}_t\).

Figure 10.9: Structure of a GRU cell showing the update and reset gates.

The structure of a GRU cell is illustrated in Figure 10.9. Compared with an LSTM, the GRU is computationally simpler because it has no separate cell state and fewer matrix operations. Despite this simplification, GRUs often perform as well as LSTMs, especially when datasets are smaller or sequence lengths are moderate.

10.6.4 Example: Forecasting a Synthetic Sequential Signal (PyTorch)

To compare recurrent architectures in a reproducible way, we use a synthetic sine-wave signal with random noise. This allows us to train RNN, LSTM, and GRU models side-by-side without large datasets or external dependencies.

10.6.4.1 Step 1. Generate the data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(0)
time = np.arange(0, 200)
signal = np.sin(time / 6) + 0.3 * np.random.randn(200)
df = pd.DataFrame({"time": time, "signal": signal})

plt.figure(figsize=(6, 3))
plt.plot(df["time"], df["signal"])
plt.xlabel("Time")
plt.ylabel("Signal")
plt.tight_layout()
plt.show()

Synthetic sine-wave signal with random noise.

10.6.4.2 Step 2. Prepare input sequences

Each training example uses the previous 20 observations to predict the next value.

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import torch
from torch import nn

L = 20
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df[["signal"]])

X, y = [], []
for t in range(L, len(scaled)):
    X.append(scaled[t-L:t, 0])
    y.append(scaled[t, 0])
X, y = np.array(X), np.array(y)
X = X.reshape(X.shape[0], L, 1)

# split and convert to tensors
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False)

X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
X_test  = torch.tensor(X_test, dtype=torch.float32)
y_test  = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)

10.6.4.3 Step 3. Define recurrent models

class RecurrentModel(nn.Module):
    def __init__(self, rnn_type="RNN", hidden_size=50):
        super().__init__()
        if rnn_type == "LSTM":
            self.rnn = nn.LSTM(1, hidden_size, batch_first=True)
        elif rnn_type == "GRU":
            self.rnn = nn.GRU(1, hidden_size, batch_first=True)
        else:
            self.rnn = nn.RNN(1, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)
    def forward(self, x):
        out, _ = self.rnn(x)
        return self.fc(out[:, -1, :])

10.6.4.4 Step 4. Train and evaluate

def train_model(model, X, y, epochs=50, lr=0.01):
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    for _ in range(epochs):
        optimizer.zero_grad()
        loss = criterion(model(X), y)
        loss.backward()
        optimizer.step()
    return loss.item()

def predict(model, X):
    model.eval()
    with torch.no_grad():
        return model(X).numpy()

models = {
    "RNN":  RecurrentModel("RNN"),
    "LSTM": RecurrentModel("LSTM"),
    "GRU":  RecurrentModel("GRU")
}

for name, m in models.items():
    final_loss = train_model(m, X_train, y_train)
    print(f"{name} final training loss: {final_loss:.5f}")

y_preds = {name: predict(m, X_test) for name, m in models.items()}
RNN final training loss: 0.01107
LSTM final training loss: 0.02883
GRU final training loss: 0.01470

10.6.4.5 Step 5. Compare RMSE and MAE

from sklearn.metrics import mean_squared_error, mean_absolute_error

def metrics(y_true, y_pred):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    return rmse, mae

for name, y_hat in y_preds.items():
    rmse, mae = metrics(y_test, y_hat)
    print(f"{name:5s} – RMSE: {rmse:.4f}, MAE: {mae:.4f}")
RNN   – RMSE: 0.0992, MAE: 0.0828
LSTM  – RMSE: 0.1803, MAE: 0.1533
GRU   – RMSE: 0.1121, MAE: 0.0928

10.6.4.6 Step 6. Visual comparison

plt.figure(figsize=(6.5, 4))
plt.plot(y_test[:100], label="Observed", color="black")
for name, y_hat in y_preds.items():
    plt.plot(y_hat[:100], label=name, alpha=0.7)
plt.xlabel("Time index")
plt.ylabel("Scaled signal")
plt.legend()
plt.tight_layout()
plt.show()

Observed and predicted values using RNN, LSTM, and GRU (PyTorch).

10.6.4.7 Discussion

All three networks capture the oscillatory pattern, but the vanilla RNN has difficulty preserving phase alignment when the sequence is long. Both the LSTM and GRU learn the dependency structure more reliably. The GRU reaches nearly the same accuracy as the LSTM while training faster, thanks to its simpler gating design.