import openml
import pandas as pd
import numpy as np
# Load Ames Housing dataset (OpenML ID 42165)
dataset = openml.datasets.get_dataset(42165)
df, *_ = dataset.get_data()
df["LogPrice"] = np.log(df["SalePrice"])10 Supervised Learning
10.1 Introduction
Machine Learning (ML) is a branch of artificial intelligence that enables systems to learn from data and improve their performance over time without being explicitly programmed. At its core, machine learning algorithms aim to identify patterns in data and use those patterns to make decisions or predictions.
Machine learning can be categorized into three main types: supervised learning, unsupervised learning, and reinforcement learning. Each type differs in the data it uses and the learning tasks it performs, addressing addresses different tasks and problems. Supervised learning aims to predict outcomes based on labeled data, unsupervised learning focuses on discovering hidden patterns within the data, and reinforcement learning centers around learning optimal actions through interaction with an environment.
Let’s define some notations to introduce them:
\(X\): A set of feature vectors representing the input data. Each element \(X_i\) corresponds to a set of features or attributes that describe an instance of data.
\(Y\): A set of labels or rewards associated with outcomes. In supervised learning, \(Y\) is used to evaluate the correctness of the model’s predictions. In reinforcement learning, \(Y\) represents the rewards that guide the learning process.
\(A\): A set of possible actions in a given context. In reinforcement learning, actions \(A\) represent choices that can be made in response to a given situation, with the goal of maximizing a reward.
10.1.1 Supervised Learning
Supervised learning is the most widely used type of machine learning. In supervised learning, we have both feature vectors \(X\) and their corresponding labels \(Y\). The objective is to train a model that can predict \(Y\) based on \(X\). This model is trained on labeled examples, where the correct outcome is known, and it adjusts its internal parameters to minimize the error in its predictions, which occurs as part of the cross-validation process.
Key tasks in supervised learning include:
- Classification: Assigning data points to predefined categories or classes.
- Regression: Predicting a continuous value based on input data.
In supervised learning, the data consists of both feature vectors \(X\) and labels \(Y\), namely, \((X, Y)\).
10.1.2 Unsupervised Learning
Unsupervised learning involves learning patterns from data without any associated labels or outcomes. The objective is to explore and identify hidden structures in the feature vectors \(X\). Since there are no ground-truth labels \(Y\) to guide the learning process, the algorithm must discover patterns on its own. This is particularly useful when subject matter experts are unsure of common properties within a data set.
Common tasks in unsupervised learning include:
Clustering: Grouping similar data points together based on certain features.
Dimension Reduction: Simplifying the input data by reducing the number of features while preserving essential patterns.
In unsupervised learning, the data consists solely of feature vectors \(X\).
10.1.3 Reinforcement Learning
Reinforcement learning involves learning how to make a sequence of decisions to maximize a cumulative reward. Unlike supervised learning, where the model learns from a static dataset of labeled examples, reinforcement learning involves an agent that interacts with an environment by taking actions \(A\), receiving feedback in the form of rewards \(Y\), and learning over time which actions lead to the highest cumulative reward.
The process in reinforcement learning involves:
States: The context or environment the agent is in, represented by feature vectors \(X\).
Actions: The set of possible choices the agent can make in response to the current state, denoted as \(A\).
Rewards: Feedback the agent receives after taking an action, which guides the learning process.
In reinforcement learning, the data consists of feature vectors \(X\), actions \(A\), and rewards \(Y\), namely, \((X, A, Y)\).
10.2 Decision Trees
Decision trees are widely used supervised learning models that predict the value of a target variable by iteratively splitting the dataset based on decision rules derived from input features. The model functions as a piecewise constant approximation of the target function, producing clear, interpretable rules that are easily visualized and analyzed (Breiman et al., 1984). Decision trees are fundamental in both classification and regression tasks, serving as the building blocks for more advanced ensemble models such as Random Forests and Gradient Boosting Machines.
10.2.1 Recursive Partition Algorithm
The core mechanism of a decision tree algorithm is the identification of optimal splits that partition the data into subsets that are increasingly homogeneous with respect to the target variable. At any node \(m\), the data subset is denoted as \(Q_m\) with a sample size of \(n_m\). The objective is to find a candidate split \(\theta\), defined as a threshold for a given feature, that minimizes an impurity or loss measure \(H\).
When a split is made at node \(m\), the data is divided into two subsets: \(Q_{m,l}\) (left node) with sample size \(n_{m,l}\), and \(Q_{m,r}\) (right node) with sample size \(n_{m,r}\). The split quality, measured by \(G(Q_m, \theta)\), is given by:
\[ G(Q_m, \theta) = \frac{n_{m,l}}{n_m} H(Q_{m,l}(\theta)) + \frac{n_{m,r}}{n_m} H(Q_{m,r}(\theta)). \]
The algorithm aims to identify the split that minimizes the impurity:
\[ \theta^* = \arg\min_{\theta} G(Q_m, \theta). \]
This process is applied recursively at each child node until a stopping condition is met.
- Stopping Criteria: The algorithm stops when the maximum tree depth is reached or when the node sample size falls below a preset threshold.
- Pruning: Reduce the complexity of the final tree by removing branches that add little predictive value. This reduces overfitting and improves generalization.
10.2.2 Search Space for Possible Splits
At each node, the search space for possible splits comprises all features in the dataset and potential thresholds derived from the feature values. For a given feature, the algorithm considers each of its unique value in the current node as a possible split point. The potential thresholds are typically set as midpoints between consecutive unique values, ensuring effective partition.
Formally, let the feature set be \(\{X_1, X_2, \ldots, X_p\}\), where \(p\) is the total number of features, and let the unique values of feature \(X_j\) at node \(m\) be denoted by \(\{v_{m,j,1}, v_{m,j,2}, \ldots, v_{m,j,k_{mj}}\}\). The search space at node \(m\) includes:
- Feature candidates: \(\{X_1, X_2, \ldots, X_p\}\).
- Threshold candidates for \(X_j\): \[ \left\{ \frac{v_{m,j,i} + v_{m,j,i+1}}{2} \mid 1 \leq i < k_{mj} \right\}. \]
While the complexity of this search can be substantial, particularly for high-dimensional data or features with numerous unique values, efficient algorithms use sorting and single-pass scanning techniques to mitigate the computational cost.
10.2.3 Metrics
10.2.3.1 Classification
In classification, the split quality metric measures how pure the resulting nodes are after a split. A pure node contains observations that predominantly belong to a single class.
Gini Index: The Gini index measures node impurity by the probability that two observations randomly drawn from the node belong to different classes. A perfect split (all instances belong to one class) has a Gini index of 0. At node \(m\), the Gini index is \[ H(Q_m) = \sum_{k=1}^{K} p_{mk} (1 - p_{mk}) = 1 - \sum_{k=1}^n p_{mk}^2, \] where \(p_{mk}\) is the proportion of samples of class \(k\) at node \(m\); and \(K\) is the total number of classes The Gini index is often preferred for its speed and simplicity, and it’s used by default in many implementations of decision trees, including
sklearn.The Gini index originates from the Gini coefficient, introduced by Corrado Gini in 1912 to quantify inequality in income distributions. In that context, the Gini coefficient measures how unevenly a quantity (such as wealth) is distributed across a population. Decision tree algorithms adapt this concept of inequality to measure the impurity of a node: instead of wealth, the distribution concerns class membership. A perfectly pure node, where all observations belong to the same class, represents complete equality and yields a Gini index of zero. As class proportions become more mixed, inequality in class membership increases, leading to higher impurity values. Thus, the Gini index used in decision trees can be viewed as a statistical measure of diversity or heterogeneity derived from Gini’s original work on inequality.
Entropy (Information Gain): Derived from information theory, entropy quantifies the disorder of the data at a node. Lower entropy means higher purity. At node \(m\), it is defined as \[ H(Q_m) = - \sum_{k=1}^{K} p_{mk} \log p_{mk}. \] Entropy is commonly used in decision tree algorithms like ID3 and C4.5. The choice between Gini and entropy often depends on specific use cases, but both perform similarly in practice.
Misclassification Error: Misclassification error focuses on the most frequent class in the node. It measures the proportion of samples that do not belong to the majority class. Although less sensitive than Gini and entropy, it can be useful for classification when simplicity is preferred. At node \(m\), it is defined as \[ H(Q_m) = 1 - \max_k p_{mk}, \] where \(\max_k p_{mk}\) is the largest proportion of samples belonging to any class \(k\).
10.2.3.2 Regression Criteria
In regression, the goal is to minimize the spread or variance of the target variable within each node.
Mean Squared Error (MSE): MSE is the average squared difference between observed and predicted values (mean of the target in the node). The smaller the MSE, the better the fit. At node \(m\), it is \[ H(Q_m) = \frac{1}{n_m} \sum_{i=1}^{n_m} (y_i - \bar{y}_m)^2, \] where
- \(y_i\) is the actual value for sample \(i\);
- \(\bar{y}_m\) is the mean value of the target at node \(m\);
- \(n_m\) is the number of samples at node \(m\).
MSE works well when the target is continuous and normally distributed.
Half Poisson Deviance: Used for count target, the Poisson deviance measures the variance in the number of occurrences of an event. At node \(m\), it is \[ H(Q_m) = \sum_{i=1}^{n_m} \left( y_i \log\left(\frac{y_i}{\hat{y}_i}\right) - (y_i - \hat{y}_i) \right), \] where \(\hat{y}_i\) is the predicted count. This criterion is especially useful when the target variable represents discrete counts, such as predicting the number of occurrences of an event.
Mean Absolute Error (MAE): MAE aims to minimize the absolute differences between actual and predicted values. While it is more robust to outliers than MSE, it is slower computationally due to the lack of a closed-form solution for minimization. At node \(m\), it is \[ H(Q_m) = \frac{1}{n_m} \sum_{i=1}^{n_m} |y_i - \bar{y}_m|. \] MAE is useful when you want to minimize large deviations and can be more robust in cases where outliers are present in the data.
10.2.4 Ames Housing Example
The Ames Housing data are used to illustrate a regression tree model for predicting log house price.
As before, we retrive the data from OpenML.
A decision tree partitions the feature space into regions where the average log price is relatively constant.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
import plotnine as gg
import pandas as pd
numeric_features = [
"OverallQual", "GrLivArea", "GarageCars",
"TotalBsmtSF", "YearBuilt", "FullBath"
]
categorical_features = ["KitchenQual"]
preprocessor = ColumnTransformer([
("num", "passthrough", numeric_features),
("cat", OneHotEncoder(drop="first"), categorical_features)
])
X = df[numeric_features + categorical_features]
y = df["LogPrice"]
depths = range(2, 11)
cv_scores = [
cross_val_score(
Pipeline([
("pre", preprocessor),
("model", DecisionTreeRegressor(max_depth=d, random_state=0))
]),
X, y, cv=5, scoring="r2"
).mean()
for d in depths
]
list(zip(depths, cv_scores))[(2, 0.594134995704976),
(3, 0.6712972027857058),
(4, 0.7141792234890973),
(5, 0.748794485599919),
(6, 0.7587964739851765),
(7, 0.7343953839481492),
(8, 0.7186324525934304),
(9, 0.7112790937242873),
(10, 0.6938966980572193)]
Cross-validation identifies an appropriate tree depth that balances fit and generalization. A too-deep tree overfits, while a shallow tree misses structure.
dt = Pipeline([
("pre", preprocessor),
("model", DecisionTreeRegressor(max_depth=4, random_state=0))
])
dt.fit(X, y)
y_pred = dt.predict(X)
df_pred = pd.DataFrame({"Observed": y, "Predicted": y_pred})
(gg.ggplot(df_pred, gg.aes(x="Observed", y="Predicted")) +
gg.geom_point(alpha=0.5) +
gg.geom_abline(slope=1, intercept=0, linetype="dashed") +
gg.labs(title="Decision Tree Regression on Ames Housing",
x="Observed Log Price", y="Predicted Log Price"))
The plot shows predicted versus observed log prices. A well-fitted model has points close to the diagonal. The decision tree naturally captures nonlinear effects and interactions, though its predictions are piecewise constant, producing visible step patterns.
10.3 Gradient-Boosted Models
Gradient boosting is a powerful ensemble technique in machine learning that combines multiple weak learners into a strong predictive model. Unlike bagging methods, which train models independently, gradient boosting fits models sequentially, with each new model correcting errors made by the previous ensemble (Friedman, 2001). While decision trees are commonly used as weak learners, gradient boosting can be generalized to other base models. This iterative method optimizes a specified loss function by repeatedly adding models designed to reduce residual errors.
10.3.1 Introduction
Gradient boosting builds on the general concept of boosting, aiming to construct a strong predictor from an ensemble of sequentially trained weak learners. The weak learners are often shallow decision trees (stumps), linear models, or generalized additive models (Hastie et al., 2009). Each iteration adds a new learner focusing primarily on the data points poorly predicted by the existing ensemble, thereby progressively enhancing predictive accuracy.
Gradient boosting’s effectiveness stems from:
- Error Correction: Each iteration specifically targets previous errors, refining predictive accuracy.
- Weighted Learning: Iteratively focuses more heavily on difficult-to-predict data points.
- Flexibility: Capable of handling diverse loss functions and various types of predictive tasks.
The effectiveness of gradient-boosted models has made them popular across diverse tasks, including classification, regression, and ranking. Gradient boosting forms the foundation for algorithms such as XGBoost (Chen & Guestrin, 2016), LightGBM (Ke et al., 2017), and CatBoost (Prokhorenkova et al., 2018), known for their high performance and scalability.
10.3.2 Gradient Boosting Process
Gradient boosting builds an ensemble by iteratively minimizing the residual errors from previous models. This iterative approach optimizes a loss function, \(L(y, F(x))\), where \(y\) represents the observed target variable and \(F(x)\) the model’s prediction for a given feature vector \(x\).
Key concepts:
- Loss Function: Guides model optimization, such as squared error for regression or logistic loss for classification.
- Learning Rate: Controls incremental updates, balancing training speed and generalization.
- Regularization: Reduces overfitting through tree depth limitation, subsampling, and L1/L2 penalties.
10.3.2.1 Model Iteration
The gradient boosting algorithm proceeds as follows:
Initialization: Define a base model \(F_0(x)\), typically the mean of the target variable for regression or the log-odds for classification.
Iterative Boosting: At each iteration \(m\):
Compute pseudo-residuals representing the negative gradient of the loss function at the current predictions. For each observation \(i\): \[ r_i^{(m)} = -\left.\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right|_{F(x)=F_{m-1}(x)}, \] where \(x_i\) and \(y_i\) denote the feature vector and observed value for the \(i\)-th observation, respectively. The residuals represent the direction of steepest descent in function space, so fitting a learner to them approximates a gradient descent step minimizing \(L(y, F(x))\).
Fit a new weak learner \(h_m(x)\) to these residuals.
Update the model: \[ F_m(x) = F_{m-1}(x) + \eta \, h_m(x), \] where \(\eta\) is a small positive learning rate (e.g., 0.01–0.1), controlling incremental improvement and reducing overfitting.
In some implementations, the update step includes an additional multiplier determined by a one-dimensional line search that minimizes the loss function at each iteration. Specifically, the optimal step length is defined as \[ \gamma_m = \arg\min_\gamma \sum_{i=1}^n L\bigl(y_i,\, F_{m-1}(x_i) + \gamma\, h_m(x_i)\bigr), \] leading to an updated model of the form \[ F_m(x) = F_{m-1}(x) + \eta\, \gamma_m\, h_m(x), \] where \(\eta\) remains a shrinkage factor controlling the overall rate of learning, while \(\gamma_m\) adjusts the step size adaptively at each iteration.
Final Model: After \(M\) iterations, the ensemble model is: \[ F_M(x) = F_0(x) + \sum_{m=1}^M \eta \, h_m(x). \]
Stochastic gradient boosting is a variant that enhances gradient boosting by introducing randomness through subsampling at each iteration, selecting a random fraction of data points (typically 50%–80%) to fit the model . This randomness helps reduce correlation among trees, improve model robustness, and lower the risk of overfitting.
10.3.3 Boosted Trees with Ames Housing
Boosted trees apply the gradient boosting framework to decision trees. They build an ensemble of shallow trees, each trained to correct the residual errors of the preceding ones. By sequentially emphasizing observations that are difficult to predict, the model progressively improves its overall predictive accuracy. We now apply gradient boosting using the same preprocessed features. Boosting combines many shallow trees, each correcting the residual errors of its predecessors, to improve predictive accuracy.
from sklearn.ensemble import GradientBoostingRegressor
# Define a range of tree counts for tuning
n_estimators_list = [50, 100, 200, 400]
cv_scores_gb = [
cross_val_score(
Pipeline([
("pre", preprocessor),
("model", GradientBoostingRegressor(
n_estimators=n,
learning_rate=0.05,
max_depth=3,
random_state=0))
]),
X, y, cv=5, scoring="r2"
).mean()
for n in n_estimators_list
]
list(zip(n_estimators_list, cv_scores_gb))[(50, 0.8199019428994425),
(100, 0.8437789746745888),
(200, 0.8451433195728157),
(400, 0.8405078307788265)]
Cross-validation shows how increasing the number of boosting rounds initially improves performance but eventually risks overfitting when too many trees are added.
gb = Pipeline([
("pre", preprocessor),
("model", GradientBoostingRegressor(
n_estimators=200,
learning_rate=0.05,
max_depth=3,
random_state=0))
])
gb.fit(X, y)
y_pred_gb = gb.predict(X)
df_pred_gb = pd.DataFrame({"Observed": y, "Predicted": y_pred_gb})
(gg.ggplot(df_pred_gb, gg.aes(x="Observed", y="Predicted")) +
gg.geom_point(alpha=0.5) +
gg.geom_abline(slope=1, intercept=0, linetype="dashed") +
gg.labs(title="Gradient-Boosted Regression on Ames Housing",
x="Observed Log Price", y="Predicted Log Price"))
The boosted model produces predictions that are generally closer to the 45-degree line than a single tree, reflecting improved accuracy and smoother response across the feature space.
Gradient-boosted trees introduce several parameters that govern model complexity, learning stability, and overfitting control:
n_estimators: the number of trees (boosting rounds). More trees can reduce bias but increase computation and risk of overfitting. learning_rate — the shrinkage parameter \(\eta\) controlling the contribution of each new tree. Smaller values (e.g., 0.05 or 0.01) require more trees but often yield better generalization.max_depth: the maximum depth of each individual tree, limiting the model’s ability to overfit local noise. Shallow trees (depth 2–4) are typical weak learners.subsample: the fraction of data used in each iteration. Values below 1.0 introduce randomness (stochastic boosting), improving robustness and reducing correlation among trees.min_samples_splitandmin_samples_leaf: minimum numbers of observations required for splitting or forming leaves. These control tree granularity and help regularize the model.
In practice, moderate learning rates with a sufficiently large number of estimators and shallow trees often perform best, balancing bias, variance, and computational cost.
10.3.4 XGBoost: Extreme Gradient Boosting
XGBoost is a scalable and efficient implementation of gradient-boosted decision trees (Chen & Guestrin, 2016). It has become one of the most widely used machine learning methods for structured data due to its high predictive performance, regularization capabilities, and speed. XGBoost builds an ensemble of decision trees in a stage-wise fashion, minimizing a regularized objective that balances training loss and model complexity.
The core idea of XGBoost is to fit each new tree to the gradient of the loss function with respect to the model’s predictions. Unlike traditional boosting algorithms like AdaBoost, which use only first-order gradients, XGBoost optionally uses second-order derivatives (Hessians), enabling better convergence and stability (Friedman, 2001).
XGBoost is widely used in data science competitions and real-world applications. It supports regularization (L1 and L2), handles missing values internally, and is designed for distributed computing.
XGBoost builds upon the same foundational idea as gradient boosted machines—sequentially adding trees to improve the predictive model— but introduces a number of enhancements:
| Aspect | Traditional GBM | XGBoost |
|---|---|---|
| Implementation | Basic gradient boosting | Optimized, regularized boosting |
| Regularization | Shrinkage only | L1 and L2 regularization |
| Loss Optimization | First-order gradients | First- and second-order |
| Missing Data | Requires manual imputation | Handled automatically |
| Tree Construction | Depth-wise | Level-wise (faster) |
| Parallelization | Limited | Built-in |
| Sparsity Handling | No | Yes |
| Objective Functions | Few options | Custom supported |
| Cross-validation | External via GridSearchCV |
Built-in xgb.cv |
XGBoost is therefore more suitable for large-scale problems and provides better generalization performance in many practical tasks.
10.4 Random Forest
This section was prepared by Alex Plotnikov, first-year master’s student in statistics.
10.4.1 Random Forest Algorithm
Random forests are an ensemble method that was developed by Leo Breiman in 2001 to improve the performance of tree-based models. This model creates multiple trees in parallel. There are a few key differences in how trees in a random forest are constructed compared to the decision tree algorithm. The main differences are:
- Sample of data used for each tree.
- Number of features selected at each node.
- Trees are allowed to grow deep in random forests.
- Prediction is made based on aggregating results from all trees.

10.4.1.1 Bagging (Bootstrap Aggregation)
Unlike decision trees which are trained on the entire dataset, random forests use bagging to create a sample for each tree. Afterwards, the tree is trained on that specific sample.
The term bagging comes from bootstrap aggregation. This process involves creating a sample set with the same number of records as the original by sampling from the original dataset with replacement. This means that observations may appear multiple times in the sample dataset.
For one draw:
\[ P(\text{not selected}) = 1 - \frac{1}{N}. \]
For N draws (since we sample N times with replacement):
\[ P(\text{not selected in entire sample}) = \left(1 - \frac{1}{N}\right)^N. \]
As N approaches infinity:
\[ \left(1 - \frac{1}{N}\right)^N \approx e^{-1} \approx 0.3679. \]
So, about 36.8% of the original observations are not included in a given bootstrap sample.
Therefore, the remaining portion is:
\[ 1 - e^{-1} \approx 0.632. \]
On average, 63.2% of the values in each sample are non-repeated. Bagging helps reduce overfitting to noise and reducing variance of the outcome for a more stable model.
10.4.1.2 Feature Sampling and Tree Depth
Random forests also calculate impurity on a subset of all features rather than all features as in a decision tree. At each node, the algorithm selects a random subset of features and calculates the impurity metric for these features. At the next node, the algorithm takes another sample of the entire feature set and calculates the metric on this new set of features.
For \(p\) features, the algorithm uses (\(\sqrt{p}\)) features for classification and (\(\frac{p}{3}\)) features for regression.
Trees in random forests are also usually allowed to grow to max depth. The goal of having multiple trees is reducing variance of output and overfitting, and this is achieved through averaging of trees.
10.4.1.3 Prediction
Random forests make predictions by aggregating results from all trees.
For classification, this is done by each tree predicting a label and then the forest combines predictions from all trees to output a label based on a majority vote.
For regression, the forest takes the numerical value predicted by each tree and then calculates the average for the output.
10.4.1.4 Algorithm Comparison
| Decision Trees | Random Forests | |
|---|---|---|
| Data Set | Entire dataset | Bootstrap sample for each tree |
| Features | All features | Random subset at each node |
| Tree Size | Based on input | Can be modified with input but usually until tree fully grows |
| Prediction | Leaf where new data lands | Majority vote of all trees for classification, average for regression |
10.4.2 RandomForestClassifier Parameters
These concepts can be adjusted as parameters in the RandomForestClassifier class in Python. Not all parameters are shown here as they are similar to decision tree parameters:
n_estimators: number of trees to train in forest.max_depth: max depth of the tree in the forest, default value is set toNonewhich allows the tree to grow to max depth.max_features: number of features to use at each node. Default value is'sqrt'and can also be set to'log2'andNone.bootstrap: whether or not to use bootstrap for each tree, ifFalsethen each tree will use the entire dataset.max_samples: the maximum number of samples for each tree can be adjusted rather than the same size as the dataset.
10.4.3 Random Forests in Python (Ames Data)
The below example shows a random forest regression on the Ames data. We construct a plot to show the observed log price against the predicted log price.
import openml
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier # For classification
from sklearn.ensemble import RandomForestRegressor # For regression
from sklearn.metrics import accuracy_score, mean_squared_error # For evaluation
import plotnine as gg
# Load Ames Housing (OpenML ID 42165)
dataset = openml.datasets.get_dataset(42165)
df, *_ = dataset.get_data()
df["LogPrice"] = np.log(df["SalePrice"])
ind_columns = df.columns.drop(['Id','SalePrice','LogPrice']).tolist()
df_encoded = pd.get_dummies(df[ind_columns], drop_first=True)
X = df_encoded
y = df['LogPrice']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# # For Classification
# model = RandomForestClassifier(n_estimators=100, random_state=42)
# model.fit(X_train, y_train)
# For Regression (if 'target' was continuous)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# For graph
dt = RandomForestRegressor(n_estimators=100, random_state=42)
dt.fit(X, y)
y_pred = dt.predict(X)
df_pred = pd.DataFrame({"Observed": y, "Predicted": y_pred})
(gg.ggplot(df_pred, gg.aes(x="Observed", y="Predicted")) +
gg.geom_point(alpha=0.5) +
gg.geom_abline(slope=1, intercept=0, linetype="dashed") +
gg.labs(title="Random Forest Regression on Ames Housing",
x="Observed Log Price", y="Predicted Log Price"))
10.4.4 Further Reading
- Classification and Regression Trees. Brieman, Friedman, Olshen, Stone 1986.
- Bagging Predictors. Brieman 1994.
- Random Forests. Brieman 2001.
- Random Forest Class - scikit
10.5 Neural Networks
Neural networks learn flexible, nonlinear relationships by composing simple computational units. Each unit (or neuron) applies an affine transformation to its inputs followed by a nonlinearity; stacking layers of such units yields a feed-forward mapping from features to predictions. A helpful anchor is logistic regression: it can be viewed as a one-layer neural network with a sigmoid activation. Hidden layers generalize this idea by transforming inputs into progressively more useful internal representations, enabling the model to capture complex patterns beyond linear decision boundaries.
Historically, the field traces back to the early formal model of a neuron by McCulloch & Pitts (1943), followed by Rosenblatt’s perceptron (Rosenblatt, 1958), which was the first trainable linear classifier. The limitations of single-layer perceptrons, notably their inability to model nonlinearly separable functions, were rigorously analyzed by Minsky & Papert (1969), leading to a temporary decline in interest. The introduction of the backpropagation algorithm by Rumelhart et al. (1986) revived the study of neural networks in the 1980s. With the growth of data, advances in hardware, and improved optimization methods, neural networks became the foundation for modern deep learning. Multilayer networks are universal function approximators in principle; in practice, their success depends on careful architectural design, efficient optimization, and appropriate regularization.
For an input vector \(x \in \mathbb{R}^d\), a hidden layer with \(m\) neurons computes \(h = \sigma(Wx + b)\) through an activation function \(\sigma\) \(W \in \mathbb{R}^{m\times d}\) and \(b \in \mathbb{R}^m\). The output layer applies another affine map (and possibly a final activation) to \(h\). Keeping track of these shapes prevents many implementation bugs.
The figure below shows an input layer, one hidden layer, and an output layer with directed connections. (We will formalize the mathematics in the next subsection.)
10.5.1 Structure of a Neural Network
A neural network is composed of layers of interconnected processing units called neurons or nodes. Each neuron receives inputs, applies a linear transformation, and passes the result through a nonlinear activation function. The layers are arranged sequentially so that information flows from input features to intermediate representations and finally to the output.
10.5.2 Mathematical Formulation
Let the input vector be \(x \in \mathbb{R}^{d}\). A hidden layer with \(m\) neurons computes
\[ h = \sigma(Wx + b), \]
where \(W \in \mathbb{R}^{m \times d}\) is the weight matrix, \(b \in \mathbb{R}^{m}\) is the bias vector, and \(\sigma(\cdot)\) is an activation function applied element-wise.
Subsequent layers take \(h\) as input and repeat the same computation, producing successively transformed representations.
If the output layer contains \(k\) neurons with linear activations, the network computes
\[ \hat{y} = Vh + c, \]
where \(V \in \mathbb{R}^{k \times m}\) and \(c \in \mathbb{R}^{k}\) are the output weights and biases.
Collectively, these parameters define the model’s trainable structure.
During a forward pass, inputs propagate layer by layer: \(x \rightarrow h^{(1)} \rightarrow h^{(2)} \rightarrow \cdots
\rightarrow \hat{y}\).
Backward propagation of gradients (discussed later) updates all weights based on prediction error.
10.5.3 Network Diagram
Figure 10.2 illustrates a feed-forward neural network with two hidden layers. Arrows indicate the direction of information flow from one layer to the next. Each connection carries a learnable weight, and nonlinear activations transform the signals as they propagate forward.
Each connection (edge) has an associated weight adjusted during training. The depth of a network refers to the number of hidden layers, and the width refers to the number of neurons per layer.
10.5.4 Activation Functions
Activation functions introduce nonlinearity into neural networks. Without them, the entire network would collapse into a single linear transformation, regardless of the number of layers. Nonlinear activations allow networks to approximate arbitrary complex functions, enabling them to model curved decision boundaries and hierarchical representations.
Let \(z\) denote the input to a neuron before activation. The most common choices are:
| Function | Formula | Range | Key Property |
|---|---|---|---|
| Sigmoid | \(\displaystyle \sigma(z) = \frac{1}{1 + e^{-z}}\) | \((0,1)\) | Smooth, bounded; used in early networks and binary outputs |
| Tanh | \(\displaystyle \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\) | \((-1,1)\) | Zero-centered; stronger gradients than sigmoid |
| ReLU | \(\displaystyle \mathrm{ReLU}(z) = \max(0, z)\) | \([0, \infty)\) | Sparse activations; efficient to compute |
| Leaky ReLU | \(\displaystyle \mathrm{LReLU}(z) = \max(0.01z, z)\) | \((-\infty, \infty)\) | Avoids “dead” neurons of ReLU |
Figure 10.3 shows the shape of these activation functions. Notice how ReLU sharply truncates negative values while sigmoid and tanh saturate for large magnitudes.
10.5.5 Training Neural Networks
Training a neural network means adjusting its parameters so that its predictions \(\hat{y}\) align closely with the true outputs \(y\). This is done by minimizing a loss function that quantifies prediction error. Optimization proceeds iteratively through forward and backward passes.
10.5.5.1 Loss Functions
The choice of loss depends on the task:
| Task | Typical Loss Function | Expression |
|---|---|---|
| Regression | Mean squared error (MSE) | \(\displaystyle L = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\) |
| Binary classification | Binary cross-entropy | \(\displaystyle L = -\frac{1}{n}\sum_{i=1}^{n} \left[y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right]\) |
| Multiclass classification | Categorical cross-entropy | \(\displaystyle L = -\frac{1}{n}\sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik}\log \hat{y}_{ik}\) |
The loss function \(L(\theta)\) depends on all network parameters \(\theta = \{W, b, V, c, \ldots\}\) and guides the optimization.
10.5.5.2 Gradient Descent
Neural networks are trained by gradient descent, which updates parameters in the opposite direction of the gradient of the loss:
\[ \theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta L(\theta^{(t)}), \]
where \(\eta > 0\) is the learning rate controlling the step size. Choosing \(\eta\) too small leads to slow convergence; too large can cause divergence.
Variants such as stochastic and mini-batch gradient descent compute gradients on subsets of data to speed learning and improve generalization.
Figure 10.4 gives a two-dimensional analogy of gradient descent. In practice, a neural network may contain millions of parameters, meaning the loss function is defined over an extremely high-dimensional space. The contours shown here merely represent a projection of that space onto two dimensions for illustration.
The true optimization surface is much more complex—full of flat regions, sharp valleys, and numerous local minima—but the intuition of moving “downhill” along the loss gradient remains valid.
10.5.6 Regularization and Overfitting
Because neural networks contain large numbers of parameters, they can easily overfit training data—capturing noise or random fluctuations rather than generalizable patterns. A model that overfits performs well on the training set but poorly on new data. Regularization techniques introduce constraints or randomness that help the network generalize.
10.5.6.1 The Bias–Variance Trade-Off
Adding model complexity (more layers or neurons) reduces bias but increases variance. The goal of regularization is to find a balance between these forces. Figure 10.5 provides a schematic view: the unregularized model follows every data fluctuation, while a regularized model captures only the underlying trend.
10.5.6.2 Common Regularization Techniques
Among the simplest and most widely used techniques is L2 weight decay, which adds a penalty term to discourage large weights:
\[ L_{\text{total}} = L_{\text{data}} + \lambda \sum_{l}\sum_{i,j} (w_{ij}^{(l)})^2, \]
where \(\lambda\) controls the strength of the constraint. Other general approaches include dropout, which randomly deactivates neurons during training to prevent reliance on any single pathway, and early stopping, which halts training once the validation loss stops improving. These ideas apply to networks of all depths and form the foundation for more advanced regularization strategies discussed later in the deep learning section.
Regularization methods restrict how freely the model parameters can adapt to training data. This constraint encourages smoother mappings, reduces sensitivity to noise, and improves generalization to unseen inputs.
- Start with small \(\lambda\) for L2 weight decay.
- Use dropout (e.g., \(p = 0.2\)–\(0.5\)) between dense layers.
- Always track training vs. validation loss curves to detect overfitting.
10.5.7 Example: A Simple Feed-Forward Network
To illustrate how a neural network operates in practice, we train a simple multilayer perceptron (MLP) on a two-dimensional nonlinear dataset. The model learns a curved decision boundary that cannot be captured by linear classifiers.
The two-moons dataset consists of two interleaving half-circles, forming a pattern that resembles a pair of crescents. Each point represents an observation with two features \((x_1, x_2)\), and the color indicates its class. The two classes overlap slightly due to added random noise.
A linear classifier such as logistic regression would draw a straight line through the plane, misclassifying many points. In contrast, a neural network can learn the nonlinear boundary that curves along the interface between the moons. This problem, although simple, captures the essence of nonlinear learning in higher dimensions.
We generate the data using make_moons() from scikit-learn and train a feed-forward neural network with two hidden layers. The model’s task is to assign each observation to one of the two moon shapes.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
# Generate synthetic data
X, y = make_moons(n_samples=500, noise=0.25, random_state=1023)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Define and train a small neural network
mlp = MLPClassifier(hidden_layer_sizes=(10, 5),
activation='relu',
solver='adam',
alpha=0.001,
max_iter=2000,
random_state=1023)
mlp.fit(X_train, y_train)MLPClassifier(alpha=0.001, hidden_layer_sizes=(10, 5), max_iter=2000,
random_state=1023)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| hidden_layer_sizes | (10, ...) | |
| activation | 'relu' | |
| solver | 'adam' | |
| alpha | 0.001 | |
| batch_size | 'auto' | |
| learning_rate | 'constant' | |
| learning_rate_init | 0.001 | |
| power_t | 0.5 | |
| max_iter | 2000 | |
| shuffle | True | |
| random_state | 1023 | |
| tol | 0.0001 | |
| verbose | False | |
| warm_start | False | |
| momentum | 0.9 | |
| nesterovs_momentum | True | |
| early_stopping | False | |
| validation_fraction | 0.1 | |
| beta_1 | 0.9 | |
| beta_2 | 0.999 | |
| epsilon | 1e-08 | |
| n_iter_no_change | 10 | |
| max_fun | 15000 |
Afer the model is trained, we evaluate its accuracy.
# Evaluate accuracy
acc_train = mlp.score(X_train, y_train)
acc_test = mlp.score(X_test, y_test)
# Prepare grid for decision boundary
xx, yy = np.meshgrid(np.linspace(-2, 3, 300),
np.linspace(-1.5, 2, 300))
Z = mlp.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
# Plot
fig, ax = plt.subplots(figsize=(6, 5))
ax.contourf(xx, yy, Z, cmap="coolwarm", alpha=0.3)
scatter = ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap="coolwarm", edgecolor="k", s=40)
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.set_title(f"Two-Moons Classification (Train Acc: {acc_train:.2f}, Test Acc: {acc_test:.2f})")
plt.show()
The trained network achieves high accuracy on both the training and test sets, demonstrating good generalization. The contour regions in Figure Figure 10.6 reveal the curved boundary the network has learned. Each hidden layer transforms the input coordinates into a new representation where the two classes become more separable. After a few layers of nonlinear transformations, the output layer can classify the points with a simple linear threshold.
Even this small network, with only a few dozen parameters, can learn a highly nonlinear decision surface. This illustrates the expressive power of multilayer neural networks, even in low-dimensional spaces.
Increase the noise level in the data to test robustness. Change the hidden layer sizes (e.g., (3,), (20,10)), or the activation function (tanh, logistic). Adjust the regularization parameter alpha to see how it affects smoothness of the boundary.
10.6 Neural Networks for Data with Temporal Dependence
Many real-world datasets are sequential, where earlier observations influence what happens later. Examples include electricity demand over hours, temperature across days, and stock prices through trading sessions. Such data exhibit temporal dependence, meaning that successive observations are not independent.
Traditional supervised learning models, such as linear regression and feedforward neural networks, treat each observation as if it were independent. When applied directly to time-ordered data, they fail to capture how information evolves through time. A prediction for one step does not reflect patterns that unfolded earlier.
To learn from sequential patterns, we need models that can remember what has already occurred and use that information to improve predictions. Neural networks designed for temporal dependence achieve this by introducing internal states that are updated as the sequence unfolds. The simplest such model is the recurrent neural network (RNN), which forms the basis for more advanced architectures such as long short-term memory (LSTM) and gated recurrent unit (GRU) networks.
10.6.1 Recurrent Neural Networks (RNNs)
To model data with temporal dependence, a neural network must be able to retain information about what has happened previously. A recurrent neural network (RNN) accomplishes this by maintaining an internal hidden state that evolves over time. The hidden state acts as a summary of all past inputs and is updated as new data arrive.
At each time step \(t\), an RNN receives an input vector \(x_t\) and produces a hidden state \(h_t\) according to
\[ h_t = \tanh(W_h h_{t-1} + W_x x_t + b_h), \]
where \(W_h\) and \(W_x\) are weight matrices and \(b_h\) is a bias term. The output at the same step can be expressed as
\[ \hat{y}_t = \sigma(W_y h_t + b_y), \]
with \(\sigma(\cdot)\) representing an activation or link function. Because \(h_t\) depends on \(h_{t-1}\), the network can in principle capture relationships across time.
The initial hidden state \(h_0\) must be specified before the sequence starts. In most applications, \(h_0\) is set to a vector of zeros with the same dimension as \(h_t\), allowing the network to begin without prior memory. This default works well because the recurrent updates quickly overwrite the initial state as new inputs arrive. In some advanced or stateful applications, \(h_0\) can instead be learned during training or carried over from the final state of a previous sequence, enabling the model to preserve continuity across batches.
Before training can begin, an objective function must be defined to measure how well the network predicts the target sequence. For a series of observations \(\{(x_t, y_t)\}_{t=1}^T\), the total loss is typically the sum of stepwise prediction errors, \[ \mathcal{L} = \sum_{t=1}^T \ell(y_t, \hat{y}_t), \] where \(\ell\) is a suitable loss such as mean squared error for regression or cross-entropy for classification. The gradients of \(\mathcal{L}\) with respect to the network parameters are then computed and used to update the weights through backpropagation through time.
Figure 10.7 illustrates how an RNN can be unrolled across time steps, showing that the same set of weights is reused at each step. The hidden state serves as a bridge between past and present inputs, allowing the network to accumulate information through time.
Training an RNN is done by backpropagation through time (BPTT), which unrolls the network over all time steps and applies gradient descent. However, when sequences are long, the repeated multiplication of gradients can lead to vanishing or exploding gradients. This makes it difficult for a standard RNN to learn long-term dependencies, limiting its ability to remember events far in the past.
In many applications, temporal dependence is only one part of the problem. Alongside the time-varying input \(x_t\), there may be additional covariates \(z\) that describe static or slowly changing characteristics, such as a station ID, region, or weather condition. These can be incorporated into an RNN by concatenating them with \(x_t\) at each time step or by feeding them into separate layers whose outputs are combined with the recurrent representation. In practice, the design depends on whether such covariates are constant across time or vary together with the sequence.
To address the limitations of standard RNNs, researchers developed architectures that explicitly control how information is remembered or forgotten. The most influential of these is the LSTM network, which introduces a structured memory cell and gating mechanisms to stabilize learning over longer sequences.
10.6.2 Long Short-Term Memory (LSTM)
The main limitation of a standard RNN is its inability to retain information over long sequences. During backpropagation through time, gradients tend to either vanish or explode, preventing effective learning of long-term dependencies. The Long Short-Term Memory (LSTM) network, proposed by Hochreiter and Schmidhuber (1997), was designed to overcome this problem.
An LSTM introduces a separate cell state \(C_t\) that acts as a highway for information to flow across time steps, along with gating mechanisms that regulate what to remember and what to forget. The gates use sigmoid activations to produce values between 0 and 1, allowing the network to scale information rather than overwrite it.
The key update equations of an LSTM are
\[ \begin{aligned} f_t &= \sigma(W_f [h_{t-1}, x_t] + b_f), \\ i_t &= \sigma(W_i [h_{t-1}, x_t] + b_i), \\ \tilde{C}_t &= \tanh(W_C [h_{t-1}, x_t] + b_C), \\ C_t &= f_t \odot C_{t-1} + i_t \odot \tilde{C}_t, \\ o_t &= \sigma(W_o [h_{t-1}, x_t] + b_o), \\ h_t &= o_t \odot \tanh(C_t), \end{aligned} \]
where \(\odot\) denotes element-wise (Hadamard) multiplication and \(\sigma(\cdot)\) is the logistic sigmoid function. Each gate \(f_t\), \(i_t\), and \(o_t\) outputs values between 0 and 1 that determine how information flows through the cell.
The activation functions \(\tanh(\cdot)\) and \(\sigma(\cdot)\) play specific roles in the LSTM design. The sigmoid \(\sigma\) compresses values to the range \((0,1)\), making it suitable for gate control because it behaves like a smooth on–off switch. The hyperbolic tangent \(\tanh\) maps inputs to \((-1,1)\), allowing both positive and negative contributions to the cell state.
Other activation functions can in principle replace \(\tanh\), such as ReLU or Leaky ReLU, but this is uncommon in practice. ReLU may cause the cell state to grow without bound, and smooth symmetric activations like \(\tanh\) are generally more stable for recurrent updates. Some modern variants, such as the Peephole LSTM and GRU, adjust or simplify these activations, but the original combination of \(\sigma\) and \(\tanh\) remains the standard choice.
Each of the three gates in an LSTM serves a distinct role.
The forget gate \(f_t\) determines how much of the previous cell state \(C_{t-1}\) should be retained, effectively deciding what information to discard. The input gate \(i_t\) controls how much new information \(\tilde{C}_t\) enters the cell state, allowing the network to incorporate relevant updates. The output gate \(o_t\) regulates how much of the cell state is exposed as the hidden state \(h_t\), influencing the network’s prediction at the current step. Together, these gates maintain a balance between remembering long-term patterns and adapting to new signals. Figure Figure 10.8 illustrates how the three gates interact with the cell state and hidden states to manage information flow through time.
10.6.3 Gated Recurrent Unit (GRU)
The Gated Recurrent Unit (GRU), introduced by Cho et al. (2014), is a simplified variant of the LSTM that retains its ability to capture long-term dependencies while using fewer parameters. The GRU combines the roles of the input and forget gates into a single update gate and omits the separate cell state \(C_t\), relying only on the hidden state \(h_t\) to store information.
The GRU update equations are
\[ \begin{aligned} z_t &= \sigma(W_z [h_{t-1}, x_t] + b_z), \\ r_t &= \sigma(W_r [h_{t-1}, x_t] + b_r), \\ \tilde{h}_t &= \tanh(W_h [r_t \odot h_{t-1}, x_t] + b_h), \\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t, \end{aligned} \]
where \(z_t\) is the update gate and \(r_t\) is the reset gate. The update gate controls how much of the previous hidden state to keep, while the reset gate determines how strongly past information should influence the new candidate state \(\tilde{h}_t\).
The structure of a GRU cell is illustrated in Figure 10.9. Compared with an LSTM, the GRU is computationally simpler because it has no separate cell state and fewer matrix operations. Despite this simplification, GRUs often perform as well as LSTMs, especially when datasets are smaller or sequence lengths are moderate.
10.6.4 Example: Forecasting a Synthetic Sequential Signal (PyTorch)
To compare recurrent architectures in a reproducible way, we use a synthetic sine-wave signal with random noise. This allows us to train RNN, LSTM, and GRU models side-by-side without large datasets or external dependencies.
10.6.4.1 Step 1. Generate the data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(0)
time = np.arange(0, 200)
signal = np.sin(time / 6) + 0.3 * np.random.randn(200)
df = pd.DataFrame({"time": time, "signal": signal})
plt.figure(figsize=(6, 3))
plt.plot(df["time"], df["signal"])
plt.xlabel("Time")
plt.ylabel("Signal")
plt.tight_layout()
plt.show()
10.6.4.2 Step 2. Prepare input sequences
Each training example uses the previous 20 observations to predict the next value.
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import torch
from torch import nn
L = 20
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df[["signal"]])
X, y = [], []
for t in range(L, len(scaled)):
X.append(scaled[t-L:t, 0])
y.append(scaled[t, 0])
X, y = np.array(X), np.array(y)
X = X.reshape(X.shape[0], L, 1)
# split and convert to tensors
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, shuffle=False)
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)10.6.4.3 Step 3. Define recurrent models
class RecurrentModel(nn.Module):
def __init__(self, rnn_type="RNN", hidden_size=50):
super().__init__()
if rnn_type == "LSTM":
self.rnn = nn.LSTM(1, hidden_size, batch_first=True)
elif rnn_type == "GRU":
self.rnn = nn.GRU(1, hidden_size, batch_first=True)
else:
self.rnn = nn.RNN(1, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, 1)
def forward(self, x):
out, _ = self.rnn(x)
return self.fc(out[:, -1, :])10.6.4.4 Step 4. Train and evaluate
def train_model(model, X, y, epochs=50, lr=0.01):
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
for _ in range(epochs):
optimizer.zero_grad()
loss = criterion(model(X), y)
loss.backward()
optimizer.step()
return loss.item()
def predict(model, X):
model.eval()
with torch.no_grad():
return model(X).numpy()
models = {
"RNN": RecurrentModel("RNN"),
"LSTM": RecurrentModel("LSTM"),
"GRU": RecurrentModel("GRU")
}
for name, m in models.items():
final_loss = train_model(m, X_train, y_train)
print(f"{name} final training loss: {final_loss:.5f}")
y_preds = {name: predict(m, X_test) for name, m in models.items()}RNN final training loss: 0.01107
LSTM final training loss: 0.02883
GRU final training loss: 0.01470
10.6.4.5 Step 5. Compare RMSE and MAE
from sklearn.metrics import mean_squared_error, mean_absolute_error
def metrics(y_true, y_pred):
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
return rmse, mae
for name, y_hat in y_preds.items():
rmse, mae = metrics(y_test, y_hat)
print(f"{name:5s} – RMSE: {rmse:.4f}, MAE: {mae:.4f}")RNN – RMSE: 0.0992, MAE: 0.0828
LSTM – RMSE: 0.1803, MAE: 0.1533
GRU – RMSE: 0.1121, MAE: 0.0928
10.6.4.6 Step 6. Visual comparison
plt.figure(figsize=(6.5, 4))
plt.plot(y_test[:100], label="Observed", color="black")
for name, y_hat in y_preds.items():
plt.plot(y_hat[:100], label=name, alpha=0.7)
plt.xlabel("Time index")
plt.ylabel("Scaled signal")
plt.legend()
plt.tight_layout()
plt.show()
10.6.4.7 Discussion
All three networks capture the oscillatory pattern, but the vanilla RNN has difficulty preserving phase alignment when the sequence is long. Both the LSTM and GRU learn the dependency structure more reliably. The GRU reaches nearly the same accuracy as the LSTM while training faster, thanks to its simpler gating design.