import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
9 Unsupervised Learning
9.1 K-Means Clustering
So far, we have explored various supervised learning algorithms such as Decision Trees and Random Forests, which rely on labeled data with known outcomes. In contrast, unsupervised learning techniques analyze unlabeled data to identify patterns, making them particularly useful for clustering and association problems. Among these, K-means clustering stands out as one of the simplest and most widely used algorithms.
K-means clustering aims to divide a dataset into non-overlapping groups based on similarity. Given a set of data points, each represented as a vector in a multi-dimensional space, the algorithm assigns each point to one of \(k\) clusters in a way that minimizes the variation within each cluster. This is done by reducing the sum of squared distances between each point and its assigned cluster center. Mathematically, we seek to minimize:
\[\begin{equation*} \sum_{i=1}^{k}\sum_{\boldsymbol{x}\in S_i} \left\|\boldsymbol{x}-\boldsymbol{\mu}_i\right\|^2 \end{equation*}\]
where \(S_i\) represents each cluster and \(\boldsymbol{\mu}_i\) is the mean of the points within that cluster.
9.1.1 Lloyd’s Algorithm
K-means clustering is typically solved using Lloyd’s algorithm, which operates iteratively as follows:
- Initialization: Select \(k\) initial cluster centroids \(\boldsymbol{\mu}_i\) randomly.
- Iteration:
Assignment step: Assign each point \(\boldsymbol{x}\) to the cluster whose centroid is closest based on the squared Euclidean distance.
Update step: Recompute the centroids as the mean of all points assigned to each cluster:
\[\begin{equation*} \boldsymbol{\mu}_i \leftarrow \frac{1}{|S_i|} \sum_{\boldsymbol{x}_j \in S_i} \boldsymbol{x}_j \end{equation*}\]
- Termination: The process stops when either the assignments no longer change or a predefined number of iterations is reached.
9.1.2 Example: Iris Data
K-means clustering can be implemented using the scikit-learn
library. Below, we apply it to the Iris dataset.
# Load the Iris dataset
= datasets.load_iris()
iris = iris.data[:, :2] # Using only two features
X = iris.target y
We visualize the observations based on their true species labels.
# Scatter plot of true species labels
= plt.subplots()
fig, ax = ax.scatter(X[:, 0], X[:, 1], c=y,
scatter ='viridis', edgecolors='k')
cmap*scatter.legend_elements(), loc="upper left",
ax.legend(="Species")
title"Feature 1")
plt.xlabel("Feature 2")
plt.ylabel("True Species Distribution")
plt.title( plt.show()
Now, we apply K-means clustering to the data.
# Train K-means model
= KMeans(n_clusters=3, init='k-means++',
Kmean =10, random_state=42)
n_init Kmean.fit(X)
KMeans(n_clusters=3, n_init=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3, n_init=10, random_state=42)
Several parameters can be adjusted for better performance. See: <https://scikit-learn.org/stable/modules/generated/ sklearn.cluster.KMeans.html>
K-means provides cluster centroids, representing the center of each cluster.
# Print predicted cluster centers
print("Cluster Centers:")
print(Kmean.cluster_centers_)
Cluster Centers:
[[6.81276596 3.07446809]
[5.77358491 2.69245283]
[5.006 3.428 ]]
We plot the centroids along with clustered points.
# Plot centroids on the scatter plot
= plt.subplots()
fig, ax 0], X[:, 1], c=Kmean.labels_,
ax.scatter(X[:, ='viridis', edgecolors='k', alpha=0.5)
cmap0],
ax.scatter(Kmean.cluster_centers_[:, 1],
Kmean.cluster_centers_[:, ="black", s=200, marker='s',
c="Centroids")
label
ax.legend()"Feature 1")
plt.xlabel("Feature 2")
plt.ylabel("K-means Clustering Results")
plt.title( plt.show()
9.1.2.1 Comparing True and Predicted Labels
By plotting the results side by side, we can see how well K-means clustering approximates the true labels.
# Compare true vs. predicted labels
= plt.subplots(ncols=2, figsize=(12, 5),
fig, axs =True)
constrained_layout
# True labels plot
0].scatter(X[:, 0], X[:, 1], c=y,
axs[='viridis', alpha=0.5,
cmap='k')
edgecolors0].set_title("True Labels")
axs[0].set_xlabel("Feature 1")
axs[0].set_ylabel("Feature 2")
axs[
# Predicted clusters plot
1].scatter(X[:, 0], X[:, 1], c=Kmean.labels_,
axs[='viridis', alpha=0.5,
cmap='k')
edgecolors1].scatter(Kmean.cluster_centers_[:, 0],
axs[1],
Kmean.cluster_centers_[:, ="s", c="black", s=200,
marker=1, label="Centroids")
alpha1].set_title("Predicted Clusters")
axs[1].set_xlabel("Feature 1")
axs[1].set_ylabel("Feature 2")
axs[1].legend()
axs[ plt.show()
9.1.3 Making Predictions on New Data
Once trained, the model can classify new data points.
# Sample test data points
= np.array([[3, 4], [7, 4]])
sample_test
# Predict cluster assignment
print("Predicted Clusters:", Kmean.predict(sample_test))
Predicted Clusters: [2 0]
9.1.4 Discussion
K-means is intuitive but has limitations:
- Sensitivity to initialization: Poor initialization can yield suboptimal results.
k-means++
mitigates this issue. - Choosing the number of clusters: The choice of \(k\) is critical. The elbow method helps determine an optimal value.
- Assumption of spherical clusters: K-means struggles when clusters have irregular shapes. Alternative methods such as kernel-based clustering may be more effective.
Despite its limitations, K-means is a fundamental tool in exploratory data analysis and practical applications.
9.2 Stochastic Neighbor Embedding
Stochastic Neighbor Embedding (SNE) is a dimensionality reduction technique used to project high-dimensional data into a lower-dimensional space (often 2D or 3D) while preserving local neighborhoods of points. It is particularly popular for visualization tasks, helping to reveal clusters or groupings among similar points. Key characteristics include:
- Unsupervised: It does not require labels, relying on similarity or distance metrics among data points.
- Probabilistic framework: Pairwise distances in the original space are interpreted as conditional probabilities, which SNE attempts to replicate in the lower-dimensional space.
- Common for exploratory data analysis: Especially useful for high-dimensional datasets such as images, text embeddings, or genetic data.
9.2.1 Statistical Rationale
The core idea behind SNE is to preserve local neighborhoods of each point in the data:
For each point \(x_i\) in the high-dimensional space, SNE defines a conditional probability \(p_{j|i}\) that represents how likely \(x_j\) is a neighbor of \(x_i\).
The probability \(p_{j|i}\) is modeled using a Gaussian distribution centered on \(x_i\):
\[ p_{j|i} = \frac{\exp\left(- \| x_i - x_j \|^2 / 2 \sigma_i^2\right)}{\sum_{k \neq i} \exp\left(- \| x_i - x_k \|^2 / 2 \sigma_i^2\right)}, \] where \(\sigma_i\) is a variance parameter controlling the neighborhood size.
Each point \(x_i\) is mapped to a lower-dimensional counterpart \(y_i\), and a corresponding probability \(q_{j|i}\) is defined similarly in that space.
The objective function minimizes the Kullback–Leibler (KL) divergence between the high-dimensional and low-dimensional conditional probabilities, encouraging a faithful representation of local neighborhoods.
9.2.2 t-SNE Variation
The t-SNE (t-distributed Stochastic Neighbor Embedding) addresses two main issues in the original formulation of SNE:
- The crowding problem: In high dimensions, pairwise distances tend to spread out; in 2D or 3D, they can crowd together. t-SNE uses a Student t-distribution (with one degree of freedom) in the low-dimensional space, which has heavier tails than a Gaussian.
- Symmetric probabilities: t-SNE symmetrizes probabilities \(p_{ij} = (p_{j|i} + p_{i|j}) / (2N)\), simplifying computation.
The Student t-distribution for low-dimensional similarity is given by: \[ q_{ij} = \frac{\bigl(1 + \| y_i - y_j \|^2 \bigr)^{-1}}{\sum_{k \neq l} \bigl(1 + \| y_k - y_l \|^2 \bigr)^{-1}}. \] This heavier tail ensures that distant points are not forced too close, thus reducing the crowding effect.
9.2.3 Supervised Variation
Although SNE and t-SNE are fundamentally unsupervised, it is possible to integrate label information. In a supervised variant, distances between similarly labeled points may be reduced (or differently weighted), and additional constraints can be imposed to promote class separation in the lower-dimensional embedding. These approaches can help when partial label information is available and you want to blend supervised and unsupervised insights.
9.2.4 Demonstration with a Subset of the NIST Digits Data
Below is a brief example in Python using t-SNE on a small subset of the MNIST digits (which is itself a curated subset of the original NIST data).
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
= fetch_openml('mnist_784', version=1)
mnist = mnist.data[:2000]
X = mnist.target[:2000]
y
= TSNE(n_components=2, perplexity=30, learning_rate='auto',
tsne ='random', random_state=42)
init= tsne.fit_transform(X)
X_embedded
# Create a separate scatter plot for each digit to show a legend
plt.figure()= np.unique(y)
digits for digit in digits:
= (y == digit)
idx
plt.scatter(0],
X_embedded[idx, 1],
X_embedded[idx, =f"Digit {digit}",
label=0.5
alpha
)"t-SNE on a Subset of MNIST Digits (by class)")
plt.title("Dimension 1")
plt.xlabel("Dimension 2")
plt.ylabel(
plt.legend() plt.show()
In the visualization:
- Points belonging to the same digit typically cluster together.
- Ambiguous or poorly written digits often end up bridging two clusters.
- Some digits, such as 3 and 5, may be visually similar and can appear partially overlapping in the 2D space.