9 Classification

Classification is one of the most widely used tasks in data science, concerned with predicting categorical outcomes rather than continuous quantities. Many real-world problems can be framed as classification, such as diagnosing a disease from medical records, determining whether a loan applicant is likely to default, or identifying spam emails. Compared with regression, which models numeric responses, classification methods aim to assign observations into predefined classes based on their features. Logistic regression, introduced in the previous chapter, provides a natural transition: it uses a regression framework to model the probability of class membership. In this chapter, we expand beyond logistic regression to study how classification models are evaluated and to introduce other methods developed for classification tasks.

9.1 Introduction to Classification

Classification problems arise when the outcome of interest is categorical rather than continuous. Instead of predicting a numerical quantity, the task is to assign each observation to one of several predefined classes. Examples include deciding whether an email is spam or not, predicting a patient’s disease status from clinical measures, or determining whether a financial transaction is fraudulent. These problems are ubiquitous across domains and often require different tools from those used in regression.

A widely used dataset for illustrating binary classification is the Breast Cancer Wisconsin (Diagnostic) dataset. It contains information on 569 patients, each described by 30 numerical features computed from digitized images of fine needle aspirates of breast masses. These features summarize characteristics of the cell nuclei, such as radius, texture, perimeter, smoothness, and concavity, with versions capturing mean, variation, and extreme values. The outcome variable records whether the tumor is malignant or benign. Because the features are all numeric and the outcome is binary, this dataset provides an ideal setting for introducing classification methods and performance evaluation.

Before building classification models, it is useful to perform exploratory data analysis (EDA) to understand the structure of the data. We first load the dataset from scikit-learn.

from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df["diagnosis"] = data.target

# Map diagnosis: 0 = malignant, 1 = benign
df["diagnosis"] = df["diagnosis"].map({0: "malignant", 1: "benign"})
df.head()

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension	diagnosis
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890	malignant
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902	malignant
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758	malignant
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300	malignant
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678	malignant

5 rows × 31 columns

We check the class distribution to see whether the dataset is balanced.

df["diagnosis"].value_counts()

diagnosis
benign       357
malignant    212
Name: count, dtype: int64

We can also examine summary statistics of the numeric features.

df.describe().T.head(10)

	count	mean	std	min	25%	50%	75%	max
mean radius	569.0	14.127292	3.524049	6.98100	11.70000	13.37000	15.78000	28.11000
mean texture	569.0	19.289649	4.301036	9.71000	16.17000	18.84000	21.80000	39.28000
mean perimeter	569.0	91.969033	24.298981	43.79000	75.17000	86.24000	104.10000	188.50000
mean area	569.0	654.889104	351.914129	143.50000	420.30000	551.10000	782.70000	2501.00000
mean smoothness	569.0	0.096360	0.014064	0.05263	0.08637	0.09587	0.10530	0.16340
mean compactness	569.0	0.104341	0.052813	0.01938	0.06492	0.09263	0.13040	0.34540
mean concavity	569.0	0.088799	0.079720	0.00000	0.02956	0.06154	0.13070	0.42680
mean concave points	569.0	0.048919	0.038803	0.00000	0.02031	0.03350	0.07400	0.20120
mean symmetry	569.0	0.181162	0.027414	0.10600	0.16190	0.17920	0.19570	0.30400
mean fractal dimension	569.0	0.062798	0.007060	0.04996	0.05770	0.06154	0.06612	0.09744

Visualization helps reveal differences between classes. For example, we can compare the distributions of a few key features by diagnosis.

from plotnine import ggplot, aes, geom_histogram, facet_wrap, labs

features = ["mean radius", "mean texture", "mean area"]

for feature in features:
    p = (
        ggplot(df, aes(x=feature, fill="diagnosis")) 
        + geom_histogram(bins=20, alpha=0.5, position="identity")
        + labs(title=feature)
    )
    p

These plots suggest that malignant and benign tumors differ in several features, such as mean radius and mean area. Such separation indicates that classification methods can be effective in distinguishing between the two groups.

9.2 Evaluating Classifiers

Validating the performance of logistic regression models is crucial to assess their effectiveness and reliability. This section explores key metrics used to evaluate the performance of logistic regression models, starting with the confusion matrix, then moving on to accuracy, precision, recall, F1 score, and the area under the ROC curve (AUC). Using simulated data, we will demonstrate how to calculate and interpret these metrics using Python.

9.2.1 Confusion Matrix

The confusion matrix is a fundamental tool used for calculating several other classification metrics. It is a table used to describe the performance of a classification model on a set of data for which the true values are known. The matrix displays the actual values against the predicted values, providing insight into the number of correct and incorrect predictions.

Actual	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Four entries in the confusion matrix:

True Positive (TP): The cases in which the model correctly predicted the positive class.
False Positive (FP): The cases in which the model incorrectly predicted the positive class (i.e., the model predicted positive, but the actual class was negative).
True Negative (TN): The cases in which the model correctly predicted the negative class.
False Negative (FN): The cases in which the model incorrectly predicted the negative class (i.e., the model predicted negative, but the actual class was positive).

Four rates from the confusion matrix with actual (row) margins:

True positive rate (TPR): TP / (TP + FN). Also known as sensitivity or recall.
False negative rate (FNR): FN / (TP + FN). Also known as miss rate.
False positive rate (FPR): FP / (FP + TN). Also known as false alarm, fall-out.
True negative rate (TNR): TN / (FP + TN). Also known as specificity.

Note that TPR and FPR do not add up to one. Neither do FNR and FPR.

Four other rates with predicted (column) margins:

Positive predictive value (PPV): TP / (TP + FP). Also known as precision.
False discovery rate (FDR): FP / (TP + FP).
False omission rate (FOR): FN / (FN + TN).
Negative predictive value (NPV): TN / (FN + TN).

Note that PPV and NP do not add up to one.

9.2.2 Accuracy

Accuracy measures the overall correctness of the model and is defined as the ratio of correct predictions (both positive and negative) to the total number of cases examined.

  Accuracy = (TP + TN) / (TP + TN + FP + FN)

Imbalanced Classes: Accuracy can be misleading if there is a significant imbalance between the classes. For instance, in a dataset where 95% of the samples are of one class, a model that naively predicts the majority class for all instances will still achieve 95% accuracy, which does not reflect true predictive performance.
Misleading Interpretations: High overall accuracy might hide the fact that the model is performing poorly on a smaller, yet important, segment of the data.

9.2.3 Precision

Precision (or PPV) measures the accuracy of positive predictions. It quantifies the number of correct positive predictions made.

  Precision = TP / (TP + FP)

Neglect of False Negatives: Precision focuses solely on the positive class predictions. It does not take into account false negatives (instances where the actual class is positive but predicted as negative). This can be problematic in cases like disease screening where missing a positive case (disease present) could be dangerous.
Not a Standalone Metric: High precision alone does not indicate good model performance, especially if recall is low. This situation could mean the model is too conservative in predicting positives, thus missing out on a significant number of true positive instances.

9.2.4 Recall

Recall (Sensitivity or TPR) measures the ability of a model to find all relevant cases (all actual positives).

  Recall = TP / (TP + FN)

Neglect of False Positives: Recall does not consider false positives (instances where the actual class is negative but predicted as positive). High recall can be achieved at the expense of precision, leading to a large number of false positives which can be costly or undesirable in certain contexts, such as in spam detection.
Trade-off with Precision: Often, increasing recall decreases precision. This trade-off needs to be managed carefully, especially in contexts where both false positives and false negatives carry significant costs or risks.

9.2.5 F-beta Score

The F-beta score is a weighted harmonic mean of precision and recall, taking into account a \(\beta\) parameter such that recall is considered \(\beta\) times as important as precision: \[ (1 + \beta^2) \frac{\text{precision} \cdot \text{recall}} {\beta^2 \text{precision} + \text{recall}}. \]

See stackexchange post for the motivation of \(\beta^2\) instead of just \(\beta\).

The F-beta score reaches its best value at 1 (perfect precision and recall) and worst at 0.

If reducing false negatives is more important (as might be the case in medical diagnostics where missing a positive diagnosis could be critical), you might choose a beta value greater than 1. If reducing false positives is more important (as in spam detection, where incorrectly classifying an email as spam could be inconvenient), a beta value less than 1 might be appropriate.

The F1 Score is a specific case of the F-beta score where beta is 1, giving equal weight to precision and recall. It is the harmonic mean of Precision and Recall and is a useful measure when you seek a balance between Precision and Recall and there is an uneven class distribution (large number of actual negatives).

9.2.6 Receiver Operating Characteristic (ROC) Curve

The Receiver Operating Characteristic (ROC) curve is a plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. It shows the trade-off between the TPR and FPR. The ROC plots TPR against FPR as the decision threshold is varied. It can be particularly useful in evaluating the performance of classifiers when the class distribution is imbalanced,

Increasing from \((0, 0)\) to \((1, 1)\). The ROC curve always starts at (0, 0) and ends at (1, 1) because these points represent the extreme threshold settings of the classifier. When the threshold is so high that all predictions are negative, both TPR and the TPR are zero—corresponding to the point (0, 0). When the threshold is so low that all predictions are positive, both TPR and FPR are one—corresponding to the point (1, 1).
Best classification passes \((0, 1)\). The ideal classifier would achieve a TPR of 1 while keeping the FPR at 0. This corresponds to the point (0, 1) in the ROC space. In practice, the closer a classifier’s ROC curve approaches this top-left corner, the better its discriminative performance.
Classification by random guess gives the 45-degree line. For every threshold, the TPR equals the FPR, because the classifier is just as likely to label a negative instance as positive as it is to label a positive instance correctly. Thus, its points fall on the line where TPR = FPR. This diagonal serves as a baseline: a model whose ROC curve lies on or below this line has no discriminative ability, equivalent to random guessing. Any useful classifier should produce a curve that bows above the diagonal, showing higher TPRs than FPRs across thresholds. The greater the area between the ROC curve and the diagonal, the more informative the model.
Area between the ROC and the 45-degree line is the Gini coefficient, a measure of inequality.
Area under the curve (AUC) of ROC thus provides an important metric of classification results. A higher AUC indicates that the model ranks positive instances higher than negative ones more consistently. Thus, a larger AUC reflects stronger separability between the classes and a more powerful classifier. An AUC of 1 means perfect discrimination, whereas an AUC of 0.5 means random guessing.

The Area Under the ROC Curve (AUC) is a scalar value that summarizes the performance of a classifier. It measures the total area underneath the ROC curve, providing a single metric to compare models. The value of AUC ranges from 0 to 1:

AUC = 1: A perfect classifier, which perfectly separates positive and negative classes.
AUC = 0.5: A classifier that performs no better than random chance.
AUC < 0.5: A classifier performing worse than random.

The AUC value provides insight into the model’s ability to discriminate between positive and negative classes across all possible threshold values.

9.2.7 Breast Cancer Example

Since logistic regression provides a natural starting point for classification, we will apply it to the breast cancer data using a subset of features for simplicity to illustrate the metrics.

The response variable y is a binary array indicating whether a tumor is malignant or benign.

y = 0 corresponds to malignant (cancerous) tumors.
y = 1 corresponds to benign (non-cancerous) tumors.

This encoding follows scikit-learn’s convention for the Wisconsin Diagnostic Breast Cancer dataset. The dataset includes 569 samples with 30 numeric features derived from digitized images of fine needle aspirate biopsies.

When fitting a logistic regression model, the predicted probabilities (y_pred_prob) represent the estimated probability that a tumor is benign (1). Consequently, high predicted probabilities correspond to benign cases, and low probabilities indicate malignant ones. The ROC and AUC computations use these probabilities as scores for the positive class (benign).

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

X = df[["mean radius", "mean texture", "mean area"]]
y = df["diagnosis"].map({"malignant":0, "benign":1})

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix

array([[ 52,  11],
       [  5, 103]])

We can compute accuracy, precision, recall, and F1-score to evaluate performance.

print(classification_report(y_test, y_pred,
                                target_names=["malignant","benign"]))

              precision    recall  f1-score   support

   malignant       0.91      0.83      0.87        63
      benign       0.90      0.95      0.93       108

    accuracy                           0.91       171
   macro avg       0.91      0.89      0.90       171
weighted avg       0.91      0.91      0.91       171

Malignant tumors: Precision of 0.91 means that 91% of tumors predicted malignant were truly malignant. Recall of 0.83 shows that the model correctly identified 83% of actual malignant tumors but missed 17% (false negatives). The F1-score of 0.87 balances these two aspects.
Benign tumors: Precision of 0.90 means 90% of predicted benign tumors were correct. Recall of 0.95 shows the model caught 95% of actual benign tumors, misclassifying only 5% as malignant. The F1-score of 0.93 reflects this strong performance.
Overall: The accuracy of 0.91 indicates that about 91% of tumors were classified correctly. The macro average (simple mean across classes) is slightly lower than the weighted average, reflecting the imbalance in sample sizes. Since benign cases are more common, the weighted average leans closer to their stronger performance.

The model seems quite accurate overall, but performs better at identifying benign tumors than malignant ones. The relatively lower recall for malignant cases means that some malignant tumors were misclassified as benign. In medical applications, such false negatives are especially serious and motivate the use of evaluation metrics beyond accuracy alone.

from sklearn.metrics import roc_curve, roc_auc_score
import pandas as pd
from plotnine import ggplot, aes, geom_line, geom_abline, labs, theme_minimal

# Assume model and data are from previous logistic regression example
# y_test: true labels (0/1)
# y_pred_prob: predicted probabilities for the positive class
y_pred_prob = model.predict_proba(X_test)[:, 1]

# Compute ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc_value = roc_auc_score(y_test, y_pred_prob)

# Create ROC DataFrame
roc_df = pd.DataFrame({
    'False Positive Rate': fpr,
    'True Positive Rate': tpr
})

# Plot ROC curve
(
    ggplot(roc_df, aes(x='False Positive Rate', y='True Positive Rate')) +
    geom_line(color='blue') +
    geom_abline(linetype='dashed') +
    labs(
        title=f'ROC Curve (AUC = {auc_value:.2f})',
        x='False Positive Rate (1 - Specificity)',
        y='True Positive Rate (Sensitivity)'
    ) +
    theme_minimal()
)

9.3 Tuning Regularized Logistic Models

The logistic regression model with an L1 (lasso) penalty requires a tuning parameter controlling the strength of regularization. In scikit-learn, this parameter is expressed as \(C = 1 / \lambda\). Smaller \(C\) values correspond to stronger penalties, shrinking more coefficients toward zero. The goal is to select \(C\) that optimizes a performance metric such as F1 or AUC via cross-validation.

Step 1: Data and Setup

We continue with the breast cancer dataset.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd
from plotnine import ggplot, aes, geom_line, labs, scale_x_log10, theme_minimal

# Load the data
X, y = load_breast_cancer(return_X_y=True)

Step 2: Automatic Selection of Candidate C Values

Rather than manually picking a grid, LogisticRegressionCV determines a data- dependent range of \(C\) values by examining the scale of the coefficients. It constructs a grid of regularization strengths internally based on the variance of the features and outcome, ensuring coverage from under-regularized to over- regularized regimes.

# Use LogisticRegressionCV to determine reasonable C values
auto_lasso = LogisticRegressionCV(
    Cs=20,
    penalty="l1",
    solver="saga",
    cv=5,
    scoring="roc_auc",
    max_iter=5000,
    n_jobs=-1
)

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", auto_lasso)
])

pipeline.fit(X, y)

## set tested Cs
C_values = np.logspace(-2, 1, 20)
## C_values = auto_lasso.Cs_

Step 3: Cross-Validation for F1 and AUC

While LogisticRegressionCV provides built-in AUC-based selection, we can also evaluate each \(C\) value using different metrics such as F1-score.

from sklearn.model_selection import cross_val_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate across automatically chosen C values
f1_scores = []
auc_scores = []
for C in C_values:
    model = LogisticRegression(
        penalty="l1", solver="saga", C=C, max_iter=5000
    )
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("model", model)
    ])
    f1_scores.append(cross_val_score(pipe, X, y, cv=cv, scoring="f1").mean())
    auc_scores.append(cross_val_score(pipe, X, y, cv=cv, scoring="roc_auc").mean())

cv_results = pd.DataFrame({"C": C_values, "F1": f1_scores, "AUC": auc_scores})

Step 4: Identify the Optimal Regularization Strength

best_f1 = cv_results.loc[cv_results["F1"].idxmax()]
best_auc = cv_results.loc[cv_results["AUC"].idxmax()]

print("Best by F1:", best_f1)
print("Best by AUC:", best_auc)

Best by F1: C      1.623777
F1     0.979323
AUC    0.994523
Name: 14, dtype: float64
Best by AUC: C      0.784760
F1     0.976486
AUC    0.995644
Name: 12, dtype: float64

Step 5: Visualization

(
    ggplot(cv_results.melt(id_vars="C", var_name="Metric", value_name="Score"),
           aes(x="C", y="Score", color="Metric"))
    + geom_line()
    + scale_x_log10()
    + labs(
        title="Cross-Validation for Lasso Logistic Regression",
        x="C (Inverse Regularization Strength)",
        y="Mean 5-Fold Score"
    )
    + theme_minimal()
)

This approach avoids arbitrary grids. The smallest \(C\) corresponds to the strongest regularization (the simplest model), while the largest \(C\) allows almost no penalty (the most flexible model). The optimal \(C\) often lies between these extremes. A higher AUC indicates better class separation, whereas a higher F1-score indicates balanced precision and recall. Depending on the application, one metric may be prioritized over the other.