11 Case study: When the data seem to cheat — Simpson’s paradox

11.1 Background

One of the most dangerous moments in data science is when the data appear to tell a clean story too quickly.

You compute an overall average, make a chart, and the conclusion seems obvious. Then you stratify by an important variable, and the story reverses. What looked like evidence for one claim now looks like evidence against it.

This is Simpson’s paradox: an association seen in aggregated data can weaken, disappear, or reverse after accounting for a third variable.

For data science students, this is not just a cute paradox. It is a warning. Data can cheat in the sense that a superficial summary can hide the mechanism generating the data. Data can also appear contradictory when two analyses are both numerically correct but answer different questions.

Simpson’s paradox does not mean one table is wrong. It means the overall table and the stratified tables are answering different questions.

In this case study, we use two real examples from the literature:

an example based on developmental-services expenditures in California, and
a longitudinal-study example from South Africa involving medical-aid status and follow-up participation.

We then use simulation to show how Simpson’s paradox arises naturally from subgroup composition.

11.2 Learning goals

By the end of this case study, you should be able to:

explain Simpson’s paradox in plain language;
distinguish an overall association from a conditional association;
identify a lurking or confounding variable that changes the story;
use weighted averages to explain why the reversal happens;
see why responsible data science requires deep statistial thinking, not just computation.

11.3 Load packages

Show Python

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy.stats import chi2_contingency

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)

11.4 What is Simpson’s paradox?

Suppose group A looks better than group B within each subgroup. For example, maybe A has a higher success rate in every age band, every department, or every disease severity category.

You might think that A must also look better overall. But that need not happen. If A is overrepresented in harder cases and B is overrepresented in easier cases, the overall average can reverse direction.

So the paradox is not that mathematics is broken. The paradox is that aggregation changes the weights.

The lesson is simple:

The first summary you compute may be numerically correct and still scientifically misleading.

11.5 Example 1: California expenditures and apparent ethnic discrimination

This example comes from a teaching case based on developmental-services expenditures in California. The paper shows that the overall average annual expenditures for Hispanic consumers were much lower than those for White non-Hispanic consumers, which could easily lead to a quick claim of discrimination. But after stratifying by age cohort, the pattern reversed in all but one cohort. The explanation was that the Hispanic population in the sample was much younger overall, and expenditures increase sharply with age.

This example shows two ideas at once:

the overall disparity is real as a descriptive summary, and
the overall disparity alone is not enough to explain why it appears.

11.5.1 Enter the summary tables from the paper

Show Python

dds = pd.DataFrame(
    {
        "age_cohort": ["0-5", "6-12", "13-17", "18-21", "22-50", "51+"],
        "hispanic_mean": [1393, 2312, 3955, 9960, 40924, 55585],
        "white_mean": [1367, 2052, 3904, 10133, 40188, 52670],
        "hispanic_n": [44, 91, 103, 78, 43, 17],
        "white_n": [20, 46, 67, 69, 133, 66],
    }
)

overall_table = pd.DataFrame(
    {
        "group": ["Hispanic", "White non-Hispanic"],
        "overall_mean_reported": [11066, 24698],
        "n_total": [376, 401],
    }
)

dds

	age_cohort	hispanic_mean	white_mean	hispanic_n	white_n
0	0-5	1393	1367	44	20
1	6-12	2312	2052	91	46
2	13-17	3955	3904	103	67
3	18-21	9960	10133	78	69
4	22-50	40924	40188	43	133
5	51+	55585	52670	17	66

11.5.2 The misleading first look: overall means only

Show Python

fig = px.bar(
    overall_table,
    x="group",
    y="overall_mean_reported",
    text="overall_mean_reported",
    title="Overall average expenditures: the misleading first look",
)
fig.update_traces(texttemplate="$%{text:,.0f}", textposition="outside")
fig.update_layout(yaxis_title="Average annual expenditures ($)")
fig.show()

If we stop here, the conclusion seems obvious: the White non-Hispanic group appears to receive far more support.

That conclusion is too fast because we have not yet asked whether the two groups have comparable age distributions.

11.5.3 Now stratify by age cohort

Show Python

dds_long = dds.melt(
    id_vars="age_cohort",
    value_vars=["hispanic_mean", "white_mean"],
    var_name="group",
    value_name="mean_expenditure",
)
dds_long["group"] = dds_long["group"].map(
    {"hispanic_mean": "Hispanic", "white_mean": "White non-Hispanic"}
)

fig = px.line(
    dds_long,
    x="age_cohort",
    y="mean_expenditure",
    color="group",
    markers=True,
    title="Average expenditures within age cohort",
)
fig.update_layout(yaxis_title="Average annual expenditures ($)")
fig.show()

Show Python

dds[["age_cohort", "hispanic_mean", "white_mean"]]

	age_cohort	hispanic_mean	white_mean
0	0-5	1393	1367
1	6-12	2312	2052
2	13-17	3955	3904
3	18-21	9960	10133
4	22-50	40924	40188
5	51+	55585	52670

Within five of the six age cohorts, the Hispanic average is actually higher than the White non-Hispanic average. The overall result told one story; the age-stratified analysis tells another.

This does not prove that discrimination is absent. It shows something narrower and very important: the overall mean by itself is not a valid basis for that conclusion.

11.5.4 A better picture: slope chart of the reversal

This graph makes the paradox easier to see. The overall comparison goes one way, while almost every within-age comparison goes the other way.

Show Python

rows = []
for _, r in dds.iterrows():
    rows.append({"panel": r["age_cohort"], "x": "Hispanic", "y": r["hispanic_mean"]})
    rows.append({"panel": r["age_cohort"], "x": "White non-Hispanic", "y": r["white_mean"]})

for _, r in overall_table.iterrows():
    rows.append({"panel": "Overall", "x": r["group"], "y": r["overall_mean_reported"]})

slope_df = pd.DataFrame(rows)
panel_order = ["Overall", "0-5", "6-12", "13-17", "18-21", "22-50", "51+"]
slope_df["panel"] = pd.Categorical(slope_df["panel"], categories=panel_order, ordered=True)

fig = px.line(
    slope_df,
    x="x",
    y="y",
    color="panel",
    markers=True,
    line_group="panel",
    category_orders={"x": ["Hispanic", "White non-Hispanic"]},
    title="Each line compares the two groups: overall vs within age cohort",
    hover_data={"panel": True, "y": ":,.0f"},
)
fig.update_layout(xaxis_title="Group", yaxis_title="Average annual expenditures ($)")
fig.show()

11.5.5 Why does the reversal happen?

The key is that the two groups have very different age distributions.

Show Python

dds["hispanic_pct"] = dds["hispanic_n"] / dds["hispanic_n"].sum()
dds["white_pct"] = dds["white_n"] / dds["white_n"].sum()

dds_pct = dds.melt(
    id_vars="age_cohort",
    value_vars=["hispanic_pct", "white_pct"],
    var_name="group",
    value_name="proportion",
)
dds_pct["group"] = dds_pct["group"].map(
    {"hispanic_pct": "Hispanic", "white_pct": "White non-Hispanic"}
)

fig = px.bar(
    dds_pct,
    x="age_cohort",
    y="proportion",
    color="group",
    barmode="group",
    title="Age composition differs sharply between the two groups",
)
fig.update_layout(yaxis_title="Proportion within ethnicity")
fig.show()

Hispanic consumers are much more concentrated in the younger cohorts, while White non-Hispanic consumers are much more concentrated in the older cohorts. But expenditures rise steeply with age. So the overall mean is a weighted average, and the two groups use very different weights.

That is the core mechanism of Simpson’s paradox in this example: not different arithmetic, but different subgroup composition.

11.5.6 Weighted-average decomposition

This plot combines the two ingredients of the paradox:

bars show the weights,
lines show the within-age means.

The overall mean comes from multiplying those two pieces together.

Show Python

fig = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.12,
    subplot_titles=("Weights: proportion in each age cohort", "Within-age average expenditures"),
)

for grp, pct_col, line_col in [
    ("Hispanic", "hispanic_pct", "hispanic_mean"),
    ("White non-Hispanic", "white_pct", "white_mean"),
]:
    fig.add_trace(
        go.Bar(x=dds["age_cohort"], y=dds[pct_col], name=f"{grp} weight", legendgroup=grp),
        row=1,
        col=1,
    )
    fig.add_trace(
        go.Scatter(x=dds["age_cohort"], y=dds[line_col], mode="lines+markers", name=f"{grp} mean", legendgroup=grp),
        row=2,
        col=1,
    )

fig.update_layout(
    title="Why the overall means reverse",
    barmode="group",
    height=700,
)
fig.update_yaxes(title_text="Proportion", row=1, col=1)
fig.update_yaxes(title_text="Average annual expenditures ($)", row=2, col=1)
fig.show()

You can verify the weighted-average explanation numerically.

Show Python

hispanic_weighted = np.sum(dds["hispanic_pct"] * dds["hispanic_mean"])
white_weighted = np.sum(dds["white_pct"] * dds["white_mean"])

pd.DataFrame(
    {
        "group": ["Hispanic", "White non-Hispanic"],
        "Weighted mean from agecohorts": [hispanic_weighted, white_weighted],
        "Reported overall mean": [11066, 24698],
    }
)

	group	Weighted mean from agecohorts	Reported overall mean
0	Hispanic	11065.441489	11066
1	White non-Hispanic	24697.508728	24698

What if we apply the same age weights (pooled age distribution) to both groups? This isolates the within-age differences from the composition effect.

Show Python

dds["pooled_pct"] = (dds["hispanic_n"] + dds["white_n"]) / (dds["hispanic_n"].sum() + dds["white_n"].sum())

same_weight_hisp = np.sum(dds["pooled_pct"] * dds["hispanic_mean"])
same_weight_white = np.sum(dds["pooled_pct"] * dds["white_mean"])

pd.DataFrame(
    {
        "group": ["Hispanic", "White non-Hispanic"],
        "Pooled weights mean": [same_weight_hisp, same_weight_white],
        "Reported overall mean": [11066, 24698],
    }
)

	group	Pooled weights mean	Reported overall mean
0	Hispanic	18479.465894	11066
1	White non-Hispanic	17974.956242	24698

Show Python

compare_weights = pd.DataFrame(
    {
        "group": ["Hispanic", "White non-Hispanic", "Hispanic", "White non-Hispanic"],
        "mean": [11066, 24698, same_weight_hisp, same_weight_white],
        "scenario": ["Original overall", "Original overall", "Same age weights", "Same age weights"],
    }
)

fig = px.bar(
    compare_weights,
    x="group",
    y="mean",
    color="scenario",
    barmode="group",
    text="mean",
    title="What happens if both groups use the same age distribution?",
)
fig.update_traces(texttemplate="$%{text:,.0f}", textposition="outside")
fig.update_layout(yaxis_title="Average annual expenditures ($)")
fig.show()

So, it should be clear that the contradiction from Simpson’s paradox is not magic: it comes from changing the weights.

11.6 Example 2: A longitudinal study in South Africa

In the Birth to Ten study, the overall proportion whose mothers had medical aid was lower in the five-year cohort than among children not traced. But once race was taken into account, the direction reversed slightly within both the white and black groups.

11.6.1 Build the contingency tables from the paper

Show Python

overall_sa = pd.DataFrame(
    {
        "cohort": ["Not traced", "Five-year group"],
        "had_medical_aid": [195, 46],
        "no_medical_aid": [979, 370],
    }
)
overall_sa["total"] = overall_sa["had_medical_aid"] + overall_sa["no_medical_aid"]
overall_sa["aid_rate"] = overall_sa["had_medical_aid"] / overall_sa["total"]

race_sa = pd.DataFrame(
    {
        "race": ["White", "White", "Black", "Black"],
        "cohort": ["Not traced", "Five-year group", "Not traced", "Five-year group"],
        "had_medical_aid": [104, 10, 91, 36],
        "no_medical_aid": [22, 2, 957, 368],
    }
)
race_sa["total"] = race_sa["had_medical_aid"] + race_sa["no_medical_aid"]
race_sa["aid_rate"] = race_sa["had_medical_aid"] / race_sa["total"]

overall_sa

	cohort	had_medical_aid	no_medical_aid	total	aid_rate
0	Not traced	195	979	1174	0.166099
1	Five-year group	46	370	416	0.110577

11.6.2 Overall association

Show Python

fig = px.bar(
    overall_sa,
    x="cohort",
    y="aid_rate",
    text=overall_sa["aid_rate"].map(lambda x: f"{100*x:.1f}%"),
    title="Overall medical-aid rate",
)
fig.update_traces(textposition="outside")
fig.update_layout(yaxis_title="Proportion with medical aid")
fig.show()

Overall, the five-year group looks worse: about 11.1% versus 16.6%.

11.6.3 Conditional on race

Show Python

fig = px.bar(
    race_sa,
    x="race",
    y="aid_rate",
    color="cohort",
    barmode="group",
    text=race_sa["aid_rate"].map(lambda x: f"{100*x:.1f}%"),
    title="Medical-aid rate after stratifying by race",
)
fig.update_traces(textposition="outside")
fig.update_layout(yaxis_title="Proportion with medical aid")
fig.show()

Within the white group, the five-year group is slightly higher: 83.3% vs 82.5%.

Within the black group, the five-year group is also slightly higher: 8.9% vs 8.7%.

So the overall association goes in the opposite direction from the within-race comparisons.

11.6.4 A compact reversal plot

Show Python

south_africa_plot = pd.concat(
    [
        pd.DataFrame(
            {
                "comparison": ["Overall", "Overall"],
                "group": ["Not traced", "Five-year group"],
                "rate": overall_sa["aid_rate"].values,
            }
        ),
        pd.DataFrame(
            {
                "comparison": ["White", "White", "Black", "Black"],
                "group": race_sa["cohort"].values,
                "rate": race_sa["aid_rate"].values,
            }
        ),
    ],
    ignore_index=True,
)

fig = px.line(
    south_africa_plot,
    x="group",
    y="rate",
    color="comparison",
    markers=True,
    line_group="comparison",
    category_orders={"group": ["Not traced", "Five-year group"]},
    title="Overall vs within-race comparisons",
    hover_data={"rate": ":.3f"},
)
fig.update_layout(yaxis_title="Medical-aid rate")
fig.show()

11.6.5 Why did this happen?

Because the racial composition differs across the two cohorts, and medical-aid access differs dramatically by race.

Show Python

comp = race_sa.pivot(index="cohort", columns="race", values="total").reset_index()
for col in ["Black", "White"]:
    comp[f"{col}_pct"] = comp[col] / (comp["Black"] + comp["White"])
comp

race	cohort	Black	White	Black_pct	White_pct
0	Five-year group	404	12	0.971154	0.028846
1	Not traced	1048	126	0.892675	0.107325

Show Python

comp_long = comp.melt(
    id_vars="cohort",
    value_vars=["White_pct", "Black_pct"],
    var_name="race",
    value_name="proportion",
)
comp_long["race"] = comp_long["race"].str.replace("_pct", "", regex=False)

fig = px.bar(
    comp_long,
    x="cohort",
    y="proportion",
    color="race",
    barmode="stack",
    title="Race composition differs across cohorts",
)
fig.update_layout(yaxis_title="Proportion")
fig.show()

11.6.6 A quick significance check

The paper reports an overall significant difference but no meaningful within-race difference. We can reproduce the corresponding chi-square tests from the tabulated counts.

Show Python

overall_tab = np.array([[195, 979], [46, 370]])
chi2_overall, p_overall, _, _ = chi2_contingency(overall_tab, correction=False)

white_tab = np.array([[104, 22], [10, 2]])
chi2_white, p_white, _, _ = chi2_contingency(white_tab, correction=False)

black_tab = np.array([[91, 957], [36, 368]])
chi2_black, p_black, _, _ = chi2_contingency(black_tab, correction=False)

pd.DataFrame(
    {
        "table": ["Overall", "White only", "Black only"],
        "p_value": [p_overall, p_white, p_black],
    }
)

	table	p_value
0	Overall	0.006658
1	White only	0.944744
2	Black only	0.890541

Again, the exact message is not “trust p-values” or “ignore p-values.” The message is that what you condition on matters.

It is also worth emphasizing that a statistically significant overall table can coexist with weak or negligible within-subgroup differences.

11.7 Simulation

The two real examples above are convincing because they come from real studies. But it is also useful to see that Simpson’s paradox is not rare magic. It can arise whenever:

the subgroup outcome levels differ a lot; and
the group composition across subgroups differs enough.

11.7.1 A simple synthetic example

Suppose we compare two methods, A and B, over two difficulty levels: easy and hard.

Within both difficulty levels, A is better.
But A is used much more often on hard cases, while B is used much more often on easy cases.

Show Python

simpson_simple = pd.DataFrame(
    {
        "difficulty": ["Easy", "Hard"],
        "A_success_rate": [0.90, 0.60],
        "B_success_rate": [0.85, 0.55],
        "A_cases": [100, 900],
        "B_cases": [900, 100],
    }
)
simpson_simple

	difficulty	A_success_rate	B_success_rate	A_cases	B_cases
0	Easy	0.9	0.85	100	900
1	Hard	0.6	0.55	900	100

Show Python

A_overall = np.average(simpson_simple["A_success_rate"], weights=simpson_simple["A_cases"])
B_overall = np.average(simpson_simple["B_success_rate"], weights=simpson_simple["B_cases"])

pd.DataFrame(
    {
        "method": ["A", "B"],
        "overall_success_rate": [A_overall, B_overall],
    }
)

	method	overall_success_rate
0	A	0.63
1	B	0.82

Show Python

simple_long = simpson_simple.melt(
    id_vars="difficulty",
    value_vars=["A_success_rate", "B_success_rate"],
    var_name="method",
    value_name="success_rate",
)
simple_long["method"] = simple_long["method"].str.replace("_success_rate", "", regex=False)

fig = px.bar(
    simple_long,
    x="difficulty",
    y="success_rate",
    color="method",
    barmode="group",
    title="Within each difficulty level, A is better",
)
fig.update_layout(yaxis_title="Success rate")
fig.show()

Show Python

fig = px.bar(
    pd.DataFrame({"method": ["A", "B"], "overall_success_rate": [A_overall, B_overall]}),
    x="method",
    y="overall_success_rate",
    text="overall_success_rate",
    title="But overall, the ranking reverses",
)
fig.update_traces(texttemplate="%{text:.3f}", textposition="outside")
fig.update_layout(yaxis_title="Overall success rate")
fig.show()

11.7.2 Simulate many datasets with varying subgroup composition

Now we simulate many datasets where A is truly better within both strata, but subgroup mixing varies from one dataset to another.

Show Python

rng = np.random.default_rng(3255)

def one_sim():
    pA_easy = rng.uniform(0.75, 0.95)
    pB_easy = pA_easy - rng.uniform(0.01, 0.08)

    pA_hard = rng.uniform(0.35, 0.70)
    pB_hard = pA_hard - rng.uniform(0.01, 0.08)

    A_easy = rng.integers(20, 250)
    A_hard = rng.integers(300, 1200)
    B_easy = rng.integers(300, 1200)
    B_hard = rng.integers(20, 250)

    A_overall = np.average([pA_easy, pA_hard], weights=[A_easy, A_hard])
    B_overall = np.average([pB_easy, pB_hard], weights=[B_easy, B_hard])

    return {
        "pA_easy": pA_easy,
        "pB_easy": pB_easy,
        "pA_hard": pA_hard,
        "pB_hard": pB_hard,
        "A_easy": A_easy,
        "A_hard": A_hard,
        "B_easy": B_easy,
        "B_hard": B_hard,
        "A_overall": A_overall,
        "B_overall": B_overall,
        "reversal": B_overall > A_overall,
    }

sim_df = pd.DataFrame([one_sim() for _ in range(2000)])
sim_df["overall_diff_B_minus_A"] = sim_df["B_overall"] - sim_df["A_overall"]
sim_df["easy_case_gap"] = sim_df["B_easy"] - sim_df["A_easy"]
sim_df["reversal_label"] = sim_df["reversal"].map({True: "Reversal", False: "No reversal"})
sim_df.head()

	pA_easy	pB_easy	pA_hard	pB_hard	A_easy	A_hard	B_easy	B_hard	A_overall	B_overall	reversal	overall_diff_B_minus_A	easy_case_gap	reversal_label
0	0.835463	0.803197	0.522610	0.496160	205	982	610	237	0.576641	0.717285	True	0.140644	405	Reversal
1	0.869031	0.856481	0.677502	0.628806	148	716	524	147	0.710310	0.806603	True	0.096293	376	Reversal
2	0.753094	0.709083	0.374724	0.321039	169	410	667	97	0.485164	0.659815	True	0.174652	498	Reversal
3	0.877914	0.806613	0.508721	0.458813	191	1085	322	36	0.563984	0.771639	True	0.207655	131	Reversal
4	0.837741	0.776933	0.454775	0.381841	245	717	673	162	0.552308	0.700281	True	0.147973	428	Reversal

Show Python

sim_df["reversal"].mean()

np.float64(0.995)

The value above is the fraction of simulated datasets where method A is better in both strata but still loses overall.

Show Python

fig = px.histogram(
    sim_df,
    x="overall_diff_B_minus_A",
    color="reversal_label",
    nbins=50,
    title="Distribution of overall difference: B - A",
)
fig.add_vline(x=0, line_dash="dash")
fig.update_layout(xaxis_title="Overall success-rate difference")
fig.show()

Show Python

fig = px.scatter(
    sim_df,
    x="A_easy",
    y="B_easy",
    color="reversal_label",
    opacity=0.45,
    title="Different subgroup compositions drive the paradox",
    hover_data=["A_hard", "B_hard", "A_overall", "B_overall"],
)
fig.update_layout(xaxis_title="Easy cases assigned to A", yaxis_title="Easy cases assigned to B")
fig.show()

11.7.3 Another simulation example in regression settings

Simpson’s paradox also appears in regression settings. In the toy GDP example below, each development group has a negative within-group relationship between country size and GDP per capita, but the pooled data look positive because the most populous countries are concentrated in the higher-GDP group.

Show Python

rng = np.random.default_rng(3255)

rows = []
for development_group, lo, hi, intercept, slope, pop_mean in [
    ("Developing", 5, 35, 95, -1.4, 2.3),
    ("Developed", 40, 100, 240, -0.9, 3.2),
]:
    country_size = rng.uniform(lo, hi, 40)
    population = rng.lognormal(mean=pop_mean, sigma=0.45, size=40)
    gdp_per_capita = intercept + slope * country_size + rng.normal(0, 6, 40)
    rows.append(
        pd.DataFrame(
            {
                "development_group": development_group,
                "country_size": country_size,
                "population": population,
                "gdp_per_capita": gdp_per_capita,
            }
        )
    )

df = pd.concat(rows, ignore_index=True)

corr_rows = [{"group": "Overall", "correlation": df["country_size"].corr(df["gdp_per_capita"])}]
for development_group, sub in df.groupby("development_group"):
    corr_rows.append(
        {
            "group": development_group,
            "correlation": sub["country_size"].corr(sub["gdp_per_capita"]),
        }
    )
corr_df = pd.DataFrame(corr_rows)

fig = go.Figure()
colors = {"Developing": "#1f77b4", "Developed": "#ff7f0e"}

for development_group, sub in df.groupby("development_group"):
    fig.add_trace(
        go.Scatter(
            x=sub["country_size"],
            y=sub["gdp_per_capita"],
            mode="markers",
            name=development_group,
            marker=dict(
                size=4 + 16 * sub["population"] / df["population"].max(),
                color=colors[development_group],
                opacity=0.7,
                line=dict(width=0.5, color="white"),
            ),
            text=[development_group] * len(sub),
            hovertemplate="Group=%{text}<br>Country size=%{x:.1f}<br>GDP per capita=%{y:.1f}<extra></extra>",
        )
    )

    m, b = np.polyfit(sub["country_size"], sub["gdp_per_capita"], 1)
    x_fit = np.linspace(sub["country_size"].min(), sub["country_size"].max(), 50)
    fig.add_trace(
        go.Scatter(
            x=x_fit,
            y=m * x_fit + b,
            mode="lines",
            name=f"{development_group} trend",
            line=dict(color=colors[development_group], dash="dash"),
            showlegend=False,
        )
    )

m_all, b_all = np.polyfit(df["country_size"], df["gdp_per_capita"], 1)
x_all = np.linspace(df["country_size"].min(), df["country_size"].max(), 100)
fig.add_trace(
    go.Scatter(
        x=x_all,
        y=m_all * x_all + b_all,
        mode="lines",
        name="Overall trend",
        line=dict(color="black", width=3),
    )
)

fig.update_layout(
    title="Overall trend is positive, but each development group slopes downward",
    xaxis_title="Country size",
    yaxis_title="GDP per capita",
)
fig.show()

# Correlation summary
fig = px.bar(
    corr_df,
    x="group",
    y="correlation",
    text="correlation",
    title="Correlation flips after stratifying by development group",
)
fig.update_traces(texttemplate="%{text:.2f}", textposition="outside")
fig.update_layout(yaxis_title="Pearson correlation")
fig.show()

corr_df

	group	correlation
0	Overall	0.744340
1	Developed	-0.910302
2	Developing	-0.923694

Population weights matter because the most populous countries contribute more to an aggregated summary. When the high-population observations sit mostly in one group, the pooled regression can flip even though both group-specific trends are negative.

The pooled line is positive because the larger countries also tend to belong to the higher-GDP region. Within each region, however, the fitted line slopes downward.

11.8 Why this matters in data science

Simpson’s paradox is an important lesson for data science students because it captures a core danger of real-world analysis:

dashboards aggregate;
machine-learning pipelines summarize;
business reports compress complexity;
and people love a single number.

But the “single number” can be deeply misleading when important subgroups are mixed together.

A few places where this matters:

admissions data by department,
treatment comparisons by disease severity,
hiring or promotion outcomes by role or unit,
educational outcomes by school and student background,
model-performance comparisons by subgroup difficulty.

The paradox teaches intellectual discipline:

Ask what variables were aggregated over.
Check whether subgroup composition differs across the groups being compared.
Separate descriptive truth from causal interpretation.
Do not call something discrimination, bias, or superiority until you understand the structure of the data.

Data do not literally lie, but they can absolutely mislead when we use them carelessly.

That is why good data science requires more than coding and computation. It requires statistical thinking, judgment about structure, context, and what comparison is scientifically meaningful.

11.9 Discussion

Can Simpson’s paradox occur in regression or machine-learning settings, not just in tables? Explain.
When should we report overall results, and when should we prioritize subgroup results?
Find a modern data-science setting where a subgroup analysis would be essential before making a conclusion.