11  Case study: When the data seem to cheat — Simpson’s paradox

11.1 Background

One of the most dangerous moments in data science is when the data appear to tell a clean story too quickly.

You compute an overall average, make a chart, and the conclusion seems obvious. Then you stratify by an important variable, and the story reverses. What looked like evidence for one claim now looks like evidence against it.

This is Simpson’s paradox: an association seen in aggregated data can weaken, disappear, or reverse after accounting for a third variable.

For data science students, this is not just a cute paradox. It is a warning. Data can cheat in the sense that a superficial summary can hide the mechanism generating the data. Data can also appear contradictory when two analyses are both numerically correct but answer different questions.

Simpson’s paradox does not mean one table is wrong. It means the overall table and the stratified tables are answering different questions.

In this case study, we use two real examples from the literature:

  1. an example based on developmental-services expenditures in California, and
  2. a longitudinal-study example from South Africa involving medical-aid status and follow-up participation.

We then use simulation to show how Simpson’s paradox arises naturally from subgroup composition.

11.2 Learning goals

By the end of this case study, you should be able to:

  • explain Simpson’s paradox in plain language;
  • distinguish an overall association from a conditional association;
  • identify a lurking or confounding variable that changes the story;
  • use weighted averages to explain why the reversal happens;
  • see why responsible data science requires deep statistial thinking, not just computation.

11.3 Load packages

Show Python
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy.stats import chi2_contingency

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)

11.4 What is Simpson’s paradox?

Suppose group A looks better than group B within each subgroup. For example, maybe A has a higher success rate in every age band, every department, or every disease severity category.

You might think that A must also look better overall. But that need not happen. If A is overrepresented in harder cases and B is overrepresented in easier cases, the overall average can reverse direction.

So the paradox is not that mathematics is broken. The paradox is that aggregation changes the weights.

The lesson is simple:

The first summary you compute may be numerically correct and still scientifically misleading.

11.5 Example 1: California expenditures and apparent ethnic discrimination

This example comes from a teaching case based on developmental-services expenditures in California. The paper shows that the overall average annual expenditures for Hispanic consumers were much lower than those for White non-Hispanic consumers, which could easily lead to a quick claim of discrimination. But after stratifying by age cohort, the pattern reversed in all but one cohort. The explanation was that the Hispanic population in the sample was much younger overall, and expenditures increase sharply with age.

This example shows two ideas at once:

  • the overall disparity is real as a descriptive summary, and
  • the overall disparity alone is not enough to explain why it appears.

11.5.1 Enter the summary tables from the paper

Show Python
dds = pd.DataFrame(
    {
        "age_cohort": ["0-5", "6-12", "13-17", "18-21", "22-50", "51+"],
        "hispanic_mean": [1393, 2312, 3955, 9960, 40924, 55585],
        "white_mean": [1367, 2052, 3904, 10133, 40188, 52670],
        "hispanic_n": [44, 91, 103, 78, 43, 17],
        "white_n": [20, 46, 67, 69, 133, 66],
    }
)

overall_table = pd.DataFrame(
    {
        "group": ["Hispanic", "White non-Hispanic"],
        "overall_mean_reported": [11066, 24698],
        "n_total": [376, 401],
    }
)

dds
age_cohort hispanic_mean white_mean hispanic_n white_n
0 0-5 1393 1367 44 20
1 6-12 2312 2052 91 46
2 13-17 3955 3904 103 67
3 18-21 9960 10133 78 69
4 22-50 40924 40188 43 133
5 51+ 55585 52670 17 66

11.5.2 The misleading first look: overall means only

Show Python
fig = px.bar(
    overall_table,
    x="group",
    y="overall_mean_reported",
    text="overall_mean_reported",
    title="Overall average expenditures: the misleading first look",
)
fig.update_traces(texttemplate="$%{text:,.0f}", textposition="outside")
fig.update_layout(yaxis_title="Average annual expenditures ($)")
fig.show()

If we stop here, the conclusion seems obvious: the White non-Hispanic group appears to receive far more support.

That conclusion is too fast because we have not yet asked whether the two groups have comparable age distributions.

11.5.3 Now stratify by age cohort

Show Python
dds_long = dds.melt(
    id_vars="age_cohort",
    value_vars=["hispanic_mean", "white_mean"],
    var_name="group",
    value_name="mean_expenditure",
)
dds_long["group"] = dds_long["group"].map(
    {"hispanic_mean": "Hispanic", "white_mean": "White non-Hispanic"}
)

fig = px.line(
    dds_long,
    x="age_cohort",
    y="mean_expenditure",
    color="group",
    markers=True,
    title="Average expenditures within age cohort",
)
fig.update_layout(yaxis_title="Average annual expenditures ($)")
fig.show()
Show Python
dds[["age_cohort", "hispanic_mean", "white_mean"]]
age_cohort hispanic_mean white_mean
0 0-5 1393 1367
1 6-12 2312 2052
2 13-17 3955 3904
3 18-21 9960 10133
4 22-50 40924 40188
5 51+ 55585 52670

Within five of the six age cohorts, the Hispanic average is actually higher than the White non-Hispanic average. The overall result told one story; the age-stratified analysis tells another.

This does not prove that discrimination is absent. It shows something narrower and very important: the overall mean by itself is not a valid basis for that conclusion.

11.5.4 A better picture: slope chart of the reversal

This graph makes the paradox easier to see. The overall comparison goes one way, while almost every within-age comparison goes the other way.

Show Python
rows = []
for _, r in dds.iterrows():
    rows.append({"panel": r["age_cohort"], "x": "Hispanic", "y": r["hispanic_mean"]})
    rows.append({"panel": r["age_cohort"], "x": "White non-Hispanic", "y": r["white_mean"]})

for _, r in overall_table.iterrows():
    rows.append({"panel": "Overall", "x": r["group"], "y": r["overall_mean_reported"]})

slope_df = pd.DataFrame(rows)
panel_order = ["Overall", "0-5", "6-12", "13-17", "18-21", "22-50", "51+"]
slope_df["panel"] = pd.Categorical(slope_df["panel"], categories=panel_order, ordered=True)

fig = px.line(
    slope_df,
    x="x",
    y="y",
    color="panel",
    markers=True,
    line_group="panel",
    category_orders={"x": ["Hispanic", "White non-Hispanic"]},
    title="Each line compares the two groups: overall vs within age cohort",
    hover_data={"panel": True, "y": ":,.0f"},
)
fig.update_layout(xaxis_title="Group", yaxis_title="Average annual expenditures ($)")
fig.show()

11.5.5 Why does the reversal happen?

The key is that the two groups have very different age distributions.

Show Python
dds["hispanic_pct"] = dds["hispanic_n"] / dds["hispanic_n"].sum()
dds["white_pct"] = dds["white_n"] / dds["white_n"].sum()

dds_pct = dds.melt(
    id_vars="age_cohort",
    value_vars=["hispanic_pct", "white_pct"],
    var_name="group",
    value_name="proportion",
)
dds_pct["group"] = dds_pct["group"].map(
    {"hispanic_pct": "Hispanic", "white_pct": "White non-Hispanic"}
)

fig = px.bar(
    dds_pct,
    x="age_cohort",
    y="proportion",
    color="group",
    barmode="group",
    title="Age composition differs sharply between the two groups",
)
fig.update_layout(yaxis_title="Proportion within ethnicity")
fig.show()

Hispanic consumers are much more concentrated in the younger cohorts, while White non-Hispanic consumers are much more concentrated in the older cohorts. But expenditures rise steeply with age. So the overall mean is a weighted average, and the two groups use very different weights.

That is the core mechanism of Simpson’s paradox in this example: not different arithmetic, but different subgroup composition.

11.5.6 Weighted-average decomposition

This plot combines the two ingredients of the paradox:

  • bars show the weights,
  • lines show the within-age means.

The overall mean comes from multiplying those two pieces together.

Show Python
fig = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.12,
    subplot_titles=("Weights: proportion in each age cohort", "Within-age average expenditures"),
)

for grp, pct_col, line_col in [
    ("Hispanic", "hispanic_pct", "hispanic_mean"),
    ("White non-Hispanic", "white_pct", "white_mean"),
]:
    fig.add_trace(
        go.Bar(x=dds["age_cohort"], y=dds[pct_col], name=f"{grp} weight", legendgroup=grp),
        row=1,
        col=1,
    )
    fig.add_trace(
        go.Scatter(x=dds["age_cohort"], y=dds[line_col], mode="lines+markers", name=f"{grp} mean", legendgroup=grp),
        row=2,
        col=1,
    )

fig.update_layout(
    title="Why the overall means reverse",
    barmode="group",
    height=700,
)
fig.update_yaxes(title_text="Proportion", row=1, col=1)
fig.update_yaxes(title_text="Average annual expenditures ($)", row=2, col=1)
fig.show()

You can verify the weighted-average explanation numerically.

Show Python
hispanic_weighted = np.sum(dds["hispanic_pct"] * dds["hispanic_mean"])
white_weighted = np.sum(dds["white_pct"] * dds["white_mean"])

pd.DataFrame(
    {
        "group": ["Hispanic", "White non-Hispanic"],
        "Weighted mean from agecohorts": [hispanic_weighted, white_weighted],
        "Reported overall mean": [11066, 24698],
    }
)
group Weighted mean from agecohorts Reported overall mean
0 Hispanic 11065.441489 11066
1 White non-Hispanic 24697.508728 24698

What if we apply the same age weights (pooled age distribution) to both groups? This isolates the within-age differences from the composition effect.

Show Python
dds["pooled_pct"] = (dds["hispanic_n"] + dds["white_n"]) / (dds["hispanic_n"].sum() + dds["white_n"].sum())

same_weight_hisp = np.sum(dds["pooled_pct"] * dds["hispanic_mean"])
same_weight_white = np.sum(dds["pooled_pct"] * dds["white_mean"])

pd.DataFrame(
    {
        "group": ["Hispanic", "White non-Hispanic"],
        "Pooled weights mean": [same_weight_hisp, same_weight_white],
        "Reported overall mean": [11066, 24698],
    }
)
group Pooled weights mean Reported overall mean
0 Hispanic 18479.465894 11066
1 White non-Hispanic 17974.956242 24698
Show Python
compare_weights = pd.DataFrame(
    {
        "group": ["Hispanic", "White non-Hispanic", "Hispanic", "White non-Hispanic"],
        "mean": [11066, 24698, same_weight_hisp, same_weight_white],
        "scenario": ["Original overall", "Original overall", "Same age weights", "Same age weights"],
    }
)

fig = px.bar(
    compare_weights,
    x="group",
    y="mean",
    color="scenario",
    barmode="group",
    text="mean",
    title="What happens if both groups use the same age distribution?",
)
fig.update_traces(texttemplate="$%{text:,.0f}", textposition="outside")
fig.update_layout(yaxis_title="Average annual expenditures ($)")
fig.show()

So, it should be clear that the contradiction from Simpson’s paradox is not magic: it comes from changing the weights.

11.6 Example 2: A longitudinal study in South Africa

In the Birth to Ten study, the overall proportion whose mothers had medical aid was lower in the five-year cohort than among children not traced. But once race was taken into account, the direction reversed slightly within both the white and black groups.

11.6.1 Build the contingency tables from the paper

Show Python
overall_sa = pd.DataFrame(
    {
        "cohort": ["Not traced", "Five-year group"],
        "had_medical_aid": [195, 46],
        "no_medical_aid": [979, 370],
    }
)
overall_sa["total"] = overall_sa["had_medical_aid"] + overall_sa["no_medical_aid"]
overall_sa["aid_rate"] = overall_sa["had_medical_aid"] / overall_sa["total"]

race_sa = pd.DataFrame(
    {
        "race": ["White", "White", "Black", "Black"],
        "cohort": ["Not traced", "Five-year group", "Not traced", "Five-year group"],
        "had_medical_aid": [104, 10, 91, 36],
        "no_medical_aid": [22, 2, 957, 368],
    }
)
race_sa["total"] = race_sa["had_medical_aid"] + race_sa["no_medical_aid"]
race_sa["aid_rate"] = race_sa["had_medical_aid"] / race_sa["total"]

overall_sa
cohort had_medical_aid no_medical_aid total aid_rate
0 Not traced 195 979 1174 0.166099
1 Five-year group 46 370 416 0.110577

11.6.2 Overall association

Show Python
fig = px.bar(
    overall_sa,
    x="cohort",
    y="aid_rate",
    text=overall_sa["aid_rate"].map(lambda x: f"{100*x:.1f}%"),
    title="Overall medical-aid rate",
)
fig.update_traces(textposition="outside")
fig.update_layout(yaxis_title="Proportion with medical aid")
fig.show()

Overall, the five-year group looks worse: about 11.1% versus 16.6%.

11.6.3 Conditional on race

Show Python
fig = px.bar(
    race_sa,
    x="race",
    y="aid_rate",
    color="cohort",
    barmode="group",
    text=race_sa["aid_rate"].map(lambda x: f"{100*x:.1f}%"),
    title="Medical-aid rate after stratifying by race",
)
fig.update_traces(textposition="outside")
fig.update_layout(yaxis_title="Proportion with medical aid")
fig.show()

Within the white group, the five-year group is slightly higher: 83.3% vs 82.5%.

Within the black group, the five-year group is also slightly higher: 8.9% vs 8.7%.

So the overall association goes in the opposite direction from the within-race comparisons.

11.6.4 A compact reversal plot

Show Python
south_africa_plot = pd.concat(
    [
        pd.DataFrame(
            {
                "comparison": ["Overall", "Overall"],
                "group": ["Not traced", "Five-year group"],
                "rate": overall_sa["aid_rate"].values,
            }
        ),
        pd.DataFrame(
            {
                "comparison": ["White", "White", "Black", "Black"],
                "group": race_sa["cohort"].values,
                "rate": race_sa["aid_rate"].values,
            }
        ),
    ],
    ignore_index=True,
)

fig = px.line(
    south_africa_plot,
    x="group",
    y="rate",
    color="comparison",
    markers=True,
    line_group="comparison",
    category_orders={"group": ["Not traced", "Five-year group"]},
    title="Overall vs within-race comparisons",
    hover_data={"rate": ":.3f"},
)
fig.update_layout(yaxis_title="Medical-aid rate")
fig.show()

11.6.5 Why did this happen?

Because the racial composition differs across the two cohorts, and medical-aid access differs dramatically by race.

Show Python
comp = race_sa.pivot(index="cohort", columns="race", values="total").reset_index()
for col in ["Black", "White"]:
    comp[f"{col}_pct"] = comp[col] / (comp["Black"] + comp["White"])
comp
race cohort Black White Black_pct White_pct
0 Five-year group 404 12 0.971154 0.028846
1 Not traced 1048 126 0.892675 0.107325
Show Python
comp_long = comp.melt(
    id_vars="cohort",
    value_vars=["White_pct", "Black_pct"],
    var_name="race",
    value_name="proportion",
)
comp_long["race"] = comp_long["race"].str.replace("_pct", "", regex=False)

fig = px.bar(
    comp_long,
    x="cohort",
    y="proportion",
    color="race",
    barmode="stack",
    title="Race composition differs across cohorts",
)
fig.update_layout(yaxis_title="Proportion")
fig.show()

11.6.6 A quick significance check

The paper reports an overall significant difference but no meaningful within-race difference. We can reproduce the corresponding chi-square tests from the tabulated counts.

Show Python
overall_tab = np.array([[195, 979], [46, 370]])
chi2_overall, p_overall, _, _ = chi2_contingency(overall_tab, correction=False)

white_tab = np.array([[104, 22], [10, 2]])
chi2_white, p_white, _, _ = chi2_contingency(white_tab, correction=False)

black_tab = np.array([[91, 957], [36, 368]])
chi2_black, p_black, _, _ = chi2_contingency(black_tab, correction=False)

pd.DataFrame(
    {
        "table": ["Overall", "White only", "Black only"],
        "p_value": [p_overall, p_white, p_black],
    }
)
table p_value
0 Overall 0.006658
1 White only 0.944744
2 Black only 0.890541

Again, the exact message is not “trust p-values” or “ignore p-values.” The message is that what you condition on matters.

It is also worth emphasizing that a statistically significant overall table can coexist with weak or negligible within-subgroup differences.

11.7 Simulation

The two real examples above are convincing because they come from real studies. But it is also useful to see that Simpson’s paradox is not rare magic. It can arise whenever:

  1. the subgroup outcome levels differ a lot; and
  2. the group composition across subgroups differs enough.

11.7.1 A simple synthetic example

Suppose we compare two methods, A and B, over two difficulty levels: easy and hard.

  • Within both difficulty levels, A is better.
  • But A is used much more often on hard cases, while B is used much more often on easy cases.
Show Python
simpson_simple = pd.DataFrame(
    {
        "difficulty": ["Easy", "Hard"],
        "A_success_rate": [0.90, 0.60],
        "B_success_rate": [0.85, 0.55],
        "A_cases": [100, 900],
        "B_cases": [900, 100],
    }
)
simpson_simple
difficulty A_success_rate B_success_rate A_cases B_cases
0 Easy 0.9 0.85 100 900
1 Hard 0.6 0.55 900 100
Show Python
A_overall = np.average(simpson_simple["A_success_rate"], weights=simpson_simple["A_cases"])
B_overall = np.average(simpson_simple["B_success_rate"], weights=simpson_simple["B_cases"])

pd.DataFrame(
    {
        "method": ["A", "B"],
        "overall_success_rate": [A_overall, B_overall],
    }
)
method overall_success_rate
0 A 0.63
1 B 0.82
Show Python
simple_long = simpson_simple.melt(
    id_vars="difficulty",
    value_vars=["A_success_rate", "B_success_rate"],
    var_name="method",
    value_name="success_rate",
)
simple_long["method"] = simple_long["method"].str.replace("_success_rate", "", regex=False)

fig = px.bar(
    simple_long,
    x="difficulty",
    y="success_rate",
    color="method",
    barmode="group",
    title="Within each difficulty level, A is better",
)
fig.update_layout(yaxis_title="Success rate")
fig.show()
Show Python
fig = px.bar(
    pd.DataFrame({"method": ["A", "B"], "overall_success_rate": [A_overall, B_overall]}),
    x="method",
    y="overall_success_rate",
    text="overall_success_rate",
    title="But overall, the ranking reverses",
)
fig.update_traces(texttemplate="%{text:.3f}", textposition="outside")
fig.update_layout(yaxis_title="Overall success rate")
fig.show()

11.7.2 Simulate many datasets with varying subgroup composition

Now we simulate many datasets where A is truly better within both strata, but subgroup mixing varies from one dataset to another.

Show Python
rng = np.random.default_rng(3255)

def one_sim():
    pA_easy = rng.uniform(0.75, 0.95)
    pB_easy = pA_easy - rng.uniform(0.01, 0.08)

    pA_hard = rng.uniform(0.35, 0.70)
    pB_hard = pA_hard - rng.uniform(0.01, 0.08)

    A_easy = rng.integers(20, 250)
    A_hard = rng.integers(300, 1200)
    B_easy = rng.integers(300, 1200)
    B_hard = rng.integers(20, 250)

    A_overall = np.average([pA_easy, pA_hard], weights=[A_easy, A_hard])
    B_overall = np.average([pB_easy, pB_hard], weights=[B_easy, B_hard])

    return {
        "pA_easy": pA_easy,
        "pB_easy": pB_easy,
        "pA_hard": pA_hard,
        "pB_hard": pB_hard,
        "A_easy": A_easy,
        "A_hard": A_hard,
        "B_easy": B_easy,
        "B_hard": B_hard,
        "A_overall": A_overall,
        "B_overall": B_overall,
        "reversal": B_overall > A_overall,
    }

sim_df = pd.DataFrame([one_sim() for _ in range(2000)])
sim_df["overall_diff_B_minus_A"] = sim_df["B_overall"] - sim_df["A_overall"]
sim_df["easy_case_gap"] = sim_df["B_easy"] - sim_df["A_easy"]
sim_df["reversal_label"] = sim_df["reversal"].map({True: "Reversal", False: "No reversal"})
sim_df.head()
pA_easy pB_easy pA_hard pB_hard A_easy A_hard B_easy B_hard A_overall B_overall reversal overall_diff_B_minus_A easy_case_gap reversal_label
0 0.835463 0.803197 0.522610 0.496160 205 982 610 237 0.576641 0.717285 True 0.140644 405 Reversal
1 0.869031 0.856481 0.677502 0.628806 148 716 524 147 0.710310 0.806603 True 0.096293 376 Reversal
2 0.753094 0.709083 0.374724 0.321039 169 410 667 97 0.485164 0.659815 True 0.174652 498 Reversal
3 0.877914 0.806613 0.508721 0.458813 191 1085 322 36 0.563984 0.771639 True 0.207655 131 Reversal
4 0.837741 0.776933 0.454775 0.381841 245 717 673 162 0.552308 0.700281 True 0.147973 428 Reversal
Show Python
sim_df["reversal"].mean()
np.float64(0.995)

The value above is the fraction of simulated datasets where method A is better in both strata but still loses overall.

Show Python
fig = px.histogram(
    sim_df,
    x="overall_diff_B_minus_A",
    color="reversal_label",
    nbins=50,
    title="Distribution of overall difference: B - A",
)
fig.add_vline(x=0, line_dash="dash")
fig.update_layout(xaxis_title="Overall success-rate difference")
fig.show()
Show Python
fig = px.scatter(
    sim_df,
    x="A_easy",
    y="B_easy",
    color="reversal_label",
    opacity=0.45,
    title="Different subgroup compositions drive the paradox",
    hover_data=["A_hard", "B_hard", "A_overall", "B_overall"],
)
fig.update_layout(xaxis_title="Easy cases assigned to A", yaxis_title="Easy cases assigned to B")
fig.show()

11.7.3 Another simulation example in regression settings

Simpson’s paradox also appears in regression settings. In the toy GDP example below, each development group has a negative within-group relationship between country size and GDP per capita, but the pooled data look positive because the most populous countries are concentrated in the higher-GDP group.

Show Python
rng = np.random.default_rng(3255)

rows = []
for development_group, lo, hi, intercept, slope, pop_mean in [
    ("Developing", 5, 35, 95, -1.4, 2.3),
    ("Developed", 40, 100, 240, -0.9, 3.2),
]:
    country_size = rng.uniform(lo, hi, 40)
    population = rng.lognormal(mean=pop_mean, sigma=0.45, size=40)
    gdp_per_capita = intercept + slope * country_size + rng.normal(0, 6, 40)
    rows.append(
        pd.DataFrame(
            {
                "development_group": development_group,
                "country_size": country_size,
                "population": population,
                "gdp_per_capita": gdp_per_capita,
            }
        )
    )

df = pd.concat(rows, ignore_index=True)

corr_rows = [{"group": "Overall", "correlation": df["country_size"].corr(df["gdp_per_capita"])}]
for development_group, sub in df.groupby("development_group"):
    corr_rows.append(
        {
            "group": development_group,
            "correlation": sub["country_size"].corr(sub["gdp_per_capita"]),
        }
    )
corr_df = pd.DataFrame(corr_rows)

fig = go.Figure()
colors = {"Developing": "#1f77b4", "Developed": "#ff7f0e"}

for development_group, sub in df.groupby("development_group"):
    fig.add_trace(
        go.Scatter(
            x=sub["country_size"],
            y=sub["gdp_per_capita"],
            mode="markers",
            name=development_group,
            marker=dict(
                size=4 + 16 * sub["population"] / df["population"].max(),
                color=colors[development_group],
                opacity=0.7,
                line=dict(width=0.5, color="white"),
            ),
            text=[development_group] * len(sub),
            hovertemplate="Group=%{text}<br>Country size=%{x:.1f}<br>GDP per capita=%{y:.1f}<extra></extra>",
        )
    )

    m, b = np.polyfit(sub["country_size"], sub["gdp_per_capita"], 1)
    x_fit = np.linspace(sub["country_size"].min(), sub["country_size"].max(), 50)
    fig.add_trace(
        go.Scatter(
            x=x_fit,
            y=m * x_fit + b,
            mode="lines",
            name=f"{development_group} trend",
            line=dict(color=colors[development_group], dash="dash"),
            showlegend=False,
        )
    )

m_all, b_all = np.polyfit(df["country_size"], df["gdp_per_capita"], 1)
x_all = np.linspace(df["country_size"].min(), df["country_size"].max(), 100)
fig.add_trace(
    go.Scatter(
        x=x_all,
        y=m_all * x_all + b_all,
        mode="lines",
        name="Overall trend",
        line=dict(color="black", width=3),
    )
)

fig.update_layout(
    title="Overall trend is positive, but each development group slopes downward",
    xaxis_title="Country size",
    yaxis_title="GDP per capita",
)
fig.show()

# Correlation summary
fig = px.bar(
    corr_df,
    x="group",
    y="correlation",
    text="correlation",
    title="Correlation flips after stratifying by development group",
)
fig.update_traces(texttemplate="%{text:.2f}", textposition="outside")
fig.update_layout(yaxis_title="Pearson correlation")
fig.show()

corr_df
group correlation
0 Overall 0.744340
1 Developed -0.910302
2 Developing -0.923694

Population weights matter because the most populous countries contribute more to an aggregated summary. When the high-population observations sit mostly in one group, the pooled regression can flip even though both group-specific trends are negative.

The pooled line is positive because the larger countries also tend to belong to the higher-GDP region. Within each region, however, the fitted line slopes downward.

11.8 Why this matters in data science

Simpson’s paradox is an important lesson for data science students because it captures a core danger of real-world analysis:

  • dashboards aggregate;
  • machine-learning pipelines summarize;
  • business reports compress complexity;
  • and people love a single number.

But the “single number” can be deeply misleading when important subgroups are mixed together.

A few places where this matters:

  • admissions data by department,
  • treatment comparisons by disease severity,
  • hiring or promotion outcomes by role or unit,
  • educational outcomes by school and student background,
  • model-performance comparisons by subgroup difficulty.

The paradox teaches intellectual discipline:

  1. Ask what variables were aggregated over.
  2. Check whether subgroup composition differs across the groups being compared.
  3. Separate descriptive truth from causal interpretation.
  4. Do not call something discrimination, bias, or superiority until you understand the structure of the data.

Data do not literally lie, but they can absolutely mislead when we use them carelessly.

That is why good data science requires more than coding and computation. It requires statistical thinking, judgment about structure, context, and what comparison is scientifically meaningful.

11.9 Discussion

  1. Can Simpson’s paradox occur in regression or machine-learning settings, not just in tables? Explain.
  2. When should we report overall results, and when should we prioritize subgroup results?
  3. Find a modern data-science setting where a subgroup analysis would be essential before making a conclusion.