11Case study: When the data seem to cheat — Simpson’s paradox
11.1 Background
One of the most dangerous moments in data science is when the data appear to tell a clean story too quickly.
You compute an overall average, make a chart, and the conclusion seems obvious. Then you stratify by an important variable, and the story reverses. What looked like evidence for one claim now looks like evidence against it.
This is Simpson’s paradox: an association seen in aggregated data can weaken, disappear, or reverse after accounting for a third variable.
For data science students, this is not just a cute paradox. It is a warning. Data can cheat in the sense that a superficial summary can hide the mechanism generating the data. Data can also appear contradictory when two analyses are both numerically correct but answer different questions.
Simpson’s paradox does not mean one table is wrong. It means the overall table and the stratified tables are answering different questions.
In this case study, we use two real examples from the literature:
an example based on developmental-services expenditures in California, and
a longitudinal-study example from South Africa involving medical-aid status and follow-up participation.
We then use simulation to show how Simpson’s paradox arises naturally from subgroup composition.
11.2 Learning goals
By the end of this case study, you should be able to:
explain Simpson’s paradox in plain language;
distinguish an overall association from a conditional association;
identify a lurking or confounding variable that changes the story;
use weighted averages to explain why the reversal happens;
see why responsible data science requires deep statistial thinking, not just computation.
11.3 Load packages
Show Python
import numpy as npimport pandas as pdimport plotly.express as pximport plotly.graph_objects as gofrom plotly.subplots import make_subplotsfrom scipy.stats import chi2_contingencypd.set_option("display.max_columns", 200)pd.set_option("display.width", 120)
11.4 What is Simpson’s paradox?
Suppose group A looks better than group B within each subgroup. For example, maybe A has a higher success rate in every age band, every department, or every disease severity category.
You might think that A must also look better overall. But that need not happen. If A is overrepresented in harder cases and B is overrepresented in easier cases, the overall average can reverse direction.
So the paradox is not that mathematics is broken. The paradox is that aggregation changes the weights.
The lesson is simple:
The first summary you compute may be numerically correct and still scientifically misleading.
11.5 Example 1: California expenditures and apparent ethnic discrimination
This example comes from a teaching case based on developmental-services expenditures in California. The paper shows that the overall average annual expenditures for Hispanic consumers were much lower than those for White non-Hispanic consumers, which could easily lead to a quick claim of discrimination. But after stratifying by age cohort, the pattern reversed in all but one cohort. The explanation was that the Hispanic population in the sample was much younger overall, and expenditures increase sharply with age.
This example shows two ideas at once:
the overall disparity is real as a descriptive summary, and
the overall disparity alone is not enough to explain why it appears.
Within five of the six age cohorts, the Hispanic average is actually higher than the White non-Hispanic average. The overall result told one story; the age-stratified analysis tells another.
This does not prove that discrimination is absent. It shows something narrower and very important: the overall mean by itself is not a valid basis for that conclusion.
11.5.4 A better picture: slope chart of the reversal
This graph makes the paradox easier to see. The overall comparison goes one way, while almost every within-age comparison goes the other way.
Show Python
rows = []for _, r in dds.iterrows(): rows.append({"panel": r["age_cohort"], "x": "Hispanic", "y": r["hispanic_mean"]}) rows.append({"panel": r["age_cohort"], "x": "White non-Hispanic", "y": r["white_mean"]})for _, r in overall_table.iterrows(): rows.append({"panel": "Overall", "x": r["group"], "y": r["overall_mean_reported"]})slope_df = pd.DataFrame(rows)panel_order = ["Overall", "0-5", "6-12", "13-17", "18-21", "22-50", "51+"]slope_df["panel"] = pd.Categorical(slope_df["panel"], categories=panel_order, ordered=True)fig = px.line( slope_df, x="x", y="y", color="panel", markers=True, line_group="panel", category_orders={"x": ["Hispanic", "White non-Hispanic"]}, title="Each line compares the two groups: overall vs within age cohort", hover_data={"panel": True, "y": ":,.0f"},)fig.update_layout(xaxis_title="Group", yaxis_title="Average annual expenditures ($)")fig.show()
11.5.5 Why does the reversal happen?
The key is that the two groups have very different age distributions.
Show Python
dds["hispanic_pct"] = dds["hispanic_n"] / dds["hispanic_n"].sum()dds["white_pct"] = dds["white_n"] / dds["white_n"].sum()dds_pct = dds.melt( id_vars="age_cohort", value_vars=["hispanic_pct", "white_pct"], var_name="group", value_name="proportion",)dds_pct["group"] = dds_pct["group"].map( {"hispanic_pct": "Hispanic", "white_pct": "White non-Hispanic"})fig = px.bar( dds_pct, x="age_cohort", y="proportion", color="group", barmode="group", title="Age composition differs sharply between the two groups",)fig.update_layout(yaxis_title="Proportion within ethnicity")fig.show()
Hispanic consumers are much more concentrated in the younger cohorts, while White non-Hispanic consumers are much more concentrated in the older cohorts. But expenditures rise steeply with age. So the overall mean is a weighted average, and the two groups use very different weights.
That is the core mechanism of Simpson’s paradox in this example: not different arithmetic, but different subgroup composition.
11.5.6 Weighted-average decomposition
This plot combines the two ingredients of the paradox:
bars show the weights,
lines show the within-age means.
The overall mean comes from multiplying those two pieces together.
Show Python
fig = make_subplots( rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.12, subplot_titles=("Weights: proportion in each age cohort", "Within-age average expenditures"),)for grp, pct_col, line_col in [ ("Hispanic", "hispanic_pct", "hispanic_mean"), ("White non-Hispanic", "white_pct", "white_mean"),]: fig.add_trace( go.Bar(x=dds["age_cohort"], y=dds[pct_col], name=f"{grp} weight", legendgroup=grp), row=1, col=1, ) fig.add_trace( go.Scatter(x=dds["age_cohort"], y=dds[line_col], mode="lines+markers", name=f"{grp} mean", legendgroup=grp), row=2, col=1, )fig.update_layout( title="Why the overall means reverse", barmode="group", height=700,)fig.update_yaxes(title_text="Proportion", row=1, col=1)fig.update_yaxes(title_text="Average annual expenditures ($)", row=2, col=1)fig.show()
You can verify the weighted-average explanation numerically.
compare_weights = pd.DataFrame( {"group": ["Hispanic", "White non-Hispanic", "Hispanic", "White non-Hispanic"],"mean": [11066, 24698, same_weight_hisp, same_weight_white],"scenario": ["Original overall", "Original overall", "Same age weights", "Same age weights"], })fig = px.bar( compare_weights, x="group", y="mean", color="scenario", barmode="group", text="mean", title="What happens if both groups use the same age distribution?",)fig.update_traces(texttemplate="$%{text:,.0f}", textposition="outside")fig.update_layout(yaxis_title="Average annual expenditures ($)")fig.show()
So, it should be clear that the contradiction from Simpson’s paradox is not magic: it comes from changing the weights.
11.6 Example 2: A longitudinal study in South Africa
In the Birth to Ten study, the overall proportion whose mothers had medical aid was lower in the five-year cohort than among children not traced. But once race was taken into account, the direction reversed slightly within both the white and black groups.
11.6.1 Build the contingency tables from the paper
fig = px.bar( overall_sa, x="cohort", y="aid_rate", text=overall_sa["aid_rate"].map(lambda x: f"{100*x:.1f}%"), title="Overall medical-aid rate",)fig.update_traces(textposition="outside")fig.update_layout(yaxis_title="Proportion with medical aid")fig.show()
Overall, the five-year group looks worse: about 11.1% versus 16.6%.
11.6.3 Conditional on race
Show Python
fig = px.bar( race_sa, x="race", y="aid_rate", color="cohort", barmode="group", text=race_sa["aid_rate"].map(lambda x: f"{100*x:.1f}%"), title="Medical-aid rate after stratifying by race",)fig.update_traces(textposition="outside")fig.update_layout(yaxis_title="Proportion with medical aid")fig.show()
Within the white group, the five-year group is slightly higher: 83.3% vs 82.5%.
Within the black group, the five-year group is also slightly higher: 8.9% vs 8.7%.
So the overall association goes in the opposite direction from the within-race comparisons.
The paper reports an overall significant difference but no meaningful within-race difference. We can reproduce the corresponding chi-square tests from the tabulated counts.
Again, the exact message is not “trust p-values” or “ignore p-values.” The message is that what you condition on matters.
It is also worth emphasizing that a statistically significant overall table can coexist with weak or negligible within-subgroup differences.
11.7 Simulation
The two real examples above are convincing because they come from real studies. But it is also useful to see that Simpson’s paradox is not rare magic. It can arise whenever:
the subgroup outcome levels differ a lot; and
the group composition across subgroups differs enough.
11.7.1 A simple synthetic example
Suppose we compare two methods, A and B, over two difficulty levels: easy and hard.
Within both difficulty levels, A is better.
But A is used much more often on hard cases, while B is used much more often on easy cases.
The value above is the fraction of simulated datasets where method A is better in both strata but still loses overall.
Show Python
fig = px.histogram( sim_df, x="overall_diff_B_minus_A", color="reversal_label", nbins=50, title="Distribution of overall difference: B - A",)fig.add_vline(x=0, line_dash="dash")fig.update_layout(xaxis_title="Overall success-rate difference")fig.show()
Show Python
fig = px.scatter( sim_df, x="A_easy", y="B_easy", color="reversal_label", opacity=0.45, title="Different subgroup compositions drive the paradox", hover_data=["A_hard", "B_hard", "A_overall", "B_overall"],)fig.update_layout(xaxis_title="Easy cases assigned to A", yaxis_title="Easy cases assigned to B")fig.show()
11.7.3 Another simulation example in regression settings
Simpson’s paradox also appears in regression settings. In the toy GDP example below, each development group has a negative within-group relationship between country size and GDP per capita, but the pooled data look positive because the most populous countries are concentrated in the higher-GDP group.
Population weights matter because the most populous countries contribute more to an aggregated summary. When the high-population observations sit mostly in one group, the pooled regression can flip even though both group-specific trends are negative.
The pooled line is positive because the larger countries also tend to belong to the higher-GDP region. Within each region, however, the fitted line slopes downward.
11.8 Why this matters in data science
Simpson’s paradox is an important lesson for data science students because it captures a core danger of real-world analysis:
dashboards aggregate;
machine-learning pipelines summarize;
business reports compress complexity;
and people love a single number.
But the “single number” can be deeply misleading when important subgroups are mixed together.
A few places where this matters:
admissions data by department,
treatment comparisons by disease severity,
hiring or promotion outcomes by role or unit,
educational outcomes by school and student background,
model-performance comparisons by subgroup difficulty.
The paradox teaches intellectual discipline:
Ask what variables were aggregated over.
Check whether subgroup composition differs across the groups being compared.
Separate descriptive truth from causal interpretation.
Do not call something discrimination, bias, or superiority until you understand the structure of the data.
Data do not literally lie, but they can absolutely mislead when we use them carelessly.
That is why good data science requires more than coding and computation. It requires statistical thinking, judgment about structure, context, and what comparison is scientifically meaningful.
11.9 Discussion
Can Simpson’s paradox occur in regression or machine-learning settings, not just in tables? Explain.
When should we report overall results, and when should we prioritize subgroup results?
Find a modern data-science setting where a subgroup analysis would be essential before making a conclusion.