import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy.stats import ttest_ind, norm
from statsmodels.nonparametric.smoothers_lowess import lowess
from statsmodels.stats.multitest import multipletests
pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)13 Case study: Multiple testing for differential gene expression
13.1 Background
In a microarray study, we do not test one hypothesis. We test thousands of them at once.
That changes the statistical problem in an important way. If we test a single gene at level \(0.05\), then a false positive is relatively rare. But if we test thousands of genes, then even when no genes truly differ, we should still expect many small p-values just by chance.
This case study uses a classic microarray example from Callow et al. (2000), where the goal is to identify genes with altered expression in HDL-deficient mice. It is a useful teaching example because it combines scientific interpretation with a core statistical difficulty: how to detect genuine signals while controlling false discoveries across thousands of simultaneous tests.
13.2 Objective
The purpose of this case study is to use microarray data to learn four closely related ideas:
- how to construct gene-level test statistics,
- why multiple testing creates a serious Type I error problem,
- how stricter thresholds can reduce false positives but increase false negatives,
- how to estimate the false discovery rate (FDR) from permutations.
The main workflow is:
- compute log-ratios and log-intensities,
- apply block-wise loess normalization,
- calculate Welch two-sample t-statistics,
- examine the consequences of multiple testing,
- estimate the false discovery rate (FDR) by permutation.
13.3 Data & Resources
Callow-data1.txt: raw two-color microarray intensities for 6384 spots across 16 arraysSpot-location1.txt: printing-pin block locations for the spotsCallow et al 2000.pdf: paper describing the scientific setting and the multiple-testing framework
13.4 Why this case study matters
This is a very good data-science example because the challenge is not just computing p-values. The real challenge is decision-making under many simultaneous comparisons.
At the level of one gene:
- A Type I error means calling a gene differentially expressed when it is not.
- A Type II error means missing a gene that truly is differentially expressed.
At the level of thousands of genes:
- Controlling the chance of any false positive can become too strict.
- Ignoring multiplicity can produce too many false discoveries.
- The false discovery rate provides a useful compromise: among the genes we call significant, what fraction do we expect to be false positives?
That is exactly the issue this case study is designed to illustrate.
13.5 Load packages and data
path_data = "Callow-data1.txt"
path_spot = "Spot-location1.txt"
raw = pd.read_csv(path_data, sep="\t")
spot = pd.read_csv(path_spot, sep="\t")
print("raw shape:", raw.shape)
print("spot shape:", spot.shape)
raw.head()raw shape: (6384, 33)
spot shape: (6384, 5)
| row.names | c1G | c1R | c2G | c2R | c3G | c3R | c4G | c4R | c5G | c5R | c6G | c6R | c7G | c7R | c8G | c8R | k1G | k1R | k2G | k2R | k3G | k3R | k4G | k4R | k5G | k5R | k6G | k6R | k7G | k7R | k8G | k8R | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 5592.58 | 2765.58 | 4749.89 | 1768.22 | 3709.31 | 1440.54 | 2666.13 | 763.06 | 3535.22 | 2027.94 | 2499.53 | 864.05 | 2445.94 | 958.68 | 1631.58 | 644.58 | 580.80 | 747.11 | 2318.73 | 1388.79 | 1441.88 | 1588.76 | 2055.08 | 1280.17 | 2437.16 | 1881.72 | 2269.43 | 1733.53 | 1670.33 | 1170.84 | 2882.78 | 1512.45 |
| 1 | 2 | 4746.38 | 2868.43 | 3088.12 | 2277.18 | 2259.42 | 1599.92 | 1518.12 | 1238.33 | 1647.30 | 1513.43 | 1658.00 | 1079.33 | 1386.56 | 1228.66 | 918.27 | 757.33 | 1179.10 | 1930.25 | 1704.40 | 2093.00 | 835.86 | 1369.81 | 979.33 | 1071.17 | 2028.75 | 3218.58 | 1294.91 | 2451.04 | 1137.45 | 1605.00 | 1652.50 | 1700.82 |
| 2 | 3 | 2108.48 | 1236.32 | 3669.53 | 1546.84 | 7774.40 | 2639.45 | 2514.81 | 999.48 | 2815.00 | 3689.67 | 3345.40 | 1505.20 | 1720.96 | 785.10 | 1387.41 | 994.86 | 734.24 | 753.52 | 2165.77 | 1300.00 | 2761.36 | 1301.61 | 3454.85 | 3292.26 | 2525.40 | 1149.23 | 2694.90 | 3424.30 | 2524.82 | 1901.06 | 2172.53 | 2200.82 |
| 3 | 4 | 548.46 | 383.62 | 708.16 | 532.50 | 462.67 | 323.55 | 680.06 | 585.14 | 210.58 | 250.74 | 633.78 | 566.58 | 548.89 | 409.18 | 425.00 | 417.79 | 582.64 | 829.82 | 334.90 | 402.84 | 639.40 | 513.91 | 493.01 | 459.69 | 536.83 | 391.09 | 552.51 | 601.00 | 540.60 | 438.03 | 538.80 | 507.25 |
| 4 | 5 | 856.48 | 377.36 | 715.64 | 525.44 | 347.27 | 280.67 | 721.93 | 553.43 | 297.49 | 190.80 | 417.00 | 461.89 | 333.50 | 225.63 | 478.59 | 465.65 | 511.31 | 634.80 | 307.33 | 390.93 | 425.76 | 355.76 | 362.12 | 318.87 | 380.99 | 325.59 | 422.22 | 323.65 | 298.20 | 343.06 | 874.53 | 602.28 |
13.6 Understand the experimental layout
The 16 arrays are divided into two groups of 8:
c1toc8: control arraysk1tok8: knockout arrays
For each array, the data contain two fluorescence channels, one green and one red. The code constructs
- a log-ratio \[M = \log_2(R/G),\]
- and a log-intensity \[A = \frac{1}{2}\log_2(RG).\]
The spot-location file provides the printing-pin blocks used for block-wise normalization.
col_names = raw.columns.tolist()
print(col_names[:12])
print("...")
print(col_names[16:24])
block1 = spot.iloc[:, 1].to_numpy()
block2 = spot.iloc[:, 2].to_numpy()
print("Unique block1 values:", np.unique(block1))
print("Unique block2 values:", np.unique(block2))
print("Number of spots:", len(block1))['row.names', 'c1G', 'c1R', 'c2G', 'c2R', 'c3G', 'c3R', 'c4G', 'c4R', 'c5G', 'c5R', 'c6G']
...
['c8R', 'k1G', 'k1R', 'k2G', 'k2R', 'k3G', 'k3R', 'k4G']
Unique block1 values: [1 2 3 4]
Unique block2 values: [1 2 3 4]
Number of spots: 6384
13.7 Preprocess the raw intensities
For invalid intensities, we use a simple preprocessing rule:
- if either channel is nonpositive, set the log-ratio to 0,
- and set the log-intensity to 8.
This keeps the transformed values well-defined and lets us focus on the multiple-testing ideas rather than on missing-value edge cases.
Show Python
n = 8
G = raw.shape[0]
datm = raw.iloc[:, 1:33].to_numpy(dtype=float)
def build_log_matrices(arr):
logr = np.zeros((G, n))
logi = np.zeros((G, n))
for i in range(n):
g = arr[:, 2 * i]
r = arr[:, 2 * i + 1]
ok = (g > 0) & (r > 0)
logr[:, i] = 0.0
logi[:, i] = 8.0
logr[ok, i] = np.log2(r[ok] / g[ok])
logi[ok, i] = 0.5 * np.log2(r[ok] * g[ok])
return logr, logi
logr1, logi1 = build_log_matrices(datm[:, :16])
logr2, logi2 = build_log_matrices(datm[:, 16:32])
print(logr1.shape, logi1.shape, logr2.shape, logi2.shape)(6384, 8) (6384, 8) (6384, 8) (6384, 8)
13.8 Explore the raw M-A structure
Before normalization, we should check whether the log-ratios depend on intensity. If they do, then naive testing would be distorted by systematic bias.
13.8.1 M-A plots for the first control and first knockout arrays
plot_df = pd.DataFrame({
"A_control": logi1[:, 0],
"M_control": logr1[:, 0],
"A_knockout": logi2[:, 0],
"M_knockout": logr2[:, 0],
})
fig = make_subplots(rows=1, cols=2, subplot_titles=("Control array 1", "Knockout array 1"))
fig.add_trace(
go.Scattergl(x=plot_df["A_control"], y=plot_df["M_control"], mode="markers", marker=dict(size=3), opacity=0.35, showlegend=False),
row=1, col=1,
)
fig.add_trace(
go.Scattergl(x=plot_df["A_knockout"], y=plot_df["M_knockout"], mode="markers", marker=dict(size=3), opacity=0.35, showlegend=False),
row=1, col=2,
)
fig.update_xaxes(title_text="log intensity", row=1, col=1)
fig.update_xaxes(title_text="log intensity", row=1, col=2)
fig.update_yaxes(title_text="log ratio", row=1, col=1)
fig.update_yaxes(title_text="log ratio", row=1, col=2)
fig.update_layout(title="Raw M-A plots", height=450)
fig.show()13.8.2 Block-specific structure for one array
We normalize within printing-pin blocks, not globally. That already tells us something important: the technical bias may vary by block.
Show Python
block_panels = []
for i in range(1, 5):
for j in range(1, 5):
bk = (block1 == i) & (block2 == j)
block_panels.append(
pd.DataFrame({
"A": logi1[bk, 0],
"M": logr1[bk, 0],
"block": f"({i},{j})"
})
)
block_df = pd.concat(block_panels, ignore_index=True)
fig = px.scatter(
block_df,
x="A",
y="M",
facet_col="block",
facet_col_wrap=4,
opacity=0.35,
title="Raw M-A plots by printing-pin block for control array 1",
)
fig.update_traces(marker=dict(size=3))
fig.update_layout(height=900)
fig.show()13.9 Block-wise loess normalization
Now we carry out the main normalization step. For each array and each printing-pin block, we fit a robust loess curve of log-ratio on log-intensity, then subtract the fitted curve.
Show Python
def blockwise_lowess_normalize(logr, logi, block1, block2, frac=0.2, it=3):
ndat = logr.copy()
fitted = np.zeros_like(logr)
for k in range(logr.shape[1]):
for i in range(1, 5):
for j in range(1, 5):
bk = (block1 == i) & (block2 == j)
x = logi[bk, k]
y = logr[bk, k]
yfit = lowess(y, x, frac=frac, it=it, return_sorted=False)
fitted[bk, k] = yfit
ndat[bk, k] = y - yfit
return ndat, fitted
ndat1, ncurv1 = blockwise_lowess_normalize(logr1, logi1, block1, block2)
ndat2, ncurv2 = blockwise_lowess_normalize(logr2, logi2, block1, block2)
print(ndat1.shape, ndat2.shape)(6384, 8) (6384, 8)
13.9.1 Before and after normalization
bk = (block1 == 1) & (block2 == 1)
curve_df = pd.DataFrame({
"A": logi1[bk, 0],
"raw_M": logr1[bk, 0],
"fit": ncurv1[bk, 0],
"normalized_M": ndat1[bk, 0],
}).sort_values("A")
fig = make_subplots(rows=1, cols=2, subplot_titles=("Raw with loess fit", "After normalization"))
fig.add_trace(
go.Scattergl(x=curve_df["A"], y=curve_df["raw_M"], mode="markers", marker=dict(size=4), opacity=0.4, name="raw"),
row=1, col=1,
)
fig.add_trace(
go.Scatter(x=curve_df["A"], y=curve_df["fit"], mode="lines", name="loess fit"),
row=1, col=1,
)
fig.add_trace(
go.Scattergl(x=curve_df["A"], y=curve_df["normalized_M"], mode="markers", marker=dict(size=4), opacity=0.4, showlegend=False),
row=1, col=2,
)
fig.update_xaxes(title_text="log intensity", row=1, col=1)
fig.update_xaxes(title_text="log intensity", row=1, col=2)
fig.update_yaxes(title_text="log ratio", row=1, col=1)
fig.update_yaxes(title_text="normalized log ratio", row=1, col=2)
fig.update_layout(title="Effect of block-wise loess normalization", height=450)
fig.show()13.10 Gene-wise Welch t-tests
After normalization, we compare the two groups gene by gene using Welch’s two-sample t-test, which allows unequal variances across the two groups.
mean_control = ndat1.mean(axis=1)
mean_knockout = ndat2.mean(axis=1)
mean_diff = mean_control - mean_knockout
welch = ttest_ind(ndat1, ndat2, axis=1, equal_var=False)
t_stat = welch.statistic
p_value = welch.pvalue
results = pd.DataFrame({
"gene_id": np.arange(1, G + 1),
"mean_control": mean_control,
"mean_knockout": mean_knockout,
"mean_diff": mean_diff,
"t_stat": t_stat,
"p_value": p_value,
"avg_intensity": np.concatenate([logi1, logi2], axis=1).mean(axis=1),
})
results.nsmallest(10, "p_value")| gene_id | mean_control | mean_knockout | mean_diff | t_stat | p_value | avg_intensity | |
|---|---|---|---|---|---|---|---|
| 2148 | 2149 | 0.131644 | -3.054269 | 3.185913 | 23.141263 | 5.249463e-10 | 12.468029 |
| 4138 | 4139 | -0.249742 | -1.284847 | 1.035105 | 13.086721 | 6.315123e-09 | 12.605722 |
| 5355 | 5356 | 0.024074 | -1.821041 | 1.845115 | 11.617349 | 1.423105e-08 | 12.926597 |
| 539 | 540 | 0.188983 | -2.868751 | 3.057734 | 11.901011 | 1.336281e-06 | 12.280637 |
| 1738 | 1739 | -0.171758 | -1.086553 | 0.914794 | 10.163451 | 1.643646e-06 | 13.737438 |
| 1495 | 1496 | -0.173019 | -1.140107 | 0.967088 | 9.396916 | 1.743839e-06 | 12.228907 |
| 2536 | 2537 | -0.126446 | -1.133750 | 1.007304 | 9.569900 | 3.485925e-06 | 13.630117 |
| 4940 | 4941 | -0.422332 | -1.363049 | 0.940717 | 7.223691 | 1.177628e-05 | 13.287497 |
| 953 | 954 | 0.018889 | -0.268728 | 0.287617 | 4.374211 | 7.895915e-04 | 12.144535 |
| 946 | 947 | -0.023616 | -0.605393 | 0.581778 | 4.361786 | 9.000834e-04 | 10.542906 |
13.11 Visualize the testing results
13.11.1 Volcano plot
results["neglog10_p"] = -np.log10(results["p_value"])
fig = px.scatter(
results,
x="mean_diff",
y="neglog10_p",
hover_data=["gene_id", "t_stat", "p_value"],
opacity=0.45,
title="Volcano plot after normalization",
)
fig.show()13.11.2 Normal Q-Q plot of t-statistics
The paper uses a normal Q-Q plot to see whether the extreme t-statistics depart from what we would expect under a mostly-null distribution.
t_sorted = np.sort(t_stat)
probs = (np.arange(1, G + 1) - 0.5) / G
theoretical = norm.ppf(probs)
qq_df = pd.DataFrame({"theoretical": theoretical, "observed": t_sorted})
fig = px.scatter(
qq_df,
x="theoretical",
y="observed",
opacity=0.45,
title="Normal Q-Q plot of gene-wise t-statistics",
)
fig.add_trace(
go.Scatter(
x=[qq_df["theoretical"].min(), qq_df["theoretical"].max()],
y=[qq_df["theoretical"].min(), qq_df["theoretical"].max()],
mode="lines",
name="45-degree line",
)
)
fig.show()13.11.3 t-statistics versus intensity
The paper also argues that large t-statistics do not simply correspond to low-intensity artifacts. This plot helps students see that issue directly.
fig = px.scatter(
results,
x="avg_intensity",
y="t_stat",
hover_data=["gene_id", "p_value"],
opacity=0.4,
title="Gene-wise t-statistics versus average intensity",
)
fig.show()13.12 Why multiple testing changes the problem
Suppose, for a moment, that all 6384 genes were null.
Then even under complete null truth, we would expect approximately
- \(6384 \times 0.05 \approx 319\) p-values below 0.05,
- \(6384 \times 0.01 \approx 64\) p-values below 0.01,
- \(6384 \times 0.001 \approx 6.4\) p-values below 0.001.
That simple calculation is often the first moment students realize that “p < 0.05” is not remotely enough in a high-dimensional setting.
alpha_grid = np.array([0.05, 0.01, 0.001, 0.0001])
expected_false_positives = G * alpha_grid
observed_hits = np.array([(p_value < a).sum() for a in alpha_grid])
mult_df = pd.DataFrame({
"alpha": alpha_grid,
"expected_false_positives_under_global_null": expected_false_positives,
"observed_hits": observed_hits,
})
mult_df| alpha | expected_false_positives_under_global_null | observed_hits | |
|---|---|---|---|
| 0 | 0.0500 | 319.2000 | 381 |
| 1 | 0.0100 | 63.8400 | 78 |
| 2 | 0.0010 | 6.3840 | 10 |
| 3 | 0.0001 | 0.6384 | 8 |
mult_long = mult_df.melt(id_vars="alpha", var_name="quantity", value_name="count")
fig = px.bar(
mult_long,
x="alpha",
y="count",
color="quantity",
barmode="group",
title="Observed discoveries versus what we would expect under a complete null",
)
fig.update_xaxes(type="category")
fig.show()13.13 Type I and Type II errors: a quick simulation
In the real dataset, we do not know the truth for each gene. So to illustrate the Type I / Type II trade-off clearly, we create a small simulation where the truth is known.
This is not part of the original paper. It is included here because it makes the error concepts concrete.
rng = np.random.default_rng(3255)
m = 2000
n_per_group = 8
n_nonnull = 80
true_effect = np.zeros(m)
true_effect[:n_nonnull] = 1.0
rng.shuffle(true_effect)
x1 = rng.normal(0, 1, size=(m, n_per_group))
x2 = rng.normal(true_effect[:, None], 1, size=(m, n_per_group))
sim_test = ttest_ind(x1, x2, axis=1, equal_var=False)
sim_p = sim_test.pvalue
thresholds = np.array([0.05, 0.01, 0.001, 0.0001])
rows = []
for th in thresholds:
reject = sim_p < th
type1 = np.sum(reject & (true_effect == 0))
type2 = np.sum((~reject) & (true_effect != 0))
rows.append({
"threshold": th,
"type_I_errors": type1,
"type_II_errors": type2,
"discoveries": reject.sum(),
})
sim_summary = pd.DataFrame(rows)
sim_summary| threshold | type_I_errors | type_II_errors | discoveries | |
|---|---|---|---|---|
| 0 | 0.0500 | 92 | 46 | 126 |
| 1 | 0.0100 | 15 | 63 | 32 |
| 2 | 0.0010 | 2 | 77 | 5 |
| 3 | 0.0001 | 0 | 80 | 0 |
sim_long = sim_summary.melt(id_vars="threshold", value_vars=["type_I_errors", "type_II_errors"],
var_name="error_type", value_name="count")
fig = px.line(
sim_long,
x="threshold",
y="count",
color="error_type",
markers=True,
title="In a simple simulation, stricter thresholds reduce Type I errors but increase Type II errors",
)
fig.update_xaxes(type="category")
fig.show()The basic lesson is simple but important:
- making the threshold stricter usually reduces false positives,
- but it also makes it easier to miss real signals.
That is one reason FDR-based thinking is so useful.
13.14 Classical multiplicity adjustments
Before moving to permutation-based FDR estimation, it is useful to compare two standard multiplicity adjustments available in many software packages:
- Bonferroni, which is very strict,
- Benjamini-Hochberg (BH), which controls the false discovery rate under standard assumptions.
reject_bonf, p_bonf, _, _ = multipletests(p_value, alpha=0.05, method="bonferroni")
reject_bh, p_bh, _, _ = multipletests(p_value, alpha=0.05, method="fdr_bh")
summary_adjust = pd.DataFrame({
"method": ["raw p < 0.0001", "Bonferroni 0.05", "BH FDR 0.05"],
"discoveries": [int((p_value < 1e-4).sum()), int(reject_bonf.sum()), int(reject_bh.sum())],
})
summary_adjust| method | discoveries | |
|---|---|---|
| 0 | raw p < 0.0001 | 8 |
| 1 | Bonferroni 0.05 | 7 |
| 2 | BH FDR 0.05 | 8 |
fig = px.bar(summary_adjust, x="method", y="discoveries", title="How many genes are called significant under different decision rules")
fig.show()results["p_bonf"] = p_bonf
results["p_bh"] = p_bh
results.nsmallest(10, "p_value")[["gene_id", "mean_diff", "t_stat", "p_value", "p_bonf", "p_bh"]]| gene_id | mean_diff | t_stat | p_value | p_bonf | p_bh | |
|---|---|---|---|---|---|---|
| 2148 | 2149 | 3.185913 | 23.141263 | 5.249463e-10 | 0.000003 | 0.000003 |
| 4138 | 4139 | 1.035105 | 13.086721 | 6.315123e-09 | 0.000040 | 0.000020 |
| 5355 | 5356 | 1.845115 | 11.617349 | 1.423105e-08 | 0.000091 | 0.000030 |
| 539 | 540 | 3.057734 | 11.901011 | 1.336281e-06 | 0.008531 | 0.001855 |
| 1738 | 1739 | 0.914794 | 10.163451 | 1.643646e-06 | 0.010493 | 0.001855 |
| 1495 | 1496 | 0.967088 | 9.396916 | 1.743839e-06 | 0.011133 | 0.001855 |
| 2536 | 2537 | 1.007304 | 9.569900 | 3.485925e-06 | 0.022254 | 0.003179 |
| 4940 | 4941 | 0.940717 | 7.223691 | 1.177628e-05 | 0.075180 | 0.009397 |
| 953 | 954 | 0.287617 | 4.374211 | 7.895915e-04 | 1.000000 | 0.560084 |
| 946 | 947 | 0.581778 | 4.361786 | 9.000834e-04 | 1.000000 | 0.574613 |
13.15 Permutation-based FDR estimation
Now we turn to the main teaching goal of this case study: estimating how many of our apparent discoveries may be false.
One intuitive approach is to estimate FDR by permutation:
- combine the normalized data from both groups,
- randomly permute the 16 group labels,
- recompute all gene-wise p-values,
- count how many genes fall below a chosen threshold,
- average these counts over many permutations,
- divide by the number of observed discoveries at that threshold.
This yields an empirical estimate of
\[ \mathrm{FDR}(c) \approx \frac{E[V(c)]}{R(c)}, \]
where \(V(c)\) is the number of false discoveries under threshold \(c\), and \(R(c)\) is the number of observed discoveries.
Show Python
rng = np.random.default_rng(123)
ndat12 = np.concatenate([ndat1, ndat2], axis=1)
thresholds = np.array([0.01, 0.001, 0.0001])
perm_no = 50
perm_counts = np.zeros((perm_no, len(thresholds)), dtype=int)
for b in range(perm_no):
ind = rng.permutation(ndat12.shape[1])
fake1 = ndat12[:, ind[:8]]
fake2 = ndat12[:, ind[8:]]
p_perm = ttest_ind(fake1, fake2, axis=1, equal_var=False).pvalue
perm_counts[b, :] = [(p_perm < th).sum() for th in thresholds]
observed_counts = np.array([(p_value < th).sum() for th in thresholds])
expected_false = perm_counts.mean(axis=0)
empirical_fdr = np.divide(
expected_false,
observed_counts,
out=np.full_like(expected_false, np.nan, dtype=float),
where=observed_counts > 0,
)
perm_summary = pd.DataFrame({
"threshold": thresholds,
"observed_discoveries": observed_counts,
"expected_false_discoveries": expected_false,
"estimated_fdr": empirical_fdr,
})
perm_summary| threshold | observed_discoveries | expected_false_discoveries | estimated_fdr | |
|---|---|---|---|---|
| 0 | 0.0100 | 78 | 45.74 | 0.58641 |
| 1 | 0.0010 | 10 | 3.94 | 0.39400 |
| 2 | 0.0001 | 8 | 0.40 | 0.05000 |
fig = px.bar(
perm_summary.melt(id_vars="threshold", value_vars=["observed_discoveries", "expected_false_discoveries"],
var_name="quantity", value_name="count"),
x="threshold",
y="count",
color="quantity",
barmode="group",
title="Observed discoveries and permutation-based expected false discoveries",
)
fig.update_xaxes(type="category")
fig.show()fig = px.line(
perm_summary,
x="threshold",
y="estimated_fdr",
markers=True,
title="Estimated FDR from permutations",
)
fig.update_xaxes(type="category")
fig.show()In this dataset, the estimated FDR is much lower at the most stringent threshold. That illustrates the main tradeoff: a stricter threshold gives fewer discoveries, but the discoveries we keep are more credible.
13.16 Identify a small high-confidence gene set
A common teaching question at this point is: after all of this adjustment and checking, what genes would we actually carry forward for follow-up?
One simple answer is to focus on the genes with raw p-values below 0.0001, which is a very stringent cutoff in this example.
selected = results.loc[results["p_value"] < 1e-4, ["gene_id", "p_value", "mean_diff", "t_stat"]].copy()
selected = selected.sort_values("p_value")
selected| gene_id | p_value | mean_diff | t_stat | |
|---|---|---|---|---|
| 2148 | 2149 | 5.249463e-10 | 3.185913 | 23.141263 |
| 4138 | 4139 | 6.315123e-09 | 1.035105 | 13.086721 |
| 5355 | 5356 | 1.423105e-08 | 1.845115 | 11.617349 |
| 539 | 540 | 1.336281e-06 | 3.057734 | 11.901011 |
| 1738 | 1739 | 1.643646e-06 | 0.914794 | 10.163451 |
| 1495 | 1496 | 1.743839e-06 | 0.967088 | 9.396916 |
| 2536 | 2537 | 3.485925e-06 | 1.007304 | 9.569900 |
| 4940 | 4941 | 1.177628e-05 | 0.940717 | 7.223691 |
This is not the only possible decision rule, but it is a useful one for classroom discussion because it produces a short list that students can inspect directly and compare against the broader BH and permutation-based summaries.
13.17 What should we report?
A common beginner mistake is to report only a list of “significant genes.” A better report should state at least the following:
- how the raw intensities were transformed,
- how normalization was performed,
- what gene-wise test statistic was used,
- how multiplicity was handled,
- what threshold or FDR target was chosen,
- and how many genes were ultimately selected.
For this case study, a concise summary for a results section is:
final_summary = pd.DataFrame({
"quantity": [
"number of spots tested",
"raw hits at p < 0.01",
"raw hits at p < 0.001",
"raw hits at p < 0.0001",
"BH discoveries at FDR 0.05",
"Bonferroni discoveries at 0.05",
"estimated FDR at p < 0.0001",
],
"value": [
G,
int((p_value < 0.01).sum()),
int((p_value < 0.001).sum()),
int((p_value < 0.0001).sum()),
int(reject_bh.sum()),
int(reject_bonf.sum()),
float(perm_summary.loc[perm_summary["threshold"] == 0.0001, "estimated_fdr"].iloc[0]),
],
})
final_summary| quantity | value | |
|---|---|---|
| 0 | number of spots tested | 6384.00 |
| 1 | raw hits at p < 0.01 | 78.00 |
| 2 | raw hits at p < 0.001 | 10.00 |
| 3 | raw hits at p < 0.0001 | 8.00 |
| 4 | BH discoveries at FDR 0.05 | 8.00 |
| 5 | Bonferroni discoveries at 0.05 | 7.00 |
| 6 | estimated FDR at p < 0.0001 | 0.05 |
13.18 Discussion questions
- Why is
p < 0.05not a sensible default rule when testing 6384 genes? - Why does permutation help in a microarray problem where the null distribution and dependence structure may be complicated?
- Why can a method that reduces Type I errors too aggressively also be scientifically unsatisfying?
- What is the conceptual difference between controlling the family-wise error rate and controlling the false discovery rate?
- The paper discusses adjusted p-values, but the supplied teaching code estimates FDR at a few thresholds. What are the strengths and weaknesses of each strategy?
13.19 Exercises
- Increase the number of permutations from 50 to 200 and see how stable the FDR estimates become.
- Add a histogram of the p-values and explain what shape you would expect under a complete null.
- Compare Welch’s t-test with the equal-variance two-sample t-test and check whether the top-ranked genes change.
- Use a grid of thresholds from
1e-5to1e-2and plot the entire estimated FDR curve. - Repeat the Type I / Type II simulation with a smaller true effect and explain what happens to power.