13 Case study: Multiple testing for differential gene expression

13.1 Background

In a microarray study, we do not test one hypothesis. We test thousands of them at once.

That changes the statistical problem in an important way. If we test a single gene at level \(0.05\), then a false positive is relatively rare. But if we test thousands of genes, then even when no genes truly differ, we should still expect many small p-values just by chance.

This case study uses a classic microarray example from Callow et al. (2000), where the goal is to identify genes with altered expression in HDL-deficient mice. It is a useful teaching example because it combines scientific interpretation with a core statistical difficulty: how to detect genuine signals while controlling false discoveries across thousands of simultaneous tests.

13.2 Objective

The purpose of this case study is to use microarray data to learn four closely related ideas:

how to construct gene-level test statistics,
why multiple testing creates a serious Type I error problem,
how stricter thresholds can reduce false positives but increase false negatives,
how to estimate the false discovery rate (FDR) from permutations.

The main workflow is:

compute log-ratios and log-intensities,
apply block-wise loess normalization,
calculate Welch two-sample t-statistics,
examine the consequences of multiple testing,
estimate the false discovery rate (FDR) by permutation.

13.3 Data & Resources

Callow-data1.txt: raw two-color microarray intensities for 6384 spots across 16 arrays
Spot-location1.txt: printing-pin block locations for the spots
Callow et al 2000.pdf: paper describing the scientific setting and the multiple-testing framework

13.4 Why this case study matters

This is a very good data-science example because the challenge is not just computing p-values. The real challenge is decision-making under many simultaneous comparisons.

At the level of one gene:

A Type I error means calling a gene differentially expressed when it is not.
A Type II error means missing a gene that truly is differentially expressed.

At the level of thousands of genes:

Controlling the chance of any false positive can become too strict.
Ignoring multiplicity can produce too many false discoveries.
The false discovery rate provides a useful compromise: among the genes we call significant, what fraction do we expect to be false positives?

That is exactly the issue this case study is designed to illustrate.

13.5 Load packages and data

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from scipy.stats import ttest_ind, norm
from statsmodels.nonparametric.smoothers_lowess import lowess
from statsmodels.stats.multitest import multipletests

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)

path_data = "Callow-data1.txt"
path_spot = "Spot-location1.txt"

raw = pd.read_csv(path_data, sep="\t")
spot = pd.read_csv(path_spot, sep="\t")

print("raw shape:", raw.shape)
print("spot shape:", spot.shape)

raw.head()

raw shape: (6384, 33)
spot shape: (6384, 5)

	row.names	c1G	c1R	c2G	c2R	c3G	c3R	c4G	c4R	c5G	c5R	c6G	c6R	c7G	c7R	c8G	c8R	k1G	k1R	k2G	k2R	k3G	k3R	k4G	k4R	k5G	k5R	k6G	k6R	k7G	k7R	k8G	k8R
0	1	5592.58	2765.58	4749.89	1768.22	3709.31	1440.54	2666.13	763.06	3535.22	2027.94	2499.53	864.05	2445.94	958.68	1631.58	644.58	580.80	747.11	2318.73	1388.79	1441.88	1588.76	2055.08	1280.17	2437.16	1881.72	2269.43	1733.53	1670.33	1170.84	2882.78	1512.45
1	2	4746.38	2868.43	3088.12	2277.18	2259.42	1599.92	1518.12	1238.33	1647.30	1513.43	1658.00	1079.33	1386.56	1228.66	918.27	757.33	1179.10	1930.25	1704.40	2093.00	835.86	1369.81	979.33	1071.17	2028.75	3218.58	1294.91	2451.04	1137.45	1605.00	1652.50	1700.82
2	3	2108.48	1236.32	3669.53	1546.84	7774.40	2639.45	2514.81	999.48	2815.00	3689.67	3345.40	1505.20	1720.96	785.10	1387.41	994.86	734.24	753.52	2165.77	1300.00	2761.36	1301.61	3454.85	3292.26	2525.40	1149.23	2694.90	3424.30	2524.82	1901.06	2172.53	2200.82
3	4	548.46	383.62	708.16	532.50	462.67	323.55	680.06	585.14	210.58	250.74	633.78	566.58	548.89	409.18	425.00	417.79	582.64	829.82	334.90	402.84	639.40	513.91	493.01	459.69	536.83	391.09	552.51	601.00	540.60	438.03	538.80	507.25
4	5	856.48	377.36	715.64	525.44	347.27	280.67	721.93	553.43	297.49	190.80	417.00	461.89	333.50	225.63	478.59	465.65	511.31	634.80	307.33	390.93	425.76	355.76	362.12	318.87	380.99	325.59	422.22	323.65	298.20	343.06	874.53	602.28

13.6 Understand the experimental layout

The 16 arrays are divided into two groups of 8:

c1 to c8: control arrays
k1 to k8: knockout arrays

For each array, the data contain two fluorescence channels, one green and one red. The code constructs

a log-ratio \[M = \log_2(R/G),\]
and a log-intensity \[A = \frac{1}{2}\log_2(RG).\]

The spot-location file provides the printing-pin blocks used for block-wise normalization.

col_names = raw.columns.tolist()
print(col_names[:12])
print("...")
print(col_names[16:24])

block1 = spot.iloc[:, 1].to_numpy()
block2 = spot.iloc[:, 2].to_numpy()

print("Unique block1 values:", np.unique(block1))
print("Unique block2 values:", np.unique(block2))
print("Number of spots:", len(block1))

['row.names', 'c1G', 'c1R', 'c2G', 'c2R', 'c3G', 'c3R', 'c4G', 'c4R', 'c5G', 'c5R', 'c6G']
...
['c8R', 'k1G', 'k1R', 'k2G', 'k2R', 'k3G', 'k3R', 'k4G']
Unique block1 values: [1 2 3 4]
Unique block2 values: [1 2 3 4]
Number of spots: 6384

13.7 Preprocess the raw intensities

For invalid intensities, we use a simple preprocessing rule:

if either channel is nonpositive, set the log-ratio to 0,
and set the log-intensity to 8.

This keeps the transformed values well-defined and lets us focus on the multiple-testing ideas rather than on missing-value edge cases.

Show Python

n = 8
G = raw.shape[0]
datm = raw.iloc[:, 1:33].to_numpy(dtype=float)

def build_log_matrices(arr):
    logr = np.zeros((G, n))
    logi = np.zeros((G, n))
    for i in range(n):
        g = arr[:, 2 * i]
        r = arr[:, 2 * i + 1]
        ok = (g > 0) & (r > 0)
        logr[:, i] = 0.0
        logi[:, i] = 8.0
        logr[ok, i] = np.log2(r[ok] / g[ok])
        logi[ok, i] = 0.5 * np.log2(r[ok] * g[ok])
    return logr, logi

logr1, logi1 = build_log_matrices(datm[:, :16])
logr2, logi2 = build_log_matrices(datm[:, 16:32])

print(logr1.shape, logi1.shape, logr2.shape, logi2.shape)

(6384, 8) (6384, 8) (6384, 8) (6384, 8)

13.8 Explore the raw M-A structure

Before normalization, we should check whether the log-ratios depend on intensity. If they do, then naive testing would be distorted by systematic bias.

13.8.1 M-A plots for the first control and first knockout arrays

plot_df = pd.DataFrame({
    "A_control": logi1[:, 0],
    "M_control": logr1[:, 0],
    "A_knockout": logi2[:, 0],
    "M_knockout": logr2[:, 0],
})

fig = make_subplots(rows=1, cols=2, subplot_titles=("Control array 1", "Knockout array 1"))
fig.add_trace(
    go.Scattergl(x=plot_df["A_control"], y=plot_df["M_control"], mode="markers", marker=dict(size=3), opacity=0.35, showlegend=False),
    row=1, col=1,
)
fig.add_trace(
    go.Scattergl(x=plot_df["A_knockout"], y=plot_df["M_knockout"], mode="markers", marker=dict(size=3), opacity=0.35, showlegend=False),
    row=1, col=2,
)
fig.update_xaxes(title_text="log intensity", row=1, col=1)
fig.update_xaxes(title_text="log intensity", row=1, col=2)
fig.update_yaxes(title_text="log ratio", row=1, col=1)
fig.update_yaxes(title_text="log ratio", row=1, col=2)
fig.update_layout(title="Raw M-A plots", height=450)
fig.show()

13.8.2 Block-specific structure for one array

We normalize within printing-pin blocks, not globally. That already tells us something important: the technical bias may vary by block.

Show Python

block_panels = []
for i in range(1, 5):
    for j in range(1, 5):
        bk = (block1 == i) & (block2 == j)
        block_panels.append(
            pd.DataFrame({
                "A": logi1[bk, 0],
                "M": logr1[bk, 0],
                "block": f"({i},{j})"
            })
        )
block_df = pd.concat(block_panels, ignore_index=True)

fig = px.scatter(
    block_df,
    x="A",
    y="M",
    facet_col="block",
    facet_col_wrap=4,
    opacity=0.35,
    title="Raw M-A plots by printing-pin block for control array 1",
)
fig.update_traces(marker=dict(size=3))
fig.update_layout(height=900)
fig.show()

13.9 Block-wise loess normalization

Now we carry out the main normalization step. For each array and each printing-pin block, we fit a robust loess curve of log-ratio on log-intensity, then subtract the fitted curve.

Show Python

def blockwise_lowess_normalize(logr, logi, block1, block2, frac=0.2, it=3):
    ndat = logr.copy()
    fitted = np.zeros_like(logr)
    for k in range(logr.shape[1]):
        for i in range(1, 5):
            for j in range(1, 5):
                bk = (block1 == i) & (block2 == j)
                x = logi[bk, k]
                y = logr[bk, k]
                yfit = lowess(y, x, frac=frac, it=it, return_sorted=False)
                fitted[bk, k] = yfit
                ndat[bk, k] = y - yfit
    return ndat, fitted

ndat1, ncurv1 = blockwise_lowess_normalize(logr1, logi1, block1, block2)
ndat2, ncurv2 = blockwise_lowess_normalize(logr2, logi2, block1, block2)

print(ndat1.shape, ndat2.shape)

(6384, 8) (6384, 8)

13.9.1 Before and after normalization

bk = (block1 == 1) & (block2 == 1)
curve_df = pd.DataFrame({
    "A": logi1[bk, 0],
    "raw_M": logr1[bk, 0],
    "fit": ncurv1[bk, 0],
    "normalized_M": ndat1[bk, 0],
}).sort_values("A")

fig = make_subplots(rows=1, cols=2, subplot_titles=("Raw with loess fit", "After normalization"))
fig.add_trace(
    go.Scattergl(x=curve_df["A"], y=curve_df["raw_M"], mode="markers", marker=dict(size=4), opacity=0.4, name="raw"),
    row=1, col=1,
)
fig.add_trace(
    go.Scatter(x=curve_df["A"], y=curve_df["fit"], mode="lines", name="loess fit"),
    row=1, col=1,
)
fig.add_trace(
    go.Scattergl(x=curve_df["A"], y=curve_df["normalized_M"], mode="markers", marker=dict(size=4), opacity=0.4, showlegend=False),
    row=1, col=2,
)
fig.update_xaxes(title_text="log intensity", row=1, col=1)
fig.update_xaxes(title_text="log intensity", row=1, col=2)
fig.update_yaxes(title_text="log ratio", row=1, col=1)
fig.update_yaxes(title_text="normalized log ratio", row=1, col=2)
fig.update_layout(title="Effect of block-wise loess normalization", height=450)
fig.show()

13.10 Gene-wise Welch t-tests

After normalization, we compare the two groups gene by gene using Welch’s two-sample t-test, which allows unequal variances across the two groups.

mean_control = ndat1.mean(axis=1)
mean_knockout = ndat2.mean(axis=1)
mean_diff = mean_control - mean_knockout

welch = ttest_ind(ndat1, ndat2, axis=1, equal_var=False)
t_stat = welch.statistic
p_value = welch.pvalue

results = pd.DataFrame({
    "gene_id": np.arange(1, G + 1),
    "mean_control": mean_control,
    "mean_knockout": mean_knockout,
    "mean_diff": mean_diff,
    "t_stat": t_stat,
    "p_value": p_value,
    "avg_intensity": np.concatenate([logi1, logi2], axis=1).mean(axis=1),
})

results.nsmallest(10, "p_value")

	gene_id	mean_control	mean_knockout	mean_diff	t_stat	p_value	avg_intensity
2148	2149	0.131644	-3.054269	3.185913	23.141263	5.249463e-10	12.468029
4138	4139	-0.249742	-1.284847	1.035105	13.086721	6.315123e-09	12.605722
5355	5356	0.024074	-1.821041	1.845115	11.617349	1.423105e-08	12.926597
539	540	0.188983	-2.868751	3.057734	11.901011	1.336281e-06	12.280637
1738	1739	-0.171758	-1.086553	0.914794	10.163451	1.643646e-06	13.737438
1495	1496	-0.173019	-1.140107	0.967088	9.396916	1.743839e-06	12.228907
2536	2537	-0.126446	-1.133750	1.007304	9.569900	3.485925e-06	13.630117
4940	4941	-0.422332	-1.363049	0.940717	7.223691	1.177628e-05	13.287497
953	954	0.018889	-0.268728	0.287617	4.374211	7.895915e-04	12.144535
946	947	-0.023616	-0.605393	0.581778	4.361786	9.000834e-04	10.542906

13.11 Visualize the testing results

13.11.1 Volcano plot

results["neglog10_p"] = -np.log10(results["p_value"])
fig = px.scatter(
    results,
    x="mean_diff",
    y="neglog10_p",
    hover_data=["gene_id", "t_stat", "p_value"],
    opacity=0.45,
    title="Volcano plot after normalization",
)
fig.show()

13.11.2 Normal Q-Q plot of t-statistics

The paper uses a normal Q-Q plot to see whether the extreme t-statistics depart from what we would expect under a mostly-null distribution.

t_sorted = np.sort(t_stat)
probs = (np.arange(1, G + 1) - 0.5) / G
theoretical = norm.ppf(probs)
qq_df = pd.DataFrame({"theoretical": theoretical, "observed": t_sorted})

fig = px.scatter(
    qq_df,
    x="theoretical",
    y="observed",
    opacity=0.45,
    title="Normal Q-Q plot of gene-wise t-statistics",
)
fig.add_trace(
    go.Scatter(
        x=[qq_df["theoretical"].min(), qq_df["theoretical"].max()],
        y=[qq_df["theoretical"].min(), qq_df["theoretical"].max()],
        mode="lines",
        name="45-degree line",
    )
)
fig.show()

13.11.3 t-statistics versus intensity

The paper also argues that large t-statistics do not simply correspond to low-intensity artifacts. This plot helps students see that issue directly.

fig = px.scatter(
    results,
    x="avg_intensity",
    y="t_stat",
    hover_data=["gene_id", "p_value"],
    opacity=0.4,
    title="Gene-wise t-statistics versus average intensity",
)
fig.show()

13.12 Why multiple testing changes the problem

Suppose, for a moment, that all 6384 genes were null.

Then even under complete null truth, we would expect approximately

\(6384 \times 0.05 \approx 319\) p-values below 0.05,
\(6384 \times 0.01 \approx 64\) p-values below 0.01,
\(6384 \times 0.001 \approx 6.4\) p-values below 0.001.

That simple calculation is often the first moment students realize that “p < 0.05” is not remotely enough in a high-dimensional setting.

alpha_grid = np.array([0.05, 0.01, 0.001, 0.0001])
expected_false_positives = G * alpha_grid
observed_hits = np.array([(p_value < a).sum() for a in alpha_grid])

mult_df = pd.DataFrame({
    "alpha": alpha_grid,
    "expected_false_positives_under_global_null": expected_false_positives,
    "observed_hits": observed_hits,
})
mult_df

	alpha	expected_false_positives_under_global_null	observed_hits
0	0.0500	319.2000	381
1	0.0100	63.8400	78
2	0.0010	6.3840	10
3	0.0001	0.6384	8

mult_long = mult_df.melt(id_vars="alpha", var_name="quantity", value_name="count")
fig = px.bar(
    mult_long,
    x="alpha",
    y="count",
    color="quantity",
    barmode="group",
    title="Observed discoveries versus what we would expect under a complete null",
)
fig.update_xaxes(type="category")
fig.show()

13.13 Type I and Type II errors: a quick simulation

In the real dataset, we do not know the truth for each gene. So to illustrate the Type I / Type II trade-off clearly, we create a small simulation where the truth is known.

This is not part of the original paper. It is included here because it makes the error concepts concrete.

rng = np.random.default_rng(3255)

m = 2000
n_per_group = 8
n_nonnull = 80
true_effect = np.zeros(m)
true_effect[:n_nonnull] = 1.0
rng.shuffle(true_effect)

x1 = rng.normal(0, 1, size=(m, n_per_group))
x2 = rng.normal(true_effect[:, None], 1, size=(m, n_per_group))

sim_test = ttest_ind(x1, x2, axis=1, equal_var=False)
sim_p = sim_test.pvalue

thresholds = np.array([0.05, 0.01, 0.001, 0.0001])
rows = []
for th in thresholds:
    reject = sim_p < th
    type1 = np.sum(reject & (true_effect == 0))
    type2 = np.sum((~reject) & (true_effect != 0))
    rows.append({
        "threshold": th,
        "type_I_errors": type1,
        "type_II_errors": type2,
        "discoveries": reject.sum(),
    })

sim_summary = pd.DataFrame(rows)
sim_summary

	threshold	type_I_errors	type_II_errors	discoveries
0	0.0500	92	46	126
1	0.0100	15	63	32
2	0.0010	2	77	5
3	0.0001	0	80	0

sim_long = sim_summary.melt(id_vars="threshold", value_vars=["type_I_errors", "type_II_errors"],
                            var_name="error_type", value_name="count")
fig = px.line(
    sim_long,
    x="threshold",
    y="count",
    color="error_type",
    markers=True,
    title="In a simple simulation, stricter thresholds reduce Type I errors but increase Type II errors",
)
fig.update_xaxes(type="category")
fig.show()

The basic lesson is simple but important:

making the threshold stricter usually reduces false positives,
but it also makes it easier to miss real signals.

That is one reason FDR-based thinking is so useful.

13.14 Classical multiplicity adjustments

Before moving to permutation-based FDR estimation, it is useful to compare two standard multiplicity adjustments available in many software packages:

Bonferroni, which is very strict,
Benjamini-Hochberg (BH), which controls the false discovery rate under standard assumptions.

reject_bonf, p_bonf, _, _ = multipletests(p_value, alpha=0.05, method="bonferroni")
reject_bh, p_bh, _, _ = multipletests(p_value, alpha=0.05, method="fdr_bh")

summary_adjust = pd.DataFrame({
    "method": ["raw p < 0.0001", "Bonferroni 0.05", "BH FDR 0.05"],
    "discoveries": [int((p_value < 1e-4).sum()), int(reject_bonf.sum()), int(reject_bh.sum())],
})
summary_adjust

	method	discoveries
0	raw p < 0.0001	8
1	Bonferroni 0.05	7
2	BH FDR 0.05	8

fig = px.bar(summary_adjust, x="method", y="discoveries", title="How many genes are called significant under different decision rules")
fig.show()

results["p_bonf"] = p_bonf
results["p_bh"] = p_bh
results.nsmallest(10, "p_value")[["gene_id", "mean_diff", "t_stat", "p_value", "p_bonf", "p_bh"]]

	gene_id	mean_diff	t_stat	p_value	p_bonf	p_bh
2148	2149	3.185913	23.141263	5.249463e-10	0.000003	0.000003
4138	4139	1.035105	13.086721	6.315123e-09	0.000040	0.000020
5355	5356	1.845115	11.617349	1.423105e-08	0.000091	0.000030
539	540	3.057734	11.901011	1.336281e-06	0.008531	0.001855
1738	1739	0.914794	10.163451	1.643646e-06	0.010493	0.001855
1495	1496	0.967088	9.396916	1.743839e-06	0.011133	0.001855
2536	2537	1.007304	9.569900	3.485925e-06	0.022254	0.003179
4940	4941	0.940717	7.223691	1.177628e-05	0.075180	0.009397
953	954	0.287617	4.374211	7.895915e-04	1.000000	0.560084
946	947	0.581778	4.361786	9.000834e-04	1.000000	0.574613

13.15 Permutation-based FDR estimation

Now we turn to the main teaching goal of this case study: estimating how many of our apparent discoveries may be false.

One intuitive approach is to estimate FDR by permutation:

combine the normalized data from both groups,
randomly permute the 16 group labels,
recompute all gene-wise p-values,
count how many genes fall below a chosen threshold,
average these counts over many permutations,
divide by the number of observed discoveries at that threshold.

This yields an empirical estimate of

\[ \mathrm{FDR}(c) \approx \frac{E[V(c)]}{R(c)}, \]

where \(V(c)\) is the number of false discoveries under threshold \(c\), and \(R(c)\) is the number of observed discoveries.

Show Python

rng = np.random.default_rng(123)
ndat12 = np.concatenate([ndat1, ndat2], axis=1)
thresholds = np.array([0.01, 0.001, 0.0001])
perm_no = 50
perm_counts = np.zeros((perm_no, len(thresholds)), dtype=int)

for b in range(perm_no):
    ind = rng.permutation(ndat12.shape[1])
    fake1 = ndat12[:, ind[:8]]
    fake2 = ndat12[:, ind[8:]]
    p_perm = ttest_ind(fake1, fake2, axis=1, equal_var=False).pvalue
    perm_counts[b, :] = [(p_perm < th).sum() for th in thresholds]

observed_counts = np.array([(p_value < th).sum() for th in thresholds])
expected_false = perm_counts.mean(axis=0)
empirical_fdr = np.divide(
    expected_false,
    observed_counts,
    out=np.full_like(expected_false, np.nan, dtype=float),
    where=observed_counts > 0,
)

perm_summary = pd.DataFrame({
    "threshold": thresholds,
    "observed_discoveries": observed_counts,
    "expected_false_discoveries": expected_false,
    "estimated_fdr": empirical_fdr,
})
perm_summary

	threshold	observed_discoveries	expected_false_discoveries	estimated_fdr
0	0.0100	78	45.74	0.58641
1	0.0010	10	3.94	0.39400
2	0.0001	8	0.40	0.05000

fig = px.bar(
    perm_summary.melt(id_vars="threshold", value_vars=["observed_discoveries", "expected_false_discoveries"],
                      var_name="quantity", value_name="count"),
    x="threshold",
    y="count",
    color="quantity",
    barmode="group",
    title="Observed discoveries and permutation-based expected false discoveries",
)
fig.update_xaxes(type="category")
fig.show()

fig = px.line(
    perm_summary,
    x="threshold",
    y="estimated_fdr",
    markers=True,
    title="Estimated FDR from permutations",
)
fig.update_xaxes(type="category")
fig.show()

In this dataset, the estimated FDR is much lower at the most stringent threshold. That illustrates the main tradeoff: a stricter threshold gives fewer discoveries, but the discoveries we keep are more credible.

13.16 Identify a small high-confidence gene set

A common teaching question at this point is: after all of this adjustment and checking, what genes would we actually carry forward for follow-up?

One simple answer is to focus on the genes with raw p-values below 0.0001, which is a very stringent cutoff in this example.

selected = results.loc[results["p_value"] < 1e-4, ["gene_id", "p_value", "mean_diff", "t_stat"]].copy()
selected = selected.sort_values("p_value")
selected

	gene_id	p_value	mean_diff	t_stat
2148	2149	5.249463e-10	3.185913	23.141263
4138	4139	6.315123e-09	1.035105	13.086721
5355	5356	1.423105e-08	1.845115	11.617349
539	540	1.336281e-06	3.057734	11.901011
1738	1739	1.643646e-06	0.914794	10.163451
1495	1496	1.743839e-06	0.967088	9.396916
2536	2537	3.485925e-06	1.007304	9.569900
4940	4941	1.177628e-05	0.940717	7.223691

This is not the only possible decision rule, but it is a useful one for classroom discussion because it produces a short list that students can inspect directly and compare against the broader BH and permutation-based summaries.

13.17 What should we report?

A common beginner mistake is to report only a list of “significant genes.” A better report should state at least the following:

how the raw intensities were transformed,
how normalization was performed,
what gene-wise test statistic was used,
how multiplicity was handled,
what threshold or FDR target was chosen,
and how many genes were ultimately selected.

For this case study, a concise summary for a results section is:

final_summary = pd.DataFrame({
    "quantity": [
        "number of spots tested",
        "raw hits at p < 0.01",
        "raw hits at p < 0.001",
        "raw hits at p < 0.0001",
        "BH discoveries at FDR 0.05",
        "Bonferroni discoveries at 0.05",
        "estimated FDR at p < 0.0001",
    ],
    "value": [
        G,
        int((p_value < 0.01).sum()),
        int((p_value < 0.001).sum()),
        int((p_value < 0.0001).sum()),
        int(reject_bh.sum()),
        int(reject_bonf.sum()),
        float(perm_summary.loc[perm_summary["threshold"] == 0.0001, "estimated_fdr"].iloc[0]),
    ],
})
final_summary

	quantity	value
0	number of spots tested	6384.00
1	raw hits at p < 0.01	78.00
2	raw hits at p < 0.001	10.00
3	raw hits at p < 0.0001	8.00
4	BH discoveries at FDR 0.05	8.00
5	Bonferroni discoveries at 0.05	7.00
6	estimated FDR at p < 0.0001	0.05

13.18 Discussion questions

Why is p < 0.05 not a sensible default rule when testing 6384 genes?
Why does permutation help in a microarray problem where the null distribution and dependence structure may be complicated?
Why can a method that reduces Type I errors too aggressively also be scientifically unsatisfying?
What is the conceptual difference between controlling the family-wise error rate and controlling the false discovery rate?
The paper discusses adjusted p-values, but the supplied teaching code estimates FDR at a few thresholds. What are the strengths and weaknesses of each strategy?

13.19 Exercises

Increase the number of permutations from 50 to 200 and see how stable the FDR estimates become.
Add a histogram of the p-values and explain what shape you would expect under a complete null.
Compare Welch’s t-test with the equal-variance two-sample t-test and check whether the top-ranked genes change.
Use a grid of thresholds from 1e-5 to 1e-2 and plot the entire estimated FDR curve.
Repeat the Type I / Type II simulation with a smaller true effect and explain what happens to power.