13  Case study: Multiple testing for differential gene expression

13.1 Background

In a microarray study, we do not test one hypothesis. We test thousands of them at once.

That changes the statistical problem in an important way. If we test a single gene at level \(0.05\), then a false positive is relatively rare. But if we test thousands of genes, then even when no genes truly differ, we should still expect many small p-values just by chance.

This case study uses a classic microarray example from Callow et al. (2000), where the goal is to identify genes with altered expression in HDL-deficient mice. It is a useful teaching example because it combines scientific interpretation with a core statistical difficulty: how to detect genuine signals while controlling false discoveries across thousands of simultaneous tests.

13.2 Objective

The purpose of this case study is to use microarray data to learn four closely related ideas:

  1. how to construct gene-level test statistics,
  2. why multiple testing creates a serious Type I error problem,
  3. how stricter thresholds can reduce false positives but increase false negatives,
  4. how to estimate the false discovery rate (FDR) from permutations.

The main workflow is:

  1. compute log-ratios and log-intensities,
  2. apply block-wise loess normalization,
  3. calculate Welch two-sample t-statistics,
  4. examine the consequences of multiple testing,
  5. estimate the false discovery rate (FDR) by permutation.

13.3 Data & Resources

  • Callow-data1.txt: raw two-color microarray intensities for 6384 spots across 16 arrays
  • Spot-location1.txt: printing-pin block locations for the spots
  • Callow et al 2000.pdf: paper describing the scientific setting and the multiple-testing framework

13.4 Why this case study matters

This is a very good data-science example because the challenge is not just computing p-values. The real challenge is decision-making under many simultaneous comparisons.

At the level of one gene:

  • A Type I error means calling a gene differentially expressed when it is not.
  • A Type II error means missing a gene that truly is differentially expressed.

At the level of thousands of genes:

  • Controlling the chance of any false positive can become too strict.
  • Ignoring multiplicity can produce too many false discoveries.
  • The false discovery rate provides a useful compromise: among the genes we call significant, what fraction do we expect to be false positives?

That is exactly the issue this case study is designed to illustrate.

13.5 Load packages and data

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from scipy.stats import ttest_ind, norm
from statsmodels.nonparametric.smoothers_lowess import lowess
from statsmodels.stats.multitest import multipletests

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)
path_data = "Callow-data1.txt"
path_spot = "Spot-location1.txt"

raw = pd.read_csv(path_data, sep="\t")
spot = pd.read_csv(path_spot, sep="\t")

print("raw shape:", raw.shape)
print("spot shape:", spot.shape)

raw.head()
raw shape: (6384, 33)
spot shape: (6384, 5)
row.names c1G c1R c2G c2R c3G c3R c4G c4R c5G c5R c6G c6R c7G c7R c8G c8R k1G k1R k2G k2R k3G k3R k4G k4R k5G k5R k6G k6R k7G k7R k8G k8R
0 1 5592.58 2765.58 4749.89 1768.22 3709.31 1440.54 2666.13 763.06 3535.22 2027.94 2499.53 864.05 2445.94 958.68 1631.58 644.58 580.80 747.11 2318.73 1388.79 1441.88 1588.76 2055.08 1280.17 2437.16 1881.72 2269.43 1733.53 1670.33 1170.84 2882.78 1512.45
1 2 4746.38 2868.43 3088.12 2277.18 2259.42 1599.92 1518.12 1238.33 1647.30 1513.43 1658.00 1079.33 1386.56 1228.66 918.27 757.33 1179.10 1930.25 1704.40 2093.00 835.86 1369.81 979.33 1071.17 2028.75 3218.58 1294.91 2451.04 1137.45 1605.00 1652.50 1700.82
2 3 2108.48 1236.32 3669.53 1546.84 7774.40 2639.45 2514.81 999.48 2815.00 3689.67 3345.40 1505.20 1720.96 785.10 1387.41 994.86 734.24 753.52 2165.77 1300.00 2761.36 1301.61 3454.85 3292.26 2525.40 1149.23 2694.90 3424.30 2524.82 1901.06 2172.53 2200.82
3 4 548.46 383.62 708.16 532.50 462.67 323.55 680.06 585.14 210.58 250.74 633.78 566.58 548.89 409.18 425.00 417.79 582.64 829.82 334.90 402.84 639.40 513.91 493.01 459.69 536.83 391.09 552.51 601.00 540.60 438.03 538.80 507.25
4 5 856.48 377.36 715.64 525.44 347.27 280.67 721.93 553.43 297.49 190.80 417.00 461.89 333.50 225.63 478.59 465.65 511.31 634.80 307.33 390.93 425.76 355.76 362.12 318.87 380.99 325.59 422.22 323.65 298.20 343.06 874.53 602.28

13.6 Understand the experimental layout

The 16 arrays are divided into two groups of 8:

  • c1 to c8: control arrays
  • k1 to k8: knockout arrays

For each array, the data contain two fluorescence channels, one green and one red. The code constructs

  • a log-ratio \[M = \log_2(R/G),\]
  • and a log-intensity \[A = \frac{1}{2}\log_2(RG).\]

The spot-location file provides the printing-pin blocks used for block-wise normalization.

col_names = raw.columns.tolist()
print(col_names[:12])
print("...")
print(col_names[16:24])

block1 = spot.iloc[:, 1].to_numpy()
block2 = spot.iloc[:, 2].to_numpy()

print("Unique block1 values:", np.unique(block1))
print("Unique block2 values:", np.unique(block2))
print("Number of spots:", len(block1))
['row.names', 'c1G', 'c1R', 'c2G', 'c2R', 'c3G', 'c3R', 'c4G', 'c4R', 'c5G', 'c5R', 'c6G']
...
['c8R', 'k1G', 'k1R', 'k2G', 'k2R', 'k3G', 'k3R', 'k4G']
Unique block1 values: [1 2 3 4]
Unique block2 values: [1 2 3 4]
Number of spots: 6384

13.7 Preprocess the raw intensities

For invalid intensities, we use a simple preprocessing rule:

  • if either channel is nonpositive, set the log-ratio to 0,
  • and set the log-intensity to 8.

This keeps the transformed values well-defined and lets us focus on the multiple-testing ideas rather than on missing-value edge cases.

Show Python
n = 8
G = raw.shape[0]
datm = raw.iloc[:, 1:33].to_numpy(dtype=float)

def build_log_matrices(arr):
    logr = np.zeros((G, n))
    logi = np.zeros((G, n))
    for i in range(n):
        g = arr[:, 2 * i]
        r = arr[:, 2 * i + 1]
        ok = (g > 0) & (r > 0)
        logr[:, i] = 0.0
        logi[:, i] = 8.0
        logr[ok, i] = np.log2(r[ok] / g[ok])
        logi[ok, i] = 0.5 * np.log2(r[ok] * g[ok])
    return logr, logi

logr1, logi1 = build_log_matrices(datm[:, :16])
logr2, logi2 = build_log_matrices(datm[:, 16:32])

print(logr1.shape, logi1.shape, logr2.shape, logi2.shape)
(6384, 8) (6384, 8) (6384, 8) (6384, 8)

13.8 Explore the raw M-A structure

Before normalization, we should check whether the log-ratios depend on intensity. If they do, then naive testing would be distorted by systematic bias.

13.8.1 M-A plots for the first control and first knockout arrays

plot_df = pd.DataFrame({
    "A_control": logi1[:, 0],
    "M_control": logr1[:, 0],
    "A_knockout": logi2[:, 0],
    "M_knockout": logr2[:, 0],
})

fig = make_subplots(rows=1, cols=2, subplot_titles=("Control array 1", "Knockout array 1"))
fig.add_trace(
    go.Scattergl(x=plot_df["A_control"], y=plot_df["M_control"], mode="markers", marker=dict(size=3), opacity=0.35, showlegend=False),
    row=1, col=1,
)
fig.add_trace(
    go.Scattergl(x=plot_df["A_knockout"], y=plot_df["M_knockout"], mode="markers", marker=dict(size=3), opacity=0.35, showlegend=False),
    row=1, col=2,
)
fig.update_xaxes(title_text="log intensity", row=1, col=1)
fig.update_xaxes(title_text="log intensity", row=1, col=2)
fig.update_yaxes(title_text="log ratio", row=1, col=1)
fig.update_yaxes(title_text="log ratio", row=1, col=2)
fig.update_layout(title="Raw M-A plots", height=450)
fig.show()

13.8.2 Block-specific structure for one array

We normalize within printing-pin blocks, not globally. That already tells us something important: the technical bias may vary by block.

Show Python
block_panels = []
for i in range(1, 5):
    for j in range(1, 5):
        bk = (block1 == i) & (block2 == j)
        block_panels.append(
            pd.DataFrame({
                "A": logi1[bk, 0],
                "M": logr1[bk, 0],
                "block": f"({i},{j})"
            })
        )
block_df = pd.concat(block_panels, ignore_index=True)

fig = px.scatter(
    block_df,
    x="A",
    y="M",
    facet_col="block",
    facet_col_wrap=4,
    opacity=0.35,
    title="Raw M-A plots by printing-pin block for control array 1",
)
fig.update_traces(marker=dict(size=3))
fig.update_layout(height=900)
fig.show()

13.9 Block-wise loess normalization

Now we carry out the main normalization step. For each array and each printing-pin block, we fit a robust loess curve of log-ratio on log-intensity, then subtract the fitted curve.

Show Python
def blockwise_lowess_normalize(logr, logi, block1, block2, frac=0.2, it=3):
    ndat = logr.copy()
    fitted = np.zeros_like(logr)
    for k in range(logr.shape[1]):
        for i in range(1, 5):
            for j in range(1, 5):
                bk = (block1 == i) & (block2 == j)
                x = logi[bk, k]
                y = logr[bk, k]
                yfit = lowess(y, x, frac=frac, it=it, return_sorted=False)
                fitted[bk, k] = yfit
                ndat[bk, k] = y - yfit
    return ndat, fitted

ndat1, ncurv1 = blockwise_lowess_normalize(logr1, logi1, block1, block2)
ndat2, ncurv2 = blockwise_lowess_normalize(logr2, logi2, block1, block2)

print(ndat1.shape, ndat2.shape)
(6384, 8) (6384, 8)

13.9.1 Before and after normalization

bk = (block1 == 1) & (block2 == 1)
curve_df = pd.DataFrame({
    "A": logi1[bk, 0],
    "raw_M": logr1[bk, 0],
    "fit": ncurv1[bk, 0],
    "normalized_M": ndat1[bk, 0],
}).sort_values("A")

fig = make_subplots(rows=1, cols=2, subplot_titles=("Raw with loess fit", "After normalization"))
fig.add_trace(
    go.Scattergl(x=curve_df["A"], y=curve_df["raw_M"], mode="markers", marker=dict(size=4), opacity=0.4, name="raw"),
    row=1, col=1,
)
fig.add_trace(
    go.Scatter(x=curve_df["A"], y=curve_df["fit"], mode="lines", name="loess fit"),
    row=1, col=1,
)
fig.add_trace(
    go.Scattergl(x=curve_df["A"], y=curve_df["normalized_M"], mode="markers", marker=dict(size=4), opacity=0.4, showlegend=False),
    row=1, col=2,
)
fig.update_xaxes(title_text="log intensity", row=1, col=1)
fig.update_xaxes(title_text="log intensity", row=1, col=2)
fig.update_yaxes(title_text="log ratio", row=1, col=1)
fig.update_yaxes(title_text="normalized log ratio", row=1, col=2)
fig.update_layout(title="Effect of block-wise loess normalization", height=450)
fig.show()

13.10 Gene-wise Welch t-tests

After normalization, we compare the two groups gene by gene using Welch’s two-sample t-test, which allows unequal variances across the two groups.

mean_control = ndat1.mean(axis=1)
mean_knockout = ndat2.mean(axis=1)
mean_diff = mean_control - mean_knockout

welch = ttest_ind(ndat1, ndat2, axis=1, equal_var=False)
t_stat = welch.statistic
p_value = welch.pvalue

results = pd.DataFrame({
    "gene_id": np.arange(1, G + 1),
    "mean_control": mean_control,
    "mean_knockout": mean_knockout,
    "mean_diff": mean_diff,
    "t_stat": t_stat,
    "p_value": p_value,
    "avg_intensity": np.concatenate([logi1, logi2], axis=1).mean(axis=1),
})

results.nsmallest(10, "p_value")
gene_id mean_control mean_knockout mean_diff t_stat p_value avg_intensity
2148 2149 0.131644 -3.054269 3.185913 23.141263 5.249463e-10 12.468029
4138 4139 -0.249742 -1.284847 1.035105 13.086721 6.315123e-09 12.605722
5355 5356 0.024074 -1.821041 1.845115 11.617349 1.423105e-08 12.926597
539 540 0.188983 -2.868751 3.057734 11.901011 1.336281e-06 12.280637
1738 1739 -0.171758 -1.086553 0.914794 10.163451 1.643646e-06 13.737438
1495 1496 -0.173019 -1.140107 0.967088 9.396916 1.743839e-06 12.228907
2536 2537 -0.126446 -1.133750 1.007304 9.569900 3.485925e-06 13.630117
4940 4941 -0.422332 -1.363049 0.940717 7.223691 1.177628e-05 13.287497
953 954 0.018889 -0.268728 0.287617 4.374211 7.895915e-04 12.144535
946 947 -0.023616 -0.605393 0.581778 4.361786 9.000834e-04 10.542906

13.11 Visualize the testing results

13.11.1 Volcano plot

results["neglog10_p"] = -np.log10(results["p_value"])
fig = px.scatter(
    results,
    x="mean_diff",
    y="neglog10_p",
    hover_data=["gene_id", "t_stat", "p_value"],
    opacity=0.45,
    title="Volcano plot after normalization",
)
fig.show()

13.11.2 Normal Q-Q plot of t-statistics

The paper uses a normal Q-Q plot to see whether the extreme t-statistics depart from what we would expect under a mostly-null distribution.

t_sorted = np.sort(t_stat)
probs = (np.arange(1, G + 1) - 0.5) / G
theoretical = norm.ppf(probs)
qq_df = pd.DataFrame({"theoretical": theoretical, "observed": t_sorted})

fig = px.scatter(
    qq_df,
    x="theoretical",
    y="observed",
    opacity=0.45,
    title="Normal Q-Q plot of gene-wise t-statistics",
)
fig.add_trace(
    go.Scatter(
        x=[qq_df["theoretical"].min(), qq_df["theoretical"].max()],
        y=[qq_df["theoretical"].min(), qq_df["theoretical"].max()],
        mode="lines",
        name="45-degree line",
    )
)
fig.show()

13.11.3 t-statistics versus intensity

The paper also argues that large t-statistics do not simply correspond to low-intensity artifacts. This plot helps students see that issue directly.

fig = px.scatter(
    results,
    x="avg_intensity",
    y="t_stat",
    hover_data=["gene_id", "p_value"],
    opacity=0.4,
    title="Gene-wise t-statistics versus average intensity",
)
fig.show()

13.12 Why multiple testing changes the problem

Suppose, for a moment, that all 6384 genes were null.

Then even under complete null truth, we would expect approximately

  • \(6384 \times 0.05 \approx 319\) p-values below 0.05,
  • \(6384 \times 0.01 \approx 64\) p-values below 0.01,
  • \(6384 \times 0.001 \approx 6.4\) p-values below 0.001.

That simple calculation is often the first moment students realize that “p < 0.05” is not remotely enough in a high-dimensional setting.

alpha_grid = np.array([0.05, 0.01, 0.001, 0.0001])
expected_false_positives = G * alpha_grid
observed_hits = np.array([(p_value < a).sum() for a in alpha_grid])

mult_df = pd.DataFrame({
    "alpha": alpha_grid,
    "expected_false_positives_under_global_null": expected_false_positives,
    "observed_hits": observed_hits,
})
mult_df
alpha expected_false_positives_under_global_null observed_hits
0 0.0500 319.2000 381
1 0.0100 63.8400 78
2 0.0010 6.3840 10
3 0.0001 0.6384 8
mult_long = mult_df.melt(id_vars="alpha", var_name="quantity", value_name="count")
fig = px.bar(
    mult_long,
    x="alpha",
    y="count",
    color="quantity",
    barmode="group",
    title="Observed discoveries versus what we would expect under a complete null",
)
fig.update_xaxes(type="category")
fig.show()

13.13 Type I and Type II errors: a quick simulation

In the real dataset, we do not know the truth for each gene. So to illustrate the Type I / Type II trade-off clearly, we create a small simulation where the truth is known.

This is not part of the original paper. It is included here because it makes the error concepts concrete.

rng = np.random.default_rng(3255)

m = 2000
n_per_group = 8
n_nonnull = 80
true_effect = np.zeros(m)
true_effect[:n_nonnull] = 1.0
rng.shuffle(true_effect)

x1 = rng.normal(0, 1, size=(m, n_per_group))
x2 = rng.normal(true_effect[:, None], 1, size=(m, n_per_group))

sim_test = ttest_ind(x1, x2, axis=1, equal_var=False)
sim_p = sim_test.pvalue

thresholds = np.array([0.05, 0.01, 0.001, 0.0001])
rows = []
for th in thresholds:
    reject = sim_p < th
    type1 = np.sum(reject & (true_effect == 0))
    type2 = np.sum((~reject) & (true_effect != 0))
    rows.append({
        "threshold": th,
        "type_I_errors": type1,
        "type_II_errors": type2,
        "discoveries": reject.sum(),
    })

sim_summary = pd.DataFrame(rows)
sim_summary
threshold type_I_errors type_II_errors discoveries
0 0.0500 92 46 126
1 0.0100 15 63 32
2 0.0010 2 77 5
3 0.0001 0 80 0
sim_long = sim_summary.melt(id_vars="threshold", value_vars=["type_I_errors", "type_II_errors"],
                            var_name="error_type", value_name="count")
fig = px.line(
    sim_long,
    x="threshold",
    y="count",
    color="error_type",
    markers=True,
    title="In a simple simulation, stricter thresholds reduce Type I errors but increase Type II errors",
)
fig.update_xaxes(type="category")
fig.show()

The basic lesson is simple but important:

  • making the threshold stricter usually reduces false positives,
  • but it also makes it easier to miss real signals.

That is one reason FDR-based thinking is so useful.

13.14 Classical multiplicity adjustments

Before moving to permutation-based FDR estimation, it is useful to compare two standard multiplicity adjustments available in many software packages:

  • Bonferroni, which is very strict,
  • Benjamini-Hochberg (BH), which controls the false discovery rate under standard assumptions.
reject_bonf, p_bonf, _, _ = multipletests(p_value, alpha=0.05, method="bonferroni")
reject_bh, p_bh, _, _ = multipletests(p_value, alpha=0.05, method="fdr_bh")

summary_adjust = pd.DataFrame({
    "method": ["raw p < 0.0001", "Bonferroni 0.05", "BH FDR 0.05"],
    "discoveries": [int((p_value < 1e-4).sum()), int(reject_bonf.sum()), int(reject_bh.sum())],
})
summary_adjust
method discoveries
0 raw p < 0.0001 8
1 Bonferroni 0.05 7
2 BH FDR 0.05 8
fig = px.bar(summary_adjust, x="method", y="discoveries", title="How many genes are called significant under different decision rules")
fig.show()
results["p_bonf"] = p_bonf
results["p_bh"] = p_bh
results.nsmallest(10, "p_value")[["gene_id", "mean_diff", "t_stat", "p_value", "p_bonf", "p_bh"]]
gene_id mean_diff t_stat p_value p_bonf p_bh
2148 2149 3.185913 23.141263 5.249463e-10 0.000003 0.000003
4138 4139 1.035105 13.086721 6.315123e-09 0.000040 0.000020
5355 5356 1.845115 11.617349 1.423105e-08 0.000091 0.000030
539 540 3.057734 11.901011 1.336281e-06 0.008531 0.001855
1738 1739 0.914794 10.163451 1.643646e-06 0.010493 0.001855
1495 1496 0.967088 9.396916 1.743839e-06 0.011133 0.001855
2536 2537 1.007304 9.569900 3.485925e-06 0.022254 0.003179
4940 4941 0.940717 7.223691 1.177628e-05 0.075180 0.009397
953 954 0.287617 4.374211 7.895915e-04 1.000000 0.560084
946 947 0.581778 4.361786 9.000834e-04 1.000000 0.574613

13.15 Permutation-based FDR estimation

Now we turn to the main teaching goal of this case study: estimating how many of our apparent discoveries may be false.

One intuitive approach is to estimate FDR by permutation:

  1. combine the normalized data from both groups,
  2. randomly permute the 16 group labels,
  3. recompute all gene-wise p-values,
  4. count how many genes fall below a chosen threshold,
  5. average these counts over many permutations,
  6. divide by the number of observed discoveries at that threshold.

This yields an empirical estimate of

\[ \mathrm{FDR}(c) \approx \frac{E[V(c)]}{R(c)}, \]

where \(V(c)\) is the number of false discoveries under threshold \(c\), and \(R(c)\) is the number of observed discoveries.

Show Python
rng = np.random.default_rng(123)
ndat12 = np.concatenate([ndat1, ndat2], axis=1)
thresholds = np.array([0.01, 0.001, 0.0001])
perm_no = 50
perm_counts = np.zeros((perm_no, len(thresholds)), dtype=int)

for b in range(perm_no):
    ind = rng.permutation(ndat12.shape[1])
    fake1 = ndat12[:, ind[:8]]
    fake2 = ndat12[:, ind[8:]]
    p_perm = ttest_ind(fake1, fake2, axis=1, equal_var=False).pvalue
    perm_counts[b, :] = [(p_perm < th).sum() for th in thresholds]

observed_counts = np.array([(p_value < th).sum() for th in thresholds])
expected_false = perm_counts.mean(axis=0)
empirical_fdr = np.divide(
    expected_false,
    observed_counts,
    out=np.full_like(expected_false, np.nan, dtype=float),
    where=observed_counts > 0,
)

perm_summary = pd.DataFrame({
    "threshold": thresholds,
    "observed_discoveries": observed_counts,
    "expected_false_discoveries": expected_false,
    "estimated_fdr": empirical_fdr,
})
perm_summary
threshold observed_discoveries expected_false_discoveries estimated_fdr
0 0.0100 78 45.74 0.58641
1 0.0010 10 3.94 0.39400
2 0.0001 8 0.40 0.05000
fig = px.bar(
    perm_summary.melt(id_vars="threshold", value_vars=["observed_discoveries", "expected_false_discoveries"],
                      var_name="quantity", value_name="count"),
    x="threshold",
    y="count",
    color="quantity",
    barmode="group",
    title="Observed discoveries and permutation-based expected false discoveries",
)
fig.update_xaxes(type="category")
fig.show()
fig = px.line(
    perm_summary,
    x="threshold",
    y="estimated_fdr",
    markers=True,
    title="Estimated FDR from permutations",
)
fig.update_xaxes(type="category")
fig.show()

In this dataset, the estimated FDR is much lower at the most stringent threshold. That illustrates the main tradeoff: a stricter threshold gives fewer discoveries, but the discoveries we keep are more credible.

13.16 Identify a small high-confidence gene set

A common teaching question at this point is: after all of this adjustment and checking, what genes would we actually carry forward for follow-up?

One simple answer is to focus on the genes with raw p-values below 0.0001, which is a very stringent cutoff in this example.

selected = results.loc[results["p_value"] < 1e-4, ["gene_id", "p_value", "mean_diff", "t_stat"]].copy()
selected = selected.sort_values("p_value")
selected
gene_id p_value mean_diff t_stat
2148 2149 5.249463e-10 3.185913 23.141263
4138 4139 6.315123e-09 1.035105 13.086721
5355 5356 1.423105e-08 1.845115 11.617349
539 540 1.336281e-06 3.057734 11.901011
1738 1739 1.643646e-06 0.914794 10.163451
1495 1496 1.743839e-06 0.967088 9.396916
2536 2537 3.485925e-06 1.007304 9.569900
4940 4941 1.177628e-05 0.940717 7.223691

This is not the only possible decision rule, but it is a useful one for classroom discussion because it produces a short list that students can inspect directly and compare against the broader BH and permutation-based summaries.

13.17 What should we report?

A common beginner mistake is to report only a list of “significant genes.” A better report should state at least the following:

  • how the raw intensities were transformed,
  • how normalization was performed,
  • what gene-wise test statistic was used,
  • how multiplicity was handled,
  • what threshold or FDR target was chosen,
  • and how many genes were ultimately selected.

For this case study, a concise summary for a results section is:

final_summary = pd.DataFrame({
    "quantity": [
        "number of spots tested",
        "raw hits at p < 0.01",
        "raw hits at p < 0.001",
        "raw hits at p < 0.0001",
        "BH discoveries at FDR 0.05",
        "Bonferroni discoveries at 0.05",
        "estimated FDR at p < 0.0001",
    ],
    "value": [
        G,
        int((p_value < 0.01).sum()),
        int((p_value < 0.001).sum()),
        int((p_value < 0.0001).sum()),
        int(reject_bh.sum()),
        int(reject_bonf.sum()),
        float(perm_summary.loc[perm_summary["threshold"] == 0.0001, "estimated_fdr"].iloc[0]),
    ],
})
final_summary
quantity value
0 number of spots tested 6384.00
1 raw hits at p < 0.01 78.00
2 raw hits at p < 0.001 10.00
3 raw hits at p < 0.0001 8.00
4 BH discoveries at FDR 0.05 8.00
5 Bonferroni discoveries at 0.05 7.00
6 estimated FDR at p < 0.0001 0.05

13.18 Discussion questions

  1. Why is p < 0.05 not a sensible default rule when testing 6384 genes?
  2. Why does permutation help in a microarray problem where the null distribution and dependence structure may be complicated?
  3. Why can a method that reduces Type I errors too aggressively also be scientifically unsatisfying?
  4. What is the conceptual difference between controlling the family-wise error rate and controlling the false discovery rate?
  5. The paper discusses adjusted p-values, but the supplied teaching code estimates FDR at a few thresholds. What are the strengths and weaknesses of each strategy?

13.19 Exercises

  1. Increase the number of permutations from 50 to 200 and see how stable the FDR estimates become.
  2. Add a histogram of the p-values and explain what shape you would expect under a complete null.
  3. Compare Welch’s t-test with the equal-variance two-sample t-test and check whether the top-ranked genes change.
  4. Use a grid of thresholds from 1e-5 to 1e-2 and plot the entire estimated FDR curve.
  5. Repeat the Type I / Type II simulation with a smaller true effect and explain what happens to power.