14  Homework

14.1 Homework 1 (due 02/06)

Quarto and Git setup: Quarto and Git are two important tools for data science: Quarto for communicating results (reports/websites/books) and Git/GitHub for version control and collaboration. This homework is about getting familiar with them through the following tasks.

Your goal: produce a short, beginner-friendly step-by-step manual for yourself (and future students) that documents what you did, what went wrong, and how you fixed it. Please use the templates/hw.qmd as your starting point. Use the command line interface.

Work in your GitHub Classroom repository (the private repo created for you when you accept the assignment link). Use the command line interface unless a step explicitly requires a browser (e.g., adding an SSH key).

Submission (What should be included in your repo)

  • hw.qmd (your completed manual)
  • hw.html (rendered from hw.qmd)
  • hw.pdf

Tasks

  1. Accept the GitHub Classroom assignment and clone your repo (SSH).
  • Accept the assignment using the GitHub Classroom link provided on HuskyCT.
  • In a terminal, verify Git is installed:
    • git --version
  • Set your Git identity (do this once per computer):
    • git config --global user.name "Your Name"
    • git config --global user.email "XXX@uconn.edu" (use the email tied to your GitHub account if different).
  1. Set up SSH authentication between your computer and GitHub. There are online tutorials with more details.
  • Check whether you already have an SSH key: +ls -al ~/.ssh
  • If you do not see something like id_ed25519 and id_ed25519.pub, create a new key:
    • ssh-keygen -t ed25519 -C "your_email@example.com"
    • Press Enter to accept the default location (~/.ssh/id_ed25519)
    • Choose a passphrase or press Enter for none
  • Start the SSH agent and add your key:
    • eval "$(ssh-agent -s)"
    • ssh-add ~/.ssh/id_ed25519
  • Copy your public key and add it to GitHub:
    • cat ~/.ssh/id_ed25519.pub
    • In your browser: GitHub → Settings → SSH and GPG keys → New SSH key
  • Test the connection:
    • ssh -T git@github.com
    • You should see a message that you have successfully authenticated.
  • Now clone your GitHub Classroom repo (SSH) and enter it:
    • git clone git@github.com:<ORG>/<REPO>.git
    • cd <REPO>
  1. Set up the homework file in your repo.
  • Download/copy the template into your Classroom repo as hw.qmd.
  • Make an initial commit:
    • git status
    • git add hw.qmd
    • git commit -m "Start HW1"
    • git push
  1. Install Quarto and verify it works. Follow the official installation instructions at https://quarto.org/docs/get-started/ for your operating system. Then verify:
    • quarto --version
    • quarto check

In hw.qmd, record exactly what you installed (version + method), and any issues you hit (PATH problems are common) and how you fixed them.

  1. Pick an editor/tool and reproduce the “line plot on polar axis” example.
  • Choose one tool (e.g., VS Code, Jupyter, Positron, Emacs) and follow the Quarto documentation to reproduce the example plot.
  • Minimum requirement: include the plot in your hw.qmd using a Python code cell that produces a polar line plot with Matplotlib. For example:
import numpy as np
import matplotlib.pyplot as plt

r = np.arange(0, 2, 0.01)
theta = 2 * np.pi * r
fig, ax = plt.subplots(subplot_kw={"projection": "polar"})
ax.plot(theta, r)
ax.set_rmax(2.0)
ax.grid(True)
plt.show()

Render the homework into a HTML. Print the HTML file to a pdf file and put the file into a release in your GitHub repo.

14.2 Homework 2 (due 02/13)

14.2.1 Part 1. Contributing to the Class Notes

This part of the homework is to make sure that you know how to contribute to our classnotes by making pull request and to complete updating of your wishlist and presentation topic.

To contribute to the classnote repository, you need to have a working copy of the sources on your computer. Document the following steps in a qmd file in the form of a step-by-step manual, as if you are explaining them to someone who wants to contribute too. Make at least 5 commits for this task, each with an informative message.

  1. Create a fork of the notes repo with your own GitHub account.
  2. Clone it to an appropriate folder on your computer.
  3. Render the classnotes on your computer. If you encounter difficulties, try to document and resolve the issues (Quarto/Python installation, environment setup/activation, etc.).
  4. Make a new branch (and name it appropriately) to experiment with your changes.
  5. Checkout your branch and add your wishes to the wish list; check the rendered notes to make sure your changes look good.
  6. Commit with an informative message; and push the changes to your GitHub account.
  7. Make a pull request to class notes repo from your fork at GitHub. Make sure you have clear messages to document the changes
  8. Make another branch and repeat the process to update your presentation topic.

The grading will be based on whether you can successfully make pull requests to update your wishlist and presentation topic on time.

14.2.2 Part 2. Generalized “Fibonacci-like” recurrence

Define a sequence \(\{a_n\}\) by

  • \(a_1 = 1, a_2 = 1\)
  • \(a_n = c_1 a_{n-1} + c_2 a_{n-2}\) for \(n \ge 3\)

where \(c_1, c_2\) are integers. Note that Fibonacci sequence corresponds to \(c_1 = 1, c_2 = 1\).

  1. Based on the Fibonacci example in the notes, implement four functions:

    • seq_rs(n, c1, c2) (naive recursion)
    • seq_dm(n, c1, c2) (memoization)
    • seq_dbu(n, c1, c2) (bottom-up with list)
    • seq_dbu_m(n, c1, c2) (bottom-up constant memory)
  2. (optional) Add input checks and raise ValueError with a helpful message if:

    • n is not an integer or n < 1
    • c1 or c2 is not an integer
  3. Benchmark all four methods for:

    • n = 10, 20, 40
    • coefficients (c1, c2) = (1, 1) and (2, 1)
    • Verify your functions by computing the result for n=10,(c1, c2) = (2, 1). The correct answer should be 1393.
    • Use %timeit or timeit and present results in a small table.
  4. Briefly explain (2–5 sentences):

    • Why recursion slows dramatically
    • How memoization changes time complexity
    • Memory usage differences between the bottom-up list vs constant-memory version

14.2.3 Part C. Fast method using matrices

For the recurrence \(a_n = c_1 a_{n-1} + c_2 a_{n-2}\), define the matrix:

\[ M = \begin{pmatrix} c_1 & c_2 \\ 1 & 0 \end{pmatrix} \]

Then, one can show that:

\[ \begin{pmatrix} a_n \\ a_{n-1} \end{pmatrix} = M^{n-2} \begin{pmatrix} a_2 \\ a_1 \end{pmatrix} \quad \text{for } n \ge 2. \]

It turns out this gives a very efficient way for computing the sequence (in fact the most efficient way with complexity O(log n)).

  1. Implement seq_log(n, c1, c2) that computes \(a_n\). You may use the matrix functions provided below.
    • Verify your function by computing the result for n=10,(c1, c2) = (2, 1). The correct answer should be 1393.
  2. Benchmark seq_dbu_m vs seq_log for large n (choose values like 10**3, 10**4 or as large as your machine can handle).
  3. In a few sentences, write down the biggest take-home message you learn from using five different ways of solving the same problem.

Render your homework into a HTML. Print the HTML file to a pdf file and put them together with source files (.qmd) into a release in your GitHub repo.


Example code: 2×2 matrices + operations

Here are some functions of matrices that could be useful to you. You may direct use these defined functions in implementing seq_log(n, c1, c2).

# A 2x2 matrix is represented as:
# [[a, b],
#  [c, d]]

def mat2_mul(A, B):
    """Multiply two 2x2 matrices A and B."""
    return [
        [
            A[0][0] * B[0][0] + A[0][1] * B[1][0],
            A[0][0] * B[0][1] + A[0][1] * B[1][1],
        ],
        [
            A[1][0] * B[0][0] + A[1][1] * B[1][0],
            A[1][0] * B[0][1] + A[1][1] * B[1][1],
        ],
    ]

def mat2_vec_mul(A, v):
    """Multiply 2x2 matrix A by 2-vector v = [x, y]."""
    return [
        A[0][0] * v[0] + A[0][1] * v[1],
        A[1][0] * v[0] + A[1][1] * v[1],
    ]

def mat2_pow(M, n):
    """
    Compute M^n for a 2x2 matrix M and integer n >= 0
    using exponentiation by squaring.
    """
    if not isinstance(n, int) or n < 0:
        raise ValueError("n must be an integer >= 0")

    # Identity matrix I
    result = [[1, 0],
              [0, 1]]
    base = M

    while n > 0:
        if n % 2 == 1:
            result = mat2_mul(result, base)
        base = mat2_mul(base, base)
        n //= 2

    return result


# Quick sanity checks
I = [[1, 0], [0, 1]]
A = [[2, 3], [4, 5]]

assert mat2_mul(I, A) == A
assert mat2_mul(A, I) == A
assert mat2_pow(A, 0) == I
assert mat2_pow(A, 1) == A

14.3 Homework 3

You are asked to model a binary outcome using a set of features. The data come from two groups: + Group A (large) + Group B (small)

You are asked to use the provided code to run the analysis and answer the questions at the end. The code simulates a dataset with a strong signal for group A and a relatively weaker signal + higher noise for group B, and then runs three different model strategies, namely, 1) a pooled model, 2) two separate models, and 3) a partial pooling model.

14.3.1 Creates data + different model strategies

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix

# -----------------------------
# 0) Create a dataset
# -----------------------------
rng = np.random.default_rng(3255)

nA = 2500   # large group
nB = 120    # small group
p_noise = 40

def sim_group(n, group):
    x1 = rng.normal(0, 1, n)
    x2 = rng.normal(0, 1, n)
    noise = {f"z{i}": rng.normal(0, 1, n) for i in range(p_noise)}

    # Group-specific relationship between features and outcome
    # A: strong signal, easier to learn
    # B: weaker signal + higher noise, hard to learn with small n
    if group == "A":
        b0, b1, b2, sig = -0.2,  2.2, -1.4, 1.0
    else:
        b0, b1, b2, sig = -0.2, -0.8,  0.6, 1.0

    lin = b0 + b1*x1 + b2*x2 + rng.normal(0, sig, n)
    p = 1 / (1 + np.exp(-lin))
    y = rng.binomial(1, p)

    d = {"group": group, "x1": x1, "x2": x2, "y": y}
    d.update(noise)
    return pd.DataFrame(d)

df = pd.concat([sim_group(nA, "A"), sim_group(nB, "B")], ignore_index=True)

print("Counts:\n", df["group"].value_counts())
print("Base rate y=1 overall:", round(df["y"].mean(), 3))
print("Base rate by group:\n", df.groupby("group")["y"].mean().round(3))
Counts:
 group
A    2500
B     120
Name: count, dtype: int64
Base rate y=1 overall: 0.472
Base rate by group:
 group
A    0.472
B    0.458
Name: y, dtype: float64
# Split within each group so both groups appear in train/test
train_parts, test_parts = [], []
for g, sub in df.groupby("group"):
    tr, te = train_test_split(sub, test_size=0.30, random_state=1, stratify=sub["y"])
    train_parts.append(tr)
    test_parts.append(te)

train = pd.concat(train_parts).sample(frac=1, random_state=1)
test  = pd.concat(test_parts).sample(frac=1, random_state=1)

features = ["x1", "x2"] + [f"z{i}" for i in range(p_noise)]

Xtr = train[features]
ytr = train["y"]
Xte = test[features]
yte = test["y"]
gte = test["group"]

# -----------------------------
# Helper: evaluate by group
# -----------------------------
def eval_by_group(model, X_test, y_test, g_test):
    proba = model.predict_proba(X_test)[:, 1]
    pred = (proba >= 0.5).astype(int)

    rows = []
    for g in sorted(g_test.unique()):
        idx = (g_test == g)
        yt = y_test[idx]
        pb = proba[idx]
        pr = pred[idx]

        acc = accuracy_score(yt, pr)
        auc = roc_auc_score(yt, pb) if yt.nunique() == 2 else np.nan

        tn, fp, fn, tp = confusion_matrix(yt, pr, labels=[0, 1]).ravel()
        fpr = fp / (fp + tn) if (fp + tn) else np.nan
        fnr = fn / (fn + tp) if (fn + tp) else np.nan

        rows.append({"group": g, "n_test": idx.sum(), "acc": acc, "auc": auc, "FPR": fpr, "FNR": fnr})

    return pd.DataFrame(rows)
# -----------------------------
# 1) POOLED model
# -----------------------------
pooled = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000))
pooled.fit(Xtr, ytr)

print("\n=== 1) Pooled (ignores group) ===")
print(eval_by_group(pooled, Xte, yte, gte).round(3))

=== 1) Pooled (ignores group) ===
  group  n_test    acc    auc    FPR    FNR
0     A     750  0.768  0.856  0.197  0.271
1     B      36  0.361  0.310  0.474  0.824
# -----------------------------
# 2) SEPARATE models
# -----------------------------
print("\n=== 2) Separate models (A model, B model) ===")
rows = []
for g in ["A", "B"]:
    tr_g = train[train["group"] == g]
    te_g = test[test["group"] == g]
    m = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000))
    m.fit(tr_g[features], tr_g["y"])
    res = eval_by_group(m, te_g[features], te_g["y"], te_g["group"])
    rows.append(res)

sep_results = pd.concat(rows, ignore_index=True)
print(sep_results.round(3))

=== 2) Separate models (A model, B model) ===
  group  n_test    acc    auc    FPR    FNR
0     A     750  0.779  0.860  0.184  0.263
1     B      36  0.556  0.563  0.421  0.471
# -----------------------------
# 3) "Partial pooling"
# -----------------------------
def add_interactions(dfX_with_group):
    X = dfX_with_group.copy()
    gB = (X["group"] == "B").astype(int)
    X = X.drop(columns=["group"])
    X["gB"] = gB
    X["x1_gB"] = dfX_with_group["x1"] * gB
    X["x2_gB"] = dfX_with_group["x2"] * gB
    return X

Xtr_int = add_interactions(train[["group"] + features])
Xte_int = add_interactions(test[["group"] + features])

partial = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000))
partial.fit(Xtr_int, ytr)

print("\n=== 3) One model with group interaction (regularized) ===")
print(eval_by_group(partial, Xte_int, yte, gte).round(3))

=== 3) One model with group interaction (regularized) ===
  group  n_test    acc    auc    FPR    FNR
0     A     750  0.773  0.861  0.189  0.268
1     B      36  0.639  0.666  0.316  0.412
  • Explain the data generating process and how the two groups differ in terms of signal and noise. Why is it harder to learn for group B?

  • Explain the evaluation metrics (accuracy, AUC, FPR, FNR) and what they mean in this binary classification context.

  • Explain how the pooled model works and how it performs for each group. Why does it perform worse for B?

  • Explain how separate models work and how they perform for each group. Why does the B model perform better?

  • Explain how the interaction model works and how it performs for each group. Why does it lead to the best performance for B?

  • What’s the ethical/real-world lesson here about how to handle small groups in modeling?

14.4 Homework 4

This homework asks you to carefully study two case studies from our notes:

Submission

  • hw4.qmd
  • hw4.html
  • hw4.pdf

14.4.1 Part 1. Community-level suicide risk case study

Answer the following questions based on 01-case-study/01-case-study.qmd.

This part focuses on two ideas from the case study:

  • rare-event rates can be unstable, especially for small populations;
  • when many districts are tested at once, multiple-testing adjustment matters.

You may use the starter code below and extend it as needed.

Show starter code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy.stats import poisson
from statsmodels.stats.multitest import multipletests

def simulate_districts(
  n_districts=120,
  base_rate=15 / 10000,
  pop_low=800,
  pop_high=20000,
  elevated_idx=None,
  elevated_rate=30 / 10000,
  seed=3255,
):
  """
  Generate district-level counts from a Poisson model.

  Parameters
  ----------
  elevated_idx : list-like of integer row positions or None
    Districts with truly elevated rates (used in Q4).
  """
  rng = np.random.default_rng(seed)
  pop = rng.integers(pop_low, pop_high, size=n_districts)

  true_rate = np.full(n_districts, base_rate, dtype=float)
  if elevated_idx is not None and len(elevated_idx) > 0:
    true_rate[np.array(elevated_idx, dtype=int)] = elevated_rate

  lam = pop * true_rate
  count = rng.poisson(lam)

  out = pd.DataFrame(
    {
      "district": [f"D{i:03d}" for i in range(n_districts)],
      "population": pop,
      "count": count,
      "true_rate": true_rate,
      "expected_count": lam,
    }
  )
  out["rate"] = out["count"] / out["population"]
  out["rate_per_10k"] = out["rate"] * 10000
  out["is_elevated"] = out["true_rate"] > base_rate
  return out


def poisson_two_sided_pvalue(count, pop, rate0):
  """Two-sided exact p-value against H0: count ~ Poisson(pop * rate0)."""
  mu = pop * rate0
  p_lo = poisson.cdf(count, mu)
  p_hi = poisson.sf(count - 1, mu)
  return min(1.0, 2 * min(p_lo, p_hi))


def add_poisson_tests(df, rate0=15 / 10000, alpha=0.05):
  """Add raw and adjusted p-values (BH/FDR and Bonferroni) to a copy of df."""
  out = df.copy()

  # Extract one p-value per district.
  out["p_raw"] = [
    poisson_two_sided_pvalue(c, p, rate0)
    for c, p in zip(out["count"], out["population"])
  ]

  # Multiple-testing adjustments.
  out["p_bh"] = multipletests(out["p_raw"], method="fdr_bh")[1]
  out["p_bonf"] = multipletests(out["p_raw"], method="bonferroni")[1]

  # Decision columns at alpha.
  out["sig_raw"] = out["p_raw"] < alpha
  out["sig_bh"] = out["p_bh"] < alpha
  out["sig_bonf"] = out["p_bonf"] < alpha
  return out
  1. Rare-event instability

    Simulate one dataset under a common true rate of 15 per 10,000 for all districts.

    • Make a scatterplot of rate_per_10k versus population.
    • Identify the 10 districts with the highest observed rates.
    • For those 10 districts, report population and count.

    Then answer briefly: what pattern do you see, and why should this make us cautious about ranking districts only by raw observed rate?

Show code for Question 1
df1 = simulate_districts(seed=3255)

plt.figure(figsize=(7, 4))
plt.scatter(df1["population"], df1["rate_per_10k"], alpha=0.7)
plt.xlabel("Population")
plt.ylabel("Observed rate per 10,000")
plt.title("Rare-event rates vs district population")
plt.show()

top10 = (
  df1.sort_values("rate_per_10k", ascending=False)
  .head(10)[["district", "population", "count", "rate_per_10k"]]
)
top10

district population count rate_per_10k
50 D050 3910 10 25.575448
100 D100 1957 5 25.549310
2 D002 8950 22 24.581006
5 D005 10268 25 24.347487
61 D061 7969 19 23.842389
85 D085 15057 34 22.580859
93 D093 14978 32 21.364668
108 D108 10776 23 21.343727
76 D076 18095 38 21.000276
79 D079 8242 17 20.626062
  1. Repeat the simulation many times

    Repeat the simulation in Question 1 at least 300 times.

    For each simulated dataset, record:

    • the maximum observed rate;
    • the population of the district with the maximum observed rate;
    • whether the top-ranked district had population below 3000.

    Summarize your results in a small table and one plot. Then explain what this says about instability in rare-event settings.

Show code for Question 2
records = []

for s in range(300):
  d = simulate_districts(seed=9000 + s)
  i = d["rate_per_10k"].idxmax()
  records.append(
    {
      "sim": s,
      "max_rate_per_10k": d.loc[i, "rate_per_10k"],
      "top_population": d.loc[i, "population"],
      "top_pop_below_3000": d.loc[i, "population"] < 3000,
    }
  )

rep = pd.DataFrame(records)

summary_q2 = pd.DataFrame(
  {
    "quantity": [
      "mean max rate per 10k",
      "median population of top district",
      "share top district pop < 3000",
    ],
    "value": [
      rep["max_rate_per_10k"].mean(),
      rep["top_population"].median(),
      rep["top_pop_below_3000"].mean(),
    ],
  }
)
summary_q2

plt.figure(figsize=(7, 4))
plt.hist(rep["top_population"], bins=20, edgecolor="white")
plt.xlabel("Population of top-ranked district")
plt.ylabel("Frequency across simulations")
plt.title("Who becomes the top-ranked district?")
plt.show()

  1. Screening districts against a statewide benchmark

    For one simulated dataset, test each district against the true rate using a two-sided Poisson test.

    • Compute all raw p-values.
    • Count how many districts satisfy p < 0.05.

    In this simulation, all districts truly have the same rate. Explain why raw p-values can still produce “discoveries.”

Show code for Question 3
tested = add_poisson_tests(df1, rate0=15 / 10000, alpha=0.05)

# Extract district-level p-values.
pvalue_table = tested[
  ["district", "population", "count", "p_raw", "p_bh", "p_bonf"]
].sort_values("p_raw")
pvalue_table.head(10)

# Count discoveries under each method.
tested[["sig_raw", "sig_bh", "sig_bonf"]].sum()
sig_raw     4
sig_bh      0
sig_bonf    0
dtype: int64
  1. Add a few truly elevated districts

    Modify the simulation so that most districts still have true rate 15 per 10,000, but 5 districts have true rate 30 per 10,000.

    • Compare raw p-values.
    • Compute how many of the 5 elevated districts are detected by each method.
    • Compute how many non-elevated districts are incorrectly flagged.

    Based on your simulation, what tradeoff do you see between sensitivity and false discoveries?

Show code for Question 4
rng_q4 = np.random.default_rng(3255)
elevated_idx = rng_q4.choice(120, size=5, replace=False)

df4 = simulate_districts(
  seed=2026,
  elevated_idx=elevated_idx,
  elevated_rate=30 / 10000,
)
tested4 = add_poisson_tests(df4, rate0=15 / 10000, alpha=0.05)

comparison_q4 = pd.DataFrame(
  {
    "method": ["raw", "BH/FDR", "Bonferroni"],
    "detected_elevated": [
      (tested4["is_elevated"] & tested4["sig_raw"]).sum(),
      (tested4["is_elevated"] & tested4["sig_bh"]).sum(),
      (tested4["is_elevated"] & tested4["sig_bonf"]).sum(),
    ],
    "false_flags": [
      ((~tested4["is_elevated"]) & tested4["sig_raw"]).sum(),
      ((~tested4["is_elevated"]) & tested4["sig_bh"]).sum(),
      ((~tested4["is_elevated"]) & tested4["sig_bonf"]).sum(),
    ],
  }
)
comparison_q4

# Optional: inspect which districts were truly elevated.
tested4.loc[tested4["is_elevated"], ["district", "population", "count", "p_raw", "p_bh", "p_bonf"]]
district population count p_raw p_bh p_bonf
5 D005 9771 41 2.525044e-08 0.000003 0.000003
37 D037 6135 19 6.149559e-03 0.147589 0.737947
47 D047 1046 3 4.173245e-01 1.000000 1.000000
49 D049 9395 28 1.400344e-03 0.059040 0.168041
50 D050 19821 49 1.476008e-03 0.059040 0.177121

14.4.2 Part 2. Simpson’s paradox case study

Answer the following questions based on 04-case-study/04_case_study_simpsons_paradox.qmd.

You may use the starter code below and extend it as needed.

Show starter code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

simpson_df = pd.DataFrame(
    {
        "group": ["A", "A", "B", "B"],
        "difficulty": ["Easy", "Hard", "Easy", "Hard"],
        "success_rate": [0.90, 0.60, 0.85, 0.55],
        "n": [100, 900, 900, 100],
    }
)

simpson_df["successes"] = np.round(simpson_df["success_rate"] * simpson_df["n"]).astype(int)
simpson_df["failures"] = simpson_df["n"] - simpson_df["successes"]
simpson_df
group difficulty success_rate n successes failures
0 A Easy 0.90 100 90 10
1 A Hard 0.60 900 540 360
2 B Easy 0.85 900 765 135
3 B Hard 0.55 100 55 45
  1. Verify the paradox numerically

    Using the starter table:

    • compute the overall success rate for group A and group B;
    • compute the success rate within each difficulty level;
    • show numerically that A is better within each subgroup but worse overall.

    Present your results in one clean table.

Show code for Question 1
within_rates = simpson_df[["group", "difficulty", "successes", "n", "success_rate"]].copy()

overall_rates = (
  simpson_df.groupby("group", as_index=False)
  .agg(successes=("successes", "sum"), n=("n", "sum"))
)
overall_rates["difficulty"] = "Overall"
overall_rates["success_rate"] = overall_rates["successes"] / overall_rates["n"]

result_table = pd.concat(
  [within_rates, overall_rates[within_rates.columns]],
  ignore_index=True,
)
result_table.sort_values(["difficulty", "group"])

# TODO: point out where A > B and where A < B.
group difficulty successes n success_rate
0 A Easy 90 100 0.90
2 B Easy 765 900 0.85
1 A Hard 540 900 0.60
3 B Hard 55 100 0.55
4 A Overall 630 1000 0.63
5 B Overall 820 1000 0.82
  1. Visualize the reversal

    Create at least two plots:

    • one plot showing the within-difficulty comparison;
    • one plot showing the overall comparison.

    Then write 3–5 sentences explaining what a careless analyst might conclude from the overall plot alone.

Show code for Question 2
fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharey=True)

within_plot = simpson_df.pivot(index="difficulty", columns="group", values="success_rate")
within_plot.plot(kind="bar", ax=axes[0], rot=0)
axes[0].set_title("Within difficulty")
axes[0].set_ylabel("Success rate")
axes[0].legend(title="Group")

overall_plot = overall_rates.set_index("group")["success_rate"]
overall_plot.plot(kind="bar", ax=axes[1], rot=0, color=["C0", "C1"])
axes[1].set_title("Overall")
axes[1].set_ylabel("Success rate")

plt.tight_layout()
plt.show()