10  Case study: Community risk for adolescent suicide

10.1 Background

Suicide is one of the most serious public-health problems affecting adolescents and young adults in the United States. Beyond deaths by suicide, many more youths experience suicidal thoughts or attempts-events that may lead to medical care and hospitalization.

10.2 Objective

The purpose of this study is to use statewide data on inpatient hospitalizations and mortality to identify communities and school districts at risk for adolescent suicide.

10.3 Data & Resources

  • Five years of data (2010-2014) from the Office of the Connecticut Medical Examiner
  • The Connecticut Hospital Inpatient Discharge Database (HIDD)
  • Other publicly avaliable data (Census data, ACS, open data from state, etc.)

10.4 1 Explore the data

Before we do any analysis or build any model, we need to be clear about what question we’re actually trying to answer, what our data can (and cannot) measure, and what outcome makes sense (for analysis at the community/school-district level).

10.4.1 1.1 What is the “outcome” we want to study?

From a public-health perspective, the most direct outcome is suicide mortality. But from a data-science perspective, suicide deaths among adolescents are (fortunately) very rare events, especially when we break the state into many districts and look year-by-year. That rarity creates several immediate issues:

  • Instability
    • A single death in a small district can make its rate appear extreme, even if it’s mostly random variation.
  • Low statistical power
    • With very few events, it is hard to distinguish a truly high-risk district from one that simply had a bad year.
  • Trouble in predictive modeling
    • Predictive models may not perform well for very rare events or unbalanced data and thus unlikely to generate any useful insight.

This is why the dataset includes not only deaths, but also inpatient hospitalizations for suicide attempts. These events are more common and can provide additional information about severe suicidal behavior, especially when we are trying to detect differences across communities.

10.4.1.1 Option A: Deaths only (mortality)

  • Pros
    • Clear definition; less ambiguity about classification.
    • Closest to the most severe outcome of interest.
  • Cons
    • Very small counts at the district level.
    • Rates will be noisy; many districts may have zeros for multiple years.
    • Hard to identify meaningful geographic variation without pooling heavily.

10.4.1.2 Option B: Deaths + inpatient suicide-attempt hospitalizations**

  • Pros
    • More events → more stable rates and better ability to detect variation.
    • Captures severe nonfatal outcomes that also demand prevention resources.
  • Cons
    • Hospitalization reflects both behavior and health-care access/usage.
    • Not all attempts result in inpatient admission (many are treated outpatient or not treated).
    • Coding practices and help-seeking patterns may differ across communities.

A useful way to phrase this is:

Mortality measures the rarest and most severe outcome; hospitalization measures a broader set of severe crises, but may be influenced by health system and thus is less direct in capturing the true risk.

10.4.2 1.2 What is the “unit” of analysis?

Our goal is to identify communities and school districts with unusually high risk. That means we will not model individual-level outcomes. Instead, we will work with aggregated counts:

  • counts of deaths by district and year
  • counts of hospitalizations by district and year
  • or combined counts of both by district and year

10.4.2.1 Are “counts” enough?

To compare districts fairly, we also need a denominator: the population at risk (e.g., adolescent population in each district). This is what turns counts into rates.

So, at minimum, our analysis dataset should look like:

  • district, year
  • deaths, attempt_hospitalizations (or similar)
  • population (adolescent population)
  • plus district-level context variables (we’ll come back to these later)

10.4.3 1.3 Data exploration (no modeling yet)

We’ll use Python to explore the data and answer a few practical questions:

  1. How many events are we talking about?
    • Totals by year and across all years for deaths and hospitalizations.
  2. How sparse are counts at the district level?
    • How many district-years have zero deaths?
    • How many have zero hospitalizations?
    • How many have zero combined events?
  3. How much do raw rates vary?
    • We’ll compute crude rates per district (events divided by population) and visualize their distribution. This will help us see how “noisy” the raw rates are, especially for small populations.
  4. Does combining deaths + attempts change the picture?
    • We’ll compare summaries and basic plots for:
      • mortality only
      • hospitalizations only
      • combined outcome

The point of these steps is not to “discover the answer” yet: it is to make sure we are asking a question the data can reasonably support.

10.4.3.1 Load packages

Show Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)

print("Libraries loaded.")
Libraries loaded.
Show Python
import os
print("CWD:", os.getcwd())
CWD: /Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/01-case-study

10.4.4 Set file paths

Show Python
import pandas as pd
import numpy as np

# File paths (relative to project root)
path_sch = "data/yearlycounts_bySuicideCodings_2010_2014_15-19.csv"
# This death data is sensitive and not shared. Only processing code is shown.
# path_death = "data/data_death_2004_2015.csv"

years = [2010, 2011, 2012, 2013, 2014]

print("Paths:")
print("  path_sch  =", path_sch)
# print("  path_death=", path_death)
print("Years:", years)
Paths:
  path_sch  = data/yearlycounts_bySuicideCodings_2010_2014_15-19.csv
Years: [2010, 2011, 2012, 2013, 2014]

10.4.4.1 Read the school-district admissions/attempt counts

Show Python
sch_data = pd.read_csv(path_sch)

print("sch_data shape:", sch_data.shape)
print("sch_data columns (first 25):", list(sch_data.columns)[:25])
display(sch_data.head())
sch_data shape: (119, 36)
sch_data columns (first 25): ['Unnamed: 0', 'school_district', 'E952010', 'E952011', 'E952012', 'E952013', 'E952014', 'EXTRA2010', 'EXTRA2011', 'EXTRA2012', 'EXTRA2013', 'EXTRA2014', 'UNDET2010', 'UNDET2011', 'UNDET2012', 'UNDET2013', 'UNDET2014', 'VCODE2010', 'VCODE2011', 'VCODE2012', 'VCODE2013', 'VCODE2014', 'ALL2010', 'ALL2011', 'ALL2012']
Unnamed: 0 school_district E952010 E952011 E952012 E952013 E952014 EXTRA2010 EXTRA2011 EXTRA2012 EXTRA2013 EXTRA2014 UNDET2010 UNDET2011 UNDET2012 UNDET2013 UNDET2014 VCODE2010 VCODE2011 VCODE2012 VCODE2013 VCODE2014 ALL2010 ALL2011 ALL2012 ALL2013 ALL2014 DRUG2010 DRUG2011 DRUG2012 DRUG2013 DRUG2014 pop_sch_15_19 pop_sch_all long lat
0 1 Ansonia School District 1 0 1 0 1 1 0 0 0 2 0 0 0 0 0 0 0 0 0 1 1 0 1 0 2 1 0 0 0 0 1289.310 19187.216 -73.074460 41.342690
1 2 Avon School District 0 1 1 0 1 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 1283.500 18391.590 -72.864310 41.789698
2 3 Berlin School District 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 2 2 0 1 1 0 1 0 0 0 1020.375 19975.086 -72.743755 41.615898
3 4 Bethel School District 1 0 1 1 2 1 0 1 0 2 0 1 0 0 0 0 0 1 0 0 1 1 1 1 2 0 1 0 0 2 1221.685 18663.863 -73.401050 41.379978
4 5 Bloomfield School District 1 1 2 5 1 1 1 1 3 1 0 0 0 0 0 0 0 0 1 0 1 1 2 5 1 1 0 0 0 1 1094.930 20483.190 -72.726420 41.832798

10.4.4.2 Create admcount: keep school_district and ALL* columns

Show Python
# Keep: school_district + ALL columns
adm_cols = [c for c in sch_data.columns if "ALL" in c]

print("Detected adm_cols:", adm_cols)

# Basic check: do we have exactly 5 years?
admcount = sch_data[["school_district"] + adm_cols].copy()

print("admcount shape:", admcount.shape)
display(admcount.head())

# Check for missing district labels
print("Missing school_district in admcount:", admcount["school_district"].isna().sum())
Detected adm_cols: ['ALL2010', 'ALL2011', 'ALL2012', 'ALL2013', 'ALL2014']
admcount shape: (119, 6)
school_district ALL2010 ALL2011 ALL2012 ALL2013 ALL2014
0 Ansonia School District 1 0 1 0 2
1 Avon School District 1 1 1 1 1
2 Berlin School District 2 2 0 1 1
3 Bethel School District 1 1 1 1 2
4 Bloomfield School District 1 1 2 5 1
Missing school_district in admcount: 0

10.4.4.3 Read the death records

Show Python
# data_death = pd.read_csv(path_death)

# print("data_death shape:", data_death.shape)
# print("data_death columns (first 25):", list(data_death.columns)[:25])
# display(data_death.head())

# Filter out deleted/invalid records (matches R: del_ind == 0)
# data_death_used = data_death[data_death["del_ind"] == 0].copy()
# print("data_death_used shape (del_ind==0):", data_death_used.shape)

# Quick check: del_ind values
# print("del_ind value counts:")
# print(data_death["del_ind"].value_counts(dropna=False).head(10))

10.4.4.4 Extract year from dateofdeath + keep 2010–2014

Show Python
# Extract year from dateofdeath
# This is slightly more robust than substr(); it handles real dates when possible.
# data_death_used["year"] = pd.to_datetime(
#     data_death_used["dateofdeath"], errors="coerce"
# ).dt.year

# print("Year parsing check:")
# print(data_death_used["year"].value_counts(dropna=False).sort_index().head(15))

# Keep target years only
# data_death_used_2010_2014 = data_death_used[data_death_used["year"].isin(years)].copy()
# print("data_death_used_2010_2014 shape:", data_death_used_2010_2014.shape)

# Check missing/invalid date parsing
# n_bad_dates = data_death_used["year"].isna().sum()
# print("Rows with unparsed dateofdeath (year is NA):", n_bad_dates)

10.4.4.5 Build deathcount: district × year table

Show Python
# death_pivot = (
#    data_death_used_2010_2014
#    .groupby(["school_district", "year"])
#    .size()
#    .unstack(fill_value=0)
#    .reindex(columns=years, fill_value=0)
#    .reset_index()
# )

# Rename to d2010..d2014
# death_pivot.columns = ["school_district"] + [f"d{y}" for y in years]
# deathcount = death_pivot.copy()

# print("deathcount shape:", deathcount.shape)
# print("deathcount columns:", list(deathcount.columns))
# display(deathcount.head())

# Sanity check totals
# print("Total deaths 2010–2014:", deathcount[[f"d{y}" for y in years]].to_numpy().sum())

###Save aggregated counts
# os.makedirs("data", exist_ok=True)
# deathcount_path = "data/deathcount_by_district_2010_2014.csv"
# deathcount.to_csv(deathcount_path, index=False)


# print("Saved aggregated death counts to:", deathcount_path)
# print("Rows, cols:", deathcount.shape)
# print("Preview:")
# display(deathcount.head())

10.4.4.6 Merge admcount + deathcount to form allcount

Show Python
deathcount_path = "data/deathcount_by_district_2010_2014.csv"
deathcount = pd.read_csv(deathcount_path)

print("Loaded deathcount from:", deathcount_path)
print("deathcount shape:", deathcount.shape)
display(deathcount.head())

allcount = admcount.merge(deathcount, on="school_district", how="outer")

print("allcount shape before fillna:", allcount.shape)
print("Missing values before fillna:", allcount.isna().sum().sum())

allcount = allcount.fillna(0)

print("Missing values after fillna:", allcount.isna().sum().sum())
display(allcount.head())

# Ensure numeric columns are ints
death_cols = [f"d{y}" for y in years]
for c in adm_cols + death_cols:
    allcount[c] = pd.to_numeric(allcount[c], errors="coerce").fillna(0).astype(int)

print("Numeric conversion done. Dtypes:")
print(allcount[adm_cols + death_cols].dtypes.value_counts())
Loaded deathcount from: data/deathcount_by_district_2010_2014.csv
deathcount shape: (29, 6)
school_district d2010 d2011 d2012 d2013 d2014
0 Ansonia School District 0 1 0 0 0
1 Bristol School District 1 0 0 1 1
2 Brookfield Area School District 0 1 0 0 0
3 Cromwell School District 0 0 1 0 0
4 Danbury School District 0 0 0 0 1
allcount shape before fillna: (119, 11)
Missing values before fillna: 450
Missing values after fillna: 0
school_district ALL2010 ALL2011 ALL2012 ALL2013 ALL2014 d2010 d2011 d2012 d2013 d2014
0 Ansonia School District 1 0 1 0 2 0.0 1.0 0.0 0.0 0.0
1 Avon School District 1 1 1 1 1 0.0 0.0 0.0 0.0 0.0
2 Berlin School District 2 2 0 1 1 0.0 0.0 0.0 0.0 0.0
3 Bethel School District 1 1 1 1 2 0.0 0.0 0.0 0.0 0.0
4 Bloomfield School District 1 1 2 5 1 0.0 0.0 0.0 0.0 0.0
Numeric conversion done. Dtypes:
int64    10
Name: count, dtype: int64

10.4.4.7 Compute yearlycount = (ALL counts) + (death counts)

Show Python
# Here we align by year using lists and numpy arrays.

yearlycount = allcount[["school_district"]].copy()
yearlycount[adm_cols] = allcount[adm_cols].to_numpy() + allcount[death_cols].to_numpy()

print("yearlycount shape:", yearlycount.shape)
display(yearlycount.head())

print("Total events (deaths + attempts proxy) 2010–2014:",
      yearlycount[adm_cols].to_numpy().sum())

# Drop districts not present in sch_data (matches the match() + removal)
before = yearlycount.shape[0]
yearlycount = yearlycount[yearlycount["school_district"].isin(sch_data["school_district"])].reset_index(drop=True)
after = yearlycount.shape[0]

print(f"Filtered yearlycount to districts in sch_data: {before} -> {after}")
print("Final yearlycount totals (2010–2014):", yearlycount[adm_cols].to_numpy().sum())
yearlycount shape: (119, 6)
school_district ALL2010 ALL2011 ALL2012 ALL2013 ALL2014
0 Ansonia School District 1 1 1 0 2
1 Avon School District 1 1 1 1 1
2 Berlin School District 2 2 0 1 1
3 Bethel School District 1 1 1 1 2
4 Bloomfield School District 1 1 2 5 1
Total events (deaths + attempts proxy) 2010–2014: 1878
Filtered yearlycount to districts in sch_data: 119 -> 119
Final yearlycount totals (2010–2014): 1878

10.4.5 1.5 A refined working question

After exploration, we’ll pick one primary outcome. A reasonable working question is:

“Which districts show unusually high rates of severe suicidal behavior among adolescents from 2010–2014, where ‘severe suicidal behavior’ is measured using suicide deaths and/or inpatient hospitalizations for suicide attempts?”

Notice what this question does:

  • it specifies the population (adolescents),
  • the time window (2010–2014),
  • the geographic unit (district),
  • and leaves open an explicit choice about the outcome definition that we will justify using exploration.

10.4.5.1 Aggregate 5 years of counts per district

Show Python
nyear = 5

# These are the 5 yearly total columns (e.g., ALL2010..ALL2014)
adm_cols = [c for c in yearlycount.columns if c != "school_district"]

print("Using yearly columns:", adm_cols)

# Sum across the 5 years for each district
count_schlevel_15_19 = yearlycount[adm_cols].sum(axis=1)

print("count_schlevel_15_19 head:")
print(count_schlevel_15_19.head())

print("Total events across all districts:", int(count_schlevel_15_19.sum()))
print("Check (sum of yearlycount without district col):", int(yearlycount[adm_cols].to_numpy().sum()))
Using yearly columns: ['ALL2010', 'ALL2011', 'ALL2012', 'ALL2013', 'ALL2014']
count_schlevel_15_19 head:
0     5
1     5
2     6
3     6
4    10
dtype: int64
Total events across all districts: 1878
Check (sum of yearlycount without district col): 1878

10.4.5.2 Compute 5-year population per district

Show Python
# Here: use sch_data (same alignment as before) and multiply by 5
mydata = sch_data.copy()

if "pop_sch_15_19" not in mydata.columns:
    raise KeyError("Expected column 'pop_sch_15_19' in sch_data. Please check the column name.")

pop_schlevel_15_19 = np.round(mydata["pop_sch_15_19"] * nyear).astype("Int64")  # Int64 allows NA

print("pop_schlevel_15_19 head:")
print(pop_schlevel_15_19.head())

print("Any missing population?", pop_schlevel_15_19.isna().sum())
pop_schlevel_15_19 head:
0    6447
1    6418
2    5102
3    6108
4    5475
Name: pop_sch_15_19, dtype: Int64
Any missing population? 0

10.4.5.3 Compute district rates and drop missing (NA) rates

Show Python
#We assume the row order matches sch_data. Let's verify quickly.
same_order = (yearlycount["school_district"].reset_index(drop=True) 
              .equals(sch_data["school_district"].reset_index(drop=True)))
print("District order matches between yearlycount and sch_data?", same_order)

if not same_order:
    # safer: merge to align by school_district
    tmp = yearlycount[["school_district"]].copy()
    tmp["count_5yr"] = count_schlevel_15_19.to_numpy()
    tmp = tmp.merge(sch_data[["school_district", "pop_sch_15_19"]], on="school_district", how="left")
    count_schlevel_15_19_aligned = tmp["count_5yr"]
    pop_schlevel_15_19_aligned = np.round(tmp["pop_sch_15_19"] * nyear)
    schdis_list_15_19 = tmp["school_district"]
else:
    count_schlevel_15_19_aligned = count_schlevel_15_19
    pop_schlevel_15_19_aligned = pop_schlevel_15_19
    schdis_list_15_19 = sch_data["school_district"]

# Compute rates
rate_schlevel_15_19 = count_schlevel_15_19_aligned / pop_schlevel_15_19_aligned

# Identify NA rates (typically due to missing population)
na_mask = rate_schlevel_15_19.isna() | np.isinf(rate_schlevel_15_19)

print("Number of NA/inf rates:", int(na_mask.sum()))

# Drop NA rows
if na_mask.any():
    pop_schlevel_15_19_aligned = pop_schlevel_15_19_aligned[~na_mask].reset_index(drop=True)
    count_schlevel_15_19_aligned = count_schlevel_15_19_aligned[~na_mask].reset_index(drop=True)
    rate_schlevel_15_19 = rate_schlevel_15_19[~na_mask].reset_index(drop=True)
    schdis_list_15_19 = schdis_list_15_19[~na_mask].reset_index(drop=True)
    mydata = mydata.loc[~na_mask].reset_index(drop=True)

print("After removing NA rates:")
print("  n districts:", len(rate_schlevel_15_19))
print("  total events:", int(count_schlevel_15_19_aligned.sum()))
print("  total pop (5-year):", float(pop_schlevel_15_19_aligned.sum()))
District order matches between yearlycount and sch_data? True
Number of NA/inf rates: 0
After removing NA rates:
  n districts: 119
  total events: 1878
  total pop (5-year): 1209993.0

10.4.5.4 Sort districts by rate (descending), show top results

Show Python
order_idx = np.argsort(-rate_schlevel_15_19.to_numpy())  # descending

print("Top 10 districts by rate:")
top_k = 10
for i in range(top_k):
    j = order_idx[i]
    print(f"{i+1:>2}. {schdis_list_15_19.iloc[j]:<35} "
          f"rate={rate_schlevel_15_19.iloc[j]:.6f}  "
          f"pop5={int(pop_schlevel_15_19_aligned.iloc[j]):>7}  "
          f"count={int(count_schlevel_15_19_aligned.iloc[j]):>4}")
Top 10 districts by rate:
 1. Clinton School District             rate=0.005318  pop5=   3573  count=  19
 2. Regional School District 04         rate=0.004779  pop5=    837  count=   4
 3. Vernon School District              rate=0.003991  pop5=   7016  count=  28
 4. Granby School District              rate=0.003923  pop5=   2294  count=   9
 5. Tolland School District             rate=0.003717  pop5=   4304  count=  16
 6. East Windsor School District        rate=0.003489  pop5=   2866  count=  10
 7. Branford School District            rate=0.003481  pop5=   6895  count=  24
 8. East Haven School District          rate=0.003450  pop5=   8115  count=  28
 9. Guilford School District            rate=0.003418  pop5=   7021  count=  24
10. Bolton School District              rate=0.003316  pop5=   3619  count=  12

10.4.5.5 Compute state average rate

Show Python
rate_state_15_19 = count_schlevel_15_19_aligned.sum() / pop_schlevel_15_19_aligned.sum()
print("State average rate (2010–2014 combined):", float(rate_state_15_19))
State average rate (2010–2014 combined): 0.001552075094649308

10.5 2 Preliminary Analysis and Visualization

We can now visualize the data and do some preliminary statistial analysis. It is useful to visualize the counts, the rates, the at-risk population sizes, and the set of “distinctive” school districts on a spatial map of the State of Connecticut.

10.5.0.1 Testing for above and below state average

Show Python
#import numpy as np
#import pandas as pd
from scipy.stats import poisson, chi2
from statsmodels.stats.multitest import multipletests

# Helper: exact Poisson CI for a rate (Garwood) using chi-square
def poisson_rate_ci(x, T, alpha=0.05):
    """
    Exact (Garwood) CI for Poisson rate lambda = x/T.
    Returns (lower, upper).
    """
    x = int(x)
    if T <= 0:
        return (np.nan, np.nan)

    if x == 0:
        lower = 0.0
        upper = 0.5 * chi2.ppf(1 - alpha/2, 2*(x+1)) / T
    else:
        lower = 0.5 * chi2.ppf(alpha/2, 2*x) / T
        upper = 0.5 * chi2.ppf(1 - alpha/2, 2*(x+1)) / T

    return (lower, upper)

# Helper: Poisson test against a known rate r0
def poisson_test_against_rate(x, T, r0, alternative="two-sided"):
    """
    Test H0: lambda = r0*T for observed x ~ Poisson(lambda).
    alternative in {"two-sided","greater","less"}.
    Returns p-value.
    """
    x = int(x)
    mu = float(T) * float(r0)

    if alternative == "greater":
        # P(X >= x) = 1 - P(X <= x-1)
        return poisson.sf(x-1, mu)

    if alternative == "less":
        # P(X <= x)
        return poisson.cdf(x, mu)

    if alternative == "two-sided":
        # "Exact" two-sided p-value via doubling the smaller tail (common approach)
        p_lo = poisson.cdf(x, mu)
        p_hi = poisson.sf(x-1, mu)
        p = 2 * min(p_lo, p_hi)
        return min(p, 1.0)

    raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")

10.5.0.2 Build district-level vectors (counts, population, rates)

Show Python
# Convert to clean numpy arrays / Series
districts = pd.Series(schdis_list_15_19).reset_index(drop=True)
counts = pd.Series(count_schlevel_15_19_aligned).astype(int).reset_index(drop=True)
pop5 = pd.Series(pop_schlevel_15_19_aligned).astype(float).reset_index(drop=True)  # 5-year pop
rates = pd.Series(rate_schlevel_15_19).astype(float).reset_index(drop=True)

print("Sanity checks:")
print("  n districts:", len(districts))
print("  total counts:", int(counts.sum()))
print("  total pop5:", float(pop5.sum()))
print("  state rate:", float(rate_state_15_19))
Sanity checks:
  n districts: 119
  total counts: 1878
  total pop5: 1209993.0
  state rate: 0.001552075094649308

10.5.0.3 Compute p-values and confidence intervals

Show Python
p_above = np.empty(len(districts), dtype=float)
p_below = np.empty(len(districts), dtype=float)
p_twoside = np.empty(len(districts), dtype=float)

lower = np.empty(len(districts), dtype=float)
upper = np.empty(len(districts), dtype=float)

p_oneside_dir = np.empty(len(districts), dtype=float)
lower_oneside = np.empty(len(districts), dtype=float)
upper_oneside = np.empty(len(districts), dtype=float)

for i in range(len(districts)):
    x = counts.iloc[i]
    T = pop5.iloc[i]

    # One-sided Poisson tail p-values
    # R: higher = 1-ppois(x,mu)+dpois(x,mu)  == P(X >= x)
    # R: lower  = ppois(x,mu)               == P(X <= x)
    p_above[i] = poisson.sf(x-1, T * rate_state_15_19)
    p_below[i] = poisson.cdf(x, T * rate_state_15_19)

    # Two-sided Poisson test vs expected mean under state rate
    p_twoside[i] = poisson_test_against_rate(x, T, rate_state_15_19, alternative="two-sided")

    # Two-sided 95% CI for district rate (Garwood)
    lo, hi = poisson_rate_ci(x, T, alpha=0.05)
    lower[i], upper[i] = lo, hi

    # Directional one-sided test (choose based on observed rate vs state rate)
    alt = "less" if rates.iloc[i] < rate_state_15_19 else "greater"
    p_oneside_dir[i] = poisson_test_against_rate(x, T, rate_state_15_19, alternative=alt)

    # One-sided CI
    # We'll provide the one-sided 95% bound:
    # - if alternative="greater": lower bound at 95%, upper=inf
    # - if alternative="less": upper bound at 95%, lower=0
    # Using chi-square based bounds.
    if alt == "greater":
        # lower one-sided (1-alpha): chi2(alpha, 2x)/2T ; upper infinity
        if x == 0:
            lo1 = 0.0
        else:
            lo1 = 0.5 * chi2.ppf(0.05, 2*x) / T
        lower_oneside[i], upper_oneside[i] = lo1, np.inf
    else:
        # upper one-sided (1-alpha): chi2(1-alpha, 2(x+1))/2T ; lower zero
        up1 = 0.5 * chi2.ppf(0.95, 2*(x+1)) / T
        lower_oneside[i], upper_oneside[i] = 0.0, up1

print("Example p-values (first 5):")
print(pd.DataFrame({
    "district": districts.head(),
    "count": counts.head(),
    "p_above": p_above[:5],
    "p_below": p_below[:5],
    "p_twoside": p_twoside[:5],
}))
Example p-values (first 5):
                     district  count   p_above   p_below  p_twoside
0     Ansonia School District      5  0.970865  0.066851   0.133701
1        Avon School District      5  0.970005  0.068568   0.137135
2      Berlin School District      6  0.801201  0.323406   0.646813
3      Bethel School District      6  0.910505  0.166478   0.332955
4  Bloomfield School District     10  0.346716  0.763626   0.693432

10.5.0.4 BH adjustment (Benjamini–Hochberg) on two-sided p-values

Show Python
# Equivalent to R: p.adjust(p_twoside, method="BH")
p_twoside_adj = multipletests(p_twoside, method="fdr_bh")[1]

print("How many significant at FDR <= 0.10?",
      int((p_twoside_adj <= 0.10).sum()))

print("How many significant at FDR <= 0.05?",
      int((p_twoside_adj <= 0.05).sum()))
How many significant at FDR <= 0.10? 17
How many significant at FDR <= 0.05? 15

10.5.0.5 Assemble results table

Show Python
# population per year (for display) like round(mydata$pop_sch_15_19)
pop_per_year = np.round(mydata["pop_sch_15_19"]).astype("Int64")

results = pd.DataFrame({
    "school_district": districts,
    "pop_15_19_times_nyear": pop5.round().astype("Int64"),
    "pop": pop_per_year,
    "count_15_19": counts,
    "rate": rates,
    "lower": lower,
    "upper": upper,
    "lower_onside": lower_oneside,
    "upper_oneside": upper_oneside,
    "p_value_diff_state_avg": p_twoside,
    "p_value_diff_state_avg_adjusted": p_twoside_adj,
    "significance_diff_state_avg": (p_twoside_adj <= 0.10),
    "p_value_oneside_state_avg": p_oneside_dir,
    "significance_oneside_state_avg": (p_oneside_dir < 0.05),
    "p_value_above_state_avg": p_above,
    "significance_above_state_avg": (p_above < 0.05),
    "p_value_below_state_avg": p_below,
    "significance_below_state_avg": (p_below < 0.05),
    "state_rate": float(rate_state_15_19),
    "ratio": rates / float(rate_state_15_19),
    "school_district_short": districts.str.replace(" School District", "", regex=False),
})

# Order by rate decreasing
results = results.sort_values("rate", ascending=False).reset_index(drop=True)

print("results shape:", results.shape)
display(results.head(15))
results shape: (119, 21)
school_district pop_15_19_times_nyear pop count_15_19 rate lower upper lower_onside upper_oneside p_value_diff_state_avg p_value_diff_state_avg_adjusted significance_diff_state_avg p_value_oneside_state_avg significance_oneside_state_avg p_value_above_state_avg significance_above_state_avg p_value_below_state_avg significance_below_state_avg state_rate ratio school_district_short
0 Clinton School District 3573 715 19 0.005318 0.003202 0.008304 0.003482 inf 0.000012 0.000835 True 0.000006 True 0.000006 True 0.999998 False 0.001552 3.426162 Clinton
1 Regional School District 04 837 167 4 0.004779 0.001302 0.012236 0.001632 inf 0.086009 0.308538 False 0.043004 True 0.043004 True 0.989367 False 0.001552 3.079086 Regional 04
2 Vernon School District 7016 1403 28 0.003991 0.002652 0.005768 0.002836 inf 0.000021 0.000835 True 0.000011 True 0.000011 True 0.999996 False 0.001552 2.571318 Vernon
3 Granby School District 2294 459 9 0.003923 0.001794 0.007448 0.002047 inf 0.021867 0.144566 False 0.010934 True 0.010934 True 0.996269 False 0.001552 2.527763 Granby
4 Tolland School District 4304 861 16 0.003717 0.002125 0.006037 0.002332 inf 0.003048 0.025906 True 0.001524 True 0.001524 True 0.999420 False 0.001552 2.395163 Tolland
5 East Windsor School District 2866 573 10 0.003489 0.001673 0.006417 0.001893 inf 0.031849 0.172275 False 0.015925 True 0.015925 True 0.993854 False 0.001552 2.248076 East Windsor
6 Branford School District 6895 1379 24 0.003481 0.002230 0.005179 0.002400 inf 0.000634 0.008088 True 0.000317 True 0.000317 True 0.999868 False 0.001552 2.242664 Branford
7 East Haven School District 8115 1623 28 0.003450 0.002293 0.004987 0.002452 inf 0.000247 0.004896 True 0.000123 True 0.000123 True 0.999948 False 0.001552 2.223089 East Haven
8 Guilford School District 7021 1404 24 0.003418 0.002190 0.005086 0.002357 inf 0.000815 0.008818 True 0.000408 True 0.000408 True 0.999827 False 0.001552 2.202417 Guilford
9 Bolton School District 3619 724 12 0.003316 0.001713 0.005792 0.001913 inf 0.025513 0.153066 False 0.012756 True 0.012756 True 0.994730 False 0.001552 2.136387 Bolton
10 Stafford School District 4064 813 13 0.003199 0.001703 0.005470 0.001892 inf 0.025725 0.153066 False 0.012863 True 0.012863 True 0.994459 False 0.001552 2.060995 Stafford
11 Regional School District 08 2938 588 9 0.003063 0.001401 0.005815 0.001598 inf 0.086204 0.308538 False 0.043102 True 0.043102 True 0.981475 False 0.001552 1.973686 Regional 08
12 East Granby School District 1324 265 4 0.003021 0.000823 0.007735 0.001032 inf 0.305851 0.674004 False 0.152925 False 0.152925 False 0.942253 False 0.001552 1.946522 East Granby
13 Southington School District 11258 2252 33 0.002931 0.002018 0.004117 0.002145 inf 0.001185 0.011749 True 0.000592 True 0.000592 True 0.999704 False 0.001552 1.888600 Southington
14 Bristol School District 15724 3145 44 0.002798 0.002033 0.003757 0.002142 inf 0.000451 0.007466 True 0.000226 True 0.000226 True 0.999880 False 0.001552 1.802922 Bristol

10.5.0.6 Save the full results table

Show Python
import os

agerange = "15-19"

# Make sure output folder exists
os.makedirs("results", exist_ok=True)

out_path = f"results/results_schooldistrict_unadjusted_2010_2014_{agerange}.csv"
results.to_csv(out_path, index=False)

print("Wrote:", out_path)
print("results shape:", results.shape)
Wrote: results/results_schooldistrict_unadjusted_2010_2014_15-19.csv
results shape: (119, 21)
Show Python
# Filter significant by BH-adjusted p-value <= 0.10
table1 = results.loc[results["p_value_diff_state_avg_adjusted"] <= 0.10, [
    "school_district_short",   
    "ratio",                  
    "pop",                    
    "count_15_19",            
    "rate", "lower", "upper", 
    "p_value_diff_state_avg", 
    "p_value_diff_state_avg_adjusted"
]].copy()

# Multiply rates by 10,000 
for col in ["rate", "lower", "upper"]:
    table1[col] = table1[col] * 10_000

# Optional: nice rounding for presentation
table1["ratio"] = table1["ratio"].astype(float).round(3)
table1[["rate", "lower", "upper"]] = table1[["rate", "lower", "upper"]].round(3)
table1[["p_value_diff_state_avg", "p_value_diff_state_avg_adjusted"]] = (
    table1[["p_value_diff_state_avg", "p_value_diff_state_avg_adjusted"]].astype(float).round(6)
)

print("table1 shape:", table1.shape)
display(table1)
table1 shape: (17, 9)
school_district_short ratio pop count_15_19 rate lower upper p_value_diff_state_avg p_value_diff_state_avg_adjusted
0 Clinton 3.426 715 19 53.177 32.016 83.042 0.000012 0.000835
2 Vernon 2.571 1403 28 39.909 26.519 57.679 0.000021 0.000835
4 Tolland 2.395 861 16 37.175 21.249 60.369 0.003048 0.025906
6 Branford 2.243 1379 24 34.808 22.302 51.791 0.000634 0.008088
7 East Haven 2.223 1623 28 34.504 22.928 49.868 0.000247 0.004896
8 Guilford 2.202 1404 24 34.183 21.902 50.862 0.000815 0.008818
13 Southington 1.889 2252 33 29.312 20.177 41.166 0.001185 0.011749
14 Bristol 1.803 3145 44 27.983 20.332 37.565 0.000451 0.007466
15 West Haven 1.797 3872 54 27.895 20.956 36.398 0.000105 0.003132
16 Manchester 1.731 3498 47 26.872 19.745 35.735 0.000680 0.008088
82 Hartford 0.752 12330 72 11.679 9.138 14.708 0.013932 0.097523
105 Groton 0.488 3434 13 7.570 4.031 12.946 0.005409 0.042909
113 Windham 0.344 2996 8 5.341 2.306 10.524 0.000502 0.007466
115 Monroe 0.237 1632 3 3.676 0.758 10.743 0.002730 0.024988
116 New London 0.219 2946 5 3.395 1.102 7.923 0.000016 0.000835
117 Regional 19 0.185 2085 3 2.878 0.594 8.411 0.000161 0.003830
118 Regional 09 0.000 688 0 0.000 0.000 10.723 0.009600 0.071400
Show Python
out_path_table1 = f"results/Table_1_2010_2014_{agerange}.csv"
table1.to_csv(out_path_table1, index=False)

print("Wrote:", out_path_table1)
Wrote: results/Table_1_2010_2014_15-19.csv

10.5.1 Visualization

10.5.1.1 Build the plotting dataframe + subset significant districts

Show Python
import os
#import numpy as np
#import pandas as pd

# Basic checks
need = ["rate","lower","upper","significance_diff_state_avg","school_district","state_rate"]
missing = [c for c in need if c not in results.columns]
print("Missing columns:", missing)

# Build df like the R code
df = pd.DataFrame({
    "est": results["rate"].astype(float) * 10_000,
    "lower": results["lower"].astype(float) * 10_000,
    "upper": results["upper"].astype(float) * 10_000,
    "sig": results["significance_diff_state_avg"].astype(bool),
    "lev_names": results["school_district"].astype(str).str.replace(" School District", "", regex=False),
})

# Sort by estimate (like order(df$est))
df = df.sort_values("est").reset_index(drop=True)

state_rate_10k = float(results["state_rate"].iloc[0]) * 10_000
print("State avg rate per 10,000:", round(state_rate_10k, 3))

# factor: 1=not used, 2=sig above, 3=sig below
df["factor"] = 1
df.loc[df["sig"] & (df["est"] > state_rate_10k), "factor"] = 2
df.loc[df["sig"] & (df["est"] < state_rate_10k), "factor"] = 3

# Keep only significant above/below (like subset factor %in% c(2,3))
df_plot = df[df["factor"].isin([2, 3])].copy().reset_index(drop=True)

print("Number of significant districts plotted:", df_plot.shape[0])
df_plot.head()
Missing columns: []
State avg rate per 10,000: 15.521
Number of significant districts plotted: 17
est lower upper sig lev_names factor
0 0.000000 0.000000 10.723487 True Regional 09 3
1 2.878250 0.593564 8.411468 True Regional 19 3
2 3.394894 1.102313 7.922550 True New London 3
3 3.676020 0.758084 10.742891 True Monroe 3
4 5.340810 2.305783 10.523526 True Windham 3

10.5.1.2 Make the plot of significant districts

Show Python
import matplotlib.pyplot as plt

# Create output folder
os.makedirs("results", exist_ok=True)

fig, ax = plt.subplots(figsize=(9, 6))

# y positions for districts
y = np.arange(len(df_plot))

# Reference line at state average (vertical line since rate is on x-axis)
ax.axvline(state_rate_10k, linestyle="--", linewidth=1)

# Plot below and above separately (different default colors automatically)
for fval, label in [(3, "Below state avg (sig)"), (2, "Above state avg (sig)")]:
    sub = df_plot[df_plot["factor"] == fval]
    if sub.empty:
        continue

    yy = sub.index.to_numpy()
    x = sub["est"].to_numpy()
    # asymmetric x-error: [x-lower, upper-x]
    xerr = np.vstack([x - sub["lower"].to_numpy(), sub["upper"].to_numpy() - x])

    ax.errorbar(
        x, yy,
        xerr=xerr,
        fmt="o",
        capsize=3,
        elinewidth=1,
        markersize=5,
        label=label
    )

# Labels and ticks (this is the "flip": districts on y-axis)
ax.set_yticks(y)
ax.set_yticklabels(df_plot["lev_names"].tolist())
ax.set_xlabel("Rate per 10,000 (2010–2014 combined)")
ax.set_ylabel("School District")


# Make a legend 
import matplotlib.lines as mlines
state_handle = mlines.Line2D(
    [], [], linestyle="--", linewidth=1,
    label=f"State average rate ({state_rate_10k:.2f})"
)
## Get existing handles/labels 
handles, labels = ax.get_legend_handles_labels()
## Put state average entry at the top of the legend
handles = [state_handle] + handles
labels = [state_handle.get_label()] + labels
## Styling similar in spirit to theme tweaks
ax.grid(True, axis="y", linestyle=":", linewidth=0.5)
ax.grid(False, axis="x")
## Draw legend (state avg appears above other labels)
ax.legend(handles, labels, frameon=False, loc="best")

fig.tight_layout()
plt.show()

Show Python
# Save as PDF
agerange = "15-19"
out_pdf = f"results/barplot_est_beforeadjustment_twosided_2010_2014_{agerange}.pdf"
fig.savefig(out_pdf)
print("Saved:", out_pdf)
Saved: results/barplot_est_beforeadjustment_twosided_2010_2014_15-19.pdf

10.5.2 Making Spatial Plot

10.5.2.1 Combine results + classify districts

Show Python
# Merge 
school_district_rate_results = mydata.merge(results, on="school_district", how="inner")

print("school_district_rate_results shape:", school_district_rate_results.shape)

# Short name
school_district_rate_results["school_district_short"] = (
    school_district_rate_results["school_district"]
    .astype(str)
    .str.replace(" School District", "", regex=False)
)

# Drop NA rate
before = school_district_rate_results.shape[0]
school_district_rate_results = school_district_rate_results.dropna(subset=["rate"]).reset_index(drop=True)
after = school_district_rate_results.shape[0]
print(f"Dropped NA rate rows: {before} -> {after}")

# Classification:
# 3 = significant & above avg; 2 = not significant; 1 = significant & below avg
ind = np.zeros(after, dtype=int)
sig = school_district_rate_results["significance_diff_state_avg"].astype(bool).to_numpy()
ratio = school_district_rate_results["ratio"].astype(float).to_numpy()

ind[(sig) & (ratio > 1)] = 3
ind[(~sig)] = 2
ind[(sig) & (ratio < 1)] = 1
school_district_rate_results["ind_beforeadj"] = ind

# subsets for labeling (two-sided significant)
data1 = school_district_rate_results[(sig) & (ratio > 1)].copy()  # above
data2 = school_district_rate_results[(sig) & (ratio < 1)].copy()  # below
print("Significant above:", data1.shape[0], "Significant below:", data2.shape[0])
school_district_rate_results shape: (119, 56)
Dropped NA rate rows: 119 -> 119
Significant above: 10 Significant below: 7

10.5.2.2 Download TIGER/Line school district boundaries

We’ll download Unified (UNSD) and Secondary (SCSD) for Connecticut and combine them. CT school districts can appear under one type or the other depending on how they’re defined, so combining makes matching easier.

Show Python
import os
import zipfile
from pathlib import Path
import requests

# TIGER/Line 2024 CT school district shapefiles (CT FIPS = 09)
# Index page shows tl_2024_09_unsd.zip exists here:
# https://www2.census.gov/geo/tiger/TIGER2024/UNSD/  (and SCSD similarly)
base = "https://www2.census.gov/geo/tiger/TIGER2024"

urls = {
    "UNSD": f"{base}/UNSD/tl_2024_09_unsd.zip",
    "SCSD": f"{base}/SCSD/tl_2024_09_scsd.zip",
}

data_dir = Path("data/geoshapes")
data_dir.mkdir(parents=True, exist_ok=True)

def download_and_unzip(url, out_dir):
    zip_path = out_dir / Path(url).name
    if not zip_path.exists():
        print("Downloading:", url)
        r = requests.get(url, timeout=60)
        r.raise_for_status()
        zip_path.write_bytes(r.content)
    else:
        print("Already downloaded:", zip_path)

    # unzip into its own folder
    folder = out_dir / zip_path.stem
    folder.mkdir(exist_ok=True)
    with zipfile.ZipFile(zip_path, "r") as z:
        z.extractall(folder)
    return folder

unsd_dir = download_and_unzip(urls["UNSD"], data_dir)
scsd_dir = download_and_unzip(urls["SCSD"], data_dir)

print("UNSD extracted to:", unsd_dir)
print("SCSD extracted to:", scsd_dir)
Already downloaded: data/geoshapes/tl_2024_09_unsd.zip
Already downloaded: data/geoshapes/tl_2024_09_scsd.zip
UNSD extracted to: data/geoshapes/tl_2024_09_unsd
SCSD extracted to: data/geoshapes/tl_2024_09_scsd

This uses the Census Bureau’s TIGER/Line distribution and the School District Review Program updates pipeline.

10.5.2.3 Read boundaries with GeoPandas + combine

Show Python
import geopandas as gpd

# Read the shapefiles
g_unsd = gpd.read_file(unsd_dir)
g_scsd = gpd.read_file(scsd_dir)

print("UNSD:", g_unsd.shape, "columns:", list(g_unsd.columns)[:12])
print("SCSD:", g_scsd.shape, "columns:", list(g_scsd.columns)[:12])

# Both typically have a NAME field
name_col = "NAME"
if name_col not in g_unsd.columns or name_col not in g_scsd.columns:
    raise KeyError("Expected a NAME column in TIGER school district shapefiles.")

g_unsd["school_district"] = g_unsd[name_col].astype(str)
g_scsd["school_district"] = g_scsd[name_col].astype(str)

# Combine into one GeoDataFrame (drop duplicates by name+geometry id-ish if desired)
g_ct = pd.concat([g_unsd, g_scsd], ignore_index=True)
g_ct = gpd.GeoDataFrame(g_ct, geometry="geometry", crs=g_unsd.crs)

print("Combined g_ct:", g_ct.shape)
print("Example district names:", g_ct["school_district"].head(5).tolist())
UNSD: (115, 16) columns: ['STATEFP', 'UNSDLEA', 'GEOID', 'GEOIDFQ', 'NAME', 'LSAD', 'LOGRADE', 'HIGRADE', 'MTFCC', 'SDTYP', 'FUNCSTAT', 'ALAND']
SCSD: (8, 16) columns: ['STATEFP', 'SCSDLEA', 'GEOID', 'GEOIDFQ', 'NAME', 'LSAD', 'LOGRADE', 'HIGRADE', 'MTFCC', 'SDTYP', 'FUNCSTAT', 'ALAND']
Combined g_ct: (123, 18)
Example district names: ['Ansonia School District', 'Avon School District', 'Berlin School District', 'Bethel School District', 'Bloomfield School District']

10.5.2.4 Harmonize a few district names

Show Python
import re
from district_zip_convert import district_zip_convert

def normalize_regional_high(name: str) -> str:
    """Convert TIGER-style 'Regional High School District NN' -> 'Regional School District NN'."""
    s = str(name).strip()
    s = re.sub(r"^Regional High School District\s+", "Regional School District ", s)
    return s

# 1) Normalize TIGER regional names first (geometry side)
g_ct["school_district"] = g_ct["school_district"].map(normalize_regional_high)

# 2) Then apply full converter on both sides
g_ct["school_district"] = (
    district_zip_convert(g_ct["school_district"])["school_district"].astype(str).str.strip()
)

school_district_rate_results["school_district"] = (
    district_zip_convert(school_district_rate_results["school_district"])["school_district"].astype(str).str.strip()
)

# 3) Drop TIGER "not defined" (not a real district for our purposes)
g_ct = g_ct[g_ct["school_district"] != "School District Not Defined"].copy()

# 4) Re-check mismatches
have_geom = set(g_ct["school_district"].dropna().unique())
have_data = set(school_district_rate_results["school_district"].dropna().unique())

missing_in_geom = sorted([d for d in have_data if d not in have_geom])
missing_in_data = sorted([d for d in have_geom if d not in have_data])

print("After fixing regional names:")
print("  Districts in results but not in geometry:", missing_in_geom)
print("  Districts in geometry but not in results (first 50):", missing_in_data[:50])
After fixing regional names:
  Districts in results but not in geometry: ['Woodstock School District']
  Districts in geometry but not in results (first 50): ['Oxford School District']

10.5.2.5 Some simple fix of the mismatches

Show Python
# Exclude Woodstock from mapping join (keep it in tables/results)
school_district_rate_results_map = school_district_rate_results[
    school_district_rate_results["school_district"] != "Woodstock School District"
].copy()

# Re-check mismatches after fixes
have_geom = set(g_ct["school_district"].dropna().unique())
have_data = set(school_district_rate_results_map["school_district"].dropna().unique())

missing_in_geom = sorted([d for d in have_data if d not in have_geom])
missing_in_data = sorted([d for d in have_geom if d not in have_data])

print("After fixes:")
print("  Districts in results (mapped) but not in geometry:", missing_in_geom)
print("  Districts in geometry but not in results (first 50):", missing_in_data[:50])
print("  Woodstock kept in results but excluded from map join.")
After fixes:
  Districts in results (mapped) but not in geometry: []
  Districts in geometry but not in results (first 50): ['Oxford School District']
  Woodstock kept in results but excluded from map join.

10.5.2.6 Join attributes to geometry + prepare centroids for labels

Show Python
# Join: geometry + results info (INCLUDING counts + population + rates) 
# Pick the columns we want to carry into the map GeoDataFrame
join_cols = [
    "school_district",
    "ind_beforeadj",                 
    "significance_diff_state_avg",
    "ratio",
    "rate",
    "lower",
    "upper",
    "count_15_19",                   # <-- counts (5-year total)
    "pop_15_19_times_nyear",         # <-- 5-year population
    "school_district_short",
]

missing = [c for c in join_cols if c not in school_district_rate_results.columns]
if missing:
    raise KeyError(f"school_district_rate_results is missing columns needed for mapping: {missing}")

# Left-join so every polygon stays, even if no data (e.g., Oxford)
g_plot = g_ct.merge(
    school_district_rate_results[join_cols],
    on="school_district",
    how="left"
)

# Fill region type: 1=below(sig), 2=neither, 3=above(sig), 4=not included/unmatched
g_plot["fill_region"] = g_plot["ind_beforeadj"].fillna(4).astype(int)

# Add a convenient scaled rate for plotting
g_plot["rate_per_10k"] = g_plot["rate"] * 10_000

print("g_plot shape:", g_plot.shape)
print("Fill_region value counts:\n", g_plot["fill_region"].value_counts(dropna=False).sort_index())

# Project for nicer geometry computations/plotting
g_plot = g_plot.to_crs(epsg=3857)

# Label points inside polygons (robust alternative to centroid)
g_plot["label_pt"] = g_plot.geometry.representative_point()
g_plot["x"] = g_plot["label_pt"].x
g_plot["y"] = g_plot["label_pt"].y

# Significant polygons for labels (above/below)
sig_pts = g_plot[g_plot["fill_region"].isin([1, 3])].copy()
print("Significant polygons available for labels:", sig_pts.shape[0])

# Quick peek (including count + rate fields)
display(g_plot[[
    "school_district", "count_15_19", "pop_15_19_times_nyear", "rate", "rate_per_10k",
    "fill_region"
]].head())
g_plot shape: (122, 29)
Fill_region value counts:
 fill_region
1      7
2    104
3     10
4      1
Name: count, dtype: int64
Significant polygons available for labels: 17
school_district count_15_19 pop_15_19_times_nyear rate rate_per_10k fill_region
0 Ansonia School District 5.0 6447 0.000776 7.755545 2
1 Avon School District 5.0 6418 0.000779 7.790589 2
2 Berlin School District 6.0 5102 0.001176 11.760094 2
3 Bethel School District 6.0 6108 0.000982 9.823183 2
4 Bloomfield School District 10.0 5475 0.001826 18.264840 2

10.5.2.7 Map 1: 5-year total counts (can be misleading)

Show Python
import matplotlib.pyplot as plt
import numpy as np

# Ensure count column exists 
if "count_15_19" not in g_plot.columns:
    raise KeyError("g_plot is missing 'count_15_19'. Make sure you merged results columns into g_plot.")

fig, ax = plt.subplots(figsize=(10, 6))

# Plot counts as choropleth (automatic colormap)
g_plot.plot(
    column="count_15_19",
    ax=ax,
    legend=True,
    edgecolor="white",
    linewidth=0.4,
    missing_kwds={"color": "lightgrey", "label": "No data"}
)

ax.set_axis_off()
ax.set_title("Connecticut school districts: total count (2010–2014)", pad=12)

plt.tight_layout()
plt.show()

plt.savefig(f"results/CT_counts_heatmap_2010_2014_{agerange}.pdf")
plt.close(fig)

<Figure size 672x480 with 0 Axes>

10.5.2.8 Map 2: Population size

Show Python
if "pop_15_19_times_nyear" in g_plot.columns:
    fig, ax = plt.subplots(figsize=(10, 6))

    g_plot.plot(
        column="pop_15_19_times_nyear",
        ax=ax,
        legend=True,
        edgecolor="white",
        linewidth=0.4,
        missing_kwds={"color": "lightgrey", "label": "No data"}
    )

    ax.set_axis_off()
    ax.set_title("Connecticut school districts: population at risk (15–19) × 5 years", pad=12)

    plt.tight_layout()
    plt.show()
    plt.savefig(f"results/CT_pop_heatmap_2010_2014_{agerange}.pdf")
    plt.close(fig)
else:
    print("Skipping population map: g_plot lacks 'pop_15_19_times_nyear'.")

<Figure size 672x480 with 0 Axes>

10.5.2.9 Map 3: Rates per 10,000 (more comparable, but noisier)

Show Python
fig, ax = plt.subplots(figsize=(10, 6))

# Create a more readable rate scale
g_plot["rate_per_10k"] = g_plot["rate"] * 10_000

g_plot.plot(
    column="rate_per_10k",
    ax=ax,
    legend=True,
    edgecolor="white",
    linewidth=0.4,
    missing_kwds={"color": "lightgrey", "label": "No data"}
)

ax.set_axis_off()
ax.set_title("Connecticut school districts: rate per 10,000 (2010–2014)", pad=12)

plt.tight_layout()
plt.show()
plt.savefig(f"results/CT_rates_heatmap_2010_2014_{agerange}.pdf")
plt.close(fig)

<Figure size 672x480 with 0 Axes>

10.5.2.10 Map 4: Significant districts

Show Python
import matplotlib.pyplot as plt

agerange = "15-19"
os.makedirs("results", exist_ok=True)

# Color mapping 
fill_colors = {
    1: "green",    # below average
    2: "blue",     # neither
    3: "red",      # above average
    4: "darkgrey", # not included / unmatched
}

g_plot["color"] = g_plot["fill_region"].map(fill_colors)

fig, ax = plt.subplots(figsize=(10, 6))

# Polygons
g_plot.plot(
    ax=ax,
    color=g_plot["color"],
    edgecolor="white",
    linewidth=0.7
)

ax.set_axis_off()
ax.set_title("Connecticut school districts: significant above/below state average", pad=12)

# Overlay labels for significant districts only
# District name + rate per 10,000
for _, row in sig_pts.iterrows():
    if pd.isna(row["rate"]):
        continue
    ax.plot(row["x"], row["y"], marker="o", markersize=3, color="black")
    ax.text(row["x"], row["y"] + 3000, str(row["school_district_short"]), fontsize=9, color="black")
    ax.text(row["x"], row["y"] - 3000, f"{row['rate']*10000:.2f}", fontsize=9, color="black")

# Legend
import matplotlib.patches as mpatches
legend_handles = [
    mpatches.Patch(color=fill_colors[1], label="Below average (sig)"),
    mpatches.Patch(color=fill_colors[2], label="Neither below nor above"),
    mpatches.Patch(color=fill_colors[3], label="Above average (sig)"),
    mpatches.Patch(color=fill_colors[4], label="Not included / unmatched"),
]
ax.legend(handles=legend_handles, loc="lower right", frameon=False)

out_pdf = f"results/CT_sig_rate_2010_2014_{agerange}.pdf"
plt.tight_layout()
plt.savefig(out_pdf)
plt.show()

print("Saved:", out_pdf)

Saved: results/CT_sig_rate_2010_2014_15-19.pdf
Show Python
import matplotlib.pyplot as plt
import os
import pandas as pd

agerange = "15-19"
os.makedirs("results", exist_ok=True)

# --- Map 4 (alt): color by rate_per_10k, outline only significant districts ---

# Make sure rate_per_10k exists
if "rate_per_10k" not in g_plot.columns:
    g_plot["rate_per_10k"] = g_plot["rate"] * 10_000

# Significant polygons: those with fill_region 1 (below sig) or 3 (above sig)
sig_polys = g_plot[g_plot["fill_region"].isin([1, 3])].copy()

fig, ax = plt.subplots(figsize=(10, 6))

# Base layer: choropleth of rates 
g_plot.plot(
    column="rate_per_10k",
    ax=ax,
    legend=True,
    edgecolor="white",
    linewidth=0.4,
    missing_kwds={"color": "lightgrey", "label": "No data"}
)

# Overlay: thick outline around significant districts (no fill)
sig_polys.boundary.plot(
    ax=ax,
    linewidth=2.5,   # thick border
    color="black"
)

# Optional: keep labels for significant districts (name + rate)
for _, row in sig_polys.iterrows():
    if pd.isna(row["rate_per_10k"]):
        continue
    ax.plot(row["x"], row["y"], marker="o", markersize=3, color="black")
    ax.text(row["x"], row["y"] + 3000, str(row["school_district_short"]), fontsize=9, color="black")
    ax.text(row["x"], row["y"] - 3000, f"{row['rate_per_10k']:.2f}", fontsize=9, color="black")

ax.set_axis_off()
ax.set_title("CT school districts: rate per 10,000 (fill) with significant districts outlined", pad=12)

out_pdf = f"results/CT_sig_outline_rate_heatmap_2010_2014_{agerange}.pdf"
plt.tight_layout()
plt.savefig(out_pdf)
plt.show()

print("Saved:", out_pdf)

Saved: results/CT_sig_outline_rate_heatmap_2010_2014_15-19.pdf

10.6 3: Explaining differences in district rates

In Parts 1–2, we treated each school district’s 2010–2014 rate as a signal and asked a simple screening question: which districts look unusually high or low compared to the state average? That’s a useful first step for surveillance, but it immediately raises the next question we discussed with our collaborators:

Why do rates differ across districts?

A high (or low) rate could reflect many things besides “true underlying risk,”” including differences in population size, random fluctuation, access to care, and the mix of community conditions that shape adolescent mental health. If our goal is not just to flag outliers, but to understand patterns and guide prevention planning, we need to move beyond “district A is higher than average” toward “district A is higher than expected given what we know about the district.”

10.6.1 A working hypothesis: context matters

From both prior public-health research and local knowledge from stakeholders, we hypothesize that differences in severe suicidal behavior across districts may be related to community context, especially factors that are plausibly connected to stress, opportunity, support, and school climate.

Two broad domains came up repeatedly in our conversations:

  1. Socioeconomic conditions

    • Economic hardship, instability, and limited access to resources can increase chronic stress for youth and families.
    • Community-level disadvantage can also be linked to reduced access to mental health services, transportation barriers, and fewer supports outside school.
  2. Academic environment and performance

    • School performance indicators can be proxies for school resources, student engagement, and broader opportunity structures.
    • Academic pressure, school climate, and the availability of support services may vary across districts, potentially shaping risk and help-seeking behavior.

Importantly, we are not claiming these factors “cause” suicidal behavior from this study alone. Rather, we want to test a more modest and practical idea:

Do socioeconomic context and academic indicators help explain systematic differences in district rates, beyond what we would expect from random variation?

10.6.2 Why this step matters

This explanatory step is useful, but it comes with responsibilities:

  • Avoid blaming communities or schools. Our goal is to understand patterns, not assign fault.
  • Separate signal from noise. Some districts looked extreme in the maps simply because they are small. Explanatory modeling can stabilize inference by borrowing strength across districts and adjusting for known differences.
  • Recognize measurement limits. Our outcome combines rare deaths with inpatient hospitalizations, which are influenced by both behavior and the health-care system. Community predictors are also imperfect proxies.

10.6.3 What we do next

To investigate these hypotheses, we worked with collaborators to obtain district-level measures related to:

  • Socioeconomic context (e.g., income/poverty proxies, family structure, housing, employment-related indicators)
  • Academic performance and school characteristics (e.g., performance metrics, graduation/achievement indicators, school system measures)

In the next part of the case study, we will:

  • Assemble predictors into an analysis-ready district dataset. Combine district-level academic and socioeconomic measures into a single table aligned with the outcomes we computed earlier.

  • Explore correlations and redundancy. Many community indicators move together (e.g., poverty and unemployment). We’ll examine correlation structure to avoid feeding highly redundant variables into a model without thinking.

  • Build a model to estimate the effects of district-level factors. We will fit a model that relates the outcome rate to district-level predictors and interpret estimated associations (direction, magnitude, uncertainty), while being transparent about the limits of observational data.

10.6.3.1 Load covariate datasets + build one district-level analysis table

Show Python
import pandas as pd

# Path (project-relative)
path_cov = "data/school_district_data_unstandardized_covariates_2010_2014.csv"

def clean_district_name(s: pd.Series) -> pd.Series:
    return (
        s.astype(str)
         .str.strip()
         .str.replace(r"\s+", " ", regex=True)
    )

# Load covariates
cov = pd.read_csv(path_cov)
cov["school_district"] = clean_district_name(cov["school_district"])

# Make sure district list matches OUR analysis list (from sch_data / yearlycount)
# Use sch_data as the source of truth (same list/order used earlier in the case study)
districts = clean_district_name(sch_data["school_district"])

# Keep only districts in our list (and in the same order)
cov_matched = (
    pd.DataFrame({"school_district": districts})
    .merge(cov, on="school_district", how="left", validate="one_to_one")
)

# Quick checks
print("Districts in our analysis list:", len(districts))
print("Rows in cov_matched:", cov_matched.shape[0])
print("Covariate columns:", cov_matched.shape[1] - 1)

missing_any = cov_matched.drop(columns=["school_district"]).isna().all(axis=1).sum()
print("Districts with all covariates missing after merge:", int(missing_any))

cov_matched.head()
Districts in our analysis list: 119
Rows in cov_matched: 119
Covariate columns: 16
Districts with all covariates missing after merge: 0
school_district male_householder_rate household_size pop_under18_rate pop_white_rate poverty_rate_grade9_12 enrollment_grade9_12 avg_CAPT graduation_rate median_income dropout_rate serious_incidence_prop incidence_rate freelunch_rate attendance_rate grant serious_incidence_rate
0 Ansonia School District 0.526854 2.50 0.236811 0.803528 0.158730 1323 228.921010 0.902492 56541.0 0.0245 0.018684 0.134039 0.576952 0.940333 0 0.017690
1 Avon School District 0.600697 2.57 0.268469 0.886412 0.088729 1251 284.714930 0.993920 105116.0 0.0025 0.009104 0.017137 0.042286 0.964667 0 0.008655
2 Berlin School District 0.590951 2.62 0.227605 0.929049 0.094435 1186 266.920952 0.940516 86211.0 0.0155 0.023847 0.112369 0.073633 0.962667 0 0.023225
3 Bethel School District 0.630085 2.68 0.236056 0.899712 0.064176 1044 271.106559 0.997161 83483.0 0.0020 0.022074 0.088827 0.127582 0.956333 1 0.020406
4 Bloomfield School District 0.518382 2.29 0.170798 0.356054 0.145251 1074 222.934220 0.916214 68372.0 0.0295 0.050884 0.113349 0.459150 0.956000 0 0.048834

10.6.3.2 Visualization by PCA and K-means

Show Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# --- Input: cov_matched (from previous step) ---
print("cov_matched shape:", cov_matched.shape)
display(cov_matched.head())

# Select numeric covariates
feature_cols = [c for c in cov_matched.columns if c != "school_district"]
X = cov_matched[feature_cols].apply(pd.to_numeric, errors="coerce")

print("Number of features:", len(feature_cols))
print("Total missing values:", int(X.isna().sum().sum()))
cov_matched shape: (119, 17)
school_district male_householder_rate household_size pop_under18_rate pop_white_rate poverty_rate_grade9_12 enrollment_grade9_12 avg_CAPT graduation_rate median_income dropout_rate serious_incidence_prop incidence_rate freelunch_rate attendance_rate grant serious_incidence_rate
0 Ansonia School District 0.526854 2.50 0.236811 0.803528 0.158730 1323 228.921010 0.902492 56541.0 0.0245 0.018684 0.134039 0.576952 0.940333 0 0.017690
1 Avon School District 0.600697 2.57 0.268469 0.886412 0.088729 1251 284.714930 0.993920 105116.0 0.0025 0.009104 0.017137 0.042286 0.964667 0 0.008655
2 Berlin School District 0.590951 2.62 0.227605 0.929049 0.094435 1186 266.920952 0.940516 86211.0 0.0155 0.023847 0.112369 0.073633 0.962667 0 0.023225
3 Bethel School District 0.630085 2.68 0.236056 0.899712 0.064176 1044 271.106559 0.997161 83483.0 0.0020 0.022074 0.088827 0.127582 0.956333 1 0.020406
4 Bloomfield School District 0.518382 2.29 0.170798 0.356054 0.145251 1074 222.934220 0.916214 68372.0 0.0295 0.050884 0.113349 0.459150 0.956000 0 0.048834
Number of features: 16
Total missing values: 0
Show Python
# --- Impute + standardize ---
imputer = SimpleImputer(strategy="median")
scaler = StandardScaler()

X_imp = imputer.fit_transform(X)
X_std = scaler.fit_transform(X_imp)

print("X_std shape:", X_std.shape)
print("Any NaN after impute/scale?", np.isnan(X_std).any())
X_std shape: (119, 16)
Any NaN after impute/scale? False
Show Python
# --- PCA (2D for visualization) ---
pca = PCA(n_components=2, random_state=0)
PC = pca.fit_transform(X_std)

expl = pca.explained_variance_ratio_
print(f"Explained variance: PC1={expl[0]:.3f}, PC2={expl[1]:.3f}, total={expl.sum():.3f}")

pc_df = pd.DataFrame({
    "school_district": cov_matched["school_district"].astype(str),
    "PC1": PC[:, 0],
    "PC2": PC[:, 1],
})
display(pc_df.head())
Explained variance: PC1=0.565, PC2=0.136, total=0.701
school_district PC1 PC2
0 Ansonia School District 2.254028 -0.400363
1 Avon School District -2.544499 0.570444
2 Berlin School District -0.909244 -0.110284
3 Bethel School District -1.604884 0.329483
4 Bloomfield School District 3.630223 -1.710817
Show Python
# --- Choose k for k-means using silhouette on PC scores ---
# We'll cluster using the first 2 PCs (what students see). You can switch to more PCs later.
k_range = range(2, 9)
sil = []

for k in k_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=0)
    labels = km.fit_predict(PC)
    sil.append(silhouette_score(PC, labels))

best_k = int(k_range[int(np.argmax(sil))])
print("Silhouette scores:", dict(zip(k_range, [round(s, 3) for s in sil])))
print("Chosen k (max silhouette):", best_k)

# Plot silhouette scores
plt.figure(figsize=(7, 4))
plt.plot(list(k_range), sil, marker="o")
plt.xticks(list(k_range))
plt.xlabel("k (number of clusters)")
plt.ylabel("Silhouette score")
plt.title("Choosing k via silhouette score (on PC1–PC2)")
plt.tight_layout()
plt.show()
Silhouette scores: {2: 0.691, 3: 0.422, 4: 0.465, 5: 0.457, 6: 0.394, 7: 0.401, 8: 0.401}
Chosen k (max silhouette): 2

Show Python
# --- Fit k-means with chosen k and plot PC1 vs PC2 colored by cluster ---
kmeans = KMeans(n_clusters=best_k, n_init=25, random_state=0)
pc_df["cluster"] = kmeans.fit_predict(PC)

# Scatter plot (no manual colors; matplotlib defaults)
fig, ax = plt.subplots(figsize=(8, 6))
scatter = ax.scatter(pc_df["PC1"], pc_df["PC2"], c=pc_df["cluster"], s=35)

ax.set_xlabel(f"PC1 ({expl[0]*100:.1f}% var)")
ax.set_ylabel(f"PC2 ({expl[1]*100:.1f}% var)")
ax.set_title(f"PCA of school districts + k-means clusters (k={best_k})")

# Label a few “most extreme” districts (farthest from origin in PC space)
center = pc_df[["PC1", "PC2"]].mean()
pc_df["dist_center"] = np.sqrt((pc_df["PC1"] - center["PC1"])**2 + (pc_df["PC2"] - center["PC2"])**2)

to_label = pc_df.nlargest(15, "dist_center")  # label 8 most extreme points
for _, r in to_label.iterrows():
    ax.text(r["PC1"], r["PC2"], r["school_district"], fontsize=8)

plt.tight_layout()
plt.show()

##display(pc_df.sort_values(["cluster", "dist"], ascending=[True, False]).head(5))

10.6.3.3 t-SNE + k-means clustering

Show Python
##import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Use the same district labels
district_names = cov_matched["school_district"].astype(str).reset_index(drop=True)

# t-SNE (2D)
tsne = TSNE(
    n_components=2,
    perplexity=30,      # common default; adjust if n is small/large
    learning_rate="auto",
    init="pca",
    random_state=0
)
Z_tsne = tsne.fit_transform(X_std)

tsne_df = pd.DataFrame({
    "school_district": district_names,
    "Z1": Z_tsne[:, 0],
    "Z2": Z_tsne[:, 1],
})

print("t-SNE embedding shape:", Z_tsne.shape)
display(tsne_df.head())
t-SNE embedding shape: (119, 2)
school_district Z1 Z2
0 Ansonia School District 5.708340 3.680246
1 Avon School District 0.024575 -5.296662
2 Berlin School District 3.552342 -2.793097
3 Bethel School District -2.953779 -2.597292
4 Bloomfield School District 7.341845 4.114298
Show Python
# Choose k using silhouette on the 2D embedding
k_range = range(2, 9)
sil = []
for k in k_range:
    km = KMeans(n_clusters=k, n_init=25, random_state=0)
    labels = km.fit_predict(Z_tsne)
    sil.append(silhouette_score(Z_tsne, labels))

best_k_tsne = int(k_range[int(np.argmax(sil))])
print("Silhouette scores (t-SNE):", dict(zip(k_range, [round(s, 3) for s in sil])))
print("Chosen k (t-SNE):", best_k_tsne)

# Plot silhouette scores
plt.figure(figsize=(7, 4))
plt.plot(list(k_range), sil, marker="o")
plt.xticks(list(k_range))
plt.xlabel("k")
plt.ylabel("Silhouette score")
plt.title("Choosing k via silhouette (t-SNE embedding)")
plt.tight_layout()
plt.show()
Silhouette scores (t-SNE): {2: 0.431, 3: 0.397, 4: 0.451, 5: 0.484, 6: 0.46, 7: 0.466, 8: 0.451}
Chosen k (t-SNE): 5

Show Python
# Fit k-means with chosen k and plot t-SNE colored by cluster
kmeans_tsne = KMeans(n_clusters=best_k_tsne, n_init=25, random_state=0)
tsne_df["cluster"] = kmeans_tsne.fit_predict(Z_tsne)

fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(tsne_df["Z1"], tsne_df["Z2"], c=tsne_df["cluster"], s=35)

ax.set_xlabel("t-SNE 1")
ax.set_ylabel("t-SNE 2")
ax.set_title(f"t-SNE of school districts + k-means clusters (k={best_k_tsne})")

# Label a few “extreme” points (optional)
center = tsne_df[["Z1", "Z2"]].mean()
tsne_df["dist_center"] = np.sqrt((tsne_df["Z1"] - center["Z1"])**2 + (tsne_df["Z2"] - center["Z2"])**2)
to_label = tsne_df.nlargest(15, "dist_center")
for _, r in to_label.iterrows():
    ax.text(r["Z1"], r["Z2"], r["school_district"], fontsize=8)

plt.tight_layout()
plt.show()

10.6.3.4 UMAP + k-means clustering (with fallback)

Show Python
##import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

district_names = cov_matched["school_district"].astype(str).reset_index(drop=True)

# Try to import UMAP; if not available, print a helpful message
try:
    import umap

    reducer = umap.UMAP(
        n_components=2,
        n_neighbors=15,
        min_dist=0.1,
        random_state=0
    )
    Z_umap = reducer.fit_transform(X_std)

    umap_df = pd.DataFrame({
        "school_district": district_names,
        "U1": Z_umap[:, 0],
        "U2": Z_umap[:, 1],
    })

    print("UMAP embedding shape:", Z_umap.shape)
    display(umap_df.head())

except Exception as e:
    print("UMAP is not available in this environment.")
    print("Install it with: pip install umap-learn")
    print("Error was:", repr(e))
    Z_umap = None

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.3.5 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 758, in start
    self.io_loop.start()
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 211, in start
    self.asyncio_loop.run_forever()
  File "/opt/homebrew/Cellar/python@3.11/3.11.14_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/homebrew/Cellar/python@3.11/3.11.14_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/homebrew/Cellar/python@3.11/3.11.14_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 614, in shell_main
    await self.dispatch_shell(msg, subshell_id=subshell_id)
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 471, in dispatch_shell
    await result
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 366, in execute_request
    await super().execute_request(stream, ident, parent)
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 827, in execute_request
    reply_content = await reply_content
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 458, in do_execute
    res = shell.run_cell(
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/ipykernel/zmqshell.py", line 663, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3123, in run_cell
    result = self._run_cell(
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3178, in _run_cell
    result = runner(coro)
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner
    coro.send(None)
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3400, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3641, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3701, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/var/folders/w9/lfdf18vd055ds72yz1x94d400000gp/T/ipykernel_50098/620713501.py", line 12, in <module>
    import umap
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/umap/__init__.py", line 7, in <module>
    from .parametric_umap import ParametricUMAP, load_ParametricUMAP
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/umap/parametric_umap.py", line 36, in <module>
    import torch
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/torch/__init__.py", line 1477, in <module>
    from .functional import *  # noqa: F403
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/umap/umap_.py:1952: UserWarning:

n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
UMAP embedding shape: (119, 2)
school_district U1 U2
0 Ansonia School District 3.574197 13.022011
1 Avon School District 7.949921 9.183983
2 Berlin School District 7.664978 12.153422
3 Bethel School District 4.986406 9.323744
4 Bloomfield School District 3.655679 13.467139
Show Python
# Only run clustering/plot if UMAP worked
if Z_umap is not None:
    k_range = range(2, 9)
    sil = []
    for k in k_range:
        km = KMeans(n_clusters=k, n_init=25, random_state=0)
        labels = km.fit_predict(Z_umap)
        sil.append(silhouette_score(Z_umap, labels))

    best_k_umap = int(k_range[int(np.argmax(sil))])
    print("Silhouette scores (UMAP):", dict(zip(k_range, [round(s, 3) for s in sil])))
    print("Chosen k (UMAP):", best_k_umap)

    plt.figure(figsize=(7, 4))
    plt.plot(list(k_range), sil, marker="o")
    plt.xticks(list(k_range))
    plt.xlabel("k")
    plt.ylabel("Silhouette score")
    plt.title("Choosing k via silhouette (UMAP embedding)")
    plt.tight_layout()
    plt.show()

    kmeans_umap = KMeans(n_clusters=best_k_umap, n_init=25, random_state=0)
    umap_df["cluster"] = kmeans_umap.fit_predict(Z_umap)

    fig, ax = plt.subplots(figsize=(8, 6))
    ax.scatter(umap_df["U1"], umap_df["U2"], c=umap_df["cluster"], s=35)

    ax.set_xlabel("UMAP 1")
    ax.set_ylabel("UMAP 2")
    ax.set_title(f"UMAP of school districts + k-means clusters (k={best_k_umap})")

    # Optional: label a few “extreme” points
    center = umap_df[["U1", "U2"]].mean()
    umap_df["dist_center"] = np.sqrt((umap_df["U1"] - center["U1"])**2 + (umap_df["U2"] - center["U2"])**2)


    to_label = umap_df.nlargest(15, "dist_center")
    for _, r in to_label.iterrows():
        ax.text(r["U1"], r["U2"], r["school_district"], fontsize=8)

    plt.tight_layout()
    plt.show()
Silhouette scores (UMAP): {2: 0.472, 3: 0.592, 4: 0.591, 5: 0.535, 6: 0.509, 7: 0.534, 8: 0.542}
Chosen k (UMAP): 3

10.6.3.5 Arcsin–sqrt transform for rate/proportion covariates

Show Python
import numpy as np
import pandas as pd

# Start from covariates table
cov = cov_matched.copy()

# Which columns are "rate/proportion" variables?
# We'll transform those that look like proportions: *_rate, *_prop
rate_cols = [c for c in cov.columns if (c.endswith("_rate") or c.endswith("_prop"))]

print("Rate/proportion columns to transform:")
print(rate_cols)

def arcsin_sqrt(p):
    """Arcsin-sqrt transform for proportions in [0,1]."""
    p = pd.to_numeric(p, errors="coerce")
    # clip for safety in case of small numeric issues
    p = p.clip(lower=0, upper=1)
    return np.arcsin(np.sqrt(p))

# Apply transform (leave non-rate variables unchanged)
for c in rate_cols:
    cov[c] = arcsin_sqrt(cov[c])

display(cov.head())
Rate/proportion columns to transform:
['male_householder_rate', 'pop_under18_rate', 'pop_white_rate', 'graduation_rate', 'dropout_rate', 'serious_incidence_prop', 'incidence_rate', 'freelunch_rate', 'attendance_rate', 'serious_incidence_rate']
school_district male_householder_rate household_size pop_under18_rate pop_white_rate poverty_rate_grade9_12 enrollment_grade9_12 avg_CAPT graduation_rate median_income dropout_rate serious_incidence_prop incidence_rate freelunch_rate attendance_rate grant serious_incidence_rate
0 Ansonia School District 0.812265 2.50 0.508230 1.111573 0.158730 1323 228.921010 1.253222 56541.0 0.157171 0.137118 0.374829 0.862657 1.324032 0 0.133400
1 Avon School District 0.886789 2.57 0.544675 1.227038 0.088729 1251 284.714930 1.492740 105116.0 0.050021 0.095560 0.131284 0.207114 1.381700 0 0.093166
2 Berlin School District 0.876858 2.62 0.497328 1.301175 0.094435 1186 266.920952 1.324418 86211.0 0.124823 0.155045 0.341833 0.274800 1.376355 0 0.152993
3 Bethel School District 0.916998 2.68 0.507342 1.248565 0.064176 1044 271.106559 1.517488 83483.0 0.044736 0.149127 0.302638 0.365254 1.360279 1 0.143340
4 Bloomfield School District 0.803784 2.29 0.426050 0.639386 0.145251 1074 222.934220 1.277135 68372.0 0.172612 0.227533 0.343381 0.744502 1.359465 0 0.222825

10.6.3.6 Correlation structure of covariates (after transform)

Show Python
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer

# Numeric matrix excluding the district name
feature_cols = [c for c in cov.columns if c != "school_district"]
X = cov[feature_cols].apply(pd.to_numeric, errors="coerce")

# Impute missing values for correlation computation
imputer = SimpleImputer(strategy="median")
X_imp = pd.DataFrame(imputer.fit_transform(X), columns=feature_cols)

# Correlation matrix
corr = X_imp.corr()

print("Correlation matrix shape:", corr.shape)

fig, ax = plt.subplots(figsize=(11, 9))
im = ax.imshow(corr.to_numpy(), aspect="auto")  # default colormap
ax.set_title("Correlation among district covariates (after arcsin–sqrt transform for rates)")

ax.set_xticks(range(len(feature_cols)))
ax.set_yticks(range(len(feature_cols)))
ax.set_xticklabels(feature_cols, rotation=90)
ax.set_yticklabels(feature_cols)

fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)

plt.tight_layout()
plt.show()
Correlation matrix shape: (16, 16)

10.6.3.7 Build a social-economic factor (PC1) and an academic factor (PC1)

We’ll use two groups of covariates and extract a single latent factor from each group using PCA on standardized variables.

Show Python
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# --- Define the two blocks you specified ---
social_cols = [
    "male_householder_rate",
    "household_size",
    "pop_under18_rate",
    "poverty_rate_grade9_12",
    "median_income",
]

academic_cols = [
    "avg_CAPT",
    "graduation_rate",
    "dropout_rate",
    "serious_incidence_prop",
    "attendance_rate",
]

# Check columns exist
missing_social = [c for c in social_cols if c not in cov.columns]
missing_acad = [c for c in academic_cols if c not in cov.columns]
print("Missing social cols:", missing_social)
print("Missing academic cols:", missing_acad)

# Impute + standardize helper
def pca_factor(df, cols, factor_name):
    Xb = df[cols].apply(pd.to_numeric, errors="coerce")
    Xb_imp = SimpleImputer(strategy="median").fit_transform(Xb)
    Xb_std = StandardScaler().fit_transform(Xb_imp)

    pca = PCA(n_components=1, random_state=0)
    score = pca.fit_transform(Xb_std).ravel()

    loadings = pd.Series(pca.components_[0], index=cols, name="loading").sort_values(key=np.abs, ascending=False)
    explained = float(pca.explained_variance_ratio_[0])

    out = df[["school_district"]].copy()
    out[factor_name] = score
    return out, loadings, explained

social_factor_df, social_loadings, social_expl = pca_factor(cov, social_cols, "factor_social")
acad_factor_df, acad_loadings, acad_expl = pca_factor(cov, academic_cols, "factor_academic")

print(f"Social factor (PC1) variance explained: {social_expl:.3f}")
display(social_loadings.to_frame())

print(f"Academic factor (PC1) variance explained: {acad_expl:.3f}")
display(acad_loadings.to_frame())
Missing social cols: []
Missing academic cols: []
Social factor (PC1) variance explained: 0.619
loading
median_income 0.525252
household_size 0.466464
male_householder_rate 0.434371
pop_under18_rate 0.417091
poverty_rate_grade9_12 -0.379312
Academic factor (PC1) variance explained: 0.749
loading
avg_CAPT 0.478099
graduation_rate 0.467988
dropout_rate -0.465195
serious_incidence_prop -0.444015
attendance_rate 0.372631
10.6.3.7.1 Add the two factor scores back into district covariate table
Show Python
cov_with_factors = (
    cov.merge(social_factor_df, on="school_district", how="left")
       .merge(acad_factor_df, on="school_district", how="left")
)

display(cov_with_factors[["school_district", "factor_social", "factor_academic"]].head())
school_district factor_social factor_academic
0 Ansonia School District -1.269748 -1.663473
1 Avon School District 1.089770 2.615316
2 Berlin School District 0.251322 0.409451
3 Bethel School District 0.894830 1.765405
4 Bloomfield School District -2.451703 -1.952717
  • Instead of putting many redundant variables into a model, we form two interpretable summary factors:

    • Social-economic factor: captures a “context” gradient (family structure / household composition / youth share / poverty / income).
    • Academic factor: captures a “school performance & climate” gradient (CAPT, graduation, dropout, attendance, serious incidents).
  • PCA gives a transparent, reproducible way to extract these summaries while keeping the interpretation tied to the chosen variables.

10.6.3.8 Poisson regression

Poisson random-intercept model, using yearly district counts with a log(population) offset, fixed effects for social_factor, acad_factor, and year, and a district random intercept.

In Python, we will use statsmodels.genmod.bayes_mixed_glm.PoissonBayesMixedGLM (a Poisson GLMM fit with variational Bayes).

10.6.3.8.1 1) Build dat1 in long format (district × year)
Show Python
import numpy as np
import pandas as pd

# --- Inputs assumed from earlier steps ---
# yearlycount: columns school_district + 5 year columns (2010-2014 totals)
# cov_with_factors: has school_district + factor_social + factor_academic (and other covariates)
# sch_data: has school_district + pop_sch_15_19 (annual pop at risk)

years = [2010, 2011, 2012, 2013, 2014]

def clean_district_name(s: pd.Series) -> pd.Series:
    return (
        s.astype(str)
         .str.strip()
         .str.replace(r"\s+", " ", regex=True)
    )

# Standardize join keys
yearlycount2 = yearlycount.copy()
yearlycount2["school_district"] = clean_district_name(yearlycount2["school_district"])

covf = cov_with_factors.copy()
covf["school_district"] = clean_district_name(covf["school_district"])

pop_df = sch_data[["school_district", "pop_sch_15_19"]].copy()
pop_df["school_district"] = clean_district_name(pop_df["school_district"])

# Long outcome table: y = count in that year
# (Rename year columns if needed: we assume they correspond to 2010..2014)
year_cols = [c for c in yearlycount2.columns if c != "school_district"]
if len(year_cols) != 5:
    print("Year columns detected:", year_cols)

dat_y = yearlycount2.melt(
    id_vars="school_district",
    value_vars=year_cols,
    var_name="year_col",
    value_name="y"
)

# If columns aren't literal 2010..2014, extract year digits from the column name
dat_y["year"] = dat_y["year_col"].astype(str).str.extract(r"(\d{4})").astype(int)

# Merge factors + population
dat1 = (dat_y
        .merge(covf[["school_district", "factor_social", "factor_academic"]], on="school_district", how="left")
        .merge(pop_df, on="school_district", how="left")
       )

# Keep only 2010–2014 and drop missing essentials
dat1 = dat1[dat1["year"].isin(years)].copy()
dat1 = dat1.dropna(subset=["y", "pop_sch_15_19", "factor_social", "factor_academic"]).reset_index(drop=True)

# Offset = log(population)
dat1["offset"] = np.log(dat1["pop_sch_15_19"].astype(float))

# District id for random intercept
dat1["obs"] = dat1["school_district"].astype("category")

print("dat1 shape:", dat1.shape)
display(dat1.head())
print("y total:", int(dat1["y"].sum()))
dat1 shape: (595, 9)
school_district year_col y year factor_social factor_academic pop_sch_15_19 offset obs
0 Ansonia School District ALL2010 1 2010 -1.269748 -1.663473 1289.310 7.161862 Ansonia School District
1 Avon School District ALL2010 1 2010 1.089770 2.615316 1283.500 7.157346 Avon School District
2 Berlin School District ALL2010 2 2010 0.251322 0.409451 1020.375 6.927925 Berlin School District
3 Bethel School District ALL2010 1 2010 0.894830 1.765405 1221.685 7.107986 Bethel School District
4 Bloomfield School District ALL2010 1 2010 -2.451703 -1.952717 1094.930 6.998446 Bloomfield School District
y total: 1878
10.6.3.8.2 2) Fit Poisson regression model

Model: [ y_{d,t} ({d,t}), ({d,t})=(pop_{d,t})+_0+_1 social+_2 acad+_t]

Show Python
##import numpy as np
##import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# dat1 should already be district-year long data with:
# y, pop_sch_15_19, factor_social, factor_academic, year
dat_glm = dat1.dropna(subset=["y","pop_sch_15_19","factor_social","factor_academic","year"]).copy()

# Offset = log(pop)
dat_glm["log_pop"] = np.log(dat_glm["pop_sch_15_19"].astype(float))

print("dat_glm shape:", dat_glm.shape)
display(dat_glm.head())

# Poisson GLM with log link and offset
glm_pois = smf.glm(
    formula="y ~ factor_social + factor_academic + C(year)",
    data=dat_glm,
    family=sm.families.Poisson(),
    offset=dat_glm["log_pop"]
).fit()

print(glm_pois.summary())
dat_glm shape: (595, 10)
school_district year_col y year factor_social factor_academic pop_sch_15_19 offset obs log_pop
0 Ansonia School District ALL2010 1 2010 -1.269748 -1.663473 1289.310 7.161862 Ansonia School District 7.161862
1 Avon School District ALL2010 1 2010 1.089770 2.615316 1283.500 7.157346 Avon School District 7.157346
2 Berlin School District ALL2010 2 2010 0.251322 0.409451 1020.375 6.927925 Berlin School District 6.927925
3 Bethel School District ALL2010 1 2010 0.894830 1.765405 1221.685 7.107986 Bethel School District 7.107986
4 Bloomfield School District ALL2010 1 2010 -2.451703 -1.952717 1094.930 6.998446 Bloomfield School District 6.998446
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  595
Model:                            GLM   Df Residuals:                      588
Model Family:                 Poisson   Df Model:                            6
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -1204.4
Date:                Mon, 16 Mar 2026   Deviance:                       1006.0
Time:                        15:42:07   Pearson chi2:                 1.02e+03
No. Iterations:                     5   Pseudo R-squ. (CS):            0.07123
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          -6.6008      0.056   -117.033      0.000      -6.711      -6.490
C(year)[T.2011]     0.1277      0.076      1.674      0.094      -0.022       0.277
C(year)[T.2012]     0.1029      0.077      1.341      0.180      -0.048       0.253
C(year)[T.2013]     0.2312      0.075      3.102      0.002       0.085       0.377
C(year)[T.2014]     0.2697      0.074      3.650      0.000       0.125       0.415
factor_social      -0.1035      0.022     -4.801      0.000      -0.146      -0.061
factor_academic     0.0811      0.016      5.045      0.000       0.050       0.113
===================================================================================

10.6.4 3) Exponentiate fixed effects for rate ratios

Show Python
##import numpy as np
coef = glm_pois.params
se = glm_pois.bse

rr = np.exp(coef)
rr_lo = np.exp(coef - 1.96*se)
rr_hi = np.exp(coef + 1.96*se)

rr_table = pd.DataFrame({
    "coef": coef,
    "rate_ratio": rr,
    "rr_2.5%": rr_lo,
    "rr_97.5%": rr_hi
})

display(rr_table)
coef rate_ratio rr_2.5% rr_97.5%
Intercept -6.600770 0.001359 0.001217 0.001518
C(year)[T.2011] 0.127710 1.136223 0.978410 1.319491
C(year)[T.2012] 0.102881 1.108359 0.953579 1.288262
C(year)[T.2013] 0.231161 1.260062 1.088831 1.458220
C(year)[T.2014] 0.269720 1.309598 1.133025 1.513687
factor_social -0.103500 0.901676 0.864369 0.940594
factor_academic 0.081144 1.084528 1.050872 1.119261
Show Python
print("Social loadings:\n", social_loadings.sort_values())
print("Academic loadings:\n", acad_loadings.sort_values())
Social loadings:
 poverty_rate_grade9_12   -0.379312
pop_under18_rate          0.417091
male_householder_rate     0.434371
household_size            0.466464
median_income             0.525252
Name: loading, dtype: float64
Academic loadings:
 dropout_rate             -0.465195
serious_incidence_prop   -0.444015
attendance_rate           0.372631
graduation_rate           0.467988
avg_CAPT                  0.478099
Name: loading, dtype: float64
  • Intercept RR = 0.001359 is the baseline expected annual rate when:
    • year = 2010 (the reference year),

    • factor_social = 0 and factor_academic = 0 (i.e., an “average” district in the standardized factor scale).

    • Because the offset accounts for population, this number is “events per person-year.” That is, (0.001359 000 ) events per 10,000 per year.

  • The year terms compare each year to 2010:
    • 2013: RR = 1.26 (CI 1.09–1.46) → holding factors constant, the statewide rate in 2013 was about 26% higher than 2010.
    • 2014: RR = 1.31 (CI 1.13–1.51) → about 31% higher than 2010.
    • 2011 and 2012 have CIs that include 1 → not clearly different from 2010 at the 95% level.
  • factor_social: RR = 0.902 (0.864–0.941)
    • A 1-unit increase in social factor is associated with a ~9.8% lower rate:
    • (1 - 0.902 ) → 9.8% decrease.
  • factor_academic: RR = 1.085 (1.051–1.119)
    • A 1-unit increase in the academic factor is associated with a ~8.5% higher rate.
  • The “1 unit” issue (what is a unit of a factor?)
    • Because WE created factors via PCA on standardized variables, a “1 unit” change is roughly one standard deviation of the factor score (not exactly, but close). So: > A one-SD increase in the social factor is associated with ~10% lower rate.

Color Me Blue Publication