12  Case Study: Detecting Proteins of Concordant Change

12.1 Background

In many biomedical studies, we do not compare just one treatment group to one control group. Sometimes we have multiple related disease-producing strains and want to find proteins that move consistently in the same direction across all of them.

That is the setting in this case study. We have proteomics data from four disease-producing strains of Clostridium perfringens and one baseline nondisease-producing strain. For each protein, the data record log intensity ratios comparing each disease-producing strain to the baseline, with three experimental replications.

Our data science goal is simple to state but not trivial to solve:

  • find proteins that are upregulated across all four strains,
  • find proteins that are downregulated across all four strains, and
  • avoid being misled by proteins with discordant or unusual patterns.

This case study follows the same general flow as the first case study:

  1. understand the scientific question,
  2. inspect the structure of the data,
  3. do exploratory analysis,
  4. fit a simple baseline model,
  5. fit a more tailored model,
  6. interpret the results.

12.2 Why this is an interesting data science problem

If we analyzed each strain separately, we would miss the fact that the real scientific target is concordant change across strains. On the other hand, if we just average everything together, discordant proteins can blur the signal.

So this is a good example of a broader lesson in data science: the model should reflect the scientific structure of the problem.

12.3 Data files

This case study uses two Python files directly:

  • protein_logratio_py.npz: the Python-ready data file converted from the original R data file
  • mixture_source_py.py: a Python translation of the tailored penalized mixture method

12.4 Load packages and data

Show Python
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from scipy import stats
from sklearn.mixture import GaussianMixture

from mixture_source_py_checked import (
    MixtureControl,
    initial_values,
    mixture_penalized,
    mixture_penalized_path,
)

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)
Show Python
loaded = np.load("protein_logratio_py.npz", allow_pickle=True)
Y = loaded["Y"].astype(float)
protein_names = loaded["protein_names"].astype(str)

P, M, S = Y.shape

print("Array shape:", Y.shape)
print("Number of proteins   =", P)
print("Number of strains    =", M)
print("Number of replicates =", S)
Array shape: (2492, 4, 3)
Number of proteins   = 2492
Number of strains    = 4
Number of replicates = 3

12.5 Understand the data structure

Each protein has repeated measurements.

  • P: number of proteins
  • M: number of strains being compared to the baseline
  • S: number of replicated experiments

So this is not an ordinary rectangular data set where each row is independent and identically structured. The unit of analysis is the protein, and each protein contributes a small multivariate profile observed over replicates.

Show Python
missing_replicate = np.isnan(Y).all(axis=1)
summary_df = pd.DataFrame(
    {
        "n_proteins": [P],
        "n_strains": [M],
        "n_replications": [S],
        "pct_missing_replicates": [100 * missing_replicate.mean()],
    }
)
summary_df
n_proteins n_strains n_replications pct_missing_replicates
0 2492 4 3 12.640449

12.6 Build a protein-level summary for exploration

Before fitting the tailored method, it is helpful to create a simple protein-level summary.

We first estimate a crude replicate effect by averaging over proteins and strains within each replicate. Then we subtract that effect and average across replications.

Show Python
alpha_simple = np.nanmean(Y, axis=(0, 1))
# Ensure any fully-missing replicates don't propagate NaNs in the subtraction
alpha_simple = np.nan_to_num(alpha_simple, nan=0.0)
with np.errstate(invalid="ignore"):
    Y_adj = Y - alpha_simple.reshape(1, 1, S)
Y_avg = np.nanmean(Y_adj, axis=2)

strain_cols = [f"strain_{j}" for j in range(1, M + 1)]
Y_avg_df = pd.DataFrame(Y_avg, columns=strain_cols)
Y_avg_df["protein"] = protein_names
Y_avg_df.head()
strain_1 strain_2 strain_3 strain_4 protein
0 0.267764 0.886674 1.515244 1.153430 Prot1
1 0.959323 1.472828 0.754082 0.952676 Prot2
2 -0.950755 -0.575378 -0.991952 -0.662052 Prot3
3 -3.004162 -2.086893 -2.976018 -2.715235 Prot4
4 -0.757199 -0.087382 -0.536672 1.148365 Prot5

12.7 Exploratory analysis

12.7.1 Correlation across strains

If proteins tend to move together across strains, we should see positive dependence.

Show Python
corr = Y_avg_df[strain_cols].corr()
fig = px.imshow(
    corr,
    text_auto=".2f",
    aspect="auto",
    title="Correlation among adjusted protein log-ratios",
)
fig.show()

12.7.2 Scatter-matrix

Show Python
fig = px.scatter_matrix(
    Y_avg_df,
    dimensions=strain_cols,
    opacity=0.35,
    title="Scatter-matrix of protein-level adjusted log-ratios",
)
fig.update_traces(diagonal_visible=False, showupperhalf=False)
fig.show()

12.7.3 Interactive 3D view

A static 3D figure is hard to read. With Plotly, we can rotate the cloud and inspect the geometry directly.

Show Python
fig = px.scatter_3d(
    Y_avg_df,
    x="strain_1",
    y="strain_2",
    z="strain_3",
    hover_name="protein",
    opacity=0.45,
    title="Protein cloud in the first three strains",
)
fig.show()

12.7.4 QQ plots by strain

Show Python
def make_qq_df(x, name):
    x = np.asarray(x)
    x = x[np.isfinite(x)]
    osm, osr = stats.probplot(x, dist="norm", fit=False)
    slope, intercept, r = stats.probplot(x, dist="norm")[-1]
    return pd.DataFrame(
        {
            "theoretical": osm,
            "sample": osr,
            "line": slope * np.asarray(osm) + intercept,
            "strain": name,
        }
    )

qq_df = pd.concat(
    [make_qq_df(Y_avg[:, j], f"strain_{j+1}") for j in range(M)],
    ignore_index=True,
)

fig = px.scatter(
    qq_df,
    x="theoretical",
    y="sample",
    facet_col="strain",
    facet_col_wrap=2,
    opacity=0.45,
    title="QQ plots by strain",
)

strains = sorted(qq_df["strain"].unique())
for i, strain in enumerate(strains):
    d = qq_df[qq_df["strain"] == strain]
    col = (i % 2) + 1
    row = (i // 2) + 1
    fig.add_trace(
        go.Scatter(
            x=d["theoretical"],
            y=d["line"],
            mode="lines",
            showlegend=False,
        ),
        row=row,
        col=col,
    )

fig.show()

12.7.5 Sign-pattern diagnostic

A protein with all positive strain summaries is a crude candidate for concordant upregulation. A protein with all negative values is a crude candidate for concordant downregulation. Proteins with mixed signs are harder to classify.

Show Python
all_positive = (Y_avg > 0).all(axis=1)
all_negative = (Y_avg < 0).all(axis=1)
mixed_sign = (Y_avg > 0).any(axis=1) & (Y_avg < 0).any(axis=1)

sign_summary = pd.DataFrame(
    {
        "pattern": ["all positive", "all negative", "mixed sign"],
        "count": [all_positive.sum(), all_negative.sum(), mixed_sign.sum()],
    }
)

fig = px.bar(
    sign_summary,
    x="pattern",
    y="count",
    title="Concordant vs discordant sign patterns",
)
fig.show()

Already, the exploratory work tells us something important: not every protein follows a clean concordant pattern.

12.8 A first model: K-means clustering

K-means is a simple and widely-used clustering method. We start with three clusters to match the expected structure: a reference cluster (near zero), a downregulated cluster, and an upregulated cluster.

Show Python
from sklearn.cluster import KMeans

means_init = np.vstack(
    [
        np.zeros(M),
        -2 * np.ones(M),
        2 * np.ones(M),
    ]
)

kmeans = KMeans(
    n_clusters=3,
    init=means_init,
    n_init=10,
    random_state=1,
)
kmeans_class = kmeans.fit_predict(Y_avg)

Y_avg_df["kmeans_class"] = kmeans_class.astype(str)
/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26/.ids-s26/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1404: RuntimeWarning:

Explicit initial center position passed: performing only one init in KMeans instead of n_init=10.
Show Python
fig = px.scatter_3d(
    Y_avg_df,
    x="strain_1",
    y="strain_2",
    z="strain_3",
    color="kmeans_class",
    hover_name="protein",
    opacity=0.55,
    title="K-means clustering (k=3)",
)
fig.update_traces(marker=dict(size=4))
fig.show()

12.9 A second model: Ordinary Gaussian mixture

As a baseline, we fit a three-component Gaussian mixture to the protein-level averages.

This is not yet the paper’s method. It is just a useful first attempt.

Show Python
means_init = np.vstack(
    [
        np.zeros(M),
        -2 * np.ones(M),
        2 * np.ones(M),
    ]
)

gmm = GaussianMixture(
    n_components=3,
    covariance_type="full",
    means_init=means_init,
    random_state=1,
    n_init=10,
)
gmm.fit(Y_avg)

post = gmm.predict_proba(Y_avg)
gmm_class = post.argmax(axis=1)
Y_avg_df["gmm_class"] = gmm_class.astype(str)
Show Python
fig = px.scatter_3d(
    Y_avg_df,
    x="strain_1",
    y="strain_2",
    z="strain_3",
    color="gmm_class",
    hover_name="protein",
    opacity=0.55,
    title="3-component Gaussian mixture",
)
fig.update_traces(marker=dict(size=4))
fig.show()

This baseline is useful, but it has a weakness: proteins with odd or discordant patterns can distort the three main components.

12.10 The tailored penalized mixture method

Now we fit the translated Python version of the tailored method from the project.

Conceptually, the model has three main components:

  • a reference component near zero,
  • a concordantly decreased component,
  • a concordantly increased component.

Before fitting the model, it helps to picture the geometry we are trying to capture. The figure below is a conceptual sketch rather than a plot of the real data: the middle ellipsoid represents the reference distribution, the nearby orange points represent discordant proteins that do not follow a clean all-up or all-down pattern, and the two corner clusters represent concordantly decreased and increased proteins.

Show conceptual geometry
rng = np.random.default_rng(3255)

# Central reference cloud
ref = rng.multivariate_normal(
    mean=[0.0, 0.0, 0.0],
    cov=np.diag([0.35, 0.20, 0.12]),
    size=450,
)

# Discordant proteins: close enough to the center to overlap with the reference,
# but shifted in irregular directions rather than along a clean concordant diagonal.
dirs = rng.normal(size=(65, 3))
dirs = dirs / np.linalg.norm(dirs, axis=1, keepdims=True)
radii = rng.uniform(1.2, 2.0, size=65)
discordant = dirs * radii[:, None] * np.array([1.2, 1.0, 0.9])
discordant += rng.normal(scale=0.06, size=discordant.shape)

# Concordant corner clusters
down = rng.multivariate_normal(
    mean=[-2.8, -2.5, -2.6],
    cov=np.diag([0.12, 0.10, 0.10]),
    size=120,
)
up = rng.multivariate_normal(
    mean=[2.8, 2.5, 2.6],
    cov=np.diag([0.12, 0.10, 0.10]),
    size=120,
)

# A translucent ellipsoid helps visualize the reference component.
u = np.linspace(0, 2 * np.pi, 40)
v = np.linspace(0, np.pi, 24)
rx, ry, rz = 1.4, 1.05, 0.8
x = rx * np.outer(np.cos(u), np.sin(v))
y = ry * np.outer(np.sin(u), np.sin(v))
z = rz * np.outer(np.ones_like(u), np.cos(v))

fig = go.Figure()

fig.add_trace(
    go.Surface(
        x=x,
        y=y,
        z=z,
        opacity=0.16,
        showscale=False,
        colorscale=[[0, "#9ecae1"], [1, "#9ecae1"]],
        hoverinfo="skip",
        name="Reference envelope",
    )
)

fig.add_trace(
    go.Scatter3d(
        x=ref[:, 0],
        y=ref[:, 1],
        z=ref[:, 2],
        mode="markers",
        name="Reference proteins",
        marker=dict(size=3, color="#6baed6", opacity=0.45),
        hoverinfo="skip",
    )
)

fig.add_trace(
    go.Scatter3d(
        x=discordant[:, 0],
        y=discordant[:, 1],
        z=discordant[:, 2],
        mode="markers",
        name="Discordant proteins",
        marker=dict(size=5, color="#f28e2b", symbol="diamond", opacity=0.9),
        hoverinfo="skip",
    )
)

fig.add_trace(
    go.Scatter3d(
        x=down[:, 0],
        y=down[:, 1],
        z=down[:, 2],
        mode="markers",
        name="Concordant down",
        marker=dict(size=4, color="#d62728", opacity=0.85),
        hoverinfo="skip",
    )
)

fig.add_trace(
    go.Scatter3d(
        x=up[:, 0],
        y=up[:, 1],
        z=up[:, 2],
        mode="markers",
        name="Concordant up",
        marker=dict(size=4, color="#2ca02c", opacity=0.85),
        hoverinfo="skip",
    )
)

fig.add_trace(
    go.Scatter3d(
        x=[0, -3.15, 3.15, 1.9],
        y=[0, -2.75, 2.75, 0.6],
        z=[1.35, -2.85, 2.85, 1.25],
        mode="text",
        text=[
            "Reference component",
            "Concordant down",
            "Concordant up",
            "Discordant proteins",
        ],
        textfont=dict(size=12, color="black"),
        showlegend=False,
        hoverinfo="skip",
    )
)

fig.update_layout(
    title="Conceptual geometry for the tailored penalized mixture",
    scene=dict(
        xaxis_title="Strain 1 summary",
        yaxis_title="Strain 2 summary",
        zaxis_title="Strain 3 summary",
        camera=dict(eye=dict(x=1.5, y=1.4, z=1.15)),
    ),
    legend=dict(x=0.02, y=0.98),
    margin=dict(l=0, r=0, b=0, t=40),
)
fig.show()

A conventional method such as K-means or a plain Gaussian mixture tries to explain all of these points using only three global clusters. That can force the discordant proteins into the wrong component and pull the central reference cluster away from the clean near-zero pattern. The tailored method keeps the three main concordant components but adds a sparse mean-shift mechanism, which gives the model a principled way to absorb those discordant proteins without letting them distort the main scientific signal.

The key extra idea is a protein-specific mean-shift vector for the reference component. If a protein has a nonzero shift, the model is saying that this protein does not really behave like a clean member of the main concordant groups. That gives the method a built-in way to handle discordant proteins.

In this case study, we use a Python translation of the hard-thresholded sparse mean-shift version with diagonal covariance matrices.

12.11 Fit the translated method

Show Python
ini = initial_values(Y)

control = MixtureControl(
    tol=1e-5,
    maxit=400,
    iter_inner=10,
    sign_cons=True,
    mu_cons=True,
    sigma_type="diag",
    warm_start=False,
    trace=False,
)

fit = mixture_penalized(
    Y,
    lam=2.5,
    ini=ini,
    method="ghard",
    control=control,
)

print("Estimated mixture proportions:", fit.Pi)
print("Estimated number of mean-shift proteins:", fit.nout)
Estimated mixture proportions: [0.64256284 0.19053902 0.16689813]
Estimated number of mean-shift proteins: 1438
Show Python
mu_df = pd.DataFrame(
    fit.Mu,
    index=strain_cols,
    columns=["component_1", "component_2", "component_3"],
)
mu_df
component_1 component_2 component_3
strain_1 0.0 -0.806311 0.806311
strain_2 0.0 -1.442712 1.442712
strain_3 0.0 -1.102325 1.102325
strain_4 0.0 -1.658476 1.658476
Show Python
class_labels = np.argmax(fit.Pik, axis=1) + 1
is_shifted = np.linalg.norm(fit.Gamma, axis=1) > 0

result_df = Y_avg_df.copy()
result_df["component"] = class_labels.astype(str)
result_df["mean_shift"] = np.where(is_shifted, "yes", "no")
result_df["gamma_norm"] = np.linalg.norm(fit.Gamma, axis=1)

fig = px.scatter_3d(
    result_df,
    x="strain_1",
    y="strain_2",
    z="strain_3",
    color="component",
    symbol="mean_shift",
    hover_name="protein",
    hover_data={"component": True, "gamma_norm": ":.4f"},
    opacity=0.65,
    title="Tailored penalized mixture: fitted components + mean-shift (lambda=2.5)",
)
fig.update_traces(marker=dict(size=4))
fig.show()

12.12 Tune the penalty along a path

The penalty parameter controls how willing the model is to assign proteins to the mean-shift set.

  • smaller lambda: easier to flag proteins as discordant
  • larger lambda: harder to flag proteins as discordant

So it is natural to inspect a path of fitted models.

Show Python
lambda_grid = np.linspace(1, 10.0, 200)

path_fits = mixture_penalized_path(
    Y,
    lambdas=lambda_grid,
    ini=ini,
    method="ghard",
    control=MixtureControl(
        tol=1e-5,
        maxit=300,
        iter_inner=10,
        sign_cons=True,
        sigma_type="diag",
        warm_start=True,
        trace=False,
    ),
)

path_df = pd.DataFrame(
    {
        "lambda": lambda_grid,
        "aic": [f.aic for f in path_fits],
        "bic": [f.bic for f in path_fits],
        "nout": [f.nout for f in path_fits],
        "loglike": [f.loglike[-1] for f in path_fits],
    }
)
path_df
lambda aic bic nout loglike
0 1.000000 90409.773309 247456.231124 1709 -23414.473049
1 1.045226 90090.180239 245973.917550 1705 -23416.005595
2 1.090452 90090.176939 245973.914249 1705 -23416.003944
3 1.135678 89774.680458 244507.460481 1701 -23417.952622
4 1.180905 89696.687432 244143.545243 1700 -23418.628212
... ... ... ... ... ...
195 9.819095 68429.800527 68624.393179 0 -34187.900264
196 9.864322 68429.800527 68624.393179 0 -34187.900264
197 9.909548 68429.800527 68624.393179 0 -34187.900264
198 9.954774 68429.800527 68624.393179 0 -34187.900264
199 10.000000 68429.800527 68624.393179 0 -34187.900264

200 rows × 5 columns

Show Python
fig = px.line(
    path_df,
    x="lambda",
    y=["aic", "bic"],
    markers=True,
    title="Model selection over the penalty path",
)
fig.show()
Show Python
fig = px.line(
    path_df,
    x="lambda",
    y="nout",
    markers=True,
    title="Number of mean-shift proteins over the penalty path",
)
fig.show()
Show Python
best_idx = path_df["aic"].idxmin()
best_lambda = path_df.loc[best_idx, "lambda"]
best_fit = path_fits[int(best_idx)]

print("Best lambda by AIC:", best_lambda)
print("Estimated discordant proteins:", best_fit.nout)
Best lambda by AIC: 3.8944723618090453
Estimated discordant proteins: 635

12.13 Interpret the fitted model

12.13.1 Posterior classification and mean-shift flag

Show Python
class_labels = np.argmax(best_fit.Pik, axis=1) + 1
is_shifted = np.linalg.norm(best_fit.Gamma, axis=1) > 0

result_df = Y_avg_df.copy()
result_df["component"] = class_labels.astype(str)
result_df["mean_shift"] = np.where(is_shifted, "yes", "no")
result_df["gamma_norm"] = np.linalg.norm(best_fit.Gamma, axis=1)
Show Python
# Create a combined category for component and shift status
# Only component 1 can have mean-shift proteins
result_df["category"] = result_df.apply(
    lambda row: (
        "Component " + row["component"] + " (shifted)" 
        if row["component"] == "1" and row["mean_shift"] == "yes"
        else "Component " + row["component"]
    ),
    axis=1
)

color_map = {
    "Component 1": "lightblue",
    "Component 1 (shifted)": "darkblue",
    "Component 2": "red",
    "Component 3": "green",
}

fig = px.scatter_3d(
    result_df,
    x="strain_1",
    y="strain_2",
    z="strain_3",
    color="category",
    symbol="mean_shift",
    hover_name="protein",
    hover_data={"component": True, "gamma_norm": ":.4f", "category": False},
    opacity=0.65,
    title="Tailored penalized mixture: fitted components and mean-shift proteins",
    color_discrete_map=color_map,
    symbol_map={"no": "circle", "yes": "diamond"},
)
fig.update_traces(marker=dict(size=4))
fig.show()
Show Python
fig = px.scatter_3d(
    result_df,
    x="strain_1",
    y="strain_2",
    z="strain_4",
    color="category",
    symbol="mean_shift",
    hover_name="protein",
    hover_data={"component": True, "gamma_norm": ":.4f", "category": False},
    opacity=0.65,
    title="Tailored penalized mixture: fitted components and mean-shift proteins",
    color_discrete_map=color_map,
    symbol_map={"no": "circle", "yes": "diamond"},
)
fig.update_traces(marker=dict(size=4))
fig.show()
Show Python
fig = px.scatter_3d(
    result_df,
    x="strain_2",
    y="strain_3",
    z="strain_4",
    color="category",
    symbol="mean_shift",
    hover_name="protein",
    hover_data={"component": True, "gamma_norm": ":.4f", "category": False},
    opacity=0.65,
    title="Tailored penalized mixture: fitted components and mean-shift proteins",
    color_discrete_map=color_map,
    symbol_map={"no": "circle", "yes": "diamond"},
)
fig.update_traces(marker=dict(size=4))
fig.show()

12.13.2 Compare with the Gaussian mixture

Show Python
comparison_df = pd.crosstab(
    result_df["component"],
    result_df["gmm_class"],
    rownames=["tailored"],
    colnames=["gmm"],
)
comparison_df
gmm 0 1 2
tailored
1 1507 113 98
2 185 197 18
3 62 42 270

12.13.3 Top flagged discordant proteins

Show Python
top_shifted = (
    result_df.loc[
        result_df["mean_shift"] == "yes",
        ["protein", *strain_cols, "gamma_norm"],
    ]
    .sort_values("gamma_norm", ascending=False)
    .head(20)
)

top_shifted
protein strain_1 strain_2 strain_3 strain_4 gamma_norm
1625 Prot1626 -0.844723 1.706193 2.013329 2.199800 3.414507
549 Prot550 2.622349 -1.174605 -1.476886 -0.876112 3.375571
345 Prot346 2.902211 -0.870206 0.228407 -1.128387 3.215256
1876 Prot1877 2.876183 -0.920910 -0.874390 -0.439072 3.158302
1870 Prot1871 -1.620029 0.040732 -2.095967 -1.307510 3.103632
39 Prot40 2.921001 -0.240246 0.573151 -0.933393 3.068194
2073 Prot2074 1.106576 -0.605516 -1.475594 -2.136992 3.042810
1049 Prot1050 2.864055 -0.531243 -0.561096 -0.725399 3.028768
1228 Prot1229 -0.233381 0.967474 -2.604404 -0.934341 3.027949
1083 Prot1084 -0.342661 0.545986 -2.458829 -1.269562 3.018178
143 Prot144 -1.433899 -1.453507 -0.693350 -1.625977 3.008929
280 Prot281 -0.885670 0.748456 -2.589188 0.536309 2.979421
1667 Prot1668 1.273775 -1.800601 -1.280725 -1.041496 2.905415
2424 Prot2425 -1.046020 -0.978768 -2.089650 -0.950710 2.870676
1398 Prot1399 0.323411 0.433759 -2.721112 -0.069632 2.844123
1520 Prot1521 1.372007 -1.133865 -0.829376 -1.915565 2.827581
1291 Prot1292 2.043481 -1.506729 0.621848 -1.029092 2.814384
1979 Prot1980 2.857910 0.060808 -0.151366 -0.015565 2.782637
604 Prot605 2.204867 -1.491763 -0.026268 -0.666822 2.752788
1493 Prot1494 2.421951 -0.961228 -0.088740 -0.859445 2.734451

12.13.4 Concordant candidates

Show Python
component_means = result_df.groupby("component")[strain_cols].mean().round(3)
component_means
strain_1 strain_2 strain_3 strain_4
component
1 0.410 0.027 -0.286 -0.083
2 -0.497 -1.527 -1.394 -2.148
3 1.676 1.620 0.909 1.256
Show Python
concordant_up = result_df[
    (result_df["component"] == "3") & (result_df["mean_shift"] == "no")
]
concordant_down = result_df[
    (result_df["component"] == "2") & (result_df["mean_shift"] == "no")
]

print("Concordant up candidates  :", concordant_up.shape[0])
print("Concordant down candidates:", concordant_down.shape[0])
Concordant up candidates  : 374
Concordant down candidates: 400

12.13.5 Top concordant upregulated proteins

Show Python
# Calculate average expression across strains for ranking
concordant_up["avg_expression"] = concordant_up[strain_cols].mean(axis=1)
top_up = concordant_up.nlargest(10, "avg_expression")[["protein", "avg_expression", *strain_cols]]
top_up
/var/folders/w9/lfdf18vd055ds72yz1x94d400000gp/T/ipykernel_7456/3141897963.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
protein avg_expression strain_1 strain_2 strain_3 strain_4
1741 Prot1742 4.938695 5.260141 3.942771 5.010008 5.541860
833 Prot834 4.066938 3.630839 2.087594 5.625105 4.924214
324 Prot325 3.846332 2.722372 1.575353 5.721773 5.365831
795 Prot796 3.783079 5.433310 3.598923 2.816042 3.284040
835 Prot836 3.683458 1.798190 1.858815 5.641284 5.435541
2167 Prot2168 3.447612 3.589540 3.638072 3.227994 3.334843
2089 Prot2090 3.396286 3.953616 2.790982 4.791982 2.048563
1122 Prot1123 3.383940 0.894082 4.140691 4.697778 3.803208
327 Prot328 3.264763 3.363528 0.621514 4.819597 4.254415
836 Prot837 3.229349 4.354592 1.684909 1.481666 5.396232

12.13.6 Top concordant downregulated proteins

Show Python
# Calculate average expression across strains for ranking
concordant_down["avg_expression"] = concordant_down[strain_cols].mean(axis=1)
top_down = concordant_down.nsmallest(10, "avg_expression")[["protein", "avg_expression", *strain_cols]]
top_down
/var/folders/w9/lfdf18vd055ds72yz1x94d400000gp/T/ipykernel_7456/2651652835.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
protein avg_expression strain_1 strain_2 strain_3 strain_4
54 Prot55 -5.240091 -4.159184 -3.281133 -6.341529 -7.178517
278 Prot279 -5.054727 -4.017561 -1.761703 -6.593440 -7.846203
374 Prot375 -4.611558 -4.065321 -2.835847 -5.763760 -5.781305
2280 Prot2281 -4.200996 -4.034536 -2.531507 -5.216161 -5.021779
228 Prot229 -4.198589 -4.138364 -2.804714 -3.995263 -5.856016
1588 Prot1589 -4.181097 -3.526249 -2.081360 -4.734060 -6.382719
2470 Prot2471 -4.176689 -3.777873 -2.620420 -5.482621 -4.825841
151 Prot152 -4.153596 -3.536090 -2.297432 -5.327850 -5.453013
732 Prot733 -3.626877 -4.302374 -3.070910 -1.669567 -5.464656
1995 Prot1996 -3.510702 -3.103987 -2.225621 -4.016707 -4.696494

12.14 What did the tailored method buy us?

This is the key modeling lesson.

A plain Gaussian mixture tries to explain everything with only three global components. That sounds reasonable, but it can be too rigid. Proteins with mixed or discordant behavior can pull those components away from the clean scientific patterns we actually care about.

The penalized mixture adds a sparse mean-shift layer. That gives the model a way to say:

most proteins should be explained by the main concordant structure, but a smaller set may need protein-specific adjustments.

That is a good example of how domain knowledge and statistical modeling work together.

12.15 Discussion questions

  1. Why is it not enough to fit four separate one-dimensional models, one for each strain?
  2. Why is a sparse mean-shift model a reasonable way to model concordant and discordant proteins?
  3. What does the penalty parameter do conceptually?
  4. How would you validate whether the identified proteins are biologically meaningful?

12.16 Exercises

  1. Compare the tailored method with K-means and a plain Gaussian mixture.

  2. Check whether proteins flagged as discordant also tend to have mixed signs across strains.

  3. Modify the case study so that users can click a selected protein and inspect its replicate-level profile.