HBC Spatial Transcriptomics – Aligned Results Analysis¶
Dataset: Human Breast Cancer (HBC) spatial transcriptomics (10x Visium)
This tutorial walks through a complete analysis of Clumppling alignment results
for the HBC Visium dataset, using the ace_of_clust (ACE) package.
The dataset contains 3 798 spatial spots annotated with 20 fine-grained
ground-truth regions (IDC, DCIS/LCIS, Tumor Edge, Healthy) and was clustered
with a matrix-mixture model (MMC) across K = 3–10.
We cover:
- Loading metadata, spatial coordinates, and gene-expression data
- Loading and inspecting Clumppling alignment results
- Q-matrix and spatial scatter visualizations
- Structure and alignment plots for a focused mode subset
- Feature (gene) importance analysis using P matrices
- Focal-gene spatial exploration
- Gene set enrichment analysis
Imports¶
from pathlib import Path
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
import os
from clumppling.plot import (
plot_membership,
plot_alignment_graph,
plot_alignment_list,
)
from clumppling.utils import get_uniq_lb_sep
import ace_of_clust as aoc
from ace_of_clust.plot import make_mode_grid, make_mode_grid_by_K
Note: to be able to use all crisp methods, you need to install some additional packages: {'bayanpy', 'infomap', 'graph_tool', 'wurlitzer'}
Note: to be able to use all crisp methods, you need to install some additional packages: {'ASLPAw', 'pyclustering'}
Note: to be able to use all crisp methods, you need to install some additional packages: {'infomap', 'wurlitzer'}
Paths¶
# ace-of-clust root (two levels up from docs/tutorials/)
base_dir = Path("../..").resolve()
example_data_dir = base_dir / "examples" / "data" / "hbc"
# clustering Q files and per-K performance CSVs
cls_dir = example_data_dir / "clustering" / "full_mmc"
# metadata / gene lists (sit directly in 'clustering/')
cls_base_dir = example_data_dir / "clustering"
# Clumppling alignment output
align_dir = example_data_dir / "aligned" / "full_mmc"
# Output directory for figures
fig_dir = base_dir / "examples" / "figures" / "hbc"
fig_dir.mkdir(parents=True, exist_ok=True)
# Path to HBC.h5ad – required only for gene-expression visualization and
# feature analysis with P matrices. Update if your copy lives elsewhere.
HBC_H5AD_PATH = Path("/oscar/data/sramacha/users/xliu293/rnaseq-clumppling/data/HBC.h5ad")
Load Data¶
We load spot metadata (ground-truth labels and spatial coordinates), gene metadata, and optionally the full AnnData object for gene-expression visualization.
Metadata and ground-truth labels¶
df_meta_sorted = pd.read_csv(cls_base_dir / "meta_sorted.csv", index_col=0)
# Canonical ordering of the 20 fine-grained ground-truth classes
gt_cat_sorted = [
'IDC_1', 'IDC_2', 'IDC_3', 'IDC_4', 'IDC_5',
'IDC_6', 'IDC_7', 'DCIS/LCIS_1', 'DCIS/LCIS_2', 'DCIS/LCIS_3',
'DCIS/LCIS_4', 'DCIS/LCIS_5', 'Tumor Edge_1', 'Tumor Edge_2', 'Tumor Edge_3',
'Tumor Edge_4', 'Tumor Edge_5', 'Tumor Edge_6', 'Healthy_1', 'Healthy_2',
]
# Normalise 'Tumor_edge' → 'Tumor Edge' in both label columns
df_meta_sorted['ground_truth_type'] = pd.Categorical(
df_meta_sorted['ground_truth_type'].str.replace('_edge', ' Edge'),
categories=['IDC', 'DCIS/LCIS', 'Tumor Edge', 'Healthy'],
ordered=True,
)
df_meta_sorted['ground_truth'] = pd.Categorical(
df_meta_sorted['ground_truth'].str.replace('_edge', ' Edge'),
categories=gt_cat_sorted,
ordered=True,
)
print(df_meta_sorted.shape)
# Ground-truth fine-grained labels (20 regions)
gt_lbs = df_meta_sorted['ground_truth'].values
# Coarser morphotype labels (IDC / DCIS/LCIS / Tumor Edge / Healthy)
gtt_lbs = df_meta_sorted['ground_truth_type'].values
n = df_meta_sorted.shape[0]
print(f"n cells = {n}")
(3798, 5) n cells = 3798
Gene metadata¶
# All genes
df_genes_meta = pd.read_csv(cls_base_dir / "genes_all.csv")
df_genes_meta = df_genes_meta.rename(columns={'Unnamed: 0': 'gene_symbols'})
gene_names = df_genes_meta['gene_symbols'].values
m = df_genes_meta.shape[0]
print(f"Total genes: {m}")
# Highly variable genes (HVGs)
df_genes_hvg = pd.read_csv(cls_base_dir / "genes_hvg.csv")
df_genes_hvg = df_genes_hvg.rename(columns={'Unnamed: 0': 'gene_symbols'})
gene_names_hvg = df_genes_hvg['gene_symbols'].values
print(f"HVGs: {len(gene_names_hvg)}")
Total genes: 36601 HVGs: 3000
Colors¶
# Colorblind-friendly palette used throughout the notebook
cb_cmap = [
'#E69F00', '#009E73', '#CC79A7', '#56B4E9',
'#F0E442', '#D55E00', '#0072B2', '#7F68A1',
'#A3C644', '#FF1493',
]
cmap_listed = mcolors.ListedColormap(cb_cmap)
Spatial coordinates¶
x_spatial = df_meta_sorted['X'].values
y_spatial = df_meta_sorted['Y'].values
spatial_coords = np.vstack((x_spatial, y_spatial)).T
print("spatial_coords shape:", spatial_coords.shape)
spatial_coords shape: (3798, 2)
(Optional) Load gene expression data¶
The gene-expression AnnData object (HBC.h5ad) is required for:
- Feature (gene) importance analysis with P matrices
- Focal-gene spatial expression plots
Set HBC_H5AD_PATH in the Paths cell above to point to your local copy.
If the file is not found, these sections will be skipped.
import scanpy as sc
if not HBC_H5AD_PATH.exists():
raise FileNotFoundError(f"HBC h5ad not found: {HBC_H5AD_PATH}")
data_full = sc.read_h5ad(HBC_H5AD_PATH)
print(data_full)
# Non-zero genes (used for P-matrix feature analysis)
idx_genes_nz = np.where(np.sum(data_full.X, axis=0) > 0)[1]
gene_names_nz = gene_names[idx_genes_nz]
used_genes = gene_names_nz
print(f"{len(idx_genes_nz)} / {len(gene_names)} genes have non-zero counts")
# Unsorted spatial coords from h5ad (for gene-expression scatter)
spatial_coords_unsorted = data_full.obs[['X', 'Y']].values
AnnData object with n_obs × n_vars = 3798 × 36601
obs: 'ground_truth', 'X', 'Y'
var: 'gene_ids', 'feature_types', 'genome'
uns: 'ground_truth_colors'
obsm: 'spatial'
24923 / 36601 genes have non-zero counts
Unique label separations¶
Pre-compute sorted label arrays and separation indices used to order cells in structure plots throughout the notebook.
uniq_lbs, uniq_lbs_indices, uniq_lbs_sep_idx = get_uniq_lb_sep(gt_lbs)
uniq_gtt_lbs, uniq_gtt_lbs_indices, uniq_gtt_lbs_sep_idx = get_uniq_lb_sep(gtt_lbs)
print("Fine-grained labels:", uniq_lbs)
print("Coarse labels:", uniq_gtt_lbs)
Fine-grained labels: ['DCIS/LCIS_1', 'DCIS/LCIS_2', 'DCIS/LCIS_3', 'DCIS/LCIS_4', 'DCIS/LCIS_5', 'Healthy_1', 'Healthy_2', 'IDC_1', 'IDC_2', 'IDC_3', 'IDC_4', 'IDC_5', 'IDC_6', 'IDC_7', 'Tumor Edge_1', 'Tumor Edge_2', 'Tumor Edge_3', 'Tumor Edge_4', 'Tumor Edge_5', 'Tumor Edge_6'] Coarse labels: ['DCIS/LCIS', 'Healthy', 'IDC', 'Tumor Edge']
Load Clumppling Results¶
Load the Clumppling alignment output and, if available, the P matrices (feature-weight matrices, one per run). P matrices are required for the gene-importance analysis later in this notebook.
# Load alignment results (including P matrices if available)
results = aoc.load_clumppling_results(
align_dir=align_dir,
suffix="rep",
cls_dir=cls_dir,
load_P=True,
strict_P=False,
)
# Compute per-gene metrics for all modes (requires P matrices)
df_by_mode = aoc.compute_all_feature_metrics(results, feature_names=used_genes)
# Select top genes by weighted_Psum across modes
selected_by_mode, df_selected_all, overlap = aoc.select_top_features(
df_by_mode,
top_quantile=0.1,
)
# Pairwise mode mappings
pair_mappings = aoc.get_mode_pair_mappings(
mode_names=results.modes,
all_modes_alignment=results.all_modes_alignment,
alignment_acrossK=results.alignment_acrossK,
)
# Summary of all identified modes
print("All modes:", results.modes)
print()
print(results.mode_stats)
All modes: ['K3M1', 'K4M1', 'K5M1', 'K5M2', 'K6M1', 'K6M2', 'K7M1', 'K7M2', 'K8M1', 'K8M2', 'K9M1', 'K9M2', 'K10M1', 'K10M2']
Mode Representative Size Cost Performance
0 K3M1 16_K3R16 20 7.442808e-11 0.999993
1 K4M1 40_K4R20 20 3.982283e-09 0.999947
2 K5M1 57_K5R17 13 4.114238e-08 0.999849
3 K5M2 49_K5R9 7 1.580895e-07 0.999655
4 K6M1 63_K6R3 18 4.207866e-09 0.999943
5 K6M2 77_K6R17 2 1.911964e-08 0.999862
6 K7M1 99_K7R19 15 4.233527e-09 0.999938
7 K7M2 92_K7R12 5 8.733948e-03 0.940366
8 K8M1 115_K8R15 15 5.851910e-07 0.999317
9 K8M2 111_K8R11 5 1.837209e-05 0.996302
10 K9M1 122_K9R2 15 4.403278e-03 0.964594
11 K9M2 128_K9R8 5 1.126936e-02 0.911796
12 K10M1 146_K10R6 14 1.814668e-03 0.982250
13 K10M2 143_K10R3 6 1.779872e-02 0.883337
Q-Matrix Visualizations¶
Quick look at the soft-membership (Q) matrices returned by Clumppling.
plot_Q_heatmap shows the full Q matrix for a single mode; plot_Q_grid
tiles all modes side by side; plot_cluster_bars gives cluster-size bars.
mode = "K6M1"
# Q-matrix heatmap for a single mode
fig, ax = plt.subplots(figsize=(5, 4))
aoc.plot_Q_heatmap(results, mode, cmap="Blues", ax=ax)
(<Figure size 500x400 with 2 Axes>,
<Axes: title={'center': 'Q heatmap: K6M1'}, xlabel='Cluster', ylabel='Individual'>)
# Q-matrix heatmaps for all modes in a grid
fig, axes = aoc.plot_Q_grid(results, cmap="Blues")
# Cluster size bars for one mode
fig, ax = aoc.plot_cluster_bars(results, mode, colors=cb_cmap[:results.mode_K[mode]])
Spatial Scatter Plots¶
Project cluster memberships back onto the tissue slide.
plot_cluster_overlay overlays all clusters in a single panel per mode
(thresholded by a minimum membership value), while plot_cluster_panels
gives one panel per cluster.
val_threshold = 0.4
sup = f"membership > {val_threshold}"
# Overlay all clusters for all modes
fig, axes = aoc.plot_cluster_overlay(
results,
coords=spatial_coords,
cluster_colors=cb_cmap[:results.K_max],
val_threshold=val_threshold,
s=1, alpha=0.9,
suptitle=sup,
)
cluster_idx = 0 # first cluster (zero-based)
Q = results.Q_by_mode[mode] # shape (n_cells, K)
ax, sp = aoc.plot_spatial_membership(
Q, spatial_coords,
s=5, alpha=1.0,
ref_color=cb_cmap[cluster_idx],
cls_idx=cluster_idx,
)
ax.set_title(f"{mode}, Cluster {cluster_idx + 1}")
Text(0.5, 1.0, 'K6M1, Cluster 1')
Alignment Overview¶
The alignment graph and list show how modes across different values of K relate to each other – which clusters split, merge, or remain stable as K increases. This overview uses all identified modes and helps decide which subset to carry forward.
fig = plot_alignment_graph(
results.K_range,
names_list=results.mode_names_list,
cmap=cb_cmap[:results.K_max],
alignment_acrossK=results.alignment_acrossK,
all_modes_alignment=results.all_modes_alignment,
alt_color=False,
ls_alt=['-', '--'],
)
Subset of Modes¶
Based on the alignment overview above, we select a meaningful subset of modes (typically K = 3–7) that balances model-fit quality with interpretability. Modes with a very small size or high alignment cost can be excluded. Below is the full mode summary to guide the selection:
modes_subset = results.modes[:8]
# Recompute pairwise mappings and costs for the subset
sub_full = aoc.subset_results(
results,
modes_subset,
)
print("Subset modes:", sub_full.modes)
print(sub_full.mode_stats)
Subset modes: ['K3M1', 'K4M1', 'K5M1', 'K5M2', 'K6M1', 'K6M2', 'K7M1', 'K7M2'] Mode Representative Size Cost Performance 0 K3M1 16_K3R16 20 7.442808e-11 0.999993 1 K4M1 40_K4R20 20 3.982283e-09 0.999947 2 K5M1 57_K5R17 13 4.114238e-08 0.999849 3 K5M2 49_K5R9 7 1.580895e-07 0.999655 4 K6M1 63_K6R3 18 4.207866e-09 0.999943 5 K6M2 77_K6R17 2 1.911964e-08 0.999862 6 K7M1 99_K7R19 15 4.233527e-09 0.999938 7 K7M2 92_K7R12 5 8.733948e-03 0.940366
mode = "K5M1"
# Structure bar-chart for a single mode, cells sorted by cluster and
# then by ground-truth label within each cluster.
fig, ax = plt.subplots(figsize=(10, 2), dpi=150)
aoc.plot_membership_reordered(
results.Q_by_mode[mode],
cmap=cb_cmap[:results.mode_K[mode]],
lbs=gt_lbs, ax=ax, title="", annot="",
)
ax.set_title(f"Mode {mode}", fontsize=12, weight='bold')
Text(0.5, 1.0, 'Mode K5M1')
Subset Visualizations¶
Spatial cluster panels, combined spatial-structure grid, structure plots, and alignment diagrams – now restricted to the selected mode subset.
fig, axes = aoc.plot_cluster_panels(
sub_full,
coords=spatial_coords,
cluster_colors=cb_cmap[:sub_full.K_max],
val_threshold=-1,
s=2, alpha=1,
suptitle="Subset of modes",
)
fig = aoc.plot_spatial_structure_grid(
sub_full,
spatial_coords,
grps=gt_lbs,
cmap=cb_cmap[:sub_full.K_max],
reorder_cls=True,
dpi=100,
)
fig.savefig(fig_dir / "hbc_spatial_structure_membership_grid.png",
bbox_inches="tight", dpi=100)
fig = aoc.plot_structure_two_level(
sub_full,
cmap=cb_cmap[:results.K_max],
grp_labels=gt_lbs,
supgrp_labels=gtt_lbs,
lb_suffix_sep="_",
reorder_clsind=True,
)
fig.savefig(fig_dir / "hbc_structure_modes.png",
bbox_inches='tight', dpi=300)
fig = plot_alignment_graph(
sub_full.K_range,
names_list=sub_full.mode_names_list,
cmap=cb_cmap[:sub_full.K_max],
alignment_acrossK=sub_full.alignment_acrossK,
all_modes_alignment=sub_full.all_modes_alignment,
alt_color=False,
ls_alt=['-', '--'],
)
fig.savefig(fig_dir / "hbc_alignment_graph.png",
bbox_inches='tight', dpi=300)
Feature (Gene) Analysis¶
Requires HBC.h5ad – loaded in the Load Data section above.
Each clustering mode comes with a P matrix that records how strongly each gene drives each cluster. We summarise gene importance with two metrics:
- weighted_Psum: overall contribution of a gene across clusters (weighted by cluster size)
- sepLFC: log-fold change between the gene's top cluster and the rest (a measure of cluster specificity)
We first inspect a single mode in detail, then scan all subset modes at once, and finally use KDE-based outlier detection to identify the most informative genes.
mode = "K5M1"
df_mode = df_by_mode[mode]
# Scatter: weighted Psum vs. sepLFC
fig, ax = aoc.plot_feature_scatter(df_mode, mode_name=mode)
# Top genes by weighted Psum
fig, ax = aoc.plot_feature_bar(df_mode, mode_name=mode, top_n=20)
# Top genes by sepLFC
fig, ax = aoc.plot_feature_bar(df_mode, mode_name=mode, metric="sepLFC", top_n=20)
# P-matrix sorted by cluster membership
fig, ax = aoc.plot_mode_P_profile(results, mode_name=mode)
fig.tight_layout()
# sepLFC distribution
fig, ax = plt.subplots(figsize=(4, 4))
aoc.plot_sepLFC_dist(results, mode_name=mode, lfc_threshold=10, ax=ax)
fig.tight_layout()
modes = sub_full.modes
# P-sorted grids for all subset modes
fig, ax_by_mode = make_mode_grid(modes, n_cols=4, panel_size=(3, 3))
for mode_name in modes:
aoc.plot_mode_P_profile(results, mode_name, ax=ax_by_mode[mode_name])
fig.tight_layout()
fig.savefig(fig_dir / "hbc_modes_P_sorted.png",
bbox_inches='tight', dpi=150)
# sepLFC distribution grids
fig, ax_by_mode = make_mode_grid(modes, n_cols=4, panel_size=(3, 3))
for mode_name in modes:
aoc.plot_sepLFC_dist(
results, mode_name,
ax=ax_by_mode[mode_name], lfc_threshold=10,
)
fig.tight_layout()
fig.savefig(fig_dir / "hbc_modes_sepLFC_distribution.png",
bbox_inches='tight', dpi=150)
# Recompute feature metrics for the mode subset
df_pvs_modes = aoc.compute_all_feature_metrics(
sub_full,
feature_names=used_genes,
)
selected_by_mode, df_selected_all, overlap = aoc.select_top_features(
df_pvs_modes,
top_quantile=0.1,
)
print(f"{len(df_selected_all)} genes selected across modes")
2465 genes selected across modes
mode = 'K7M2'
outliers_genes, outliers_mask = aoc.get_kde_outliers(
selected_by_mode[mode],
x_col=f"weighted_Psum_{mode}",
y_col=f"sepLFC_{mode}",
levels=8,
min_x=0.0,
scale="robust",
return_mask=True,
)
print(f"{len(outliers_genes)} outlier genes")
outliers_mask &= (selected_by_mode[mode][f"sepLFC_{mode}"] > 1)
print(f"Number of outliers with sepLFC > 1: {np.sum(outliers_mask)}")
fig, ax = aoc.plot_feature_kde(
selected_by_mode[mode],
x_col=f"weighted_Psum_{mode}",
y_col=f"sepLFC_{mode}",
mode_name=mode,
outlier_mask=outliers_mask,
bg_alpha=0.3,
)
ax.set_ylim(0, 40)
ax.set_xlabel("weighted_Psum")
ax.set_ylabel("sepLFC")
ax.set_xlim(0, 0.014)
fig.savefig(fig_dir / f"hbc_{mode}_kde_outliers.png",
bbox_inches='tight', dpi=300)
59 outlier genes Number of outliers with sepLFC > 1: 45
Focal Gene Analysis¶
Zoom in on a single gene of interest (here MGP). We visualise its clustering-informativeness metrics across all subset modes and its raw expression on the tissue slide, then show which spatial clusters it drives most strongly.
focal_gene = "COX6C" # alternatives: "APOE", "IGKC", etc.
modes = sub_full.modes
# Color palette: one color per cluster + grey for multi-cluster points
custom_palette = {f'Cls.{i+1}': c for i, c in enumerate(cb_cmap[:sub_full.K_max])}
custom_palette['Multi.Cls'] = '#A9A9A9'
fig, ax = plt.subplots(figsize=(4, 5), dpi=150)
_, _, df_gene = aoc.plot_feature_across_modes(
df_pvs_modes=df_pvs_modes,
modes=modes,
selected_feature=focal_gene, # replaces 'focal_gene' kwarg from sc_clumppling
custom_color_dict=custom_palette,
xlim=[0.00, 0.015],
ylim=[40],
legend_loc="upper left",
legend_bbox_to_anchor=(0.0, 1.0),
ax=ax,
)
ax.set_title("Distribution of metrics", fontsize=14, weight='bold')
fig.savefig(fig_dir / f"hbc_{focal_gene}_pvs_across_modes.png",
bbox_inches='tight', dpi=300)
i_g = np.where(gene_names == focal_gene)[0][0]
gene_counts = np.array(data_full[:, i_g].X.todense()).squeeze()
fig, ax = plt.subplots(1, 1, figsize=(5, 4), dpi=150)
fig = aoc.plot_feature_count(
gene_counts, spatial_coords_unsorted,
feature_name=focal_gene, ax=ax,
log_transformed=False, vmax=6.0, cmap="RdYlBu_r", cbar_loc='right',
)
ax.set_title("Gene expression", fontsize=13, weight='bold')
fig.savefig(fig_dir / f"hbc_{focal_gene}_spatial_expression.png",
bbox_inches='tight', dpi=300)
# Fewer-side separated clusters (compact view)
fig, axes = aoc.plot_feature_cluster_panels(
results=sub_full,
coords=spatial_coords,
df_pvs_modes=df_pvs_modes,
selected_feature=focal_gene, # replaces 'focal_gene' kwarg from sc_clumppling
modes=modes,
colors=cb_cmap,
plot_both_sides=False,
val_threshold=0.0,
)
fig.suptitle("Separated clusters", fontsize=13, weight='bold')
fig.savefig(fig_dir / f"hbc_{focal_gene}_separated_clusters.png",
bbox_inches='tight', dpi=300)
# Both-sides separated clusters (expanded view)
fig, axes = aoc.plot_feature_cluster_panels(
results=sub_full,
coords=spatial_coords,
df_pvs_modes=df_pvs_modes,
selected_feature=focal_gene,
modes=modes,
colors=cb_cmap,
plot_both_sides=True,
val_threshold=0.0,
h_scale=2.0,
w_scale=2.0,
)
fig.suptitle(f"Separated clusters for Gene {focal_gene}",
y=0.92, fontsize=16, weight='bold')
fig.savefig(fig_dir / f"hbc_{focal_gene}_separated_clusters_both_sides.png",
bbox_inches='tight', dpi=300)
Gene Set Analysis¶
Gene set (GS) analysis tests whether a curated set of biologically related genes is systematically enriched in one or more of the clustering modes' feature weight (P) matrices. Three complementary enrichment statistics are computed:
- P enrichment: is the mean P value of GS genes higher than expected by chance within each cluster?
- Pairwise LFC enrichment: are pairwise log-fold-changes in mean P between clusters larger than the null?
- sepLFC enrichment: does the GS achieve a larger cluster-separation log-fold-change than a random gene set of the same size?
Null distributions are generated by permutation (n_perm random draws of the
same number of genes from the full feature set).
We use the HALLMARK_E2F_TARGETS gene set from MSigDB as a worked example.
Load Gene Set¶
Gene sets are stored as plain-text files (one gene symbol per line) under
examples/data/hallmark_gene_sets/. load_gene_set reads the file and
resolve_gene_set maps the symbols to indices in used_genes (the non-zero
genes from the expression data), dropping any that are absent.
hallmark_data_dir = base_dir / "examples" / "data" / "hallmark_gene_sets"
gene_set_name = "HALLMARK_E2F_TARGETS"
raw_symbols = aoc.load_gene_set(gene_set_name, hallmark_data_dir)
gene_set, gene_set_indices = aoc.resolve_gene_set(raw_symbols, used_genes)
n_gene_set = len(gene_set)
print(f"{gene_set_name}: {n_gene_set} genes resolved")
print(", ".join(gene_set[:10]), "...")
HALLMARK_E2F_TARGETS: 198 genes resolved AK2, ANP32E, ASF1A, ASF1B, ATAD2, AURKA, AURKB, BARD1, BIRC5, BRCA1 ...
# Permutation settings shared by all enrichment tests
# (increase n_perm to 10_000 for more precise p-values)
n_perm = 1_000
rng_seed = 1
Single-Mode Analysis¶
We first examine one representative mode in depth. The mode's P matrix
(shape: n_features × K) encodes how strongly each gene contributes to
each cluster. We extract the rows corresponding to the gene set, then run
all three enrichment tests against a shared permutation null.
gs_mode = "K3M1"
K_mode = sub_full.mode_K[gs_mode]
df_mode = df_by_mode[gs_mode]
cluster_labels = [f"Cls.{k+1}" for k in range(K_mode)]
cluster_colors = cb_cmap[:K_mode]
gs_title = f"GS: {gene_set_name} (mode {gs_mode})"
print(f"Mode: {gs_mode} | K={K_mode}")
print(f"Cluster labels: {cluster_labels}")
Mode: K3M1 | K=3 Cluster labels: ['Cls.1', 'Cls.2', 'Cls.3']
# Extract P sub-matrix for the gene set
P = sub_full.P_aligned_by_mode[gs_mode] # shape: (n_features, K)
P_gs = P[gene_set_indices] # shape: (n_gs, K)
print(f"P shape: {P.shape} | P_gs shape: {P_gs.shape}")
P shape: (24923, 3) | P_gs shape: (198, 3)
# Build shared permutation null (draw n_perm random same-size gene sets)
null_mean_P = aoc.sample_null_P(P, gene_set_indices, n_perm, rng_seed)
# 1. P enrichment: is mean P of GS genes elevated in each cluster?
p_res = aoc.test_P_enrichment(P, gene_set_indices, null_mean_P)
print("gs_mean_P:", dict(zip(cluster_labels, p_res["gs_mean_P"].round(6))))
print("p_emp: ", dict(zip(cluster_labels, p_res["p_emp"].round(4))))
# 2. Pairwise LFC enrichment: are pairwise LFCs in mean P larger than null?
lfc_res = aoc.test_LFC_enrichment(P, gene_set_indices, null_mean_P)
# 3. sepLFC enrichment: does the GS achieve greater cluster separation than null?
sep_res = aoc.test_sepLFC_enrichment(P, gene_set_indices, null_mean_P)
print(f"\nsepLFC (arithmetic) = {sep_res['gs_sepLFC']:.4f}")
print(f"sepLFC (geometric) = {sep_res['gs_geomean_LFC']:.4f}")
print(f" low clusters: {[cluster_labels[k] for k in sep_res['sepL']]}")
print(f" high clusters: {[cluster_labels[k] for k in sep_res['sepH']]}")
print(f" p (vs null best): {sep_res['p_vs_null_sep']:.4e}")
print(f" p (vs null fixed clusters): {sep_res['p_vs_null_fixed']:.4e}")
gs_mean_P: {'Cls.1': 6.1e-05, 'Cls.2': 3.8e-05, 'Cls.3': 7.5e-05}
p_emp: {'Cls.1': 0.0839, 'Cls.2': 0.3836, 'Cls.3': 0.026}
sepLFC (arithmetic) = 0.6954
sepLFC (geometric) = 0.7322
low clusters: ['Cls.2']
high clusters: ['Cls.1', 'Cls.3']
p (vs null best): 6.5934e-02
p (vs null fixed clusters): 1.2987e-02
# Top GS genes ranked by sepLFC between the identified separation clusters
df_sepCls = aoc.analyze_sep_genes(
df_mode, sep_res['sepH'], sep_res['sepL'], gene_set, top_n=15
)
df_sepCls['gene'] = df_sepCls.index
sepCls ordered: ((1,), (0, 2)) | n_genes (all): 3756 sepCls unordered: 2 permutations | n_genes (all): 7590 gene_set matches: 120 / 198 | returning top 15
# Genes (GS and non-GS) that share the gene set's separation clusters
sepL_t = tuple(int(x) for x in sep_res['sepL'])
sepH_t = tuple(int(x) for x in sep_res['sepH'])
# Ordered match: sepCls == (sepL, sepH) exactly
sepCls_ordered = (sepL_t, sepH_t)
df_sep_ordered = df_mode[df_mode['sepCls'] == sepCls_ordered].copy()
# Unordered match: any permutation of the two groups
sepCls_unordered = {
(permL, permH)
for permL in itertools.permutations(sepL_t)
for permH in itertools.permutations(sepH_t)
}
df_sep_unordered = df_mode[df_mode['sepCls'].apply(lambda x: x in sepCls_unordered)].copy()
for _df in [df_sep_ordered, df_sep_unordered]:
_df['rank_sepLFC'] = _df['sepLFC'].rank(ascending=False, method='min').astype(int)
df_sep_ordered = df_sep_ordered.sort_values('sepLFC', ascending=False)
df_sep_unordered = df_sep_unordered.sort_values('sepLFC', ascending=False)
print(f"Genes in ordered sepCls: {len(set(df_sep_ordered.index) & set(gene_set))} / {len(gene_set)}")
print(f"Genes in unordered sepCls: {len(set(df_sep_unordered.index) & set(gene_set))} / {len(gene_set)}")
Genes in ordered sepCls: 89 / 198 Genes in unordered sepCls: 114 / 198
Visualizations¶
The plots below cover all three enrichment angles for the selected mode.
# P enrichment: empirical p-value and z-score per cluster
fig, axes = plt.subplots(1, 2, figsize=(10, 4), dpi=150)
aoc.plot_P_enrichment_pval(p_res, cluster_labels, cluster_colors, ax=axes[0])
aoc.plot_P_enrichment_zscore(p_res, cluster_labels, cluster_colors, ax=axes[1])
fig.suptitle(f"{gs_title} – P enrichment", fontsize=11, weight="bold")
fig.tight_layout()
# Pairwise LFC enrichment: observed LFC (lower triangle) vs. z-score (upper triangle)
fig, ax = aoc.plot_pairwise_heatmap_bidir(
upper_mat=lfc_res["test"]["z_mat"],
lower_mat=lfc_res["obs_lfc"],
upper_sig=lfc_res["test"]["q_mat"],
lower_sig=lfc_res["test"]["q_mat"],
upper_label="z-score",
lower_label="obs LFC",
labels=cluster_labels,
title=f"{gs_title} – pairwise LFC",
)
# sepLFC null distributions: largest sepLFC in random sets vs. GS sepCls LFC
fig, axes = plt.subplots(1, 2, figsize=(10, 4), dpi=150)
aoc.plot_sepLFC_null_sep(sep_res, ax=axes[0])
aoc.plot_sepLFC_null_fixed(sep_res, cluster_labels, ax=axes[1])
fig.suptitle(f"{gs_title} – sepLFC enrichment", fontsize=11, weight="bold", y=1.02)
fig.tight_layout()
# Stacked-bar profile: how much of each cluster's P weight do GS genes hold?
fig, ax = aoc.plot_gene_P_stacked(
P_gs, gene_set, cluster_labels,
gs_title=gs_title, sort_by_sum=True, top_n=10,
)
# Top GS genes by cluster-specific P weight
fig, _ = aoc.plot_gene_P_bars(
P_gs, gene_set, cluster_labels, cluster_colors, top_n=10
)
# Per-gene LFC between the gene set's separation clusters (mean across cluster group)
df_gene_lfc = aoc.compute_gene_lfc(
P_gs, gene_set, sep_res["sepL"], sep_res["sepH"], kind="mean"
)
fig, ax = aoc.plot_gene_lfc(
df_gene_lfc, cluster_labels,
sep_res["sepL"], sep_res["sepH"],
sep_res["gs_sepLFC"], cluster_colors, kind="mean", show_labels=False,
)
ax.axvline(0, color='k', linestyle='-', linewidth=0.5)
ax.axvline(sep_res["gs_sepLFC"], color='gray', linestyle='--', linewidth=0.5)
ax.text(
sep_res["gs_sepLFC"], len(gene_set) - 1,
f"GS sepLFC = {sep_res['gs_sepLFC']:.4f}",
ha="left", va="top", fontsize=8, color="gray", style="italic",
)
ax.set_title("Per-gene mean LFC between GS sepCls", fontsize=11)
fig.tight_layout()
# Bipartite diagram: top GS genes linking low- and high-expression clusters
fig, ax = aoc.plot_seplfc_bipartite(
gene_set, sep_res["sepH"], sep_res["sepL"], cluster_labels,
df_mode=df_mode,
top_n_per_pair=5,
gs_title=gs_title,
colors=cluster_colors,
cmap='cividis',
figsize=(8, 5),
)
# Top GS genes by sepLFC, annotated with their separation cluster pair
fig, ax = plt.subplots(figsize=(6, 4), dpi=150)
sns.barplot(
x='sepLFC', y='gene', data=df_sepCls,
hue='sepCls', palette='Paired', ax=ax, legend=False,
)
existing_labels = set()
for i_gene, row in df_sepCls.iterrows():
low_str = "Cls." + "+".join(str(k + 1) for k in row['sepCls'][0])
high_str = "Cls." + "+".join(str(k + 1) for k in row['sepCls'][1])
lbl = f"[{high_str}]/[{low_str}]"
if lbl in existing_labels:
lbl = ""
else:
existing_labels.add(lbl)
x_pos = 0.1 if (row['sepLFC'] > 10 or max(df_sepCls['sepLFC']) < 15) else row['sepLFC'] + 0.1
ax.text(x_pos, i_gene, lbl, ha='left', va='center', fontsize=8)
ax.set_title("Top GS genes by sepLFC", fontsize=11)
fig.tight_layout()
# sepLFC distribution across genes sharing the GS separation clusters (unordered)
fig, ax = plt.subplots(figsize=(7, 4), dpi=150)
aoc.plot_sepLFC_distribution(
df_sep_unordered, gene_set,
title="", kind="auto",
show_gs_textbox=True, gs_textbox_threshold=10, n_gs_textbox=10,
show_non_gs_textbox=True, n_non_gs_textbox=10,
ax=ax,
)
n_in = len(set(df_sep_unordered.index) & set(gene_set))
ax.set_title(
f"sepLFC distribution – {n_in} / {len(gene_set)} GS genes in unordered sepCls",
fontsize=11,
)
fig.tight_layout()
Cross-Mode Analysis¶
Running enrichment across all subset modes reveals whether the gene set's
signal is consistent across values of K and across alternative modes at the
same K. run_gs_enrichment batches all three tests for every mode in one call.
res_by_mode = aoc.run_gs_enrichment(
sub_full, gene_set_indices, sub_full.modes, n_perm, rng_seed
)
# P enrichment p-values across all subset modes, grouped by K
fig, ax_by_mode = make_mode_grid_by_K(sub_full)
aoc.plot_P_enrichment_grid(res_by_mode, ax_by_mode, sub_full, cb_cmap, kind="pval")
fig.suptitle(f"{gene_set_name} – P enrichment across modes", fontsize=11, weight="bold")
fig.tight_layout()
# Pairwise LFC z-scores across all subset modes
fig, ax_by_mode = make_mode_grid_by_K(sub_full)
aoc.plot_LFC_enrichment_grid(res_by_mode, ax_by_mode, sub_full, cb_cmap)
fig.suptitle(f"{gene_set_name} – pairwise LFC z-score across modes", fontsize=11, weight="bold")
fig.tight_layout()
# sepLFC enrichment across all subset modes
fig, ax_by_mode = make_mode_grid_by_K(sub_full)
aoc.plot_sepLFC_enrichment_grid(
res_by_mode, ax_by_mode, sub_full, cb_cmap, kind="null_sep"
)
fig.suptitle(f"{gene_set_name} – sepLFC enrichment across modes", fontsize=11, weight="bold")
fig.tight_layout()
# P enrichment by cluster
fig1, _ = aoc.plot_P_enrichment_by_cluster(
res_by_mode, sub_full, cb_cmap, kind='pval', ncols=4,
sig_threshold=0.05)
fig1.tight_layout()
fig1.suptitle('P enrichment by cluster across modes (p-values)',
fontsize=12, weight='bold', y=1.01)
pass
# sepLFC distribution heatmaps for all modes, annotated with p-values (largest separation)
fig2, _ = aoc.plot_sepLFC_distribution_heatmap(
res_by_mode, n_bins=60, kind='null_sep', pval_threshold=0.05, annotate_pval=True)
fig2.suptitle('Largest sepLFC',
fontsize=12, weight='bold', x=0.45, y=1.01)
Text(0.45, 1.01, 'Largest sepLFC')
# sepLFC distribution heatmaps for all modes, annotated with p-values (fixed clusters)
fig3, _ = aoc.plot_sepLFC_distribution_heatmap(
res_by_mode, n_bins=60, kind='null_fixed', pval_threshold=0.05, annotate_pval=True)
fig3.suptitle('LFC across GS sepCls',
fontsize=12, weight='bold', x=0.45, y=1.01)
Text(0.45, 1.01, 'LFC across GS sepCls')