HBC Spatial Transcriptomics – Aligned Results Analysis¶

Dataset: Human Breast Cancer (HBC) spatial transcriptomics (10x Visium)

This tutorial walks through a complete analysis of Clumppling alignment results for the HBC Visium dataset, using the ace_of_clust (ACE) package. The dataset contains 3 798 spatial spots annotated with 20 fine-grained ground-truth regions (IDC, DCIS/LCIS, Tumor Edge, Healthy) and was clustered with a matrix-mixture model (MMC) across K = 3–10.

We cover:

Loading metadata, spatial coordinates, and gene-expression data
Loading and inspecting Clumppling alignment results
Q-matrix and spatial scatter visualizations
Structure and alignment plots for a focused mode subset
Feature (gene) importance analysis using P matrices
Focal-gene spatial exploration
Gene set enrichment analysis

Imports¶

In [2]:

Copied!





from pathlib import Path
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
import os

from clumppling.plot import (
    plot_membership,
    plot_alignment_graph,
    plot_alignment_list,
)
from clumppling.utils import get_uniq_lb_sep

import ace_of_clust as aoc

from ace_of_clust.plot import make_mode_grid, make_mode_grid_by_K
from pathlib import Path
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
import os

from clumppling.plot import (
    plot_membership,
    plot_alignment_graph,
    plot_alignment_list,
)
from clumppling.utils import get_uniq_lb_sep

import ace_of_clust as aoc

from ace_of_clust.plot import make_mode_grid, make_mode_grid_by_K

Note: to be able to use all crisp methods, you need to install some additional packages:  {'bayanpy', 'infomap', 'graph_tool', 'wurlitzer'}
Note: to be able to use all crisp methods, you need to install some additional packages:  {'ASLPAw', 'pyclustering'}
Note: to be able to use all crisp methods, you need to install some additional packages:  {'infomap', 'wurlitzer'}

Paths¶

In [3]:

Copied!





# ace-of-clust root (two levels up from docs/tutorials/)
base_dir = Path("../..").resolve()
example_data_dir = base_dir / "examples" / "data" / "hbc"

# clustering Q files and per-K performance CSVs
cls_dir = example_data_dir / "clustering" / "full_mmc"

# metadata / gene lists (sit directly in 'clustering/')
cls_base_dir = example_data_dir / "clustering"

# Clumppling alignment output
align_dir = example_data_dir / "aligned" / "full_mmc"

# Output directory for figures
fig_dir = base_dir / "examples" / "figures" / "hbc"
fig_dir.mkdir(parents=True, exist_ok=True)

# Path to HBC.h5ad – required only for gene-expression visualization and
# feature analysis with P matrices. Update if your copy lives elsewhere.
HBC_H5AD_PATH = Path("/oscar/data/sramacha/users/xliu293/rnaseq-clumppling/data/HBC.h5ad")
# ace-of-clust root (two levels up from docs/tutorials/)
base_dir = Path("../..").resolve()
example_data_dir = base_dir / "examples" / "data" / "hbc"

# clustering Q files and per-K performance CSVs
cls_dir = example_data_dir / "clustering" / "full_mmc"

# metadata / gene lists (sit directly in 'clustering/')
cls_base_dir = example_data_dir / "clustering"

# Clumppling alignment output
align_dir = example_data_dir / "aligned" / "full_mmc"

# Output directory for figures
fig_dir = base_dir / "examples" / "figures" / "hbc"
fig_dir.mkdir(parents=True, exist_ok=True)

# Path to HBC.h5ad – required only for gene-expression visualization and
# feature analysis with P matrices. Update if your copy lives elsewhere.
HBC_H5AD_PATH = Path("/oscar/data/sramacha/users/xliu293/rnaseq-clumppling/data/HBC.h5ad")

Load Data¶

We load spot metadata (ground-truth labels and spatial coordinates), gene metadata, and optionally the full AnnData object for gene-expression visualization.

Metadata and ground-truth labels¶

In [4]:

Copied!





df_meta_sorted = pd.read_csv(cls_base_dir / "meta_sorted.csv", index_col=0)

# Canonical ordering of the 20 fine-grained ground-truth classes
gt_cat_sorted = [
    'IDC_1', 'IDC_2', 'IDC_3', 'IDC_4', 'IDC_5',
    'IDC_6', 'IDC_7', 'DCIS/LCIS_1', 'DCIS/LCIS_2', 'DCIS/LCIS_3',
    'DCIS/LCIS_4', 'DCIS/LCIS_5', 'Tumor Edge_1', 'Tumor Edge_2', 'Tumor Edge_3',
    'Tumor Edge_4', 'Tumor Edge_5', 'Tumor Edge_6', 'Healthy_1', 'Healthy_2',
]

# Normalise 'Tumor_edge' → 'Tumor Edge' in both label columns
df_meta_sorted['ground_truth_type'] = pd.Categorical(
    df_meta_sorted['ground_truth_type'].str.replace('_edge', ' Edge'),
    categories=['IDC', 'DCIS/LCIS', 'Tumor Edge', 'Healthy'],
    ordered=True,
)
df_meta_sorted['ground_truth'] = pd.Categorical(
    df_meta_sorted['ground_truth'].str.replace('_edge', ' Edge'),
    categories=gt_cat_sorted,
    ordered=True,
)

print(df_meta_sorted.shape)

# Ground-truth fine-grained labels (20 regions)
gt_lbs  = df_meta_sorted['ground_truth'].values
# Coarser morphotype labels (IDC / DCIS/LCIS / Tumor Edge / Healthy)
gtt_lbs = df_meta_sorted['ground_truth_type'].values

n = df_meta_sorted.shape[0]
print(f"n cells = {n}")
df_meta_sorted = pd.read_csv(cls_base_dir / "meta_sorted.csv", index_col=0)

# Canonical ordering of the 20 fine-grained ground-truth classes
gt_cat_sorted = [
    'IDC_1', 'IDC_2', 'IDC_3', 'IDC_4', 'IDC_5',
    'IDC_6', 'IDC_7', 'DCIS/LCIS_1', 'DCIS/LCIS_2', 'DCIS/LCIS_3',
    'DCIS/LCIS_4', 'DCIS/LCIS_5', 'Tumor Edge_1', 'Tumor Edge_2', 'Tumor Edge_3',
    'Tumor Edge_4', 'Tumor Edge_5', 'Tumor Edge_6', 'Healthy_1', 'Healthy_2',
]

# Normalise 'Tumor_edge' → 'Tumor Edge' in both label columns
df_meta_sorted['ground_truth_type'] = pd.Categorical(
    df_meta_sorted['ground_truth_type'].str.replace('_edge', ' Edge'),
    categories=['IDC', 'DCIS/LCIS', 'Tumor Edge', 'Healthy'],
    ordered=True,
)
df_meta_sorted['ground_truth'] = pd.Categorical(
    df_meta_sorted['ground_truth'].str.replace('_edge', ' Edge'),
    categories=gt_cat_sorted,
    ordered=True,
)

print(df_meta_sorted.shape)

# Ground-truth fine-grained labels (20 regions)
gt_lbs  = df_meta_sorted['ground_truth'].values
# Coarser morphotype labels (IDC / DCIS/LCIS / Tumor Edge / Healthy)
gtt_lbs = df_meta_sorted['ground_truth_type'].values

n = df_meta_sorted.shape[0]
print(f"n cells = {n}")

(3798, 5)
n cells = 3798

Gene metadata¶

In [5]:

Copied!





# All genes
df_genes_meta = pd.read_csv(cls_base_dir / "genes_all.csv")
df_genes_meta = df_genes_meta.rename(columns={'Unnamed: 0': 'gene_symbols'})
gene_names = df_genes_meta['gene_symbols'].values
m = df_genes_meta.shape[0]
print(f"Total genes: {m}")

# Highly variable genes (HVGs)
df_genes_hvg = pd.read_csv(cls_base_dir / "genes_hvg.csv")
df_genes_hvg = df_genes_hvg.rename(columns={'Unnamed: 0': 'gene_symbols'})
gene_names_hvg = df_genes_hvg['gene_symbols'].values
print(f"HVGs: {len(gene_names_hvg)}")
# All genes
df_genes_meta = pd.read_csv(cls_base_dir / "genes_all.csv")
df_genes_meta = df_genes_meta.rename(columns={'Unnamed: 0': 'gene_symbols'})
gene_names = df_genes_meta['gene_symbols'].values
m = df_genes_meta.shape[0]
print(f"Total genes: {m}")

# Highly variable genes (HVGs)
df_genes_hvg = pd.read_csv(cls_base_dir / "genes_hvg.csv")
df_genes_hvg = df_genes_hvg.rename(columns={'Unnamed: 0': 'gene_symbols'})
gene_names_hvg = df_genes_hvg['gene_symbols'].values
print(f"HVGs: {len(gene_names_hvg)}")

Total genes: 36601
HVGs: 3000

Colors¶

In [6]:

Copied!





# Colorblind-friendly palette used throughout the notebook
cb_cmap = [
    '#E69F00', '#009E73', '#CC79A7', '#56B4E9',
    '#F0E442', '#D55E00', '#0072B2', '#7F68A1',
    '#A3C644', '#FF1493',
]
cmap_listed = mcolors.ListedColormap(cb_cmap)
# Colorblind-friendly palette used throughout the notebook
cb_cmap = [
    '#E69F00', '#009E73', '#CC79A7', '#56B4E9',
    '#F0E442', '#D55E00', '#0072B2', '#7F68A1',
    '#A3C644', '#FF1493',
]
cmap_listed = mcolors.ListedColormap(cb_cmap)

Spatial coordinates¶

In [7]:

Copied!





x_spatial = df_meta_sorted['X'].values
y_spatial = df_meta_sorted['Y'].values
spatial_coords = np.vstack((x_spatial, y_spatial)).T
print("spatial_coords shape:", spatial_coords.shape)
x_spatial = df_meta_sorted['X'].values
y_spatial = df_meta_sorted['Y'].values
spatial_coords = np.vstack((x_spatial, y_spatial)).T
print("spatial_coords shape:", spatial_coords.shape)

spatial_coords shape: (3798, 2)

(Optional) Load gene expression data¶

The gene-expression AnnData object (HBC.h5ad) is required for:

Feature (gene) importance analysis with P matrices
Focal-gene spatial expression plots

Set HBC_H5AD_PATH in the Paths cell above to point to your local copy. If the file is not found, these sections will be skipped.

In [8]:

Copied!





import scanpy as sc

if not HBC_H5AD_PATH.exists():
    raise FileNotFoundError(f"HBC h5ad not found: {HBC_H5AD_PATH}")

data_full = sc.read_h5ad(HBC_H5AD_PATH)
print(data_full)

# Non-zero genes (used for P-matrix feature analysis)
idx_genes_nz = np.where(np.sum(data_full.X, axis=0) > 0)[1]
gene_names_nz = gene_names[idx_genes_nz]
used_genes = gene_names_nz
print(f"{len(idx_genes_nz)} / {len(gene_names)} genes have non-zero counts")

# Unsorted spatial coords from h5ad (for gene-expression scatter)
spatial_coords_unsorted = data_full.obs[['X', 'Y']].values
import scanpy as sc

if not HBC_H5AD_PATH.exists():
    raise FileNotFoundError(f"HBC h5ad not found: {HBC_H5AD_PATH}")

data_full = sc.read_h5ad(HBC_H5AD_PATH)
print(data_full)

# Non-zero genes (used for P-matrix feature analysis)
idx_genes_nz = np.where(np.sum(data_full.X, axis=0) > 0)[1]
gene_names_nz = gene_names[idx_genes_nz]
used_genes = gene_names_nz
print(f"{len(idx_genes_nz)} / {len(gene_names)} genes have non-zero counts")

# Unsorted spatial coords from h5ad (for gene-expression scatter)
spatial_coords_unsorted = data_full.obs[['X', 'Y']].values

AnnData object with n_obs × n_vars = 3798 × 36601
    obs: 'ground_truth', 'X', 'Y'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'ground_truth_colors'
    obsm: 'spatial'
24923 / 36601 genes have non-zero counts

Unique label separations¶

Pre-compute sorted label arrays and separation indices used to order cells in structure plots throughout the notebook.

In [9]:

Copied!





uniq_lbs, uniq_lbs_indices, uniq_lbs_sep_idx = get_uniq_lb_sep(gt_lbs)
uniq_gtt_lbs, uniq_gtt_lbs_indices, uniq_gtt_lbs_sep_idx = get_uniq_lb_sep(gtt_lbs)
print("Fine-grained labels:", uniq_lbs)
print("Coarse labels:", uniq_gtt_lbs)
uniq_lbs, uniq_lbs_indices, uniq_lbs_sep_idx = get_uniq_lb_sep(gt_lbs)
uniq_gtt_lbs, uniq_gtt_lbs_indices, uniq_gtt_lbs_sep_idx = get_uniq_lb_sep(gtt_lbs)
print("Fine-grained labels:", uniq_lbs)
print("Coarse labels:", uniq_gtt_lbs)

Fine-grained labels: ['DCIS/LCIS_1', 'DCIS/LCIS_2', 'DCIS/LCIS_3', 'DCIS/LCIS_4', 'DCIS/LCIS_5', 'Healthy_1', 'Healthy_2', 'IDC_1', 'IDC_2', 'IDC_3', 'IDC_4', 'IDC_5', 'IDC_6', 'IDC_7', 'Tumor Edge_1', 'Tumor Edge_2', 'Tumor Edge_3', 'Tumor Edge_4', 'Tumor Edge_5', 'Tumor Edge_6']
Coarse labels: ['DCIS/LCIS', 'Healthy', 'IDC', 'Tumor Edge']

Load Clumppling Results¶

Load the Clumppling alignment output and, if available, the P matrices (feature-weight matrices, one per run). P matrices are required for the gene-importance analysis later in this notebook.

In [10]:

Copied!





# Load alignment results (including P matrices if available)
results = aoc.load_clumppling_results(
    align_dir=align_dir,
    suffix="rep",
    cls_dir=cls_dir,
    load_P=True,
    strict_P=False,
)

# Compute per-gene metrics for all modes (requires P matrices)
df_by_mode = aoc.compute_all_feature_metrics(results, feature_names=used_genes)

# Select top genes by weighted_Psum across modes
selected_by_mode, df_selected_all, overlap = aoc.select_top_features(
    df_by_mode,
    top_quantile=0.1,
)

# Pairwise mode mappings
pair_mappings = aoc.get_mode_pair_mappings(
    mode_names=results.modes,
    all_modes_alignment=results.all_modes_alignment,
    alignment_acrossK=results.alignment_acrossK,
)
# Load alignment results (including P matrices if available)
results = aoc.load_clumppling_results(
    align_dir=align_dir,
    suffix="rep",
    cls_dir=cls_dir,
    load_P=True,
    strict_P=False,
)

# Compute per-gene metrics for all modes (requires P matrices)
df_by_mode = aoc.compute_all_feature_metrics(results, feature_names=used_genes)

# Select top genes by weighted_Psum across modes
selected_by_mode, df_selected_all, overlap = aoc.select_top_features(
    df_by_mode,
    top_quantile=0.1,
)

# Pairwise mode mappings
pair_mappings = aoc.get_mode_pair_mappings(
    mode_names=results.modes,
    all_modes_alignment=results.all_modes_alignment,
    alignment_acrossK=results.alignment_acrossK,
)

In [11]:

Copied!





# Summary of all identified modes
print("All modes:", results.modes)
print()
print(results.mode_stats)
# Summary of all identified modes
print("All modes:", results.modes)
print()
print(results.mode_stats)

All modes: ['K3M1', 'K4M1', 'K5M1', 'K5M2', 'K6M1', 'K6M2', 'K7M1', 'K7M2', 'K8M1', 'K8M2', 'K9M1', 'K9M2', 'K10M1', 'K10M2']

     Mode Representative  Size          Cost  Performance
0    K3M1       16_K3R16    20  7.442808e-11     0.999993
1    K4M1       40_K4R20    20  3.982283e-09     0.999947
2    K5M1       57_K5R17    13  4.114238e-08     0.999849
3    K5M2        49_K5R9     7  1.580895e-07     0.999655
4    K6M1        63_K6R3    18  4.207866e-09     0.999943
5    K6M2       77_K6R17     2  1.911964e-08     0.999862
6    K7M1       99_K7R19    15  4.233527e-09     0.999938
7    K7M2       92_K7R12     5  8.733948e-03     0.940366
8    K8M1      115_K8R15    15  5.851910e-07     0.999317
9    K8M2      111_K8R11     5  1.837209e-05     0.996302
10   K9M1       122_K9R2    15  4.403278e-03     0.964594
11   K9M2       128_K9R8     5  1.126936e-02     0.911796
12  K10M1      146_K10R6    14  1.814668e-03     0.982250
13  K10M2      143_K10R3     6  1.779872e-02     0.883337

Q-Matrix Visualizations¶

Quick look at the soft-membership (Q) matrices returned by Clumppling. plot_Q_heatmap shows the full Q matrix for a single mode; plot_Q_grid tiles all modes side by side; plot_cluster_bars gives cluster-size bars.

In [11]:

Copied!

mode = "K6M1"

# Q-matrix heatmap for a single mode
fig, ax = plt.subplots(figsize=(5, 4))
aoc.plot_Q_heatmap(results, mode, cmap="Blues", ax=ax)
mode = "K6M1"

# Q-matrix heatmap for a single mode
fig, ax = plt.subplots(figsize=(5, 4))
aoc.plot_Q_heatmap(results, mode, cmap="Blues", ax=ax)

Out[11]:

(<Figure size 500x400 with 2 Axes>,
 <Axes: title={'center': 'Q heatmap: K6M1'}, xlabel='Cluster', ylabel='Individual'>)

No description has been provided for this image

In [12]:

Copied!

# Q-matrix heatmaps for all modes in a grid
fig, axes = aoc.plot_Q_grid(results, cmap="Blues")
# Q-matrix heatmaps for all modes in a grid
fig, axes = aoc.plot_Q_grid(results, cmap="Blues")

In [13]:

Copied!

# Cluster size bars for one mode
fig, ax = aoc.plot_cluster_bars(results, mode, colors=cb_cmap[:results.mode_K[mode]])
# Cluster size bars for one mode
fig, ax = aoc.plot_cluster_bars(results, mode, colors=cb_cmap[:results.mode_K[mode]])

Spatial Scatter Plots¶

Project cluster memberships back onto the tissue slide. plot_cluster_overlay overlays all clusters in a single panel per mode (thresholded by a minimum membership value), while plot_cluster_panels gives one panel per cluster.

In [14]:

Copied!





val_threshold = 0.4
sup = f"membership > {val_threshold}"

# Overlay all clusters for all modes
fig, axes = aoc.plot_cluster_overlay(
    results,
    coords=spatial_coords,
    cluster_colors=cb_cmap[:results.K_max],
    val_threshold=val_threshold,
    s=1, alpha=0.9,
    suptitle=sup,
)
val_threshold = 0.4
sup = f"membership > {val_threshold}"

# Overlay all clusters for all modes
fig, axes = aoc.plot_cluster_overlay(
    results,
    coords=spatial_coords,
    cluster_colors=cb_cmap[:results.K_max],
    val_threshold=val_threshold,
    s=1, alpha=0.9,
    suptitle=sup,
)

In [15]:

Copied!





cluster_idx = 0  # first cluster (zero-based)

Q = results.Q_by_mode[mode]  # shape (n_cells, K)
ax, sp = aoc.plot_spatial_membership(
    Q, spatial_coords,
    s=5, alpha=1.0,
    ref_color=cb_cmap[cluster_idx],
    cls_idx=cluster_idx,
)
ax.set_title(f"{mode}, Cluster {cluster_idx + 1}")
cluster_idx = 0  # first cluster (zero-based)

Q = results.Q_by_mode[mode]  # shape (n_cells, K)
ax, sp = aoc.plot_spatial_membership(
    Q, spatial_coords,
    s=5, alpha=1.0,
    ref_color=cb_cmap[cluster_idx],
    cls_idx=cluster_idx,
)
ax.set_title(f"{mode}, Cluster {cluster_idx + 1}")

Out[15]:

Text(0.5, 1.0, 'K6M1, Cluster 1')

Alignment Overview¶

The alignment graph and list show how modes across different values of K relate to each other – which clusters split, merge, or remain stable as K increases. This overview uses all identified modes and helps decide which subset to carry forward.

In [16]:

Copied!





fig = plot_alignment_graph(
    results.K_range,
    names_list=results.mode_names_list,
    cmap=cb_cmap[:results.K_max],
    alignment_acrossK=results.alignment_acrossK,
    all_modes_alignment=results.all_modes_alignment,
    alt_color=False,
    ls_alt=['-', '--'],
)
fig = plot_alignment_graph(
    results.K_range,
    names_list=results.mode_names_list,
    cmap=cb_cmap[:results.K_max],
    alignment_acrossK=results.alignment_acrossK,
    all_modes_alignment=results.all_modes_alignment,
    alt_color=False,
    ls_alt=['-', '--'],
)

Subset of Modes¶

Based on the alignment overview above, we select a meaningful subset of modes (typically K = 3–7) that balances model-fit quality with interpretability. Modes with a very small size or high alignment cost can be excluded. Below is the full mode summary to guide the selection:

In [12]:

Copied!





modes_subset = results.modes[:8]

# Recompute pairwise mappings and costs for the subset
sub_full = aoc.subset_results(
    results,
    modes_subset,
)

print("Subset modes:", sub_full.modes)
print(sub_full.mode_stats)
modes_subset = results.modes[:8]

# Recompute pairwise mappings and costs for the subset
sub_full = aoc.subset_results(
    results,
    modes_subset,
)

print("Subset modes:", sub_full.modes)
print(sub_full.mode_stats)

Subset modes: ['K3M1', 'K4M1', 'K5M1', 'K5M2', 'K6M1', 'K6M2', 'K7M1', 'K7M2']
   Mode Representative  Size          Cost  Performance
0  K3M1       16_K3R16    20  7.442808e-11     0.999993
1  K4M1       40_K4R20    20  3.982283e-09     0.999947
2  K5M1       57_K5R17    13  4.114238e-08     0.999849
3  K5M2        49_K5R9     7  1.580895e-07     0.999655
4  K6M1        63_K6R3    18  4.207866e-09     0.999943
5  K6M2       77_K6R17     2  1.911964e-08     0.999862
6  K7M1       99_K7R19    15  4.233527e-09     0.999938
7  K7M2       92_K7R12     5  8.733948e-03     0.940366

In [18]:

Copied!





mode = "K5M1"

# Structure bar-chart for a single mode, cells sorted by cluster and
# then by ground-truth label within each cluster.
fig, ax = plt.subplots(figsize=(10, 2), dpi=150)
aoc.plot_membership_reordered(
    results.Q_by_mode[mode],
    cmap=cb_cmap[:results.mode_K[mode]],
    lbs=gt_lbs, ax=ax, title="", annot="",
)
ax.set_title(f"Mode {mode}", fontsize=12, weight='bold')
mode = "K5M1"

# Structure bar-chart for a single mode, cells sorted by cluster and
# then by ground-truth label within each cluster.
fig, ax = plt.subplots(figsize=(10, 2), dpi=150)
aoc.plot_membership_reordered(
    results.Q_by_mode[mode],
    cmap=cb_cmap[:results.mode_K[mode]],
    lbs=gt_lbs, ax=ax, title="", annot="",
)
ax.set_title(f"Mode {mode}", fontsize=12, weight='bold')

Out[18]:

Text(0.5, 1.0, 'Mode K5M1')

Subset Visualizations¶

Spatial cluster panels, combined spatial-structure grid, structure plots, and alignment diagrams – now restricted to the selected mode subset.

In [19]:

Copied!





fig, axes = aoc.plot_cluster_panels(
    sub_full,
    coords=spatial_coords,
    cluster_colors=cb_cmap[:sub_full.K_max],
    val_threshold=-1,
    s=2, alpha=1,
    suptitle="Subset of modes",
)
fig, axes = aoc.plot_cluster_panels(
    sub_full,
    coords=spatial_coords,
    cluster_colors=cb_cmap[:sub_full.K_max],
    val_threshold=-1,
    s=2, alpha=1,
    suptitle="Subset of modes",
)

In [20]:

Copied!





fig = aoc.plot_spatial_structure_grid(
    sub_full,
    spatial_coords,
    grps=gt_lbs,
    cmap=cb_cmap[:sub_full.K_max],
    reorder_cls=True,
    dpi=100,
)
fig.savefig(fig_dir / "hbc_spatial_structure_membership_grid.png",
            bbox_inches="tight", dpi=100)
fig = aoc.plot_spatial_structure_grid(
    sub_full,
    spatial_coords,
    grps=gt_lbs,
    cmap=cb_cmap[:sub_full.K_max],
    reorder_cls=True,
    dpi=100,
)
fig.savefig(fig_dir / "hbc_spatial_structure_membership_grid.png",
            bbox_inches="tight", dpi=100)

In [21]:

Copied!





fig = aoc.plot_structure_two_level(
    sub_full,
    cmap=cb_cmap[:results.K_max],
    grp_labels=gt_lbs,
    supgrp_labels=gtt_lbs,
    lb_suffix_sep="_",
    reorder_clsind=True,
)
fig.savefig(fig_dir / "hbc_structure_modes.png",
            bbox_inches='tight', dpi=300)
fig = aoc.plot_structure_two_level(
    sub_full,
    cmap=cb_cmap[:results.K_max],
    grp_labels=gt_lbs,
    supgrp_labels=gtt_lbs,
    lb_suffix_sep="_",
    reorder_clsind=True,
)
fig.savefig(fig_dir / "hbc_structure_modes.png",
            bbox_inches='tight', dpi=300)

In [22]:

Copied!





fig = plot_alignment_graph(
    sub_full.K_range,
    names_list=sub_full.mode_names_list,
    cmap=cb_cmap[:sub_full.K_max],
    alignment_acrossK=sub_full.alignment_acrossK,
    all_modes_alignment=sub_full.all_modes_alignment,
    alt_color=False,
    ls_alt=['-', '--'],
)
fig.savefig(fig_dir / "hbc_alignment_graph.png",
            bbox_inches='tight', dpi=300)
fig = plot_alignment_graph(
    sub_full.K_range,
    names_list=sub_full.mode_names_list,
    cmap=cb_cmap[:sub_full.K_max],
    alignment_acrossK=sub_full.alignment_acrossK,
    all_modes_alignment=sub_full.all_modes_alignment,
    alt_color=False,
    ls_alt=['-', '--'],
)
fig.savefig(fig_dir / "hbc_alignment_graph.png",
            bbox_inches='tight', dpi=300)

Feature (Gene) Analysis¶

Requires HBC.h5ad – loaded in the Load Data section above.

Each clustering mode comes with a P matrix that records how strongly each gene drives each cluster. We summarise gene importance with two metrics:

weighted_Psum: overall contribution of a gene across clusters (weighted by cluster size)
sepLFC: log-fold change between the gene's top cluster and the rest (a measure of cluster specificity)

We first inspect a single mode in detail, then scan all subset modes at once, and finally use KDE-based outlier detection to identify the most informative genes.

In [23]:

Copied!





mode = "K5M1"
df_mode = df_by_mode[mode]

# Scatter: weighted Psum vs. sepLFC
fig, ax = aoc.plot_feature_scatter(df_mode, mode_name=mode)

# Top genes by weighted Psum
fig, ax = aoc.plot_feature_bar(df_mode, mode_name=mode, top_n=20)

# Top genes by sepLFC
fig, ax = aoc.plot_feature_bar(df_mode, mode_name=mode, metric="sepLFC", top_n=20)

# P-matrix sorted by cluster membership
fig, ax = aoc.plot_mode_P_profile(results, mode_name=mode)
fig.tight_layout()

# sepLFC distribution
fig, ax = plt.subplots(figsize=(4, 4))
aoc.plot_sepLFC_dist(results, mode_name=mode, lfc_threshold=10, ax=ax)
fig.tight_layout()
mode = "K5M1"
df_mode = df_by_mode[mode]

# Scatter: weighted Psum vs. sepLFC
fig, ax = aoc.plot_feature_scatter(df_mode, mode_name=mode)

# Top genes by weighted Psum
fig, ax = aoc.plot_feature_bar(df_mode, mode_name=mode, top_n=20)

# Top genes by sepLFC
fig, ax = aoc.plot_feature_bar(df_mode, mode_name=mode, metric="sepLFC", top_n=20)

# P-matrix sorted by cluster membership
fig, ax = aoc.plot_mode_P_profile(results, mode_name=mode)
fig.tight_layout()

# sepLFC distribution
fig, ax = plt.subplots(figsize=(4, 4))
aoc.plot_sepLFC_dist(results, mode_name=mode, lfc_threshold=10, ax=ax)
fig.tight_layout()

In [24]:

Copied!





modes = sub_full.modes

# P-sorted grids for all subset modes
fig, ax_by_mode = make_mode_grid(modes, n_cols=4, panel_size=(3, 3))
for mode_name in modes:
    aoc.plot_mode_P_profile(results, mode_name, ax=ax_by_mode[mode_name])
fig.tight_layout()
fig.savefig(fig_dir / "hbc_modes_P_sorted.png",
            bbox_inches='tight', dpi=150)

# sepLFC distribution grids
fig, ax_by_mode = make_mode_grid(modes, n_cols=4, panel_size=(3, 3))
for mode_name in modes:
    aoc.plot_sepLFC_dist(
        results, mode_name,
        ax=ax_by_mode[mode_name], lfc_threshold=10,
    )
fig.tight_layout()
fig.savefig(fig_dir / "hbc_modes_sepLFC_distribution.png",
            bbox_inches='tight', dpi=150)
modes = sub_full.modes

# P-sorted grids for all subset modes
fig, ax_by_mode = make_mode_grid(modes, n_cols=4, panel_size=(3, 3))
for mode_name in modes:
    aoc.plot_mode_P_profile(results, mode_name, ax=ax_by_mode[mode_name])
fig.tight_layout()
fig.savefig(fig_dir / "hbc_modes_P_sorted.png",
            bbox_inches='tight', dpi=150)

# sepLFC distribution grids
fig, ax_by_mode = make_mode_grid(modes, n_cols=4, panel_size=(3, 3))
for mode_name in modes:
    aoc.plot_sepLFC_dist(
        results, mode_name,
        ax=ax_by_mode[mode_name], lfc_threshold=10,
    )
fig.tight_layout()
fig.savefig(fig_dir / "hbc_modes_sepLFC_distribution.png",
            bbox_inches='tight', dpi=150)

In [12]:

Copied!





# Recompute feature metrics for the mode subset
df_pvs_modes = aoc.compute_all_feature_metrics(
    sub_full,
    feature_names=used_genes,
)
selected_by_mode, df_selected_all, overlap = aoc.select_top_features(
    df_pvs_modes,
    top_quantile=0.1,
)
print(f"{len(df_selected_all)} genes selected across modes")
# Recompute feature metrics for the mode subset
df_pvs_modes = aoc.compute_all_feature_metrics(
    sub_full,
    feature_names=used_genes,
)
selected_by_mode, df_selected_all, overlap = aoc.select_top_features(
    df_pvs_modes,
    top_quantile=0.1,
)
print(f"{len(df_selected_all)} genes selected across modes")

2465 genes selected across modes

In [13]:

Copied!





mode = 'K7M2'

outliers_genes, outliers_mask = aoc.get_kde_outliers(
    selected_by_mode[mode],
    x_col=f"weighted_Psum_{mode}",
    y_col=f"sepLFC_{mode}",
    levels=8,
    min_x=0.0,
    scale="robust",
    return_mask=True,
)
print(f"{len(outliers_genes)} outlier genes")

outliers_mask &= (selected_by_mode[mode][f"sepLFC_{mode}"] > 1)
print(f"Number of outliers with sepLFC > 1: {np.sum(outliers_mask)}")

fig, ax = aoc.plot_feature_kde(
    selected_by_mode[mode],
    x_col=f"weighted_Psum_{mode}",
    y_col=f"sepLFC_{mode}",
    mode_name=mode,
    outlier_mask=outliers_mask,
    bg_alpha=0.3,
)
ax.set_ylim(0, 40)
ax.set_xlabel("weighted_Psum")
ax.set_ylabel("sepLFC")
ax.set_xlim(0, 0.014)
fig.savefig(fig_dir / f"hbc_{mode}_kde_outliers.png",
            bbox_inches='tight', dpi=300)
mode = 'K7M2'

outliers_genes, outliers_mask = aoc.get_kde_outliers(
    selected_by_mode[mode],
    x_col=f"weighted_Psum_{mode}",
    y_col=f"sepLFC_{mode}",
    levels=8,
    min_x=0.0,
    scale="robust",
    return_mask=True,
)
print(f"{len(outliers_genes)} outlier genes")

outliers_mask &= (selected_by_mode[mode][f"sepLFC_{mode}"] > 1)
print(f"Number of outliers with sepLFC > 1: {np.sum(outliers_mask)}")

fig, ax = aoc.plot_feature_kde(
    selected_by_mode[mode],
    x_col=f"weighted_Psum_{mode}",
    y_col=f"sepLFC_{mode}",
    mode_name=mode,
    outlier_mask=outliers_mask,
    bg_alpha=0.3,
)
ax.set_ylim(0, 40)
ax.set_xlabel("weighted_Psum")
ax.set_ylabel("sepLFC")
ax.set_xlim(0, 0.014)
fig.savefig(fig_dir / f"hbc_{mode}_kde_outliers.png",
            bbox_inches='tight', dpi=300)

59 outlier genes
Number of outliers with sepLFC > 1: 45

Focal Gene Analysis¶

Zoom in on a single gene of interest (here MGP). We visualise its clustering-informativeness metrics across all subset modes and its raw expression on the tissue slide, then show which spatial clusters it drives most strongly.

In [ ]:

Copied!





focal_gene = "COX6C"  # alternatives: "APOE", "IGKC", etc.
modes = sub_full.modes

# Color palette: one color per cluster + grey for multi-cluster points
custom_palette = {f'Cls.{i+1}': c for i, c in enumerate(cb_cmap[:sub_full.K_max])}
custom_palette['Multi.Cls'] = '#A9A9A9'
focal_gene = "COX6C"  # alternatives: "APOE", "IGKC", etc.
modes = sub_full.modes

# Color palette: one color per cluster + grey for multi-cluster points
custom_palette = {f'Cls.{i+1}': c for i, c in enumerate(cb_cmap[:sub_full.K_max])}
custom_palette['Multi.Cls'] = '#A9A9A9'

In [23]:

Copied!





fig, ax = plt.subplots(figsize=(4, 5), dpi=150)
_, _, df_gene = aoc.plot_feature_across_modes(
    df_pvs_modes=df_pvs_modes,
    modes=modes,
    selected_feature=focal_gene,   # replaces 'focal_gene' kwarg from sc_clumppling
    custom_color_dict=custom_palette,
    xlim=[0.00, 0.015],
    ylim=[40],
    legend_loc="upper left",
    legend_bbox_to_anchor=(0.0, 1.0),
    ax=ax,
)
ax.set_title("Distribution of metrics", fontsize=14, weight='bold')
fig.savefig(fig_dir / f"hbc_{focal_gene}_pvs_across_modes.png",
            bbox_inches='tight', dpi=300)
fig, ax = plt.subplots(figsize=(4, 5), dpi=150)
_, _, df_gene = aoc.plot_feature_across_modes(
    df_pvs_modes=df_pvs_modes,
    modes=modes,
    selected_feature=focal_gene,   # replaces 'focal_gene' kwarg from sc_clumppling
    custom_color_dict=custom_palette,
    xlim=[0.00, 0.015],
    ylim=[40],
    legend_loc="upper left",
    legend_bbox_to_anchor=(0.0, 1.0),
    ax=ax,
)
ax.set_title("Distribution of metrics", fontsize=14, weight='bold')
fig.savefig(fig_dir / f"hbc_{focal_gene}_pvs_across_modes.png",
            bbox_inches='tight', dpi=300)

In [24]:

Copied!





i_g = np.where(gene_names == focal_gene)[0][0]
gene_counts = np.array(data_full[:, i_g].X.todense()).squeeze()

fig, ax = plt.subplots(1, 1, figsize=(5, 4), dpi=150)
fig = aoc.plot_feature_count(
    gene_counts, spatial_coords_unsorted,
    feature_name=focal_gene, ax=ax,
    log_transformed=False, vmax=6.0, cmap="RdYlBu_r", cbar_loc='right',
)
ax.set_title("Gene expression", fontsize=13, weight='bold')
fig.savefig(fig_dir / f"hbc_{focal_gene}_spatial_expression.png",
            bbox_inches='tight', dpi=300)
i_g = np.where(gene_names == focal_gene)[0][0]
gene_counts = np.array(data_full[:, i_g].X.todense()).squeeze()

fig, ax = plt.subplots(1, 1, figsize=(5, 4), dpi=150)
fig = aoc.plot_feature_count(
    gene_counts, spatial_coords_unsorted,
    feature_name=focal_gene, ax=ax,
    log_transformed=False, vmax=6.0, cmap="RdYlBu_r", cbar_loc='right',
)
ax.set_title("Gene expression", fontsize=13, weight='bold')
fig.savefig(fig_dir / f"hbc_{focal_gene}_spatial_expression.png",
            bbox_inches='tight', dpi=300)

In [25]:

Copied!





# Fewer-side separated clusters (compact view)
fig, axes = aoc.plot_feature_cluster_panels(
    results=sub_full,
    coords=spatial_coords,
    df_pvs_modes=df_pvs_modes,
    selected_feature=focal_gene,   # replaces 'focal_gene' kwarg from sc_clumppling
    modes=modes,
    colors=cb_cmap,
    plot_both_sides=False,
    val_threshold=0.0,
)
fig.suptitle("Separated clusters", fontsize=13, weight='bold')
fig.savefig(fig_dir / f"hbc_{focal_gene}_separated_clusters.png",
            bbox_inches='tight', dpi=300)
# Fewer-side separated clusters (compact view)
fig, axes = aoc.plot_feature_cluster_panels(
    results=sub_full,
    coords=spatial_coords,
    df_pvs_modes=df_pvs_modes,
    selected_feature=focal_gene,   # replaces 'focal_gene' kwarg from sc_clumppling
    modes=modes,
    colors=cb_cmap,
    plot_both_sides=False,
    val_threshold=0.0,
)
fig.suptitle("Separated clusters", fontsize=13, weight='bold')
fig.savefig(fig_dir / f"hbc_{focal_gene}_separated_clusters.png",
            bbox_inches='tight', dpi=300)

In [26]:

Copied!





# Both-sides separated clusters (expanded view)
fig, axes = aoc.plot_feature_cluster_panels(
    results=sub_full,
    coords=spatial_coords,
    df_pvs_modes=df_pvs_modes,
    selected_feature=focal_gene,
    modes=modes,
    colors=cb_cmap,
    plot_both_sides=True,
    val_threshold=0.0,
    h_scale=2.0,
    w_scale=2.0,
)
fig.suptitle(f"Separated clusters for Gene {focal_gene}",
             y=0.92, fontsize=16, weight='bold')
fig.savefig(fig_dir / f"hbc_{focal_gene}_separated_clusters_both_sides.png",
            bbox_inches='tight', dpi=300)
# Both-sides separated clusters (expanded view)
fig, axes = aoc.plot_feature_cluster_panels(
    results=sub_full,
    coords=spatial_coords,
    df_pvs_modes=df_pvs_modes,
    selected_feature=focal_gene,
    modes=modes,
    colors=cb_cmap,
    plot_both_sides=True,
    val_threshold=0.0,
    h_scale=2.0,
    w_scale=2.0,
)
fig.suptitle(f"Separated clusters for Gene {focal_gene}",
             y=0.92, fontsize=16, weight='bold')
fig.savefig(fig_dir / f"hbc_{focal_gene}_separated_clusters_both_sides.png",
            bbox_inches='tight', dpi=300)

Gene Set Analysis¶

Gene set (GS) analysis tests whether a curated set of biologically related genes is systematically enriched in one or more of the clustering modes' feature weight (P) matrices. Three complementary enrichment statistics are computed:

P enrichment: is the mean P value of GS genes higher than expected by chance within each cluster?
Pairwise LFC enrichment: are pairwise log-fold-changes in mean P between clusters larger than the null?
sepLFC enrichment: does the GS achieve a larger cluster-separation log-fold-change than a random gene set of the same size?

Null distributions are generated by permutation (n_perm random draws of the same number of genes from the full feature set).

We use the HALLMARK_E2F_TARGETS gene set from MSigDB as a worked example.

Load Gene Set¶

Gene sets are stored as plain-text files (one gene symbol per line) under examples/data/hallmark_gene_sets/. load_gene_set reads the file and resolve_gene_set maps the symbols to indices in used_genes (the non-zero genes from the expression data), dropping any that are absent.

In [13]:

Copied!





hallmark_data_dir = base_dir / "examples" / "data" / "hallmark_gene_sets"
gene_set_name     = "HALLMARK_E2F_TARGETS"

raw_symbols = aoc.load_gene_set(gene_set_name, hallmark_data_dir)
gene_set, gene_set_indices = aoc.resolve_gene_set(raw_symbols, used_genes)
n_gene_set = len(gene_set)
print(f"{gene_set_name}: {n_gene_set} genes resolved")
print(", ".join(gene_set[:10]), "...")
hallmark_data_dir = base_dir / "examples" / "data" / "hallmark_gene_sets"
gene_set_name     = "HALLMARK_E2F_TARGETS"

raw_symbols = aoc.load_gene_set(gene_set_name, hallmark_data_dir)
gene_set, gene_set_indices = aoc.resolve_gene_set(raw_symbols, used_genes)
n_gene_set = len(gene_set)
print(f"{gene_set_name}: {n_gene_set} genes resolved")
print(", ".join(gene_set[:10]), "...")

HALLMARK_E2F_TARGETS: 198 genes resolved
AK2, ANP32E, ASF1A, ASF1B, ATAD2, AURKA, AURKB, BARD1, BIRC5, BRCA1 ...

In [14]:

Copied!





# Permutation settings shared by all enrichment tests
# (increase n_perm to 10_000 for more precise p-values)
n_perm   = 1_000
rng_seed = 1
# Permutation settings shared by all enrichment tests
# (increase n_perm to 10_000 for more precise p-values)
n_perm   = 1_000
rng_seed = 1

Single-Mode Analysis¶

We first examine one representative mode in depth. The mode's P matrix (shape: n_features × K) encodes how strongly each gene contributes to each cluster. We extract the rows corresponding to the gene set, then run all three enrichment tests against a shared permutation null.

In [15]:

Copied!





gs_mode = "K3M1"
K_mode  = sub_full.mode_K[gs_mode]
df_mode = df_by_mode[gs_mode]

cluster_labels = [f"Cls.{k+1}" for k in range(K_mode)]
cluster_colors = cb_cmap[:K_mode]
gs_title = f"GS: {gene_set_name}  (mode {gs_mode})"

print(f"Mode: {gs_mode}  |  K={K_mode}")
print(f"Cluster labels: {cluster_labels}")
gs_mode = "K3M1"
K_mode  = sub_full.mode_K[gs_mode]
df_mode = df_by_mode[gs_mode]

cluster_labels = [f"Cls.{k+1}" for k in range(K_mode)]
cluster_colors = cb_cmap[:K_mode]
gs_title = f"GS: {gene_set_name}  (mode {gs_mode})"

print(f"Mode: {gs_mode}  |  K={K_mode}")
print(f"Cluster labels: {cluster_labels}")

Mode: K3M1  |  K=3
Cluster labels: ['Cls.1', 'Cls.2', 'Cls.3']

In [16]:

Copied!

# Extract P sub-matrix for the gene set
P    = sub_full.P_aligned_by_mode[gs_mode]   # shape: (n_features, K)
P_gs = P[gene_set_indices]                   # shape: (n_gs, K)

print(f"P shape: {P.shape}  |  P_gs shape: {P_gs.shape}")
# Extract P sub-matrix for the gene set
P    = sub_full.P_aligned_by_mode[gs_mode]   # shape: (n_features, K)
P_gs = P[gene_set_indices]                   # shape: (n_gs, K)

print(f"P shape: {P.shape}  |  P_gs shape: {P_gs.shape}")

P shape: (24923, 3)  |  P_gs shape: (198, 3)

In [17]:

Copied!





# Build shared permutation null (draw n_perm random same-size gene sets)
null_mean_P = aoc.sample_null_P(P, gene_set_indices, n_perm, rng_seed)

# 1. P enrichment: is mean P of GS genes elevated in each cluster?
p_res = aoc.test_P_enrichment(P, gene_set_indices, null_mean_P)
print("gs_mean_P:", dict(zip(cluster_labels, p_res["gs_mean_P"].round(6))))
print("p_emp:    ", dict(zip(cluster_labels, p_res["p_emp"].round(4))))

# 2. Pairwise LFC enrichment: are pairwise LFCs in mean P larger than null?
lfc_res = aoc.test_LFC_enrichment(P, gene_set_indices, null_mean_P)

# 3. sepLFC enrichment: does the GS achieve greater cluster separation than null?
sep_res = aoc.test_sepLFC_enrichment(P, gene_set_indices, null_mean_P)
print(f"\nsepLFC (arithmetic) = {sep_res['gs_sepLFC']:.4f}")
print(f"sepLFC (geometric)  = {sep_res['gs_geomean_LFC']:.4f}")
print(f"  low clusters:  {[cluster_labels[k] for k in sep_res['sepL']]}")
print(f"  high clusters: {[cluster_labels[k] for k in sep_res['sepH']]}")
print(f"  p (vs null best):           {sep_res['p_vs_null_sep']:.4e}")
print(f"  p (vs null fixed clusters): {sep_res['p_vs_null_fixed']:.4e}")
# Build shared permutation null (draw n_perm random same-size gene sets)
null_mean_P = aoc.sample_null_P(P, gene_set_indices, n_perm, rng_seed)

# 1. P enrichment: is mean P of GS genes elevated in each cluster?
p_res = aoc.test_P_enrichment(P, gene_set_indices, null_mean_P)
print("gs_mean_P:", dict(zip(cluster_labels, p_res["gs_mean_P"].round(6))))
print("p_emp:    ", dict(zip(cluster_labels, p_res["p_emp"].round(4))))

# 2. Pairwise LFC enrichment: are pairwise LFCs in mean P larger than null?
lfc_res = aoc.test_LFC_enrichment(P, gene_set_indices, null_mean_P)

# 3. sepLFC enrichment: does the GS achieve greater cluster separation than null?
sep_res = aoc.test_sepLFC_enrichment(P, gene_set_indices, null_mean_P)
print(f"\nsepLFC (arithmetic) = {sep_res['gs_sepLFC']:.4f}")
print(f"sepLFC (geometric)  = {sep_res['gs_geomean_LFC']:.4f}")
print(f"  low clusters:  {[cluster_labels[k] for k in sep_res['sepL']]}")
print(f"  high clusters: {[cluster_labels[k] for k in sep_res['sepH']]}")
print(f"  p (vs null best):           {sep_res['p_vs_null_sep']:.4e}")
print(f"  p (vs null fixed clusters): {sep_res['p_vs_null_fixed']:.4e}")

gs_mean_P: {'Cls.1': 6.1e-05, 'Cls.2': 3.8e-05, 'Cls.3': 7.5e-05}
p_emp:     {'Cls.1': 0.0839, 'Cls.2': 0.3836, 'Cls.3': 0.026}

sepLFC (arithmetic) = 0.6954
sepLFC (geometric)  = 0.7322
  low clusters:  ['Cls.2']
  high clusters: ['Cls.1', 'Cls.3']
  p (vs null best):           6.5934e-02
  p (vs null fixed clusters): 1.2987e-02

In [18]:

Copied!





# Top GS genes ranked by sepLFC between the identified separation clusters
df_sepCls = aoc.analyze_sep_genes(
    df_mode, sep_res['sepH'], sep_res['sepL'], gene_set, top_n=15
)
df_sepCls['gene'] = df_sepCls.index
# Top GS genes ranked by sepLFC between the identified separation clusters
df_sepCls = aoc.analyze_sep_genes(
    df_mode, sep_res['sepH'], sep_res['sepL'], gene_set, top_n=15
)
df_sepCls['gene'] = df_sepCls.index

sepCls ordered:   ((1,), (0, 2))  |  n_genes (all): 3756
sepCls unordered: 2 permutations  |  n_genes (all): 7590
gene_set matches: 120 / 198  |  returning top 15

In [19]:

Copied!





# Genes (GS and non-GS) that share the gene set's separation clusters
sepL_t = tuple(int(x) for x in sep_res['sepL'])
sepH_t = tuple(int(x) for x in sep_res['sepH'])

# Ordered match: sepCls == (sepL, sepH) exactly
sepCls_ordered   = (sepL_t, sepH_t)
df_sep_ordered   = df_mode[df_mode['sepCls'] == sepCls_ordered].copy()

# Unordered match: any permutation of the two groups
sepCls_unordered = {
    (permL, permH)
    for permL in itertools.permutations(sepL_t)
    for permH in itertools.permutations(sepH_t)
}
df_sep_unordered = df_mode[df_mode['sepCls'].apply(lambda x: x in sepCls_unordered)].copy()

for _df in [df_sep_ordered, df_sep_unordered]:
    _df['rank_sepLFC'] = _df['sepLFC'].rank(ascending=False, method='min').astype(int)
df_sep_ordered   = df_sep_ordered.sort_values('sepLFC', ascending=False)
df_sep_unordered = df_sep_unordered.sort_values('sepLFC', ascending=False)

print(f"Genes in ordered sepCls:   {len(set(df_sep_ordered.index)   & set(gene_set))} / {len(gene_set)}")
print(f"Genes in unordered sepCls: {len(set(df_sep_unordered.index) & set(gene_set))} / {len(gene_set)}")
# Genes (GS and non-GS) that share the gene set's separation clusters
sepL_t = tuple(int(x) for x in sep_res['sepL'])
sepH_t = tuple(int(x) for x in sep_res['sepH'])

# Ordered match: sepCls == (sepL, sepH) exactly
sepCls_ordered   = (sepL_t, sepH_t)
df_sep_ordered   = df_mode[df_mode['sepCls'] == sepCls_ordered].copy()

# Unordered match: any permutation of the two groups
sepCls_unordered = {
    (permL, permH)
    for permL in itertools.permutations(sepL_t)
    for permH in itertools.permutations(sepH_t)
}
df_sep_unordered = df_mode[df_mode['sepCls'].apply(lambda x: x in sepCls_unordered)].copy()

for _df in [df_sep_ordered, df_sep_unordered]:
    _df['rank_sepLFC'] = _df['sepLFC'].rank(ascending=False, method='min').astype(int)
df_sep_ordered   = df_sep_ordered.sort_values('sepLFC', ascending=False)
df_sep_unordered = df_sep_unordered.sort_values('sepLFC', ascending=False)

print(f"Genes in ordered sepCls:   {len(set(df_sep_ordered.index)   & set(gene_set))} / {len(gene_set)}")
print(f"Genes in unordered sepCls: {len(set(df_sep_unordered.index) & set(gene_set))} / {len(gene_set)}")

Genes in ordered sepCls:   89 / 198
Genes in unordered sepCls: 114 / 198

Visualizations¶

The plots below cover all three enrichment angles for the selected mode.

In [20]:

Copied!





# P enrichment: empirical p-value and z-score per cluster
fig, axes = plt.subplots(1, 2, figsize=(10, 4), dpi=150)
aoc.plot_P_enrichment_pval(p_res, cluster_labels, cluster_colors, ax=axes[0])
aoc.plot_P_enrichment_zscore(p_res, cluster_labels, cluster_colors, ax=axes[1])
fig.suptitle(f"{gs_title} – P enrichment", fontsize=11, weight="bold")
fig.tight_layout()
# P enrichment: empirical p-value and z-score per cluster
fig, axes = plt.subplots(1, 2, figsize=(10, 4), dpi=150)
aoc.plot_P_enrichment_pval(p_res, cluster_labels, cluster_colors, ax=axes[0])
aoc.plot_P_enrichment_zscore(p_res, cluster_labels, cluster_colors, ax=axes[1])
fig.suptitle(f"{gs_title} – P enrichment", fontsize=11, weight="bold")
fig.tight_layout()

In [21]:

Copied!





# Pairwise LFC enrichment: observed LFC (lower triangle) vs. z-score (upper triangle)
fig, ax = aoc.plot_pairwise_heatmap_bidir(
    upper_mat=lfc_res["test"]["z_mat"],
    lower_mat=lfc_res["obs_lfc"],
    upper_sig=lfc_res["test"]["q_mat"],
    lower_sig=lfc_res["test"]["q_mat"],
    upper_label="z-score",
    lower_label="obs LFC",
    labels=cluster_labels,
    title=f"{gs_title} – pairwise LFC",
)
# Pairwise LFC enrichment: observed LFC (lower triangle) vs. z-score (upper triangle)
fig, ax = aoc.plot_pairwise_heatmap_bidir(
    upper_mat=lfc_res["test"]["z_mat"],
    lower_mat=lfc_res["obs_lfc"],
    upper_sig=lfc_res["test"]["q_mat"],
    lower_sig=lfc_res["test"]["q_mat"],
    upper_label="z-score",
    lower_label="obs LFC",
    labels=cluster_labels,
    title=f"{gs_title} – pairwise LFC",
)

In [22]:

Copied!





# sepLFC null distributions: largest sepLFC in random sets vs. GS sepCls LFC
fig, axes = plt.subplots(1, 2, figsize=(10, 4), dpi=150)
aoc.plot_sepLFC_null_sep(sep_res, ax=axes[0])
aoc.plot_sepLFC_null_fixed(sep_res, cluster_labels, ax=axes[1])
fig.suptitle(f"{gs_title} – sepLFC enrichment", fontsize=11, weight="bold", y=1.02)
fig.tight_layout()
# sepLFC null distributions: largest sepLFC in random sets vs. GS sepCls LFC
fig, axes = plt.subplots(1, 2, figsize=(10, 4), dpi=150)
aoc.plot_sepLFC_null_sep(sep_res, ax=axes[0])
aoc.plot_sepLFC_null_fixed(sep_res, cluster_labels, ax=axes[1])
fig.suptitle(f"{gs_title} – sepLFC enrichment", fontsize=11, weight="bold", y=1.02)
fig.tight_layout()

In [23]:

Copied!





# Stacked-bar profile: how much of each cluster's P weight do GS genes hold?
fig, ax = aoc.plot_gene_P_stacked(
    P_gs, gene_set, cluster_labels,
    gs_title=gs_title, sort_by_sum=True, top_n=10,
)
# Stacked-bar profile: how much of each cluster's P weight do GS genes hold?
fig, ax = aoc.plot_gene_P_stacked(
    P_gs, gene_set, cluster_labels,
    gs_title=gs_title, sort_by_sum=True, top_n=10,
)

In [24]:

Copied!





# Top GS genes by cluster-specific P weight
fig, _ = aoc.plot_gene_P_bars(
    P_gs, gene_set, cluster_labels, cluster_colors, top_n=10
)
# Top GS genes by cluster-specific P weight
fig, _ = aoc.plot_gene_P_bars(
    P_gs, gene_set, cluster_labels, cluster_colors, top_n=10
)

In [25]:

Copied!





# Per-gene LFC between the gene set's separation clusters (mean across cluster group)
df_gene_lfc = aoc.compute_gene_lfc(
    P_gs, gene_set, sep_res["sepL"], sep_res["sepH"], kind="mean"
)
fig, ax = aoc.plot_gene_lfc(
    df_gene_lfc, cluster_labels,
    sep_res["sepL"], sep_res["sepH"],
    sep_res["gs_sepLFC"], cluster_colors, kind="mean", show_labels=False,
)
ax.axvline(0, color='k', linestyle='-', linewidth=0.5)
ax.axvline(sep_res["gs_sepLFC"], color='gray', linestyle='--', linewidth=0.5)
ax.text(
    sep_res["gs_sepLFC"], len(gene_set) - 1,
    f"GS sepLFC = {sep_res['gs_sepLFC']:.4f}",
    ha="left", va="top", fontsize=8, color="gray", style="italic",
)
ax.set_title("Per-gene mean LFC between GS sepCls", fontsize=11)
fig.tight_layout()
# Per-gene LFC between the gene set's separation clusters (mean across cluster group)
df_gene_lfc = aoc.compute_gene_lfc(
    P_gs, gene_set, sep_res["sepL"], sep_res["sepH"], kind="mean"
)
fig, ax = aoc.plot_gene_lfc(
    df_gene_lfc, cluster_labels,
    sep_res["sepL"], sep_res["sepH"],
    sep_res["gs_sepLFC"], cluster_colors, kind="mean", show_labels=False,
)
ax.axvline(0, color='k', linestyle='-', linewidth=0.5)
ax.axvline(sep_res["gs_sepLFC"], color='gray', linestyle='--', linewidth=0.5)
ax.text(
    sep_res["gs_sepLFC"], len(gene_set) - 1,
    f"GS sepLFC = {sep_res['gs_sepLFC']:.4f}",
    ha="left", va="top", fontsize=8, color="gray", style="italic",
)
ax.set_title("Per-gene mean LFC between GS sepCls", fontsize=11)
fig.tight_layout()

In [26]:

Copied!





# Bipartite diagram: top GS genes linking low- and high-expression clusters
fig, ax = aoc.plot_seplfc_bipartite(
    gene_set, sep_res["sepH"], sep_res["sepL"], cluster_labels,
    df_mode=df_mode,
    top_n_per_pair=5,
    gs_title=gs_title,
    colors=cluster_colors,
    cmap='cividis',
    figsize=(8, 5),
)
# Bipartite diagram: top GS genes linking low- and high-expression clusters
fig, ax = aoc.plot_seplfc_bipartite(
    gene_set, sep_res["sepH"], sep_res["sepL"], cluster_labels,
    df_mode=df_mode,
    top_n_per_pair=5,
    gs_title=gs_title,
    colors=cluster_colors,
    cmap='cividis',
    figsize=(8, 5),
)

In [27]:

Copied!





# Top GS genes by sepLFC, annotated with their separation cluster pair
fig, ax = plt.subplots(figsize=(6, 4), dpi=150)
sns.barplot(
    x='sepLFC', y='gene', data=df_sepCls,
    hue='sepCls', palette='Paired', ax=ax, legend=False,
)
existing_labels = set()
for i_gene, row in df_sepCls.iterrows():
    low_str  = "Cls." + "+".join(str(k + 1) for k in row['sepCls'][0])
    high_str = "Cls." + "+".join(str(k + 1) for k in row['sepCls'][1])
    lbl = f"[{high_str}]/[{low_str}]"
    if lbl in existing_labels:
        lbl = ""
    else:
        existing_labels.add(lbl)
    x_pos = 0.1 if (row['sepLFC'] > 10 or max(df_sepCls['sepLFC']) < 15) else row['sepLFC'] + 0.1
    ax.text(x_pos, i_gene, lbl, ha='left', va='center', fontsize=8)
ax.set_title("Top GS genes by sepLFC", fontsize=11)
fig.tight_layout()
# Top GS genes by sepLFC, annotated with their separation cluster pair
fig, ax = plt.subplots(figsize=(6, 4), dpi=150)
sns.barplot(
    x='sepLFC', y='gene', data=df_sepCls,
    hue='sepCls', palette='Paired', ax=ax, legend=False,
)
existing_labels = set()
for i_gene, row in df_sepCls.iterrows():
    low_str  = "Cls." + "+".join(str(k + 1) for k in row['sepCls'][0])
    high_str = "Cls." + "+".join(str(k + 1) for k in row['sepCls'][1])
    lbl = f"[{high_str}]/[{low_str}]"
    if lbl in existing_labels:
        lbl = ""
    else:
        existing_labels.add(lbl)
    x_pos = 0.1 if (row['sepLFC'] > 10 or max(df_sepCls['sepLFC']) < 15) else row['sepLFC'] + 0.1
    ax.text(x_pos, i_gene, lbl, ha='left', va='center', fontsize=8)
ax.set_title("Top GS genes by sepLFC", fontsize=11)
fig.tight_layout()

In [28]:

Copied!





# sepLFC distribution across genes sharing the GS separation clusters (unordered)
fig, ax = plt.subplots(figsize=(7, 4), dpi=150)
aoc.plot_sepLFC_distribution(
    df_sep_unordered, gene_set,
    title="", kind="auto",
    show_gs_textbox=True, gs_textbox_threshold=10, n_gs_textbox=10,
    show_non_gs_textbox=True, n_non_gs_textbox=10,
    ax=ax,
)
n_in = len(set(df_sep_unordered.index) & set(gene_set))
ax.set_title(
    f"sepLFC distribution – {n_in} / {len(gene_set)} GS genes in unordered sepCls",
    fontsize=11,
)
fig.tight_layout()
# sepLFC distribution across genes sharing the GS separation clusters (unordered)
fig, ax = plt.subplots(figsize=(7, 4), dpi=150)
aoc.plot_sepLFC_distribution(
    df_sep_unordered, gene_set,
    title="", kind="auto",
    show_gs_textbox=True, gs_textbox_threshold=10, n_gs_textbox=10,
    show_non_gs_textbox=True, n_non_gs_textbox=10,
    ax=ax,
)
n_in = len(set(df_sep_unordered.index) & set(gene_set))
ax.set_title(
    f"sepLFC distribution – {n_in} / {len(gene_set)} GS genes in unordered sepCls",
    fontsize=11,
)
fig.tight_layout()

Cross-Mode Analysis¶

Running enrichment across all subset modes reveals whether the gene set's signal is consistent across values of K and across alternative modes at the same K. run_gs_enrichment batches all three tests for every mode in one call.

In [29]:

Copied!

res_by_mode = aoc.run_gs_enrichment(
    sub_full, gene_set_indices, sub_full.modes, n_perm, rng_seed
)
res_by_mode = aoc.run_gs_enrichment(
    sub_full, gene_set_indices, sub_full.modes, n_perm, rng_seed
)

In [30]:

Copied!





# P enrichment p-values across all subset modes, grouped by K
fig, ax_by_mode = make_mode_grid_by_K(sub_full)
aoc.plot_P_enrichment_grid(res_by_mode, ax_by_mode, sub_full, cb_cmap, kind="pval")
fig.suptitle(f"{gene_set_name} – P enrichment across modes", fontsize=11, weight="bold")
fig.tight_layout()
# P enrichment p-values across all subset modes, grouped by K
fig, ax_by_mode = make_mode_grid_by_K(sub_full)
aoc.plot_P_enrichment_grid(res_by_mode, ax_by_mode, sub_full, cb_cmap, kind="pval")
fig.suptitle(f"{gene_set_name} – P enrichment across modes", fontsize=11, weight="bold")
fig.tight_layout()

In [31]:

Copied!





# Pairwise LFC z-scores across all subset modes
fig, ax_by_mode = make_mode_grid_by_K(sub_full)
aoc.plot_LFC_enrichment_grid(res_by_mode, ax_by_mode, sub_full, cb_cmap)
fig.suptitle(f"{gene_set_name} – pairwise LFC z-score across modes", fontsize=11, weight="bold")
fig.tight_layout()
# Pairwise LFC z-scores across all subset modes
fig, ax_by_mode = make_mode_grid_by_K(sub_full)
aoc.plot_LFC_enrichment_grid(res_by_mode, ax_by_mode, sub_full, cb_cmap)
fig.suptitle(f"{gene_set_name} – pairwise LFC z-score across modes", fontsize=11, weight="bold")
fig.tight_layout()

In [32]:

Copied!





# sepLFC enrichment across all subset modes
fig, ax_by_mode = make_mode_grid_by_K(sub_full)
aoc.plot_sepLFC_enrichment_grid(
    res_by_mode, ax_by_mode, sub_full, cb_cmap, kind="null_sep"
)
fig.suptitle(f"{gene_set_name} – sepLFC enrichment across modes", fontsize=11, weight="bold")
fig.tight_layout()
# sepLFC enrichment across all subset modes
fig, ax_by_mode = make_mode_grid_by_K(sub_full)
aoc.plot_sepLFC_enrichment_grid(
    res_by_mode, ax_by_mode, sub_full, cb_cmap, kind="null_sep"
)
fig.suptitle(f"{gene_set_name} – sepLFC enrichment across modes", fontsize=11, weight="bold")
fig.tight_layout()

In [33]:

Copied!





# P enrichment by cluster
fig1, _ = aoc.plot_P_enrichment_by_cluster(
    res_by_mode, sub_full, cb_cmap, kind='pval', ncols=4, 
    sig_threshold=0.05)
fig1.tight_layout()
fig1.suptitle('P enrichment by cluster across modes (p-values)',
               fontsize=12, weight='bold', y=1.01)
pass
# P enrichment by cluster
fig1, _ = aoc.plot_P_enrichment_by_cluster(
    res_by_mode, sub_full, cb_cmap, kind='pval', ncols=4, 
    sig_threshold=0.05)
fig1.tight_layout()
fig1.suptitle('P enrichment by cluster across modes (p-values)',
               fontsize=12, weight='bold', y=1.01)
pass

In [36]:

Copied!





# sepLFC distribution heatmaps for all modes, annotated with p-values (largest separation)
fig2, _ = aoc.plot_sepLFC_distribution_heatmap(
    res_by_mode, n_bins=60, kind='null_sep', pval_threshold=0.05, annotate_pval=True)

fig2.suptitle('Largest sepLFC',
            fontsize=12, weight='bold', x=0.45, y=1.01)
# sepLFC distribution heatmaps for all modes, annotated with p-values (largest separation)
fig2, _ = aoc.plot_sepLFC_distribution_heatmap(
    res_by_mode, n_bins=60, kind='null_sep', pval_threshold=0.05, annotate_pval=True)

fig2.suptitle('Largest sepLFC',
            fontsize=12, weight='bold', x=0.45, y=1.01)

Out[36]:

Text(0.45, 1.01, 'Largest sepLFC')

In [35]:

Copied!





# sepLFC distribution heatmaps for all modes, annotated with p-values (fixed clusters)
fig3, _ = aoc.plot_sepLFC_distribution_heatmap(
    res_by_mode, n_bins=60, kind='null_fixed', pval_threshold=0.05, annotate_pval=True)

fig3.suptitle('LFC across GS sepCls',
               fontsize=12, weight='bold', x=0.45, y=1.01)
# sepLFC distribution heatmaps for all modes, annotated with p-values (fixed clusters)
fig3, _ = aoc.plot_sepLFC_distribution_heatmap(
    res_by_mode, n_bins=60, kind='null_fixed', pval_threshold=0.05, annotate_pval=True)

fig3.suptitle('LFC across GS sepCls',
               fontsize=12, weight='bold', x=0.45, y=1.01)

Out[35]:

Text(0.45, 1.01, 'LFC across GS sepCls')