Skip to content

Analysis

analysis.py

Functions for organizing, processing, and analyzing clumppling results. Core profile/feature metrics and alignment mapping.

Classes

Functions

analyze_sep_genes(df_mode, sepH, sepL, gene_set, top_n=10)

Summarise which genes in gene_set match a given (sepH, sepL) split.

Prints the number of genes in df_mode whose sepCls matches the split exactly (ordered) and as an unordered set, then returns a DataFrame of the top top_n genes from gene_set ranked by sepLFC.

Parameters:

Name Type Description Default
df_mode DataFrame

Feature-metrics DataFrame indexed by gene name (from compute_feature_metrics / compute_all_feature_metrics).

required
sepH sequence of int

Cluster indices in the high group. Accepts lists, tuples, or arrays.

required
sepL sequence of int

Cluster indices in the low group. Accepts lists, tuples, or arrays.

required
gene_set list of str

Gene names to filter and rank.

required
top_n int

Number of top genes to return. Default 10.

10

Returns:

Type Description
DataFrame

Rows for the top top_n matching genes, columns ['sepLFC', 'sepCls', 'weighted_Psum'].

compute_all_feature_metrics(results, feature_names)

Compute feature-level metrics (weighted_Psum, sepLFC, sepCls) for all modes.

Parameters:

Name Type Description Default
results ClumpplingResults

Must have P_aligned_by_mode populated (i.e. load_clumppling_results was called with cls_dir=...).

required
feature_names sequence of str

Names for each row of P (e.g. gene IDs/symbols).

required

Returns:

Name Type Description
df_by_mode dict

{mode_name -> DataFrame as returned by compute_feature_metrics}

compute_feature_metrics(P, Q, feature_names)

Compute sepLFC, sepCls, and weighted_Psum for a single mode.

Parameters:

Name Type Description Default
P (array, shape(n_features, K))

Feature-by-cluster loadings.

required
Q (array, shape(n_cells, K))

Cell-by-cluster memberships (aligned).

required
feature_names sequence of str

Names for each row of P (e.g. gene IDs/symbols). Must have length equal to P.shape[0].

required

Returns:

Name Type Description
df DataFrame

Index = feature_names Columns = ["weighted_Psum", "sepLFC", "sepCls"]

compute_profile(P)

Compute clustering profile for feature-level P.

Parameters:

Name Type Description Default
P (array - like, shape(M, K))

Per-feature values over clusters (e.g. log-P or scores).

required

Returns:

Name Type Description
LFC_sorted (M, K - 1)

Log2 fold-changes between consecutive sorted values per feature.

idx_sorted (M, K)

Indices of clusters sorted per feature (ascending).

compute_profile_unnorm(P)

Sort cluster values and compute log2 ratios between consecutive sorted entries.

Unlike compute_profile, this version does not normalise P before sorting, making it suitable for operating directly on mean loading vectors rather than per-feature rows of a full P matrix.

Parameters:

Name Type Description Default
P (ndarray, shape(M, K))

Per-row values over K clusters (e.g. null mean loading vectors).

required

Returns:

Name Type Description
LFC_sorted (ndarray, shape(M, K - 1))

Log2 fold-changes between consecutive sorted values per row.

idx_sorted (ndarray, shape(M, K))

Column indices that sort each row in ascending order.

compute_weighted_Psum(P, Q)

Compute a weighted sum of P across clusters, using cluster weights from Q.

Parameters:

Name Type Description Default
P ndarray

Feature-by-cluster loadings.

required
Q ndarray

Cell-by-cluster memberships (aligned).

required

Returns:

Name Type Description
weighted_Psum ndarray

Weighted sum of P across clusters.

get_mode_pair_mappings(mode_names, all_modes_alignment, alignment_acrossK)

For each pair of modes (A, B), compute the mapping from clusters in B to clusters in A, in aligned column space, using paths through intermediate modes.

Parameters:

Name Type Description Default
mode_names list of str

Modes you care about (e.g. sorted list of all_modes_alignment.keys()).

required
all_modes_alignment dict

{mode_name -> reordering}, where reordering is the alignment pattern used for columns in that mode (same object you indexed in your plots).

required
alignment_acrossK dict

{"A-B" -> mapping}, where for key "A-B", mapping[k_B] = k_A maps original cluster index in mode B to original index in mode A.

required

Returns:

Name Type Description
pair_mappings dict

{ "A-B": [(col_idx_in_A, col_idx_in_B), ...], ... } All indices are in the current aligned column order (after alignment), i.e. x-axis column indices in your plots. For each pair, the mapping is from clusters of B → clusters of A.

get_sepLFC(LFC_sorted, idx_sorted)

Compute sepLFC and sepCls from clustering profile.

Parameters:

Name Type Description Default
LFC_sorted ndarray

Log2 fold-changes between consecutive sorted values per feature.

required
idx_sorted ndarray

Indices of clusters sorted per feature (ascending).

required

Returns:

Name Type Description
sepLFC ndarray

Maximum log2 fold-change per feature.

sepCls list of tuples

Each tuple contains two tuples representing the indices of clusters on each side of the maximum gap, in original cluster indices.

map_alt_to_ref(ref_Q, alt_Q, pair_mapping)

Map alt_Q into ref_Q space using pair_mapping.

Parameters:

Name Type Description Default
ref_Q ndarray

Reference membership matrix (n_cells, ref_K).

required
alt_Q ndarray

Alternative membership matrix (n_cells, alt_K), where ref_K <= alt_K.

required
pair_mapping Sequence[Tuple[int, int]]

Mapping pairs (i_ref, j_alt) indicating how clusters in alt_Q map to clusters in ref_Q.

required

Returns:

Name Type Description
alt_Q_mapped ndarray

Mapped alternative membership matrix (n_cells, ref_K).

diff_Q ndarray

Absolute difference between ref_Q and alt_Q_mapped.

select_top_features(df_by_mode, top_quantile=0.1)

For each mode, select the top features by weighted_Psum and related information.

Parameters:

Name Type Description Default
df_by_mode mapping

{mode_name -> feature metrics DataFrame}.

required
top_quantile float

We keep features with weighted_Psum above this upper quantile (i.e. the top (1 - top_quantile) fraction).

0.1

Returns:

Name Type Description
selected_by_mode dict

{mode_name -> DataFrame of selected features, with columns suffixed by f"_{mode_name}"}.

df_selected_all DataFrame

Inner join of all per-mode selected DataFrames.

overlap set

Set of feature names present in all per-mode selections.

subset_results(results, modes_subset)

Return a new ClumpplingResults object containing only a subset of modes.

Parameters:

Name Type Description Default
results ClumpplingResults

Original full results.

required
modes_subset sequence of str

Mode names to keep (must exist in results.Q_by_mode).

required

Returns:

Name Type Description
subset ClumpplingResults

New object with the same fields as the original, but restricted to the selected modes.