Analysis
analysis.py
Functions for organizing, processing, and analyzing clumppling results. Core profile/feature metrics and alignment mapping.
Classes
Functions
analyze_sep_genes(df_mode, sepH, sepL, gene_set, top_n=10)
Summarise which genes in gene_set match a given (sepH, sepL) split.
Prints the number of genes in df_mode whose sepCls matches the split
exactly (ordered) and as an unordered set, then returns a DataFrame of the
top top_n genes from gene_set ranked by sepLFC.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_mode
|
DataFrame
|
Feature-metrics DataFrame indexed by gene name (from
|
required |
sepH
|
sequence of int
|
Cluster indices in the high group. Accepts lists, tuples, or arrays. |
required |
sepL
|
sequence of int
|
Cluster indices in the low group. Accepts lists, tuples, or arrays. |
required |
gene_set
|
list of str
|
Gene names to filter and rank. |
required |
top_n
|
int
|
Number of top genes to return. Default 10. |
10
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Rows for the top top_n matching genes, columns
|
compute_all_feature_metrics(results, feature_names)
Compute feature-level metrics (weighted_Psum, sepLFC, sepCls) for all modes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
ClumpplingResults
|
Must have P_aligned_by_mode populated (i.e. load_clumppling_results was called with cls_dir=...). |
required |
feature_names
|
sequence of str
|
Names for each row of P (e.g. gene IDs/symbols). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
df_by_mode |
dict
|
{mode_name -> DataFrame as returned by compute_feature_metrics} |
compute_feature_metrics(P, Q, feature_names)
Compute sepLFC, sepCls, and weighted_Psum for a single mode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
P
|
(array, shape(n_features, K))
|
Feature-by-cluster loadings. |
required |
Q
|
(array, shape(n_cells, K))
|
Cell-by-cluster memberships (aligned). |
required |
feature_names
|
sequence of str
|
Names for each row of P (e.g. gene IDs/symbols). Must have length equal to P.shape[0]. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
Index = feature_names Columns = ["weighted_Psum", "sepLFC", "sepCls"] |
compute_profile(P)
Compute clustering profile for feature-level P.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
P
|
(array - like, shape(M, K))
|
Per-feature values over clusters (e.g. log-P or scores). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
LFC_sorted |
(M, K - 1)
|
Log2 fold-changes between consecutive sorted values per feature. |
idx_sorted |
(M, K)
|
Indices of clusters sorted per feature (ascending). |
compute_profile_unnorm(P)
Sort cluster values and compute log2 ratios between consecutive sorted entries.
Unlike compute_profile, this version does not normalise P before
sorting, making it suitable for operating directly on mean loading vectors
rather than per-feature rows of a full P matrix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
P
|
(ndarray, shape(M, K))
|
Per-row values over K clusters (e.g. null mean loading vectors). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
LFC_sorted |
(ndarray, shape(M, K - 1))
|
Log2 fold-changes between consecutive sorted values per row. |
idx_sorted |
(ndarray, shape(M, K))
|
Column indices that sort each row in ascending order. |
compute_weighted_Psum(P, Q)
Compute a weighted sum of P across clusters, using cluster weights from Q.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
P
|
ndarray
|
Feature-by-cluster loadings. |
required |
Q
|
ndarray
|
Cell-by-cluster memberships (aligned). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
weighted_Psum |
ndarray
|
Weighted sum of P across clusters. |
get_mode_pair_mappings(mode_names, all_modes_alignment, alignment_acrossK)
For each pair of modes (A, B), compute the mapping from clusters in B to clusters in A, in aligned column space, using paths through intermediate modes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode_names
|
list of str
|
Modes you care about (e.g. sorted list of all_modes_alignment.keys()). |
required |
all_modes_alignment
|
dict
|
{mode_name -> reordering}, where |
required |
alignment_acrossK
|
dict
|
{"A-B" -> mapping}, where for key "A-B", |
required |
Returns:
| Name | Type | Description |
|---|---|---|
pair_mappings |
dict
|
{ "A-B": [(col_idx_in_A, col_idx_in_B), ...], ... } All indices are in the current aligned column order (after alignment), i.e. x-axis column indices in your plots. For each pair, the mapping is from clusters of B → clusters of A. |
get_sepLFC(LFC_sorted, idx_sorted)
Compute sepLFC and sepCls from clustering profile.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
LFC_sorted
|
ndarray
|
Log2 fold-changes between consecutive sorted values per feature. |
required |
idx_sorted
|
ndarray
|
Indices of clusters sorted per feature (ascending). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
sepLFC |
ndarray
|
Maximum log2 fold-change per feature. |
sepCls |
list of tuples
|
Each tuple contains two tuples representing the indices of clusters on each side of the maximum gap, in original cluster indices. |
map_alt_to_ref(ref_Q, alt_Q, pair_mapping)
Map alt_Q into ref_Q space using pair_mapping.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref_Q
|
ndarray
|
Reference membership matrix (n_cells, ref_K). |
required |
alt_Q
|
ndarray
|
Alternative membership matrix (n_cells, alt_K), where ref_K <= alt_K. |
required |
pair_mapping
|
Sequence[Tuple[int, int]]
|
Mapping pairs (i_ref, j_alt) indicating how clusters in alt_Q map to clusters in ref_Q. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
alt_Q_mapped |
ndarray
|
Mapped alternative membership matrix (n_cells, ref_K). |
diff_Q |
ndarray
|
Absolute difference between ref_Q and alt_Q_mapped. |
select_top_features(df_by_mode, top_quantile=0.1)
For each mode, select the top features by weighted_Psum and related information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_by_mode
|
mapping
|
{mode_name -> feature metrics DataFrame}. |
required |
top_quantile
|
float
|
We keep features with weighted_Psum above this upper quantile (i.e. the top (1 - top_quantile) fraction). |
0.1
|
Returns:
| Name | Type | Description |
|---|---|---|
selected_by_mode |
dict
|
{mode_name -> DataFrame of selected features, with columns suffixed by f"_{mode_name}"}. |
df_selected_all |
DataFrame
|
Inner join of all per-mode selected DataFrames. |
overlap |
set
|
Set of feature names present in all per-mode selections. |
subset_results(results, modes_subset)
Return a new ClumpplingResults object containing only a subset of modes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
ClumpplingResults
|
Original full results. |
required |
modes_subset
|
sequence of str
|
Mode names to keep (must exist in results.Q_by_mode). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
subset |
ClumpplingResults
|
New object with the same fields as the original, but restricted to the selected modes. |