I/O
io.py
Functions for loading clumppling's main and compModel outputs, as well as other related data.
Classes
ClumpplingResults
dataclass
Container for all core clumppling outputs needed for analysis/plots.
Attributes:
| Name | Type | Description |
|---|---|---|
align_dir |
Path
|
Directory containing clumppling outputs (e.g. output/clumppling/pbmc10k-tutorial_hc_output). |
suffix |
str
|
Suffix used in aligned Q filenames (e.g. "rep" or "avg"). |
mode_alignment |
DataFrame
|
DataFrame loaded from 'mode_alignment.txt'. |
mode_stats |
DataFrame
|
DataFrame loaded from 'mode_stats.txt'. |
modes |
List[str]
|
Flat list of mode names (e.g. ["K5M1", "K5M2", ...]). |
mode_K |
Dict[str, int]
|
Mapping from mode name to K for that mode. |
K_range |
List[int]
|
Sorted unique K values across all modes. |
K_max |
int
|
Maximum K value across all modes. |
mode_names_list |
List[List[str]]
|
Grouped mode names by K (same structure as notebook). |
Q_by_mode |
Dict[str, ndarray]
|
Mapping from mode name to aligned membership matrix. |
alignment_acrossK |
Dict[str, Sequence[int]]
|
{"A-B" -> mapping from B->A (original indices)} |
cost_acrossK |
Dict[str, float]
|
Mapping from mode name to alignment cost. |
all_modes_alignment |
Dict[str, Sequence[int]]
|
{mode_name -> reordering (aligned columns)} |
mode_coord_dict |
Dict[str, Tuple[int, int]]
|
{mode_name -> (row_idx, col_idx)} grid by K. |
mode_sep_coord_dict |
Dict[Tuple[str, int], Tuple[int, int]]
|
{(mode_name, cls_idx) -> (row_idx, col_idx)}. |
input_meta |
DataFrame | None
|
DataFrame loaded from 'input_meta.txt', if available. |
Q_unaligned_by_mode |
Dict[str, ndarray] | None
|
{mode_name -> unaligned membership matrix}, if loaded. |
P_unaligned_by_mode |
Dict[str, ndarray] | None
|
{mode_name -> unaligned feature matrix}, if loaded. |
P_aligned_by_mode |
Dict[str, ndarray] | None
|
{mode_name -> aligned feature matrix}, if loaded. |
Functions
reorder_inds(reorder_idx)
Return a new CompModelsResults with all Q matrices reordered according to reorder_idx.
CompModelsResults
dataclass
Container for clumppling.compModels outputs and associated metadata.
Attributes:
| Name | Type | Description |
|---|---|---|
res_dir |
Path
|
Directory containing compModels outputs (e.g. output/comp_models/pbmc10k-tutorial_hc_output). |
input_dir |
Optional[Path]
|
Directory containing per-model input stats for compModels (e.g. output/comp_models/pbmc10k-tutorial_hc). |
models |
List[str]
|
Model names (e.g. "rna.seurat.louvain", "rna.seurat.leiden", ...). |
modes_by_model |
Dict[str, List[str]]
|
For each model, a list of short mode names with the model prefix stripped, e.g. {"rna.seurat.louvain": ["K21M1", "K21M2", ...]}. |
full_mode_names_by_model |
Dict[str, List[str]]
|
For each model, the full mode names as used in filenames, e.g. {"rna.seurat.louvain": ["rna.seurat.louvain_K21M1", ...]}. |
full_mode_names |
List[str]
|
Flat list of all full mode names across all models. |
Q_by_mode |
Dict[str, ndarray]
|
Mapping full mode name -> aligned membership matrix Q loaded from res_dir / "aligned" / f"{mode}.Q". |
all_modes_alignment |
Dict[str, Sequence[int]]
|
Mapping full mode name -> global alignment pattern, parsed from res_dir / "aligned" / "all_modes_alignment.txt" if present. (Keys are full mode names; values are index patterns.) |
alignment_across_all |
Optional[Dict[str, Sequence[int]]]
|
Reserved for cross-mode alignment patterns (currently left as None unless you later decide to parse additional files). |
cost_across_all |
Optional[Dict[str, float]]
|
Reserved for cross-mode alignment costs (currently None). |
mode_stats_by_model |
Dict[str, DataFrame]
|
For each model, its original mode_stats DataFrame loaded from input_dir / f"{model}_mode_stats.txt", if available. |
K_max |
int
|
|
K_max_by_model |
Dict[str, int]
|
|
Functions
get_Q(full_mode_name)
Return the aligned Q matrix for a full mode name.
get_Q_for(model, mode_short)
Return the aligned Q matrix for a (model, short_mode_name) pair, where short_mode_name is e.g. 'K21M1'.
reorder_inds(reorder_idx)
Return a new CompModelsResults with all Q matrices reordered according to reorder_idx.
Functions
add_pairwise_alignment(res, alignment)
Return a new ClumpplingResults with alignment updated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
res
|
ClumpplingResults
|
Original results object. |
required |
alignment
|
dict
|
New alignment patterns per mode, e.g. {'K17M1': [2,0,1], ...} |
required |
Returns:
| Type | Description |
|---|---|
ClumpplingResults
|
New results object with updated Q_by_mode, P_aligned_by_mode, and all_modes_alignment according to the new alignment. |
filter_bed_by_peaks(bed_path, peaks, *, ccre_id_col=3)
Stream a BED file and keep only lines that overlap any of the given peaks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bed_path
|
str or Path
|
Path to the BED file to filter. |
required |
peaks
|
iterable of str
|
Iterable of peak strings, e.g. ['chr1:10109-10357', ...] |
required |
ccre_id_col
|
int
|
Column index (0-based) in the BED file where the cCRE ID is located. |
3
|
Returns:
| Name | Type | Description |
|---|---|---|
filtered_rows |
list of list[str]
|
Each inner list is the BED line split by ' '. |
kept_ids |
set of str
|
Set of cCRE IDs (from column |
filter_gene_links_by_ccre(gene_links_path, kept_ids, *, ccre_id_col=0, keep_header=True)
Stream a gene-link file and keep only rows whose cCRE ID is in kept_ids.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gene_links_path
|
str or Path
|
Path to the gene-link file to filter. |
required |
kept_ids
|
set of str
|
Set of cCRE IDs to keep. |
required |
ccre_id_col
|
int
|
Column index (0-based) in the gene-link file where the cCRE ID is located. |
0
|
keep_header
|
bool
|
Whether to keep and return the header line (first line) of the file. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
header |
list[str] or None
|
Header columns if keep_header=True and file has at least one line, otherwise None. |
filtered_rows |
list of list[str]
|
Data rows (split by ' ') with cCRE IDs in kept_ids. |
group_modes_by_K(mode_names)
Group a flat list of mode names into a list-of-lists by K.
Given a list like: ['K17M1', 'K17M2', 'K18M1', 'K19M1'] return: [['K17M1', 'K17M2'], ['K18M1'], ['K19M1']]
This mirrors the notebook helper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode_names
|
sequence of str
|
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
mode_names_list |
list of list of str
|
One sublist per K, ordered by increasing K. |
infer_K_range(mode_names)
Convenience helper to get sorted K values from mode names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode_names
|
sequence of str, e.g. ['K17M1', 'K17M2', 'K18M1']
|
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
K_range |
list of int, e.g. [17, 18]
|
|
load_aligned_Qs(align_dir, modes, suffix='rep', *, delimiter=' ')
Load aligned membership matrices for each mode from 'modes_aligned'.
Files are expected to be named like:
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
align_dir
|
path - like
|
Main clumppling output directory (same as used in load_mode_alignment). |
required |
modes
|
sequence of str
|
Mode names to load, e.g. from _get_mode_names(...). |
required |
suffix
|
('rep', 'avg')
|
Suffix used by clumppling when writing aligned modes. |
"rep"
|
delimiter
|
str
|
Delimiter for the Q files. |
" "
|
Returns:
| Name | Type | Description |
|---|---|---|
Q_by_mode |
dict
|
{mode_name -> np.ndarray of shape (n_individuals, K_mode)} |
load_alignment_across_K(align_file)
Load alignment_acrossK and cost_acrossK from the file written by clumppling.write_alignment_across_k.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
align_file
|
str | PathLike
|
Path to the alignment file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
alignment_acrossK |
dict
|
{pair_label -> alignment pattern}, where |
cost_acrossK |
dict
|
{pair_label -> float cost} |
load_all_modes_alignment(align_dir, suffix='rep', *, filename=None)
Load all_modes_alignment from the file written by clumppling.write_reordered_across_k.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
align_dir
|
path - like
|
Main clumppling output directory. |
required |
suffix
|
('rep', 'avg')
|
Suffix used when writing the all-modes alignment file. |
"rep"
|
filename
|
str
|
Override the default filename if needed. If None, uses f"all_modes_alignment_{suffix}.txt". |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
all_modes_alignment |
dict
|
{mode_label -> alignment_pattern}, where alignment_pattern is obtained by applying str_to_pattern to the stored pattern string. |
load_clumppling_results(align_dir, *, suffix='rep', round_Q=False, cls_dir=None, load_unaligned=False, load_P=True, strict_P=False, p_ext=None)
Load clumppling results from the specified directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
align_dir
|
path - like
|
The clumppling output directory used as |
required |
suffix
|
('rep', 'avg')
|
Suffix used in the aligned Q filenames, e.g. "K17M1_rep.Q". |
"rep"
|
round_Q
|
bool
|
If True, apply np.rint to each aligned Q matrix to get hard cluster memberships. |
False
|
cls_dir
|
path - like
|
Directory containing the original clustering outputs (*.P files). If provided and load_P is True, P matrices will be loaded. |
None
|
load_P
|
bool
|
Whether to attempt loading P matrices at all. Set to False if you know this run has no P files (e.g. hard clustering only). |
True
|
strict_P
|
bool
|
If True, missing P files raise FileNotFoundError. If False, missing P files will emit a warning and skip P loading. |
False
|
p_ext
|
str
|
File extension for P matrices. Use this when the tool writes P files
with a non-standard extension, e.g. |
None
|
load_compmodels_results(res_dir, input_dir=None)
Load outputs from clumppling.compModels into a CompModelsResults object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
res_dir
|
str or Path
|
Directory containing compModels outputs, e.g. .../output/comp_models/pbmc10k-tutorial_hc_output |
required |
input_dir
|
str or Path
|
Directory containing per-model input stats used for compModels, e.g. .../output/comp_models/pbmc10k-tutorial_hc. If None, mode_stats_by_model will be empty. |
None
|
Returns:
| Type | Description |
|---|---|
CompModelsResults
|
Structured container for multi-model Q matrices, mode lists, global alignment patterns, and per-model mode_stats. |
load_gene_intervals(gtf_file, *, upstream=5000, downstream=0, feature_type='gene', source='HAVANA', gene_type_allowlist=None)
Stream a (possibly gzipped) GTF and extract only the intervals needed. Returns dict: chrom -> sorted list of (start, end, gene_name).
load_gene_set(name, gene_set_dir, *, prefix=None)
Load a gene-set file (one symbol per line) from a directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Gene-set name, used as the file stem (e.g. |
required |
gene_set_dir
|
path - like
|
Directory containing |
required |
prefix
|
str
|
If provided and |
None
|
Returns:
| Type | Description |
|---|---|
list of str
|
Gene symbols, one per line, with blank lines removed. |
load_input_meta(align_dir)
Load the 'input_meta.txt' table that links original Q/P files to modes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
align_dir
|
path - like
|
The main clumppling output directory. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
input_meta |
DataFrame
|
|
load_mode_alignment(align_dir)
Load the 'mode_alignment.txt' table produced by clumppling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
align_dir
|
path - like
|
The main output directory passed as |
required |
Returns:
| Name | Type | Description |
|---|---|---|
mode_alignment |
DataFrame
|
|
load_mode_stats(align_dir)
Load the 'mode_stats.txt' table produced by clumppling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
align_dir
|
path - like
|
The main output directory passed as |
required |
Returns:
| Name | Type | Description |
|---|---|---|
mode_stats |
DataFrame
|
|
load_unaligned_for_modes(cls_dir, align_dir, *, mat_type, modes=None, mode_stats=None, input_meta=None, delimiter=None, p_ext=None)
Load the unaligned P matrices for each mode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls_dir
|
path - like
|
Directory that contains the original MMC / clustering outputs, i.e. where the '*.P' files live. |
required |
align_dir
|
path - like
|
Main clumppling output directory (used to load mode_stats and input_meta if they are not provided). |
required |
mat_type
|
Literal['P', 'Q']
|
Type of matrix to load. Currently "P" and "Q" are supported. |
required |
modes
|
sequence of str
|
Mode names to load. If None, they are inferred from mode_alignment. |
None
|
mode_stats
|
DataFrame
|
If already loaded, pass it to avoid re-reading. |
None
|
input_meta
|
DataFrame
|
If already loaded, pass it to avoid re-reading. |
None
|
delimiter
|
str or None
|
Delimiter for the matrix files. |
None
|
p_ext
|
str
|
Extension to use for P files when |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
mat_by_mode |
dict
|
{mode_name -> np.ndarray of shape (n_features, K_mode)} |
subset_compmodels(comp_res, K_min=None, K_max=None, K_values=None)
Return a new CompModelsResults object restricted to a subset of K values.
A mode is kept if its number of clusters K = Q.shape[1] satisfies: - if K_values is not None: K in K_values - else: K_min <= K <= K_max (with open ends if K_min/K_max is None)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
comp_res
|
CompModelsResults
|
Original full comparison results. |
required |
K_min
|
int
|
Lower / upper bounds for K. Ignored if K_values is provided. |
None
|
K_max
|
int
|
Lower / upper bounds for K. Ignored if K_values is provided. |
None
|
K_values
|
sequence of int
|
Explicit set of K values to keep. |
None
|
Returns:
| Type | Description |
|---|---|
CompModelsResults
|
New results object with only the selected modes and updated metadata. |