I/O

io.py

Functions for loading clumppling's main and compModel outputs, as well as other related data.

Classes

`ClumpplingResults` `dataclass`

Container for all core clumppling outputs needed for analysis/plots.

Attributes:

Name	Type	Description
`align_dir`	`Path`	Directory containing clumppling outputs (e.g. output/clumppling/pbmc10k-tutorial_hc_output).
`suffix`	`str`	Suffix used in aligned Q filenames (e.g. "rep" or "avg").
`mode_alignment`	`DataFrame`	DataFrame loaded from 'mode_alignment.txt'.
`mode_stats`	`DataFrame`	DataFrame loaded from 'mode_stats.txt'.
`modes`	`List[str]`	Flat list of mode names (e.g. ["K5M1", "K5M2", ...]).
`mode_K`	`Dict[str, int]`	Mapping from mode name to K for that mode.
`K_range`	`List[int]`	Sorted unique K values across all modes.
`K_max`	`int`	Maximum K value across all modes.
`mode_names_list`	`List[List[str]]`	Grouped mode names by K (same structure as notebook).
`Q_by_mode`	`Dict[str, ndarray]`	Mapping from mode name to aligned membership matrix.
`alignment_acrossK`	`Dict[str, Sequence[int]]`	{"A-B" -> mapping from B->A (original indices)}
`cost_acrossK`	`Dict[str, float]`	Mapping from mode name to alignment cost.
`all_modes_alignment`	`Dict[str, Sequence[int]]`	{mode_name -> reordering (aligned columns)}
`mode_coord_dict`	`Dict[str, Tuple[int, int]]`	{mode_name -> (row_idx, col_idx)} grid by K.
`mode_sep_coord_dict`	`Dict[Tuple[str, int], Tuple[int, int]]`	{(mode_name, cls_idx) -> (row_idx, col_idx)}.
`input_meta`	`DataFrame \| None`	DataFrame loaded from 'input_meta.txt', if available.
`Q_unaligned_by_mode`	`Dict[str, ndarray] \| None`	{mode_name -> unaligned membership matrix}, if loaded.
`P_unaligned_by_mode`	`Dict[str, ndarray] \| None`	{mode_name -> unaligned feature matrix}, if loaded.
`P_aligned_by_mode`	`Dict[str, ndarray] \| None`	{mode_name -> aligned feature matrix}, if loaded.

Functions

`reorder_inds(reorder_idx)`

Return a new CompModelsResults with all Q matrices reordered according to reorder_idx.

`CompModelsResults` `dataclass`

Container for clumppling.compModels outputs and associated metadata.

Attributes:

Name	Type	Description
`res_dir`	`Path`	Directory containing compModels outputs (e.g. output/comp_models/pbmc10k-tutorial_hc_output).
`input_dir`	`Optional[Path]`	Directory containing per-model input stats for compModels (e.g. output/comp_models/pbmc10k-tutorial_hc).
`models`	`List[str]`	Model names (e.g. "rna.seurat.louvain", "rna.seurat.leiden", ...).
`modes_by_model`	`Dict[str, List[str]]`	For each model, a list of short mode names with the model prefix stripped, e.g. {"rna.seurat.louvain": ["K21M1", "K21M2", ...]}.
`full_mode_names_by_model`	`Dict[str, List[str]]`	For each model, the full mode names as used in filenames, e.g. {"rna.seurat.louvain": ["rna.seurat.louvain_K21M1", ...]}.
`full_mode_names`	`List[str]`	Flat list of all full mode names across all models.
`Q_by_mode`	`Dict[str, ndarray]`	Mapping full mode name -> aligned membership matrix Q loaded from res_dir / "aligned" / f"{mode}.Q".
`all_modes_alignment`	`Dict[str, Sequence[int]]`	Mapping full mode name -> global alignment pattern, parsed from res_dir / "aligned" / "all_modes_alignment.txt" if present. (Keys are full mode names; values are index patterns.)
`alignment_across_all`	`Optional[Dict[str, Sequence[int]]]`	Reserved for cross-mode alignment patterns (currently left as None unless you later decide to parse additional files).
`cost_across_all`	`Optional[Dict[str, float]]`	Reserved for cross-mode alignment costs (currently None).
`mode_stats_by_model`	`Dict[str, DataFrame]`	For each model, its original mode_stats DataFrame loaded from input_dir / f"{model}_mode_stats.txt", if available.
`K_max`	`int`
`K_max_by_model`	`Dict[str, int]`

Functions

`get_Q(full_mode_name)`

Return the aligned Q matrix for a full mode name.

`get_Q_for(model, mode_short)`

Return the aligned Q matrix for a (model, short_mode_name) pair, where short_mode_name is e.g. 'K21M1'.

`reorder_inds(reorder_idx)`

Return a new CompModelsResults with all Q matrices reordered according to reorder_idx.

Functions

`add_pairwise_alignment(res, alignment)`

Return a new ClumpplingResults with alignment updated.

Parameters:

Name	Type	Description	Default
`res`	`ClumpplingResults`	Original results object.	required
`alignment`	`dict`	New alignment patterns per mode, e.g. {'K17M1': [2,0,1], ...}	required

Returns:

Type	Description
`ClumpplingResults`	New results object with updated Q_by_mode, P_aligned_by_mode, and all_modes_alignment according to the new alignment.

`filter_bed_by_peaks(bed_path, peaks, *, ccre_id_col=3)`

Stream a BED file and keep only lines that overlap any of the given peaks.

Parameters:

Name	Type	Description	Default
`bed_path`	`str or Path`	Path to the BED file to filter.	required
`peaks`	`iterable of str`	Iterable of peak strings, e.g. ['chr1:10109-10357', ...]	required
`ccre_id_col`	`int`	Column index (0-based) in the BED file where the cCRE ID is located.	`3`

Returns:

Name	Type	Description
`filtered_rows`	`list of list[str]`	Each inner list is the BED line split by ' '.
`kept_ids`	`set of str`	Set of cCRE IDs (from column `ccre_id_col`) for filtered rows.

`filter_gene_links_by_ccre(gene_links_path, kept_ids, *, ccre_id_col=0, keep_header=True)`

Stream a gene-link file and keep only rows whose cCRE ID is in kept_ids.

Parameters:

Name	Type	Description	Default
`gene_links_path`	`str or Path`	Path to the gene-link file to filter.	required
`kept_ids`	`set of str`	Set of cCRE IDs to keep.	required
`ccre_id_col`	`int`	Column index (0-based) in the gene-link file where the cCRE ID is located.	`0`
`keep_header`	`bool`	Whether to keep and return the header line (first line) of the file.	`True`

Returns:

Name	Type	Description
`header`	`list[str] or None`	Header columns if keep_header=True and file has at least one line, otherwise None.
`filtered_rows`	`list of list[str]`	Data rows (split by ' ') with cCRE IDs in kept_ids.

`group_modes_by_K(mode_names)`

Group a flat list of mode names into a list-of-lists by K.

Given a list like: ['K17M1', 'K17M2', 'K18M1', 'K19M1'] return: [['K17M1', 'K17M2'], ['K18M1'], ['K19M1']]

This mirrors the notebook helper.

Parameters:

Name	Type	Description	Default
`mode_names`	`sequence of str`		required

Returns:

Name	Type	Description
`mode_names_list`	`list of list of str`	One sublist per K, ordered by increasing K.

`infer_K_range(mode_names)`

Convenience helper to get sorted K values from mode names.

Parameters:

Name	Type	Description	Default
`mode_names`	`sequence of str, e.g. ['K17M1', 'K17M2', 'K18M1']`		required

Returns:

Name	Type	Description
`K_range`	`list of int, e.g. [17, 18]`

`load_aligned_Qs(align_dir, modes, suffix='rep', *, delimiter=' ')`

Load aligned membership matrices for each mode from 'modes_aligned'.

Files are expected to be named like: _.Q e.g. "K17M1_rep.Q"

Parameters:

Name	Type	Description	Default
`align_dir`	`path - like`	Main clumppling output directory (same as used in load_mode_alignment).	required
`modes`	`sequence of str`	Mode names to load, e.g. from _get_mode_names(...).	required
`suffix`	`('rep', 'avg')`	Suffix used by clumppling when writing aligned modes.	`"rep"`
`delimiter`	`str`	Delimiter for the Q files.	`" "`

Returns:

Name	Type	Description
`Q_by_mode`	`dict`	{mode_name -> np.ndarray of shape (n_individuals, K_mode)}

`load_alignment_across_K(align_file)`

Load alignment_acrossK and cost_acrossK from the file written by clumppling.write_alignment_across_k.

Parameters:

Name	Type	Description	Default
`align_file`	`str \| PathLike`	Path to the alignment file.	required

Returns:

Name	Type	Description
`alignment_acrossK`	`dict`	{pair_label -> alignment pattern}, where `alignment pattern` is a sequence of ints mapping clusters between two modes.
`cost_acrossK`	`dict`	{pair_label -> float cost}

`load_all_modes_alignment(align_dir, suffix='rep', *, filename=None)`

Load all_modes_alignment from the file written by clumppling.write_reordered_across_k.

Parameters:

Name	Type	Description	Default
`align_dir`	`path - like`	Main clumppling output directory.	required
`suffix`	`('rep', 'avg')`	Suffix used when writing the all-modes alignment file.	`"rep"`
`filename`	`str`	Override the default filename if needed. If None, uses f"all_modes_alignment_{suffix}.txt".	`None`

Returns:

Name	Type	Description
`all_modes_alignment`	`dict`	{mode_label -> alignment_pattern}, where alignment_pattern is obtained by applying str_to_pattern to the stored pattern string.

`load_clumppling_results(align_dir, *, suffix='rep', round_Q=False, cls_dir=None, load_unaligned=False, load_P=True, strict_P=False, p_ext=None)`

Load clumppling results from the specified directory.

Parameters:

Name	Type	Description	Default
`align_dir`	`path - like`	The clumppling output directory used as `-o/--output`.	required
`suffix`	`('rep', 'avg')`	Suffix used in the aligned Q filenames, e.g. "K17M1_rep.Q".	`"rep"`
`round_Q`	`bool`	If True, apply np.rint to each aligned Q matrix to get hard cluster memberships.	`False`
`cls_dir`	`path - like`	Directory containing the original clustering outputs (*.P files). If provided and load_P is True, P matrices will be loaded.	`None`
`load_P`	`bool`	Whether to attempt loading P matrices at all. Set to False if you know this run has no P files (e.g. hard clustering only).	`True`
`strict_P`	`bool`	If True, missing P files raise FileNotFoundError. If False, missing P files will emit a warning and skip P loading.	`False`
`p_ext`	`str`	File extension for P matrices. Use this when the tool writes P files with a non-standard extension, e.g. `p_ext="meanP"` for fastStructure (whose P files end in `.meanP` rather than `.P`). When None (default), the extension is derived from `mat_type` as before.	`None`

`load_compmodels_results(res_dir, input_dir=None)`

Load outputs from clumppling.compModels into a CompModelsResults object.

Parameters:

Name	Type	Description	Default
`res_dir`	`str or Path`	Directory containing compModels outputs, e.g. .../output/comp_models/pbmc10k-tutorial_hc_output	required
`input_dir`	`str or Path`	Directory containing per-model input stats used for compModels, e.g. .../output/comp_models/pbmc10k-tutorial_hc. If None, mode_stats_by_model will be empty.	`None`

Returns:

Type	Description
`CompModelsResults`	Structured container for multi-model Q matrices, mode lists, global alignment patterns, and per-model mode_stats.

`load_gene_intervals(gtf_file, *, upstream=5000, downstream=0, feature_type='gene', source='HAVANA', gene_type_allowlist=None)`

Stream a (possibly gzipped) GTF and extract only the intervals needed. Returns dict: chrom -> sorted list of (start, end, gene_name).

`load_gene_set(name, gene_set_dir, *, prefix=None)`

Load a gene-set file (one symbol per line) from a directory.

Parameters:

Name	Type	Description	Default
`name`	`str`	Gene-set name, used as the file stem (e.g. `"HALLMARK_E2F_TARGETS"` or just `"E2F_TARGETS"` when `prefix="HALLMARK_"`).	required
`gene_set_dir`	`path - like`	Directory containing `.txt` gene-set files.	required
`prefix`	`str`	If provided and `name` does not already start with it, the prefix is prepended before constructing the filename. Useful for MSigDB Hallmark collections where files are named `HALLMARK_<name>.txt`.	`None`

Returns:

Type	Description
`list of str`	Gene symbols, one per line, with blank lines removed.

`load_input_meta(align_dir)`

Load the 'input_meta.txt' table that links original Q/P files to modes.

Parameters:

Name	Type	Description	Default
`align_dir`	`path - like`	The main clumppling output directory.	required

Returns:

Name	Type	Description
`input_meta`	`DataFrame`

`load_mode_alignment(align_dir)`

Load the 'mode_alignment.txt' table produced by clumppling.

Parameters:

Name	Type	Description	Default
`align_dir`	`path - like`	The main output directory passed as `-o/--output` to clumppling.	required

Returns:

Name	Type	Description
`mode_alignment`	`DataFrame`

`load_mode_stats(align_dir)`

Load the 'mode_stats.txt' table produced by clumppling.

Parameters:

Name	Type	Description	Default
`align_dir`	`path - like`	The main output directory passed as `-o/--output` to clumppling.	required

Returns:

Name	Type	Description
`mode_stats`	`DataFrame`

`load_unaligned_for_modes(cls_dir, align_dir, *, mat_type, modes=None, mode_stats=None, input_meta=None, delimiter=None, p_ext=None)`

Load the unaligned P matrices for each mode.

Parameters:

Name	Type	Description	Default
`cls_dir`	`path - like`	Directory that contains the original MMC / clustering outputs, i.e. where the '*.P' files live.	required
`align_dir`	`path - like`	Main clumppling output directory (used to load mode_stats and input_meta if they are not provided).	required
`mat_type`	`Literal['P', 'Q']`	Type of matrix to load. Currently "P" and "Q" are supported.	required
`modes`	`sequence of str`	Mode names to load. If None, they are inferred from mode_alignment.	`None`
`mode_stats`	`DataFrame`	If already loaded, pass it to avoid re-reading.	`None`
`input_meta`	`DataFrame`	If already loaded, pass it to avoid re-reading.	`None`
`delimiter`	`str or None`	Delimiter for the matrix files. `None` (default) splits on any whitespace, which correctly handles both single- and double-space separated outputs (e.g. fastStructure `meanP`/`meanQ` files).	`None`
`p_ext`	`str`	Extension to use for P files when `mat_type="P"`. When set, the extension already present on `orig_file_name` (e.g. `.meanQ` for fastStructure) is stripped via `Path.stem` and replaced with this value. For example, pass `p_ext="meanP"` for fastStructure runs whose P files end in `.meanP`. If None (default), the old behaviour is preserved: `mat_type` is appended directly to `orig_file_name`.	`None`

Returns:

Name	Type	Description
`mat_by_mode`	`dict`	{mode_name -> np.ndarray of shape (n_features, K_mode)}

`subset_compmodels(comp_res, K_min=None, K_max=None, K_values=None)`

Return a new CompModelsResults object restricted to a subset of K values.

A mode is kept if its number of clusters K = Q.shape[1] satisfies: - if K_values is not None: K in K_values - else: K_min <= K <= K_max (with open ends if K_min/K_max is None)

Parameters:

Name	Type	Description	Default
`comp_res`	`CompModelsResults`	Original full comparison results.	required
`K_min`	`int`	Lower / upper bounds for K. Ignored if K_values is provided.	`None`
`K_max`	`int`	Lower / upper bounds for K. Ignored if K_values is provided.	`None`
`K_values`	`sequence of int`	Explicit set of K values to keep.	`None`

Returns:

Type	Description
`CompModelsResults`	New results object with only the selected modes and updated metadata.

I/O

Classes

ClumpplingResults dataclass

Functions

reorder_inds(reorder_idx)

CompModelsResults dataclass

Functions

get_Q(full_mode_name)

get_Q_for(model, mode_short)

reorder_inds(reorder_idx)

Functions

add_pairwise_alignment(res, alignment)

filter_bed_by_peaks(bed_path, peaks, *, ccre_id_col=3)

filter_gene_links_by_ccre(gene_links_path, kept_ids, *, ccre_id_col=0, keep_header=True)

group_modes_by_K(mode_names)

infer_K_range(mode_names)

load_aligned_Qs(align_dir, modes, suffix='rep', *, delimiter=' ')

load_alignment_across_K(align_file)

load_all_modes_alignment(align_dir, suffix='rep', *, filename=None)

load_clumppling_results(align_dir, *, suffix='rep', round_Q=False, cls_dir=None, load_unaligned=False, load_P=True, strict_P=False, p_ext=None)

load_compmodels_results(res_dir, input_dir=None)

load_gene_intervals(gtf_file, *, upstream=5000, downstream=0, feature_type='gene', source='HAVANA', gene_type_allowlist=None)

load_gene_set(name, gene_set_dir, *, prefix=None)

load_input_meta(align_dir)

load_mode_alignment(align_dir)

load_mode_stats(align_dir)

load_unaligned_for_modes(cls_dir, align_dir, *, mat_type, modes=None, mode_stats=None, input_meta=None, delimiter=None, p_ext=None)

subset_compmodels(comp_res, K_min=None, K_max=None, K_values=None)

`ClumpplingResults` `dataclass`

`reorder_inds(reorder_idx)`

`CompModelsResults` `dataclass`

`get_Q(full_mode_name)`

`get_Q_for(model, mode_short)`

`reorder_inds(reorder_idx)`

`add_pairwise_alignment(res, alignment)`

`filter_bed_by_peaks(bed_path, peaks, *, ccre_id_col=3)`

`filter_gene_links_by_ccre(gene_links_path, kept_ids, *, ccre_id_col=0, keep_header=True)`

`group_modes_by_K(mode_names)`

`infer_K_range(mode_names)`

`load_aligned_Qs(align_dir, modes, suffix='rep', *, delimiter=' ')`

`load_alignment_across_K(align_file)`

`load_all_modes_alignment(align_dir, suffix='rep', *, filename=None)`

`load_clumppling_results(align_dir, *, suffix='rep', round_Q=False, cls_dir=None, load_unaligned=False, load_P=True, strict_P=False, p_ext=None)`

`load_compmodels_results(res_dir, input_dir=None)`

`load_gene_intervals(gtf_file, *, upstream=5000, downstream=0, feature_type='gene', source='HAVANA', gene_type_allowlist=None)`

`load_gene_set(name, gene_set_dir, *, prefix=None)`

`load_input_meta(align_dir)`

`load_mode_alignment(align_dir)`

`load_mode_stats(align_dir)`

`load_unaligned_for_modes(cls_dir, align_dir, *, mat_type, modes=None, mode_stats=None, input_meta=None, delimiter=None, p_ext=None)`

`subset_compmodels(comp_res, K_min=None, K_max=None, K_values=None)`