Skip to content

I/O

io.py

Functions for loading clumppling's main and compModel outputs, as well as other related data.

Classes

ClumpplingResults dataclass

Container for all core clumppling outputs needed for analysis/plots.

Attributes:

Name Type Description
align_dir Path

Directory containing clumppling outputs (e.g. output/clumppling/pbmc10k-tutorial_hc_output).

suffix str

Suffix used in aligned Q filenames (e.g. "rep" or "avg").

mode_alignment DataFrame

DataFrame loaded from 'mode_alignment.txt'.

mode_stats DataFrame

DataFrame loaded from 'mode_stats.txt'.

modes List[str]

Flat list of mode names (e.g. ["K5M1", "K5M2", ...]).

mode_K Dict[str, int]

Mapping from mode name to K for that mode.

K_range List[int]

Sorted unique K values across all modes.

K_max int

Maximum K value across all modes.

mode_names_list List[List[str]]

Grouped mode names by K (same structure as notebook).

Q_by_mode Dict[str, ndarray]

Mapping from mode name to aligned membership matrix.

alignment_acrossK Dict[str, Sequence[int]]

{"A-B" -> mapping from B->A (original indices)}

cost_acrossK Dict[str, float]

Mapping from mode name to alignment cost.

all_modes_alignment Dict[str, Sequence[int]]

{mode_name -> reordering (aligned columns)}

mode_coord_dict Dict[str, Tuple[int, int]]

{mode_name -> (row_idx, col_idx)} grid by K.

mode_sep_coord_dict Dict[Tuple[str, int], Tuple[int, int]]

{(mode_name, cls_idx) -> (row_idx, col_idx)}.

input_meta DataFrame | None

DataFrame loaded from 'input_meta.txt', if available.

Q_unaligned_by_mode Dict[str, ndarray] | None

{mode_name -> unaligned membership matrix}, if loaded.

P_unaligned_by_mode Dict[str, ndarray] | None

{mode_name -> unaligned feature matrix}, if loaded.

P_aligned_by_mode Dict[str, ndarray] | None

{mode_name -> aligned feature matrix}, if loaded.

Functions
reorder_inds(reorder_idx)

Return a new CompModelsResults with all Q matrices reordered according to reorder_idx.

CompModelsResults dataclass

Container for clumppling.compModels outputs and associated metadata.

Attributes:

Name Type Description
res_dir Path

Directory containing compModels outputs (e.g. output/comp_models/pbmc10k-tutorial_hc_output).

input_dir Optional[Path]

Directory containing per-model input stats for compModels (e.g. output/comp_models/pbmc10k-tutorial_hc).

models List[str]

Model names (e.g. "rna.seurat.louvain", "rna.seurat.leiden", ...).

modes_by_model Dict[str, List[str]]

For each model, a list of short mode names with the model prefix stripped, e.g. {"rna.seurat.louvain": ["K21M1", "K21M2", ...]}.

full_mode_names_by_model Dict[str, List[str]]

For each model, the full mode names as used in filenames, e.g. {"rna.seurat.louvain": ["rna.seurat.louvain_K21M1", ...]}.

full_mode_names List[str]

Flat list of all full mode names across all models.

Q_by_mode Dict[str, ndarray]

Mapping full mode name -> aligned membership matrix Q loaded from res_dir / "aligned" / f"{mode}.Q".

all_modes_alignment Dict[str, Sequence[int]]

Mapping full mode name -> global alignment pattern, parsed from res_dir / "aligned" / "all_modes_alignment.txt" if present. (Keys are full mode names; values are index patterns.)

alignment_across_all Optional[Dict[str, Sequence[int]]]

Reserved for cross-mode alignment patterns (currently left as None unless you later decide to parse additional files).

cost_across_all Optional[Dict[str, float]]

Reserved for cross-mode alignment costs (currently None).

mode_stats_by_model Dict[str, DataFrame]

For each model, its original mode_stats DataFrame loaded from input_dir / f"{model}_mode_stats.txt", if available.

K_max int
K_max_by_model Dict[str, int]
Functions
get_Q(full_mode_name)

Return the aligned Q matrix for a full mode name.

get_Q_for(model, mode_short)

Return the aligned Q matrix for a (model, short_mode_name) pair, where short_mode_name is e.g. 'K21M1'.

reorder_inds(reorder_idx)

Return a new CompModelsResults with all Q matrices reordered according to reorder_idx.

Functions

add_pairwise_alignment(res, alignment)

Return a new ClumpplingResults with alignment updated.

Parameters:

Name Type Description Default
res ClumpplingResults

Original results object.

required
alignment dict

New alignment patterns per mode, e.g. {'K17M1': [2,0,1], ...}

required

Returns:

Type Description
ClumpplingResults

New results object with updated Q_by_mode, P_aligned_by_mode, and all_modes_alignment according to the new alignment.

filter_bed_by_peaks(bed_path, peaks, *, ccre_id_col=3)

Stream a BED file and keep only lines that overlap any of the given peaks.

Parameters:

Name Type Description Default
bed_path str or Path

Path to the BED file to filter.

required
peaks iterable of str

Iterable of peak strings, e.g. ['chr1:10109-10357', ...]

required
ccre_id_col int

Column index (0-based) in the BED file where the cCRE ID is located.

3

Returns:

Name Type Description
filtered_rows list of list[str]

Each inner list is the BED line split by ' '.

kept_ids set of str

Set of cCRE IDs (from column ccre_id_col) for filtered rows.

Stream a gene-link file and keep only rows whose cCRE ID is in kept_ids.

Parameters:

Name Type Description Default
gene_links_path str or Path

Path to the gene-link file to filter.

required
kept_ids set of str

Set of cCRE IDs to keep.

required
ccre_id_col int

Column index (0-based) in the gene-link file where the cCRE ID is located.

0
keep_header bool

Whether to keep and return the header line (first line) of the file.

True

Returns:

Name Type Description
header list[str] or None

Header columns if keep_header=True and file has at least one line, otherwise None.

filtered_rows list of list[str]

Data rows (split by ' ') with cCRE IDs in kept_ids.

group_modes_by_K(mode_names)

Group a flat list of mode names into a list-of-lists by K.

Given a list like: ['K17M1', 'K17M2', 'K18M1', 'K19M1'] return: [['K17M1', 'K17M2'], ['K18M1'], ['K19M1']]

This mirrors the notebook helper.

Parameters:

Name Type Description Default
mode_names sequence of str
required

Returns:

Name Type Description
mode_names_list list of list of str

One sublist per K, ordered by increasing K.

infer_K_range(mode_names)

Convenience helper to get sorted K values from mode names.

Parameters:

Name Type Description Default
mode_names sequence of str, e.g. ['K17M1', 'K17M2', 'K18M1']
required

Returns:

Name Type Description
K_range list of int, e.g. [17, 18]

load_aligned_Qs(align_dir, modes, suffix='rep', *, delimiter=' ')

Load aligned membership matrices for each mode from 'modes_aligned'.

Files are expected to be named like: _.Q e.g. "K17M1_rep.Q"

Parameters:

Name Type Description Default
align_dir path - like

Main clumppling output directory (same as used in load_mode_alignment).

required
modes sequence of str

Mode names to load, e.g. from _get_mode_names(...).

required
suffix ('rep', 'avg')

Suffix used by clumppling when writing aligned modes.

"rep"
delimiter str

Delimiter for the Q files.

" "

Returns:

Name Type Description
Q_by_mode dict

{mode_name -> np.ndarray of shape (n_individuals, K_mode)}

load_alignment_across_K(align_file)

Load alignment_acrossK and cost_acrossK from the file written by clumppling.write_alignment_across_k.

Parameters:

Name Type Description Default
align_file str | PathLike

Path to the alignment file.

required

Returns:

Name Type Description
alignment_acrossK dict

{pair_label -> alignment pattern}, where alignment pattern is a sequence of ints mapping clusters between two modes.

cost_acrossK dict

{pair_label -> float cost}

load_all_modes_alignment(align_dir, suffix='rep', *, filename=None)

Load all_modes_alignment from the file written by clumppling.write_reordered_across_k.

Parameters:

Name Type Description Default
align_dir path - like

Main clumppling output directory.

required
suffix ('rep', 'avg')

Suffix used when writing the all-modes alignment file.

"rep"
filename str

Override the default filename if needed. If None, uses f"all_modes_alignment_{suffix}.txt".

None

Returns:

Name Type Description
all_modes_alignment dict

{mode_label -> alignment_pattern}, where alignment_pattern is obtained by applying str_to_pattern to the stored pattern string.

load_clumppling_results(align_dir, *, suffix='rep', round_Q=False, cls_dir=None, load_unaligned=False, load_P=True, strict_P=False, p_ext=None)

Load clumppling results from the specified directory.

Parameters:

Name Type Description Default
align_dir path - like

The clumppling output directory used as -o/--output.

required
suffix ('rep', 'avg')

Suffix used in the aligned Q filenames, e.g. "K17M1_rep.Q".

"rep"
round_Q bool

If True, apply np.rint to each aligned Q matrix to get hard cluster memberships.

False
cls_dir path - like

Directory containing the original clustering outputs (*.P files). If provided and load_P is True, P matrices will be loaded.

None
load_P bool

Whether to attempt loading P matrices at all. Set to False if you know this run has no P files (e.g. hard clustering only).

True
strict_P bool

If True, missing P files raise FileNotFoundError. If False, missing P files will emit a warning and skip P loading.

False
p_ext str

File extension for P matrices. Use this when the tool writes P files with a non-standard extension, e.g. p_ext="meanP" for fastStructure (whose P files end in .meanP rather than .P). When None (default), the extension is derived from mat_type as before.

None

load_compmodels_results(res_dir, input_dir=None)

Load outputs from clumppling.compModels into a CompModelsResults object.

Parameters:

Name Type Description Default
res_dir str or Path

Directory containing compModels outputs, e.g. .../output/comp_models/pbmc10k-tutorial_hc_output

required
input_dir str or Path

Directory containing per-model input stats used for compModels, e.g. .../output/comp_models/pbmc10k-tutorial_hc. If None, mode_stats_by_model will be empty.

None

Returns:

Type Description
CompModelsResults

Structured container for multi-model Q matrices, mode lists, global alignment patterns, and per-model mode_stats.

load_gene_intervals(gtf_file, *, upstream=5000, downstream=0, feature_type='gene', source='HAVANA', gene_type_allowlist=None)

Stream a (possibly gzipped) GTF and extract only the intervals needed. Returns dict: chrom -> sorted list of (start, end, gene_name).

load_gene_set(name, gene_set_dir, *, prefix=None)

Load a gene-set file (one symbol per line) from a directory.

Parameters:

Name Type Description Default
name str

Gene-set name, used as the file stem (e.g. "HALLMARK_E2F_TARGETS" or just "E2F_TARGETS" when prefix="HALLMARK_").

required
gene_set_dir path - like

Directory containing .txt gene-set files.

required
prefix str

If provided and name does not already start with it, the prefix is prepended before constructing the filename. Useful for MSigDB Hallmark collections where files are named HALLMARK_<name>.txt.

None

Returns:

Type Description
list of str

Gene symbols, one per line, with blank lines removed.

load_input_meta(align_dir)

Load the 'input_meta.txt' table that links original Q/P files to modes.

Parameters:

Name Type Description Default
align_dir path - like

The main clumppling output directory.

required

Returns:

Name Type Description
input_meta DataFrame

load_mode_alignment(align_dir)

Load the 'mode_alignment.txt' table produced by clumppling.

Parameters:

Name Type Description Default
align_dir path - like

The main output directory passed as -o/--output to clumppling.

required

Returns:

Name Type Description
mode_alignment DataFrame

load_mode_stats(align_dir)

Load the 'mode_stats.txt' table produced by clumppling.

Parameters:

Name Type Description Default
align_dir path - like

The main output directory passed as -o/--output to clumppling.

required

Returns:

Name Type Description
mode_stats DataFrame

load_unaligned_for_modes(cls_dir, align_dir, *, mat_type, modes=None, mode_stats=None, input_meta=None, delimiter=None, p_ext=None)

Load the unaligned P matrices for each mode.

Parameters:

Name Type Description Default
cls_dir path - like

Directory that contains the original MMC / clustering outputs, i.e. where the '*.P' files live.

required
align_dir path - like

Main clumppling output directory (used to load mode_stats and input_meta if they are not provided).

required
mat_type Literal['P', 'Q']

Type of matrix to load. Currently "P" and "Q" are supported.

required
modes sequence of str

Mode names to load. If None, they are inferred from mode_alignment.

None
mode_stats DataFrame

If already loaded, pass it to avoid re-reading.

None
input_meta DataFrame

If already loaded, pass it to avoid re-reading.

None
delimiter str or None

Delimiter for the matrix files. None (default) splits on any whitespace, which correctly handles both single- and double-space separated outputs (e.g. fastStructure meanP/meanQ files).

None
p_ext str

Extension to use for P files when mat_type="P". When set, the extension already present on orig_file_name (e.g. .meanQ for fastStructure) is stripped via Path.stem and replaced with this value. For example, pass p_ext="meanP" for fastStructure runs whose P files end in .meanP. If None (default), the old behaviour is preserved: mat_type is appended directly to orig_file_name.

None

Returns:

Name Type Description
mat_by_mode dict

{mode_name -> np.ndarray of shape (n_features, K_mode)}

subset_compmodels(comp_res, K_min=None, K_max=None, K_values=None)

Return a new CompModelsResults object restricted to a subset of K values.

A mode is kept if its number of clusters K = Q.shape[1] satisfies: - if K_values is not None: K in K_values - else: K_min <= K <= K_max (with open ends if K_min/K_max is None)

Parameters:

Name Type Description Default
comp_res CompModelsResults

Original full comparison results.

required
K_min int

Lower / upper bounds for K. Ignored if K_values is provided.

None
K_max int

Lower / upper bounds for K. Ignored if K_values is provided.

None
K_values sequence of int

Explicit set of K values to keep.

None

Returns:

Type Description
CompModelsResults

New results object with only the selected modes and updated metadata.