load_lrgb_mol_splits#

skfp.datasets.lrgb.load_lrgb_mol_splits(dataset_name: str, valid_sequences_only: bool = False, data_dir: str | PathLike | None = None, as_dict: bool = False, verbose: bool = False, force_update: bool = False) → tuple[list[int], list[int], list[int]] | dict[str, list[int]]#

Load the official LRGB splits for molecular datasets.

Long Range Graph Benchmark (LRGB) [1] uses precomputed stratified random split for both Peptides-func and Peptides-struct datasets.

Dataset names here are the same as returned by load_moleculenet_benchmark function, and are case-sensitive.

Parameters:

dataset_name ({"Peptides-func", "Peptides-struct"}) – Name of the dataset to load splits for.
valid_sequences_only (bool, default=False) – Whether to load only rows with valid amino acid sequences, which can be loaded as RDKit Mol objects. This removes some sequences with valid SMILES, but custom notation for chemical modifications.
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.
as_dict (bool, default=False) – If True, returns the splits as dictionary with keys “train”, “valid” and “test”, and index lists as values. Otherwise, returns three lists with splits indexes.
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
force_update (bool, default=False) – If True, always re-download the dataset from HuggingFace Hub, even if it is already present locally. If False, the dataset is downloaded only if it is not yet available locally.

Returns:

data – Depending on the as_dict argument, one of: - three lists of integer indexes - dictionary with “train”, “valid” and “test” keys, and values as lists with splits indexes

Return type:

tuple(list[int], list[int], list[int]) or dict

References