load_moleculeace_splits#

skfp.datasets.moleculeace.load_moleculeace_splits(dataset_name: str, split_type: str = 'activity_cliff', data_dir: str | PathLike | None = None, as_dict: bool = False, verbose: bool = False, force_update: bool = False) tuple[list[int], list[int]] | dict[str, list[int]]#

Load pre-generated dataset splits for the MoleculeACE benchmark.

MoleculeACE [1] provides two stratified split types based on activity-cliff membership. The data are split into train/test partitions as one of:

  • random

  • activity_cliff

Random splits use an 80/20 train/test split. Activity cliffs additionally restrict the test set to molecules that are part of activity-cliff pairs. Activity cliffs splits are recommended in the literature.

Dataset names are the same as those returned by load_moleculeace_benchmark() and are case-sensitive.

Parameters:
  • dataset_name (str) – Name of the dataset to load splits for.

  • split_type ({"random", "activity_cliff"}) – Type of the split to load.

  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_dict (bool, default=False) – If True, returns the splits as dictionary with keys “train”, “valid” and “test”, and index lists as values. Otherwise, returns three lists with splits indexes.

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

  • force_update (bool, default=False) – If True, always re-download the dataset from HuggingFace Hub, even if it is already present locally. If False, the dataset is downloaded only if it is not yet available locally.

Returns:

data – Depending on the as_dict argument, one of: - three lists of integer indexes - dictionary with “train”, “valid” and “test” keys, and values as lists with splits indexes

Return type:

tuple(list[int], list[int], list[int]) or dict

References