load_biogen_adme_benchmark#

skfp.datasets.biogen_adme.load_biogen_adme_benchmark(subset: list[str] | None = None, data_dir: str | PathLike | None = None, as_frames: bool = False, verbose: bool = False, force_update: bool = False) Iterator[tuple[str, DataFrame]] | Iterator[tuple[str, list[str], ndarray]]#

Load the Biogen ADME benchmark datasets.

Biogen ADME benchmark [1] consists of 3521 diverse compounds from commercial libraries, tested against 6 in vitro ADME endpoints. All label values are log10-transformed. The “Internal ID” column reflects ordering and can be used for temporal splitting.

For more details, see loading functions for particular datasets. Allowed individual dataset names are listed below. Dataset names are also returned (case-sensitive).

  • HLM CLint

  • MDR1-MDCK ER

  • Solubility

  • hPPB

  • rPPB

  • RLM CLint

Parameters:
  • subset (None or list of strings) – If None, returns all datasets. List of strings loads only datasets with given names.

  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frames (bool, default=False) – If True, returns the raw DataFrame for each dataset. Otherwise, returns SMILES as a list of strings, and labels as a NumPy array for each dataset.

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

  • force_update (bool, default=False) – If True, always re-download the dataset from HuggingFace Hub, even if it is already present locally. If False, the dataset is downloaded only if it is not yet available locally.

Returns:

data – Loads and returns datasets with a generator. Returned types depend on the as_frame parameter, either: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

generator of pd.DataFrame or tuples (list[str], np.ndarray)

References