load_expansionrx_benchmark#

skfp.datasets.expansionrx.load_expansionrx_benchmark(subset: list[str] | None = None, data_dir: str | PathLike | None = None, as_frames: bool = False, verbose: bool = False, force_update: bool = False) → Iterator[tuple[str, DataFrame]] | Iterator[tuple[str, list[str], ndarray]]#

Load the ExpansionRx-OpenADMET challenge datasets.

Expansion Therapeutics - OpenADMET challenge [1] [2] datasets come from real-work ADMET experiments from a series of drug discovery campaigns by Expansion Therapeutics on RNA-mediated diseases. This data has been obtained during late-stage optimization and has time-ordering information - IDs in the “Molecule name” column reflect measurement order.

For more details, see loading functions for particular datasets. Allowed individual dataset names are listed below. Dataset names are also returned (case-sensitive).

LogD
KSOL
HLM CLint
RLM CLint
MLM CLint
Caco-2 Permeability Papp A>B
Caco-2 Permeability Efflux
MPPB
MBPB
MGMB

Note that RLM CLint has not been a part of the original challenge. It has been provided by the organizers afterward as an additional endpoint.

Parameters:

subset (None or list of strings) – If None, returns all datasets. List of strings loads only datasets with given names.
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.
as_frames (bool, default=False) – If True, returns the raw DataFrame for each dataset. Otherwise, returns SMILES as a list of strings, and labels as a NumPy array for each dataset.
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
force_update (bool, default=False) – If True, always re-download the dataset from HuggingFace Hub, even if it is already present locally. If False, the dataset is downloaded only if it is not yet available locally.

Returns:

data – Loads and returns datasets with a generator. Returned types depend on the as_frame parameter, either: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

generator of pd.DataFrame or tuples (list[str], np.ndarray)

References