load_cyp3a4_substrate_carbonmangels#

skfp.datasets.tdc.adme.load_cyp3a4_substrate_carbonmangels(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False, force_update: bool = False) → DataFrame | tuple[list[str], ndarray]#

Load the CYP3A4 subset of Substrate Carbon-Mangels dataset.

CYP3A4 is an important enzyme in the body, mainly found in the liver and in the intestine. It oxidizes small foreign organic molecules (xenobiotics), such as toxins or drugs, so that they can be removed from the body [1] [2] [3]. Substrates are drugs that are metabolized by the enzyme. The task is to predict whether a molecule is a substrate to CYP3A4.

All Substrate Carbon-Mangels subsets:

load_cyp2c9_substrate_carbonmangels()

load_cyp2d6_substrate_carbonmangels()

load_cyp3a4_substrate_carbonmangels()

This dataset is a part of “metabolism” subset of ADME tasks.

Tasks	1
Task type	classification
Total samples	670
Recommended split	scaffold
Recommended metric	AUROC

Parameters:

data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise, returns SMILES as list of strings, and labels as a NumPy array (1D integer binary vector).
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
force_update (bool, default=False) – If True, always re-download the dataset from HuggingFace Hub, even if it is already present locally. If False, the dataset is downloaded only if it is not yet available locally.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References