BoundingBoxADChecker#
- class skfp.applicability_domain.BoundingBoxADChecker(percentile_lower: float | str = 0, percentile_upper: float | str = 100, num_allowed_violations: int = 0, n_jobs: int | None = None, verbose: int | dict = 0)#
Bounding box method.
Defines applicability domain based on feature ranges in the training data. This creates a “bounding box” using their extreme values, and new molecules should lie in this distribution, i.e. have properties in the same ranges [1].
Typically, physicochemical properties (continuous features) are used as inputs. Consider scaling, normalizing, or transforming them before computing AD to lessen effects of outliers, e.g. with
PowerTransformerorRobustScaler. This is particularly important if"three_sigma"is used as the percentile bound, as it assumes normal distribution.By default, the full range of training descriptors are allowed as AD. For stricter check, use
percentile_lowerandpercentile_upperarguments to disallow extremely low or large values, respectively. For looser check, usenum_allowed_violationsto allow a number of desrciptors to lie outside the given ranges.This method scales very well with both the number of samples and features.
- Parameters:
percentile_lower (float or "three_sigma", default=0) – Lower bound of accepted feature value ranges. Float or integer value is interpreted as a percentile of descriptors in the training data for each feature.
"three_sigma"uses 3 standard deviations from the mean, a common rule-of-thumb for outliers assuming the normal distribution.percentile_upper (float or "three_sigma", default=100) – Upper bound of accepted feature value ranges. Float or integer value is interpreted as a percentile of descriptors in the training data for each feature.
"three_sigma"uses 3 standard deviations from the mean, a common rule-of-thumb for outliers assuming the normal distribution.num_allowed_violations (int, default=0) – Number of allowed violations of feature ranges. By default, all descriptors must lie inside the bounding box.
n_jobs (int, default=None) – The number of jobs to run in parallel.
transform_x_y()andtransform()are parallelized over the input molecules.Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors. See scikit-learn documentation onn_jobsfor more details.verbose (int or dict, default=0) – Controls the verbosity when filtering molecules. If a dictionary is passed, it is treated as kwargs for
tqdm(), and can be used to control the progress bar.
References
Examples
>>> import numpy as np >>> from skfp.applicability_domain import BoundingBoxADChecker >>> X_train = np.array([[0.1, 0.2, 0.3], [1.0, 0.9, 0.8], [0.5, 0.5, 0.5]]) >>> X_test = np.array([[0.3, 0.3, 0.3], [0.6, 0.6, 0.6], [0.0, 0.9, 1.0]]) >>> bb_ad_checker = BoundingBoxADChecker() >>> bb_ad_checker BoundingBoxADChecker()
>>> bb_ad_checker.fit(X_train) BoundingBoxADChecker()
>>> bb_ad_checker.predict(X_test) array([ True, True, False])
Methods
fit(X[, y])Fit applicability domain estimator.
fit_predict(X[, y])Perform fit on X and returns labels for X.
Get metadata routing of this object.
get_params([deep])Get parameters for this estimator.
predict(X)Predict labels (1 inside AD, 0 outside AD) of X according to the fitted model.
Calculate the applicability domain score of samples.
set_params(**params)Set the parameters of this estimator.
- fit(X: ndarray, y: ndarray | None = None)#
Fit applicability domain estimator.
- Parameters:
X (array-like of shape (n_samples, n_features)) – The input samples.
y (any) – Unused, kept for scikit-learn compatibility.
- Returns:
self – Fitted estimator.
- Return type:
object
- fit_predict(X, y=None, **kwargs)#
Perform fit on X and returns labels for X.
Returns -1 for outliers and 1 for inliers.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples.
y (Ignored) – Not used, present for API consistency by convention.
**kwargs (dict) –
Arguments to be passed to
fit.Added in version 1.4.
- Returns:
y – 1 for inliers, -1 for outliers.
- Return type:
ndarray of shape (n_samples,)
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequestencapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
params – Parameter names mapped to their values.
- Return type:
dict
- predict(X: ndarray) ndarray#
Predict labels (1 inside AD, 0 outside AD) of X according to the fitted model.
- Parameters:
X (array-like of shape (n_samples, n_features)) – The data matrix.
- Returns:
is_inside_applicability_domain – Returns 1 for molecules inside the applicability domain, and 0 for those outside (outliers).
- Return type:
ndarray of shape (n_samples,)
- score_samples(X: ndarray) ndarray#
Calculate the applicability domain score of samples. It is the number of feature ranges fulfilled by samples. It ranges between 0 and
num_features, where 0 means all descriptors inside training data ranges.- Parameters:
X (array-like of shape (n_samples, n_features)) – The data matrix.
- Returns:
scores – Applicability domain scores of samples.
- Return type:
ndarray of shape (n_samples,)
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance