KNNADChecker#
- class skfp.applicability_domain.KNNADChecker(k: int = 1, metric: str | Callable = 'tanimoto_binary_distance', agg: str = 'mean', threshold: float = 95, n_jobs: int | None = None, verbose: int | dict = 0)#
k-nearest neighbors applicability domain checker.
This method determines whether a query molecule falls within the applicability domain by comparing its distance to k nearest neighbors [1] [2] [3] in the training set, using a threshold derived from the training data.
The applicability domain is defined as one of:
the mean distance to k nearest neighbors,
the distance to k-th nearest neighbor (max distance),
the distance to the closest neighbor from the training set (min distance)
A threshold is then set at the 95th percentile of these aggregated distances. Query molecules with an aggregated distance to their k nearest neighbors below this threshold are considered within the applicability domain.
This implementation supports binary and count Tanimoto similarity metrics.
- Parameters:
k (int, default=1) – Number of nearest neighbors to consider for distance calculations. Must be smaller than the number of training samples.
metric (Callable or string, default="tanimoto_binary_distance") – Distance metric to use.
agg ("mean" or "max" or "min", default="mean") –
Aggregation method for distances to k nearest neighbors:
”mean”: average distance
”max”: maximum distance, to k-th neighbor
”min”: minimal distance, to the nearest neighbor
threshold (float, default=95) – Percentile of distance distribution, used as the threshold for determining the applicability domain. Value in range
[0, 100].n_jobs (int, default=None) – The number of jobs to run in parallel.
transform_x_y()andtransform()are parallelized over the input molecules.Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors. See scikit-learn documentation onn_jobsfor more details.verbose (int or dict, default=0) – Controls the verbosity when filtering molecules. If a dictionary is passed, it is treated as kwargs for
tqdm(), and can be used to control the progress bar.
References
Examples
>>> from skfp.applicability_domain import KNNADChecker >>> import numpy as np >>> X_train_binary = np.array([ ... [1, 1, 1], ... [0, 1, 1], ... [0, 0, 1] ... ]) >>> X_test_binary = 1 - X_train_binary >>> knn_ad_checker_binary = KNNADChecker(k=2, metric="tanimoto_binary_distance", agg="mean") >>> knn_ad_checker_binary.fit(X_train_binary) KNNADChecker(k=2)
>>> knn_ad_checker_binary.predict(X_test_binary) array([False, False, False])
Methods
fit(X[, y])Fit applicability domain estimator.
fit_predict(X[, y])Perform fit on X and returns labels for X.
Get metadata routing of this object.
get_params([deep])Get parameters for this estimator.
predict(X)Predict labels (1 inside AD, 0 outside AD) of X according to the fitted model.
Calculate the applicability domain score of samples.
set_params(**params)Set the parameters of this estimator.
- fit(X: ndarray, y: ndarray | None = None)#
Fit applicability domain estimator.
- Parameters:
X (array-like of shape (n_samples, n_features)) – The input samples.
y (any) – Unused, kept for scikit-learn compatibility.
- Returns:
self – Fitted estimator.
- Return type:
object
- fit_predict(X, y=None, **kwargs)#
Perform fit on X and returns labels for X.
Returns -1 for outliers and 1 for inliers.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples.
y (Ignored) – Not used, present for API consistency by convention.
**kwargs (dict) –
Arguments to be passed to
fit.Added in version 1.4.
- Returns:
y – 1 for inliers, -1 for outliers.
- Return type:
ndarray of shape (n_samples,)
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequestencapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
params – Parameter names mapped to their values.
- Return type:
dict
- predict(X: ndarray) ndarray#
Predict labels (1 inside AD, 0 outside AD) of X according to the fitted model.
- Parameters:
X (array-like of shape (n_samples, n_features)) – The data matrix.
- Returns:
is_inside_applicability_domain – Returns 1 for molecules inside the applicability domain, and 0 for those outside (outliers).
- Return type:
ndarray of shape (n_samples,)
- score_samples(X: ndarray) ndarray#
Calculate the applicability domain score of samples. It is simply a 0/1 decision equal to
.predict().- Parameters:
X (array-like of shape (n_samples, n_features)) – The data matrix.
- Returns:
scores – Applicability domain scores of samples.
- Return type:
ndarray of shape (n_samples,)
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance