KNNADChecker#

class skfp.applicability_domain.KNNADChecker(k: int = 1, metric: str | Callable = 'tanimoto_binary_distance', agg: str = 'mean', threshold: float = 95, n_jobs: int | None = None, verbose: int | dict = 0)#

k-nearest neighbors applicability domain checker.

This method determines whether a query molecule falls within the applicability domain by comparing its distance to k nearest neighbors [1] [2] [3] in the training set, using a threshold derived from the training data.

The applicability domain is defined as one of:

  • the mean distance to k nearest neighbors,

  • the distance to k-th nearest neighbor (max distance),

  • the distance to the closest neighbor from the training set (min distance)

A threshold is then set at the 95th percentile of these aggregated distances. Query molecules with an aggregated distance to their k nearest neighbors below this threshold are considered within the applicability domain.

This implementation supports binary and count Tanimoto similarity metrics.

Parameters:
  • k (int, default=1) – Number of nearest neighbors to consider for distance calculations. Must be smaller than the number of training samples.

  • metric (Callable or string, default="tanimoto_binary_distance") – Distance metric to use.

  • agg ("mean" or "max" or "min", default="mean") –

    Aggregation method for distances to k nearest neighbors:

    • ”mean”: average distance

    • ”max”: maximum distance, to k-th neighbor

    • ”min”: minimal distance, to the nearest neighbor

  • threshold (float, default=95) – Percentile of distance distribution, used as the threshold for determining the applicability domain. Value in range [0, 100].

  • n_jobs (int, default=None) – The number of jobs to run in parallel. transform_x_y() and transform() are parallelized over the input molecules. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See scikit-learn documentation on n_jobs for more details.

  • verbose (int or dict, default=0) – Controls the verbosity when filtering molecules. If a dictionary is passed, it is treated as kwargs for tqdm(), and can be used to control the progress bar.

References

Examples

>>> from skfp.applicability_domain import KNNADChecker
>>> import numpy as np
>>> X_train_binary = np.array([
...     [1, 1, 1],
...     [0, 1, 1],
...     [0, 0, 1]
... ])
>>> X_test_binary = 1 - X_train_binary
>>> knn_ad_checker_binary = KNNADChecker(k=2, metric="tanimoto_binary_distance", agg="mean")
>>> knn_ad_checker_binary.fit(X_train_binary)
KNNADChecker(k=2)
>>> knn_ad_checker_binary.predict(X_test_binary)
array([False, False, False])

Methods

fit(X[, y])

Fit applicability domain estimator.

fit_predict(X[, y])

Perform fit on X and returns labels for X.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

predict(X)

Predict labels (1 inside AD, 0 outside AD) of X according to the fitted model.

score_samples(X)

Calculate the applicability domain score of samples.

set_params(**params)

Set the parameters of this estimator.

fit(X: ndarray, y: ndarray | None = None)#

Fit applicability domain estimator.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The input samples.

  • y (any) – Unused, kept for scikit-learn compatibility.

Returns:

self – Fitted estimator.

Return type:

object

fit_predict(X, y=None, **kwargs)#

Perform fit on X and returns labels for X.

Returns -1 for outliers and 1 for inliers.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples.

  • y (Ignored) – Not used, present for API consistency by convention.

  • **kwargs (dict) –

    Arguments to be passed to fit.

    Added in version 1.4.

Returns:

y – 1 for inliers, -1 for outliers.

Return type:

ndarray of shape (n_samples,)

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

predict(X: ndarray) ndarray#

Predict labels (1 inside AD, 0 outside AD) of X according to the fitted model.

Parameters:

X (array-like of shape (n_samples, n_features)) – The data matrix.

Returns:

is_inside_applicability_domain – Returns 1 for molecules inside the applicability domain, and 0 for those outside (outliers).

Return type:

ndarray of shape (n_samples,)

score_samples(X: ndarray) ndarray#

Calculate the applicability domain score of samples. It is simply a 0/1 decision equal to .predict().

Parameters:

X (array-like of shape (n_samples, n_features)) – The data matrix.

Returns:

scores – Applicability domain scores of samples.

Return type:

ndarray of shape (n_samples,)

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance