TOPKATADChecker#

class skfp.applicability_domain.TOPKATADChecker(threshold: float | None = None, n_jobs: int | None = None, verbose: int | dict = 0)#

TOPKAT (Optimal Prediction Space) method.

Defines applicability domain using the Optimal Prediction Space (OPS) approach introduced in the TOPKAT system [1]. The method transforms the input feature space into a normalized, centered and rotated space using PCA on a scaled [-1, 1] version of the training data (S-space). Each new sample is projected into the same OPS space, and a weighted distance (dOPS) from the center is computed.

Samples are considered in-domain if their dOPS is below a threshold. By default, this threshold is computed as \(5 * D / (2 * N)\), where: - :math:D is the number of input features, - :math:N is the number of training samples.

This method captures both the variance and correlation structure of the descriptors, and performs well for detecting global outliers in descriptor space.

Parameters:

threshold (float, default=None) – Optional user-defined threshold for dOPS. If provided, overrides the default analytical threshold \(5 * D / (2 * N)\). Lower values produce stricter domains.
n_jobs (int, default=None) – The number of jobs to run in parallel. transform_x_y() and transform() are parallelized over the input molecules. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See scikit-learn documentation on n_jobs for more details.
verbose (int or dict, default=0) – Controls the verbosity when filtering molecules. If a dictionary is passed, it is treated as kwargs for tqdm(), and can be used to control the progress bar.

References

Examples

>>> import numpy as np
>>> from skfp.applicability_domain import TOPKATADChecker
>>> from sklearn.datasets import make_blobs
>>> X_train, _ = make_blobs(n_samples=100, centers=2, n_features=5, random_state=0)
>>> X_test = X_train[:5]

>>> topkat_ad_checker = TOPKATADChecker()
>>> topkat_ad_checker.fit(X_train)
TOPKATADChecker()

>>> topkat_ad_checker.predict(X_test)
array([ True,  True,  True,  True, False])

Methods

`fit`(X[, y])	Fit applicability domain estimator.
`fit_predict`(X[, y])	Perform fit on X and returns labels for X.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`predict`(X)	Predict labels (1 inside AD, 0 outside AD) of X according to the fitted model.
`score_samples`(X)	Calculate the applicability domain score of samples.
`set_params`(**params)	Set the parameters of this estimator.

fit(X: ndarray, y: ndarray | None = None)#

Fit applicability domain estimator.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input samples.
y (any) – Unused, kept for scikit-learn compatibility.

Returns:

self – Fitted estimator.

Return type:

object

fit_predict(X, y=None, **kwargs)#

Perform fit on X and returns labels for X.

Returns -1 for outliers and 1 for inliers.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples.
y (Ignored) – Not used, present for API consistency by convention.
**kwargs (dict) –
Arguments to be passed to fit.

Added in version 1.4.

Returns:

y – 1 for inliers, -1 for outliers.

Return type:

ndarray of shape (n_samples,)

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

predict(X: ndarray) → ndarray#

Predict labels (1 inside AD, 0 outside AD) of X according to the fitted model.

Parameters:: X (array-like of shape (n_samples, n_features)) – The data matrix.
Returns:: is_inside_applicability_domain – Returns 1 for molecules inside the applicability domain, and 0 for those outside (outliers).
Return type:: ndarray of shape (n_samples,)

score_samples(X: ndarray) → ndarray#

Calculate the applicability domain score of samples. It is defined as the weighted distance (dOPS) from the center of the training data in the OPS-transformed space. Lower values indicate higher similarity to the training data.

Parameters:: X (array-like of shape (n_samples, n_features)) – The data matrix.
Returns:: scores – Distance scores from the center of the Optimal Prediction Space (dOPS).
Return type:: ndarray of shape (n_samples,)

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance