MaxMinClustering#

class skfp.clustering.MaxMinClustering(distance_threshold: float = 0.1, random_state: int | RandomState | None = None)#

MaxMin clustering.

This is a centroid-based clustering algorithm using binary fingerprints and Tanimoto similarity. Centroids are selected to maximize their minimal pairwise distance, distributing them uniformly across the space. Clusters tend to be convex and similarly sized, in contrast to density-based clustering methods like Butina clustering.

Centroid selection uses the MaxMin heuristic originally described by Ashton et al. [1], where they are iteratively selected to maximize the minimal distance to previously chosen centroids. RDKit MaxMinPicker with a given minimal distance threshold is used.

After selecting centroids, each sample is assigned to the centroid with the highest Tanimoto similarity. The same process is used for prediction for new samples.

Parameters:

distance_threshold (float, default=0.1) – Distance threshold, denotes minimal Tanimoto distance between clusters (distance = 1 - Tanimoto similarity) Must be between 0 and 1. The default value was chosen based on analysis of multiple chemical datasets [2].
random_state (int, RandomState instance or None, default=None) – Determines random number generation for selection of the first centroid. Pass an integer for reproducible output across multiple function calls.

centroid_indices_#

Indices of samples chosen as centroids after fit().

Type:: list of int

centroid_bitvectors_#

Centroid fingerprints as RDKit ExplicitBitVect objects.

Type:: list of ExplicitBitVect

centroids_#

Centroids represented as boolean NumPy arrays when the input was a dense array or sparse matrix.

Type:: ndarray of bool, shape (n_centroids, n_bits)

labels_#

Cluster labels for each sample.

Type:: ndarray of int, shape (n_samples,)

Notes

This estimator follows the scikit-learn estimator API and accepts dense NumPy arrays, SciPy sparse arrays, or lists/tuples of RDKit ExplicitBitVect objects as input.

References

Methods

`fit`(X[, y])	Fit the MaxMin clustering model.
`fit_predict`(X[, y])	Fit the MaxMin clustering model and return cluster labels.
`get_clusters_and_points`()	Return clusters as a mapping from cluster ID to sample indices.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`predict`(X)	Assign new samples to existing centroids.
`set_params`(**params)	Set the parameters of this estimator.

fit(X: ndarray | csr_array | Sequence[ExplicitBitVect], y=None)#

Fit the MaxMin clustering model.

Parameters:

X ({array-like, sparse matrix, sequence of ExplicitBitVect}) – Binary fingerprint data. Expected shapes are (n_samples, n_bits) for arrays and sparse arrays. Alternatively, a list/tuple of RDKit ExplicitBitVect objects is accepted.
y (ignored) – Not used, present for API consistency with scikit-learn.

Returns:

self – Fitted estimator.

Return type:

MaxMinClustering

fit_predict(X: ndarray | csr_array | Sequence[ExplicitBitVect], y=None) → ndarray#

Fit the MaxMin clustering model and return cluster labels.

This is a convenience method that calls fit() and returns the cluster labels.

Parameters:

X ({array-like, sparse matrix, sequence of ExplicitBitVect}) – Binary fingerprint data. Expected shapes are (n_samples, n_bits) for arrays and sparse arrays. Alternatively, a list/tuple of RDKit ExplicitBitVect objects is accepted.
y (ignored) – Not used, present for API consistency with scikit-learn.

Returns:

labels – Cluster labels for X.

Return type:

ndarray of int, shape (n_samples,)

get_clusters_and_points() → dict[int, ndarray]#

Return clusters as a mapping from cluster ID to sample indices.

Returns:: clusters – Mapping from integer cluster ID to a 1D NumPy array containing the indices of samples belonging to that cluster.
Return type:: dict

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

predict(X: ndarray | csr_array | Sequence[ExplicitBitVect]) → ndarray#

Assign new samples to existing centroids.

Parameters:: X ({array-like, sparse matrix, sequence of ExplicitBitVect}) – New samples to assign to clusters. The input formats match those accepted by fit().
Returns:: labels – Cluster labels for the input samples.
Return type:: ndarray of int, shape (n_samples,)

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance