MaxMinClustering#
- class skfp.clustering.MaxMinClustering(distance_threshold: float = 0.1, random_state: int | RandomState | None = None)#
MaxMin clustering.
This is a centroid-based clustering algorithm using binary fingerprints and Tanimoto similarity. Centroids are selected to maximize their minimal pairwise distance, distributing them uniformly across the space. Clusters tend to be convex and similarly sized, in contrast to density-based clustering methods like Butina clustering.
Centroid selection uses the MaxMin heuristic originally described by Ashton et al. [1], where they are iteratively selected to maximize the minimal distance to previously chosen centroids. RDKit
MaxMinPickerwith a given minimal distance threshold is used.After selecting centroids, each sample is assigned to the centroid with the highest Tanimoto similarity. The same process is used for prediction for new samples.
- Parameters:
distance_threshold (float, default=0.1) – Distance threshold, denotes minimal Tanimoto distance between clusters (distance = 1 - Tanimoto similarity) Must be between 0 and 1. The default value was chosen based on analysis of multiple chemical datasets [2].
random_state (int, RandomState instance or None, default=None) – Determines random number generation for selection of the first centroid. Pass an integer for reproducible output across multiple function calls.
- centroid_bitvectors_#
Centroid fingerprints as RDKit ExplicitBitVect objects.
- Type:
list of ExplicitBitVect
- centroids_#
Centroids represented as boolean NumPy arrays when the input was a dense array or sparse matrix.
- Type:
ndarray of bool, shape (n_centroids, n_bits)
- labels_#
Cluster labels for each sample.
- Type:
ndarray of int, shape (n_samples,)
Notes
This estimator follows the scikit-learn estimator API and accepts dense NumPy arrays, SciPy sparse arrays, or lists/tuples of RDKit
ExplicitBitVectobjects as input.References
Methods
fit(X[, y])Fit the MaxMin clustering model.
fit_predict(X[, y])Fit the MaxMin clustering model and return cluster labels.
Return clusters as a mapping from cluster ID to sample indices.
Get metadata routing of this object.
get_params([deep])Get parameters for this estimator.
predict(X)Assign new samples to existing centroids.
set_params(**params)Set the parameters of this estimator.
- fit(X: ndarray | csr_array | Sequence[ExplicitBitVect], y=None)#
Fit the MaxMin clustering model.
- Parameters:
X ({array-like, sparse matrix, sequence of ExplicitBitVect}) – Binary fingerprint data. Expected shapes are
(n_samples, n_bits)for arrays and sparse arrays. Alternatively, a list/tuple of RDKitExplicitBitVectobjects is accepted.y (ignored) – Not used, present for API consistency with scikit-learn.
- Returns:
self – Fitted estimator.
- Return type:
- fit_predict(X: ndarray | csr_array | Sequence[ExplicitBitVect], y=None) ndarray#
Fit the MaxMin clustering model and return cluster labels.
This is a convenience method that calls
fit()and returns the cluster labels.- Parameters:
X ({array-like, sparse matrix, sequence of ExplicitBitVect}) – Binary fingerprint data. Expected shapes are
(n_samples, n_bits)for arrays and sparse arrays. Alternatively, a list/tuple of RDKitExplicitBitVectobjects is accepted.y (ignored) – Not used, present for API consistency with scikit-learn.
- Returns:
labels – Cluster labels for
X.- Return type:
ndarray of int, shape (n_samples,)
- get_clusters_and_points() dict[int, ndarray]#
Return clusters as a mapping from cluster ID to sample indices.
- Returns:
clusters – Mapping from integer cluster ID to a 1D NumPy array containing the indices of samples belonging to that cluster.
- Return type:
dict
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequestencapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
params – Parameter names mapped to their values.
- Return type:
dict
- predict(X: ndarray | csr_array | Sequence[ExplicitBitVect]) ndarray#
Assign new samples to existing centroids.
- Parameters:
X ({array-like, sparse matrix, sequence of ExplicitBitVect}) – New samples to assign to clusters. The input formats match those accepted by
fit().- Returns:
labels – Cluster labels for the input samples.
- Return type:
ndarray of int, shape (n_samples,)
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance