Molecular Clustering Algorithms#
scikit-fingerprints provides tools to partition chemical space using fingerprint-based methods. These clustering algorithms make it possible to divide large molecular collections into smaller, representative groups, which is useful in a variety of settings such as:
organizing compounds in virtual screening campaigns
exploring chemical diversity
constructing stratified splits for machine-learning model validation
In the following tutorial, we demonstrate how to work with built-in datasets, preprocess molecular data, and apply clustering algorithms to partition chemical space in practice.
[1]:
import pandas as pd
from skfp.clustering import MaxMinClustering
from skfp.datasets.moleculenet import load_bace
from skfp.fingerprints import ECFPFingerprint
[2]:
# Get and preprocess data - here, we use a built-in MoleculeNet datset and compute binary fingerprints
smiles, _ = load_bace()
fps = ECFPFingerprint().fit_transform(smiles)
Available clustering algorithms#
The table below summarizes the clustering algorithms currently implemented in scikit-fingerprints. Additional methods may be added in future releases.
Algorithm |
Distance / Similarity |
Centroid-based |
Produces labels |
Predict on new data |
Typical use case |
|---|---|---|---|---|---|
MaxMin clustering |
Tanimoto (binary fingerprints) |
Yes (diverse representatives) |
Yes |
Yes |
Partitioning chemical space into representative clusters |
MaxMin clustering#
MaxMin clustering is a centroid-based clustering method designed for binary molecular fingerprints. It combines diversity-based centroid selection with similarity-based cluster assignment, making it particularly suitable for partitioning chemical space into representative regions.
Unlike many classical clustering algorithms, MaxMin clustering explicitly selects cluster centers to maximize diversity before assigning molecules to clusters.
Algorithm overview
MaxMin clustering proceeds in two stages:
Centroid selection (MaxMin diversity picking) A set of representative molecules is selected using RDKit’s MaxMinPicker. Centroids are chosen iteratively such that each newly selected centroid maximizes the minimum Tanimoto distance to all previously selected centroids. This encourages broad coverage of chemical space.
Cluster assignment Once centroids are fixed, all molecules (including the centroids themselves) are assigned to the nearest centroid using Tanimoto similarity. Each molecule is assigned to the cluster corresponding to the centroid with the highest similarity.
Distance and similarity
MaxMin clustering operates on binary fingerprints and uses:
Tanimoto similarity to measure molecular similarity
Tanimoto distance (1 − similarity) during centroid selection
This choice is standard in cheminformatics and well suited for sparse binary representations such as ECFP (Morgan) fingerprints.
Controlling cluster granularity
The number and spread of clusters are controlled by the distance threshold used during centroid selection:
Lower thresholds produce fewer, broader clusters
Higher thresholds produce more, finer-grained clusters
The exact number of clusters emerges from the data and the chosen threshold.
The algorithm does not balance cluster sizes or select fixed-size subsets; such operations are left to downstream processing.
[3]:
# run MaxMin clustering (distance threshold 0.4) - this outputs a label for each molecule
clusterer = MaxMinClustering(distance_threshold=0.4, random_state=0)
labels = clusterer.fit_predict(fps)
labels
[3]:
array([124, 307, 194, ..., 263, 263, 216], shape=(1513,))
Inspecting Clustering Results
To better understand the outcome, we first examine how many clusters were created and how compressed the chemical space is on average.
[4]:
n_molecules = len(smiles)
n_clusters = len(clusterer.centroid_indices_)
print(f"Number of molecules: {n_molecules}")
print(f"Number of clusters (distance_threshold=0.4): {n_clusters}")
print(f"Average molecules per cluster: {n_molecules / n_clusters:.1f}")
Number of molecules: 1513
Number of clusters (distance_threshold=0.4): 347
Average molecules per cluster: 4.4
Cluster size distribution
Next, we attach the cluster labels to a table and rank clusters by size.
Larger clusters correspond to densely populated regions of chemical space, while smaller clusters often represent more unique or isolated chemotypes.
[5]:
df = pd.DataFrame(
{
"smiles": smiles,
"cluster": labels,
}
)
cluster_sizes = df.groupby("cluster").size().sort_values(ascending=False)
print("Top 10 largest clusters:")
cluster_sizes.head(10)
Top 10 largest clusters:
[5]:
cluster
0 33
17 32
261 29
281 29
187 24
15 21
324 18
218 18
190 18
145 17
dtype: int64
Effect of the distance threshold
Finally, we explore how changing the distance threshold affects the number of clusters.
[6]:
for t in [0.3, 0.4, 0.5, 0.7]:
c = MaxMinClustering(distance_threshold=t, random_state=0)
c.fit(fps)
print(f"threshold={t:0.2f} -> n_clusters={len(c.centroid_indices_)}")
threshold=0.30 -> n_clusters=600
threshold=0.40 -> n_clusters=347
threshold=0.50 -> n_clusters=211
threshold=0.70 -> n_clusters=83
Increasing the distance threshold forces centroids to be more dissimilar, resulting in a larger number of finer-grained clusters.