Molecular Clustering Algorithms#

scikit-fingerprints provides tools to partition chemical space using fingerprint-based methods. These clustering algorithms make it possible to divide large molecular collections into smaller, representative groups, which is useful in a variety of settings such as:

  • organizing compounds in virtual screening campaigns

  • exploring chemical diversity

  • constructing stratified splits for machine-learning model validation

In the following tutorial, we demonstrate how to work with built-in datasets, preprocess molecular data, and apply clustering algorithms to partition chemical space in practice.

[1]:
import pandas as pd

from skfp.clustering import MaxMinClustering
from skfp.datasets.moleculenet import load_bace
from skfp.fingerprints import ECFPFingerprint
[2]:
# Get and preprocess data - here, we use a built-in MoleculeNet datset and compute binary fingerprints

smiles, _ = load_bace()
fps = ECFPFingerprint().fit_transform(smiles)

Available clustering algorithms#

The table below summarizes the clustering algorithms currently implemented in scikit-fingerprints. Additional methods may be added in future releases.

Algorithm

Distance / Similarity

Centroid-based

Produces labels

Predict on new data

Typical use case

MaxMin clustering

Tanimoto (binary fingerprints)

Yes (diverse representatives)

Yes

Yes

Partitioning chemical space into representative clusters

MaxMin clustering#

MaxMin clustering is a centroid-based clustering method designed for binary molecular fingerprints. It combines diversity-based centroid selection with similarity-based cluster assignment, making it particularly suitable for partitioning chemical space into representative regions.

Unlike many classical clustering algorithms, MaxMin clustering explicitly selects cluster centers to maximize diversity before assigning molecules to clusters.

Algorithm overview

MaxMin clustering proceeds in two stages:

  • Centroid selection (MaxMin diversity picking) A set of representative molecules is selected using RDKit’s MaxMinPicker. Centroids are chosen iteratively such that each newly selected centroid maximizes the minimum Tanimoto distance to all previously selected centroids. This encourages broad coverage of chemical space.

  • Cluster assignment Once centroids are fixed, all molecules (including the centroids themselves) are assigned to the nearest centroid using Tanimoto similarity. Each molecule is assigned to the cluster corresponding to the centroid with the highest similarity.

Distance and similarity

MaxMin clustering operates on binary fingerprints and uses:

  • Tanimoto similarity to measure molecular similarity

  • Tanimoto distance (1 − similarity) during centroid selection

This choice is standard in cheminformatics and well suited for sparse binary representations such as ECFP (Morgan) fingerprints.

Controlling cluster granularity

The number and spread of clusters are controlled by the distance threshold used during centroid selection:

  • Lower thresholds produce fewer, broader clusters

  • Higher thresholds produce more, finer-grained clusters

The exact number of clusters emerges from the data and the chosen threshold.

The algorithm does not balance cluster sizes or select fixed-size subsets; such operations are left to downstream processing.

[3]:
# run MaxMin clustering (distance threshold 0.4) - this outputs a label for each molecule
clusterer = MaxMinClustering(distance_threshold=0.4, random_state=0)
labels = clusterer.fit_predict(fps)
labels
[3]:
array([124, 307, 194, ..., 263, 263, 216], shape=(1513,))

Inspecting Clustering Results

To better understand the outcome, we first examine how many clusters were created and how compressed the chemical space is on average.

[4]:
n_molecules = len(smiles)
n_clusters = len(clusterer.centroid_indices_)

print(f"Number of molecules: {n_molecules}")
print(f"Number of clusters (distance_threshold=0.4): {n_clusters}")
print(f"Average molecules per cluster: {n_molecules / n_clusters:.1f}")
Number of molecules: 1513
Number of clusters (distance_threshold=0.4): 347
Average molecules per cluster: 4.4

Cluster size distribution

Next, we attach the cluster labels to a table and rank clusters by size.

Larger clusters correspond to densely populated regions of chemical space, while smaller clusters often represent more unique or isolated chemotypes.

[5]:
df = pd.DataFrame(
    {
        "smiles": smiles,
        "cluster": labels,
    }
)

cluster_sizes = df.groupby("cluster").size().sort_values(ascending=False)

print("Top 10 largest clusters:")
cluster_sizes.head(10)
Top 10 largest clusters:
[5]:
cluster
0      33
17     32
261    29
281    29
187    24
15     21
324    18
218    18
190    18
145    17
dtype: int64

Effect of the distance threshold

Finally, we explore how changing the distance threshold affects the number of clusters.

[6]:
for t in [0.3, 0.4, 0.5, 0.7]:
    c = MaxMinClustering(distance_threshold=t, random_state=0)
    c.fit(fps)
    print(f"threshold={t:0.2f} -> n_clusters={len(c.centroid_indices_)}")
threshold=0.30 -> n_clusters=600
threshold=0.40 -> n_clusters=347
threshold=0.50 -> n_clusters=211
threshold=0.70 -> n_clusters=83

Increasing the distance threshold forces centroids to be more dissimilar, resulting in a larger number of finer-grained clusters.