maxmin_stratified_train_valid_test_split#

Split using MaxMin algorithm with stratification.

A variant of MaxMin split (see maxmin_train_valid_test_split()), modified to split each class separately. The goal is to preserve the class distribution in the resulting train, valid, and test splits, while also distributing points in each subset across the chemical space.

Note that results may differ quite strongly from regular MaxMin split, as here classes are treated independently of each other. While the distances between compounds in each class are maximized, there are no guarantees for the overall dataset. However, generally this split should also result in relatively uniform coverage of the entire chemical space.

Resulting sizes of train, valid, and test sets may differ slightly for very small datasets. This is because each class is split separately with a given percentage.

If train_size, valid_size and test_size are integers, they must sum up to the data length. If they are floating numbers, they must sum up to 1.

Parameters:

data (sequence) – A sequence representing either SMILES strings or RDKit Mol objects.
labels (array-like) – An array or list with class labels as integers.
additional_data (sequence) – Additional sequences to be split alongside the main data, e.g. labels.
train_size (float, default=None) – The fraction of data to be used for the train subset. If None, it is set to 1 - test_size - valid_size. If valid_size is not provided, train_size is set to 1 - test_size. If train_size, test_size and valid_size aren’t set, train_size is set to 0.8.
valid_size (float, default=None) – The fraction of data to be used for the test subset. If None, it is set to 1 - train_size - valid_size. If train_size, test_size and valid_size aren’t set, train_size is set to 0.1.
test_size (float, default=None) – The fraction of data to be used for the validation subset. If None, it is set to 1 - train_size - valid_size. If valid_size is not provided, test_size is set to 1 - train_size. If train_size, test_size and valid_size aren’t set, test_size is set to 0.1.
return_indices (bool, default=False) – Whether the method should return the input object subsets, i.e. SMILES strings or RDKit Mol objects, or only the indices of the subsets instead of the data.
random_state (int, default=0) – Random generator seed that will be used for selecting initial molecules.

Returns:

subsets – Tuple with train-valid-test subsets of provided arrays. First three are lists of SMILES strings or RDKit Mol objects, depending on the input type. Next three are NumPy arrays with labels of train-valid-test subsets. If return_indices is True, lists of indices are returned instead of actual data as the first three elements.

Return type:

tuple[list, list, …]

maxmin_stratified_train_valid_test_split#

This Page