Diversity-Aware Top-k Metric
Overview
The diversity-aware top-k metric is an advanced evaluation metric for molecular generation tasks that balances both quality and chemical diversity. Unlike the standard top-k metric which only considers scores, this metric enforces that selected molecules are sufficiently different from each other.
Key Concepts
Diversity Constraint
The metric uses a similarity threshold (parameter t) to enforce diversity:
- Selected molecules must have Tanimoto similarity < t with each other
- Lower
tvalues enforce stricter diversity (e.g., t=0.1 requires very different molecules), and will be increasingly challenging astdecreases. - Higher
tvalues allow more similar molecules (e.g., t=0.9 is more lenient), and the results of the diversity metric should converge to the top-k score astapproaches 1 (see Top-k Metric).
Greedy Selection Algorithm
The selection process uses a greedy approach:
- Sort molecules by score (highest first)
- Select the highest-scoring molecule
-
For each remaining molecule (in descending score order):
- If it's sufficiently different from all selected molecules, select it
- Otherwise, skip it and check the next candidate
-
Stop when k molecules are selected or all candidates are evaluated
Padding Mechanism
If fewer than k diverse molecules are found, the remaining slots are padded with 0.0 scores, penalizing models unable to generate k molecules meeting the diversity-constraint.
Molecular Fingerprints
The metric uses molecular fingerprints to compute similarity:
- Default: ECFP4-1024 (Extended Connectivity Fingerprint, diameter 4, 1024 bits)
- Supported: ECFP, MACCS, RDKit, Gobbi2D, Avalon
- See Fingerprint Utilities for details
Usage Examples
Basic Usage with SMILES
from mol_gen_docking.evaluation.diversity_aware_top_k import diversity_aware_top_k
# Generated molecules as SMILES
smiles = [
"c1ccccc1", # (score: 8.5)
"CC(C)Cc1ccc(cc1)C(C)C(O)=O", # (score: 9.2)
"c1ccc2ccccc2c1", # (score: 8.0) - similar to benzene
"CCO" # (score: 6.5)
]
scores = [8.5, 9.2, 8.0, 6.5]
# Select top 2 molecules with similarity threshold 0.7
# (molecules must have similarity < 0.7, i.e., distance > 0.3)
metric = diversity_aware_top_k(
smiles,
scores,
k=2,
t=0.9,
fingerprint_name="ecfp4-1024"
)
print(f"Diversity-aware top-2 score: {metric}")
# Output:
# >>> Diversity-aware top-2 score: 8.85
metric = diversity_aware_top_k(
smiles,
scores,
k=2,
t=0.05,
fingerprint_name="ecfp4-1024"
)
print(f"Diversity-aware top-2 score: {metric}")
# Output:
# >>> Diversity-aware top-2 score: 4.6
Using RDKit Mol Objects
from mol_gen_docking.evaluation.diversity_aware_top_k import diversity_aware_top_k
from rdkit import Chem
# Convert to Mol objects
mols = [Chem.MolFromSmiles(smi) for smi in smiles]
# Same interface works with Mol objects
metric = diversity_aware_top_k(mols, scores, k=3, t=0.7)
print(metric)
# Output:
# >>> 8.566666666666666
Using Pre-computed Similarity Matrix
import numpy as np
from mol_gen_docking.evaluation.diversity_aware_top_k import diversity_aware_top_k
# Pre-computed similarity matrix (diagonal = 1.0)
sim_matrix = np.array([
[1.0, 0.3, 0.9, 0.2], # benzene
[0.3, 1.0, 0.4, 0.6], # ibuprofen
[0.9, 0.4, 1.0, 0.3], # naphthalene
[0.2, 0.6, 0.3, 1.0] # ethanol
])
scores = [8.5, 9.2, 8.0, 6.5]
# Use similarity matrix directly
metric = diversity_aware_top_k(
sim_matrix,
scores,
k=2,
t=0.7
)
print(metric)
# Output:
# >>> 8.85
Function Reference
Diversity-aware top-k evaluation metric for molecular generation.
This module implements a diversity-aware variant of the top-k metric that selects molecules not only based on their scores but also on their chemical diversity. It ensures selected molecules are sufficiently different from each other, preventing the selection of similar redundant compounds.
div_aware_top_k_from_dist(dist, weights, k, t)
Select at most k molecules with highest weights while enforcing minimum distance.
This function implements a greedy selection algorithm that selects molecules with the highest weights while ensuring each selected molecule is at distance (dissimilarity) of at least t from all previously selected molecules.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dist
|
ndarray[float]
|
Condensed distance matrix (1D array of upper triangle distances). This should be from scipy.spatial.distance.squareform or similar. Distance values should be in range [0, 1] where 0 = identical, 1 = completely different. |
required |
weights
|
ndarray[float]
|
1D array of weights/scores for each molecule. Higher weights are selected first. Must have length n where n*(n-1)/2 == len(dist). |
required |
k
|
int
|
Maximum number of molecules to select. |
required |
t
|
float
|
Minimum distance threshold. Selected molecules must be at distance >= t from each other (i.e., dissimilarity >= t). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
1D NumPy array of indices of selected molecules, sorted by weight (descending). |
ndarray
|
Array may contain fewer than k elements if not enough molecules satisfy |
ndarray
|
the distance constraint. |
Raises:
| Type | Description |
|---|---|
AssertionError
|
If the distance matrix size doesn't match the weights array. |
Example
import numpy as np
from mol_gen_docking.evaluation.diversity_aware_top_k import div_aware_top_k_from_dist
# 3 molecules: dist matrix has 3 pairwise distances
dist = np.array([0.3, 0.8, 0.2]) # condensed distance matrix
weights = np.array([8.5, 9.2, 7.1])
selected = div_aware_top_k_from_dist(dist, weights, k=2, t=0.5)
print(f"Selected indices: {selected}") # Molecules at distance >= 0.5
Notes
- Uses a greedy algorithm: sorts by weight and selects molecules in order
- Once a molecule is selected, it acts as a constraint for future selections
- No backtracking: if a high-weight molecule can't be selected due to distance constraints, it's skipped (lower-weight candidates are checked next)
Source code in mol_gen_docking/evaluation/diversity_aware_top_k.py
diversity_aware_top_k(mols, scores, k, t, fingerprint_name='ecfp4-1024')
Calculate diversity-aware top-k metric for molecular generation.
This function computes a diversity-aware top-k metric that selects up to k molecules with the highest scores, subject to the constraint that selected molecules must have chemical similarity below a threshold (i.e., dissimilarity above 1-t). This prevents selecting multiple similar molecules and encourages chemical diversity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mols
|
List[Mol] | List[str] | ndarray
|
List of molecules in one of three formats: - List of SMILES strings (str) - List of RDKit Mol objects (Chem.Mol) - 2D NumPy array representing a similarity matrix |
required |
scores
|
Sequence[float | int]
|
Sequence of scores corresponding to each molecule (e.g., docking scores). Must have the same length as mols (unless mols is a similarity matrix). |
required |
k
|
int
|
Maximum number of molecules to select. |
required |
t
|
float
|
Similarity threshold (range 0.0 to 1.0). Selected molecules must have Tanimoto similarity < t to be considered diverse enough. Lower values enforce higher diversity. |
required |
fingerprint_name
|
Optional[str]
|
Name of the molecular fingerprint to use for similarity calculation. Only used when mols are SMILES or Mol objects. Default is "ecfp4-1024". See mol_gen_docking.evaluation.fingeprints_utils.fp_name_to_fn for options. |
'ecfp4-1024'
|
Returns:
| Type | Description |
|---|---|
float
|
Average score of the selected k diverse molecules. If fewer than k molecules |
float
|
are selected due to diversity constraints, unselected slots are padded with 0.0. |
Raises:
| Type | Description |
|---|---|
AssertionError
|
If mols and scores have different lengths, or if input types are inconsistent. |
Example
from mol_gen_docking.evaluation.diversity_aware_top_k import diversity_aware_top_k
from rdkit import Chem
# Using SMILES strings
smiles = [
"c1ccccc1", # benzene
"CC(C)Cc1ccc(cc1)C(C)C(O)=O", # ibuprofen
"c1ccc2ccccc2c1", # naphthalene (similar to benzene)
"CCO" # ethanol
]
scores = [8.5, 9.2, 8.0, 6.5]
# Select top 2 molecules with similarity threshold 0.8
metric = diversity_aware_top_k(
smiles, scores, k=2, t=0.8, fingerprint_name="ecfp4-1024"
)
print(f"Diversity-aware top-2 score: {metric}")
# Using a pre-computed similarity matrix
sim_matrix = np.array([
[1.0, 0.3, 0.9, 0.2],
[0.3, 1.0, 0.4, 0.6],
[0.9, 0.4, 1.0, 0.3],
[0.2, 0.6, 0.3, 1.0]
])
metric = diversity_aware_top_k(
sim_matrix, scores, k=2, t=0.8
)
Notes
- The function converts similarity matrices to distance matrices (1 - similarity)
- Higher t values (closer to 1.0) allow selection of more similar molecules
- Lower t values enforce stricter diversity constraints
- If a similarity matrix is provided directly, it should be a 2D NumPy array with diagonal elements equal to 1.0
- Padding with 0.0 for unselected slots means diversity constraints can result in lower average scores than unconstrained top-k
References
This metric is commonly used in molecular generation benchmarks to evaluate both quality and diversity of generated molecules (e.g., De Novo Generation task).
Source code in mol_gen_docking/evaluation/diversity_aware_top_k.py
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | |
Common Pitfalls
- Confusing similarity and distance: Remember t is similarity threshold, distance = 1 - similarity
- Invalid SMILES: Always validate SMILES strings before passing to function
- Threshold interpretation: Lower t = stricter diversity, not lenient