Fingerprint Utilities
Supported Fingerprints
The module supports several standard fingerprint types used in cheminformatics:
| Fingerprint | Type | Description | Bits |
|---|---|---|---|
| ECFP | Structural | Extended Connectivity Fingerprint (Morgan) | Configurable |
| MACCS | Structural | MACCS Keys fingerprint | 166 |
| RDKit | Structural | RDKit topological fingerprint | 2048 |
| Gobbi2D | Pharmacophore | Gobbi pharmacophore 2D fingerprint | Variable |
| Avalon | Structural | Avalon fingerprint | Variable |
ECFP Parameters
ECFP (Extended Connectivity Fingerprint, also known as Morgan fingerprint) fingerprints are specified as:
- Format:
"ecfp{diameter}-{nbits}" - Examples:
"ecfp2-1024","ecfp4-1024","ecfp6-2048" - Diameter: Topological diameter (internally converted to radius = diameter/2)
- Bits: Number of bits in the fingerprint (256, 512, 1024, 2048, etc.)
Tanimoto Similarity
The module uses Tanimoto (Jaccard) similarity to measure molecular similarity: - Range: 0.0 (completely dissimilar) to 1.0 (identical) - Formula: |A ∩ B| / |A ∪ B| - Suitable for bit vectors and sparse representations
Function Reference
Molecular fingerprint utilities for similarity calculations.
This module provides utilities for computing molecular fingerprints and similarity matrices. It supports multiple fingerprint types (ECFP, MACCS, RDKit, Gobbi2D, Avalon) and computes Tanimoto similarity between molecules for use in diversity-aware metrics.
fp_name_to_fn(fp_name)
Convert a fingerprint name to a function that generates that fingerprint.
This function returns a callable that takes an RDKit Mol object and returns the corresponding fingerprint. It supports several standard fingerprint types used in cheminformatics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fp_name
|
str
|
Name of the fingerprint to use. Supported options: - "ecfp{diameter}-{nbits}": Extended Connectivity Fingerprint (Morgan). Examples: "ecfp2-1024", "ecfp4-1024", "ecfp6-2048". - "maccs": MACCS Keys fingerprint (166 bit). - "rdkit": RDKit fingerprint. - "Gobbi2d": Gobbi pharmacophore 2D fingerprint. - "Avalon": Avalon fingerprint. |
required |
Returns:
| Type | Description |
|---|---|
Callable[[Mol], ExplicitBitVect]
|
A callable function that takes an RDKit Mol object and returns |
Callable[[Mol], ExplicitBitVect]
|
an ExplicitBitVect fingerprint. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the fingerprint name is not recognized or has invalid format. |
AssertionError
|
If the ECFP parameters don't match expected format. |
Example
Notes
- ECFP diameter is specified as topological diameter (divided by 2 internally to get the radius for Morgan fingerprints)
- All returned fingerprints are explicit bit vectors for similarity calculations
Source code in mol_gen_docking/evaluation/fingeprints_utils.py
get_sim_matrix(mols, fingerprint_name='ecfp4-1024')
Compute a pairwise similarity matrix for a list of molecules.
This function computes Tanimoto similarities between all pairs of molecules using the specified fingerprint method. The result is a condensed distance matrix (upper triangle only) in NumPy array format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mols
|
list[Mol]
|
List of RDKit Mol objects. |
required |
fingerprint_name
|
str
|
Name of the fingerprint to use for similarity calculation. Default is "ecfp4-1024" (ECFP with diameter 4 and 1024 bits). See fp_name_to_fn for supported options. |
'ecfp4-1024'
|
Returns:
| Type | Description |
|---|---|
ndarray[float]
|
A 1D NumPy array containing the upper triangle of the pairwise similarity |
ndarray[float]
|
matrix. The array is in condensed form (as used by scipy.spatial.distance.squareform). |
ndarray[float]
|
For n molecules, the length is n*(n-1)/2. |
Example
from mol_gen_docking.evaluation.fingeprints_utils import get_sim_matrix
from rdkit import Chem
smiles = ["c1ccccc1", "CC(C)Cc1ccc(cc1)C(C)C(O)=O", "CCO"]
mols = [Chem.MolFromSmiles(smi) for smi in smiles]
sim_matrix = get_sim_matrix(mols, fingerprint_name="ecfp4-1024")
print(f"Shape: {sim_matrix.shape}") # (3,) for 3 molecules
Notes
- The returned array is in condensed form. Use scipy.spatial.distance.squareform to convert to a full square similarity matrix.
- Similarity values range from 0.0 (completely dissimilar) to 1.0 (identical)
- The matrix is symmetric, so only the upper triangle is computed