What Are Molecular Fingerprints and ECFPs?
Molecular fingerprints are representations that encode key functionalities and properties of chemical compounds. They were originally designed for substructure search in databases, but later gained popularity for similarity searching and molecule clustering.
Extended-connectivity fingerprints (ECFPs) are a type of molecular fingerprint specifically designed for predicting and analyzing molecular activity and properties. First introduced in 2000, they have since been widely adopted across various fields. (Interestingly, they are conceptually similar to convolutional operations: like the convolution operator in CNNs that apply the same function in all directions before global pooling, ECFPs use iterative circular hashing to capture local chemical environments before proceeding to a global pooling step.)
ECFPs aim to represent precise substructural features relevant to properties like solubility and biological activity. Key aspects covered here include:
- Their generation via iterative hashing to capture varying-size circular substructures around atoms
- Comparison to other fingerprint methods
- Applications in virtual screening, QSAR modeling, and library analysis
- Implementation from scratch in code using Python, and using open-source tools
ECFP Generation Process - Overview
The ECFP algorithm works by iteratively hashing circular substructures of increasing size around each non-hydrogen atom. There are three main stages (each process occurs for every atom in the molecule):
Step 1 – Initial Atom Assignment
- Atoms have integer IDs assigned to them (generated from a hash function, based on properties like atomic number and ring membership). These are collected into an initial fingerprint set.
Step 2 – Iterative Updating
- In this stage, each atom identifier is updated to reflect the identifiers of each atom’s neighbors
- At each iteration, new IDs are calculated from neighbor IDs by hashing an array that contains its the current atom identifier, the neighboring atom identifiers and the corresponding connectivity, order, and bond type.
- For every iteration, as the context diameter increases, larger and larger circular substructures are represented
Step 3 – Duplicate Removal
- During this stage, multiple occurrences of identical features/substructural regions are reduced to a single feature (alternatively, the counts of the redundant feature can be kept if one desires to have a set of counts as opposed to a standard binary/one-hot encoded fingerprint)
The number of iterations controls the maximum feature size and fingerprint complexity. For example, ECFP_6 has a maximum diameter of 6 bonds. An overview of this process for benzoic acid amide (benzamide) is depicted in the figure below. At the beginning of the process, the initial identifier only represents information about the atom itself and its attached bonds (the “A” atom type represents an arbitrary attachment atom). After one iteration, the identifier now contains information about immediate neighbors. After two iterations, the identifier contains information for a larger substructure (including the amide group plus the aromatic ring).
The iterative updating process enables a local operation in an atomic environment to capture information that represents larger substructures. Initially, the atom identifiers only encode information about the atom itself and its direct bonds. However, by collecting the identifiers of neighboring atoms at each step, the substructure captured rapidly expands outwards from the central atom.
So while the updates use only local neighbor information, this results in atomic identifiers that can represent quite substantial circular substructures. Since the process runs on all atoms, the final fingerprint contains a mixture of small common substructures from the initial iteration as well as large, precise substructures like specific substituted ring systems from later iterations.
Controlling the number of iterations allows fine-tuning the fingerprint – fewer iterations for smaller, more generic features versus more iterations for larger, more specific features. This control over feature size and specificity is a key advantage of the ECFP design.
Below, we will dive deeper into each step.
ECFP Step 1 - Initial Assignment of Atom Identifiers
The first step in generating ECFPs is assigning each non-hydrogen atom a unique integer ID to serve as its initial identifier. These IDs should encode key atomic properties while being independent of atom order/numbering.
Properties Encoded
For standard ECFPs, the properties hashed are:
- Atomic number
- Heavy atom neighbor count
- Valence (minus number of hydrogens)
- Total attached hydrogen count
- Ring membership
- Atomic charge
These properties (with the exception of ring membership) are taken from the Daylight atomic invariants rule and are independent of initial atom numbering. They are then hashed into a 32-bit integer. For example, the carboxylic acid carbon in butyramide in the figure below gets an initial ID of -1100000244. (Depending on the choice of hash, an identifier can be a positive or negative number. Any hash function can be chosen, as long as it yields integers for atoms and is independent of atom numbering. We’ll discuss this more in detail below.)
Variant Fingerprints
Other property rules give fingerprint variants like:
- FCFP: Pharmacophore roles (as opposed to ECFPs, which intend to capture precise environment substructures, FCFPs aim to capture abstract role-based substructures)
- SCFP: Sybyl atom types
- LCFP: AlogP atom codes
The naming convention is a 4-character base (ECFP) plus number of iterations (ECFP_6). The number of iterations impacts the effective coverage diameter for the largest possible feature: i.e., the effective diameter of the largest feature is equal to twice the number of iterations (if three iterations are performed, the largest possible substructure feature will have a width of 6 bonds).
Initializing Fingerprint
The initial IDs for all atoms comprise the starting fingerprint set. For butyramide this set is:
[734603939, 1559650422, 1559650422, -1100000244, 1572579716, -1074141656]
Each atom also tracks the bonds currently represented in its feature (this starts as an empty set during initialization and will grow as the algorithm proceeds through the iterative updating steps).
ECFP Step 2 - Iterative Updating of Identifiers
Iterative updating enables the creation of new, larger features by incorporating connectivity and neighboring atoms to represent larger circular substructures centered on each atom.
Process
To update an atom’s ID in each iteration:
- Create array with iteration number and current atom ID
- Sort neighbors by bond order and neighbor atom ID
- Append the (neighbor ID, bond order) pairs to the array
- (Optional) Add stereochemistry flags
- Hash array into new 32-bit ID, which serves as the new ID for the current atom
Continuing with carbon 4 in butyramide, we initialize the array to:
[1, -1100000244]
1 is for the current iteration, and -1100000244 is the initial atom ID for carbon 4 (calculated using the hash function of the “daylight” extracted features from the section above).
For each non-hydrogen neighbor atom to carbon 4, two numbers are appended to the array:
- The bond order to that neighbor, encoded as integers (1 = single, 2 = double, 3 = triple, 4 = aromatic)
- The current identifier of the neighbor atom
To ensure consistency, the neighbors are sorted by these number pairs before appending them. This establishes a deterministic order independent of atom indexing.
In the butyramide example (the figure above), atom 4 has three neighbors. The sorted pairs are:
- (1, 1559650422) – Single bond to atom 2
- (1, 1572579716) – Single bond to atom 5
- (2, -1074141656) – Double bond to atom 6
After appending the sorted pairs, the final array contains eight elements:
[1, -1100000244, 1, 1559650422, 1, 1572579716, 2, -1074141656]
Hashing this final array gives the new ID -1708545601 for carbon 4, which now encodes the first-level connectivity/features. Repeating this process for each atom in butyramide yields the following array (represented in the figure below):
[734603939, 1559650422, 1559650422, -1100000244, 1572579716, -1074141656, 863188371, -1793471910, -1789102870, -1708545601, -932108170, 2099970318]
Expanding Substructural Features
As iterations increase, the atom IDs represent larger circular substructures outward from the central atom.
For instance, in butyramide, the ID for oxygen (atom 6) goes from representing a double-bonded oxygen initially to a carbonyl after one iteration to a full aliphatic carboxylic acid amide substructure after two iterations (see figure below).
Controlling Feature Size
The number of iterations controls the maximum feature size. More iterations allow larger maximum diameter but take longer to compute. Values of 2-4 are typical.
ECFP Step 3 - Duplicate Removal
Duplicate Structure Removal
As the iteration process expands the circular substructures represented by atom identifiers, it becomes possible for different starting atoms to end up encoding the same structural regions in the molecule. This situation is termed structural duplication – where two atom-centered environments represent an identical substructural unit, despite having different hashed identifiers. (This can occur because at the start, different atoms may have different identifiers due to the hash function, and thus as the iterations proceed, if duplicate structures are present, they may have different hashed IDs.)
For example, in benzoic acid amide, the oxygen and nitrogen atoms both end up representing the same amide substructure after two iterations (see figure below). Their identifiers are different but the underlying bonds covered are equivalent.
This duplication needs to be addressed to avoid adding redundant features to the final fingerprint. The goal is to eliminate repetition to keep the fingerprint concise while retaining all uniquely represented structural fragments.
Identifying Duplicates
To detect duplication, each feature tracks the set of bonds representing its substructure. At each iteration, this set is updated to include:
- Bonds from previous iteration
- Bonds to neighbor attachments
- Bonds between attachments
Before the newly generated features from an iteration are appended to the fingerprint set, they are checked to see if any structural duplicates exist to either previously generated features or newly discovered features. When two features are discovered to be from equivalent bond sets, the following rules are used to remove one:
Removal Rules
When duplicate features are found:
- If generated from different iterations, prefer the feature generated from fewer iterations
- If generated from the same number of iterations, discard the feature represented by the larger ID
Following these rules avoids useless redundancy in the final fingerprint.
Impact on Feature Count
Duplicate removal causes fewer new features in later iterations. For butyramide, there are:
- 5 initial features
- +6 after iteration 1
- +3 after iteration 2
After iteration 2, there are no new features, as the full molecule has been covered (see figure below, showing the final features for butyramide).
Duplicate Identifier Removal
In some cases, the final fingerprint contains duplicate identifiers representing identical features. This happens when the same substructure occurs multiple times in the molecule.
For example, different methyl groups may generate the same initial ID and iterated IDs. A feature could also collide to the same bit identifier as another feature.
Handling Duplicates
There are two approaches:
- Remove duplicate IDs for a condensed fingerprint
- Retain duplicate IDs and represent them as counts (this approach captures multiplicity)
For the standard ECFP, duplicates are removed to keep the fingerprint concise. The absence of a “bit” in the ECFP fingerprint representing a feature means that the feature is not present. On the other hand, the presence of a “bit” in the ECFP fingerprint representing a feature is only suggestive and indicates it likely occurs in the molecule. This is because collisions can occur, meaning there could be other representations mapping to the same ID. This collision handling maximizes use of the large identifier space to represent many possible features. (See the main image of this post at the top for an example of IDs represented as bits and bit collisions.)
Overall, the iteration and duplicate removal stages yield a rich, concise structural description containing up to billions of possible features to characterize molecules.
Hash Function & Interpretation
Choosing the Hash Function
Any reasonable hash function can be used to generate identifiers, as long as it can map arrays of integers randomly and uniformly into a 232-bit space of possible integers. If the hash function can’t map integers uniformly, then the collision rate (i.e., when multiple IDs map to the same “bit”) may increase leading to a loss of information.
A larger space (e.g. 64-bit integers) could reduce collisions, but in practice there is no measurable improvement in analysis.
Additionally, there are hardware considerations: for example, 64-bit operations are slower on 32-bit hardware. So 32-bit integers provide the best tradeoff between identifier uniqueness and computational performance.
The output space and randomness/uniformity are what primarily matter rather than the exact hashing algorithm used.
Interpreting ECFP Identifiers
While the numeric identifiers themselves contain no inherent structural/semantic information/meaning, it is possible to map features to the underlying substructures they represent.
Conceptualizing Identifiers
Identifiers can be considered:
- Indices of bits in a large bit array, or
- Keys in a hash table
But the fingerprint is stored as a list, not explicitly expanded out:
- The fingerprints are stored as lists of integer IDs representing the “on” features. They are not explicitly stored as long bit vectors with zeros everywhere and 1s for those IDs.
Explicitly writing out all 232 bits, when only a small fraction would be non-zero for a given molecule is inefficient. So instead, only the nonzero “on” features, represented as integer IDs, are tracked in the list for storage and manipulation. This variable-length list contains the pertinent data in a compact representation.
The bitset interpretation is an analogy rather than an implementation detail. At runtime, the integers are manipulated directly as hash table keys or array indices rather than as physical bit vectors. Software like RDKit can convert these to bit vectors when preparing a dataset for analysis or modeling.
Regenerating Substructures
During fingerprint generation, each feature tracks:
- Atoms covered
- Bonds covered
This can be used to regenerate the substructure later, represented as a SMARTS string (see figure below).
So while an identifier like “1499521844” reveals nothing, the corresponding substructure can be looked up. (Caveat: the mapping is often NOT one-to-one, due to collisions from hashing.) An example of this is displayed in the figure below for the FCFP pharmacophore fingerprint protocol.
Applications
Since their introduction around 2000, ECFPs have been widely applied across several areas:
Virtual Screening
- Categorizing actives/inactives in noisy high-throughput screening data
Structure-Activity Modeling
- Predicting bioactivity and properties from molecular structures
ADMET Prediction
- Model pharmacokinetic behaviors like absorption, distribution, metabolism, excretion
- Used for diverse endpoints: toxicity, clearance, drug-drug interactions
Compound Library Analysis
- Analyze aspects of molecule collections besides activity (3D docking, diversity, etc.)
ECFP Implementation - Python
ECFP implementations exist in software libraries like RDKit, DeepChem, and others. We will first create a simplified version to illustrate the process. Next, we will show how to compute the actual ECFP using RDKit.
Before we proceed, we’ll need to install the necessary libraries, so run this in the command line terminal:
pip install rdkit bitarray
Simple Python Implementation from Scratch
# Imports
from rdkit import Chem
import hashlib
from bitarray import bitarray
# Helper function to flatten a list of tuples
def flatten(lst):
return [item for sublist in lst for item in sublist]
def atom_invariant(atom):
"""
Generates a hash of atom invariants, capturing detailed properties of each atom.
These properties help in uniquely identifying atoms based on their chemical environment.
The hash of these properties serves as a compact identifier for the atom.
(Note: does not include all properties covered in ECFP)
Parameters:
- atom: An RDKit Atom object.
Returns:
- A hash of a tuple containing atom properties including atomic number, total degree (bonds count including hydrogens),
total number of hydrogens, explicit valence, atomic mass, formal charge, and ring membership.
"""
properties = (
atom.GetAtomicNum(), # Atomic number, identifies the element type.
atom.GetNumImplicitHs() + atom.GetNumExplicitHs(), # Number of attached hydrogens (both implicit and explicit)
atom.GetExplicitValence(), # Explicit valence
atom.GetMass(), # Atomic mass
atom.GetFormalCharge(), # Atom's formal charge.
atom.IsInRing() # Boolean indicating if the atom is part of a ring structure.
)
# Return the hash of the properties tuple
return hash(properties)
def get_neighborhood_hash(mol, atom_idx, iteration, previous_neighborhoods, hash_to_bonds):
"""
Calculates a hash for the neighborhood of an atom at a given iteration depth,
and tracks the bonds involved in that neighborhood. The hash and bond tracking
are used to generate extended connectivity fingerprints (ECFPs).
This function iterates over the neighbors of the specified atom, collects
their properties and bond orders, and creates a hash representing the
neighborhood's substructure at the current iteration. Additionally, it
accumulates and updates the set of bonds involved for each unique neighborhood
hash seen across iterations, supporting substructure tracking within the molecule.
Parameters:
- mol (RDKit Mol): The molecule being analyzed.
- atom_idx (int): Index of the atom whose neighborhood hash is being calculated.
- iteration (int): The current iteration depth, controlling the "radius" of
the neighborhood being considered.
- previous_neighborhoods (set): A set of previously encountered neighborhood
hashes to avoid duplications.
- hash_to_bonds (dict): A dictionary mapping unique neighborhood hashes to sets
of bonds (as frozensets of atom indices and bond types) included in each
neighborhood.
Returns:
- (tuple or None, set): A tuple containing the new neighborhood hash (or None
if the neighborhood hash is not new) and the updated set of bonds for the
current neighborhood. If the neighborhood hash is not new, it returns None
instead of the hash to indicate that this particular neighborhood structure
has been previously encountered.
"""
### STEP 1 - GET NEIGHBORHOOD HASH ###
atom = mol.GetAtomWithIdx(atom_idx)
# Start with a list containing the iteration number and the core atom's hash
neighborhood = [(iteration, atom_invariant(atom))]
bond_order_neighbor_pairs = []
# Loop through neighbors and collect each neighbors bond order and hash
for nbr in atom.GetNeighbors():
bond = mol.GetBondBetweenAtoms(atom_idx, nbr.GetIdx())
bond_order = bond.GetBondTypeAsDouble()
neighbor_hash = atom_invariant(nbr)
bond_order_neighbor_pairs.append((bond_order, neighbor_hash))
# Sort the neighbor pairs by bond order, then by neighbor hash
bond_order_neighbor_pairs.sort()
# Extend the neighborhood list with the sorted neighbor information
for pair in bond_order_neighbor_pairs:
neighborhood.append(pair)
# Flatten the list and convert to a tuple for hashing
neighborhood_tuple = tuple(flatten(neighborhood))
neighborhood_hash = hash(neighborhood_tuple)
# Check if the neighborhood hash is new before adding it to previous_neighborhoods
is_new_hash = neighborhood_hash not in previous_neighborhoods
### STEP 2 - BOND TRACKING ###
# Initialize an empty set to accumulate bonds for the current neighborhood hash.
accumulated_bonds = set()
# Accumulate all bonds associated with the core atom and its neighbors from the previous iteration
if iteration > 0:
for prev_hash, prev_bonds in hash_to_bonds.items():
if any(atom_idx in bond_atoms for bond_atoms, bond_type, iter_num in prev_bonds if iter_num == iteration - 1):
accumulated_bonds.update({(bond_atoms, bond_type, iter_num) for bond_atoms, bond_type, iter_num in prev_bonds if iter_num == iteration - 1})
# Collect bonds for tracking, using frozenset for non-directionality (in the context of molecular graphs, a bond between
# atom 0 and atom 1 is the same as a bond between atom 1 and atom 0)
# Each element of the bonds set is a tuple where the first element is a frozenset identifying the bond by the indices of the atoms
# involved, and the second element is the bond type represented as a double. This allows tracking both the connectivity
# and the type of each bond.
for nbr in atom.GetNeighbors():
bond = mol.GetBondBetweenAtoms(atom_idx, nbr.GetIdx())
accumulated_bonds.add((frozenset({atom_idx, nbr.GetIdx()}), bond.GetBondTypeAsDouble(), iteration))
### STEP 3 - UPDATE AND RETURN ###
# Updating ID set and initial collection of bonds for this neighborhood, if it's a new hash
# Decide on what to return based on whether the neighborhood_hash is new
if is_new_hash:
# Add hash to the unique set of IDs passed in during the ecfp call
previous_neighborhoods.add(neighborhood_hash)
hash_to_bonds[neighborhood_hash] = accumulated_bonds
return neighborhood_hash, accumulated_bonds
else:
# Update the existing entry with newly accumulated bonds
hash_to_bonds[neighborhood_hash].update(accumulated_bonds)
return None, accumulated_bonds
def ecfp(mol, radius=2):
"""
Computes the Extended Connectivity Fingerprint (ECFP) for a given molecule, starting with iteration 0
for just the atom identifiers, then expanding in further iterations. Additionally, collects all unique
identifiers generated throughout the iterations.
Parameters:
- mol: An RDKit Molecule object.
- radius: The radius of neighborhoods to consider, corresponding to the iteration count.
Returns:
- A bitarray representing the ECFP of the molecule.
- A list of all unique identifiers generated through all iterations.
- A set of "tracked bonds and substructures"
"""
fp_length = 2048 # Length of the fingerprint bitarray.
fingerprint = bitarray(fp_length)
fingerprint.setall(0)
all_unique_identifiers = set() # Set to keep track of all unique identifiers generated
hash_to_bonds = {} # Dictionary to map unique hashes to their bonds
for iteration in range(0, radius + 1):
for atom_idx in range(mol.GetNumAtoms()):
if iteration == 0:
# Directly use atom_invariant as the unique identifier without neighbor info
atom_hash = atom_invariant(mol.GetAtomWithIdx(atom_idx))
# Add to the set of unique identifiers
all_unique_identifiers.add(atom_hash)
# Map the atom's hash to a position in the fingerprint and set the bit
bit_pos = int(hashlib.sha256(str(atom_hash).encode('utf-8')).hexdigest(), 16) % fp_length
fingerprint[bit_pos] = 1
else:
# Include neighbor information in subsequent iterations
neighborhood_hash, bonds = get_neighborhood_hash(mol, atom_idx, iteration, all_unique_identifiers, hash_to_bonds)
# print(neighborhood_hash)
if neighborhood_hash is not None:
# Add to the set of unique identifiers
all_unique_identifiers.add(neighborhood_hash)
# Store the bonds for this unique hash
hash_to_bonds[neighborhood_hash] = bonds
# Map the hash to a position in the fingerprint and set the bit
bit_pos = int(hashlib.sha256(str(neighborhood_hash).encode('utf-8')).hexdigest(), 16) % fp_length
fingerprint[bit_pos] = 1
# Convert the set of all unique identifiers to a list
unique_identifiers_list = list(all_unique_identifiers)
return fingerprint, unique_identifiers_list, hash_to_bonds
And to put it all together, we run the ecfp function using butyramide as an example:
# Example using butyramide
smiles = 'CCCC(=O)N'
mol = Chem.MolFromSmiles(smiles)
fingerprint, ids, bond_tracking = ecfp(mol, radius=3)
RDKit Implementation
RDKit offers a ton of functionality for computing a variety of molecular fingerprints, including ECFPs, in just a few lines of code. This “Getting Started in Python” guide is an excellent resource. Here is an example of calculating ECFPs for butyramide using RDKit.
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors, AllChem
import numpy as np
def ecfp_from_smiles(smiles, radius=3, nBits=2048):
mol = Chem.MolFromSmiles(smiles)
fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius, nBits=nBits)
# Convert the RDKit explicit bit vector to a Numpy array for easier use
np_fp = np.zeros((1,))
Chem.DataStructs.ConvertToNumpyArray(fp, np_fp)
return np_fp
smiles_string = 'CCCC(=O)N'
fingerprint = ecfp_from_smiles(smiles_string)
RDKit provides comprehensive fingerprint information, which can be obtained through various methods. For example, from the RDKit docs:
“The fingerprint generators can collect information about the atoms/bonds involved in setting bits when a fingerprint is generated. This information is quite useful for understanding which parts of a molecule were involved in each bit.
Each fingerprinting method provides different information, but this is all accessed using the additionalOutput argument to the fingerprinting functions.
Information is available about the atoms that contribute to particular bits in the Morgan fingerprint via the bit info map. This is a dictionary with one entry per bit set in the fingerprint, the keys are the bit ids, the values are lists of (atom index, radius) tuples.”
As an example:
m = Chem.MolFromSmiles('CCCC(=O)N')
fpgen = AllChem.GetMorganGenerator(radius=3)
ao = AllChem.AdditionalOutput()
ao.CollectBitInfoMap()
fp = fpgen.GetSparseCountFingerprint(m,additionalOutput=ao)
print(len(fp.GetNonzeroElements()))
info = ao.GetBitInfoMap()
print(len(info))
Conclusion
Thanks for reading! Next up, we’ll cover graph neural networks for learning molecule fingerprints. So stay tuned!