Property Prediction Datasets
This page describes the molecular property prediction datasets used in our benchmark, extracted from the Polaris Hub. These datasets cover a wide range of ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties and target-specific bioactivity predictions.
Overview
| Dataset | Task Type | Property |
|---|---|---|
| ASAP Discovery - MERS-CoV Mpro | Regression | Antiviral Potency |
| AstraZeneca LogD | Regression | Lipophilicity |
| AstraZeneca PPB Clearance | Regression | Plasma Protein Binding |
| Novartis CYP3A4 | Regression | CYP Inactivation |
| PKIS2 Kinase Inhibition | Classification | Kinase Inhibition |
| Biogen ADME | Regression | Various ADME Properties |
| TDCommons | Mixed | Various ADMET Properties |
Assay description
-
ASAP Discovery - MERS-CoV Mpro
Source:
asap-discovery/antiviral-potency-2025-unblindedTask Type: Regression
Target Property: pIC50 against MERS-CoV main protease
This dataset contains compounds tested for their inhibitory activity against the MERS-CoV main protease (Mpro), essential for viral replication. Main proteases are highly conserved across coronaviruses, making them attractive targets for broad-spectrum antiviral development.
-
AstraZeneca LogD
Source:
polaris/az-logd-74-v1Task Type: Regression
Target Property: Octan-1-ol/water (pH 7.4) distribution coefficient (LogD)
LogD at pH 7.4 measures lipophilicity under physiological conditions, accounting for ionization state. Optimal values (1-3) are associated with good oral bioavailability and CNS penetration.
-
AstraZeneca PPB Clearance
Source:
polaris/az-ppb-clearance-v1Task Type: Regression
Target Property: Log percent of compound unbound to whole human plasma
Plasma protein binding measures how much drug binds to plasma proteins (mainly albumin). Only the unbound fraction is pharmacologically active. High binding affects distribution, clearance, and drug-drug interactions.
-
Novartis CYP3A4
Source:
novartis/novartis-cyp3a4-v1Task Type: Regression
Target Property: Log-inactivation rate constant (log kobs) of CYP enzymes
Measures time-dependent inhibition of CYP3A4, the most abundant liver enzyme metabolizing ~50% of marketed drugs. CYP3A4 inhibition can cause serious drug-drug interactions.
-
PKIS2 - EGFR
Source:
polaris/drewry2017-pkis2-subset-v2Task Type: Classification
Target Property: Inhibitor of the EGFR kinase
EGFR is a receptor tyrosine kinase involved in cell proliferation. EGFR inhibitors are used in cancer therapy for non-small cell lung cancer and colorectal cancer.
-
PKIS2 - KIT
Source:
polaris/drewry2017-pkis2-subset-v2Task Type: Classification
Target Property: Inhibitor of the KIT kinase
KIT plays a role in cell survival and proliferation. KIT inhibitors treat gastrointestinal stromal tumors (GIST) and certain leukemias.
-
PKIS2 - RET
Source:
polaris/drewry2017-pkis2-subset-v2Task Type: Classification
Target Property: Inhibitor of the RET kinase
RET is involved in cell growth and differentiation. RET inhibitors are approved for RET-fusion positive cancers and medullary thyroid carcinoma.
-
PKIS2 - LOK
Source:
polaris/drewry2017-pkis2-subset-v2Task Type: Classification
Target Property: Inhibitor of the LOK kinase
LOK (STK10) is a serine/threonine kinase involved in lymphocyte migration and immune cell function.
-
PKIS2 - SLK
Source:
polaris/drewry2017-pkis2-subset-v2Task Type: Classification
Target Property: Inhibitor of the SLK kinase
SLK is involved in cell cycle regulation, apoptosis, and cytoskeletal organization.
-
Solubility
Source:
biogen/adme-fang-solu-reg-v1Task Type: Regression
Target Property: Log-solubility
Aqueous solubility affects drug absorption and formulation. Poor solubility is a major cause of drug development failures.
-
Rat Plasma Protein Binding
Source:
biogen/adme-fang-rppb-reg-v1Task Type: Regression
Target Property: Log-rat plasma protein binding rate
Fraction of compound bound to rat plasma proteins, useful for preclinical pharmacokinetic studies.
-
Human Plasma Protein Binding
Source:
biogen/adme-fang-hppb-reg-v1Task Type: Regression
Target Property: Log-human plasma protein binding rate
Fraction bound to human plasma proteins, critical for clinical pharmacokinetic predictions.
-
Permeability (MDR1-MDCK)
Source:
biogen/adme-fang-perm-reg-v1Task Type: Regression
Target Property: Log-MDR1 MDCK efflux ratio
Measures P-glycoprotein mediated efflux. P-gp substrates may have limited brain penetration and variable oral bioavailability.
-
Human Liver Microsomal Stability
Source:
biogen/adme-fang-hclint-reg-v1Task Type: Regression
Target Property: Log-human liver microsomal stability (CLint)
Intrinsic clearance predicts hepatic metabolic stability. Higher values indicate faster metabolism affecting exposure and dosing.
-
Rat Liver Microsomal Stability
Source:
biogen/adme-fang-rclint-reg-v1Task Type: Regression
Target Property: Log-rat liver microsomal stability (CLint)
Intrinsic clearance in rat liver microsomes for preclinical species selection and PK prediction.
-
P-glycoprotein Inhibition
Source:
tdcommons/pgp-broccatelliTask Type: Classification
Target Property: Inhibitor of P-glycoprotein (P-gp)
P-gp is an efflux transporter. P-gp inhibitors can enhance brain penetration but may cause drug-drug interactions.
-
Blood-Brain Barrier Penetration
Source:
tdcommons/bbb-martinsTask Type: Classification
Target Property: Ability to penetrate the blood-brain barrier (BBB)
Critical for CNS drug development. BBB penetration is necessary for brain-targeting drugs but should be avoided for peripheral drugs.
-
Caco-2 Permeability
Source:
tdcommons/caco2-wangTask Type: Regression
Target Property: Rate of compounds passing through Caco-2 cells
Caco-2 cells model intestinal absorption. Low permeability often correlates with poor oral bioavailability.
-
Volume of Distribution
Source:
tdcommons/vdss-lombardoTask Type: Regression
Target Property: Volume of distribution at steady state (Vdss)
High Vdss indicates extensive tissue distribution; low Vdss suggests the drug remains in plasma. Affects dosing and accumulation.
-
Half-Life
Source:
tdcommons/half-life-obachTask Type: Regression
Target Property: Duration for drug concentration to be reduced by half
Short half-life requires frequent dosing; long half-life may lead to accumulation.
-
Hepatocyte Clearance
Source:
tdcommons/clearance-hepatocyte-azTask Type: Regression
Target Property: Drug clearance measured in hepatocytes
More physiologically relevant than microsomes as it includes Phase I and Phase II metabolism.
-
Microsome Clearance
Source:
tdcommons/clearance-microsome-azTask Type: Regression
Target Property: Drug clearance measured in microsomes
Primarily reflects CYP-mediated Phase I metabolism.
-
Lipophilicity
Source:
tdcommons/lipophilicity-astrazenecaTask Type: Regression
Target Property: Lipophilicity (LogD)
Affects membrane permeability, protein binding, metabolism, and overall pharmacokinetics.
-
Drug-Induced Liver Injury (DILI)
Source:
tdcommons/diliTask Type: Classification
Target Property: Potential to induce liver injuries
DILI is a major cause of drug withdrawal. Hepatotoxicity is a leading cause of clinical trial failures.
-
hERG Inhibition
Source:
tdcommons/hergTask Type: Classification
Target Property: Blocker of hERG channel
hERG inhibition can cause QT prolongation and fatal cardiac arrhythmias. Screening is mandatory in drug development.
-
Ames Mutagenicity
Source:
tdcommons/amesTask Type: Classification
Target Property: Mutagenic potential (Ames test positive/negative)
Mutagenic compounds are potential carcinogens. Positive Ames test often disqualifies compounds from development.
-
Acute Toxicity (LD50)
Source:
tdcommons/ld50-zhuTask Type: Regression
Target Property: Acute toxicity LD50 (lethal dose for 50% of test animals)
Provides initial safety assessment and helps establish safe starting doses.
-
CYP2C9 Substrate
Source:
tdcommons/cyp2c9-substrate-carbonmangelsTask Type: Classification
Target Property: Substrate of CYP2C9
CYP2C9 metabolizes ~15% of drugs including warfarin and NSAIDs.
-
CYP2D6 Substrate
Source:
tdcommons/cyp2d6-substrate-carbonmangelsTask Type: Classification
Target Property: Substrate of CYP2D6
CYP2D6 is highly polymorphic, affecting ~25% of drugs. Genetic variations lead to poor, intermediate, extensive, and ultra-rapid metabolizer phenotypes.
-
CYP3A4 Substrate
Source:
tdcommons/cyp3a4-substrate-carbonmangelsTask Type: Classification
Target Property: Substrate of CYP3A4
CYP3A4 is the most important drug-metabolizing enzyme, processing ~50% of drugs.
-
Solubility (AqSolDB)
Source:
tdcommons/solubility-aqsoldbTask Type: Regression
Target Property: Aqueous solubility
Data from AqSolDB, one of the largest curated aqueous solubility datasets. Essential for drug absorption and formulation.
Reward Functions
Property prediction tasks use different reward functions depending on whether the task is regression or classification:
-
Regression Reward
\[R = 1 - \frac{(\hat{y} - y)^2}{\sigma^2}\]Where \(\hat{y}\) is the predicted value, \(y\) is the ground truth, and \(\sigma\) is the standard deviation of training labels. This normalizes prediction errors and ensures rewards are in [0, 1].
Training samples: ~44,000
-
Classification Reward
\[R = \mathbb{1}_{pred = label}\]Binary reward: 1 if the prediction exactly matches the ground truth label, 0 otherwise.
Training samples: ~11,000
Invalid Predictions
If the model generates an invalid or unparseable prediction, the reward is automatically set to 0.
References
- ASAP Discovery Consortium. Antiviral Potency Dataset (2025).
- Drewry, D.H., et al. "Progress towards a public chemogenomic set for protein kinases and a call for contributions." PLOS ONE (2017).
- Lombardo, F., et al. "Trend Analysis of a Database of Intravenous Pharmacokinetic Parameters in Humans for 670 Drug Compounds." Drug Metabolism and Disposition (2008).
- Martins, I.F., et al. "A Bayesian Approach to in Silico Blood-Brain Barrier Penetration Modeling." Journal of Chemical Information and Modeling (2012).
- Therapeutics Data Commons: https://tdcommons.ai/
- Polaris Hub: https://polarishub.io/