Data Access
Overview
MolGenDocking addresses the challenge of de novo molecular generation, with a benchmark designed for Large Language Models (LLMs) and other generative architectures. Our dataset currently supports 3 downstream tasks:
| Dataset | Size | Source | Purpose |
|---|---|---|---|
| De Novo Molecular Generation | ~50k prompts | SAIR and SIU | Generate molecules optimizing a set of up to three properties |
| Molecular Property prediction | ~50k prompts | Polaris | Predict the properties of a molecule (regression + classification) |
| RetroSynthesis Tasks | ~50k reactions | Enamine | Retro-synthesis planning, reactants, products, SMARTS prediction |
Downloading Datasets
Our dataset is available on huggingface.
The three tasks are separated into three compressed folders: molgendata.tar.gz, property_predictions.tar.gz, and synthesis_tasks.tar.gz.
HF datasets are also available, but do not include the metadatas necessary for the reward computation, and are thus only meant for inference.
# Extract tarball datasets
tar -xzf molgendata.tar.gz
tar -xzf property_prediction.tar.gz
tar -xzf synthesis_tasks.tar.gz
Warning
To perform de novo molecular generation with docking constraints, the path to the molgendata folder should be provided to the reward server via the DATA_PATH environment variable. (See Reward Server Configuration for more details.)
Data Organization
molgendata/ # Docking targets
├── docking_targets.json
├── names_mapping.json
├── pockets_info.json
├── pdb_files/
├── train_data/
│ └── train_prompts.jsonl
│ └── train_prompts_boxed.jsonl
├── eval_data/
│ └── eval_prompts_ood_boxed.jsonl
│ └── eval_prompts_ood.jsonl
└── test_data/
└── test_prompts_ood_boxed.jsonl
└── test_prompts_ood.jsonl
Data Format
Our data are stored in JSONL format, represented by a pydantic base model (source):
class Message(BaseModel):
"""Represents a single message in a conversation"""
role: Literal["system", "user", "assistant"]
content: str
meta: Dict[str, Any] = Field(default_factory=dict)
identifier: Optional[str] = None
multimodal_document: Optional[Dict[str, Any]] = None
class Conversation(BaseModel):
"""Complete conversation structure"""
meta: Dict[
str, Any
]
messages: List[Message]
system_prompt: Optional[str] = None
available_tools: Optional[List[str]] = None
truncate_at_max_tokens: Optional[int] = None
truncate_at_max_image_tokens: Optional[int] = None
output_modalities: Optional[List[str]] = None
identifier: str
references: List[Any] = Field(default_factory=list)
rating: Optional[float] = None
source: Optional[str] = None
training_masks_strategy: str
custom_training_masks: Optional[Dict[str, Any]] = None
class Sample(BaseModel):
"""Root model containing all conversations"""
identifier: str
conversations: List[Conversation]
trajectories: List[Any] = Field(default_factory=list)
meta: Dict[str, Any] = Field(default_factory=dict)
source: Optional[str] = None
JSONL Format
Each line in a JSONL file represents a Sample object with the following structure:
{
"identifier": "sample_001",
"conversations": [
{
"meta": {
"properties": ["QED", ...],
"objectives": ["above", ...],
"target": [0.5, ...],
... # Additional metadata such as pocket box, target information, etc.
},
"messages": [
{
"role": "system",
"content": "You are a molecular generation assistant...",
"meta": {}
},
{
"role": "user",
"content": "Generate a molecule that binds to GSK3B with high affinity",
"meta": {},
"identifier": "msg_001"
}
]
}
]
}
{
"identifier": "id0",
"conversations": [
{
"meta": {
"properties": ["biogen/adme-fang-hclint-reg-v1"],
"objectives": ["regression"],
"target": [0.67],
"smiles": ["Brc1ccc(-c2nnc(Cn3cnc4ccccc43)o2)o1"],
"norm_var": 0.62,
... # Additional metadata
},
"messages": [
{
"role": "system",
"content": "You are a property prediction assistant...",
},
{
"role": "user",
"content": "Predict the clearance value for the molecule with SMILES: Brc1ccc(-c2nnc(Cn3cnc4ccccc43)o2)o1",
}
]
}
]
}
{
"identifier": "id0",
"conversations": [
{
"meta": {
"properties": ["full_path_bb_ref"],
"objectives": ["full_path_bb_ref"],
"target": ["CCCC[C@H](NC(=O)[C@H](CCCCN)NC(=O)[C@H](CCCNC(=N)N)NC(=O)c1ccc(/C=C2\\SC(=O)N(c3ccc(C)cc3)C2=O)cc1)C(N)=O"],
... # Additional metadata
},
"messages": [
{
"role": "system",
"content": "You are a retrosynthesis planning assistant...",
},
{
"role": "user",
"content": "Given the product molecule with SMILES: CC(=O)OC1=CC=CC=C1C(=O)O, provide the reactants used in its synthesis.",
}
]
}
]
}