Data Models
Conversation
Bases: BaseModel
Complete conversation structure with messages and metadata.
This model represents a full conversation between a user and assistant, including all messages, system configuration, and evaluation metrics. The meta field should contain all information necessary to compute rewards, ground truth values, and other evaluation metrics.
Attributes:
| Name | Type | Description |
|---|---|---|
meta |
Dict[str, Any]
|
Metadata dictionary containing evaluation and reward computation data. |
messages |
List[Message]
|
List of Message objects representing conversation turns. |
system_prompt |
Optional[str]
|
Optional system prompt that defines assistant behavior. |
available_tools |
Optional[List[str]]
|
Optional list of tool names available to the assistant (e.g., 'calculator', 'web_search', 'code_interpreter'). |
truncate_at_max_tokens |
Optional[int]
|
Optional maximum token limit for conversation. |
truncate_at_max_image_tokens |
Optional[int]
|
Optional maximum image token limit. |
output_modalities |
Optional[List[str]]
|
Optional list of output types the model can produce (e.g., 'text', 'image', 'audio'). |
identifier |
str
|
Unique identifier for this conversation. |
references |
List[Any]
|
Optional list of reference materials, ground truth data, or expected outputs for evaluation. |
rating |
Optional[float]
|
Optional quality rating or score (typically 0.0 to 1.0). |
source |
Optional[str]
|
Optional source or origin (e.g., 'human-generated', 'synthetic'). |
training_masks_strategy |
str
|
Strategy for applying attention masks during training (e.g., 'full', 'partial', 'causal'). |
custom_training_masks |
Optional[Dict[str, Any]]
|
Optional custom mask configuration for advanced masking strategies. |
Example
Source code in mol_gen_docking/data/pydantic_dataset.py
Message
Bases: BaseModel
Represents a single message in a conversation.
A message is a single turn in a multi-turn conversation. It contains the speaker's role, the message content, and optional metadata.
Attributes:
| Name | Type | Description |
|---|---|---|
role |
Literal['system', 'user', 'assistant']
|
The role of the message sender (system, user, or assistant). |
content |
str
|
The text content of the message. |
meta |
Dict[str, Any]
|
Optional metadata dictionary (e.g., timestamps, token counts). |
identifier |
Optional[str]
|
Optional unique identifier for this message. |
multimodal_document |
Optional[Dict[str, Any]]
|
Optional dictionary containing multimodal content like images, audio, or file attachments. |
Example
Source code in mol_gen_docking/data/pydantic_dataset.py
Sample
Bases: BaseModel
Root model containing all conversations for a single sample.
A sample is the top-level container that groups one or more related conversations together with shared metadata and optional trajectories. This is typically the unit used for dataset serialization.
Attributes:
| Name | Type | Description |
|---|---|---|
identifier |
str
|
Unique identifier for this sample. |
conversations |
List[Conversation]
|
List of Conversation objects in this sample. |
trajectories |
List[Any]
|
Optional list of trajectories or execution paths for this sample (e.g., for reasoning or planning tasks). |
meta |
Dict[str, Any]
|
Optional metadata about the sample (difficulty, domain, etc.). |
source |
Optional[str]
|
Optional source or origin (e.g., 'benchmark', 'user-submitted'). |
Example
Source code in mol_gen_docking/data/pydantic_dataset.py
read_jsonl(input_file)
Read Sample objects from a JSONL file.
Each line in the file should be a valid JSON object representing a Sample. Lines are parsed sequentially into Sample instances.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_file
|
Path
|
Path to the input JSONL file. |
required |
Returns:
| Type | Description |
|---|---|
list[Sample]
|
List of Sample objects parsed from the file. |
Raises:
| Type | Description |
|---|---|
AssertionError
|
If the input file does not exist. |
Example
Source code in mol_gen_docking/data/pydantic_dataset.py
write_jsonl(output_file, samples)
Write Sample objects to a JSONL file.
Each sample is serialized to JSON and written on a separate line. Parent directories are created automatically if they don't exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_file
|
Path
|
Path to the output JSONL file. |
required |
samples
|
list[Sample]
|
List of Sample objects to write. |
required |