Data Models

`Conversation`

Bases: BaseModel

Complete conversation structure with messages and metadata.

This model represents a full conversation between a user and assistant, including all messages, system configuration, and evaluation metrics. The meta field should contain all information necessary to compute rewards, ground truth values, and other evaluation metrics.

Attributes:

Name	Type	Description
`meta`	`Dict[str, Any]`	Metadata dictionary containing evaluation and reward computation data.
`messages`	`List[Message]`	List of Message objects representing conversation turns.
`system_prompt`	`Optional[str]`	Optional system prompt that defines assistant behavior.
`available_tools`	`Optional[List[str]]`	Optional list of tool names available to the assistant (e.g., 'calculator', 'web_search', 'code_interpreter').
`truncate_at_max_tokens`	`Optional[int]`	Optional maximum token limit for conversation.
`truncate_at_max_image_tokens`	`Optional[int]`	Optional maximum image token limit.
`output_modalities`	`Optional[List[str]]`	Optional list of output types the model can produce (e.g., 'text', 'image', 'audio').
`identifier`	`str`	Unique identifier for this conversation.
`references`	`List[Any]`	Optional list of reference materials, ground truth data, or expected outputs for evaluation.
`rating`	`Optional[float]`	Optional quality rating or score (typically 0.0 to 1.0).
`source`	`Optional[str]`	Optional source or origin (e.g., 'human-generated', 'synthetic').
`training_masks_strategy`	`str`	Strategy for applying attention masks during training (e.g., 'full', 'partial', 'causal').
`custom_training_masks`	`Optional[Dict[str, Any]]`	Optional custom mask configuration for advanced masking strategies.

Example

conversation = Conversation(
    meta={"task": "qa", "difficulty": "hard"},
    messages=[message1, message2],
    identifier="conv_001",
    training_masks_strategy="causal",
    source="human-generated"
)

Source code in mol_gen_docking/data/pydantic_dataset.py

class Conversation(BaseModel):
    """Complete conversation structure with messages and metadata.

    This model represents a full conversation between a user and assistant,
    including all messages, system configuration, and evaluation metrics.
    The meta field should contain all information necessary to compute
    rewards, ground truth values, and other evaluation metrics.

    Attributes:
        meta: Metadata dictionary containing evaluation and reward computation data.
        messages: List of Message objects representing conversation turns.
        system_prompt: Optional system prompt that defines assistant behavior.
        available_tools: Optional list of tool names available to the assistant
            (e.g., 'calculator', 'web_search', 'code_interpreter').
        truncate_at_max_tokens: Optional maximum token limit for conversation.
        truncate_at_max_image_tokens: Optional maximum image token limit.
        output_modalities: Optional list of output types the model can produce
            (e.g., 'text', 'image', 'audio').
        identifier: Unique identifier for this conversation.
        references: Optional list of reference materials, ground truth data,
            or expected outputs for evaluation.
        rating: Optional quality rating or score (typically 0.0 to 1.0).
        source: Optional source or origin (e.g., 'human-generated', 'synthetic').
        training_masks_strategy: Strategy for applying attention masks during
            training (e.g., 'full', 'partial', 'causal').
        custom_training_masks: Optional custom mask configuration for
            advanced masking strategies.

    Example:
        ```python
        conversation = Conversation(
            meta={"task": "qa", "difficulty": "hard"},
            messages=[message1, message2],
            identifier="conv_001",
            training_masks_strategy="causal",
            source="human-generated"
        )
        ```
    """

    meta: Dict[str, Any] = Field(
        ..., description="Metadata for reward computation and evaluation (required)"
    )
    messages: List[Message] = Field(
        ..., description="List of messages in conversation order"
    )
    system_prompt: Optional[str] = Field(
        None, description="System prompt defining assistant behavior and constraints"
    )
    available_tools: Optional[List[str]] = Field(
        None, description="List of tools available for the assistant to use"
    )
    truncate_at_max_tokens: Optional[int] = Field(
        None, description="Maximum token count before truncation"
    )
    truncate_at_max_image_tokens: Optional[int] = Field(
        None, description="Maximum image token count before truncation"
    )
    output_modalities: Optional[List[str]] = Field(
        None, description="Supported output modalities (text, image, audio, etc.)"
    )
    identifier: str = Field(..., description="Unique conversation identifier")
    references: List[Any] = Field(
        default_factory=list,
        description="Reference data for evaluation and ground truth",
    )
    rating: Optional[float] = Field(
        None, description="Quality rating (typically 0.0-1.0 scale)"
    )
    source: Optional[str] = Field(
        None, description="Source or origin of the conversation"
    )
    training_masks_strategy: str = Field(
        ..., description="Attention mask strategy during training"
    )
    custom_training_masks: Optional[Dict[str, Any]] = Field(
        None, description="Custom mask configuration"
    )

`Message`

Bases: BaseModel

Represents a single message in a conversation.

A message is a single turn in a multi-turn conversation. It contains the speaker's role, the message content, and optional metadata.

Attributes:

Name	Type	Description
`role`	`Literal['system', 'user', 'assistant']`	The role of the message sender (system, user, or assistant).
`content`	`str`	The text content of the message.
`meta`	`Dict[str, Any]`	Optional metadata dictionary (e.g., timestamps, token counts).
`identifier`	`Optional[str]`	Optional unique identifier for this message.
`multimodal_document`	`Optional[Dict[str, Any]]`	Optional dictionary containing multimodal content like images, audio, or file attachments.

Example

message = Message(
    role="user",
    content="What is the capital of France?",
    meta={"tokens": 10}
)

Source code in mol_gen_docking/data/pydantic_dataset.py

class Message(BaseModel):
    """Represents a single message in a conversation.

    A message is a single turn in a multi-turn conversation. It contains
    the speaker's role, the message content, and optional metadata.

    Attributes:
        role: The role of the message sender (system, user, or assistant).
        content: The text content of the message.
        meta: Optional metadata dictionary (e.g., timestamps, token counts).
        identifier: Optional unique identifier for this message.
        multimodal_document: Optional dictionary containing multimodal content
            like images, audio, or file attachments.

    Example:
        ```python
        message = Message(
            role="user",
            content="What is the capital of France?",
            meta={"tokens": 10}
        )
        ```
    """

    role: Literal["system", "user", "assistant"] = Field(
        ..., description="The role of the sender: 'system', 'user', or 'assistant'"
    )
    content: str = Field(..., description="The text content of the message")
    meta: Dict[str, Any] = Field(
        default_factory=dict,
        description="Optional metadata (timestamps, token counts, flags, etc.)",
    )
    identifier: Optional[str] = Field(
        None, description="Optional unique identifier for the message"
    )
    multimodal_document: Optional[Dict[str, Any]] = Field(
        None, description="Optional multimodal content (images, audio, files, etc.)"
    )

`Sample`

Bases: BaseModel

Root model containing all conversations for a single sample.

A sample is the top-level container that groups one or more related conversations together with shared metadata and optional trajectories. This is typically the unit used for dataset serialization.

Attributes:

Name	Type	Description
`identifier`	`str`	Unique identifier for this sample.
`conversations`	`List[Conversation]`	List of Conversation objects in this sample.
`trajectories`	`List[Any]`	Optional list of trajectories or execution paths for this sample (e.g., for reasoning or planning tasks).
`meta`	`Dict[str, Any]`	Optional metadata about the sample (difficulty, domain, etc.).
`source`	`Optional[str]`	Optional source or origin (e.g., 'benchmark', 'user-submitted').

Example

sample = Sample(
    identifier="sample_001",
    conversations=[conversation1, conversation2],
    meta={"domain": "chemistry", "difficulty": "expert"},
    source="benchmark"
)

Source code in mol_gen_docking/data/pydantic_dataset.py

class Sample(BaseModel):
    """Root model containing all conversations for a single sample.

    A sample is the top-level container that groups one or more related
    conversations together with shared metadata and optional trajectories.
    This is typically the unit used for dataset serialization.

    Attributes:
        identifier: Unique identifier for this sample.
        conversations: List of Conversation objects in this sample.
        trajectories: Optional list of trajectories or execution paths
            for this sample (e.g., for reasoning or planning tasks).
        meta: Optional metadata about the sample (difficulty, domain, etc.).
        source: Optional source or origin (e.g., 'benchmark', 'user-submitted').

    Example:
        ```python
        sample = Sample(
            identifier="sample_001",
            conversations=[conversation1, conversation2],
            meta={"domain": "chemistry", "difficulty": "expert"},
            source="benchmark"
        )
        ```
    """

    identifier: str = Field(..., description="Unique sample identifier")
    conversations: List[Conversation] = Field(
        ..., description="List of conversations in this sample"
    )
    trajectories: List[Any] = Field(
        default_factory=list, description="Optional trajectories or execution paths"
    )
    meta: Dict[str, Any] = Field(
        default_factory=dict,
        description="Sample metadata (difficulty, domain, annotations, etc.)",
    )
    source: Optional[str] = Field(None, description="Source or origin of the sample")

`read_jsonl(input_file)`

Read Sample objects from a JSONL file.

Each line in the file should be a valid JSON object representing a Sample. Lines are parsed sequentially into Sample instances.

Parameters:

Name	Type	Description	Default
`input_file`	`Path`	Path to the input JSONL file.	required

Returns:

Type	Description
`list[Sample]`	List of Sample objects parsed from the file.

Raises:

Type	Description
`AssertionError`	If the input file does not exist.

Example

from pathlib import Path
samples = read_jsonl(Path("data/input.jsonl"))
for sample in samples:
    print(sample.identifier)

Source code in mol_gen_docking/data/pydantic_dataset.py

def read_jsonl(input_file: Path) -> list[Sample]:
    """Read Sample objects from a JSONL file.

    Each line in the file should be a valid JSON object representing
    a Sample. Lines are parsed sequentially into Sample instances.

    Args:
        input_file: Path to the input JSONL file.

    Returns:
        List of Sample objects parsed from the file.

    Raises:
        AssertionError: If the input file does not exist.

    Example:
        ```python
        from pathlib import Path
        samples = read_jsonl(Path("data/input.jsonl"))
        for sample in samples:
            print(sample.identifier)
        ```
    """
    assert input_file.exists(), input_file

    samples: list[Sample] = []
    with open(input_file) as fin:
        for line in fin:
            samples.append(Sample(**json.loads(line)))

    return samples

`write_jsonl(output_file, samples)`

Write Sample objects to a JSONL file.

Each sample is serialized to JSON and written on a separate line. Parent directories are created automatically if they don't exist.

Parameters:

Name	Type	Description	Default
`output_file`	`Path`	Path to the output JSONL file.	required
`samples`	`list[Sample]`	List of Sample objects to write.	required

Example

from pathlib import Path
samples = [sample1, sample2, sample3]
write_jsonl(Path("data/output.jsonl"), samples)

Source code in mol_gen_docking/data/pydantic_dataset.py

def write_jsonl(output_file: Path, samples: list[Sample]) -> None:
    """Write Sample objects to a JSONL file.

    Each sample is serialized to JSON and written on a separate line.
    Parent directories are created automatically if they don't exist.

    Args:
        output_file: Path to the output JSONL file.
        samples: List of Sample objects to write.

    Example:
        ```python
        from pathlib import Path
        samples = [sample1, sample2, sample3]
        write_jsonl(Path("data/output.jsonl"), samples)
        ```
    """
    output_file.parent.mkdir(exist_ok=True, parents=True)

    with open(output_file, "w") as fout:
        for sample in samples:
            fout.write(json.dumps(sample.model_dump(), ensure_ascii=False) + "\n")