Object Type Definitions
Core Object Types
This section defines the data structures used by the Supervised AI Testing API. These objects represent the fundamental entities you will interact with when managing test suites, executing evaluations, and analyzing model performance.
TestSet
A TestSet is a logical container for a collection of test cases. It is used to organize testing data by version, task type (e.g., summarization, classification), or specific model requirements.
| Property | Type | Description |
| :--- | :--- | :--- |
| id | string | The unique identifier for the TestSet. |
| name | string | A human-readable name for the test suite. |
| description | string | A brief explanation of the TestSet's purpose. |
| version | string | Semantic versioning or custom version string (e.g., v1.0.2). |
| created_at | ISO8601 | Timestamp of when the set was initialized. |
Example Usage:
{
"id": "ts_88219",
"name": "LLM Hallucination Benchmark",
"description": "Tests for factual consistency in RAG pipelines.",
"version": "1.2.0",
"created_at": "2023-11-15T10:30:00Z"
}
TestCase
The TestCase is the granular unit of data within a TestSet. It contains the input parameters provided to the model and the ground truth (if applicable) used for evaluation.
| Property | Type | Description |
| :--- | :--- | :--- |
| id | string | The unique identifier for the test case. |
| inputs | object | Key-value pairs representing model prompt variables or parameters. |
| expected_output | string | (Optional) The gold-standard response for comparison. |
| metadata | object | Arbitrary key-value pairs for filtering (e.g., difficulty: "high"). |
Example Usage:
{
"id": "tc_001",
"inputs": {
"prompt": "What is the capital of France?",
"context": "Geography quiz data."
},
"expected_output": "The capital of France is Paris.",
"metadata": {
"category": "factual_recall"
}
}
EvaluationRun
An EvaluationRun represents a specific execution instance where a model's outputs are generated and scored against a TestSet.
| Property | Type | Description |
| :--- | :--- | :--- |
| run_id | string | Unique identifier for the execution instance. |
| model_id | string | The identifier of the model being tested. |
| status | enum | The current state: PENDING, RUNNING, COMPLETED, FAILED. |
| summary | object | Aggregated scores (e.g., mean accuracy, average latency). |
Example Usage:
{
"run_id": "run_9942",
"model_id": "gpt-4-turbo",
"status": "COMPLETED",
"summary": {
"accuracy": 0.94,
"avg_latency_ms": 450
}
}
MetricResult
A MetricResult provides the specific outcome of a single evaluation metric applied to a single TestCase output.
| Property | Type | Description |
| :--- | :--- | :--- |
| metric_name | string | The name of the metric (e.g., BLEU, ROUGE, Toxicity). |
| score | float | The numerical result of the metric evaluation. |
| reasoning | string | (Internal/Optional) Explanation of the score if generated by an LLM-evaluator. |
| status | string | Typically PASS, FAIL, or WARN based on defined thresholds. |
Example Usage:
{
"metric_name": "exact_match",
"score": 1.0,
"reasoning": "The model output matched the expected output string perfectly.",
"status": "PASS"
}
ModelOutput
This object captures the raw response from the target model before it is processed by the evaluation engine.
| Property | Type | Description |
| :--- | :--- | :--- |
| raw_response | string | The literal text or JSON returned by the model. |
| latency | float | Time taken in seconds for the model to respond. |
| tokens_used | integer | Total count of prompt and completion tokens. |
Example Usage:
{
"raw_response": "Paris is the capital.",
"latency": 0.82,
"tokens_used": 15
}