Core Entities
Core Entities
This section defines the fundamental data structures used within the Supervised AI testing-api. Understanding these entities is essential for configuring test suites, interpreting evaluation results, and integrating with the platform's automated testing workflows.
TestCase
The TestCase is the atomic unit of testing. It represents a single input scenario and its corresponding expected behavior or criteria.
| Field | Type | Description |
| :--- | :--- | :--- |
| id | string | A unique identifier for the test case. |
| input | object | The data payload sent to the AI model (e.g., prompt, images, or parameters). |
| expected_output | any | (Optional) The ground truth or reference value to compare against. |
| metadata | dict | Key-value pairs used for filtering or categorization (e.g., priority: "high"). |
Usage Example:
{
"id": "tc_001",
"input": {
"prompt": "Summarize the following text..."
},
"expected_output": "The text discusses the rise of...",
"metadata": {
"feature_area": "summarization",
"version": "1.2"
}
}
Metric
A Metric defines how the performance of the AI model is measured for a given test execution. Metrics can be quantitative (e.g., latency) or qualitative (e.g., toxicity score).
| Field | Type | Description |
| :--- | :--- | :--- |
| name | string | The human-readable name of the metric (e.g., accuracy). |
| value | float/int | The calculated score for the metric. |
| threshold | float | The minimum or maximum acceptable value for a "Pass" status. |
| unit | string | The unit of measurement (e.g., ms, percent). |
Usage Example:
metric = Metric(
name="response_latency",
value=150.5,
threshold=200.0,
unit="ms"
)
TestResult
The TestResult captures the outcome of an individual TestCase execution. It aggregates the raw output from the AI model and the calculated metrics.
| Field | Type | Description |
| :--- | :--- | :--- |
| test_case_id | string | Reference to the original TestCase. |
| actual_output | any | The raw response generated by the system under test. |
| metrics | List[Metric] | A collection of metric objects evaluated for this specific result. |
| status | string | The final state: PASS, FAIL, or ERROR. |
Usage Example:
{
"test_case_id": "tc_001",
"actual_output": "The provided text is summarized as...",
"status": "PASS",
"metrics": [
{ "name": "semantic_similarity", "value": 0.92 }
]
}
EvaluationRun
An EvaluationRun represents a complete execution of a test suite against a specific version of an AI model. It acts as a container for multiple TestResults.
| Field | Type | Description |
| :--- | :--- | :--- |
| run_id | string | Unique identifier for the execution session. |
| model_config | object | Configuration of the model tested (e.g., temperature, model_name). |
| results | List[TestResult] | The list of all results generated during the run. |
| summary | object | Aggregated statistics (e.g., total pass/fail count). |
Usage Example:
# Initiating a run via the API
run = testing_client.create_run(
model_id="gpt-4-summarizer",
dataset_id="prod_eval_set_v1",
config={"temperature": 0.7}
)
Dataset Reference (Internal)
While datasets are managed via the platform UI or Storage API, the testing-api uses a DatasetReference to link test cases to a specific run.
Note: This is an internal pointer used by the API to fetch batches of TestCases during execution and is typically not modified by the user directly.