System Architecture

System Overview

The testing-api serves as the centralized orchestration layer for model evaluation and data validation within the Supervised AI platform. It acts as the bridge between the core platform services (where datasets and models are managed) and the distributed execution environments where testing logic is applied.

By standardizing the communication protocol for testing, this API ensures that diverse AI models—ranging from LLMs to computer vision systems—can be evaluated against a unified set of performance metrics and safety benchmarks.

Integration Architecture

The testing-api is designed as a stateless service that interacts with three primary architectural tiers:

Platform Core (Client): Initiates testing jobs, manages versioning of models/datasets, and consumes the final evaluation reports.
Execution Engines: Specialized workers or ephemeral containers that run the actual test scripts and validation logic.
Data Lake/Storage: Provides the source data (ground truth) and receives the telemetry/artifacts generated during the testing process.

High-Level Data Flow

Public Interface & Usage

The system exposes a RESTful interface to manage the lifecycle of a test. Users primarily interact with the API to define test suites, trigger executions, and retrieve metrics.

Test Execution Lifecycle

To initiate a test through the platform, the client must provide a configuration payload defining the scope and parameters of the evaluation.

1. Dispatching a Test Job

POST /v1/tests/run

Example Usage:

curl -X POST https://api.supervised.ai/testing/v1/tests/run \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "llama-3-finetuned-v1",
    "dataset_id": "qa-benchmark-gold",
    "test_suite": ["hallucination_rate", "response_time"],
    "callback_url": "https://hooks.supervised.ai/results/123"
  }'

2. Monitoring Status

GET /v1/tests/{job_id}/status

The API provides real-time status updates. Common states include PENDING, RUNNING, COMPLETED, or FAILED.

Result Reporting

Once execution is complete, the testing-api aggregates metrics from the execution engine and returns a structured JSON object containing:

Summary Metrics: Aggregate scores for the entire test run.
Trace Details: Per-sample logs or failure reasons for debugging.
Artifact Links: Signed URLs to download generated reports or visualization plots.

Security and Authentication

Bearer Token Authentication: All requests to the testing-api must be authenticated via the Supervised AI Identity Provider.
Isolation: Each test execution is scoped to a specific project or organization, ensuring that model parameters and dataset contents are never leaked across different platform tenants.