System Design

Architecture Overview

The testing-api serves as the bridge between development environments and the Supervised AI core infrastructure. It is designed to standardize how AI models are validated, benchmarked, and monitored before deployment.

The system operates as a stateless middleware that orchestrates communication between your model endpoints, the Supervised AI dataset repository, and the evaluation engine.

Integration Layer

The API integrates with Supervised AI services through three primary channels:

Data Ingress: Pulls ground-truth data and test cases from the platform's Dataset Management service.
Execution Engine: Sends model inference requests and captures raw outputs.
Reporting Hub: Streams real-time metrics and final evaluation reports back to the Supervised AI dashboard.

Core Components

1. Testing Client

The TestingClient is the primary entry point for all operations. It handles authentication, session management, and provides the interface for triggering test suites.

Usage Example:

from testing_api import TestingClient

# Initialize the client with your platform credentials
client = TestingClient(
    api_key="your_supervised_ai_key",
    environment="production"
)

2. Test Suites

A Test Suite is a logical grouping of test cases defined within the Supervised AI platform. The API allows you to trigger these suites programmatically.

Usage Example:

results = client.run_suite(
    suite_id="suite_v1_nlp_benchmark",
    model_endpoint="https://api.your-model.com/v1/predict",
    config={"concurrency": 5}
)

3. Evaluators (Internal)

While evaluators are managed internally by the platform, the API allows you to specify which evaluation logic to apply to a specific run.

Role: Evaluators compare model output against ground truth using metrics like Precision, Recall, F1-Score, or custom LLM-based grading.
Interaction: Users specify the evaluator_type in the request payload to determine how results are processed.

Data Flow and Interaction

The following workflow illustrates how a standard testing request is processed:

Request Initiation: The user calls the API with a model_endpoint and suite_id.
Dataset Retrieval: The API fetches the associated test dataset from Supervised AI's internal storage.
Inference Loop: The system iterates through the dataset, sending payloads to the provided model_endpoint.
Scoring: Raw responses are sent to the Evaluator Service.
Persistence: Results are saved to the platform's database, and a summary object is returned to the user.

API Output Schema

Every successful test execution returns a TestResult object:

{
  "test_id": "uuid-string",
  "status": "completed",
  "summary": {
    "total_cases": 100,
    "passed": 95,
    "failed": 5,
    "metrics": {
      "accuracy": 0.95,
      "latency_p95": "120ms"
    }
  },
  "report_url": "https://app.supervised.ai/reports/uuid-string"
}

Configuration & Environment

The API behavior can be modified via environment variables or the initialization config to align with different stages of the CI/CD pipeline:

SAI_TESTING_TIMEOUT: Maximum time (in seconds) to wait for a model response.
SAI_RETRY_COUNT: Number of attempts for failed network calls to the model endpoint.
SAI_LOG_LEVEL: Controls the verbosity of the client (DEBUG, INFO, ERROR).