Design Principles

Core Design Philosophy

The testing-api is designed to provide a standardized, scalable interface for evaluating Supervised AI models. Its architecture is built on the premise that model testing should be as rigorous and repeatable as software unit testing, but tailored to the nuances of machine learning workflows—specifically ground-truth validation and performance benchmarking.

The following principles guide the structure and evolution of the API:

Supervised-First Architecture

Unlike general-purpose logging or monitoring tools, this API is explicitly structured around supervised learning primitives. Every interface assumes a relationship between an input, a model prediction, and a verified ground-truth label. This focus ensures that metrics such as precision, recall, and F1-score are calculated with high integrity across different model versions.

Deterministic Evaluation

To ensure reliability across the Supervised AI platform, the API enforces a deterministic approach to testing. For a given dataset and model output, the API guarantees consistent metric results.

Idempotency: Submitting the same prediction set against the same ground truth will always yield the same evaluation signature.
Traceability: Every test result is linked to a specific dataset version and model metadata.

Schema-Driven Data Contracts

The API utilizes strict data contracts to handle the hand-off between the inference engine and the testing suite. This prevents common "silent failures" in ML pipelines, such as label mismatch or data type drift.

// Example of the standardized input structure for evaluation
{
  "test_session_id": "uuid-v4",
  "data_points": [
    {
      "input_id": "ref-001",
      "prediction": "cat",
      "ground_truth": "cat",
      "confidence": 0.98
    }
  ]
}

Decoupling of Inference and Validation

The design maintains a strict separation of concerns:

Inference: Handled by the model deployment.
Validation: Handled by the testing-api.

By decoupling these, users can swap out evaluation metrics or update testing requirements without redeploying the underlying model. This allows for "Retroactive Testing," where historical model outputs can be re-evaluated against newly updated ground-truth datasets.

Public Interface Design

The API exposes a developer-centric interface that prioritizes automation and integration into CI/CD pipelines.

Standardized Metric Payloads

The output of the API is designed to be consumed by both automated systems and human-readable dashboards. Every evaluation endpoint returns a structured object containing:

Aggregated Metrics: High-level performance (e.g., Accuracy, MSE).
Per-Class Breakdown: Detailed performance for specific labels in a supervised set.
Outlier Identification: Automatic flagging of data points where the model failed significantly.

Type-Safe Integration

For developers using the platform, the API structure favors strong typing. This ensures that when a user defines a "Label," it is consistently treated as a string or integer throughout the testing lifecycle, minimizing errors during the calculation of confusion matrices or loss functions.

Extensibility for AI Domains

While the core structure is rigid to ensure data integrity, the API is designed to be extensible across different supervised domains:

Classification: Optimized for categorical cross-entropy and accuracy.
Regression: Optimized for continuous variable error analysis.
Sequence Labeling: Structured to handle multi-token ground truth comparisons.

By adhering to these principles, the testing-api provides a robust foundation for building trust in Supervised AI systems through continuous, automated, and standardized evaluation.