Dataset Validation

Overview

Dataset validation is a critical step in the Supervised AI platform workflow. Before a dataset can be used for training or testing, it must pass a series of integrity and compatibility checks to ensure the data is structured correctly and contains no malformed entries that could lead to training failure.

The testing-api provides a validation interface to verify schemas, check for missing values, and ensure label consistency across your training sets.

Validation Endpoint

To validate a dataset, use the validation endpoint. This process evaluates your dataset against the platform's required schema for specific AI task types (e.g., classification, object detection, or NLP).

POST /v1/datasets/validate

Validates a local or remote dataset against the Supervised AI training requirements.

Request Headers:

Content-Type: application/json
Authorization: Bearer <YOUR_TOKEN>

Request Body Parameters:

Usage Example:

curl -X POST "https://api.supervised.ai/v1/datasets/validate" \
     -H "Authorization: Bearer $API_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "dataset_path": "s3://my-bucket/training-data/v1/",
       "task_type": "image_classification",
       "strict_mode": false
     }'

Validation Rules

The API executes three primary layers of validation:

1. Structural Validation

Ensures the file directory or manifest matches the expected format.

CSV/JSON Check: Verifies headers and data types.
Directory Structure: Ensures images and annotations are correctly mapped.

2. Data Integrity

Checks for the physical health of the data points.

Corrupt File Detection: Identifies unreadable images or truncated text files.
Null Values: Detects missing labels or empty feature sets.
Size Constraints: Ensures individual files do not exceed platform limits.

3. Distribution & Labeling

Analyzes the content for training readiness.

Label Consistency: Validates that all labels in the dataset are defined in the project metadata.
Class Balance: Warns if specific classes have insufficient samples for meaningful training.

Response Format

The API returns a detailed report of the validation status.

Success Response (Status 200):

{
  "status": "success",
  "validation_id": "val_8829102",
  "summary": {
    "total_records": 5000,
    "valid_records": 4998,
    "invalid_records": 2,
    "warnings": 1
  },
  "details": {
    "errors": [
      { "row": 452, "message": "Missing label attribute" },
      { "row": 1022, "message": "Unsupported file format (.tiff)" }
    ],
    "warnings": [
      { "type": "class_imbalance", "message": "Class 'Cat' has 80% fewer samples than 'Dog'" }
    ]
  }
}

Internal Validation Logic

While the validation logic is managed by the internal ValidatorCore module, users can configure validation behavior through the validation_config block in the platform's main configuration file.

The ValidatorCore is responsible for checksum verification and cross-referencing labels with the internal taxonomy engine. Users do not interact with this module directly but may see it referenced in error logs during high-level debugging.

Error Handling

If the validation fails significantly (e.g., the dataset is unreachable or the schema is completely mismatched), the API will return a 422 Unprocessable Entity error.

| Status Code | Reason | | :--- | :--- | | 400 | Malformed request body. | | 401 | Unauthorized; check your API token. | | 422 | Validation failed (check the errors array for details). | | 500 | Internal processing error during validation. |