Dataset Validation
Overview
Dataset validation is a critical step in the Supervised AI platform workflow. Before a dataset can be used for training or testing, it must pass a series of integrity and compatibility checks to ensure the data is structured correctly and contains no malformed entries that could lead to training failure.
The testing-api provides a validation interface to verify schemas, check for missing values, and ensure label consistency across your training sets.
Validation Endpoint
To validate a dataset, use the validation endpoint. This process evaluates your dataset against the platform's required schema for specific AI task types (e.g., classification, object detection, or NLP).
POST /v1/datasets/validate
Validates a local or remote dataset against the Supervised AI training requirements.
Request Headers:
Content-Type: application/jsonAuthorization: Bearer <YOUR_TOKEN>
Request Body Parameters:
| Parameter | Type | Description |
| :--- | :--- | :--- |
| dataset_path | string | The URI or path to the dataset files (S3, GCS, or local path). |
| task_type | string | The AI task (e.g., image_classification, text_summarization). |
| schema_version | string | (Optional) The specific schema version to validate against. Defaults to latest. |
| strict_mode | boolean | If true, validation fails on warnings (e.g., class imbalance). |
Usage Example:
curl -X POST "https://api.supervised.ai/v1/datasets/validate" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"dataset_path": "s3://my-bucket/training-data/v1/",
"task_type": "image_classification",
"strict_mode": false
}'
Validation Rules
The API executes three primary layers of validation:
1. Structural Validation
Ensures the file directory or manifest matches the expected format.
- CSV/JSON Check: Verifies headers and data types.
- Directory Structure: Ensures images and annotations are correctly mapped.
2. Data Integrity
Checks for the physical health of the data points.
- Corrupt File Detection: Identifies unreadable images or truncated text files.
- Null Values: Detects missing labels or empty feature sets.
- Size Constraints: Ensures individual files do not exceed platform limits.
3. Distribution & Labeling
Analyzes the content for training readiness.
- Label Consistency: Validates that all labels in the dataset are defined in the project metadata.
- Class Balance: Warns if specific classes have insufficient samples for meaningful training.
Response Format
The API returns a detailed report of the validation status.
Success Response (Status 200):
{
"status": "success",
"validation_id": "val_8829102",
"summary": {
"total_records": 5000,
"valid_records": 4998,
"invalid_records": 2,
"warnings": 1
},
"details": {
"errors": [
{ "row": 452, "message": "Missing label attribute" },
{ "row": 1022, "message": "Unsupported file format (.tiff)" }
],
"warnings": [
{ "type": "class_imbalance", "message": "Class 'Cat' has 80% fewer samples than 'Dog'" }
]
}
}
Internal Validation Logic
While the validation logic is managed by the internal ValidatorCore module, users can configure validation behavior through the validation_config block in the platform's main configuration file.
The ValidatorCore is responsible for checksum verification and cross-referencing labels with the internal taxonomy engine. Users do not interact with this module directly but may see it referenced in error logs during high-level debugging.
Error Handling
If the validation fails significantly (e.g., the dataset is unreachable or the schema is completely mismatched), the API will return a 422 Unprocessable Entity error.
| Status Code | Reason |
| :--- | :--- |
| 400 | Malformed request body. |
| 401 | Unauthorized; check your API token. |
| 422 | Validation failed (check the errors array for details). |
| 500 | Internal processing error during validation. |