Dataset Integrity Metrics for Production ML Systems (Medium) — Practice with Code Visualizer

In production machine learning environments, the adage "garbage in, garbage out" holds particularly true. Model performance, reliability, and trustworthiness are directly tied to the quality of data flowing through your ML pipelines. Before training models or serving predictions, robust data validation mechanisms must verify that incoming data conforms to expected schemas and quality standards.

Dataset Integrity Analysis is a critical component of MLOps infrastructure. It involves systematically evaluating datasets against predefined schemas to compute quantitative quality metrics. These metrics serve as gatekeepers—allowing clean data to proceed while flagging anomalies that could corrupt model behavior or cause downstream failures.

Given a collection of data records (represented as a list of dictionaries) and a schema specification, your task is to compute four essential integrity metrics:

1. Completeness Score

Measures the percentage of non-null values across all expected fields in the dataset. A completeness score of 100% indicates that every expected field in every record contains a value.

$$\text{Completeness} = \frac{\text{Number of non-null values}}{\text{Total expected values (rows × fields)}} \times 100$$

2. Type Validity Score

Measures the percentage of values that conform to their expected data types. This metric considers nullable constraints:

Numeric type: Accepts int and float values (but explicitly not boolean, since Python's bool is a subclass of int)
Categorical type: Accepts str (string) values only
Boolean type: Accepts only True or False literals

Null handling for type validity:

If a value is None and the field is marked as nullable, it counts as type-valid
If a value is None and the field is marked as not nullable, it counts as type-invalid

$$\text{Type Validity} = \frac{\text{Number of type-valid values}}{\text{Total values (rows × fields)}} \times 100$$

3. Uniqueness Ratio

Measures the percentage of distinct records in the dataset. Duplicate records can introduce bias in training and skew statistical analyses.

$$\text{Uniqueness Ratio} = \frac{\text{Number of unique records}}{\text{Total records}} \times 100$$

4. Overall Integrity Score

A weighted aggregate of the three individual metrics, providing a single quality indicator:

$$\text{Overall Score} = 0.4 \times \text{Completeness} + 0.4 \times \text{Type Validity} + 0.2 \times \text{Uniqueness Ratio}$$

Schema Specification Format

The schema is a dictionary where:

Each key is a column/field name
Each value is an object containing:
- 'type': One of 'numeric', 'categorical', or 'boolean'
- 'nullable': Boolean indicating if null values are acceptable

Your Task

Implement a Python function that evaluates dataset integrity and returns a dictionary containing all four metrics. Handle edge cases appropriately—return an empty dictionary if the input dataset is empty. All percentage values must be rounded to 2 decimal places.

Completeness Analysis: • Total expected values = 4 rows × 3 fields = 12 • Non-null values = 10 (row 3 has 2 null values for 'age' and 'name') • Completeness = (10 / 12) × 100 = 83.33%

Type Validity Analysis: • Row 1: age=25 (numeric ✓), name='Alice' (categorical ✓), active=True (boolean ✓) → 3 valid • Row 2: age='thirty' (string, not numeric ✗), name='Bob' (✓), active=False (✓) → 2 valid • Row 3: age=None (nullable ✓), name=None (nullable ✓), active=True (✓) → 3 valid • Row 4: age=40 (✓), name='Dave' (✓), active='yes' (string, not boolean ✗) → 2 valid • Type validity = (10 / 12) × 100 = 83.33%

Uniqueness Analysis: • All 4 records have distinct value combinations → 100% unique

Overall Score: • (0.4 × 83.33) + (0.4 × 83.33) + (0.2 × 100) = 33.332 + 33.332 + 20 = 86.67%

Perfect Dataset: • All 6 values (3 rows × 2 fields) are present → Completeness = 100% • All values match expected types (integers for 'id', booleans for 'status') → Type validity = 100% • All 3 records are unique → Uniqueness ratio = 100% • Overall = (0.4 × 100) + (0.4 × 100) + (0.2 × 100) = 100%

This represents an ideal dataset with perfect integrity scores across all dimensions.

Completeness Analysis: • Total expected values = 3 rows × 3 fields = 9 • Non-null values = 7 (row 2 has null 'score', row 3 has null 'grade') • Completeness = (7 / 9) × 100 = 77.78%

Type Validity Analysis: • All present values match their expected types • Both null values are in nullable fields (score and grade are nullable) • Since nulls in nullable fields are type-valid → Type validity = 100%

Uniqueness Analysis: • All 3 records are distinct → 100%

Overall Score: • (0.4 × 77.78) + (0.4 × 100) + (0.2 × 100) = 31.112 + 40 + 20 = 91.11%

This example demonstrates how completeness and type validity are independent metrics—missing values affect completeness but may still be type-valid if the field allows nulls.