Loading problem...
In production machine learning environments, the adage "garbage in, garbage out" holds particularly true. Model performance, reliability, and trustworthiness are directly tied to the quality of data flowing through your ML pipelines. Before training models or serving predictions, robust data validation mechanisms must verify that incoming data conforms to expected schemas and quality standards.
Dataset Integrity Analysis is a critical component of MLOps infrastructure. It involves systematically evaluating datasets against predefined schemas to compute quantitative quality metrics. These metrics serve as gatekeepers—allowing clean data to proceed while flagging anomalies that could corrupt model behavior or cause downstream failures.
Given a collection of data records (represented as a list of dictionaries) and a schema specification, your task is to compute four essential integrity metrics:
Measures the percentage of non-null values across all expected fields in the dataset. A completeness score of 100% indicates that every expected field in every record contains a value.
$$\text{Completeness} = \frac{\text{Number of non-null values}}{\text{Total expected values (rows × fields)}} \times 100$$
Measures the percentage of values that conform to their expected data types. This metric considers nullable constraints:
int and float values (but explicitly not boolean, since Python's bool is a subclass of int)str (string) values onlyTrue or False literalsNull handling for type validity:
None and the field is marked as nullable, it counts as type-validNone and the field is marked as not nullable, it counts as type-invalid$$\text{Type Validity} = \frac{\text{Number of type-valid values}}{\text{Total values (rows × fields)}} \times 100$$
Measures the percentage of distinct records in the dataset. Duplicate records can introduce bias in training and skew statistical analyses.
$$\text{Uniqueness Ratio} = \frac{\text{Number of unique records}}{\text{Total records}} \times 100$$
A weighted aggregate of the three individual metrics, providing a single quality indicator:
$$\text{Overall Score} = 0.4 \times \text{Completeness} + 0.4 \times \text{Type Validity} + 0.2 \times \text{Uniqueness Ratio}$$
The schema is a dictionary where:
'type': One of 'numeric', 'categorical', or 'boolean''nullable': Boolean indicating if null values are acceptableImplement a Python function that evaluates dataset integrity and returns a dictionary containing all four metrics. Handle edge cases appropriately—return an empty dictionary if the input dataset is empty. All percentage values must be rounded to 2 decimal places.
records = [
{'age': 25, 'name': 'Alice', 'active': True},
{'age': 'thirty', 'name': 'Bob', 'active': False},
{'age': None, 'name': None, 'active': True},
{'age': 40, 'name': 'Dave', 'active': 'yes'}
]
schema_definition = {
'age': {'type': 'numeric', 'nullable': True},
'name': {'type': 'categorical', 'nullable': True},
'active': {'type': 'boolean', 'nullable': False}
}{'completeness': 83.33, 'type_validity': 83.33, 'uniqueness_ratio': 100.0, 'overall_score': 86.67}Completeness Analysis: • Total expected values = 4 rows × 3 fields = 12 • Non-null values = 10 (row 3 has 2 null values for 'age' and 'name') • Completeness = (10 / 12) × 100 = 83.33%
Type Validity Analysis: • Row 1: age=25 (numeric ✓), name='Alice' (categorical ✓), active=True (boolean ✓) → 3 valid • Row 2: age='thirty' (string, not numeric ✗), name='Bob' (✓), active=False (✓) → 2 valid • Row 3: age=None (nullable ✓), name=None (nullable ✓), active=True (✓) → 3 valid • Row 4: age=40 (✓), name='Dave' (✓), active='yes' (string, not boolean ✗) → 2 valid • Type validity = (10 / 12) × 100 = 83.33%
Uniqueness Analysis: • All 4 records have distinct value combinations → 100% unique
Overall Score: • (0.4 × 83.33) + (0.4 × 83.33) + (0.2 × 100) = 33.332 + 33.332 + 20 = 86.67%
records = [
{'id': 1, 'status': True},
{'id': 2, 'status': False},
{'id': 3, 'status': True}
]
schema_definition = {
'id': {'type': 'numeric', 'nullable': False},
'status': {'type': 'boolean', 'nullable': False}
}{'completeness': 100.0, 'type_validity': 100.0, 'uniqueness_ratio': 100.0, 'overall_score': 100.0}Perfect Dataset: • All 6 values (3 rows × 2 fields) are present → Completeness = 100% • All values match expected types (integers for 'id', booleans for 'status') → Type validity = 100% • All 3 records are unique → Uniqueness ratio = 100% • Overall = (0.4 × 100) + (0.4 × 100) + (0.2 × 100) = 100%
This represents an ideal dataset with perfect integrity scores across all dimensions.
records = [
{'score': 95.5, 'grade': 'A', 'passed': True},
{'score': None, 'grade': 'B', 'passed': True},
{'score': 70.0, 'grade': None, 'passed': False}
]
schema_definition = {
'score': {'type': 'numeric', 'nullable': True},
'grade': {'type': 'categorical', 'nullable': True},
'passed': {'type': 'boolean', 'nullable': False}
}{'completeness': 77.78, 'type_validity': 100.0, 'uniqueness_ratio': 100.0, 'overall_score': 91.11}Completeness Analysis: • Total expected values = 3 rows × 3 fields = 9 • Non-null values = 7 (row 2 has null 'score', row 3 has null 'grade') • Completeness = (7 / 9) × 100 = 77.78%
Type Validity Analysis: • All present values match their expected types • Both null values are in nullable fields (score and grade are nullable) • Since nulls in nullable fields are type-valid → Type validity = 100%
Uniqueness Analysis: • All 3 records are distinct → 100%
Overall Score: • (0.4 × 77.78) + (0.4 × 100) + (0.2 × 100) = 31.112 + 40 + 20 = 91.11%
This example demonstrates how completeness and type validity are independent metrics—missing values affect completeness but may still be type-valid if the field allows nulls.
Constraints