Loading content...
In enterprise machine learning operations (MLOps), canary deployments represent a critical strategy for safely releasing new model versions into production environments. Rather than immediately deploying a new model to all users—which carries inherent risk—organizations route a small percentage of live traffic (typically 1-10%) to the candidate model while the majority of requests continue flowing to the battle-tested production model.
This traffic splitting approach enables real-time performance comparison between the experimental "canary" model and the established "baseline" model using actual production data and conditions. By monitoring key performance indicators (KPIs) such as prediction accuracy and response latency, engineering teams can make data-driven decisions about whether to:
You are tasked with implementing a deployment health analyzer that processes prediction results from both canary and baseline models to compute comprehensive comparison metrics and generate an automated promotion recommendation.
Each prediction result in both input lists is represented as a dictionary with the following structure:
| Field | Type | Description |
|---|---|---|
latency_ms | float | Response latency in milliseconds for this prediction |
prediction | any | The model's predicted output value |
ground_truth | any | The actual correct value (label) |
Implement the function evaluate_canary_rollout(canary_results, baseline_results, accuracy_tolerance, latency_tolerance) that computes and returns the following metrics:
canary_accuracy: Fraction of correct predictions for the canary model (range 0-1)baseline_accuracy: Fraction of correct predictions for the baseline model (range 0-1)accuracy_change_pct: Relative change in accuracy expressed as a percentage, calculated as:canary_avg_latency: Mean response latency of the canary model (milliseconds)baseline_avg_latency: Mean response latency of the baseline model (milliseconds)latency_change_pct: Relative change in latency expressed as a percentage, calculated as:promote_recommended: A boolean value indicating whether the canary model should be promoted to full production. This should be True if and only if:
accuracy_tolerance (i.e., accuracy_change_pct >= -accuracy_tolerance)latency_tolerance (i.e., latency_change_pct <= latency_tolerance)canary_accuracy, baseline_accuracy): Round to 4 decimal places{}A negative accuracy_change_pct indicates the canary model is performing worse than baseline.
A negative latency_change_pct indicates the canary model is faster than baseline (improvement).
The promotion decision uses tolerance thresholds to allow for minor statistical variations in real-world traffic patterns.
canary_results = [
{'latency_ms': 45, 'prediction': 1, 'ground_truth': 1},
{'latency_ms': 50, 'prediction': 0, 'ground_truth': 0},
{'latency_ms': 48, 'prediction': 1, 'ground_truth': 1},
{'latency_ms': 52, 'prediction': 1, 'ground_truth': 0},
{'latency_ms': 47, 'prediction': 0, 'ground_truth': 0}
]
baseline_results = [
{'latency_ms': 50, 'prediction': 1, 'ground_truth': 1},
{'latency_ms': 55, 'prediction': 0, 'ground_truth': 0},
{'latency_ms': 52, 'prediction': 1, 'ground_truth': 0},
{'latency_ms': 58, 'prediction': 0, 'ground_truth': 0},
{'latency_ms': 53, 'prediction': 1, 'ground_truth': 1}
]
accuracy_tolerance = 5.0
latency_tolerance = 10.0{
'canary_accuracy': 0.8,
'baseline_accuracy': 0.8,
'accuracy_change_pct': 0.0,
'canary_avg_latency': 48.4,
'baseline_avg_latency': 53.6,
'latency_change_pct': -9.7,
'promote_recommended': True
}Accuracy Analysis: • Canary model: 4 out of 5 predictions correct → accuracy = 4/5 = 0.8 • Baseline model: 4 out of 5 predictions correct → accuracy = 4/5 = 0.8 • Accuracy change: (0.8 - 0.8) / 0.8 × 100 = 0.0%
Latency Analysis: • Canary average latency: (45 + 50 + 48 + 52 + 47) / 5 = 242 / 5 = 48.4 ms • Baseline average latency: (50 + 55 + 52 + 58 + 53) / 5 = 268 / 5 = 53.6 ms • Latency change: (48.4 - 53.6) / 53.6 × 100 = -9.70% (canary is faster!)
Promotion Decision: • Accuracy change (0.0%) >= -5.0 (tolerance) ✓ • Latency change (-9.70%) <= 10.0 (tolerance) ✓ • Both conditions satisfied → promote_recommended = True
The canary model maintains identical accuracy while delivering nearly 10% faster response times, making it an excellent candidate for promotion.
canary_results = [
{'latency_ms': 100, 'prediction': 0, 'ground_truth': 1},
{'latency_ms': 110, 'prediction': 0, 'ground_truth': 1},
{'latency_ms': 105, 'prediction': 1, 'ground_truth': 1}
]
baseline_results = [
{'latency_ms': 50, 'prediction': 1, 'ground_truth': 1},
{'latency_ms': 55, 'prediction': 1, 'ground_truth': 1},
{'latency_ms': 52, 'prediction': 1, 'ground_truth': 1}
]
accuracy_tolerance = 10.0
latency_tolerance = 5.0{
'canary_accuracy': 0.3333,
'baseline_accuracy': 1.0,
'accuracy_change_pct': -66.67,
'canary_avg_latency': 105.0,
'baseline_avg_latency': 52.33,
'latency_change_pct': 100.65,
'promote_recommended': False
}Accuracy Analysis: • Canary model: 1 out of 3 predictions correct → accuracy = 1/3 = 0.3333 • Baseline model: 3 out of 3 predictions correct → accuracy = 3/3 = 1.0 • Accuracy change: (0.3333 - 1.0) / 1.0 × 100 = -66.67% (severe degradation!)
Latency Analysis: • Canary average latency: (100 + 110 + 105) / 3 = 315 / 3 = 105.0 ms • Baseline average latency: (50 + 55 + 52) / 3 = 157 / 3 ≈ 52.33 ms • Latency change: (105.0 - 52.33) / 52.33 × 100 ≈ 100.65% (doubled latency!)
Promotion Decision: • Accuracy change (-66.67%) >= -10.0 (tolerance)? NO ✗ • Latency change (100.65%) <= 5.0 (tolerance)? NO ✗ • Neither condition satisfied → promote_recommended = False
The canary model exhibits catastrophic accuracy degradation (66.67% worse) and doubles response latency. Immediate rollback is warranted.
canary_results = []
baseline_results = [
{'latency_ms': 50, 'prediction': 1, 'ground_truth': 1},
{'latency_ms': 55, 'prediction': 1, 'ground_truth': 1},
{'latency_ms': 52, 'prediction': 1, 'ground_truth': 1}
]
accuracy_tolerance = 5.0
latency_tolerance = 10.0{}Edge Case Handling: The canary_results list is empty, meaning no traffic was routed to the canary model during the observation period.
Without any canary data to analyze, meaningful performance comparison is impossible. The function correctly returns an empty dictionary to signal that the health analysis cannot be performed.
This could occur in production when: • The canary deployment just started and no requests have been processed • Traffic routing configuration prevented any requests from reaching the canary • The canary model crashed immediately and processed zero requests
Constraints