Loading content...

0/318

00:00:00

Description

Editorial

Canary Deployment Health Analysis for ML Model Rollouts

MEDIUM30 pts

Overview

In enterprise machine learning operations (MLOps), canary deployments represent a critical strategy for safely releasing new model versions into production environments. Rather than immediately deploying a new model to all users—which carries inherent risk—organizations route a small percentage of live traffic (typically 1-10%) to the candidate model while the majority of requests continue flowing to the battle-tested production model.

This traffic splitting approach enables real-time performance comparison between the experimental "canary" model and the established "baseline" model using actual production data and conditions. By monitoring key performance indicators (KPIs) such as prediction accuracy and response latency, engineering teams can make data-driven decisions about whether to:

Promote the canary to full production deployment
Rollback to the baseline if the canary exhibits degraded performance
Extend observation if results are inconclusive

The Challenge

You are tasked with implementing a deployment health analyzer that processes prediction results from both canary and baseline models to compute comprehensive comparison metrics and generate an automated promotion recommendation.

Each prediction result in both input lists is represented as a dictionary with the following structure:

Field	Type	Description
`latency_ms`	float	Response latency in milliseconds for this prediction
`prediction`	any	The model's predicted output value
`ground_truth`	any	The actual correct value (label)

Function Specification

Implement the function evaluate_canary_rollout(canary_results, baseline_results, accuracy_tolerance, latency_tolerance) that computes and returns the following metrics:

Accuracy Metrics

canary_accuracy: Fraction of correct predictions for the canary model (range 0-1)
baseline_accuracy: Fraction of correct predictions for the baseline model (range 0-1)
accuracy_change_pct: Relative change in accuracy expressed as a percentage, calculated as:
$$\text{accuracy_change_pct} = \frac{\text{canary_accuracy} - \text{baseline_accuracy}}{\text{baseline_accuracy}} \times 100$$

Latency Metrics

canary_avg_latency: Mean response latency of the canary model (milliseconds)
baseline_avg_latency: Mean response latency of the baseline model (milliseconds)
latency_change_pct: Relative change in latency expressed as a percentage, calculated as:
$$\text{latency_change_pct} = \frac{\text{canary_avg_latency} - \text{baseline_avg_latency}}{\text{baseline_avg_latency}} \times 100$$

Promotion Decision

promote_recommended: A boolean value indicating whether the canary model should be promoted to full production. This should be True if and only if:
1. Accuracy has not degraded beyond the specified accuracy_tolerance (i.e., accuracy_change_pct >= -accuracy_tolerance)
2. Latency has not increased beyond the specified latency_tolerance (i.e., latency_change_pct <= latency_tolerance)

Rounding Requirements

Accuracy values (canary_accuracy, baseline_accuracy): Round to 4 decimal places
All other numeric values: Round to 2 decimal places

Edge Cases

If either input list is empty, return an empty dictionary {}

Key Insights

A negative accuracy_change_pct indicates the canary model is performing worse than baseline. A negative latency_change_pct indicates the canary model is faster than baseline (improvement). The promotion decision uses tolerance thresholds to allow for minor statistical variations in real-world traffic patterns.

Example

Input

canary_results = [
    {'latency_ms': 45, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 50, 'prediction': 0, 'ground_truth': 0},
    {'latency_ms': 48, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 52, 'prediction': 1, 'ground_truth': 0},
    {'latency_ms': 47, 'prediction': 0, 'ground_truth': 0}
]
baseline_results = [
    {'latency_ms': 50, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 55, 'prediction': 0, 'ground_truth': 0},
    {'latency_ms': 52, 'prediction': 1, 'ground_truth': 0},
    {'latency_ms': 58, 'prediction': 0, 'ground_truth': 0},
    {'latency_ms': 53, 'prediction': 1, 'ground_truth': 1}
]
accuracy_tolerance = 5.0
latency_tolerance = 10.0

Output

{
    'canary_accuracy': 0.8,
    'baseline_accuracy': 0.8,
    'accuracy_change_pct': 0.0,
    'canary_avg_latency': 48.4,
    'baseline_avg_latency': 53.6,
    'latency_change_pct': -9.7,
    'promote_recommended': True
}

Explanation

Accuracy Analysis: • Canary model: 4 out of 5 predictions correct → accuracy = 4/5 = 0.8 • Baseline model: 4 out of 5 predictions correct → accuracy = 4/5 = 0.8 • Accuracy change: (0.8 - 0.8) / 0.8 × 100 = 0.0%

Latency Analysis: • Canary average latency: (45 + 50 + 48 + 52 + 47) / 5 = 242 / 5 = 48.4 ms • Baseline average latency: (50 + 55 + 52 + 58 + 53) / 5 = 268 / 5 = 53.6 ms • Latency change: (48.4 - 53.6) / 53.6 × 100 = -9.70% (canary is faster!)

Promotion Decision: • Accuracy change (0.0%) >= -5.0 (tolerance) ✓ • Latency change (-9.70%) <= 10.0 (tolerance) ✓ • Both conditions satisfied → promote_recommended = True

The canary model maintains identical accuracy while delivering nearly 10% faster response times, making it an excellent candidate for promotion.

Example

Input

canary_results = [
    {'latency_ms': 100, 'prediction': 0, 'ground_truth': 1},
    {'latency_ms': 110, 'prediction': 0, 'ground_truth': 1},
    {'latency_ms': 105, 'prediction': 1, 'ground_truth': 1}
]
baseline_results = [
    {'latency_ms': 50, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 55, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 52, 'prediction': 1, 'ground_truth': 1}
]
accuracy_tolerance = 10.0
latency_tolerance = 5.0

Output

{
    'canary_accuracy': 0.3333,
    'baseline_accuracy': 1.0,
    'accuracy_change_pct': -66.67,
    'canary_avg_latency': 105.0,
    'baseline_avg_latency': 52.33,
    'latency_change_pct': 100.65,
    'promote_recommended': False
}

Explanation

Accuracy Analysis: • Canary model: 1 out of 3 predictions correct → accuracy = 1/3 = 0.3333 • Baseline model: 3 out of 3 predictions correct → accuracy = 3/3 = 1.0 • Accuracy change: (0.3333 - 1.0) / 1.0 × 100 = -66.67% (severe degradation!)

Latency Analysis: • Canary average latency: (100 + 110 + 105) / 3 = 315 / 3 = 105.0 ms • Baseline average latency: (50 + 55 + 52) / 3 = 157 / 3 ≈ 52.33 ms • Latency change: (105.0 - 52.33) / 52.33 × 100 ≈ 100.65% (doubled latency!)

Promotion Decision: • Accuracy change (-66.67%) >= -10.0 (tolerance)? NO ✗ • Latency change (100.65%) <= 5.0 (tolerance)? NO ✗ • Neither condition satisfied → promote_recommended = False

The canary model exhibits catastrophic accuracy degradation (66.67% worse) and doubles response latency. Immediate rollback is warranted.

Example

Input

canary_results = []
baseline_results = [
    {'latency_ms': 50, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 55, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 52, 'prediction': 1, 'ground_truth': 1}
]
accuracy_tolerance = 5.0
latency_tolerance = 10.0

Output

{}

Explanation

Edge Case Handling: The canary_results list is empty, meaning no traffic was routed to the canary model during the observation period.

Without any canary data to analyze, meaningful performance comparison is impossible. The function correctly returns an empty dictionary to signal that the health analysis cannot be performed.

This could occur in production when: • The canary deployment just started and no requests have been processed • Traffic routing configuration prevented any requests from reaching the canary • The canary model crashed immediately and processed zero requests

Accepted0/0·0% Acceptance

Constraints

0 ≤ length of canary_results ≤ 10,000
0 ≤ length of baseline_results ≤ 10,000
0 ≤ latency_ms ≤ 10,000 (milliseconds)
prediction and ground_truth can be any comparable values (integers, floats, strings)
0 < accuracy_tolerance ≤ 100
0 < latency_tolerance ≤ 100
All latency values are non-negative

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

canary_results =

[{"latency_ms":45,"prediction":1,"ground_truth":1},{"latency_ms":50,"prediction":0,"ground_truth":0},{"latency_ms":48,"prediction":1,"ground_truth":1},{"latency_ms":52,"prediction":1,"ground_truth":0},{"latency_ms":47,"prediction":0,"ground_truth":0}]

baseline_results =

[{"latency_ms":50,"prediction":1,"ground_truth":1},{"latency_ms":55,"prediction":0,"ground_truth":0},{"latency_ms":52,"prediction":1,"ground_truth":0},{"latency_ms":58,"prediction":0,"ground_truth":0},{"latency_ms":53,"prediction":1,"ground_truth":1}]

latency_tolerance =

accuracy_tolerance =

00:00:00

Description

Editorial

Canary Deployment Health Analysis for ML Model Rollouts

MEDIUM30 pts

Overview

Promote the canary to full production deployment
Rollback to the baseline if the canary exhibits degraded performance
Extend observation if results are inconclusive

The Challenge

Each prediction result in both input lists is represented as a dictionary with the following structure:

Field	Type	Description
`latency_ms`	float	Response latency in milliseconds for this prediction
`prediction`	any	The model's predicted output value
`ground_truth`	any	The actual correct value (label)

Function Specification

Implement the function evaluate_canary_rollout(canary_results, baseline_results, accuracy_tolerance, latency_tolerance) that computes and returns the following metrics:

Accuracy Metrics

canary_accuracy: Fraction of correct predictions for the canary model (range 0-1)
baseline_accuracy: Fraction of correct predictions for the baseline model (range 0-1)
accuracy_change_pct: Relative change in accuracy expressed as a percentage, calculated as:
$$\text{accuracy_change_pct} = \frac{\text{canary_accuracy} - \text{baseline_accuracy}}{\text{baseline_accuracy}} \times 100$$

Latency Metrics

canary_avg_latency: Mean response latency of the canary model (milliseconds)
baseline_avg_latency: Mean response latency of the baseline model (milliseconds)
latency_change_pct: Relative change in latency expressed as a percentage, calculated as:
$$\text{latency_change_pct} = \frac{\text{canary_avg_latency} - \text{baseline_avg_latency}}{\text{baseline_avg_latency}} \times 100$$

Promotion Decision

promote_recommended: A boolean value indicating whether the canary model should be promoted to full production. This should be True if and only if:
1. Accuracy has not degraded beyond the specified accuracy_tolerance (i.e., accuracy_change_pct >= -accuracy_tolerance)
2. Latency has not increased beyond the specified latency_tolerance (i.e., latency_change_pct <= latency_tolerance)

Rounding Requirements

Accuracy values (canary_accuracy, baseline_accuracy): Round to 4 decimal places
All other numeric values: Round to 2 decimal places

Edge Cases

If either input list is empty, return an empty dictionary {}

Key Insights

Example

Input

canary_results = [
    {'latency_ms': 45, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 50, 'prediction': 0, 'ground_truth': 0},
    {'latency_ms': 48, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 52, 'prediction': 1, 'ground_truth': 0},
    {'latency_ms': 47, 'prediction': 0, 'ground_truth': 0}
]
baseline_results = [
    {'latency_ms': 50, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 55, 'prediction': 0, 'ground_truth': 0},
    {'latency_ms': 52, 'prediction': 1, 'ground_truth': 0},
    {'latency_ms': 58, 'prediction': 0, 'ground_truth': 0},
    {'latency_ms': 53, 'prediction': 1, 'ground_truth': 1}
]
accuracy_tolerance = 5.0
latency_tolerance = 10.0

Output

{
    'canary_accuracy': 0.8,
    'baseline_accuracy': 0.8,
    'accuracy_change_pct': 0.0,
    'canary_avg_latency': 48.4,
    'baseline_avg_latency': 53.6,
    'latency_change_pct': -9.7,
    'promote_recommended': True
}

Explanation

Promotion Decision: • Accuracy change (0.0%) >= -5.0 (tolerance) ✓ • Latency change (-9.70%) <= 10.0 (tolerance) ✓ • Both conditions satisfied → promote_recommended = True

The canary model maintains identical accuracy while delivering nearly 10% faster response times, making it an excellent candidate for promotion.

Example

Input

canary_results = [
    {'latency_ms': 100, 'prediction': 0, 'ground_truth': 1},
    {'latency_ms': 110, 'prediction': 0, 'ground_truth': 1},
    {'latency_ms': 105, 'prediction': 1, 'ground_truth': 1}
]
baseline_results = [
    {'latency_ms': 50, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 55, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 52, 'prediction': 1, 'ground_truth': 1}
]
accuracy_tolerance = 10.0
latency_tolerance = 5.0

Output

{
    'canary_accuracy': 0.3333,
    'baseline_accuracy': 1.0,
    'accuracy_change_pct': -66.67,
    'canary_avg_latency': 105.0,
    'baseline_avg_latency': 52.33,
    'latency_change_pct': 100.65,
    'promote_recommended': False
}

Explanation

The canary model exhibits catastrophic accuracy degradation (66.67% worse) and doubles response latency. Immediate rollback is warranted.

Example

Input

canary_results = []
baseline_results = [
    {'latency_ms': 50, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 55, 'prediction': 1, 'ground_truth': 1},
    {'latency_ms': 52, 'prediction': 1, 'ground_truth': 1}
]
accuracy_tolerance = 5.0
latency_tolerance = 10.0

Output

{}

Explanation

Edge Case Handling: The canary_results list is empty, meaning no traffic was routed to the canary model during the observation period.

Without any canary data to analyze, meaningful performance comparison is impossible. The function correctly returns an empty dictionary to signal that the health analysis cannot be performed.

Accepted0/0·0% Acceptance

Constraints

0 ≤ length of canary_results ≤ 10,000
0 ≤ length of baseline_results ≤ 10,000
0 ≤ latency_ms ≤ 10,000 (milliseconds)
prediction and ground_truth can be any comparable values (integers, floats, strings)
0 < accuracy_tolerance ≤ 100
0 < latency_tolerance ≤ 100
All latency values are non-negative

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

canary_results =

baseline_results =

latency_tolerance =

accuracy_tolerance =

Canary Deployment Health Analysis for ML Model Rollouts

Overview

The Challenge

Function Specification

Accuracy Metrics

Latency Metrics

Promotion Decision

Rounding Requirements

Edge Cases

Key Insights

Hints

Canary Deployment Health Analysis for ML Model Rollouts

Overview

The Challenge

Function Specification

Accuracy Metrics

Latency Metrics

Promotion Decision

Rounding Requirements

Edge Cases

Key Insights

Hints