Loading content...
The most dangerous moment in an ML project is the transition from development to production. Models that performed excellently in offline evaluation have caused production outages, financial losses, and reputational damage when launched without proper safeguards.
Launch criteria are the systematic gates that a model must pass before serving real users. They protect against:
This page provides a comprehensive launch readiness framework that enables confident, safe deployment.
By completing this page, you will be able to: (1) Define comprehensive launch readiness criteria for ML models, (2) Design validation gates that catch issues before production impact, (3) Select appropriate deployment strategies for different risk profiles, (4) Implement monitoring that detects production issues early, and (5) Plan rollback procedures for failed launches.
A model is launch-ready when it has passed all required validation gates across five dimensions: Model Quality, Infrastructure Readiness, Safety & Fairness, Operational Readiness, and Stakeholder Alignment.
The Launch Readiness Checklist:
| Dimension | Criteria | Evidence Required |
|---|---|---|
| Model Quality | Meets performance thresholds on holdout | Evaluation report with confidence intervals |
| Model Quality | Passes A/B test statistical significance | Experiment report with p-values |
| Infrastructure | Latency meets SLA requirements | Load test results |
| Infrastructure | Handles peak traffic without degradation | Stress test results |
| Safety | Fairness metrics within acceptable bounds | Fairness audit report |
| Safety | No harmful output patterns detected | Safety evaluation results |
| Operational | Monitoring dashboards operational | Dashboard screenshots |
| Operational | Runbooks documented for incidents | Runbook links |
| Stakeholder | Product owner sign-off obtained | Approval record |
| Stakeholder | Legal/compliance review complete | Review documentation |
Quality Thresholds:
Explicit, quantitative thresholds must be defined for each quality dimension:
| Metric | Minimum Threshold | Target | Blocking? |
|---|---|---|---|
| Primary model metric | Defined in scoping | Above baseline | Yes |
| Latency P50 | < 50ms | < 20ms | Yes |
| Latency P99 | < 200ms | < 100ms | Yes |
| Error rate | < 0.1% | < 0.01% | Yes |
| Fairness gap | < 10% | < 5% | Yes for sensitive applications |
Blocking criteria must be met; non-blocking criteria are tracked but don't prevent launch.
Launch criteria lose their value if they're routinely waived. Every exception sets precedent. If criteria are legitimately too strict, update them formally—don't bypass them informally. Organizations with 'this one time' exception cultures experience more production incidents.
A validation gate is a checkpoint that must be passed before proceeding to the next stage. Gates create hard stops that prevent premature launches.
The Gate Sequence:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
from dataclasses import dataclassfrom typing import Dict, List, Optionalfrom enum import Enum class GateStatus(Enum): NOT_STARTED = "not_started" IN_PROGRESS = "in_progress" PASSED = "passed" FAILED = "failed" BLOCKED = "blocked" @dataclassclass LaunchGate: name: str description: str criteria: Dict[str, float] # metric: threshold evidence_required: List[str] blocking: bool = True @dataclassclass LaunchReadinessReport: model_version: str gates: Dict[str, GateStatus] blocking_issues: List[str] warnings: List[str] ready_to_launch: bool def evaluate_launch_readiness( model_version: str, evaluation_results: Dict[str, float], gate_definitions: List[LaunchGate]) -> LaunchReadinessReport: """ Evaluate model against all launch gates. """ gate_statuses = {} blocking_issues = [] warnings = [] for gate in gate_definitions: passed = True for metric, threshold in gate.criteria.items(): actual = evaluation_results.get(metric) if actual is None: passed = False if gate.blocking: blocking_issues.append( f"{gate.name}: Missing metric '{metric}'" ) elif actual < threshold: passed = False if gate.blocking: blocking_issues.append( f"{gate.name}: {metric}={actual:.3f} < {threshold}" ) else: warnings.append( f"{gate.name}: {metric}={actual:.3f} < {threshold}" ) gate_statuses[gate.name] = ( GateStatus.PASSED if passed else GateStatus.FAILED ) return LaunchReadinessReport( model_version=model_version, gates=gate_statuses, blocking_issues=blocking_issues, warnings=warnings, ready_to_launch=len(blocking_issues) == 0 ) # Example gate definitionsSTANDARD_GATES = [ LaunchGate( name="Model Quality", description="Offline evaluation metrics", criteria={ "auc_roc": 0.85, "precision_at_10pct": 0.50, }, evidence_required=["evaluation_report.pdf"], blocking=True ), LaunchGate( name="Latency", description="Inference performance", criteria={ "latency_p50_ms": 50, "latency_p99_ms": 200, }, evidence_required=["load_test_results.json"], blocking=True ), LaunchGate( name="Fairness", description="Demographic parity", criteria={ "max_group_gap": 0.10, }, evidence_required=["fairness_audit.pdf"], blocking=True ),]Shadow mode runs the new model on production traffic without serving results. It catches issues offline evaluation misses: feature computation bugs, data format mismatches, and unexpected edge cases. Run shadow mode for at least one business cycle (often one week) before any live traffic.
The deployment strategy determines how traffic shifts from the old model (or no model) to the new model. Strategy selection depends on risk tolerance, rollback requirements, and infrastructure capabilities.
Strategy Comparison:
| Strategy | Risk Level | Rollback Speed | Best For |
|---|---|---|---|
| Big Bang | High | Slow | Low-stakes, simple models |
| Canary | Medium | Fast | Standard production models |
| Blue-Green | Medium | Instant | When instant rollback required |
| Rolling | Low | Gradual | Large-scale distributed systems |
| Feature Flag | Low | Instant | Fine-grained control needed |
A/B Testing as Deployment:
For models with measurable business impact, deployment can be structured as an A/B test:
This approach provides causal evidence of model value but requires longer deployment timeline and sufficient traffic for statistical power.
Every deployment must have a tested rollback path. If the new model causes issues, you must be able to revert within minutes, not hours. Rollback procedures should be automated and tested before launch—never rely on untested manual procedures during an incident.
Launching is the beginning of monitoring, not the end of the journey. ML models degrade over time due to data drift, concept drift, and changing user behavior. Monitoring must detect issues before they cause significant harm.
The Monitoring Stack:
| Layer | What to Monitor | Alert Thresholds |
|---|---|---|
| Infrastructure | Latency, throughput, error rates, memory | P99 latency > 2× baseline |
| Data | Input distributions, missing values, schema | Distribution drift > threshold |
| Model | Prediction distributions, confidence scores | Unexpected prediction patterns |
| Business | Downstream metric impact | Significant metric degradation |
| Feedback | Ground truth when available | Accuracy drop below threshold |
Prediction distribution and feature drift are leading indicators—they signal potential problems before business impact. Business metrics are lagging indicators—by the time they move, damage is done. Monitor leading indicators for early warning; alert on lagging indicators for confirmed issues.
Despite careful validation, production incidents happen. The quality of incident response determines whether a minor issue becomes a major outage.
The Incident Response Playbook:
Rollback Decision Matrix:
| Severity | Symptoms | Action | Timeline |
|---|---|---|---|
| Critical | Complete model failure, revenue impact | Immediate rollback | < 5 minutes |
| High | Significant quality degradation | Rollback + investigate | < 15 minutes |
| Medium | Noticeable issues, limited impact | Investigate first, rollback if needed | < 1 hour |
| Low | Minor anomalies, no user impact | Monitor closely, no rollback | Monitor |
Post-Deployment Checklist:
After every launch:
✓ Confirm monitoring dashboards show expected patterns ✓ Verify no anomalous alerts in first hour ✓ Check downstream systems for unexpected behavior ✓ Review sample predictions for sanity ✓ Confirm rollback mechanisms remain functional ✓ Document any observations or concerns
Post-Mortem Culture:
Every significant incident should produce a blameless post-mortem that documents:
During incidents, err on the side of rolling back. The cost of unnecessary rollback (lost experiment time) is always less than the cost of extended production issues. You can always re-deploy after investigation; you can't undo user harm or revenue loss.
Module Complete:
You have now completed the ML Project Management module. You understand problem scoping, data collection planning, experiment management, iteration strategies, and launch criteria—the comprehensive skill set that enables successful ML projects from inception to production.
These project management capabilities are as important as technical ML skills. Many models fail not due to algorithmic limitations but due to poor problem definition, inadequate data planning, chaotic experimentation, or premature launch. You now have the frameworks to avoid these failures.
Congratulations! You have mastered ML Project Management. From problem scoping to launch criteria, you now possess the systematic approach that distinguishes successful ML practitioners. Apply these frameworks to your projects, and continuously refine them based on your experience.