Loading content...
Deploying a model to production is not the finish line—it's the starting line. Unlike traditional software that behaves consistently until code changes, ML models operate on a moving target. The data that feeds them shifts. The world that generated the training data evolves. User behavior changes. Competitors adapt. The model that performed brilliantly at launch can quietly degrade until it's actively harming the business.
Monitoring and maintenance for ML systems is fundamentally different from traditional software operations. You're not just watching for crashes and latency—you're watching for silent failures where the system continues running but predictions become worthless. You're detecting drift that accumulates gradually over weeks or months. You're making decisions about when and how to retrain, and how to validate that retraining actually helps.
This page covers the principles, techniques, and operational practices for maintaining healthy ML systems throughout their lifecycle.
By the end of this page, you will understand how to monitor ML systems in production—detecting data drift, model degradation, and silent failures. You'll learn when and how to retrain models, how to automate the ML lifecycle, and how to maintain system health over months and years of operation.
ML monitoring operates across three distinct layers, each catching different types of problems. Comprehensive monitoring requires coverage of all three.
The Three Layers of ML Monitoring:
| Layer | What It Monitors | Failure Symptoms | Detection Speed |
|---|---|---|---|
| Infrastructure | Servers, network, resources | Crashes, timeouts, resource exhaustion | Seconds to minutes |
| Model Performance | Prediction quality, accuracy | Accuracy drops, biased predictions | Hours to days |
| Business Impact | Business metrics, outcomes | Revenue decline, user complaints | Days to weeks |
Layer 1: Infrastructure Monitoring
Traditional system monitoring—necessary but not sufficient for ML:
| Metric | Alert Threshold | Why It Matters |
|---|---|---|
| Latency (P50, P95, P99) | P99 > 200ms | User experience, SLA compliance |
| Error rate | 0.1% | Service reliability |
| Throughput (QPS) | < 80% baseline | Capacity issue or traffic drop |
| CPU/GPU utilization | 90% sustained | Capacity planning |
| Memory usage | 85% | Potential OOM crashes |
| Request queue depth | Growing trend | Processing bottleneck |
Layer 2: Model Performance Monitoring
ML-specific metrics that track prediction quality:
| Metric Category | Specific Metrics | Detection Goal |
|---|---|---|
| Prediction statistics | Mean, variance, distribution | Detect output drift |
| Feature statistics | Values, distributions, nulls | Detect input drift |
| Error analysis | Error rates by segment | Detect targeted degradation |
| Confidence distribution | Entropy, calibration | Detect uncertainty changes |
| Comparison to baseline | A/B metrics, reference model | Detect relative degradation |
Layer 3: Business Impact Monitoring
Connect ML predictions to business outcomes:
| Business Metric | Connection to ML | Monitoring Approach |
|---|---|---|
| Conversion rate | Recommendation quality | A/B test, trend analysis |
| False positive cost | Classification threshold | Track operational costs |
| Customer churn | Prediction accuracy | Compare predicted vs. actual |
| Revenue per user | Personalization effectiveness | Cohort analysis |
The most dangerous failures are those that affect business metrics but not infrastructure metrics. The model keeps serving predictions, latency stays green, error rates are zero—but conversion rate drops 5% and nobody notices for weeks. Always connect model monitoring to business outcomes, even if the connection requires delayed analysis.
Drift is the fundamental challenge of ML maintenance. The statistical properties of production data change over time, causing models trained on historical data to become increasingly inaccurate.
Types of Drift:
Drift Detection Methods:
Statistical Tests:
For numerical features:
For categorical features:
Example: PSI Calculation
import numpy as np
def calculate_psi(expected, actual, buckets=10):
"""
Calculate Population Stability Index.
PSI < 0.1: No significant shift
0.1 <= PSI < 0.2: Moderate shift (investigate)
PSI >= 0.2: Significant shift (action required)
"""
# Discretize into buckets
breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1))
expected_percents = np.histogram(expected, breakpoints)[0] / len(expected)
actual_percents = np.histogram(actual, breakpoints)[0] / len(actual)
# Avoid division by zero
expected_percents = np.maximum(expected_percents, 0.001)
actual_percents = np.maximum(actual_percents, 0.001)
psi = np.sum(
(actual_percents - expected_percents) *
np.log(actual_percents / expected_percents)
)
return psi
Window-Based Monitoring:
┌─────────────────────────────────────────────────────────────────┐
│ Time Windows │
│ │
│ Training │ Reference │ Current │ Future │
│ (historical)│ (baseline) │ (monitored) │ (incoming) │
│ │ │ │ │
│ ═══════════ │ ─────────── │ ▓▓▓▓▓▓▓▓▓▓▓▓▓│ │
│ │ │ │ │
│ Compare reference window to current window │
│ to detect shift │
└─────────────────────────────────────────────────────────────────┘
Typical window configurations:
| Feature Type | Detection Method | Alert Threshold | Example |
|---|---|---|---|
| Numeric (continuous) | KS test, PSI | p < 0.01 or PSI > 0.2 | User age distribution shift |
| Numeric (bounded) | Z-score on mean/std | 3 sigma from baseline | Session duration change |
| Categorical (low card.) | Chi-square, Cramér's V | p < 0.01 | Device type distribution |
| Categorical (high card.) | Embedding cosine distance | Distance > threshold | New product categories |
| Text | Vocabulary OOV rate | 10% new tokens | New slang, topics |
| Model output | Prediction distribution | KS or PSI on scores | Score distribution shift |
Drift detection tells you inputs have changed; it doesn't tell you whether model performance has degraded. A model can experience significant drift but remain accurate if it generalizes well. Conversely, small drift in critical features can cause severe degradation. Always pair drift detection with outcome monitoring.
Detecting model performance degradation is challenging because ground truth is often delayed or unavailable. Different strategies apply depending on when (or if) you receive labels.
Ground Truth Availability Scenarios:
| Scenario | Example | Detection Strategy | Detection Delay |
|---|---|---|---|
| Immediate labels | Click-through prediction | Direct accuracy tracking | Minutes |
| Delayed labels (hours) | Conversion prediction | Leading indicator proxies | Hours |
| Delayed labels (days) | Churn prediction (30-day) | Historical label comparison | Days-weeks |
| Delayed labels (months) | Loan default prediction | Early warning signals | Months |
| No labels | Unsupervised anomaly detection | Indirect quality signals | Varies |
Strategy 1: Direct Accuracy Monitoring (When Labels Are Available)
# Track accuracy metrics over time
class AccuracyMonitor:
def __init__(self, window_size=3600): # 1 hour window
self.predictions = []
self.labels = []
self.timestamps = []
self.window_size = window_size
def record(self, prediction, label, timestamp):
self.predictions.append(prediction)
self.labels.append(label)
self.timestamps.append(timestamp)
self._trim_old_records(timestamp)
def get_current_metrics(self):
if len(self.predictions) < 100: # Minimum sample size
return None
return {
'accuracy': accuracy_score(self.labels, self.predictions),
'precision': precision_score(self.labels, self.predictions),
'recall': recall_score(self.labels, self.predictions),
'auc': roc_auc_score(self.labels, self.predictions),
'sample_size': len(self.predictions),
}
def check_degradation(self, baseline_metrics, threshold=0.05):
current = self.get_current_metrics()
if current is None:
return None
degradations = {}
for metric, baseline_value in baseline_metrics.items():
current_value = current.get(metric, 0)
relative_drop = (baseline_value - current_value) / baseline_value
if relative_drop > threshold:
degradations[metric] = {
'baseline': baseline_value,
'current': current_value,
'drop': relative_drop,
}
return degradations
Strategy 2: Proxy Metrics (When Labels Are Delayed)
Use correlated, faster-available metrics as leading indicators:
| Delayed Metric | Proxy Metrics | Rationale |
|---|---|---|
| 30-day churn | 7-day engagement drop, support tickets | Early churn signals |
| 90-day loan default | Payment delay, account activity | Financial stress indicators |
| Purchase conversion | Add-to-cart rate, session depth | Purchase funnel metrics |
Strategy 3: Prediction Distribution Monitoring (No Labels)
Even without labels, significant changes in prediction patterns suggest issues:
Reference Model Comparison:
Maintain a reference model to compare production predictions:
Production Request
│
├──→ [Production Model] ──→ Production Prediction
│ │
└──→ [Reference Model] ──→ Reference Prediction
│
▼
[Comparison Analysis]
Alert if: Production significantly diverges from Reference
AND we believe Reference is still accurate
Reference model can be:
Segment-Level Monitoring:
Aggregate metrics can hide segment-level problems:
Overall Accuracy: 92% ✓
By User Segment:
- Power Users: 95% ✓
- New Users: 70% ✗ ← Problem hidden in aggregate!
- Mobile Users: 85% ✓
- Desktop Users: 94% ✓
Monitor metrics for important segments independently:
Model predictions can influence the data you use to evaluate the model. If recommendations determine what users see, and user interactions become training labels, you can't directly measure what users would have wanted without the model. This requires careful experimental design (holdout groups, exploration) to obtain unbiased evaluation signals.
Effective alerting turns monitoring data into actionable notifications. Too many alerts lead to alert fatigue; too few miss real problems. The goal is high signal-to-noise ratio with fast detection of genuine issues.
Alert Design Principles:
| Severity | Response Time | Examples | Notification Channel |
|---|---|---|---|
| Critical (P1) | Immediate | Model serving down, severe accuracy drop | Page on-call, war room |
| High (P2) | < 1 hour | Significant performance degradation, data pipeline failure | Slack + page if no ack |
| Medium (P3) | < 4 hours | Moderate drift detected, minor accuracy drop | Slack notification |
| Low (P4) | < 24 hours | Slight distribution shift, non-critical feature issues | Email, ticket |
| Informational | Next business day | Scheduled retraining reminder, capacity planning | Dashboard, weekly report |
ML Incident Response Playbook:
Phase 1: Detection and Triage (< 15 minutes)
Confirm the alert is real:
Assess impact:
Decide on immediate action:
Phase 2: Stabilization (< 1 hour)
Implement mitigation:
Preserve evidence:
Phase 3: Investigation (hours to days)
Root cause analysis:
Remediation:
Phase 4: Postmortem and Prevention (< 1 week)
Document the incident:
Identify preventive actions:
Implement improvements:
The most important incident response skill is quickly deciding whether to rollback. If you're unsure about the cause but the impact is significant, rollback first and investigate later. A fast rollback limits blast radius; a slow investigation while users suffer compounds the damage.
Models degrade over time. Retraining refreshes the model with recent data, adapting to current patterns. The key decisions are when to retrain and how to retrain.
Retraining Triggers:
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Scheduled | Retrain on fixed cadence (daily, weekly) | Predictable, simple | May retrain unnecessarily or too late |
| Performance-triggered | Retrain when accuracy drops below threshold | Efficient, responsive | Requires reliable monitoring |
| Drift-triggered | Retrain when data drift exceeds threshold | Proactive, catches early | Drift may not affect accuracy |
| Data-triggered | Retrain when significant new data available | Uses fresh data efficiently | Depends on data arrival patterns |
| Hybrid | Scheduled + triggered early when needed | Balanced approach | More complex to implement |
Retraining Approaches:
Full Retraining: Train from scratch on all available data.
[All Historical Data] → [Training] → [New Model]
Incremental/Online Update: Update existing model with new data.
[Existing Model] + [New Data] → [Update] → [Updated Model]
Sliding Window: Retrain on a fixed-length window of recent data.
[Last N Months of Data] → [Training] → [New Model]
Window slides forward: discard oldest data, add newest
Weighted/Decay Training: Weight recent data more heavily than old data.
Sample weights: w(t) = e^(-λ * age(t))
Older samples contribute less to training objective
Retraining Validation:
Never deploy a retrained model without validation:
Offline evaluation:
Shadow deployment:
Canary deployment:
Retraining Pipeline Example:
Trigger (scheduled/drift)
│
▼
[Data Collection] ←── Collect recent labeled data
│
▼
[Data Validation] ←── Check data quality
│
▼
[Feature Engineering] ←── Compute training features
│
▼
[Model Training] ←── Train with current hyperparameters
│
▼
[Offline Evaluation] ←── Compare to baseline
│
├── Fail → Alert, investigate
│
▼ Pass
[Model Registration] ←── Version, store artifacts
│
▼
[Shadow Deployment] ←── Validate on production traffic
│
├── Fail → Rollback, investigate
│
▼ Pass
[Canary Deployment] ←── Progressive rollout
│
▼
[Full Production] ←── New model is live
Sometimes retraining makes things worse—corrupted training data, bugs in feature engineering, or concept drift that confuses the model. Always have automated checks that prevent deploying a model that's significantly worse than the current one. Define minimum performance thresholds that must be met.
Manual ML operations don't scale. As models multiply and retraining becomes routine, automation is essential to maintain quality and velocity. MLOps brings DevOps principles to ML systems.
The ML Automation Maturity Model:
| Level | Characteristics | Training | Deployment | Monitoring |
|---|---|---|---|---|
| Level 0: Manual | Everything by hand | Notebooks | Manual copy | Manual checks |
| Level 1: Scripts | Repeatable scripts | Script-based | CI triggers | Basic dashboards |
| Level 2: Pipelines | Automated workflows | Orchestrated | Automated + approval | Automated alerts |
| Level 3: Continuous | Full automation | Continuous training | Continuous deployment | Continuous monitoring |
| Level 4: Autonomous | Self-healing systems | Auto-triggered | Auto-rollout/rollback | Auto-remediation |
CI/CD for ML:
Continuous Integration (CI):
Code Change → [Build] → [Unit Tests] → [Integration Tests] → Merge
│ │
│ └── Data validation tests
│ └── Feature transform tests
│ └── Model inference tests
│
└── Code quality, linting
└── Security scanning
Continuous Training (CT):
Trigger → [Data Pipeline] → [Train] → [Evaluate] → [Register] → Staging
│
├── Scheduled (daily/weekly)
├── Data-triggered (new data arrived)
├── Performance-triggered (degradation detected)
└── Manual (on-demand)
Continuous Deployment (CD):
Staging Model → [Shadow Test] → [Canary] → [Progressive Rollout] → Production
│ │ │
Auto-validation Auto-validation Auto-validation
│ │ │
Fail → Halt Fail → Rollback Fail → Rollback
Key Automation Infrastructure:
| Component | Purpose | Examples |
|---|---|---|
| Workflow Orchestration | Pipeline execution, dependencies | Airflow, Kubeflow, Prefect |
| Feature Store | Feature management, serving | Feast, Tecton, SageMaker |
| Model Registry | Model versioning, artifacts | MLflow, SageMaker, Weights & Biases |
| Experiment Tracking | Metrics, hyperparameters, comparison | MLflow, W&B, Neptune |
| Model Serving | Inference infrastructure | TensorFlow Serving, Triton, Seldon |
| Monitoring | Metrics, alerts, dashboards | Prometheus, Grafana, custom |
Infrastructure as Code for ML:
Define ML infrastructure declaratively:
# Example: ML pipeline configuration
pipeline:
name: churn-prediction-training
schedule: "0 2 * * *" # Daily at 2 AM
stages:
- name: data-preparation
image: data-prep:v1
resources:
cpu: "4"
memory: "16Gi"
inputs:
- source: s3://data/transactions
date_range: last_90_days
outputs:
- destination: s3://features/training
- name: model-training
image: training:v2
resources:
gpu: "1" # NVIDIA T4
memory: "32Gi"
inputs:
- source: s3://features/training
hyperparameters:
learning_rate: 0.001
epochs: 100
batch_size: 256
outputs:
- destination: s3://models/churn
- name: model-evaluation
image: evaluation:v1
inputs:
- model: s3://models/churn
- validation_data: s3://features/validation
thresholds:
accuracy: 0.85
auc: 0.90
on_failure: alert_and_halt
- name: model-registration
image: registry:v1
inputs:
- model: s3://models/churn
destination: model-registry/churn-prediction
promote_to: staging
This configuration:
Full automation (Level 3-4) requires significant investment. Start by making current processes repeatable with scripts (Level 1), then add orchestration (Level 2). Only pursue continuous automation when you have the volume and maturity to justify it. A single model doesn't need the same infrastructure as a platform serving hundreds of models.
ML systems are complex, and knowledge about them often exists only in the heads of their creators. When team members leave or forget, this knowledge is lost. Systematic documentation preserves institutional knowledge and enables effective maintenance.
Model Cards:
Model cards standardize model documentation:
# Model Card: Churn Prediction v2.3
## Model Details
- **Developed by:** ML Team
- **Model date:** 2024-01-15
- **Model version:** 2.3.0
- **Model type:** Gradient Boosted Trees (LightGBM)
- **Training data:** User transactions Jan 2023 - Dec 2023
## Intended Use
- **Primary use:** Predict 30-day churn probability for subscription users
- **Out-of-scope:** Free tier users, B2B accounts, new users (<7 days)
## Metrics
| Metric | Value | Threshold |
|--------|-------|----------|
| AUC-ROC | 0.87 | > 0.85 |
| Precision@10% | 0.45 | > 0.40 |
| Recall@10% | 0.62 | > 0.50 |
## Training Data
- **Size:** 1.2M users, 14.3% churn rate
- **Features:** 47 features covering engagement, billing, support
- **Label definition:** No activity for 30+ consecutive days
## Ethical Considerations
- **Fairness:** Tested for demographic parity across age groups
- **Privacy:** No PII used directly; only behavioral features
## Limitations
- Performs poorly for users with <10 sessions (AUC drops to 0.72)
- Seasonal effects not fully captured (holiday periods)
- Does not account for promotional offers
## Caveats and Recommendations
- Retrain monthly to maintain accuracy
- Monitor for drift in session features
- High-value user predictions should be reviewed manually
Operational Runbooks:
Document procedures for common operational tasks:
# Runbook: Churn Model Degradation Alert
## Trigger
- Alert: "Churn model AUC dropped below 0.82 (threshold: 0.85)"
## Immediate Actions
1. **Verify alert is genuine:**
- Check dashboard for sample size (< 1000 may be noise)
- Compare to previous day's metrics
- Check if there are infrastructure issues
2. **Assess impact:**
- How many predictions affected?
- Are intervention campaigns running on this model?
3. **Decide on mitigation:**
- If degradation > 10%: Consider rollback to previous version
- If degradation 5-10%: Monitor closely, prepare rollback
- If degradation < 5%: Continue monitoring
## Investigation Steps
1. Check for data issues:
- Feature freshness in feature store
- Data pipeline success status
- Feature distribution shifts
2. Check for model issues:
- Recent model deployments
- Serving infrastructure health
- Memory/CPU utilization
3. Check for external factors:
- Seasonal effects
- Product changes
- Marketing campaigns
## Escalation
- If cause unclear after 30 minutes: Page senior ML engineer
- If rollback fails: Page infrastructure on-call
- If business impact significant: Notify product stakeholders
## Recovery Validation
- Confirm metrics return to baseline
- Run smoke tests on predictions
- Document root cause for postmortem
| Document Type | Purpose | Update Frequency | Owner |
|---|---|---|---|
| Model Card | Model overview, capabilities, limitations | Each model version | ML Engineer |
| Architecture Doc | System design, components, dependencies | Major changes | Tech Lead |
| Runbooks | Operational procedures for incidents | After each incident | On-call rotation |
| Training Guide | How to retrain and validate | Pipeline changes | ML Engineer |
| Data Dictionary | Feature definitions, sources, schemas | Feature additions | Data Engineer |
| Experiment Log | A/B tests, results, decisions | Each experiment | Data Scientist |
Keep documentation close to code—in the same repository, updated in the same pull requests. Documentation that lives in a separate wiki drifts from reality. Model cards can be generated automatically from training logs. Runbooks can be versioned alongside the systems they describe.
Monitoring and maintenance are what separate experimental ML projects from production ML systems. The techniques covered in this page ensure your models continue to provide value long after initial deployment—adapting to changing data, detecting silent failures, and recovering from degradation.
Module Complete:
You've now completed Module 1: ML System Design. You've learned how to:
These skills enable you to design and operate complete ML systems—not just train models in notebooks, but build systems that deliver value in production.
You've mastered the fundamentals of ML system design—from requirements gathering through production maintenance. You can now design, deploy, and operate ML systems that reliably deliver business value. The next module will cover ML debugging—how to diagnose and fix problems when things go wrong.