Ml System Design - Learning Module

Loading content...

0/278

Monitoring and Maintenance

The Long Game: Keeping Models Healthy

Deploying a model to production is not the finish line—it's the starting line. Unlike traditional software that behaves consistently until code changes, ML models operate on a moving target. The data that feeds them shifts. The world that generated the training data evolves. User behavior changes. Competitors adapt. The model that performed brilliantly at launch can quietly degrade until it's actively harming the business.

Monitoring and maintenance for ML systems is fundamentally different from traditional software operations. You're not just watching for crashes and latency—you're watching for silent failures where the system continues running but predictions become worthless. You're detecting drift that accumulates gradually over weeks or months. You're making decisions about when and how to retrain, and how to validate that retraining actually helps.

This page covers the principles, techniques, and operational practices for maintaining healthy ML systems throughout their lifecycle.

What You Will Learn

By the end of this page, you will understand how to monitor ML systems in production—detecting data drift, model degradation, and silent failures. You'll learn when and how to retrain models, how to automate the ML lifecycle, and how to maintain system health over months and years of operation.

ML Monitoring Fundamentals

ML monitoring operates across three distinct layers, each catching different types of problems. Comprehensive monitoring requires coverage of all three.

The Three Layers of ML Monitoring:

ML Monitoring Layers
Layer	What It Monitors	Failure Symptoms	Detection Speed
Infrastructure	Servers, network, resources	Crashes, timeouts, resource exhaustion	Seconds to minutes
Model Performance	Prediction quality, accuracy	Accuracy drops, biased predictions	Hours to days
Business Impact	Business metrics, outcomes	Revenue decline, user complaints	Days to weeks

Layer 1: Infrastructure Monitoring

Traditional system monitoring—necessary but not sufficient for ML:

Metric	Alert Threshold	Why It Matters
Latency (P50, P95, P99)	P99 > 200ms	User experience, SLA compliance
Error rate	0.1%	Service reliability
Throughput (QPS)	< 80% baseline	Capacity issue or traffic drop
CPU/GPU utilization	90% sustained	Capacity planning
Memory usage	85%	Potential OOM crashes
Request queue depth	Growing trend	Processing bottleneck

Layer 2: Model Performance Monitoring

ML-specific metrics that track prediction quality:

Metric Category	Specific Metrics	Detection Goal
Prediction statistics	Mean, variance, distribution	Detect output drift
Feature statistics	Values, distributions, nulls	Detect input drift
Error analysis	Error rates by segment	Detect targeted degradation
Confidence distribution	Entropy, calibration	Detect uncertainty changes
Comparison to baseline	A/B metrics, reference model	Detect relative degradation

Layer 3: Business Impact Monitoring

Connect ML predictions to business outcomes:

Business Metric	Connection to ML	Monitoring Approach
Conversion rate	Recommendation quality	A/B test, trend analysis
False positive cost	Classification threshold	Track operational costs
Customer churn	Prediction accuracy	Compare predicted vs. actual
Revenue per user	Personalization effectiveness	Cohort analysis

The Monitoring Gap

The most dangerous failures are those that affect business metrics but not infrastructure metrics. The model keeps serving predictions, latency stays green, error rates are zero—but conversion rate drops 5% and nobody notices for weeks. Always connect model monitoring to business outcomes, even if the connection requires delayed analysis.

Data and Concept Drift

Drift is the fundamental challenge of ML maintenance. The statistical properties of production data change over time, causing models trained on historical data to become increasingly inaccurate.

Types of Drift:

Drift Categories

•Data Drift (Covariate Shift) — Input feature distributions change. Example: Average user age shifts as platform demographics evolve. P(X) changes, but P(Y|X) remains the same.
•Concept Drift — The relationship between inputs and outputs changes. Example: What makes a 'good' restaurant review changes as user expectations evolve. P(Y|X) changes.
•Label Drift (Prior Probability Shift) — Target distribution changes. Example: Fraud rate increases during a new attack campaign. P(Y) changes.
•Prediction Drift — Model output distribution changes. Often a consequence of data drift; can indicate model issues.

Drift Detection Methods:

Statistical Tests:

For numerical features:

Kolmogorov-Smirnov (KS) Test: Compares two distributions; reports maximum distance between CDFs
Population Stability Index (PSI): Measures distribution shift; commonly used threshold: PSI > 0.2 indicates significant shift
Jensen-Shannon Divergence: Symmetric measure of distribution similarity

For categorical features:

Chi-Square Test: Tests whether observed frequencies differ from expected
Cramér's V: Measures association strength

Example: PSI Calculation

import numpy as np

def calculate_psi(expected, actual, buckets=10):
    """
    Calculate Population Stability Index.
    
    PSI < 0.1: No significant shift
    0.1 <= PSI < 0.2: Moderate shift (investigate)
    PSI >= 0.2: Significant shift (action required)
    """
    # Discretize into buckets
    breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1))
    
    expected_percents = np.histogram(expected, breakpoints)[0] / len(expected)
    actual_percents = np.histogram(actual, breakpoints)[0] / len(actual)
    
    # Avoid division by zero
    expected_percents = np.maximum(expected_percents, 0.001)
    actual_percents = np.maximum(actual_percents, 0.001)
    
    psi = np.sum(
        (actual_percents - expected_percents) * 
        np.log(actual_percents / expected_percents)
    )
    
    return psi

Window-Based Monitoring:

    ┌─────────────────────────────────────────────────────────────────┐
    │                    Time Windows                                  │
    │                                                                  │
    │    Training    │  Reference  │   Current    │    Future        │
    │    (historical)│  (baseline) │  (monitored) │   (incoming)     │
    │                │             │              │                   │
    │    ═══════════ │ ─────────── │ ▓▓▓▓▓▓▓▓▓▓▓▓▓│                  │
    │                │             │              │                   │
    │  Compare reference window to current window                     │
    │  to detect shift                                                │
    └─────────────────────────────────────────────────────────────────┘

Typical window configurations:

Reference: Last 30 days (or training data)
Current: Last 24 hours
Compare hourly or daily

Drift Detection Strategy by Feature Type
Feature Type	Detection Method	Alert Threshold	Example
Numeric (continuous)	KS test, PSI	p < 0.01 or PSI > 0.2	User age distribution shift
Numeric (bounded)	Z-score on mean/std	3 sigma from baseline	Session duration change
Categorical (low card.)	Chi-square, Cramér's V	p < 0.01	Device type distribution
Categorical (high card.)	Embedding cosine distance	Distance > threshold	New product categories
Text	Vocabulary OOV rate	10% new tokens	New slang, topics
Model output	Prediction distribution	KS or PSI on scores	Score distribution shift

Drift ≠ Degradation

Drift detection tells you inputs have changed; it doesn't tell you whether model performance has degraded. A model can experience significant drift but remain accurate if it generalizes well. Conversely, small drift in critical features can cause severe degradation. Always pair drift detection with outcome monitoring.

Performance Degradation Detection

Detecting model performance degradation is challenging because ground truth is often delayed or unavailable. Different strategies apply depending on when (or if) you receive labels.

Ground Truth Availability Scenarios:

Degradation Detection by Ground Truth Availability
Scenario	Example	Detection Strategy	Detection Delay
Immediate labels	Click-through prediction	Direct accuracy tracking	Minutes
Delayed labels (hours)	Conversion prediction	Leading indicator proxies	Hours
Delayed labels (days)	Churn prediction (30-day)	Historical label comparison	Days-weeks
Delayed labels (months)	Loan default prediction	Early warning signals	Months
No labels	Unsupervised anomaly detection	Indirect quality signals	Varies

Strategy 1: Direct Accuracy Monitoring (When Labels Are Available)

# Track accuracy metrics over time
class AccuracyMonitor:
    def __init__(self, window_size=3600):  # 1 hour window
        self.predictions = []
        self.labels = []
        self.timestamps = []
        self.window_size = window_size
    
    def record(self, prediction, label, timestamp):
        self.predictions.append(prediction)
        self.labels.append(label)
        self.timestamps.append(timestamp)
        self._trim_old_records(timestamp)
    
    def get_current_metrics(self):
        if len(self.predictions) < 100:  # Minimum sample size
            return None
        
        return {
            'accuracy': accuracy_score(self.labels, self.predictions),
            'precision': precision_score(self.labels, self.predictions),
            'recall': recall_score(self.labels, self.predictions),
            'auc': roc_auc_score(self.labels, self.predictions),
            'sample_size': len(self.predictions),
        }
    
    def check_degradation(self, baseline_metrics, threshold=0.05):
        current = self.get_current_metrics()
        if current is None:
            return None
        
        degradations = {}
        for metric, baseline_value in baseline_metrics.items():
            current_value = current.get(metric, 0)
            relative_drop = (baseline_value - current_value) / baseline_value
            
            if relative_drop > threshold:
                degradations[metric] = {
                    'baseline': baseline_value,
                    'current': current_value,
                    'drop': relative_drop,
                }
        
        return degradations

Strategy 2: Proxy Metrics (When Labels Are Delayed)

Use correlated, faster-available metrics as leading indicators:

Delayed Metric	Proxy Metrics	Rationale
30-day churn	7-day engagement drop, support tickets	Early churn signals
90-day loan default	Payment delay, account activity	Financial stress indicators
Purchase conversion	Add-to-cart rate, session depth	Purchase funnel metrics

Strategy 3: Prediction Distribution Monitoring (No Labels)

Even without labels, significant changes in prediction patterns suggest issues:

Score distribution shift: If the model suddenly predicts many more positives (or negatives), something has changed
Confidence calibration: If high-confidence predictions increase dramatically, the model may be overconfident
Segment-level shifts: Predictions for specific user segments may shift while aggregate remains stable

Reference Model Comparison:

Maintain a reference model to compare production predictions:

    Production Request
          │
          ├──→ [Production Model] ──→ Production Prediction
          │                              │
          └──→ [Reference Model]  ──→ Reference Prediction
                                         │
                                         ▼
                              [Comparison Analysis]
                              
    Alert if: Production significantly diverges from Reference
              AND we believe Reference is still accurate

Reference model can be:

Previous production version (detect regression)
Simple baseline (detect when complex model fails)
Ensemble of models (detect outlier predictions)

Segment-Level Monitoring:

Aggregate metrics can hide segment-level problems:

    Overall Accuracy: 92% ✓
    
    By User Segment:
    - Power Users: 95% ✓
    - New Users: 70% ✗  ← Problem hidden in aggregate!
    - Mobile Users: 85% ✓
    - Desktop Users: 94% ✓

Monitor metrics for important segments independently:

User cohorts (new, returning, power users)
Geographic regions
Device types
Feature value ranges (e.g., high vs. low price items)

The Feedback Loop Problem

Model predictions can influence the data you use to evaluate the model. If recommendations determine what users see, and user interactions become training labels, you can't directly measure what users would have wanted without the model. This requires careful experimental design (holdout groups, exploration) to obtain unbiased evaluation signals.

Alerting and Incident Response

Effective alerting turns monitoring data into actionable notifications. Too many alerts lead to alert fatigue; too few miss real problems. The goal is high signal-to-noise ratio with fast detection of genuine issues.

Alert Design Principles:

Effective Alert Characteristics

•Actionable — Every alert should have a clear response. If there's no action, it shouldn't be an alert.
•Urgent — Alert on issues that need immediate attention. Non-urgent issues go to dashboards or reports.
•Meaningful — Alert on symptoms that matter (degraded user experience), not just technical anomalies.
•Trustworthy — False positives erode trust. Tune thresholds to minimize noise while catching real issues.
•Contextualized — Include enough context to diagnose: affected model, segment, magnitude, comparison to baseline.

Alert Severity Levels for ML Systems
Severity	Response Time	Examples	Notification Channel
Critical (P1)	Immediate	Model serving down, severe accuracy drop	Page on-call, war room
High (P2)	< 1 hour	Significant performance degradation, data pipeline failure	Slack + page if no ack
Medium (P3)	< 4 hours	Moderate drift detected, minor accuracy drop	Slack notification
Low (P4)	< 24 hours	Slight distribution shift, non-critical feature issues	Email, ticket
Informational	Next business day	Scheduled retraining reminder, capacity planning	Dashboard, weekly report

ML Incident Response Playbook:

Phase 1: Detection and Triage (< 15 minutes)

Confirm the alert is real:
- Is this a false positive? Check multiple data sources
- What's the scope? All traffic or specific segment?
- When did it start? Correlate with deployments/changes
Assess impact:
- How many users/requests affected?
- What's the business impact (revenue, experience)?
- Is it getting worse or stabilizing?
Decide on immediate action:
- Rollback to previous model version?
- Switch to fallback/baseline?
- Disable affected feature?

Phase 2: Stabilization (< 1 hour)

Implement mitigation:
- Execute rollback/fallback if decided
- Confirm mitigation is effective
- Communicate status to stakeholders
Preserve evidence:
- Snapshot affected data and logs
- Record timeline of events
- Note any manual interventions

Phase 3: Investigation (hours to days)

Root cause analysis:
- What caused the degradation?
- Data issue? Model issue? Infrastructure issue?
- Why wasn't it caught earlier?
Remediation:
- Fix the underlying issue
- Validate fix in staging
- Deploy fix with appropriate rollout strategy

Phase 4: Postmortem and Prevention (< 1 week)

Document the incident:
- Timeline, impact, root cause, resolution
- What went well? What could improve?
Identify preventive actions:
- New monitoring needed?
- Testing gaps to address?
- Process changes required?
Implement improvements:
- Add detection for this failure mode
- Update runbooks with lessons learned

The First Five Minutes

The most important incident response skill is quickly deciding whether to rollback. If you're unsure about the cause but the impact is significant, rollback first and investigate later. A fast rollback limits blast radius; a slow investigation while users suffer compounds the damage.

Retraining Strategies

Models degrade over time. Retraining refreshes the model with recent data, adapting to current patterns. The key decisions are when to retrain and how to retrain.

Retraining Triggers:

Retraining Trigger Strategies
Strategy	Description	Pros	Cons
Scheduled	Retrain on fixed cadence (daily, weekly)	Predictable, simple	May retrain unnecessarily or too late
Performance-triggered	Retrain when accuracy drops below threshold	Efficient, responsive	Requires reliable monitoring
Drift-triggered	Retrain when data drift exceeds threshold	Proactive, catches early	Drift may not affect accuracy
Data-triggered	Retrain when significant new data available	Uses fresh data efficiently	Depends on data arrival patterns
Hybrid	Scheduled + triggered early when needed	Balanced approach	More complex to implement

Retraining Approaches:

Full Retraining: Train from scratch on all available data.

    [All Historical Data] → [Training] → [New Model]

Pros: Clean slate, captures all patterns, simple to reason about
Cons: Computationally expensive, may lose recent patterns in large historical data
Best for: Models that can be trained relatively quickly, when architecture might change

Incremental/Online Update: Update existing model with new data.

    [Existing Model] + [New Data] → [Update] → [Updated Model]

Pros: Fast, emphasizes recent data, continuous adaptation
Cons: Can drift from optimal, accumulates errors, some algorithms don't support
Best for: High-frequency updates, when recent data is most relevant

Sliding Window: Retrain on a fixed-length window of recent data.

    [Last N Months of Data] → [Training] → [New Model]
    
    Window slides forward: discard oldest data, add newest

Pros: Balances historical and recent, bounded training cost
Cons: Loses long-term patterns, window size selection is critical
Best for: Domains with strong recency effects (e.g., fashion, trending content)

Weighted/Decay Training: Weight recent data more heavily than old data.

    Sample weights: w(t) = e^(-λ * age(t))
    
    Older samples contribute less to training objective

Pros: Uses all data, emphasizes recent, smooth transition
Cons: Requires sample weighting support, decay rate selection
Best for: When both historical patterns and recent trends matter

Retraining Validation:

Never deploy a retrained model without validation:

Offline evaluation:
- Holdout set performance (must use temporally correct split)
- Comparison to current production model
- Performance on known edge cases
Shadow deployment:
- Run on production traffic without serving
- Compare predictions to production model
- Verify latency and error rates
Canary deployment:
- Small traffic percentage initially
- Monitor closely for degradation
- Progressive rollout if healthy

Retraining Pipeline Example:

    Trigger (scheduled/drift)
           │
           ▼
    [Data Collection] ←── Collect recent labeled data
           │
           ▼
    [Data Validation] ←── Check data quality
           │
           ▼
    [Feature Engineering] ←── Compute training features
           │
           ▼
    [Model Training] ←── Train with current hyperparameters
           │
           ▼
    [Offline Evaluation] ←── Compare to baseline
           │
           ├── Fail → Alert, investigate
           │
           ▼ Pass
    [Model Registration] ←── Version, store artifacts
           │
           ▼
    [Shadow Deployment] ←── Validate on production traffic
           │
           ├── Fail → Rollback, investigate
           │
           ▼ Pass
    [Canary Deployment] ←── Progressive rollout
           │
           ▼
    [Full Production] ←── New model is live

Catastrophic Retraining

Sometimes retraining makes things worse—corrupted training data, bugs in feature engineering, or concept drift that confuses the model. Always have automated checks that prevent deploying a model that's significantly worse than the current one. Define minimum performance thresholds that must be met.

ML Lifecycle Automation (MLOps)

Manual ML operations don't scale. As models multiply and retraining becomes routine, automation is essential to maintain quality and velocity. MLOps brings DevOps principles to ML systems.

The ML Automation Maturity Model:

ML Automation Maturity Levels
Level	Characteristics	Training	Deployment	Monitoring
Level 0: Manual	Everything by hand	Notebooks	Manual copy	Manual checks
Level 1: Scripts	Repeatable scripts	Script-based	CI triggers	Basic dashboards
Level 2: Pipelines	Automated workflows	Orchestrated	Automated + approval	Automated alerts
Level 3: Continuous	Full automation	Continuous training	Continuous deployment	Continuous monitoring
Level 4: Autonomous	Self-healing systems	Auto-triggered	Auto-rollout/rollback	Auto-remediation

CI/CD for ML:

Continuous Integration (CI):

    Code Change → [Build] → [Unit Tests] → [Integration Tests] → Merge
                                │                  │
                                │                  └── Data validation tests
                                │                  └── Feature transform tests
                                │                  └── Model inference tests
                                │
                                └── Code quality, linting
                                └── Security scanning

Continuous Training (CT):

    Trigger → [Data Pipeline] → [Train] → [Evaluate] → [Register] → Staging
       │
       ├── Scheduled (daily/weekly)
       ├── Data-triggered (new data arrived)
       ├── Performance-triggered (degradation detected)
       └── Manual (on-demand)

Continuous Deployment (CD):

    Staging Model → [Shadow Test] → [Canary] → [Progressive Rollout] → Production
                          │              │               │
                    Auto-validation  Auto-validation  Auto-validation
                          │              │               │
                    Fail → Halt    Fail → Rollback  Fail → Rollback

Key Automation Infrastructure:

Component	Purpose	Examples
Workflow Orchestration	Pipeline execution, dependencies	Airflow, Kubeflow, Prefect
Feature Store	Feature management, serving	Feast, Tecton, SageMaker
Model Registry	Model versioning, artifacts	MLflow, SageMaker, Weights & Biases
Experiment Tracking	Metrics, hyperparameters, comparison	MLflow, W&B, Neptune
Model Serving	Inference infrastructure	TensorFlow Serving, Triton, Seldon
Monitoring	Metrics, alerts, dashboards	Prometheus, Grafana, custom

Infrastructure as Code for ML:

Define ML infrastructure declaratively:

# Example: ML pipeline configuration
pipeline:
  name: churn-prediction-training
  schedule: "0 2 * * *"  # Daily at 2 AM
  
  stages:
    - name: data-preparation
      image: data-prep:v1
      resources:
        cpu: "4"
        memory: "16Gi"
      inputs:
        - source: s3://data/transactions
          date_range: last_90_days
      outputs:
        - destination: s3://features/training
    
    - name: model-training
      image: training:v2
      resources:
        gpu: "1"  # NVIDIA T4
        memory: "32Gi"
      inputs:
        - source: s3://features/training
      hyperparameters:
        learning_rate: 0.001
        epochs: 100
        batch_size: 256
      outputs:
        - destination: s3://models/churn
    
    - name: model-evaluation
      image: evaluation:v1
      inputs:
        - model: s3://models/churn
        - validation_data: s3://features/validation
      thresholds:
        accuracy: 0.85
        auc: 0.90
      on_failure: alert_and_halt
    
    - name: model-registration
      image: registry:v1
      inputs:
        - model: s3://models/churn
      destination: model-registry/churn-prediction
      promote_to: staging

This configuration:

Lives in version control
Can be reviewed like code
Executed consistently across environments
Enables reproducibility

Start with Level 1

Full automation (Level 3-4) requires significant investment. Start by making current processes repeatable with scripts (Level 1), then add orchestration (Level 2). Only pursue continuous automation when you have the volume and maturity to justify it. A single model doesn't need the same infrastructure as a platform serving hundreds of models.

Documentation and Knowledge Management

ML systems are complex, and knowledge about them often exists only in the heads of their creators. When team members leave or forget, this knowledge is lost. Systematic documentation preserves institutional knowledge and enables effective maintenance.

Model Cards:

Model cards standardize model documentation:

# Model Card: Churn Prediction v2.3

## Model Details
- **Developed by:** ML Team
- **Model date:** 2024-01-15
- **Model version:** 2.3.0
- **Model type:** Gradient Boosted Trees (LightGBM)
- **Training data:** User transactions Jan 2023 - Dec 2023

## Intended Use
- **Primary use:** Predict 30-day churn probability for subscription users
- **Out-of-scope:** Free tier users, B2B accounts, new users (<7 days)

## Metrics
| Metric | Value | Threshold |
|--------|-------|----------|
| AUC-ROC | 0.87 | > 0.85 |
| Precision@10% | 0.45 | > 0.40 |
| Recall@10% | 0.62 | > 0.50 |

## Training Data
- **Size:** 1.2M users, 14.3% churn rate
- **Features:** 47 features covering engagement, billing, support
- **Label definition:** No activity for 30+ consecutive days

## Ethical Considerations
- **Fairness:** Tested for demographic parity across age groups
- **Privacy:** No PII used directly; only behavioral features

## Limitations
- Performs poorly for users with <10 sessions (AUC drops to 0.72)
- Seasonal effects not fully captured (holiday periods)
- Does not account for promotional offers

## Caveats and Recommendations
- Retrain monthly to maintain accuracy
- Monitor for drift in session features
- High-value user predictions should be reviewed manually

Operational Runbooks:

Document procedures for common operational tasks:

# Runbook: Churn Model Degradation Alert

## Trigger
- Alert: "Churn model AUC dropped below 0.82 (threshold: 0.85)"

## Immediate Actions
1. **Verify alert is genuine:**
   - Check dashboard for sample size (< 1000 may be noise)
   - Compare to previous day's metrics
   - Check if there are infrastructure issues

2. **Assess impact:**
   - How many predictions affected?
   - Are intervention campaigns running on this model?

3. **Decide on mitigation:**
   - If degradation > 10%: Consider rollback to previous version
   - If degradation 5-10%: Monitor closely, prepare rollback
   - If degradation < 5%: Continue monitoring

## Investigation Steps
1. Check for data issues:
   - Feature freshness in feature store
   - Data pipeline success status
   - Feature distribution shifts

2. Check for model issues:
   - Recent model deployments
   - Serving infrastructure health
   - Memory/CPU utilization

3. Check for external factors:
   - Seasonal effects
   - Product changes
   - Marketing campaigns

## Escalation
- If cause unclear after 30 minutes: Page senior ML engineer
- If rollback fails: Page infrastructure on-call
- If business impact significant: Notify product stakeholders

## Recovery Validation
- Confirm metrics return to baseline
- Run smoke tests on predictions
- Document root cause for postmortem

Essential Documentation for ML Systems
Document Type	Purpose	Update Frequency	Owner
Model Card	Model overview, capabilities, limitations	Each model version	ML Engineer
Architecture Doc	System design, components, dependencies	Major changes	Tech Lead
Runbooks	Operational procedures for incidents	After each incident	On-call rotation
Training Guide	How to retrain and validate	Pipeline changes	ML Engineer
Data Dictionary	Feature definitions, sources, schemas	Feature additions	Data Engineer
Experiment Log	A/B tests, results, decisions	Each experiment	Data Scientist

Documentation as Code

Keep documentation close to code—in the same repository, updated in the same pull requests. Documentation that lives in a separate wiki drifts from reality. Model cards can be generated automatically from training logs. Runbooks can be versioned alongside the systems they describe.

Summary: Monitoring and Maintenance Mastery

Monitoring and maintenance are what separate experimental ML projects from production ML systems. The techniques covered in this page ensure your models continue to provide value long after initial deployment—adapting to changing data, detecting silent failures, and recovering from degradation.

Key Takeaways

•Monitor across three layers — Infrastructure, model performance, and business impact. Each catches different failure types at different speeds.
•Understand drift types — Data drift, concept drift, and label drift have different implications and require different responses.
•Detect degradation creatively — When labels are delayed, use proxy metrics, reference models, and prediction distribution monitoring.
•Design effective alerts — Actionable, urgent, meaningful, trustworthy, and contextualized. Reduce noise to maintain trust.
•Plan retraining strategy — Choose between scheduled, triggered, or hybrid approaches. Validate thoroughly before deployment.
•Automate the lifecycle — MLOps brings CI/CD principles to ML. Start simple and add automation as the system matures.
•Document systematically — Model cards, runbooks, and architecture docs preserve knowledge and enable effective maintenance.

Module Complete:

You've now completed Module 1: ML System Design. You've learned how to:

Gather requirements — Translate business needs to ML specifications
Design data pipelines — Build the infrastructure that feeds ML systems
Architect models — Structure models for production constraints
Build serving infrastructure — Deploy models reliably at scale
Monitor and maintain — Keep models healthy throughout their lifecycle

These skills enable you to design and operate complete ML systems—not just train models in notebooks, but build systems that deliver value in production.

Module Complete

You've mastered the fundamentals of ML system design—from requirements gathering through production maintenance. You can now design, deploy, and operate ML systems that reliably deliver business value. The next module will cover ML debugging—how to diagnose and fix problems when things go wrong.