The Ml Pipeline - Learning Module

Loading content...

0/278

Evaluation and Deployment

From Experiment to Production

A model that performs brilliantly in a Jupyter notebook but never reaches production delivers zero value. Yet countless ML projects end precisely here—in a graveyard of experiments that seemed promising but never saw real users.

Evaluation and deployment are where ML projects transition from research to impact. Evaluation answers: How good is this model really? Will it work in production? What could go wrong? Deployment answers: How do we serve predictions reliably, monitor for failures, and maintain performance over time?

These stages are fundamentally different from development. In development, you can iterate quickly, tolerate failures, and explore freely. In production, you need reliability, latency guarantees, graceful degradation, and continuous monitoring. The skills that make a good ML researcher are different from those that make a good ML engineer—and successful projects require both.

What You Will Learn

By the end of this page, you will understand how to evaluate models rigorously beyond simple accuracy, conduct proper offline evaluation before deployment, design and analyze A/B tests, deploy models to production environments, monitor model performance in production, and handle model degradation and retraining.

Comprehensive Evaluation Metrics

A single metric rarely captures everything that matters. Comprehensive evaluation requires examining model behavior from multiple angles.

Classification Metrics:

Classification Metrics Reference
Metric	Formula	Intuition	When to Use
Accuracy	(TP+TN)/(TP+TN+FP+FN)	Fraction of correct predictions	Balanced classes, all errors equal
Precision	TP/(TP+FP)	Of predicted positives, how many are correct	When false positives are costly
Recall (Sensitivity)	TP/(TP+FN)	Of actual positives, how many did we catch	When false negatives are costly
F1 Score	2·(P·R)/(P+R)	Harmonic mean of precision and recall	Balance precision and recall
AUC-ROC	Area under TPR vs FPR curve	Ranking quality across all thresholds	General quality, comparing models
AUC-PR	Area under Precision-Recall curve	Ranking quality for positives	Imbalanced data, rare positives
Log Loss	-Σ(y·log(p)+(1-y)·log(1-p))	Penalizes confident wrong predictions	Probabilistic predictions needed
Specificity	TN/(TN+FP)	Of actual negatives, how many correctly identified	When true negatives matter

Regression Metrics:

Metric	Formula	Intuition	When to Use
MSE	Σ(y-ŷ)²/n	Average squared error	When large errors are very bad
RMSE	√MSE	Same units as target	Interpretability + penalize large errors
MAE	Σ	y-ŷ	/n
MAPE	Σ(	y-ŷ	/y)/n
R²	1 - SS_res/SS_tot	Variance explained	Comparing to baseline (mean)

Multi-Class Metrics:

For multi-class problems, per-class metrics can be aggregated:

Macro average: Compute metric per class, then average. Treats all classes equally.
Weighted average: Compute per class, average weighted by class frequency. Accounts for class sizes.
Micro average: Aggregate all predictions, compute metric once. Dominated by majority class.

Choose Metrics Based on Business Impact

If false positives cost $10 and false negatives cost $1000, optimize for recall over precision. If both cost the same, F1 balances them. Always trace metrics back to business outcomes—the best metric is the one that correlates with what the business actually cares about.

evaluation_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score, log_loss,
    confusion_matrix, classification_report,
    mean_squared_error, mean_absolute_error, r2_score
)
import numpy as np
 
# ===== CLASSIFICATION METRICS =====
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1]
y_proba = [0.9, 0.1, 0.8, 0.4, 0.2, 0.7, 0.6, 0.3, 0.85, 0.95]
 
print("Classification Metrics:")
print(f"Accuracy:  {accuracy_score(y_true, y_pred):.4f}")
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall:    {recall_score(y_true, y_pred):.4f}")
print(f"F1 Score:  {f1_score(y_true, y_pred):.4f}")
print(f"AUC-ROC:   {roc_auc_score(y_true, y_proba):.4f}")
print(f"AUC-PR:    {average_precision_score(y_true, y_proba):.4f}")
print(f"Log Loss:  {log_loss(y_true, y_proba):.4f}")
 
# Confusion matrix
print("
Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
 
# Full classification report
print("
Classification Report:")
print(classification_report(y_true, y_pred))
 
# ===== REGRESSION METRICS =====
y_true_reg = [3.0, 5.0, 2.5, 7.0, 4.5]
y_pred_reg = [2.8, 5.2, 2.0, 6.5, 4.8]
 
print("
Regression Metrics:")
print(f"MSE:  {mean_squared_error(y_true_reg, y_pred_reg):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_true_reg, y_pred_reg)):.4f}")
print(f"MAE:  {mean_absolute_error(y_true_reg, y_pred_reg):.4f}")
print(f"R²:   {r2_score(y_true_reg, y_pred_reg):.4f}")

Beyond Aggregate Metrics

Aggregate metrics (accuracy, AUC, RMSE) summarize overall performance but can hide critical failures. Comprehensive evaluation examines model behavior across segments, edge cases, and over time.

Sliced Evaluation:

Aggregate metrics average over all examples. But models often perform differently on different subgroups:

Does the model work equally well for all user demographics?
Does performance differ across geographic regions?
Does the model handle edge cases (rare categories, extreme values)?

Slicing evaluates metrics on subsets of data:

sliced_evaluation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import pandas as pd
from sklearn.metrics import accuracy_score, recall_score
 
# Evaluate metrics across different slices
def evaluate_by_slice(df, y_true_col, y_pred_col, slice_col, metric_fn):
    """Compute metric for each value of slice_col."""
    results = []
    for slice_val in df[slice_col].unique():
        mask = df[slice_col] == slice_val
        n_samples = mask.sum()
        
        if n_samples >= 30:  # Only evaluate if enough samples
            score = metric_fn(
                df.loc[mask, y_true_col], 
                df.loc[mask, y_pred_col]
            )
            results.append({
                'slice': slice_val,
                'n_samples': n_samples,
                'score': score
            })
    
    return pd.DataFrame(results).sort_values('score')
 
# Example usage
slices = ['age_group', 'country', 'device_type', 'user_tenure']
for slice_col in slices:
    print(f"
Performance by {slice_col}:")
    slice_results = evaluate_by_slice(
        df, 'actual', 'predicted', slice_col, accuracy_score
    )
    print(slice_results)
    
    # Flag concerning slices
    overall = accuracy_score(df['actual'], df['predicted'])
    underperforming = slice_results[slice_results['score'] < overall * 0.9]
    if len(underperforming) > 0:
        print("⚠️  Underperforming slices:")
        print(underperforming)

Additional Evaluation Dimensions

•Error Analysis: Manually inspect misclassified examples. What patterns exist? Are errors random or systematic?
•Calibration: Do predicted probabilities match observed frequencies? A model predicting 70% should be correct ~70% of the time.
•Stability: How much do predictions vary with small input perturbations? Unstable models are unreliable.
•Fairness: Are there disparities across protected groups? Does the model disadvantage certain demographics?
•Robustness: How does the model handle out-of-distribution inputs, adversarial examples, or missing features?
•Temporal Validity: How does performance change over time? Does the model degrade on newer data?

Calibration Matters for Decision Making

If your model's 80% confidence predictions are actually correct only 50% of the time, any downstream decisions based on those probabilities will be systematically wrong. Always validate calibration using calibration curves or reliability diagrams before using probability outputs.

Offline Evaluation Best Practices

Offline evaluation happens before deployment, using historical data. It's your last chance to catch problems before users are affected.

Golden Rules of Offline Evaluation:

Offline Evaluation Checklist

•Never evaluate on training data — Training metrics are meaningless. Always use held-out data.
•The test set should be truly unseen — Not used for any decision during development, including hyperparameter tuning.
•Match production conditions — Evaluate on data that resembles what the model will see in production (recency, distribution, quality).
•Respect temporal ordering — For time-sensitive problems, train on past, evaluate on future. Never train on data from after the evaluation period.
•Compare to meaningful baselines — Beat a naive baseline (random, always-majority-class, simple heuristics) before claiming success.
•Report confidence intervals — Point estimates hide uncertainty. Bootstrap or cross-validation provides variance estimates.
•Evaluate on relevant subpopulations — Aggregate metrics can hide poor performance on minorities. Slice by relevant dimensions.

Common Offline Evaluation Mistakes:

Mistake	Why It's Dangerous	How to Avoid
Looking at test set during development	Optimistic estimates, won't generalize	Strict holdout, don't peek
Inconsistent preprocessing	Train/test have different transformations	Use pipelines, fit on train only
Wrong splits for time series	Information leakage from future	TimeSeriesSplit, respect chronology
Ignoring data drift	Eval data doesn't match production	Use recent data, test on new distributions
Not testing edge cases	Model fails on unusual inputs	Create adversarial test sets

Backtesting for Time-Sensitive Models:

For models predicting future events (forecasting, churn, fraud), offline evaluation must simulate the temporal structure of production:

Define a historical evaluation window
Pretend you're at a point in time before the window
Train on all data before that point
Predict for the window period
Evaluate predictions against actual outcomes
Slide forward, repeat, average results

Always Beat the Dumb Baseline

Before celebrating your model's 85% accuracy, ask: what does a trivial baseline achieve? If always predicting the majority class gives 80% accuracy, your model only adds 5 percentage points. For many problems, simple heuristics are surprisingly hard to beat significantly.

Online Evaluation and A/B Testing

Offline evaluation predicts production performance but can't guarantee it. Online evaluation measures actual performance with real users, typically via A/B tests.

Why Online Evaluation Is Necessary:

Offline data may not perfectly represent production conditions
User behavior may change in response to the model (feedback loops)
Infrastructure factors (latency, availability) only matter in production
Business metrics may not correlate perfectly with ML metrics

A/B Testing Fundamentals:

An A/B test randomly assigns users to groups:

Control (A): Existing system or baseline
Treatment (B): New model or feature

You then compare outcomes between groups to determine if the treatment is better.

A/B Test Design Considerations

•Randomization unit: User, session, or request? Typically user to avoid inconsistent experience.
•Sample size: Need enough samples for statistical significance. Power analysis before launch.
•Duration: Long enough for weekly patterns, but short enough to iterate. Often 1-2 weeks.
•Primary metric: What's the one metric that determines success? Define before launch.
•Guardrail metrics: Metrics that must not degrade (e.g., latency, error rate, revenue).
•Segmentation: Check if effect differs across segments (new vs. returning users, geographies).

Statistical Analysis:

A/B test analysis determines if observed differences are real or due to chance:

Compute metrics for each group
Calculate difference and confidence interval
Perform significance test (t-test, chi-squared, or Bayesian equivalent)
Check for statistical significance (typically p < 0.05 or 95% CI excluding 0)
Check for practical significance (is the effect size meaningful?)

Common Pitfalls:

Pitfall	Consequence	Solution
Peeking at results early	Inflated false positive rate	Pre-commit to sample size and duration
Multiple testing	Increased false positives	Correct for multiple comparisons (Bonferroni, FDR)
Ignoring novelty effects	Short-term gains don't persist	Run tests long enough for novelty to wear off
Simpson's paradox	Wrong conclusion due to confounders	Check segment-level effects
Underpowered tests	Can't detect real effects	Power analysis before launch

Statistical vs. Practical Significance

Statistical significance means the observed difference is unlikely due to random chance. Practical significance means the difference is large enough to matter. A 0.01% improvement can be statistically significant with enough data, but if it requires months of engineering to maintain, it's not practically significant.

Deployment Patterns

Deploying a model means making it available to serve predictions in production. Several patterns exist, each with different tradeoffs.

Deployment Patterns:

Model Deployment Patterns
Pattern	Description	Pros	Cons	Best For
Batch Prediction	Run predictions periodically on batches of data, store results	Simple, cost-effective, easy debugging	Stale predictions, storage overhead	Recommendations, reports, non-real-time
Real-time API	Model served via REST/gRPC endpoint, called per request	Fresh predictions, dynamic inputs	Latency requirements, scaling challenges	Fraud detection, personalization
Embedded Model	Model bundled with application (mobile, edge)	No network latency, works offline	Update complexity, device constraints	Mobile apps, IoT devices
Streaming	Predictions on data streams (Kafka, Kinesis)	Continuous processing, low latency	Infrastructure complexity	Real-time monitoring, event processing
Shadow Mode	Run new model in parallel, don't serve yet	Safe testing in production conditions	No impact data	Before risky changes
Canary Deployment	Route small percentage of traffic to new model	Limited blast radius if problems	Slower rollout	Gradual, safe rollouts

Real-Time Serving Architecture:

For real-time prediction serving, typical components include:

Load Balancer: Distributes requests across model servers
Model Server: Runs inference (TensorFlow Serving, TorchServe, custom)
Feature Store: Provides precomputed features at low latency
Caching: Stores recent predictions to avoid redundant computation
Monitoring: Tracks latency, throughput, errors, and model metrics
Fallback: Default predictions if model fails

Feature Store Importance:

Real-time models need real-time features. But features that took hours to compute during training can't be recomputed per request. Feature stores:

Precompute and store features for entities (users, items)
Serve features at low latency for online inference
Ensure consistency between training features and serving features
Examples: Feast, Tecton, custom solutions

Training-Serving Skew

Training-serving skew occurs when features computed at training time differ from those at serving time. Maybe a bug in the serving code, maybe different library versions, maybe different data freshness. This silent killer degrades predictions without obvious errors. Always validate feature consistency between training and serving.

Production Monitoring

A deployed model isn't finished—it's just beginning its production lifecycle. Continuous monitoring is essential to catch degradation before it causes damage.

What to Monitor:

Monitoring Categories

•Operational Metrics: Latency (p50, p99, p99.9), throughput, error rates, availability. These must be healthy for the model to function.
•Input Monitoring: Feature distributions, null rates, out-of-range values. Detect data quality issues and distribution shift.
•Output Monitoring: Prediction distributions, confidence levels. Sudden changes may indicate problems.
•Business Metrics: Downstream KPIs that the model affects (conversion, engagement, revenue). Ultimate measure of success.
•Model Metrics (when labels available): Accuracy, precision, recall on production data. Ground truth may be delayed or sampled.

Key Monitoring Signals
Signal	What It Detects	Alerting Threshold
Prediction distribution shift	Model behavior change	Significant change from baseline (KL divergence, PSI)
Feature distribution shift	Input data drift	Population stability index > 0.25
Latency spike	Infrastructure or model issues	p99 > 2x baseline
Error rate increase	Code or data bugs	Errors > baseline + 3σ
Null feature rate spike	Data pipeline failures	Nulls > baseline by significant margin
Prediction rate change	System-wide behavior shift	Volume differs from expected by >20%

Data Drift Detection:

The world changes. User behavior shifts, products evolve, external events disrupt patterns. When the data distribution in production differs from training, model performance degrades.

Types of Drift:

Covariate shift: Input feature distributions change, but the relationship to labels stays the same
Prior probability shift (label drift): Target distribution changes (e.g., fraud becomes more common)
Concept drift: The relationship between features and target changes (same inputs, different correct answers)

Drift Detection Methods:

Statistical tests: Kolmogorov-Smirnov, Chi-squared for individual features
Population Stability Index (PSI): Compares binned distributions
Distance metrics: KL divergence, Jensen-Shannon divergence between distributions
Performance degradation: If you have delayed labels, track prediction accuracy over time

Silent Degradation

Model degradation is often invisible. The model still returns predictions, there are no errors, latency is fine—but predictions are less accurate. Without monitoring, you only discover this when business metrics tank. By then, weeks of damage may have accumulated.

Model Maintenance and Retraining

Models are not static artifacts. They require ongoing maintenance to remain effective.

Why Models Degrade:

Data drift: Production data diverges from training data
Concept drift: Relationships between features and outcomes change
Feedback loops: Model predictions influence future data
System changes: Upstream data sources, preprocessing, or dependencies change

Retraining Strategies:

Retraining Strategy Comparison
Strategy	Description	Pros	Cons	Best For
Fixed schedule	Retrain every N days/weeks	Predictable, simple	May retrain unnecessarily or not enough	Stable environments
Trigger-based	Retrain when drift or performance drops	Efficient, reactive	Requires good monitoring	Fast-changing environments
Continuous training	Pipeline constantly ingests data, updates model	Always fresh	Complex infrastructure, risk of instability	High-volume, fast-changing
Online learning	Model updates with each new example	Real-time adaptation	Stability concerns, not all algorithms support	Real-time systems

Retraining Pipeline Components:

Data collection: Fresh labeled data for training
Feature computation: Same transformations as original training
Model training: Same algorithm and hyperparameters (unless tuning)
Validation: Ensure new model is at least as good as current
Comparison testing: A/B test or shadow mode before full rollout
Rollback capability: Quick reversion if new model fails

Champion-Challenger Pattern:

Champion: Currently deployed model
Challenger: Newly trained model

The challenger runs in shadow mode or on a small traffic slice. If it outperforms the champion, it becomes the new champion. This protects against deploying worse models.

Model Versioning:

Track what's deployed and ensure reproducibility:

Model artifacts (serialized model files)
Training data version
Code version (feature engineering, training pipeline)
Configuration (hyperparameters, thresholds)
Performance metrics at deployment time

Automate Everything

Manual retraining doesn't scale. Automate data collection, training, validation, and deployment. The goal is a pipeline that can refresh models with minimal human intervention—humans design the pipeline and set policies, but execution is automatic.

Summary: The Complete ML Pipeline

Evaluation and deployment are where ML projects deliver value—or fail to. Rigorous evaluation catches problems before users are affected; robust deployment ensures reliable, maintainable systems. Let's consolidate the key lessons:

Key Takeaways

•Choose metrics that align with business goals — Accuracy isn't always the right choice. Consider precision, recall, AUC, and business-specific metrics.
•Look beyond aggregate metrics — Slice by segments, analyze errors, check calibration, test edge cases.
•Respect the sanctity of the test set — Never use it for any decision during development. It estimates production performance.
•Validate in production with A/B tests — Offline metrics don't guarantee online success. Run properly designed experiments.
•Choose deployment patterns matching your requirements — Batch for simplicity, real-time for freshness, embedding for latency.
•Monitor continuously — Track operational, input, output, and business metrics. Detect drift before it causes damage.
•Plan for model maintenance — Models degrade. Build retraining pipelines, version everything, enable rollback.

Module Complete: The ML Pipeline

You've now journeyed through the complete machine learning pipeline:

Problem Formulation: Translating business objectives into well-defined ML tasks
Data Collection and Preparation: Gathering, cleaning, and preparing data for training
Feature Engineering: Creating informative representations that enable models to learn
Model Selection and Training: Choosing algorithms, tuning hyperparameters, and training to convergence
Evaluation and Deployment: Rigorous testing and production-ready deployment

This pipeline isn't waterfall—it's iterative. Evaluation findings loop back to improve features; production monitoring reveals problems that require reformulating the problem. Each stage informs the others.

Module Complete

Congratulations! You now understand the complete ML pipeline from problem formulation through production deployment. This framework applies to every ML project—whether you're predicting churn, detecting fraud, recommending products, or any other application. The specific techniques vary, but the pipeline structure remains constant. You're ready to build ML systems that deliver real-world value.