Loading content...
A model that performs brilliantly in a Jupyter notebook but never reaches production delivers zero value. Yet countless ML projects end precisely here—in a graveyard of experiments that seemed promising but never saw real users.
Evaluation and deployment are where ML projects transition from research to impact. Evaluation answers: How good is this model really? Will it work in production? What could go wrong? Deployment answers: How do we serve predictions reliably, monitor for failures, and maintain performance over time?
These stages are fundamentally different from development. In development, you can iterate quickly, tolerate failures, and explore freely. In production, you need reliability, latency guarantees, graceful degradation, and continuous monitoring. The skills that make a good ML researcher are different from those that make a good ML engineer—and successful projects require both.
By the end of this page, you will understand how to evaluate models rigorously beyond simple accuracy, conduct proper offline evaluation before deployment, design and analyze A/B tests, deploy models to production environments, monitor model performance in production, and handle model degradation and retraining.
A single metric rarely captures everything that matters. Comprehensive evaluation requires examining model behavior from multiple angles.
Classification Metrics:
| Metric | Formula | Intuition | When to Use |
|---|---|---|---|
| Accuracy | (TP+TN)/(TP+TN+FP+FN) | Fraction of correct predictions | Balanced classes, all errors equal |
| Precision | TP/(TP+FP) | Of predicted positives, how many are correct | When false positives are costly |
| Recall (Sensitivity) | TP/(TP+FN) | Of actual positives, how many did we catch | When false negatives are costly |
| F1 Score | 2·(P·R)/(P+R) | Harmonic mean of precision and recall | Balance precision and recall |
| AUC-ROC | Area under TPR vs FPR curve | Ranking quality across all thresholds | General quality, comparing models |
| AUC-PR | Area under Precision-Recall curve | Ranking quality for positives | Imbalanced data, rare positives |
| Log Loss | -Σ(y·log(p)+(1-y)·log(1-p)) | Penalizes confident wrong predictions | Probabilistic predictions needed |
| Specificity | TN/(TN+FP) | Of actual negatives, how many correctly identified | When true negatives matter |
Regression Metrics:
| Metric | Formula | Intuition | When to Use |
|---|---|---|---|
| MSE | Σ(y-ŷ)²/n | Average squared error | When large errors are very bad |
| RMSE | √MSE | Same units as target | Interpretability + penalize large errors |
| MAE | Σ | y-ŷ | /n |
| MAPE | Σ( | y-ŷ | /y)/n |
| R² | 1 - SS_res/SS_tot | Variance explained | Comparing to baseline (mean) |
Multi-Class Metrics:
For multi-class problems, per-class metrics can be aggregated:
If false positives cost $10 and false negatives cost $1000, optimize for recall over precision. If both cost the same, F1 balances them. Always trace metrics back to business outcomes—the best metric is the one that correlates with what the business actually cares about.
123456789101112131415161718192021222324252627282930313233343536373839404142
from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score, log_loss, confusion_matrix, classification_report, mean_squared_error, mean_absolute_error, r2_score)import numpy as np # ===== CLASSIFICATION METRICS =====y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1]y_proba = [0.9, 0.1, 0.8, 0.4, 0.2, 0.7, 0.6, 0.3, 0.85, 0.95] print("Classification Metrics:")print(f"Accuracy: {accuracy_score(y_true, y_pred):.4f}")print(f"Precision: {precision_score(y_true, y_pred):.4f}")print(f"Recall: {recall_score(y_true, y_pred):.4f}")print(f"F1 Score: {f1_score(y_true, y_pred):.4f}")print(f"AUC-ROC: {roc_auc_score(y_true, y_proba):.4f}")print(f"AUC-PR: {average_precision_score(y_true, y_proba):.4f}")print(f"Log Loss: {log_loss(y_true, y_proba):.4f}") # Confusion matrixprint("Confusion Matrix:")print(confusion_matrix(y_true, y_pred)) # Full classification reportprint("Classification Report:")print(classification_report(y_true, y_pred)) # ===== REGRESSION METRICS =====y_true_reg = [3.0, 5.0, 2.5, 7.0, 4.5]y_pred_reg = [2.8, 5.2, 2.0, 6.5, 4.8] print("Regression Metrics:")print(f"MSE: {mean_squared_error(y_true_reg, y_pred_reg):.4f}")print(f"RMSE: {np.sqrt(mean_squared_error(y_true_reg, y_pred_reg)):.4f}")print(f"MAE: {mean_absolute_error(y_true_reg, y_pred_reg):.4f}")print(f"R²: {r2_score(y_true_reg, y_pred_reg):.4f}")Aggregate metrics (accuracy, AUC, RMSE) summarize overall performance but can hide critical failures. Comprehensive evaluation examines model behavior across segments, edge cases, and over time.
Sliced Evaluation:
Aggregate metrics average over all examples. But models often perform differently on different subgroups:
Slicing evaluates metrics on subsets of data:
12345678910111213141516171819202122232425262728293031323334353637383940
import pandas as pdfrom sklearn.metrics import accuracy_score, recall_score # Evaluate metrics across different slicesdef evaluate_by_slice(df, y_true_col, y_pred_col, slice_col, metric_fn): """Compute metric for each value of slice_col.""" results = [] for slice_val in df[slice_col].unique(): mask = df[slice_col] == slice_val n_samples = mask.sum() if n_samples >= 30: # Only evaluate if enough samples score = metric_fn( df.loc[mask, y_true_col], df.loc[mask, y_pred_col] ) results.append({ 'slice': slice_val, 'n_samples': n_samples, 'score': score }) return pd.DataFrame(results).sort_values('score') # Example usageslices = ['age_group', 'country', 'device_type', 'user_tenure']for slice_col in slices: print(f"Performance by {slice_col}:") slice_results = evaluate_by_slice( df, 'actual', 'predicted', slice_col, accuracy_score ) print(slice_results) # Flag concerning slices overall = accuracy_score(df['actual'], df['predicted']) underperforming = slice_results[slice_results['score'] < overall * 0.9] if len(underperforming) > 0: print("⚠️ Underperforming slices:") print(underperforming)If your model's 80% confidence predictions are actually correct only 50% of the time, any downstream decisions based on those probabilities will be systematically wrong. Always validate calibration using calibration curves or reliability diagrams before using probability outputs.
Offline evaluation happens before deployment, using historical data. It's your last chance to catch problems before users are affected.
Golden Rules of Offline Evaluation:
Common Offline Evaluation Mistakes:
| Mistake | Why It's Dangerous | How to Avoid |
|---|---|---|
| Looking at test set during development | Optimistic estimates, won't generalize | Strict holdout, don't peek |
| Inconsistent preprocessing | Train/test have different transformations | Use pipelines, fit on train only |
| Wrong splits for time series | Information leakage from future | TimeSeriesSplit, respect chronology |
| Ignoring data drift | Eval data doesn't match production | Use recent data, test on new distributions |
| Not testing edge cases | Model fails on unusual inputs | Create adversarial test sets |
Backtesting for Time-Sensitive Models:
For models predicting future events (forecasting, churn, fraud), offline evaluation must simulate the temporal structure of production:
Before celebrating your model's 85% accuracy, ask: what does a trivial baseline achieve? If always predicting the majority class gives 80% accuracy, your model only adds 5 percentage points. For many problems, simple heuristics are surprisingly hard to beat significantly.
Offline evaluation predicts production performance but can't guarantee it. Online evaluation measures actual performance with real users, typically via A/B tests.
Why Online Evaluation Is Necessary:
A/B Testing Fundamentals:
An A/B test randomly assigns users to groups:
You then compare outcomes between groups to determine if the treatment is better.
Statistical Analysis:
A/B test analysis determines if observed differences are real or due to chance:
Common Pitfalls:
| Pitfall | Consequence | Solution |
|---|---|---|
| Peeking at results early | Inflated false positive rate | Pre-commit to sample size and duration |
| Multiple testing | Increased false positives | Correct for multiple comparisons (Bonferroni, FDR) |
| Ignoring novelty effects | Short-term gains don't persist | Run tests long enough for novelty to wear off |
| Simpson's paradox | Wrong conclusion due to confounders | Check segment-level effects |
| Underpowered tests | Can't detect real effects | Power analysis before launch |
Statistical significance means the observed difference is unlikely due to random chance. Practical significance means the difference is large enough to matter. A 0.01% improvement can be statistically significant with enough data, but if it requires months of engineering to maintain, it's not practically significant.
Deploying a model means making it available to serve predictions in production. Several patterns exist, each with different tradeoffs.
Deployment Patterns:
| Pattern | Description | Pros | Cons | Best For |
|---|---|---|---|---|
| Batch Prediction | Run predictions periodically on batches of data, store results | Simple, cost-effective, easy debugging | Stale predictions, storage overhead | Recommendations, reports, non-real-time |
| Real-time API | Model served via REST/gRPC endpoint, called per request | Fresh predictions, dynamic inputs | Latency requirements, scaling challenges | Fraud detection, personalization |
| Embedded Model | Model bundled with application (mobile, edge) | No network latency, works offline | Update complexity, device constraints | Mobile apps, IoT devices |
| Streaming | Predictions on data streams (Kafka, Kinesis) | Continuous processing, low latency | Infrastructure complexity | Real-time monitoring, event processing |
| Shadow Mode | Run new model in parallel, don't serve yet | Safe testing in production conditions | No impact data | Before risky changes |
| Canary Deployment | Route small percentage of traffic to new model | Limited blast radius if problems | Slower rollout | Gradual, safe rollouts |
Real-Time Serving Architecture:
For real-time prediction serving, typical components include:
Feature Store Importance:
Real-time models need real-time features. But features that took hours to compute during training can't be recomputed per request. Feature stores:
Training-serving skew occurs when features computed at training time differ from those at serving time. Maybe a bug in the serving code, maybe different library versions, maybe different data freshness. This silent killer degrades predictions without obvious errors. Always validate feature consistency between training and serving.
A deployed model isn't finished—it's just beginning its production lifecycle. Continuous monitoring is essential to catch degradation before it causes damage.
What to Monitor:
| Signal | What It Detects | Alerting Threshold |
|---|---|---|
| Prediction distribution shift | Model behavior change | Significant change from baseline (KL divergence, PSI) |
| Feature distribution shift | Input data drift | Population stability index > 0.25 |
| Latency spike | Infrastructure or model issues | p99 > 2x baseline |
| Error rate increase | Code or data bugs | Errors > baseline + 3σ |
| Null feature rate spike | Data pipeline failures | Nulls > baseline by significant margin |
| Prediction rate change | System-wide behavior shift | Volume differs from expected by >20% |
Data Drift Detection:
The world changes. User behavior shifts, products evolve, external events disrupt patterns. When the data distribution in production differs from training, model performance degrades.
Types of Drift:
Drift Detection Methods:
Model degradation is often invisible. The model still returns predictions, there are no errors, latency is fine—but predictions are less accurate. Without monitoring, you only discover this when business metrics tank. By then, weeks of damage may have accumulated.
Models are not static artifacts. They require ongoing maintenance to remain effective.
Why Models Degrade:
Retraining Strategies:
| Strategy | Description | Pros | Cons | Best For |
|---|---|---|---|---|
| Fixed schedule | Retrain every N days/weeks | Predictable, simple | May retrain unnecessarily or not enough | Stable environments |
| Trigger-based | Retrain when drift or performance drops | Efficient, reactive | Requires good monitoring | Fast-changing environments |
| Continuous training | Pipeline constantly ingests data, updates model | Always fresh | Complex infrastructure, risk of instability | High-volume, fast-changing |
| Online learning | Model updates with each new example | Real-time adaptation | Stability concerns, not all algorithms support | Real-time systems |
Retraining Pipeline Components:
Champion-Challenger Pattern:
The challenger runs in shadow mode or on a small traffic slice. If it outperforms the champion, it becomes the new champion. This protects against deploying worse models.
Model Versioning:
Track what's deployed and ensure reproducibility:
Manual retraining doesn't scale. Automate data collection, training, validation, and deployment. The goal is a pipeline that can refresh models with minimal human intervention—humans design the pipeline and set policies, but execution is automatic.
Evaluation and deployment are where ML projects deliver value—or fail to. Rigorous evaluation catches problems before users are affected; robust deployment ensures reliable, maintainable systems. Let's consolidate the key lessons:
Module Complete: The ML Pipeline
You've now journeyed through the complete machine learning pipeline:
This pipeline isn't waterfall—it's iterative. Evaluation findings loop back to improve features; production monitoring reveals problems that require reformulating the problem. Each stage informs the others.
Congratulations! You now understand the complete ML pipeline from problem formulation through production deployment. This framework applies to every ML project—whether you're predicting churn, detecting fraud, recommending products, or any other application. The specific techniques vary, but the pipeline structure remains constant. You're ready to build ML systems that deliver real-world value.