Loading learning content...
Selecting evaluation metrics is one of the most consequential decisions in ML development. The wrong metric leads to models optimized for the wrong objective—technically successful but practically useless.
The Selection Challenge:
With dozens of available metrics—accuracy, precision, recall, F1, AUC, log-loss, MAE, RMSE, NDCG, and countless domain-specific variants—how do you choose? The answer depends on:
This page provides a systematic framework for metric selection.
By the end of this page, you will have a systematic process for metric selection, understand when to use each major metric family, know how to validate metric choices, and be able to communicate metric decisions to stakeholders.
The first filter for metric selection is problem type. Different problem structures require fundamentally different metrics.
| Problem Type | Primary Metrics | When to Use |
|---|---|---|
| Binary Classification | AUC-ROC, F1, Precision/Recall | Two mutually exclusive classes |
| Multi-class Classification | Macro/Micro F1, Cohen's Kappa | Multiple mutually exclusive classes |
| Multi-label Classification | Hamming Loss, Subset Accuracy | Multiple non-exclusive labels per instance |
| Regression | RMSE, MAE, R² | Continuous numeric output |
| Ranking | NDCG, MAP, MRR | Ordered list of items |
| Probabilistic Prediction | Log-loss, Brier Score | Calibrated probability estimates needed |
| Anomaly Detection | Precision@k, AUPRC | Rare event detection, imbalanced |
AUC-ROC and AUPRC evaluate across all thresholds. Precision, recall, and F1 require a specific threshold. If you haven't decided on a threshold, use threshold-independent metrics for model comparison; use threshold-dependent metrics for final deployment evaluation.
Class imbalance dramatically affects metric behavior. Metrics that look impressive on balanced data can be misleading on imbalanced data.
The Accuracy Trap:
With 99% negative class, a trivial "predict all negative" classifier achieves 99% accuracy. Accuracy fails as a useful signal.
| Metric | Imbalance Robust? | Notes |
|---|---|---|
| Accuracy | ❌ No | Dominated by majority class performance |
| Precision | ⚠️ Partial | Still meaningful, but depends on threshold |
| Recall | ⚠️ Partial | Meaningful for minority class detection |
| F1 Score | ⚠️ Partial | Better than accuracy, still threshold-dependent |
| AUC-ROC | ⚠️ Partial | Can be optimistic with high imbalance |
| AUPRC (PR-AUC) | ✅ Yes | More informative for imbalanced data |
| Matthews Correlation | ✅ Yes | Accounts for all confusion matrix elements |
| Cohen's Kappa | ✅ Yes | Corrects for chance agreement |
Error cost asymmetry should directly inform metric selection.
Decision Framework:
| Error Cost Structure | Recommended Metrics |
|---|---|
| Symmetric costs | Accuracy, F1, AUC |
| FN >> FP (missing is costly) | Recall, sensitivity, AUPRC |
| FP >> FN (false alarms costly) | Precision, specificity, PPV |
| Instance-varying costs | Expected cost, weighted metrics |
| Unknown costs | AUC (threshold-independent), then threshold sweep |
F-beta score with β = sqrt(C_FN / C_FP) balances precision and recall according to your cost ratio. F2 emphasizes recall (β=2, FN twice as bad as FP). F0.5 emphasizes precision (β=0.5, FP twice as bad as FN).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as np def recommend_metrics_from_costs(cost_fp, cost_fn, cost_tp=0, cost_tn=0): """ Recommend appropriate metrics based on cost structure. """ cost_ratio = cost_fn / cost_fp if cost_fp > 0 else float('inf') recommendations = { 'primary_metrics': [], 'secondary_metrics': [], 'beta_for_fbeta': np.sqrt(cost_ratio), 'optimal_threshold_formula': f"t* = {cost_fp} / ({cost_fp} + {cost_fn}) = {cost_fp / (cost_fp + cost_fn):.4f}", 'cost_ratio': cost_ratio, } if 0.8 <= cost_ratio <= 1.2: # Approximately symmetric recommendations['primary_metrics'] = ['accuracy', 'F1', 'AUC-ROC'] recommendations['secondary_metrics'] = ['balanced_accuracy'] recommendations['rationale'] = "Symmetric costs: standard metrics apply" elif cost_ratio > 5: # FN much costlier recommendations['primary_metrics'] = ['recall', 'AUPRC', f'F{min(cost_ratio, 10):.1f}'] recommendations['secondary_metrics'] = ['sensitivity', 'miss_rate'] recommendations['rationale'] = f"FN {cost_ratio:.1f}x costlier: prioritize recall" elif cost_ratio < 0.2: # FP much costlier recommendations['primary_metrics'] = ['precision', 'specificity', 'F0.5'] recommendations['secondary_metrics'] = ['PPV', 'false_discovery_rate'] recommendations['rationale'] = f"FP {1/cost_ratio:.1f}x costlier: prioritize precision" else: # Moderate asymmetry recommendations['primary_metrics'] = ['F1', 'AUC-ROC', f'F{recommendations["beta_for_fbeta"]:.1f}'] recommendations['secondary_metrics'] = ['precision', 'recall'] recommendations['rationale'] = "Moderate asymmetry: use F-beta with appropriate beta" return recommendations # Example scenariosscenarios = [ ("Medical screening", 10, 1000), # FN very costly ("Spam filter", 100, 5), # FP costly (lost email) ("Ad click prediction", 1, 1), # ~Symmetric ("Fraud detection", 10, 150), # FN quite costly] print("Metric Recommendations by Cost Structure")print("=" * 60) for name, cfp, cfn in scenarios: rec = recommend_metrics_from_costs(cfp, cfn) print(f"\n{name}:") print(f" Cost ratio (FN/FP): {rec['cost_ratio']:.1f}") print(f" Rationale: {rec['rationale']}") print(f" Primary: {', '.join(rec['primary_metrics'])}") print(f" F-beta: β = {rec['beta_for_fbeta']:.2f}")Different stakeholders need different metrics:
For Data Scientists / ML Engineers:
For Product Managers:
For Executives:
Selected metrics should be validated before committing to them:
Validation Questions:
Alignment Test: Do metric improvements correlate with business improvements in historical data?
Sensitivity Test: Is the metric sensitive enough to distinguish meaningfully different models?
Robustness Test: Does the metric behave consistently across data subsets and time periods?
Gaming Test: Can the metric be optimized in ways that hurt real objectives?
Interpretability Test: Can stakeholders understand what the metric means?
Any metric can be gamed. Precision can be gamed by only predicting high-confidence positives. Recall by predicting everything positive. AUC by overfitting to evaluation distribution. Always pair primary metrics with sanity-check metrics that catch gaming.
Establish a clear hierarchy: 1-2 primary metrics for decisions, 3-5 secondary metrics for monitoring, and sanity-check metrics for guardrails. More than this creates confusion.
Step-by-Step Metric Selection:
Congratulations! You've completed the Custom and Business Metrics module. You now understand cost-sensitive evaluation, threshold optimization, business alignment, multi-objective evaluation, and metric selection strategy. These skills enable you to design evaluation frameworks that capture what truly matters for your ML applications.