Machine LearningCustom and Business Metrics

Custom and Business Metrics

LevelIntermediate

Duration90 mins

TopicCustom and Business Metrics

5 / 5

Metric Selection Strategy

Choosing the Right Metric

Selecting evaluation metrics is one of the most consequential decisions in ML development. The wrong metric leads to models optimized for the wrong objective—technically successful but practically useless.

The Selection Challenge:

With dozens of available metrics—accuracy, precision, recall, F1, AUC, log-loss, MAE, RMSE, NDCG, and countless domain-specific variants—how do you choose? The answer depends on:

Problem type (classification, regression, ranking, etc.)
Class distribution and imbalance
Error costs and asymmetries
Business objectives and stakeholder needs
Deployment and operational context

This page provides a systematic framework for metric selection.

What You Will Learn

By the end of this page, you will have a systematic process for metric selection, understand when to use each major metric family, know how to validate metric choices, and be able to communicate metric decisions to stakeholders.

Metric Selection by Problem Type

The first filter for metric selection is problem type. Different problem structures require fundamentally different metrics.

Primary Metrics by Problem Type
Problem Type	Primary Metrics	When to Use
Binary Classification	AUC-ROC, F1, Precision/Recall	Two mutually exclusive classes
Multi-class Classification	Macro/Micro F1, Cohen's Kappa	Multiple mutually exclusive classes
Multi-label Classification	Hamming Loss, Subset Accuracy	Multiple non-exclusive labels per instance
Regression	RMSE, MAE, R²	Continuous numeric output
Ranking	NDCG, MAP, MRR	Ordered list of items
Probabilistic Prediction	Log-loss, Brier Score	Calibrated probability estimates needed
Anomaly Detection	Precision@k, AUPRC	Rare event detection, imbalanced

Threshold-Dependent vs. Threshold-Independent

AUC-ROC and AUPRC evaluate across all thresholds. Precision, recall, and F1 require a specific threshold. If you haven't decided on a threshold, use threshold-independent metrics for model comparison; use threshold-dependent metrics for final deployment evaluation.

Handling Class Imbalance in Metric Selection

Class imbalance dramatically affects metric behavior. Metrics that look impressive on balanced data can be misleading on imbalanced data.

The Accuracy Trap:

With 99% negative class, a trivial "predict all negative" classifier achieves 99% accuracy. Accuracy fails as a useful signal.

Metric Robustness to Class Imbalance
Metric	Imbalance Robust?	Notes
Accuracy	❌ No	Dominated by majority class performance
Precision	⚠️ Partial	Still meaningful, but depends on threshold
Recall	⚠️ Partial	Meaningful for minority class detection
F1 Score	⚠️ Partial	Better than accuracy, still threshold-dependent
AUC-ROC	⚠️ Partial	Can be optimistic with high imbalance
AUPRC (PR-AUC)	✅ Yes	More informative for imbalanced data
Matthews Correlation	✅ Yes	Accounts for all confusion matrix elements
Cohen's Kappa	✅ Yes	Corrects for chance agreement

Imbalanced Data Metric Recommendations

•Moderate imbalance (10-30% minority): F1, balanced accuracy, or weighted F1
•Severe imbalance (1-10% minority): AUPRC, recall at fixed precision, or precision@k
•Extreme imbalance (<1% minority): AUPRC, recall@k, or domain-specific metrics
•Always report: Class distribution alongside metrics for context

Cost Structure Analysis

Error cost asymmetry should directly inform metric selection.

Decision Framework:

Error Cost Structure	Recommended Metrics
Symmetric costs	Accuracy, F1, AUC
FN >> FP (missing is costly)	Recall, sensitivity, AUPRC
FP >> FN (false alarms costly)	Precision, specificity, PPV
Instance-varying costs	Expected cost, weighted metrics
Unknown costs	AUC (threshold-independent), then threshold sweep

Cost Ratio to F-beta Mapping

F-beta score with β = sqrt(C_FN / C_FP) balances precision and recall according to your cost ratio. F2 emphasizes recall (β=2, FN twice as bad as FP). F0.5 emphasizes precision (β=0.5, FP twice as bad as FN).

cost_to_metric.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
def recommend_metrics_from_costs(cost_fp, cost_fn, cost_tp=0, cost_tn=0):
    """
    Recommend appropriate metrics based on cost structure.
    """
    cost_ratio = cost_fn / cost_fp if cost_fp > 0 else float('inf')
    
    recommendations = {
        'primary_metrics': [],
        'secondary_metrics': [],
        'beta_for_fbeta': np.sqrt(cost_ratio),
        'optimal_threshold_formula': f"t* = {cost_fp} / ({cost_fp} + {cost_fn}) = {cost_fp / (cost_fp + cost_fn):.4f}",
        'cost_ratio': cost_ratio,
    }
    
    if 0.8 <= cost_ratio <= 1.2:  # Approximately symmetric
        recommendations['primary_metrics'] = ['accuracy', 'F1', 'AUC-ROC']
        recommendations['secondary_metrics'] = ['balanced_accuracy']
        recommendations['rationale'] = "Symmetric costs: standard metrics apply"
        
    elif cost_ratio > 5:  # FN much costlier
        recommendations['primary_metrics'] = ['recall', 'AUPRC', f'F{min(cost_ratio, 10):.1f}']
        recommendations['secondary_metrics'] = ['sensitivity', 'miss_rate']
        recommendations['rationale'] = f"FN {cost_ratio:.1f}x costlier: prioritize recall"
        
    elif cost_ratio < 0.2:  # FP much costlier
        recommendations['primary_metrics'] = ['precision', 'specificity', 'F0.5']
        recommendations['secondary_metrics'] = ['PPV', 'false_discovery_rate']
        recommendations['rationale'] = f"FP {1/cost_ratio:.1f}x costlier: prioritize precision"
        
    else:  # Moderate asymmetry
        recommendations['primary_metrics'] = ['F1', 'AUC-ROC', f'F{recommendations["beta_for_fbeta"]:.1f}']
        recommendations['secondary_metrics'] = ['precision', 'recall']
        recommendations['rationale'] = "Moderate asymmetry: use F-beta with appropriate beta"
    
    return recommendations
 
 
# Example scenarios
scenarios = [
    ("Medical screening", 10, 1000),  # FN very costly
    ("Spam filter", 100, 5),          # FP costly (lost email)
    ("Ad click prediction", 1, 1),    # ~Symmetric
    ("Fraud detection", 10, 150),     # FN quite costly
]
 
print("Metric Recommendations by Cost Structure")
print("=" * 60)
 
for name, cfp, cfn in scenarios:
    rec = recommend_metrics_from_costs(cfp, cfn)
    print(f"\n{name}:")
    print(f"  Cost ratio (FN/FP): {rec['cost_ratio']:.1f}")
    print(f"  Rationale: {rec['rationale']}")
    print(f"  Primary: {', '.join(rec['primary_metrics'])}")
    print(f"  F-beta: β = {rec['beta_for_fbeta']:.2f}")

Metric Selection for Stakeholders

Different stakeholders need different metrics:

For Data Scientists / ML Engineers:

Technical metrics: AUC, log-loss, calibration error
Detailed breakdown: per-class metrics, confusion matrices
Statistical significance: confidence intervals, p-values

For Product Managers:

User-impact metrics: coverage, precision at operating point
Comparison to baselines: relative improvements
Trade-off visualizations: precision-recall curves

For Executives:

Business KPIs: dollars saved, customers retained
Simple summaries: "catches X% of fraud" not "recall = 0.X"
ROI projections: cost vs. value

Technical Report

•AUC-ROC: 0.923 (95% CI: 0.918-0.928)
•AUPRC: 0.412 (baseline: 0.05)
•Recall@Precision=0.3: 0.847
•Brier score: 0.042
•ECE: 0.018

Executive Summary

•Catches 85% of fraud cases
•Reduces false alarms by 40% vs. current
•Expected savings: $2.1M annually
•Payback period: 4 months
•Customer friction: minimal increase

Validating Your Metric Choice

Selected metrics should be validated before committing to them:

Validation Questions:

Alignment Test: Do metric improvements correlate with business improvements in historical data?
Sensitivity Test: Is the metric sensitive enough to distinguish meaningfully different models?
Robustness Test: Does the metric behave consistently across data subsets and time periods?
Gaming Test: Can the metric be optimized in ways that hurt real objectives?
Interpretability Test: Can stakeholders understand what the metric means?

Metric Gaming

Any metric can be gamed. Precision can be gamed by only predicting high-confidence positives. Recall by predicting everything positive. AUC by overfitting to evaluation distribution. Always pair primary metrics with sanity-check metrics that catch gaming.

Metric Validation Checklist

•☐ Metric aligns with business objectives (validated on historical data)
•☐ Metric is sensitive enough to detect meaningful improvements
•☐ Metric is robust across data subsets and time periods
•☐ Metric cannot be trivially gamed without hurting real objectives
•☐ Metric is understandable by key stakeholders
•☐ Metric computation is reproducible and well-defined
•☐ Baseline values and targets are established

Common Metric Selection Mistakes

Mistakes to Avoid

•Using accuracy on imbalanced data: Majority-class classifier looks great
•Ignoring calibration: High AUC doesn't mean well-calibrated probabilities
•Single metric obsession: Optimize one metric while others degrade
•Threshold-specific metrics without threshold decision: Reporting F1 with arbitrary 0.5 threshold
•Ignoring statistical uncertainty: Declaring winner without confidence intervals
•Optimizing offline metrics only: High test AUC but poor production impact
•Metric proliferation: Tracking 20 metrics makes decision-making impossible

The Metric Hierarchy

Establish a clear hierarchy: 1-2 primary metrics for decisions, 3-5 secondary metrics for monitoring, and sanity-check metrics for guardrails. More than this creates confusion.

A Systematic Selection Framework

Step-by-Step Metric Selection:

•Identify problem type → Filters to relevant metric families
•Assess class distribution → Filters out non-robust metrics if imbalanced
•Analyze cost structure → Determines precision vs. recall emphasis
•Consider operational constraints → May require specific metrics (e.g., latency)
•Map to business objectives → Ensures alignment with ultimate goals
•Select primary metric(s) → 1-2 metrics for go/no-go decisions
•Add secondary metrics → 3-5 for monitoring and insight
•Define guardrail metrics → Red lines that must not be crossed
•Validate choices → Historical correlation, sensitivity analysis
•Document rationale → Future you will thank present you

Summary

Key Takeaways

•Problem type matters: Classification, regression, ranking each have appropriate metric families
•Imbalance requires care: Accuracy fails; use AUPRC, precision@k, or balanced metrics
•Costs drive selection: Asymmetric costs → asymmetric metric emphasis
•Stakeholders need translation: Technical metrics for ML teams; business metrics for executives
•Validate your choices: Ensure metrics align with business outcomes before committing
•Keep it simple: 1-2 primary, 3-5 secondary, clear guardrails—no more

Module Complete

Congratulations! You've completed the Custom and Business Metrics module. You now understand cost-sensitive evaluation, threshold optimization, business alignment, multi-objective evaluation, and metric selection strategy. These skills enable you to design evaluation frameworks that capture what truly matters for your ML applications.

5 / 5

Loading learning content...

Machine LearningCustom and Business Metrics

Custom and Business Metrics

LevelIntermediate

Duration90 mins

TopicCustom and Business Metrics

5 / 5

Metric Selection Strategy

Choosing the Right Metric

The Selection Challenge:

With dozens of available metrics—accuracy, precision, recall, F1, AUC, log-loss, MAE, RMSE, NDCG, and countless domain-specific variants—how do you choose? The answer depends on:

Problem type (classification, regression, ranking, etc.)
Class distribution and imbalance
Error costs and asymmetries
Business objectives and stakeholder needs
Deployment and operational context

This page provides a systematic framework for metric selection.

What You Will Learn

Metric Selection by Problem Type

The first filter for metric selection is problem type. Different problem structures require fundamentally different metrics.

Primary Metrics by Problem Type
Problem Type	Primary Metrics	When to Use
Binary Classification	AUC-ROC, F1, Precision/Recall	Two mutually exclusive classes
Multi-class Classification	Macro/Micro F1, Cohen's Kappa	Multiple mutually exclusive classes
Multi-label Classification	Hamming Loss, Subset Accuracy	Multiple non-exclusive labels per instance
Regression	RMSE, MAE, R²	Continuous numeric output
Ranking	NDCG, MAP, MRR	Ordered list of items
Probabilistic Prediction	Log-loss, Brier Score	Calibrated probability estimates needed
Anomaly Detection	Precision@k, AUPRC	Rare event detection, imbalanced

Threshold-Dependent vs. Threshold-Independent

Handling Class Imbalance in Metric Selection

Class imbalance dramatically affects metric behavior. Metrics that look impressive on balanced data can be misleading on imbalanced data.

The Accuracy Trap:

With 99% negative class, a trivial "predict all negative" classifier achieves 99% accuracy. Accuracy fails as a useful signal.

Metric Robustness to Class Imbalance
Metric	Imbalance Robust?	Notes
Accuracy	❌ No	Dominated by majority class performance
Precision	⚠️ Partial	Still meaningful, but depends on threshold
Recall	⚠️ Partial	Meaningful for minority class detection
F1 Score	⚠️ Partial	Better than accuracy, still threshold-dependent
AUC-ROC	⚠️ Partial	Can be optimistic with high imbalance
AUPRC (PR-AUC)	✅ Yes	More informative for imbalanced data
Matthews Correlation	✅ Yes	Accounts for all confusion matrix elements
Cohen's Kappa	✅ Yes	Corrects for chance agreement

Imbalanced Data Metric Recommendations

•Moderate imbalance (10-30% minority): F1, balanced accuracy, or weighted F1
•Severe imbalance (1-10% minority): AUPRC, recall at fixed precision, or precision@k
•Extreme imbalance (<1% minority): AUPRC, recall@k, or domain-specific metrics
•Always report: Class distribution alongside metrics for context

Cost Structure Analysis

Error cost asymmetry should directly inform metric selection.

Decision Framework:

Error Cost Structure	Recommended Metrics
Symmetric costs	Accuracy, F1, AUC
FN >> FP (missing is costly)	Recall, sensitivity, AUPRC
FP >> FN (false alarms costly)	Precision, specificity, PPV
Instance-varying costs	Expected cost, weighted metrics
Unknown costs	AUC (threshold-independent), then threshold sweep

Cost Ratio to F-beta Mapping

cost_to_metric.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
def recommend_metrics_from_costs(cost_fp, cost_fn, cost_tp=0, cost_tn=0):
    """
    Recommend appropriate metrics based on cost structure.
    """
    cost_ratio = cost_fn / cost_fp if cost_fp > 0 else float('inf')
    
    recommendations = {
        'primary_metrics': [],
        'secondary_metrics': [],
        'beta_for_fbeta': np.sqrt(cost_ratio),
        'optimal_threshold_formula': f"t* = {cost_fp} / ({cost_fp} + {cost_fn}) = {cost_fp / (cost_fp + cost_fn):.4f}",
        'cost_ratio': cost_ratio,
    }
    
    if 0.8 <= cost_ratio <= 1.2:  # Approximately symmetric
        recommendations['primary_metrics'] = ['accuracy', 'F1', 'AUC-ROC']
        recommendations['secondary_metrics'] = ['balanced_accuracy']
        recommendations['rationale'] = "Symmetric costs: standard metrics apply"
        
    elif cost_ratio > 5:  # FN much costlier
        recommendations['primary_metrics'] = ['recall', 'AUPRC', f'F{min(cost_ratio, 10):.1f}']
        recommendations['secondary_metrics'] = ['sensitivity', 'miss_rate']
        recommendations['rationale'] = f"FN {cost_ratio:.1f}x costlier: prioritize recall"
        
    elif cost_ratio < 0.2:  # FP much costlier
        recommendations['primary_metrics'] = ['precision', 'specificity', 'F0.5']
        recommendations['secondary_metrics'] = ['PPV', 'false_discovery_rate']
        recommendations['rationale'] = f"FP {1/cost_ratio:.1f}x costlier: prioritize precision"
        
    else:  # Moderate asymmetry
        recommendations['primary_metrics'] = ['F1', 'AUC-ROC', f'F{recommendations["beta_for_fbeta"]:.1f}']
        recommendations['secondary_metrics'] = ['precision', 'recall']
        recommendations['rationale'] = "Moderate asymmetry: use F-beta with appropriate beta"
    
    return recommendations
 
 
# Example scenarios
scenarios = [
    ("Medical screening", 10, 1000),  # FN very costly
    ("Spam filter", 100, 5),          # FP costly (lost email)
    ("Ad click prediction", 1, 1),    # ~Symmetric
    ("Fraud detection", 10, 150),     # FN quite costly
]
 
print("Metric Recommendations by Cost Structure")
print("=" * 60)
 
for name, cfp, cfn in scenarios:
    rec = recommend_metrics_from_costs(cfp, cfn)
    print(f"\n{name}:")
    print(f"  Cost ratio (FN/FP): {rec['cost_ratio']:.1f}")
    print(f"  Rationale: {rec['rationale']}")
    print(f"  Primary: {', '.join(rec['primary_metrics'])}")
    print(f"  F-beta: β = {rec['beta_for_fbeta']:.2f}")

Metric Selection for Stakeholders

Different stakeholders need different metrics:

For Data Scientists / ML Engineers:

Technical metrics: AUC, log-loss, calibration error
Detailed breakdown: per-class metrics, confusion matrices
Statistical significance: confidence intervals, p-values

For Product Managers:

User-impact metrics: coverage, precision at operating point
Comparison to baselines: relative improvements
Trade-off visualizations: precision-recall curves

For Executives:

Business KPIs: dollars saved, customers retained
Simple summaries: "catches X% of fraud" not "recall = 0.X"
ROI projections: cost vs. value

Technical Report

•AUC-ROC: 0.923 (95% CI: 0.918-0.928)
•AUPRC: 0.412 (baseline: 0.05)
•Recall@Precision=0.3: 0.847
•Brier score: 0.042
•ECE: 0.018

Executive Summary

•Catches 85% of fraud cases
•Reduces false alarms by 40% vs. current
•Expected savings: $2.1M annually
•Payback period: 4 months
•Customer friction: minimal increase

Validating Your Metric Choice

Selected metrics should be validated before committing to them:

Validation Questions:

Alignment Test: Do metric improvements correlate with business improvements in historical data?
Sensitivity Test: Is the metric sensitive enough to distinguish meaningfully different models?
Robustness Test: Does the metric behave consistently across data subsets and time periods?
Gaming Test: Can the metric be optimized in ways that hurt real objectives?
Interpretability Test: Can stakeholders understand what the metric means?

Metric Gaming

Metric Validation Checklist

•☐ Metric aligns with business objectives (validated on historical data)
•☐ Metric is sensitive enough to detect meaningful improvements
•☐ Metric is robust across data subsets and time periods
•☐ Metric cannot be trivially gamed without hurting real objectives
•☐ Metric is understandable by key stakeholders
•☐ Metric computation is reproducible and well-defined
•☐ Baseline values and targets are established

Common Metric Selection Mistakes

Mistakes to Avoid

•Using accuracy on imbalanced data: Majority-class classifier looks great
•Ignoring calibration: High AUC doesn't mean well-calibrated probabilities
•Single metric obsession: Optimize one metric while others degrade
•Threshold-specific metrics without threshold decision: Reporting F1 with arbitrary 0.5 threshold
•Ignoring statistical uncertainty: Declaring winner without confidence intervals
•Optimizing offline metrics only: High test AUC but poor production impact
•Metric proliferation: Tracking 20 metrics makes decision-making impossible

The Metric Hierarchy

Establish a clear hierarchy: 1-2 primary metrics for decisions, 3-5 secondary metrics for monitoring, and sanity-check metrics for guardrails. More than this creates confusion.

A Systematic Selection Framework

Step-by-Step Metric Selection:

•Identify problem type → Filters to relevant metric families
•Assess class distribution → Filters out non-robust metrics if imbalanced
•Analyze cost structure → Determines precision vs. recall emphasis
•Consider operational constraints → May require specific metrics (e.g., latency)
•Map to business objectives → Ensures alignment with ultimate goals
•Select primary metric(s) → 1-2 metrics for go/no-go decisions
•Add secondary metrics → 3-5 for monitoring and insight
•Define guardrail metrics → Red lines that must not be crossed
•Validate choices → Historical correlation, sensitivity analysis
•Document rationale → Future you will thank present you

Summary

Key Takeaways

•Problem type matters: Classification, regression, ranking each have appropriate metric families
•Imbalance requires care: Accuracy fails; use AUPRC, precision@k, or balanced metrics
•Costs drive selection: Asymmetric costs → asymmetric metric emphasis
•Stakeholders need translation: Technical metrics for ML teams; business metrics for executives
•Validate your choices: Ensure metrics align with business outcomes before committing
•Keep it simple: 1-2 primary, 3-5 secondary, clear guardrails—no more

Module Complete

5 / 5