Machine LearningCustom and Business Metrics

Custom and Business Metrics

LevelIntermediate

Duration90 mins

TopicCustom and Business Metrics

1 / 5

Cost-Sensitive Evaluation

The Cost of Errors: Beyond Accuracy

In the pristine world of textbook machine learning, all errors are created equal. A false positive carries the same weight as a false negative. Misclassifying one sample is just as bad as misclassifying another. But step into any real-world deployment, and this assumption crumbles instantly.

Consider these scenarios:

A spam filter marks an important job offer as spam (false positive) versus letting an obvious advertisement through (false negative)
A fraud detection system blocks a legitimate $10,000 transaction (false positive) versus missing a fraudulent $50 purchase (false negative)
A medical diagnostic failing to detect cancer (false negative) versus unnecessarily alarming a healthy patient (false positive)

In each case, the costs of different error types are dramatically asymmetric. Yet standard accuracy treats them identically. This fundamental mismatch between evaluation metrics and real-world impact is what cost-sensitive evaluation addresses.

What You Will Learn

By the end of this page, you will understand how to formulate cost matrices that capture asymmetric error costs, compute expected cost metrics that reflect true business impact, design cost-sensitive training and evaluation pipelines, and apply these concepts to real-world systems where the consequences of different errors vary dramatically.

Foundations of Cost-Sensitive Learning

Cost-sensitive learning represents a fundamental paradigm shift from traditional machine learning. Instead of treating all predictions equally, we explicitly model the consequences of each prediction-outcome combination. This approach acknowledges a simple truth: in the real world, we don't minimize errors—we minimize losses.

The Core Insight:

Traditional evaluation asks: "How many mistakes did the model make?"

Cost-sensitive evaluation asks: "How much damage did the model's mistakes cause?"

This shift from counting errors to measuring impact requires us to define what "damage" means in our specific context—which brings us to the cost matrix.

Why Standard Metrics Fall Short

Standard metrics like accuracy, precision, and recall implicitly assume equal error costs. When this assumption holds (rare in practice), these metrics work fine. When it doesn't (common in practice), optimizing these metrics can lead to models that perform well on paper but poorly in production—sometimes catastrophically so.

Historical Context:

The formal study of cost-sensitive learning emerged from decision theory and statistical decision making in the mid-20th century. Abraham Wald's work on statistical decision functions in the 1940s laid theoretical groundwork. The field gained momentum in machine learning through the 1990s and 2000s, driven by practical applications in fraud detection, medical diagnosis, and financial modeling where asymmetric costs were impossible to ignore.

Today, cost-sensitive evaluation is essential in any domain where:

Error consequences are asymmetric — Different mistakes have different impacts
Costs are heterogeneous — The same error type has different costs for different instances
Business value must be quantified — Stakeholders need to understand model value in monetary or utility terms
Regulatory requirements exist — Some domains mandate explicit cost-benefit analysis

The Cost Matrix: Formalizing Error Consequences

The cost matrix (also called the loss matrix or misclassification cost matrix) is the fundamental building block of cost-sensitive evaluation. It explicitly encodes the cost associated with each combination of predicted class and actual class.

Binary Classification Cost Matrix:

For a binary classification problem with classes Positive (P) and Negative (N), the cost matrix C is typically structured as:

Binary Classification Cost Matrix Structure
	Actual Positive	Actual Negative
Predict Positive	C(P\|P) = Cost of True Positive	C(P\|N) = Cost of False Positive
Predict Negative	C(N\|P) = Cost of False Negative	C(N\|N) = Cost of True Negative

Interpreting Cost Matrix Elements:

C(P|P) — Cost when we predict positive and the instance is actually positive (True Positive). Often set to 0 or a small negative value (representing benefit/reward).
C(P|N) — Cost when we predict positive but the instance is actually negative (False Positive). The cost of false alarms, unnecessary interventions, or wasted resources.
C(N|P) — Cost when we predict negative but the instance is actually positive (False Negative). The cost of missed detections, overlooked opportunities, or failed interventions.
C(N|N) — Cost when we predict negative and the instance is actually negative (True Negative). Often set to 0 (no cost for correct non-action).

Convention Note: There are two common conventions for cost matrices:

Cost convention: Higher values = worse outcomes (minimize total cost)
Utility convention: Higher values = better outcomes (maximize total utility)

We'll use the cost convention throughout, where correct predictions typically have cost 0 and incorrect predictions have positive costs.

Costs vs. Class Weights

Cost matrices are related to but distinct from class weights. Class weights typically reweight the importance of each class equally across all instances of that class. Cost matrices allow different costs for different error types and can even vary by instance. Cost matrices are strictly more expressive.

Concrete Example: Credit Card Fraud Detection

Let's construct a realistic cost matrix for fraud detection. Assume:

Average transaction value: $100
Cost to investigate a flagged transaction: $5
Chargeback and administrative cost for missed fraud: $150
Regulatory fine for high false negative rate: reflected in per-miss cost
Customer friction from false positives: $2 in expected customer lifetime value loss

Credit Card Fraud Detection Cost Matrix (in dollars)
	Actual Fraud	Actual Legitimate
Predict Fraud	$5 (investigation)	$7 (investigation + friction)
Predict Legitimate	$150 (chargeback + losses)	$0

This cost matrix reveals a 21:1 ratio between false negative and false positive costs ($150 vs $7). Any evaluation metric that treats these errors equally fundamentally misunderstands the problem.

Deriving the Cost Ratio:

The cost ratio CR = C(N|P) / C(P|N) = 150/7 ≈ 21.4 tells us that a single missed fraud is as costly as approximately 21 false alarms. This ratio drives threshold selection, which we'll explore in detail later.

Expected Cost: The Primary Cost-Sensitive Metric

Once we've defined a cost matrix, we can compute the expected cost (or total cost) of a classifier's predictions. This becomes our primary evaluation metric, replacing or augmenting accuracy.

Total Cost Calculation:

Given a cost matrix C and a confusion matrix with counts TP, FP, FN, TN, the total cost is:

$$\text{Total Cost} = TP \cdot C(P|P) + FP \cdot C(P|N) + FN \cdot C(N|P) + TN \cdot C(N|N)$$

When correct predictions have zero cost (common convention):

$$\text{Total Cost} = FP \cdot C_{FP} + FN \cdot C_{FN}$$

where $C_{FP} = C(P|N)$ and $C_{FN} = C(N|P)$.

Normalized Expected Cost

To compare across datasets of different sizes, divide by the number of samples: Expected Cost Per Sample = Total Cost / N. This gives a cost rate that's comparable across test sets of different sizes.

Cost-Weighted Confusion Matrix:

A powerful visualization combines the confusion matrix with the cost matrix to show the contribution of each cell to total cost:

Cost-Weighted Confusion Matrix (Fraud Example, 1000 transactions, 1% fraud rate)
	Actual Fraud (10)	Actual Legitimate (990)	Cost Contribution
Predict Fraud	TP=8 × $5 = $40	FP=50 × $7 = $350	$390
Predict Legitimate	FN=2 × $150 = $300	TN=940 × $0 = $0	$300
Total			$690

Interpreting the Cost-Weighted Matrix:

This visualization immediately reveals where costs accumulate:

$390 comes from predictions of fraud (investigation costs + customer friction)
$300 comes from 2 missed frauds—despite being only 2 errors, they dominate the negative contribution
The model's cost per transaction = $690 / 1000 = $0.69

Comparison to Accuracy:

This same classifier achieves:

Accuracy = (8 + 940) / 1000 = 94.8%
Precision = 8 / 58 = 13.8%
Recall = 8 / 10 = 80%

Accuracy looks excellent (94.8%!), but the model is still costing $0.69 per transaction. A different model with lower accuracy but fewer false negatives could have dramatically lower total cost.

expected_cost.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
from sklearn.metrics import confusion_matrix
 
def compute_expected_cost(y_true, y_pred, cost_matrix):
    """
    Compute total expected cost given predictions and a cost matrix.
    
    Parameters
    ----------
    y_true : array-like
        True labels (0 or 1)
    y_pred : array-like
        Predicted labels (0 or 1)
    cost_matrix : dict
        Dictionary with keys 'tp', 'fp', 'fn', 'tn' representing costs
        
    Returns
    -------
    dict
        Dictionary containing total_cost, cost_per_sample, and breakdown
    """
    # Compute confusion matrix elements
    # Note: sklearn's confusion_matrix returns [[TN, FP], [FN, TP]]
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    # Compute cost contributions
    cost_tp = tp * cost_matrix.get('tp', 0)
    cost_fp = fp * cost_matrix.get('fp', 0)
    cost_fn = fn * cost_matrix.get('fn', 0)
    cost_tn = tn * cost_matrix.get('tn', 0)
    
    total_cost = cost_tp + cost_fp + cost_fn + cost_tn
    n_samples = len(y_true)
    
    return {
        'total_cost': total_cost,
        'cost_per_sample': total_cost / n_samples,
        'breakdown': {
            'tp_contribution': cost_tp,
            'fp_contribution': cost_fp,
            'fn_contribution': cost_fn,
            'tn_contribution': cost_tn,
        },
        'confusion_matrix': {
            'tp': tp, 'fp': fp, 'fn': fn, 'tn': tn
        }
    }
 
 
# Example: Fraud detection cost matrix
fraud_cost_matrix = {
    'tp': 5,    # Investigation cost for caught fraud
    'fp': 7,    # Investigation + customer friction
    'fn': 150,  # Chargeback and losses from missed fraud
    'tn': 0     # No cost for correctly passing legitimate
}
 
# Simulated predictions
np.random.seed(42)
n_samples = 10000
fraud_rate = 0.01
 
y_true = np.random.binomial(1, fraud_rate, n_samples)
 
# Model A: Conservative (fewer false negatives, more false positives)
y_pred_conservative = y_true.copy()
# Add some false positives
false_positive_mask = np.random.random(n_samples) < 0.08
y_pred_conservative[false_positive_mask & (y_true == 0)] = 1
# Miss some frauds (10% false negative rate)
false_negative_mask = np.random.random(n_samples) < 0.10
y_pred_conservative[false_negative_mask & (y_true == 1)] = 0
 
# Model B: Aggressive (more false negatives, fewer false positives)
y_pred_aggressive = y_true.copy()
# Fewer false positives (2%)
false_positive_mask = np.random.random(n_samples) < 0.02
y_pred_aggressive[false_positive_mask & (y_true == 0)] = 1
# More missed frauds (30% false negative rate)
false_negative_mask = np.random.random(n_samples) < 0.30
y_pred_aggressive[false_negative_mask & (y_true == 1)] = 0
 
# Evaluate both models
result_conservative = compute_expected_cost(y_true, y_pred_conservative, fraud_cost_matrix)
result_aggressive = compute_expected_cost(y_true, y_pred_aggressive, fraud_cost_matrix)
 
print("=== Model Comparison (Cost-Sensitive Evaluation) ===\n")
print("Conservative Model (favors catching fraud):")
print(f"  Total Cost: ${result_conservative['total_cost']:,.2f}")
print(f"  Cost per Transaction: ${result_conservative['cost_per_sample']:.4f}")
print(f"  Confusion Matrix: {result_conservative['confusion_matrix']}")
print(f"  Cost Breakdown: {result_conservative['breakdown']}\n")
 
print("Aggressive Model (favors fewer investigations):")
print(f"  Total Cost: ${result_aggressive['total_cost']:,.2f}")
print(f"  Cost per Transaction: ${result_aggressive['cost_per_sample']:.4f}")
print(f"  Confusion Matrix: {result_aggressive['confusion_matrix']}")
print(f"  Cost Breakdown: {result_aggressive['breakdown']}")

Cost-Sensitive Evaluation Metrics

Beyond raw expected cost, several derived metrics help us understand and compare cost-sensitive model performance.

1. Cost Reduction Ratio (CRR)

The Cost Reduction Ratio compares a model's cost to the cost of a baseline strategy (typically always predicting the majority class or a random classifier):

$$CRR = 1 - \frac{\text{Cost}{model}}{\text{Cost}{baseline}}$$

CRR = 1: Model eliminates all costs (perfect)
CRR = 0: Model performs no better than baseline
CRR < 0: Model is worse than baseline

2. Expected Cost Per Action (ECPA)

When the "action" (positive prediction) triggers some intervention, ECPA measures the expected cost per intervention:

$$ECPA = \frac{TP \cdot C_{TP} + FP \cdot C_{FP}}{TP + FP}$$

This is useful when intervention capacity is limited and we want to prioritize cost-effective actions.

Cost Savings vs. Cost

Some organizations prefer to report 'cost savings' relative to no-model baseline rather than absolute costs. Both are valid; ensure stakeholders understand which framing is being used to avoid confusion.

3. Weighted Accuracy

Weighted accuracy incorporates costs into the accuracy calculation:

$$\text{Weighted Accuracy} = \frac{w_{TP} \cdot TP + w_{TN} \cdot TN}{w_{TP} \cdot TP + w_{TN} \cdot TN + w_{FP} \cdot FP + w_{FN} \cdot FN}$$

where weights are inversely proportional to costs. This maintains the intuitive 0-1 range of accuracy while incorporating cost awareness.

4. Normalized Expected Cost (NEC)

NEC normalizes the expected cost to the [0, 1] range:

$$NEC = \frac{\text{Cost}{model} - \text{Cost}{best}}{\text{Cost}{worst} - \text{Cost}{best}}$$

where Cost_best is the cost of a perfect classifier (0 if correct predictions are free) and Cost_worst is the maximum possible cost.

5. Cost Curves

Analogous to ROC curves, cost curves visualize classifier performance across the full range of cost ratios. For each operating point of the classifier, we plot the expected cost at different assumed cost ratios. This reveals which classifier is optimal for which cost assumptions—crucial when cost estimates are uncertain.

cost_sensitive_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from sklearn.metrics import confusion_matrix
 
def cost_sensitive_metrics(y_true, y_pred, cost_matrix):
    """
    Compute comprehensive cost-sensitive evaluation metrics.
    
    Returns multiple perspectives on cost-sensitive performance.
    """
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    n = len(y_true)
    n_pos = sum(y_true)
    n_neg = n - n_pos
    
    # Extract costs (with defaults)
    c_tp = cost_matrix.get('tp', 0)
    c_fp = cost_matrix.get('fp', 1)
    c_fn = cost_matrix.get('fn', 1)
    c_tn = cost_matrix.get('tn', 0)
    
    # 1. Total and per-sample cost
    total_cost = tp * c_tp + fp * c_fp + fn * c_fn + tn * c_tn
    cost_per_sample = total_cost / n
    
    # 2. Baseline cost (always predict majority class)
    if n_pos > n_neg:
        # Majority is positive: predict all positive
        baseline_cost = n_pos * c_tp + n_neg * c_fp
    else:
        # Majority is negative: predict all negative
        baseline_cost = n_pos * c_fn + n_neg * c_tn
    
    # 3. Cost Reduction Ratio
    crr = 1 - (total_cost / baseline_cost) if baseline_cost > 0 else 0
    
    # 4. Expected Cost Per Action (per positive prediction)
    n_pred_pos = tp + fp
    ecpa = (tp * c_tp + fp * c_fp) / n_pred_pos if n_pred_pos > 0 else 0
    
    # 5. Weighted Accuracy (weights = inverse of costs, normalized)
    # Higher weight for cells with lower cost
    max_cost = max(c_tp, c_fp, c_fn, c_tn) + 1  # +1 to avoid division by zero
    w_tp = (max_cost - c_tp)
    w_fp = (max_cost - c_fp) 
    w_fn = (max_cost - c_fn)
    w_tn = (max_cost - c_tn)
    
    weighted_acc = (w_tp * tp + w_tn * tn) / (w_tp * tp + w_tn * tn + w_fp * fp + w_fn * fn)
    
    # 6. Normalized Expected Cost
    # Best case: perfect predictions -> cost = c_tp * n_pos + c_tn * n_neg
    best_cost = n_pos * c_tp + n_neg * c_tn
    # Worst case: all wrong predictions
    worst_cost = n_pos * c_fn + n_neg * c_fp
    
    nec = (total_cost - best_cost) / (worst_cost - best_cost) if worst_cost != best_cost else 0
    
    return {
        'total_cost': total_cost,
        'cost_per_sample': cost_per_sample,
        'baseline_cost': baseline_cost,
        'cost_reduction_ratio': crr,
        'expected_cost_per_action': ecpa,
        'weighted_accuracy': weighted_acc,
        'normalized_expected_cost': nec,
        'standard_metrics': {
            'accuracy': (tp + tn) / n,
            'precision': tp / (tp + fp) if (tp + fp) > 0 else 0,
            'recall': tp / (tp + fn) if (tp + fn) > 0 else 0,
        }
    }
 
 
# Example usage
np.random.seed(42)
y_true = np.array([1]*100 + [0]*9900)  # 1% positive rate
np.random.shuffle(y_true)
 
# Simulate model predictions (80% recall, 10% precision)
y_pred = np.zeros_like(y_true)
y_pred[(y_true == 1)] = np.random.binomial(1, 0.80, sum(y_true))  # 80% TP rate
y_pred[(y_true == 0) & (np.random.random(len(y_true)) < 0.08)] = 1  # ~8% FP rate
 
cost_matrix = {'tp': 5, 'fp': 7, 'fn': 150, 'tn': 0}
 
metrics = cost_sensitive_metrics(y_true, y_pred, cost_matrix)
 
print("Cost-Sensitive Evaluation Metrics")
print("=" * 50)
print(f"Total Cost: ${metrics['total_cost']:,.2f}")
print(f"Cost per Sample: ${metrics['cost_per_sample']:.4f}")
print(f"Baseline Cost: ${metrics['baseline_cost']:,.2f}")
print(f"Cost Reduction Ratio: {metrics['cost_reduction_ratio']:.2%}")
print(f"Expected Cost per Action: ${metrics['expected_cost_per_action']:.2f}")
print(f"Weighted Accuracy: {metrics['weighted_accuracy']:.4f}")
print(f"Normalized Expected Cost: {metrics['normalized_expected_cost']:.4f}")
print(f"\nStandard Metrics for Reference:")
for k, v in metrics['standard_metrics'].items():
    print(f"  {k.capitalize()}: {v:.4f}")

Instance-Dependent Costs

The cost matrices we've discussed so far assume uniform costs—every false positive has the same cost, every false negative has the same cost. In reality, costs often vary by instance.

Examples of Instance-Dependent Costs:

Fraud detection: Missing a $10,000 fraudulent transaction is costlier than missing a $50 fraud
Medical diagnosis: Misdiagnosis cost depends on patient age, disease severity, and treatment options
Credit scoring: Cost of default depends on loan amount and customer profile
Spam filtering: Losing an important business email costs more than losing a newsletter

Instance-dependent costs require us to move from a single cost matrix to per-instance cost functions.

Formal Framework:

Let $C_i = \begin{bmatrix} c_i^{TN} & c_i^{FP} \ c_i^{FN} & c_i^{TP} \end{bmatrix}$ be the cost matrix for instance $i$.

The total expected cost becomes:

$$\text{Total Cost} = \sum_{i=1}^{n} C_i[y_i, \hat{y}_i]$$

where $y_i$ is the true label and $\hat{y}_i$ is the predicted label for instance $i$.

Practical Implementation:

Instance-dependent costs are typically implemented by:

Feature-based cost functions: Define costs as functions of instance features
- $C_{FN}(x) = \text{transaction_amount}(x) \times 1.5$
Cost columns in data: Store costs alongside features in the dataset
- Each row has columns: features, label, cost_fp, cost_fn
Lookup tables: Map instance categories to cost matrices
- Customer tier → cost matrix mapping

instance_dependent_costs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
import pandas as pd
 
def compute_instance_dependent_cost(y_true, y_pred, cost_fn_false_neg, cost_fp_false_pos):
    """
    Compute total cost with instance-dependent costs.
    
    Parameters
    ----------
    y_true : array-like
        True labels
    y_pred : array-like
        Predicted labels  
    cost_fn_false_neg : array-like
        Per-instance cost of false negative
    cost_fp_false_pos : array-like
        Per-instance cost of false positive
        
    Returns
    -------
    dict
        Cost breakdown and totals
    """
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    cost_fn = np.array(cost_fn_false_neg)
    cost_fp = np.array(cost_fp_false_pos)
    
    # Identify error types
    false_negatives = (y_true == 1) & (y_pred == 0)
    false_positives = (y_true == 0) & (y_pred == 1)
    
    # Compute costs
    fn_cost = cost_fn[false_negatives].sum()
    fp_cost = cost_fp[false_positives].sum()
    total_cost = fn_cost + fp_cost
    
    return {
        'total_cost': total_cost,
        'fn_cost': fn_cost,
        'fp_cost': fp_cost,
        'n_false_negatives': false_negatives.sum(),
        'n_false_positives': false_positives.sum(),
        'avg_fn_cost': cost_fn[false_negatives].mean() if false_negatives.any() else 0,
        'avg_fp_cost': cost_fp[false_positives].mean() if false_positives.any() else 0,
    }
 
 
# Create a realistic fraud detection dataset
np.random.seed(42)
n_transactions = 10000
 
# Generate transaction amounts (log-normal distribution)
amounts = np.random.lognormal(mean=4, sigma=1.5, size=n_transactions)
amounts = np.clip(amounts, 10, 50000)  # $10 to $50,000 range
 
# Generate fraud labels (higher amounts slightly more likely to be fraud)
fraud_prob = 0.005 + 0.0001 * (amounts / 1000)  # Base 0.5% + amount factor
is_fraud = np.random.random(n_transactions) < fraud_prob
 
# Instance-dependent costs
# FN cost: amount * 1.5 (chargeback + admin + reputation)
cost_false_negative = amounts * 1.5
 
# FP cost: fixed investigation cost + small customer friction
cost_false_positive = np.full(n_transactions, 5.0) + amounts * 0.001
 
print(f"Dataset Statistics:")
print(f"  Total transactions: {n_transactions:,}")
print(f"  Fraud rate: {is_fraud.mean():.2%}")
print(f"  Average transaction: ${amounts.mean():,.2f}")
print(f"  Average FN cost: ${cost_false_negative.mean():,.2f}")
print(f"  Average FP cost: ${cost_false_positive.mean():.2f}")
print()
 
# Simulate two models
# Model A: High recall, lower precision
pred_a = is_fraud.copy()
pred_a[np.random.random(n_transactions) < 0.05] = 1  # 5% extra FP
pred_a[(is_fraud) & (np.random.random(n_transactions) < 0.10)] = 0  # 10% FN
 
# Model B: Lower recall, higher precision  
pred_b = is_fraud.copy()
pred_b[np.random.random(n_transactions) < 0.01] = 1  # 1% extra FP
pred_b[(is_fraud) & (np.random.random(n_transactions) < 0.25)] = 0  # 25% FN
 
# Evaluate with instance-dependent costs
result_a = compute_instance_dependent_cost(is_fraud, pred_a, cost_false_negative, cost_false_positive)
result_b = compute_instance_dependent_cost(is_fraud, pred_b, cost_false_negative, cost_false_positive)
 
print("Model A (High Recall):")
print(f"  Total Cost: ${result_a['total_cost']:,.2f}")
print(f"  FN Cost: ${result_a['fn_cost']:,.2f} ({result_a['n_false_negatives']} missed)")
print(f"  FP Cost: ${result_a['fp_cost']:,.2f} ({result_a['n_false_positives']} false alarms)")
print()
 
print("Model B (High Precision):")
print(f"  Total Cost: ${result_b['total_cost']:,.2f}")
print(f"  FN Cost: ${result_b['fn_cost']:,.2f} ({result_b['n_false_negatives']} missed)")
print(f"  FP Cost: ${result_b['fp_cost']:,.2f} ({result_b['n_false_positives']} false alarms)")
print()
 
print("Winner:", "Model A" if result_a['total_cost'] < result_b['total_cost'] else "Model B")
print(f"Cost Difference: ${abs(result_a['total_cost'] - result_b['total_cost']):,.2f}")

Cost Estimation Challenges

Instance-dependent costs are powerful but require careful estimation. Overestimating costs for certain instances can lead to models that over-prioritize those cases. Regularly validate cost assumptions against actual business outcomes.

Cost-Sensitive Training Approaches

So far, we've focused on cost-sensitive evaluation—measuring model performance using cost-aware metrics. But we can go further: cost-sensitive training incorporates costs directly into the learning process.

Three Main Approaches:

Resampling methods — Modify the training data distribution
Reweighting methods — Assign different weights to training instances
Algorithm modification — Directly incorporate costs into the learning algorithm

Approach 1: Resampling Methods

•Oversampling: Duplicate high-cost instances to increase their influence on training
•Undersampling: Remove low-cost instances to rebalance the training set
•SMOTE with cost weighting: Generate synthetic samples proportional to misclassification cost
•Pros: Simple, works with any classifier
•Cons: Changes data distribution, may cause overfitting (oversampling) or information loss (undersampling)

Approach 2: Reweighting Methods

•Sample weights: Assign weight proportional to misclassification cost during training
•Class weights: Set class weights inversely proportional to class costs
•Most scikit-learn classifiers support sample_weight parameter
•Pros: Doesn't duplicate data, computationally efficient
•Cons: Not all algorithms support weights, may destabilize training with extreme weights

Approach 3: Algorithm Modification

•Cost-sensitive decision trees: Use cost-based split criteria instead of Gini/entropy
•MetaCost: Wrapper algorithm that relabels training data based on cost-minimizing predictions
•Cost-sensitive neural networks: Modify loss function to weight errors by cost
•Pros: Most principled approach, directly optimizes cost objective
•Cons: Requires algorithm-specific implementation, may not be available in standard libraries

cost_sensitive_training.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
 
def compute_sample_weights_from_costs(y, cost_fp, cost_fn):
    """
    Compute sample weights from cost matrix.
    
    Weight samples by the cost of misclassifying them.
    """
    weights = np.ones(len(y), dtype=float)
    
    # Positive samples: weight by cost of false negative
    weights[y == 1] = cost_fn
    
    # Negative samples: weight by cost of false positive
    weights[y == 0] = cost_fp
    
    # Normalize weights to sum to len(y) for stability
    weights = weights * len(y) / weights.sum()
    
    return weights
 
 
def train_and_evaluate_cost_sensitive(X, y, cost_matrix, test_size=0.3):
    """
    Compare standard vs cost-sensitive training.
    """
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, stratify=y, random_state=42
    )
    
    cost_fp = cost_matrix['fp']
    cost_fn = cost_matrix['fn']
    
    # Compute sample weights for training
    sample_weights = compute_sample_weights_from_costs(y_train, cost_fp, cost_fn)
    
    # Standard training (no weights)
    model_standard = RandomForestClassifier(n_estimators=100, random_state=42)
    model_standard.fit(X_train, y_train)
    
    # Cost-sensitive training (with weights)
    model_weighted = RandomForestClassifier(n_estimators=100, random_state=42)
    model_weighted.fit(X_train, y_train, sample_weight=sample_weights)
    
    # Also try class_weight parameter
    # Compute class weights from cost ratio
    cost_ratio = cost_fn / cost_fp
    model_class_weight = RandomForestClassifier(
        n_estimators=100, 
        class_weight={0: 1, 1: cost_ratio},
        random_state=42
    )
    model_class_weight.fit(X_train, y_train)
    
    # Evaluate all models
    results = {}
    
    for name, model in [
        ('Standard', model_standard),
        ('Sample Weighted', model_weighted),
        ('Class Weighted', model_class_weight)
    ]:
        y_pred = model.predict(X_test)
        tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
        
        total_cost = fp * cost_fp + fn * cost_fn
        
        results[name] = {
            'confusion_matrix': {'tp': tp, 'fp': fp, 'fn': fn, 'tn': tn},
            'total_cost': total_cost,
            'accuracy': (tp + tn) / len(y_test),
            'recall': tp / (tp + fn) if (tp + fn) > 0 else 0,
            'precision': tp / (tp + fp) if (tp + fp) > 0 else 0,
        }
    
    return results
 
 
# Generate synthetic imbalanced data
from sklearn.datasets import make_classification
 
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    n_clusters_per_class=2,
    weights=[0.95, 0.05],  # 5% positive class
    random_state=42
)
 
# Define cost matrix with high asymmetry
cost_matrix = {'tp': 0, 'fp': 10, 'fn': 100, 'tn': 0}  # 10:1 cost ratio
 
results = train_and_evaluate_cost_sensitive(X, y, cost_matrix)
 
print("Cost-Sensitive Training Comparison")
print("=" * 60)
print(f"Cost Matrix: FP=${cost_matrix['fp']}, FN=${cost_matrix['fn']}")
print()
 
for name, metrics in results.items():
    print(f"{name}:")
    print(f"  Total Cost: ${metrics['total_cost']:,.2f}")
    print(f"  Accuracy: {metrics['accuracy']:.4f}")
    print(f"  Recall: {metrics['recall']:.4f}")
    print(f"  Precision: {metrics['precision']:.4f}")
    cm = metrics['confusion_matrix']
    print(f"  FP: {cm['fp']}, FN: {cm['fn']}")
    print()

Calibration for Cost-Sensitive Decisions

Cost-sensitive evaluation often relies on probability estimates rather than hard predictions. When a model outputs P(Fraud) = 0.15, we need that probability to be calibrated—meaning that among transactions where the model predicts 15% fraud probability, approximately 15% should actually be fraud.

Why Calibration Matters for Costs:

The optimal decision rule given costs is:

$$\text{Predict Positive if } P(y=1|x) > \frac{C_{FP}}{C_{FP} + C_{FN}}$$

This threshold, called the cost-proportionate threshold, requires well-calibrated probabilities to work correctly. If probabilities are systematically over- or under-estimated, the threshold won't achieve optimal expected cost.

The Cost-Proportionate Threshold

With cost ratio CR = C_FN / C_FP = 15, we should predict positive when P(positive) > 1 / (1 + 15) = 0.0625. This is much lower than the default 0.5 threshold, reflecting that missing positives is 15x more costly than false alarms.

Common Calibration Methods:

Platt Scaling: Fit a sigmoid function to transform raw scores into calibrated probabilities
Isotonic Regression: Non-parametric calibration using isotonic regression
Temperature Scaling: Divide logits by a learned temperature parameter (popular for neural networks)
Histogram Binning: Bin predictions and replace with bin's empirical probability

Evaluating Calibration:

Reliability Diagrams: Plot predicted probability vs. actual frequency
Expected Calibration Error (ECE): Weighted average of calibration error across probability bins
Brier Score: Mean squared error between predicted probabilities and binary outcomes

cost_calibration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import brier_score_loss
 
def cost_optimal_threshold(cost_fp, cost_fn):
    """Compute the cost-optimal decision threshold."""
    return cost_fp / (cost_fp + cost_fn)
 
def cost_optimal_predictions(y_proba, cost_fp, cost_fn):
    """
    Make predictions using the cost-optimal threshold.
    
    Parameters
    ----------
    y_proba : array-like
        Predicted probabilities for positive class
    cost_fp, cost_fn : float
        Costs of false positive and false negative
        
    Returns
    -------
    array
        Binary predictions using cost-optimal threshold
    """
    threshold = cost_optimal_threshold(cost_fp, cost_fn)
    return (y_proba >= threshold).astype(int)
 
 
def evaluate_calibration_for_costs(X, y, cost_matrix):
    """
    Compare calibrated vs uncalibrated models for cost-sensitive decisions.
    """
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, stratify=y, random_state=42
    )
    
    cost_fp = cost_matrix['fp']
    cost_fn = cost_matrix['fn']
    optimal_threshold = cost_optimal_threshold(cost_fp, cost_fn)
    
    # Uncalibrated model
    model_raw = RandomForestClassifier(n_estimators=100, random_state=42)
    model_raw.fit(X_train, y_train)
    
    # Calibrated model (Platt scaling)
    model_platt = CalibratedClassifierCV(
        RandomForestClassifier(n_estimators=100, random_state=42),
        method='sigmoid',
        cv=5
    )
    model_platt.fit(X_train, y_train)
    
    # Calibrated model (Isotonic)
    model_isotonic = CalibratedClassifierCV(
        RandomForestClassifier(n_estimators=100, random_state=42),
        method='isotonic',
        cv=5
    )
    model_isotonic.fit(X_train, y_train)
    
    results = {}
    
    for name, model in [
        ('Uncalibrated', model_raw),
        ('Platt Calibrated', model_platt),
        ('Isotonic Calibrated', model_isotonic)
    ]:
        # Get probabilities
        y_proba = model.predict_proba(X_test)[:, 1]
        
        # Make cost-optimal predictions
        y_pred = cost_optimal_predictions(y_proba, cost_fp, cost_fn)
        
        # Compute confusion matrix
        from sklearn.metrics import confusion_matrix
        tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
        
        # Compute costs
        total_cost = fp * cost_fp + fn * cost_fn
        
        # Compute calibration metrics
        brier = brier_score_loss(y_test, y_proba)
        
        # Reliability diagram data
        prob_true, prob_pred = calibration_curve(y_test, y_proba, n_bins=10)
        calibration_error = np.mean(np.abs(prob_true - prob_pred))
        
        results[name] = {
            'total_cost': total_cost,
            'brier_score': brier,
            'mean_calibration_error': calibration_error,
            'confusion': {'tp': tp, 'fp': fp, 'fn': fn, 'tn': tn},
            'threshold_used': optimal_threshold,
        }
    
    return results
 
 
# Generate imbalanced data
from sklearn.datasets import make_classification
 
X, y = make_classification(
    n_samples=5000,
    n_features=15,
    n_informative=8,
    n_redundant=4,
    weights=[0.95, 0.05],
    random_state=42
)
 
cost_matrix = {'fp': 10, 'fn': 150}
 
print(f"Cost-Optimal Threshold: {cost_optimal_threshold(cost_matrix['fp'], cost_matrix['fn']):.4f}")
print(f"(Compare to default 0.5)\n")
 
results = evaluate_calibration_for_costs(X, y, cost_matrix)
 
print("Calibration Impact on Cost-Sensitive Decisions")
print("=" * 60)
 
for name, metrics in results.items():
    print(f"\n{name}:")
    print(f"  Total Cost: ${metrics['total_cost']:,.2f}")
    print(f"  Brier Score: {metrics['brier_score']:.4f}")
    print(f"  Mean Calibration Error: {metrics['mean_calibration_error']:.4f}")
    cm = metrics['confusion']
    print(f"  Confusion: TP={cm['tp']}, FP={cm['fp']}, FN={cm['fn']}, TN={cm['tn']}")

Real-World Applications

Cost-sensitive evaluation and training are essential across numerous domains. Let's examine how these techniques apply to specific real-world challenges.

Medical Diagnosis:

Healthcare presents some of the most dramatic cost asymmetries:

Cancer screening: False negative (missed cancer) can mean death; false positive means additional testing and anxiety
Emergency triage: Under-triaging a critical patient has life-threatening consequences
Drug prescription: Missing a contraindication vs. being overly cautious

Cost Considerations:

Direct costs: Treatment, testing, procedures
Indirect costs: Quality of life, disability, lost productivity
Legal costs: Malpractice liability
Psychological costs: Patient anxiety, physician burnout

Typical Cost Ratios:

Cancer screening: FN/FP ratio often 100:1 to 1000:1
Emergency admission: Highly variable by condition
Must include rare but catastrophic outcomes

Summary and Best Practices

Cost-sensitive evaluation transforms model assessment from an academic exercise into a business-aligned practice. Let's consolidate the key principles:

Key Takeaways

•Define costs explicitly — Work with stakeholders to quantify error costs, even if approximate. Having imperfect cost estimates is better than assuming equal costs.
•Use cost matrices — Structure costs as matrices to capture all four prediction-outcome combinations. Consider instance-dependent costs when error impact varies.
•Report expected cost — Make expected cost your primary evaluation metric. Standard metrics can supplement but shouldn't replace cost-aware evaluation.
•Calibrate probabilities — Cost-optimal thresholds require well-calibrated probability estimates. Always verify calibration before deploying cost-sensitive decisions.
•Consider cost-sensitive training — Don't just evaluate on cost—train with cost awareness using sample weights, class weights, or specialized algorithms.
•Validate cost assumptions — Retrospectively compare predicted costs with actual business outcomes. Update cost models as understanding improves.

Common Pitfalls to Avoid

• Using arbitrary cost ratios without business justification • Ignoring instance-dependent cost variation • Optimizing costs on test set without proper validation • Assuming costs are stable when they may change over time • Neglecting indirect and long-term costs (reputation, customer lifetime value)

What's Next:

With cost-sensitive evaluation established, we'll next explore threshold optimization—the systematic process of selecting decision thresholds that minimize expected cost or maximize business value. This natural extension connects cost matrices to actionable deployment decisions.

Page Complete

You now understand the foundations of cost-sensitive evaluation: cost matrices, expected cost metrics, instance-dependent costs, and cost-sensitive training. You can formulate costs that reflect real-world consequences and evaluate models based on their true business impact rather than abstract accuracy measures.

1 / 5

Loading learning content...

Machine LearningCustom and Business Metrics

Custom and Business Metrics

LevelIntermediate

Duration90 mins

TopicCustom and Business Metrics

1 / 5

Cost-Sensitive Evaluation

The Cost of Errors: Beyond Accuracy

Consider these scenarios:

A spam filter marks an important job offer as spam (false positive) versus letting an obvious advertisement through (false negative)
A fraud detection system blocks a legitimate $10,000 transaction (false positive) versus missing a fraudulent $50 purchase (false negative)
A medical diagnostic failing to detect cancer (false negative) versus unnecessarily alarming a healthy patient (false positive)

What You Will Learn

Foundations of Cost-Sensitive Learning

The Core Insight:

Traditional evaluation asks: "How many mistakes did the model make?"

Cost-sensitive evaluation asks: "How much damage did the model's mistakes cause?"

This shift from counting errors to measuring impact requires us to define what "damage" means in our specific context—which brings us to the cost matrix.

Why Standard Metrics Fall Short

Historical Context:

Today, cost-sensitive evaluation is essential in any domain where:

Error consequences are asymmetric — Different mistakes have different impacts
Costs are heterogeneous — The same error type has different costs for different instances
Business value must be quantified — Stakeholders need to understand model value in monetary or utility terms
Regulatory requirements exist — Some domains mandate explicit cost-benefit analysis

The Cost Matrix: Formalizing Error Consequences

Binary Classification Cost Matrix:

For a binary classification problem with classes Positive (P) and Negative (N), the cost matrix C is typically structured as:

Binary Classification Cost Matrix Structure
	Actual Positive	Actual Negative
Predict Positive	C(P\|P) = Cost of True Positive	C(P\|N) = Cost of False Positive
Predict Negative	C(N\|P) = Cost of False Negative	C(N\|N) = Cost of True Negative

Interpreting Cost Matrix Elements:

C(P|P) — Cost when we predict positive and the instance is actually positive (True Positive). Often set to 0 or a small negative value (representing benefit/reward).
C(P|N) — Cost when we predict positive but the instance is actually negative (False Positive). The cost of false alarms, unnecessary interventions, or wasted resources.
C(N|P) — Cost when we predict negative but the instance is actually positive (False Negative). The cost of missed detections, overlooked opportunities, or failed interventions.
C(N|N) — Cost when we predict negative and the instance is actually negative (True Negative). Often set to 0 (no cost for correct non-action).

Convention Note: There are two common conventions for cost matrices:

Cost convention: Higher values = worse outcomes (minimize total cost)
Utility convention: Higher values = better outcomes (maximize total utility)

We'll use the cost convention throughout, where correct predictions typically have cost 0 and incorrect predictions have positive costs.

Costs vs. Class Weights

Concrete Example: Credit Card Fraud Detection

Let's construct a realistic cost matrix for fraud detection. Assume:

Average transaction value: $100
Cost to investigate a flagged transaction: $5
Chargeback and administrative cost for missed fraud: $150
Regulatory fine for high false negative rate: reflected in per-miss cost
Customer friction from false positives: $2 in expected customer lifetime value loss

Credit Card Fraud Detection Cost Matrix (in dollars)
	Actual Fraud	Actual Legitimate
Predict Fraud	$5 (investigation)	$7 (investigation + friction)
Predict Legitimate	$150 (chargeback + losses)	$0

Deriving the Cost Ratio:

Expected Cost: The Primary Cost-Sensitive Metric

Once we've defined a cost matrix, we can compute the expected cost (or total cost) of a classifier's predictions. This becomes our primary evaluation metric, replacing or augmenting accuracy.

Total Cost Calculation:

Given a cost matrix C and a confusion matrix with counts TP, FP, FN, TN, the total cost is:

$$\text{Total Cost} = TP \cdot C(P|P) + FP \cdot C(P|N) + FN \cdot C(N|P) + TN \cdot C(N|N)$$

When correct predictions have zero cost (common convention):

$$\text{Total Cost} = FP \cdot C_{FP} + FN \cdot C_{FN}$$

where $C_{FP} = C(P|N)$ and $C_{FN} = C(N|P)$.

Normalized Expected Cost

Cost-Weighted Confusion Matrix:

A powerful visualization combines the confusion matrix with the cost matrix to show the contribution of each cell to total cost:

Cost-Weighted Confusion Matrix (Fraud Example, 1000 transactions, 1% fraud rate)
	Actual Fraud (10)	Actual Legitimate (990)	Cost Contribution
Predict Fraud	TP=8 × $5 = $40	FP=50 × $7 = $350	$390
Predict Legitimate	FN=2 × $150 = $300	TN=940 × $0 = $0	$300
Total			$690

Interpreting the Cost-Weighted Matrix:

This visualization immediately reveals where costs accumulate:

$390 comes from predictions of fraud (investigation costs + customer friction)
$300 comes from 2 missed frauds—despite being only 2 errors, they dominate the negative contribution
The model's cost per transaction = $690 / 1000 = $0.69

Comparison to Accuracy:

This same classifier achieves:

Accuracy = (8 + 940) / 1000 = 94.8%
Precision = 8 / 58 = 13.8%
Recall = 8 / 10 = 80%

Accuracy looks excellent (94.8%!), but the model is still costing $0.69 per transaction. A different model with lower accuracy but fewer false negatives could have dramatically lower total cost.

expected_cost.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
from sklearn.metrics import confusion_matrix
 
def compute_expected_cost(y_true, y_pred, cost_matrix):
    """
    Compute total expected cost given predictions and a cost matrix.
    
    Parameters
    ----------
    y_true : array-like
        True labels (0 or 1)
    y_pred : array-like
        Predicted labels (0 or 1)
    cost_matrix : dict
        Dictionary with keys 'tp', 'fp', 'fn', 'tn' representing costs
        
    Returns
    -------
    dict
        Dictionary containing total_cost, cost_per_sample, and breakdown
    """
    # Compute confusion matrix elements
    # Note: sklearn's confusion_matrix returns [[TN, FP], [FN, TP]]
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    # Compute cost contributions
    cost_tp = tp * cost_matrix.get('tp', 0)
    cost_fp = fp * cost_matrix.get('fp', 0)
    cost_fn = fn * cost_matrix.get('fn', 0)
    cost_tn = tn * cost_matrix.get('tn', 0)
    
    total_cost = cost_tp + cost_fp + cost_fn + cost_tn
    n_samples = len(y_true)
    
    return {
        'total_cost': total_cost,
        'cost_per_sample': total_cost / n_samples,
        'breakdown': {
            'tp_contribution': cost_tp,
            'fp_contribution': cost_fp,
            'fn_contribution': cost_fn,
            'tn_contribution': cost_tn,
        },
        'confusion_matrix': {
            'tp': tp, 'fp': fp, 'fn': fn, 'tn': tn
        }
    }
 
 
# Example: Fraud detection cost matrix
fraud_cost_matrix = {
    'tp': 5,    # Investigation cost for caught fraud
    'fp': 7,    # Investigation + customer friction
    'fn': 150,  # Chargeback and losses from missed fraud
    'tn': 0     # No cost for correctly passing legitimate
}
 
# Simulated predictions
np.random.seed(42)
n_samples = 10000
fraud_rate = 0.01
 
y_true = np.random.binomial(1, fraud_rate, n_samples)
 
# Model A: Conservative (fewer false negatives, more false positives)
y_pred_conservative = y_true.copy()
# Add some false positives
false_positive_mask = np.random.random(n_samples) < 0.08
y_pred_conservative[false_positive_mask & (y_true == 0)] = 1
# Miss some frauds (10% false negative rate)
false_negative_mask = np.random.random(n_samples) < 0.10
y_pred_conservative[false_negative_mask & (y_true == 1)] = 0
 
# Model B: Aggressive (more false negatives, fewer false positives)
y_pred_aggressive = y_true.copy()
# Fewer false positives (2%)
false_positive_mask = np.random.random(n_samples) < 0.02
y_pred_aggressive[false_positive_mask & (y_true == 0)] = 1
# More missed frauds (30% false negative rate)
false_negative_mask = np.random.random(n_samples) < 0.30
y_pred_aggressive[false_negative_mask & (y_true == 1)] = 0
 
# Evaluate both models
result_conservative = compute_expected_cost(y_true, y_pred_conservative, fraud_cost_matrix)
result_aggressive = compute_expected_cost(y_true, y_pred_aggressive, fraud_cost_matrix)
 
print("=== Model Comparison (Cost-Sensitive Evaluation) ===\n")
print("Conservative Model (favors catching fraud):")
print(f"  Total Cost: ${result_conservative['total_cost']:,.2f}")
print(f"  Cost per Transaction: ${result_conservative['cost_per_sample']:.4f}")
print(f"  Confusion Matrix: {result_conservative['confusion_matrix']}")
print(f"  Cost Breakdown: {result_conservative['breakdown']}\n")
 
print("Aggressive Model (favors fewer investigations):")
print(f"  Total Cost: ${result_aggressive['total_cost']:,.2f}")
print(f"  Cost per Transaction: ${result_aggressive['cost_per_sample']:.4f}")
print(f"  Confusion Matrix: {result_aggressive['confusion_matrix']}")
print(f"  Cost Breakdown: {result_aggressive['breakdown']}")

Cost-Sensitive Evaluation Metrics

Beyond raw expected cost, several derived metrics help us understand and compare cost-sensitive model performance.

1. Cost Reduction Ratio (CRR)

The Cost Reduction Ratio compares a model's cost to the cost of a baseline strategy (typically always predicting the majority class or a random classifier):

$$CRR = 1 - \frac{\text{Cost}{model}}{\text{Cost}{baseline}}$$

CRR = 1: Model eliminates all costs (perfect)
CRR = 0: Model performs no better than baseline
CRR < 0: Model is worse than baseline

2. Expected Cost Per Action (ECPA)

When the "action" (positive prediction) triggers some intervention, ECPA measures the expected cost per intervention:

$$ECPA = \frac{TP \cdot C_{TP} + FP \cdot C_{FP}}{TP + FP}$$

This is useful when intervention capacity is limited and we want to prioritize cost-effective actions.

Cost Savings vs. Cost

3. Weighted Accuracy

Weighted accuracy incorporates costs into the accuracy calculation:

$$\text{Weighted Accuracy} = \frac{w_{TP} \cdot TP + w_{TN} \cdot TN}{w_{TP} \cdot TP + w_{TN} \cdot TN + w_{FP} \cdot FP + w_{FN} \cdot FN}$$

where weights are inversely proportional to costs. This maintains the intuitive 0-1 range of accuracy while incorporating cost awareness.

4. Normalized Expected Cost (NEC)

NEC normalizes the expected cost to the [0, 1] range:

$$NEC = \frac{\text{Cost}{model} - \text{Cost}{best}}{\text{Cost}{worst} - \text{Cost}{best}}$$

where Cost_best is the cost of a perfect classifier (0 if correct predictions are free) and Cost_worst is the maximum possible cost.

5. Cost Curves

cost_sensitive_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from sklearn.metrics import confusion_matrix
 
def cost_sensitive_metrics(y_true, y_pred, cost_matrix):
    """
    Compute comprehensive cost-sensitive evaluation metrics.
    
    Returns multiple perspectives on cost-sensitive performance.
    """
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    n = len(y_true)
    n_pos = sum(y_true)
    n_neg = n - n_pos
    
    # Extract costs (with defaults)
    c_tp = cost_matrix.get('tp', 0)
    c_fp = cost_matrix.get('fp', 1)
    c_fn = cost_matrix.get('fn', 1)
    c_tn = cost_matrix.get('tn', 0)
    
    # 1. Total and per-sample cost
    total_cost = tp * c_tp + fp * c_fp + fn * c_fn + tn * c_tn
    cost_per_sample = total_cost / n
    
    # 2. Baseline cost (always predict majority class)
    if n_pos > n_neg:
        # Majority is positive: predict all positive
        baseline_cost = n_pos * c_tp + n_neg * c_fp
    else:
        # Majority is negative: predict all negative
        baseline_cost = n_pos * c_fn + n_neg * c_tn
    
    # 3. Cost Reduction Ratio
    crr = 1 - (total_cost / baseline_cost) if baseline_cost > 0 else 0
    
    # 4. Expected Cost Per Action (per positive prediction)
    n_pred_pos = tp + fp
    ecpa = (tp * c_tp + fp * c_fp) / n_pred_pos if n_pred_pos > 0 else 0
    
    # 5. Weighted Accuracy (weights = inverse of costs, normalized)
    # Higher weight for cells with lower cost
    max_cost = max(c_tp, c_fp, c_fn, c_tn) + 1  # +1 to avoid division by zero
    w_tp = (max_cost - c_tp)
    w_fp = (max_cost - c_fp) 
    w_fn = (max_cost - c_fn)
    w_tn = (max_cost - c_tn)
    
    weighted_acc = (w_tp * tp + w_tn * tn) / (w_tp * tp + w_tn * tn + w_fp * fp + w_fn * fn)
    
    # 6. Normalized Expected Cost
    # Best case: perfect predictions -> cost = c_tp * n_pos + c_tn * n_neg
    best_cost = n_pos * c_tp + n_neg * c_tn
    # Worst case: all wrong predictions
    worst_cost = n_pos * c_fn + n_neg * c_fp
    
    nec = (total_cost - best_cost) / (worst_cost - best_cost) if worst_cost != best_cost else 0
    
    return {
        'total_cost': total_cost,
        'cost_per_sample': cost_per_sample,
        'baseline_cost': baseline_cost,
        'cost_reduction_ratio': crr,
        'expected_cost_per_action': ecpa,
        'weighted_accuracy': weighted_acc,
        'normalized_expected_cost': nec,
        'standard_metrics': {
            'accuracy': (tp + tn) / n,
            'precision': tp / (tp + fp) if (tp + fp) > 0 else 0,
            'recall': tp / (tp + fn) if (tp + fn) > 0 else 0,
        }
    }
 
 
# Example usage
np.random.seed(42)
y_true = np.array([1]*100 + [0]*9900)  # 1% positive rate
np.random.shuffle(y_true)
 
# Simulate model predictions (80% recall, 10% precision)
y_pred = np.zeros_like(y_true)
y_pred[(y_true == 1)] = np.random.binomial(1, 0.80, sum(y_true))  # 80% TP rate
y_pred[(y_true == 0) & (np.random.random(len(y_true)) < 0.08)] = 1  # ~8% FP rate
 
cost_matrix = {'tp': 5, 'fp': 7, 'fn': 150, 'tn': 0}
 
metrics = cost_sensitive_metrics(y_true, y_pred, cost_matrix)
 
print("Cost-Sensitive Evaluation Metrics")
print("=" * 50)
print(f"Total Cost: ${metrics['total_cost']:,.2f}")
print(f"Cost per Sample: ${metrics['cost_per_sample']:.4f}")
print(f"Baseline Cost: ${metrics['baseline_cost']:,.2f}")
print(f"Cost Reduction Ratio: {metrics['cost_reduction_ratio']:.2%}")
print(f"Expected Cost per Action: ${metrics['expected_cost_per_action']:.2f}")
print(f"Weighted Accuracy: {metrics['weighted_accuracy']:.4f}")
print(f"Normalized Expected Cost: {metrics['normalized_expected_cost']:.4f}")
print(f"\nStandard Metrics for Reference:")
for k, v in metrics['standard_metrics'].items():
    print(f"  {k.capitalize()}: {v:.4f}")

Instance-Dependent Costs

The cost matrices we've discussed so far assume uniform costs—every false positive has the same cost, every false negative has the same cost. In reality, costs often vary by instance.

Examples of Instance-Dependent Costs:

Fraud detection: Missing a $10,000 fraudulent transaction is costlier than missing a $50 fraud
Medical diagnosis: Misdiagnosis cost depends on patient age, disease severity, and treatment options
Credit scoring: Cost of default depends on loan amount and customer profile
Spam filtering: Losing an important business email costs more than losing a newsletter

Instance-dependent costs require us to move from a single cost matrix to per-instance cost functions.

Formal Framework:

Let $C_i = \begin{bmatrix} c_i^{TN} & c_i^{FP} \ c_i^{FN} & c_i^{TP} \end{bmatrix}$ be the cost matrix for instance $i$.

The total expected cost becomes:

$$\text{Total Cost} = \sum_{i=1}^{n} C_i[y_i, \hat{y}_i]$$

where $y_i$ is the true label and $\hat{y}_i$ is the predicted label for instance $i$.

Practical Implementation:

Instance-dependent costs are typically implemented by:

Feature-based cost functions: Define costs as functions of instance features
- $C_{FN}(x) = \text{transaction_amount}(x) \times 1.5$
Cost columns in data: Store costs alongside features in the dataset
- Each row has columns: features, label, cost_fp, cost_fn
Lookup tables: Map instance categories to cost matrices
- Customer tier → cost matrix mapping

instance_dependent_costs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
import pandas as pd
 
def compute_instance_dependent_cost(y_true, y_pred, cost_fn_false_neg, cost_fp_false_pos):
    """
    Compute total cost with instance-dependent costs.
    
    Parameters
    ----------
    y_true : array-like
        True labels
    y_pred : array-like
        Predicted labels  
    cost_fn_false_neg : array-like
        Per-instance cost of false negative
    cost_fp_false_pos : array-like
        Per-instance cost of false positive
        
    Returns
    -------
    dict
        Cost breakdown and totals
    """
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    cost_fn = np.array(cost_fn_false_neg)
    cost_fp = np.array(cost_fp_false_pos)
    
    # Identify error types
    false_negatives = (y_true == 1) & (y_pred == 0)
    false_positives = (y_true == 0) & (y_pred == 1)
    
    # Compute costs
    fn_cost = cost_fn[false_negatives].sum()
    fp_cost = cost_fp[false_positives].sum()
    total_cost = fn_cost + fp_cost
    
    return {
        'total_cost': total_cost,
        'fn_cost': fn_cost,
        'fp_cost': fp_cost,
        'n_false_negatives': false_negatives.sum(),
        'n_false_positives': false_positives.sum(),
        'avg_fn_cost': cost_fn[false_negatives].mean() if false_negatives.any() else 0,
        'avg_fp_cost': cost_fp[false_positives].mean() if false_positives.any() else 0,
    }
 
 
# Create a realistic fraud detection dataset
np.random.seed(42)
n_transactions = 10000
 
# Generate transaction amounts (log-normal distribution)
amounts = np.random.lognormal(mean=4, sigma=1.5, size=n_transactions)
amounts = np.clip(amounts, 10, 50000)  # $10 to $50,000 range
 
# Generate fraud labels (higher amounts slightly more likely to be fraud)
fraud_prob = 0.005 + 0.0001 * (amounts / 1000)  # Base 0.5% + amount factor
is_fraud = np.random.random(n_transactions) < fraud_prob
 
# Instance-dependent costs
# FN cost: amount * 1.5 (chargeback + admin + reputation)
cost_false_negative = amounts * 1.5
 
# FP cost: fixed investigation cost + small customer friction
cost_false_positive = np.full(n_transactions, 5.0) + amounts * 0.001
 
print(f"Dataset Statistics:")
print(f"  Total transactions: {n_transactions:,}")
print(f"  Fraud rate: {is_fraud.mean():.2%}")
print(f"  Average transaction: ${amounts.mean():,.2f}")
print(f"  Average FN cost: ${cost_false_negative.mean():,.2f}")
print(f"  Average FP cost: ${cost_false_positive.mean():.2f}")
print()
 
# Simulate two models
# Model A: High recall, lower precision
pred_a = is_fraud.copy()
pred_a[np.random.random(n_transactions) < 0.05] = 1  # 5% extra FP
pred_a[(is_fraud) & (np.random.random(n_transactions) < 0.10)] = 0  # 10% FN
 
# Model B: Lower recall, higher precision  
pred_b = is_fraud.copy()
pred_b[np.random.random(n_transactions) < 0.01] = 1  # 1% extra FP
pred_b[(is_fraud) & (np.random.random(n_transactions) < 0.25)] = 0  # 25% FN
 
# Evaluate with instance-dependent costs
result_a = compute_instance_dependent_cost(is_fraud, pred_a, cost_false_negative, cost_false_positive)
result_b = compute_instance_dependent_cost(is_fraud, pred_b, cost_false_negative, cost_false_positive)
 
print("Model A (High Recall):")
print(f"  Total Cost: ${result_a['total_cost']:,.2f}")
print(f"  FN Cost: ${result_a['fn_cost']:,.2f} ({result_a['n_false_negatives']} missed)")
print(f"  FP Cost: ${result_a['fp_cost']:,.2f} ({result_a['n_false_positives']} false alarms)")
print()
 
print("Model B (High Precision):")
print(f"  Total Cost: ${result_b['total_cost']:,.2f}")
print(f"  FN Cost: ${result_b['fn_cost']:,.2f} ({result_b['n_false_negatives']} missed)")
print(f"  FP Cost: ${result_b['fp_cost']:,.2f} ({result_b['n_false_positives']} false alarms)")
print()
 
print("Winner:", "Model A" if result_a['total_cost'] < result_b['total_cost'] else "Model B")
print(f"Cost Difference: ${abs(result_a['total_cost'] - result_b['total_cost']):,.2f}")

Cost Estimation Challenges

Cost-Sensitive Training Approaches

Three Main Approaches:

Resampling methods — Modify the training data distribution
Reweighting methods — Assign different weights to training instances
Algorithm modification — Directly incorporate costs into the learning algorithm

Approach 1: Resampling Methods

•Oversampling: Duplicate high-cost instances to increase their influence on training
•Undersampling: Remove low-cost instances to rebalance the training set
•SMOTE with cost weighting: Generate synthetic samples proportional to misclassification cost
•Pros: Simple, works with any classifier
•Cons: Changes data distribution, may cause overfitting (oversampling) or information loss (undersampling)

Approach 2: Reweighting Methods

•Sample weights: Assign weight proportional to misclassification cost during training
•Class weights: Set class weights inversely proportional to class costs
•Most scikit-learn classifiers support sample_weight parameter
•Pros: Doesn't duplicate data, computationally efficient
•Cons: Not all algorithms support weights, may destabilize training with extreme weights

Approach 3: Algorithm Modification

•Cost-sensitive decision trees: Use cost-based split criteria instead of Gini/entropy
•MetaCost: Wrapper algorithm that relabels training data based on cost-minimizing predictions
•Cost-sensitive neural networks: Modify loss function to weight errors by cost
•Pros: Most principled approach, directly optimizes cost objective
•Cons: Requires algorithm-specific implementation, may not be available in standard libraries

cost_sensitive_training.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
 
def compute_sample_weights_from_costs(y, cost_fp, cost_fn):
    """
    Compute sample weights from cost matrix.
    
    Weight samples by the cost of misclassifying them.
    """
    weights = np.ones(len(y), dtype=float)
    
    # Positive samples: weight by cost of false negative
    weights[y == 1] = cost_fn
    
    # Negative samples: weight by cost of false positive
    weights[y == 0] = cost_fp
    
    # Normalize weights to sum to len(y) for stability
    weights = weights * len(y) / weights.sum()
    
    return weights
 
 
def train_and_evaluate_cost_sensitive(X, y, cost_matrix, test_size=0.3):
    """
    Compare standard vs cost-sensitive training.
    """
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, stratify=y, random_state=42
    )
    
    cost_fp = cost_matrix['fp']
    cost_fn = cost_matrix['fn']
    
    # Compute sample weights for training
    sample_weights = compute_sample_weights_from_costs(y_train, cost_fp, cost_fn)
    
    # Standard training (no weights)
    model_standard = RandomForestClassifier(n_estimators=100, random_state=42)
    model_standard.fit(X_train, y_train)
    
    # Cost-sensitive training (with weights)
    model_weighted = RandomForestClassifier(n_estimators=100, random_state=42)
    model_weighted.fit(X_train, y_train, sample_weight=sample_weights)
    
    # Also try class_weight parameter
    # Compute class weights from cost ratio
    cost_ratio = cost_fn / cost_fp
    model_class_weight = RandomForestClassifier(
        n_estimators=100, 
        class_weight={0: 1, 1: cost_ratio},
        random_state=42
    )
    model_class_weight.fit(X_train, y_train)
    
    # Evaluate all models
    results = {}
    
    for name, model in [
        ('Standard', model_standard),
        ('Sample Weighted', model_weighted),
        ('Class Weighted', model_class_weight)
    ]:
        y_pred = model.predict(X_test)
        tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
        
        total_cost = fp * cost_fp + fn * cost_fn
        
        results[name] = {
            'confusion_matrix': {'tp': tp, 'fp': fp, 'fn': fn, 'tn': tn},
            'total_cost': total_cost,
            'accuracy': (tp + tn) / len(y_test),
            'recall': tp / (tp + fn) if (tp + fn) > 0 else 0,
            'precision': tp / (tp + fp) if (tp + fp) > 0 else 0,
        }
    
    return results
 
 
# Generate synthetic imbalanced data
from sklearn.datasets import make_classification
 
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    n_clusters_per_class=2,
    weights=[0.95, 0.05],  # 5% positive class
    random_state=42
)
 
# Define cost matrix with high asymmetry
cost_matrix = {'tp': 0, 'fp': 10, 'fn': 100, 'tn': 0}  # 10:1 cost ratio
 
results = train_and_evaluate_cost_sensitive(X, y, cost_matrix)
 
print("Cost-Sensitive Training Comparison")
print("=" * 60)
print(f"Cost Matrix: FP=${cost_matrix['fp']}, FN=${cost_matrix['fn']}")
print()
 
for name, metrics in results.items():
    print(f"{name}:")
    print(f"  Total Cost: ${metrics['total_cost']:,.2f}")
    print(f"  Accuracy: {metrics['accuracy']:.4f}")
    print(f"  Recall: {metrics['recall']:.4f}")
    print(f"  Precision: {metrics['precision']:.4f}")
    cm = metrics['confusion_matrix']
    print(f"  FP: {cm['fp']}, FN: {cm['fn']}")
    print()

Calibration for Cost-Sensitive Decisions

Why Calibration Matters for Costs:

The optimal decision rule given costs is:

$$\text{Predict Positive if } P(y=1|x) > \frac{C_{FP}}{C_{FP} + C_{FN}}$$

The Cost-Proportionate Threshold

Common Calibration Methods:

Platt Scaling: Fit a sigmoid function to transform raw scores into calibrated probabilities
Isotonic Regression: Non-parametric calibration using isotonic regression
Temperature Scaling: Divide logits by a learned temperature parameter (popular for neural networks)
Histogram Binning: Bin predictions and replace with bin's empirical probability

Evaluating Calibration:

Reliability Diagrams: Plot predicted probability vs. actual frequency
Expected Calibration Error (ECE): Weighted average of calibration error across probability bins
Brier Score: Mean squared error between predicted probabilities and binary outcomes

cost_calibration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import brier_score_loss
 
def cost_optimal_threshold(cost_fp, cost_fn):
    """Compute the cost-optimal decision threshold."""
    return cost_fp / (cost_fp + cost_fn)
 
def cost_optimal_predictions(y_proba, cost_fp, cost_fn):
    """
    Make predictions using the cost-optimal threshold.
    
    Parameters
    ----------
    y_proba : array-like
        Predicted probabilities for positive class
    cost_fp, cost_fn : float
        Costs of false positive and false negative
        
    Returns
    -------
    array
        Binary predictions using cost-optimal threshold
    """
    threshold = cost_optimal_threshold(cost_fp, cost_fn)
    return (y_proba >= threshold).astype(int)
 
 
def evaluate_calibration_for_costs(X, y, cost_matrix):
    """
    Compare calibrated vs uncalibrated models for cost-sensitive decisions.
    """
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, stratify=y, random_state=42
    )
    
    cost_fp = cost_matrix['fp']
    cost_fn = cost_matrix['fn']
    optimal_threshold = cost_optimal_threshold(cost_fp, cost_fn)
    
    # Uncalibrated model
    model_raw = RandomForestClassifier(n_estimators=100, random_state=42)
    model_raw.fit(X_train, y_train)
    
    # Calibrated model (Platt scaling)
    model_platt = CalibratedClassifierCV(
        RandomForestClassifier(n_estimators=100, random_state=42),
        method='sigmoid',
        cv=5
    )
    model_platt.fit(X_train, y_train)
    
    # Calibrated model (Isotonic)
    model_isotonic = CalibratedClassifierCV(
        RandomForestClassifier(n_estimators=100, random_state=42),
        method='isotonic',
        cv=5
    )
    model_isotonic.fit(X_train, y_train)
    
    results = {}
    
    for name, model in [
        ('Uncalibrated', model_raw),
        ('Platt Calibrated', model_platt),
        ('Isotonic Calibrated', model_isotonic)
    ]:
        # Get probabilities
        y_proba = model.predict_proba(X_test)[:, 1]
        
        # Make cost-optimal predictions
        y_pred = cost_optimal_predictions(y_proba, cost_fp, cost_fn)
        
        # Compute confusion matrix
        from sklearn.metrics import confusion_matrix
        tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
        
        # Compute costs
        total_cost = fp * cost_fp + fn * cost_fn
        
        # Compute calibration metrics
        brier = brier_score_loss(y_test, y_proba)
        
        # Reliability diagram data
        prob_true, prob_pred = calibration_curve(y_test, y_proba, n_bins=10)
        calibration_error = np.mean(np.abs(prob_true - prob_pred))
        
        results[name] = {
            'total_cost': total_cost,
            'brier_score': brier,
            'mean_calibration_error': calibration_error,
            'confusion': {'tp': tp, 'fp': fp, 'fn': fn, 'tn': tn},
            'threshold_used': optimal_threshold,
        }
    
    return results
 
 
# Generate imbalanced data
from sklearn.datasets import make_classification
 
X, y = make_classification(
    n_samples=5000,
    n_features=15,
    n_informative=8,
    n_redundant=4,
    weights=[0.95, 0.05],
    random_state=42
)
 
cost_matrix = {'fp': 10, 'fn': 150}
 
print(f"Cost-Optimal Threshold: {cost_optimal_threshold(cost_matrix['fp'], cost_matrix['fn']):.4f}")
print(f"(Compare to default 0.5)\n")
 
results = evaluate_calibration_for_costs(X, y, cost_matrix)
 
print("Calibration Impact on Cost-Sensitive Decisions")
print("=" * 60)
 
for name, metrics in results.items():
    print(f"\n{name}:")
    print(f"  Total Cost: ${metrics['total_cost']:,.2f}")
    print(f"  Brier Score: {metrics['brier_score']:.4f}")
    print(f"  Mean Calibration Error: {metrics['mean_calibration_error']:.4f}")
    cm = metrics['confusion']
    print(f"  Confusion: TP={cm['tp']}, FP={cm['fp']}, FN={cm['fn']}, TN={cm['tn']}")

Real-World Applications

Cost-sensitive evaluation and training are essential across numerous domains. Let's examine how these techniques apply to specific real-world challenges.

Medical Diagnosis:

Healthcare presents some of the most dramatic cost asymmetries:

Cancer screening: False negative (missed cancer) can mean death; false positive means additional testing and anxiety
Emergency triage: Under-triaging a critical patient has life-threatening consequences
Drug prescription: Missing a contraindication vs. being overly cautious

Cost Considerations:

Direct costs: Treatment, testing, procedures
Indirect costs: Quality of life, disability, lost productivity
Legal costs: Malpractice liability
Psychological costs: Patient anxiety, physician burnout

Typical Cost Ratios:

Cancer screening: FN/FP ratio often 100:1 to 1000:1
Emergency admission: Highly variable by condition
Must include rare but catastrophic outcomes

Summary and Best Practices

Cost-sensitive evaluation transforms model assessment from an academic exercise into a business-aligned practice. Let's consolidate the key principles:

Key Takeaways

•Define costs explicitly — Work with stakeholders to quantify error costs, even if approximate. Having imperfect cost estimates is better than assuming equal costs.
•Use cost matrices — Structure costs as matrices to capture all four prediction-outcome combinations. Consider instance-dependent costs when error impact varies.
•Report expected cost — Make expected cost your primary evaluation metric. Standard metrics can supplement but shouldn't replace cost-aware evaluation.
•Calibrate probabilities — Cost-optimal thresholds require well-calibrated probability estimates. Always verify calibration before deploying cost-sensitive decisions.
•Consider cost-sensitive training — Don't just evaluate on cost—train with cost awareness using sample weights, class weights, or specialized algorithms.
•Validate cost assumptions — Retrospectively compare predicted costs with actual business outcomes. Update cost models as understanding improves.

Common Pitfalls to Avoid

What's Next:

Page Complete

1 / 5