Machine LearningModel Evaluation Metrics

Regression Metrics

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

5 / 5

Median Errors

Robustness Through Medians

MSE and MAE both use the mean to aggregate errors. But means have a well-known weakness: they're sensitive to extreme values. A single catastrophic prediction can dominate your entire metric.

The median offers a robust alternative. By definition, the median is the value that splits data in half—50% above, 50% below. It completely ignores the magnitude of extreme values, caring only about how many values are above or below.

Median-based error metrics provide a clear picture of typical model performance, unaffected by the handful of cases where your model fails spectacularly.

What You Will Learn

By the end of this page, you will understand Median Absolute Error (MedAE) and its properties, why medians provide robustness to outliers, Median Absolute Deviation (MAD) as a robust scale measure, quantile-based error metrics (beyond just the median), when to prefer median metrics over mean metrics, and how to combine median and mean metrics for comprehensive evaluation.

Median Absolute Error (MedAE)

Median Absolute Error is simply the median of all absolute errors:

$$\text{MedAE} = \text{median}(|y_1 - \hat{y}_1|, |y_2 - \hat{y}_2|, ..., |y_n - \hat{y}_n|)$$

Compare to MAE:

MAE = mean of absolute errors
MedAE = median of absolute errors

The difference is profound in the presence of outliers.

median_absolute_error.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
 
def median_absolute_error(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Compute Median Absolute Error.
    
    MedAE is the median of all absolute residuals.
    It's robust to outliers and represents the 'typical' error.
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    
    absolute_errors = np.abs(y_true - y_pred)
    return np.median(absolute_errors)
 
def compare_mae_medae(y_true: np.ndarray, y_pred: np.ndarray):
    """
    Compare MAE and MedAE, showing robustness difference.
    """
    absolute_errors = np.abs(y_true - y_pred)
    
    mae = np.mean(absolute_errors)
    medae = np.median(absolute_errors)
    
    print("Absolute errors:", np.sort(absolute_errors))
    print(f"
MAE (mean):   {mae:.2f}")
    print(f"MedAE (median): {medae:.2f}")
    print(f"Ratio MAE/MedAE: {mae/medae:.2f}x")
    
    if mae > 1.5 * medae:
        print("
⚠️  MAE >> MedAE suggests outlier errors are inflating the mean")
    
    return mae, medae
 
# Example: Good predictions with one catastrophic failure
print("=== Example with Outlier ===")
y_true = np.array([100, 110, 95, 105, 108, 100, 102, 97, 103, 1000])  # Last is outlier
y_pred = np.array([98, 112, 93, 107, 105, 101, 104, 95, 100, 200])   # Big miss on outlier
 
compare_mae_medae(y_true, y_pred)
 
print("
" + "="*50)
print("=== Same data without outlier ===")
compare_mae_medae(y_true[:-1], y_pred[:-1])

Key Properties of MedAE

50% Breakdown Point: Up to 50% of predictions can be arbitrarily wrong before MedAE is affected. (MAE's breakdown point is 0%—a single outlier affects it.)
Same Units as Target: Like MAE, MedAE is in the original units of y.
Interpretation: 'Half of predictions are within ±MedAE of the true value' (approximately).
Computational Simplicity: Just sort errors and take the middle value(s).

MAE vs MedAE Properties
Property	MAE	MedAE
Central tendency	Mean of absolute errors	Median of absolute errors
Outlier sensitivity	High	None (robust)
Breakdown point	0%	50%
Interpretation	Average absolute error	Typical absolute error
Differentiable	Subgradient only	Not differentiable
Optimization	Can minimize directly	Difficult to optimize

When MedAE Shines

MedAE is particularly valuable when you expect your model to fail badly on some samples (outliers, distribution shift, edge cases) but want to understand typical performance. It answers: 'Ignoring my worst predictions, how well am I doing?'

Understanding Robustness

Robustness in statistics measures how much extreme observations affect an estimator. The key concept is the breakdown point.

Breakdown Point Definition

The breakdown point is the proportion of arbitrary contamination an estimator can handle before giving arbitrarily bad results.

Mean: Breakdown point = 0%. A single infinite value makes the mean infinite.
Median: Breakdown point = 50%. Even if 49% of values are ±∞, the median remains finite.

breakdown_point_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
def breakdown_point_demonstration():
    """
    Show the breakdown point difference between mean and median.
    """
    print("=== Breakdown Point Demonstration ===
")
    
    # Original errors (all small)
    n = 21
    original_errors = np.abs(np.random.randn(n) * 5)
    
    print(f"Original {n} errors (all reasonable):")
    print(f"  MAE: {np.mean(original_errors):.2f}")
    print(f"  MedAE: {np.median(original_errors):.2f}")
    
    # Contaminate with different proportions
    contamination_levels = [0.05, 0.10, 0.20, 0.40, 0.50]
    
    print(f"
Effect of contamination (replacing with value 1,000,000):")
    print(f"{'Contamination':>15} | {'MAE':>12} | {'MedAE':>12} | {'MAE stable?':>12}")
    print("-" * 60)
    
    for frac in contamination_levels:
        errors = original_errors.copy()
        n_contaminated = int(frac * n)
        errors[:n_contaminated] = 1_000_000  # Extreme outliers
        
        mae = np.mean(errors)
        medae = np.median(errors)
        
        # Is MAE "stable"? (within 10x of original)
        mae_stable = "Yes" if mae < 10 * np.mean(original_errors) else "No (broken)"
        
        print(f"{frac*100:>14.0f}% | {mae:>12.0f} | {medae:>12.2f} | {mae_stable:>12}")
    
    print("
*** Key Insight ***")
    print("MAE is 'broken' immediately with any outliers.")
    print("MedAE remains stable until contamination exceeds 50%.")
 
breakdown_point_demonstration()
 
def influence_function_intuition():
    """
    Show how each observation influences MAE vs MedAE.
    """
    print("
=== Influence of Individual Observations ===
")
    
    errors = np.array([2, 3, 4, 5, 6])
    print(f"Original errors: {errors}")
    print(f"MAE: {np.mean(errors):.2f}, MedAE: {np.median(errors):.2f}")
    
    # Change the largest error progressively
    print(f"
Change last value from 6 to...")
    for new_val in [6, 10, 20, 50, 100, 1000]:
        modified = errors.copy()
        modified[-1] = new_val
        print(f"  [{modified}]: MAE={np.mean(modified):.2f}, MedAE={np.median(modified):.2f}")
    
    print("
→ MedAE is completely unchanged by the last value's magnitude!")
 
influence_function_intuition()

Mean (MAE) Sensitivity

•Every observation affects the mean
•Large errors pull mean strongly
•Breakdown point = 0%
•One outlier can ruin metric
•Sensitive to tail behavior

Median (MedAE) Robustness

•Only rank order matters
•Extreme values have no extra influence
•Breakdown point = 50%
•Stable under heavy contamination
•Ignores tail magnitudes

Median Absolute Deviation (MAD)

While MedAE measures typical error magnitude, sometimes we want to measure error dispersion—how spread out the errors are. The standard deviation does this but is sensitive to outliers.

Median Absolute Deviation (MAD) provides a robust measure of dispersion:

$$\text{MAD} = \text{median}(|X_i - \text{median}(X)|)$$

For regression residuals: $$\text{MAD}_{residuals} = \text{median}(|e_i - \text{median}(e)|)$$

Where $e_i = y_i - \hat{y}_i$ are the residuals.

mad_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from scipy import stats
 
def median_absolute_deviation(x: np.ndarray, scale: bool = True) -> float:
    """
    Compute Median Absolute Deviation (MAD).
    
    Parameters:
    -----------
    x : np.ndarray
        Data array
    scale : bool
        If True, scale MAD to be comparable to std dev
        (multiply by 1.4826 for normal data)
    """
    x = np.asarray(x)
    median = np.median(x)
    mad = np.median(np.abs(x - median))
    
    if scale:
        # Scale factor makes MAD consistent with std dev for normal data
        mad *= 1.4826
    
    return mad
 
def compare_std_mad(data: np.ndarray, name: str = "Data"):
    """
    Compare standard deviation and MAD.
    """
    std = np.std(data, ddof=1)
    mad = median_absolute_deviation(data, scale=True)
    mad_unscaled = median_absolute_deviation(data, scale=False)
    
    print(f"
=== {name} ===")
    print(f"Mean: {np.mean(data):.2f}, Median: {np.median(data):.2f}")
    print(f"Std Dev: {std:.2f}")
    print(f"MAD (unscaled): {mad_unscaled:.2f}")
    print(f"MAD (scaled): {mad:.2f}")
    print(f"Std/MAD ratio: {std/mad:.2f}x")
    
    if std > 2 * mad:
        print("⚠️  Std >> MAD suggests outliers are inflating variance")
    
    return std, mad
 
# Example 1: Normal data
np.random.seed(42)
normal_data = np.random.randn(100)
compare_std_mad(normal_data, "Normal Distribution (no outliers)")
 
# Example 2: Data with outliers
contaminated = normal_data.copy()
contaminated[0:5] = [10, -10, 15, -15, 20]  # Add outliers
compare_std_mad(contaminated, "Contaminated with 5% outliers")
 
# Example 3: Regression residuals
print("
" + "="*50)
print("=== Application to Regression Residuals ===")
 
y_true = np.array([100, 105, 98, 102, 110, 95, 108, 103, 500, 50])  # Two outliers
y_pred = np.array([101, 104, 99, 101, 108, 97, 106, 105, 120, 95])
 
residuals = y_true - y_pred
print(f"
Residuals: {residuals}")
compare_std_mad(residuals, "Regression Residuals")

The 1.4826 Scale Factor

For normally distributed data:

Standard deviation σ relates to MAD as: $\text{MAD} \approx 0.6745 \times \sigma$
To make MAD comparable to σ, multiply by $1/0.6745 \approx 1.4826$

This scaled MAD is a consistent estimator of standard deviation for normal data—but much more robust.

Use Cases for MAD

Outlier detection: Observations beyond $k \times \text{MAD}$ from median are potential outliers
Robust standardization: $z = \frac{x - \text{median}}{\text{MAD}}$ is robust z-score
Error distribution analysis: Compare std vs MAD to detect heavy tails

Robust Outlier Detection

A common robust outlier rule: points where |x - median| > 3 × MAD are outliers. This is more reliable than the mean ± 3σ rule, which is itself affected by outliers.

Quantile-Based Error Metrics

The median is the 50th percentile, but we can generalize to any quantile. Quantile-based error metrics give a richer picture of error distribution than any single summary statistic.

Common Quantile Metrics

P50 Error (50th percentile): Same as MedAE
P90 Error (90th percentile): 90% of predictions have error ≤ this
P95 Error (95th percentile): Captures 'bad but not catastrophic' errors
P99 Error (99th percentile): Near-worst-case performance
Max Error: Absolute worst case

quantile_errors.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
from typing import Dict, List
 
def quantile_error_profile(
    y_true: np.ndarray, 
    y_pred: np.ndarray,
    quantiles: List[float] = [0.50, 0.75, 0.90, 0.95, 0.99]
) -> Dict[str, float]:
    """
    Compute error at multiple quantiles for comprehensive view.
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    absolute_errors = np.abs(y_true - y_pred)
    
    results = {
        'MAE': np.mean(absolute_errors),
        'MedAE': np.median(absolute_errors),
        'Max': np.max(absolute_errors),
    }
    
    for q in quantiles:
        key = f'P{int(q*100)}'
        results[key] = np.percentile(absolute_errors, q * 100)
    
    return results
 
def comprehensive_error_report(y_true: np.ndarray, y_pred: np.ndarray, name: str = "Model"):
    """
    Generate comprehensive quantile-based error report.
    """
    profile = quantile_error_profile(y_true, y_pred)
    absolute_errors = np.abs(y_true - y_pred)
    
    print(f"╔═══════════════════════════════════════════════╗")
    print(f"║  Error Profile: {name:^28} ║")
    print(f"╠═══════════════════════════════════════════════╣")
    print(f"║  {'Statistic':15} │ {'Value':>10} │ {'Interpretation':^15} ║")
    print(f"╠═══════════════════════════════════════════════╣")
    
    print(f"║  {'MAE':15} │ {profile['MAE']:>10.2f} │ {'Average error':^15} ║")
    print(f"║  {'P50 (Median)':15} │ {profile['P50']:>10.2f} │ {'Typical error':^15} ║")
    print(f"║  {'P75':15} │ {profile['P75']:>10.2f} │ {'75% ≤ this':^15} ║")
    print(f"║  {'P90':15} │ {profile['P90']:>10.2f} │ {'90% ≤ this':^15} ║")
    print(f"║  {'P95':15} │ {profile['P95']:>10.2f} │ {'95% ≤ this':^15} ║")
    print(f"║  {'P99':15} │ {profile['P99']:>10.2f} │ {'99% ≤ this':^15} ║")
    print(f"║  {'Max':15} │ {profile['Max']:>10.2f} │ {'Worst case':^15} ║")
    print(f"╚═══════════════════════════════════════════════╝")
    
    # Analysis
    print(f"
Analysis:")
    ratio_95_50 = profile['P95'] / profile['P50']
    ratio_max_95 = profile['Max'] / profile['P95']
    
    print(f"  P95/P50 ratio: {ratio_95_50:.1f}x — ", end="")
    if ratio_95_50 < 2:
        print("tight distribution")
    elif ratio_95_50 < 4:
        print("moderate spread")
    else:
        print("heavy right tail (some large errors)")
    
    print(f"  Max/P95 ratio: {ratio_max_95:.1f}x — ", end="")
    if ratio_max_95 < 2:
        print("no extreme outliers")
    else:
        print("extreme outliers present")
    
    return profile
 
# Example with realistic model predictions
np.random.seed(42)
n = 1000
 
# Good model with occasional failures
y_true = np.random.uniform(100, 500, n)
# Most predictions within 5%, but some failures
noise = np.random.randn(n)
noise[noise > 2] *= 5  # Inflate tail errors
y_pred = y_true * (1 + noise * 0.05)
 
comprehensive_error_report(y_true, y_pred, "Production Model")

Interquartile Range of Errors (IQR)

The IQR measures the middle 50% of errors: $$\text{IQR} = P_{75} - P_{25}$$

This is a robust measure of error spread, analogous to MAD for dispersion:

Small IQR: Errors are consistent (predictable)
Large IQR: Errors vary widely (inconsistent model)

Service Level Agreements (SLAs)

In production ML systems, quantile metrics directly translate to SLAs. 'P95 latency < 200ms' or 'P99 error < $50' are common SLA formats. Understanding quantile error profiles helps you set and meet realistic service commitments.

Comparing Mean and Median Metrics

The relationship between MAE and MedAE reveals important information about your error distribution.

The MAE/MedAE Ratio

For symmetric distributions, MAE ≈ MedAE. Deviations indicate asymmetry or outliers:

MAE/MedAE ≈ 1: Symmetric, well-behaved errors
MAE/MedAE > 1.5: Right-skewed (some large errors)
MAE/MedAE > 2: Heavy right tail (outlier problem)
MAE/MedAE > 3: Severe outliers dominating

mean_median_diagnosis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from scipy import stats
 
def diagnose_error_distribution(y_true: np.ndarray, y_pred: np.ndarray):
    """
    Use MAE/MedAE ratio and other statistics to diagnose error distribution.
    """
    residuals = y_true - y_pred
    abs_errors = np.abs(residuals)
    
    mae = np.mean(abs_errors)
    medae = np.median(abs_errors)
    std = np.std(abs_errors)
    
    ratio = mae / medae
    
    # Distribution characteristics
    skewness = stats.skew(abs_errors)
    kurtosis = stats.kurtosis(abs_errors)  # Excess kurtosis
    
    print("=== Error Distribution Diagnosis ===
")
    print(f"MAE:   {mae:.4f}")
    print(f"MedAE: {medae:.4f}")
    print(f"StdDev: {std:.4f}")
    print(f"
MAE/MedAE ratio: {ratio:.2f}")
    
    # Interpretation
    print("
Diagnosis:")
    if ratio < 1.2:
        print("  ✓ Errors are symmetric and well-behaved")
        print("  → MAE and MedAE are both appropriate")
    elif ratio < 1.5:
        print("  ○ Slight asymmetry or mild outliers")
        print("  → Both metrics valid, report both")
    elif ratio < 2.0:
        print("  ⚠ Moderate right skew in errors")
        print("  → Consider using MedAE or investigating outliers")
    else:
        print("  ⚠ Heavy outliers inflating MAE significantly")
        print("  → MedAE more representative of typical performance")
        print("  → Investigate and possibly exclude/correct outliers")
    
    print(f"
Distribution stats:")
    print(f"  Skewness: {skewness:.2f} (>1 = right-skewed)")
    print(f"  Kurtosis: {kurtosis:.2f} (>0 = heavy tails)")
    
    # Outlier count
    threshold = np.median(abs_errors) + 3 * stats.median_abs_deviation(abs_errors)
    n_outliers = np.sum(abs_errors > threshold)
    print(f"  Outliers (>3 MAD): {n_outliers} ({n_outliers/len(abs_errors)*100:.1f}%)")
    
    return ratio
 
# Example 1: Well-behaved errors
np.random.seed(42)
y_true_good = np.random.uniform(100, 200, 1000)
y_pred_good = y_true_good + np.random.randn(1000) * 5
 
print("Example 1: Normal errors")
diagnose_error_distribution(y_true_good, y_pred_good)
 
# Example 2: With outliers
print("
" + "="*50)
print("
Example 2: Errors with outliers")
y_pred_bad = y_pred_good.copy()
y_pred_bad[:50] += np.random.uniform(50, 100, 50)  # 5% severe errors
 
diagnose_error_distribution(y_true_good, y_pred_bad)

When to Use Which Metric
Scenario	Primary Metric	Reason
No outliers expected	MAE	Standard, familiar, optimizable
Outliers likely/present	MedAE	Robust to contamination
Both perspectives needed	MAE & MedAE together	MAE for overall, MedAE for typical
Production SLAs	Quantiles (P95, P99)	Directly map to guarantees
Model comparison	Both + ratio analysis	Understand behavior differences

Robust Regression Metrics Toolkit

Let's consolidate all the robust regression metrics into a practical toolkit.

robust_metrics_toolkit.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
from scipy import stats
from typing import Dict, Any
 
class RobustRegressionMetrics:
    """
    Complete toolkit for robust regression evaluation.
    Includes both standard and robust metrics.
    """
    
    def __init__(self, y_true: np.ndarray, y_pred: np.ndarray):
        self.y_true = np.asarray(y_true)
        self.y_pred = np.asarray(y_pred)
        self.residuals = self.y_true - self.y_pred
        self.abs_errors = np.abs(self.residuals)
        
    @property
    def mse(self) -> float:
        """Mean Squared Error"""
        return np.mean(self.residuals ** 2)
    
    @property
    def rmse(self) -> float:
        """Root Mean Squared Error"""
        return np.sqrt(self.mse)
    
    @property
    def mae(self) -> float:
        """Mean Absolute Error"""
        return np.mean(self.abs_errors)
    
    @property
    def medae(self) -> float:
        """Median Absolute Error (robust)"""
        return np.median(self.abs_errors)
    
    @property
    def mad(self) -> float:
        """Median Absolute Deviation of residuals (scaled)"""
        return stats.median_abs_deviation(self.residuals, scale=1.4826)
    
    @property
    def trimmed_mae(self, trim_pct: float = 0.1) -> float:
        """
        Trimmed MAE - excludes extreme values.
        trim_pct = 0.1 means exclude top and bottom 5% each.
        """
        return stats.trim_mean(self.abs_errors, trim_pct)
    
    @property
    def winsorized_mae(self, limits: tuple = (0.05, 0.05)) -> float:
        """
        Winsorized MAE - caps extreme values instead of removing.
        """
        winsorized = stats.mstats.winsorize(self.abs_errors, limits=limits)
        return np.mean(winsorized)
    
    def quantile_errors(self, quantiles: list = [0.5, 0.75, 0.9, 0.95, 0.99]) -> Dict[str, float]:
        """Error at specified quantiles"""
        return {
            f'P{int(q*100)}': np.percentile(self.abs_errors, q * 100)
            for q in quantiles
        }
    
    @property
    def iqr(self) -> float:
        """Interquartile Range of errors"""
        return np.percentile(self.abs_errors, 75) - np.percentile(self.abs_errors, 25)
    
    @property
    def mae_medae_ratio(self) -> float:
        """Ratio indicating outlier presence"""
        return self.mae / self.medae if self.medae > 0 else float('inf')
    
    def comprehensive_report(self) -> Dict[str, Any]:
        """Generate complete metric report"""
        return {
            # Location (central tendency)
            'mae': self.mae,
            'medae': self.medae,
            'trimmed_mae_10pct': self.trimmed_mae,
            
            # Scale (dispersion)
            'rmse': self.rmse,
            'std_residuals': np.std(self.residuals),
            'mad': self.mad,
            'iqr': self.iqr,
            
            # Quantiles
            **self.quantile_errors(),
            'max_error': np.max(self.abs_errors),
            
            # Diagnostics
            'mae_medae_ratio': self.mae_medae_ratio,
            'skewness': stats.skew(self.abs_errors),
            'kurtosis': stats.kurtosis(self.abs_errors),
            'n_samples': len(self.y_true),
        }
    
    def print_report(self, name: str = "Model"):
        """Pretty print the comprehensive report"""
        report = self.comprehensive_report()
        
        print(f"
{'='*60}")
        print(f"ROBUST REGRESSION METRICS: {name}")
        print('='*60)
        
        print("
📊 LOCATION (Central Tendency of Errors)")
        print(f"   MAE:              {report['mae']:.4f}")
        print(f"   MedAE:            {report['medae']:.4f} (robust)")
        print(f"   Trimmed MAE (10%): {report['trimmed_mae_10pct']:.4f}")
        
        print("
📏 SCALE (Dispersion of Errors)")
        print(f"   RMSE:             {report['rmse']:.4f}")
        print(f"   MAD:              {report['mad']:.4f} (robust)")
        print(f"   IQR:              {report['iqr']:.4f}")
        
        print("
📈 QUANTILES")
        print(f"   P50:              {report['P50']:.4f}")
        print(f"   P90:              {report['P90']:.4f}")
        print(f"   P95:              {report['P95']:.4f}")
        print(f"   P99:              {report['P99']:.4f}")
        print(f"   Max:              {report['max_error']:.4f}")
        
        print("
🔍 DIAGNOSTICS")
        print(f"   MAE/MedAE ratio:  {report['mae_medae_ratio']:.2f}", end="")
        if report['mae_medae_ratio'] > 1.5:
            print(" ⚠️  (outliers affecting mean)")
        else:
            print(" ✓")
        print(f"   Skewness:         {report['skewness']:.2f}")
        print(f"   Kurtosis:         {report['kurtosis']:.2f}")
        
        print('='*60)
 
# Example usage
np.random.seed(42)
y_true = np.random.uniform(100, 500, 500)
noise = np.random.randn(500)
noise[noise > 2.5] *= 3  # Create some outliers
y_pred = y_true + noise * 15
 
metrics = RobustRegressionMetrics(y_true, y_pred)
metrics.print_report("Production Forecaster")

When to Use Robust Metrics

Robust metrics aren't always necessary or appropriate. Here's guidance for when they add value.

Use Robust Metrics When:

•Data quality issues exist — Measurement errors, sensor failures, manual entry mistakes
•Model occasionally fails badly — Edge cases, distribution shift, out-of-domain inputs
•Communicating 'typical' performance — Stakeholders want to know the normal case, not worst case
•Comparing models fairly — One outlier shouldn't change your model selection
•Initial data exploration — Before cleaning, see how outliers affect metrics
•High stakes decisions — Want multiple perspectives (mean AND median)

Mean Metrics May Be Better When:

•Large errors are critical — Safety applications where worst case matters most
•No outliers expected — Clean, controlled data with known distribution
•Optimizing the model — Most loss functions use mean-based objectives
•Comparing to baselines — Most literature reports MAE/MSE
•Simple communication needed — 'Average error' is universally understood

The Best Practice

Report BOTH mean and median metrics. If they're similar, you have well-behaved errors. If they differ significantly, explain why and which is more relevant for your use case. This transparency builds trust and demonstrates statistical sophistication.

Practical Implementation

Here's how to effectively integrate median metrics into your ML workflow.

sklearn_integration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from sklearn.metrics import (
    mean_absolute_error, 
    median_absolute_error,  # Built-in!
    mean_squared_error
)
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, HuberRegressor
from sklearn.datasets import make_regression
 
def sklearn_median_metrics():
    """
    Using sklearn's built-in median_absolute_error.
    """
    # Generate data with outliers
    X, y = make_regression(n_samples=200, n_features=5, noise=10, random_state=42)
    
    # Add outliers to y
    y[0:10] += np.random.uniform(100, 200, 10)
    
    # Split
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Compare models
    models = {
        'Linear Regression': LinearRegression(),
        'Huber Regressor': HuberRegressor()  # Robust to outliers
    }
    
    print("=== Sklearn Model Comparison ===
")
    print(f"{'Model':25s} | {'MAE':>10s} | {'MedAE':>10s} | {'MAE/MedAE':>10s}")
    print("-" * 62)
    
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        mae = mean_absolute_error(y_test, y_pred)
        medae = median_absolute_error(y_test, y_pred)  # sklearn built-in!
        ratio = mae / medae
        
        print(f"{name:25s} | {mae:>10.2f} | {medae:>10.2f} | {ratio:>10.2f}")
 
sklearn_median_metrics()
 
def custom_robust_scorer():
    """
    Create custom scorer using MedAE for cross-validation.
    """
    from sklearn.metrics import make_scorer
    
    # Create MedAE scorer (higher is worse, so neg_)
    medae_scorer = make_scorer(
        median_absolute_error, 
        greater_is_better=False  # Negate for sklearn convention
    )
    
    # Generate data
    X, y = make_regression(n_samples=500, n_features=5, noise=10, random_state=42)
    y[0:25] += 200  # Add outliers
    
    model = LinearRegression()
    
    # Cross-validate with different scorers
    mae_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
    medae_scores = cross_val_score(model, X, y, cv=5, scoring=medae_scorer)
    
    print("
=== Cross-Validation with Different Metrics ===")
    print(f"MAE scores:   {-mae_scores.mean():.2f} ± {mae_scores.std():.2f}")
    print(f"MedAE scores: {-medae_scores.mean():.2f} ± {medae_scores.std():.2f}")
    print(f"
Ratio: {-mae_scores.mean() / -medae_scores.mean():.2f}x")
 
custom_robust_scorer()

Production Monitoring

In production, track both MAE and MedAE over time. A sudden increase in MAE while MedAE stays stable often indicates a few predictions going badly wrong (outliers, edge cases, data quality issues)—much easier to diagnose than a general degradation.

Summary: Median Errors

Median-based metrics provide robust alternatives to mean-based metrics, giving you a clearer picture of typical model performance. Let's consolidate the key insights:

Key Takeaways

•MedAE (Median Absolute Error): Median of absolute errors—represents 'typical' error, ignoring extremes.
•Breakdown Point: Median has 50% breakdown point vs. 0% for mean—vastly more robust.
•MAD (Median Absolute Deviation): Robust measure of error dispersion; scaled by 1.4826 to compare with standard deviation.
•Quantile Metrics: P50, P90, P95, P99 give full picture of error distribution; essential for SLAs.
•MAE/MedAE Ratio: Values > 1.5 indicate outliers are inflating the mean.
•When to Use: Data quality issues, occasional model failures, stakeholder communication, fair model comparison.
•Best Practice: Report BOTH mean and median metrics; their difference is informative.
•Built-in Support: sklearn has median_absolute_error; scipy has median_abs_deviation.

Module Complete

Congratulations! You've now completed the Regression Metrics module. You understand the full spectrum of metrics from MSE to MAPE to median errors, and you can select and interpret them appropriately for any regression problem.

Next, you might explore Ranking Metrics to understand how to evaluate models when the order of predictions matters more than their exact values.

Module Complete

You now command a comprehensive toolkit for regression evaluation. From MSE's probabilistic foundations to MedAE's robustness, from R²'s variance explanation to MAPE's percentage interpretation—you can evaluate, compare, and communicate regression model performance with confidence and rigor.

5 / 5

Loading learning content...

Machine LearningModel Evaluation Metrics

Regression Metrics

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

5 / 5

Median Errors

Robustness Through Medians

MSE and MAE both use the mean to aggregate errors. But means have a well-known weakness: they're sensitive to extreme values. A single catastrophic prediction can dominate your entire metric.

Median-based error metrics provide a clear picture of typical model performance, unaffected by the handful of cases where your model fails spectacularly.

What You Will Learn

Median Absolute Error (MedAE)

Median Absolute Error is simply the median of all absolute errors:

$$\text{MedAE} = \text{median}(|y_1 - \hat{y}_1|, |y_2 - \hat{y}_2|, ..., |y_n - \hat{y}_n|)$$

Compare to MAE:

MAE = mean of absolute errors
MedAE = median of absolute errors

The difference is profound in the presence of outliers.

median_absolute_error.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
 
def median_absolute_error(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Compute Median Absolute Error.
    
    MedAE is the median of all absolute residuals.
    It's robust to outliers and represents the 'typical' error.
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    
    absolute_errors = np.abs(y_true - y_pred)
    return np.median(absolute_errors)
 
def compare_mae_medae(y_true: np.ndarray, y_pred: np.ndarray):
    """
    Compare MAE and MedAE, showing robustness difference.
    """
    absolute_errors = np.abs(y_true - y_pred)
    
    mae = np.mean(absolute_errors)
    medae = np.median(absolute_errors)
    
    print("Absolute errors:", np.sort(absolute_errors))
    print(f"
MAE (mean):   {mae:.2f}")
    print(f"MedAE (median): {medae:.2f}")
    print(f"Ratio MAE/MedAE: {mae/medae:.2f}x")
    
    if mae > 1.5 * medae:
        print("
⚠️  MAE >> MedAE suggests outlier errors are inflating the mean")
    
    return mae, medae
 
# Example: Good predictions with one catastrophic failure
print("=== Example with Outlier ===")
y_true = np.array([100, 110, 95, 105, 108, 100, 102, 97, 103, 1000])  # Last is outlier
y_pred = np.array([98, 112, 93, 107, 105, 101, 104, 95, 100, 200])   # Big miss on outlier
 
compare_mae_medae(y_true, y_pred)
 
print("
" + "="*50)
print("=== Same data without outlier ===")
compare_mae_medae(y_true[:-1], y_pred[:-1])

Key Properties of MedAE

50% Breakdown Point: Up to 50% of predictions can be arbitrarily wrong before MedAE is affected. (MAE's breakdown point is 0%—a single outlier affects it.)
Same Units as Target: Like MAE, MedAE is in the original units of y.
Interpretation: 'Half of predictions are within ±MedAE of the true value' (approximately).
Computational Simplicity: Just sort errors and take the middle value(s).

MAE vs MedAE Properties
Property	MAE	MedAE
Central tendency	Mean of absolute errors	Median of absolute errors
Outlier sensitivity	High	None (robust)
Breakdown point	0%	50%
Interpretation	Average absolute error	Typical absolute error
Differentiable	Subgradient only	Not differentiable
Optimization	Can minimize directly	Difficult to optimize

When MedAE Shines

Understanding Robustness

Robustness in statistics measures how much extreme observations affect an estimator. The key concept is the breakdown point.

Breakdown Point Definition

The breakdown point is the proportion of arbitrary contamination an estimator can handle before giving arbitrarily bad results.

Mean: Breakdown point = 0%. A single infinite value makes the mean infinite.
Median: Breakdown point = 50%. Even if 49% of values are ±∞, the median remains finite.

breakdown_point_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
def breakdown_point_demonstration():
    """
    Show the breakdown point difference between mean and median.
    """
    print("=== Breakdown Point Demonstration ===
")
    
    # Original errors (all small)
    n = 21
    original_errors = np.abs(np.random.randn(n) * 5)
    
    print(f"Original {n} errors (all reasonable):")
    print(f"  MAE: {np.mean(original_errors):.2f}")
    print(f"  MedAE: {np.median(original_errors):.2f}")
    
    # Contaminate with different proportions
    contamination_levels = [0.05, 0.10, 0.20, 0.40, 0.50]
    
    print(f"
Effect of contamination (replacing with value 1,000,000):")
    print(f"{'Contamination':>15} | {'MAE':>12} | {'MedAE':>12} | {'MAE stable?':>12}")
    print("-" * 60)
    
    for frac in contamination_levels:
        errors = original_errors.copy()
        n_contaminated = int(frac * n)
        errors[:n_contaminated] = 1_000_000  # Extreme outliers
        
        mae = np.mean(errors)
        medae = np.median(errors)
        
        # Is MAE "stable"? (within 10x of original)
        mae_stable = "Yes" if mae < 10 * np.mean(original_errors) else "No (broken)"
        
        print(f"{frac*100:>14.0f}% | {mae:>12.0f} | {medae:>12.2f} | {mae_stable:>12}")
    
    print("
*** Key Insight ***")
    print("MAE is 'broken' immediately with any outliers.")
    print("MedAE remains stable until contamination exceeds 50%.")
 
breakdown_point_demonstration()
 
def influence_function_intuition():
    """
    Show how each observation influences MAE vs MedAE.
    """
    print("
=== Influence of Individual Observations ===
")
    
    errors = np.array([2, 3, 4, 5, 6])
    print(f"Original errors: {errors}")
    print(f"MAE: {np.mean(errors):.2f}, MedAE: {np.median(errors):.2f}")
    
    # Change the largest error progressively
    print(f"
Change last value from 6 to...")
    for new_val in [6, 10, 20, 50, 100, 1000]:
        modified = errors.copy()
        modified[-1] = new_val
        print(f"  [{modified}]: MAE={np.mean(modified):.2f}, MedAE={np.median(modified):.2f}")
    
    print("
→ MedAE is completely unchanged by the last value's magnitude!")
 
influence_function_intuition()

Mean (MAE) Sensitivity

•Every observation affects the mean
•Large errors pull mean strongly
•Breakdown point = 0%
•One outlier can ruin metric
•Sensitive to tail behavior

Median (MedAE) Robustness

•Only rank order matters
•Extreme values have no extra influence
•Breakdown point = 50%
•Stable under heavy contamination
•Ignores tail magnitudes

Median Absolute Deviation (MAD)

While MedAE measures typical error magnitude, sometimes we want to measure error dispersion—how spread out the errors are. The standard deviation does this but is sensitive to outliers.

Median Absolute Deviation (MAD) provides a robust measure of dispersion:

$$\text{MAD} = \text{median}(|X_i - \text{median}(X)|)$$

For regression residuals: $$\text{MAD}_{residuals} = \text{median}(|e_i - \text{median}(e)|)$$

Where $e_i = y_i - \hat{y}_i$ are the residuals.

mad_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from scipy import stats
 
def median_absolute_deviation(x: np.ndarray, scale: bool = True) -> float:
    """
    Compute Median Absolute Deviation (MAD).
    
    Parameters:
    -----------
    x : np.ndarray
        Data array
    scale : bool
        If True, scale MAD to be comparable to std dev
        (multiply by 1.4826 for normal data)
    """
    x = np.asarray(x)
    median = np.median(x)
    mad = np.median(np.abs(x - median))
    
    if scale:
        # Scale factor makes MAD consistent with std dev for normal data
        mad *= 1.4826
    
    return mad
 
def compare_std_mad(data: np.ndarray, name: str = "Data"):
    """
    Compare standard deviation and MAD.
    """
    std = np.std(data, ddof=1)
    mad = median_absolute_deviation(data, scale=True)
    mad_unscaled = median_absolute_deviation(data, scale=False)
    
    print(f"
=== {name} ===")
    print(f"Mean: {np.mean(data):.2f}, Median: {np.median(data):.2f}")
    print(f"Std Dev: {std:.2f}")
    print(f"MAD (unscaled): {mad_unscaled:.2f}")
    print(f"MAD (scaled): {mad:.2f}")
    print(f"Std/MAD ratio: {std/mad:.2f}x")
    
    if std > 2 * mad:
        print("⚠️  Std >> MAD suggests outliers are inflating variance")
    
    return std, mad
 
# Example 1: Normal data
np.random.seed(42)
normal_data = np.random.randn(100)
compare_std_mad(normal_data, "Normal Distribution (no outliers)")
 
# Example 2: Data with outliers
contaminated = normal_data.copy()
contaminated[0:5] = [10, -10, 15, -15, 20]  # Add outliers
compare_std_mad(contaminated, "Contaminated with 5% outliers")
 
# Example 3: Regression residuals
print("
" + "="*50)
print("=== Application to Regression Residuals ===")
 
y_true = np.array([100, 105, 98, 102, 110, 95, 108, 103, 500, 50])  # Two outliers
y_pred = np.array([101, 104, 99, 101, 108, 97, 106, 105, 120, 95])
 
residuals = y_true - y_pred
print(f"
Residuals: {residuals}")
compare_std_mad(residuals, "Regression Residuals")

The 1.4826 Scale Factor

For normally distributed data:

Standard deviation σ relates to MAD as: $\text{MAD} \approx 0.6745 \times \sigma$
To make MAD comparable to σ, multiply by $1/0.6745 \approx 1.4826$

This scaled MAD is a consistent estimator of standard deviation for normal data—but much more robust.

Use Cases for MAD

Outlier detection: Observations beyond $k \times \text{MAD}$ from median are potential outliers
Robust standardization: $z = \frac{x - \text{median}}{\text{MAD}}$ is robust z-score
Error distribution analysis: Compare std vs MAD to detect heavy tails

Robust Outlier Detection

A common robust outlier rule: points where |x - median| > 3 × MAD are outliers. This is more reliable than the mean ± 3σ rule, which is itself affected by outliers.

Quantile-Based Error Metrics

The median is the 50th percentile, but we can generalize to any quantile. Quantile-based error metrics give a richer picture of error distribution than any single summary statistic.

Common Quantile Metrics

P50 Error (50th percentile): Same as MedAE
P90 Error (90th percentile): 90% of predictions have error ≤ this
P95 Error (95th percentile): Captures 'bad but not catastrophic' errors
P99 Error (99th percentile): Near-worst-case performance
Max Error: Absolute worst case

quantile_errors.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
from typing import Dict, List
 
def quantile_error_profile(
    y_true: np.ndarray, 
    y_pred: np.ndarray,
    quantiles: List[float] = [0.50, 0.75, 0.90, 0.95, 0.99]
) -> Dict[str, float]:
    """
    Compute error at multiple quantiles for comprehensive view.
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    absolute_errors = np.abs(y_true - y_pred)
    
    results = {
        'MAE': np.mean(absolute_errors),
        'MedAE': np.median(absolute_errors),
        'Max': np.max(absolute_errors),
    }
    
    for q in quantiles:
        key = f'P{int(q*100)}'
        results[key] = np.percentile(absolute_errors, q * 100)
    
    return results
 
def comprehensive_error_report(y_true: np.ndarray, y_pred: np.ndarray, name: str = "Model"):
    """
    Generate comprehensive quantile-based error report.
    """
    profile = quantile_error_profile(y_true, y_pred)
    absolute_errors = np.abs(y_true - y_pred)
    
    print(f"╔═══════════════════════════════════════════════╗")
    print(f"║  Error Profile: {name:^28} ║")
    print(f"╠═══════════════════════════════════════════════╣")
    print(f"║  {'Statistic':15} │ {'Value':>10} │ {'Interpretation':^15} ║")
    print(f"╠═══════════════════════════════════════════════╣")
    
    print(f"║  {'MAE':15} │ {profile['MAE']:>10.2f} │ {'Average error':^15} ║")
    print(f"║  {'P50 (Median)':15} │ {profile['P50']:>10.2f} │ {'Typical error':^15} ║")
    print(f"║  {'P75':15} │ {profile['P75']:>10.2f} │ {'75% ≤ this':^15} ║")
    print(f"║  {'P90':15} │ {profile['P90']:>10.2f} │ {'90% ≤ this':^15} ║")
    print(f"║  {'P95':15} │ {profile['P95']:>10.2f} │ {'95% ≤ this':^15} ║")
    print(f"║  {'P99':15} │ {profile['P99']:>10.2f} │ {'99% ≤ this':^15} ║")
    print(f"║  {'Max':15} │ {profile['Max']:>10.2f} │ {'Worst case':^15} ║")
    print(f"╚═══════════════════════════════════════════════╝")
    
    # Analysis
    print(f"
Analysis:")
    ratio_95_50 = profile['P95'] / profile['P50']
    ratio_max_95 = profile['Max'] / profile['P95']
    
    print(f"  P95/P50 ratio: {ratio_95_50:.1f}x — ", end="")
    if ratio_95_50 < 2:
        print("tight distribution")
    elif ratio_95_50 < 4:
        print("moderate spread")
    else:
        print("heavy right tail (some large errors)")
    
    print(f"  Max/P95 ratio: {ratio_max_95:.1f}x — ", end="")
    if ratio_max_95 < 2:
        print("no extreme outliers")
    else:
        print("extreme outliers present")
    
    return profile
 
# Example with realistic model predictions
np.random.seed(42)
n = 1000
 
# Good model with occasional failures
y_true = np.random.uniform(100, 500, n)
# Most predictions within 5%, but some failures
noise = np.random.randn(n)
noise[noise > 2] *= 5  # Inflate tail errors
y_pred = y_true * (1 + noise * 0.05)
 
comprehensive_error_report(y_true, y_pred, "Production Model")

Interquartile Range of Errors (IQR)

The IQR measures the middle 50% of errors: $$\text{IQR} = P_{75} - P_{25}$$

This is a robust measure of error spread, analogous to MAD for dispersion:

Small IQR: Errors are consistent (predictable)
Large IQR: Errors vary widely (inconsistent model)

Service Level Agreements (SLAs)

Comparing Mean and Median Metrics

The relationship between MAE and MedAE reveals important information about your error distribution.

The MAE/MedAE Ratio

For symmetric distributions, MAE ≈ MedAE. Deviations indicate asymmetry or outliers:

MAE/MedAE ≈ 1: Symmetric, well-behaved errors
MAE/MedAE > 1.5: Right-skewed (some large errors)
MAE/MedAE > 2: Heavy right tail (outlier problem)
MAE/MedAE > 3: Severe outliers dominating

mean_median_diagnosis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from scipy import stats
 
def diagnose_error_distribution(y_true: np.ndarray, y_pred: np.ndarray):
    """
    Use MAE/MedAE ratio and other statistics to diagnose error distribution.
    """
    residuals = y_true - y_pred
    abs_errors = np.abs(residuals)
    
    mae = np.mean(abs_errors)
    medae = np.median(abs_errors)
    std = np.std(abs_errors)
    
    ratio = mae / medae
    
    # Distribution characteristics
    skewness = stats.skew(abs_errors)
    kurtosis = stats.kurtosis(abs_errors)  # Excess kurtosis
    
    print("=== Error Distribution Diagnosis ===
")
    print(f"MAE:   {mae:.4f}")
    print(f"MedAE: {medae:.4f}")
    print(f"StdDev: {std:.4f}")
    print(f"
MAE/MedAE ratio: {ratio:.2f}")
    
    # Interpretation
    print("
Diagnosis:")
    if ratio < 1.2:
        print("  ✓ Errors are symmetric and well-behaved")
        print("  → MAE and MedAE are both appropriate")
    elif ratio < 1.5:
        print("  ○ Slight asymmetry or mild outliers")
        print("  → Both metrics valid, report both")
    elif ratio < 2.0:
        print("  ⚠ Moderate right skew in errors")
        print("  → Consider using MedAE or investigating outliers")
    else:
        print("  ⚠ Heavy outliers inflating MAE significantly")
        print("  → MedAE more representative of typical performance")
        print("  → Investigate and possibly exclude/correct outliers")
    
    print(f"
Distribution stats:")
    print(f"  Skewness: {skewness:.2f} (>1 = right-skewed)")
    print(f"  Kurtosis: {kurtosis:.2f} (>0 = heavy tails)")
    
    # Outlier count
    threshold = np.median(abs_errors) + 3 * stats.median_abs_deviation(abs_errors)
    n_outliers = np.sum(abs_errors > threshold)
    print(f"  Outliers (>3 MAD): {n_outliers} ({n_outliers/len(abs_errors)*100:.1f}%)")
    
    return ratio
 
# Example 1: Well-behaved errors
np.random.seed(42)
y_true_good = np.random.uniform(100, 200, 1000)
y_pred_good = y_true_good + np.random.randn(1000) * 5
 
print("Example 1: Normal errors")
diagnose_error_distribution(y_true_good, y_pred_good)
 
# Example 2: With outliers
print("
" + "="*50)
print("
Example 2: Errors with outliers")
y_pred_bad = y_pred_good.copy()
y_pred_bad[:50] += np.random.uniform(50, 100, 50)  # 5% severe errors
 
diagnose_error_distribution(y_true_good, y_pred_bad)

When to Use Which Metric
Scenario	Primary Metric	Reason
No outliers expected	MAE	Standard, familiar, optimizable
Outliers likely/present	MedAE	Robust to contamination
Both perspectives needed	MAE & MedAE together	MAE for overall, MedAE for typical
Production SLAs	Quantiles (P95, P99)	Directly map to guarantees
Model comparison	Both + ratio analysis	Understand behavior differences

Robust Regression Metrics Toolkit

Let's consolidate all the robust regression metrics into a practical toolkit.

robust_metrics_toolkit.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
from scipy import stats
from typing import Dict, Any
 
class RobustRegressionMetrics:
    """
    Complete toolkit for robust regression evaluation.
    Includes both standard and robust metrics.
    """
    
    def __init__(self, y_true: np.ndarray, y_pred: np.ndarray):
        self.y_true = np.asarray(y_true)
        self.y_pred = np.asarray(y_pred)
        self.residuals = self.y_true - self.y_pred
        self.abs_errors = np.abs(self.residuals)
        
    @property
    def mse(self) -> float:
        """Mean Squared Error"""
        return np.mean(self.residuals ** 2)
    
    @property
    def rmse(self) -> float:
        """Root Mean Squared Error"""
        return np.sqrt(self.mse)
    
    @property
    def mae(self) -> float:
        """Mean Absolute Error"""
        return np.mean(self.abs_errors)
    
    @property
    def medae(self) -> float:
        """Median Absolute Error (robust)"""
        return np.median(self.abs_errors)
    
    @property
    def mad(self) -> float:
        """Median Absolute Deviation of residuals (scaled)"""
        return stats.median_abs_deviation(self.residuals, scale=1.4826)
    
    @property
    def trimmed_mae(self, trim_pct: float = 0.1) -> float:
        """
        Trimmed MAE - excludes extreme values.
        trim_pct = 0.1 means exclude top and bottom 5% each.
        """
        return stats.trim_mean(self.abs_errors, trim_pct)
    
    @property
    def winsorized_mae(self, limits: tuple = (0.05, 0.05)) -> float:
        """
        Winsorized MAE - caps extreme values instead of removing.
        """
        winsorized = stats.mstats.winsorize(self.abs_errors, limits=limits)
        return np.mean(winsorized)
    
    def quantile_errors(self, quantiles: list = [0.5, 0.75, 0.9, 0.95, 0.99]) -> Dict[str, float]:
        """Error at specified quantiles"""
        return {
            f'P{int(q*100)}': np.percentile(self.abs_errors, q * 100)
            for q in quantiles
        }
    
    @property
    def iqr(self) -> float:
        """Interquartile Range of errors"""
        return np.percentile(self.abs_errors, 75) - np.percentile(self.abs_errors, 25)
    
    @property
    def mae_medae_ratio(self) -> float:
        """Ratio indicating outlier presence"""
        return self.mae / self.medae if self.medae > 0 else float('inf')
    
    def comprehensive_report(self) -> Dict[str, Any]:
        """Generate complete metric report"""
        return {
            # Location (central tendency)
            'mae': self.mae,
            'medae': self.medae,
            'trimmed_mae_10pct': self.trimmed_mae,
            
            # Scale (dispersion)
            'rmse': self.rmse,
            'std_residuals': np.std(self.residuals),
            'mad': self.mad,
            'iqr': self.iqr,
            
            # Quantiles
            **self.quantile_errors(),
            'max_error': np.max(self.abs_errors),
            
            # Diagnostics
            'mae_medae_ratio': self.mae_medae_ratio,
            'skewness': stats.skew(self.abs_errors),
            'kurtosis': stats.kurtosis(self.abs_errors),
            'n_samples': len(self.y_true),
        }
    
    def print_report(self, name: str = "Model"):
        """Pretty print the comprehensive report"""
        report = self.comprehensive_report()
        
        print(f"
{'='*60}")
        print(f"ROBUST REGRESSION METRICS: {name}")
        print('='*60)
        
        print("
📊 LOCATION (Central Tendency of Errors)")
        print(f"   MAE:              {report['mae']:.4f}")
        print(f"   MedAE:            {report['medae']:.4f} (robust)")
        print(f"   Trimmed MAE (10%): {report['trimmed_mae_10pct']:.4f}")
        
        print("
📏 SCALE (Dispersion of Errors)")
        print(f"   RMSE:             {report['rmse']:.4f}")
        print(f"   MAD:              {report['mad']:.4f} (robust)")
        print(f"   IQR:              {report['iqr']:.4f}")
        
        print("
📈 QUANTILES")
        print(f"   P50:              {report['P50']:.4f}")
        print(f"   P90:              {report['P90']:.4f}")
        print(f"   P95:              {report['P95']:.4f}")
        print(f"   P99:              {report['P99']:.4f}")
        print(f"   Max:              {report['max_error']:.4f}")
        
        print("
🔍 DIAGNOSTICS")
        print(f"   MAE/MedAE ratio:  {report['mae_medae_ratio']:.2f}", end="")
        if report['mae_medae_ratio'] > 1.5:
            print(" ⚠️  (outliers affecting mean)")
        else:
            print(" ✓")
        print(f"   Skewness:         {report['skewness']:.2f}")
        print(f"   Kurtosis:         {report['kurtosis']:.2f}")
        
        print('='*60)
 
# Example usage
np.random.seed(42)
y_true = np.random.uniform(100, 500, 500)
noise = np.random.randn(500)
noise[noise > 2.5] *= 3  # Create some outliers
y_pred = y_true + noise * 15
 
metrics = RobustRegressionMetrics(y_true, y_pred)
metrics.print_report("Production Forecaster")

When to Use Robust Metrics

Robust metrics aren't always necessary or appropriate. Here's guidance for when they add value.

Use Robust Metrics When:

•Data quality issues exist — Measurement errors, sensor failures, manual entry mistakes
•Model occasionally fails badly — Edge cases, distribution shift, out-of-domain inputs
•Communicating 'typical' performance — Stakeholders want to know the normal case, not worst case
•Comparing models fairly — One outlier shouldn't change your model selection
•Initial data exploration — Before cleaning, see how outliers affect metrics
•High stakes decisions — Want multiple perspectives (mean AND median)

Mean Metrics May Be Better When:

•Large errors are critical — Safety applications where worst case matters most
•No outliers expected — Clean, controlled data with known distribution
•Optimizing the model — Most loss functions use mean-based objectives
•Comparing to baselines — Most literature reports MAE/MSE
•Simple communication needed — 'Average error' is universally understood

The Best Practice

Practical Implementation

Here's how to effectively integrate median metrics into your ML workflow.

sklearn_integration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from sklearn.metrics import (
    mean_absolute_error, 
    median_absolute_error,  # Built-in!
    mean_squared_error
)
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, HuberRegressor
from sklearn.datasets import make_regression
 
def sklearn_median_metrics():
    """
    Using sklearn's built-in median_absolute_error.
    """
    # Generate data with outliers
    X, y = make_regression(n_samples=200, n_features=5, noise=10, random_state=42)
    
    # Add outliers to y
    y[0:10] += np.random.uniform(100, 200, 10)
    
    # Split
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Compare models
    models = {
        'Linear Regression': LinearRegression(),
        'Huber Regressor': HuberRegressor()  # Robust to outliers
    }
    
    print("=== Sklearn Model Comparison ===
")
    print(f"{'Model':25s} | {'MAE':>10s} | {'MedAE':>10s} | {'MAE/MedAE':>10s}")
    print("-" * 62)
    
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        mae = mean_absolute_error(y_test, y_pred)
        medae = median_absolute_error(y_test, y_pred)  # sklearn built-in!
        ratio = mae / medae
        
        print(f"{name:25s} | {mae:>10.2f} | {medae:>10.2f} | {ratio:>10.2f}")
 
sklearn_median_metrics()
 
def custom_robust_scorer():
    """
    Create custom scorer using MedAE for cross-validation.
    """
    from sklearn.metrics import make_scorer
    
    # Create MedAE scorer (higher is worse, so neg_)
    medae_scorer = make_scorer(
        median_absolute_error, 
        greater_is_better=False  # Negate for sklearn convention
    )
    
    # Generate data
    X, y = make_regression(n_samples=500, n_features=5, noise=10, random_state=42)
    y[0:25] += 200  # Add outliers
    
    model = LinearRegression()
    
    # Cross-validate with different scorers
    mae_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
    medae_scores = cross_val_score(model, X, y, cv=5, scoring=medae_scorer)
    
    print("
=== Cross-Validation with Different Metrics ===")
    print(f"MAE scores:   {-mae_scores.mean():.2f} ± {mae_scores.std():.2f}")
    print(f"MedAE scores: {-medae_scores.mean():.2f} ± {medae_scores.std():.2f}")
    print(f"
Ratio: {-mae_scores.mean() / -medae_scores.mean():.2f}x")
 
custom_robust_scorer()

Production Monitoring

Summary: Median Errors

Median-based metrics provide robust alternatives to mean-based metrics, giving you a clearer picture of typical model performance. Let's consolidate the key insights:

Key Takeaways

•MedAE (Median Absolute Error): Median of absolute errors—represents 'typical' error, ignoring extremes.
•Breakdown Point: Median has 50% breakdown point vs. 0% for mean—vastly more robust.
•MAD (Median Absolute Deviation): Robust measure of error dispersion; scaled by 1.4826 to compare with standard deviation.
•Quantile Metrics: P50, P90, P95, P99 give full picture of error distribution; essential for SLAs.
•MAE/MedAE Ratio: Values > 1.5 indicate outliers are inflating the mean.
•When to Use: Data quality issues, occasional model failures, stakeholder communication, fair model comparison.
•Best Practice: Report BOTH mean and median metrics; their difference is informative.
•Built-in Support: sklearn has median_absolute_error; scipy has median_abs_deviation.

Module Complete

Next, you might explore Ranking Metrics to understand how to evaluate models when the order of predictions matters more than their exact values.

Module Complete

5 / 5