Loading learning content...
MSE and MAE both use the mean to aggregate errors. But means have a well-known weakness: they're sensitive to extreme values. A single catastrophic prediction can dominate your entire metric.
The median offers a robust alternative. By definition, the median is the value that splits data in half—50% above, 50% below. It completely ignores the magnitude of extreme values, caring only about how many values are above or below.
Median-based error metrics provide a clear picture of typical model performance, unaffected by the handful of cases where your model fails spectacularly.
By the end of this page, you will understand Median Absolute Error (MedAE) and its properties, why medians provide robustness to outliers, Median Absolute Deviation (MAD) as a robust scale measure, quantile-based error metrics (beyond just the median), when to prefer median metrics over mean metrics, and how to combine median and mean metrics for comprehensive evaluation.
Median Absolute Error is simply the median of all absolute errors:
$$\text{MedAE} = \text{median}(|y_1 - \hat{y}_1|, |y_2 - \hat{y}_2|, ..., |y_n - \hat{y}_n|)$$
Compare to MAE:
The difference is profound in the presence of outliers.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import numpy as np def median_absolute_error(y_true: np.ndarray, y_pred: np.ndarray) -> float: """ Compute Median Absolute Error. MedAE is the median of all absolute residuals. It's robust to outliers and represents the 'typical' error. """ y_true = np.asarray(y_true) y_pred = np.asarray(y_pred) absolute_errors = np.abs(y_true - y_pred) return np.median(absolute_errors) def compare_mae_medae(y_true: np.ndarray, y_pred: np.ndarray): """ Compare MAE and MedAE, showing robustness difference. """ absolute_errors = np.abs(y_true - y_pred) mae = np.mean(absolute_errors) medae = np.median(absolute_errors) print("Absolute errors:", np.sort(absolute_errors)) print(f"MAE (mean): {mae:.2f}") print(f"MedAE (median): {medae:.2f}") print(f"Ratio MAE/MedAE: {mae/medae:.2f}x") if mae > 1.5 * medae: print("⚠️ MAE >> MedAE suggests outlier errors are inflating the mean") return mae, medae # Example: Good predictions with one catastrophic failureprint("=== Example with Outlier ===")y_true = np.array([100, 110, 95, 105, 108, 100, 102, 97, 103, 1000]) # Last is outliery_pred = np.array([98, 112, 93, 107, 105, 101, 104, 95, 100, 200]) # Big miss on outlier compare_mae_medae(y_true, y_pred) print("" + "="*50)print("=== Same data without outlier ===")compare_mae_medae(y_true[:-1], y_pred[:-1])Key Properties of MedAE
50% Breakdown Point: Up to 50% of predictions can be arbitrarily wrong before MedAE is affected. (MAE's breakdown point is 0%—a single outlier affects it.)
Same Units as Target: Like MAE, MedAE is in the original units of y.
Interpretation: 'Half of predictions are within ±MedAE of the true value' (approximately).
Computational Simplicity: Just sort errors and take the middle value(s).
| Property | MAE | MedAE |
|---|---|---|
| Central tendency | Mean of absolute errors | Median of absolute errors |
| Outlier sensitivity | High | None (robust) |
| Breakdown point | 0% | 50% |
| Interpretation | Average absolute error | Typical absolute error |
| Differentiable | Subgradient only | Not differentiable |
| Optimization | Can minimize directly | Difficult to optimize |
MedAE is particularly valuable when you expect your model to fail badly on some samples (outliers, distribution shift, edge cases) but want to understand typical performance. It answers: 'Ignoring my worst predictions, how well am I doing?'
Robustness in statistics measures how much extreme observations affect an estimator. The key concept is the breakdown point.
Breakdown Point Definition
The breakdown point is the proportion of arbitrary contamination an estimator can handle before giving arbitrarily bad results.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import numpy as np def breakdown_point_demonstration(): """ Show the breakdown point difference between mean and median. """ print("=== Breakdown Point Demonstration ===") # Original errors (all small) n = 21 original_errors = np.abs(np.random.randn(n) * 5) print(f"Original {n} errors (all reasonable):") print(f" MAE: {np.mean(original_errors):.2f}") print(f" MedAE: {np.median(original_errors):.2f}") # Contaminate with different proportions contamination_levels = [0.05, 0.10, 0.20, 0.40, 0.50] print(f"Effect of contamination (replacing with value 1,000,000):") print(f"{'Contamination':>15} | {'MAE':>12} | {'MedAE':>12} | {'MAE stable?':>12}") print("-" * 60) for frac in contamination_levels: errors = original_errors.copy() n_contaminated = int(frac * n) errors[:n_contaminated] = 1_000_000 # Extreme outliers mae = np.mean(errors) medae = np.median(errors) # Is MAE "stable"? (within 10x of original) mae_stable = "Yes" if mae < 10 * np.mean(original_errors) else "No (broken)" print(f"{frac*100:>14.0f}% | {mae:>12.0f} | {medae:>12.2f} | {mae_stable:>12}") print("*** Key Insight ***") print("MAE is 'broken' immediately with any outliers.") print("MedAE remains stable until contamination exceeds 50%.") breakdown_point_demonstration() def influence_function_intuition(): """ Show how each observation influences MAE vs MedAE. """ print("=== Influence of Individual Observations ===") errors = np.array([2, 3, 4, 5, 6]) print(f"Original errors: {errors}") print(f"MAE: {np.mean(errors):.2f}, MedAE: {np.median(errors):.2f}") # Change the largest error progressively print(f"Change last value from 6 to...") for new_val in [6, 10, 20, 50, 100, 1000]: modified = errors.copy() modified[-1] = new_val print(f" [{modified}]: MAE={np.mean(modified):.2f}, MedAE={np.median(modified):.2f}") print("→ MedAE is completely unchanged by the last value's magnitude!") influence_function_intuition()While MedAE measures typical error magnitude, sometimes we want to measure error dispersion—how spread out the errors are. The standard deviation does this but is sensitive to outliers.
Median Absolute Deviation (MAD) provides a robust measure of dispersion:
$$\text{MAD} = \text{median}(|X_i - \text{median}(X)|)$$
For regression residuals: $$\text{MAD}_{residuals} = \text{median}(|e_i - \text{median}(e)|)$$
Where $e_i = y_i - \hat{y}_i$ are the residuals.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npfrom scipy import stats def median_absolute_deviation(x: np.ndarray, scale: bool = True) -> float: """ Compute Median Absolute Deviation (MAD). Parameters: ----------- x : np.ndarray Data array scale : bool If True, scale MAD to be comparable to std dev (multiply by 1.4826 for normal data) """ x = np.asarray(x) median = np.median(x) mad = np.median(np.abs(x - median)) if scale: # Scale factor makes MAD consistent with std dev for normal data mad *= 1.4826 return mad def compare_std_mad(data: np.ndarray, name: str = "Data"): """ Compare standard deviation and MAD. """ std = np.std(data, ddof=1) mad = median_absolute_deviation(data, scale=True) mad_unscaled = median_absolute_deviation(data, scale=False) print(f"=== {name} ===") print(f"Mean: {np.mean(data):.2f}, Median: {np.median(data):.2f}") print(f"Std Dev: {std:.2f}") print(f"MAD (unscaled): {mad_unscaled:.2f}") print(f"MAD (scaled): {mad:.2f}") print(f"Std/MAD ratio: {std/mad:.2f}x") if std > 2 * mad: print("⚠️ Std >> MAD suggests outliers are inflating variance") return std, mad # Example 1: Normal datanp.random.seed(42)normal_data = np.random.randn(100)compare_std_mad(normal_data, "Normal Distribution (no outliers)") # Example 2: Data with outlierscontaminated = normal_data.copy()contaminated[0:5] = [10, -10, 15, -15, 20] # Add outlierscompare_std_mad(contaminated, "Contaminated with 5% outliers") # Example 3: Regression residualsprint("" + "="*50)print("=== Application to Regression Residuals ===") y_true = np.array([100, 105, 98, 102, 110, 95, 108, 103, 500, 50]) # Two outliersy_pred = np.array([101, 104, 99, 101, 108, 97, 106, 105, 120, 95]) residuals = y_true - y_predprint(f"Residuals: {residuals}")compare_std_mad(residuals, "Regression Residuals")The 1.4826 Scale Factor
For normally distributed data:
This scaled MAD is a consistent estimator of standard deviation for normal data—but much more robust.
Use Cases for MAD
A common robust outlier rule: points where |x - median| > 3 × MAD are outliers. This is more reliable than the mean ± 3σ rule, which is itself affected by outliers.
The median is the 50th percentile, but we can generalize to any quantile. Quantile-based error metrics give a richer picture of error distribution than any single summary statistic.
Common Quantile Metrics
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import numpy as npfrom typing import Dict, List def quantile_error_profile( y_true: np.ndarray, y_pred: np.ndarray, quantiles: List[float] = [0.50, 0.75, 0.90, 0.95, 0.99]) -> Dict[str, float]: """ Compute error at multiple quantiles for comprehensive view. """ y_true = np.asarray(y_true) y_pred = np.asarray(y_pred) absolute_errors = np.abs(y_true - y_pred) results = { 'MAE': np.mean(absolute_errors), 'MedAE': np.median(absolute_errors), 'Max': np.max(absolute_errors), } for q in quantiles: key = f'P{int(q*100)}' results[key] = np.percentile(absolute_errors, q * 100) return results def comprehensive_error_report(y_true: np.ndarray, y_pred: np.ndarray, name: str = "Model"): """ Generate comprehensive quantile-based error report. """ profile = quantile_error_profile(y_true, y_pred) absolute_errors = np.abs(y_true - y_pred) print(f"╔═══════════════════════════════════════════════╗") print(f"║ Error Profile: {name:^28} ║") print(f"╠═══════════════════════════════════════════════╣") print(f"║ {'Statistic':15} │ {'Value':>10} │ {'Interpretation':^15} ║") print(f"╠═══════════════════════════════════════════════╣") print(f"║ {'MAE':15} │ {profile['MAE']:>10.2f} │ {'Average error':^15} ║") print(f"║ {'P50 (Median)':15} │ {profile['P50']:>10.2f} │ {'Typical error':^15} ║") print(f"║ {'P75':15} │ {profile['P75']:>10.2f} │ {'75% ≤ this':^15} ║") print(f"║ {'P90':15} │ {profile['P90']:>10.2f} │ {'90% ≤ this':^15} ║") print(f"║ {'P95':15} │ {profile['P95']:>10.2f} │ {'95% ≤ this':^15} ║") print(f"║ {'P99':15} │ {profile['P99']:>10.2f} │ {'99% ≤ this':^15} ║") print(f"║ {'Max':15} │ {profile['Max']:>10.2f} │ {'Worst case':^15} ║") print(f"╚═══════════════════════════════════════════════╝") # Analysis print(f"Analysis:") ratio_95_50 = profile['P95'] / profile['P50'] ratio_max_95 = profile['Max'] / profile['P95'] print(f" P95/P50 ratio: {ratio_95_50:.1f}x — ", end="") if ratio_95_50 < 2: print("tight distribution") elif ratio_95_50 < 4: print("moderate spread") else: print("heavy right tail (some large errors)") print(f" Max/P95 ratio: {ratio_max_95:.1f}x — ", end="") if ratio_max_95 < 2: print("no extreme outliers") else: print("extreme outliers present") return profile # Example with realistic model predictionsnp.random.seed(42)n = 1000 # Good model with occasional failuresy_true = np.random.uniform(100, 500, n)# Most predictions within 5%, but some failuresnoise = np.random.randn(n)noise[noise > 2] *= 5 # Inflate tail errorsy_pred = y_true * (1 + noise * 0.05) comprehensive_error_report(y_true, y_pred, "Production Model")Interquartile Range of Errors (IQR)
The IQR measures the middle 50% of errors: $$\text{IQR} = P_{75} - P_{25}$$
This is a robust measure of error spread, analogous to MAD for dispersion:
In production ML systems, quantile metrics directly translate to SLAs. 'P95 latency < 200ms' or 'P99 error < $50' are common SLA formats. Understanding quantile error profiles helps you set and meet realistic service commitments.
The relationship between MAE and MedAE reveals important information about your error distribution.
The MAE/MedAE Ratio
For symmetric distributions, MAE ≈ MedAE. Deviations indicate asymmetry or outliers:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom scipy import stats def diagnose_error_distribution(y_true: np.ndarray, y_pred: np.ndarray): """ Use MAE/MedAE ratio and other statistics to diagnose error distribution. """ residuals = y_true - y_pred abs_errors = np.abs(residuals) mae = np.mean(abs_errors) medae = np.median(abs_errors) std = np.std(abs_errors) ratio = mae / medae # Distribution characteristics skewness = stats.skew(abs_errors) kurtosis = stats.kurtosis(abs_errors) # Excess kurtosis print("=== Error Distribution Diagnosis ===") print(f"MAE: {mae:.4f}") print(f"MedAE: {medae:.4f}") print(f"StdDev: {std:.4f}") print(f"MAE/MedAE ratio: {ratio:.2f}") # Interpretation print("Diagnosis:") if ratio < 1.2: print(" ✓ Errors are symmetric and well-behaved") print(" → MAE and MedAE are both appropriate") elif ratio < 1.5: print(" ○ Slight asymmetry or mild outliers") print(" → Both metrics valid, report both") elif ratio < 2.0: print(" ⚠ Moderate right skew in errors") print(" → Consider using MedAE or investigating outliers") else: print(" ⚠ Heavy outliers inflating MAE significantly") print(" → MedAE more representative of typical performance") print(" → Investigate and possibly exclude/correct outliers") print(f"Distribution stats:") print(f" Skewness: {skewness:.2f} (>1 = right-skewed)") print(f" Kurtosis: {kurtosis:.2f} (>0 = heavy tails)") # Outlier count threshold = np.median(abs_errors) + 3 * stats.median_abs_deviation(abs_errors) n_outliers = np.sum(abs_errors > threshold) print(f" Outliers (>3 MAD): {n_outliers} ({n_outliers/len(abs_errors)*100:.1f}%)") return ratio # Example 1: Well-behaved errorsnp.random.seed(42)y_true_good = np.random.uniform(100, 200, 1000)y_pred_good = y_true_good + np.random.randn(1000) * 5 print("Example 1: Normal errors")diagnose_error_distribution(y_true_good, y_pred_good) # Example 2: With outliersprint("" + "="*50)print("Example 2: Errors with outliers")y_pred_bad = y_pred_good.copy()y_pred_bad[:50] += np.random.uniform(50, 100, 50) # 5% severe errors diagnose_error_distribution(y_true_good, y_pred_bad)| Scenario | Primary Metric | Reason |
|---|---|---|
| No outliers expected | MAE | Standard, familiar, optimizable |
| Outliers likely/present | MedAE | Robust to contamination |
| Both perspectives needed | MAE & MedAE together | MAE for overall, MedAE for typical |
| Production SLAs | Quantiles (P95, P99) | Directly map to guarantees |
| Model comparison | Both + ratio analysis | Understand behavior differences |
Let's consolidate all the robust regression metrics into a practical toolkit.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
import numpy as npfrom scipy import statsfrom typing import Dict, Any class RobustRegressionMetrics: """ Complete toolkit for robust regression evaluation. Includes both standard and robust metrics. """ def __init__(self, y_true: np.ndarray, y_pred: np.ndarray): self.y_true = np.asarray(y_true) self.y_pred = np.asarray(y_pred) self.residuals = self.y_true - self.y_pred self.abs_errors = np.abs(self.residuals) @property def mse(self) -> float: """Mean Squared Error""" return np.mean(self.residuals ** 2) @property def rmse(self) -> float: """Root Mean Squared Error""" return np.sqrt(self.mse) @property def mae(self) -> float: """Mean Absolute Error""" return np.mean(self.abs_errors) @property def medae(self) -> float: """Median Absolute Error (robust)""" return np.median(self.abs_errors) @property def mad(self) -> float: """Median Absolute Deviation of residuals (scaled)""" return stats.median_abs_deviation(self.residuals, scale=1.4826) @property def trimmed_mae(self, trim_pct: float = 0.1) -> float: """ Trimmed MAE - excludes extreme values. trim_pct = 0.1 means exclude top and bottom 5% each. """ return stats.trim_mean(self.abs_errors, trim_pct) @property def winsorized_mae(self, limits: tuple = (0.05, 0.05)) -> float: """ Winsorized MAE - caps extreme values instead of removing. """ winsorized = stats.mstats.winsorize(self.abs_errors, limits=limits) return np.mean(winsorized) def quantile_errors(self, quantiles: list = [0.5, 0.75, 0.9, 0.95, 0.99]) -> Dict[str, float]: """Error at specified quantiles""" return { f'P{int(q*100)}': np.percentile(self.abs_errors, q * 100) for q in quantiles } @property def iqr(self) -> float: """Interquartile Range of errors""" return np.percentile(self.abs_errors, 75) - np.percentile(self.abs_errors, 25) @property def mae_medae_ratio(self) -> float: """Ratio indicating outlier presence""" return self.mae / self.medae if self.medae > 0 else float('inf') def comprehensive_report(self) -> Dict[str, Any]: """Generate complete metric report""" return { # Location (central tendency) 'mae': self.mae, 'medae': self.medae, 'trimmed_mae_10pct': self.trimmed_mae, # Scale (dispersion) 'rmse': self.rmse, 'std_residuals': np.std(self.residuals), 'mad': self.mad, 'iqr': self.iqr, # Quantiles **self.quantile_errors(), 'max_error': np.max(self.abs_errors), # Diagnostics 'mae_medae_ratio': self.mae_medae_ratio, 'skewness': stats.skew(self.abs_errors), 'kurtosis': stats.kurtosis(self.abs_errors), 'n_samples': len(self.y_true), } def print_report(self, name: str = "Model"): """Pretty print the comprehensive report""" report = self.comprehensive_report() print(f"{'='*60}") print(f"ROBUST REGRESSION METRICS: {name}") print('='*60) print("📊 LOCATION (Central Tendency of Errors)") print(f" MAE: {report['mae']:.4f}") print(f" MedAE: {report['medae']:.4f} (robust)") print(f" Trimmed MAE (10%): {report['trimmed_mae_10pct']:.4f}") print("📏 SCALE (Dispersion of Errors)") print(f" RMSE: {report['rmse']:.4f}") print(f" MAD: {report['mad']:.4f} (robust)") print(f" IQR: {report['iqr']:.4f}") print("📈 QUANTILES") print(f" P50: {report['P50']:.4f}") print(f" P90: {report['P90']:.4f}") print(f" P95: {report['P95']:.4f}") print(f" P99: {report['P99']:.4f}") print(f" Max: {report['max_error']:.4f}") print("🔍 DIAGNOSTICS") print(f" MAE/MedAE ratio: {report['mae_medae_ratio']:.2f}", end="") if report['mae_medae_ratio'] > 1.5: print(" ⚠️ (outliers affecting mean)") else: print(" ✓") print(f" Skewness: {report['skewness']:.2f}") print(f" Kurtosis: {report['kurtosis']:.2f}") print('='*60) # Example usagenp.random.seed(42)y_true = np.random.uniform(100, 500, 500)noise = np.random.randn(500)noise[noise > 2.5] *= 3 # Create some outliersy_pred = y_true + noise * 15 metrics = RobustRegressionMetrics(y_true, y_pred)metrics.print_report("Production Forecaster")Robust metrics aren't always necessary or appropriate. Here's guidance for when they add value.
Report BOTH mean and median metrics. If they're similar, you have well-behaved errors. If they differ significantly, explain why and which is more relevant for your use case. This transparency builds trust and demonstrates statistical sophistication.
Here's how to effectively integrate median metrics into your ML workflow.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as npfrom sklearn.metrics import ( mean_absolute_error, median_absolute_error, # Built-in! mean_squared_error)from sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LinearRegression, HuberRegressorfrom sklearn.datasets import make_regression def sklearn_median_metrics(): """ Using sklearn's built-in median_absolute_error. """ # Generate data with outliers X, y = make_regression(n_samples=200, n_features=5, noise=10, random_state=42) # Add outliers to y y[0:10] += np.random.uniform(100, 200, 10) # Split from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Compare models models = { 'Linear Regression': LinearRegression(), 'Huber Regressor': HuberRegressor() # Robust to outliers } print("=== Sklearn Model Comparison ===") print(f"{'Model':25s} | {'MAE':>10s} | {'MedAE':>10s} | {'MAE/MedAE':>10s}") print("-" * 62) for name, model in models.items(): model.fit(X_train, y_train) y_pred = model.predict(X_test) mae = mean_absolute_error(y_test, y_pred) medae = median_absolute_error(y_test, y_pred) # sklearn built-in! ratio = mae / medae print(f"{name:25s} | {mae:>10.2f} | {medae:>10.2f} | {ratio:>10.2f}") sklearn_median_metrics() def custom_robust_scorer(): """ Create custom scorer using MedAE for cross-validation. """ from sklearn.metrics import make_scorer # Create MedAE scorer (higher is worse, so neg_) medae_scorer = make_scorer( median_absolute_error, greater_is_better=False # Negate for sklearn convention ) # Generate data X, y = make_regression(n_samples=500, n_features=5, noise=10, random_state=42) y[0:25] += 200 # Add outliers model = LinearRegression() # Cross-validate with different scorers mae_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error') medae_scores = cross_val_score(model, X, y, cv=5, scoring=medae_scorer) print("=== Cross-Validation with Different Metrics ===") print(f"MAE scores: {-mae_scores.mean():.2f} ± {mae_scores.std():.2f}") print(f"MedAE scores: {-medae_scores.mean():.2f} ± {medae_scores.std():.2f}") print(f"Ratio: {-mae_scores.mean() / -medae_scores.mean():.2f}x") custom_robust_scorer()In production, track both MAE and MedAE over time. A sudden increase in MAE while MedAE stays stable often indicates a few predictions going badly wrong (outliers, edge cases, data quality issues)—much easier to diagnose than a general degradation.
Median-based metrics provide robust alternatives to mean-based metrics, giving you a clearer picture of typical model performance. Let's consolidate the key insights:
median_absolute_error; scipy has median_abs_deviation.Module Complete
Congratulations! You've now completed the Regression Metrics module. You understand the full spectrum of metrics from MSE to MAPE to median errors, and you can select and interpret them appropriately for any regression problem.
Next, you might explore Ranking Metrics to understand how to evaluate models when the order of predictions matters more than their exact values.
You now command a comprehensive toolkit for regression evaluation. From MSE's probabilistic foundations to MedAE's robustness, from R²'s variance explanation to MAPE's percentage interpretation—you can evaluate, compare, and communicate regression model performance with confidence and rigor.