Loading content...
While Mean Squared Error dominates regression evaluation, it has a well-known weakness: it's highly sensitive to outliers. A single large error can completely dominate the metric, potentially distorting your assessment of model quality.
Mean Absolute Error (MAE) offers a more robust alternative. Instead of squaring errors, MAE takes their absolute values—treating a prediction error of +10 exactly the same as -10, and penalizing all errors proportionally to their magnitude.
But MAE's simplicity hides subtle tradeoffs. Understanding when to choose MAE over MSE requires deep knowledge of both metrics' properties.
By the end of this page, you will understand MAE's mathematical definition and properties, why linear penalties provide robustness to outliers, the statistical interpretation as median optimization, optimization challenges due to non-differentiability at zero, when MAE is preferable to MSE and vice versa, and practical implementation considerations.
Mean Absolute Error measures the average of the absolute differences between predicted and actual values. For a dataset with $n$ samples:
$$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$
Compare this to MSE:
This seemingly small change has profound implications for how the metric behaves.
| Property | MAE | MSE |
|---|---|---|
| Formula | $\frac{1}{n}\sum|y_i - \hat{y}_i|$ | $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$ |
| Error penalty | Linear | Quadratic |
| Units | Same as target | Squared units |
| Outlier sensitivity | Low | High |
| Differentiability | Not at zero | Everywhere |
| Optimal prediction | Median | Mean |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import numpy as np def mean_absolute_error(y_true: np.ndarray, y_pred: np.ndarray) -> float: """ Compute Mean Absolute Error from scratch. Parameters: ----------- y_true : np.ndarray Array of true target values y_pred : np.ndarray Array of predicted values Returns: -------- float The mean absolute error """ y_true = np.asarray(y_true) y_pred = np.asarray(y_pred) if y_true.shape != y_pred.shape: raise ValueError(f"Shape mismatch: {y_true.shape} vs {y_pred.shape}") # Calculate absolute residuals absolute_residuals = np.abs(y_true - y_pred) # Average over all samples mae = np.mean(absolute_residuals) return mae # Example: House price predictions (in $1000s)y_true = np.array([250, 300, 350, 400, 450])y_pred = np.array([260, 290, 340, 420, 440]) mae = mean_absolute_error(y_true, y_pred)mse = np.mean((y_true - y_pred) ** 2)rmse = np.sqrt(mse) print(f"Residuals: {y_true - y_pred}")print(f"Absolute Residuals: {np.abs(y_true - y_pred)}")print(f"MAE: {mae:.2f} $1000")print(f"RMSE: {rmse:.2f} $1000")print(f"Interpretation: On average, predictions are off by ${mae: .0f },000")# Output:# Residuals: [-10 10 10 - 20 10]# Absolute Residuals: [10 10 10 20 10]# MAE: 12.00 $1000# RMSE: 12.65 $1000One of MAE's practical advantages is that it's in the same units as the target variable—no square root needed. An MAE of 10 for house prices means 'on average, predictions are off by about $10,000.' This makes MAE arguably more intuitive than MSE for stakeholder communication.
The fundamental difference between MAE and MSE comes down to how they penalize errors of different sizes:
MSE's Quadratic Penalty
MAE's Linear Penalty
With MSE, doubling the error quadruples the penalty. With MAE, doubling the error merely doubles the penalty.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as np def compare_penalties(): """ Illustrate how MAE and MSE penalize errors differently. """ errors = np.array([1, 2, 5, 10, 20, 50, 100]) print("Error Magnitude | MAE Penalty | MSE Penalty | MSE/MAE Ratio") print("-" * 60) for e in errors: mae_penalty = abs(e) mse_penalty = e ** 2 ratio = mse_penalty / mae_penalty print(f"{e:14d} | {mae_penalty:11d} | {mse_penalty:11d} | {ratio:13.0f}x") compare_penalties()# Output:# Error Magnitude | MAE Penalty | MSE Penalty | MSE / MAE Ratio# ------------------------------------------------------------# 1 | 1 | 1 | 1x# 2 | 2 | 4 | 2x# 5 | 5 | 25 | 5x# 10 | 10 | 100 | 10x# 20 | 20 | 400 | 20x# 50 | 50 | 2500 | 50x# 100 | 100 | 10000 | 100x def outlier_impact_comparison(): """ Show how a single outlier affects MAE vs MSE. """ # Baseline: four predictions with small errors errors_clean = np.array([2, 3, 2, 3]) # Add one large error(outlier) errors_with_outlier = np.array([2, 3, 2, 3, 50]) # Calculate metrics mae_clean = np.mean(np.abs(errors_clean)) mse_clean = np.mean(errors_clean ** 2) mae_outlier = np.mean(np.abs(errors_with_outlier)) mse_outlier = np.mean(errors_with_outlier ** 2) print("=== Outlier Impact Analysis ===") print(f"Clean data (errors: {errors_clean}):") print(f" MAE: {mae_clean:.2f}") print(f" MSE: {mse_clean:.2f}") print(f"With outlier (errors: {errors_with_outlier}):") print(f" MAE: {mae_outlier:.2f} (increase: {(mae_outlier/mae_clean - 1)*100:.0f}%)") print(f" MSE: {mse_outlier:.2f} (increase: {(mse_outlier/mse_clean - 1)*100:.0f}%)") # Contribution analysis print(f"Outlier's contribution:") print(f" To total absolute error: {50 / np.sum(np.abs(errors_with_outlier)) * 100:.1f}%") print(f" To total squared error: {2500 / np.sum(errors_with_outlier ** 2) * 100:.1f}%") outlier_impact_comparison()# Shows MSE increases ~1400 % while MAE increases only ~140 % MAE's robustness is a double-edged sword. If your model occasionally makes catastrophic predictions, MAE will partially hide this. Always examine the full error distribution, not just the average—consider looking at max error, percentiles (e.g., 95th percentile error), and residual plots.
Here's a profound connection that illuminates MAE's behavior:
Minimizing MAE finds the conditional median, just as minimizing MSE finds the conditional mean.
If we want a single number $c$ that minimizes the sum of absolute deviations from a set of values ${y_1, y_2, ..., y_n}$:
$$\underset{c}{\text{argmin}} \sum_{i=1}^n |y_i - c|$$
The solution is $c = \text{median}(y_1, ..., y_n)$, not the mean!
Why the Median?
Intuitively, the median balances the number of values above and below it. For absolute deviations, moving the prediction by a small amount $\epsilon$:
The total change is $\epsilon \times (\text{count above} - \text{count below})$. This is zero only when counts are balanced—i.e., at the median.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import numpy as npfrom scipy.optimize import minimize_scalar def demonstrate_median_minimizes_mae(): """ Prove empirically that the median minimizes MAE while the mean minimizes MSE. """ # Sample data with outlier y = np.array([10, 12, 11, 13, 12, 100]) # 100 is an outlier mean_y = np.mean(y) median_y = np.median(y) print(f"Data: {y}") print(f"Mean: {mean_y:.2f}") print(f"Median: {median_y:.2f}") # Calculate MAE and MSE for different prediction values def mae_for_constant(c): return np.mean(np.abs(y - c)) def mse_for_constant(c): return np.mean((y - c) ** 2) # Test at mean and median print(f"MAE when predicting mean ({mean_y:.2f}): {mae_for_constant(mean_y):.2f}") print(f"MAE when predicting median ({median_y:.2f}): {mae_for_constant(median_y):.2f}") print(f"→ Median gives lower MAE!") print(f"MSE when predicting mean ({mean_y:.2f}): {mse_for_constant(mean_y):.2f}") print(f"MSE when predicting median ({median_y:.2f}): {mse_for_constant(median_y):.2f}") print(f"→ Mean gives lower MSE!") # Find optimal values numerically opt_mae = minimize_scalar(mae_for_constant, bounds = (0, 150)) opt_mse = minimize_scalar(mse_for_constant, bounds = (0, 150)) print(f"Numerical optimization:") print(f" Optimal for MAE: {opt_mae.x:.2f} (median = {median_y:.2f})") print(f" Optimal for MSE: {opt_mse.x:.2f} (mean = {mean_y:.2f})") demonstrate_median_minimizes_mae()# Output shows median is optimal for MAE, mean is optimal for MSEImplications for Regression
When you train a model by minimizing MAE:
When you train by minimizing MSE:
These are the same only when the conditional distribution is symmetric. For skewed distributions, mean ≠ median, and the choice of loss function determines which central tendency your model predicts.
| Distribution Shape | Mean vs Median | MAE-Optimal Prediction | MSE-Optimal Prediction |
|---|---|---|---|
| Symmetric (Normal) | Mean = Median | Either | Either |
| Right-skewed (income) | Mean > Median | Lower value | Higher value |
| Left-skewed (test scores) | Mean < Median | Higher value | Lower value |
| Heavy outliers | Mean pulled by outliers | Robust estimate | Outlier-influenced |
For house price prediction, prices are often right-skewed (few very expensive houses). MSE-optimal models will predict higher (toward the mean, influenced by mansions), while MAE-optimal models predict lower (toward the median, the 'typical' house). Neither is 'wrong'—they answer different questions.
MAE has a significant mathematical limitation: the absolute value function is not differentiable at zero.
Recall that: $$|x| = \begin{cases} x & \text{if } x \geq 0 \ -x & \text{if } x < 0 \end{cases}$$
The derivative is: $$\frac{d|x|}{dx} = \begin{cases} 1 & \text{if } x > 0 \ -1 & \text{if } x < 0 \ \text{undefined} & \text{if } x = 0 \end{cases}$$
At $x = 0$, the left-derivative is -1 and the right-derivative is +1—there's a 'kink' in the function.
Why Does This Matter?
Gradient-based optimization (gradient descent, Adam, etc.) requires computing gradients. When a residual is exactly zero:
Practical Solutions
In practice, residuals rarely equal exactly zero due to floating-point representation, so this is more of a theoretical concern. But implementations handle it:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as np def mae_gradient(y_true: np.ndarray, y_pred: np.ndarray) -> np.ndarray: """ Compute the(sub)gradient of MAE with respect to predictions. For | y - y_hat |: - If y > y_hat: gradient = -1(increase prediction to reduce error) - If y < y_hat: gradient = +1(decrease prediction to reduce error) - If y = y_hat: gradient = 0(arbitrary choice from subgradient[-1, 1]) """ residuals = y_true - y_pred # Subgradient: sign of residual gradient = -np.sign(residuals) # Note: np.sign(0) = 0, which is a valid subgradient choice # Normalize by n(for mean) gradient = gradient / len(y_true) return gradient def compare_gradients(): """ Compare MAE and MSE gradients at various residual values. """ residuals = np.array([-10, -5, -1, 0, 1, 5, 10]) # MSE gradient: d / dy_hat of(y - y_hat) ^ 2 = -2(y - y_hat) mse_grad = -2 * residuals # MAE gradient: d / dy_hat of | y - y_hat | = -sign(y - y_hat) mae_grad = -np.sign(residuals) print("Residual | MAE Gradient | MSE Gradient | Ratio") print("-" * 55) for r, mg, sg in zip(residuals, mae_grad, mse_grad): ratio = sg / mg if mg != 0 else 0 print(f"{r:8.1f} | {mg:12.1f} | {sg:12.1f} | {ratio:5.1f}x") compare_gradients()# Output:# Residual | MAE Gradient | MSE Gradient | Ratio# -------------------------------------------------------# - 10.0 | -1.0 | -20.0 | 20.0x# - 5.0 | -1.0 | -10.0 | 10.0x# - 1.0 | -1.0 | -2.0 | 2.0x# 0.0 | 0.0 | 0.0 | 0.0x# 1.0 | 1.0 | 2.0 | 2.0x# 5.0 | 1.0 | 10.0 | 10.0x# 10.0 | 1.0 | 20.0 | 20.0xKey Insight: Constant Learning Signal
Notice that MAE's gradient magnitude is always 1 (or 0 at the optimum), regardless of error size. This is fundamentally different from MSE, where gradient magnitude scales with error size.
Implications:
Huber loss combines MAE and MSE: behaves like MSE for small errors (differentiable, proportional gradients) and like MAE for large errors (robust, bounded gradients). We'll cover Huber loss in detail in the next section. It's often the practical choice when you want robustness without sacrificing optimization stability.
MAE is actually a special case of a broader framework: quantile regression. Understanding this connection reveals MAE's true statistical nature.
The Quantile Loss Function
For a given quantile $\tau \in (0, 1)$, the pinball loss (or quantile loss) is:
$$L_\tau(y, \hat{y}) = \begin{cases} \tau \cdot (y - \hat{y}) & \text{if } y \geq \hat{y} \ (1-\tau) \cdot (\hat{y} - y) & \text{if } y < \hat{y} \end{cases}$$
Simplified: $$L_\tau(y, \hat{y}) = \max(\tau(y - \hat{y}), (1-\tau)(\hat{y} - y))$$
MAE as 50th Percentile Quantile Loss
When $\tau = 0.5$: $$L_{0.5}(y, \hat{y}) = 0.5 \times |y - \hat{y}|$$
This is just half of MAE! Minimizing MAE is equivalent to quantile regression at the median.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as np def quantile_loss(y_true: np.ndarray, y_pred: np.ndarray, tau: float) -> float: """ Compute quantile loss(pinball loss) for a given quantile tau. tau = 0.5 gives half of MAE(median regression) tau = 0.1 gives 10th percentile regression tau = 0.9 gives 90th percentile regression """ residuals = y_true - y_pred # Asymmetric weighting loss = np.where( residuals >= 0, tau * residuals, (tau - 1) * residuals # = (1 - tau) * | residuals | ) return np.mean(loss) def demonstrate_quantile_mae_equivalence(): """ Show that quantile loss at tau = 0.5 is proportional to MAE. """ np.random.seed(42) y_true = np.random.randn(100) y_pred = y_true + np.random.randn(100) * 0.5 # Predictions with noise mae = np.mean(np.abs(y_true - y_pred)) q_loss_50 = quantile_loss(y_true, y_pred, 0.5) print(f"MAE: {mae:.4f}") print(f"Quantile loss (τ=0.5): {q_loss_50:.4f}") print(f"MAE / 2: {mae/2:.4f}") print(f"Ratio: {mae / q_loss_50:.4f}x (should be 2)") demonstrate_quantile_mae_equivalence() def quantile_regression_demo(): """ Show how different quantiles predict different parts of the distribution. """ # Simulate skewed data np.random.seed(42) y = np.concatenate([ np.random.normal(50, 10, 80), # Most values around 50 np.random.normal(100, 5, 20) # Some high values around 100 ]) print(f"Data Statistics:") print(f" Mean: {np.mean(y):.1f}") print(f" Median: {np.median(y):.1f}") print(f" 10th percentile: {np.percentile(y, 10):.1f}") print(f" 90th percentile: {np.percentile(y, 90):.1f}") # Find optimal constant prediction for each quantile from scipy.optimize import minimize_scalar for tau in [0.1, 0.5, 0.9]: result = minimize_scalar( lambda c: quantile_loss(y, np.full_like(y, c), tau), bounds = (0, 150) ) actual_percentile = np.percentile(y, tau * 100) print(f"τ={tau}: Optimal prediction = {result.x:.1f} (actual {tau*100:.0f}th percentile = {actual_percentile:.1f})") quantile_regression_demo()Why This Matters
Understanding MAE as quantile regression at the median gives you powerful options:
Asymmetric MAE Variants
By varying τ from 0.5, you can create asymmetric versions of MAE that penalize over- and under-prediction differently:
Modern forecasting often uses quantile regression to produce prediction distributions, not just point estimates. Instead of predicting 'sales will be 100 units,' you predict 'there's a 90% chance sales will be between 80 and 130 units.' MAE at τ=0.5 is just one point in this richer picture.
Choosing between MAE and MSE isn't arbitrary—it should reflect your problem's characteristics and requirements. Here's a comprehensive framework for making this decision.
| Scenario | Choose MAE | Choose MSE |
|---|---|---|
| Outliers present | ✓ If outliers are noise/errors | ✓ If outliers are real and important |
| Error cost structure | ✓ Cost linear in error size | ✓ Cost grows super-linearly |
| Target distribution | ✓ Skewed, want median prediction | ✓ Symmetric, want mean prediction |
| Interpretability need | ✓ Stakeholders need intuitive units | RMSE for interpretability |
| Optimization priority | May need special handling | ✓ Clean gradients, convex |
| Safety-critical | Only if large errors acceptable | ✓ Never underestimate risk |
Decision Tree Approach
1.Are large errors catastrophic in your domain ?
→ YES: Consider MSE or even higher - order penalties
→ NO: Continue to question 2
2.Are there outliers that represent noise or measurement error ?
→ YES: MAE or Huber loss
→ NO: Continue to question 3
3.Is your target distribution symmetric ?
→ YES: Either works; MSE for optimization convenience
→ NO(skewed): MAE if you want median prediction
4.Do stakeholders need intuitive interpretation ?
→ YES: MAE(or RMSE from MSE)
→ NO: Either based on above criteria
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as np def metric_analysis(y_true: np.ndarray, y_pred: np.ndarray) -> dict: """ Analyze data to help choose between MAE and MSE. Returns guidance based on data characteristics. """ residuals = y_true - y_pred abs_residuals = np.abs(residuals) # Calculate both metrics mae = np.mean(abs_residuals) mse = np.mean(residuals ** 2) rmse = np.sqrt(mse) # Outlier indicators std_residual = np.std(residuals) outlier_threshold = 3 * std_residual n_outliers = np.sum(np.abs(residuals) > outlier_threshold) pct_outliers = n_outliers / len(residuals) * 100 # Contribution concentration(how much do top 10 % contribute ?) sorted_sq = np.sort(residuals ** 2)[:: -1] top_10_pct = int(0.1 * len(residuals)) top_10_contribution_mse = np.sum(sorted_sq[: top_10_pct]) / np.sum(sorted_sq) * 100 sorted_abs = np.sort(abs_residuals)[:: -1] top_10_contribution_mae = np.sum(sorted_abs[: top_10_pct]) / np.sum(sorted_abs) * 100 # Distribution shape from scipy.stats import skew, kurtosis residual_skew = skew(residuals) residual_kurtosis = kurtosis(residuals) # Recommendations recommendations = [] if pct_outliers > 5: recommendations.append("High outlier rate: Consider MAE or Huber loss") if top_10_contribution_mse > 50: recommendations.append(f"Top 10% errors contribute {top_10_contribution_mse:.1f}% of MSE: Consider MAE") if abs(residual_kurtosis) > 3: recommendations.append(f"Heavy tails (kurtosis={residual_kurtosis:.2f}): MAE more stable") if rmse > 1.5 * mae: recommendations.append(f"RMSE >> MAE indicates large errors dominating: Consider MAE") return { 'mae': mae, 'mse': mse, 'rmse': rmse, 'rmse_over_mae': rmse / mae, 'pct_outliers': pct_outliers, 'top_10_contribution_mse': top_10_contribution_mse, 'top_10_contribution_mae': top_10_contribution_mae, 'skewness': residual_skew, 'kurtosis': residual_kurtosis, 'recommendations': recommendations } # Example with outlier - prone data np.random.seed(42) y_true = np.random.normal(100, 10, 200) y_pred = y_true + np.random.normal(0, 5, 200)# Add some outliers y_pred[0: 10] += 50 # Large over - predictions result = metric_analysis(y_true, y_pred) print("=== Metric Selection Analysis ===") for key, value in result.items(): if key != 'recommendations': if isinstance(value, float): print(f" {key}: {value:.4f}") else: print(f" {key}: {value}") print("Recommendations:") for rec in result['recommendations']: print(f" → {rec}")Let's cover practical aspects of working with MAE in real ML workflows.
Training with MAE
Most ML frameworks support MAE as a loss function, but optimization may require adjustments:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as npfrom sklearn.linear_model import LinearRegression, SGDRegressorfrom sklearn.metrics import mean_absolute_error, mean_squared_errorfrom sklearn.model_selection import train_test_split def compare_mae_mse_training(): """ Compare models trained with MSE vs MAE objectives. """ # Generate data with outliers np.random.seed(42) n_samples = 500 X = np.random.randn(n_samples, 5) # True relationship true_coef = np.array([1.0, -2.0, 3.0, -1.5, 0.5]) y_clean = X @true_coef # Add noise with outliers noise = np.random.randn(n_samples) * 2 outlier_idx = np.random.choice(n_samples, size = 25, replace = False) noise[outlier_idx] = np.random.randn(25) * 30 # Large outliers y = y_clean + noise X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42) # Model 1: Standard Linear Regression(minimizes MSE) lr_mse = LinearRegression() lr_mse.fit(X_train, y_train) # Model 2: SGDRegressor with MAE loss(epsilon_insensitive with epsilon = 0) # Note: sklearn doesn't have pure MAE, but Huber with very small epsilon approximates it from sklearn.linear_model import HuberRegressor lr_mae = HuberRegressor(epsilon = 0.1) # Approximates MAE lr_mae.fit(X_train, y_train) # Evaluate both print("=== Model Comparison ===") print("True coefficients:", true_coef) print("MSE model coefficients:", lr_mse.coef_.round(2)) print("MAE model coefficients:", lr_mae.coef_.round(2)) pred_mse = lr_mse.predict(X_test) pred_mae = lr_mae.predict(X_test) print("--- Test Set Performance ---") print("MSE-trained model:") print(f" MAE: {mean_absolute_error(y_test, pred_mse):.3f}") print(f" MSE: {mean_squared_error(y_test, pred_mse):.3f}") print("MAE-trained model (Huber approximation):") print(f" MAE: {mean_absolute_error(y_test, pred_mae):.3f}") print(f" MSE: {mean_squared_error(y_test, pred_mae):.3f}") # Coefficient recovery analysis print("--- Coefficient Recovery (closer to true = better) ---") mse_dist = np.linalg.norm(lr_mse.coef_ - true_coef) mae_dist = np.linalg.norm(lr_mae.coef_ - true_coef) print(f"MSE model distance from true: {mse_dist:.3f}") print(f"MAE model distance from true: {mae_dist:.3f}") if mae_dist < mse_dist: print("→ MAE model recovered true coefficients better (more robust to outliers)") else: print("→ MSE model recovered true coefficients better") compare_mae_mse_training()Evaluation Best Practices
When using MAE for model evaluation:
Mean Absolute Error provides a robust, interpretable alternative to MSE. Let's consolidate the key insights:
What's Next
We've seen that MSE is sensitive to outliers and MAE is fully robust. But what if we want something in between? Huber Loss provides exactly this—behaving like MSE for small errors (nice optimization properties) and like MAE for large errors (robustness). We'll explore this elegant hybrid next.
You now understand MAE deeply—its mathematics, properties, and practical applications. You can confidently choose between MAE and MSE based on your problem's requirements and interpret results appropriately.