Regression Metrics - Learning Module

Loading content...

0/278

Mean Absolute Error

The Robust Alternative

While Mean Squared Error dominates regression evaluation, it has a well-known weakness: it's highly sensitive to outliers. A single large error can completely dominate the metric, potentially distorting your assessment of model quality.

Mean Absolute Error (MAE) offers a more robust alternative. Instead of squaring errors, MAE takes their absolute values—treating a prediction error of +10 exactly the same as -10, and penalizing all errors proportionally to their magnitude.

But MAE's simplicity hides subtle tradeoffs. Understanding when to choose MAE over MSE requires deep knowledge of both metrics' properties.

What You Will Learn

By the end of this page, you will understand MAE's mathematical definition and properties, why linear penalties provide robustness to outliers, the statistical interpretation as median optimization, optimization challenges due to non-differentiability at zero, when MAE is preferable to MSE and vice versa, and practical implementation considerations.

Mathematical Definition

Mean Absolute Error measures the average of the absolute differences between predicted and actual values. For a dataset with $n$ samples:

$$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$

Compare this to MSE:

MSE: squares the errors → $(y_i - \hat{y}_i)^2$
MAE: takes absolute value → $|y_i - \hat{y}_i|$

This seemingly small change has profound implications for how the metric behaves.

MAE vs MSE: Fundamental Comparison
Property	MAE	MSE
Formula	$\frac{1}{n}\sum\|y_i - \hat{y}_i\|$	$\frac{1}{n}\sum(y_i - \hat{y}_i)^2$
Error penalty	Linear	Quadratic
Units	Same as target	Squared units
Outlier sensitivity	Low	High
Differentiability	Not at zero	Everywhere
Optimal prediction	Median	Mean

mae_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
 
def mean_absolute_error(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Compute Mean Absolute Error from scratch.
    
    Parameters:
    -----------
    y_true : np.ndarray
        Array of true target values
    y_pred : np.ndarray
        Array of predicted values
        
    Returns:
    --------
    float
        The mean absolute error
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    
    if y_true.shape != y_pred.shape:
        raise ValueError(f"Shape mismatch: {y_true.shape} vs {y_pred.shape}")
    
    # Calculate absolute residuals
    absolute_residuals = np.abs(y_true - y_pred)
    
    # Average over all samples
    mae = np.mean(absolute_residuals)
    
    return mae
 
# Example: House price predictions (in $1000s)
y_true = np.array([250, 300, 350, 400, 450])
y_pred = np.array([260, 290, 340, 420, 440])
 
mae = mean_absolute_error(y_true, y_pred)
mse = np.mean((y_true - y_pred) ** 2)
rmse = np.sqrt(mse)
 
print(f"Residuals: {y_true - y_pred}")
print(f"Absolute Residuals: {np.abs(y_true - y_pred)}")
print(f"MAE:  {mae:.2f} $1000")
print(f"RMSE: {rmse:.2f} $1000")
print(f"
Interpretation: On average, predictions are off by ${mae: .0f
                                    },000")
# Output:
# Residuals: [-10  10  10 - 20  10]
# Absolute Residuals: [10 10 10 20 10]
# MAE: 12.00 $1000
# RMSE: 12.65 $1000

Units and Interpretability

One of MAE's practical advantages is that it's in the same units as the target variable—no square root needed. An MAE of 10 for house prices means 'on average, predictions are off by about $10,000.' This makes MAE arguably more intuitive than MSE for stakeholder communication.

Linear Penalty and Robustness

The fundamental difference between MAE and MSE comes down to how they penalize errors of different sizes:

MSE's Quadratic Penalty

Error of 1 → penalty of 1
Error of 10 → penalty of 100 (100× larger)
Error of 100 → penalty of 10,000 (10,000× larger)

MAE's Linear Penalty

Error of 1 → penalty of 1
Error of 10 → penalty of 10 (10× larger)
Error of 100 → penalty of 100 (100× larger)

With MSE, doubling the error quadruples the penalty. With MAE, doubling the error merely doubles the penalty.

penalty_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
 
def compare_penalties():
                                    """
    Illustrate how MAE and MSE penalize errors differently.
    """
    errors = np.array([1, 2, 5, 10, 20, 50, 100])
    
    print("Error Magnitude | MAE Penalty | MSE Penalty | MSE/MAE Ratio")
    print("-" * 60)
    
    for e in errors:
                                mae_penalty = abs(e)
        mse_penalty = e ** 2
        ratio = mse_penalty / mae_penalty
        print(f"{e:14d} | {mae_penalty:11d} | {mse_penalty:11d} | {ratio:13.0f}x")
 
compare_penalties()
# Output:
# Error Magnitude | MAE Penalty | MSE Penalty | MSE / MAE Ratio
# ------------------------------------------------------------
#              1 | 1 | 1 | 1x
#              2 | 2 | 4 | 2x
#              5 | 5 | 25 | 5x
#             10 | 10 | 100 | 10x
#             20 | 20 | 400 | 20x
#             50 | 50 | 2500 | 50x
#            100 | 100 | 10000 | 100x
 
def outlier_impact_comparison():
                        """
    Show how a single outlier affects MAE vs MSE.
    """
    # Baseline: four predictions with small errors
    errors_clean = np.array([2, 3, 2, 3])
    
    # Add one large error(outlier)
    errors_with_outlier = np.array([2, 3, 2, 3, 50])
    
    # Calculate metrics
    mae_clean = np.mean(np.abs(errors_clean))
    mse_clean = np.mean(errors_clean ** 2)
    
    mae_outlier = np.mean(np.abs(errors_with_outlier))
    mse_outlier = np.mean(errors_with_outlier ** 2)
    
    print("
=== Outlier Impact Analysis ===")
    print(f"Clean data (errors: {errors_clean}):")
    print(f"  MAE: {mae_clean:.2f}")
    print(f"  MSE: {mse_clean:.2f}")
    
    print(f"
With outlier (errors: {errors_with_outlier}):")
    print(f"  MAE: {mae_outlier:.2f} (increase: {(mae_outlier/mae_clean - 1)*100:.0f}%)")
    print(f"  MSE: {mse_outlier:.2f} (increase: {(mse_outlier/mse_clean - 1)*100:.0f}%)")
    
    # Contribution analysis
    print(f"
Outlier's contribution:")
    print(f"  To total absolute error: {50 / np.sum(np.abs(errors_with_outlier)) * 100:.1f}%")
    print(f"  To total squared error: {2500 / np.sum(errors_with_outlier ** 2) * 100:.1f}%")
 
outlier_impact_comparison()
# Shows MSE increases ~1400 % while MAE increases only ~140 % 

MAE Robustness Advantages

•Single outliers don't dominate the metric
•Gives equal weight per unit of error
•Better represents 'typical' performance
•More stable across different samples
•Resistant to measurement errors

When Robustness Isn't Desired

•Large errors are truly catastrophic
•Need to catch extreme predictions
•Safety-critical applications
•Cost grows super-linearly with error
•Want explicit outlier sensitivity

Robustness Can Mask Problems

MAE's robustness is a double-edged sword. If your model occasionally makes catastrophic predictions, MAE will partially hide this. Always examine the full error distribution, not just the average—consider looking at max error, percentiles (e.g., 95th percentile error), and residual plots.

Statistical Interpretation: The Median Connection

Here's a profound connection that illuminates MAE's behavior:

Minimizing MAE finds the conditional median, just as minimizing MSE finds the conditional mean.

If we want a single number $c$ that minimizes the sum of absolute deviations from a set of values ${y_1, y_2, ..., y_n}$:

$$\underset{c}{\text{argmin}} \sum_{i=1}^n |y_i - c|$$

The solution is $c = \text{median}(y_1, ..., y_n)$, not the mean!

Why the Median?

Intuitively, the median balances the number of values above and below it. For absolute deviations, moving the prediction by a small amount $\epsilon$:

Increases absolute error for all points on one side by $\epsilon$
Decreases absolute error for all points on the other side by $\epsilon$

The total change is $\epsilon \times (\text{count above} - \text{count below})$. This is zero only when counts are balanced—i.e., at the median.

median_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
from scipy.optimize import minimize_scalar
 
def demonstrate_median_minimizes_mae():
    """
    Prove empirically that the median minimizes MAE
    while the mean minimizes MSE.
    """
    # Sample data with outlier
    y = np.array([10, 12, 11, 13, 12, 100])  # 100 is an outlier
 
    mean_y = np.mean(y)
    median_y = np.median(y)
 
    print(f"Data: {y}")
    print(f"Mean: {mean_y:.2f}")
    print(f"Median: {median_y:.2f}")
    
    # Calculate MAE and MSE for different prediction values
    def mae_for_constant(c):
    return np.mean(np.abs(y - c))
    
    def mse_for_constant(c):
    return np.mean((y - c) ** 2)
    
    # Test at mean and median
    print(f"
MAE when predicting mean ({mean_y:.2f}): {mae_for_constant(mean_y):.2f}")
    print(f"MAE when predicting median ({median_y:.2f}): {mae_for_constant(median_y):.2f}")
    print(f"→ Median gives lower MAE!")
 
    print(f"
MSE when predicting mean ({mean_y:.2f}): {mse_for_constant(mean_y):.2f}")
    print(f"MSE when predicting median ({median_y:.2f}): {mse_for_constant(median_y):.2f}")
    print(f"→ Mean gives lower MSE!")
    
    # Find optimal values numerically
    opt_mae = minimize_scalar(mae_for_constant, bounds = (0, 150))
    opt_mse = minimize_scalar(mse_for_constant, bounds = (0, 150))
 
    print(f"
Numerical optimization:")
    print(f"  Optimal for MAE: {opt_mae.x:.2f} (median = {median_y:.2f})")
    print(f"  Optimal for MSE: {opt_mse.x:.2f} (mean = {mean_y:.2f})")
 
    demonstrate_median_minimizes_mae()
# Output shows median is optimal for MAE, mean is optimal for MSE

Implications for Regression

When you train a model by minimizing MAE:

The model learns to predict the conditional median of y given x
This is called quantile regression at the 50th percentile

When you train by minimizing MSE:

The model learns to predict the conditional mean of y given x

These are the same only when the conditional distribution is symmetric. For skewed distributions, mean ≠ median, and the choice of loss function determines which central tendency your model predicts.

Mean vs Median Predictions
Distribution Shape	Mean vs Median	MAE-Optimal Prediction	MSE-Optimal Prediction
Symmetric (Normal)	Mean = Median	Either	Either
Right-skewed (income)	Mean > Median	Lower value	Higher value
Left-skewed (test scores)	Mean < Median	Higher value	Lower value
Heavy outliers	Mean pulled by outliers	Robust estimate	Outlier-influenced

Practical Impact

For house price prediction, prices are often right-skewed (few very expensive houses). MSE-optimal models will predict higher (toward the mean, influenced by mansions), while MAE-optimal models predict lower (toward the median, the 'typical' house). Neither is 'wrong'—they answer different questions.

The Non-Differentiability Challenge

MAE has a significant mathematical limitation: the absolute value function is not differentiable at zero.

Recall that: $$|x| = \begin{cases} x & \text{if } x \geq 0 \ -x & \text{if } x < 0 \end{cases}$$

The derivative is: $$\frac{d|x|}{dx} = \begin{cases} 1 & \text{if } x > 0 \ -1 & \text{if } x < 0 \ \text{undefined} & \text{if } x = 0 \end{cases}$$

At $x = 0$, the left-derivative is -1 and the right-derivative is +1—there's a 'kink' in the function.

Why Does This Matter?

Gradient-based optimization (gradient descent, Adam, etc.) requires computing gradients. When a residual is exactly zero:

The gradient is technically undefined
We need to choose an arbitrary subgradient
Optimization can behave unexpectedly

Practical Solutions

In practice, residuals rarely equal exactly zero due to floating-point representation, so this is more of a theoretical concern. But implementations handle it:

mae_gradient.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
 
def mae_gradient(y_true: np.ndarray, y_pred: np.ndarray) -> np.ndarray:
    """
    Compute the(sub)gradient of MAE with respect to predictions.
 
        For | y - y_hat |:
    - If y > y_hat: gradient = -1(increase prediction to reduce error)
        - If y < y_hat: gradient = +1(decrease prediction to reduce error)
            - If y = y_hat: gradient = 0(arbitrary choice from subgradient[-1, 1])
    """
    residuals = y_true - y_pred
    
    # Subgradient: sign of residual
    gradient = -np.sign(residuals)
    # Note: np.sign(0) = 0, which is a valid subgradient choice
    
    # Normalize by n(for mean)
        gradient = gradient / len(y_true)
 
    return gradient
 
def compare_gradients():
    """
    Compare MAE and MSE gradients at various residual values.
    """
    residuals = np.array([-10, -5, -1, 0, 1, 5, 10])
    
    # MSE gradient: d / dy_hat of(y - y_hat) ^ 2 = -2(y - y_hat)
    mse_grad = -2 * residuals
    
    # MAE gradient: d / dy_hat of | y - y_hat | = -sign(y - y_hat)
    mae_grad = -np.sign(residuals)
 
    print("Residual | MAE Gradient | MSE Gradient | Ratio")
    print("-" * 55)
    for r, mg, sg in zip(residuals, mae_grad, mse_grad):
        ratio = sg / mg if mg != 0 else 0
    print(f"{r:8.1f} | {mg:12.1f} | {sg:12.1f} | {ratio:5.1f}x")
 
    compare_gradients()
# Output:
# Residual | MAE Gradient | MSE Gradient | Ratio
# -------------------------------------------------------
# - 10.0 | -1.0 | -20.0 | 20.0x
# - 5.0 | -1.0 | -10.0 | 10.0x
# - 1.0 | -1.0 | -2.0 | 2.0x
#      0.0 | 0.0 | 0.0 | 0.0x
#      1.0 | 1.0 | 2.0 | 2.0x
#      5.0 | 1.0 | 10.0 | 10.0x
#     10.0 | 1.0 | 20.0 | 20.0x

Key Insight: Constant Learning Signal

Notice that MAE's gradient magnitude is always 1 (or 0 at the optimum), regardless of error size. This is fundamentally different from MSE, where gradient magnitude scales with error size.

Implications:

Large errors don't get proportionally faster updates in MAE
Small errors get relatively larger updates compared to MSE
Optimization can be less stable without learning rate adjustment
No gradient explosion for large errors (a benefit for some problems)

Huber Loss: The Best of Both Worlds

Huber loss combines MAE and MSE: behaves like MSE for small errors (differentiable, proportional gradients) and like MAE for large errors (robust, bounded gradients). We'll cover Huber loss in detail in the next section. It's often the practical choice when you want robustness without sacrificing optimization stability.

Quantile Regression Perspective

MAE is actually a special case of a broader framework: quantile regression. Understanding this connection reveals MAE's true statistical nature.

The Quantile Loss Function

For a given quantile $\tau \in (0, 1)$, the pinball loss (or quantile loss) is:

$$L_\tau(y, \hat{y}) = \begin{cases} \tau \cdot (y - \hat{y}) & \text{if } y \geq \hat{y} \ (1-\tau) \cdot (\hat{y} - y) & \text{if } y < \hat{y} \end{cases}$$

Simplified: $$L_\tau(y, \hat{y}) = \max(\tau(y - \hat{y}), (1-\tau)(\hat{y} - y))$$

MAE as 50th Percentile Quantile Loss

When $\tau = 0.5$: $$L_{0.5}(y, \hat{y}) = 0.5 \times |y - \hat{y}|$$

This is just half of MAE! Minimizing MAE is equivalent to quantile regression at the median.

quantile_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
 
def quantile_loss(y_true: np.ndarray, y_pred: np.ndarray, tau: float) -> float:
    """
    Compute quantile loss(pinball loss) for a given quantile tau.
 
        tau = 0.5 gives half of MAE(median regression)
    tau = 0.1 gives 10th percentile regression
    tau = 0.9 gives 90th percentile regression
    """
    residuals = y_true - y_pred
    
    # Asymmetric weighting
    loss = np.where(
        residuals >= 0,
        tau * residuals,
        (tau - 1) * residuals  # = (1 - tau) * | residuals |
    )
 
    return np.mean(loss)
 
def demonstrate_quantile_mae_equivalence():
    """
    Show that quantile loss at tau = 0.5 is proportional to MAE.
    """
    np.random.seed(42)
    y_true = np.random.randn(100)
    y_pred = y_true + np.random.randn(100) * 0.5  # Predictions with noise
    
    mae = np.mean(np.abs(y_true - y_pred))
    q_loss_50 = quantile_loss(y_true, y_pred, 0.5)
 
    print(f"MAE: {mae:.4f}")
    print(f"Quantile loss (τ=0.5): {q_loss_50:.4f}")
    print(f"MAE / 2: {mae/2:.4f}")
    print(f"Ratio: {mae / q_loss_50:.4f}x (should be 2)")
 
    demonstrate_quantile_mae_equivalence()
 
def quantile_regression_demo():
    """
    Show how different quantiles predict different parts of the distribution.
    """
    # Simulate skewed data
    np.random.seed(42)
    y = np.concatenate([
        np.random.normal(50, 10, 80),   # Most values around 50
        np.random.normal(100, 5, 20)    # Some high values around 100
    ])
 
    print(f"
Data Statistics:")
    print(f"  Mean: {np.mean(y):.1f}")
    print(f"  Median: {np.median(y):.1f}")
    print(f"  10th percentile: {np.percentile(y, 10):.1f}")
    print(f"  90th percentile: {np.percentile(y, 90):.1f}")
    
    # Find optimal constant prediction for each quantile
    from scipy.optimize import minimize_scalar
    
    for tau in [0.1, 0.5, 0.9]:
        result = minimize_scalar(
            lambda c: quantile_loss(y, np.full_like(y, c), tau),
            bounds = (0, 150)
        )
    actual_percentile = np.percentile(y, tau * 100)
    print(f"
τ={tau}: Optimal prediction = {result.x:.1f} (actual {tau*100:.0f}th percentile = {actual_percentile:.1f})")
 
    quantile_regression_demo()

Why This Matters

Understanding MAE as quantile regression at the median gives you powerful options:

Prediction intervals: Use τ=0.05 and τ=0.95 for 90% prediction intervals
Custom risk profiles: τ=0.9 for conservative predictions (underpredict less often)
Asymmetric costs: When overprediction and underprediction have different costs

Asymmetric MAE Variants

By varying τ from 0.5, you can create asymmetric versions of MAE that penalize over- and under-prediction differently:

τ > 0.5: Penalize underprediction more (e.g., safety stock planning)
τ < 0.5: Penalize overprediction more (e.g., avoiding over-capacity)

Probabilistic Forecasting

Modern forecasting often uses quantile regression to produce prediction distributions, not just point estimates. Instead of predicting 'sales will be 100 units,' you predict 'there's a 90% chance sales will be between 80 and 130 units.' MAE at τ=0.5 is just one point in this richer picture.

MAE vs MSE: A Decision Framework

Choosing between MAE and MSE isn't arbitrary—it should reflect your problem's characteristics and requirements. Here's a comprehensive framework for making this decision.

When to Choose Each Metric
Scenario	Choose MAE	Choose MSE
Outliers present	✓ If outliers are noise/errors	✓ If outliers are real and important
Error cost structure	✓ Cost linear in error size	✓ Cost grows super-linearly
Target distribution	✓ Skewed, want median prediction	✓ Symmetric, want mean prediction
Interpretability need	✓ Stakeholders need intuitive units	RMSE for interpretability
Optimization priority	May need special handling	✓ Clean gradients, convex
Safety-critical	Only if large errors acceptable	✓ Never underestimate risk

Decision Tree Approach

1.Are large errors catastrophic in your domain ?
   → YES: Consider MSE or even higher - order penalties
   → NO: Continue to question 2

2.Are there outliers that represent noise or measurement error ?
   → YES: MAE or Huber loss
   → NO: Continue to question 3

3.Is your target distribution symmetric ?
   → YES: Either works; MSE for optimization convenience
   → NO(skewed): MAE if you want median prediction

4.Do stakeholders need intuitive interpretation ?
   → YES: MAE(or RMSE from MSE) 
   → NO: Either based on above criteria

metric_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
 
def metric_analysis(y_true: np.ndarray, y_pred: np.ndarray) -> dict:
    """
    Analyze data to help choose between MAE and MSE.
    Returns guidance based on data characteristics.
    """
    residuals = y_true - y_pred
    abs_residuals = np.abs(residuals)
    
    # Calculate both metrics
    mae = np.mean(abs_residuals)
    mse = np.mean(residuals ** 2)
    rmse = np.sqrt(mse)
    
    # Outlier indicators
    std_residual = np.std(residuals)
    outlier_threshold = 3 * std_residual
    n_outliers = np.sum(np.abs(residuals) > outlier_threshold)
    pct_outliers = n_outliers / len(residuals) * 100
    
    # Contribution concentration(how much do top 10 % contribute ?)
    sorted_sq = np.sort(residuals ** 2)[:: -1]
    top_10_pct = int(0.1 * len(residuals))
    top_10_contribution_mse = np.sum(sorted_sq[: top_10_pct]) / np.sum(sorted_sq) * 100
 
    sorted_abs = np.sort(abs_residuals)[:: -1]
    top_10_contribution_mae = np.sum(sorted_abs[: top_10_pct]) / np.sum(sorted_abs) * 100
    
    # Distribution shape
    from scipy.stats import skew, kurtosis
    residual_skew = skew(residuals)
    residual_kurtosis = kurtosis(residuals)
    
    # Recommendations
    recommendations = []
 
    if pct_outliers > 5:
        recommendations.append("High outlier rate: Consider MAE or Huber loss")
 
    if top_10_contribution_mse > 50:
        recommendations.append(f"Top 10% errors contribute {top_10_contribution_mse:.1f}% of MSE: Consider MAE")
 
    if abs(residual_kurtosis) > 3:
        recommendations.append(f"Heavy tails (kurtosis={residual_kurtosis:.2f}): MAE more stable")
 
    if rmse > 1.5 * mae:
        recommendations.append(f"RMSE >> MAE indicates large errors dominating: Consider MAE")
 
    return {
        'mae': mae,
        'mse': mse,
        'rmse': rmse,
        'rmse_over_mae': rmse / mae,
        'pct_outliers': pct_outliers,
        'top_10_contribution_mse': top_10_contribution_mse,
        'top_10_contribution_mae': top_10_contribution_mae,
        'skewness': residual_skew,
        'kurtosis': residual_kurtosis,
        'recommendations': recommendations
    }
 
# Example with outlier - prone data
    np.random.seed(42)
    y_true = np.random.normal(100, 10, 200)
    y_pred = y_true + np.random.normal(0, 5, 200)
# Add some outliers
    y_pred[0: 10] += 50  # Large over - predictions
 
    result = metric_analysis(y_true, y_pred)
    print("=== Metric Selection Analysis ===")
    for key, value in result.items():
        if key != 'recommendations':
            if isinstance(value, float):
                print(f"  {key}: {value:.4f}")
            else:
            print(f"  {key}: {value}")
 
    print("
Recommendations:")
    for rec in result['recommendations']:
        print(f"  → {rec}")

Practical Implementation

Let's cover practical aspects of working with MAE in real ML workflows.

Training with MAE

Most ML frameworks support MAE as a loss function, but optimization may require adjustments:

mae_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
 
def compare_mae_mse_training():
    """
    Compare models trained with MSE vs MAE objectives.
    """
    # Generate data with outliers
    np.random.seed(42)
    n_samples = 500
    X = np.random.randn(n_samples, 5)
    
    # True relationship
    true_coef = np.array([1.0, -2.0, 3.0, -1.5, 0.5])
    y_clean = X @true_coef
    
    # Add noise with outliers
    noise = np.random.randn(n_samples) * 2
    outlier_idx = np.random.choice(n_samples, size = 25, replace = False)
    noise[outlier_idx] = np.random.randn(25) * 30  # Large outliers
    y = y_clean + noise
 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
    
    # Model 1: Standard Linear Regression(minimizes MSE)
    lr_mse = LinearRegression()
    lr_mse.fit(X_train, y_train)
    
    # Model 2: SGDRegressor with MAE loss(epsilon_insensitive with epsilon = 0)
    # Note: sklearn doesn't have pure MAE, but Huber with very small epsilon approximates it
    from sklearn.linear_model import HuberRegressor
    lr_mae = HuberRegressor(epsilon = 0.1)  # Approximates MAE
    lr_mae.fit(X_train, y_train)
    
    # Evaluate both
    print("=== Model Comparison ===
")
    print("True coefficients:", true_coef)
    print("MSE model coefficients:", lr_mse.coef_.round(2))
    print("MAE model coefficients:", lr_mae.coef_.round(2))
 
    pred_mse = lr_mse.predict(X_test)
    pred_mae = lr_mae.predict(X_test)
 
    print("
--- Test Set Performance ---")
    print("MSE-trained model:")
    print(f"  MAE: {mean_absolute_error(y_test, pred_mse):.3f}")
    print(f"  MSE: {mean_squared_error(y_test, pred_mse):.3f}")
 
    print("MAE-trained model (Huber approximation):")
    print(f"  MAE: {mean_absolute_error(y_test, pred_mae):.3f}")
    print(f"  MSE: {mean_squared_error(y_test, pred_mae):.3f}")
    
    # Coefficient recovery analysis
    print("
--- Coefficient Recovery (closer to true = better) ---")
    mse_dist = np.linalg.norm(lr_mse.coef_ - true_coef)
    mae_dist = np.linalg.norm(lr_mae.coef_ - true_coef)
    print(f"MSE model distance from true: {mse_dist:.3f}")
    print(f"MAE model distance from true: {mae_dist:.3f}")
 
    if mae_dist < mse_dist:
        print("→ MAE model recovered true coefficients better (more robust to outliers)")
    else:
    print("→ MSE model recovered true coefficients better")
 
    compare_mae_mse_training()

Evaluation Best Practices

When using MAE for model evaluation:

MAE Evaluation Checklist

•Compute baseline MAE — Calculate MAE for predicting the median of training y. Your model should beat this significantly.
•Report both MAE and RMSE — Their ratio reveals outlier impact. If RMSE >> MAE, you have large errors.
•Check error distribution — Plot residual histogram. MAE is an average; understanding the distribution is essential.
•Consider percentiles — Report 50th, 90th, 99th percentile errors for a fuller picture.
•Segment analysis — MAE by segment (e.g., high-value vs. low-value customers) reveals where the model struggles.

Summary: Mean Absolute Error

Mean Absolute Error provides a robust, interpretable alternative to MSE. Let's consolidate the key insights:

Key Takeaways

•Mathematical Definition: MAE = $\frac{1}{n}\sum|y_i - \hat{y}_i|$ — average absolute residual.
•Linear Penalty: All errors contribute proportionally to their size, providing outlier robustness.
•Median Optimization: Minimizing MAE yields the conditional median—different from MSE's mean.
•Same Units: MAE is directly interpretable in target units (unlike MSE's squared units).
•Non-Differentiable: The kink at zero can complicate optimization but is handled by subgradients.
•Quantile Connection: MAE is quantile regression at the 50th percentile.
•Decision Factors: Choose MAE when errors are noise, costs are linear, and robustness matters.
•Not Mutually Exclusive: Report both MAE and MSE/RMSE for comprehensive evaluation.

What's Next

We've seen that MSE is sensitive to outliers and MAE is fully robust. But what if we want something in between? Huber Loss provides exactly this—behaving like MSE for small errors (nice optimization properties) and like MAE for large errors (robustness). We'll explore this elegant hybrid next.

Page Complete

You now understand MAE deeply—its mathematics, properties, and practical applications. You can confidently choose between MAE and MSE based on your problem's requirements and interpret results appropriately.