Machine LearningNeural Network Foundations

Loss Functions

LevelIntermediate

Duration90 mins

TopicNeural Network Foundations

2 / 5

Mean Squared Error: The Regression Workhorse

The Geometry of Prediction

While cross-entropy dominates classification, Mean Squared Error (MSE) reigns supreme in regression. When neural networks predict continuous quantities—stock prices, temperatures, ages, pixel intensities, embedding coordinates—MSE is typically the first loss function considered.

But MSE is more than just a convenient mathematical formula. It embeds a profound assumption about the nature of prediction errors: that they follow a Gaussian distribution. This assumption, when valid, makes MSE the theoretically optimal choice. When violated, MSE can lead to poor models that are overly sensitive to outliers.

This page will take you through MSE from first principles: its definition, statistical foundations, gradient behavior, and the circumstances under which you should—and should not—use it. We'll also explore robust alternatives like Mean Absolute Error (MAE) and Huber loss that address MSE's limitations.

What You Will Master

By the end of this page, you will understand: (1) MSE's definition and mathematical properties, (2) its derivation from maximum likelihood under Gaussian noise, (3) gradient behavior and optimization characteristics, (4) sensitivity to outliers and when this matters, and (5) robust alternatives including MAE, Huber loss, and quantile regression.

Definition and Mathematical Properties

The MSE Formula

For a dataset of $N$ samples with true values $\mathbf{y} = (y_1, \ldots, y_N)$ and predictions $\hat{\mathbf{y}} = (\hat{y}_1, \ldots, \hat{y}_N)$, the Mean Squared Error is:

$$\mathcal{L}{MSE} = \frac{1}{N} \sum{i=1}^{N} (y_i - \hat{y}_i)^2 = \frac{1}{N} |\mathbf{y} - \hat{\mathbf{y}}|_2^2$$

The square of the $L_2$ norm, also known as the squared Euclidean distance.

Decomposition: Bias-Variance Perspective

MSE admits a beautiful decomposition that reveals its components:

$$\mathbb{E}[(Y - \hat{Y})^2] = \underbrace{(\mathbb{E}[\hat{Y}] - \mathbb{E}[Y])^2}{\text{Bias}^2} + \underbrace{\text{Var}(\hat{Y})}{\text{Variance}} + \underbrace{\text{Var}(Y)}_{\text{Irreducible Error}}$$

This decomposition shows that MSE captures both systematic errors (bias) and random fluctuations (variance) in our predictions, plus the inherent unpredictability of the target.

Key Properties

Differentiability: MSE is smooth and differentiable everywhere, with simple gradients.

Convexity: MSE is a convex function of the predictions. For linear models, this means a unique global minimum. For neural networks, the loss surface has no local minima due to MSE itself (though the network parameterization introduces non-convexity).

Scale Sensitivity: MSE is sensitive to the scale of targets. A target in thousands will produce losses in millions, dominating other objectives in multi-task settings.

MSE Properties Summary
Property	Value	Implication
Range	[0, ∞)	Always non-negative; 0 is perfect prediction
Gradient	2(ŷ - y)/N	Linear in prediction error—simple and stable
Hessian	2I/N	Constant positive definite—strongly convex
Optimal predictor	E[Y\|X]	Mean of conditional distribution
Scale invariance	No	Multiply targets by k → Loss multiplied by k²

Comparison: MSE vs RMSE

The Root Mean Squared Error (RMSE) is simply $\sqrt{MSE}$:

$$\text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2}$$

Why use RMSE?

Same units as the target (interpretable)
More intuitive for communication: "average error is 5 degrees" vs "average squared error is 25"

Why optimize MSE not RMSE?

Gradient of RMSE has a $1/\sqrt{MSE}$ factor that explodes near zero loss
MSE has simpler, more stable gradients
Minimizing RMSE is equivalent to minimizing MSE (monotonic transformation)

Common convention: Train with MSE, report RMSE.

Statistical Foundations: Why MSE?

Maximum Likelihood Under Gaussian Noise

MSE isn't an arbitrary choice—it emerges naturally from assuming prediction errors are Gaussian distributed. This is the most important theoretical justification for MSE.

The Model: Assume the true relationship is: $$y = f(x; \theta) + \epsilon$$

where $\epsilon \sim \mathcal{N}(0, \sigma^2)$ is Gaussian noise.

This means $y | x \sim \mathcal{N}(f(x; \theta), \sigma^2)$, and the likelihood of observing $y$ given prediction $f(x; \theta)$ is:

$$p(y | x, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - f(x; \theta))^2}{2\sigma^2}\right)$$

Log-Likelihood: For $N$ independent samples:

$$\log p(\mathbf{y} | \mathbf{X}, \theta) = -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_i - f(x_i; \theta))^2$$

Maximum Likelihood = Minimize MSE: Since the first term is constant and $\sigma^2 > 0$, maximizing log-likelihood is equivalent to:

$$\min_\theta \sum_{i=1}^{N}(y_i - f(x_i; \theta))^2$$

which is exactly MSE (up to a constant factor).

The profound implication: If your errors are truly Gaussian, MSE is the statistically optimal loss function. It extracts maximum information from your data.

The Central Limit Theorem Connection

Why is the Gaussian assumption often reasonable? The Central Limit Theorem states that the sum of many independent random effects tends toward Gaussian. If prediction errors arise from many small, independent sources of uncertainty, the aggregate error will be approximately Gaussian—justifying MSE even when no single error source is Gaussian.

Why the Mean? Optimal Point Estimate

MSE encourages the model to predict the conditional mean $\mathbb{E}[Y | X]$. This is because:

$$\arg\min_c \mathbb{E}[(Y - c)^2] = \mathbb{E}[Y]$$

Proof sketch: Let $\mu = \mathbb{E}[Y]$. Then: $$\mathbb{E}[(Y - c)^2] = \mathbb{E}[(Y - \mu + \mu - c)^2] = \text{Var}(Y) + (\mu - c)^2$$

This is minimized when $c = \mu$.

Implication: A model trained with MSE learns to predict the average outcome for each input. This is the optimal point estimate under squared loss, but:

It doesn't capture uncertainty (no confidence intervals)
It can be misleading for multimodal or skewed distributions

Heteroscedastic Noise

In many real problems, noise variance varies with input: $\epsilon(x) \sim \mathcal{N}(0, \sigma^2(x))$.

Standard MSE assumes constant variance (homoscedasticity). For heteroscedastic problems:

Predictions in low-noise regions should be weighted more heavily
Neural networks can be modified to predict both mean and variance
This leads to Gaussian negative log-likelihood loss:

$$\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N} \left[\frac{(y_i - \mu(x_i))^2}{2\sigma^2(x_i)} + \frac{1}{2}\log\sigma^2(x_i)\right]$$

where the network outputs both $\mu(x)$ and $\sigma^2(x)$.

Gradient Analysis and Optimization Behavior

Gradient Derivation

For a single sample with prediction $\hat{y}$ and target $y$:

$$\frac{\partial}{\partial \hat{y}} (y - \hat{y})^2 = -2(y - \hat{y}) = 2(\hat{y} - y)$$

For the mean over $N$ samples: $$\frac{\partial \mathcal{L}_{MSE}}{\partial \hat{y}_i} = \frac{2}{N}(\hat{y}_i - y_i)$$

Gradient magnitude is proportional to error: Large errors produce large gradients; small errors produce small gradients. This is intuitive and leads to stable optimization.

Gradient Through Network Parameters

Using chain rule, the gradient with respect to parameters $\theta$:

$$\frac{\partial \mathcal{L}{MSE}}{\partial \theta} = \frac{2}{N}\sum{i=1}^{N}(\hat{y}_i - y_i) \cdot \frac{\partial \hat{y}_i}{\partial \theta}$$

The loss contributes a linear factor $(\hat{y} - y)$, while the network's Jacobian $\frac{\partial \hat{y}}{\partial \theta}$ handles the transformation.

Optimization Landscape

For linear models $\hat{y} = \mathbf{w}^T \mathbf{x}$, MSE creates a paraboloid loss surface:

$$\mathcal{L}(\mathbf{w}) = \frac{1}{N}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2$$

Single global minimum at $\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$
Hessian is $\frac{2}{N}\mathbf{X}^T\mathbf{X}$, constant and positive semi-definite
Gradient descent converges linearly

For neural networks, the network parameterization introduces non-convexity, but MSE itself doesn't add local minima.

mse_gradients.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
import matplotlib.pyplot as plt
 
def mse_loss(y_true, y_pred):
    """Compute MSE loss."""
    return np.mean((y_true - y_pred) ** 2)
 
def mse_gradient(y_true, y_pred):
    """
    Compute gradient of MSE with respect to predictions.
    
    dL/d(y_pred) = 2 * (y_pred - y_true) / N
    """
    n = len(y_true)
    return 2 * (y_pred - y_true) / n
 
def rmse_loss(y_true, y_pred):
    """Compute RMSE loss."""
    return np.sqrt(mse_loss(y_true, y_pred))
 
# Demonstrate gradient stability
print("MSE Gradient Analysis")
print("="*50)
 
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
 
# Various prediction scenarios
scenarios = [
    ("Perfect", y_true.copy()),
    ("Small errors", y_true + np.array([0.1, -0.1, 0.05, -0.15, 0.1])),
    ("Medium errors", y_true + np.array([0.5, -0.3, 0.8, -0.6, 0.4])),
    ("Large errors", y_true + np.array([2.0, -1.5, 3.0, -2.0, 1.5])),
    ("One outlier", y_true + np.array([0.1, 0.1, 10.0, 0.1, 0.1])),
]
 
for name, y_pred in scenarios:
    loss = mse_loss(y_true, y_pred)
    grad = mse_gradient(y_true, y_pred)
    grad_norm = np.linalg.norm(grad)
    
    print(f"\n{name}:")
    print(f"  MSE Loss: {loss:.4f}")
    print(f"  RMSE: {np.sqrt(loss):.4f}")
    print(f"  Gradient: {grad}")
    print(f"  Gradient L2 norm: {grad_norm:.4f}")
 
# Demonstrate linear relationship between error and gradient
print("\n" + "="*50)
print("Error-Gradient Relationship:")
errors = np.linspace(-5, 5, 11)
for e in errors:
    grad = 2 * e  # For single sample
    print(f"  Error {e:+.1f} → Gradient {grad:+.1f}")

Learning Dynamics Comparison

MSE Gradient Behavior:

Gradient magnitude scales linearly with error
No saturation for large errors → always makes progress
No explosion for small errors → stable near optimum

Contrast with Classification: Recall that for sigmoid + MSE in classification (which we don't recommend): $$\frac{\partial \mathcal{L}}{\partial z} = 2(\hat{y} - y) \cdot \sigma'(z) = 2(\hat{y} - y) \cdot \hat{y}(1 - \hat{y})$$

The $\hat{y}(1-\hat{y})$ factor vanishes for confident predictions, causing gradient vanishing. This is why MSE is problematic for classification—not for regression where no sigmoid is involved.

The Outlier Problem: MSE's Achilles' Heel

Quadratic Penalty = Outlier Sensitivity

The same property that makes MSE stable for gradient descent—quadratic penalty—makes it extremely sensitive to outliers.

The Math: An error of magnitude $e$ contributes $e^2$ to the loss:

Error of 1 → Loss contribution of 1
Error of 10 → Loss contribution of 100
Error of 100 → Loss contribution of 10,000

A single large outlier can dominate the entire loss, disproportionately influencing the model.

Real-World Example: Consider predicting house prices. Most houses sell for $200K-$500K, but one mansion sold for $50M. With MSE:

The $50M house contributes ~10,000× more to the loss than typical houses
The model will distort itself to fit this outlier, harming predictions for normal houses

When Outliers Are Data vs Noise

Outliers as noise: Measurement errors, data entry mistakes, sensor glitches. These should NOT influence the model. MSE is inappropriate.

Outliers as signal: Rare but valid extreme events (fraud transactions, disease outbreaks). These SHOULD influence the model. But MSE may still be problematic—you likely want special handling, not just squared penalty.

Robust Alternatives

We'll explore three key alternatives:

Mean Absolute Error (MAE): Linear penalty, constant gradient
Huber Loss: Blends MSE for small errors, MAE for large
Quantile Loss: Predicts specific percentiles, robust by design

Loss Function Comparison Under Outliers
Error	MSE	MAE	Huber (δ=1)	Sensitivity
0.1	0.01	0.1	0.005	All similar (small)
0.5	0.25	0.5	0.125	All similar (small)
1.0	1.00	1.0	0.50	MSE = MAE
2.0	4.00	2.0	1.50	MSE 2× MAE
10.0	100.0	10.0	9.50	MSE 10× MAE
100.0	10000	100	99.5	MSE 100× MAE

Mean Absolute Error (MAE)

Definition: $$\mathcal{L}{MAE} = \frac{1}{N}\sum{i=1}^{N}|y_i - \hat{y}_i|$$

Gradient: $$\frac{\partial \mathcal{L}_{MAE}}{\partial \hat{y}_i} = \frac{1}{N} \cdot \text{sign}(\hat{y}_i - y_i)$$

Key Properties:

Gradient is ±1/N, independent of error magnitude
Constant learning pressure regardless of error size
Linear penalty: outliers contribute proportionally, not quadratically

Statistical Interpretation: MAE corresponds to Laplacian noise: $\epsilon \sim \text{Laplace}(0, b)$ Optimal predictor is the conditional median, not mean.

Downsides:

Non-differentiable at $\hat{y} = y$ (gradient undefined at zero error)
Constant gradient can cause oscillation near optimum
Slower convergence than MSE in practice

robust_losses.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
 
def mae_loss(y_true, y_pred):
    """Mean Absolute Error."""
    return np.mean(np.abs(y_true - y_pred))
 
def mae_gradient(y_true, y_pred):
    """Gradient of MAE (subgradient at 0)."""
    return np.sign(y_pred - y_true) / len(y_true)
 
def huber_loss(y_true, y_pred, delta=1.0):
    """
    Huber loss: MSE for small errors, MAE for large errors.
    
    L(e) = 0.5 * e^2           if |e| <= delta
         = delta * |e| - 0.5 * delta^2   otherwise
    """
    error = y_true - y_pred
    abs_error = np.abs(error)
    
    # Quadratic for small errors, linear for large
    loss = np.where(
        abs_error <= delta,
        0.5 * error ** 2,
        delta * abs_error - 0.5 * delta ** 2
    )
    return np.mean(loss)
 
def huber_gradient(y_true, y_pred, delta=1.0):
    """
    Gradient of Huber loss.
    
    dL/d(y_pred) = (y_pred - y_true)      if |e| <= delta  (like MSE)
                 = delta * sign(y_pred - y_true)  otherwise (like MAE)
    """
    error = y_pred - y_true
    abs_error = np.abs(error)
    
    grad = np.where(
        abs_error <= delta,
        error,  # MSE-like gradient
        delta * np.sign(error)  # MAE-like gradient, capped
    )
    return grad / len(y_true)
 
def quantile_loss(y_true, y_pred, quantile=0.5):
    """
    Quantile loss (asymmetric MAE).
    
    For quantile=0.5, equivalent to MAE/2.
    For quantile=0.9, penalizes under-prediction 9x more than over-prediction.
    """
    error = y_true - y_pred
    loss = np.where(
        error >= 0,
        quantile * error,
        (quantile - 1) * error
    )
    return np.mean(loss)
 
# Demonstration: Outlier robustness
print("Loss Comparison with Outliers")
print("="*50)
 
# Normal data
y_true_clean = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.1, 2.2, 2.9, 3.8, 5.1])  # Reasonable predictions
 
print("Clean data:")
print(f"  MSE:   {np.mean((y_true_clean - y_pred)**2):.4f}")
print(f"  MAE:   {mae_loss(y_true_clean, y_pred):.4f}")
print(f"  Huber: {huber_loss(y_true_clean, y_pred, delta=1.0):.4f}")
 
# Data with outlier
y_true_outlier = np.array([1.0, 2.0, 3.0, 4.0, 50.0])  # Last point is outlier
 
print("\nWith outlier (50 instead of 5):")
print(f"  MSE:   {np.mean((y_true_outlier - y_pred)**2):.4f}")
print(f"  MAE:   {mae_loss(y_true_outlier, y_pred):.4f}")
print(f"  Huber: {huber_loss(y_true_outlier, y_pred, delta=1.0):.4f}")
 
# Relative increase
mse_clean = np.mean((y_true_clean - y_pred)**2)
mse_outlier = np.mean((y_true_outlier - y_pred)**2)
mae_clean = mae_loss(y_true_clean, y_pred)
mae_outlier = mae_loss(y_true_outlier, y_pred)
 
print(f"\nOutlier impact (increase factor):")
print(f"  MSE increased by: {mse_outlier/mse_clean:.1f}x")
print(f"  MAE increased by: {mae_outlier/mae_clean:.1f}x")

Huber Loss: The Best of Both Worlds

The Huber Loss Formula

Huber loss (also called smooth L1 loss) combines the advantages of MSE and MAE:

$$L_\delta(e) = \begin{cases} \frac{1}{2}e^2 & \text{if } |e| \leq \delta \ \delta|e| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}$$

where $e = y - \hat{y}$ is the prediction error and $\delta$ is a threshold parameter.

Properties:

Quadratic for small errors: When $|e| \leq \delta$, behaves exactly like MSE. Strong gradients for small errors, smooth optimization near the optimum.
Linear for large errors: When $|e| > \delta$, behaves like MAE. Linear penalty prevents outlier domination.
Continuous and differentiable everywhere: Unlike MAE, no gradient discontinuity at zero.
Parameter $\delta$ controls transition: Small $\delta$ = more robust (MAE-like); large $\delta$ = more efficient (MSE-like).

Gradient Properties

$$\frac{\partial L_\delta}{\partial \hat{y}} = \begin{cases} \hat{y} - y & \text{if } |e| \leq \delta \ \delta \cdot \text{sign}(\hat{y} - y) & \text{otherwise} \end{cases}$$

Key insight: Gradient magnitude is capped at $\delta$. No matter how extreme the outlier, it can only contribute at most $\delta$ to the gradient. This is the core robustness mechanism.

Choosing δ

The threshold δ should be set based on what you consider a 'normal' error magnitude:

• δ = 1.0 is a common default • Estimate δ from data: Set it to the median absolute deviation (MAD) or a percentile of residuals from an initial model • Domain knowledge: If errors above X are likely outliers, set δ ≈ X

In practice, δ can be treated as a hyperparameter tuned on validation data.

Smooth L1 Loss Variant

A common variant used in object detection (e.g., Faster R-CNN) is:

$$\text{smooth}_{L_1}(e) = \begin{cases} 0.5e^2 & \text{if } |e| < 1 \ |e| - 0.5 & \text{otherwise} \end{cases}$$

This is Huber loss with $\delta = 1$, often used for bounding box regression where outliers (due to wrong anchor matches) are common.

Log-Cosh Loss: A Smooth Alternative

Another option that's twice-differentiable and robust:

$$L_{\text{log-cosh}}(e) = \log(\cosh(e))$$

Properties:

Approximates $e^2/2$ for small $e$
Approximates $|e| - \log(2)$ for large $e$
Smooth transition without a threshold parameter
Second derivative exists everywhere (unlike Huber at $\pm\delta$)

Use case: When you need second-order optimization methods (Newton, L-BFGS) that require the Hessian.

huber_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
 
def huber_loss_single(error, delta=1.0):
    """Huber loss for a single error value."""
    abs_e = np.abs(error)
    if abs_e <= delta:
        return 0.5 * error ** 2
    else:
        return delta * abs_e - 0.5 * delta ** 2
 
def huber_gradient_single(error, delta=1.0):
    """Huber gradient for a single prediction."""
    if abs(error) <= delta:
        return -error  # d/d(y_pred) where error = y - y_pred
    else:
        return -delta * np.sign(error)
 
def smooth_l1_loss(y_true, y_pred, beta=1.0):
    """
    Smooth L1 loss (used in object detection).
    
    Parameterized by beta (= delta in Huber terms).
    """
    error = y_true - y_pred
    abs_error = np.abs(error)
    
    loss = np.where(
        abs_error < beta,
        0.5 * error ** 2 / beta,
        abs_error - 0.5 * beta
    )
    return np.mean(loss)
 
def log_cosh_loss(y_true, y_pred):
    """Log-cosh loss: smooth, robust, twice-differentiable."""
    error = y_true - y_pred
    return np.mean(np.log(np.cosh(error)))
 
# Compare loss curves
print("Loss Values at Different Error Magnitudes")
print("="*60)
print(f"{'Error':>8} {'MSE':>10} {'MAE':>10} {'Huber':>10} {'Log-cosh':>10}")
print("-" * 60)
 
for e in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
    mse = 0.5 * e ** 2
    mae = e
    huber = huber_loss_single(e, delta=1.0)
    logcosh = np.log(np.cosh(e))
    print(f"{e:8.1f} {mse:10.4f} {mae:10.4f} {huber:10.4f} {logcosh:10.4f}")
 
# Gradient comparison
print("\nGradient Magnitudes (absolute value)")
print("="*60)
print(f"{'Error':>8} {'MSE':>10} {'MAE':>10} {'Huber':>10} {'Log-cosh':>10}")
print("-" * 60)
 
for e in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
    mse_grad = e  # d/de of 0.5*e^2
    mae_grad = 1.0  # d/de of |e| for e > 0
    huber_grad = min(e, 1.0)  # capped at delta=1
    logcosh_grad = np.tanh(e)  # d/de of log(cosh(e))
    print(f"{e:8.1f} {mse_grad:10.4f} {mae_grad:10.4f} {huber_grad:10.4f} {logcosh_grad:10.4f}")

Multi-Output Regression and Normalization

Multi-Output Regression

When predicting multiple targets $\mathbf{y} = (y_1, \ldots, y_K)$, MSE is computed across all outputs:

$$\mathcal{L} = \frac{1}{NK}\sum_{i=1}^{N}\sum_{k=1}^{K}(y_{ik} - \hat{y}_{ik})^2$$

Considerations:

All outputs are weighted equally by default
If outputs have different scales, larger-scale outputs dominate
May need per-output weighting or normalization

The Scale Problem

Consider predicting:

Temperature (0-100°C)
Pressure (0-10,000 Pa)
Humidity (0-100%)

With raw MSE:

Pressure errors contribute ~1000× more than temperature/humidity
Model optimizes primarily for pressure, ignoring other outputs

Solution 1: Output Normalization

Normalize targets to similar scales before training: $$y'_k = \frac{y_k - \mu_k}{\sigma_k}$$

This is almost always done in practice.

Solution 2: Per-Output Weights

$$\mathcal{L} = \sum_{k=1}^{K} w_k \cdot \frac{1}{N}\sum_{i=1}^{N}(y_{ik} - \hat{y}_{ik})^2$$

Weights $w_k$ can be set:

Inversely proportional to variance: $w_k = 1/\sigma_k^2$
Based on task importance (domain knowledge)
Learned during training (uncertainty weighting)

multi_output_mse.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
 
def multi_output_mse(y_true, y_pred):
    """Standard MSE for multi-output (averages over all elements)."""
    return np.mean((y_true - y_pred) ** 2)
 
def weighted_multi_output_mse(y_true, y_pred, weights):
    """
    Weighted MSE for multi-output regression.
    
    Args:
        y_true: (N, K) array of targets
        y_pred: (N, K) array of predictions
        weights: (K,) array of per-output weights
    """
    per_output_mse = np.mean((y_true - y_pred) ** 2, axis=0)  # (K,)
    return np.sum(weights * per_output_mse)
 
def normalized_mse(y_true, y_pred, means, stds):
    """
    MSE computed on normalized targets.
    
    Normalizes both targets and predictions before computing MSE.
    This is equivalent to per-output weighting with w_k = 1/std_k^2.
    """
    y_true_norm = (y_true - means) / stds
    y_pred_norm = (y_pred - means) / stds
    return np.mean((y_true_norm - y_pred_norm) ** 2)
 
def uncertainty_weighted_loss(y_true, y_pred, log_variances):
    """
    Learned uncertainty weighting (Kendall et al., 2018).
    
    Network predicts both mean and log(variance) for each output.
    Loss automatically balances outputs based on learned uncertainty.
    
    L = sum_k [ 0.5 * exp(-s_k) * (y_k - y_pred_k)^2 + 0.5 * s_k ]
    
    where s_k = log(sigma_k^2) is learned.
    """
    mse_per_output = np.mean((y_true - y_pred) ** 2, axis=0)  # (K,)
    
    # exp(-s) * MSE + s: first term is weighted MSE, second prevents s → -∞
    precision = np.exp(-log_variances)  # 1/sigma^2
    loss = 0.5 * np.sum(precision * mse_per_output + log_variances)
    return loss
 
# Demonstration
print("Multi-Output MSE with Different Scales")
print("="*60)
 
# Simulated multi-output data with different scales
np.random.seed(42)
N = 100
 
# Three outputs with very different scales
y_true = np.column_stack([
    np.random.randn(N) * 10 + 50,    # Output 1: mean=50, std=10
    np.random.randn(N) * 1000 + 5000, # Output 2: mean=5000, std=1000  
    np.random.randn(N) * 0.1 + 0.5,   # Output 3: mean=0.5, std=0.1
])
 
# Predictions with similar relative errors (~10% of std)
y_pred = y_true + np.column_stack([
    np.random.randn(N) * 1,      # ~10% error on output 1
    np.random.randn(N) * 100,    # ~10% error on output 2
    np.random.randn(N) * 0.01,   # ~10% error on output 3
])
 
# Raw MSE
raw_mse = multi_output_mse(y_true, y_pred)
per_output_mse = np.mean((y_true - y_pred) ** 2, axis=0)
 
print("Raw MSE:")
print(f"  Total: {raw_mse:.2f}")
print(f"  Per-output: {per_output_mse}")
print(f"  Output 2 contributes {per_output_mse[1]/np.sum(per_output_mse)*100:.1f}% of loss!")
 
# Normalized MSE
means = np.mean(y_true, axis=0)
stds = np.std(y_true, axis=0)
norm_mse = normalized_mse(y_true, y_pred, means, stds)
 
print(f"\nNormalized MSE: {norm_mse:.4f}")
print("  (Now all outputs contribute roughly equally)")

Uncertainty Weighting: A Principled Approach

The uncertainty-weighted loss learns to balance multiple outputs automatically. The network predicts both the mean μ(x) and uncertainty σ(x) for each output. The loss becomes:

L = Σ_k [½ · (y_k - μ_k)²/σ_k² + ½ log σ_k²]

The first term weights MSE by precision (1/σ²). The second term prevents σ → ∞ (which would make the first term vanish). This elegant formulation emerges from Gaussian likelihood.

MSE in Modern Deep Learning

Common Applications

1. Image Regression Tasks

Super-resolution, denoising, inpainting
Pixel-wise MSE (L2 loss) is standard
Often combined with perceptual losses (feature-space similarity)

2. Autoencoders and VAEs

Reconstruction loss for continuous outputs
For images: MSE on pixel values
For embeddings: MSE on latent representations

3. Reinforcement Learning

TD-error for value function approximation: $(r + \gamma V(s') - V(s))^2$
Often uses Huber loss for robustness

4. Regression Heads in Multi-Task Models

Object detection: bounding box coordinates
Pose estimation: keypoint locations
Age/attribute prediction in face models

MSE vs Perceptual Losses

For image generation, MSE has a critical limitation: it encourages blurriness.

Why? MSE penalizes deviations from the ground truth. For a pixel that should be bright (255), MSE is minimized by predicting exactly 255. But for a pixel that might be 0 or 255 (high uncertainty), MSE is minimized by predicting 127.5!

This averaging behavior results in blurry reconstructions that lack high-frequency details.

Solution: Perceptual losses compute MSE in feature space (e.g., VGG features) rather than pixel space: $$\mathcal{L}_{perceptual} = |\phi(y) - \phi(\hat{y})|_2^2$$

where $\phi$ is a pretrained network. This preserves perceptual quality while allowing pixel-level variation.

When to Use MSE

•Continuous targets with Gaussian errors
•Clean data without extreme outliers
•When you want the conditional mean
•Value function approximation (RL)
•Embedding/feature regression
•When interpretability matters (same units²)

When to Avoid MSE

•Data with extreme outliers → use Huber
•Heavy-tailed error distributions → use MAE
•When median is preferred over mean → use MAE
•Classification tasks → use cross-entropy
•Image generation quality matters → add perceptual loss
•Quantile prediction needed → use quantile loss

Implementation in Frameworks

Every major framework provides optimized MSE:

# PyTorch
import torch.nn as nn
loss_fn = nn.MSELoss(reduction='mean')  # or 'sum', 'none'
loss = loss_fn(predictions, targets)

# TensorFlow/Keras
import tensorflow as tf
loss = tf.keras.losses.MeanSquaredError()(targets, predictions)

# JAX
import jax.numpy as jnp
loss = jnp.mean((predictions - targets) ** 2)

The reduction parameter controls averaging:

'mean': Average over all elements (most common)
'sum': Sum over all elements (for custom weighting)
'none': Return per-element losses (for sample weighting)

Summary: MSE Mastery

We've explored Mean Squared Error from its statistical foundations to practical implementation. Here are the essential takeaways:

Key Takeaways

•Gaussian foundation: MSE is the maximum likelihood estimator under Gaussian noise. If your errors are truly Gaussian, MSE is optimal.
•Predicts the mean: MSE optimizes for the conditional mean E[Y|X]. This is appropriate when mean is the right summary statistic.
•Simple, stable gradients: Gradient is proportional to error—no vanishing or exploding due to the loss itself.
•Outlier sensitive: Quadratic penalty means outliers dominate. Use Huber or MAE when outliers are present.
•Scale dependent: Normalize targets or use weighted losses for multi-output with different scales.
•Context matters: Great for regression with clean data; needs augmentation (perceptual loss) for image generation.

Looking Forward

MSE and cross-entropy cover regression and classification respectively. But many real-world problems require specialized losses:

Hinge Loss: The margin-based alternative that underpins SVMs, with different inductive biases
Focal Loss: Designed for extreme class imbalance in object detection
Custom Losses: Encoding domain knowledge directly into the optimization objective

Page Complete

You now understand MSE deeply—when to use it, why it works, and when to reach for alternatives. Combined with cross-entropy from the previous page, you have the foundational losses that power most neural network training.

2 / 5

Loading learning content...

Machine LearningNeural Network Foundations

Loss Functions

LevelIntermediate

Duration90 mins

TopicNeural Network Foundations

2 / 5

Mean Squared Error: The Regression Workhorse

The Geometry of Prediction

What You Will Master

Definition and Mathematical Properties

The MSE Formula

For a dataset of $N$ samples with true values $\mathbf{y} = (y_1, \ldots, y_N)$ and predictions $\hat{\mathbf{y}} = (\hat{y}_1, \ldots, \hat{y}_N)$, the Mean Squared Error is:

$$\mathcal{L}{MSE} = \frac{1}{N} \sum{i=1}^{N} (y_i - \hat{y}_i)^2 = \frac{1}{N} |\mathbf{y} - \hat{\mathbf{y}}|_2^2$$

The square of the $L_2$ norm, also known as the squared Euclidean distance.

Decomposition: Bias-Variance Perspective

MSE admits a beautiful decomposition that reveals its components:

This decomposition shows that MSE captures both systematic errors (bias) and random fluctuations (variance) in our predictions, plus the inherent unpredictability of the target.

Key Properties

Differentiability: MSE is smooth and differentiable everywhere, with simple gradients.

Scale Sensitivity: MSE is sensitive to the scale of targets. A target in thousands will produce losses in millions, dominating other objectives in multi-task settings.

MSE Properties Summary
Property	Value	Implication
Range	[0, ∞)	Always non-negative; 0 is perfect prediction
Gradient	2(ŷ - y)/N	Linear in prediction error—simple and stable
Hessian	2I/N	Constant positive definite—strongly convex
Optimal predictor	E[Y\|X]	Mean of conditional distribution
Scale invariance	No	Multiply targets by k → Loss multiplied by k²

Comparison: MSE vs RMSE

The Root Mean Squared Error (RMSE) is simply $\sqrt{MSE}$:

$$\text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2}$$

Why use RMSE?

Same units as the target (interpretable)
More intuitive for communication: "average error is 5 degrees" vs "average squared error is 25"

Why optimize MSE not RMSE?

Gradient of RMSE has a $1/\sqrt{MSE}$ factor that explodes near zero loss
MSE has simpler, more stable gradients
Minimizing RMSE is equivalent to minimizing MSE (monotonic transformation)

Common convention: Train with MSE, report RMSE.

Statistical Foundations: Why MSE?

Maximum Likelihood Under Gaussian Noise

MSE isn't an arbitrary choice—it emerges naturally from assuming prediction errors are Gaussian distributed. This is the most important theoretical justification for MSE.

The Model: Assume the true relationship is: $$y = f(x; \theta) + \epsilon$$

where $\epsilon \sim \mathcal{N}(0, \sigma^2)$ is Gaussian noise.

This means $y | x \sim \mathcal{N}(f(x; \theta), \sigma^2)$, and the likelihood of observing $y$ given prediction $f(x; \theta)$ is:

$$p(y | x, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - f(x; \theta))^2}{2\sigma^2}\right)$$

Log-Likelihood: For $N$ independent samples:

$$\log p(\mathbf{y} | \mathbf{X}, \theta) = -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_i - f(x_i; \theta))^2$$

Maximum Likelihood = Minimize MSE: Since the first term is constant and $\sigma^2 > 0$, maximizing log-likelihood is equivalent to:

$$\min_\theta \sum_{i=1}^{N}(y_i - f(x_i; \theta))^2$$

which is exactly MSE (up to a constant factor).

The profound implication: If your errors are truly Gaussian, MSE is the statistically optimal loss function. It extracts maximum information from your data.

The Central Limit Theorem Connection

Why the Mean? Optimal Point Estimate

MSE encourages the model to predict the conditional mean $\mathbb{E}[Y | X]$. This is because:

$$\arg\min_c \mathbb{E}[(Y - c)^2] = \mathbb{E}[Y]$$

Proof sketch: Let $\mu = \mathbb{E}[Y]$. Then: $$\mathbb{E}[(Y - c)^2] = \mathbb{E}[(Y - \mu + \mu - c)^2] = \text{Var}(Y) + (\mu - c)^2$$

This is minimized when $c = \mu$.

Implication: A model trained with MSE learns to predict the average outcome for each input. This is the optimal point estimate under squared loss, but:

It doesn't capture uncertainty (no confidence intervals)
It can be misleading for multimodal or skewed distributions

Heteroscedastic Noise

In many real problems, noise variance varies with input: $\epsilon(x) \sim \mathcal{N}(0, \sigma^2(x))$.

Standard MSE assumes constant variance (homoscedasticity). For heteroscedastic problems:

Predictions in low-noise regions should be weighted more heavily
Neural networks can be modified to predict both mean and variance
This leads to Gaussian negative log-likelihood loss:

$$\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N} \left[\frac{(y_i - \mu(x_i))^2}{2\sigma^2(x_i)} + \frac{1}{2}\log\sigma^2(x_i)\right]$$

where the network outputs both $\mu(x)$ and $\sigma^2(x)$.

Gradient Analysis and Optimization Behavior

Gradient Derivation

For a single sample with prediction $\hat{y}$ and target $y$:

$$\frac{\partial}{\partial \hat{y}} (y - \hat{y})^2 = -2(y - \hat{y}) = 2(\hat{y} - y)$$

For the mean over $N$ samples: $$\frac{\partial \mathcal{L}_{MSE}}{\partial \hat{y}_i} = \frac{2}{N}(\hat{y}_i - y_i)$$

Gradient magnitude is proportional to error: Large errors produce large gradients; small errors produce small gradients. This is intuitive and leads to stable optimization.

Gradient Through Network Parameters

Using chain rule, the gradient with respect to parameters $\theta$:

$$\frac{\partial \mathcal{L}{MSE}}{\partial \theta} = \frac{2}{N}\sum{i=1}^{N}(\hat{y}_i - y_i) \cdot \frac{\partial \hat{y}_i}{\partial \theta}$$

The loss contributes a linear factor $(\hat{y} - y)$, while the network's Jacobian $\frac{\partial \hat{y}}{\partial \theta}$ handles the transformation.

Optimization Landscape

For linear models $\hat{y} = \mathbf{w}^T \mathbf{x}$, MSE creates a paraboloid loss surface:

$$\mathcal{L}(\mathbf{w}) = \frac{1}{N}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2$$

Single global minimum at $\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$
Hessian is $\frac{2}{N}\mathbf{X}^T\mathbf{X}$, constant and positive semi-definite
Gradient descent converges linearly

For neural networks, the network parameterization introduces non-convexity, but MSE itself doesn't add local minima.

mse_gradients.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
import matplotlib.pyplot as plt
 
def mse_loss(y_true, y_pred):
    """Compute MSE loss."""
    return np.mean((y_true - y_pred) ** 2)
 
def mse_gradient(y_true, y_pred):
    """
    Compute gradient of MSE with respect to predictions.
    
    dL/d(y_pred) = 2 * (y_pred - y_true) / N
    """
    n = len(y_true)
    return 2 * (y_pred - y_true) / n
 
def rmse_loss(y_true, y_pred):
    """Compute RMSE loss."""
    return np.sqrt(mse_loss(y_true, y_pred))
 
# Demonstrate gradient stability
print("MSE Gradient Analysis")
print("="*50)
 
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
 
# Various prediction scenarios
scenarios = [
    ("Perfect", y_true.copy()),
    ("Small errors", y_true + np.array([0.1, -0.1, 0.05, -0.15, 0.1])),
    ("Medium errors", y_true + np.array([0.5, -0.3, 0.8, -0.6, 0.4])),
    ("Large errors", y_true + np.array([2.0, -1.5, 3.0, -2.0, 1.5])),
    ("One outlier", y_true + np.array([0.1, 0.1, 10.0, 0.1, 0.1])),
]
 
for name, y_pred in scenarios:
    loss = mse_loss(y_true, y_pred)
    grad = mse_gradient(y_true, y_pred)
    grad_norm = np.linalg.norm(grad)
    
    print(f"\n{name}:")
    print(f"  MSE Loss: {loss:.4f}")
    print(f"  RMSE: {np.sqrt(loss):.4f}")
    print(f"  Gradient: {grad}")
    print(f"  Gradient L2 norm: {grad_norm:.4f}")
 
# Demonstrate linear relationship between error and gradient
print("\n" + "="*50)
print("Error-Gradient Relationship:")
errors = np.linspace(-5, 5, 11)
for e in errors:
    grad = 2 * e  # For single sample
    print(f"  Error {e:+.1f} → Gradient {grad:+.1f}")

Learning Dynamics Comparison

MSE Gradient Behavior:

Gradient magnitude scales linearly with error
No saturation for large errors → always makes progress
No explosion for small errors → stable near optimum

The $\hat{y}(1-\hat{y})$ factor vanishes for confident predictions, causing gradient vanishing. This is why MSE is problematic for classification—not for regression where no sigmoid is involved.

The Outlier Problem: MSE's Achilles' Heel

Quadratic Penalty = Outlier Sensitivity

The same property that makes MSE stable for gradient descent—quadratic penalty—makes it extremely sensitive to outliers.

The Math: An error of magnitude $e$ contributes $e^2$ to the loss:

Error of 1 → Loss contribution of 1
Error of 10 → Loss contribution of 100
Error of 100 → Loss contribution of 10,000

A single large outlier can dominate the entire loss, disproportionately influencing the model.

Real-World Example: Consider predicting house prices. Most houses sell for $200K-$500K, but one mansion sold for $50M. With MSE:

The $50M house contributes ~10,000× more to the loss than typical houses
The model will distort itself to fit this outlier, harming predictions for normal houses

When Outliers Are Data vs Noise

Outliers as noise: Measurement errors, data entry mistakes, sensor glitches. These should NOT influence the model. MSE is inappropriate.

Robust Alternatives

We'll explore three key alternatives:

Mean Absolute Error (MAE): Linear penalty, constant gradient
Huber Loss: Blends MSE for small errors, MAE for large
Quantile Loss: Predicts specific percentiles, robust by design

Loss Function Comparison Under Outliers
Error	MSE	MAE	Huber (δ=1)	Sensitivity
0.1	0.01	0.1	0.005	All similar (small)
0.5	0.25	0.5	0.125	All similar (small)
1.0	1.00	1.0	0.50	MSE = MAE
2.0	4.00	2.0	1.50	MSE 2× MAE
10.0	100.0	10.0	9.50	MSE 10× MAE
100.0	10000	100	99.5	MSE 100× MAE

Mean Absolute Error (MAE)

Definition: $$\mathcal{L}{MAE} = \frac{1}{N}\sum{i=1}^{N}|y_i - \hat{y}_i|$$

Gradient: $$\frac{\partial \mathcal{L}_{MAE}}{\partial \hat{y}_i} = \frac{1}{N} \cdot \text{sign}(\hat{y}_i - y_i)$$

Key Properties:

Gradient is ±1/N, independent of error magnitude
Constant learning pressure regardless of error size
Linear penalty: outliers contribute proportionally, not quadratically

Statistical Interpretation: MAE corresponds to Laplacian noise: $\epsilon \sim \text{Laplace}(0, b)$ Optimal predictor is the conditional median, not mean.

Downsides:

Non-differentiable at $\hat{y} = y$ (gradient undefined at zero error)
Constant gradient can cause oscillation near optimum
Slower convergence than MSE in practice

robust_losses.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
 
def mae_loss(y_true, y_pred):
    """Mean Absolute Error."""
    return np.mean(np.abs(y_true - y_pred))
 
def mae_gradient(y_true, y_pred):
    """Gradient of MAE (subgradient at 0)."""
    return np.sign(y_pred - y_true) / len(y_true)
 
def huber_loss(y_true, y_pred, delta=1.0):
    """
    Huber loss: MSE for small errors, MAE for large errors.
    
    L(e) = 0.5 * e^2           if |e| <= delta
         = delta * |e| - 0.5 * delta^2   otherwise
    """
    error = y_true - y_pred
    abs_error = np.abs(error)
    
    # Quadratic for small errors, linear for large
    loss = np.where(
        abs_error <= delta,
        0.5 * error ** 2,
        delta * abs_error - 0.5 * delta ** 2
    )
    return np.mean(loss)
 
def huber_gradient(y_true, y_pred, delta=1.0):
    """
    Gradient of Huber loss.
    
    dL/d(y_pred) = (y_pred - y_true)      if |e| <= delta  (like MSE)
                 = delta * sign(y_pred - y_true)  otherwise (like MAE)
    """
    error = y_pred - y_true
    abs_error = np.abs(error)
    
    grad = np.where(
        abs_error <= delta,
        error,  # MSE-like gradient
        delta * np.sign(error)  # MAE-like gradient, capped
    )
    return grad / len(y_true)
 
def quantile_loss(y_true, y_pred, quantile=0.5):
    """
    Quantile loss (asymmetric MAE).
    
    For quantile=0.5, equivalent to MAE/2.
    For quantile=0.9, penalizes under-prediction 9x more than over-prediction.
    """
    error = y_true - y_pred
    loss = np.where(
        error >= 0,
        quantile * error,
        (quantile - 1) * error
    )
    return np.mean(loss)
 
# Demonstration: Outlier robustness
print("Loss Comparison with Outliers")
print("="*50)
 
# Normal data
y_true_clean = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.1, 2.2, 2.9, 3.8, 5.1])  # Reasonable predictions
 
print("Clean data:")
print(f"  MSE:   {np.mean((y_true_clean - y_pred)**2):.4f}")
print(f"  MAE:   {mae_loss(y_true_clean, y_pred):.4f}")
print(f"  Huber: {huber_loss(y_true_clean, y_pred, delta=1.0):.4f}")
 
# Data with outlier
y_true_outlier = np.array([1.0, 2.0, 3.0, 4.0, 50.0])  # Last point is outlier
 
print("\nWith outlier (50 instead of 5):")
print(f"  MSE:   {np.mean((y_true_outlier - y_pred)**2):.4f}")
print(f"  MAE:   {mae_loss(y_true_outlier, y_pred):.4f}")
print(f"  Huber: {huber_loss(y_true_outlier, y_pred, delta=1.0):.4f}")
 
# Relative increase
mse_clean = np.mean((y_true_clean - y_pred)**2)
mse_outlier = np.mean((y_true_outlier - y_pred)**2)
mae_clean = mae_loss(y_true_clean, y_pred)
mae_outlier = mae_loss(y_true_outlier, y_pred)
 
print(f"\nOutlier impact (increase factor):")
print(f"  MSE increased by: {mse_outlier/mse_clean:.1f}x")
print(f"  MAE increased by: {mae_outlier/mae_clean:.1f}x")

Huber Loss: The Best of Both Worlds

The Huber Loss Formula

Huber loss (also called smooth L1 loss) combines the advantages of MSE and MAE:

$$L_\delta(e) = \begin{cases} \frac{1}{2}e^2 & \text{if } |e| \leq \delta \ \delta|e| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}$$

where $e = y - \hat{y}$ is the prediction error and $\delta$ is a threshold parameter.

Properties:

Quadratic for small errors: When $|e| \leq \delta$, behaves exactly like MSE. Strong gradients for small errors, smooth optimization near the optimum.
Linear for large errors: When $|e| > \delta$, behaves like MAE. Linear penalty prevents outlier domination.
Continuous and differentiable everywhere: Unlike MAE, no gradient discontinuity at zero.
Parameter $\delta$ controls transition: Small $\delta$ = more robust (MAE-like); large $\delta$ = more efficient (MSE-like).

Gradient Properties

$$\frac{\partial L_\delta}{\partial \hat{y}} = \begin{cases} \hat{y} - y & \text{if } |e| \leq \delta \ \delta \cdot \text{sign}(\hat{y} - y) & \text{otherwise} \end{cases}$$

Key insight: Gradient magnitude is capped at $\delta$. No matter how extreme the outlier, it can only contribute at most $\delta$ to the gradient. This is the core robustness mechanism.

Choosing δ

The threshold δ should be set based on what you consider a 'normal' error magnitude:

In practice, δ can be treated as a hyperparameter tuned on validation data.

Smooth L1 Loss Variant

A common variant used in object detection (e.g., Faster R-CNN) is:

$$\text{smooth}_{L_1}(e) = \begin{cases} 0.5e^2 & \text{if } |e| < 1 \ |e| - 0.5 & \text{otherwise} \end{cases}$$

This is Huber loss with $\delta = 1$, often used for bounding box regression where outliers (due to wrong anchor matches) are common.

Log-Cosh Loss: A Smooth Alternative

Another option that's twice-differentiable and robust:

$$L_{\text{log-cosh}}(e) = \log(\cosh(e))$$

Properties:

Approximates $e^2/2$ for small $e$
Approximates $|e| - \log(2)$ for large $e$
Smooth transition without a threshold parameter
Second derivative exists everywhere (unlike Huber at $\pm\delta$)

Use case: When you need second-order optimization methods (Newton, L-BFGS) that require the Hessian.

huber_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
 
def huber_loss_single(error, delta=1.0):
    """Huber loss for a single error value."""
    abs_e = np.abs(error)
    if abs_e <= delta:
        return 0.5 * error ** 2
    else:
        return delta * abs_e - 0.5 * delta ** 2
 
def huber_gradient_single(error, delta=1.0):
    """Huber gradient for a single prediction."""
    if abs(error) <= delta:
        return -error  # d/d(y_pred) where error = y - y_pred
    else:
        return -delta * np.sign(error)
 
def smooth_l1_loss(y_true, y_pred, beta=1.0):
    """
    Smooth L1 loss (used in object detection).
    
    Parameterized by beta (= delta in Huber terms).
    """
    error = y_true - y_pred
    abs_error = np.abs(error)
    
    loss = np.where(
        abs_error < beta,
        0.5 * error ** 2 / beta,
        abs_error - 0.5 * beta
    )
    return np.mean(loss)
 
def log_cosh_loss(y_true, y_pred):
    """Log-cosh loss: smooth, robust, twice-differentiable."""
    error = y_true - y_pred
    return np.mean(np.log(np.cosh(error)))
 
# Compare loss curves
print("Loss Values at Different Error Magnitudes")
print("="*60)
print(f"{'Error':>8} {'MSE':>10} {'MAE':>10} {'Huber':>10} {'Log-cosh':>10}")
print("-" * 60)
 
for e in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
    mse = 0.5 * e ** 2
    mae = e
    huber = huber_loss_single(e, delta=1.0)
    logcosh = np.log(np.cosh(e))
    print(f"{e:8.1f} {mse:10.4f} {mae:10.4f} {huber:10.4f} {logcosh:10.4f}")
 
# Gradient comparison
print("\nGradient Magnitudes (absolute value)")
print("="*60)
print(f"{'Error':>8} {'MSE':>10} {'MAE':>10} {'Huber':>10} {'Log-cosh':>10}")
print("-" * 60)
 
for e in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
    mse_grad = e  # d/de of 0.5*e^2
    mae_grad = 1.0  # d/de of |e| for e > 0
    huber_grad = min(e, 1.0)  # capped at delta=1
    logcosh_grad = np.tanh(e)  # d/de of log(cosh(e))
    print(f"{e:8.1f} {mse_grad:10.4f} {mae_grad:10.4f} {huber_grad:10.4f} {logcosh_grad:10.4f}")

Multi-Output Regression and Normalization

Multi-Output Regression

When predicting multiple targets $\mathbf{y} = (y_1, \ldots, y_K)$, MSE is computed across all outputs:

$$\mathcal{L} = \frac{1}{NK}\sum_{i=1}^{N}\sum_{k=1}^{K}(y_{ik} - \hat{y}_{ik})^2$$

Considerations:

All outputs are weighted equally by default
If outputs have different scales, larger-scale outputs dominate
May need per-output weighting or normalization

The Scale Problem

Consider predicting:

Temperature (0-100°C)
Pressure (0-10,000 Pa)
Humidity (0-100%)

With raw MSE:

Pressure errors contribute ~1000× more than temperature/humidity
Model optimizes primarily for pressure, ignoring other outputs

Solution 1: Output Normalization

Normalize targets to similar scales before training: $$y'_k = \frac{y_k - \mu_k}{\sigma_k}$$

This is almost always done in practice.

Solution 2: Per-Output Weights

$$\mathcal{L} = \sum_{k=1}^{K} w_k \cdot \frac{1}{N}\sum_{i=1}^{N}(y_{ik} - \hat{y}_{ik})^2$$

Weights $w_k$ can be set:

Inversely proportional to variance: $w_k = 1/\sigma_k^2$
Based on task importance (domain knowledge)
Learned during training (uncertainty weighting)

multi_output_mse.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
 
def multi_output_mse(y_true, y_pred):
    """Standard MSE for multi-output (averages over all elements)."""
    return np.mean((y_true - y_pred) ** 2)
 
def weighted_multi_output_mse(y_true, y_pred, weights):
    """
    Weighted MSE for multi-output regression.
    
    Args:
        y_true: (N, K) array of targets
        y_pred: (N, K) array of predictions
        weights: (K,) array of per-output weights
    """
    per_output_mse = np.mean((y_true - y_pred) ** 2, axis=0)  # (K,)
    return np.sum(weights * per_output_mse)
 
def normalized_mse(y_true, y_pred, means, stds):
    """
    MSE computed on normalized targets.
    
    Normalizes both targets and predictions before computing MSE.
    This is equivalent to per-output weighting with w_k = 1/std_k^2.
    """
    y_true_norm = (y_true - means) / stds
    y_pred_norm = (y_pred - means) / stds
    return np.mean((y_true_norm - y_pred_norm) ** 2)
 
def uncertainty_weighted_loss(y_true, y_pred, log_variances):
    """
    Learned uncertainty weighting (Kendall et al., 2018).
    
    Network predicts both mean and log(variance) for each output.
    Loss automatically balances outputs based on learned uncertainty.
    
    L = sum_k [ 0.5 * exp(-s_k) * (y_k - y_pred_k)^2 + 0.5 * s_k ]
    
    where s_k = log(sigma_k^2) is learned.
    """
    mse_per_output = np.mean((y_true - y_pred) ** 2, axis=0)  # (K,)
    
    # exp(-s) * MSE + s: first term is weighted MSE, second prevents s → -∞
    precision = np.exp(-log_variances)  # 1/sigma^2
    loss = 0.5 * np.sum(precision * mse_per_output + log_variances)
    return loss
 
# Demonstration
print("Multi-Output MSE with Different Scales")
print("="*60)
 
# Simulated multi-output data with different scales
np.random.seed(42)
N = 100
 
# Three outputs with very different scales
y_true = np.column_stack([
    np.random.randn(N) * 10 + 50,    # Output 1: mean=50, std=10
    np.random.randn(N) * 1000 + 5000, # Output 2: mean=5000, std=1000  
    np.random.randn(N) * 0.1 + 0.5,   # Output 3: mean=0.5, std=0.1
])
 
# Predictions with similar relative errors (~10% of std)
y_pred = y_true + np.column_stack([
    np.random.randn(N) * 1,      # ~10% error on output 1
    np.random.randn(N) * 100,    # ~10% error on output 2
    np.random.randn(N) * 0.01,   # ~10% error on output 3
])
 
# Raw MSE
raw_mse = multi_output_mse(y_true, y_pred)
per_output_mse = np.mean((y_true - y_pred) ** 2, axis=0)
 
print("Raw MSE:")
print(f"  Total: {raw_mse:.2f}")
print(f"  Per-output: {per_output_mse}")
print(f"  Output 2 contributes {per_output_mse[1]/np.sum(per_output_mse)*100:.1f}% of loss!")
 
# Normalized MSE
means = np.mean(y_true, axis=0)
stds = np.std(y_true, axis=0)
norm_mse = normalized_mse(y_true, y_pred, means, stds)
 
print(f"\nNormalized MSE: {norm_mse:.4f}")
print("  (Now all outputs contribute roughly equally)")

Uncertainty Weighting: A Principled Approach

The uncertainty-weighted loss learns to balance multiple outputs automatically. The network predicts both the mean μ(x) and uncertainty σ(x) for each output. The loss becomes:

L = Σ_k [½ · (y_k - μ_k)²/σ_k² + ½ log σ_k²]

The first term weights MSE by precision (1/σ²). The second term prevents σ → ∞ (which would make the first term vanish). This elegant formulation emerges from Gaussian likelihood.

MSE in Modern Deep Learning

Common Applications

1. Image Regression Tasks

Super-resolution, denoising, inpainting
Pixel-wise MSE (L2 loss) is standard
Often combined with perceptual losses (feature-space similarity)

2. Autoencoders and VAEs

Reconstruction loss for continuous outputs
For images: MSE on pixel values
For embeddings: MSE on latent representations

3. Reinforcement Learning

TD-error for value function approximation: $(r + \gamma V(s') - V(s))^2$
Often uses Huber loss for robustness

4. Regression Heads in Multi-Task Models

Object detection: bounding box coordinates
Pose estimation: keypoint locations
Age/attribute prediction in face models

MSE vs Perceptual Losses

For image generation, MSE has a critical limitation: it encourages blurriness.

This averaging behavior results in blurry reconstructions that lack high-frequency details.

Solution: Perceptual losses compute MSE in feature space (e.g., VGG features) rather than pixel space: $$\mathcal{L}_{perceptual} = |\phi(y) - \phi(\hat{y})|_2^2$$

where $\phi$ is a pretrained network. This preserves perceptual quality while allowing pixel-level variation.

When to Use MSE

•Continuous targets with Gaussian errors
•Clean data without extreme outliers
•When you want the conditional mean
•Value function approximation (RL)
•Embedding/feature regression
•When interpretability matters (same units²)

When to Avoid MSE

•Data with extreme outliers → use Huber
•Heavy-tailed error distributions → use MAE
•When median is preferred over mean → use MAE
•Classification tasks → use cross-entropy
•Image generation quality matters → add perceptual loss
•Quantile prediction needed → use quantile loss

Implementation in Frameworks

Every major framework provides optimized MSE:

# PyTorch
import torch.nn as nn
loss_fn = nn.MSELoss(reduction='mean')  # or 'sum', 'none'
loss = loss_fn(predictions, targets)

# TensorFlow/Keras
import tensorflow as tf
loss = tf.keras.losses.MeanSquaredError()(targets, predictions)

# JAX
import jax.numpy as jnp
loss = jnp.mean((predictions - targets) ** 2)

The reduction parameter controls averaging:

'mean': Average over all elements (most common)
'sum': Sum over all elements (for custom weighting)
'none': Return per-element losses (for sample weighting)

Summary: MSE Mastery

We've explored Mean Squared Error from its statistical foundations to practical implementation. Here are the essential takeaways:

Key Takeaways

•Gaussian foundation: MSE is the maximum likelihood estimator under Gaussian noise. If your errors are truly Gaussian, MSE is optimal.
•Predicts the mean: MSE optimizes for the conditional mean E[Y|X]. This is appropriate when mean is the right summary statistic.
•Simple, stable gradients: Gradient is proportional to error—no vanishing or exploding due to the loss itself.
•Outlier sensitive: Quadratic penalty means outliers dominate. Use Huber or MAE when outliers are present.
•Scale dependent: Normalize targets or use weighted losses for multi-output with different scales.
•Context matters: Great for regression with clean data; needs augmentation (perceptual loss) for image generation.

Looking Forward

MSE and cross-entropy cover regression and classification respectively. But many real-world problems require specialized losses:

Hinge Loss: The margin-based alternative that underpins SVMs, with different inductive biases
Focal Loss: Designed for extreme class imbalance in object detection
Custom Losses: Encoding domain knowledge directly into the optimization objective

Page Complete

2 / 5