Loading learning content...
While cross-entropy dominates classification, Mean Squared Error (MSE) reigns supreme in regression. When neural networks predict continuous quantities—stock prices, temperatures, ages, pixel intensities, embedding coordinates—MSE is typically the first loss function considered.
But MSE is more than just a convenient mathematical formula. It embeds a profound assumption about the nature of prediction errors: that they follow a Gaussian distribution. This assumption, when valid, makes MSE the theoretically optimal choice. When violated, MSE can lead to poor models that are overly sensitive to outliers.
This page will take you through MSE from first principles: its definition, statistical foundations, gradient behavior, and the circumstances under which you should—and should not—use it. We'll also explore robust alternatives like Mean Absolute Error (MAE) and Huber loss that address MSE's limitations.
By the end of this page, you will understand: (1) MSE's definition and mathematical properties, (2) its derivation from maximum likelihood under Gaussian noise, (3) gradient behavior and optimization characteristics, (4) sensitivity to outliers and when this matters, and (5) robust alternatives including MAE, Huber loss, and quantile regression.
For a dataset of $N$ samples with true values $\mathbf{y} = (y_1, \ldots, y_N)$ and predictions $\hat{\mathbf{y}} = (\hat{y}_1, \ldots, \hat{y}_N)$, the Mean Squared Error is:
$$\mathcal{L}{MSE} = \frac{1}{N} \sum{i=1}^{N} (y_i - \hat{y}_i)^2 = \frac{1}{N} |\mathbf{y} - \hat{\mathbf{y}}|_2^2$$
The square of the $L_2$ norm, also known as the squared Euclidean distance.
MSE admits a beautiful decomposition that reveals its components:
$$\mathbb{E}[(Y - \hat{Y})^2] = \underbrace{(\mathbb{E}[\hat{Y}] - \mathbb{E}[Y])^2}{\text{Bias}^2} + \underbrace{\text{Var}(\hat{Y})}{\text{Variance}} + \underbrace{\text{Var}(Y)}_{\text{Irreducible Error}}$$
This decomposition shows that MSE captures both systematic errors (bias) and random fluctuations (variance) in our predictions, plus the inherent unpredictability of the target.
Differentiability: MSE is smooth and differentiable everywhere, with simple gradients.
Convexity: MSE is a convex function of the predictions. For linear models, this means a unique global minimum. For neural networks, the loss surface has no local minima due to MSE itself (though the network parameterization introduces non-convexity).
Scale Sensitivity: MSE is sensitive to the scale of targets. A target in thousands will produce losses in millions, dominating other objectives in multi-task settings.
| Property | Value | Implication |
|---|---|---|
| Range | [0, ∞) | Always non-negative; 0 is perfect prediction |
| Gradient | 2(ŷ - y)/N | Linear in prediction error—simple and stable |
| Hessian | 2I/N | Constant positive definite—strongly convex |
| Optimal predictor | E[Y|X] | Mean of conditional distribution |
| Scale invariance | No | Multiply targets by k → Loss multiplied by k² |
The Root Mean Squared Error (RMSE) is simply $\sqrt{MSE}$:
$$\text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2}$$
Why use RMSE?
Why optimize MSE not RMSE?
Common convention: Train with MSE, report RMSE.
MSE isn't an arbitrary choice—it emerges naturally from assuming prediction errors are Gaussian distributed. This is the most important theoretical justification for MSE.
The Model: Assume the true relationship is: $$y = f(x; \theta) + \epsilon$$
where $\epsilon \sim \mathcal{N}(0, \sigma^2)$ is Gaussian noise.
This means $y | x \sim \mathcal{N}(f(x; \theta), \sigma^2)$, and the likelihood of observing $y$ given prediction $f(x; \theta)$ is:
$$p(y | x, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - f(x; \theta))^2}{2\sigma^2}\right)$$
Log-Likelihood: For $N$ independent samples:
$$\log p(\mathbf{y} | \mathbf{X}, \theta) = -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_i - f(x_i; \theta))^2$$
Maximum Likelihood = Minimize MSE: Since the first term is constant and $\sigma^2 > 0$, maximizing log-likelihood is equivalent to:
$$\min_\theta \sum_{i=1}^{N}(y_i - f(x_i; \theta))^2$$
which is exactly MSE (up to a constant factor).
The profound implication: If your errors are truly Gaussian, MSE is the statistically optimal loss function. It extracts maximum information from your data.
Why is the Gaussian assumption often reasonable? The Central Limit Theorem states that the sum of many independent random effects tends toward Gaussian. If prediction errors arise from many small, independent sources of uncertainty, the aggregate error will be approximately Gaussian—justifying MSE even when no single error source is Gaussian.
MSE encourages the model to predict the conditional mean $\mathbb{E}[Y | X]$. This is because:
$$\arg\min_c \mathbb{E}[(Y - c)^2] = \mathbb{E}[Y]$$
Proof sketch: Let $\mu = \mathbb{E}[Y]$. Then: $$\mathbb{E}[(Y - c)^2] = \mathbb{E}[(Y - \mu + \mu - c)^2] = \text{Var}(Y) + (\mu - c)^2$$
This is minimized when $c = \mu$.
Implication: A model trained with MSE learns to predict the average outcome for each input. This is the optimal point estimate under squared loss, but:
In many real problems, noise variance varies with input: $\epsilon(x) \sim \mathcal{N}(0, \sigma^2(x))$.
Standard MSE assumes constant variance (homoscedasticity). For heteroscedastic problems:
$$\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N} \left[\frac{(y_i - \mu(x_i))^2}{2\sigma^2(x_i)} + \frac{1}{2}\log\sigma^2(x_i)\right]$$
where the network outputs both $\mu(x)$ and $\sigma^2(x)$.
For a single sample with prediction $\hat{y}$ and target $y$:
$$\frac{\partial}{\partial \hat{y}} (y - \hat{y})^2 = -2(y - \hat{y}) = 2(\hat{y} - y)$$
For the mean over $N$ samples: $$\frac{\partial \mathcal{L}_{MSE}}{\partial \hat{y}_i} = \frac{2}{N}(\hat{y}_i - y_i)$$
Gradient magnitude is proportional to error: Large errors produce large gradients; small errors produce small gradients. This is intuitive and leads to stable optimization.
Using chain rule, the gradient with respect to parameters $\theta$:
$$\frac{\partial \mathcal{L}{MSE}}{\partial \theta} = \frac{2}{N}\sum{i=1}^{N}(\hat{y}_i - y_i) \cdot \frac{\partial \hat{y}_i}{\partial \theta}$$
The loss contributes a linear factor $(\hat{y} - y)$, while the network's Jacobian $\frac{\partial \hat{y}}{\partial \theta}$ handles the transformation.
For linear models $\hat{y} = \mathbf{w}^T \mathbf{x}$, MSE creates a paraboloid loss surface:
$$\mathcal{L}(\mathbf{w}) = \frac{1}{N}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2$$
For neural networks, the network parameterization introduces non-convexity, but MSE itself doesn't add local minima.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import numpy as npimport matplotlib.pyplot as plt def mse_loss(y_true, y_pred): """Compute MSE loss.""" return np.mean((y_true - y_pred) ** 2) def mse_gradient(y_true, y_pred): """ Compute gradient of MSE with respect to predictions. dL/d(y_pred) = 2 * (y_pred - y_true) / N """ n = len(y_true) return 2 * (y_pred - y_true) / n def rmse_loss(y_true, y_pred): """Compute RMSE loss.""" return np.sqrt(mse_loss(y_true, y_pred)) # Demonstrate gradient stabilityprint("MSE Gradient Analysis")print("="*50) y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) # Various prediction scenariosscenarios = [ ("Perfect", y_true.copy()), ("Small errors", y_true + np.array([0.1, -0.1, 0.05, -0.15, 0.1])), ("Medium errors", y_true + np.array([0.5, -0.3, 0.8, -0.6, 0.4])), ("Large errors", y_true + np.array([2.0, -1.5, 3.0, -2.0, 1.5])), ("One outlier", y_true + np.array([0.1, 0.1, 10.0, 0.1, 0.1])),] for name, y_pred in scenarios: loss = mse_loss(y_true, y_pred) grad = mse_gradient(y_true, y_pred) grad_norm = np.linalg.norm(grad) print(f"\n{name}:") print(f" MSE Loss: {loss:.4f}") print(f" RMSE: {np.sqrt(loss):.4f}") print(f" Gradient: {grad}") print(f" Gradient L2 norm: {grad_norm:.4f}") # Demonstrate linear relationship between error and gradientprint("\n" + "="*50)print("Error-Gradient Relationship:")errors = np.linspace(-5, 5, 11)for e in errors: grad = 2 * e # For single sample print(f" Error {e:+.1f} → Gradient {grad:+.1f}")MSE Gradient Behavior:
Contrast with Classification: Recall that for sigmoid + MSE in classification (which we don't recommend): $$\frac{\partial \mathcal{L}}{\partial z} = 2(\hat{y} - y) \cdot \sigma'(z) = 2(\hat{y} - y) \cdot \hat{y}(1 - \hat{y})$$
The $\hat{y}(1-\hat{y})$ factor vanishes for confident predictions, causing gradient vanishing. This is why MSE is problematic for classification—not for regression where no sigmoid is involved.
The same property that makes MSE stable for gradient descent—quadratic penalty—makes it extremely sensitive to outliers.
The Math: An error of magnitude $e$ contributes $e^2$ to the loss:
A single large outlier can dominate the entire loss, disproportionately influencing the model.
Real-World Example: Consider predicting house prices. Most houses sell for $200K-$500K, but one mansion sold for $50M. With MSE:
Outliers as noise: Measurement errors, data entry mistakes, sensor glitches. These should NOT influence the model. MSE is inappropriate.
Outliers as signal: Rare but valid extreme events (fraud transactions, disease outbreaks). These SHOULD influence the model. But MSE may still be problematic—you likely want special handling, not just squared penalty.
We'll explore three key alternatives:
| Error | MSE | MAE | Huber (δ=1) | Sensitivity |
|---|---|---|---|---|
| 0.1 | 0.01 | 0.1 | 0.005 | All similar (small) |
| 0.5 | 0.25 | 0.5 | 0.125 | All similar (small) |
| 1.0 | 1.00 | 1.0 | 0.50 | MSE = MAE |
| 2.0 | 4.00 | 2.0 | 1.50 | MSE 2× MAE |
| 10.0 | 100.0 | 10.0 | 9.50 | MSE 10× MAE |
| 100.0 | 10000 | 100 | 99.5 | MSE 100× MAE |
Definition: $$\mathcal{L}{MAE} = \frac{1}{N}\sum{i=1}^{N}|y_i - \hat{y}_i|$$
Gradient: $$\frac{\partial \mathcal{L}_{MAE}}{\partial \hat{y}_i} = \frac{1}{N} \cdot \text{sign}(\hat{y}_i - y_i)$$
Key Properties:
Statistical Interpretation: MAE corresponds to Laplacian noise: $\epsilon \sim \text{Laplace}(0, b)$ Optimal predictor is the conditional median, not mean.
Downsides:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import numpy as np def mae_loss(y_true, y_pred): """Mean Absolute Error.""" return np.mean(np.abs(y_true - y_pred)) def mae_gradient(y_true, y_pred): """Gradient of MAE (subgradient at 0).""" return np.sign(y_pred - y_true) / len(y_true) def huber_loss(y_true, y_pred, delta=1.0): """ Huber loss: MSE for small errors, MAE for large errors. L(e) = 0.5 * e^2 if |e| <= delta = delta * |e| - 0.5 * delta^2 otherwise """ error = y_true - y_pred abs_error = np.abs(error) # Quadratic for small errors, linear for large loss = np.where( abs_error <= delta, 0.5 * error ** 2, delta * abs_error - 0.5 * delta ** 2 ) return np.mean(loss) def huber_gradient(y_true, y_pred, delta=1.0): """ Gradient of Huber loss. dL/d(y_pred) = (y_pred - y_true) if |e| <= delta (like MSE) = delta * sign(y_pred - y_true) otherwise (like MAE) """ error = y_pred - y_true abs_error = np.abs(error) grad = np.where( abs_error <= delta, error, # MSE-like gradient delta * np.sign(error) # MAE-like gradient, capped ) return grad / len(y_true) def quantile_loss(y_true, y_pred, quantile=0.5): """ Quantile loss (asymmetric MAE). For quantile=0.5, equivalent to MAE/2. For quantile=0.9, penalizes under-prediction 9x more than over-prediction. """ error = y_true - y_pred loss = np.where( error >= 0, quantile * error, (quantile - 1) * error ) return np.mean(loss) # Demonstration: Outlier robustnessprint("Loss Comparison with Outliers")print("="*50) # Normal datay_true_clean = np.array([1.0, 2.0, 3.0, 4.0, 5.0])y_pred = np.array([1.1, 2.2, 2.9, 3.8, 5.1]) # Reasonable predictions print("Clean data:")print(f" MSE: {np.mean((y_true_clean - y_pred)**2):.4f}")print(f" MAE: {mae_loss(y_true_clean, y_pred):.4f}")print(f" Huber: {huber_loss(y_true_clean, y_pred, delta=1.0):.4f}") # Data with outliery_true_outlier = np.array([1.0, 2.0, 3.0, 4.0, 50.0]) # Last point is outlier print("\nWith outlier (50 instead of 5):")print(f" MSE: {np.mean((y_true_outlier - y_pred)**2):.4f}")print(f" MAE: {mae_loss(y_true_outlier, y_pred):.4f}")print(f" Huber: {huber_loss(y_true_outlier, y_pred, delta=1.0):.4f}") # Relative increasemse_clean = np.mean((y_true_clean - y_pred)**2)mse_outlier = np.mean((y_true_outlier - y_pred)**2)mae_clean = mae_loss(y_true_clean, y_pred)mae_outlier = mae_loss(y_true_outlier, y_pred) print(f"\nOutlier impact (increase factor):")print(f" MSE increased by: {mse_outlier/mse_clean:.1f}x")print(f" MAE increased by: {mae_outlier/mae_clean:.1f}x")Huber loss (also called smooth L1 loss) combines the advantages of MSE and MAE:
$$L_\delta(e) = \begin{cases} \frac{1}{2}e^2 & \text{if } |e| \leq \delta \ \delta|e| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}$$
where $e = y - \hat{y}$ is the prediction error and $\delta$ is a threshold parameter.
Properties:
$$\frac{\partial L_\delta}{\partial \hat{y}} = \begin{cases} \hat{y} - y & \text{if } |e| \leq \delta \ \delta \cdot \text{sign}(\hat{y} - y) & \text{otherwise} \end{cases}$$
Key insight: Gradient magnitude is capped at $\delta$. No matter how extreme the outlier, it can only contribute at most $\delta$ to the gradient. This is the core robustness mechanism.
The threshold δ should be set based on what you consider a 'normal' error magnitude:
• δ = 1.0 is a common default • Estimate δ from data: Set it to the median absolute deviation (MAD) or a percentile of residuals from an initial model • Domain knowledge: If errors above X are likely outliers, set δ ≈ X
In practice, δ can be treated as a hyperparameter tuned on validation data.
A common variant used in object detection (e.g., Faster R-CNN) is:
$$\text{smooth}_{L_1}(e) = \begin{cases} 0.5e^2 & \text{if } |e| < 1 \ |e| - 0.5 & \text{otherwise} \end{cases}$$
This is Huber loss with $\delta = 1$, often used for bounding box regression where outliers (due to wrong anchor matches) are common.
Another option that's twice-differentiable and robust:
$$L_{\text{log-cosh}}(e) = \log(\cosh(e))$$
Properties:
Use case: When you need second-order optimization methods (Newton, L-BFGS) that require the Hessian.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
import numpy as np def huber_loss_single(error, delta=1.0): """Huber loss for a single error value.""" abs_e = np.abs(error) if abs_e <= delta: return 0.5 * error ** 2 else: return delta * abs_e - 0.5 * delta ** 2 def huber_gradient_single(error, delta=1.0): """Huber gradient for a single prediction.""" if abs(error) <= delta: return -error # d/d(y_pred) where error = y - y_pred else: return -delta * np.sign(error) def smooth_l1_loss(y_true, y_pred, beta=1.0): """ Smooth L1 loss (used in object detection). Parameterized by beta (= delta in Huber terms). """ error = y_true - y_pred abs_error = np.abs(error) loss = np.where( abs_error < beta, 0.5 * error ** 2 / beta, abs_error - 0.5 * beta ) return np.mean(loss) def log_cosh_loss(y_true, y_pred): """Log-cosh loss: smooth, robust, twice-differentiable.""" error = y_true - y_pred return np.mean(np.log(np.cosh(error))) # Compare loss curvesprint("Loss Values at Different Error Magnitudes")print("="*60)print(f"{'Error':>8} {'MSE':>10} {'MAE':>10} {'Huber':>10} {'Log-cosh':>10}")print("-" * 60) for e in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]: mse = 0.5 * e ** 2 mae = e huber = huber_loss_single(e, delta=1.0) logcosh = np.log(np.cosh(e)) print(f"{e:8.1f} {mse:10.4f} {mae:10.4f} {huber:10.4f} {logcosh:10.4f}") # Gradient comparisonprint("\nGradient Magnitudes (absolute value)")print("="*60)print(f"{'Error':>8} {'MSE':>10} {'MAE':>10} {'Huber':>10} {'Log-cosh':>10}")print("-" * 60) for e in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]: mse_grad = e # d/de of 0.5*e^2 mae_grad = 1.0 # d/de of |e| for e > 0 huber_grad = min(e, 1.0) # capped at delta=1 logcosh_grad = np.tanh(e) # d/de of log(cosh(e)) print(f"{e:8.1f} {mse_grad:10.4f} {mae_grad:10.4f} {huber_grad:10.4f} {logcosh_grad:10.4f}")When predicting multiple targets $\mathbf{y} = (y_1, \ldots, y_K)$, MSE is computed across all outputs:
$$\mathcal{L} = \frac{1}{NK}\sum_{i=1}^{N}\sum_{k=1}^{K}(y_{ik} - \hat{y}_{ik})^2$$
Considerations:
Consider predicting:
With raw MSE:
Solution 1: Output Normalization
Normalize targets to similar scales before training: $$y'_k = \frac{y_k - \mu_k}{\sigma_k}$$
This is almost always done in practice.
Solution 2: Per-Output Weights
$$\mathcal{L} = \sum_{k=1}^{K} w_k \cdot \frac{1}{N}\sum_{i=1}^{N}(y_{ik} - \hat{y}_{ik})^2$$
Weights $w_k$ can be set:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as np def multi_output_mse(y_true, y_pred): """Standard MSE for multi-output (averages over all elements).""" return np.mean((y_true - y_pred) ** 2) def weighted_multi_output_mse(y_true, y_pred, weights): """ Weighted MSE for multi-output regression. Args: y_true: (N, K) array of targets y_pred: (N, K) array of predictions weights: (K,) array of per-output weights """ per_output_mse = np.mean((y_true - y_pred) ** 2, axis=0) # (K,) return np.sum(weights * per_output_mse) def normalized_mse(y_true, y_pred, means, stds): """ MSE computed on normalized targets. Normalizes both targets and predictions before computing MSE. This is equivalent to per-output weighting with w_k = 1/std_k^2. """ y_true_norm = (y_true - means) / stds y_pred_norm = (y_pred - means) / stds return np.mean((y_true_norm - y_pred_norm) ** 2) def uncertainty_weighted_loss(y_true, y_pred, log_variances): """ Learned uncertainty weighting (Kendall et al., 2018). Network predicts both mean and log(variance) for each output. Loss automatically balances outputs based on learned uncertainty. L = sum_k [ 0.5 * exp(-s_k) * (y_k - y_pred_k)^2 + 0.5 * s_k ] where s_k = log(sigma_k^2) is learned. """ mse_per_output = np.mean((y_true - y_pred) ** 2, axis=0) # (K,) # exp(-s) * MSE + s: first term is weighted MSE, second prevents s → -∞ precision = np.exp(-log_variances) # 1/sigma^2 loss = 0.5 * np.sum(precision * mse_per_output + log_variances) return loss # Demonstrationprint("Multi-Output MSE with Different Scales")print("="*60) # Simulated multi-output data with different scalesnp.random.seed(42)N = 100 # Three outputs with very different scalesy_true = np.column_stack([ np.random.randn(N) * 10 + 50, # Output 1: mean=50, std=10 np.random.randn(N) * 1000 + 5000, # Output 2: mean=5000, std=1000 np.random.randn(N) * 0.1 + 0.5, # Output 3: mean=0.5, std=0.1]) # Predictions with similar relative errors (~10% of std)y_pred = y_true + np.column_stack([ np.random.randn(N) * 1, # ~10% error on output 1 np.random.randn(N) * 100, # ~10% error on output 2 np.random.randn(N) * 0.01, # ~10% error on output 3]) # Raw MSEraw_mse = multi_output_mse(y_true, y_pred)per_output_mse = np.mean((y_true - y_pred) ** 2, axis=0) print("Raw MSE:")print(f" Total: {raw_mse:.2f}")print(f" Per-output: {per_output_mse}")print(f" Output 2 contributes {per_output_mse[1]/np.sum(per_output_mse)*100:.1f}% of loss!") # Normalized MSEmeans = np.mean(y_true, axis=0)stds = np.std(y_true, axis=0)norm_mse = normalized_mse(y_true, y_pred, means, stds) print(f"\nNormalized MSE: {norm_mse:.4f}")print(" (Now all outputs contribute roughly equally)")The uncertainty-weighted loss learns to balance multiple outputs automatically. The network predicts both the mean μ(x) and uncertainty σ(x) for each output. The loss becomes:
L = Σ_k [½ · (y_k - μ_k)²/σ_k² + ½ log σ_k²]
The first term weights MSE by precision (1/σ²). The second term prevents σ → ∞ (which would make the first term vanish). This elegant formulation emerges from Gaussian likelihood.
1. Image Regression Tasks
2. Autoencoders and VAEs
3. Reinforcement Learning
4. Regression Heads in Multi-Task Models
For image generation, MSE has a critical limitation: it encourages blurriness.
Why? MSE penalizes deviations from the ground truth. For a pixel that should be bright (255), MSE is minimized by predicting exactly 255. But for a pixel that might be 0 or 255 (high uncertainty), MSE is minimized by predicting 127.5!
This averaging behavior results in blurry reconstructions that lack high-frequency details.
Solution: Perceptual losses compute MSE in feature space (e.g., VGG features) rather than pixel space: $$\mathcal{L}_{perceptual} = |\phi(y) - \phi(\hat{y})|_2^2$$
where $\phi$ is a pretrained network. This preserves perceptual quality while allowing pixel-level variation.
Every major framework provides optimized MSE:
# PyTorch
import torch.nn as nn
loss_fn = nn.MSELoss(reduction='mean') # or 'sum', 'none'
loss = loss_fn(predictions, targets)
# TensorFlow/Keras
import tensorflow as tf
loss = tf.keras.losses.MeanSquaredError()(targets, predictions)
# JAX
import jax.numpy as jnp
loss = jnp.mean((predictions - targets) ** 2)
The reduction parameter controls averaging:
'mean': Average over all elements (most common)'sum': Sum over all elements (for custom weighting)'none': Return per-element losses (for sample weighting)We've explored Mean Squared Error from its statistical foundations to practical implementation. Here are the essential takeaways:
Looking Forward
MSE and cross-entropy cover regression and classification respectively. But many real-world problems require specialized losses:
You now understand MSE deeply—when to use it, why it works, and when to reach for alternatives. Combined with cross-entropy from the previous page, you have the foundational losses that power most neural network training.