Machine LearningModel Evaluation Metrics

Regression Metrics

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

1 / 5

Mean Squared Error

The Workhorse of Regression Evaluation

In the world of regression modeling, we face a fundamental question: How do we measure how wrong our predictions are?

When your model predicts house prices, stock returns, or customer lifetime value, you need a single number that captures the quality of those predictions. Enter Mean Squared Error (MSE)—the most widely used, most theoretically grounded, and most computationally convenient metric for regression.

MSE isn't just popular by accident. It emerges naturally from probability theory, optimization algorithms, and geometric intuition. Understanding MSE deeply means understanding why most of machine learning works the way it does.

What You Will Learn

By the end of this page, you will understand MSE's mathematical formulation and why squaring matters, its connection to maximum likelihood estimation, the geometric interpretation in prediction space, its sensitivity to outliers and scale, when MSE is the right choice—and when it isn't, and practical considerations for computing and interpreting MSE.

Mathematical Definition

Mean Squared Error measures the average of the squared differences between predicted values and actual values. For a dataset with $n$ samples, where $y_i$ represents the true value and $\hat{y}_i$ represents the predicted value for sample $i$:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Let's dissect this formula component by component to build deep understanding.

Components of MSE

•$(y_i - \hat{y}_i)$: The Residual — The difference between truth and prediction. A residual of zero means perfect prediction. Positive residuals mean we underestimated; negative means we overestimated.
•$(y_i - \hat{y}_i)^2$: The Squared Residual — Squaring accomplishes three crucial things: it makes all errors positive, it penalizes large errors more heavily, and it yields nice mathematical properties (differentiability everywhere).
•$\sum_{i=1}^{n}$: Summation — We aggregate across all samples to assess overall model performance, not just individual predictions.
•$\frac{1}{n}$: Averaging — Dividing by $n$ normalizes for dataset size, making MSE comparable across datasets of different sizes.

Root Mean Squared Error (RMSE)

RMSE is simply the square root of MSE: $\text{RMSE} = \sqrt{\text{MSE}}$. While MSE is in squared units (e.g., dollars²), RMSE returns to the original units (e.g., dollars), making it more interpretable. Both metrics rank models identically since the square root is monotonic.

mse_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
def mean_squared_error(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Compute Mean Squared Error from scratch.
    
    Parameters:
    -----------
    y_true : np.ndarray
        Array of true target values
    y_pred : np.ndarray
        Array of predicted values
        
    Returns:
    --------
    float
        The mean squared error
    """
    # Ensure arrays are numpy arrays
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    
    # Validate shapes match
    if y_true.shape != y_pred.shape:
        raise ValueError(f"Shape mismatch: {y_true.shape} vs {y_pred.shape}")
    
    # Calculate residuals
    residuals = y_true - y_pred
    
    # Square the residuals
    squared_residuals = residuals ** 2
    
    # Average over all samples
    mse = np.mean(squared_residuals)
    
    return mse
 
def root_mean_squared_error(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Compute RMSE - interpretable in original units."""
    return np.sqrt(mean_squared_error(y_true, y_pred))
 
# Example: House price predictions (in $1000s)
y_true = np.array([250, 300, 350, 400, 450])
y_pred = np.array([260, 290, 340, 420, 440])
 
mse = mean_squared_error(y_true, y_pred)
rmse = root_mean_squared_error(y_true, y_pred)
 
print(f"Residuals: {y_true - y_pred}")
print(f"Squared Residuals: {(y_true - y_pred) ** 2}")
print(f"MSE: {mse:.2f} ($1000)²")
print(f"RMSE: {rmse:.2f} $1000")
# Output:
# Residuals: [-10  10  10 -20  10]
# Squared Residuals: [100 100 100 400 100]
# MSE: 160.00 ($1000)²
# RMSE: 12.65 $1000

Why Squaring? The Deep Rationale

The choice to square errors isn't arbitrary—it has profound mathematical and practical justifications. Understanding these helps you know when MSE is appropriate and when alternatives are better.

Reason 1: Symmetry and Positivity

Raw residuals $(y_i - \hat{y}_i)$ can be positive or negative. Simply averaging them would allow positive and negative errors to cancel out—a model that overpredicts by 100 half the time and underpredicts by 100 half the time would have a "perfect" average error of zero.

Squaring forces all errors to be positive, ensuring they accumulate rather than cancel.

Mean Error (Flawed)

•Errors: +100, -100, +50, -50
•Mean Error = 0
•Suggests 'perfect' predictions
•Completely misleading!

Mean Squared Error

•Errors: +100, -100, +50, -50
•Squared: 10000, 10000, 2500, 2500
•MSE = 6250
•Correctly shows prediction error

Reason 2: Differentiability

Squaring produces a smooth, differentiable function everywhere. This is essential for gradient-based optimization. The derivative of $(y - \hat{y})^2$ with respect to $\hat{y}$ is simply $-2(y - \hat{y})$—clean and easy to compute.

Alternatives like absolute error $|y - \hat{y}|$ have a non-differentiable point at zero, complicating optimization.

Reason 3: Connection to Variance and Normal Distribution

MSE is intimately connected to variance. If we assume errors are normally distributed with mean zero:$$\epsilon_i \sim \mathcal{N}(0, \sigma^2)$$

Then minimizing MSE is equivalent to maximum likelihood estimation. The squared term in MSE corresponds to the exponent in the Gaussian probability density function:

$$p(y | \hat{y}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - \hat{y})^2}{2\sigma^2}\right)$$

Maximizing this likelihood (or equivalently, minimizing negative log-likelihood) leads directly to minimizing MSE.

The Gaussian Connection

This is why linear regression learned by minimizing MSE is also called 'least squares'—and why it's the maximum likelihood estimator when errors are Gaussian. The theoretical elegance isn't coincidental; it reflects a deep connection between MSE and probabilistic modeling.

Reason 4: Quadratic Penalty for Large Errors

Squaring penalizes large errors disproportionately. An error of 10 contributes 100 to MSE; an error of 20 contributes 400—four times as much, not twice. This property is both a strength and a weakness:

Strength: The model is strongly incentivized to avoid catastrophic errors
Weakness: A few outliers can dominate the entire metric

This quadratic sensitivity makes MSE particularly appropriate when large errors are truly more costly than small errors—but problematic when outliers are present in the data.

Geometric Interpretation

MSE has a beautiful geometric interpretation that connects algebra to visual intuition.

Euclidean Distance in Prediction Space

Consider each prediction as a point in $n$-dimensional space, where $n$ is the number of samples. The true values form one point $\mathbf{y} = (y_1, y_2, ..., y_n)$, and predictions form another point $\mathbf{\hat{y}} = (\hat{y}_1, \hat{y}_2, ..., \hat{y}_n)$.

The Euclidean distance between these points is: $$d(\mathbf{y}, \mathbf{\hat{y}}) = \sqrt{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = \sqrt{n \cdot \text{MSE}}$$

Thus: $$\text{MSE} = \frac{d(\mathbf{y}, \mathbf{\hat{y}})^2}{n}$$

MSE is the squared Euclidean distance per sample—a measure of how far our prediction vector is from the truth vector in this high-dimensional space.

geometric_mse.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
from numpy.linalg import norm
 
def geometric_mse_interpretation(y_true: np.ndarray, y_pred: np.ndarray):
    """
    Demonstrate the geometric interpretation of MSE
    as squared Euclidean distance per sample.
    """
    n = len(y_true)
    
    # Method 1: Standard MSE calculation
    residuals = y_true - y_pred
    mse = np.mean(residuals ** 2)
    
    # Method 2: Via Euclidean distance
    euclidean_distance = norm(y_true - y_pred)
    mse_from_distance = (euclidean_distance ** 2) / n
    
    print(f"Number of samples: {n}")
    print(f"Residual vector: {residuals}")
    print(f"Euclidean distance: {euclidean_distance:.4f}")
    print(f"MSE (direct): {mse:.4f}")
    print(f"MSE (from distance²/n): {mse_from_distance:.4f}")
    print(f"RMSE ≈ Euclidean distance / √n: {euclidean_distance / np.sqrt(n):.4f}")
    
    return mse
 
# Example with 4 samples
y_true = np.array([100, 200, 150, 250])
y_pred = np.array([110, 190, 160, 240])
 
geometric_mse_interpretation(y_true, y_pred)
# Output:
# Number of samples: 4
# Residual vector: [-10  10 -10  10]
# Euclidean distance: 20.0000
# MSE (direct): 100.0000
# MSE (from distance²/n): 100.0000
# RMSE ≈ Euclidean distance / √n: 10.0000

The Projection Perspective

In linear regression, we can think of fitting a model as projecting the target vector $\mathbf{y}$ onto the subspace spanned by our feature matrix. The residual vector $\mathbf{y} - \mathbf{\hat{y}}$ is orthogonal to this subspace.

Minimizing MSE means finding the projection that minimizes the distance—the orthogonal projection. This is why the normal equations $\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = \mathbf{0}$ express orthogonality: the residuals are perpendicular to all columns of $\mathbf{X}$.

Why Geometry Matters

The geometric view explains why linear regression has a unique solution (when X has full rank): there's exactly one orthogonal projection. It also explains regularization geometrically—L2 regularization constrains the solution to lie on a sphere, finding the closest point on the sphere to the unconstrained solution.

Statistical Properties

MSE possesses several important statistical properties that make it central to regression theory.

Bias-Variance Decomposition

For a fixed point $x$, we can decompose the expected MSE of a model into three components:

$$\mathbb{E}[(y - \hat{y})^2] = \text{Bias}[\hat{y}]^2 + \text{Var}[\hat{y}] + \sigma^2$$

Where:

Bias²: How far the average prediction is from the true value
Variance: How much predictions vary across different training sets
Irreducible error ($\sigma^2$): Noise inherent in the data

This decomposition is fundamental to understanding the bias-variance tradeoff in machine learning.

Components of Expected MSE
Component	Formula	Interpretation	How to Reduce
Bias²	$(\mathbb{E}[\hat{y}] - y_{true})^2$	Systematic error in predictions	Increase model complexity
Variance	$\mathbb{E}[(\hat{y} - \mathbb{E}[\hat{y}])^2]$	Sensitivity to training data	Reduce complexity, add data, regularize
Irreducible Error	$\sigma^2$	Inherent noise in problem	Cannot reduce—fundamental limit

Unbiased Estimation of MSE

When estimating MSE on the training set, we're slightly biased optimistically because the model was fit to minimize error on this exact data. For an unbiased estimate, we need held-out data (test set) or cross-validation.

However, within the training set, we can compute an unbiased estimator of $\sigma^2$ (error variance) using the residual sum of squares:

$$\hat{\sigma}^2 = \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{n - p}$$

Where $p$ is the number of parameters. Dividing by $n - p$ instead of $n$ corrects for the degrees of freedom consumed by fitting the model.

Training MSE ≠ Test MSE

Training MSE is always optimistically biased—it will be lower than the true expected error. This is why we always evaluate on held-out data. A model with near-zero training MSE and high test MSE is overfitting; it has memorized training data rather than learned generalizable patterns.

mse_bias_variance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
 
def bias_variance_simulation(n_simulations=100, n_samples=50):
    """
    Simulate bias-variance decomposition for polynomial regression.
    Shows how MSE decomposes into bias² + variance + noise.
    """
    np.random.seed(42)
    
    # True function: f(x) = sin(x)
    true_function = lambda x: np.sin(x)
    noise_std = 0.3  # Irreducible error
    
    # Test point where we'll measure bias and variance
    x_test = np.array([[1.5]])
    y_true = true_function(x_test[0, 0])
    
    # Store predictions across simulations for different model complexities
    results = {}
    
    for degree in [1, 3, 10]:  # Underfitting, appropriate, overfitting
        predictions = []
        
        for _ in range(n_simulations):
            # Generate fresh training data (different noise each time)
            X_train = np.random.uniform(0, 2*np.pi, n_samples).reshape(-1, 1)
            y_train = true_function(X_train.ravel()) + np.random.normal(0, noise_std, n_samples)
            
            # Fit polynomial model
            model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
            model.fit(X_train, y_train)
            
            # Predict at test point
            predictions.append(model.predict(x_test)[0])
        
        predictions = np.array(predictions)
        
        # Compute bias and variance
        expected_prediction = np.mean(predictions)
        bias = expected_prediction - y_true
        variance = np.var(predictions)
        
        # Expected MSE = bias² + variance + irreducible_error²
        expected_mse = bias**2 + variance + noise_std**2
        
        results[degree] = {
            'bias_squared': bias**2,
            'variance': variance,
            'irreducible': noise_std**2,
            'expected_mse': expected_mse
        }
        
        print(f"\nDegree {degree} polynomial:")
        print(f"  Bias²:       {bias**2:.4f}")
        print(f"  Variance:    {variance:.4f}")
        print(f"  Irreducible: {noise_std**2:.4f}")
        print(f"  Expected MSE:{expected_mse:.4f}")
    
    return results
 
# Run simulation
results = bias_variance_simulation()
# Output shows:
# - Degree 1 (linear): High bias, low variance
# - Degree 3 (good fit): Low bias, moderate variance
# - Degree 10 (overfit): Low bias, high variance

Sensitivity to Outliers

MSE's quadratic penalty is a double-edged sword. While it penalizes large errors heavily, this same property makes MSE highly sensitive to outliers.

The Outlier Problem

Consider predictions: errors of 2, 2, 2, 2, and one outlier error of 20.

Without outlier: MSE = $(2^2 \times 4) / 4 = 4$
With outlier: MSE = $(2^2 \times 4 + 20^2) / 5 = 83.2$

A single outlier increased MSE by over 2000%! The outlier's squared error (400) dominates the entire sum.

outlier_sensitivity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
import matplotlib.pyplot as plt
 
def demonstrate_outlier_sensitivity():
    """
    Show how a single outlier can devastate MSE.
    """
    # Base data: mostly accurate predictions
    y_true_base = np.array([100, 105, 98, 102, 99])
    y_pred_base = np.array([101, 104, 99, 101, 100])  # Errors of ~1
    
    # MSE without outliers
    mse_clean = np.mean((y_true_base - y_pred_base) ** 2)
    print(f"Clean data - MSE: {mse_clean:.2f}")
    
    # Add progressively larger outliers
    outlier_magnitudes = [10, 50, 100, 500]
    
    for outlier in outlier_magnitudes:
        y_true = np.append(y_true_base, 100)
        y_pred = np.append(y_pred_base, 100 + outlier)  # Outlier error
        
        mse_with_outlier = np.mean((y_true - y_pred) ** 2)
        
        # What percentage of total squared error comes from outlier?
        total_sq_error = np.sum((y_true - y_pred) ** 2)
        outlier_contribution = outlier ** 2
        pct_from_outlier = (outlier_contribution / total_sq_error) * 100
        
        print(f"\nWith outlier (error={outlier}):")
        print(f"  MSE: {mse_with_outlier:.2f}")
        print(f"  Increase from clean: {(mse_with_outlier / mse_clean - 1) * 100:.0f}%")
        print(f"  Outlier contribution: {pct_from_outlier:.1f}% of total squared error")
 
demonstrate_outlier_sensitivity()
# Output:
# Clean data - MSE: 1.20
#
# With outlier (error=10):
#   MSE: 17.67
#   Increase from clean: 1372%
#   Outlier contribution: 94.3% of total squared error
#
# With outlier (error=50):
#   MSE: 417.67
#   Increase from clean: 34706%
#   Outlier contribution: 99.8% of total squared error
# ...

Implications for Model Training

When you minimize MSE during training, the optimizer will bend the model to reduce large errors—even if those large errors come from outliers that don't represent the general pattern.

This behavior can:

Pull the regression line away from the majority of data points
Increase errors on non-outlier samples to reduce errors on outliers
Make the model less generalizable

When Outlier Sensitivity is Good

In some domains, large errors are truly catastrophic and should be minimized at all costs:

Medical dosage prediction (huge underprediction = patient death)
Financial risk modeling (large underestimation = bankruptcy)
Structural engineering (underestimating load = collapse)

When Outlier Sensitivity is Bad

In other domains, outliers represent data quality issues, not true signal:

Sensor data with occasional measurement errors
Survey responses with typos
Transaction data with fraudulent entries

Robust Alternatives

When outliers are problematic, consider: Mean Absolute Error (MAE) with linear penalty, Huber Loss that transitions from quadratic to linear beyond a threshold, Quantile regression that ignores outliers in specific tails, or Trimmed MSE that excludes the largest errors from the calculation.

Scale Dependence

MSE is scale-dependent—its value depends on the units and magnitude of the target variable. This has important implications for interpretation and model comparison.

The Scale Problem

Consider predicting house prices:

In dollars: prices range $200,000-$500,000, MSE might be 1,000,000,000 (dollars²)
In thousands of dollars: range 200-500, MSE might be 1,000 ((k$)²)

Same predictions, same quality—but MSE differs by a factor of 1,000,000.

MSE Varies with Scale
Measurement Unit	True Value	Predicted	Error	Squared Error
Dollars	350,000	340,000	10,000	100,000,000
Thousands ($K)	350	340	10	100
Millions ($M)	0.35	0.34	0.01	0.0001

Implications

Cannot compare across targets: An MSE of 100 for temperature prediction (°C) says nothing about whether a model is better than one with MSE of 50 for wind speed prediction (m/s).
Multi-output regression problems: When predicting multiple targets simultaneously with different scales, raw MSE will be dominated by the highest-magnitude target.
Relative vs. absolute: An error of $100 is huge for a $500 item but negligible for a $500,000 house.

Solutions to Scale Dependence

Normalization Strategies

•RMSE for Interpretability — Take the square root to return to original units. An RMSE of $10,000 means 'typical error is about $10,000.'
•Normalized RMSE (NRMSE) — Divide RMSE by the range, mean, or standard deviation of y. Usually reported as a percentage.
•Coefficient of Determination (R²) — Measures proportion of variance explained, always in [0,1] regardless of scale.
•MAPE (Mean Absolute Percentage Error) — Express errors as percentages of true values, making them scale-invariant.
•Standardize Target — Transform y to have mean 0, std 1 before training. MSE becomes directly interpretable as variance unexplained.

scale_normalization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
 
def normalized_metrics(y_true: np.ndarray, y_pred: np.ndarray):
    """
    Compute scale-normalized versions of MSE/RMSE
    for meaningful cross-problem comparison.
    """
    mse = np.mean((y_true - y_pred) ** 2)
    rmse = np.sqrt(mse)
    
    # Normalization approaches
    y_range = y_true.max() - y_true.min()
    y_mean = np.mean(y_true)
    y_std = np.std(y_true)
    
    # Coefficient of Variation of RMSE (CV-RMSE)
    cv_rmse = rmse / y_mean
    
    # Normalized RMSE by range
    nrmse_range = rmse / y_range
    
    # Normalized RMSE by std (how many stds is typical error?)
    nrmse_std = rmse / y_std
    
    # R² (coefficient of determination)
    ss_res = np.sum((y_true - y_pred) ** 2)
    ss_tot = np.sum((y_true - y_mean) ** 2)
    r_squared = 1 - (ss_res / ss_tot)
    
    print(f"Target statistics:")
    print(f"  Range: {y_true.min():.2f} to {y_true.max():.2f}")
    print(f"  Mean: {y_mean:.2f}, Std: {y_std:.2f}")
    print(f"\nRaw metrics:")
    print(f"  MSE: {mse:.4f}")
    print(f"  RMSE: {rmse:.4f}")
    print(f"\nNormalized metrics:")
    print(f"  CV-RMSE (RMSE/mean): {cv_rmse:.2%}")
    print(f"  NRMSE (RMSE/range): {nrmse_range:.2%}")
    print(f"  NRMSE (RMSE/std): {nrmse_std:.4f}")
    print(f"  R²: {r_squared:.4f}")
 
# Example: Same prediction quality, different scales
y_true_dollars = np.array([350000, 420000, 280000, 550000, 390000])
y_pred_dollars = np.array([340000, 430000, 290000, 540000, 400000])
 
y_true_thousands = y_true_dollars / 1000
y_pred_thousands = y_pred_dollars / 1000
 
print("=== Dollars ===")
normalized_metrics(y_true_dollars, y_pred_dollars)
print("\n=== Thousands ===")
normalized_metrics(y_true_thousands, y_pred_thousands)
# Note: Normalized metrics are identical despite 1000x scale difference

MSE in Optimization

MSE's mathematical properties make it especially convenient for optimization—explaining why it's the default loss function for most regression algorithms.

Closed-Form Solution for Linear Regression

For linear regression $\hat{y} = \mathbf{X}\boldsymbol{\beta}$, minimizing MSE yields the famous normal equations:

$$\boldsymbol{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

No other common loss function produces such an elegant closed-form solution.

Gradient Descent Properties

The gradient of MSE with respect to predictions is:

$$\frac{\partial \text{MSE}}{\partial \hat{y}_i} = \frac{2}{n}(\hat{y}_i - y_i)$$

This gradient has desirable properties:

Proportional to error: Larger errors produce larger gradients, naturally allocating more 'effort' to reducing big mistakes
Continuous and smooth: No discontinuities that could cause optimization problems
Lipschitz continuous: Gradients don't explode, ensuring stable optimization

mse_gradients.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
 
def mse_gradient_derivation():
    """
    Demonstrate MSE gradient for gradient descent.
    For linear regression: y_hat = X @ beta
    """
    # Sample data
    np.random.seed(42)
    n_samples, n_features = 100, 3
    
    X = np.random.randn(n_samples, n_features)
    X = np.hstack([np.ones((n_samples, 1)), X])  # Add bias term
    
    true_beta = np.array([5.0, 2.0, -1.0, 0.5])
    y = X @ true_beta + np.random.randn(n_samples) * 0.5
    
    def mse_loss(beta):
        """Compute MSE loss."""
        y_pred = X @ beta
        return np.mean((y - y_pred) ** 2)
    
    def mse_gradient(beta):
        """
        Compute gradient of MSE w.r.t. beta.
        
        MSE = (1/n) * sum((y - X @ beta)^2)
        dMSE/dbeta = (2/n) * X.T @ (X @ beta - y)
                   = -(2/n) * X.T @ (y - y_pred)
        """
        y_pred = X @ beta
        residuals = y - y_pred
        gradient = -(2 / n_samples) * X.T @ residuals
        return gradient
    
    # Gradient descent
    beta = np.zeros(4)  # Initialize at zero
    learning_rate = 0.1
    history = []
    
    for iteration in range(100):
        loss = mse_loss(beta)
        history.append(loss)
        
        grad = mse_gradient(beta)
        beta = beta - learning_rate * grad
        
        if iteration < 5 or iteration % 20 == 0:
            print(f"Iter {iteration:3d}: MSE = {loss:.4f}, |grad| = {np.linalg.norm(grad):.4f}")
    
    print(f"\nFinal beta: {beta}")
    print(f"True beta:  {true_beta}")
    
    # Compare with closed-form solution
    beta_closed_form = np.linalg.inv(X.T @ X) @ X.T @ y
    print(f"Closed-form:{beta_closed_form}")
 
mse_gradient_derivation()

Convexity

MSE is a convex function of the predictions (and for linear models, convex in the parameters). This means:

There's a unique global minimum—no local minima to get trapped in
Any local minimum is the global minimum
Gradient descent is guaranteed to find the optimum (with appropriate learning rate)

For non-linear models like neural networks, the model's predictions are non-linear functions of parameters, so the overall loss landscape isn't convex. However, MSE still provides clean gradients that flow well through backpropagation.

Computational Efficiency

Beyond mathematical elegance, MSE is computationally cheap. It requires only n subtractions, n squarings, and n additions—all vectorizable operations. For large-scale ML where loss is computed millions of times, this efficiency matters. Alternatives like quantile losses or custom objectives may require sorting or other expensive operations.

Practical Considerations

When using MSE in practice, several considerations can help you extract maximum value and avoid common pitfalls.

Interpretation Guidelines

MSE values have no universal 'good' or 'bad' threshold—interpretation is always context-dependent:

Compare to baseline models (mean prediction, last-value prediction)
Consider the decision context (what error magnitude is acceptable?)
Look at error distribution, not just the average

MSE Interpretation Checklist
Question to Ask	Why It Matters	Action
What's the baseline MSE?	Raw MSE is meaningless without reference	Always compute MSE for naive baseline (predict mean)
What do the units mean?	MSE is in squared units	Use RMSE for interpretability in original units
Are there outliers?	Outliers inflate MSE disproportionately	Check median error, examine residual distribution
Is scale comparable?	Cannot compare MSE across different targets	Use R² or normalized RMSE for cross-target comparison
Is MSE the right objective?	Not all problems care about squared error	Consider if large errors are truly more costly

mse_diagnostic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from scipy import stats
 
def comprehensive_mse_analysis(y_true: np.ndarray, y_pred: np.ndarray, name: str = "Model"):
    """
    Complete MSE analysis with context and diagnostics.
    """
    residuals = y_true - y_pred
    
    # Basic metrics
    mse = np.mean(residuals ** 2)
    rmse = np.sqrt(mse)
    mae = np.mean(np.abs(residuals))
    
    # Baseline comparison (predict mean)
    baseline_pred = np.full_like(y_true, np.mean(y_true))
    baseline_mse = np.mean((y_true - baseline_pred) ** 2)
    
    # R² (compares to baseline)
    r_squared = 1 - (mse / baseline_mse)
    
    # Residual analysis
    median_residual = np.median(residuals)
    residual_skew = stats.skew(residuals)
    
    # Outlier detection (residuals > 2 std devs)
    residual_std = np.std(residuals)
    n_outliers = np.sum(np.abs(residuals) > 2 * residual_std)
    pct_outliers = n_outliers / len(residuals) * 100
    
    # Contribution analysis
    squared_residuals = residuals ** 2
    top_10pct_threshold = np.percentile(squared_residuals, 90)
    top_10pct_contribution = np.sum(squared_residuals[squared_residuals >= top_10pct_threshold])
    pct_from_top_10 = top_10pct_contribution / np.sum(squared_residuals) * 100
    
    print(f"=== {name} Analysis ===")
    print(f"\nPerformance Metrics:")
    print(f"  MSE:  {mse:.4f}")
    print(f"  RMSE: {rmse:.4f} (in original units)")
    print(f"  MAE:  {mae:.4f} (for comparison)")
    print(f"  R²:   {r_squared:.4f} ({r_squared*100:.1f}% variance explained)")
    
    print(f"\nBaseline Comparison:")
    print(f"  Baseline MSE (predict mean): {baseline_mse:.4f}")
    print(f"  Improvement: {(1 - mse/baseline_mse)*100:.1f}%")
    
    print(f"\nResidual Diagnostics:")
    print(f"  Mean residual: {np.mean(residuals):.4f} (should be ~0)")
    print(f"  Median residual: {median_residual:.4f}")
    print(f"  Skewness: {residual_skew:.4f} (0 = symmetric)")
    
    print(f"\nOutlier Analysis:")
    print(f"  Outliers (>2σ): {n_outliers} ({pct_outliers:.1f}%)")
    print(f"  Top 10% errors contribute {pct_from_top_10:.1f}% of total MSE")
    
    if pct_from_top_10 > 50:
        print("  ⚠️  MSE dominated by large errors—consider robust loss")
    
    return {
        'mse': mse, 'rmse': rmse, 'mae': mae, 'r_squared': r_squared,
        'pct_outliers': pct_outliers, 'pct_from_top': pct_from_top_10
    }
 
# Example usage
np.random.seed(42)
y_true = np.random.normal(100, 20, 1000)
y_pred = y_true + np.random.normal(0, 10, 1000)  # Good predictions
y_pred[0:10] += 100  # Add some outliers
 
comprehensive_mse_analysis(y_true, y_pred, "Regression Model")

Summary: Mean Squared Error

We've taken a deep dive into Mean Squared Error—the foundational metric for regression evaluation. Let's consolidate the key insights:

Key Takeaways

•Mathematical Definition: MSE = $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$ — the average of squared residuals.
•Why Squaring: Ensures positivity, enables differentiation, connects to maximum likelihood under Gaussian assumptions, and penalizes large errors quadratically.
•Geometric View: MSE is squared Euclidean distance per sample in prediction space.
•Bias-Variance Decomposition: Expected MSE = Bias² + Variance + Irreducible Error.
•Outlier Sensitivity: Quadratic penalty means outliers can dominate MSE—a strength when large errors are catastrophic, a weakness when they're noise.
•Scale Dependence: MSE values depend on units; use RMSE, R², or normalized metrics for interpretation.
•Optimization Properties: Convexity, closed-form solutions, and clean gradients make MSE the default for training.
•Always Compare to Baseline: Raw MSE is meaningless without context—always compute baseline MSE.

What's Next

MSE's sensitivity to outliers and its quadratic penalty aren't always desirable. In the next page, we'll explore Mean Absolute Error (MAE)—a metric that penalizes errors linearly, providing robustness to outliers at the cost of some mathematical convenience.

Page Complete

You now have deep understanding of MSE—its mathematics, interpretation, strengths, and limitations. This foundation prepares you to evaluate regression models rigorously and to select appropriate metrics for different scenarios.

1 / 5

Loading learning content...

Machine LearningModel Evaluation Metrics

Regression Metrics

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

1 / 5

Mean Squared Error

The Workhorse of Regression Evaluation

In the world of regression modeling, we face a fundamental question: How do we measure how wrong our predictions are?

What You Will Learn

Mathematical Definition

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Let's dissect this formula component by component to build deep understanding.

Components of MSE

•$(y_i - \hat{y}_i)$: The Residual — The difference between truth and prediction. A residual of zero means perfect prediction. Positive residuals mean we underestimated; negative means we overestimated.
•$(y_i - \hat{y}_i)^2$: The Squared Residual — Squaring accomplishes three crucial things: it makes all errors positive, it penalizes large errors more heavily, and it yields nice mathematical properties (differentiability everywhere).
•$\sum_{i=1}^{n}$: Summation — We aggregate across all samples to assess overall model performance, not just individual predictions.
•$\frac{1}{n}$: Averaging — Dividing by $n$ normalizes for dataset size, making MSE comparable across datasets of different sizes.

Root Mean Squared Error (RMSE)

mse_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
def mean_squared_error(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Compute Mean Squared Error from scratch.
    
    Parameters:
    -----------
    y_true : np.ndarray
        Array of true target values
    y_pred : np.ndarray
        Array of predicted values
        
    Returns:
    --------
    float
        The mean squared error
    """
    # Ensure arrays are numpy arrays
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    
    # Validate shapes match
    if y_true.shape != y_pred.shape:
        raise ValueError(f"Shape mismatch: {y_true.shape} vs {y_pred.shape}")
    
    # Calculate residuals
    residuals = y_true - y_pred
    
    # Square the residuals
    squared_residuals = residuals ** 2
    
    # Average over all samples
    mse = np.mean(squared_residuals)
    
    return mse
 
def root_mean_squared_error(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Compute RMSE - interpretable in original units."""
    return np.sqrt(mean_squared_error(y_true, y_pred))
 
# Example: House price predictions (in $1000s)
y_true = np.array([250, 300, 350, 400, 450])
y_pred = np.array([260, 290, 340, 420, 440])
 
mse = mean_squared_error(y_true, y_pred)
rmse = root_mean_squared_error(y_true, y_pred)
 
print(f"Residuals: {y_true - y_pred}")
print(f"Squared Residuals: {(y_true - y_pred) ** 2}")
print(f"MSE: {mse:.2f} ($1000)²")
print(f"RMSE: {rmse:.2f} $1000")
# Output:
# Residuals: [-10  10  10 -20  10]
# Squared Residuals: [100 100 100 400 100]
# MSE: 160.00 ($1000)²
# RMSE: 12.65 $1000

Why Squaring? The Deep Rationale

The choice to square errors isn't arbitrary—it has profound mathematical and practical justifications. Understanding these helps you know when MSE is appropriate and when alternatives are better.

Reason 1: Symmetry and Positivity

Squaring forces all errors to be positive, ensuring they accumulate rather than cancel.

Mean Error (Flawed)

•Errors: +100, -100, +50, -50
•Mean Error = 0
•Suggests 'perfect' predictions
•Completely misleading!

Mean Squared Error

•Errors: +100, -100, +50, -50
•Squared: 10000, 10000, 2500, 2500
•MSE = 6250
•Correctly shows prediction error

Reason 2: Differentiability

Alternatives like absolute error $|y - \hat{y}|$ have a non-differentiable point at zero, complicating optimization.

Reason 3: Connection to Variance and Normal Distribution

MSE is intimately connected to variance. If we assume errors are normally distributed with mean zero:$$\epsilon_i \sim \mathcal{N}(0, \sigma^2)$$

Then minimizing MSE is equivalent to maximum likelihood estimation. The squared term in MSE corresponds to the exponent in the Gaussian probability density function:

$$p(y | \hat{y}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - \hat{y})^2}{2\sigma^2}\right)$$

Maximizing this likelihood (or equivalently, minimizing negative log-likelihood) leads directly to minimizing MSE.

The Gaussian Connection

Reason 4: Quadratic Penalty for Large Errors

Strength: The model is strongly incentivized to avoid catastrophic errors
Weakness: A few outliers can dominate the entire metric

This quadratic sensitivity makes MSE particularly appropriate when large errors are truly more costly than small errors—but problematic when outliers are present in the data.

Geometric Interpretation

MSE has a beautiful geometric interpretation that connects algebra to visual intuition.

Euclidean Distance in Prediction Space

The Euclidean distance between these points is: $$d(\mathbf{y}, \mathbf{\hat{y}}) = \sqrt{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = \sqrt{n \cdot \text{MSE}}$$

Thus: $$\text{MSE} = \frac{d(\mathbf{y}, \mathbf{\hat{y}})^2}{n}$$

MSE is the squared Euclidean distance per sample—a measure of how far our prediction vector is from the truth vector in this high-dimensional space.

geometric_mse.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
from numpy.linalg import norm
 
def geometric_mse_interpretation(y_true: np.ndarray, y_pred: np.ndarray):
    """
    Demonstrate the geometric interpretation of MSE
    as squared Euclidean distance per sample.
    """
    n = len(y_true)
    
    # Method 1: Standard MSE calculation
    residuals = y_true - y_pred
    mse = np.mean(residuals ** 2)
    
    # Method 2: Via Euclidean distance
    euclidean_distance = norm(y_true - y_pred)
    mse_from_distance = (euclidean_distance ** 2) / n
    
    print(f"Number of samples: {n}")
    print(f"Residual vector: {residuals}")
    print(f"Euclidean distance: {euclidean_distance:.4f}")
    print(f"MSE (direct): {mse:.4f}")
    print(f"MSE (from distance²/n): {mse_from_distance:.4f}")
    print(f"RMSE ≈ Euclidean distance / √n: {euclidean_distance / np.sqrt(n):.4f}")
    
    return mse
 
# Example with 4 samples
y_true = np.array([100, 200, 150, 250])
y_pred = np.array([110, 190, 160, 240])
 
geometric_mse_interpretation(y_true, y_pred)
# Output:
# Number of samples: 4
# Residual vector: [-10  10 -10  10]
# Euclidean distance: 20.0000
# MSE (direct): 100.0000
# MSE (from distance²/n): 100.0000
# RMSE ≈ Euclidean distance / √n: 10.0000

The Projection Perspective

Why Geometry Matters

Statistical Properties

MSE possesses several important statistical properties that make it central to regression theory.

Bias-Variance Decomposition

For a fixed point $x$, we can decompose the expected MSE of a model into three components:

$$\mathbb{E}[(y - \hat{y})^2] = \text{Bias}[\hat{y}]^2 + \text{Var}[\hat{y}] + \sigma^2$$

Where:

Bias²: How far the average prediction is from the true value
Variance: How much predictions vary across different training sets
Irreducible error ($\sigma^2$): Noise inherent in the data

This decomposition is fundamental to understanding the bias-variance tradeoff in machine learning.

Components of Expected MSE
Component	Formula	Interpretation	How to Reduce
Bias²	$(\mathbb{E}[\hat{y}] - y_{true})^2$	Systematic error in predictions	Increase model complexity
Variance	$\mathbb{E}[(\hat{y} - \mathbb{E}[\hat{y}])^2]$	Sensitivity to training data	Reduce complexity, add data, regularize
Irreducible Error	$\sigma^2$	Inherent noise in problem	Cannot reduce—fundamental limit

Unbiased Estimation of MSE

However, within the training set, we can compute an unbiased estimator of $\sigma^2$ (error variance) using the residual sum of squares:

$$\hat{\sigma}^2 = \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{n - p}$$

Where $p$ is the number of parameters. Dividing by $n - p$ instead of $n$ corrects for the degrees of freedom consumed by fitting the model.

Training MSE ≠ Test MSE

mse_bias_variance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
 
def bias_variance_simulation(n_simulations=100, n_samples=50):
    """
    Simulate bias-variance decomposition for polynomial regression.
    Shows how MSE decomposes into bias² + variance + noise.
    """
    np.random.seed(42)
    
    # True function: f(x) = sin(x)
    true_function = lambda x: np.sin(x)
    noise_std = 0.3  # Irreducible error
    
    # Test point where we'll measure bias and variance
    x_test = np.array([[1.5]])
    y_true = true_function(x_test[0, 0])
    
    # Store predictions across simulations for different model complexities
    results = {}
    
    for degree in [1, 3, 10]:  # Underfitting, appropriate, overfitting
        predictions = []
        
        for _ in range(n_simulations):
            # Generate fresh training data (different noise each time)
            X_train = np.random.uniform(0, 2*np.pi, n_samples).reshape(-1, 1)
            y_train = true_function(X_train.ravel()) + np.random.normal(0, noise_std, n_samples)
            
            # Fit polynomial model
            model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
            model.fit(X_train, y_train)
            
            # Predict at test point
            predictions.append(model.predict(x_test)[0])
        
        predictions = np.array(predictions)
        
        # Compute bias and variance
        expected_prediction = np.mean(predictions)
        bias = expected_prediction - y_true
        variance = np.var(predictions)
        
        # Expected MSE = bias² + variance + irreducible_error²
        expected_mse = bias**2 + variance + noise_std**2
        
        results[degree] = {
            'bias_squared': bias**2,
            'variance': variance,
            'irreducible': noise_std**2,
            'expected_mse': expected_mse
        }
        
        print(f"\nDegree {degree} polynomial:")
        print(f"  Bias²:       {bias**2:.4f}")
        print(f"  Variance:    {variance:.4f}")
        print(f"  Irreducible: {noise_std**2:.4f}")
        print(f"  Expected MSE:{expected_mse:.4f}")
    
    return results
 
# Run simulation
results = bias_variance_simulation()
# Output shows:
# - Degree 1 (linear): High bias, low variance
# - Degree 3 (good fit): Low bias, moderate variance
# - Degree 10 (overfit): Low bias, high variance

Sensitivity to Outliers

MSE's quadratic penalty is a double-edged sword. While it penalizes large errors heavily, this same property makes MSE highly sensitive to outliers.

The Outlier Problem

Consider predictions: errors of 2, 2, 2, 2, and one outlier error of 20.

Without outlier: MSE = $(2^2 \times 4) / 4 = 4$
With outlier: MSE = $(2^2 \times 4 + 20^2) / 5 = 83.2$

A single outlier increased MSE by over 2000%! The outlier's squared error (400) dominates the entire sum.

outlier_sensitivity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
import matplotlib.pyplot as plt
 
def demonstrate_outlier_sensitivity():
    """
    Show how a single outlier can devastate MSE.
    """
    # Base data: mostly accurate predictions
    y_true_base = np.array([100, 105, 98, 102, 99])
    y_pred_base = np.array([101, 104, 99, 101, 100])  # Errors of ~1
    
    # MSE without outliers
    mse_clean = np.mean((y_true_base - y_pred_base) ** 2)
    print(f"Clean data - MSE: {mse_clean:.2f}")
    
    # Add progressively larger outliers
    outlier_magnitudes = [10, 50, 100, 500]
    
    for outlier in outlier_magnitudes:
        y_true = np.append(y_true_base, 100)
        y_pred = np.append(y_pred_base, 100 + outlier)  # Outlier error
        
        mse_with_outlier = np.mean((y_true - y_pred) ** 2)
        
        # What percentage of total squared error comes from outlier?
        total_sq_error = np.sum((y_true - y_pred) ** 2)
        outlier_contribution = outlier ** 2
        pct_from_outlier = (outlier_contribution / total_sq_error) * 100
        
        print(f"\nWith outlier (error={outlier}):")
        print(f"  MSE: {mse_with_outlier:.2f}")
        print(f"  Increase from clean: {(mse_with_outlier / mse_clean - 1) * 100:.0f}%")
        print(f"  Outlier contribution: {pct_from_outlier:.1f}% of total squared error")
 
demonstrate_outlier_sensitivity()
# Output:
# Clean data - MSE: 1.20
#
# With outlier (error=10):
#   MSE: 17.67
#   Increase from clean: 1372%
#   Outlier contribution: 94.3% of total squared error
#
# With outlier (error=50):
#   MSE: 417.67
#   Increase from clean: 34706%
#   Outlier contribution: 99.8% of total squared error
# ...

Implications for Model Training

When you minimize MSE during training, the optimizer will bend the model to reduce large errors—even if those large errors come from outliers that don't represent the general pattern.

This behavior can:

Pull the regression line away from the majority of data points
Increase errors on non-outlier samples to reduce errors on outliers
Make the model less generalizable

When Outlier Sensitivity is Good

In some domains, large errors are truly catastrophic and should be minimized at all costs:

Medical dosage prediction (huge underprediction = patient death)
Financial risk modeling (large underestimation = bankruptcy)
Structural engineering (underestimating load = collapse)

When Outlier Sensitivity is Bad

In other domains, outliers represent data quality issues, not true signal:

Sensor data with occasional measurement errors
Survey responses with typos
Transaction data with fraudulent entries

Robust Alternatives

Scale Dependence

MSE is scale-dependent—its value depends on the units and magnitude of the target variable. This has important implications for interpretation and model comparison.

The Scale Problem

Consider predicting house prices:

In dollars: prices range $200,000-$500,000, MSE might be 1,000,000,000 (dollars²)
In thousands of dollars: range 200-500, MSE might be 1,000 ((k$)²)

Same predictions, same quality—but MSE differs by a factor of 1,000,000.

MSE Varies with Scale
Measurement Unit	True Value	Predicted	Error	Squared Error
Dollars	350,000	340,000	10,000	100,000,000
Thousands ($K)	350	340	10	100
Millions ($M)	0.35	0.34	0.01	0.0001

Implications

Cannot compare across targets: An MSE of 100 for temperature prediction (°C) says nothing about whether a model is better than one with MSE of 50 for wind speed prediction (m/s).
Multi-output regression problems: When predicting multiple targets simultaneously with different scales, raw MSE will be dominated by the highest-magnitude target.
Relative vs. absolute: An error of $100 is huge for a $500 item but negligible for a $500,000 house.

Solutions to Scale Dependence

Normalization Strategies

•RMSE for Interpretability — Take the square root to return to original units. An RMSE of $10,000 means 'typical error is about $10,000.'
•Normalized RMSE (NRMSE) — Divide RMSE by the range, mean, or standard deviation of y. Usually reported as a percentage.
•Coefficient of Determination (R²) — Measures proportion of variance explained, always in [0,1] regardless of scale.
•MAPE (Mean Absolute Percentage Error) — Express errors as percentages of true values, making them scale-invariant.
•Standardize Target — Transform y to have mean 0, std 1 before training. MSE becomes directly interpretable as variance unexplained.

scale_normalization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
 
def normalized_metrics(y_true: np.ndarray, y_pred: np.ndarray):
    """
    Compute scale-normalized versions of MSE/RMSE
    for meaningful cross-problem comparison.
    """
    mse = np.mean((y_true - y_pred) ** 2)
    rmse = np.sqrt(mse)
    
    # Normalization approaches
    y_range = y_true.max() - y_true.min()
    y_mean = np.mean(y_true)
    y_std = np.std(y_true)
    
    # Coefficient of Variation of RMSE (CV-RMSE)
    cv_rmse = rmse / y_mean
    
    # Normalized RMSE by range
    nrmse_range = rmse / y_range
    
    # Normalized RMSE by std (how many stds is typical error?)
    nrmse_std = rmse / y_std
    
    # R² (coefficient of determination)
    ss_res = np.sum((y_true - y_pred) ** 2)
    ss_tot = np.sum((y_true - y_mean) ** 2)
    r_squared = 1 - (ss_res / ss_tot)
    
    print(f"Target statistics:")
    print(f"  Range: {y_true.min():.2f} to {y_true.max():.2f}")
    print(f"  Mean: {y_mean:.2f}, Std: {y_std:.2f}")
    print(f"\nRaw metrics:")
    print(f"  MSE: {mse:.4f}")
    print(f"  RMSE: {rmse:.4f}")
    print(f"\nNormalized metrics:")
    print(f"  CV-RMSE (RMSE/mean): {cv_rmse:.2%}")
    print(f"  NRMSE (RMSE/range): {nrmse_range:.2%}")
    print(f"  NRMSE (RMSE/std): {nrmse_std:.4f}")
    print(f"  R²: {r_squared:.4f}")
 
# Example: Same prediction quality, different scales
y_true_dollars = np.array([350000, 420000, 280000, 550000, 390000])
y_pred_dollars = np.array([340000, 430000, 290000, 540000, 400000])
 
y_true_thousands = y_true_dollars / 1000
y_pred_thousands = y_pred_dollars / 1000
 
print("=== Dollars ===")
normalized_metrics(y_true_dollars, y_pred_dollars)
print("\n=== Thousands ===")
normalized_metrics(y_true_thousands, y_pred_thousands)
# Note: Normalized metrics are identical despite 1000x scale difference

MSE in Optimization

MSE's mathematical properties make it especially convenient for optimization—explaining why it's the default loss function for most regression algorithms.

Closed-Form Solution for Linear Regression

For linear regression $\hat{y} = \mathbf{X}\boldsymbol{\beta}$, minimizing MSE yields the famous normal equations:

$$\boldsymbol{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

No other common loss function produces such an elegant closed-form solution.

Gradient Descent Properties

The gradient of MSE with respect to predictions is:

$$\frac{\partial \text{MSE}}{\partial \hat{y}_i} = \frac{2}{n}(\hat{y}_i - y_i)$$

This gradient has desirable properties:

Proportional to error: Larger errors produce larger gradients, naturally allocating more 'effort' to reducing big mistakes
Continuous and smooth: No discontinuities that could cause optimization problems
Lipschitz continuous: Gradients don't explode, ensuring stable optimization

mse_gradients.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
 
def mse_gradient_derivation():
    """
    Demonstrate MSE gradient for gradient descent.
    For linear regression: y_hat = X @ beta
    """
    # Sample data
    np.random.seed(42)
    n_samples, n_features = 100, 3
    
    X = np.random.randn(n_samples, n_features)
    X = np.hstack([np.ones((n_samples, 1)), X])  # Add bias term
    
    true_beta = np.array([5.0, 2.0, -1.0, 0.5])
    y = X @ true_beta + np.random.randn(n_samples) * 0.5
    
    def mse_loss(beta):
        """Compute MSE loss."""
        y_pred = X @ beta
        return np.mean((y - y_pred) ** 2)
    
    def mse_gradient(beta):
        """
        Compute gradient of MSE w.r.t. beta.
        
        MSE = (1/n) * sum((y - X @ beta)^2)
        dMSE/dbeta = (2/n) * X.T @ (X @ beta - y)
                   = -(2/n) * X.T @ (y - y_pred)
        """
        y_pred = X @ beta
        residuals = y - y_pred
        gradient = -(2 / n_samples) * X.T @ residuals
        return gradient
    
    # Gradient descent
    beta = np.zeros(4)  # Initialize at zero
    learning_rate = 0.1
    history = []
    
    for iteration in range(100):
        loss = mse_loss(beta)
        history.append(loss)
        
        grad = mse_gradient(beta)
        beta = beta - learning_rate * grad
        
        if iteration < 5 or iteration % 20 == 0:
            print(f"Iter {iteration:3d}: MSE = {loss:.4f}, |grad| = {np.linalg.norm(grad):.4f}")
    
    print(f"\nFinal beta: {beta}")
    print(f"True beta:  {true_beta}")
    
    # Compare with closed-form solution
    beta_closed_form = np.linalg.inv(X.T @ X) @ X.T @ y
    print(f"Closed-form:{beta_closed_form}")
 
mse_gradient_derivation()

Convexity

MSE is a convex function of the predictions (and for linear models, convex in the parameters). This means:

There's a unique global minimum—no local minima to get trapped in
Any local minimum is the global minimum
Gradient descent is guaranteed to find the optimum (with appropriate learning rate)

Computational Efficiency

Practical Considerations

When using MSE in practice, several considerations can help you extract maximum value and avoid common pitfalls.

Interpretation Guidelines

MSE values have no universal 'good' or 'bad' threshold—interpretation is always context-dependent:

Compare to baseline models (mean prediction, last-value prediction)
Consider the decision context (what error magnitude is acceptable?)
Look at error distribution, not just the average

MSE Interpretation Checklist
Question to Ask	Why It Matters	Action
What's the baseline MSE?	Raw MSE is meaningless without reference	Always compute MSE for naive baseline (predict mean)
What do the units mean?	MSE is in squared units	Use RMSE for interpretability in original units
Are there outliers?	Outliers inflate MSE disproportionately	Check median error, examine residual distribution
Is scale comparable?	Cannot compare MSE across different targets	Use R² or normalized RMSE for cross-target comparison
Is MSE the right objective?	Not all problems care about squared error	Consider if large errors are truly more costly

mse_diagnostic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from scipy import stats
 
def comprehensive_mse_analysis(y_true: np.ndarray, y_pred: np.ndarray, name: str = "Model"):
    """
    Complete MSE analysis with context and diagnostics.
    """
    residuals = y_true - y_pred
    
    # Basic metrics
    mse = np.mean(residuals ** 2)
    rmse = np.sqrt(mse)
    mae = np.mean(np.abs(residuals))
    
    # Baseline comparison (predict mean)
    baseline_pred = np.full_like(y_true, np.mean(y_true))
    baseline_mse = np.mean((y_true - baseline_pred) ** 2)
    
    # R² (compares to baseline)
    r_squared = 1 - (mse / baseline_mse)
    
    # Residual analysis
    median_residual = np.median(residuals)
    residual_skew = stats.skew(residuals)
    
    # Outlier detection (residuals > 2 std devs)
    residual_std = np.std(residuals)
    n_outliers = np.sum(np.abs(residuals) > 2 * residual_std)
    pct_outliers = n_outliers / len(residuals) * 100
    
    # Contribution analysis
    squared_residuals = residuals ** 2
    top_10pct_threshold = np.percentile(squared_residuals, 90)
    top_10pct_contribution = np.sum(squared_residuals[squared_residuals >= top_10pct_threshold])
    pct_from_top_10 = top_10pct_contribution / np.sum(squared_residuals) * 100
    
    print(f"=== {name} Analysis ===")
    print(f"\nPerformance Metrics:")
    print(f"  MSE:  {mse:.4f}")
    print(f"  RMSE: {rmse:.4f} (in original units)")
    print(f"  MAE:  {mae:.4f} (for comparison)")
    print(f"  R²:   {r_squared:.4f} ({r_squared*100:.1f}% variance explained)")
    
    print(f"\nBaseline Comparison:")
    print(f"  Baseline MSE (predict mean): {baseline_mse:.4f}")
    print(f"  Improvement: {(1 - mse/baseline_mse)*100:.1f}%")
    
    print(f"\nResidual Diagnostics:")
    print(f"  Mean residual: {np.mean(residuals):.4f} (should be ~0)")
    print(f"  Median residual: {median_residual:.4f}")
    print(f"  Skewness: {residual_skew:.4f} (0 = symmetric)")
    
    print(f"\nOutlier Analysis:")
    print(f"  Outliers (>2σ): {n_outliers} ({pct_outliers:.1f}%)")
    print(f"  Top 10% errors contribute {pct_from_top_10:.1f}% of total MSE")
    
    if pct_from_top_10 > 50:
        print("  ⚠️  MSE dominated by large errors—consider robust loss")
    
    return {
        'mse': mse, 'rmse': rmse, 'mae': mae, 'r_squared': r_squared,
        'pct_outliers': pct_outliers, 'pct_from_top': pct_from_top_10
    }
 
# Example usage
np.random.seed(42)
y_true = np.random.normal(100, 20, 1000)
y_pred = y_true + np.random.normal(0, 10, 1000)  # Good predictions
y_pred[0:10] += 100  # Add some outliers
 
comprehensive_mse_analysis(y_true, y_pred, "Regression Model")

Summary: Mean Squared Error

We've taken a deep dive into Mean Squared Error—the foundational metric for regression evaluation. Let's consolidate the key insights:

Key Takeaways

•Mathematical Definition: MSE = $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$ — the average of squared residuals.
•Why Squaring: Ensures positivity, enables differentiation, connects to maximum likelihood under Gaussian assumptions, and penalizes large errors quadratically.
•Geometric View: MSE is squared Euclidean distance per sample in prediction space.
•Bias-Variance Decomposition: Expected MSE = Bias² + Variance + Irreducible Error.
•Outlier Sensitivity: Quadratic penalty means outliers can dominate MSE—a strength when large errors are catastrophic, a weakness when they're noise.
•Scale Dependence: MSE values depend on units; use RMSE, R², or normalized metrics for interpretation.
•Optimization Properties: Convexity, closed-form solutions, and clean gradients make MSE the default for training.
•Always Compare to Baseline: Raw MSE is meaningless without context—always compute baseline MSE.

What's Next

Page Complete

1 / 5