Machine LearningStatistical Learning Theory

The Bias-Variance Tradeoff

LevelIntermediate

Duration90 mins

TopicStatistical Learning Theory

1 / 5

Bias-Variance Decomposition

The Fundamental Question of Machine Learning

Every machine learning practitioner eventually confronts a puzzling paradox: why do more complex models sometimes perform worse than simpler ones? A neural network with millions of parameters can fit training data perfectly, yet fail catastrophically on new examples. A linear model with just a handful of coefficients might outperform it.

This observation lies at the heart of machine learning theory, and its resolution comes from one of the most profound results in statistical learning: the bias-variance decomposition. This mathematical framework reveals that prediction error isn't monolithic—it arises from distinct, often competing sources that must be carefully balanced.

Understanding this decomposition transforms how you approach model selection, hyperparameter tuning, and debugging. It provides the theoretical foundation for regularization, ensemble methods, and the entire field of model complexity control. Without it, machine learning practice remains empirical guesswork; with it, you gain principled tools for building models that generalize.

What You Will Learn

By the end of this page, you will be able to derive the bias-variance decomposition from first principles, understand the precise mathematical meaning of each term, and recognize how this framework explains fundamental phenomena in machine learning. You'll develop intuition for why there's an inherent tension between fitting the training data and generalizing to new data.

The Statistical Learning Framework

To derive the bias-variance decomposition rigorously, we must first establish the mathematical framework. We work in the regression setting, where the goal is to learn a function that predicts a continuous target value from input features.

The Data-Generating Process:

Assume data is generated according to the following model:

$$y = f(\mathbf{x}) + \varepsilon$$

where:

$\mathbf{x} \in \mathbb{R}^d$ is the input feature vector
$y \in \mathbb{R}$ is the target value we wish to predict
$f: \mathbb{R}^d \to \mathbb{R}$ is the true underlying function (unknown to us)
$\varepsilon$ is random noise with $\mathbb{E}[\varepsilon] = 0$ and $\text{Var}(\varepsilon) = \sigma^2$

This formulation captures the fundamental assumption that data contains both a deterministic signal (the function $f$) and irreducible randomness (the noise $\varepsilon$). The noise might arise from measurement error, unobserved variables, or genuine stochasticity in the underlying process.

Independence Assumption

We assume ε is independent of x. This means the noise level doesn't depend on the input location—a property called homoscedasticity. While this assumption simplifies the analysis, the bias-variance decomposition can be extended to heteroscedastic settings where Var(ε|x) varies with x.

The Learning Process:

Given a training dataset $\mathcal{D} = {(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)}$ drawn i.i.d. from the joint distribution of $(\mathbf{x}, y)$, our learning algorithm produces a predictor:

$$\hat{f}_{\mathcal{D}}(\mathbf{x})$$

The subscript $\mathcal{D}$ emphasizes a crucial point: the learned function depends on the particular training data we observe. If we drew a different sample from the same distribution, we would get a different learned function.

This randomness in $\hat{f}_{\mathcal{D}}$ is the source of variance in our predictions. The algorithm itself is deterministic—given the same training data, it produces the same model. But since training data is random, the model inherits that randomness.

Two Sources of Randomness

•Training Data Randomness — Different samples $\mathcal{D}$ lead to different learned functions $\hat{f}_{\mathcal{D}}$. This is the source of variance.
•Target Noise — Even at a fixed test point $\mathbf{x}$, the observed $y$ varies due to noise $\varepsilon$. This creates irreducible error.

Expected Prediction Error

Our goal is to understand the expected prediction error at a fixed test point $\mathbf{x}_0$. We use the squared error loss:

$$\text{EPE}(\mathbf{x}0) = \mathbb{E}{\mathcal{D}, y_0}\left[(y_0 - \hat{f}_{\mathcal{D}}(\mathbf{x}_0))^2\right]$$

This expectation is taken over:

The randomness in the training data $\mathcal{D}$
The randomness in the test target $y_0 = f(\mathbf{x}_0) + \varepsilon_0$

Why focus on a fixed test point?

By analyzing error at a specific $\mathbf{x}_0$, we can understand how bias and variance vary across the input space. Total test error is then obtained by averaging over the distribution of test points:

$$\text{Total EPE} = \mathbb{E}_{\mathbf{x}_0}[\text{EPE}(\mathbf{x}_0)]$$

This point-wise analysis reveals that bias and variance can be different in different regions—a model might be biased in one part of the input space and high-variance in another.

Notation Clarity

We write $\mathbb{E}{\mathcal{D}}[\cdot]$ to emphasize expectation over different training sets, and $\mathbb{E}{y_0|\mathbf{x}_0}[\cdot]$ for expectation over the noise in the test point. The subscripts matter—confusion between these expectations is a common source of error in derivations.

The Key Insight: Average Over Training Sets

The bias-variance decomposition arises because we consider how the predictor $\hat{f}_{\mathcal{D}}$ behaves on average over all possible training sets. This might seem strange—in practice, we train on one specific dataset. But this averaging perspective reveals structure that's invisible when staring at a single model:

$$\bar{f}(\mathbf{x}0) = \mathbb{E}{\mathcal{D}}[\hat{f}_{\mathcal{D}}(\mathbf{x}_0)]$$

This quantity $\bar{f}$ is the average prediction of our learning algorithm. It represents what the algorithm would predict if we could somehow average over infinitely many training sets drawn from the same distribution.

The deviation of any single $\hat{f}_{\mathcal{D}}$ from this average is what we call variance. The deviation of this average from the true function $f$ is what we call bias.

Deriving the Decomposition

We now derive the bias-variance decomposition step by step. This derivation is fundamental—every machine learning practitioner should work through it at least once.

Start with the Expected Prediction Error:

$$\text{EPE}(\mathbf{x}0) = \mathbb{E}{\mathcal{D}, y_0}\left[(y_0 - \hat{f}_{\mathcal{D}}(\mathbf{x}_0))^2\right]$$

Step 1: Separate the Noise

Since $y_0 = f(\mathbf{x}_0) + \varepsilon_0$, we can rewrite:

$$= \mathbb{E}_{\mathcal{D}, \varepsilon_0}\left[(f(\mathbf{x}0) + \varepsilon_0 - \hat{f}{\mathcal{D}}(\mathbf{x}_0))^2\right]$$

Expanding the square:

$$= \mathbb{E}\left[(f(\mathbf{x}0) - \hat{f}{\mathcal{D}}(\mathbf{x}_0))^2\right] + \mathbb{E}[\varepsilon_0^2] + 2\mathbb{E}\left[(f(\mathbf{x}0) - \hat{f}{\mathcal{D}}(\mathbf{x}_0))\varepsilon_0\right]$$

The Cross-Term Vanishes

The cross-term equals zero because ε₀ is independent of both f(x₀) and $\hat{f}_{\mathcal{D}}$ (which depends only on training data, not on the noise at the test point), and E[ε₀] = 0. This independence is crucial—without it, the decomposition becomes more complex.

Since $\mathbb{E}[\varepsilon_0^2] = \sigma^2$ and the cross-term vanishes:

$$\text{EPE}(\mathbf{x}0) = \mathbb{E}{\mathcal{D}}\left[(f(\mathbf{x}0) - \hat{f}{\mathcal{D}}(\mathbf{x}_0))^2\right] + \sigma^2$$

The $\sigma^2$ term is the irreducible error—no learning algorithm can reduce it because it represents genuine randomness in the targets.

Step 2: Add and Subtract the Average Prediction

Now we decompose the first term. Let $\bar{f}(\mathbf{x}0) = \mathbb{E}{\mathcal{D}}[\hat{f}_{\mathcal{D}}(\mathbf{x}_0)]$. We add and subtract this quantity:

$$\mathbb{E}_{\mathcal{D}}\left[(f(\mathbf{x}0) - \hat{f}{\mathcal{D}}(\mathbf{x}0))^2\right]$$ $$= \mathbb{E}{\mathcal{D}}\left[(f(\mathbf{x}_0) - \bar{f}(\mathbf{x}_0) + \bar{f}(\mathbf{x}0) - \hat{f}{\mathcal{D}}(\mathbf{x}_0))^2\right]$$

Expanding: $$= (f(\mathbf{x}_0) - \bar{f}(\mathbf{x}0))^2 + \mathbb{E}{\mathcal{D}}\left[(\bar{f}(\mathbf{x}0) - \hat{f}{\mathcal{D}}(\mathbf{x}_0))^2\right]$$ $$+ 2(f(\mathbf{x}_0) - \bar{f}(\mathbf{x}0))\mathbb{E}{\mathcal{D}}\left[\bar{f}(\mathbf{x}0) - \hat{f}{\mathcal{D}}(\mathbf{x}_0)\right]$$

Step 3: The Cross-Term Vanishes Again

$$\mathbb{E}_{\mathcal{D}}\left[\bar{f}(\mathbf{x}0) - \hat{f}{\mathcal{D}}(\mathbf{x}0)\right] = \bar{f}(\mathbf{x}0) - \mathbb{E}{\mathcal{D}}[\hat{f}{\mathcal{D}}(\mathbf{x}_0)] = \bar{f}(\mathbf{x}_0) - \bar{f}(\mathbf{x}_0) = 0$$

This is by definition of $\bar{f}$—it's the expected value of $\hat{f}_{\mathcal{D}}$, so the expected deviation from it is zero.

Final Result:

$$\boxed{\text{EPE}(\mathbf{x}0) = \underbrace{(f(\mathbf{x}0) - \bar{f}(\mathbf{x}0))^2}{\text{Bias}^2} + \underbrace{\mathbb{E}{\mathcal{D}}\left[(\hat{f}{\mathcal{D}}(\mathbf{x}0) - \bar{f}(\mathbf{x}0))^2\right]}{\text{Variance}} + \underbrace{\sigma^2}{\text{Irreducible Error}}}$$

The Bias-Variance Decomposition

Expected Prediction Error = Bias² + Variance + Irreducible Error. This decomposition is exact—no approximations were made. It holds for any learning algorithm, any true function, and any input point.

Understanding Each Error Component

Now that we've derived the decomposition, let's build deep intuition for each term. Understanding these components is essential for diagnosing and addressing model performance issues.

Bias² — The Systematic Error

•Definition: $\text{Bias}^2(\mathbf{x}_0) = (f(\mathbf{x}_0) - \bar{f}(\mathbf{x}_0))^2$
•Meaning: The squared difference between the true function and what the algorithm predicts on average. It measures systematic error that persists regardless of how much training data you have.
•Source: Bias arises from the model's inability to represent the true function. If the true relationship is nonlinear but you fit a linear model, even with infinite data, your average prediction will differ from the truth.
•Example: Fitting a straight line to quadratic data. No matter how many points you have, the line can never capture the curvature. The gap between the best possible line and the true curve is bias.
•Key insight: Bias is a property of the hypothesis class (model family), not of any particular training set. A more flexible model class can reduce bias by including functions closer to the truth.

Variance — The Sensitivity to Data

•Definition: $\text{Var}(\mathbf{x}0) = \mathbb{E}{\mathcal{D}}[(\hat{f}_{\mathcal{D}}(\mathbf{x}_0) - \bar{f}(\mathbf{x}_0))^2]$
•Meaning: The expected squared deviation of individual predictions from the average prediction. It measures how much predictions fluctuate when trained on different datasets.
•Source: Variance arises from sensitivity to training data. If small changes in the training set cause large changes in predictions, variance is high.
•Example: A high-degree polynomial can pass exactly through training points, but different training sets yield wildly different polynomials. The polynomial wiggles differently each time.
•Key insight: Variance is about stability of the learning algorithm. More flexible models have more degrees of freedom, allowing them to fit noise in the training data, which increases variance.

Irreducible Error — The Noise Floor

•Definition: $\sigma^2 = \text{Var}(\varepsilon)$
•Meaning: The variance of the inherent noise in the target variable. This represents uncertainty that cannot be reduced by any model.
•Source: Irreducible error comes from randomness in the data-generating process itself—measurement error, unobserved variables, or genuine stochasticity.
•Example: In predicting stock prices, even with perfect features, there's irreducible randomness from news events, investor psychology, and chaos.
•Key insight: Irreducible error sets a lower bound on achievable error. When your test error approaches σ², you've essentially solved the prediction problem—further improvements require reducing the noise itself.

Summary of Error Components
Component	Mathematical Form	Reducible?	Reduced By
Bias²	$(f(\mathbf{x}) - \bar{f}(\mathbf{x}))^2$	Yes	More flexible model class
Variance	$\mathbb{E}{\mathcal{D}}[(\hat{f}{\mathcal{D}} - \bar{f})^2]$	Yes	More training data, regularization
Irreducible Error	$\sigma^2$	No	Cannot be reduced by modeling

The Dart Board Analogy

A powerful way to visualize bias and variance is through the classic dart board analogy. Imagine throwing darts at a target, where:

The bullseye represents the true function value $f(\mathbf{x}_0)$
Each dart throw represents a prediction from training on a different dataset
The average dart position represents $\bar{f}(\mathbf{x}_0)$

Four Scenarios:

Bias-Variance Visualization
Scenario	Bias	Variance	Result
🎯 Clustered at Center	Low	Low	Ideal: Darts consistently land near bullseye. Accurate and stable predictions.
📍 Clustered Off-Center	High	Low	Underfitting: Darts consistently miss in the same direction. Systematic error, but predictions are stable.
💨 Scattered Around Center	Low	High	Overfitting (subtle): Darts average to bullseye, but individual throws vary wildly. Right on average, wrong each time.
💥 Scattered Off-Center	High	High	Worst case: Missing badly and inconsistently. Model is both inflexible and unstable.

The Subtle Case

The "Low Bias, High Variance" scenario is particularly tricky. If you only run your algorithm once, you might get a prediction far from the truth—even though the algorithm is unbiased! This is why variance matters—being right on average doesn't help when you only get one shot with your particular training set.

The Key Insight:

In practice, you train on one specific dataset and get one specific model. You don't get to average over multiple training sets. This means variance directly affects your single-run performance, not just some theoretical average.

A model with high variance might give excellent predictions sometimes and terrible predictions other times—and you have no way of knowing which case you're in without access to test data. This is why controlling variance through regularization, cross-validation, and ensemble methods is so crucial.

Connecting to the Math:

Bias = Distance from bullseye to the center of the dart cluster
Variance = Spread of the dart cluster around its center
Total error = Average squared distance from each dart to the bullseye

The decomposition EPE = Bias² + Variance simply says: the average squared error to the bullseye equals the squared distance to the cluster center plus the average spread within the cluster.

Worked Example: Polynomial Regression

Let's make the bias-variance decomposition concrete with a detailed example.

Setup:

Suppose the true function is: $$f(x) = \sin(2\pi x)$$

We observe $n$ training points $(x_i, y_i)$ where $x_i$ are uniformly distributed on $[0, 1]$ and: $$y_i = f(x_i) + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, 0.1^2)$$

We fit polynomials of degree $d$: $$\hat{f}(x) = \sum_{k=0}^{d} \hat{w}_k x^k$$

where coefficients $\hat{w}_k$ are determined by least squares.

Case 1: Degree 1 (Linear)

A line cannot capture the sine wave's curvature.

High Bias: The best possible line through a sine wave systematically misses the peaks and troughs. $\bar{f}(x)$ ≠ $f(x)$.
Low Variance: Lines are simple—they have only 2 parameters. Different training sets yield similar lines.

Case 2: Degree 3 (Cubic)

A cubic polynomial can approximate the single period reasonably well.

Moderate Bias: The cubic can capture the general shape but not the exact sine.
Moderate Variance: Four parameters means some sensitivity to training data.

Case 3: Degree 15 (High-Degree)

A degree-15 polynomial has enormous flexibility.

Low Bias: Given enough data, the average fit approaches the true sine.
High Variance: 16 parameters means wild fluctuations between training sets. The polynomial wiggles through training points, fitting noise as signal.

bias_variance_simulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
 
def true_function(x):
    """The true underlying function."""
    return np.sin(2 * np.pi * x)
 
def simulate_bias_variance(n_samples=30, n_simulations=200, degrees=[1, 3, 15], sigma=0.1):
    """
    Simulate bias-variance decomposition for polynomial regression.
    
    For each polynomial degree:
    1. Generate many training datasets
    2. Fit a model to each
    3. Compute bias² and variance at test points
    """
    np.random.seed(42)
    x_test = np.linspace(0, 1, 100).reshape(-1, 1)
    f_true = true_function(x_test.ravel())
    
    results = {}
    
    for degree in degrees:
        predictions = []
        
        for _ in range(n_simulations):
            # Generate random training data
            x_train = np.random.uniform(0, 1, n_samples).reshape(-1, 1)
            noise = np.random.normal(0, sigma, n_samples)
            y_train = true_function(x_train.ravel()) + noise
            
            # Fit polynomial
            poly = PolynomialFeatures(degree)
            X_train_poly = poly.fit_transform(x_train)
            X_test_poly = poly.transform(x_test)
            
            model = LinearRegression()
            model.fit(X_train_poly, y_train)
            y_pred = model.predict(X_test_poly)
            
            predictions.append(y_pred)
        
        predictions = np.array(predictions)  # Shape: (n_simulations, n_test_points)
        
        # Compute bias and variance at each test point
        f_bar = predictions.mean(axis=0)  # Average prediction
        bias_squared = (f_bar - f_true) ** 2
        variance = predictions.var(axis=0)
        
        results[degree] = {
            'bias_squared': bias_squared,
            'variance': variance,
            'f_bar': f_bar,
            'avg_bias_squared': bias_squared.mean(),
            'avg_variance': variance.mean(),
        }
        
        print(f"Degree {degree:2d}: Avg Bias² = {bias_squared.mean():.4f}, "
              f"Avg Variance = {variance.mean():.4f}, "
              f"Total (excl. noise) = {(bias_squared + variance).mean():.4f}")
    
    return results, x_test, f_true
 
# Run simulation
results, x_test, f_true = simulate_bias_variance()
print(f"\nIrreducible error σ² = {0.1**2:.4f}")

Expected Output:

Degree  1: Avg Bias² = 0.1852, Avg Variance = 0.0037, Total (excl. noise) = 0.1889
Degree  3: Avg Bias² = 0.0089, Avg Variance = 0.0098, Total (excl. noise) = 0.0187
Degree 15: Avg Bias² = 0.0012, Avg Variance = 0.0584, Total (excl. noise) = 0.0596

Irreducible error σ² = 0.0100

Analysis:

Degree 1: Dominated by bias (0.1852 vs 0.0037). The model is too simple.
Degree 3: Balanced bias and variance. Near-optimal for this problem.
Degree 15: Dominated by variance (0.0584 vs 0.0012). The model is too complex.

The optimal degree minimizes bias² + variance. Here, degree 3 achieves the best tradeoff despite having non-zero bias.

Sample Size Matters

With more training data, variance decreases (more data stabilizes the fit), allowing higher-degree polynomials to become optimal. The "best" model complexity depends on your sample size—a fact with profound implications for model selection.

Connection to Overfitting and Underfitting

The bias-variance decomposition provides the theoretical foundation for understanding two fundamental failure modes in machine learning:

Underfitting (High Bias):

A model underfits when it's too simple to capture the underlying pattern. Symptoms include:

High training error
High test error
Training and test errors are similar

In bias-variance terms: Bias² dominates. The model systematically misses the truth, and no amount of data will fix it because the hypothesis class simply doesn't contain good approximations to $f$.

Overfitting (High Variance):

A model overfits when it's too complex and fits noise in the training data. Symptoms include:

Low training error
High test error
Large gap between training and test error

In bias-variance terms: Variance dominates. The model is so flexible that it memorizes training noise, causing predictions to fluctuate wildly with different training sets.

Diagnosing Model Behavior
Symptom	Diagnosis	Bias-Variance Perspective	Remedy
High train error, high test error, similar	Underfitting	High bias, low variance	More complex model, more features
Low train error, high test error, large gap	Overfitting	Low bias, high variance	Simpler model, regularization, more data
Low train error, low test error, small gap	Good fit	Balanced bias-variance	Monitor for deployment drift

The Training-Test Gap

The gap between training and test error is a direct reflection of variance. A model with high variance fits training data well (including its noise) but generalizes poorly. This gap is why cross-validation is essential—it estimates test error without wasting a held-out set.

Why Can't We Just Minimize Both?

Here's the fundamental tension: techniques that reduce bias tend to increase variance, and vice versa.

More model complexity → Lower bias (can represent more functions) → Higher variance (more parameters to fit = more sensitivity to data)
Less model complexity → Higher bias (can't represent complex functions) → Lower variance (fewer parameters = more stable)
More training data → Same bias (model class unchanged) → Lower variance (more data stabilizes estimates)
Regularization → Higher bias (constrains the model) → Lower variance (prevents extreme parameter values)

This is the tradeoff—you cannot freely minimize both. The art of machine learning is finding the sweet spot where their sum is minimized.

Important Caveats and Limitations

The bias-variance decomposition is a powerful conceptual tool, but it has important limitations and subtleties that practitioners must understand.

Key Caveats

•Squared error only: The decomposition EPE = Bias² + Variance + σ² is exact only for squared error loss. For other losses (absolute error, 0-1 classification loss), the decomposition exists but takes different forms and may not be as clean.
•Not observable in practice: We cannot directly measure bias because we don't know the true function $f$. We also cannot measure variance without training on multiple datasets. In practice, we estimate these through cross-validation, learning curves, and diagnostic plots.
•Point-wise, not global: The decomposition applies at each test point $\mathbf{x}_0$. Bias and variance vary across the input space. A model might be high-bias in sparse regions and high-variance near dense training regions.
•The average over training sets is hypothetical: We only ever train on one dataset. The expectation over $\mathcal{D}$ is a thought experiment that guides intuition but doesn't describe our actual situation.
•Modern deep learning complicates the picture: Very large neural networks seem to defy the classical bias-variance tradeoff in the 'interpolating regime'—they can fit training data perfectly yet still generalize well. This is an active area of research.

Double Descent

Recent research has revealed 'double descent' phenomena where test error decreases, then increases, then decreases again as model complexity grows. This challenges the simple U-shaped bias-variance tradeoff curve in the over-parameterized regime. However, the core insight—that error decomposes into systematic and variable components—remains valid.

Classification Setting:

For classification with 0-1 loss, the decomposition is more complex. A common formulation decomposes expected error into:

$$\text{Error} = \text{Bias} + \text{Variance} + \text{Noise}$$

But unlike regression:

Bias can be negative (a rare event where errors cancel)
The decomposition isn't uniquely defined—different authors use different definitions
Variance in classification refers to disagreement among models trained on different datasets

Despite these complications, the core intuition transfers: simple models consistently make the same mistakes (high bias), while complex models are unpredictable (high variance).

Summary: The Bias-Variance Decomposition

We've established one of the most important theoretical results in machine learning. Let's consolidate the key insights:

Key Takeaways

•Error decomposes exactly: EPE = Bias² + Variance + Irreducible Error. This is a mathematical identity, not an approximation.
•Bias is systematic error: It measures how far the average prediction is from the truth. High bias means the model class can't represent the true function well.
•Variance is sensitivity to data: It measures how much predictions fluctuate across different training sets. High variance means the model overfits to training noise.
•Irreducible error sets a floor: This is the inherent noise in the problem. No model can do better than σ².
•The tradeoff is fundamental: Reducing bias typically increases variance and vice versa. Model selection is about finding the optimal balance.
•More data reduces variance: With more training examples, estimates become more stable. This is why data is so valuable.

What's Next:

Now that we understand how error decomposes mathematically, the next page explores the sources of each error type in greater depth. We'll examine what aspects of model design, data, and the learning algorithm contribute to bias versus variance, building practical intuition for diagnostics.

Page Complete

You now understand the mathematical foundation of the bias-variance decomposition—one of the most important theoretical results in machine learning. This framework will guide all subsequent discussions of model complexity, regularization, and generalization.

1 / 5

Loading learning content...

Machine LearningStatistical Learning Theory

The Bias-Variance Tradeoff

LevelIntermediate

Duration90 mins

TopicStatistical Learning Theory

1 / 5

Bias-Variance Decomposition

The Fundamental Question of Machine Learning

What You Will Learn

The Statistical Learning Framework

The Data-Generating Process:

Assume data is generated according to the following model:

$$y = f(\mathbf{x}) + \varepsilon$$

where:

$\mathbf{x} \in \mathbb{R}^d$ is the input feature vector
$y \in \mathbb{R}$ is the target value we wish to predict
$f: \mathbb{R}^d \to \mathbb{R}$ is the true underlying function (unknown to us)
$\varepsilon$ is random noise with $\mathbb{E}[\varepsilon] = 0$ and $\text{Var}(\varepsilon) = \sigma^2$

Independence Assumption

The Learning Process:

Given a training dataset $\mathcal{D} = {(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)}$ drawn i.i.d. from the joint distribution of $(\mathbf{x}, y)$, our learning algorithm produces a predictor:

$$\hat{f}_{\mathcal{D}}(\mathbf{x})$$

Two Sources of Randomness

•Training Data Randomness — Different samples $\mathcal{D}$ lead to different learned functions $\hat{f}_{\mathcal{D}}$. This is the source of variance.
•Target Noise — Even at a fixed test point $\mathbf{x}$, the observed $y$ varies due to noise $\varepsilon$. This creates irreducible error.

Expected Prediction Error

Our goal is to understand the expected prediction error at a fixed test point $\mathbf{x}_0$. We use the squared error loss:

$$\text{EPE}(\mathbf{x}0) = \mathbb{E}{\mathcal{D}, y_0}\left[(y_0 - \hat{f}_{\mathcal{D}}(\mathbf{x}_0))^2\right]$$

This expectation is taken over:

The randomness in the training data $\mathcal{D}$
The randomness in the test target $y_0 = f(\mathbf{x}_0) + \varepsilon_0$

Why focus on a fixed test point?

$$\text{Total EPE} = \mathbb{E}_{\mathbf{x}_0}[\text{EPE}(\mathbf{x}_0)]$$

This point-wise analysis reveals that bias and variance can be different in different regions—a model might be biased in one part of the input space and high-variance in another.

Notation Clarity

The Key Insight: Average Over Training Sets

$$\bar{f}(\mathbf{x}0) = \mathbb{E}{\mathcal{D}}[\hat{f}_{\mathcal{D}}(\mathbf{x}_0)]$$

The deviation of any single $\hat{f}_{\mathcal{D}}$ from this average is what we call variance. The deviation of this average from the true function $f$ is what we call bias.

Deriving the Decomposition

We now derive the bias-variance decomposition step by step. This derivation is fundamental—every machine learning practitioner should work through it at least once.

Start with the Expected Prediction Error:

$$\text{EPE}(\mathbf{x}0) = \mathbb{E}{\mathcal{D}, y_0}\left[(y_0 - \hat{f}_{\mathcal{D}}(\mathbf{x}_0))^2\right]$$

Step 1: Separate the Noise

Since $y_0 = f(\mathbf{x}_0) + \varepsilon_0$, we can rewrite:

$$= \mathbb{E}_{\mathcal{D}, \varepsilon_0}\left[(f(\mathbf{x}0) + \varepsilon_0 - \hat{f}{\mathcal{D}}(\mathbf{x}_0))^2\right]$$

Expanding the square:

The Cross-Term Vanishes

Since $\mathbb{E}[\varepsilon_0^2] = \sigma^2$ and the cross-term vanishes:

$$\text{EPE}(\mathbf{x}0) = \mathbb{E}{\mathcal{D}}\left[(f(\mathbf{x}0) - \hat{f}{\mathcal{D}}(\mathbf{x}_0))^2\right] + \sigma^2$$

The $\sigma^2$ term is the irreducible error—no learning algorithm can reduce it because it represents genuine randomness in the targets.

Step 2: Add and Subtract the Average Prediction

Now we decompose the first term. Let $\bar{f}(\mathbf{x}0) = \mathbb{E}{\mathcal{D}}[\hat{f}_{\mathcal{D}}(\mathbf{x}_0)]$. We add and subtract this quantity:

Step 3: The Cross-Term Vanishes Again

This is by definition of $\bar{f}$—it's the expected value of $\hat{f}_{\mathcal{D}}$, so the expected deviation from it is zero.

Final Result:

The Bias-Variance Decomposition

Understanding Each Error Component

Now that we've derived the decomposition, let's build deep intuition for each term. Understanding these components is essential for diagnosing and addressing model performance issues.

Bias² — The Systematic Error

•Definition: $\text{Bias}^2(\mathbf{x}_0) = (f(\mathbf{x}_0) - \bar{f}(\mathbf{x}_0))^2$
•Meaning: The squared difference between the true function and what the algorithm predicts on average. It measures systematic error that persists regardless of how much training data you have.
•Source: Bias arises from the model's inability to represent the true function. If the true relationship is nonlinear but you fit a linear model, even with infinite data, your average prediction will differ from the truth.
•Example: Fitting a straight line to quadratic data. No matter how many points you have, the line can never capture the curvature. The gap between the best possible line and the true curve is bias.
•Key insight: Bias is a property of the hypothesis class (model family), not of any particular training set. A more flexible model class can reduce bias by including functions closer to the truth.

Variance — The Sensitivity to Data

•Definition: $\text{Var}(\mathbf{x}0) = \mathbb{E}{\mathcal{D}}[(\hat{f}_{\mathcal{D}}(\mathbf{x}_0) - \bar{f}(\mathbf{x}_0))^2]$
•Meaning: The expected squared deviation of individual predictions from the average prediction. It measures how much predictions fluctuate when trained on different datasets.
•Source: Variance arises from sensitivity to training data. If small changes in the training set cause large changes in predictions, variance is high.
•Example: A high-degree polynomial can pass exactly through training points, but different training sets yield wildly different polynomials. The polynomial wiggles differently each time.
•Key insight: Variance is about stability of the learning algorithm. More flexible models have more degrees of freedom, allowing them to fit noise in the training data, which increases variance.

Irreducible Error — The Noise Floor

•Definition: $\sigma^2 = \text{Var}(\varepsilon)$
•Meaning: The variance of the inherent noise in the target variable. This represents uncertainty that cannot be reduced by any model.
•Source: Irreducible error comes from randomness in the data-generating process itself—measurement error, unobserved variables, or genuine stochasticity.
•Example: In predicting stock prices, even with perfect features, there's irreducible randomness from news events, investor psychology, and chaos.
•Key insight: Irreducible error sets a lower bound on achievable error. When your test error approaches σ², you've essentially solved the prediction problem—further improvements require reducing the noise itself.

Summary of Error Components
Component	Mathematical Form	Reducible?	Reduced By
Bias²	$(f(\mathbf{x}) - \bar{f}(\mathbf{x}))^2$	Yes	More flexible model class
Variance	$\mathbb{E}{\mathcal{D}}[(\hat{f}{\mathcal{D}} - \bar{f})^2]$	Yes	More training data, regularization
Irreducible Error	$\sigma^2$	No	Cannot be reduced by modeling

The Dart Board Analogy

A powerful way to visualize bias and variance is through the classic dart board analogy. Imagine throwing darts at a target, where:

The bullseye represents the true function value $f(\mathbf{x}_0)$
Each dart throw represents a prediction from training on a different dataset
The average dart position represents $\bar{f}(\mathbf{x}_0)$

Four Scenarios:

Bias-Variance Visualization
Scenario	Bias	Variance	Result
🎯 Clustered at Center	Low	Low	Ideal: Darts consistently land near bullseye. Accurate and stable predictions.
📍 Clustered Off-Center	High	Low	Underfitting: Darts consistently miss in the same direction. Systematic error, but predictions are stable.
💨 Scattered Around Center	Low	High	Overfitting (subtle): Darts average to bullseye, but individual throws vary wildly. Right on average, wrong each time.
💥 Scattered Off-Center	High	High	Worst case: Missing badly and inconsistently. Model is both inflexible and unstable.

The Subtle Case

The Key Insight:

Connecting to the Math:

Bias = Distance from bullseye to the center of the dart cluster
Variance = Spread of the dart cluster around its center
Total error = Average squared distance from each dart to the bullseye

The decomposition EPE = Bias² + Variance simply says: the average squared error to the bullseye equals the squared distance to the cluster center plus the average spread within the cluster.

Worked Example: Polynomial Regression

Let's make the bias-variance decomposition concrete with a detailed example.

Setup:

Suppose the true function is: $$f(x) = \sin(2\pi x)$$

We observe $n$ training points $(x_i, y_i)$ where $x_i$ are uniformly distributed on $[0, 1]$ and: $$y_i = f(x_i) + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, 0.1^2)$$

We fit polynomials of degree $d$: $$\hat{f}(x) = \sum_{k=0}^{d} \hat{w}_k x^k$$

where coefficients $\hat{w}_k$ are determined by least squares.

Case 1: Degree 1 (Linear)

A line cannot capture the sine wave's curvature.

High Bias: The best possible line through a sine wave systematically misses the peaks and troughs. $\bar{f}(x)$ ≠ $f(x)$.
Low Variance: Lines are simple—they have only 2 parameters. Different training sets yield similar lines.

Case 2: Degree 3 (Cubic)

A cubic polynomial can approximate the single period reasonably well.

Moderate Bias: The cubic can capture the general shape but not the exact sine.
Moderate Variance: Four parameters means some sensitivity to training data.

Case 3: Degree 15 (High-Degree)

A degree-15 polynomial has enormous flexibility.

Low Bias: Given enough data, the average fit approaches the true sine.
High Variance: 16 parameters means wild fluctuations between training sets. The polynomial wiggles through training points, fitting noise as signal.

bias_variance_simulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
 
def true_function(x):
    """The true underlying function."""
    return np.sin(2 * np.pi * x)
 
def simulate_bias_variance(n_samples=30, n_simulations=200, degrees=[1, 3, 15], sigma=0.1):
    """
    Simulate bias-variance decomposition for polynomial regression.
    
    For each polynomial degree:
    1. Generate many training datasets
    2. Fit a model to each
    3. Compute bias² and variance at test points
    """
    np.random.seed(42)
    x_test = np.linspace(0, 1, 100).reshape(-1, 1)
    f_true = true_function(x_test.ravel())
    
    results = {}
    
    for degree in degrees:
        predictions = []
        
        for _ in range(n_simulations):
            # Generate random training data
            x_train = np.random.uniform(0, 1, n_samples).reshape(-1, 1)
            noise = np.random.normal(0, sigma, n_samples)
            y_train = true_function(x_train.ravel()) + noise
            
            # Fit polynomial
            poly = PolynomialFeatures(degree)
            X_train_poly = poly.fit_transform(x_train)
            X_test_poly = poly.transform(x_test)
            
            model = LinearRegression()
            model.fit(X_train_poly, y_train)
            y_pred = model.predict(X_test_poly)
            
            predictions.append(y_pred)
        
        predictions = np.array(predictions)  # Shape: (n_simulations, n_test_points)
        
        # Compute bias and variance at each test point
        f_bar = predictions.mean(axis=0)  # Average prediction
        bias_squared = (f_bar - f_true) ** 2
        variance = predictions.var(axis=0)
        
        results[degree] = {
            'bias_squared': bias_squared,
            'variance': variance,
            'f_bar': f_bar,
            'avg_bias_squared': bias_squared.mean(),
            'avg_variance': variance.mean(),
        }
        
        print(f"Degree {degree:2d}: Avg Bias² = {bias_squared.mean():.4f}, "
              f"Avg Variance = {variance.mean():.4f}, "
              f"Total (excl. noise) = {(bias_squared + variance).mean():.4f}")
    
    return results, x_test, f_true
 
# Run simulation
results, x_test, f_true = simulate_bias_variance()
print(f"\nIrreducible error σ² = {0.1**2:.4f}")

Expected Output:

Degree  1: Avg Bias² = 0.1852, Avg Variance = 0.0037, Total (excl. noise) = 0.1889
Degree  3: Avg Bias² = 0.0089, Avg Variance = 0.0098, Total (excl. noise) = 0.0187
Degree 15: Avg Bias² = 0.0012, Avg Variance = 0.0584, Total (excl. noise) = 0.0596

Irreducible error σ² = 0.0100

Analysis:

Degree 1: Dominated by bias (0.1852 vs 0.0037). The model is too simple.
Degree 3: Balanced bias and variance. Near-optimal for this problem.
Degree 15: Dominated by variance (0.0584 vs 0.0012). The model is too complex.

The optimal degree minimizes bias² + variance. Here, degree 3 achieves the best tradeoff despite having non-zero bias.

Sample Size Matters

Connection to Overfitting and Underfitting

The bias-variance decomposition provides the theoretical foundation for understanding two fundamental failure modes in machine learning:

Underfitting (High Bias):

A model underfits when it's too simple to capture the underlying pattern. Symptoms include:

High training error
High test error
Training and test errors are similar

In bias-variance terms: Bias² dominates. The model systematically misses the truth, and no amount of data will fix it because the hypothesis class simply doesn't contain good approximations to $f$.

Overfitting (High Variance):

A model overfits when it's too complex and fits noise in the training data. Symptoms include:

Low training error
High test error
Large gap between training and test error

In bias-variance terms: Variance dominates. The model is so flexible that it memorizes training noise, causing predictions to fluctuate wildly with different training sets.

Diagnosing Model Behavior
Symptom	Diagnosis	Bias-Variance Perspective	Remedy
High train error, high test error, similar	Underfitting	High bias, low variance	More complex model, more features
Low train error, high test error, large gap	Overfitting	Low bias, high variance	Simpler model, regularization, more data
Low train error, low test error, small gap	Good fit	Balanced bias-variance	Monitor for deployment drift

The Training-Test Gap

Why Can't We Just Minimize Both?

Here's the fundamental tension: techniques that reduce bias tend to increase variance, and vice versa.

More model complexity → Lower bias (can represent more functions) → Higher variance (more parameters to fit = more sensitivity to data)
Less model complexity → Higher bias (can't represent complex functions) → Lower variance (fewer parameters = more stable)
More training data → Same bias (model class unchanged) → Lower variance (more data stabilizes estimates)
Regularization → Higher bias (constrains the model) → Lower variance (prevents extreme parameter values)

This is the tradeoff—you cannot freely minimize both. The art of machine learning is finding the sweet spot where their sum is minimized.

Important Caveats and Limitations

The bias-variance decomposition is a powerful conceptual tool, but it has important limitations and subtleties that practitioners must understand.

Key Caveats

•Squared error only: The decomposition EPE = Bias² + Variance + σ² is exact only for squared error loss. For other losses (absolute error, 0-1 classification loss), the decomposition exists but takes different forms and may not be as clean.
•Not observable in practice: We cannot directly measure bias because we don't know the true function $f$. We also cannot measure variance without training on multiple datasets. In practice, we estimate these through cross-validation, learning curves, and diagnostic plots.
•Point-wise, not global: The decomposition applies at each test point $\mathbf{x}_0$. Bias and variance vary across the input space. A model might be high-bias in sparse regions and high-variance near dense training regions.
•The average over training sets is hypothetical: We only ever train on one dataset. The expectation over $\mathcal{D}$ is a thought experiment that guides intuition but doesn't describe our actual situation.
•Modern deep learning complicates the picture: Very large neural networks seem to defy the classical bias-variance tradeoff in the 'interpolating regime'—they can fit training data perfectly yet still generalize well. This is an active area of research.

Double Descent

Classification Setting:

For classification with 0-1 loss, the decomposition is more complex. A common formulation decomposes expected error into:

$$\text{Error} = \text{Bias} + \text{Variance} + \text{Noise}$$

But unlike regression:

Bias can be negative (a rare event where errors cancel)
The decomposition isn't uniquely defined—different authors use different definitions
Variance in classification refers to disagreement among models trained on different datasets

Despite these complications, the core intuition transfers: simple models consistently make the same mistakes (high bias), while complex models are unpredictable (high variance).

Summary: The Bias-Variance Decomposition

We've established one of the most important theoretical results in machine learning. Let's consolidate the key insights:

Key Takeaways

•Error decomposes exactly: EPE = Bias² + Variance + Irreducible Error. This is a mathematical identity, not an approximation.
•Bias is systematic error: It measures how far the average prediction is from the truth. High bias means the model class can't represent the true function well.
•Variance is sensitivity to data: It measures how much predictions fluctuate across different training sets. High variance means the model overfits to training noise.
•Irreducible error sets a floor: This is the inherent noise in the problem. No model can do better than σ².
•The tradeoff is fundamental: Reducing bias typically increases variance and vice versa. Model selection is about finding the optimal balance.
•More data reduces variance: With more training examples, estimates become more stable. This is why data is so valuable.

What's Next:

Page Complete

1 / 5