Loading content...
When you estimate a parameter from data, your estimate will almost certainly differ from the true value. But why does it differ? Understanding the sources of estimation error is crucial for designing better estimators and making informed statistical decisions.
It turns out that estimation error has two fundamentally distinct sources:
These concepts form the bedrock of statistical estimation theory and have profound implications for machine learning, from model selection to the famous bias-variance tradeoff in prediction.
By the end of this page, you will master formal definitions of bias and variance, understand the bias-variance decomposition of mean squared error, analyze bias and variance of common estimators (sample mean, MLE variance estimator), explore the bias-variance tradeoff and its implications, and develop intuition through the dartboard analogy and mathematical rigor.
Before defining bias and variance, we must understand a crucial fact: estimators are random variables.
The setup:
Since D is random (different samples give different data), θ̂ is also random. If we could repeat the sampling process many times, we'd get a distribution of estimates — the sampling distribution of θ̂.
Key insight:
We don't observe the sampling distribution directly—we only have one dataset and one estimate. But the theoretical properties of this distribution (its mean, variance, shape) tell us about the quality of our estimation procedure.
The frequentist perspective on quality:
A good estimator should have a sampling distribution that:
Thought experiment:
Imagine 1000 statisticians, each given a different random sample of size n from the same population. Each computes θ̂ from their sample.
The expectation E[θ̂] is over the sampling distribution—averaging over all possible datasets, not over parameters. This is the frequentist perspective. In Bayesian terms, we'd instead look at the posterior distribution, which quantifies uncertainty about θ given a single fixed dataset.
Definition:
The bias of an estimator θ̂ is the difference between its expected value and the true parameter:
$$\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta^*$$
Interpretation:
An unbiased estimator satisfies E[θ̂] = θ* for all possible values of θ*.
Example 1: Sample Mean is Unbiased
Let X₁, ..., Xₙ be i.i.d. with E[Xᵢ] = μ. The sample mean is:
$$\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$$
Its expected value:
$$E[\bar{X}] = E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n E[X_i] = \frac{1}{n} \cdot n\mu = \mu$$
Since E[X̄] = μ, the sample mean is an unbiased estimator of the population mean.
Example 2: MLE Variance Estimator is Biased
For data from N(μ, σ²), the MLE for variance is:
$$\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X})^2$$
Its expected value:
$$E[\hat{\sigma}^2_{MLE}] = \frac{n-1}{n}\sigma^2 \neq \sigma^2$$
Bias = E[σ̂²] - σ² = -σ²/n < 0
The MLE systematically underestimates the true variance! This motivates Bessel's correction: using n-1 in the denominator.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import numpy as npimport matplotlib.pyplot as plt def demonstrate_mle_variance_bias(): """ Empirically demonstrate that MLE variance estimator is biased. """ np.random.seed(42) true_mu = 0 true_sigma2 = 4 # True variance sample_sizes = [5, 10, 20, 50, 100, 500] n_simulations = 10000 results = [] for n in sample_sizes: mle_estimates = [] unbiased_estimates = [] for _ in range(n_simulations): # Generate sample data = np.random.normal(true_mu, np.sqrt(true_sigma2), size=n) # MLE variance (divide by n) mle_var = np.mean((data - np.mean(data))**2) # Unbiased variance (divide by n-1) unbiased_var = np.var(data, ddof=1) mle_estimates.append(mle_var) unbiased_estimates.append(unbiased_var) results.append({ 'n': n, 'mle_mean': np.mean(mle_estimates), 'unbiased_mean': np.mean(unbiased_estimates), 'expected_mle': (n-1)/n * true_sigma2, 'mle_bias': np.mean(mle_estimates) - true_sigma2, 'unbiased_bias': np.mean(unbiased_estimates) - true_sigma2 }) # Print results print("Bias of Variance Estimators") print("=" * 70) print(f"True variance: σ² = {true_sigma2}") print(f"Number of simulations: {n_simulations}") print() print(f"{'n':>6} | {'E[MLE]':>10} | {'Theory':>10} | {'E[Unbiased]':>12} | {'MLE Bias':>10}") print("-" * 70) for r in results: print(f"{r['n']:>6} | {r['mle_mean']:>10.4f} | {r['expected_mle']:>10.4f} | " f"{r['unbiased_mean']:>12.4f} | {r['mle_bias']:>10.4f}") # Visualize fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: Expected values vs sample size ax1 = axes[0] ns = [r['n'] for r in results] ax1.plot(ns, [r['mle_mean'] for r in results], 'bo-', label='MLE (biased)', markersize=8) ax1.plot(ns, [r['unbiased_mean'] for r in results], 'gs-', label='Sample variance (unbiased)', markersize=8) ax1.axhline(y=true_sigma2, color='red', linestyle='--', label=f'True σ² = {true_sigma2}') ax1.set_xlabel('Sample Size n') ax1.set_ylabel('Expected Value of Estimator') ax1.set_title('The MLE Variance Estimator is Biased') ax1.legend() ax1.grid(True, alpha=0.3) ax1.set_xscale('log') # Right: Bias magnitude ax2 = axes[1] theoretical_bias = [(n-1)/n * true_sigma2 - true_sigma2 for n in ns] empirical_bias = [r['mle_bias'] for r in results] ax2.plot(ns, theoretical_bias, 'r-', lw=2, label='Theoretical: -σ²/n') ax2.plot(ns, empirical_bias, 'bo', markersize=8, label='Empirical MLE bias') ax2.plot(ns, [r['unbiased_bias'] for r in results], 'gs', markersize=8, label='Unbiased estimator bias ≈ 0') ax2.axhline(y=0, color='gray', linestyle=':') ax2.set_xlabel('Sample Size n') ax2.set_ylabel('Bias') ax2.set_title('Bias Decreases with Sample Size') ax2.legend() ax2.grid(True, alpha=0.3) ax2.set_xscale('log') plt.tight_layout() plt.show() return results results = demonstrate_mle_variance_bias()Definition:
The variance of an estimator θ̂ measures its spread around its own mean:
$$\text{Var}(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2] = E[\hat{\theta}^2] - (E[\hat{\theta}])^2$$
Interpretation:
Standard Error:
The standard error is the square root of variance:
$$\text{SE}(\hat{\theta}) = \sqrt{\text{Var}(\hat{\theta})}$$
It has the same units as θ, making it more interpretable.
Example: Variance of the Sample Mean
For X₁, ..., Xₙ i.i.d. with Var(Xᵢ) = σ²:
$$\text{Var}(\bar{X}) = \text{Var}\left(\frac{1}{n}\sum_{i=1}^n X_i\right)$$
$$= \frac{1}{n^2}\sum_{i=1}^n \text{Var}(X_i) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}$$
Key insight: Variance decreases with sample size!
The standard error scales as 1/√n:
$$\text{SE}(\bar{X}) = \frac{\sigma}{\sqrt{n}}$$
To halve the standard error, you need 4 times as much data!
Standard error decreases as 1/√n, not 1/n. This has profound implications: going from n=100 to n=10,000 (100× more data) only reduces standard error by a factor of 10 (√100 = 10). Diminishing returns set in quickly. This is why sample size calculations are so important in experimental design.
The Mean Squared Error (MSE) combines bias and variance into a single measure of estimator quality:
$$\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta^*)^2]$$
This measures the expected squared distance from the estimate to the true parameter.
The Bias-Variance Decomposition:
A beautiful mathematical fact is that MSE decomposes perfectly:
$$\boxed{\text{MSE}(\hat{\theta}) = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta})}$$
MSE = Bias² + Variance
Proof of the decomposition:
Let μ = E[θ̂] for brevity. Then:
$$\text{MSE} = E[(\hat{\theta} - \theta^*)^2]$$
Add and subtract μ:
$$= E[(\hat{\theta} - \mu + \mu - \theta^*)^2]$$
$$= E[(\hat{\theta} - \mu)^2] + 2E[(\hat{\theta} - \mu)(\mu - \theta^)] + (\mu - \theta^)^2$$
The middle term vanishes because E[θ̂ - μ] = 0:
$$= E[(\hat{\theta} - \mu)^2] + (\mu - \theta^*)^2$$
$$= \text{Var}(\hat{\theta}) + \text{Bias}(\hat{\theta})^2 \quad \square$$
Imagine throwing darts at a target (the true parameter θ*). High bias = consistently missing the center in one direction. High variance = darts scattered widely. MSE measures the average squared distance from bullseye. A slightly biased but precise thrower might outperform an unbiased but erratic one!
The dartboard analogy is so useful that it deserves a detailed exploration. Four scenarios illustrate all combinations of high/low bias and variance:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as npimport matplotlib.pyplot as plt def plot_dartboard_analogy(): """ Visualize bias and variance using the dartboard analogy. """ np.random.seed(42) n_darts = 50 fig, axes = plt.subplots(2, 2, figsize=(12, 12)) scenarios = [ ('Low Bias, Low Variance\n(Ideal)', 0, 0.1), ('Low Bias, High Variance', 0, 0.8), ('High Bias, Low Variance', 0.7, 0.1), ('High Bias, High Variance', 0.7, 0.6), ] for ax, (title, bias, std) in zip(axes.flatten(), scenarios): # Generate dart positions x = np.random.normal(bias, std, n_darts) y = np.random.normal(bias, std, n_darts) # Draw target circles for r in [0.25, 0.5, 0.75, 1.0, 1.25]: circle = plt.Circle((0, 0), r, fill=False, color='gray', alpha=0.5) ax.add_patch(circle) # Draw bullseye bullseye = plt.Circle((0, 0), 0.1, fill=True, color='red', alpha=0.5) ax.add_patch(bullseye) # Plot darts ax.scatter(x, y, c='blue', alpha=0.6, s=50, edgecolors='darkblue') # Mark center of mass (mean) mean_x, mean_y = np.mean(x), np.mean(y) ax.scatter([mean_x], [mean_y], c='green', s=200, marker='X', edgecolors='darkgreen', linewidths=2, label='Mean', zorder=5) # Calculate actual bias and variance actual_bias = np.sqrt(mean_x**2 + mean_y**2) actual_var = np.var(x) + np.var(y) mse = np.mean(x**2 + y**2) ax.set_xlim(-1.5, 1.5) ax.set_ylim(-1.5, 1.5) ax.set_aspect('equal') ax.axhline(y=0, color='gray', linestyle='-', alpha=0.3) ax.axvline(x=0, color='gray', linestyle='-', alpha=0.3) ax.set_title(f'{title}\nBias²≈{actual_bias**2:.2f}, Var≈{actual_var:.2f}, MSE≈{mse:.2f}', fontsize=11) ax.legend(loc='upper right') plt.suptitle('The Dartboard Analogy: Bias vs Variance', fontsize=14, y=1.02) plt.tight_layout() plt.show() def plot_bias_variance_tradeoff(): """ Visualize the bias-variance tradeoff. """ # Simulated example: varying regularization strength regularization = np.linspace(0, 3, 100) # Bias increases with regularization (we shrink toward prior/zero) bias_squared = 0.1 * regularization**2 # Variance decreases with regularization (estimates become more stable) variance = 2 * np.exp(-regularization) # MSE = Bias² + Variance mse = bias_squared + variance # Find optimal regularization optimal_idx = np.argmin(mse) optimal_reg = regularization[optimal_idx] fig, ax = plt.subplots(figsize=(10, 6)) ax.plot(regularization, bias_squared, 'r-', lw=2, label='Bias²') ax.plot(regularization, variance, 'b-', lw=2, label='Variance') ax.plot(regularization, mse, 'g-', lw=3, label='MSE = Bias² + Variance') ax.axvline(x=optimal_reg, color='gray', linestyle='--', label=f'Optimal λ ≈ {optimal_reg:.2f}') ax.scatter([optimal_reg], [mse[optimal_idx]], color='green', s=150, zorder=5, edgecolors='darkgreen', linewidths=2) ax.set_xlabel('Regularization Strength (λ)', fontsize=12) ax.set_ylabel('Error', fontsize=12) ax.set_title('The Bias-Variance Tradeoff', fontsize=14) ax.legend(fontsize=11) ax.grid(True, alpha=0.3) ax.set_xlim([0, 3]) ax.set_ylim([0, 2.5]) # Add annotations ax.annotate('Underfitting\n(High Bias)', xy=(2.5, 0.7), fontsize=10, ha='center') ax.annotate('Overfitting\n(High Variance)', xy=(0.3, 1.8), fontsize=10, ha='center') ax.annotate('Sweet\nSpot', xy=(optimal_reg + 0.3, mse[optimal_idx] + 0.15), fontsize=10, ha='left') plt.tight_layout() plt.show() # Run visualizationsplot_dartboard_analogy()plot_bias_variance_tradeoff()A common misconception is that unbiased estimators are always preferable. The MSE decomposition reveals why this is wrong.
The James-Stein Paradox:
In 1961, Charles Stein proved a shocking result: when estimating the mean of a multivariate Gaussian with d ≥ 3 dimensions, the sample mean (unbiased!) is inadmissible—there exist estimators with strictly lower MSE everywhere.
The James-Stein estimator shrinks the sample mean toward zero:
$$\hat{\theta}_{JS} = \left(1 - \frac{(d-2)\sigma^2}{||\bar{X}||^2}\right)\bar{X}$$
This biased estimator has lower MSE than the unbiased sample mean for all true θ when d ≥ 3!
When biased estimators win:
When unbiased estimators are preferred:
| Criterion | Unbiased Estimator | Biased Estimator (e.g., Ridge) |
|---|---|---|
| Expected value | E[θ̂] = θ* | E[θ̂] ≠ θ* |
| Variance | Can be high | Usually lower |
| MSE | Bias² = 0, but Var can dominate | Bias² > 0, but Var reduction can compensate |
| Small sample performance | Often poor (high variance) | Often better (controlled variance) |
| Large sample behavior | Optimal (variance → 0) | Bias doesn't vanish (may be suboptimal) |
| Predictive accuracy | Variable | Often superior |
In modern machine learning, we rarely insist on unbiasedness. What matters is predictive performance, which is better captured by MSE or cross-validation error. Regularization techniques introduce bias precisely because the variance reduction outweighs the cost. The bias-variance tradeoff is fundamental to understanding model selection.
The bias-variance decomposition extends beyond parameter estimation to prediction error in machine learning. This is one of the most important concepts in the field.
Setup:
For a new point x₀, the expected squared prediction error is:
$$E[(y_0 - \hat{f}(x_0))^2] = \sigma^2 + \text{Bias}(\hat{f}(x_0))^2 + \text{Var}(\hat{f}(x_0))$$
The three components:
Irreducible error (σ²): Noise inherent in the data. No model can eliminate this.
Squared bias: How well the model can approximate the true function on average.
Variance: How much the model changes with different training samples.
The tradeoff:
As model complexity increases:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.linear_model import LinearRegression def bias_variance_polynomial(): """ Demonstrate bias-variance tradeoff with polynomial regression. """ np.random.seed(42) # True function def true_f(x): return np.sin(2 * np.pi * x) # Generate many training datasets n_datasets = 200 n_train = 20 noise_std = 0.3 x_test = np.linspace(0, 1, 100) y_true = true_f(x_test) degrees = [1, 3, 5, 9, 15] fig, axes = plt.subplots(2, 3, figsize=(15, 10)) for idx, degree in enumerate(degrees): ax = axes.flatten()[idx] # Store predictions from each dataset all_predictions = [] for _ in range(n_datasets): # Generate training data x_train = np.random.uniform(0, 1, n_train) y_train = true_f(x_train) + np.random.normal(0, noise_std, n_train) # Fit polynomial poly = PolynomialFeatures(degree) X_train_poly = poly.fit_transform(x_train.reshape(-1, 1)) X_test_poly = poly.transform(x_test.reshape(-1, 1)) model = LinearRegression() model.fit(X_train_poly, y_train) y_pred = model.predict(X_test_poly) all_predictions.append(y_pred) all_predictions = np.array(all_predictions) # Compute bias and variance mean_prediction = np.mean(all_predictions, axis=0) bias_squared = np.mean((mean_prediction - y_true) ** 2) variance = np.mean(np.var(all_predictions, axis=0)) mse = np.mean(np.mean((all_predictions - y_true) ** 2, axis=0)) # Plot ax.plot(x_test, y_true, 'g-', lw=2, label='True f(x)') ax.plot(x_test, mean_prediction, 'r-', lw=2, label='Mean prediction') # Plot a few individual predictions for i in range(min(10, n_datasets)): ax.plot(x_test, all_predictions[i], 'b-', alpha=0.1) ax.set_title(f'Degree {degree}\nBias²={bias_squared:.3f}, Var={variance:.3f}\nMSE≈{mse:.3f}') ax.set_xlabel('x') ax.set_ylabel('y') ax.legend(fontsize=8) ax.set_ylim([-2, 2]) # Summary plot ax_summary = axes.flatten()[5] degrees_all = range(1, 16) bias_sq_all = [] var_all = [] mse_all = [] for degree in degrees_all: all_preds = [] for _ in range(n_datasets): x_train = np.random.uniform(0, 1, n_train) y_train = true_f(x_train) + np.random.normal(0, noise_std, n_train) poly = PolynomialFeatures(degree) X_train_poly = poly.fit_transform(x_train.reshape(-1, 1)) X_test_poly = poly.transform(x_test.reshape(-1, 1)) model = LinearRegression() model.fit(X_train_poly, y_train) all_preds.append(model.predict(X_test_poly)) all_preds = np.array(all_preds) mean_pred = np.mean(all_preds, axis=0) bias_sq_all.append(np.mean((mean_pred - y_true) ** 2)) var_all.append(np.mean(np.var(all_preds, axis=0))) mse_all.append(np.mean(np.mean((all_preds - y_true) ** 2, axis=0))) ax_summary.plot(degrees_all, bias_sq_all, 'r-', lw=2, label='Bias²') ax_summary.plot(degrees_all, var_all, 'b-', lw=2, label='Variance') ax_summary.plot(degrees_all, mse_all, 'g-', lw=3, label='MSE') ax_summary.axhline(y=noise_std**2, color='gray', linestyle='--', label='Noise²') ax_summary.set_xlabel('Polynomial Degree') ax_summary.set_ylabel('Error') ax_summary.set_title('Bias-Variance Tradeoff Summary') ax_summary.legend() ax_summary.set_xlim([1, 15]) ax_summary.grid(True, alpha=0.3) plt.tight_layout() plt.show() bias_variance_polynomial()Understanding bias and variance has direct implications for how we build and tune ML models:
Diagnosing model problems:
Techniques that reduce variance:
Techniques that reduce bias:
The gap between training and validation error tells you everything: Small gap, both high → High bias (underfit). Large gap → High variance (overfit). Small gap, both low → Just right! Use learning curves (error vs. training set size) to diagnose: if training and validation converge at high error, add complexity. If they diverge, add regularization or data.
We've explored the fundamental concepts of estimator quality. Let's consolidate:
Looking ahead:
Bias and variance describe finite-sample behavior. But what happens as we collect more and more data? The next page introduces Consistency—the asymptotic property guaranteeing that our estimates converge to the truth.
You now understand the twin pillars of estimator quality: bias and variance. You can decompose MSE into its components, recognize when each dominates, diagnose underfitting vs. overfitting, and apply the bias-variance tradeoff to guide model selection. This framework is foundational to both statistical inference and machine learning.