Estimation Theory - Learning Module

Loading content...

0/245

Consistency of Estimators

The Promise of Convergence

We've seen that estimators have bias and variance—they're not perfect. But here's a comforting thought: with enough data, good estimators get arbitrarily close to the truth.

Consistency formalizes this intuition. It's the guarantee that as we gather more observations, our uncertainty vanishes and we can estimate parameters with arbitrary precision. Without consistency, all the data in the world might not help us—a deeply unsatisfying property for any estimation procedure.

This page explores what consistency means, when it holds, and why it's fundamental to statistical learning theory.

What You Will Learn

By the end of this page, you will understand the formal definition of consistency (convergence in probability), distinguish between consistency and unbiasedness, prove consistency of common estimators, understand the conditions under which MLE is consistent, and connect consistency to the law of large numbers.

Formal Definition of Consistency

Definition: Consistency

An estimator θ̂ₙ (based on n observations) is consistent for θ* if it converges to θ* in probability as n → ∞:

$$\hat{\theta}_n \xrightarrow{P} \theta^*$$

Formally, for any ε > 0:

$$\lim_{n \to \infty} P(|\hat{\theta}_n - \theta^*| > \epsilon) = 0$$

Interpretation:

As we collect more data, the probability of being "far" from the true value (by more than any fixed ε) shrinks to zero. We can make our estimate arbitrarily accurate with enough observations.

Types of convergence:

There are actually several modes of convergence, from weakest to strongest:

Convergence in probability (what we're discussing): $$P(|\hat{\theta}_n - \theta^*| > \epsilon) \to 0$$
Almost sure convergence (stronger): $$P(\lim_{n \to \infty} \hat{\theta}_n = \theta^*) = 1$$
Mean squared convergence (L² convergence): $$E[(\hat{\theta}_n - \theta^*)^2] \to 0$$

For practical purposes, convergence in probability (consistency) is usually sufficient. Mean squared convergence (MSE → 0) is often easier to prove and implies consistency.

A Sufficient Condition

If Bias(θ̂ₙ) → 0 AND Var(θ̂ₙ) → 0 as n → ∞, then MSE → 0, which implies consistency. This gives us a practical recipe: show that both bias and variance vanish asymptotically.

Consistency vs. Unbiasedness: A Critical Distinction

Consistency and unbiasedness are often confused. They are independent properties—neither implies the other!

Four possibilities:

Consistency vs. Unbiasedness
	Unbiased	Biased
Consistent	Sample mean for μ	MLE variance σ̂² = Σ(xᵢ-x̄)²/n
Inconsistent	θ̂ = X₁ (first observation only)	θ̂ = 0 (ignores data entirely)

Example 1: Unbiased but Inconsistent

Consider estimating the population mean μ using only the first observation:

$$\hat{\mu} = X_1$$

This is unbiased: E[X₁] = μ.

But it's inconsistent! No matter how many observations we collect, we only use X₁. The variance Var(X₁) = σ² never decreases. The estimator doesn't improve with more data.

Example 2: Biased but Consistent

The MLE for Gaussian variance:

$$\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X})^2$$

This is biased: E[σ̂²] = (n-1)/n × σ² ≠ σ².

But it's consistent:

Bias = -σ²/n → 0 as n → ∞
Variance → 0 as n → ∞
Therefore MSE → 0, so consistent!

Key Insight

Unbiasedness is a finite-sample property—it holds for any fixed n. Consistency is an asymptotic property—it only describes what happens as n → ∞. An estimator can be unbiased for each n yet never converge (like θ̂ = X₁). Conversely, an estimator can be biased for each n yet have the bias vanish (like MLE variance).

The Law of Large Numbers: Foundation of Consistency

The Law of Large Numbers (LLN) is the fundamental theorem underlying consistency.

Weak Law of Large Numbers (WLLN):

For i.i.d. random variables X₁, ..., Xₙ with E[Xᵢ] = μ:

$$\bar{X}n = \frac{1}{n}\sum{i=1}^n X_i \xrightarrow{P} \mu$$

The sample mean converges in probability to the population mean.

Strong Law of Large Numbers (SLLN):

Under the same conditions:

$$P\left(\lim_{n \to \infty} \bar{X}_n = \mu\right) = 1$$

The sample mean converges almost surely (with probability 1).

Proof sketch (WLLN via Chebyshev):

By Chebyshev's inequality:

$$P(|\bar{X}_n - \mu| > \epsilon) \leq \frac{\text{Var}(\bar{X}_n)}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2}$$

As n → ∞, the right side → 0, proving convergence in probability.

Why LLN matters for estimation:

Many estimators can be written as sample averages. For example:

Sample mean: X̄ = (1/n)Σ Xᵢ
Sample variance: s² = (1/n)Σ(Xᵢ - X̄)² (approximately)
Empirical probability: P̂(A) = (1/n)Σ 1{Xᵢ ∈ A}
MLE for many models (via first-order conditions)

The LLN immediately gives consistency for all these!

lln_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
import matplotlib.pyplot as plt
 
def demonstrate_lln():
    """
    Visualize the Law of Large Numbers in action.
    """
    np.random.seed(42)
    
    # Parameters
    true_mu = 5.0
    true_sigma = 2.0
    max_n = 10000
    n_simulations = 20
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot 1: Multiple sample paths converging
    ax1 = axes[0, 0]
    ns = np.arange(1, max_n + 1)
    
    for i in range(n_simulations):
        data = np.random.normal(true_mu, true_sigma, max_n)
        running_means = np.cumsum(data) / ns
        ax1.plot(ns, running_means, alpha=0.3)
    
    ax1.axhline(y=true_mu, color='red', linewidth=2, label=f'True μ = {true_mu}')
    ax1.set_xlabel('Sample Size n')
    ax1.set_ylabel('Sample Mean X̄ₙ')
    ax1.set_title('Law of Large Numbers: Sample Paths Converging')
    ax1.set_xscale('log')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Distribution of X̄ₙ for different n
    ax2 = axes[0, 1]
    sample_sizes = [10, 100, 1000, 10000]
    n_samples = 5000
    
    colors = plt.cm.viridis(np.linspace(0, 0.8, len(sample_sizes)))
    
    for n, color in zip(sample_sizes, colors):
        sample_means = []
        for _ in range(n_samples):
            data = np.random.normal(true_mu, true_sigma, n)
            sample_means.append(np.mean(data))
        
        ax2.hist(sample_means, bins=50, density=True, alpha=0.5, 
                 color=color, label=f'n = {n}')
    
    ax2.axvline(x=true_mu, color='red', linewidth=2, linestyle='--')
    ax2.set_xlabel('Sample Mean X̄ₙ')
    ax2.set_ylabel('Density')
    ax2.set_title('Distribution of X̄ₙ Concentrates Around μ')
    ax2.legend()
    
    # Plot 3: P(|X̄ - μ| > ε) decreasing
    ax3 = axes[1, 0]
    epsilon = 0.1
    sample_sizes = np.logspace(1, 4, 50).astype(int)
    
    prob_exceeds = []
    chebyshev_bound = []
    
    for n in sample_sizes:
        sample_means = [np.mean(np.random.normal(true_mu, true_sigma, n)) 
                       for _ in range(1000)]
        prob = np.mean(np.abs(np.array(sample_means) - true_mu) > epsilon)
        prob_exceeds.append(prob)
        chebyshev_bound.append(true_sigma**2 / (n * epsilon**2))
    
    ax3.plot(sample_sizes, prob_exceeds, 'b-', lw=2, 
             label=f'P(|X̄ₙ - μ| > {epsilon})')
    ax3.plot(sample_sizes, chebyshev_bound, 'r--', lw=2, 
             label='Chebyshev bound')
    ax3.set_xlabel('Sample Size n')
    ax3.set_ylabel('Probability')
    ax3.set_title(f'Convergence in Probability (ε = {epsilon})')
    ax3.set_xscale('log')
    ax3.set_yscale('log')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    # Plot 4: Standard error decreasing as 1/√n
    ax4 = axes[1, 1]
    theoretical_se = true_sigma / np.sqrt(sample_sizes)
    empirical_se = []
    
    for n in sample_sizes:
        sample_means = [np.mean(np.random.normal(true_mu, true_sigma, n)) 
                       for _ in range(500)]
        empirical_se.append(np.std(sample_means))
    
    ax4.plot(sample_sizes, theoretical_se, 'r-', lw=2, 
             label=f'Theoretical: σ/√n')
    ax4.plot(sample_sizes, empirical_se, 'bo', markersize=4, alpha=0.7,
             label='Empirical SE')
    ax4.set_xlabel('Sample Size n')
    ax4.set_ylabel('Standard Error of X̄ₙ')
    ax4.set_title('Standard Error Decreases as 1/√n')
    ax4.set_xscale('log')
    ax4.set_yscale('log')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
 
demonstrate_lln()

Proving Consistency: The Sample Mean

Let's rigorously prove consistency of the sample mean using the MSE approach.

Claim: The sample mean X̄ₙ = (1/n)Σᵢ₌₁ⁿ Xᵢ is a consistent estimator of μ = E[X].

Proof:

We'll show MSE(X̄ₙ) → 0, which implies consistency.

Step 1: Compute Bias

$$\text{Bias}(\bar{X}_n) = E[\bar{X}_n] - \mu = \mu - \mu = 0$$

The sample mean is unbiased for all n, so Bias = 0.

Step 2: Compute Variance

$$\text{Var}(\bar{X}n) = \text{Var}\left(\frac{1}{n}\sum{i=1}^n X_i\right) = \frac{1}{n^2}\sum_{i=1}^n \text{Var}(X_i) = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}$$

Step 3: Compute MSE

$$\text{MSE}(\bar{X}_n) = \text{Bias}^2 + \text{Var} = 0 + \frac{\sigma^2}{n} = \frac{\sigma^2}{n}$$

Step 4: Take limit

$$\lim_{n \to \infty} \text{MSE}(\bar{X}n) = \lim{n \to \infty} \frac{\sigma^2}{n} = 0$$

Conclusion: MSE → 0 implies X̄ₙ →ᵖ μ, so the sample mean is consistent. ∎

The Recipe for Proving Consistency

This proof illustrates a general strategy: (1) Show bias → 0 (or is already 0), (2) Show variance → 0, (3) Conclude MSE → 0 → consistency. This approach works for many estimators beyond the sample mean.

Consistency of Maximum Likelihood Estimators

One of the most important results in statistical theory is that MLEs are consistent under mild regularity conditions.

Theorem (MLE Consistency):

Under regularity conditions, the MLE θ̂ₙ is consistent for θ*:

$$\hat{\theta}_{MLE} \xrightarrow{P} \theta^*$$

Intuition:

MLE maximizes the log-likelihood:

$$\hat{\theta}{MLE} = \arg\max\theta \frac{1}{n}\sum_{i=1}^n \log P(X_i | \theta)$$

By the Law of Large Numbers:

$$\frac{1}{n}\sum_{i=1}^n \log P(X_i | \theta) \xrightarrow{P} E_{X \sim P(\cdot|\theta^*)}[\log P(X | \theta)]$$

The expectation E[log P(X|θ)] (taken over the true distribution P(X|θ*)) is maximized at θ = θ*. This is because:

$$E[\log P(X|\theta)] - E[\log P(X|\theta^)] = -D_{KL}(P(\cdot|\theta^) || P(\cdot|\theta)) \leq 0$$

KL divergence is non-negative, with equality iff θ = θ*.

Regularity conditions:

MLE consistency requires:

Identifiability: Different θ give different distributions. If P(X|θ₁) = P(X|θ₂) for all X, we can't distinguish θ₁ from θ₂.
Correct specification: The true data-generating process is in the model family. If reality is N(μ, σ²) but we fit Exponential(λ), MLE won't find the "true" parameters.
Compact parameter space or proper behavior at boundaries.
Smoothness: The log-likelihood is sufficiently smooth (differentiable, etc.).
Dominated convergence: Technical conditions for the LLN to apply uniformly.

When MLE Fails to be Consistent

MLE can fail to be consistent when: (1) The model is misspecified—MLE converges to the 'closest' distribution in KL sense, not the truth. (2) The number of parameters grows with n (as in some non-parametric settings). (3) There are multiple modes and we find the wrong one. (4) Edge cases violate regularity (e.g., estimating the endpoint of a uniform distribution).

mle_consistency.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.optimize import minimize_scalar, minimize
 
def demonstrate_mle_consistency():
    """
    Demonstrate that MLE estimates converge to true parameters.
    """
    np.random.seed(42)
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Example 1: Bernoulli MLE
    ax1 = axes[0, 0]
    true_p = 0.7
    sample_sizes = np.logspace(1, 4, 30).astype(int)
    n_simulations = 200
    
    mle_means = []
    mle_stds = []
    
    for n in sample_sizes:
        mles = []
        for _ in range(n_simulations):
            data = np.random.binomial(1, true_p, size=n)
            mle = np.mean(data)  # MLE for Bernoulli
            mles.append(mle)
        mle_means.append(np.mean(mles))
        mle_stds.append(np.std(mles))
    
    ax1.errorbar(sample_sizes, mle_means, yerr=mle_stds, fmt='o-', 
                 capsize=3, label='MLE ± 1 std')
    ax1.axhline(y=true_p, color='red', linestyle='--', linewidth=2,
                label=f'True p = {true_p}')
    ax1.fill_between(sample_sizes, true_p - 0.02, true_p + 0.02, 
                     alpha=0.2, color='red')
    ax1.set_xlabel('Sample Size n')
    ax1.set_ylabel('MLE Estimate')
    ax1.set_title('Bernoulli MLE: Convergence to True p')
    ax1.set_xscale('log')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Example 2: Gaussian MLE for mean
    ax2 = axes[0, 1]
    true_mu = 5.0
    true_sigma = 2.0
    
    mle_means_mu = []
    mle_stds_mu = []
    
    for n in sample_sizes:
        mles = []
        for _ in range(n_simulations):
            data = np.random.normal(true_mu, true_sigma, size=n)
            mle = np.mean(data)  # MLE for Gaussian mean
            mles.append(mle)
        mle_means_mu.append(np.mean(mles))
        mle_stds_mu.append(np.std(mles))
    
    ax2.errorbar(sample_sizes, mle_means_mu, yerr=mle_stds_mu, fmt='s-', 
                 capsize=3, color='green', label='MLE ± 1 std')
    ax2.axhline(y=true_mu, color='red', linestyle='--', linewidth=2,
                label=f'True μ = {true_mu}')
    ax2.set_xlabel('Sample Size n')
    ax2.set_ylabel('MLE Estimate')
    ax2.set_title('Gaussian MLE: Convergence to True μ')
    ax2.set_xscale('log')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # Example 3: Exponential MLE
    ax3 = axes[1, 0]
    true_lambda = 0.5
    
    mle_means_lambda = []
    mle_stds_lambda = []
    
    for n in sample_sizes:
        mles = []
        for _ in range(n_simulations):
            data = np.random.exponential(scale=1/true_lambda, size=n)
            mle = 1 / np.mean(data)  # MLE for Exponential rate
            mles.append(mle)
        mle_means_lambda.append(np.mean(mles))
        mle_stds_lambda.append(np.std(mles))
    
    ax3.errorbar(sample_sizes, mle_means_lambda, yerr=mle_stds_lambda, fmt='^-', 
                 capsize=3, color='purple', label='MLE ± 1 std')
    ax3.axhline(y=true_lambda, color='red', linestyle='--', linewidth=2,
                label=f'True λ = {true_lambda}')
    ax3.set_xlabel('Sample Size n')
    ax3.set_ylabel('MLE Estimate')
    ax3.set_title('Exponential MLE: Convergence to True λ')
    ax3.set_xscale('log')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    # Example 4: Distribution of MLE for different n
    ax4 = axes[1, 1]
    true_p = 0.3
    sample_sizes_dist = [20, 100, 500, 2000]
    n_samples = 2000
    
    colors = plt.cm.plasma(np.linspace(0.2, 0.8, len(sample_sizes_dist)))
    
    for n, color in zip(sample_sizes_dist, colors):
        mles = [np.mean(np.random.binomial(1, true_p, size=n)) 
                for _ in range(n_samples)]
        ax4.hist(mles, bins=50, density=True, alpha=0.4, color=color, 
                 label=f'n = {n}')
    
    ax4.axvline(x=true_p, color='red', linewidth=2, linestyle='--',
                label=f'True p = {true_p}')
    ax4.set_xlabel('MLE Estimate')
    ax4.set_ylabel('Density')
    ax4.set_title('MLE Distribution Concentrates as n → ∞')
    ax4.legend()
    
    plt.tight_layout()
    plt.show()
 
demonstrate_mle_consistency()

Asymptotic Normality: The Shape of Convergence

Consistency tells us that estimators converge to the truth. Asymptotic normality tells us how they converge—the shape of the sampling distribution for large n.

Central Limit Theorem (CLT):

For i.i.d. X₁, ..., Xₙ with mean μ and variance σ²:

$$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$$

Or equivalently:

$$\bar{X}_n \approx N\left(\mu, \frac{\sigma^2}{n}\right)$$

The sample mean becomes approximately Gaussian, regardless of the original distribution!

Asymptotic normality of MLE:

Under regularity conditions:

$$\sqrt{n}(\hat{\theta}_{MLE} - \theta^) \xrightarrow{d} N(0, I(\theta^)^{-1})$$

where I(θ) is the Fisher Information.

This means MLE is approximately:

$$\hat{\theta}_{MLE} \approx N\left(\theta^, \frac{1}{nI(\theta^)}\right)$$

Why asymptotic normality matters:

Confidence intervals: We can construct approximate 95% CIs: $$\hat{\theta} \pm 1.96 \times \frac{\text{SE}(\hat{\theta})}{\sqrt{n}}$$
Hypothesis testing: Standard tests (t-tests, z-tests) are justified
Efficiency comparison: The Fisher Information determines the "best" variance achievable
Universal behavior: Regardless of the underlying distribution, MLEs behave like Gaussians for large n

The rate of convergence:

Note the √n factor. This means:

Variance of MLE ∝ 1/n
Standard error ∝ 1/√n
To halve the standard error: need 4× data

The √n Scaling is Universal

For most well-behaved estimators, the standard error scales as 1/√n. This is the 'statistical speed limit'—you cannot generally do better. Some exceptions exist (superefficient estimators, parametric bootstrap), but 1/√n is the typical rate. This explains why collecting 100× more data only gives 10× more precision.

Practical Implications of Consistency

Consistency has profound practical implications for machine learning and statistics:

1. Justification for MLE/MAP:

Consistency tells us that with enough data, MLE will find the truth (assuming correct model specification). This justifies using MLE even when we don't have closed-form solutions—we know we're converging to the right answer.

2. Model selection:

Consistent model selection criteria (like BIC) converge to selecting the true model with probability 1 as n → ∞. This gives theoretical backing to complexity penalties.

3. Cross-validation:

Leave-one-out and k-fold CV are consistent estimators of prediction error. With enough data, they correctly identify the best model.

4. Regularization tuning:

Optimal regularization strength typically decreases with n (we need less shrinkage with more data), consistent with the Bayesian interpretation where data overwhelms the prior.

5. Sample size planning:

Understanding that SE ∝ 1/√n guides experimental design:

Current SE = 0.1
Desired SE = 0.05 (half)
Required n = 4 × current n

6. Convergence diagnostics:

When monitoring training, we expect:

Training error → Bayes optimal error
Validation error → 0 (for consistent procedures on realizable problems)
Gap between training and validation → 0

7. Limitations:

Consistency is asymptotic—no guarantees for finite n
Misspecified models: MLE converges to best approximation, not truth
Non-identifiable models: MLE may not converge at all

The Finite Sample Reality

Consistency tells us about the limit as n → ∞, but we always have finite samples. An estimator can be consistent yet perform poorly for the sample sizes we actually have. Convergence can be slow (requiring billions of samples). Always consider both asymptotic properties AND finite-sample performance.

How to Verify Consistency

When working with a new estimator, how do we determine if it's consistent? Here's a toolkit:

Method 1: MSE Approach

Show that:

Bias(θ̂ₙ) → 0 as n → ∞
Var(θ̂ₙ) → 0 as n → ∞

Then MSE = Bias² + Var → 0, implying consistency.

Method 2: Direct Application of LLN

If θ̂ₙ can be written as a sample average (or continuous function of sample averages), LLN applies directly.

Method 3: Contraction Argument

Show that as n increases, the estimator stays within a shrinking ball around θ* with high probability.

Method 4: Simulation Study

Empirical verification:

Set true parameter θ*
For various sample sizes n:
- Generate many datasets
- Compute θ̂ₙ for each
- Check if mean(θ̂ₙ) → θ* and std(θ̂ₙ) → 0
Plot convergence

Method 5: Literature Search

For standard estimators (MLE, method of moments), consistency results are well-established under regularity conditions. Check if your problem satisfies these conditions.

Checklist for Consistency

•Is the model correctly specified? (True DGP in model family?)
•Is the parameter identifiable? (Unique θ for each distribution?)
•Does the estimator use all the data? (Not just X₁)
•Does variance go to zero? (Var(θ̂ₙ) ~ O(1/n)?)
•Is bias controlled? (Either zero or vanishing?)
•Can you apply LLN/CLT to the underlying computations?

Summary: Consistency

We've explored the asymptotic property that guarantees our estimates improve with more data. Let's consolidate:

Key Takeaways

•Consistency = convergence in probability — θ̂ₙ →ᵖ θ* as n → ∞.
•Consistency ≠ Unbiasedness — Neither implies the other. Biased estimators can be consistent if bias → 0.
•LLN is the foundation — Sample averages converge to expectations, driving consistency of many estimators.
•MLE is consistent — Under regularity conditions, MLE converges to the true parameter.
•Asymptotic normality — Consistent estimators are often approximately Gaussian, enabling inference.
•Rate 1/√n — Standard error typically decreases as 1/√n, the statistical speed limit.
•Finite sample warning — Consistency is asymptotic; actual performance depends on n and constants.

Looking ahead:

Consistency tells us we'll eventually get it right. But among consistent estimators, some converge faster than others. The next page introduces Efficiency—the property that characterizes the best possible convergence rate and the estimators that achieve it.

Page Complete

You now understand consistency—the asymptotic guarantee that estimators converge to truth. You can distinguish it from unbiasedness, prove consistency using the MSE approach, recognize the Law of Large Numbers as the underlying engine, and appreciate both the power and limitations of asymptotic theory.