Loading content...
We've seen that estimators have bias and variance—they're not perfect. But here's a comforting thought: with enough data, good estimators get arbitrarily close to the truth.
Consistency formalizes this intuition. It's the guarantee that as we gather more observations, our uncertainty vanishes and we can estimate parameters with arbitrary precision. Without consistency, all the data in the world might not help us—a deeply unsatisfying property for any estimation procedure.
This page explores what consistency means, when it holds, and why it's fundamental to statistical learning theory.
By the end of this page, you will understand the formal definition of consistency (convergence in probability), distinguish between consistency and unbiasedness, prove consistency of common estimators, understand the conditions under which MLE is consistent, and connect consistency to the law of large numbers.
Definition: Consistency
An estimator θ̂ₙ (based on n observations) is consistent for θ* if it converges to θ* in probability as n → ∞:
$$\hat{\theta}_n \xrightarrow{P} \theta^*$$
Formally, for any ε > 0:
$$\lim_{n \to \infty} P(|\hat{\theta}_n - \theta^*| > \epsilon) = 0$$
Interpretation:
As we collect more data, the probability of being "far" from the true value (by more than any fixed ε) shrinks to zero. We can make our estimate arbitrarily accurate with enough observations.
Types of convergence:
There are actually several modes of convergence, from weakest to strongest:
Convergence in probability (what we're discussing): $$P(|\hat{\theta}_n - \theta^*| > \epsilon) \to 0$$
Almost sure convergence (stronger): $$P(\lim_{n \to \infty} \hat{\theta}_n = \theta^*) = 1$$
Mean squared convergence (L² convergence): $$E[(\hat{\theta}_n - \theta^*)^2] \to 0$$
For practical purposes, convergence in probability (consistency) is usually sufficient. Mean squared convergence (MSE → 0) is often easier to prove and implies consistency.
If Bias(θ̂ₙ) → 0 AND Var(θ̂ₙ) → 0 as n → ∞, then MSE → 0, which implies consistency. This gives us a practical recipe: show that both bias and variance vanish asymptotically.
Consistency and unbiasedness are often confused. They are independent properties—neither implies the other!
Four possibilities:
| Unbiased | Biased | |
|---|---|---|
| Consistent | Sample mean for μ | MLE variance σ̂² = Σ(xᵢ-x̄)²/n |
| Inconsistent | θ̂ = X₁ (first observation only) | θ̂ = 0 (ignores data entirely) |
Example 1: Unbiased but Inconsistent
Consider estimating the population mean μ using only the first observation:
$$\hat{\mu} = X_1$$
This is unbiased: E[X₁] = μ.
But it's inconsistent! No matter how many observations we collect, we only use X₁. The variance Var(X₁) = σ² never decreases. The estimator doesn't improve with more data.
Example 2: Biased but Consistent
The MLE for Gaussian variance:
$$\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X})^2$$
This is biased: E[σ̂²] = (n-1)/n × σ² ≠ σ².
But it's consistent:
Unbiasedness is a finite-sample property—it holds for any fixed n. Consistency is an asymptotic property—it only describes what happens as n → ∞. An estimator can be unbiased for each n yet never converge (like θ̂ = X₁). Conversely, an estimator can be biased for each n yet have the bias vanish (like MLE variance).
The Law of Large Numbers (LLN) is the fundamental theorem underlying consistency.
Weak Law of Large Numbers (WLLN):
For i.i.d. random variables X₁, ..., Xₙ with E[Xᵢ] = μ:
$$\bar{X}n = \frac{1}{n}\sum{i=1}^n X_i \xrightarrow{P} \mu$$
The sample mean converges in probability to the population mean.
Strong Law of Large Numbers (SLLN):
Under the same conditions:
$$P\left(\lim_{n \to \infty} \bar{X}_n = \mu\right) = 1$$
The sample mean converges almost surely (with probability 1).
Proof sketch (WLLN via Chebyshev):
By Chebyshev's inequality:
$$P(|\bar{X}_n - \mu| > \epsilon) \leq \frac{\text{Var}(\bar{X}_n)}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2}$$
As n → ∞, the right side → 0, proving convergence in probability.
Why LLN matters for estimation:
Many estimators can be written as sample averages. For example:
The LLN immediately gives consistency for all these!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as npimport matplotlib.pyplot as plt def demonstrate_lln(): """ Visualize the Law of Large Numbers in action. """ np.random.seed(42) # Parameters true_mu = 5.0 true_sigma = 2.0 max_n = 10000 n_simulations = 20 fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Plot 1: Multiple sample paths converging ax1 = axes[0, 0] ns = np.arange(1, max_n + 1) for i in range(n_simulations): data = np.random.normal(true_mu, true_sigma, max_n) running_means = np.cumsum(data) / ns ax1.plot(ns, running_means, alpha=0.3) ax1.axhline(y=true_mu, color='red', linewidth=2, label=f'True μ = {true_mu}') ax1.set_xlabel('Sample Size n') ax1.set_ylabel('Sample Mean X̄ₙ') ax1.set_title('Law of Large Numbers: Sample Paths Converging') ax1.set_xscale('log') ax1.legend() ax1.grid(True, alpha=0.3) # Plot 2: Distribution of X̄ₙ for different n ax2 = axes[0, 1] sample_sizes = [10, 100, 1000, 10000] n_samples = 5000 colors = plt.cm.viridis(np.linspace(0, 0.8, len(sample_sizes))) for n, color in zip(sample_sizes, colors): sample_means = [] for _ in range(n_samples): data = np.random.normal(true_mu, true_sigma, n) sample_means.append(np.mean(data)) ax2.hist(sample_means, bins=50, density=True, alpha=0.5, color=color, label=f'n = {n}') ax2.axvline(x=true_mu, color='red', linewidth=2, linestyle='--') ax2.set_xlabel('Sample Mean X̄ₙ') ax2.set_ylabel('Density') ax2.set_title('Distribution of X̄ₙ Concentrates Around μ') ax2.legend() # Plot 3: P(|X̄ - μ| > ε) decreasing ax3 = axes[1, 0] epsilon = 0.1 sample_sizes = np.logspace(1, 4, 50).astype(int) prob_exceeds = [] chebyshev_bound = [] for n in sample_sizes: sample_means = [np.mean(np.random.normal(true_mu, true_sigma, n)) for _ in range(1000)] prob = np.mean(np.abs(np.array(sample_means) - true_mu) > epsilon) prob_exceeds.append(prob) chebyshev_bound.append(true_sigma**2 / (n * epsilon**2)) ax3.plot(sample_sizes, prob_exceeds, 'b-', lw=2, label=f'P(|X̄ₙ - μ| > {epsilon})') ax3.plot(sample_sizes, chebyshev_bound, 'r--', lw=2, label='Chebyshev bound') ax3.set_xlabel('Sample Size n') ax3.set_ylabel('Probability') ax3.set_title(f'Convergence in Probability (ε = {epsilon})') ax3.set_xscale('log') ax3.set_yscale('log') ax3.legend() ax3.grid(True, alpha=0.3) # Plot 4: Standard error decreasing as 1/√n ax4 = axes[1, 1] theoretical_se = true_sigma / np.sqrt(sample_sizes) empirical_se = [] for n in sample_sizes: sample_means = [np.mean(np.random.normal(true_mu, true_sigma, n)) for _ in range(500)] empirical_se.append(np.std(sample_means)) ax4.plot(sample_sizes, theoretical_se, 'r-', lw=2, label=f'Theoretical: σ/√n') ax4.plot(sample_sizes, empirical_se, 'bo', markersize=4, alpha=0.7, label='Empirical SE') ax4.set_xlabel('Sample Size n') ax4.set_ylabel('Standard Error of X̄ₙ') ax4.set_title('Standard Error Decreases as 1/√n') ax4.set_xscale('log') ax4.set_yscale('log') ax4.legend() ax4.grid(True, alpha=0.3) plt.tight_layout() plt.show() demonstrate_lln()Let's rigorously prove consistency of the sample mean using the MSE approach.
Claim: The sample mean X̄ₙ = (1/n)Σᵢ₌₁ⁿ Xᵢ is a consistent estimator of μ = E[X].
Proof:
We'll show MSE(X̄ₙ) → 0, which implies consistency.
Step 1: Compute Bias
$$\text{Bias}(\bar{X}_n) = E[\bar{X}_n] - \mu = \mu - \mu = 0$$
The sample mean is unbiased for all n, so Bias = 0.
Step 2: Compute Variance
$$\text{Var}(\bar{X}n) = \text{Var}\left(\frac{1}{n}\sum{i=1}^n X_i\right) = \frac{1}{n^2}\sum_{i=1}^n \text{Var}(X_i) = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}$$
Step 3: Compute MSE
$$\text{MSE}(\bar{X}_n) = \text{Bias}^2 + \text{Var} = 0 + \frac{\sigma^2}{n} = \frac{\sigma^2}{n}$$
Step 4: Take limit
$$\lim_{n \to \infty} \text{MSE}(\bar{X}n) = \lim{n \to \infty} \frac{\sigma^2}{n} = 0$$
Conclusion: MSE → 0 implies X̄ₙ →ᵖ μ, so the sample mean is consistent. ∎
This proof illustrates a general strategy: (1) Show bias → 0 (or is already 0), (2) Show variance → 0, (3) Conclude MSE → 0 → consistency. This approach works for many estimators beyond the sample mean.
One of the most important results in statistical theory is that MLEs are consistent under mild regularity conditions.
Theorem (MLE Consistency):
Under regularity conditions, the MLE θ̂ₙ is consistent for θ*:
$$\hat{\theta}_{MLE} \xrightarrow{P} \theta^*$$
Intuition:
MLE maximizes the log-likelihood:
$$\hat{\theta}{MLE} = \arg\max\theta \frac{1}{n}\sum_{i=1}^n \log P(X_i | \theta)$$
By the Law of Large Numbers:
$$\frac{1}{n}\sum_{i=1}^n \log P(X_i | \theta) \xrightarrow{P} E_{X \sim P(\cdot|\theta^*)}[\log P(X | \theta)]$$
The expectation E[log P(X|θ)] (taken over the true distribution P(X|θ*)) is maximized at θ = θ*. This is because:
$$E[\log P(X|\theta)] - E[\log P(X|\theta^)] = -D_{KL}(P(\cdot|\theta^) || P(\cdot|\theta)) \leq 0$$
KL divergence is non-negative, with equality iff θ = θ*.
Regularity conditions:
MLE consistency requires:
Identifiability: Different θ give different distributions. If P(X|θ₁) = P(X|θ₂) for all X, we can't distinguish θ₁ from θ₂.
Correct specification: The true data-generating process is in the model family. If reality is N(μ, σ²) but we fit Exponential(λ), MLE won't find the "true" parameters.
Compact parameter space or proper behavior at boundaries.
Smoothness: The log-likelihood is sufficiently smooth (differentiable, etc.).
Dominated convergence: Technical conditions for the LLN to apply uniformly.
MLE can fail to be consistent when: (1) The model is misspecified—MLE converges to the 'closest' distribution in KL sense, not the truth. (2) The number of parameters grows with n (as in some non-parametric settings). (3) There are multiple modes and we find the wrong one. (4) Edge cases violate regularity (e.g., estimating the endpoint of a uniform distribution).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import numpy as npimport matplotlib.pyplot as pltfrom scipy import statsfrom scipy.optimize import minimize_scalar, minimize def demonstrate_mle_consistency(): """ Demonstrate that MLE estimates converge to true parameters. """ np.random.seed(42) fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Example 1: Bernoulli MLE ax1 = axes[0, 0] true_p = 0.7 sample_sizes = np.logspace(1, 4, 30).astype(int) n_simulations = 200 mle_means = [] mle_stds = [] for n in sample_sizes: mles = [] for _ in range(n_simulations): data = np.random.binomial(1, true_p, size=n) mle = np.mean(data) # MLE for Bernoulli mles.append(mle) mle_means.append(np.mean(mles)) mle_stds.append(np.std(mles)) ax1.errorbar(sample_sizes, mle_means, yerr=mle_stds, fmt='o-', capsize=3, label='MLE ± 1 std') ax1.axhline(y=true_p, color='red', linestyle='--', linewidth=2, label=f'True p = {true_p}') ax1.fill_between(sample_sizes, true_p - 0.02, true_p + 0.02, alpha=0.2, color='red') ax1.set_xlabel('Sample Size n') ax1.set_ylabel('MLE Estimate') ax1.set_title('Bernoulli MLE: Convergence to True p') ax1.set_xscale('log') ax1.legend() ax1.grid(True, alpha=0.3) # Example 2: Gaussian MLE for mean ax2 = axes[0, 1] true_mu = 5.0 true_sigma = 2.0 mle_means_mu = [] mle_stds_mu = [] for n in sample_sizes: mles = [] for _ in range(n_simulations): data = np.random.normal(true_mu, true_sigma, size=n) mle = np.mean(data) # MLE for Gaussian mean mles.append(mle) mle_means_mu.append(np.mean(mles)) mle_stds_mu.append(np.std(mles)) ax2.errorbar(sample_sizes, mle_means_mu, yerr=mle_stds_mu, fmt='s-', capsize=3, color='green', label='MLE ± 1 std') ax2.axhline(y=true_mu, color='red', linestyle='--', linewidth=2, label=f'True μ = {true_mu}') ax2.set_xlabel('Sample Size n') ax2.set_ylabel('MLE Estimate') ax2.set_title('Gaussian MLE: Convergence to True μ') ax2.set_xscale('log') ax2.legend() ax2.grid(True, alpha=0.3) # Example 3: Exponential MLE ax3 = axes[1, 0] true_lambda = 0.5 mle_means_lambda = [] mle_stds_lambda = [] for n in sample_sizes: mles = [] for _ in range(n_simulations): data = np.random.exponential(scale=1/true_lambda, size=n) mle = 1 / np.mean(data) # MLE for Exponential rate mles.append(mle) mle_means_lambda.append(np.mean(mles)) mle_stds_lambda.append(np.std(mles)) ax3.errorbar(sample_sizes, mle_means_lambda, yerr=mle_stds_lambda, fmt='^-', capsize=3, color='purple', label='MLE ± 1 std') ax3.axhline(y=true_lambda, color='red', linestyle='--', linewidth=2, label=f'True λ = {true_lambda}') ax3.set_xlabel('Sample Size n') ax3.set_ylabel('MLE Estimate') ax3.set_title('Exponential MLE: Convergence to True λ') ax3.set_xscale('log') ax3.legend() ax3.grid(True, alpha=0.3) # Example 4: Distribution of MLE for different n ax4 = axes[1, 1] true_p = 0.3 sample_sizes_dist = [20, 100, 500, 2000] n_samples = 2000 colors = plt.cm.plasma(np.linspace(0.2, 0.8, len(sample_sizes_dist))) for n, color in zip(sample_sizes_dist, colors): mles = [np.mean(np.random.binomial(1, true_p, size=n)) for _ in range(n_samples)] ax4.hist(mles, bins=50, density=True, alpha=0.4, color=color, label=f'n = {n}') ax4.axvline(x=true_p, color='red', linewidth=2, linestyle='--', label=f'True p = {true_p}') ax4.set_xlabel('MLE Estimate') ax4.set_ylabel('Density') ax4.set_title('MLE Distribution Concentrates as n → ∞') ax4.legend() plt.tight_layout() plt.show() demonstrate_mle_consistency()Consistency tells us that estimators converge to the truth. Asymptotic normality tells us how they converge—the shape of the sampling distribution for large n.
Central Limit Theorem (CLT):
For i.i.d. X₁, ..., Xₙ with mean μ and variance σ²:
$$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$$
Or equivalently:
$$\bar{X}_n \approx N\left(\mu, \frac{\sigma^2}{n}\right)$$
The sample mean becomes approximately Gaussian, regardless of the original distribution!
Asymptotic normality of MLE:
Under regularity conditions:
$$\sqrt{n}(\hat{\theta}_{MLE} - \theta^) \xrightarrow{d} N(0, I(\theta^)^{-1})$$
where I(θ) is the Fisher Information.
This means MLE is approximately:
$$\hat{\theta}_{MLE} \approx N\left(\theta^, \frac{1}{nI(\theta^)}\right)$$
Why asymptotic normality matters:
Confidence intervals: We can construct approximate 95% CIs: $$\hat{\theta} \pm 1.96 \times \frac{\text{SE}(\hat{\theta})}{\sqrt{n}}$$
Hypothesis testing: Standard tests (t-tests, z-tests) are justified
Efficiency comparison: The Fisher Information determines the "best" variance achievable
Universal behavior: Regardless of the underlying distribution, MLEs behave like Gaussians for large n
The rate of convergence:
Note the √n factor. This means:
For most well-behaved estimators, the standard error scales as 1/√n. This is the 'statistical speed limit'—you cannot generally do better. Some exceptions exist (superefficient estimators, parametric bootstrap), but 1/√n is the typical rate. This explains why collecting 100× more data only gives 10× more precision.
Consistency has profound practical implications for machine learning and statistics:
1. Justification for MLE/MAP:
Consistency tells us that with enough data, MLE will find the truth (assuming correct model specification). This justifies using MLE even when we don't have closed-form solutions—we know we're converging to the right answer.
2. Model selection:
Consistent model selection criteria (like BIC) converge to selecting the true model with probability 1 as n → ∞. This gives theoretical backing to complexity penalties.
3. Cross-validation:
Leave-one-out and k-fold CV are consistent estimators of prediction error. With enough data, they correctly identify the best model.
4. Regularization tuning:
Optimal regularization strength typically decreases with n (we need less shrinkage with more data), consistent with the Bayesian interpretation where data overwhelms the prior.
5. Sample size planning:
Understanding that SE ∝ 1/√n guides experimental design:
6. Convergence diagnostics:
When monitoring training, we expect:
7. Limitations:
Consistency tells us about the limit as n → ∞, but we always have finite samples. An estimator can be consistent yet perform poorly for the sample sizes we actually have. Convergence can be slow (requiring billions of samples). Always consider both asymptotic properties AND finite-sample performance.
When working with a new estimator, how do we determine if it's consistent? Here's a toolkit:
Method 1: MSE Approach
Show that:
Then MSE = Bias² + Var → 0, implying consistency.
Method 2: Direct Application of LLN
If θ̂ₙ can be written as a sample average (or continuous function of sample averages), LLN applies directly.
Method 3: Contraction Argument
Show that as n increases, the estimator stays within a shrinking ball around θ* with high probability.
Method 4: Simulation Study
Empirical verification:
Method 5: Literature Search
For standard estimators (MLE, method of moments), consistency results are well-established under regularity conditions. Check if your problem satisfies these conditions.
We've explored the asymptotic property that guarantees our estimates improve with more data. Let's consolidate:
Looking ahead:
Consistency tells us we'll eventually get it right. But among consistent estimators, some converge faster than others. The next page introduces Efficiency—the property that characterizes the best possible convergence rate and the estimators that achieve it.
You now understand consistency—the asymptotic guarantee that estimators converge to truth. You can distinguish it from unbiasedness, prove consistency using the MSE approach, recognize the Law of Large Numbers as the underlying engine, and appreciate both the power and limitations of asymptotic theory.