Loading content...
We've established that good estimators are consistent (converge to truth) and have low MSE (bias + variance). But among all consistent, unbiased estimators, is there a best one? Is there a fundamental limit to how precise our estimates can be?
The answer is yes, and it leads to one of the most beautiful results in statistics: the Cramér-Rao Lower Bound. This bound sets a floor on the variance of any unbiased estimator, and estimators that achieve this floor are called efficient.
Efficiency tells us when we're extracting the maximum possible information from our data—when no other unbiased estimator could do better.
By the end of this page, you will understand the Fisher Information and its interpretation, derive and apply the Cramér-Rao Lower Bound, recognize efficient estimators (those achieving the bound), appreciate that MLE is asymptotically efficient, and understand relative efficiency for comparing estimators.
The Fisher Information measures how much information an observation carries about the parameter θ. It's the key quantity that determines the precision limits of estimation.
Definition 1 (Score function):
The score function is the gradient of the log-likelihood:
$$S(\theta) = \frac{\partial}{\partial\theta} \log P(X | \theta)$$
Definition 2 (Fisher Information):
The Fisher Information is the variance of the score:
$$I(\theta) = \text{Var}\left[\frac{\partial}{\partial\theta} \log P(X | \theta)\right] = E\left[\left(\frac{\partial}{\partial\theta} \log P(X | \theta)\right)^2\right]$$
Under regularity conditions, an equivalent form is:
$$I(\theta) = -E\left[\frac{\partial^2}{\partial\theta^2} \log P(X | \theta)\right]$$
The second form is often easier to compute.
Intuition:
The score function S(θ) measures how "sensitive" the log-likelihood is to changes in θ. High sensitivity means we can distinguish between nearby parameter values—the data is informative.
Geometric interpretation:
Fisher Information is the expected curvature (negative second derivative) of the log-likelihood at the true parameter. Sharp curves mean precise estimation; flat curves mean imprecise estimation.
Additivity of Fisher Information:
For n i.i.d. observations, the total Fisher Information is:
$$I_n(\theta) = n \cdot I_1(\theta)$$
Information adds linearly! This is why more data allows more precise estimation—we accumulate information.
Example: Bernoulli Fisher Information
For X ~ Bernoulli(p):
$$\log P(X | p) = X \log p + (1-X) \log(1-p)$$
$$\frac{\partial}{\partial p} \log P = \frac{X}{p} - \frac{1-X}{1-p}$$
$$\frac{\partial^2}{\partial p^2} \log P = -\frac{X}{p^2} - \frac{1-X}{(1-p)^2}$$
$$I(p) = -E\left[-\frac{X}{p^2} - \frac{1-X}{(1-p)^2}\right] = \frac{p}{p^2} + \frac{1-p}{(1-p)^2} = \frac{1}{p(1-p)}$$
For Bernoulli, I(p) = 1/(p(1-p)) is maximized at p = 0.5 and goes to infinity as p → 0 or p → 1. Intuitively, when p is extreme (e.g., 0.01), a single head is very informative—it would be surprising under p = 0 but probable under p = 0.01. When p = 0.5, each observation carries less distinguishing power.
The Cramér-Rao Lower Bound (CRLB) is one of the most important results in estimation theory. It sets a fundamental limit on estimator precision.
Theorem (Cramér-Rao Lower Bound):
For any unbiased estimator θ̂ of θ:
$$\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}$$
For n i.i.d. observations:
$$\text{Var}(\hat{\theta}) \geq \frac{1}{nI_1(\theta)}$$
Interpretation:
No unbiased estimator can have variance lower than 1/(nI). The Fisher Information sets a speed limit on estimation precision. This is a fundamental lower bound—not a constraint we choose, but one that's mathematically inescapable.
Proof sketch:
The proof uses the Cauchy-Schwarz inequality. Define the score S(θ) = ∂/∂θ log P(X|θ).
Key facts:
By Cauchy-Schwarz: $$1 = |\text{Cov}(\hat{\theta}, S)|^2 \leq \text{Var}(\hat{\theta}) \cdot \text{Var}(S) = \text{Var}(\hat{\theta}) \cdot I(\theta)$$
Rearranging: Var(θ̂) ≥ 1/I(θ). ∎
| Distribution | Parameter | Fisher Information I(θ) | CRLB: Var ≥ |
|---|---|---|---|
| Bernoulli(p) | p | 1/(p(1-p)) | p(1-p)/n |
| N(μ, σ²) [known σ²] | μ | 1/σ² | σ²/n |
| N(μ, σ²) [known μ] | σ² | 1/(2σ⁴) | 2σ⁴/n |
| Exponential(λ) | λ | 1/λ² | λ²/n |
| Poisson(λ) | λ | 1/λ | λ/n |
| Uniform(0, θ) | θ | n/θ² (special case) | θ²/(n(n+1)(n+2)) |
The CRLB requires regularity conditions (smoothness of the likelihood, interchangeability of differentiation and integration). For some distributions (e.g., Uniform(0,θ)), these fail, and better-than-CRLB estimators exist! The order statistic X₍ₙ₎ for Uniform(0,θ) has variance O(1/n²), beating the naive CRLB.
An estimator is efficient if it achieves the Cramér-Rao Lower Bound with equality:
$$\text{Var}(\hat{\theta}) = \frac{1}{nI(\theta)}$$
When does efficiency occur?
The CRLB is achieved when the Cauchy-Schwarz inequality becomes an equality, which happens iff θ̂ and S(θ) are perfectly correlated—specifically, when:
$$\hat{\theta} - \theta = \frac{S(\theta)}{I(\theta)}$$
This is equivalent to the score being a linear function of the estimator.
The Exponential Family:
Efficient estimators exist precisely for the exponential family of distributions:
$$P(x | \theta) = h(x) \exp(\eta(\theta) T(x) - A(\theta))$$
Examples: Bernoulli, Gaussian, Exponential, Poisson, Gamma, Beta.
For these distributions, the MLE achieves the CRLB exactly.
Example: Sample Mean for Gaussian
For X₁, ..., Xₙ ~ N(μ, σ²) with known σ²:
The sample mean achieves the bound—it's efficient!
Example: MLE for Bernoulli
For X₁, ..., Xₙ ~ Bernoulli(p):
The sample proportion is efficient!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import numpy as npimport matplotlib.pyplot as plt def verify_efficiency(): """ Empirically verify that MLE achieves the Cramér-Rao bound. """ np.random.seed(42) n_simulations = 10000 fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Example 1: Bernoulli MLE vs CRLB ax1 = axes[0, 0] true_p = 0.3 sample_sizes = np.arange(10, 501, 20) crlb_values = [] empirical_vars = [] for n in sample_sizes: # Cramér-Rao Lower Bound crlb = true_p * (1 - true_p) / n crlb_values.append(crlb) # Empirical variance of MLE mles = [np.mean(np.random.binomial(1, true_p, size=n)) for _ in range(n_simulations)] empirical_vars.append(np.var(mles)) ax1.plot(sample_sizes, crlb_values, 'r-', lw=2, label='CRLB: p(1-p)/n') ax1.plot(sample_sizes, empirical_vars, 'b.-', alpha=0.7, label='Empirical Var(MLE)') ax1.set_xlabel('Sample Size n') ax1.set_ylabel('Variance') ax1.set_title(f'Bernoulli MLE Achieves CRLB (p={true_p})') ax1.legend() ax1.grid(True, alpha=0.3) ax1.set_yscale('log') # Example 2: Gaussian mean MLE vs CRLB ax2 = axes[0, 1] true_mu = 5.0 true_sigma2 = 4.0 crlb_gauss = [] empirical_vars_gauss = [] for n in sample_sizes: crlb = true_sigma2 / n crlb_gauss.append(crlb) mles = [np.mean(np.random.normal(true_mu, np.sqrt(true_sigma2), n)) for _ in range(n_simulations)] empirical_vars_gauss.append(np.var(mles)) ax2.plot(sample_sizes, crlb_gauss, 'r-', lw=2, label='CRLB: σ²/n') ax2.plot(sample_sizes, empirical_vars_gauss, 'g.-', alpha=0.7, label='Empirical Var(MLE)') ax2.set_xlabel('Sample Size n') ax2.set_ylabel('Variance') ax2.set_title(f'Gaussian Mean MLE Achieves CRLB (σ²={true_sigma2})') ax2.legend() ax2.grid(True, alpha=0.3) ax2.set_yscale('log') # Example 3: Exponential rate MLE vs CRLB ax3 = axes[1, 0] true_lambda = 0.5 crlb_exp = [] empirical_vars_exp = [] for n in sample_sizes: # For exponential rate λ, I(λ) = 1/λ² crlb = true_lambda**2 / n crlb_exp.append(crlb) mles = [1 / np.mean(np.random.exponential(1/true_lambda, n)) for _ in range(n_simulations)] empirical_vars_exp.append(np.var(mles)) ax3.plot(sample_sizes, crlb_exp, 'r-', lw=2, label='CRLB: λ²/n') ax3.plot(sample_sizes, empirical_vars_exp, 'm.-', alpha=0.7, label='Empirical Var(MLE)') ax3.set_xlabel('Sample Size n') ax3.set_ylabel('Variance') ax3.set_title(f'Exponential Rate MLE Achieves CRLB (λ={true_lambda})') ax3.legend() ax3.grid(True, alpha=0.3) ax3.set_yscale('log') # Example 4: Efficiency ratio ax4 = axes[1, 1] efficiency_bern = [crlb / emp for crlb, emp in zip(crlb_values, empirical_vars)] efficiency_gauss = [crlb / emp for crlb, emp in zip(crlb_gauss, empirical_vars_gauss)] efficiency_exp = [crlb / emp for crlb, emp in zip(crlb_exp, empirical_vars_exp)] ax4.plot(sample_sizes, efficiency_bern, 'b.-', label='Bernoulli') ax4.plot(sample_sizes, efficiency_gauss, 'g.-', label='Gaussian Mean') ax4.plot(sample_sizes, efficiency_exp, 'm.-', label='Exponential Rate') ax4.axhline(y=1.0, color='red', linestyle='--', lw=2, label='Perfect efficiency') ax4.set_xlabel('Sample Size n') ax4.set_ylabel('Efficiency Ratio (CRLB / Empirical Var)') ax4.set_title('Efficiency Ratio ≈ 1 Confirms Achievability') ax4.legend() ax4.grid(True, alpha=0.3) ax4.set_ylim([0.9, 1.1]) plt.tight_layout() plt.show() print("Efficiency Summary:") print("=" * 50) print(f"Bernoulli MLE efficiency: {np.mean(efficiency_bern):.4f}") print(f"Gaussian Mean MLE efficiency: {np.mean(efficiency_gauss):.4f}") print(f"Exponential Rate MLE efficiency: {np.mean(efficiency_exp):.4f}") print("(Values close to 1.0 indicate CRLB is achieved)") verify_efficiency()One of the most important results in estimation theory:
Theorem (Asymptotic Efficiency of MLE):
Under regularity conditions, the MLE is asymptotically efficient:
$$\sqrt{n}(\hat{\theta}_{MLE} - \theta^) \xrightarrow{d} N(0, I(\theta^)^{-1})$$
The asymptotic variance of √n·(θ̂ - θ*) is exactly 1/I(θ*), the Cramér-Rao bound.
What this means:
MLE is the best you can do asymptotically (among regular estimators).
Why MLE achieves asymptotic efficiency:
Recall MLE maximizes:
$$\frac{1}{n}\sum_{i=1}^n \log P(X_i | \theta)$$
The first-order condition is:
$$\frac{1}{n}\sum_{i=1}^n \frac{\partial}{\partial\theta}\log P(X_i | \theta) = 0$$
This is a sample average of the score—by CLT, it's asymptotically normal. A Taylor expansion around θ* connects the MLE to this average, yielding the result.
The remarkable universality:
Regardless of the specific model (Bernoulli, Gaussian, complex neural network), if you use MLE (or equivalently, minimize negative log-likelihood), you get the most precise estimates possible for large samples.
MLE's asymptotic efficiency is why it's the go-to method in statistics and ML. When you minimize cross-entropy in classification or MSE in regression (with Gaussian noise), you're doing MLE—and thereby achieving optimal efficiency. This isn't just convenient; it's theoretically optimal.
When an estimator doesn't achieve the CRLB, we can still quantify how close it comes using relative efficiency.
Definition:
The relative efficiency of estimator θ̂₁ compared to θ̂₂ is:
$$\text{RE}(\hat{\theta}_1, \hat{\theta}_2) = \frac{\text{Var}(\hat{\theta}_2)}{\text{Var}(\hat{\theta}_1)}$$
Absolute efficiency is the ratio of CRLB to actual variance:
$$e(\hat{\theta}) = \frac{1/(nI(\theta))}{\text{Var}(\hat{\theta})} = \frac{\text{CRLB}}{\text{Var}(\hat{\theta})}$$
Example: Mean vs. Median for Gaussian
For N(μ, σ²), both the sample mean and median are consistent estimators of μ.
Relative efficiency: $$\text{RE}(\bar{X}, \text{median}) = \frac{(\pi/2) \cdot \sigma^2/n}{\sigma^2/n} = \frac{\pi}{2} \approx 1.57$$
The mean is 57% more efficient than the median for Gaussian data.
Interpretation: You need 57% more data using the median to achieve the same precision as the mean.
But for heavy-tailed distributions, the median can be more efficient due to robustness!
| Distribution | RE(Mean, Median) | Winner |
|---|---|---|
| Gaussian | π/2 ≈ 1.57 | Mean is 57% more efficient |
| Uniform | 1.0 | Tied (both suboptimal) |
| Laplace (Double Exp) | 0.5 | Median is 2x more efficient |
| Cauchy | 0 | Median wins (mean has infinite variance!) |
There's often a tradeoff between efficiency (optimal under the assumed model) and robustness (performance under model misspecification). The sample mean is efficient for Gaussian but terrible for Cauchy. The median sacrifices some efficiency for much better robustness. In practice, slight efficiency losses for improved robustness are often worthwhile.
Fisher Information has modern applications beyond classical statistics:
1. Natural Gradient Descent:
Instead of standard gradient descent: $$\theta \leftarrow \theta - \alpha \nabla_{\theta} L(\theta)$$
Natural gradient uses the Fisher Information Matrix (FIM): $$\theta \leftarrow \theta - \alpha F(\theta)^{-1} \nabla_{\theta} L(\theta)$$
The FIM accounts for the geometry of probability distributions, leading to faster, more stable convergence.
2. Elastic Weight Consolidation (EWC):
For continual learning (avoiding catastrophic forgetting), EWC uses Fisher Information to identify which weights are important for previous tasks:
$$L_{EWC} = L_{new} + \frac{\lambda}{2}\sum_i F_i (\theta_i - \theta_{old,i})^2$$
Weights with high Fisher Information (important for past tasks) are penalized more for changing.
3. Information Geometry:
The space of probability distributions can be viewed as a Riemannian manifold with the FIM as the metric tensor. This gives a geometric interpretation to learning:
4. Neural Network Compression:
Fisher Information helps identify parameters whose values can be changed with minimal impact on predictions—candidates for pruning or quantization.
5. Uncertainty Quantification:
The inverse FIM provides approximate posterior uncertainty (Laplace approximation): $$P(\theta | D) \approx N(\hat{\theta}_{MLE}, F^{-1})$$
Useful for Bayesian neural networks and uncertainty estimation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npimport matplotlib.pyplot as plt def logistic_fisher_information(): """ Compute and visualize Fisher Information for logistic regression. """ # Simple logistic regression: P(y=1|x) = σ(w·x) # For single weight w and standard normal x: # I(w) = E[x² · σ(wx)(1-σ(wx))] def sigmoid(z): return 1 / (1 + np.exp(-np.clip(z, -500, 500))) def fisher_info_logistic(w, n_samples=10000): """Compute Fisher Information for logistic regression at weight w.""" x = np.random.randn(n_samples) # Standard normal features p = sigmoid(w * x) # Fisher info = E[x² · p(1-p)] return np.mean(x**2 * p * (1 - p)) # Compute Fisher Information across weight values weights = np.linspace(-5, 5, 100) fisher_values = [fisher_info_logistic(w) for w in weights] fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: Fisher Information vs weight magnitude ax1 = axes[0] ax1.plot(weights, fisher_values, 'b-', lw=2) ax1.set_xlabel('Weight w') ax1.set_ylabel('Fisher Information I(w)') ax1.set_title('Fisher Information in Logistic Regression') ax1.grid(True, alpha=0.3) # Add interpretation ax1.annotate('Max info near w=0\n(uncertain predictions)', xy=(0, max(fisher_values)), xytext=(1.5, max(fisher_values)*0.9), arrowprops=dict(arrowstyle='->', color='red'), fontsize=10) ax1.annotate('Low info at |w|→∞\n(confident predictions)', xy=(4, fisher_values[-10]), xytext=(2, 0.05), arrowprops=dict(arrowstyle='->', color='red'), fontsize=10) # Right: What Fisher Information tells us about learning ax2 = axes[1] # Simulate learning curves at different starting points np.random.seed(42) true_w = 2.0 n_samples = 100 x_data = np.random.randn(n_samples) y_data = (np.random.rand(n_samples) < sigmoid(true_w * x_data)).astype(float) def negative_log_likelihood(w): p = sigmoid(w * x_data) p = np.clip(p, 1e-10, 1 - 1e-10) return -np.mean(y_data * np.log(p) + (1 - y_data) * np.log(1 - p)) # Plot showing CRLB for estimation precision weights_est = np.linspace(0.5, 3.5, 100) nll_values = [negative_log_likelihood(w) for w in weights_est] ax2.plot(weights_est, nll_values, 'b-', lw=2, label='NLL') ax2.axvline(x=true_w, color='red', linestyle='--', label=f'True w = {true_w}') # MLE mle_w = weights_est[np.argmin(nll_values)] ax2.axvline(x=mle_w, color='green', linestyle=':', label=f'MLE ≈ {mle_w:.2f}') # Approximate CI using Fisher Information fisher_at_mle = fisher_info_logistic(mle_w) std_error = 1 / np.sqrt(n_samples * fisher_at_mle) ax2.axvspan(mle_w - 1.96*std_error, mle_w + 1.96*std_error, alpha=0.2, color='green', label=f'95% CI (via FIM)') ax2.set_xlabel('Weight w') ax2.set_ylabel('Negative Log-Likelihood') ax2.set_title(f'MLE and Uncertainty (n={n_samples})') ax2.legend() ax2.grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"MLE: {mle_w:.4f}") print(f"Fisher Information at MLE: {fisher_at_mle:.4f}") print(f"Asymptotic SE: {std_error:.4f}") print(f"95% CI: [{mle_w - 1.96*std_error:.4f}, {mle_w + 1.96*std_error:.4f}]") logistic_fisher_information()Can we ever do better than the Cramér-Rao bound? The answer is subtle.
Hodges' Superefficient Estimator:
Consider estimating μ from N(μ, 1). Define:
$$\hat{\mu}_H = \begin{cases} \bar{X} & \text{if } |\bar{X}| > n^{-1/4} \ 0 & \text{otherwise} \end{cases}$$
This estimator:
This seems to violate our bound. What's happening?
The resolution:
The CRLB applies to estimators that are unbiased at all θ values. Hodges' estimator:
The Convolution Theorem (Le Cam) shows that:
Practical implications:
Superefficient estimators are curiosities, not tools. You can't exploit them without knowing the true parameter in advance. MLE remains the practical choice.
The CRLB also fails when regularity conditions fail. For Uniform(0, θ), the MLE X₍ₙ₎ (maximum order statistic) has variance O(1/n²), beating the naive CRLB. This isn't superefficiency—it's that the standard bound doesn't apply. The score function isn't smooth at the boundary.
Understanding efficiency guides practical statistical work:
1. Sample Size Planning:
Using the CRLB, we can determine required sample sizes:
"I need standard error ≤ 0.01 for estimating proportion p ≈ 0.3"
$$\text{SE} = \sqrt{\frac{p(1-p)}{n}} \leq 0.01$$ $$n \geq \frac{0.3 \times 0.7}{0.01^2} = 2100$$
2. Experimental Design:
Fisher Information can optimize data collection:
Maximize information subject to constraints.
3. Model Selection:
Among models with similar predictive power, prefer those whose MLE achieves near-efficiency. This indicates the model is well-suited to the data structure.
4. Uncertainty Quantification:
For efficient estimators:
$$\text{Approximate 95% CI} = \hat{\theta} \pm 1.96 \times \frac{1}{\sqrt{nI(\hat{\theta})}}$$
The width is determined by Fisher Information—intrinsic to the problem.
5. Comparing Methods:
If method A has 80% efficiency compared to MLE, you need 25% more data to match MLE precision. This quantifies the cost of using simpler methods.
6. Regularization Tradeoffs:
Regularized estimators are biased, so CRLB doesn't directly apply. But the MSE framework applies: regularization typically increases bias, decreases variance, and can improve MSE despite losing efficiency in the classical sense.
Thinking about efficiency forces you to ask: Am I extracting all available information from my data? If using a method with known inefficiency (e.g., median instead of mean for Gaussian), you should have a reason (robustness). Otherwise, you're leaving precision on the table.
We've explored the ultimate limits of estimation precision. Let's consolidate:
Module Complete:
You've now mastered Estimation Theory—from the mechanics of MLE and MAP to the theoretical properties that guarantee their quality. You understand:
These concepts form the theoretical foundation for nearly all statistical and machine learning methods. Whether fitting a logistic regression or training a billion-parameter language model, you're applying these principles.
You've completed Estimation Theory! You now have deep understanding of how to estimate parameters optimally, what theoretical guarantees we can expect, and how these foundations connect to modern machine learning. This knowledge will inform every model you build and every statistical inference you make.