Estimation Theory - Learning Module

Loading content...

0/245

Efficiency of Estimators

Extracting Maximum Information from Data

We've established that good estimators are consistent (converge to truth) and have low MSE (bias + variance). But among all consistent, unbiased estimators, is there a best one? Is there a fundamental limit to how precise our estimates can be?

The answer is yes, and it leads to one of the most beautiful results in statistics: the Cramér-Rao Lower Bound. This bound sets a floor on the variance of any unbiased estimator, and estimators that achieve this floor are called efficient.

Efficiency tells us when we're extracting the maximum possible information from our data—when no other unbiased estimator could do better.

What You Will Learn

By the end of this page, you will understand the Fisher Information and its interpretation, derive and apply the Cramér-Rao Lower Bound, recognize efficient estimators (those achieving the bound), appreciate that MLE is asymptotically efficient, and understand relative efficiency for comparing estimators.

Fisher Information: Quantifying Data Informativeness

The Fisher Information measures how much information an observation carries about the parameter θ. It's the key quantity that determines the precision limits of estimation.

Definition 1 (Score function):

The score function is the gradient of the log-likelihood:

$$S(\theta) = \frac{\partial}{\partial\theta} \log P(X | \theta)$$

Definition 2 (Fisher Information):

The Fisher Information is the variance of the score:

$$I(\theta) = \text{Var}\left[\frac{\partial}{\partial\theta} \log P(X | \theta)\right] = E\left[\left(\frac{\partial}{\partial\theta} \log P(X | \theta)\right)^2\right]$$

Under regularity conditions, an equivalent form is:

$$I(\theta) = -E\left[\frac{\partial^2}{\partial\theta^2} \log P(X | \theta)\right]$$

The second form is often easier to compute.

Intuition:

The score function S(θ) measures how "sensitive" the log-likelihood is to changes in θ. High sensitivity means we can distinguish between nearby parameter values—the data is informative.

High Fisher Information: The log-likelihood is sharply curved; observations strongly distinguish between nearby θ values.
Low Fisher Information: The log-likelihood is flat; observations barely distinguish different θ values.

Geometric interpretation:

Fisher Information is the expected curvature (negative second derivative) of the log-likelihood at the true parameter. Sharp curves mean precise estimation; flat curves mean imprecise estimation.

Additivity of Fisher Information:

For n i.i.d. observations, the total Fisher Information is:

$$I_n(\theta) = n \cdot I_1(\theta)$$

Information adds linearly! This is why more data allows more precise estimation—we accumulate information.

Example: Bernoulli Fisher Information

For X ~ Bernoulli(p):

$$\log P(X | p) = X \log p + (1-X) \log(1-p)$$

$$\frac{\partial}{\partial p} \log P = \frac{X}{p} - \frac{1-X}{1-p}$$

$$\frac{\partial^2}{\partial p^2} \log P = -\frac{X}{p^2} - \frac{1-X}{(1-p)^2}$$

$$I(p) = -E\left[-\frac{X}{p^2} - \frac{1-X}{(1-p)^2}\right] = \frac{p}{p^2} + \frac{1-p}{(1-p)^2} = \frac{1}{p(1-p)}$$

Fisher Information at Extreme Values

For Bernoulli, I(p) = 1/(p(1-p)) is maximized at p = 0.5 and goes to infinity as p → 0 or p → 1. Intuitively, when p is extreme (e.g., 0.01), a single head is very informative—it would be surprising under p = 0 but probable under p = 0.01. When p = 0.5, each observation carries less distinguishing power.

The Cramér-Rao Lower Bound

The Cramér-Rao Lower Bound (CRLB) is one of the most important results in estimation theory. It sets a fundamental limit on estimator precision.

Theorem (Cramér-Rao Lower Bound):

For any unbiased estimator θ̂ of θ:

$$\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}$$

For n i.i.d. observations:

$$\text{Var}(\hat{\theta}) \geq \frac{1}{nI_1(\theta)}$$

Interpretation:

No unbiased estimator can have variance lower than 1/(nI). The Fisher Information sets a speed limit on estimation precision. This is a fundamental lower bound—not a constraint we choose, but one that's mathematically inescapable.

Proof sketch:

The proof uses the Cauchy-Schwarz inequality. Define the score S(θ) = ∂/∂θ log P(X|θ).

Key facts:

E[S(θ)] = 0 (regularity condition)
Cov(θ̂, S) = 1 for unbiased θ̂ (can be shown)
Var(S) = I(θ)

By Cauchy-Schwarz: $$1 = |\text{Cov}(\hat{\theta}, S)|^2 \leq \text{Var}(\hat{\theta}) \cdot \text{Var}(S) = \text{Var}(\hat{\theta}) \cdot I(\theta)$$

Rearranging: Var(θ̂) ≥ 1/I(θ). ∎

Cramér-Rao Bounds for Common Distributions
Distribution	Parameter	Fisher Information I(θ)	CRLB: Var ≥
Bernoulli(p)	p	1/(p(1-p))	p(1-p)/n
N(μ, σ²) [known σ²]	μ	1/σ²	σ²/n
N(μ, σ²) [known μ]	σ²	1/(2σ⁴)	2σ⁴/n
Exponential(λ)	λ	1/λ²	λ²/n
Poisson(λ)	λ	1/λ	λ/n
Uniform(0, θ)	θ	n/θ² (special case)	θ²/(n(n+1)(n+2))

Regularity Conditions Matter

The CRLB requires regularity conditions (smoothness of the likelihood, interchangeability of differentiation and integration). For some distributions (e.g., Uniform(0,θ)), these fail, and better-than-CRLB estimators exist! The order statistic X₍ₙ₎ for Uniform(0,θ) has variance O(1/n²), beating the naive CRLB.

Efficient Estimators: Achieving the Bound

An estimator is efficient if it achieves the Cramér-Rao Lower Bound with equality:

$$\text{Var}(\hat{\theta}) = \frac{1}{nI(\theta)}$$

When does efficiency occur?

The CRLB is achieved when the Cauchy-Schwarz inequality becomes an equality, which happens iff θ̂ and S(θ) are perfectly correlated—specifically, when:

$$\hat{\theta} - \theta = \frac{S(\theta)}{I(\theta)}$$

This is equivalent to the score being a linear function of the estimator.

The Exponential Family:

Efficient estimators exist precisely for the exponential family of distributions:

$$P(x | \theta) = h(x) \exp(\eta(\theta) T(x) - A(\theta))$$

Examples: Bernoulli, Gaussian, Exponential, Poisson, Gamma, Beta.

For these distributions, the MLE achieves the CRLB exactly.

Example: Sample Mean for Gaussian

For X₁, ..., Xₙ ~ N(μ, σ²) with known σ²:

Fisher Information: I(μ) = 1/σ²
CRLB: Var(μ̂) ≥ σ²/n
Sample mean: Var(X̄) = σ²/n ✓

The sample mean achieves the bound—it's efficient!

Example: MLE for Bernoulli

For X₁, ..., Xₙ ~ Bernoulli(p):

Fisher Information: I(p) = 1/(p(1-p))
CRLB: Var(p̂) ≥ p(1-p)/n
MLE: p̂ = X̄ with Var(p̂) = p(1-p)/n ✓

The sample proportion is efficient!

efficiency_verification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
import matplotlib.pyplot as plt
 
def verify_efficiency():
    """
    Empirically verify that MLE achieves the Cramér-Rao bound.
    """
    np.random.seed(42)
    n_simulations = 10000
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Example 1: Bernoulli MLE vs CRLB
    ax1 = axes[0, 0]
    true_p = 0.3
    sample_sizes = np.arange(10, 501, 20)
    
    crlb_values = []
    empirical_vars = []
    
    for n in sample_sizes:
        # Cramér-Rao Lower Bound
        crlb = true_p * (1 - true_p) / n
        crlb_values.append(crlb)
        
        # Empirical variance of MLE
        mles = [np.mean(np.random.binomial(1, true_p, size=n)) 
                for _ in range(n_simulations)]
        empirical_vars.append(np.var(mles))
    
    ax1.plot(sample_sizes, crlb_values, 'r-', lw=2, label='CRLB: p(1-p)/n')
    ax1.plot(sample_sizes, empirical_vars, 'b.-', alpha=0.7, 
             label='Empirical Var(MLE)')
    ax1.set_xlabel('Sample Size n')
    ax1.set_ylabel('Variance')
    ax1.set_title(f'Bernoulli MLE Achieves CRLB (p={true_p})')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_yscale('log')
    
    # Example 2: Gaussian mean MLE vs CRLB
    ax2 = axes[0, 1]
    true_mu = 5.0
    true_sigma2 = 4.0
    
    crlb_gauss = []
    empirical_vars_gauss = []
    
    for n in sample_sizes:
        crlb = true_sigma2 / n
        crlb_gauss.append(crlb)
        
        mles = [np.mean(np.random.normal(true_mu, np.sqrt(true_sigma2), n)) 
                for _ in range(n_simulations)]
        empirical_vars_gauss.append(np.var(mles))
    
    ax2.plot(sample_sizes, crlb_gauss, 'r-', lw=2, label='CRLB: σ²/n')
    ax2.plot(sample_sizes, empirical_vars_gauss, 'g.-', alpha=0.7, 
             label='Empirical Var(MLE)')
    ax2.set_xlabel('Sample Size n')
    ax2.set_ylabel('Variance')
    ax2.set_title(f'Gaussian Mean MLE Achieves CRLB (σ²={true_sigma2})')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.set_yscale('log')
    
    # Example 3: Exponential rate MLE vs CRLB
    ax3 = axes[1, 0]
    true_lambda = 0.5
    
    crlb_exp = []
    empirical_vars_exp = []
    
    for n in sample_sizes:
        # For exponential rate λ, I(λ) = 1/λ²
        crlb = true_lambda**2 / n
        crlb_exp.append(crlb)
        
        mles = [1 / np.mean(np.random.exponential(1/true_lambda, n)) 
                for _ in range(n_simulations)]
        empirical_vars_exp.append(np.var(mles))
    
    ax3.plot(sample_sizes, crlb_exp, 'r-', lw=2, label='CRLB: λ²/n')
    ax3.plot(sample_sizes, empirical_vars_exp, 'm.-', alpha=0.7, 
             label='Empirical Var(MLE)')
    ax3.set_xlabel('Sample Size n')
    ax3.set_ylabel('Variance')
    ax3.set_title(f'Exponential Rate MLE Achieves CRLB (λ={true_lambda})')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    ax3.set_yscale('log')
    
    # Example 4: Efficiency ratio
    ax4 = axes[1, 1]
    efficiency_bern = [crlb / emp for crlb, emp in zip(crlb_values, empirical_vars)]
    efficiency_gauss = [crlb / emp for crlb, emp in zip(crlb_gauss, empirical_vars_gauss)]
    efficiency_exp = [crlb / emp for crlb, emp in zip(crlb_exp, empirical_vars_exp)]
    
    ax4.plot(sample_sizes, efficiency_bern, 'b.-', label='Bernoulli')
    ax4.plot(sample_sizes, efficiency_gauss, 'g.-', label='Gaussian Mean')
    ax4.plot(sample_sizes, efficiency_exp, 'm.-', label='Exponential Rate')
    ax4.axhline(y=1.0, color='red', linestyle='--', lw=2, label='Perfect efficiency')
    ax4.set_xlabel('Sample Size n')
    ax4.set_ylabel('Efficiency Ratio (CRLB / Empirical Var)')
    ax4.set_title('Efficiency Ratio ≈ 1 Confirms Achievability')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    ax4.set_ylim([0.9, 1.1])
    
    plt.tight_layout()
    plt.show()
    
    print("Efficiency Summary:")
    print("=" * 50)
    print(f"Bernoulli MLE efficiency: {np.mean(efficiency_bern):.4f}")
    print(f"Gaussian Mean MLE efficiency: {np.mean(efficiency_gauss):.4f}")
    print(f"Exponential Rate MLE efficiency: {np.mean(efficiency_exp):.4f}")
    print("(Values close to 1.0 indicate CRLB is achieved)")
 
verify_efficiency()

Asymptotic Efficiency of MLE

One of the most important results in estimation theory:

Theorem (Asymptotic Efficiency of MLE):

Under regularity conditions, the MLE is asymptotically efficient:

$$\sqrt{n}(\hat{\theta}_{MLE} - \theta^) \xrightarrow{d} N(0, I(\theta^)^{-1})$$

The asymptotic variance of √n·(θ̂ - θ*) is exactly 1/I(θ*), the Cramér-Rao bound.

What this means:

For large n, MLE is approximately unbiased
Its variance approaches the minimum possible (CRLB)
No other consistent estimator can have asymptotically lower variance

MLE is the best you can do asymptotically (among regular estimators).

Why MLE achieves asymptotic efficiency:

Recall MLE maximizes:

$$\frac{1}{n}\sum_{i=1}^n \log P(X_i | \theta)$$

The first-order condition is:

$$\frac{1}{n}\sum_{i=1}^n \frac{\partial}{\partial\theta}\log P(X_i | \theta) = 0$$

This is a sample average of the score—by CLT, it's asymptotically normal. A Taylor expansion around θ* connects the MLE to this average, yielding the result.

The remarkable universality:

Regardless of the specific model (Bernoulli, Gaussian, complex neural network), if you use MLE (or equivalently, minimize negative log-likelihood), you get the most precise estimates possible for large samples.

The Gold Standard

MLE's asymptotic efficiency is why it's the go-to method in statistics and ML. When you minimize cross-entropy in classification or MSE in regression (with Gaussian noise), you're doing MLE—and thereby achieving optimal efficiency. This isn't just convenient; it's theoretically optimal.

Relative Efficiency: Comparing Estimators

When an estimator doesn't achieve the CRLB, we can still quantify how close it comes using relative efficiency.

Definition:

The relative efficiency of estimator θ̂₁ compared to θ̂₂ is:

$$\text{RE}(\hat{\theta}_1, \hat{\theta}_2) = \frac{\text{Var}(\hat{\theta}_2)}{\text{Var}(\hat{\theta}_1)}$$

Absolute efficiency is the ratio of CRLB to actual variance:

$$e(\hat{\theta}) = \frac{1/(nI(\theta))}{\text{Var}(\hat{\theta})} = \frac{\text{CRLB}}{\text{Var}(\hat{\theta})}$$

e(θ̂) = 1: Efficient (achieves CRLB)
e(θ̂) < 1: Inefficient (some information wasted)
Relative efficiency > 1: First estimator is better than second

Example: Mean vs. Median for Gaussian

For N(μ, σ²), both the sample mean and median are consistent estimators of μ.

Sample mean: Var(X̄) = σ²/n (efficient)
Sample median: Var(median) ≈ (π/2) × σ²/n for large n

Relative efficiency: $$\text{RE}(\bar{X}, \text{median}) = \frac{(\pi/2) \cdot \sigma^2/n}{\sigma^2/n} = \frac{\pi}{2} \approx 1.57$$

The mean is 57% more efficient than the median for Gaussian data.

Interpretation: You need 57% more data using the median to achieve the same precision as the mean.

But for heavy-tailed distributions, the median can be more efficient due to robustness!

Relative Efficiency of Sample Median vs. Mean
Distribution	RE(Mean, Median)	Winner
Gaussian	π/2 ≈ 1.57	Mean is 57% more efficient
Uniform	1.0	Tied (both suboptimal)
Laplace (Double Exp)	0.5	Median is 2x more efficient
Cauchy	0	Median wins (mean has infinite variance!)

Efficiency vs. Robustness

There's often a tradeoff between efficiency (optimal under the assumed model) and robustness (performance under model misspecification). The sample mean is efficient for Gaussian but terrible for Cauchy. The median sacrifices some efficiency for much better robustness. In practice, slight efficiency losses for improved robustness are often worthwhile.

Fisher Information in Machine Learning

Fisher Information has modern applications beyond classical statistics:

1. Natural Gradient Descent:

Instead of standard gradient descent: $$\theta \leftarrow \theta - \alpha \nabla_{\theta} L(\theta)$$

Natural gradient uses the Fisher Information Matrix (FIM): $$\theta \leftarrow \theta - \alpha F(\theta)^{-1} \nabla_{\theta} L(\theta)$$

The FIM accounts for the geometry of probability distributions, leading to faster, more stable convergence.

2. Elastic Weight Consolidation (EWC):

For continual learning (avoiding catastrophic forgetting), EWC uses Fisher Information to identify which weights are important for previous tasks:

$$L_{EWC} = L_{new} + \frac{\lambda}{2}\sum_i F_i (\theta_i - \theta_{old,i})^2$$

Weights with high Fisher Information (important for past tasks) are penalized more for changing.

3. Information Geometry:

The space of probability distributions can be viewed as a Riemannian manifold with the FIM as the metric tensor. This gives a geometric interpretation to learning:

KL divergence is approximately (1/2)Δθᵀ F Δθ for nearby θ
Natural gradient follows geodesics on this manifold
Learning rate adaptation (Adam, etc.) implicitly approximates FIM

4. Neural Network Compression:

Fisher Information helps identify parameters whose values can be changed with minimal impact on predictions—candidates for pruning or quantization.

5. Uncertainty Quantification:

The inverse FIM provides approximate posterior uncertainty (Laplace approximation): $$P(\theta | D) \approx N(\hat{\theta}_{MLE}, F^{-1})$$

Useful for Bayesian neural networks and uncertainty estimation.

fisher_information_nn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
import matplotlib.pyplot as plt
 
def logistic_fisher_information():
    """
    Compute and visualize Fisher Information for logistic regression.
    """
    # Simple logistic regression: P(y=1|x) = σ(w·x)
    # For single weight w and standard normal x:
    # I(w) = E[x² · σ(wx)(1-σ(wx))]
    
    def sigmoid(z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def fisher_info_logistic(w, n_samples=10000):
        """Compute Fisher Information for logistic regression at weight w."""
        x = np.random.randn(n_samples)  # Standard normal features
        p = sigmoid(w * x)
        # Fisher info = E[x² · p(1-p)]
        return np.mean(x**2 * p * (1 - p))
    
    # Compute Fisher Information across weight values
    weights = np.linspace(-5, 5, 100)
    fisher_values = [fisher_info_logistic(w) for w in weights]
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Left: Fisher Information vs weight magnitude
    ax1 = axes[0]
    ax1.plot(weights, fisher_values, 'b-', lw=2)
    ax1.set_xlabel('Weight w')
    ax1.set_ylabel('Fisher Information I(w)')
    ax1.set_title('Fisher Information in Logistic Regression')
    ax1.grid(True, alpha=0.3)
    
    # Add interpretation
    ax1.annotate('Max info near w=0\n(uncertain predictions)', 
                 xy=(0, max(fisher_values)), xytext=(1.5, max(fisher_values)*0.9),
                 arrowprops=dict(arrowstyle='->', color='red'),
                 fontsize=10)
    ax1.annotate('Low info at |w|→∞\n(confident predictions)', 
                 xy=(4, fisher_values[-10]), xytext=(2, 0.05),
                 arrowprops=dict(arrowstyle='->', color='red'),
                 fontsize=10)
    
    # Right: What Fisher Information tells us about learning
    ax2 = axes[1]
    
    # Simulate learning curves at different starting points
    np.random.seed(42)
    true_w = 2.0
    n_samples = 100
    x_data = np.random.randn(n_samples)
    y_data = (np.random.rand(n_samples) < sigmoid(true_w * x_data)).astype(float)
    
    def negative_log_likelihood(w):
        p = sigmoid(w * x_data)
        p = np.clip(p, 1e-10, 1 - 1e-10)
        return -np.mean(y_data * np.log(p) + (1 - y_data) * np.log(1 - p))
    
    # Plot showing CRLB for estimation precision
    weights_est = np.linspace(0.5, 3.5, 100)
    nll_values = [negative_log_likelihood(w) for w in weights_est]
    
    ax2.plot(weights_est, nll_values, 'b-', lw=2, label='NLL')
    ax2.axvline(x=true_w, color='red', linestyle='--', label=f'True w = {true_w}')
    
    # MLE
    mle_w = weights_est[np.argmin(nll_values)]
    ax2.axvline(x=mle_w, color='green', linestyle=':', 
                label=f'MLE ≈ {mle_w:.2f}')
    
    # Approximate CI using Fisher Information
    fisher_at_mle = fisher_info_logistic(mle_w)
    std_error = 1 / np.sqrt(n_samples * fisher_at_mle)
    ax2.axvspan(mle_w - 1.96*std_error, mle_w + 1.96*std_error, 
                alpha=0.2, color='green', label=f'95% CI (via FIM)')
    
    ax2.set_xlabel('Weight w')
    ax2.set_ylabel('Negative Log-Likelihood')
    ax2.set_title(f'MLE and Uncertainty (n={n_samples})')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"MLE: {mle_w:.4f}")
    print(f"Fisher Information at MLE: {fisher_at_mle:.4f}")
    print(f"Asymptotic SE: {std_error:.4f}")
    print(f"95% CI: [{mle_w - 1.96*std_error:.4f}, {mle_w + 1.96*std_error:.4f}]")
 
logistic_fisher_information()

Beyond the Bound: Superefficiency

Can we ever do better than the Cramér-Rao bound? The answer is subtle.

Hodges' Superefficient Estimator:

Consider estimating μ from N(μ, 1). Define:

$$\hat{\mu}_H = \begin{cases} \bar{X} & \text{if } |\bar{X}| > n^{-1/4} \ 0 & \text{otherwise} \end{cases}$$

This estimator:

Is consistent: μ̂_H → μ as n → ∞
Has Var(μ̂_H) → 0 as n → ∞ (asymptotically unbiased)
At μ = 0: Var(μ̂_H) = o(1/n)—beats the CRLB!

This seems to violate our bound. What's happening?

The resolution:

The CRLB applies to estimators that are unbiased at all θ values. Hodges' estimator:

Is biased for μ ≠ 0 in finite samples
Achieves superefficiency only at isolated points (μ = 0)
Has worse performance for μ near (but not at) 0

The Convolution Theorem (Le Cam) shows that:

Superefficiency can only occur on sets of measure zero
No estimator can beat the CRLB uniformly
Superefficiency at one point implies suboptimality nearby

Practical implications:

Superefficient estimators are curiosities, not tools. You can't exploit them without knowing the true parameter in advance. MLE remains the practical choice.

Regularity Violations

The CRLB also fails when regularity conditions fail. For Uniform(0, θ), the MLE X₍ₙ₎ (maximum order statistic) has variance O(1/n²), beating the naive CRLB. This isn't superefficiency—it's that the standard bound doesn't apply. The score function isn't smooth at the boundary.

Practical Implications of Efficiency

Understanding efficiency guides practical statistical work:

1. Sample Size Planning:

Using the CRLB, we can determine required sample sizes:

"I need standard error ≤ 0.01 for estimating proportion p ≈ 0.3"

$$\text{SE} = \sqrt{\frac{p(1-p)}{n}} \leq 0.01$$ $$n \geq \frac{0.3 \times 0.7}{0.01^2} = 2100$$

2. Experimental Design:

Fisher Information can optimize data collection:

Which features to measure?
Where to sample in input space?
How to allocate resources across conditions?

Maximize information subject to constraints.

3. Model Selection:

Among models with similar predictive power, prefer those whose MLE achieves near-efficiency. This indicates the model is well-suited to the data structure.

4. Uncertainty Quantification:

For efficient estimators:

$$\text{Approximate 95% CI} = \hat{\theta} \pm 1.96 \times \frac{1}{\sqrt{nI(\hat{\theta})}}$$

The width is determined by Fisher Information—intrinsic to the problem.

5. Comparing Methods:

If method A has 80% efficiency compared to MLE, you need 25% more data to match MLE precision. This quantifies the cost of using simpler methods.

6. Regularization Tradeoffs:

Regularized estimators are biased, so CRLB doesn't directly apply. But the MSE framework applies: regularization typically increases bias, decreases variance, and can improve MSE despite losing efficiency in the classical sense.

The Efficiency Mindset

Thinking about efficiency forces you to ask: Am I extracting all available information from my data? If using a method with known inefficiency (e.g., median instead of mean for Gaussian), you should have a reason (robustness). Otherwise, you're leaving precision on the table.

Summary: Efficiency of Estimators

We've explored the ultimate limits of estimation precision. Let's consolidate:

Key Takeaways

•Fisher Information = Data's information content — Measures how sharply the likelihood distinguishes nearby parameters.
•CRLB = fundamental variance floor — No unbiased estimator can do better than Var ≥ 1/(nI(θ)).
•Efficient estimators achieve the bound — They extract the maximum possible information from data.
•MLE is asymptotically efficient — The go-to method achieves optimal precision for large samples.
•Relative efficiency compares estimators — Quantifies how much more data one method needs vs. another.
•Fisher Information has modern uses — Natural gradient, EWC, uncertainty quantification in deep learning.
•Superefficiency is a curiosity — Can't beat CRLB uniformly; MLE remains practical choice.

Module Complete:

You've now mastered Estimation Theory—from the mechanics of MLE and MAP to the theoretical properties that guarantee their quality. You understand:

MLE: Maximize likelihood, the workhorse of parameter estimation
MAP: Add priors for regularization and robustness
Bias/Variance: Two sources of error, with tradeoffs between them
Consistency: Asymptotic convergence to truth
Efficiency: Optimal finite-sample precision

These concepts form the theoretical foundation for nearly all statistical and machine learning methods. Whether fitting a logistic regression or training a billion-parameter language model, you're applying these principles.

Module Complete

You've completed Estimation Theory! You now have deep understanding of how to estimate parameters optimally, what theoretical guarantees we can expect, and how these foundations connect to modern machine learning. This knowledge will inform every model you build and every statistical inference you make.