Machine LearningProbability Theory

Estimation Theory

LevelIntermediate

Duration90 mins

TopicProbability Theory

2 / 5

Maximum A Posteriori (MAP) Estimation

When Data Alone Isn't Enough

Imagine you flip a coin just 3 times and observe 3 heads. What's your estimate for P(heads)? MLE says: p̂ = 1.0—the coin always lands heads. But does this seem reasonable? Surely you'd hesitate to bet your life savings that the next flip will be heads.

The problem is clear: MLE with limited data can produce extreme, implausible estimates. What we need is a way to incorporate our prior beliefs—like "most coins are roughly fair"—while still learning from data.

Maximum A Posteriori (MAP) estimation provides exactly this framework. It blends prior knowledge with observed evidence, producing estimates that are more robust when data is scarce and converge to MLE when data is plentiful.

What You Will Learn

By the end of this page, you will master Bayes' theorem for parameter estimation, understand how prior distributions encode beliefs, derive MAP estimates for Bernoulli and Gaussian models, see the profound connection between MAP and regularization, and understand conjugate priors and their computational advantages.

The Bayesian Perspective on Parameters

The fundamental philosophical difference between frequentist and Bayesian approaches lies in how we treat parameters:

Frequentist view (MLE):

Parameters θ are fixed but unknown constants
There's one true value θ* that generated the data
We seek point estimates that approximate θ*
Uncertainty is expressed through sampling distributions of the estimator

Bayesian view (MAP and full Bayesian):

Parameters θ are random variables with probability distributions
We have prior beliefs P(θ) before seeing data
Data updates our beliefs to posterior P(θ|D)
Uncertainty is expressed through the posterior distribution itself

Bayes' Theorem for Parameter Estimation:

The machinery for updating beliefs is Bayes' theorem:

$$P(\theta | \mathbf{D}) = \frac{P(\mathbf{D} | \theta) \cdot P(\theta)}{P(\mathbf{D})}$$

Let's name each term:

Term	Name	Meaning
P(θ\|D)	Posterior	Our updated belief about θ after seeing data
P(D\|θ)	Likelihood	How probable the data is given θ (same as MLE)
P(θ)	Prior	Our initial belief about θ before data
P(D)	Evidence/Marginal	Total probability of data (normalizing constant)

The MAP Principle:

Rather than working with the full posterior distribution (which can be intractable), MAP finds the mode—the most probable parameter value:

$$\hat{\theta}{MAP} = \arg\max{\theta} P(\theta | \mathbf{D})$$

Since P(D) doesn't depend on θ, we can ignore it for optimization:

$$\hat{\theta}{MAP} = \arg\max{\theta} P(\mathbf{D} | \theta) \cdot P(\theta) = \arg\max_{\theta} [\text{Likelihood} \times \text{Prior}]$$

Taking logs:

$$\hat{\theta}{MAP} = \arg\max{\theta} [\log P(\mathbf{D} | \theta) + \log P(\theta)]$$

This is MLE plus a penalty term from the prior!

The Central Insight

MAP = MLE + Prior Regularization. The prior acts as a penalty that pulls the estimate toward values we believed likely before seeing data. With infinite data, the likelihood dominates and MAP → MLE. With little data, the prior dominates and prevents extreme estimates.

Choosing Prior Distributions

The prior P(θ) encodes our beliefs before observing data. Choosing it is both an art and a science.

Types of Priors:

Informative Priors: Encode genuine domain knowledge
- "Based on physics, we expect this coefficient to be near 9.8"
- "Historical data suggests customer churn is typically 5-15%"
Weakly Informative Priors: Provide gentle regularization
- "Parameters are probably in a reasonable range, not extreme"
- "I expect values near zero but am not certain"
Non-informative (Flat) Priors: P(θ) = constant
- Express "I have no prior knowledge"
- Mathematically: MAP = MLE when P(θ) is uniform
Conjugate Priors: Mathematically convenient
- Prior and posterior belong to same distribution family
- Enable closed-form solutions

Common Prior Distributions and Their Use Cases
Prior Distribution	Parameters	Use Case	Properties
Beta(α, β)	α, β > 0	Probability parameters (Bernoulli)	Conjugate to Bernoulli/Binomial
Normal(μ₀, σ₀²)	Mean μ₀, variance σ₀²	Real-valued parameters	Conjugate to Gaussian (known var)
Gamma(α, β)	Shape α, rate β	Positive parameters (rates, precisions)	Conjugate to Poisson, Exponential
Inverse-Gamma(α, β)	Shape α, scale β	Variance parameters	Conjugate to Gaussian (unknown var)
Dirichlet(α)	Concentration vector α	Probability vectors (multinomial)	Conjugate to Categorical/Multinomial
Laplace(0, b)	Location 0, scale b	Sparse parameters	Induces L1 regularization

The Beta Distribution: Perfect for Probabilities

The Beta distribution is defined on [0, 1], making it ideal for probability parameters:

$$P(p | \alpha, \beta) = \frac{p^{\alpha-1}(1-p)^{\beta-1}}{B(\alpha, \beta)}$$

where B(α, β) is the Beta function (normalizing constant).

Interpreting α and β:

Think of α - 1 as "prior heads" and β - 1 as "prior tails"
α = β = 1: Uniform prior (no preference)
α = β > 1: Peaked around 0.5 (fair coin prior)
α > β: Prior belief that p > 0.5
α < β: Prior belief that p < 0.5
Larger α + β: Stronger prior (more confident)

Prior Strength and Sample Size

A useful heuristic: the prior's effective sample size is α + β - 2 for the Beta distribution. If you set α = β = 10, you're expressing prior beliefs equivalent to having already observed 18 coin flips (9 heads, 9 tails). Your actual data needs to compete with these 'phantom observations'.

MAP Estimation for Bernoulli with Beta Prior

Let's derive the MAP estimate for a coin flip probability, using a Beta prior—the canonical example of Bayesian parameter estimation.

Problem Setup:

Data: n coin flips with k heads, (n-k) tails
Likelihood: Bernoulli(p) → L(p) = p^k(1-p)^(n-k)
Prior: Beta(α, β) → P(p) ∝ p^(α-1)(1-p)^(β-1)
Goal: Find p̂_MAP

Step 1: Write the posterior (up to proportionality)

$$P(p | k, n) \propto P(k, n | p) \cdot P(p)$$

$$\propto p^k(1-p)^{n-k} \cdot p^{\alpha-1}(1-p)^{\beta-1}$$

$$= p^{k+\alpha-1}(1-p)^{n-k+\beta-1}$$

This is a Beta(k + α, n - k + β) distribution! The Beta prior is conjugate to the Bernoulli likelihood.

Step 2: Find the mode of the posterior

The mode of Beta(a, b) is (a-1)/(a+b-2) when a, b > 1:

$$\hat{p}_{MAP} = \frac{(k + \alpha - 1)}{(k + \alpha - 1) + (n - k + \beta - 1)} = \frac{k + \alpha - 1}{n + \alpha + \beta - 2}$$

Interpretation:

Compare to MLE: p̂_MLE = k/n

MAP adds "pseudo-observations" to both numerator and denominator:

Numerator: k real heads + (α - 1) prior heads
Denominator: n real flips + (α + β - 2) prior flips

This is exactly Laplace smoothing with α = β = 2, giving:

$$\hat{p}_{Laplace} = \frac{k + 1}{n + 2}$$

Numerical Example:

3 flips, 3 heads (k=3, n=3):

Method	Prior	Estimate
MLE	None	3/3 = 1.00
MAP	Beta(2, 2)	(3+1)/(3+2) = 0.80
MAP	Beta(5, 5)	(3+4)/(3+8) ≈ 0.64
MAP	Beta(10, 10)	(3+9)/(3+18) ≈ 0.57

100 flips, 70 heads (k=70, n=100):

Method	Prior	Estimate
MLE	None	70/100 = 0.70
MAP	Beta(2, 2)	(70+1)/(100+2) ≈ 0.70
MAP	Beta(10, 10)	(70+9)/(100+18) ≈ 0.67

Key insight: With more data, the prior's influence diminishes. MAP and MLE converge as n → ∞.

bernoulli_map.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def bernoulli_mle(k, n):
    """MLE for Bernoulli parameter."""
    return k / n if n > 0 else 0.5
 
def bernoulli_map(k, n, alpha, beta):
    """
    MAP estimate for Bernoulli parameter with Beta prior.
    
    Parameters:
        k: number of successes (heads)
        n: number of trials
        alpha, beta: Beta prior parameters
    
    Returns:
        MAP estimate for p
    """
    # Mode of Beta(k + alpha, n - k + beta)
    a = k + alpha
    b = n - k + beta
    if a > 1 and b > 1:
        return (a - 1) / (a + b - 2)
    elif a <= 1:
        return 0  # Mode at 0
    else:
        return 1  # Mode at 1
 
# Compare MLE vs MAP for different data sizes
np.random.seed(42)
true_p = 0.7
 
# Simulate increasingly larger datasets
sample_sizes = [3, 10, 30, 100, 300, 1000]
prior_params = [(2, 2), (5, 5), (10, 10)]
 
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Plot 1: Estimates vs sample size
ax1 = axes[0, 0]
for alpha, beta in prior_params:
    map_estimates = []
    for n in sample_sizes:
        data = np.random.binomial(1, true_p, size=n)
        k = np.sum(data)
        map_estimates.append(bernoulli_map(k, n, alpha, beta))
    ax1.plot(sample_sizes, map_estimates, 'o-', 
             label=f'MAP Beta({alpha},{beta})')
 
mle_estimates = []
for n in sample_sizes:
    data = np.random.binomial(1, true_p, size=n)
    k = np.sum(data)
    mle_estimates.append(bernoulli_mle(k, n))
ax1.plot(sample_sizes, mle_estimates, 's--', label='MLE', color='black')
ax1.axhline(y=true_p, color='red', linestyle=':', label=f'True p={true_p}')
ax1.set_xlabel('Sample Size n')
ax1.set_ylabel('Estimate')
ax1.set_title('MLE vs MAP: Convergence with Sample Size')
ax1.legend()
ax1.set_xscale('log')
 
# Plot 2: Posterior distributions for n=3, k=3
ax2 = axes[0, 1]
p_range = np.linspace(0, 1, 200)
k, n = 3, 3
 
# Prior (Beta(5, 5))
ax2.plot(p_range, stats.beta.pdf(p_range, 5, 5), 'b-', 
         lw=2, label='Prior: Beta(5,5)')
# Likelihood (unnormalized)
likelihood = p_range**k * (1-p_range)**(n-k)
likelihood = likelihood / np.max(likelihood) * 2  # Scale for visibility
ax2.plot(p_range, likelihood, 'g--', lw=2, label='Likelihood (scaled)')
# Posterior: Beta(k+5, n-k+5)
ax2.plot(p_range, stats.beta.pdf(p_range, k+5, n-k+5), 'r-', 
         lw=2, label='Posterior: Beta(8,5)')
ax2.axvline(x=bernoulli_map(k, n, 5, 5), color='r', linestyle=':',
            label=f'MAP = {bernoulli_map(k, n, 5, 5):.3f}')
ax2.axvline(x=1.0, color='g', linestyle=':',
            label=f'MLE = {bernoulli_mle(k, n):.1f}')
ax2.set_xlabel('Parameter p')
ax2.set_ylabel('Density')
ax2.set_title(f'Bayesian Update: n={n}, k={k}')
ax2.legend()
 
# Plot 3: Posterior distributions for n=100, k=70
ax3 = axes[1, 0]
k, n = 70, 100
 
ax3.plot(p_range, stats.beta.pdf(p_range, 5, 5), 'b-', 
         lw=2, label='Prior: Beta(5,5)')
likelihood = p_range**k * (1-p_range)**(n-k)
likelihood = likelihood / np.max(likelihood) * 15  # Scale for visibility
ax3.plot(p_range, likelihood, 'g--', lw=2, label='Likelihood (scaled)')
ax3.plot(p_range, stats.beta.pdf(p_range, k+5, n-k+5), 'r-', 
         lw=2, label=f'Posterior: Beta({k+5},{n-k+5})')
map_val = bernoulli_map(k, n, 5, 5)
mle_val = bernoulli_mle(k, n)
ax3.axvline(x=map_val, color='r', linestyle=':',
            label=f'MAP = {map_val:.3f}')
ax3.axvline(x=mle_val, color='g', linestyle=':',
            label=f'MLE = {mle_val:.2f}')
ax3.set_xlabel('Parameter p')
ax3.set_ylabel('Density')
ax3.set_title(f'Bayesian Update: n={n}, k={k} (Data dominates)')
ax3.legend()
ax3.set_xlim([0.5, 0.9])
 
# Plot 4: Effect of prior strength
ax4 = axes[1, 1]
k, n = 7, 10  # 70% heads observed
prior_strengths = [(1.01, 1.01), (2, 2), (5, 5), (10, 10), (50, 50)]
 
estimates = []
labels = []
for alpha, beta in prior_strengths:
    map_est = bernoulli_map(k, n, alpha, beta)
    estimates.append(map_est)
    labels.append(f'Beta({alpha},{beta})')
 
ax4.barh(labels, estimates, color='steelblue', alpha=0.7)
ax4.axvline(x=bernoulli_mle(k, n), color='red', linestyle='--',
            label=f'MLE = {bernoulli_mle(k, n):.1f}')
ax4.axvline(x=0.5, color='gray', linestyle=':', label='Prior mean')
ax4.set_xlabel('MAP Estimate')
ax4.set_title(f'Effect of Prior Strength (n={n}, k={k})')
ax4.legend()
ax4.set_xlim([0.4, 0.8])
 
plt.tight_layout()
plt.show()

MAP Estimation for Gaussian Distribution

Let's derive MAP for the Gaussian mean with a Gaussian prior—another instance of conjugacy.

Problem Setup:

Data: n observations x₁, ..., xₙ
Likelihood: N(μ, σ²) with known σ²
Prior: μ ~ N(μ₀, σ₀²)
Goal: Find μ̂_MAP

Step 1: Write the posterior

Log-posterior (up to constants):

$$\log P(\mu | \mathbf{x}) = \log P(\mathbf{x} | \mu) + \log P(\mu) + C$$

$$= -\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2 - \frac{1}{2\sigma_0^2}(\mu - \mu_0)^2 + C$$

Step 2: Differentiate and solve

$$\frac{d}{d\mu}\log P(\mu | \mathbf{x}) = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i - \mu) - \frac{1}{\sigma_0^2}(\mu - \mu_0) = 0$$

$$\frac{n\bar{x}}{\sigma^2} - \frac{n\mu}{\sigma^2} = \frac{\mu}{\sigma_0^2} - \frac{\mu_0}{\sigma_0^2}$$

Rearranging:

$$\mu\left(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}\right) = \frac{n\bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}$$

Result:

$$\boxed{\hat{\mu}_{MAP} = \frac{\frac{n}{\sigma^2}\bar{x} + \frac{1}{\sigma_0^2}\mu_0}{\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}}}$$

Interpretation as Precision-Weighted Average:

Define precision (inverse variance) as τ = 1/σ²:

$$\hat{\mu}{MAP} = \frac{\tau{data} \cdot \bar{x} + \tau_{prior} \cdot \mu_0}{\tau_{data} + \tau_{prior}}$$

where τ_data = n/σ² and τ_prior = 1/σ₀².

The MAP estimate is a precision-weighted average of the data mean and prior mean!

Key Properties:

If τ_prior → 0 (diffuse prior): μ̂_MAP → x̄ (MLE)
If τ_data → 0 (little data): μ̂_MAP → μ₀ (prior mean)
As n → ∞: τ_data dominates, μ̂_MAP → x̄
Higher precision = more influence on the estimate

gaussian_map.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def gaussian_map(data, sigma2, mu0, sigma0_2):
    """
    MAP estimate for Gaussian mean with Gaussian prior.
    
    Parameters:
        data: observed data points
        sigma2: known data variance σ²
        mu0: prior mean
        sigma0_2: prior variance σ₀²
    
    Returns:
        MAP estimate for μ
    """
    n = len(data)
    x_bar = np.mean(data)
    
    # Precision formulation
    tau_data = n / sigma2
    tau_prior = 1 / sigma0_2
    
    return (tau_data * x_bar + tau_prior * mu0) / (tau_data + tau_prior)
 
def posterior_variance(n, sigma2, sigma0_2):
    """Variance of the posterior distribution."""
    tau_data = n / sigma2
    tau_prior = 1 / sigma0_2
    return 1 / (tau_data + tau_prior)
 
# Setup
np.random.seed(42)
true_mu = 5.0
sigma2 = 1.0  # Known data variance
mu0 = 3.0     # Prior mean (intentionally different from true)
sigma0_2 = 2.0  # Prior variance
 
# Generate small dataset
n = 5
data = np.random.normal(true_mu, np.sqrt(sigma2), size=n)
x_bar = np.mean(data)
 
# Compute estimates
mle = x_bar
map_estimate = gaussian_map(data, sigma2, mu0, sigma0_2)
 
print(f"True μ: {true_mu}")
print(f"Prior mean μ₀: {mu0}")
print(f"Sample mean (MLE): {mle:.4f}")
print(f"MAP estimate: {map_estimate:.4f}")
print(f"Posterior variance: {posterior_variance(n, sigma2, sigma0_2):.4f}")
 
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: Prior, Likelihood, Posterior
mu_range = np.linspace(0, 8, 200)
 
# Prior: N(mu0, sigma0_2)
prior = stats.norm.pdf(mu_range, mu0, np.sqrt(sigma0_2))
 
# Likelihood of observing this x_bar given μ
# p(x_bar | μ) ∝ exp(-n(x_bar - μ)²/(2σ²))
likelihood = stats.norm.pdf(mu_range, x_bar, np.sqrt(sigma2/n))
 
# Posterior
post_var = posterior_variance(n, sigma2, sigma0_2)
posterior = stats.norm.pdf(mu_range, map_estimate, np.sqrt(post_var))
 
axes[0].plot(mu_range, prior, 'b-', lw=2, label=f'Prior N({mu0}, {sigma0_2})')
axes[0].plot(mu_range, likelihood, 'g--', lw=2, 
             label=f'Likelihood (centered at x̄={x_bar:.2f})')
axes[0].plot(mu_range, posterior, 'r-', lw=2, 
             label=f'Posterior N({map_estimate:.2f}, {post_var:.2f})')
axes[0].axvline(x=mle, color='g', linestyle=':', alpha=0.7, label=f'MLE = {mle:.2f}')
axes[0].axvline(x=map_estimate, color='r', linestyle=':', alpha=0.7, 
                label=f'MAP = {map_estimate:.2f}')
axes[0].axvline(x=true_mu, color='k', linestyle='--', alpha=0.5, 
                label=f'True μ = {true_mu}')
axes[0].set_xlabel('Parameter μ')
axes[0].set_ylabel('Density')
axes[0].set_title(f'Bayesian Update (n={n} observations)')
axes[0].legend(loc='upper left')
 
# Right: MAP vs MLE as function of sample size
sample_sizes = np.arange(1, 101)
map_estimates = []
mle_estimates = []
 
for n in sample_sizes:
    data = np.random.normal(true_mu, np.sqrt(sigma2), size=n)
    mle_estimates.append(np.mean(data))
    map_estimates.append(gaussian_map(data, sigma2, mu0, sigma0_2))
 
axes[1].plot(sample_sizes, mle_estimates, 'g-', alpha=0.7, label='MLE (sample mean)')
axes[1].plot(sample_sizes, map_estimates, 'r-', alpha=0.7, label='MAP')
axes[1].axhline(y=true_mu, color='k', linestyle='--', label=f'True μ = {true_mu}')
axes[1].axhline(y=mu0, color='b', linestyle=':', label=f'Prior mean = {mu0}')
axes[1].fill_between(sample_sizes, true_mu - 0.2, true_mu + 0.2, 
                     alpha=0.2, color='gray')
axes[1].set_xlabel('Sample Size n')
axes[1].set_ylabel('Estimate')
axes[1].set_title('MAP Converges to MLE as n Increases')
axes[1].legend()
 
plt.tight_layout()
plt.show()

The Profound Connection: MAP = Regularized MLE

One of the most beautiful insights in machine learning is that regularization is equivalent to MAP estimation with specific priors. This connection unifies the frequentist and Bayesian perspectives.

The General Framework:

Recall the MAP objective:

$$\hat{\theta}{MAP} = \arg\max{\theta} [\log P(\mathbf{D} | \theta) + \log P(\theta)]$$

Or equivalently, minimizing the negative:

$$\hat{\theta}{MAP} = \arg\min{\theta} [-\log P(\mathbf{D} | \theta) - \log P(\theta)]$$

$$= \arg\min_{\theta} [\text{Negative Log-Likelihood} + \text{Negative Log-Prior}]$$

$$= \arg\min_{\theta} [\text{Loss Function} + \text{Regularization}]$$

L2 Regularization (Ridge) = Gaussian Prior:

If θ ~ N(0, σ₀²I), then:

$$-\log P(\theta) = \frac{1}{2\sigma_0^2}||\theta||_2^2 + C = \lambda ||\theta||_2^2 + C$$

where λ = 1/(2σ₀²). Larger λ means smaller prior variance (stronger belief that θ is near zero).

Ridge regression is MAP estimation under a Gaussian prior on weights!

The L2 penalty ||w||² shrinks weights toward zero, just as the Gaussian prior N(0, σ₀²) concentrates probability around zero.

L1 Regularization (Lasso) = Laplace Prior:

If θ ~ Laplace(0, b), then:

$$P(\theta) = \frac{1}{2b}\exp\left(-\frac{|\theta|}{b}\right)$$

$$-\log P(\theta) = \frac{|\theta|}{b} + C = \lambda |\theta| + C$$

where λ = 1/b.

Lasso regression is MAP estimation under a Laplace prior!

The Laplace distribution's sharp peak at zero explains why L1 produces sparse solutions—weights near zero are pushed exactly to zero.

L2 / Gaussian Prior

•Prior: θ ~ N(0, σ₀²)
•Penalty: λ||θ||₂²
•Effect: Shrinks all weights
•No sparsity: All weights remain non-zero
•Use: Multicollinearity, general regularization

L1 / Laplace Prior

•Prior: θ ~ Laplace(0, b)
•Penalty: λ||θ||₁
•Effect: Sparse solutions
•Sparsity: Many weights become exactly zero
•Use: Feature selection, interpretable models

regularization_as_map.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
# Visualize Gaussian vs Laplace priors
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Left: Prior distributions
theta = np.linspace(-5, 5, 200)
 
# Gaussian prior (L2 regularization)
sigma0 = 1.0
gaussian_prior = stats.norm.pdf(theta, 0, sigma0)
 
# Laplace prior (L1 regularization)
b = sigma0 / np.sqrt(2)  # Matched variance
laplace_prior = stats.laplace.pdf(theta, 0, b)
 
axes[0].plot(theta, gaussian_prior, 'b-', lw=2, 
             label=f'Gaussian N(0, {sigma0}²)')
axes[0].plot(theta, laplace_prior, 'r-', lw=2, 
             label=f'Laplace(0, {b:.2f})')
axes[0].set_xlabel('Parameter θ')
axes[0].set_ylabel('Prior Density')
axes[0].set_title('L2 vs L1 Prior Distributions')
axes[0].legend()
axes[0].set_xlim([-5, 5])
 
# Middle: -log(prior) = regularization penalty
gaussian_penalty = 0.5 * theta**2 / sigma0**2  # L2
laplace_penalty = np.abs(theta) / b  # L1
 
axes[1].plot(theta, gaussian_penalty, 'b-', lw=2, 
             label='L2: λ||θ||²')
axes[1].plot(theta, laplace_penalty, 'r-', lw=2, 
             label='L1: λ|θ|')
axes[1].set_xlabel('Parameter θ')
axes[1].set_ylabel('-log P(θ) + const')
axes[1].set_title('Regularization = -log(Prior)')
axes[1].legend()
axes[1].set_xlim([-5, 5])
axes[1].set_ylim([0, 6])
 
# Right: Effect on parameter estimates
# Simulate shrinkage behavior
np.random.seed(42)
n_params = 20
true_theta = np.random.randn(n_params)  # True parameters
noise = np.random.randn(n_params) * 0.3  # Observation noise
observed = true_theta + noise  # "Data" estimates (MLE)
 
# L2 shrinkage: point toward zero proportionally
lambda_l2 = 0.5
l2_estimates = observed / (1 + lambda_l2)
 
# L1 shrinkage: soft thresholding
lambda_l1 = 0.3
l1_estimates = np.sign(observed) * np.maximum(np.abs(observed) - lambda_l1, 0)
 
x_pos = np.arange(n_params)
width = 0.25
 
axes[2].bar(x_pos - width, observed, width, label='MLE', alpha=0.7)
axes[2].bar(x_pos, l2_estimates, width, label='L2 (Ridge)', alpha=0.7)
axes[2].bar(x_pos + width, l1_estimates, width, label='L1 (Lasso)', alpha=0.7)
axes[2].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[2].set_xlabel('Parameter Index')
axes[2].set_ylabel('Estimate')
axes[2].set_title('Shrinkage: L2 vs L1')
axes[2].legend()
 
plt.tight_layout()
plt.show()
 
# Count zeros
print(f"\nSparsity comparison:")
print(f"  MLE zeros: {np.sum(np.abs(observed) < 1e-6)}")
print(f"  L2 zeros: {np.sum(np.abs(l2_estimates) < 1e-6)}")
print(f"  L1 zeros: {np.sum(np.abs(l1_estimates) < 1e-6)}")

The Grand Unification

This connection transforms how we think about regularization. It's not an arbitrary add-on to prevent overfitting—it's a principled expression of prior beliefs. Understanding this lets you derive custom regularizers for domain-specific priors, interpret regularization hyperparameters (they control prior strength!), and connect machine learning to Bayesian statistics seamlessly.

Conjugate Priors: Mathematical Elegance

A conjugate prior is a prior distribution that, when combined with a specific likelihood, produces a posterior in the same family as the prior. This enables:

Closed-form posteriors: No numerical integration needed
Intuitive updates: Simple formulas for incorporating new data
Sequential learning: Easy online updates as data arrives

Why conjugacy matters computationally:

Without conjugacy: $$P(\theta | \mathbf{D}) = \frac{P(\mathbf{D}|\theta)P(\theta)}{\int P(\mathbf{D}|\theta')P(\theta')d\theta'}$$

The denominator integral is often intractable, requiring Monte Carlo methods or variational approximations.

With conjugacy, this integral has a known closed form!

Standard Conjugate Prior-Likelihood Pairs
Likelihood	Conjugate Prior	Posterior	Prior Parameters as 'Pseudo-data'
Bernoulli(p)	Beta(α, β)	Beta(α+k, β+n-k)	α-1 prior heads, β-1 prior tails
Binomial(n, p)	Beta(α, β)	Beta(α+k, β+n-k)	Same interpretation
Poisson(λ)	Gamma(α, β)	Gamma(α+Σx, β+n)	α-1 prior events, β prior intervals
N(μ, σ²) [known σ²]	N(μ₀, σ₀²)	N(μ_post, σ²_post)	Precision-weighted combination
N(μ, σ²) [known μ]	Inv-Gamma(α, β)	Inv-Gamma(α+n/2, β+SS/2)	α prior observations
Multinomial(θ)	Dirichlet(α)	Dirichlet(α + counts)	αᵢ-1 prior counts for category i
Exponential(λ)	Gamma(α, β)	Gamma(α+n, β+Σx)	α-1 prior events observed

Sequential Updating with Conjugacy:

One beautiful property is that the posterior from one batch of data becomes the prior for the next:

$$P(\theta | D_1, D_2) \propto P(D_2|\theta) \cdot P(\theta | D_1)$$

For Beta-Bernoulli:

Start: Beta(α, β)
After k₁ heads in n₁ flips: Beta(α + k₁, β + n₁ - k₁)
After k₂ more heads in n₂ more flips: Beta(α + k₁ + k₂, β + n₁ + n₂ - k₁ - k₂)

This enables true online learning—updating beliefs incrementally without storing all past data!

sequential_bayesian_update.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def sequential_beta_update(prior_alpha, prior_beta, data_stream):
    """
    Perform sequential Bayesian updates for Bernoulli likelihood.
    
    Parameters:
        prior_alpha, prior_beta: Initial Beta prior parameters
        data_stream: List of observation batches (each batch is array of 0s and 1s)
    
    Returns:
        List of (alpha, beta, map_estimate) after each batch
    """
    alpha, beta = prior_alpha, prior_beta
    history = [(alpha, beta, (alpha - 1) / (alpha + beta - 2) if alpha > 1 and beta > 1 else 0.5)]
    
    for batch in data_stream:
        k = np.sum(batch)
        n = len(batch)
        alpha += k
        beta += (n - k)
        map_est = (alpha - 1) / (alpha + beta - 2) if alpha > 1 and beta > 1 else 0.5
        history.append((alpha, beta, map_est))
    
    return history
 
# Simulate online learning scenario
np.random.seed(42)
true_p = 0.7
 
# Initial weak prior (uniform-ish)
prior_alpha, prior_beta = 2, 2
 
# Data arrives in batches
batch_sizes = [5, 10, 20, 50, 100, 200]
data_stream = [np.random.binomial(1, true_p, size=n) for n in batch_sizes]
 
# Perform sequential updates
history = sequential_beta_update(prior_alpha, prior_beta, data_stream)
 
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Plot 1: MAP estimate evolution
ax1 = axes[0, 0]
total_samples = np.cumsum([0] + batch_sizes)
map_estimates = [h[2] for h in history]
ax1.step(total_samples, map_estimates, where='post', lw=2, color='blue')
ax1.axhline(y=true_p, color='red', linestyle='--', label=f'True p = {true_p}')
ax1.fill_between(total_samples, true_p - 0.05, true_p + 0.05, alpha=0.2, color='red')
ax1.set_xlabel('Cumulative Samples')
ax1.set_ylabel('MAP Estimate')
ax1.set_title('Online Learning: MAP Estimate Evolution')
ax1.legend()
ax1.grid(True, alpha=0.3)
 
# Plot 2: Posterior uncertainty shrinks
ax2 = axes[0, 1]
p_range = np.linspace(0, 1, 200)
 
colors = plt.cm.viridis(np.linspace(0, 1, len(history)))
for i, (alpha, beta, _) in enumerate(history):
    pdf = stats.beta.pdf(p_range, alpha, beta)
    total_n = sum(batch_sizes[:i]) if i > 0 else 0
    ax2.plot(p_range, pdf, color=colors[i], lw=1.5, 
             label=f'n={total_n}' if i < 5 else None)
 
ax2.axvline(x=true_p, color='red', linestyle='--', alpha=0.7)
ax2.set_xlabel('Parameter p')
ax2.set_ylabel('Posterior Density')
ax2.set_title('Posterior Evolution: Uncertainty Decreases')
ax2.legend()
 
# Plot 3: Posterior variance (uncertainty) over time
ax3 = axes[1, 0]
posterior_vars = []
for alpha, beta, _ in history:
    var = (alpha * beta) / ((alpha + beta)**2 * (alpha + beta + 1))
    posterior_vars.append(var)
 
ax3.plot(total_samples, posterior_vars, 'o-', color='purple', lw=2)
ax3.set_xlabel('Cumulative Samples')
ax3.set_ylabel('Posterior Variance')
ax3.set_title('Uncertainty Decreases with More Data')
ax3.set_yscale('log')
ax3.grid(True, alpha=0.3)
 
# Plot 4: 95% credible intervals
ax4 = axes[1, 1]
lower_bounds = []
upper_bounds = []
for alpha, beta, _ in history:
    lower = stats.beta.ppf(0.025, alpha, beta)
    upper = stats.beta.ppf(0.975, alpha, beta)
    lower_bounds.append(lower)
    upper_bounds.append(upper)
 
ax4.fill_between(total_samples, lower_bounds, upper_bounds, 
                 alpha=0.3, color='blue', label='95% Credible Interval')
ax4.plot(total_samples, map_estimates, 'b-', lw=2, label='MAP Estimate')
ax4.axhline(y=true_p, color='red', linestyle='--', label=f'True p = {true_p}')
ax4.set_xlabel('Cumulative Samples')
ax4.set_ylabel('Parameter p')
ax4.set_title('95% Credible Interval Shrinks')
ax4.legend()
ax4.set_ylim([0.4, 1.0])
ax4.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.show()
 
# Print summary
print("Sequential Bayesian Update Summary:")
print("=" * 50)
for i, (alpha, beta, map_est) in enumerate(history):
    total_n = sum(batch_sizes[:i]) if i > 0 else 0
    var = (alpha * beta) / ((alpha + beta)**2 * (alpha + beta + 1))
    print(f"After {total_n:4d} samples: α={alpha:6.1f}, β={beta:6.1f}, "
          f"MAP={map_est:.4f}, Var={var:.6f}")

MLE vs MAP: When to Use Each

Now that we understand both methods deeply, let's systematically compare them to guide practical decisions.

Comprehensive MLE vs MAP Comparison
Dimension	MLE	MAP
Philosophy	Parameters are fixed constants	Parameters are random variables
Objective	Maximize P(D\|θ)	Maximize P(D\|θ) × P(θ)
Prior required?	No	Yes
Small data behavior	Prone to extreme estimates	Regularized by prior
Large data behavior	Optimal (efficient)	Converges to MLE
Uncertainty quantification	Requires separate analysis	Natural via posterior
Computational	Often simpler	Requires prior specification
Overfitting risk	Higher	Lower (regularized)
Interpretation	Point estimate	Mode of belief distribution

Use MLE when:

Large datasets: Prior influence becomes negligible anyway
No prior knowledge: Rather than guess, let data speak
Computational simplicity: When prior doesn't have conjugacy benefits
Interpretability: In scientific contexts where arbitrary priors may be questioned
Benchmarking: As a baseline before adding regularization

Use MAP when:

Limited data: Prior prevents overfitting to noise
Genuine prior knowledge: Domain expertise should influence estimates
Regularization needed: Especially for high-dimensional problems
Sparse solutions desired: Use Laplace prior for L1 regularization
Sequential learning: Prior enables efficient online updates
Preventing extreme estimates: Essential for probabilities, rates, etc.

The Practical Sweet Spot

In modern machine learning, the choice is often made implicitly. Using Ridge (L2) or Lasso (L1) regression? You're doing MAP. Training a neural network with weight decay? That's MAP with a Gaussian prior. The key insight is recognizing these connections, so you can tune regularization with Bayesian intuition: 'How strongly do I believe weights should be small?'

Beyond MAP: Full Bayesian Inference

MAP finds a single "best" parameter value—the posterior mode. But the full Bayesian approach maintains the entire posterior distribution, enabling richer inference.

Why go beyond MAP?

MAP discards uncertainty information. Consider two scenarios:

Scenario A: Posterior is sharply peaked at θ̂ = 0.7
Scenario B: Posterior is flat, with mode at θ̂ = 0.7

Both have the same MAP estimate, but our confidence differs enormously! Full Bayesian inference captures this.

Posterior Predictive Distribution:

Instead of using point estimate θ̂ for predictions, integrate over all possible parameters:

$$P(x_{new} | \mathbf{D}) = \int P(x_{new} | \theta) P(\theta | \mathbf{D}) d\theta$$

This automatically accounts for parameter uncertainty—predictions are hedged proportionally to our confidence in θ.

Bayesian Model Comparison:

The evidence P(D) (which we ignored in MAP) becomes crucial:

$$P(D) = \int P(D|\theta) P(\theta) d\theta$$

This enables comparison between models with different complexity—Occam's razor emerges automatically!

Credible Intervals:

Bayesian posteriors give natural uncertainty intervals. A 95% credible interval [a, b] means: "Given the data and prior, there's 95% probability that θ ∈ [a, b]." This is often what practitioners actually want (unlike frequentist confidence intervals).

When to Use Full Bayesian

Full Bayesian inference shines when uncertainty quantification is critical (medical diagnosis, autonomous vehicles), well-calibrated predictions matter, model comparison is needed, and datasets are small. The cost is computational: MCMC, variational inference, or other approximations are often required.

Summary: Maximum A Posteriori Estimation

We've explored the Bayesian extension of MLE and its profound connections to regularization. Let's consolidate the key insights:

Key Takeaways

•MAP = MLE + Prior — The posterior mode incorporates both data (likelihood) and beliefs (prior).
•Priors encode assumptions — Beta for probabilities, Gaussian for real parameters, Laplace for sparsity.
•Regularization IS Bayesian — L2 penalty = Gaussian prior; L1 penalty = Laplace prior.
•Conjugacy enables elegance — Closed-form posteriors for standard likelihood-prior pairs.
•Data dominates asymptotically — MAP → MLE as sample size → ∞.
•Prior strength = regularization strength — Tighter priors mean stronger regularization.
•Sequential updates are natural — Posterior becomes next prior, enabling online learning.

Looking ahead:

We've established how to obtain point estimates via MLE and MAP. But how good are these estimates? What properties should we demand of estimators? The next page explores Bias and Variance of Estimators—the fundamental concepts for evaluating and comparing estimation procedures.

Page Complete

You've mastered MAP estimation—the Bayesian extension of MLE that unifies regularization theory with probabilistic inference. You can now derive MAP estimates with conjugate priors, interpret regularization hyperparameters as prior strengths, choose between MLE and MAP based on problem characteristics, and understand the full spectrum from MLE to MAP to full Bayesian inference.

2 / 5

Loading learning content...

Machine LearningProbability Theory

Estimation Theory

LevelIntermediate

Duration90 mins

TopicProbability Theory

2 / 5

Maximum A Posteriori (MAP) Estimation

When Data Alone Isn't Enough

What You Will Learn

The Bayesian Perspective on Parameters

The fundamental philosophical difference between frequentist and Bayesian approaches lies in how we treat parameters:

Frequentist view (MLE):

Parameters θ are fixed but unknown constants
There's one true value θ* that generated the data
We seek point estimates that approximate θ*
Uncertainty is expressed through sampling distributions of the estimator

Bayesian view (MAP and full Bayesian):

Parameters θ are random variables with probability distributions
We have prior beliefs P(θ) before seeing data
Data updates our beliefs to posterior P(θ|D)
Uncertainty is expressed through the posterior distribution itself

Bayes' Theorem for Parameter Estimation:

The machinery for updating beliefs is Bayes' theorem:

$$P(\theta | \mathbf{D}) = \frac{P(\mathbf{D} | \theta) \cdot P(\theta)}{P(\mathbf{D})}$$

Let's name each term:

Term	Name	Meaning
P(θ\|D)	Posterior	Our updated belief about θ after seeing data
P(D\|θ)	Likelihood	How probable the data is given θ (same as MLE)
P(θ)	Prior	Our initial belief about θ before data
P(D)	Evidence/Marginal	Total probability of data (normalizing constant)

The MAP Principle:

Rather than working with the full posterior distribution (which can be intractable), MAP finds the mode—the most probable parameter value:

$$\hat{\theta}{MAP} = \arg\max{\theta} P(\theta | \mathbf{D})$$

Since P(D) doesn't depend on θ, we can ignore it for optimization:

$$\hat{\theta}{MAP} = \arg\max{\theta} P(\mathbf{D} | \theta) \cdot P(\theta) = \arg\max_{\theta} [\text{Likelihood} \times \text{Prior}]$$

Taking logs:

$$\hat{\theta}{MAP} = \arg\max{\theta} [\log P(\mathbf{D} | \theta) + \log P(\theta)]$$

This is MLE plus a penalty term from the prior!

The Central Insight

Choosing Prior Distributions

The prior P(θ) encodes our beliefs before observing data. Choosing it is both an art and a science.

Types of Priors:

Informative Priors: Encode genuine domain knowledge
- "Based on physics, we expect this coefficient to be near 9.8"
- "Historical data suggests customer churn is typically 5-15%"
Weakly Informative Priors: Provide gentle regularization
- "Parameters are probably in a reasonable range, not extreme"
- "I expect values near zero but am not certain"
Non-informative (Flat) Priors: P(θ) = constant
- Express "I have no prior knowledge"
- Mathematically: MAP = MLE when P(θ) is uniform
Conjugate Priors: Mathematically convenient
- Prior and posterior belong to same distribution family
- Enable closed-form solutions

Common Prior Distributions and Their Use Cases
Prior Distribution	Parameters	Use Case	Properties
Beta(α, β)	α, β > 0	Probability parameters (Bernoulli)	Conjugate to Bernoulli/Binomial
Normal(μ₀, σ₀²)	Mean μ₀, variance σ₀²	Real-valued parameters	Conjugate to Gaussian (known var)
Gamma(α, β)	Shape α, rate β	Positive parameters (rates, precisions)	Conjugate to Poisson, Exponential
Inverse-Gamma(α, β)	Shape α, scale β	Variance parameters	Conjugate to Gaussian (unknown var)
Dirichlet(α)	Concentration vector α	Probability vectors (multinomial)	Conjugate to Categorical/Multinomial
Laplace(0, b)	Location 0, scale b	Sparse parameters	Induces L1 regularization

The Beta Distribution: Perfect for Probabilities

The Beta distribution is defined on [0, 1], making it ideal for probability parameters:

$$P(p | \alpha, \beta) = \frac{p^{\alpha-1}(1-p)^{\beta-1}}{B(\alpha, \beta)}$$

where B(α, β) is the Beta function (normalizing constant).

Interpreting α and β:

Think of α - 1 as "prior heads" and β - 1 as "prior tails"
α = β = 1: Uniform prior (no preference)
α = β > 1: Peaked around 0.5 (fair coin prior)
α > β: Prior belief that p > 0.5
α < β: Prior belief that p < 0.5
Larger α + β: Stronger prior (more confident)

Prior Strength and Sample Size

MAP Estimation for Bernoulli with Beta Prior

Let's derive the MAP estimate for a coin flip probability, using a Beta prior—the canonical example of Bayesian parameter estimation.

Problem Setup:

Data: n coin flips with k heads, (n-k) tails
Likelihood: Bernoulli(p) → L(p) = p^k(1-p)^(n-k)
Prior: Beta(α, β) → P(p) ∝ p^(α-1)(1-p)^(β-1)
Goal: Find p̂_MAP

Step 1: Write the posterior (up to proportionality)

$$P(p | k, n) \propto P(k, n | p) \cdot P(p)$$

$$\propto p^k(1-p)^{n-k} \cdot p^{\alpha-1}(1-p)^{\beta-1}$$

$$= p^{k+\alpha-1}(1-p)^{n-k+\beta-1}$$

This is a Beta(k + α, n - k + β) distribution! The Beta prior is conjugate to the Bernoulli likelihood.

Step 2: Find the mode of the posterior

The mode of Beta(a, b) is (a-1)/(a+b-2) when a, b > 1:

$$\hat{p}_{MAP} = \frac{(k + \alpha - 1)}{(k + \alpha - 1) + (n - k + \beta - 1)} = \frac{k + \alpha - 1}{n + \alpha + \beta - 2}$$

Interpretation:

Compare to MLE: p̂_MLE = k/n

MAP adds "pseudo-observations" to both numerator and denominator:

Numerator: k real heads + (α - 1) prior heads
Denominator: n real flips + (α + β - 2) prior flips

This is exactly Laplace smoothing with α = β = 2, giving:

$$\hat{p}_{Laplace} = \frac{k + 1}{n + 2}$$

Numerical Example:

3 flips, 3 heads (k=3, n=3):

Method	Prior	Estimate
MLE	None	3/3 = 1.00
MAP	Beta(2, 2)	(3+1)/(3+2) = 0.80
MAP	Beta(5, 5)	(3+4)/(3+8) ≈ 0.64
MAP	Beta(10, 10)	(3+9)/(3+18) ≈ 0.57

100 flips, 70 heads (k=70, n=100):

Method	Prior	Estimate
MLE	None	70/100 = 0.70
MAP	Beta(2, 2)	(70+1)/(100+2) ≈ 0.70
MAP	Beta(10, 10)	(70+9)/(100+18) ≈ 0.67

Key insight: With more data, the prior's influence diminishes. MAP and MLE converge as n → ∞.

bernoulli_map.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def bernoulli_mle(k, n):
    """MLE for Bernoulli parameter."""
    return k / n if n > 0 else 0.5
 
def bernoulli_map(k, n, alpha, beta):
    """
    MAP estimate for Bernoulli parameter with Beta prior.
    
    Parameters:
        k: number of successes (heads)
        n: number of trials
        alpha, beta: Beta prior parameters
    
    Returns:
        MAP estimate for p
    """
    # Mode of Beta(k + alpha, n - k + beta)
    a = k + alpha
    b = n - k + beta
    if a > 1 and b > 1:
        return (a - 1) / (a + b - 2)
    elif a <= 1:
        return 0  # Mode at 0
    else:
        return 1  # Mode at 1
 
# Compare MLE vs MAP for different data sizes
np.random.seed(42)
true_p = 0.7
 
# Simulate increasingly larger datasets
sample_sizes = [3, 10, 30, 100, 300, 1000]
prior_params = [(2, 2), (5, 5), (10, 10)]
 
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Plot 1: Estimates vs sample size
ax1 = axes[0, 0]
for alpha, beta in prior_params:
    map_estimates = []
    for n in sample_sizes:
        data = np.random.binomial(1, true_p, size=n)
        k = np.sum(data)
        map_estimates.append(bernoulli_map(k, n, alpha, beta))
    ax1.plot(sample_sizes, map_estimates, 'o-', 
             label=f'MAP Beta({alpha},{beta})')
 
mle_estimates = []
for n in sample_sizes:
    data = np.random.binomial(1, true_p, size=n)
    k = np.sum(data)
    mle_estimates.append(bernoulli_mle(k, n))
ax1.plot(sample_sizes, mle_estimates, 's--', label='MLE', color='black')
ax1.axhline(y=true_p, color='red', linestyle=':', label=f'True p={true_p}')
ax1.set_xlabel('Sample Size n')
ax1.set_ylabel('Estimate')
ax1.set_title('MLE vs MAP: Convergence with Sample Size')
ax1.legend()
ax1.set_xscale('log')
 
# Plot 2: Posterior distributions for n=3, k=3
ax2 = axes[0, 1]
p_range = np.linspace(0, 1, 200)
k, n = 3, 3
 
# Prior (Beta(5, 5))
ax2.plot(p_range, stats.beta.pdf(p_range, 5, 5), 'b-', 
         lw=2, label='Prior: Beta(5,5)')
# Likelihood (unnormalized)
likelihood = p_range**k * (1-p_range)**(n-k)
likelihood = likelihood / np.max(likelihood) * 2  # Scale for visibility
ax2.plot(p_range, likelihood, 'g--', lw=2, label='Likelihood (scaled)')
# Posterior: Beta(k+5, n-k+5)
ax2.plot(p_range, stats.beta.pdf(p_range, k+5, n-k+5), 'r-', 
         lw=2, label='Posterior: Beta(8,5)')
ax2.axvline(x=bernoulli_map(k, n, 5, 5), color='r', linestyle=':',
            label=f'MAP = {bernoulli_map(k, n, 5, 5):.3f}')
ax2.axvline(x=1.0, color='g', linestyle=':',
            label=f'MLE = {bernoulli_mle(k, n):.1f}')
ax2.set_xlabel('Parameter p')
ax2.set_ylabel('Density')
ax2.set_title(f'Bayesian Update: n={n}, k={k}')
ax2.legend()
 
# Plot 3: Posterior distributions for n=100, k=70
ax3 = axes[1, 0]
k, n = 70, 100
 
ax3.plot(p_range, stats.beta.pdf(p_range, 5, 5), 'b-', 
         lw=2, label='Prior: Beta(5,5)')
likelihood = p_range**k * (1-p_range)**(n-k)
likelihood = likelihood / np.max(likelihood) * 15  # Scale for visibility
ax3.plot(p_range, likelihood, 'g--', lw=2, label='Likelihood (scaled)')
ax3.plot(p_range, stats.beta.pdf(p_range, k+5, n-k+5), 'r-', 
         lw=2, label=f'Posterior: Beta({k+5},{n-k+5})')
map_val = bernoulli_map(k, n, 5, 5)
mle_val = bernoulli_mle(k, n)
ax3.axvline(x=map_val, color='r', linestyle=':',
            label=f'MAP = {map_val:.3f}')
ax3.axvline(x=mle_val, color='g', linestyle=':',
            label=f'MLE = {mle_val:.2f}')
ax3.set_xlabel('Parameter p')
ax3.set_ylabel('Density')
ax3.set_title(f'Bayesian Update: n={n}, k={k} (Data dominates)')
ax3.legend()
ax3.set_xlim([0.5, 0.9])
 
# Plot 4: Effect of prior strength
ax4 = axes[1, 1]
k, n = 7, 10  # 70% heads observed
prior_strengths = [(1.01, 1.01), (2, 2), (5, 5), (10, 10), (50, 50)]
 
estimates = []
labels = []
for alpha, beta in prior_strengths:
    map_est = bernoulli_map(k, n, alpha, beta)
    estimates.append(map_est)
    labels.append(f'Beta({alpha},{beta})')
 
ax4.barh(labels, estimates, color='steelblue', alpha=0.7)
ax4.axvline(x=bernoulli_mle(k, n), color='red', linestyle='--',
            label=f'MLE = {bernoulli_mle(k, n):.1f}')
ax4.axvline(x=0.5, color='gray', linestyle=':', label='Prior mean')
ax4.set_xlabel('MAP Estimate')
ax4.set_title(f'Effect of Prior Strength (n={n}, k={k})')
ax4.legend()
ax4.set_xlim([0.4, 0.8])
 
plt.tight_layout()
plt.show()

MAP Estimation for Gaussian Distribution

Let's derive MAP for the Gaussian mean with a Gaussian prior—another instance of conjugacy.

Problem Setup:

Data: n observations x₁, ..., xₙ
Likelihood: N(μ, σ²) with known σ²
Prior: μ ~ N(μ₀, σ₀²)
Goal: Find μ̂_MAP

Step 1: Write the posterior

Log-posterior (up to constants):

$$\log P(\mu | \mathbf{x}) = \log P(\mathbf{x} | \mu) + \log P(\mu) + C$$

$$= -\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2 - \frac{1}{2\sigma_0^2}(\mu - \mu_0)^2 + C$$

Step 2: Differentiate and solve

$$\frac{d}{d\mu}\log P(\mu | \mathbf{x}) = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i - \mu) - \frac{1}{\sigma_0^2}(\mu - \mu_0) = 0$$

$$\frac{n\bar{x}}{\sigma^2} - \frac{n\mu}{\sigma^2} = \frac{\mu}{\sigma_0^2} - \frac{\mu_0}{\sigma_0^2}$$

Rearranging:

$$\mu\left(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}\right) = \frac{n\bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}$$

Result:

$$\boxed{\hat{\mu}_{MAP} = \frac{\frac{n}{\sigma^2}\bar{x} + \frac{1}{\sigma_0^2}\mu_0}{\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}}}$$

Interpretation as Precision-Weighted Average:

Define precision (inverse variance) as τ = 1/σ²:

$$\hat{\mu}{MAP} = \frac{\tau{data} \cdot \bar{x} + \tau_{prior} \cdot \mu_0}{\tau_{data} + \tau_{prior}}$$

where τ_data = n/σ² and τ_prior = 1/σ₀².

The MAP estimate is a precision-weighted average of the data mean and prior mean!

Key Properties:

If τ_prior → 0 (diffuse prior): μ̂_MAP → x̄ (MLE)
If τ_data → 0 (little data): μ̂_MAP → μ₀ (prior mean)
As n → ∞: τ_data dominates, μ̂_MAP → x̄
Higher precision = more influence on the estimate

gaussian_map.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def gaussian_map(data, sigma2, mu0, sigma0_2):
    """
    MAP estimate for Gaussian mean with Gaussian prior.
    
    Parameters:
        data: observed data points
        sigma2: known data variance σ²
        mu0: prior mean
        sigma0_2: prior variance σ₀²
    
    Returns:
        MAP estimate for μ
    """
    n = len(data)
    x_bar = np.mean(data)
    
    # Precision formulation
    tau_data = n / sigma2
    tau_prior = 1 / sigma0_2
    
    return (tau_data * x_bar + tau_prior * mu0) / (tau_data + tau_prior)
 
def posterior_variance(n, sigma2, sigma0_2):
    """Variance of the posterior distribution."""
    tau_data = n / sigma2
    tau_prior = 1 / sigma0_2
    return 1 / (tau_data + tau_prior)
 
# Setup
np.random.seed(42)
true_mu = 5.0
sigma2 = 1.0  # Known data variance
mu0 = 3.0     # Prior mean (intentionally different from true)
sigma0_2 = 2.0  # Prior variance
 
# Generate small dataset
n = 5
data = np.random.normal(true_mu, np.sqrt(sigma2), size=n)
x_bar = np.mean(data)
 
# Compute estimates
mle = x_bar
map_estimate = gaussian_map(data, sigma2, mu0, sigma0_2)
 
print(f"True μ: {true_mu}")
print(f"Prior mean μ₀: {mu0}")
print(f"Sample mean (MLE): {mle:.4f}")
print(f"MAP estimate: {map_estimate:.4f}")
print(f"Posterior variance: {posterior_variance(n, sigma2, sigma0_2):.4f}")
 
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: Prior, Likelihood, Posterior
mu_range = np.linspace(0, 8, 200)
 
# Prior: N(mu0, sigma0_2)
prior = stats.norm.pdf(mu_range, mu0, np.sqrt(sigma0_2))
 
# Likelihood of observing this x_bar given μ
# p(x_bar | μ) ∝ exp(-n(x_bar - μ)²/(2σ²))
likelihood = stats.norm.pdf(mu_range, x_bar, np.sqrt(sigma2/n))
 
# Posterior
post_var = posterior_variance(n, sigma2, sigma0_2)
posterior = stats.norm.pdf(mu_range, map_estimate, np.sqrt(post_var))
 
axes[0].plot(mu_range, prior, 'b-', lw=2, label=f'Prior N({mu0}, {sigma0_2})')
axes[0].plot(mu_range, likelihood, 'g--', lw=2, 
             label=f'Likelihood (centered at x̄={x_bar:.2f})')
axes[0].plot(mu_range, posterior, 'r-', lw=2, 
             label=f'Posterior N({map_estimate:.2f}, {post_var:.2f})')
axes[0].axvline(x=mle, color='g', linestyle=':', alpha=0.7, label=f'MLE = {mle:.2f}')
axes[0].axvline(x=map_estimate, color='r', linestyle=':', alpha=0.7, 
                label=f'MAP = {map_estimate:.2f}')
axes[0].axvline(x=true_mu, color='k', linestyle='--', alpha=0.5, 
                label=f'True μ = {true_mu}')
axes[0].set_xlabel('Parameter μ')
axes[0].set_ylabel('Density')
axes[0].set_title(f'Bayesian Update (n={n} observations)')
axes[0].legend(loc='upper left')
 
# Right: MAP vs MLE as function of sample size
sample_sizes = np.arange(1, 101)
map_estimates = []
mle_estimates = []
 
for n in sample_sizes:
    data = np.random.normal(true_mu, np.sqrt(sigma2), size=n)
    mle_estimates.append(np.mean(data))
    map_estimates.append(gaussian_map(data, sigma2, mu0, sigma0_2))
 
axes[1].plot(sample_sizes, mle_estimates, 'g-', alpha=0.7, label='MLE (sample mean)')
axes[1].plot(sample_sizes, map_estimates, 'r-', alpha=0.7, label='MAP')
axes[1].axhline(y=true_mu, color='k', linestyle='--', label=f'True μ = {true_mu}')
axes[1].axhline(y=mu0, color='b', linestyle=':', label=f'Prior mean = {mu0}')
axes[1].fill_between(sample_sizes, true_mu - 0.2, true_mu + 0.2, 
                     alpha=0.2, color='gray')
axes[1].set_xlabel('Sample Size n')
axes[1].set_ylabel('Estimate')
axes[1].set_title('MAP Converges to MLE as n Increases')
axes[1].legend()
 
plt.tight_layout()
plt.show()

The Profound Connection: MAP = Regularized MLE

The General Framework:

Recall the MAP objective:

$$\hat{\theta}{MAP} = \arg\max{\theta} [\log P(\mathbf{D} | \theta) + \log P(\theta)]$$

Or equivalently, minimizing the negative:

$$\hat{\theta}{MAP} = \arg\min{\theta} [-\log P(\mathbf{D} | \theta) - \log P(\theta)]$$

$$= \arg\min_{\theta} [\text{Negative Log-Likelihood} + \text{Negative Log-Prior}]$$

$$= \arg\min_{\theta} [\text{Loss Function} + \text{Regularization}]$$

L2 Regularization (Ridge) = Gaussian Prior:

If θ ~ N(0, σ₀²I), then:

$$-\log P(\theta) = \frac{1}{2\sigma_0^2}||\theta||_2^2 + C = \lambda ||\theta||_2^2 + C$$

where λ = 1/(2σ₀²). Larger λ means smaller prior variance (stronger belief that θ is near zero).

Ridge regression is MAP estimation under a Gaussian prior on weights!

The L2 penalty ||w||² shrinks weights toward zero, just as the Gaussian prior N(0, σ₀²) concentrates probability around zero.

L1 Regularization (Lasso) = Laplace Prior:

If θ ~ Laplace(0, b), then:

$$P(\theta) = \frac{1}{2b}\exp\left(-\frac{|\theta|}{b}\right)$$

$$-\log P(\theta) = \frac{|\theta|}{b} + C = \lambda |\theta| + C$$

where λ = 1/b.

Lasso regression is MAP estimation under a Laplace prior!

The Laplace distribution's sharp peak at zero explains why L1 produces sparse solutions—weights near zero are pushed exactly to zero.

L2 / Gaussian Prior

•Prior: θ ~ N(0, σ₀²)
•Penalty: λ||θ||₂²
•Effect: Shrinks all weights
•No sparsity: All weights remain non-zero
•Use: Multicollinearity, general regularization

L1 / Laplace Prior

•Prior: θ ~ Laplace(0, b)
•Penalty: λ||θ||₁
•Effect: Sparse solutions
•Sparsity: Many weights become exactly zero
•Use: Feature selection, interpretable models

regularization_as_map.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
# Visualize Gaussian vs Laplace priors
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Left: Prior distributions
theta = np.linspace(-5, 5, 200)
 
# Gaussian prior (L2 regularization)
sigma0 = 1.0
gaussian_prior = stats.norm.pdf(theta, 0, sigma0)
 
# Laplace prior (L1 regularization)
b = sigma0 / np.sqrt(2)  # Matched variance
laplace_prior = stats.laplace.pdf(theta, 0, b)
 
axes[0].plot(theta, gaussian_prior, 'b-', lw=2, 
             label=f'Gaussian N(0, {sigma0}²)')
axes[0].plot(theta, laplace_prior, 'r-', lw=2, 
             label=f'Laplace(0, {b:.2f})')
axes[0].set_xlabel('Parameter θ')
axes[0].set_ylabel('Prior Density')
axes[0].set_title('L2 vs L1 Prior Distributions')
axes[0].legend()
axes[0].set_xlim([-5, 5])
 
# Middle: -log(prior) = regularization penalty
gaussian_penalty = 0.5 * theta**2 / sigma0**2  # L2
laplace_penalty = np.abs(theta) / b  # L1
 
axes[1].plot(theta, gaussian_penalty, 'b-', lw=2, 
             label='L2: λ||θ||²')
axes[1].plot(theta, laplace_penalty, 'r-', lw=2, 
             label='L1: λ|θ|')
axes[1].set_xlabel('Parameter θ')
axes[1].set_ylabel('-log P(θ) + const')
axes[1].set_title('Regularization = -log(Prior)')
axes[1].legend()
axes[1].set_xlim([-5, 5])
axes[1].set_ylim([0, 6])
 
# Right: Effect on parameter estimates
# Simulate shrinkage behavior
np.random.seed(42)
n_params = 20
true_theta = np.random.randn(n_params)  # True parameters
noise = np.random.randn(n_params) * 0.3  # Observation noise
observed = true_theta + noise  # "Data" estimates (MLE)
 
# L2 shrinkage: point toward zero proportionally
lambda_l2 = 0.5
l2_estimates = observed / (1 + lambda_l2)
 
# L1 shrinkage: soft thresholding
lambda_l1 = 0.3
l1_estimates = np.sign(observed) * np.maximum(np.abs(observed) - lambda_l1, 0)
 
x_pos = np.arange(n_params)
width = 0.25
 
axes[2].bar(x_pos - width, observed, width, label='MLE', alpha=0.7)
axes[2].bar(x_pos, l2_estimates, width, label='L2 (Ridge)', alpha=0.7)
axes[2].bar(x_pos + width, l1_estimates, width, label='L1 (Lasso)', alpha=0.7)
axes[2].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[2].set_xlabel('Parameter Index')
axes[2].set_ylabel('Estimate')
axes[2].set_title('Shrinkage: L2 vs L1')
axes[2].legend()
 
plt.tight_layout()
plt.show()
 
# Count zeros
print(f"\nSparsity comparison:")
print(f"  MLE zeros: {np.sum(np.abs(observed) < 1e-6)}")
print(f"  L2 zeros: {np.sum(np.abs(l2_estimates) < 1e-6)}")
print(f"  L1 zeros: {np.sum(np.abs(l1_estimates) < 1e-6)}")

The Grand Unification

Conjugate Priors: Mathematical Elegance

A conjugate prior is a prior distribution that, when combined with a specific likelihood, produces a posterior in the same family as the prior. This enables:

Closed-form posteriors: No numerical integration needed
Intuitive updates: Simple formulas for incorporating new data
Sequential learning: Easy online updates as data arrives

Why conjugacy matters computationally:

Without conjugacy: $$P(\theta | \mathbf{D}) = \frac{P(\mathbf{D}|\theta)P(\theta)}{\int P(\mathbf{D}|\theta')P(\theta')d\theta'}$$

The denominator integral is often intractable, requiring Monte Carlo methods or variational approximations.

With conjugacy, this integral has a known closed form!

Standard Conjugate Prior-Likelihood Pairs
Likelihood	Conjugate Prior	Posterior	Prior Parameters as 'Pseudo-data'
Bernoulli(p)	Beta(α, β)	Beta(α+k, β+n-k)	α-1 prior heads, β-1 prior tails
Binomial(n, p)	Beta(α, β)	Beta(α+k, β+n-k)	Same interpretation
Poisson(λ)	Gamma(α, β)	Gamma(α+Σx, β+n)	α-1 prior events, β prior intervals
N(μ, σ²) [known σ²]	N(μ₀, σ₀²)	N(μ_post, σ²_post)	Precision-weighted combination
N(μ, σ²) [known μ]	Inv-Gamma(α, β)	Inv-Gamma(α+n/2, β+SS/2)	α prior observations
Multinomial(θ)	Dirichlet(α)	Dirichlet(α + counts)	αᵢ-1 prior counts for category i
Exponential(λ)	Gamma(α, β)	Gamma(α+n, β+Σx)	α-1 prior events observed

Sequential Updating with Conjugacy:

One beautiful property is that the posterior from one batch of data becomes the prior for the next:

$$P(\theta | D_1, D_2) \propto P(D_2|\theta) \cdot P(\theta | D_1)$$

For Beta-Bernoulli:

Start: Beta(α, β)
After k₁ heads in n₁ flips: Beta(α + k₁, β + n₁ - k₁)
After k₂ more heads in n₂ more flips: Beta(α + k₁ + k₂, β + n₁ + n₂ - k₁ - k₂)

This enables true online learning—updating beliefs incrementally without storing all past data!

sequential_bayesian_update.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def sequential_beta_update(prior_alpha, prior_beta, data_stream):
    """
    Perform sequential Bayesian updates for Bernoulli likelihood.
    
    Parameters:
        prior_alpha, prior_beta: Initial Beta prior parameters
        data_stream: List of observation batches (each batch is array of 0s and 1s)
    
    Returns:
        List of (alpha, beta, map_estimate) after each batch
    """
    alpha, beta = prior_alpha, prior_beta
    history = [(alpha, beta, (alpha - 1) / (alpha + beta - 2) if alpha > 1 and beta > 1 else 0.5)]
    
    for batch in data_stream:
        k = np.sum(batch)
        n = len(batch)
        alpha += k
        beta += (n - k)
        map_est = (alpha - 1) / (alpha + beta - 2) if alpha > 1 and beta > 1 else 0.5
        history.append((alpha, beta, map_est))
    
    return history
 
# Simulate online learning scenario
np.random.seed(42)
true_p = 0.7
 
# Initial weak prior (uniform-ish)
prior_alpha, prior_beta = 2, 2
 
# Data arrives in batches
batch_sizes = [5, 10, 20, 50, 100, 200]
data_stream = [np.random.binomial(1, true_p, size=n) for n in batch_sizes]
 
# Perform sequential updates
history = sequential_beta_update(prior_alpha, prior_beta, data_stream)
 
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Plot 1: MAP estimate evolution
ax1 = axes[0, 0]
total_samples = np.cumsum([0] + batch_sizes)
map_estimates = [h[2] for h in history]
ax1.step(total_samples, map_estimates, where='post', lw=2, color='blue')
ax1.axhline(y=true_p, color='red', linestyle='--', label=f'True p = {true_p}')
ax1.fill_between(total_samples, true_p - 0.05, true_p + 0.05, alpha=0.2, color='red')
ax1.set_xlabel('Cumulative Samples')
ax1.set_ylabel('MAP Estimate')
ax1.set_title('Online Learning: MAP Estimate Evolution')
ax1.legend()
ax1.grid(True, alpha=0.3)
 
# Plot 2: Posterior uncertainty shrinks
ax2 = axes[0, 1]
p_range = np.linspace(0, 1, 200)
 
colors = plt.cm.viridis(np.linspace(0, 1, len(history)))
for i, (alpha, beta, _) in enumerate(history):
    pdf = stats.beta.pdf(p_range, alpha, beta)
    total_n = sum(batch_sizes[:i]) if i > 0 else 0
    ax2.plot(p_range, pdf, color=colors[i], lw=1.5, 
             label=f'n={total_n}' if i < 5 else None)
 
ax2.axvline(x=true_p, color='red', linestyle='--', alpha=0.7)
ax2.set_xlabel('Parameter p')
ax2.set_ylabel('Posterior Density')
ax2.set_title('Posterior Evolution: Uncertainty Decreases')
ax2.legend()
 
# Plot 3: Posterior variance (uncertainty) over time
ax3 = axes[1, 0]
posterior_vars = []
for alpha, beta, _ in history:
    var = (alpha * beta) / ((alpha + beta)**2 * (alpha + beta + 1))
    posterior_vars.append(var)
 
ax3.plot(total_samples, posterior_vars, 'o-', color='purple', lw=2)
ax3.set_xlabel('Cumulative Samples')
ax3.set_ylabel('Posterior Variance')
ax3.set_title('Uncertainty Decreases with More Data')
ax3.set_yscale('log')
ax3.grid(True, alpha=0.3)
 
# Plot 4: 95% credible intervals
ax4 = axes[1, 1]
lower_bounds = []
upper_bounds = []
for alpha, beta, _ in history:
    lower = stats.beta.ppf(0.025, alpha, beta)
    upper = stats.beta.ppf(0.975, alpha, beta)
    lower_bounds.append(lower)
    upper_bounds.append(upper)
 
ax4.fill_between(total_samples, lower_bounds, upper_bounds, 
                 alpha=0.3, color='blue', label='95% Credible Interval')
ax4.plot(total_samples, map_estimates, 'b-', lw=2, label='MAP Estimate')
ax4.axhline(y=true_p, color='red', linestyle='--', label=f'True p = {true_p}')
ax4.set_xlabel('Cumulative Samples')
ax4.set_ylabel('Parameter p')
ax4.set_title('95% Credible Interval Shrinks')
ax4.legend()
ax4.set_ylim([0.4, 1.0])
ax4.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.show()
 
# Print summary
print("Sequential Bayesian Update Summary:")
print("=" * 50)
for i, (alpha, beta, map_est) in enumerate(history):
    total_n = sum(batch_sizes[:i]) if i > 0 else 0
    var = (alpha * beta) / ((alpha + beta)**2 * (alpha + beta + 1))
    print(f"After {total_n:4d} samples: α={alpha:6.1f}, β={beta:6.1f}, "
          f"MAP={map_est:.4f}, Var={var:.6f}")

MLE vs MAP: When to Use Each

Now that we understand both methods deeply, let's systematically compare them to guide practical decisions.

Comprehensive MLE vs MAP Comparison
Dimension	MLE	MAP
Philosophy	Parameters are fixed constants	Parameters are random variables
Objective	Maximize P(D\|θ)	Maximize P(D\|θ) × P(θ)
Prior required?	No	Yes
Small data behavior	Prone to extreme estimates	Regularized by prior
Large data behavior	Optimal (efficient)	Converges to MLE
Uncertainty quantification	Requires separate analysis	Natural via posterior
Computational	Often simpler	Requires prior specification
Overfitting risk	Higher	Lower (regularized)
Interpretation	Point estimate	Mode of belief distribution

Use MLE when:

Large datasets: Prior influence becomes negligible anyway
No prior knowledge: Rather than guess, let data speak
Computational simplicity: When prior doesn't have conjugacy benefits
Interpretability: In scientific contexts where arbitrary priors may be questioned
Benchmarking: As a baseline before adding regularization

Use MAP when:

Limited data: Prior prevents overfitting to noise
Genuine prior knowledge: Domain expertise should influence estimates
Regularization needed: Especially for high-dimensional problems
Sparse solutions desired: Use Laplace prior for L1 regularization
Sequential learning: Prior enables efficient online updates
Preventing extreme estimates: Essential for probabilities, rates, etc.

The Practical Sweet Spot

Beyond MAP: Full Bayesian Inference

MAP finds a single "best" parameter value—the posterior mode. But the full Bayesian approach maintains the entire posterior distribution, enabling richer inference.

Why go beyond MAP?

MAP discards uncertainty information. Consider two scenarios:

Scenario A: Posterior is sharply peaked at θ̂ = 0.7
Scenario B: Posterior is flat, with mode at θ̂ = 0.7

Both have the same MAP estimate, but our confidence differs enormously! Full Bayesian inference captures this.

Posterior Predictive Distribution:

Instead of using point estimate θ̂ for predictions, integrate over all possible parameters:

$$P(x_{new} | \mathbf{D}) = \int P(x_{new} | \theta) P(\theta | \mathbf{D}) d\theta$$

This automatically accounts for parameter uncertainty—predictions are hedged proportionally to our confidence in θ.

Bayesian Model Comparison:

The evidence P(D) (which we ignored in MAP) becomes crucial:

$$P(D) = \int P(D|\theta) P(\theta) d\theta$$

This enables comparison between models with different complexity—Occam's razor emerges automatically!

Credible Intervals:

When to Use Full Bayesian

Summary: Maximum A Posteriori Estimation

We've explored the Bayesian extension of MLE and its profound connections to regularization. Let's consolidate the key insights:

Key Takeaways

•MAP = MLE + Prior — The posterior mode incorporates both data (likelihood) and beliefs (prior).
•Priors encode assumptions — Beta for probabilities, Gaussian for real parameters, Laplace for sparsity.
•Regularization IS Bayesian — L2 penalty = Gaussian prior; L1 penalty = Laplace prior.
•Conjugacy enables elegance — Closed-form posteriors for standard likelihood-prior pairs.
•Data dominates asymptotically — MAP → MLE as sample size → ∞.
•Prior strength = regularization strength — Tighter priors mean stronger regularization.
•Sequential updates are natural — Posterior becomes next prior, enabling online learning.

Looking ahead:

Page Complete

2 / 5