Loading learning content...
Imagine you flip a coin just 3 times and observe 3 heads. What's your estimate for P(heads)? MLE says: p̂ = 1.0—the coin always lands heads. But does this seem reasonable? Surely you'd hesitate to bet your life savings that the next flip will be heads.
The problem is clear: MLE with limited data can produce extreme, implausible estimates. What we need is a way to incorporate our prior beliefs—like "most coins are roughly fair"—while still learning from data.
Maximum A Posteriori (MAP) estimation provides exactly this framework. It blends prior knowledge with observed evidence, producing estimates that are more robust when data is scarce and converge to MLE when data is plentiful.
By the end of this page, you will master Bayes' theorem for parameter estimation, understand how prior distributions encode beliefs, derive MAP estimates for Bernoulli and Gaussian models, see the profound connection between MAP and regularization, and understand conjugate priors and their computational advantages.
The fundamental philosophical difference between frequentist and Bayesian approaches lies in how we treat parameters:
Frequentist view (MLE):
Bayesian view (MAP and full Bayesian):
Bayes' Theorem for Parameter Estimation:
The machinery for updating beliefs is Bayes' theorem:
$$P(\theta | \mathbf{D}) = \frac{P(\mathbf{D} | \theta) \cdot P(\theta)}{P(\mathbf{D})}$$
Let's name each term:
| Term | Name | Meaning |
|---|---|---|
| P(θ|D) | Posterior | Our updated belief about θ after seeing data |
| P(D|θ) | Likelihood | How probable the data is given θ (same as MLE) |
| P(θ) | Prior | Our initial belief about θ before data |
| P(D) | Evidence/Marginal | Total probability of data (normalizing constant) |
The MAP Principle:
Rather than working with the full posterior distribution (which can be intractable), MAP finds the mode—the most probable parameter value:
$$\hat{\theta}{MAP} = \arg\max{\theta} P(\theta | \mathbf{D})$$
Since P(D) doesn't depend on θ, we can ignore it for optimization:
$$\hat{\theta}{MAP} = \arg\max{\theta} P(\mathbf{D} | \theta) \cdot P(\theta) = \arg\max_{\theta} [\text{Likelihood} \times \text{Prior}]$$
Taking logs:
$$\hat{\theta}{MAP} = \arg\max{\theta} [\log P(\mathbf{D} | \theta) + \log P(\theta)]$$
This is MLE plus a penalty term from the prior!
MAP = MLE + Prior Regularization. The prior acts as a penalty that pulls the estimate toward values we believed likely before seeing data. With infinite data, the likelihood dominates and MAP → MLE. With little data, the prior dominates and prevents extreme estimates.
The prior P(θ) encodes our beliefs before observing data. Choosing it is both an art and a science.
Types of Priors:
Informative Priors: Encode genuine domain knowledge
Weakly Informative Priors: Provide gentle regularization
Non-informative (Flat) Priors: P(θ) = constant
Conjugate Priors: Mathematically convenient
| Prior Distribution | Parameters | Use Case | Properties |
|---|---|---|---|
| Beta(α, β) | α, β > 0 | Probability parameters (Bernoulli) | Conjugate to Bernoulli/Binomial |
| Normal(μ₀, σ₀²) | Mean μ₀, variance σ₀² | Real-valued parameters | Conjugate to Gaussian (known var) |
| Gamma(α, β) | Shape α, rate β | Positive parameters (rates, precisions) | Conjugate to Poisson, Exponential |
| Inverse-Gamma(α, β) | Shape α, scale β | Variance parameters | Conjugate to Gaussian (unknown var) |
| Dirichlet(α) | Concentration vector α | Probability vectors (multinomial) | Conjugate to Categorical/Multinomial |
| Laplace(0, b) | Location 0, scale b | Sparse parameters | Induces L1 regularization |
The Beta Distribution: Perfect for Probabilities
The Beta distribution is defined on [0, 1], making it ideal for probability parameters:
$$P(p | \alpha, \beta) = \frac{p^{\alpha-1}(1-p)^{\beta-1}}{B(\alpha, \beta)}$$
where B(α, β) is the Beta function (normalizing constant).
Interpreting α and β:
A useful heuristic: the prior's effective sample size is α + β - 2 for the Beta distribution. If you set α = β = 10, you're expressing prior beliefs equivalent to having already observed 18 coin flips (9 heads, 9 tails). Your actual data needs to compete with these 'phantom observations'.
Let's derive the MAP estimate for a coin flip probability, using a Beta prior—the canonical example of Bayesian parameter estimation.
Problem Setup:
Step 1: Write the posterior (up to proportionality)
$$P(p | k, n) \propto P(k, n | p) \cdot P(p)$$
$$\propto p^k(1-p)^{n-k} \cdot p^{\alpha-1}(1-p)^{\beta-1}$$
$$= p^{k+\alpha-1}(1-p)^{n-k+\beta-1}$$
This is a Beta(k + α, n - k + β) distribution! The Beta prior is conjugate to the Bernoulli likelihood.
Step 2: Find the mode of the posterior
The mode of Beta(a, b) is (a-1)/(a+b-2) when a, b > 1:
$$\hat{p}_{MAP} = \frac{(k + \alpha - 1)}{(k + \alpha - 1) + (n - k + \beta - 1)} = \frac{k + \alpha - 1}{n + \alpha + \beta - 2}$$
Interpretation:
Compare to MLE: p̂_MLE = k/n
MAP adds "pseudo-observations" to both numerator and denominator:
This is exactly Laplace smoothing with α = β = 2, giving:
$$\hat{p}_{Laplace} = \frac{k + 1}{n + 2}$$
Numerical Example:
3 flips, 3 heads (k=3, n=3):
| Method | Prior | Estimate |
|---|---|---|
| MLE | None | 3/3 = 1.00 |
| MAP | Beta(2, 2) | (3+1)/(3+2) = 0.80 |
| MAP | Beta(5, 5) | (3+4)/(3+8) ≈ 0.64 |
| MAP | Beta(10, 10) | (3+9)/(3+18) ≈ 0.57 |
100 flips, 70 heads (k=70, n=100):
| Method | Prior | Estimate |
|---|---|---|
| MLE | None | 70/100 = 0.70 |
| MAP | Beta(2, 2) | (70+1)/(100+2) ≈ 0.70 |
| MAP | Beta(10, 10) | (70+9)/(100+18) ≈ 0.67 |
Key insight: With more data, the prior's influence diminishes. MAP and MLE converge as n → ∞.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats def bernoulli_mle(k, n): """MLE for Bernoulli parameter.""" return k / n if n > 0 else 0.5 def bernoulli_map(k, n, alpha, beta): """ MAP estimate for Bernoulli parameter with Beta prior. Parameters: k: number of successes (heads) n: number of trials alpha, beta: Beta prior parameters Returns: MAP estimate for p """ # Mode of Beta(k + alpha, n - k + beta) a = k + alpha b = n - k + beta if a > 1 and b > 1: return (a - 1) / (a + b - 2) elif a <= 1: return 0 # Mode at 0 else: return 1 # Mode at 1 # Compare MLE vs MAP for different data sizesnp.random.seed(42)true_p = 0.7 # Simulate increasingly larger datasetssample_sizes = [3, 10, 30, 100, 300, 1000]prior_params = [(2, 2), (5, 5), (10, 10)] fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Plot 1: Estimates vs sample sizeax1 = axes[0, 0]for alpha, beta in prior_params: map_estimates = [] for n in sample_sizes: data = np.random.binomial(1, true_p, size=n) k = np.sum(data) map_estimates.append(bernoulli_map(k, n, alpha, beta)) ax1.plot(sample_sizes, map_estimates, 'o-', label=f'MAP Beta({alpha},{beta})') mle_estimates = []for n in sample_sizes: data = np.random.binomial(1, true_p, size=n) k = np.sum(data) mle_estimates.append(bernoulli_mle(k, n))ax1.plot(sample_sizes, mle_estimates, 's--', label='MLE', color='black')ax1.axhline(y=true_p, color='red', linestyle=':', label=f'True p={true_p}')ax1.set_xlabel('Sample Size n')ax1.set_ylabel('Estimate')ax1.set_title('MLE vs MAP: Convergence with Sample Size')ax1.legend()ax1.set_xscale('log') # Plot 2: Posterior distributions for n=3, k=3ax2 = axes[0, 1]p_range = np.linspace(0, 1, 200)k, n = 3, 3 # Prior (Beta(5, 5))ax2.plot(p_range, stats.beta.pdf(p_range, 5, 5), 'b-', lw=2, label='Prior: Beta(5,5)')# Likelihood (unnormalized)likelihood = p_range**k * (1-p_range)**(n-k)likelihood = likelihood / np.max(likelihood) * 2 # Scale for visibilityax2.plot(p_range, likelihood, 'g--', lw=2, label='Likelihood (scaled)')# Posterior: Beta(k+5, n-k+5)ax2.plot(p_range, stats.beta.pdf(p_range, k+5, n-k+5), 'r-', lw=2, label='Posterior: Beta(8,5)')ax2.axvline(x=bernoulli_map(k, n, 5, 5), color='r', linestyle=':', label=f'MAP = {bernoulli_map(k, n, 5, 5):.3f}')ax2.axvline(x=1.0, color='g', linestyle=':', label=f'MLE = {bernoulli_mle(k, n):.1f}')ax2.set_xlabel('Parameter p')ax2.set_ylabel('Density')ax2.set_title(f'Bayesian Update: n={n}, k={k}')ax2.legend() # Plot 3: Posterior distributions for n=100, k=70ax3 = axes[1, 0]k, n = 70, 100 ax3.plot(p_range, stats.beta.pdf(p_range, 5, 5), 'b-', lw=2, label='Prior: Beta(5,5)')likelihood = p_range**k * (1-p_range)**(n-k)likelihood = likelihood / np.max(likelihood) * 15 # Scale for visibilityax3.plot(p_range, likelihood, 'g--', lw=2, label='Likelihood (scaled)')ax3.plot(p_range, stats.beta.pdf(p_range, k+5, n-k+5), 'r-', lw=2, label=f'Posterior: Beta({k+5},{n-k+5})')map_val = bernoulli_map(k, n, 5, 5)mle_val = bernoulli_mle(k, n)ax3.axvline(x=map_val, color='r', linestyle=':', label=f'MAP = {map_val:.3f}')ax3.axvline(x=mle_val, color='g', linestyle=':', label=f'MLE = {mle_val:.2f}')ax3.set_xlabel('Parameter p')ax3.set_ylabel('Density')ax3.set_title(f'Bayesian Update: n={n}, k={k} (Data dominates)')ax3.legend()ax3.set_xlim([0.5, 0.9]) # Plot 4: Effect of prior strengthax4 = axes[1, 1]k, n = 7, 10 # 70% heads observedprior_strengths = [(1.01, 1.01), (2, 2), (5, 5), (10, 10), (50, 50)] estimates = []labels = []for alpha, beta in prior_strengths: map_est = bernoulli_map(k, n, alpha, beta) estimates.append(map_est) labels.append(f'Beta({alpha},{beta})') ax4.barh(labels, estimates, color='steelblue', alpha=0.7)ax4.axvline(x=bernoulli_mle(k, n), color='red', linestyle='--', label=f'MLE = {bernoulli_mle(k, n):.1f}')ax4.axvline(x=0.5, color='gray', linestyle=':', label='Prior mean')ax4.set_xlabel('MAP Estimate')ax4.set_title(f'Effect of Prior Strength (n={n}, k={k})')ax4.legend()ax4.set_xlim([0.4, 0.8]) plt.tight_layout()plt.show()Let's derive MAP for the Gaussian mean with a Gaussian prior—another instance of conjugacy.
Problem Setup:
Step 1: Write the posterior
Log-posterior (up to constants):
$$\log P(\mu | \mathbf{x}) = \log P(\mathbf{x} | \mu) + \log P(\mu) + C$$
$$= -\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2 - \frac{1}{2\sigma_0^2}(\mu - \mu_0)^2 + C$$
Step 2: Differentiate and solve
$$\frac{d}{d\mu}\log P(\mu | \mathbf{x}) = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i - \mu) - \frac{1}{\sigma_0^2}(\mu - \mu_0) = 0$$
$$\frac{n\bar{x}}{\sigma^2} - \frac{n\mu}{\sigma^2} = \frac{\mu}{\sigma_0^2} - \frac{\mu_0}{\sigma_0^2}$$
Rearranging:
$$\mu\left(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}\right) = \frac{n\bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}$$
Result:
$$\boxed{\hat{\mu}_{MAP} = \frac{\frac{n}{\sigma^2}\bar{x} + \frac{1}{\sigma_0^2}\mu_0}{\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}}}$$
Interpretation as Precision-Weighted Average:
Define precision (inverse variance) as τ = 1/σ²:
$$\hat{\mu}{MAP} = \frac{\tau{data} \cdot \bar{x} + \tau_{prior} \cdot \mu_0}{\tau_{data} + \tau_{prior}}$$
where τ_data = n/σ² and τ_prior = 1/σ₀².
The MAP estimate is a precision-weighted average of the data mean and prior mean!
Key Properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats def gaussian_map(data, sigma2, mu0, sigma0_2): """ MAP estimate for Gaussian mean with Gaussian prior. Parameters: data: observed data points sigma2: known data variance σ² mu0: prior mean sigma0_2: prior variance σ₀² Returns: MAP estimate for μ """ n = len(data) x_bar = np.mean(data) # Precision formulation tau_data = n / sigma2 tau_prior = 1 / sigma0_2 return (tau_data * x_bar + tau_prior * mu0) / (tau_data + tau_prior) def posterior_variance(n, sigma2, sigma0_2): """Variance of the posterior distribution.""" tau_data = n / sigma2 tau_prior = 1 / sigma0_2 return 1 / (tau_data + tau_prior) # Setupnp.random.seed(42)true_mu = 5.0sigma2 = 1.0 # Known data variancemu0 = 3.0 # Prior mean (intentionally different from true)sigma0_2 = 2.0 # Prior variance # Generate small datasetn = 5data = np.random.normal(true_mu, np.sqrt(sigma2), size=n)x_bar = np.mean(data) # Compute estimatesmle = x_barmap_estimate = gaussian_map(data, sigma2, mu0, sigma0_2) print(f"True μ: {true_mu}")print(f"Prior mean μ₀: {mu0}")print(f"Sample mean (MLE): {mle:.4f}")print(f"MAP estimate: {map_estimate:.4f}")print(f"Posterior variance: {posterior_variance(n, sigma2, sigma0_2):.4f}") # Visualizefig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: Prior, Likelihood, Posteriormu_range = np.linspace(0, 8, 200) # Prior: N(mu0, sigma0_2)prior = stats.norm.pdf(mu_range, mu0, np.sqrt(sigma0_2)) # Likelihood of observing this x_bar given μ# p(x_bar | μ) ∝ exp(-n(x_bar - μ)²/(2σ²))likelihood = stats.norm.pdf(mu_range, x_bar, np.sqrt(sigma2/n)) # Posteriorpost_var = posterior_variance(n, sigma2, sigma0_2)posterior = stats.norm.pdf(mu_range, map_estimate, np.sqrt(post_var)) axes[0].plot(mu_range, prior, 'b-', lw=2, label=f'Prior N({mu0}, {sigma0_2})')axes[0].plot(mu_range, likelihood, 'g--', lw=2, label=f'Likelihood (centered at x̄={x_bar:.2f})')axes[0].plot(mu_range, posterior, 'r-', lw=2, label=f'Posterior N({map_estimate:.2f}, {post_var:.2f})')axes[0].axvline(x=mle, color='g', linestyle=':', alpha=0.7, label=f'MLE = {mle:.2f}')axes[0].axvline(x=map_estimate, color='r', linestyle=':', alpha=0.7, label=f'MAP = {map_estimate:.2f}')axes[0].axvline(x=true_mu, color='k', linestyle='--', alpha=0.5, label=f'True μ = {true_mu}')axes[0].set_xlabel('Parameter μ')axes[0].set_ylabel('Density')axes[0].set_title(f'Bayesian Update (n={n} observations)')axes[0].legend(loc='upper left') # Right: MAP vs MLE as function of sample sizesample_sizes = np.arange(1, 101)map_estimates = []mle_estimates = [] for n in sample_sizes: data = np.random.normal(true_mu, np.sqrt(sigma2), size=n) mle_estimates.append(np.mean(data)) map_estimates.append(gaussian_map(data, sigma2, mu0, sigma0_2)) axes[1].plot(sample_sizes, mle_estimates, 'g-', alpha=0.7, label='MLE (sample mean)')axes[1].plot(sample_sizes, map_estimates, 'r-', alpha=0.7, label='MAP')axes[1].axhline(y=true_mu, color='k', linestyle='--', label=f'True μ = {true_mu}')axes[1].axhline(y=mu0, color='b', linestyle=':', label=f'Prior mean = {mu0}')axes[1].fill_between(sample_sizes, true_mu - 0.2, true_mu + 0.2, alpha=0.2, color='gray')axes[1].set_xlabel('Sample Size n')axes[1].set_ylabel('Estimate')axes[1].set_title('MAP Converges to MLE as n Increases')axes[1].legend() plt.tight_layout()plt.show()One of the most beautiful insights in machine learning is that regularization is equivalent to MAP estimation with specific priors. This connection unifies the frequentist and Bayesian perspectives.
The General Framework:
Recall the MAP objective:
$$\hat{\theta}{MAP} = \arg\max{\theta} [\log P(\mathbf{D} | \theta) + \log P(\theta)]$$
Or equivalently, minimizing the negative:
$$\hat{\theta}{MAP} = \arg\min{\theta} [-\log P(\mathbf{D} | \theta) - \log P(\theta)]$$
$$= \arg\min_{\theta} [\text{Negative Log-Likelihood} + \text{Negative Log-Prior}]$$
$$= \arg\min_{\theta} [\text{Loss Function} + \text{Regularization}]$$
L2 Regularization (Ridge) = Gaussian Prior:
If θ ~ N(0, σ₀²I), then:
$$-\log P(\theta) = \frac{1}{2\sigma_0^2}||\theta||_2^2 + C = \lambda ||\theta||_2^2 + C$$
where λ = 1/(2σ₀²). Larger λ means smaller prior variance (stronger belief that θ is near zero).
Ridge regression is MAP estimation under a Gaussian prior on weights!
The L2 penalty ||w||² shrinks weights toward zero, just as the Gaussian prior N(0, σ₀²) concentrates probability around zero.
L1 Regularization (Lasso) = Laplace Prior:
If θ ~ Laplace(0, b), then:
$$P(\theta) = \frac{1}{2b}\exp\left(-\frac{|\theta|}{b}\right)$$
$$-\log P(\theta) = \frac{|\theta|}{b} + C = \lambda |\theta| + C$$
where λ = 1/b.
Lasso regression is MAP estimation under a Laplace prior!
The Laplace distribution's sharp peak at zero explains why L1 produces sparse solutions—weights near zero are pushed exactly to zero.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats # Visualize Gaussian vs Laplace priorsfig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Left: Prior distributionstheta = np.linspace(-5, 5, 200) # Gaussian prior (L2 regularization)sigma0 = 1.0gaussian_prior = stats.norm.pdf(theta, 0, sigma0) # Laplace prior (L1 regularization)b = sigma0 / np.sqrt(2) # Matched variancelaplace_prior = stats.laplace.pdf(theta, 0, b) axes[0].plot(theta, gaussian_prior, 'b-', lw=2, label=f'Gaussian N(0, {sigma0}²)')axes[0].plot(theta, laplace_prior, 'r-', lw=2, label=f'Laplace(0, {b:.2f})')axes[0].set_xlabel('Parameter θ')axes[0].set_ylabel('Prior Density')axes[0].set_title('L2 vs L1 Prior Distributions')axes[0].legend()axes[0].set_xlim([-5, 5]) # Middle: -log(prior) = regularization penaltygaussian_penalty = 0.5 * theta**2 / sigma0**2 # L2laplace_penalty = np.abs(theta) / b # L1 axes[1].plot(theta, gaussian_penalty, 'b-', lw=2, label='L2: λ||θ||²')axes[1].plot(theta, laplace_penalty, 'r-', lw=2, label='L1: λ|θ|')axes[1].set_xlabel('Parameter θ')axes[1].set_ylabel('-log P(θ) + const')axes[1].set_title('Regularization = -log(Prior)')axes[1].legend()axes[1].set_xlim([-5, 5])axes[1].set_ylim([0, 6]) # Right: Effect on parameter estimates# Simulate shrinkage behaviornp.random.seed(42)n_params = 20true_theta = np.random.randn(n_params) # True parametersnoise = np.random.randn(n_params) * 0.3 # Observation noiseobserved = true_theta + noise # "Data" estimates (MLE) # L2 shrinkage: point toward zero proportionallylambda_l2 = 0.5l2_estimates = observed / (1 + lambda_l2) # L1 shrinkage: soft thresholdinglambda_l1 = 0.3l1_estimates = np.sign(observed) * np.maximum(np.abs(observed) - lambda_l1, 0) x_pos = np.arange(n_params)width = 0.25 axes[2].bar(x_pos - width, observed, width, label='MLE', alpha=0.7)axes[2].bar(x_pos, l2_estimates, width, label='L2 (Ridge)', alpha=0.7)axes[2].bar(x_pos + width, l1_estimates, width, label='L1 (Lasso)', alpha=0.7)axes[2].axhline(y=0, color='gray', linestyle='--', alpha=0.5)axes[2].set_xlabel('Parameter Index')axes[2].set_ylabel('Estimate')axes[2].set_title('Shrinkage: L2 vs L1')axes[2].legend() plt.tight_layout()plt.show() # Count zerosprint(f"\nSparsity comparison:")print(f" MLE zeros: {np.sum(np.abs(observed) < 1e-6)}")print(f" L2 zeros: {np.sum(np.abs(l2_estimates) < 1e-6)}")print(f" L1 zeros: {np.sum(np.abs(l1_estimates) < 1e-6)}")This connection transforms how we think about regularization. It's not an arbitrary add-on to prevent overfitting—it's a principled expression of prior beliefs. Understanding this lets you derive custom regularizers for domain-specific priors, interpret regularization hyperparameters (they control prior strength!), and connect machine learning to Bayesian statistics seamlessly.
A conjugate prior is a prior distribution that, when combined with a specific likelihood, produces a posterior in the same family as the prior. This enables:
Why conjugacy matters computationally:
Without conjugacy: $$P(\theta | \mathbf{D}) = \frac{P(\mathbf{D}|\theta)P(\theta)}{\int P(\mathbf{D}|\theta')P(\theta')d\theta'}$$
The denominator integral is often intractable, requiring Monte Carlo methods or variational approximations.
With conjugacy, this integral has a known closed form!
| Likelihood | Conjugate Prior | Posterior | Prior Parameters as 'Pseudo-data' |
|---|---|---|---|
| Bernoulli(p) | Beta(α, β) | Beta(α+k, β+n-k) | α-1 prior heads, β-1 prior tails |
| Binomial(n, p) | Beta(α, β) | Beta(α+k, β+n-k) | Same interpretation |
| Poisson(λ) | Gamma(α, β) | Gamma(α+Σx, β+n) | α-1 prior events, β prior intervals |
| N(μ, σ²) [known σ²] | N(μ₀, σ₀²) | N(μ_post, σ²_post) | Precision-weighted combination |
| N(μ, σ²) [known μ] | Inv-Gamma(α, β) | Inv-Gamma(α+n/2, β+SS/2) | α prior observations |
| Multinomial(θ) | Dirichlet(α) | Dirichlet(α + counts) | αᵢ-1 prior counts for category i |
| Exponential(λ) | Gamma(α, β) | Gamma(α+n, β+Σx) | α-1 prior events observed |
Sequential Updating with Conjugacy:
One beautiful property is that the posterior from one batch of data becomes the prior for the next:
$$P(\theta | D_1, D_2) \propto P(D_2|\theta) \cdot P(\theta | D_1)$$
For Beta-Bernoulli:
This enables true online learning—updating beliefs incrementally without storing all past data!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats def sequential_beta_update(prior_alpha, prior_beta, data_stream): """ Perform sequential Bayesian updates for Bernoulli likelihood. Parameters: prior_alpha, prior_beta: Initial Beta prior parameters data_stream: List of observation batches (each batch is array of 0s and 1s) Returns: List of (alpha, beta, map_estimate) after each batch """ alpha, beta = prior_alpha, prior_beta history = [(alpha, beta, (alpha - 1) / (alpha + beta - 2) if alpha > 1 and beta > 1 else 0.5)] for batch in data_stream: k = np.sum(batch) n = len(batch) alpha += k beta += (n - k) map_est = (alpha - 1) / (alpha + beta - 2) if alpha > 1 and beta > 1 else 0.5 history.append((alpha, beta, map_est)) return history # Simulate online learning scenarionp.random.seed(42)true_p = 0.7 # Initial weak prior (uniform-ish)prior_alpha, prior_beta = 2, 2 # Data arrives in batchesbatch_sizes = [5, 10, 20, 50, 100, 200]data_stream = [np.random.binomial(1, true_p, size=n) for n in batch_sizes] # Perform sequential updateshistory = sequential_beta_update(prior_alpha, prior_beta, data_stream) # Visualizationfig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Plot 1: MAP estimate evolutionax1 = axes[0, 0]total_samples = np.cumsum([0] + batch_sizes)map_estimates = [h[2] for h in history]ax1.step(total_samples, map_estimates, where='post', lw=2, color='blue')ax1.axhline(y=true_p, color='red', linestyle='--', label=f'True p = {true_p}')ax1.fill_between(total_samples, true_p - 0.05, true_p + 0.05, alpha=0.2, color='red')ax1.set_xlabel('Cumulative Samples')ax1.set_ylabel('MAP Estimate')ax1.set_title('Online Learning: MAP Estimate Evolution')ax1.legend()ax1.grid(True, alpha=0.3) # Plot 2: Posterior uncertainty shrinksax2 = axes[0, 1]p_range = np.linspace(0, 1, 200) colors = plt.cm.viridis(np.linspace(0, 1, len(history)))for i, (alpha, beta, _) in enumerate(history): pdf = stats.beta.pdf(p_range, alpha, beta) total_n = sum(batch_sizes[:i]) if i > 0 else 0 ax2.plot(p_range, pdf, color=colors[i], lw=1.5, label=f'n={total_n}' if i < 5 else None) ax2.axvline(x=true_p, color='red', linestyle='--', alpha=0.7)ax2.set_xlabel('Parameter p')ax2.set_ylabel('Posterior Density')ax2.set_title('Posterior Evolution: Uncertainty Decreases')ax2.legend() # Plot 3: Posterior variance (uncertainty) over timeax3 = axes[1, 0]posterior_vars = []for alpha, beta, _ in history: var = (alpha * beta) / ((alpha + beta)**2 * (alpha + beta + 1)) posterior_vars.append(var) ax3.plot(total_samples, posterior_vars, 'o-', color='purple', lw=2)ax3.set_xlabel('Cumulative Samples')ax3.set_ylabel('Posterior Variance')ax3.set_title('Uncertainty Decreases with More Data')ax3.set_yscale('log')ax3.grid(True, alpha=0.3) # Plot 4: 95% credible intervalsax4 = axes[1, 1]lower_bounds = []upper_bounds = []for alpha, beta, _ in history: lower = stats.beta.ppf(0.025, alpha, beta) upper = stats.beta.ppf(0.975, alpha, beta) lower_bounds.append(lower) upper_bounds.append(upper) ax4.fill_between(total_samples, lower_bounds, upper_bounds, alpha=0.3, color='blue', label='95% Credible Interval')ax4.plot(total_samples, map_estimates, 'b-', lw=2, label='MAP Estimate')ax4.axhline(y=true_p, color='red', linestyle='--', label=f'True p = {true_p}')ax4.set_xlabel('Cumulative Samples')ax4.set_ylabel('Parameter p')ax4.set_title('95% Credible Interval Shrinks')ax4.legend()ax4.set_ylim([0.4, 1.0])ax4.grid(True, alpha=0.3) plt.tight_layout()plt.show() # Print summaryprint("Sequential Bayesian Update Summary:")print("=" * 50)for i, (alpha, beta, map_est) in enumerate(history): total_n = sum(batch_sizes[:i]) if i > 0 else 0 var = (alpha * beta) / ((alpha + beta)**2 * (alpha + beta + 1)) print(f"After {total_n:4d} samples: α={alpha:6.1f}, β={beta:6.1f}, " f"MAP={map_est:.4f}, Var={var:.6f}")Now that we understand both methods deeply, let's systematically compare them to guide practical decisions.
| Dimension | MLE | MAP |
|---|---|---|
| Philosophy | Parameters are fixed constants | Parameters are random variables |
| Objective | Maximize P(D|θ) | Maximize P(D|θ) × P(θ) |
| Prior required? | No | Yes |
| Small data behavior | Prone to extreme estimates | Regularized by prior |
| Large data behavior | Optimal (efficient) | Converges to MLE |
| Uncertainty quantification | Requires separate analysis | Natural via posterior |
| Computational | Often simpler | Requires prior specification |
| Overfitting risk | Higher | Lower (regularized) |
| Interpretation | Point estimate | Mode of belief distribution |
Use MLE when:
Use MAP when:
In modern machine learning, the choice is often made implicitly. Using Ridge (L2) or Lasso (L1) regression? You're doing MAP. Training a neural network with weight decay? That's MAP with a Gaussian prior. The key insight is recognizing these connections, so you can tune regularization with Bayesian intuition: 'How strongly do I believe weights should be small?'
MAP finds a single "best" parameter value—the posterior mode. But the full Bayesian approach maintains the entire posterior distribution, enabling richer inference.
Why go beyond MAP?
MAP discards uncertainty information. Consider two scenarios:
Both have the same MAP estimate, but our confidence differs enormously! Full Bayesian inference captures this.
Posterior Predictive Distribution:
Instead of using point estimate θ̂ for predictions, integrate over all possible parameters:
$$P(x_{new} | \mathbf{D}) = \int P(x_{new} | \theta) P(\theta | \mathbf{D}) d\theta$$
This automatically accounts for parameter uncertainty—predictions are hedged proportionally to our confidence in θ.
Bayesian Model Comparison:
The evidence P(D) (which we ignored in MAP) becomes crucial:
$$P(D) = \int P(D|\theta) P(\theta) d\theta$$
This enables comparison between models with different complexity—Occam's razor emerges automatically!
Credible Intervals:
Bayesian posteriors give natural uncertainty intervals. A 95% credible interval [a, b] means: "Given the data and prior, there's 95% probability that θ ∈ [a, b]." This is often what practitioners actually want (unlike frequentist confidence intervals).
Full Bayesian inference shines when uncertainty quantification is critical (medical diagnosis, autonomous vehicles), well-calibrated predictions matter, model comparison is needed, and datasets are small. The cost is computational: MCMC, variational inference, or other approximations are often required.
We've explored the Bayesian extension of MLE and its profound connections to regularization. Let's consolidate the key insights:
Looking ahead:
We've established how to obtain point estimates via MLE and MAP. But how good are these estimates? What properties should we demand of estimators? The next page explores Bias and Variance of Estimators—the fundamental concepts for evaluating and comparing estimation procedures.
You've mastered MAP estimation—the Bayesian extension of MLE that unifies regularization theory with probabilistic inference. You can now derive MAP estimates with conjugate priors, interpret regularization hyperparameters as prior strengths, choose between MLE and MAP based on problem characteristics, and understand the full spectrum from MLE to MAP to full Bayesian inference.