Loading learning content...
If Beta-Binomial is the foundation for binary data, Gaussian-Gaussian conjugacy is its counterpart for continuous measurements. From estimating physical constants to modeling financial returns, from sensor calibration to Bayesian regression—Gaussian conjugacy underlies an enormous fraction of practical Bayesian inference.
The Gaussian (Normal) distribution occupies a privileged position in statistics:
Gaussian-Gaussian conjugacy extends this analytical tractability to Bayesian inference, providing closed-form posteriors when we place Gaussian priors on Gaussian-distributed parameters.
By the end of this page, you will master Gaussian-Gaussian conjugacy for the known-variance case, understand the Normal-Inverse-Gamma conjugate for jointly unknown mean and variance, derive posterior distributions and their properties, compute predictive distributions for future observations, and connect Gaussian priors to ridge regression and L2 regularization.
Before developing conjugate analysis, we establish essential Gaussian properties and two parametrizations used in Bayesian inference.
Definition: A random variable $x$ follows a Gaussian (Normal) distribution with mean $\mu$ and variance $\sigma^2$, written $x \sim \mathcal{N}(\mu, \sigma^2)$, if its density is:
$$p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)$$
The Precision Parametrization:
It is often more convenient to work with precision $\tau = 1/\sigma^2$ (inverse variance):
$$p(x | \mu, \tau) = \sqrt{\frac{\tau}{2\pi}} \exp\left( -\frac{\tau(x - \mu)^2}{2} \right)$$
In this parametrization, $\mathcal{N}(\mu, \tau^{-1})$ denotes a Gaussian with mean $\mu$ and precision $\tau$.
Why Precision? Precision arises naturally in conjugate analysis because:
| Property | Formula | Notes |
|---|---|---|
| Mean | $\mathbb{E}[x] = \mu$ | First moment equals the parameter |
| Variance | $\text{Var}[x] = \sigma^2 = 1/\tau$ | Second central moment |
| Mode | $\text{Mode}[x] = \mu$ | Symmetric: mean = mode = median |
| Sum of Gaussians | $\sum_i x_i \sim \mathcal{N}(\sum_i \mu_i, \sum_i \sigma_i^2)$ | Closed under addition |
| Linear transform | $ax + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)$ | Affine transformation |
| Sample mean | $\bar{x} \sim \mathcal{N}(\mu, \sigma^2/n)$ | Mean of $n$ i.i.d. samples |
For $n$ i.i.d. Gaussian observations $x_1, \ldots, x_n$ with unknown mean $\mu$ and known variance $\sigma^2$, the sufficient statistic is the sample sum $\sum_i x_i$ (or equivalently, the sample mean $\bar{x} = \frac{1}{n}\sum_i x_i$). When both mean and variance are unknown, the sufficient statistics are ($\sum_i x_i$, $\sum_i x_i^2$) or equivalently ($\bar{x}$, $s^2 = \frac{1}{n}\sum_i (x_i - \bar{x})^2$).
The simplest and most instructive Gaussian conjugacy occurs when inferring the mean $\mu$ with known variance $\sigma^2$.
Setup:
Conjugate Posterior Theorem: The posterior is Gaussian:
$$\mu | x_1, \ldots, x_n \sim \mathcal{N}(\mu_n, \sigma_n^2)$$
with posterior parameters:
$$\mu_n = \frac{\frac{\mu_0}{\sigma_0^2} + \frac{n\bar{x}}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}} = \frac{\tau_0 \mu_0 + n\tau \bar{x}}{\tau_0 + n\tau}$$
$$\frac{1}{\sigma_n^2} = \frac{1}{\sigma_0^2} + \frac{n}{\sigma^2} \quad \text{(precisions add!)}$$
where $\tau_0 = 1/\sigma_0^2$ is the prior precision and $\tau = 1/\sigma^2$ is the data precision.
The posterior mean is a precision-weighted average of the prior mean and the sample mean. Information combines additively in precision space: more observations add precision, shrinking uncertainty. This is the Gaussian analogue of the Beta-Binomial's pseudo-count interpretation.
Derivation (Completing the Square):
We derive the posterior by combining prior and likelihood and recognizing the Gaussian kernel.
Step 1: Combine Prior and Likelihood
$$P(\mu | x_{1:n}) \propto P(x_{1:n} | \mu) \cdot P(\mu)$$
$$\propto \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \right) \cdot \exp\left( -\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right)$$
Step 2: Expand the Quadratics Consider only terms involving $\mu$:
$$\propto \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i^2 - 2x_i\mu + \mu^2) - \frac{\mu^2 - 2\mu\mu_0 + \mu_0^2}{2\sigma_0^2} \right)$$
$$\propto \exp\left( -\frac{n\mu^2 - 2\mu \sum x_i}{2\sigma^2} - \frac{\mu^2 - 2\mu\mu_0}{2\sigma_0^2} \right)$$
Step 3: Collect Terms in $\mu$
$$\propto \exp\left( -\frac{1}{2}\left[ \mu^2 \left( \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2} \right) - 2\mu \left( \frac{n\bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2} \right) \right] \right)$$
Step 4: Complete the Square This has the form $-\frac{1}{2}[A\mu^2 - 2B\mu] = -\frac{A}{2}(\mu - B/A)^2 + \text{const}$
With $A = \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2} = \frac{1}{\sigma_n^2}$ and $B = \frac{n\bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}$
The posterior is $\mathcal{N}(B/A, 1/A) = \mathcal{N}(\mu_n, \sigma_n^2)$. ∎
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def gaussian_posterior_known_variance( prior_mean: float, prior_variance: float, data: np.ndarray, data_variance: float) -> tuple: """ Compute Gaussian posterior for mean with known variance. The posterior is Gaussian with: - mean = precision-weighted average of prior mean and sample mean - precision = prior precision + data precision × n This is the fundamental Gaussian conjugate update. """ n = len(data) sample_mean = np.mean(data) # Work in precision space prior_precision = 1.0 / prior_variance data_precision = 1.0 / data_variance # Posterior precision: precisions add! posterior_precision = prior_precision + n * data_precision posterior_variance = 1.0 / posterior_precision # Posterior mean: precision-weighted average posterior_mean = ( prior_precision * prior_mean + n * data_precision * sample_mean ) / posterior_precision return posterior_mean, posterior_variance def visualize_gaussian_update( prior_mean: float, prior_variance: float, data: np.ndarray, data_variance: float): """ Visualize the Bayesian update from prior to posterior. """ post_mean, post_var = gaussian_posterior_known_variance( prior_mean, prior_variance, data, data_variance ) # Create x-axis range encompassing both prior and posterior x_min = min(prior_mean - 3*np.sqrt(prior_variance), post_mean - 3*np.sqrt(post_var)) x_max = max(prior_mean + 3*np.sqrt(prior_variance), post_mean + 3*np.sqrt(post_var)) x = np.linspace(x_min, x_max, 500) prior_pdf = stats.norm.pdf(x, prior_mean, np.sqrt(prior_variance)) posterior_pdf = stats.norm.pdf(x, post_mean, np.sqrt(post_var)) # Likelihood (as function of mu, normalized for visualization) sample_mean = np.mean(data) n = len(data) likelihood_var = data_variance / n likelihood_pdf = stats.norm.pdf(x, sample_mean, np.sqrt(likelihood_var)) fig, ax = plt.subplots(figsize=(10, 6)) ax.plot(x, prior_pdf, 'b--', lw=2, label=f'Prior: N({prior_mean:.2f}, {prior_variance:.2f})') ax.plot(x, likelihood_pdf, 'g:', lw=2, label=f'Likelihood: x̄={sample_mean:.2f}, n={n}') ax.plot(x, posterior_pdf, 'r-', lw=2, label=f'Posterior: N({post_mean:.2f}, {post_var:.4f})') ax.fill_between(x, posterior_pdf, alpha=0.2, color='red') ax.axvline(sample_mean, color='green', linestyle='--', alpha=0.5) ax.set_xlabel('μ', fontsize=12) ax.set_ylabel('Density', fontsize=12) ax.set_title('Gaussian-Gaussian Conjugate Update', fontsize=14) ax.legend(fontsize=10) ax.grid(alpha=0.3) return fig, post_mean, post_var # Example: Sensor Calibration# Prior: We expect the sensor reads around 100, uncertainty ±10prior_mean = 100.0prior_variance = 100.0 # σ₀ = 10 # Data: We take 5 measurements, knowing sensor noise has σ = 5np.random.seed(42)true_value = 102.0 # Unknown truthdata = np.random.normal(true_value, 5.0, size=5)data_variance = 25.0 # σ² = 5² = 25 (known) fig, post_mean, post_var = visualize_gaussian_update( prior_mean, prior_variance, data, data_variance) print(f"Prior: N({prior_mean}, {prior_variance})")print(f"Data: {data}")print(f"Sample mean: {np.mean(data):.3f}")print(f"Posterior: N({post_mean:.3f}, {post_var:.4f})")print(f"Posterior 95% CI: [{post_mean - 1.96*np.sqrt(post_var):.3f}, " f"{post_mean + 1.96*np.sqrt(post_var):.3f}]")print(f"True value: {true_value}")The Gaussian posterior mean reveals deep connections between Bayesian inference, shrinkage estimation, and regularization.
Shrinkage Interpretation:
The posterior mean can be written as:
$$\mu_n = \lambda \mu_0 + (1 - \lambda) \bar{x}$$
where the shrinkage factor is:
$$\lambda = \frac{\tau_0}{\tau_0 + n\tau} = \frac{\sigma^2 / n}{\sigma^2 / n + \sigma_0^2} = \frac{\text{data variance}}{\text{data variance} + \text{prior variance}}$$
The posterior mean shrinks the MLE ($\bar{x}$) toward the prior mean ($\mu_0$). The amount of shrinkage depends on the relative precision of data vs. prior:
The prior acts like $n_0 = \sigma^2 / \sigma_0^2$ additional observations centered at $\mu_0$. This is the prior's 'effective sample size.' If $n >> n_0$, data dominates. If $n << n_0$, prior dominates. This interpretation guides prior elicitation: 'How many observations is my prior worth?'
Connection to Ridge Regression:
Consider the regularized least squares problem (ridge regression):
$$\hat{\mu}{\text{ridge}} = \arg\min\mu \left[ \sum_{i=1}^n (x_i - \mu)^2 + \lambda (\mu - \mu_0)^2 \right]$$
Taking the derivative and setting to zero:
$$\hat{\mu}_{\text{ridge}} = \frac{n\bar{x} + \lambda\mu_0}{n + \lambda}$$
This is exactly the Bayesian posterior mean with: $$\lambda = \frac{\sigma^2}{\sigma_0^2} = \frac{\text{data variance}}{\text{prior variance}}$$
Key Insight: Ridge regression is Bayesian inference with a Gaussian prior. The regularization parameter $\lambda$ corresponds to the ratio of data variance to prior variance. This connection extends to all Gaussian linear models:
| Frequentist | Bayesian |
|---|---|
| Ridge regression | Gaussian prior on weights |
| LASSO | Laplace (double-exponential) prior |
| Elastic net | Mixture of Gaussian and Laplace |
| L2 regularization | Gaussian posterior MAP |
| Data Regime | Posterior Mean | Posterior Variance | Interpretation |
|---|---|---|---|
| $n = 0$ (no data) | $\mu_0$ | $\sigma_0^2$ | Prior unchanged |
| $n << n_0$ | Near $\mu_0$ | Slightly smaller than $\sigma_0^2$ | Prior dominates |
| $n = n_0$ | Midpoint of $\mu_0$ and $\bar{x}$ | $\sigma_0^2 / 2$ | Equal weight |
| $n >> n_0$ | Near $\bar{x}$ | $\approx \sigma^2 / n$ | Data dominates |
| $n \to \infty$ | $\bar{x}$ (MLE) | $\to 0$ | Point mass at truth |
In practice, the data variance $\sigma^2$ is rarely known. Bayesian inference can simultaneously estimate both mean and variance using the Normal-Inverse-Gamma conjugate prior.
Setup:
The Normal-Inverse-Gamma Prior:
The conjugate prior for $(\mu, \sigma^2)$ is:
$$\sigma^2 \sim \text{Inv-Gamma}(\alpha, \beta)$$ $$\mu | \sigma^2 \sim \mathcal{N}\left(\mu_0, \frac{\sigma^2}{\kappa}\right)$$
This is written as $(\mu, \sigma^2) \sim \text{NIG}(\mu_0, \kappa, \alpha, \beta)$
Hyperparameter Interpretation:
The Normal-Inverse-Gamma prior has a hierarchical structure: first generate $\sigma^2$, then generate $\mu$ conditional on $\sigma^2$. The uncertainty about $\mu$ scales with the unknown variance—if variance is large, our uncertainty about the mean is also large. This coupling is natural and realistic.
The Posterior:
After observing data $x_1, \ldots, x_n$ with sample mean $\bar{x}$ and sum of squared deviations $s = \sum_i (x_i - \bar{x})^2$:
$$(\mu, \sigma^2 | x_{1:n}) \sim \text{NIG}(\mu_n, \kappa_n, \alpha_n, \beta_n)$$
with updated hyperparameters:
$$\kappa_n = \kappa + n$$
$$\mu_n = \frac{\kappa\mu_0 + n\bar{x}}{\kappa + n}$$
$$\alpha_n = \alpha + \frac{n}{2}$$
$$\beta_n = \beta + \frac{s}{2} + \frac{\kappa n (\bar{x} - \mu_0)^2}{2(\kappa + n)}$$
Interpretation of Updates:
Marginal Distributions:
The marginal distribution of $\mu$ (integrating out $\sigma^2$) is a Student-t distribution:
$$\mu | x_{1:n} \sim t_{2\alpha_n}\left(\mu_n, \frac{\beta_n}{\alpha_n \kappa_n}\right)$$
This has heavier tails than a Gaussian, reflecting uncertainty about $\sigma^2$. As $n \to \infty$, it approaches Gaussian.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140
import numpy as npfrom scipy import stats class NormalInverseGamma: """ Normal-Inverse-Gamma conjugate prior for Gaussian mean and variance. Joint prior: (μ, σ²) ~ NIG(μ_0, κ, α, β) - σ² ~ Inverse-Gamma(α, β) - μ | σ² ~ N(μ_0, σ²/κ) This is the natural conjugate prior when both mean and variance are unknown. """ def __init__(self, mu_0: float = 0.0, kappa: float = 1.0, alpha: float = 1.0, beta: float = 1.0): """ Initialize NIG hyperparameters. Args: mu_0: Prior mean for μ kappa: Prior precision (effective prior sample size for μ) alpha: Shape for Inverse-Gamma on σ² (half prior df) beta: Scale for Inverse-Gamma on σ² (half prior SS) """ self.mu_0 = mu_0 self.kappa = kappa self.alpha = alpha self.beta = beta def update(self, data: np.ndarray) -> 'NormalInverseGamma': """ Compute posterior NIG parameters given observed data. Returns a new NIG object with updated parameters. """ n = len(data) if n == 0: return NormalInverseGamma(self.mu_0, self.kappa, self.alpha, self.beta) x_bar = np.mean(data) ss = np.sum((data - x_bar)**2) # Sum of squared deviations # Update equations kappa_n = self.kappa + n mu_n = (self.kappa * self.mu_0 + n * x_bar) / kappa_n alpha_n = self.alpha + n / 2 # Beta update includes term for mean shift mean_shift_term = (self.kappa * n * (x_bar - self.mu_0)**2) / (2 * kappa_n) beta_n = self.beta + ss/2 + mean_shift_term return NormalInverseGamma(mu_n, kappa_n, alpha_n, beta_n) def marginal_mu_distribution(self): """ Return the marginal distribution of μ (Student-t). The marginal arises from integrating σ² out of the joint NIG. This has heavier tails than Gaussian, reflecting variance uncertainty. """ df = 2 * self.alpha loc = self.mu_0 scale = np.sqrt(self.beta / (self.alpha * self.kappa)) return stats.t(df=df, loc=loc, scale=scale) def posterior_mean_mu(self) -> float: """Posterior mean of μ.""" return self.mu_0 # NIG maintains μ_n as conditional mean def posterior_mean_sigma2(self) -> float: """Posterior mean of σ² (requires α > 1).""" if self.alpha <= 1: return np.inf return self.beta / (self.alpha - 1) def posterior_mode_sigma2(self) -> float: """Posterior mode of σ².""" return self.beta / (self.alpha + 1) def credible_interval_mu(self, confidence: float = 0.95) -> tuple: """ Credible interval for μ using marginal t-distribution. """ dist = self.marginal_mu_distribution() tail = (1 - confidence) / 2 return (dist.ppf(tail), dist.ppf(1 - tail)) def sample(self, n_samples: int = 10000) -> tuple: """ Generate samples from the joint posterior. Returns (mu_samples, sigma2_samples) """ # Sample σ² from Inverse-Gamma(α, β) # Inverse-Gamma(α, β) = 1 / Gamma(α, 1/β) sigma2_samples = 1.0 / np.random.gamma(self.alpha, 1.0/self.beta, n_samples) # Sample μ | σ² from N(μ_0, σ²/κ) mu_samples = np.random.normal( self.mu_0, np.sqrt(sigma2_samples / self.kappa) ) return mu_samples, sigma2_samples def summary(self): """Print posterior summary.""" print(f"NIG({self.mu_0:.4f}, {self.kappa:.2f}, {self.alpha:.2f}, {self.beta:.4f})") print(f" E[μ] = {self.posterior_mean_mu():.4f}") print(f" E[σ²] = {self.posterior_mean_sigma2():.4f}") ci = self.credible_interval_mu() print(f" 95% CI for μ: [{ci[0]:.4f}, {ci[1]:.4f}]") # Example: Estimate height distribution# Prior: Heights are around 170cm, with prior "sample size" of 5# Variance prior: centered around 100 (σ ≈ 10)prior = NormalInverseGamma( mu_0=170.0, # Prior mean height kappa=5.0, # Worth 5 observations alpha=3.0, # df = 6 beta=200.0 # Scale for variance prior) print("Prior:")prior.summary() # Observed heights (in cm)heights = np.array([168, 175, 172, 169, 178, 171, 174, 167, 180, 173]) posterior = prior.update(heights) print(f"\nObserved data: {heights}")print(f"Sample mean: {np.mean(heights):.2f}")print(f"Sample std: {np.std(heights):.2f}") print("\nPosterior:")posterior.summary()The posterior predictive distribution for future observations is a key Bayesian output, incorporating both parameter uncertainty and data variability.
Known Variance Case:
For $\mu | x_{1:n} \sim \mathcal{N}(\mu_n, \sigma_n^2)$ and future $\tilde{x} | \mu \sim \mathcal{N}(\mu, \sigma^2)$, the posterior predictive is:
$$\tilde{x} | x_{1:n} \sim \mathcal{N}(\mu_n, \sigma^2 + \sigma_n^2)$$
The predictive variance has two components:
As $n \to \infty$, $\sigma_n^2 \to 0$, and the predictive variance approaches $\sigma^2$ (the irreducible noise).
Unknown Variance Case (NIG):
With $(\mu, \sigma^2) \sim \text{NIG}(\mu_n, \kappa_n, \alpha_n, \beta_n)$, the posterior predictive is:
$$\tilde{x} | x_{1:n} \sim t_{2\alpha_n}\left(\mu_n, \frac{\beta_n(\kappa_n + 1)}{\alpha_n \kappa_n}\right)$$
This is a Student-t distribution with heavier tails than Gaussian, reflecting uncertainty about both $\mu$ and $\sigma^2$.
The Student-t distribution naturally arises when averaging over uncertainty in variance. It provides wider predictive intervals than a Gaussian would, appropriately reflecting our epistemic uncertainty. As we observe more data, the degrees of freedom increase, and the t-distribution converges to Gaussian.
Prediction Intervals:
A $100(1-\alpha)$% prediction interval $[L, U]$ for a future observation satisfies:
$$P(L \leq \tilde{x} \leq U | x_{1:n}) = 1 - \alpha$$
Unlike credible intervals (for parameters), prediction intervals (for observables) are typically wider because they include data variability on top of parameter uncertainty.
Comparison of Interval Types:
| Interval Type | Captures | Width | Behavior as $n \to \infty$ |
|---|---|---|---|
| Credible interval for $\mu$ | Parameter uncertainty | Narrow | Shrinks to zero |
| Predictive interval for $\tilde{x}$ | Parameter + data uncertainty | Wide | Approaches $\mu \pm z_{\alpha/2} \sigma$ |
| Confidence interval for $\mu$ | Sampling variability | Narrow | Shrinks to zero |
| Tolerance interval | Proportion of population | Very wide | Bounded below |
Gaussian conjugacy naturally supports sequential updating, enabling online learning from streaming data.
Sequential Update Property:
For known variance, if current posterior is $\mu \sim \mathcal{N}(\mu_t, \sigma_t^2)$ and we observe a new datum $x_{t+1} \sim \mathcal{N}(\mu, \sigma^2)$:
$$\text{New precision: } \tau_{t+1} = \tau_t + \tau$$
$$\text{New mean: } \mu_{t+1} = \frac{\tau_t \mu_t + \tau x_{t+1}}{\tau_{t+1}}$$
where $\tau_t = 1/\sigma_t^2$ and $\tau = 1/\sigma^2$.
This update is incremental: we only need the current posterior parameters and new observation—not the full history.
The Gaussian sequential update is the simplest case of the Kalman filter. The full Kalman filter extends this to systems with time-varying dynamics, process noise, and multivariate observations. Understanding Gaussian conjugacy is the gateway to Kalman filtering and state-space models.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as np class GaussianOnlineEstimator: """ Online Bayesian estimation of Gaussian mean with known variance. Updates incrementally with each observation, O(1) per update. Maintains complete posterior information in just 2 numbers. """ def __init__(self, prior_mean: float, prior_variance: float, data_variance: float): """ Initialize with prior and known data variance. """ self.mu = prior_mean self.tau = 1.0 / prior_variance # Precision self.data_tau = 1.0 / data_variance # Data precision self.n = 0 def update(self, x: float) -> None: """ Update posterior with single observation. O(1) time and space complexity. """ # New precision new_tau = self.tau + self.data_tau # New mean (precision-weighted average) self.mu = (self.tau * self.mu + self.data_tau * x) / new_tau self.tau = new_tau self.n += 1 def update_batch(self, data: np.ndarray) -> None: """Update with multiple observations at once.""" for x in data: self.update(x) @property def variance(self) -> float: return 1.0 / self.tau @property def std(self) -> float: return np.sqrt(self.variance) def credible_interval(self, confidence: float = 0.95) -> tuple: from scipy import stats z = stats.norm.ppf((1 + confidence) / 2) return (self.mu - z * self.std, self.mu + z * self.std) def __repr__(self): return f"N({self.mu:.4f}, {self.variance:.6f}) after {self.n} obs" # Example: Online mean estimation with streaming datatrue_mean = 5.0data_variance = 4.0 # Known σ² = 4 (σ = 2) # Prior: vague belief centered at 0estimator = GaussianOnlineEstimator( prior_mean=0.0, prior_variance=100.0, # Weak prior data_variance=data_variance) print("Initial:", estimator) # Simulate streaming observationsnp.random.seed(42)for i in range(20): x = np.random.normal(true_mean, np.sqrt(data_variance)) estimator.update(x) if (i + 1) % 5 == 0: ci = estimator.credible_interval() print(f"After {i+1:2d} obs: μ = {estimator.mu:.3f} ± {estimator.std:.3f}, " f"95% CI: [{ci[0]:.3f}, {ci[1]:.3f}]") print(f"\nTrue mean: {true_mean}")Gaussian conjugacy extends naturally to multivariate settings, essential for vector observations, regression, and modern machine learning.
Multivariate Gaussian with Known Covariance:
For $\mathbf{x}_1, \ldots, \mathbf{x}_n | \boldsymbol{\mu} \sim \mathcal{N}_d(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ i.i.d. (known $\boldsymbol{\Sigma}$), the conjugate prior is:
$$\boldsymbol{\mu} \sim \mathcal{N}_d(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0)$$
The posterior is:
$$\boldsymbol{\mu} | \mathbf{x}_{1:n} \sim \mathcal{N}_d(\boldsymbol{\mu}_n, \boldsymbol{\Sigma}_n)$$
with:
$$\boldsymbol{\Sigma}_n^{-1} = \boldsymbol{\Sigma}_0^{-1} + n\boldsymbol{\Sigma}^{-1}$$
$$\boldsymbol{\mu}_n = \boldsymbol{\Sigma}_n \left( \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu}_0 + n\boldsymbol{\Sigma}^{-1} \bar{\mathbf{x}} \right)$$
Unknown Covariance (Normal-Inverse-Wishart):
For unknown covariance matrix $\boldsymbol{\Sigma}$, the conjugate prior is:
$$\boldsymbol{\Sigma} \sim \text{Inverse-Wishart}(\nu_0, \boldsymbol{\Psi}_0)$$ $$\boldsymbol{\mu} | \boldsymbol{\Sigma} \sim \mathcal{N}_d(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}/\kappa_0)$$
This is the multivariate extension of Normal-Inverse-Gamma, called Normal-Inverse-Wishart (NIW).
| Unknown Parameters | Conjugate Prior | Posterior | Dimension |
|---|---|---|---|
| $\mu$ (known $\sigma^2$) | Normal($\mu_0, \sigma_0^2$) | Normal($\mu_n, \sigma_n^2$) | 2 hyperparams |
| $\sigma^2$ (known $\mu$) | Inverse-Gamma($\alpha, \beta$) | Inverse-Gamma($\alpha_n, \beta_n$) | 2 hyperparams |
| $\mu, \sigma^2$ | Normal-Inverse-Gamma | Normal-Inverse-Gamma | 4 hyperparams |
| $\boldsymbol{\mu}$ (known $\boldsymbol{\Sigma}$) | Multivariate Normal | Multivariate Normal | $d + d(d+1)/2$ |
| $\boldsymbol{\Sigma}$ (known $\boldsymbol{\mu}$) | Inverse-Wishart | Inverse-Wishart | $1 + d(d+1)/2$ |
| $\boldsymbol{\mu}, \boldsymbol{\Sigma}$ | Normal-Inverse-Wishart | Normal-Inverse-Wishart | $d + 1 + 1 + d(d+1)/2$ |
Bayesian linear regression with Gaussian prior on weights is a direct application of multivariate Gaussian conjugacy. The posterior over weights is Gaussian, the predictive distribution is Gaussian (or Student-t with unknown variance), and regularization emerges naturally from the prior. This is the foundation of Gaussian Process regression.
We have comprehensively explored Gaussian-Gaussian conjugacy—the foundation for Bayesian inference on continuous data. Let us consolidate the key insights:
What's Next:
Having mastered the univariate conjugates Beta-Binomial and Gaussian-Gaussian, we now tackle categorical data. The next page covers Dirichlet-Multinomial conjugacy—essential for text models, topic models, mixture models, and any setting with multiple discrete categories.
You now command Gaussian conjugacy—the workhorse for continuous data inference. You understand both the known-variance and unknown-variance cases, can perform sequential updates, recognize the connection to regularization, and appreciate the extension to multiple dimensions. Next, we explore Dirichlet-Multinomial conjugacy for categorical data.