Machine LearningBayesian Inference

Conjugate Priors

LevelAdvanced

Duration90 mins

TopicBayesian Inference

3 / 5

Gaussian-Gaussian

The Workhorse of Continuous Inference

If Beta-Binomial is the foundation for binary data, Gaussian-Gaussian conjugacy is its counterpart for continuous measurements. From estimating physical constants to modeling financial returns, from sensor calibration to Bayesian regression—Gaussian conjugacy underlies an enormous fraction of practical Bayesian inference.

The Gaussian (Normal) distribution occupies a privileged position in statistics:

Central Limit Theorem: Sums of many random variables converge to Gaussian
Maximum Entropy: Gaussian has maximum entropy for fixed mean and variance
Analytical Tractability: Gaussian algebra is remarkably clean
Ubiquitous in Nature: Many physical quantities are approximately Gaussian

Gaussian-Gaussian conjugacy extends this analytical tractability to Bayesian inference, providing closed-form posteriors when we place Gaussian priors on Gaussian-distributed parameters.

What You Will Learn

By the end of this page, you will master Gaussian-Gaussian conjugacy for the known-variance case, understand the Normal-Inverse-Gamma conjugate for jointly unknown mean and variance, derive posterior distributions and their properties, compute predictive distributions for future observations, and connect Gaussian priors to ridge regression and L2 regularization.

The Gaussian Distribution: Properties and Parametrizations

Before developing conjugate analysis, we establish essential Gaussian properties and two parametrizations used in Bayesian inference.

Definition: A random variable $x$ follows a Gaussian (Normal) distribution with mean $\mu$ and variance $\sigma^2$, written $x \sim \mathcal{N}(\mu, \sigma^2)$, if its density is:

$$p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)$$

The Precision Parametrization:

It is often more convenient to work with precision $\tau = 1/\sigma^2$ (inverse variance):

$$p(x | \mu, \tau) = \sqrt{\frac{\tau}{2\pi}} \exp\left( -\frac{\tau(x - \mu)^2}{2} \right)$$

In this parametrization, $\mathcal{N}(\mu, \tau^{-1})$ denotes a Gaussian with mean $\mu$ and precision $\tau$.

Why Precision? Precision arises naturally in conjugate analysis because:

Posterior precisions add (while variances don't combine as simply)
The conjugate prior for precision is the Gamma distribution
Many update formulas are simpler in precision form

Gaussian Properties Quick Reference
Property	Formula	Notes
Mean	$\mathbb{E}[x] = \mu$	First moment equals the parameter
Variance	$\text{Var}[x] = \sigma^2 = 1/\tau$	Second central moment
Mode	$\text{Mode}[x] = \mu$	Symmetric: mean = mode = median
Sum of Gaussians	$\sum_i x_i \sim \mathcal{N}(\sum_i \mu_i, \sum_i \sigma_i^2)$	Closed under addition
Linear transform	$ax + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)$	Affine transformation
Sample mean	$\bar{x} \sim \mathcal{N}(\mu, \sigma^2/n)$	Mean of $n$ i.i.d. samples

Sufficient Statistics for Gaussians

For $n$ i.i.d. Gaussian observations $x_1, \ldots, x_n$ with unknown mean $\mu$ and known variance $\sigma^2$, the sufficient statistic is the sample sum $\sum_i x_i$ (or equivalently, the sample mean $\bar{x} = \frac{1}{n}\sum_i x_i$). When both mean and variance are unknown, the sufficient statistics are ($\sum_i x_i$, $\sum_i x_i^2$) or equivalently ($\bar{x}$, $s^2 = \frac{1}{n}\sum_i (x_i - \bar{x})^2$).

Gaussian Mean Inference with Known Variance

The simplest and most instructive Gaussian conjugacy occurs when inferring the mean $\mu$ with known variance $\sigma^2$.

Setup:

Prior: $\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)$ — we believe the true mean is around $\mu_0$ with uncertainty $\sigma_0^2$
Likelihood: $x_1, \ldots, x_n | \mu \sim \mathcal{N}(\mu, \sigma^2)$ independently — data is Gaussian with unknown mean $\mu$ and known variance $\sigma^2$
Goal: Find $P(\mu | x_1, \ldots, x_n)$

Conjugate Posterior Theorem: The posterior is Gaussian:

$$\mu | x_1, \ldots, x_n \sim \mathcal{N}(\mu_n, \sigma_n^2)$$

with posterior parameters:

$$\mu_n = \frac{\frac{\mu_0}{\sigma_0^2} + \frac{n\bar{x}}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}} = \frac{\tau_0 \mu_0 + n\tau \bar{x}}{\tau_0 + n\tau}$$

$$\frac{1}{\sigma_n^2} = \frac{1}{\sigma_0^2} + \frac{n}{\sigma^2} \quad \text{(precisions add!)}$$

where $\tau_0 = 1/\sigma_0^2$ is the prior precision and $\tau = 1/\sigma^2$ is the data precision.

Precision-Weighted Average

The posterior mean is a precision-weighted average of the prior mean and the sample mean. Information combines additively in precision space: more observations add precision, shrinking uncertainty. This is the Gaussian analogue of the Beta-Binomial's pseudo-count interpretation.

Derivation (Completing the Square):

We derive the posterior by combining prior and likelihood and recognizing the Gaussian kernel.

Step 1: Combine Prior and Likelihood

$$P(\mu | x_{1:n}) \propto P(x_{1:n} | \mu) \cdot P(\mu)$$

$$\propto \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \right) \cdot \exp\left( -\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right)$$

Step 2: Expand the Quadratics Consider only terms involving $\mu$:

$$\propto \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i^2 - 2x_i\mu + \mu^2) - \frac{\mu^2 - 2\mu\mu_0 + \mu_0^2}{2\sigma_0^2} \right)$$

$$\propto \exp\left( -\frac{n\mu^2 - 2\mu \sum x_i}{2\sigma^2} - \frac{\mu^2 - 2\mu\mu_0}{2\sigma_0^2} \right)$$

Step 3: Collect Terms in $\mu$

$$\propto \exp\left( -\frac{1}{2}\left[ \mu^2 \left( \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2} \right) - 2\mu \left( \frac{n\bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2} \right) \right] \right)$$

Step 4: Complete the Square This has the form $-\frac{1}{2}[A\mu^2 - 2B\mu] = -\frac{A}{2}(\mu - B/A)^2 + \text{const}$

With $A = \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2} = \frac{1}{\sigma_n^2}$ and $B = \frac{n\bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}$

The posterior is $\mathcal{N}(B/A, 1/A) = \mathcal{N}(\mu_n, \sigma_n^2)$. ∎

Gaussian-Gaussian Posterior (Known Variance)
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def gaussian_posterior_known_variance(
    prior_mean: float, prior_variance: float,
    data: np.ndarray, data_variance: float
) -> tuple:
    """
    Compute Gaussian posterior for mean with known variance.
    
    The posterior is Gaussian with:
    - mean = precision-weighted average of prior mean and sample mean
    - precision = prior precision + data precision × n
    
    This is the fundamental Gaussian conjugate update.
    """
    n = len(data)
    sample_mean = np.mean(data)
    
    # Work in precision space
    prior_precision = 1.0 / prior_variance
    data_precision = 1.0 / data_variance
    
    # Posterior precision: precisions add!
    posterior_precision = prior_precision + n * data_precision
    posterior_variance = 1.0 / posterior_precision
    
    # Posterior mean: precision-weighted average
    posterior_mean = (
        prior_precision * prior_mean + n * data_precision * sample_mean
    ) / posterior_precision
    
    return posterior_mean, posterior_variance
 
 
def visualize_gaussian_update(
    prior_mean: float, prior_variance: float,
    data: np.ndarray, data_variance: float
):
    """
    Visualize the Bayesian update from prior to posterior.
    """
    post_mean, post_var = gaussian_posterior_known_variance(
        prior_mean, prior_variance, data, data_variance
    )
    
    # Create x-axis range encompassing both prior and posterior
    x_min = min(prior_mean - 3*np.sqrt(prior_variance), 
                post_mean - 3*np.sqrt(post_var))
    x_max = max(prior_mean + 3*np.sqrt(prior_variance), 
                post_mean + 3*np.sqrt(post_var))
    x = np.linspace(x_min, x_max, 500)
    
    prior_pdf = stats.norm.pdf(x, prior_mean, np.sqrt(prior_variance))
    posterior_pdf = stats.norm.pdf(x, post_mean, np.sqrt(post_var))
    
    # Likelihood (as function of mu, normalized for visualization)
    sample_mean = np.mean(data)
    n = len(data)
    likelihood_var = data_variance / n
    likelihood_pdf = stats.norm.pdf(x, sample_mean, np.sqrt(likelihood_var))
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    ax.plot(x, prior_pdf, 'b--', lw=2, 
            label=f'Prior: N({prior_mean:.2f}, {prior_variance:.2f})')
    ax.plot(x, likelihood_pdf, 'g:', lw=2, 
            label=f'Likelihood: x̄={sample_mean:.2f}, n={n}')
    ax.plot(x, posterior_pdf, 'r-', lw=2, 
            label=f'Posterior: N({post_mean:.2f}, {post_var:.4f})')
    
    ax.fill_between(x, posterior_pdf, alpha=0.2, color='red')
    ax.axvline(sample_mean, color='green', linestyle='--', alpha=0.5)
    
    ax.set_xlabel('μ', fontsize=12)
    ax.set_ylabel('Density', fontsize=12)
    ax.set_title('Gaussian-Gaussian Conjugate Update', fontsize=14)
    ax.legend(fontsize=10)
    ax.grid(alpha=0.3)
    
    return fig, post_mean, post_var
 
 
# Example: Sensor Calibration
# Prior: We expect the sensor reads around 100, uncertainty ±10
prior_mean = 100.0
prior_variance = 100.0  # σ₀ = 10
 
# Data: We take 5 measurements, knowing sensor noise has σ = 5
np.random.seed(42)
true_value = 102.0  # Unknown truth
data = np.random.normal(true_value, 5.0, size=5)
data_variance = 25.0  # σ² = 5² = 25 (known)
 
fig, post_mean, post_var = visualize_gaussian_update(
    prior_mean, prior_variance, data, data_variance
)
 
print(f"Prior: N({prior_mean}, {prior_variance})")
print(f"Data: {data}")
print(f"Sample mean: {np.mean(data):.3f}")
print(f"Posterior: N({post_mean:.3f}, {post_var:.4f})")
print(f"Posterior 95% CI: [{post_mean - 1.96*np.sqrt(post_var):.3f}, "
      f"{post_mean + 1.96*np.sqrt(post_var):.3f}]")
print(f"True value: {true_value}")

Interpretations: Shrinkage, Effective Sample Size, and Regularization

The Gaussian posterior mean reveals deep connections between Bayesian inference, shrinkage estimation, and regularization.

Shrinkage Interpretation:

The posterior mean can be written as:

$$\mu_n = \lambda \mu_0 + (1 - \lambda) \bar{x}$$

where the shrinkage factor is:

$$\lambda = \frac{\tau_0}{\tau_0 + n\tau} = \frac{\sigma^2 / n}{\sigma^2 / n + \sigma_0^2} = \frac{\text{data variance}}{\text{data variance} + \text{prior variance}}$$

The posterior mean shrinks the MLE ($\bar{x}$) toward the prior mean ($\mu_0$). The amount of shrinkage depends on the relative precision of data vs. prior:

High prior precision ($\sigma_0^2$ small): Strong shrinkage toward $\mu_0$
Many observations ($n$ large): Weak shrinkage, posterior near $\bar{x}$
Large data noise ($\sigma^2$ large): More shrinkage toward prior

Effective Sample Size of the Prior

The prior acts like $n_0 = \sigma^2 / \sigma_0^2$ additional observations centered at $\mu_0$. This is the prior's 'effective sample size.' If $n >> n_0$, data dominates. If $n << n_0$, prior dominates. This interpretation guides prior elicitation: 'How many observations is my prior worth?'

Connection to Ridge Regression:

Consider the regularized least squares problem (ridge regression):

$$\hat{\mu}{\text{ridge}} = \arg\min\mu \left[ \sum_{i=1}^n (x_i - \mu)^2 + \lambda (\mu - \mu_0)^2 \right]$$

Taking the derivative and setting to zero:

$$\hat{\mu}_{\text{ridge}} = \frac{n\bar{x} + \lambda\mu_0}{n + \lambda}$$

This is exactly the Bayesian posterior mean with: $$\lambda = \frac{\sigma^2}{\sigma_0^2} = \frac{\text{data variance}}{\text{prior variance}}$$

Key Insight: Ridge regression is Bayesian inference with a Gaussian prior. The regularization parameter $\lambda$ corresponds to the ratio of data variance to prior variance. This connection extends to all Gaussian linear models:

Frequentist	Bayesian
Ridge regression	Gaussian prior on weights
LASSO	Laplace (double-exponential) prior
Elastic net	Mixture of Gaussian and Laplace
L2 regularization	Gaussian posterior MAP

Asymptotic Behavior of Gaussian Posterior
Data Regime	Posterior Mean	Posterior Variance	Interpretation
$n = 0$ (no data)	$\mu_0$	$\sigma_0^2$	Prior unchanged
$n << n_0$	Near $\mu_0$	Slightly smaller than $\sigma_0^2$	Prior dominates
$n = n_0$	Midpoint of $\mu_0$ and $\bar{x}$	$\sigma_0^2 / 2$	Equal weight
$n >> n_0$	Near $\bar{x}$	$\approx \sigma^2 / n$	Data dominates
$n \to \infty$	$\bar{x}$ (MLE)	$\to 0$	Point mass at truth

Unknown Variance: Normal-Inverse-Gamma Conjugacy

In practice, the data variance $\sigma^2$ is rarely known. Bayesian inference can simultaneously estimate both mean and variance using the Normal-Inverse-Gamma conjugate prior.

Setup:

Data: $x_1, \ldots, x_n | \mu, \sigma^2 \sim \mathcal{N}(\mu, \sigma^2)$ i.i.d.
Goal: Jointly infer $P(\mu, \sigma^2 | x_{1:n})$

The Normal-Inverse-Gamma Prior:

The conjugate prior for $(\mu, \sigma^2)$ is:

$$\sigma^2 \sim \text{Inv-Gamma}(\alpha, \beta)$$ $$\mu | \sigma^2 \sim \mathcal{N}\left(\mu_0, \frac{\sigma^2}{\kappa}\right)$$

This is written as $(\mu, \sigma^2) \sim \text{NIG}(\mu_0, \kappa, \alpha, \beta)$

Hyperparameter Interpretation:

$\mu_0$: Prior mean of $\mu$
$\kappa$: Prior "sample size" for mean (number of pseudo-observations)
$\alpha$: Prior "degrees of freedom" for variance (half the pseudo-sample size)
$\beta$: Prior "sum of squares" (scale for variance)

Hierarchical Structure

The Normal-Inverse-Gamma prior has a hierarchical structure: first generate $\sigma^2$, then generate $\mu$ conditional on $\sigma^2$. The uncertainty about $\mu$ scales with the unknown variance—if variance is large, our uncertainty about the mean is also large. This coupling is natural and realistic.

The Posterior:

After observing data $x_1, \ldots, x_n$ with sample mean $\bar{x}$ and sum of squared deviations $s = \sum_i (x_i - \bar{x})^2$:

$$(\mu, \sigma^2 | x_{1:n}) \sim \text{NIG}(\mu_n, \kappa_n, \alpha_n, \beta_n)$$

with updated hyperparameters:

$$\kappa_n = \kappa + n$$

$$\mu_n = \frac{\kappa\mu_0 + n\bar{x}}{\kappa + n}$$

$$\alpha_n = \alpha + \frac{n}{2}$$

$$\beta_n = \beta + \frac{s}{2} + \frac{\kappa n (\bar{x} - \mu_0)^2}{2(\kappa + n)}$$

Interpretation of Updates:

$\kappa_n = \kappa + n$: Prior pseudo-observations plus real observations
$\mu_n$: Weighted average of prior and sample means
$\alpha_n = \alpha + n/2$: Adds $n/2$ degrees of freedom
$\beta_n$: Adds sample sum of squares plus a term accounting for the shift in the mean

Marginal Distributions:

The marginal distribution of $\mu$ (integrating out $\sigma^2$) is a Student-t distribution:

$$\mu | x_{1:n} \sim t_{2\alpha_n}\left(\mu_n, \frac{\beta_n}{\alpha_n \kappa_n}\right)$$

This has heavier tails than a Gaussian, reflecting uncertainty about $\sigma^2$. As $n \to \infty$, it approaches Gaussian.

Normal-Inverse-Gamma Posterior
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
from scipy import stats
 
class NormalInverseGamma:
    """
    Normal-Inverse-Gamma conjugate prior for Gaussian mean and variance.
    
    Joint prior: (μ, σ²) ~ NIG(μ_0, κ, α, β)
    - σ² ~ Inverse-Gamma(α, β)
    - μ | σ² ~ N(μ_0, σ²/κ)
    
    This is the natural conjugate prior when both mean and variance
    are unknown.
    """
    
    def __init__(self, mu_0: float = 0.0, kappa: float = 1.0,
                 alpha: float = 1.0, beta: float = 1.0):
        """
        Initialize NIG hyperparameters.
        
        Args:
            mu_0: Prior mean for μ
            kappa: Prior precision (effective prior sample size for μ)
            alpha: Shape for Inverse-Gamma on σ² (half prior df)
            beta: Scale for Inverse-Gamma on σ² (half prior SS)
        """
        self.mu_0 = mu_0
        self.kappa = kappa
        self.alpha = alpha
        self.beta = beta
    
    def update(self, data: np.ndarray) -> 'NormalInverseGamma':
        """
        Compute posterior NIG parameters given observed data.
        Returns a new NIG object with updated parameters.
        """
        n = len(data)
        if n == 0:
            return NormalInverseGamma(self.mu_0, self.kappa, 
                                       self.alpha, self.beta)
        
        x_bar = np.mean(data)
        ss = np.sum((data - x_bar)**2)  # Sum of squared deviations
        
        # Update equations
        kappa_n = self.kappa + n
        mu_n = (self.kappa * self.mu_0 + n * x_bar) / kappa_n
        alpha_n = self.alpha + n / 2
        
        # Beta update includes term for mean shift
        mean_shift_term = (self.kappa * n * (x_bar - self.mu_0)**2) / (2 * kappa_n)
        beta_n = self.beta + ss/2 + mean_shift_term
        
        return NormalInverseGamma(mu_n, kappa_n, alpha_n, beta_n)
    
    def marginal_mu_distribution(self):
        """
        Return the marginal distribution of μ (Student-t).
        
        The marginal arises from integrating σ² out of the joint NIG.
        This has heavier tails than Gaussian, reflecting variance uncertainty.
        """
        df = 2 * self.alpha
        loc = self.mu_0
        scale = np.sqrt(self.beta / (self.alpha * self.kappa))
        return stats.t(df=df, loc=loc, scale=scale)
    
    def posterior_mean_mu(self) -> float:
        """Posterior mean of μ."""
        return self.mu_0  # NIG maintains μ_n as conditional mean
    
    def posterior_mean_sigma2(self) -> float:
        """Posterior mean of σ² (requires α > 1)."""
        if self.alpha <= 1:
            return np.inf
        return self.beta / (self.alpha - 1)
    
    def posterior_mode_sigma2(self) -> float:
        """Posterior mode of σ²."""
        return self.beta / (self.alpha + 1)
    
    def credible_interval_mu(self, confidence: float = 0.95) -> tuple:
        """
        Credible interval for μ using marginal t-distribution.
        """
        dist = self.marginal_mu_distribution()
        tail = (1 - confidence) / 2
        return (dist.ppf(tail), dist.ppf(1 - tail))
    
    def sample(self, n_samples: int = 10000) -> tuple:
        """
        Generate samples from the joint posterior.
        Returns (mu_samples, sigma2_samples)
        """
        # Sample σ² from Inverse-Gamma(α, β)
        # Inverse-Gamma(α, β) = 1 / Gamma(α, 1/β)
        sigma2_samples = 1.0 / np.random.gamma(self.alpha, 1.0/self.beta, 
                                                n_samples)
        
        # Sample μ | σ² from N(μ_0, σ²/κ)
        mu_samples = np.random.normal(
            self.mu_0, 
            np.sqrt(sigma2_samples / self.kappa)
        )
        
        return mu_samples, sigma2_samples
    
    def summary(self):
        """Print posterior summary."""
        print(f"NIG({self.mu_0:.4f}, {self.kappa:.2f}, {self.alpha:.2f}, {self.beta:.4f})")
        print(f"  E[μ] = {self.posterior_mean_mu():.4f}")
        print(f"  E[σ²] = {self.posterior_mean_sigma2():.4f}")
        ci = self.credible_interval_mu()
        print(f"  95% CI for μ: [{ci[0]:.4f}, {ci[1]:.4f}]")
 
 
# Example: Estimate height distribution
# Prior: Heights are around 170cm, with prior "sample size" of 5
# Variance prior: centered around 100 (σ ≈ 10)
prior = NormalInverseGamma(
    mu_0=170.0,   # Prior mean height
    kappa=5.0,    # Worth 5 observations
    alpha=3.0,    # df = 6
    beta=200.0    # Scale for variance prior
)
 
print("Prior:")
prior.summary()
 
# Observed heights (in cm)
heights = np.array([168, 175, 172, 169, 178, 171, 174, 167, 180, 173])
 
posterior = prior.update(heights)
 
print(f"\nObserved data: {heights}")
print(f"Sample mean: {np.mean(heights):.2f}")
print(f"Sample std: {np.std(heights):.2f}")
 
print("\nPosterior:")
posterior.summary()

The Posterior Predictive Distribution

The posterior predictive distribution for future observations is a key Bayesian output, incorporating both parameter uncertainty and data variability.

Known Variance Case:

For $\mu | x_{1:n} \sim \mathcal{N}(\mu_n, \sigma_n^2)$ and future $\tilde{x} | \mu \sim \mathcal{N}(\mu, \sigma^2)$, the posterior predictive is:

$$\tilde{x} | x_{1:n} \sim \mathcal{N}(\mu_n, \sigma^2 + \sigma_n^2)$$

The predictive variance has two components:

$\sigma^2$: Inherent data variability (aleatory uncertainty)
$\sigma_n^2$: Parameter uncertainty (epistemic uncertainty)

As $n \to \infty$, $\sigma_n^2 \to 0$, and the predictive variance approaches $\sigma^2$ (the irreducible noise).

Unknown Variance Case (NIG):

With $(\mu, \sigma^2) \sim \text{NIG}(\mu_n, \kappa_n, \alpha_n, \beta_n)$, the posterior predictive is:

$$\tilde{x} | x_{1:n} \sim t_{2\alpha_n}\left(\mu_n, \frac{\beta_n(\kappa_n + 1)}{\alpha_n \kappa_n}\right)$$

This is a Student-t distribution with heavier tails than Gaussian, reflecting uncertainty about both $\mu$ and $\sigma^2$.

Why Student-t?

The Student-t distribution naturally arises when averaging over uncertainty in variance. It provides wider predictive intervals than a Gaussian would, appropriately reflecting our epistemic uncertainty. As we observe more data, the degrees of freedom increase, and the t-distribution converges to Gaussian.

Prediction Intervals:

A $100(1-\alpha)$% prediction interval $[L, U]$ for a future observation satisfies:

$$P(L \leq \tilde{x} \leq U | x_{1:n}) = 1 - \alpha$$

Unlike credible intervals (for parameters), prediction intervals (for observables) are typically wider because they include data variability on top of parameter uncertainty.

Comparison of Interval Types:

Interval Type	Captures	Width	Behavior as $n \to \infty$
Credible interval for $\mu$	Parameter uncertainty	Narrow	Shrinks to zero
Predictive interval for $\tilde{x}$	Parameter + data uncertainty	Wide	Approaches $\mu \pm z_{\alpha/2} \sigma$
Confidence interval for $\mu$	Sampling variability	Narrow	Shrinks to zero
Tolerance interval	Proportion of population	Very wide	Bounded below

Applications of Predictive Distributions

•Anomaly detection: Flag observations outside predictive interval as anomalies
•Model checking: If observed data falls outside predictive range, model may be misspecified
•Decision making: Optimal decisions often depend on predictive distribution, not just point estimates
•Forecast uncertainty: Communicate full range of plausible future outcomes
•Simulation: Sample from predictive to generate synthetic future data

Sequential Learning and Online Updates

Gaussian conjugacy naturally supports sequential updating, enabling online learning from streaming data.

Sequential Update Property:

For known variance, if current posterior is $\mu \sim \mathcal{N}(\mu_t, \sigma_t^2)$ and we observe a new datum $x_{t+1} \sim \mathcal{N}(\mu, \sigma^2)$:

$$\text{New precision: } \tau_{t+1} = \tau_t + \tau$$

$$\text{New mean: } \mu_{t+1} = \frac{\tau_t \mu_t + \tau x_{t+1}}{\tau_{t+1}}$$

where $\tau_t = 1/\sigma_t^2$ and $\tau = 1/\sigma^2$.

This update is incremental: we only need the current posterior parameters and new observation—not the full history.

Kalman Filter Connection

The Gaussian sequential update is the simplest case of the Kalman filter. The full Kalman filter extends this to systems with time-varying dynamics, process noise, and multivariate observations. Understanding Gaussian conjugacy is the gateway to Kalman filtering and state-space models.

Sequential Gaussian Inference
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
 
class GaussianOnlineEstimator:
    """
    Online Bayesian estimation of Gaussian mean with known variance.
    
    Updates incrementally with each observation, O(1) per update.
    Maintains complete posterior information in just 2 numbers.
    """
    
    def __init__(self, prior_mean: float, prior_variance: float, 
                  data_variance: float):
        """
        Initialize with prior and known data variance.
        """
        self.mu = prior_mean
        self.tau = 1.0 / prior_variance  # Precision
        self.data_tau = 1.0 / data_variance  # Data precision
        self.n = 0
    
    def update(self, x: float) -> None:
        """
        Update posterior with single observation.
        O(1) time and space complexity.
        """
        # New precision
        new_tau = self.tau + self.data_tau
        
        # New mean (precision-weighted average)
        self.mu = (self.tau * self.mu + self.data_tau * x) / new_tau
        
        self.tau = new_tau
        self.n += 1
    
    def update_batch(self, data: np.ndarray) -> None:
        """Update with multiple observations at once."""
        for x in data:
            self.update(x)
    
    @property
    def variance(self) -> float:
        return 1.0 / self.tau
    
    @property
    def std(self) -> float:
        return np.sqrt(self.variance)
    
    def credible_interval(self, confidence: float = 0.95) -> tuple:
        from scipy import stats
        z = stats.norm.ppf((1 + confidence) / 2)
        return (self.mu - z * self.std, self.mu + z * self.std)
    
    def __repr__(self):
        return f"N({self.mu:.4f}, {self.variance:.6f}) after {self.n} obs"
 
 
# Example: Online mean estimation with streaming data
true_mean = 5.0
data_variance = 4.0  # Known σ² = 4 (σ = 2)
 
# Prior: vague belief centered at 0
estimator = GaussianOnlineEstimator(
    prior_mean=0.0,
    prior_variance=100.0,  # Weak prior
    data_variance=data_variance
)
 
print("Initial:", estimator)
 
# Simulate streaming observations
np.random.seed(42)
for i in range(20):
    x = np.random.normal(true_mean, np.sqrt(data_variance))
    estimator.update(x)
    
    if (i + 1) % 5 == 0:
        ci = estimator.credible_interval()
        print(f"After {i+1:2d} obs: μ = {estimator.mu:.3f} ± {estimator.std:.3f}, "
              f"95% CI: [{ci[0]:.3f}, {ci[1]:.3f}]")
 
print(f"\nTrue mean: {true_mean}")

Multivariate Gaussian Conjugacy

Gaussian conjugacy extends naturally to multivariate settings, essential for vector observations, regression, and modern machine learning.

Multivariate Gaussian with Known Covariance:

For $\mathbf{x}_1, \ldots, \mathbf{x}_n | \boldsymbol{\mu} \sim \mathcal{N}_d(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ i.i.d. (known $\boldsymbol{\Sigma}$), the conjugate prior is:

$$\boldsymbol{\mu} \sim \mathcal{N}_d(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0)$$

The posterior is:

$$\boldsymbol{\mu} | \mathbf{x}_{1:n} \sim \mathcal{N}_d(\boldsymbol{\mu}_n, \boldsymbol{\Sigma}_n)$$

with:

$$\boldsymbol{\Sigma}_n^{-1} = \boldsymbol{\Sigma}_0^{-1} + n\boldsymbol{\Sigma}^{-1}$$

$$\boldsymbol{\mu}_n = \boldsymbol{\Sigma}_n \left( \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu}_0 + n\boldsymbol{\Sigma}^{-1} \bar{\mathbf{x}} \right)$$

Unknown Covariance (Normal-Inverse-Wishart):

For unknown covariance matrix $\boldsymbol{\Sigma}$, the conjugate prior is:

$$\boldsymbol{\Sigma} \sim \text{Inverse-Wishart}(\nu_0, \boldsymbol{\Psi}_0)$$ $$\boldsymbol{\mu} | \boldsymbol{\Sigma} \sim \mathcal{N}_d(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}/\kappa_0)$$

This is the multivariate extension of Normal-Inverse-Gamma, called Normal-Inverse-Wishart (NIW).

Gaussian Conjugate Priors: Complete Reference
Unknown Parameters	Conjugate Prior	Posterior	Dimension
$\mu$ (known $\sigma^2$)	Normal($\mu_0, \sigma_0^2$)	Normal($\mu_n, \sigma_n^2$)	2 hyperparams
$\sigma^2$ (known $\mu$)	Inverse-Gamma($\alpha, \beta$)	Inverse-Gamma($\alpha_n, \beta_n$)	2 hyperparams
$\mu, \sigma^2$	Normal-Inverse-Gamma	Normal-Inverse-Gamma	4 hyperparams
$\boldsymbol{\mu}$ (known $\boldsymbol{\Sigma}$)	Multivariate Normal	Multivariate Normal	$d + d(d+1)/2$
$\boldsymbol{\Sigma}$ (known $\boldsymbol{\mu}$)	Inverse-Wishart	Inverse-Wishart	$1 + d(d+1)/2$
$\boldsymbol{\mu}, \boldsymbol{\Sigma}$	Normal-Inverse-Wishart	Normal-Inverse-Wishart	$d + 1 + 1 + d(d+1)/2$

Application: Bayesian Linear Regression

Bayesian linear regression with Gaussian prior on weights is a direct application of multivariate Gaussian conjugacy. The posterior over weights is Gaussian, the predictive distribution is Gaussian (or Student-t with unknown variance), and regularization emerges naturally from the prior. This is the foundation of Gaussian Process regression.

Summary: Gaussian Conjugacy Mastery

We have comprehensively explored Gaussian-Gaussian conjugacy—the foundation for Bayesian inference on continuous data. Let us consolidate the key insights:

Key Takeaways

•Gaussian prior + Gaussian likelihood → Gaussian posterior: Closed-form inference for the mean with known variance
•Precisions add: Posterior precision = prior precision + n × data precision; this is the Gaussian analogue of adding pseudo-counts
•Precision-weighted average: Posterior mean blends prior and MLE with weights proportional to their precisions
•Shrinkage interpretation: Bayesian inference shrinks MLE toward the prior mean; shrinkage decreases with more data
•Ridge regression connection: L2 regularization is Maximum A Posteriori with Gaussian prior; regularization strength = variance ratio
•Normal-Inverse-Gamma: The joint conjugate prior when both mean and variance are unknown; marginal distribution of mean is Student-t
•Sequential updates: Natural support for online learning with O(1) updates per observation
•Multivariate extension: Normal-Inverse-Wishart handles vector observations with unknown covariance

What's Next:

Having mastered the univariate conjugates Beta-Binomial and Gaussian-Gaussian, we now tackle categorical data. The next page covers Dirichlet-Multinomial conjugacy—essential for text models, topic models, mixture models, and any setting with multiple discrete categories.

Page Complete

You now command Gaussian conjugacy—the workhorse for continuous data inference. You understand both the known-variance and unknown-variance cases, can perform sequential updates, recognize the connection to regularization, and appreciate the extension to multiple dimensions. Next, we explore Dirichlet-Multinomial conjugacy for categorical data.

3 / 5

Loading learning content...

Machine LearningBayesian Inference

Conjugate Priors

LevelAdvanced

Duration90 mins

TopicBayesian Inference

3 / 5

Gaussian-Gaussian

The Workhorse of Continuous Inference

The Gaussian (Normal) distribution occupies a privileged position in statistics:

Central Limit Theorem: Sums of many random variables converge to Gaussian
Maximum Entropy: Gaussian has maximum entropy for fixed mean and variance
Analytical Tractability: Gaussian algebra is remarkably clean
Ubiquitous in Nature: Many physical quantities are approximately Gaussian

Gaussian-Gaussian conjugacy extends this analytical tractability to Bayesian inference, providing closed-form posteriors when we place Gaussian priors on Gaussian-distributed parameters.

What You Will Learn

The Gaussian Distribution: Properties and Parametrizations

Before developing conjugate analysis, we establish essential Gaussian properties and two parametrizations used in Bayesian inference.

Definition: A random variable $x$ follows a Gaussian (Normal) distribution with mean $\mu$ and variance $\sigma^2$, written $x \sim \mathcal{N}(\mu, \sigma^2)$, if its density is:

$$p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)$$

The Precision Parametrization:

It is often more convenient to work with precision $\tau = 1/\sigma^2$ (inverse variance):

$$p(x | \mu, \tau) = \sqrt{\frac{\tau}{2\pi}} \exp\left( -\frac{\tau(x - \mu)^2}{2} \right)$$

In this parametrization, $\mathcal{N}(\mu, \tau^{-1})$ denotes a Gaussian with mean $\mu$ and precision $\tau$.

Why Precision? Precision arises naturally in conjugate analysis because:

Posterior precisions add (while variances don't combine as simply)
The conjugate prior for precision is the Gamma distribution
Many update formulas are simpler in precision form

Gaussian Properties Quick Reference
Property	Formula	Notes
Mean	$\mathbb{E}[x] = \mu$	First moment equals the parameter
Variance	$\text{Var}[x] = \sigma^2 = 1/\tau$	Second central moment
Mode	$\text{Mode}[x] = \mu$	Symmetric: mean = mode = median
Sum of Gaussians	$\sum_i x_i \sim \mathcal{N}(\sum_i \mu_i, \sum_i \sigma_i^2)$	Closed under addition
Linear transform	$ax + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)$	Affine transformation
Sample mean	$\bar{x} \sim \mathcal{N}(\mu, \sigma^2/n)$	Mean of $n$ i.i.d. samples

Sufficient Statistics for Gaussians

Gaussian Mean Inference with Known Variance

The simplest and most instructive Gaussian conjugacy occurs when inferring the mean $\mu$ with known variance $\sigma^2$.

Setup:

Prior: $\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)$ — we believe the true mean is around $\mu_0$ with uncertainty $\sigma_0^2$
Likelihood: $x_1, \ldots, x_n | \mu \sim \mathcal{N}(\mu, \sigma^2)$ independently — data is Gaussian with unknown mean $\mu$ and known variance $\sigma^2$
Goal: Find $P(\mu | x_1, \ldots, x_n)$

Conjugate Posterior Theorem: The posterior is Gaussian:

$$\mu | x_1, \ldots, x_n \sim \mathcal{N}(\mu_n, \sigma_n^2)$$

with posterior parameters:

$$\mu_n = \frac{\frac{\mu_0}{\sigma_0^2} + \frac{n\bar{x}}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}} = \frac{\tau_0 \mu_0 + n\tau \bar{x}}{\tau_0 + n\tau}$$

$$\frac{1}{\sigma_n^2} = \frac{1}{\sigma_0^2} + \frac{n}{\sigma^2} \quad \text{(precisions add!)}$$

where $\tau_0 = 1/\sigma_0^2$ is the prior precision and $\tau = 1/\sigma^2$ is the data precision.

Precision-Weighted Average

Derivation (Completing the Square):

We derive the posterior by combining prior and likelihood and recognizing the Gaussian kernel.

Step 1: Combine Prior and Likelihood

$$P(\mu | x_{1:n}) \propto P(x_{1:n} | \mu) \cdot P(\mu)$$

$$\propto \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \right) \cdot \exp\left( -\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right)$$

Step 2: Expand the Quadratics Consider only terms involving $\mu$:

$$\propto \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i^2 - 2x_i\mu + \mu^2) - \frac{\mu^2 - 2\mu\mu_0 + \mu_0^2}{2\sigma_0^2} \right)$$

$$\propto \exp\left( -\frac{n\mu^2 - 2\mu \sum x_i}{2\sigma^2} - \frac{\mu^2 - 2\mu\mu_0}{2\sigma_0^2} \right)$$

Step 3: Collect Terms in $\mu$

$$\propto \exp\left( -\frac{1}{2}\left[ \mu^2 \left( \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2} \right) - 2\mu \left( \frac{n\bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2} \right) \right] \right)$$

Step 4: Complete the Square This has the form $-\frac{1}{2}[A\mu^2 - 2B\mu] = -\frac{A}{2}(\mu - B/A)^2 + \text{const}$

With $A = \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2} = \frac{1}{\sigma_n^2}$ and $B = \frac{n\bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}$

The posterior is $\mathcal{N}(B/A, 1/A) = \mathcal{N}(\mu_n, \sigma_n^2)$. ∎

Gaussian-Gaussian Posterior (Known Variance)
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def gaussian_posterior_known_variance(
    prior_mean: float, prior_variance: float,
    data: np.ndarray, data_variance: float
) -> tuple:
    """
    Compute Gaussian posterior for mean with known variance.
    
    The posterior is Gaussian with:
    - mean = precision-weighted average of prior mean and sample mean
    - precision = prior precision + data precision × n
    
    This is the fundamental Gaussian conjugate update.
    """
    n = len(data)
    sample_mean = np.mean(data)
    
    # Work in precision space
    prior_precision = 1.0 / prior_variance
    data_precision = 1.0 / data_variance
    
    # Posterior precision: precisions add!
    posterior_precision = prior_precision + n * data_precision
    posterior_variance = 1.0 / posterior_precision
    
    # Posterior mean: precision-weighted average
    posterior_mean = (
        prior_precision * prior_mean + n * data_precision * sample_mean
    ) / posterior_precision
    
    return posterior_mean, posterior_variance
 
 
def visualize_gaussian_update(
    prior_mean: float, prior_variance: float,
    data: np.ndarray, data_variance: float
):
    """
    Visualize the Bayesian update from prior to posterior.
    """
    post_mean, post_var = gaussian_posterior_known_variance(
        prior_mean, prior_variance, data, data_variance
    )
    
    # Create x-axis range encompassing both prior and posterior
    x_min = min(prior_mean - 3*np.sqrt(prior_variance), 
                post_mean - 3*np.sqrt(post_var))
    x_max = max(prior_mean + 3*np.sqrt(prior_variance), 
                post_mean + 3*np.sqrt(post_var))
    x = np.linspace(x_min, x_max, 500)
    
    prior_pdf = stats.norm.pdf(x, prior_mean, np.sqrt(prior_variance))
    posterior_pdf = stats.norm.pdf(x, post_mean, np.sqrt(post_var))
    
    # Likelihood (as function of mu, normalized for visualization)
    sample_mean = np.mean(data)
    n = len(data)
    likelihood_var = data_variance / n
    likelihood_pdf = stats.norm.pdf(x, sample_mean, np.sqrt(likelihood_var))
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    ax.plot(x, prior_pdf, 'b--', lw=2, 
            label=f'Prior: N({prior_mean:.2f}, {prior_variance:.2f})')
    ax.plot(x, likelihood_pdf, 'g:', lw=2, 
            label=f'Likelihood: x̄={sample_mean:.2f}, n={n}')
    ax.plot(x, posterior_pdf, 'r-', lw=2, 
            label=f'Posterior: N({post_mean:.2f}, {post_var:.4f})')
    
    ax.fill_between(x, posterior_pdf, alpha=0.2, color='red')
    ax.axvline(sample_mean, color='green', linestyle='--', alpha=0.5)
    
    ax.set_xlabel('μ', fontsize=12)
    ax.set_ylabel('Density', fontsize=12)
    ax.set_title('Gaussian-Gaussian Conjugate Update', fontsize=14)
    ax.legend(fontsize=10)
    ax.grid(alpha=0.3)
    
    return fig, post_mean, post_var
 
 
# Example: Sensor Calibration
# Prior: We expect the sensor reads around 100, uncertainty ±10
prior_mean = 100.0
prior_variance = 100.0  # σ₀ = 10
 
# Data: We take 5 measurements, knowing sensor noise has σ = 5
np.random.seed(42)
true_value = 102.0  # Unknown truth
data = np.random.normal(true_value, 5.0, size=5)
data_variance = 25.0  # σ² = 5² = 25 (known)
 
fig, post_mean, post_var = visualize_gaussian_update(
    prior_mean, prior_variance, data, data_variance
)
 
print(f"Prior: N({prior_mean}, {prior_variance})")
print(f"Data: {data}")
print(f"Sample mean: {np.mean(data):.3f}")
print(f"Posterior: N({post_mean:.3f}, {post_var:.4f})")
print(f"Posterior 95% CI: [{post_mean - 1.96*np.sqrt(post_var):.3f}, "
      f"{post_mean + 1.96*np.sqrt(post_var):.3f}]")
print(f"True value: {true_value}")

Interpretations: Shrinkage, Effective Sample Size, and Regularization

The Gaussian posterior mean reveals deep connections between Bayesian inference, shrinkage estimation, and regularization.

Shrinkage Interpretation:

The posterior mean can be written as:

$$\mu_n = \lambda \mu_0 + (1 - \lambda) \bar{x}$$

where the shrinkage factor is:

$$\lambda = \frac{\tau_0}{\tau_0 + n\tau} = \frac{\sigma^2 / n}{\sigma^2 / n + \sigma_0^2} = \frac{\text{data variance}}{\text{data variance} + \text{prior variance}}$$

The posterior mean shrinks the MLE ($\bar{x}$) toward the prior mean ($\mu_0$). The amount of shrinkage depends on the relative precision of data vs. prior:

High prior precision ($\sigma_0^2$ small): Strong shrinkage toward $\mu_0$
Many observations ($n$ large): Weak shrinkage, posterior near $\bar{x}$
Large data noise ($\sigma^2$ large): More shrinkage toward prior

Effective Sample Size of the Prior

Connection to Ridge Regression:

Consider the regularized least squares problem (ridge regression):

$$\hat{\mu}{\text{ridge}} = \arg\min\mu \left[ \sum_{i=1}^n (x_i - \mu)^2 + \lambda (\mu - \mu_0)^2 \right]$$

Taking the derivative and setting to zero:

$$\hat{\mu}_{\text{ridge}} = \frac{n\bar{x} + \lambda\mu_0}{n + \lambda}$$

This is exactly the Bayesian posterior mean with: $$\lambda = \frac{\sigma^2}{\sigma_0^2} = \frac{\text{data variance}}{\text{prior variance}}$$

Frequentist	Bayesian
Ridge regression	Gaussian prior on weights
LASSO	Laplace (double-exponential) prior
Elastic net	Mixture of Gaussian and Laplace
L2 regularization	Gaussian posterior MAP

Asymptotic Behavior of Gaussian Posterior
Data Regime	Posterior Mean	Posterior Variance	Interpretation
$n = 0$ (no data)	$\mu_0$	$\sigma_0^2$	Prior unchanged
$n << n_0$	Near $\mu_0$	Slightly smaller than $\sigma_0^2$	Prior dominates
$n = n_0$	Midpoint of $\mu_0$ and $\bar{x}$	$\sigma_0^2 / 2$	Equal weight
$n >> n_0$	Near $\bar{x}$	$\approx \sigma^2 / n$	Data dominates
$n \to \infty$	$\bar{x}$ (MLE)	$\to 0$	Point mass at truth

Unknown Variance: Normal-Inverse-Gamma Conjugacy

In practice, the data variance $\sigma^2$ is rarely known. Bayesian inference can simultaneously estimate both mean and variance using the Normal-Inverse-Gamma conjugate prior.

Setup:

Data: $x_1, \ldots, x_n | \mu, \sigma^2 \sim \mathcal{N}(\mu, \sigma^2)$ i.i.d.
Goal: Jointly infer $P(\mu, \sigma^2 | x_{1:n})$

The Normal-Inverse-Gamma Prior:

The conjugate prior for $(\mu, \sigma^2)$ is:

$$\sigma^2 \sim \text{Inv-Gamma}(\alpha, \beta)$$ $$\mu | \sigma^2 \sim \mathcal{N}\left(\mu_0, \frac{\sigma^2}{\kappa}\right)$$

This is written as $(\mu, \sigma^2) \sim \text{NIG}(\mu_0, \kappa, \alpha, \beta)$

Hyperparameter Interpretation:

$\mu_0$: Prior mean of $\mu$
$\kappa$: Prior "sample size" for mean (number of pseudo-observations)
$\alpha$: Prior "degrees of freedom" for variance (half the pseudo-sample size)
$\beta$: Prior "sum of squares" (scale for variance)

Hierarchical Structure

The Posterior:

After observing data $x_1, \ldots, x_n$ with sample mean $\bar{x}$ and sum of squared deviations $s = \sum_i (x_i - \bar{x})^2$:

$$(\mu, \sigma^2 | x_{1:n}) \sim \text{NIG}(\mu_n, \kappa_n, \alpha_n, \beta_n)$$

with updated hyperparameters:

$$\kappa_n = \kappa + n$$

$$\mu_n = \frac{\kappa\mu_0 + n\bar{x}}{\kappa + n}$$

$$\alpha_n = \alpha + \frac{n}{2}$$

$$\beta_n = \beta + \frac{s}{2} + \frac{\kappa n (\bar{x} - \mu_0)^2}{2(\kappa + n)}$$

Interpretation of Updates:

$\kappa_n = \kappa + n$: Prior pseudo-observations plus real observations
$\mu_n$: Weighted average of prior and sample means
$\alpha_n = \alpha + n/2$: Adds $n/2$ degrees of freedom
$\beta_n$: Adds sample sum of squares plus a term accounting for the shift in the mean

Marginal Distributions:

The marginal distribution of $\mu$ (integrating out $\sigma^2$) is a Student-t distribution:

$$\mu | x_{1:n} \sim t_{2\alpha_n}\left(\mu_n, \frac{\beta_n}{\alpha_n \kappa_n}\right)$$

This has heavier tails than a Gaussian, reflecting uncertainty about $\sigma^2$. As $n \to \infty$, it approaches Gaussian.

Normal-Inverse-Gamma Posterior
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
from scipy import stats
 
class NormalInverseGamma:
    """
    Normal-Inverse-Gamma conjugate prior for Gaussian mean and variance.
    
    Joint prior: (μ, σ²) ~ NIG(μ_0, κ, α, β)
    - σ² ~ Inverse-Gamma(α, β)
    - μ | σ² ~ N(μ_0, σ²/κ)
    
    This is the natural conjugate prior when both mean and variance
    are unknown.
    """
    
    def __init__(self, mu_0: float = 0.0, kappa: float = 1.0,
                 alpha: float = 1.0, beta: float = 1.0):
        """
        Initialize NIG hyperparameters.
        
        Args:
            mu_0: Prior mean for μ
            kappa: Prior precision (effective prior sample size for μ)
            alpha: Shape for Inverse-Gamma on σ² (half prior df)
            beta: Scale for Inverse-Gamma on σ² (half prior SS)
        """
        self.mu_0 = mu_0
        self.kappa = kappa
        self.alpha = alpha
        self.beta = beta
    
    def update(self, data: np.ndarray) -> 'NormalInverseGamma':
        """
        Compute posterior NIG parameters given observed data.
        Returns a new NIG object with updated parameters.
        """
        n = len(data)
        if n == 0:
            return NormalInverseGamma(self.mu_0, self.kappa, 
                                       self.alpha, self.beta)
        
        x_bar = np.mean(data)
        ss = np.sum((data - x_bar)**2)  # Sum of squared deviations
        
        # Update equations
        kappa_n = self.kappa + n
        mu_n = (self.kappa * self.mu_0 + n * x_bar) / kappa_n
        alpha_n = self.alpha + n / 2
        
        # Beta update includes term for mean shift
        mean_shift_term = (self.kappa * n * (x_bar - self.mu_0)**2) / (2 * kappa_n)
        beta_n = self.beta + ss/2 + mean_shift_term
        
        return NormalInverseGamma(mu_n, kappa_n, alpha_n, beta_n)
    
    def marginal_mu_distribution(self):
        """
        Return the marginal distribution of μ (Student-t).
        
        The marginal arises from integrating σ² out of the joint NIG.
        This has heavier tails than Gaussian, reflecting variance uncertainty.
        """
        df = 2 * self.alpha
        loc = self.mu_0
        scale = np.sqrt(self.beta / (self.alpha * self.kappa))
        return stats.t(df=df, loc=loc, scale=scale)
    
    def posterior_mean_mu(self) -> float:
        """Posterior mean of μ."""
        return self.mu_0  # NIG maintains μ_n as conditional mean
    
    def posterior_mean_sigma2(self) -> float:
        """Posterior mean of σ² (requires α > 1)."""
        if self.alpha <= 1:
            return np.inf
        return self.beta / (self.alpha - 1)
    
    def posterior_mode_sigma2(self) -> float:
        """Posterior mode of σ²."""
        return self.beta / (self.alpha + 1)
    
    def credible_interval_mu(self, confidence: float = 0.95) -> tuple:
        """
        Credible interval for μ using marginal t-distribution.
        """
        dist = self.marginal_mu_distribution()
        tail = (1 - confidence) / 2
        return (dist.ppf(tail), dist.ppf(1 - tail))
    
    def sample(self, n_samples: int = 10000) -> tuple:
        """
        Generate samples from the joint posterior.
        Returns (mu_samples, sigma2_samples)
        """
        # Sample σ² from Inverse-Gamma(α, β)
        # Inverse-Gamma(α, β) = 1 / Gamma(α, 1/β)
        sigma2_samples = 1.0 / np.random.gamma(self.alpha, 1.0/self.beta, 
                                                n_samples)
        
        # Sample μ | σ² from N(μ_0, σ²/κ)
        mu_samples = np.random.normal(
            self.mu_0, 
            np.sqrt(sigma2_samples / self.kappa)
        )
        
        return mu_samples, sigma2_samples
    
    def summary(self):
        """Print posterior summary."""
        print(f"NIG({self.mu_0:.4f}, {self.kappa:.2f}, {self.alpha:.2f}, {self.beta:.4f})")
        print(f"  E[μ] = {self.posterior_mean_mu():.4f}")
        print(f"  E[σ²] = {self.posterior_mean_sigma2():.4f}")
        ci = self.credible_interval_mu()
        print(f"  95% CI for μ: [{ci[0]:.4f}, {ci[1]:.4f}]")
 
 
# Example: Estimate height distribution
# Prior: Heights are around 170cm, with prior "sample size" of 5
# Variance prior: centered around 100 (σ ≈ 10)
prior = NormalInverseGamma(
    mu_0=170.0,   # Prior mean height
    kappa=5.0,    # Worth 5 observations
    alpha=3.0,    # df = 6
    beta=200.0    # Scale for variance prior
)
 
print("Prior:")
prior.summary()
 
# Observed heights (in cm)
heights = np.array([168, 175, 172, 169, 178, 171, 174, 167, 180, 173])
 
posterior = prior.update(heights)
 
print(f"\nObserved data: {heights}")
print(f"Sample mean: {np.mean(heights):.2f}")
print(f"Sample std: {np.std(heights):.2f}")
 
print("\nPosterior:")
posterior.summary()

The Posterior Predictive Distribution

The posterior predictive distribution for future observations is a key Bayesian output, incorporating both parameter uncertainty and data variability.

Known Variance Case:

For $\mu | x_{1:n} \sim \mathcal{N}(\mu_n, \sigma_n^2)$ and future $\tilde{x} | \mu \sim \mathcal{N}(\mu, \sigma^2)$, the posterior predictive is:

$$\tilde{x} | x_{1:n} \sim \mathcal{N}(\mu_n, \sigma^2 + \sigma_n^2)$$

The predictive variance has two components:

$\sigma^2$: Inherent data variability (aleatory uncertainty)
$\sigma_n^2$: Parameter uncertainty (epistemic uncertainty)

As $n \to \infty$, $\sigma_n^2 \to 0$, and the predictive variance approaches $\sigma^2$ (the irreducible noise).

Unknown Variance Case (NIG):

With $(\mu, \sigma^2) \sim \text{NIG}(\mu_n, \kappa_n, \alpha_n, \beta_n)$, the posterior predictive is:

$$\tilde{x} | x_{1:n} \sim t_{2\alpha_n}\left(\mu_n, \frac{\beta_n(\kappa_n + 1)}{\alpha_n \kappa_n}\right)$$

This is a Student-t distribution with heavier tails than Gaussian, reflecting uncertainty about both $\mu$ and $\sigma^2$.

Why Student-t?

Prediction Intervals:

A $100(1-\alpha)$% prediction interval $[L, U]$ for a future observation satisfies:

$$P(L \leq \tilde{x} \leq U | x_{1:n}) = 1 - \alpha$$

Unlike credible intervals (for parameters), prediction intervals (for observables) are typically wider because they include data variability on top of parameter uncertainty.

Comparison of Interval Types:

Interval Type	Captures	Width	Behavior as $n \to \infty$
Credible interval for $\mu$	Parameter uncertainty	Narrow	Shrinks to zero
Predictive interval for $\tilde{x}$	Parameter + data uncertainty	Wide	Approaches $\mu \pm z_{\alpha/2} \sigma$
Confidence interval for $\mu$	Sampling variability	Narrow	Shrinks to zero
Tolerance interval	Proportion of population	Very wide	Bounded below

Applications of Predictive Distributions

•Anomaly detection: Flag observations outside predictive interval as anomalies
•Model checking: If observed data falls outside predictive range, model may be misspecified
•Decision making: Optimal decisions often depend on predictive distribution, not just point estimates
•Forecast uncertainty: Communicate full range of plausible future outcomes
•Simulation: Sample from predictive to generate synthetic future data

Sequential Learning and Online Updates

Gaussian conjugacy naturally supports sequential updating, enabling online learning from streaming data.

Sequential Update Property:

For known variance, if current posterior is $\mu \sim \mathcal{N}(\mu_t, \sigma_t^2)$ and we observe a new datum $x_{t+1} \sim \mathcal{N}(\mu, \sigma^2)$:

$$\text{New precision: } \tau_{t+1} = \tau_t + \tau$$

$$\text{New mean: } \mu_{t+1} = \frac{\tau_t \mu_t + \tau x_{t+1}}{\tau_{t+1}}$$

where $\tau_t = 1/\sigma_t^2$ and $\tau = 1/\sigma^2$.

This update is incremental: we only need the current posterior parameters and new observation—not the full history.

Kalman Filter Connection

Sequential Gaussian Inference
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
 
class GaussianOnlineEstimator:
    """
    Online Bayesian estimation of Gaussian mean with known variance.
    
    Updates incrementally with each observation, O(1) per update.
    Maintains complete posterior information in just 2 numbers.
    """
    
    def __init__(self, prior_mean: float, prior_variance: float, 
                  data_variance: float):
        """
        Initialize with prior and known data variance.
        """
        self.mu = prior_mean
        self.tau = 1.0 / prior_variance  # Precision
        self.data_tau = 1.0 / data_variance  # Data precision
        self.n = 0
    
    def update(self, x: float) -> None:
        """
        Update posterior with single observation.
        O(1) time and space complexity.
        """
        # New precision
        new_tau = self.tau + self.data_tau
        
        # New mean (precision-weighted average)
        self.mu = (self.tau * self.mu + self.data_tau * x) / new_tau
        
        self.tau = new_tau
        self.n += 1
    
    def update_batch(self, data: np.ndarray) -> None:
        """Update with multiple observations at once."""
        for x in data:
            self.update(x)
    
    @property
    def variance(self) -> float:
        return 1.0 / self.tau
    
    @property
    def std(self) -> float:
        return np.sqrt(self.variance)
    
    def credible_interval(self, confidence: float = 0.95) -> tuple:
        from scipy import stats
        z = stats.norm.ppf((1 + confidence) / 2)
        return (self.mu - z * self.std, self.mu + z * self.std)
    
    def __repr__(self):
        return f"N({self.mu:.4f}, {self.variance:.6f}) after {self.n} obs"
 
 
# Example: Online mean estimation with streaming data
true_mean = 5.0
data_variance = 4.0  # Known σ² = 4 (σ = 2)
 
# Prior: vague belief centered at 0
estimator = GaussianOnlineEstimator(
    prior_mean=0.0,
    prior_variance=100.0,  # Weak prior
    data_variance=data_variance
)
 
print("Initial:", estimator)
 
# Simulate streaming observations
np.random.seed(42)
for i in range(20):
    x = np.random.normal(true_mean, np.sqrt(data_variance))
    estimator.update(x)
    
    if (i + 1) % 5 == 0:
        ci = estimator.credible_interval()
        print(f"After {i+1:2d} obs: μ = {estimator.mu:.3f} ± {estimator.std:.3f}, "
              f"95% CI: [{ci[0]:.3f}, {ci[1]:.3f}]")
 
print(f"\nTrue mean: {true_mean}")

Multivariate Gaussian Conjugacy

Gaussian conjugacy extends naturally to multivariate settings, essential for vector observations, regression, and modern machine learning.

Multivariate Gaussian with Known Covariance:

For $\mathbf{x}_1, \ldots, \mathbf{x}_n | \boldsymbol{\mu} \sim \mathcal{N}_d(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ i.i.d. (known $\boldsymbol{\Sigma}$), the conjugate prior is:

$$\boldsymbol{\mu} \sim \mathcal{N}_d(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0)$$

The posterior is:

$$\boldsymbol{\mu} | \mathbf{x}_{1:n} \sim \mathcal{N}_d(\boldsymbol{\mu}_n, \boldsymbol{\Sigma}_n)$$

with:

$$\boldsymbol{\Sigma}_n^{-1} = \boldsymbol{\Sigma}_0^{-1} + n\boldsymbol{\Sigma}^{-1}$$

$$\boldsymbol{\mu}_n = \boldsymbol{\Sigma}_n \left( \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu}_0 + n\boldsymbol{\Sigma}^{-1} \bar{\mathbf{x}} \right)$$

Unknown Covariance (Normal-Inverse-Wishart):

For unknown covariance matrix $\boldsymbol{\Sigma}$, the conjugate prior is:

$$\boldsymbol{\Sigma} \sim \text{Inverse-Wishart}(\nu_0, \boldsymbol{\Psi}_0)$$ $$\boldsymbol{\mu} | \boldsymbol{\Sigma} \sim \mathcal{N}_d(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}/\kappa_0)$$

This is the multivariate extension of Normal-Inverse-Gamma, called Normal-Inverse-Wishart (NIW).

Gaussian Conjugate Priors: Complete Reference
Unknown Parameters	Conjugate Prior	Posterior	Dimension
$\mu$ (known $\sigma^2$)	Normal($\mu_0, \sigma_0^2$)	Normal($\mu_n, \sigma_n^2$)	2 hyperparams
$\sigma^2$ (known $\mu$)	Inverse-Gamma($\alpha, \beta$)	Inverse-Gamma($\alpha_n, \beta_n$)	2 hyperparams
$\mu, \sigma^2$	Normal-Inverse-Gamma	Normal-Inverse-Gamma	4 hyperparams
$\boldsymbol{\mu}$ (known $\boldsymbol{\Sigma}$)	Multivariate Normal	Multivariate Normal	$d + d(d+1)/2$
$\boldsymbol{\Sigma}$ (known $\boldsymbol{\mu}$)	Inverse-Wishart	Inverse-Wishart	$1 + d(d+1)/2$
$\boldsymbol{\mu}, \boldsymbol{\Sigma}$	Normal-Inverse-Wishart	Normal-Inverse-Wishart	$d + 1 + 1 + d(d+1)/2$

Application: Bayesian Linear Regression

Summary: Gaussian Conjugacy Mastery

We have comprehensively explored Gaussian-Gaussian conjugacy—the foundation for Bayesian inference on continuous data. Let us consolidate the key insights:

Key Takeaways

•Gaussian prior + Gaussian likelihood → Gaussian posterior: Closed-form inference for the mean with known variance
•Precisions add: Posterior precision = prior precision + n × data precision; this is the Gaussian analogue of adding pseudo-counts
•Precision-weighted average: Posterior mean blends prior and MLE with weights proportional to their precisions
•Shrinkage interpretation: Bayesian inference shrinks MLE toward the prior mean; shrinkage decreases with more data
•Ridge regression connection: L2 regularization is Maximum A Posteriori with Gaussian prior; regularization strength = variance ratio
•Normal-Inverse-Gamma: The joint conjugate prior when both mean and variance are unknown; marginal distribution of mean is Student-t
•Sequential updates: Natural support for online learning with O(1) updates per observation
•Multivariate extension: Normal-Inverse-Wishart handles vector observations with unknown covariance

What's Next:

Page Complete

3 / 5