Gaussian Mixture Models - Learning Module

Loading content...

0/278

Marginal Likelihood

Beyond Parameter Fitting: Evaluating the Model Itself

In the previous pages, we focused on estimating GMM parameters given a fixed number of components $K$. But a fundamental question remains: How do we choose $K$ itself? Should we use 3 components or 5? What about 10?

This question cannot be answered by the ordinary likelihood alone—adding components always improves fit. We need a principled way to compare models of different complexity.

The marginal likelihood (also called model evidence or integrated likelihood) provides exactly this. It measures how well a model explains the data after averaging over all possible parameter values, naturally penalizing complexity. This page develops the concept rigorously and connects it to practical model selection.

Learning Objectives

By the end of this page, you will: (1) Define the marginal likelihood and understand its role in Bayesian inference, (2) Derive the marginal likelihood formula for mixture models, (3) Explain why marginal likelihood naturally balances fit and complexity, (4) Understand the computational challenges in evaluating marginal likelihood, and (5) Connect marginal likelihood to model selection criteria like BIC.

The Bayesian Framework for Model Comparison

In Bayesian statistics, we treat parameters as random variables with probability distributions. This fundamentally changes how we think about models and model comparison.

The key quantities in Bayesian inference:

Prior $p(\boldsymbol{\theta} \mid \mathcal{M})$: Our belief about parameters before seeing data, under model $\mathcal{M}$.
Likelihood $p(\mathbf{X} \mid \boldsymbol{\theta}, \mathcal{M})$: Probability of data given specific parameter values.
Posterior $p(\boldsymbol{\theta} \mid \mathbf{X}, \mathcal{M})$: Updated belief about parameters after seeing data.
Marginal Likelihood $p(\mathbf{X} \mid \mathcal{M})$: Probability of data under the model, averaging over all parameters.

Bayes' Theorem for Parameters

$$p(\boldsymbol{\theta} \mid \mathbf{X}, \mathcal{M}) = \frac{p(\mathbf{X} \mid \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} \mid \mathcal{M})}{p(\mathbf{X} \mid \mathcal{M})}$$

The denominator $p(\mathbf{X} \mid \mathcal{M})$ is the marginal likelihood—it normalizes the posterior to be a valid probability distribution.

Definition of marginal likelihood:

The marginal likelihood is obtained by integrating (marginalizing) over all possible parameter values:

$$p(\mathbf{X} \mid \mathcal{M}) = \int p(\mathbf{X} \mid \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} \mid \mathcal{M}) , d\boldsymbol{\theta}$$

This is the expected likelihood under the prior: $$p(\mathbf{X} \mid \mathcal{M}) = \mathbb{E}_{\boldsymbol{\theta} \sim p(\boldsymbol{\theta} | \mathcal{M})}\left[ p(\mathbf{X} \mid \boldsymbol{\theta}, \mathcal{M}) \right]$$

Why "marginal"?

The term "marginal" indicates that we've marginalized (integrated out) the parameters. The joint distribution of data and parameters is: $$p(\mathbf{X}, \boldsymbol{\theta} \mid \mathcal{M}) = p(\mathbf{X} \mid \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} \mid \mathcal{M})$$

Integrating over $\boldsymbol{\theta}$ gives the marginal distribution of data: $$p(\mathbf{X} \mid \mathcal{M}) = \int p(\mathbf{X}, \boldsymbol{\theta} \mid \mathcal{M}) , d\boldsymbol{\theta}$$

Terminology for the Marginal Likelihood
Term	Usage Context	Emphasis
Marginal Likelihood	General Bayesian statistics	Marginalization over parameters
Model Evidence	Bayesian model comparison	Evidence for model $\mathcal{M}$
Integrated Likelihood	Statistical literature	Integration over parameter space
Normalizing Constant	Posterior computation	Normalizes posterior to integrate to 1

Model Comparison via Bayes Factors

The marginal likelihood enables principled comparison between models of different complexity. For two models $\mathcal{M}_1$ and $\mathcal{M}_2$, we can apply Bayes' theorem at the model level:

$$\frac{P(\mathcal{M}_1 \mid \mathbf{X})}{P(\mathcal{M}_2 \mid \mathbf{X})} = \frac{p(\mathbf{X} \mid \mathcal{M}_1)}{p(\mathbf{X} \mid \mathcal{M}_2)} \cdot \frac{P(\mathcal{M}_1)}{P(\mathcal{M}_2)}$$

The ratio $\frac{p(\mathbf{X} \mid \mathcal{M}_1)}{p(\mathbf{X} \mid \mathcal{M}_2)}$ is the Bayes factor, measuring relative evidence from the data.

Bayes Factor Definition

$$B_{12} = \frac{p(\mathbf{X} \mid \mathcal{M}_1)}{p(\mathbf{X} \mid \mathcal{M}_2)}$$

The Bayes factor quantifies how much the data updates our relative belief in $\mathcal{M}_1$ vs $\mathcal{M}_2$.

$B_{12} > 1$: Data supports $\mathcal{M}_1$
$B_{12} < 1$: Data supports $\mathcal{M}_2$
$B_{12} = 1$: Data is equally consistent with both models

Interpreting Bayes Factors (Kass & Raftery, 1995)
$B_{12}$	$\log_{10} B_{12}$	Evidence for $\mathcal{M}_1$
$< 1$	$< 0$	Evidence against $\mathcal{M}_1$ (supports $\mathcal{M}_2$)
$1$ to $3$	$0$ to $0.5$	Barely worth mentioning
$3$ to $20$	$0.5$ to $1.3$	Positive evidence
$20$ to $150$	$1.3$ to $2.2$	Strong evidence
$> 150$	$> 2.2$	Very strong evidence

Application to GMM selection:

For GMMs, we often compare models with different numbers of components. Let $\mathcal{M}_K$ denote a GMM with $K$ components. To compare $K=3$ vs $K=5$:

$$B_{3,5} = \frac{p(\mathbf{X} \mid \mathcal{M}_3)}{p(\mathbf{X} \mid \mathcal{M}_5)}$$

If $B_{3,5} > 1$, the data favors the simpler 3-component model despite the 5-component model potentially fitting better. This is the automatic Occam's razor effect.

The Occam's razor property:

More complex models spread their prior probability over a larger parameter space. This means:

Complex models can explain more datasets, but assign lower prior probability to any specific dataset
Simple models concentrate probability on fewer datasets
If the data is well-explained by a simple model, that model gets higher marginal likelihood

This is not an ad hoc penalty—it emerges naturally from probability theory!

Automatic Complexity Penalty

Unlike criteria like AIC that add explicit penalty terms, the marginal likelihood's complexity penalty emerges from the mathematics. A complex model must "pay" for its extra parameters by spreading prior probability over a larger space. If those parameters aren't needed to explain the data, the simpler model wins.

Marginal Likelihood for Gaussian Mixture Models

For a GMM with $K$ components, the marginal likelihood requires integrating over all component parameters and the latent assignments. This is a two-level marginalization.

The complete parameter set: $$\boldsymbol{\theta} = {\pi_1, \ldots, \pi_K, \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_K, \boldsymbol{\Sigma}_1, \ldots, \boldsymbol{\Sigma}_K}$$

The latent variables: $$\mathbf{Z} = {z_1, \ldots, z_N}$$

Marginal Likelihood for GMM

$$p(\mathbf{X} \mid \mathcal{M}K) = \int \sum{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}, \mathcal{M}_K) , p(\boldsymbol{\theta} \mid \mathcal{M}_K) , d\boldsymbol{\theta}$$

Or equivalently, using the incomplete-data likelihood:

$$p(\mathbf{X} \mid \mathcal{M}_K) = \int p(\mathbf{X} \mid \boldsymbol{\theta}, \mathcal{M}_K) , p(\boldsymbol{\theta} \mid \mathcal{M}_K) , d\boldsymbol{\theta}$$

where $p(\mathbf{X} \mid \boldsymbol{\theta}) = \prod_n \sum_k \pi_k \mathcal{N}(\mathbf{x}_n | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$.

Specifying priors for GMM parameters:

To compute the marginal likelihood, we need priors on all parameters:

1. Mixing coefficients $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K)$:

The natural choice is a Dirichlet prior, which is conjugate to the categorical distribution: $$\boldsymbol{\pi} \sim \text{Dirichlet}(\alpha_1, \ldots, \alpha_K)$$

Symmetric choice: $\alpha_k = \alpha_0 / K$ for all $k$, where $\alpha_0$ is the concentration parameter.

2. Component means $\boldsymbol{\mu}_k$:

Typically Gaussian prior: $$\boldsymbol{\mu}_k \sim \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0)$$

where $\mathbf{m}_0$ and $\mathbf{S}_0$ are hyperparameters (e.g., sample mean and scaled sample covariance).

3. Component covariances $\boldsymbol{\Sigma}_k$:

The conjugate prior is inverse-Wishart: $$\boldsymbol{\Sigma}_k \sim \mathcal{W}^{-1}(\boldsymbol{\Psi}_0, u_0)$$

where $\boldsymbol{\Psi}_0$ is the scale matrix and $ u_0$ is the degrees of freedom.

Normal-inverse-Wishart (NIW): For computational convenience, we often use the joint prior: $$(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \sim \text{NIW}(\mathbf{m}_0, \kappa_0, \boldsymbol{\Psi}_0, u_0)$$

Standard Prior Choices for GMM Parameters
Parameter	Prior Distribution	Hyperparameters
$\boldsymbol{\pi}$	Dirichlet$(\alpha_1, \ldots, \alpha_K)$	$\alpha_k$ (concentration for each component)
$\boldsymbol{\mu}_k$	Normal$(\mathbf{m}_0, \mathbf{S}_0)$	$\mathbf{m}_0$ (prior mean), $\mathbf{S}_0$ (prior covariance)
$\boldsymbol{\Sigma}_k$	Inverse-Wishart$(\boldsymbol{\Psi}_0, u_0)$	$\boldsymbol{\Psi}_0$ (scale matrix), $ u_0$ (degrees of freedom)
$(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$	NIW$(\mathbf{m}_0, \kappa_0, \boldsymbol{\Psi}_0, u_0)$	Combined hyperparameters

The Computational Challenge: Why Exact Computation is Hard

The marginal likelihood integral for GMMs is analytically intractable—there is no closed-form solution. This is a fundamental obstacle that requires approximation methods.

Why is the integral intractable?

The integrand involves: $$p(\mathbf{X} \mid \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta}) = \left[ \prod_{n=1}^{N} \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right] \cdot p(\boldsymbol{\theta})$$

The product of sums doesn't factor nicely. Expanding the product gives $K^N$ terms—each representing a different complete-data configuration. Integrating exactly would require computing $K^N$ separate integrals, which is exponentially expensive.

Exponential Complexity

For $N = 100$ observations and $K = 3$ components, exact computation would require evaluating $3^{100} \approx 5 \times 10^{47}$ integrals. This is computationally impossible, necessitating approximation methods.

Contrast with simpler models:

For a single Gaussian (not a mixture), the marginal likelihood has a closed form when using conjugate priors. If:

Data: $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_N}$ from $\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$
Prior: $(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \sim \text{NIW}(\mathbf{m}_0, \kappa_0, \boldsymbol{\Psi}_0, u_0)$

Then the marginal likelihood is: $$p(\mathbf{X}) = \frac{\Gamma_D( u_N/2)}{\Gamma_D( u_0/2)} \cdot \frac{|\boldsymbol{\Psi}_0|^{ u_0/2}}{|\boldsymbol{\Psi}_N|^{ u_N/2}} \cdot \left(\frac{\kappa_0}{\kappa_N}\right)^{D/2} \cdot \pi^{-ND/2}$$

where $\Gamma_D$ is the multivariate gamma function and $(\mathbf{m}_N, \kappa_N, \boldsymbol{\Psi}_N, u_N)$ are the posterior hyperparameters.

This closed form exists because there are no latent variables to marginalize over—the key difference from mixtures.

Sources of Intractability

•Latent variable marginalization: Summing over $K^N$ configurations of $\mathbf{Z}$
•High-dimensional parameter integral: Integrating over $(K-1) + KD + KD(D+1)/2$ parameters
•Non-conjugacy in mixture context: Even with conjugate priors per component, the mixture structure breaks conjugacy
•Product-of-sums structure: The likelihood doesn't factor across observations in a useful way
•Multimodal posterior: The parameter posterior has multiple modes (label switching), complicating integration

Approximation Methods for Marginal Likelihood

Since exact computation is intractable, we use approximations. Several approaches have been developed, each with trade-offs between accuracy and computational cost.

Major Approximation Approaches

•Laplace approximation: Approximate the posterior as Gaussian around the MAP estimate, giving a closed-form integral approximation
•BIC approximation: A further simplification of Laplace, retaining only the leading-order terms
•Variational Bayes: Approximate the posterior with a tractable family, yielding a lower bound on log marginal likelihood (ELBO)
•Monte Carlo methods: Estimate the integral numerically via sampling (importance sampling, bridge sampling, etc.)
•Chib's method: Use Gibbs sampling outputs to estimate the normalizing constant

The Laplace Approximation

The Laplace approximation uses a second-order Taylor expansion of the log posterior around the mode $\hat{\boldsymbol{\theta}}$:

$$\log p(\mathbf{X} \mid \mathcal{M}) \approx \log p(\mathbf{X} \mid \hat{\boldsymbol{\theta}}) + \log p(\hat{\boldsymbol{\theta}}) + \frac{D_{\theta}}{2} \log(2\pi) - \frac{1}{2} \log |\mathbf{H}|$$

where $\mathbf{H}$ is the Hessian of the negative log posterior at $\hat{\boldsymbol{\theta}}$, and $D_{\theta}$ is the number of parameters.

Laplace approximation in detail:

Let $f(\boldsymbol{\theta}) = \log p(\mathbf{X} \mid \boldsymbol{\theta}) + \log p(\boldsymbol{\theta})$ be the log joint (unnormalized log posterior).

Expanding around the mode $\hat{\boldsymbol{\theta}}$ (where $ abla f = 0$): $$f(\boldsymbol{\theta}) \approx f(\hat{\boldsymbol{\theta}}) - \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^\top \mathbf{H} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})$$

where $\mathbf{H} = - abla^2 f(\hat{\boldsymbol{\theta}})$ is the observed information matrix.

The integral becomes: $$p(\mathbf{X}) \approx e^{f(\hat{\boldsymbol{\theta}})} \int \exp\left( -\frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^\top \mathbf{H} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}) \right) d\boldsymbol{\theta}$$ $$= e^{f(\hat{\boldsymbol{\theta}})} \cdot (2\pi)^{D/2} |\mathbf{H}|^{-1/2}$$

Taking logarithms: $$\log p(\mathbf{X}) \approx f(\hat{\boldsymbol{\theta}}) + \frac{D}{2} \log(2\pi) - \frac{1}{2} \log|\mathbf{H}|$$

laplace_approximation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import numpy as np
from scipy.optimize import minimize
from scipy.stats import multivariate_normal
 
def laplace_marginal_likelihood(X, K, max_iter=100):
    """
    Approximate marginal likelihood using Laplace approximation.
    
    This is a simplified illustration - a full implementation would
    need proper GMM parameterization and constraint handling.
    
    Parameters:
    -----------
    X : array, shape (N, D)
        Observed data
    K : int
        Number of components
    
    Returns:
    --------
    log_ml : float
        Approximate log marginal likelihood
    """
    N, D = X.shape
    
    # Number of parameters (simplified: diagonal covariance)
    n_params = (K - 1) + K * D + K * D  # pi, mu, sigma_diag
    
    def neg_log_joint(params):
        """Negative log joint (likelihood * prior)."""
        # Unpack parameters (simplified parameterization)
        idx = 0
        # Mixing weights via softmax for unconstrained optimization
        logits = params[idx:idx + K]
        pi = np.exp(logits - np.max(logits))
        pi = pi / pi.sum()
        idx += K
        
        means = params[idx:idx + K*D].reshape(K, D)
        idx += K * D
        
        log_sigmas = params[idx:idx + K*D].reshape(K, D)
        sigmas = np.exp(log_sigmas)  # Ensure positive
        
        # Compute log likelihood
        log_lik = 0.0
        for n in range(N):
            component_probs = np.zeros(K)
            for k in range(K):
                diff = X[n] - means[k]
                log_p = -0.5 * np.sum(diff**2 / sigmas[k]) - 0.5 * np.sum(np.log(sigmas[k]))
                log_p -= 0.5 * D * np.log(2 * np.pi)
                component_probs[k] = np.log(pi[k]) + log_p
            
            max_prob = np.max(component_probs)
            log_lik += max_prob + np.log(np.sum(np.exp(component_probs - max_prob)))
        
        # Add priors (simplified: flat/weak priors)
        log_prior = -0.01 * np.sum(params**2)  # Weak regularization
        
        return -(log_lik + log_prior)
    
    # Initialize parameters
    np.random.seed(42)
    init_params = np.random.randn(K + K*D + K*D) * 0.1
    
    # Find MAP estimate
    result = minimize(neg_log_joint, init_params, method='L-BFGS-B')
    theta_hat = result.x
    
    # Compute Hessian numerically (simplified)
    eps = 1e-5
    H = np.zeros((len(theta_hat), len(theta_hat)))
    f0 = neg_log_joint(theta_hat)
    
    for i in range(len(theta_hat)):
        for j in range(i, len(theta_hat)):
            theta_pp = theta_hat.copy()
            theta_pp[i] += eps
            theta_pp[j] += eps
            
            theta_pm = theta_hat.copy()
            theta_pm[i] += eps
            theta_pm[j] -= eps
            
            theta_mp = theta_hat.copy()
            theta_mp[i] -= eps
            theta_mp[j] += eps
            
            theta_mm = theta_hat.copy()
            theta_mm[i] -= eps
            theta_mm[j] -= eps
            
            H[i, j] = (neg_log_joint(theta_pp) - neg_log_joint(theta_pm) 
                      - neg_log_joint(theta_mp) + neg_log_joint(theta_mm)) / (4 * eps**2)
            H[j, i] = H[i, j]
    
    # Laplace approximation
    log_joint_at_map = -result.fun
    sign, log_det_H = np.linalg.slogdet(H)
    
    log_ml = log_joint_at_map + 0.5 * n_params * np.log(2 * np.pi) - 0.5 * log_det_H
    
    return log_ml
 
# Example usage
np.random.seed(42)
X = np.vstack([
    np.random.multivariate_normal([0, 0], np.eye(2) * 0.5, 50),
    np.random.multivariate_normal([3, 0], np.eye(2) * 0.5, 50)
])
 
for K in [1, 2, 3]:
    log_ml = laplace_marginal_likelihood(X, K)
    print(f"K={K}: Approximate log marginal likelihood = {log_ml:.2f}")

The BIC Approximation: A Practical Simplification

The Bayesian Information Criterion (BIC) is a widely-used approximation to the log marginal likelihood. It's simpler to compute than the full Laplace approximation because it doesn't require the Hessian.

Derivation from Laplace:

Starting from Laplace: $$\log p(\mathbf{X}) \approx \log p(\mathbf{X} \mid \hat{\boldsymbol{\theta}}) + \log p(\hat{\boldsymbol{\theta}}) + \frac{D_{\theta}}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{H}|$$

For large $N$, the log-likelihood dominates and the Hessian scales approximately as $N \cdot \mathbf{I}$ (Fisher information scaling). Taking leading-order terms: $$\log p(\mathbf{X}) \approx \log p(\mathbf{X} \mid \hat{\boldsymbol{\theta}}) - \frac{D_{\theta}}{2} \log N + O(1)$$

Bayesian Information Criterion (BIC)

$$\text{BIC} = -2 \log p(\mathbf{X} \mid \hat{\boldsymbol{\theta}}) + D_{\theta} \log N$$

Or equivalently (for model comparison via approximating log marginal likelihood):

$$\log p(\mathbf{X} \mid \mathcal{M}) \approx \log p(\mathbf{X} \mid \hat{\boldsymbol{\theta}}) - \frac{D_{\theta}}{2} \log N$$

Lower BIC indicates a better model. The term $D_{\theta} \log N / 2$ penalizes complexity.

Computing BIC for GMMs:

For a GMM with $K$ components in $D$ dimensions with full covariances:

Mixing coefficients: $K - 1$ free parameters
Means: $K \times D$ parameters
Covariances: $K \times D(D+1)/2$ parameters

Total: $D_{\theta} = (K-1) + KD + K \cdot D(D+1)/2$

$$\text{BIC} = -2 \sum_{n=1}^{N} \log\left( \sum_{k=1}^{K} \hat{\pi}_k \mathcal{N}(\mathbf{x}_n \mid \hat{\boldsymbol{\mu}}_k, \hat{\boldsymbol{\Sigma}}k) \right) + D{\theta} \log N$$

where $\hat{\boldsymbol{\theta}} = (\hat{\pi}_k, \hat{\boldsymbol{\mu}}_k, \hat{\boldsymbol{\Sigma}}_k)$ are the MLE (typically from EM).

bic_for_gmm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from sklearn.mixture import GaussianMixture
 
def compute_bic_for_gmm(X, K_range):
    """
    Compute BIC for GMMs with different numbers of components.
    
    Parameters:
    -----------
    X : array, shape (N, D)
        Observed data
    K_range : list or range
        Range of K values to try
    
    Returns:
    --------
    results : dict
        Dictionary with K as keys and (BIC, log_likelihood, n_params) as values
    """
    N, D = X.shape
    results = {}
    
    for K in K_range:
        # Fit GMM using EM
        gmm = GaussianMixture(
            n_components=K,
            covariance_type='full',
            n_init=5,
            max_iter=200,
            random_state=42
        )
        gmm.fit(X)
        
        # Compute log-likelihood at MLE
        log_lik = gmm.score(X) * N  # sklearn returns per-sample average
        
        # Number of parameters
        n_mix = K - 1  # Mixing coefficients (sum to 1 constraint)
        n_means = K * D
        n_covs = K * D * (D + 1) // 2  # Full covariance matrices
        n_params = n_mix + n_means + n_covs
        
        # BIC
        bic = -2 * log_lik + n_params * np.log(N)
        
        results[K] = {
            'bic': bic,
            'log_likelihood': log_lik,
            'n_params': n_params,
            'sklearn_bic': gmm.bic(X)  # Verify with sklearn's implementation
        }
        
    return results
 
# Example: Find optimal K
np.random.seed(42)
 
# Generate data from 3-component GMM
true_K = 3
true_means = [np.array([0, 0]), np.array([4, 0]), np.array([2, 3])]
true_covs = [0.5 * np.eye(2) for _ in range(true_K)]
true_weights = [0.3, 0.4, 0.3]
 
X = np.vstack([
    np.random.multivariate_normal(true_means[k], true_covs[k], 
                                  int(200 * true_weights[k]))
    for k in range(true_K)
])
 
# Compute BIC for different K
results = compute_bic_for_gmm(X, range(1, 7))
 
print("BIC Model Selection Results:")
print("-" * 60)
print(f"{'K':>3} {'BIC':>12} {'Log-Lik':>12} {'# Params':>10}")
print("-" * 60)
 
best_K = None
best_bic = float('inf')
 
for K, res in results.items():
    print(f"{K:3d} {res['bic']:12.2f} {res['log_likelihood']:12.2f} {res['n_params']:10d}")
    if res['bic'] < best_bic:
        best_bic = res['bic']
        best_K = K
 
print("-" * 60)
print(f"Best model: K = {best_K} (True K = {true_K})")

BIC Advantages

•Easy to compute—only needs MLE and parameter count
•Consistent: selects true model as $N \to \infty$
•Widely implemented in software packages
•Doesn't require prior specification
•Works well in practice for many problems

BIC Limitations

•Asymptotic approximation—may be inaccurate for small $N$
•Ignores prior information
•May over-penalize in some settings
•Assumes regular model (may fail near boundaries)
•Doesn't account for model uncertainty

The ELBO: A Variational Lower Bound

Variational Bayes provides another approach to approximating the marginal likelihood. Instead of a point approximation (like Laplace), it finds a tractable distribution that approximates the posterior and provides a lower bound on the log marginal likelihood.

The Evidence Lower Bound (ELBO) is fundamental to variational inference and connects to the EM algorithm.

The ELBO Decomposition

For any distribution $q(\boldsymbol{\theta}, \mathbf{Z})$ over parameters and latent variables:

$$\log p(\mathbf{X}) = \text{ELBO}(q) + \text{KL}(q | p(\boldsymbol{\theta}, \mathbf{Z} \mid \mathbf{X}))$$

where: $$\text{ELBO}(q) = \mathbb{E}_q[\log p(\mathbf{X}, \boldsymbol{\theta}, \mathbf{Z})] - \mathbb{E}_q[\log q(\boldsymbol{\theta}, \mathbf{Z})]$$

Since KL divergence is non-negative, ELBO $\leq \log p(\mathbf{X})$.

Understanding the ELBO:

Rewriting the ELBO: $$\text{ELBO} = \mathbb{E}_q[\log p(\mathbf{X} \mid \boldsymbol{\theta}, \mathbf{Z})] - \text{KL}(q(\boldsymbol{\theta}, \mathbf{Z}) | p(\boldsymbol{\theta}, \mathbf{Z}))$$

This reveals two components:

Expected log-likelihood: How well the model explains data under the approximate posterior
KL from prior: Penalty for deviating from the prior (complexity penalty)

Connection to EM:

The EM algorithm can be viewed as maximizing a restricted ELBO where:

$q(\mathbf{Z}) = p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text{old}})$ (exact posterior over latent variables)
Parameters $\boldsymbol{\theta}$ are point masses (no uncertainty)

Variational Bayes extends this by:

Maintaining uncertainty over $\boldsymbol{\theta}$
Using factorized approximations for tractability

Using ELBO for model selection:

Since ELBO $\leq \log p(\mathbf{X})$, the ELBO provides a lower bound on model evidence. Higher ELBO suggests better model fit (subject to approximation tightness).

However, ELBO comparisons across models require that approximations be similarly tight—this is not always guaranteed.

ELBO in Practice

The ELBO is extensively used in modern machine learning, particularly for Variational Autoencoders (VAEs) and Bayesian neural networks. For GMMs, variational Bayes with a fully factorized approximation yields closed-form update equations similar to EM but with automatic relevance determination—components can be "turned off" if not needed.

Summary: Marginal Likelihood in GMM Model Selection

This page has developed the theory and practice of marginal likelihood for Gaussian Mixture Model selection. Let's consolidate the key insights:

Core Concepts

•Marginal likelihood $p(\mathbf{X} | \mathcal{M})$ measures how well a model explains data, averaging over all parameters. It naturally balances fit and complexity.
•Bayes factors compare models via ratios of marginal likelihoods, providing a principled approach to model selection.
•Automatic Occam's razor: Complex models spread prior probability over larger parameter spaces, paying a complexity penalty when extra parameters aren't needed.
•Exact computation is intractable for GMMs due to the $K^N$ latent variable configurations and high-dimensional parameter integrals.
•Laplace approximation uses a Gaussian approximation around the MAP, requiring the Hessian matrix.
•BIC simplifies Laplace to: $-2\log p(\mathbf{X}|\hat{\theta}) + D_\theta \log N$, providing a practical complexity-penalized criterion.
•ELBO provides a variational lower bound, connecting to EM and enabling Bayesian treatment of parameters.

Comparison of Model Selection Approaches
Method	Computational Cost	Accuracy	Requirements
Exact Marginal Likelihood	Intractable	Exact	Conjugate priors (not possible for GMM)
Laplace Approximation	Moderate (needs Hessian)	Good for large $N$	MAP + Hessian computation
BIC	Low (just MLE)	Asymptotically correct	Only MLE parameters
Variational ELBO	Moderate (iterative)	Lower bound	Tractable $q$ distribution
Monte Carlo	High	Converges to true value	Sampling scheme design

Connections forward:

Page 3: Identifiability — The label switching symmetry affects marginal likelihood computation and poses challenges for Monte Carlo approaches.
Page 4: Model Selection — Combines marginal likelihood with practical criteria (AIC, cross-validation) for choosing $K$.
Module 4: EM Algorithm — The ELBO perspective illuminates why EM increases likelihood at each iteration.

Page Complete

You now understand the marginal likelihood for Gaussian Mixture Models: its theoretical foundation in Bayesian model comparison, why it automatically penalizes complexity, why exact computation is intractable, and how approximations like BIC and ELBO provide practical alternatives. This foundation is essential for principled model selection in mixture modeling.