Loading content...
In the previous pages, we focused on estimating GMM parameters given a fixed number of components $K$. But a fundamental question remains: How do we choose $K$ itself? Should we use 3 components or 5? What about 10?
This question cannot be answered by the ordinary likelihood alone—adding components always improves fit. We need a principled way to compare models of different complexity.
The marginal likelihood (also called model evidence or integrated likelihood) provides exactly this. It measures how well a model explains the data after averaging over all possible parameter values, naturally penalizing complexity. This page develops the concept rigorously and connects it to practical model selection.
By the end of this page, you will: (1) Define the marginal likelihood and understand its role in Bayesian inference, (2) Derive the marginal likelihood formula for mixture models, (3) Explain why marginal likelihood naturally balances fit and complexity, (4) Understand the computational challenges in evaluating marginal likelihood, and (5) Connect marginal likelihood to model selection criteria like BIC.
In Bayesian statistics, we treat parameters as random variables with probability distributions. This fundamentally changes how we think about models and model comparison.
The key quantities in Bayesian inference:
Prior $p(\boldsymbol{\theta} \mid \mathcal{M})$: Our belief about parameters before seeing data, under model $\mathcal{M}$.
Likelihood $p(\mathbf{X} \mid \boldsymbol{\theta}, \mathcal{M})$: Probability of data given specific parameter values.
Posterior $p(\boldsymbol{\theta} \mid \mathbf{X}, \mathcal{M})$: Updated belief about parameters after seeing data.
Marginal Likelihood $p(\mathbf{X} \mid \mathcal{M})$: Probability of data under the model, averaging over all parameters.
$$p(\boldsymbol{\theta} \mid \mathbf{X}, \mathcal{M}) = \frac{p(\mathbf{X} \mid \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} \mid \mathcal{M})}{p(\mathbf{X} \mid \mathcal{M})}$$
The denominator $p(\mathbf{X} \mid \mathcal{M})$ is the marginal likelihood—it normalizes the posterior to be a valid probability distribution.
Definition of marginal likelihood:
The marginal likelihood is obtained by integrating (marginalizing) over all possible parameter values:
$$p(\mathbf{X} \mid \mathcal{M}) = \int p(\mathbf{X} \mid \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} \mid \mathcal{M}) , d\boldsymbol{\theta}$$
This is the expected likelihood under the prior: $$p(\mathbf{X} \mid \mathcal{M}) = \mathbb{E}_{\boldsymbol{\theta} \sim p(\boldsymbol{\theta} | \mathcal{M})}\left[ p(\mathbf{X} \mid \boldsymbol{\theta}, \mathcal{M}) \right]$$
Why "marginal"?
The term "marginal" indicates that we've marginalized (integrated out) the parameters. The joint distribution of data and parameters is: $$p(\mathbf{X}, \boldsymbol{\theta} \mid \mathcal{M}) = p(\mathbf{X} \mid \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} \mid \mathcal{M})$$
Integrating over $\boldsymbol{\theta}$ gives the marginal distribution of data: $$p(\mathbf{X} \mid \mathcal{M}) = \int p(\mathbf{X}, \boldsymbol{\theta} \mid \mathcal{M}) , d\boldsymbol{\theta}$$
| Term | Usage Context | Emphasis |
|---|---|---|
| Marginal Likelihood | General Bayesian statistics | Marginalization over parameters |
| Model Evidence | Bayesian model comparison | Evidence for model $\mathcal{M}$ |
| Integrated Likelihood | Statistical literature | Integration over parameter space |
| Normalizing Constant | Posterior computation | Normalizes posterior to integrate to 1 |
The marginal likelihood enables principled comparison between models of different complexity. For two models $\mathcal{M}_1$ and $\mathcal{M}_2$, we can apply Bayes' theorem at the model level:
$$\frac{P(\mathcal{M}_1 \mid \mathbf{X})}{P(\mathcal{M}_2 \mid \mathbf{X})} = \frac{p(\mathbf{X} \mid \mathcal{M}_1)}{p(\mathbf{X} \mid \mathcal{M}_2)} \cdot \frac{P(\mathcal{M}_1)}{P(\mathcal{M}_2)}$$
The ratio $\frac{p(\mathbf{X} \mid \mathcal{M}_1)}{p(\mathbf{X} \mid \mathcal{M}_2)}$ is the Bayes factor, measuring relative evidence from the data.
$$B_{12} = \frac{p(\mathbf{X} \mid \mathcal{M}_1)}{p(\mathbf{X} \mid \mathcal{M}_2)}$$
The Bayes factor quantifies how much the data updates our relative belief in $\mathcal{M}_1$ vs $\mathcal{M}_2$.
| $B_{12}$ | $\log_{10} B_{12}$ | Evidence for $\mathcal{M}_1$ |
|---|---|---|
| $< 1$ | $< 0$ | Evidence against $\mathcal{M}_1$ (supports $\mathcal{M}_2$) |
| $1$ to $3$ | $0$ to $0.5$ | Barely worth mentioning |
| $3$ to $20$ | $0.5$ to $1.3$ | Positive evidence |
| $20$ to $150$ | $1.3$ to $2.2$ | Strong evidence |
| $> 150$ | $> 2.2$ | Very strong evidence |
Application to GMM selection:
For GMMs, we often compare models with different numbers of components. Let $\mathcal{M}_K$ denote a GMM with $K$ components. To compare $K=3$ vs $K=5$:
$$B_{3,5} = \frac{p(\mathbf{X} \mid \mathcal{M}_3)}{p(\mathbf{X} \mid \mathcal{M}_5)}$$
If $B_{3,5} > 1$, the data favors the simpler 3-component model despite the 5-component model potentially fitting better. This is the automatic Occam's razor effect.
The Occam's razor property:
More complex models spread their prior probability over a larger parameter space. This means:
This is not an ad hoc penalty—it emerges naturally from probability theory!
Unlike criteria like AIC that add explicit penalty terms, the marginal likelihood's complexity penalty emerges from the mathematics. A complex model must "pay" for its extra parameters by spreading prior probability over a larger space. If those parameters aren't needed to explain the data, the simpler model wins.
For a GMM with $K$ components, the marginal likelihood requires integrating over all component parameters and the latent assignments. This is a two-level marginalization.
The complete parameter set: $$\boldsymbol{\theta} = {\pi_1, \ldots, \pi_K, \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_K, \boldsymbol{\Sigma}_1, \ldots, \boldsymbol{\Sigma}_K}$$
The latent variables: $$\mathbf{Z} = {z_1, \ldots, z_N}$$
$$p(\mathbf{X} \mid \mathcal{M}K) = \int \sum{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}, \mathcal{M}_K) , p(\boldsymbol{\theta} \mid \mathcal{M}_K) , d\boldsymbol{\theta}$$
Or equivalently, using the incomplete-data likelihood:
$$p(\mathbf{X} \mid \mathcal{M}_K) = \int p(\mathbf{X} \mid \boldsymbol{\theta}, \mathcal{M}_K) , p(\boldsymbol{\theta} \mid \mathcal{M}_K) , d\boldsymbol{\theta}$$
where $p(\mathbf{X} \mid \boldsymbol{\theta}) = \prod_n \sum_k \pi_k \mathcal{N}(\mathbf{x}_n | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$.
Specifying priors for GMM parameters:
To compute the marginal likelihood, we need priors on all parameters:
1. Mixing coefficients $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K)$:
The natural choice is a Dirichlet prior, which is conjugate to the categorical distribution: $$\boldsymbol{\pi} \sim \text{Dirichlet}(\alpha_1, \ldots, \alpha_K)$$
Symmetric choice: $\alpha_k = \alpha_0 / K$ for all $k$, where $\alpha_0$ is the concentration parameter.
2. Component means $\boldsymbol{\mu}_k$:
Typically Gaussian prior: $$\boldsymbol{\mu}_k \sim \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0)$$
where $\mathbf{m}_0$ and $\mathbf{S}_0$ are hyperparameters (e.g., sample mean and scaled sample covariance).
3. Component covariances $\boldsymbol{\Sigma}_k$:
The conjugate prior is inverse-Wishart: $$\boldsymbol{\Sigma}_k \sim \mathcal{W}^{-1}(\boldsymbol{\Psi}_0, u_0)$$
where $\boldsymbol{\Psi}_0$ is the scale matrix and $ u_0$ is the degrees of freedom.
Normal-inverse-Wishart (NIW): For computational convenience, we often use the joint prior: $$(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \sim \text{NIW}(\mathbf{m}_0, \kappa_0, \boldsymbol{\Psi}_0, u_0)$$
| Parameter | Prior Distribution | Hyperparameters |
|---|---|---|
| $\boldsymbol{\pi}$ | Dirichlet$(\alpha_1, \ldots, \alpha_K)$ | $\alpha_k$ (concentration for each component) |
| $\boldsymbol{\mu}_k$ | Normal$(\mathbf{m}_0, \mathbf{S}_0)$ | $\mathbf{m}_0$ (prior mean), $\mathbf{S}_0$ (prior covariance) |
| $\boldsymbol{\Sigma}_k$ | Inverse-Wishart$(\boldsymbol{\Psi}_0, u_0)$ | $\boldsymbol{\Psi}_0$ (scale matrix), $ u_0$ (degrees of freedom) |
| $(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ | NIW$(\mathbf{m}_0, \kappa_0, \boldsymbol{\Psi}_0, u_0)$ | Combined hyperparameters |
The marginal likelihood integral for GMMs is analytically intractable—there is no closed-form solution. This is a fundamental obstacle that requires approximation methods.
Why is the integral intractable?
The integrand involves: $$p(\mathbf{X} \mid \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta}) = \left[ \prod_{n=1}^{N} \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right] \cdot p(\boldsymbol{\theta})$$
The product of sums doesn't factor nicely. Expanding the product gives $K^N$ terms—each representing a different complete-data configuration. Integrating exactly would require computing $K^N$ separate integrals, which is exponentially expensive.
For $N = 100$ observations and $K = 3$ components, exact computation would require evaluating $3^{100} \approx 5 \times 10^{47}$ integrals. This is computationally impossible, necessitating approximation methods.
Contrast with simpler models:
For a single Gaussian (not a mixture), the marginal likelihood has a closed form when using conjugate priors. If:
Then the marginal likelihood is: $$p(\mathbf{X}) = \frac{\Gamma_D( u_N/2)}{\Gamma_D( u_0/2)} \cdot \frac{|\boldsymbol{\Psi}_0|^{ u_0/2}}{|\boldsymbol{\Psi}_N|^{ u_N/2}} \cdot \left(\frac{\kappa_0}{\kappa_N}\right)^{D/2} \cdot \pi^{-ND/2}$$
where $\Gamma_D$ is the multivariate gamma function and $(\mathbf{m}_N, \kappa_N, \boldsymbol{\Psi}_N, u_N)$ are the posterior hyperparameters.
This closed form exists because there are no latent variables to marginalize over—the key difference from mixtures.
Since exact computation is intractable, we use approximations. Several approaches have been developed, each with trade-offs between accuracy and computational cost.
The Laplace approximation uses a second-order Taylor expansion of the log posterior around the mode $\hat{\boldsymbol{\theta}}$:
$$\log p(\mathbf{X} \mid \mathcal{M}) \approx \log p(\mathbf{X} \mid \hat{\boldsymbol{\theta}}) + \log p(\hat{\boldsymbol{\theta}}) + \frac{D_{\theta}}{2} \log(2\pi) - \frac{1}{2} \log |\mathbf{H}|$$
where $\mathbf{H}$ is the Hessian of the negative log posterior at $\hat{\boldsymbol{\theta}}$, and $D_{\theta}$ is the number of parameters.
Laplace approximation in detail:
Let $f(\boldsymbol{\theta}) = \log p(\mathbf{X} \mid \boldsymbol{\theta}) + \log p(\boldsymbol{\theta})$ be the log joint (unnormalized log posterior).
Expanding around the mode $\hat{\boldsymbol{\theta}}$ (where $ abla f = 0$): $$f(\boldsymbol{\theta}) \approx f(\hat{\boldsymbol{\theta}}) - \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^\top \mathbf{H} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})$$
where $\mathbf{H} = - abla^2 f(\hat{\boldsymbol{\theta}})$ is the observed information matrix.
The integral becomes: $$p(\mathbf{X}) \approx e^{f(\hat{\boldsymbol{\theta}})} \int \exp\left( -\frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^\top \mathbf{H} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}) \right) d\boldsymbol{\theta}$$ $$= e^{f(\hat{\boldsymbol{\theta}})} \cdot (2\pi)^{D/2} |\mathbf{H}|^{-1/2}$$
Taking logarithms: $$\log p(\mathbf{X}) \approx f(\hat{\boldsymbol{\theta}}) + \frac{D}{2} \log(2\pi) - \frac{1}{2} \log|\mathbf{H}|$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
import numpy as npfrom scipy.optimize import minimizefrom scipy.stats import multivariate_normal def laplace_marginal_likelihood(X, K, max_iter=100): """ Approximate marginal likelihood using Laplace approximation. This is a simplified illustration - a full implementation would need proper GMM parameterization and constraint handling. Parameters: ----------- X : array, shape (N, D) Observed data K : int Number of components Returns: -------- log_ml : float Approximate log marginal likelihood """ N, D = X.shape # Number of parameters (simplified: diagonal covariance) n_params = (K - 1) + K * D + K * D # pi, mu, sigma_diag def neg_log_joint(params): """Negative log joint (likelihood * prior).""" # Unpack parameters (simplified parameterization) idx = 0 # Mixing weights via softmax for unconstrained optimization logits = params[idx:idx + K] pi = np.exp(logits - np.max(logits)) pi = pi / pi.sum() idx += K means = params[idx:idx + K*D].reshape(K, D) idx += K * D log_sigmas = params[idx:idx + K*D].reshape(K, D) sigmas = np.exp(log_sigmas) # Ensure positive # Compute log likelihood log_lik = 0.0 for n in range(N): component_probs = np.zeros(K) for k in range(K): diff = X[n] - means[k] log_p = -0.5 * np.sum(diff**2 / sigmas[k]) - 0.5 * np.sum(np.log(sigmas[k])) log_p -= 0.5 * D * np.log(2 * np.pi) component_probs[k] = np.log(pi[k]) + log_p max_prob = np.max(component_probs) log_lik += max_prob + np.log(np.sum(np.exp(component_probs - max_prob))) # Add priors (simplified: flat/weak priors) log_prior = -0.01 * np.sum(params**2) # Weak regularization return -(log_lik + log_prior) # Initialize parameters np.random.seed(42) init_params = np.random.randn(K + K*D + K*D) * 0.1 # Find MAP estimate result = minimize(neg_log_joint, init_params, method='L-BFGS-B') theta_hat = result.x # Compute Hessian numerically (simplified) eps = 1e-5 H = np.zeros((len(theta_hat), len(theta_hat))) f0 = neg_log_joint(theta_hat) for i in range(len(theta_hat)): for j in range(i, len(theta_hat)): theta_pp = theta_hat.copy() theta_pp[i] += eps theta_pp[j] += eps theta_pm = theta_hat.copy() theta_pm[i] += eps theta_pm[j] -= eps theta_mp = theta_hat.copy() theta_mp[i] -= eps theta_mp[j] += eps theta_mm = theta_hat.copy() theta_mm[i] -= eps theta_mm[j] -= eps H[i, j] = (neg_log_joint(theta_pp) - neg_log_joint(theta_pm) - neg_log_joint(theta_mp) + neg_log_joint(theta_mm)) / (4 * eps**2) H[j, i] = H[i, j] # Laplace approximation log_joint_at_map = -result.fun sign, log_det_H = np.linalg.slogdet(H) log_ml = log_joint_at_map + 0.5 * n_params * np.log(2 * np.pi) - 0.5 * log_det_H return log_ml # Example usagenp.random.seed(42)X = np.vstack([ np.random.multivariate_normal([0, 0], np.eye(2) * 0.5, 50), np.random.multivariate_normal([3, 0], np.eye(2) * 0.5, 50)]) for K in [1, 2, 3]: log_ml = laplace_marginal_likelihood(X, K) print(f"K={K}: Approximate log marginal likelihood = {log_ml:.2f}")The Bayesian Information Criterion (BIC) is a widely-used approximation to the log marginal likelihood. It's simpler to compute than the full Laplace approximation because it doesn't require the Hessian.
Derivation from Laplace:
Starting from Laplace: $$\log p(\mathbf{X}) \approx \log p(\mathbf{X} \mid \hat{\boldsymbol{\theta}}) + \log p(\hat{\boldsymbol{\theta}}) + \frac{D_{\theta}}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{H}|$$
For large $N$, the log-likelihood dominates and the Hessian scales approximately as $N \cdot \mathbf{I}$ (Fisher information scaling). Taking leading-order terms: $$\log p(\mathbf{X}) \approx \log p(\mathbf{X} \mid \hat{\boldsymbol{\theta}}) - \frac{D_{\theta}}{2} \log N + O(1)$$
$$\text{BIC} = -2 \log p(\mathbf{X} \mid \hat{\boldsymbol{\theta}}) + D_{\theta} \log N$$
Or equivalently (for model comparison via approximating log marginal likelihood):
$$\log p(\mathbf{X} \mid \mathcal{M}) \approx \log p(\mathbf{X} \mid \hat{\boldsymbol{\theta}}) - \frac{D_{\theta}}{2} \log N$$
Lower BIC indicates a better model. The term $D_{\theta} \log N / 2$ penalizes complexity.
Computing BIC for GMMs:
For a GMM with $K$ components in $D$ dimensions with full covariances:
Total: $D_{\theta} = (K-1) + KD + K \cdot D(D+1)/2$
$$\text{BIC} = -2 \sum_{n=1}^{N} \log\left( \sum_{k=1}^{K} \hat{\pi}_k \mathcal{N}(\mathbf{x}_n \mid \hat{\boldsymbol{\mu}}_k, \hat{\boldsymbol{\Sigma}}k) \right) + D{\theta} \log N$$
where $\hat{\boldsymbol{\theta}} = (\hat{\pi}_k, \hat{\boldsymbol{\mu}}_k, \hat{\boldsymbol{\Sigma}}_k)$ are the MLE (typically from EM).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npfrom sklearn.mixture import GaussianMixture def compute_bic_for_gmm(X, K_range): """ Compute BIC for GMMs with different numbers of components. Parameters: ----------- X : array, shape (N, D) Observed data K_range : list or range Range of K values to try Returns: -------- results : dict Dictionary with K as keys and (BIC, log_likelihood, n_params) as values """ N, D = X.shape results = {} for K in K_range: # Fit GMM using EM gmm = GaussianMixture( n_components=K, covariance_type='full', n_init=5, max_iter=200, random_state=42 ) gmm.fit(X) # Compute log-likelihood at MLE log_lik = gmm.score(X) * N # sklearn returns per-sample average # Number of parameters n_mix = K - 1 # Mixing coefficients (sum to 1 constraint) n_means = K * D n_covs = K * D * (D + 1) // 2 # Full covariance matrices n_params = n_mix + n_means + n_covs # BIC bic = -2 * log_lik + n_params * np.log(N) results[K] = { 'bic': bic, 'log_likelihood': log_lik, 'n_params': n_params, 'sklearn_bic': gmm.bic(X) # Verify with sklearn's implementation } return results # Example: Find optimal Knp.random.seed(42) # Generate data from 3-component GMMtrue_K = 3true_means = [np.array([0, 0]), np.array([4, 0]), np.array([2, 3])]true_covs = [0.5 * np.eye(2) for _ in range(true_K)]true_weights = [0.3, 0.4, 0.3] X = np.vstack([ np.random.multivariate_normal(true_means[k], true_covs[k], int(200 * true_weights[k])) for k in range(true_K)]) # Compute BIC for different Kresults = compute_bic_for_gmm(X, range(1, 7)) print("BIC Model Selection Results:")print("-" * 60)print(f"{'K':>3} {'BIC':>12} {'Log-Lik':>12} {'# Params':>10}")print("-" * 60) best_K = Nonebest_bic = float('inf') for K, res in results.items(): print(f"{K:3d} {res['bic']:12.2f} {res['log_likelihood']:12.2f} {res['n_params']:10d}") if res['bic'] < best_bic: best_bic = res['bic'] best_K = K print("-" * 60)print(f"Best model: K = {best_K} (True K = {true_K})")Variational Bayes provides another approach to approximating the marginal likelihood. Instead of a point approximation (like Laplace), it finds a tractable distribution that approximates the posterior and provides a lower bound on the log marginal likelihood.
The Evidence Lower Bound (ELBO) is fundamental to variational inference and connects to the EM algorithm.
For any distribution $q(\boldsymbol{\theta}, \mathbf{Z})$ over parameters and latent variables:
$$\log p(\mathbf{X}) = \text{ELBO}(q) + \text{KL}(q | p(\boldsymbol{\theta}, \mathbf{Z} \mid \mathbf{X}))$$
where: $$\text{ELBO}(q) = \mathbb{E}_q[\log p(\mathbf{X}, \boldsymbol{\theta}, \mathbf{Z})] - \mathbb{E}_q[\log q(\boldsymbol{\theta}, \mathbf{Z})]$$
Since KL divergence is non-negative, ELBO $\leq \log p(\mathbf{X})$.
Understanding the ELBO:
Rewriting the ELBO: $$\text{ELBO} = \mathbb{E}_q[\log p(\mathbf{X} \mid \boldsymbol{\theta}, \mathbf{Z})] - \text{KL}(q(\boldsymbol{\theta}, \mathbf{Z}) | p(\boldsymbol{\theta}, \mathbf{Z}))$$
This reveals two components:
Connection to EM:
The EM algorithm can be viewed as maximizing a restricted ELBO where:
Variational Bayes extends this by:
Using ELBO for model selection:
Since ELBO $\leq \log p(\mathbf{X})$, the ELBO provides a lower bound on model evidence. Higher ELBO suggests better model fit (subject to approximation tightness).
However, ELBO comparisons across models require that approximations be similarly tight—this is not always guaranteed.
The ELBO is extensively used in modern machine learning, particularly for Variational Autoencoders (VAEs) and Bayesian neural networks. For GMMs, variational Bayes with a fully factorized approximation yields closed-form update equations similar to EM but with automatic relevance determination—components can be "turned off" if not needed.
This page has developed the theory and practice of marginal likelihood for Gaussian Mixture Model selection. Let's consolidate the key insights:
| Method | Computational Cost | Accuracy | Requirements |
|---|---|---|---|
| Exact Marginal Likelihood | Intractable | Exact | Conjugate priors (not possible for GMM) |
| Laplace Approximation | Moderate (needs Hessian) | Good for large $N$ | MAP + Hessian computation |
| BIC | Low (just MLE) | Asymptotically correct | Only MLE parameters |
| Variational ELBO | Moderate (iterative) | Lower bound | Tractable $q$ distribution |
| Monte Carlo | High | Converges to true value | Sampling scheme design |
Connections forward:
Page 3: Identifiability — The label switching symmetry affects marginal likelihood computation and poses challenges for Monte Carlo approaches.
Page 4: Model Selection — Combines marginal likelihood with practical criteria (AIC, cross-validation) for choosing $K$.
Module 4: EM Algorithm — The ELBO perspective illuminates why EM increases likelihood at each iteration.
You now understand the marginal likelihood for Gaussian Mixture Models: its theoretical foundation in Bayesian model comparison, why it automatically penalizes complexity, why exact computation is intractable, and how approximations like BIC and ELBO provide practical alternatives. This foundation is essential for principled model selection in mixture modeling.