Gaussian Mixture Models - Learning Module

Loading content...

0/278

Identifiability

When Are Mixture Parameters Uniquely Determined?

A fundamental question in statistical modeling is whether the parameters of a model can be uniquely determined from data—the question of identifiability. For Gaussian Mixture Models, identifiability takes on particular importance due to inherent symmetries in the model structure.

Consider a simple observation: if we have a 2-component GMM with components A and B, we could equivalently call them B and A. The likelihood is unchanged—we've just relabeled. But this means multiple parameter settings correspond to the same probability distribution.

This page rigorously examines identifiability in GMMs: what it means, when models are identifiable (up to label switching), when they fail to be identifiable, and why this matters for inference, optimization, and interpretation.

Learning Objectives

By the end of this page, you will: (1) Define identifiability formally and understand its importance, (2) Explain the label switching problem with precision, (3) Describe conditions under which GMMs are identifiable (up to permutation), (4) Identify scenarios that cause non-identifiability beyond label switching, and (5) Understand practical implications for EM optimization and Bayesian inference.

Formal Definition of Identifiability

Identifiability is a fundamental concept in statistical theory that addresses whether distinct parameter values lead to distinguishable probability distributions.

Formal definition:

A parametric model ${p(\mathbf{x} \mid \boldsymbol{\theta}) : \boldsymbol{\theta} \in \Theta}$ is identifiable if the mapping from parameters to distributions is one-to-one. Formally:

$$p(\mathbf{x} \mid \boldsymbol{\theta}_1) = p(\mathbf{x} \mid \boldsymbol{\theta}_2) \text{ for all } \mathbf{x} \implies \boldsymbol{\theta}_1 = \boldsymbol{\theta}_2$$

If this condition holds, knowing the true distribution uniquely determines the parameters (up to the equivalence relation).

Why Identifiability Matters

For parameter estimation: If a model is not identifiable, there exist multiple parameter settings that are equally consistent with any dataset. The MLE or posterior is not unique.

For interpretation: Non-identifiable parameters cannot be given meaningful scientific interpretation—different settings explain data equally well.

For optimization: Non-identifiability creates flat regions or ridges in the likelihood surface where the objective doesn't change, causing numerical issues.

Identifiability up to equivalence:

In practice, we often accept identifiability up to an equivalence relation. If there exists a known transformation $T$ such that: $$p(\mathbf{x} \mid \boldsymbol{\theta}_1) = p(\mathbf{x} \mid \boldsymbol{\theta}_2) \implies \boldsymbol{\theta}_2 = T(\boldsymbol{\theta}_1)$$

for some element of a known group of transformations, the model may still be practically useful. For GMMs, this transformation is component permutation.

Examples of identifiability:

Single Gaussian (identifiable): If $\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)$ for all $\mathbf{x}$, then $\boldsymbol{\mu}_1 = \boldsymbol{\mu}_2$ and $\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_2$. The Gaussian is identifiable.

2-component GMM (identifiable up to label switching): Swapping $(\pi_1, \boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) \leftrightarrow (\pi_2, \boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)$ gives the same mixture density.

Identifiability Status of Common Models
Model	Identifiability	Equivalent Parameterizations
Single Gaussian	Identifiable	Unique $\boldsymbol{\mu}, \boldsymbol{\Sigma}$
Linear Regression	Identifiable (with full rank $\mathbf{X}$)	Unique $\boldsymbol{\beta}$
GMM ($K$ components)	Identifiable up to $K!$ permutations	Any permutation of component labels
Factor Analysis	Not identifiable (rotation)	Any orthogonal rotation of loadings
ICA	Identifiable up to scaling/permutation	Scaling and ordering of components

The Label Switching Problem

The most prominent identifiability issue in mixture models is label switching. Since the component labels (1, 2, ..., K) are arbitrary, any permutation of them yields the same probability distribution.

Formal statement:

For a GMM with parameters $\boldsymbol{\theta} = {(\pi_k, \boldsymbol{\mu}k, \boldsymbol{\Sigma}k)}{k=1}^K$ and any permutation $\sigma$ of ${1, \ldots, K}$, define the permuted parameters: $$\boldsymbol{\theta}^\sigma = {(\pi{\sigma(k)}, \boldsymbol{\mu}{\sigma(k)}, \boldsymbol{\Sigma}{\sigma(k)})}_{k=1}^K$$

Then: $$p(\mathbf{x} \mid \boldsymbol{\theta}) = p(\mathbf{x} \mid \boldsymbol{\theta}^\sigma) \quad \text{for all } \mathbf{x}$$

This creates $K!$ equivalent parameter settings for any given GMM.

The Magnitude of Label Switching

For $K = 5$ components, there are $5! = 120$ equivalent parameter settings. For $K = 10$, there are $10! = 3,628,800$ equivalent settings. This symmetry has significant implications for optimization and Bayesian inference.

label_switching_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from scipy.stats import multivariate_normal
from itertools import permutations
 
def gmm_density(x, means, covs, weights):
    """Compute GMM density at point x."""
    density = 0.0
    for mean, cov, weight in zip(means, covs, weights):
        density += weight * multivariate_normal.pdf(x, mean, cov)
    return density
 
def permute_gmm(means, covs, weights, perm):
    """Apply a permutation to GMM parameters."""
    return (
        [means[p] for p in perm],
        [covs[p] for p in perm],
        [weights[p] for p in perm]
    )
 
# Define a 3-component GMM
original_means = [np.array([0, 0]), np.array([3, 0]), np.array([1.5, 2.5])]
original_covs = [0.5 * np.eye(2), 0.3 * np.eye(2), 0.4 * np.eye(2)]
original_weights = [0.3, 0.5, 0.2]
 
# Test point
x = np.array([1.0, 1.0])
 
# Compute density under original parameterization
original_density = gmm_density(x, original_means, original_covs, original_weights)
 
print("Label Switching Demonstration")
print("=" * 60)
print(f"
Test point: x = {x}")
print(f"Original density: p(x) = {original_density:.6f}")
print(f"
Density under all {np.math.factorial(3)} permutations:")
 
for i, perm in enumerate(permutations([0, 1, 2])):
    perm_means, perm_covs, perm_weights = permute_gmm(
        original_means, original_covs, original_weights, perm
    )
    perm_density = gmm_density(x, perm_means, perm_covs, perm_weights)
    
    print(f"  Permutation {perm}: p(x) = {perm_density:.6f} "
          f"{'✓' if np.isclose(perm_density, original_density) else '✗'}")
 
# Verify for multiple test points
print("
Verifying across 100 random test points...")
test_points = np.random.randn(100, 2)
all_equal = True
 
for x in test_points:
    d_orig = gmm_density(x, original_means, original_covs, original_weights)
    for perm in permutations([0, 1, 2]):
        pm, pc, pw = permute_gmm(original_means, original_covs, original_weights, perm)
        if not np.isclose(gmm_density(x, pm, pc, pw), d_orig):
            all_equal = False
            break
 
print(f"All permutations give identical densities: {all_equal}")

Consequences of label switching:

1. Multiple equivalent optima in likelihood:

The log-likelihood surface has $K!$ equivalent global maxima. From any solution, permuting components gives another solution with identical likelihood.

2. EM algorithm behavior:

Depending on initialization, EM can converge to any of the $K!$ equivalent solutions. Different random seeds may find different labelings, making results appear inconsistent (though they represent the same clustering).

3. Bayesian posterior multimodality:

The posterior $p(\boldsymbol{\theta} \mid \mathbf{X})$ has $K!$ symmetric modes. Standard MCMC samplers will explore all modes, jumping between them. This creates serious problems for posterior summarization—simply averaging samples mixes apples and oranges.

4. Interpretation challenges:

"Component 1" from one EM run may correspond to "Component 3" from another run. Comparing across runs or studies requires careful alignment.

Generic Identifiability of Finite Gaussian Mixtures

A fundamental theorem establishes that finite Gaussian mixtures are identifiable up to component permutation under mild conditions.

Theorem (Teicher, 1963; Yakowitz & Spragins, 1968):

The family of finite Gaussian mixtures is identifiable in the sense that if: $$\sum_{k=1}^{K_1} \pi_k^{(1)} \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k^{(1)}, \boldsymbol{\Sigma}k^{(1)}) = \sum{k=1}^{K_2} \pi_k^{(2)} \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k^{(2)}, \boldsymbol{\Sigma}_k^{(2)})$$

for all $\mathbf{x}$, then $K_1 = K_2$ and there exists a permutation $\sigma$ such that: $$\pi_k^{(1)} = \pi_{\sigma(k)}^{(2)}, \quad \boldsymbol{\mu}k^{(1)} = \boldsymbol{\mu}{\sigma(k)}^{(2)}, \quad \boldsymbol{\Sigma}k^{(1)} = \boldsymbol{\Sigma}{\sigma(k)}^{(2)}$$

Intuition Behind the Theorem

The proof relies on the fact that the Gaussian family is linearly independent in a function-theoretic sense. No finite linear combination of distinct Gaussians equals the zero function. This means if two mixtures produce the same density, they must have:

The same number of components
Matching parameters (up to label permutation)

Proof sketch:

The proof uses properties of the characteristic function (Fourier transform) of the mixture. For Gaussians: $$\phi_k(\mathbf{t}) = \exp\left(i \mathbf{t}^\top \boldsymbol{\mu}_k - \frac{1}{2} \mathbf{t}^\top \boldsymbol{\Sigma}_k \mathbf{t}\right)$$

Step 1: The mixture's characteristic function is: $$\phi(\mathbf{t}) = \sum_{k=1}^K \pi_k \phi_k(\mathbf{t})$$

Step 2: If two mixtures have identical densities, they have identical characteristic functions.

Step 3: The characteristic functions ${\phi_k}$ are linearly independent over the complex numbers (this is the key technical step, proven via analytic continuation arguments).

Step 4: Linear independence implies that the same characteristic function can only arise from re-ordering the same component functions.

Conclusion: The mixtures have the same components with the same weights, up to permutation.

Requirements for identifiability:

All mixing weights $\pi_k > 0$ (no empty components)
Components are distinct: $(\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j) eq (\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ for $j eq k$
Finite number of components

Practical Implication

The identifiability theorem guarantees that with enough data, we can (in principle) uniquely recover the true mixture parameters up to labeling. This justifies using GMMs for density estimation and clustering—the parameters we estimate are meaningful, not arbitrary.

Non-Identifiability Beyond Label Switching

While GMMs are identifiable up to label permutation in the generic case, several special configurations lead to genuine non-identifiability where different parameter settings produce the same distribution and are not related by simple relabeling.

Scenarios Causing Non-Identifiability

•Overfitting with too many components: With $K$ larger than needed, excess components can have $\pi_k \to 0$ or duplicate existing components with shared weight. Multiple configurations achieve the same density.
•Identical components: If two components have identical parameters $(\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j) = (\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$, only their combined weight $\pi_j + \pi_k$ is identifiable, not the individual $\pi_j, \pi_k$.
•Empty components: If $\pi_k = 0$, the corresponding $\boldsymbol{\mu}_k$ and $\boldsymbol{\Sigma}_k$ can be arbitrary without affecting the density.
•Boundary of parameter space: When parameters approach boundaries (e.g., $\boldsymbol{\Sigma}_k$ approaching singularity), degeneracies can occur.
•Model misspecification: If the true density is not a finite Gaussian mixture, the "best" GMM approximation may not be unique.

Example: Overfitting with extra components

Suppose the true density is a single Gaussian $\mathcal{N}(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0)$. We fit a 2-component GMM.

Many solutions achieve the same likelihood:

$\pi_1 = 1, \pi_2 = 0$ with $\boldsymbol{\mu}_1 = \boldsymbol{\mu}_0, \boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_0$ (and arbitrary $\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2$)
$\pi_1 = 0.5, \pi_2 = 0.5$ with both components equal to $(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0)$
And infinitely many more...

This is not label switching—these are genuinely different parameter configurations.

Example: Component splitting

A single Gaussian $\mathcal{N}(0, 1)$ can be approximated arbitrarily well by: $$\frac{1}{K} \sum_{k=1}^K \mathcal{N}(0, \sigma^2)$$ with $\sigma^2$ chosen appropriately. As $K \to \infty$, the approximation becomes exact. But there's no unique assignment of means in the limit.

Practical Detection

Signs of non-identifiability in practice:

• Very small mixing weights ($\pi_k \approx 0$) suggesting unnecessary components • Near-identical component parameters suggesting redundant components • EM not converging or cycling between configurations • Large parameter uncertainty in Bayesian inference • Sensitivity to initialization suggesting multiple equivalent optima

Implications for EM Optimization

The label switching symmetry and potential non-identifiability have direct consequences for the EM algorithm.

Multiple equivalent optima:

The EM algorithm seeks a (local) maximum of the log-likelihood. With $K!$ equivalent global maxima, EM will converge to one of them depending on initialization. This is actually fine for density estimation—all equivalent solutions define the same density.

However, for clustering interpretation:

If we want to interpret "component 1" as corresponding to a specific subpopulation, the arbitrary labeling becomes problematic. Different runs assign different labels to the same cluster.

em_label_consistency.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
from sklearn.mixture import GaussianMixture
 
# Generate data from a known GMM
np.random.seed(42)
true_means = np.array([[0, 0], [4, 0], [2, 3]])
true_covs = np.array([0.5 * np.eye(2) for _ in range(3)])
n_samples = [100, 150, 100]
 
X = np.vstack([
    np.random.multivariate_normal(true_means[k], true_covs[k], n_samples[k])
    for k in range(3)
])
 
print("Demonstrating Label Inconsistency Across EM Runs")
print("=" * 60)
print(f"
True means: {true_means.tolist()}")
print(f"True proportions: {[n/sum(n_samples) for n in n_samples]}")
 
# Run EM multiple times with different random seeds
print("
Fitted means from 5 independent EM runs:")
print("-" * 60)
 
for run in range(5):
    gmm = GaussianMixture(
        n_components=3,
        covariance_type='full',
        n_init=1,  # Single initialization to show variability
        random_state=run * 123  # Different seed each run
    )
    gmm.fit(X)
    
    print(f"Run {run + 1}:")
    for k in range(3):
        print(f"  Component {k}: mean = [{gmm.means_[k, 0]:.2f}, {gmm.means_[k, 1]:.2f}], "
              f"weight = {gmm.weights_[k]:.3f}")
    print()
 
# Alignment strategy: sort by first coordinate of mean
print("After sorting components by mean[0]:")
print("-" * 60)
 
for run in range(5):
    gmm = GaussianMixture(n_components=3, n_init=1, random_state=run * 123)
    gmm.fit(X)
    
    # Sort by first coordinate of mean
    order = np.argsort(gmm.means_[:, 0])
    sorted_means = gmm.means_[order]
    sorted_weights = gmm.weights_[order]
    
    print(f"Run {run + 1} (sorted):")
    for k in range(3):
        print(f"  Component {k}: mean = [{sorted_means[k, 0]:.2f}, {sorted_means[k, 1]:.2f}], "
              f"weight = {sorted_weights[k]:.3f}")
    print()

Strategies for Consistent Labeling

•Sort by parameter value: Order components by mean (first coordinate), mixing weight, or covariance determinant. Simple but may not always give meaningful ordering.
•Use identifiability constraints: Add ordering constraints during optimization, e.g., $\mu_1 < \mu_2 < \cdots < \mu_K$ (univariate case).
•Post-hoc alignment: After fitting, align to a reference labeling using the Hungarian algorithm or similar assignment methods.
•Multiple initializations, same label: Use k-means initialization consistently, or fix one component's initial position.
•Informed initialization: Initialize based on prior knowledge about expected cluster locations.

For Density Estimation vs. Clustering

If your goal is density estimation (predicting $p(\mathbf{x})$ for new points), label switching doesn't matter—all permutations give the same density. If your goal is clustering (assigning points to interpretable groups), you need consistent labeling, typically through one of the strategies above.

Implications for Bayesian Inference

Label switching poses severe challenges for Bayesian inference via MCMC sampling.

The problem:

The posterior distribution $p(\boldsymbol{\theta} \mid \mathbf{X})$ inherits the $K!$-fold symmetry of the likelihood. It has $K!$ modes, each corresponding to a different labeling of the same underlying clustering.

If MCMC samples explore multiple modes (as they should for proper posterior exploration), naively averaging samples mixes parameters from different labelings:

$$\hat{\boldsymbol{\mu}}1 = \frac{1}{T} \sum{t=1}^T \boldsymbol{\mu}_1^{(t)}$$

This average is meaningless if $\boldsymbol{\mu}_1^{(t)}$ sometimes refers to true cluster 1 and sometimes to true cluster 2.

The Posterior Averaging Problem

If MCMC samples switch between modes, the sample mean of $\boldsymbol{\mu}_k$ will tend toward the grand mean of all clusters. Variance estimates will be inflated. The posterior summary is completely wrong—not because of poor sampling, but because of label switching.

Solutions for Bayesian inference:

1. Identifiability constraints:

Impose constraints that break the symmetry:

Ordering constraints: $\mu_{1,1} < \mu_{2,1} < \cdots < \mu_{K,1}$ (order by first coordinate of mean)
Constraint on weights: $\pi_1 > \pi_2 > \cdots > \pi_K$ (order by mixing weight)

These constraints restrict the prior to a single mode, forcing consistent labeling.

2. Relabeling algorithms:

After MCMC, relabel each sample to match a reference configuration:

Stephens' algorithm: Minimize KL divergence between allocation probabilities across samples
Pivotal reordering: Identify "pivot" observations that are clearly assigned to specific clusters
ECR algorithm: Equivalence class representatives via optimization

3. Loss-based summarization:

Avoid parameter-level summaries. Instead:

Report posterior distributions of label-invariant quantities (pairwise co-clustering probabilities)
Use decision-theoretic clustering that doesn't depend on labels
Report cluster-level properties without identifying labels

Approaches to Label Switching in MCMC
Approach	When Applied	Advantages	Disadvantages
Identifiability constraints	During sampling	Simple, prevents switching	May bias posterior, arbitrary choice of constraint
Relabeling algorithms	Post-processing	Uses all samples, principled	Computationally expensive, requires choice of loss
Loss-based summary	Interpretation	Avoids problem entirely	Doesn't give component-level parameters
Single-mode sampling	Algorithm design	Avoids switching	May not explore full posterior

label_invariant_summaries.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
 
def compute_coclustering_matrix(responsibilities_samples):
    """
    Compute the posterior co-clustering probability matrix.
    
    This is a label-invariant summary: P(z_i = z_j | X) doesn't
    depend on which label the shared cluster has.
    
    Parameters:
    -----------
    responsibilities_samples : list of arrays, each (N, K)
        Posterior samples of responsibilities from MCMC
    
    Returns:
    --------
    coclustering : array, shape (N, N)
        coclustering[i, j] = posterior probability that i and j 
        share the same cluster
    """
    n_samples = len(responsibilities_samples)
    N = responsibilities_samples[0].shape[0]
    
    coclustering = np.zeros((N, N))
    
    for gamma in responsibilities_samples:
        # For each sample, compute probability that i and j share cluster
        # P(z_i = z_j) = sum_k P(z_i = k) * P(z_j = k)
        sample_cocluster = gamma @ gamma.T
        coclustering += sample_cocluster
    
    coclustering /= n_samples
    
    return coclustering
 
# Simulated example
np.random.seed(42)
N = 20  # Small example
K = 3
n_samples = 100
 
# Simulate MCMC samples (random responsibilities with label switching)
samples = []
for _ in range(n_samples):
    # Random permutation to simulate label switching
    perm = np.random.permutation(K)
    
    # Generate responsibilities (soft clustering)
    gamma = np.random.dirichlet(np.ones(K) * 5, size=N)
    gamma = gamma[:, perm]  # Apply permutation
    samples.append(gamma)
 
# Compute label-invariant summary
cocluster_probs = compute_coclustering_matrix(samples)
 
print("Co-clustering Probability Matrix (first 10x10):")
print("-" * 60)
print(np.round(cocluster_probs[:10, :10], 2))
 
print("
This matrix is invariant to label switching!")
print("Entry (i,j) = posterior probability that points i and j")
print("belong to the same cluster (regardless of cluster label).")
 
# Diagonal should be 1 (point always co-clusters with itself)
print(f"
Diagonal values (should be 1.0): {np.diag(cocluster_probs)[:5]}")

Practical Guidelines for Handling Identifiability

Based on the theoretical foundations developed above, here are practical recommendations for working with GMMs:

Best Practices for GMM Identifiability

•For density estimation: Label switching doesn't matter. Use multiple random initializations and select the solution with highest likelihood. All equivalent solutions give the same density.
•For clustering with interpretation: (a) Use k-means++ initialization for consistent starting points, (b) Apply post-hoc ordering by a meaningful criterion (e.g., mean magnitude, cluster size), (c) Document your ordering convention clearly.
•For Bayesian inference: (a) Either impose identifiability constraints in the prior, (b) Use relabeling algorithms on MCMC output, (c) Or report only label-invariant summaries like co-clustering probabilities.
•Detecting non-identifiability issues: Watch for near-zero mixing weights, near-identical component parameters, or EM convergence to different parameter values (not just different labelings) from different initializations.
•Model selection: Use proper model selection criteria (BIC, cross-validation) rather than over-specifying $K$ and hoping for spare components to have zero weight.

When to Worry

Don't worry about label switching if you're computing $p(\mathbf{x}_{new})$ for new points—all labelings give the same answer.

Do worry if you need to (a) compare component parameters across studies, (b) average MCMC samples, (c) interpret component identities, or (d) track components over time in online learning.

Summary: Identifiability in Gaussian Mixture Models

This page has provided a comprehensive treatment of identifiability in Gaussian Mixture Models. Let's consolidate the key insights:

Core Concepts

•Identifiability asks whether parameters can be uniquely recovered from the distribution. Finite Gaussian mixtures are identifiable up to component permutation.
•Label switching is the $K!$-fold symmetry arising from arbitrary component labels. Any permutation of components yields the same density, creating equivalent parameter configurations.
•The identifiability theorem guarantees that (with distinct components and positive weights) the mixture parameters are unique up to relabeling.
•Additional non-identifiability can arise from overfitting (too many components), identical components, or empty components—these go beyond label switching.
•EM implications: Different initializations find different labelings with identical likelihood. For clustering interpretation, consistent labeling is needed (via constraints or post-hoc alignment).
•Bayesian implications: The posterior has $K!$ modes. Naive MCMC averaging gives meaningless estimates. Solutions include constraints, relabeling algorithms, or label-invariant summaries.

Key Takeaways
Issue	Nature	Solution
Label switching	Inherent model symmetry	Accept (density estimation) or impose order (clustering)
Multiple EM solutions	Different labelings of same clustering	Post-hoc alignment or consistent initialization
MCMC mode mixing	Posterior symmetry	Relabeling algorithms or label-invariant summaries
Overfitting $K$	Genuine non-identifiability	Model selection to choose appropriate $K$

Connections forward:

Page 4: Model Selection — Choosing $K$ appropriately avoids over-parameterization and the associated non-identifiability issues.
Module 4: EM Algorithm — Understanding label switching helps interpret EM results and design initialization strategies.
Module 5: Beyond Gaussian Mixtures — Identifiability considerations extend to other mixture families with similar (and sometimes additional) symmetries.

Page Complete

You now understand identifiability in GMMs: the formal definition, the label switching problem, conditions for identifiability (up to permutation), additional sources of non-identifiability, and practical strategies for handling these issues in EM optimization and Bayesian inference. This understanding is essential for correct interpretation of mixture model results.