Machine LearningBeyond Gaussian Mixtures

Beyond Gaussian Mixtures

LevelAdvanced

Duration180 mins

TopicBeyond Gaussian Mixtures

4 / 5

Infinite Mixtures (DPMM)

Beyond Fixed Complexity: Learning the Number of Clusters

Every mixture model we've studied—Gaussian mixtures, Student-t mixtures, mixtures of experts, HMMs—requires specifying the number of components $K$ upfront. This presents a fundamental dilemma:

Set $K$ too small: The model underfits, failing to capture distinct structure
Set $K$ too large: The model overfits, splitting genuine clusters into artificial fragments
Model selection: Running multiple models with different $K$ and using BIC/AIC is computationally expensive and feels inelegant

What if the model could learn the appropriate complexity from data? What if we could specify a prior that says "use as many components as the data warrants, but no more"?

Dirichlet Process Mixture Models (DPMMs) provide exactly this capability. By placing a Dirichlet Process prior over the space of distributions, DPMMs allow for a theoretically infinite number of mixture components, with the posterior naturally concentrating on a finite (and appropriate) number based on the data.

This "infinite mixture" might sound paradoxical—how can we compute with infinitely many components? The key insight is that, almost surely, only finitely many components are ever used, and we can perform inference without explicitly representing the unused components.

What You Will Learn

By the end of this page, you will understand: (1) The Dirichlet distribution and its connection to the Dirichlet Process; (2) The stick-breaking construction that makes infinite mixtures concrete; (3) The Chinese Restaurant Process as an intuitive metaphor; (4) Gibbs sampling and variational inference for DPMM; (5) Practical considerations for applying DPMMs to real clustering problems.

The Dirichlet Distribution: Foundation for Infinite Extensions

Before introducing the Dirichlet Process, we need to thoroughly understand the Dirichlet distribution, which generalizes the Beta distribution to multiple dimensions.

Definition

The Dirichlet distribution $\text{Dir}(\boldsymbol{\alpha})$ with concentration parameters $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_K)$ is a distribution over the $(K-1)$-simplex—the set of vectors $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K)$ satisfying $\pi_k \geq 0$ and $\sum_k \pi_k = 1$.

Density: $$p(\boldsymbol{\pi} | \boldsymbol{\alpha}) = \frac{\Gamma(\sum_{k=1}^K \alpha_k)}{\prod_{k=1}^K \Gamma(\alpha_k)} \prod_{k=1}^K \pi_k^{\alpha_k - 1}$$

Key properties:

$\mathbb{E}[\pi_k] = \alpha_k / \alpha_0$ where $\alpha_0 = \sum_k \alpha_k$
$\text{Var}(\pi_k) = \frac{\alpha_k(\alpha_0 - \alpha_k)}{\alpha_0^2(\alpha_0 + 1)}$
As $\alpha_0 \to \infty$, variance shrinks to zero

Interpretation of Concentration Parameters

The parameters $\alpha_k$ control the expected proportion and variability:

Symmetric case $\alpha_k = \alpha$ for all $k$:

$\alpha = 1$: Uniform over the simplex
$\alpha < 1$: Mass concentrates near vertices (sparse: most weight on few components)
$\alpha > 1$: Mass concentrates near center (uniform: weight spread across components)

Conjugacy with Multinomial

The Dirichlet is conjugate to the Categorical/Multinomial:

$$\boldsymbol{\pi} \sim \text{Dir}(\boldsymbol{\alpha})$$ $$x_1, \ldots, x_N | \boldsymbol{\pi} \sim \text{Categorical}(\boldsymbol{\pi})$$ $$\boldsymbol{\pi} | x_1, \ldots, x_N \sim \text{Dir}(\boldsymbol{\alpha} + \mathbf{n})$$

where $n_k = \sum_i \mathbb{1}[x_i = k]$ counts observations in category $k$.

This closed-form update is why the Dirichlet is foundational for Bayesian mixture models.

The Sparsity Connection

For DPMMs, we'll use α < 1, which encourages sparsity. Draws from Dir(α, …, α) with small α tend to have most mass on just a few components. This connects to our goal: even with infinitely many potential components, we expect only a few to be used.

The Dirichlet Process

The Dirichlet Process (DP) is a distribution over probability distributions—a "distribution over distributions." It extends the Dirichlet distribution from finite to infinite dimensions.

Formal Definition

A random probability measure $G$ follows a Dirichlet Process with base distribution $H$ and concentration parameter $\alpha$, written $G \sim \text{DP}(\alpha, H)$, if for any finite measurable partition $(A_1, \ldots, A_K)$ of the sample space:

$$(G(A_1), \ldots, G(A_K)) \sim \text{Dir}(\alpha H(A_1), \ldots, \alpha H(A_K))$$

Interpretation:

$H$ is the base distribution—the expected value $\mathbb{E}[G] = H$
$\alpha$ is the concentration parameter—controls how close $G$ is to $H$
As $\alpha \to \infty$, $G \to H$ (the prior dominates)
As $\alpha \to 0$, $G$ concentrates on few atoms

Key Property: Discreteness

A crucial property of the DP is that draws $G \sim \text{DP}(\alpha, H)$ are almost surely discrete, even when $H$ is continuous. This means $G$ can be written as:

$$G = \sum_{k=1}^\infty \pi_k \delta_{\theta_k}$$

where:

$\delta_{\theta}$ is a point mass at $\theta$
$\theta_k \sim H$ are the locations (drawn from base distribution)
$\pi_k$ are the weights (summing to 1)

This decomposition is made explicit by the stick-breaking construction.

Why This Matters for Mixtures

In a standard mixture model: $$p(x) = \sum_{k=1}^K \pi_k f(x | \theta_k)$$

With a DP prior on the mixing distribution: $$G \sim \text{DP}(\alpha, H)$$ $$\theta_i | G \sim G$$ $$x_i | \theta_i \sim f(x | \theta_i)$$

Since $G$ is discrete, multiple observations can share the same $\theta$ value—they belong to the same cluster. But the number of distinct $\theta$ values (clusters) is random and grows with data.

Infinite but Tractable

The DP specifies infinitely many potential clusters, but for N data points, at most N clusters can be occupied. The expected number of clusters grows as O(α log N), much slower than N. This logarithmic growth makes inference tractable—we only need to track used clusters.

The Stick-Breaking Construction

The stick-breaking construction (Sethuraman, 1994) provides an explicit, constructive representation of draws from the Dirichlet Process.

The Procedure

To sample $G \sim \text{DP}(\alpha, H)$:

1. Generate stick-breaking weights: $$\beta_k \sim \text{Beta}(1, \alpha), \quad k = 1, 2, \ldots$$ $$\pi_k = \beta_k \prod_{j=1}^{k-1} (1 - \beta_j)$$

2. Generate atoms: $$\theta_k \sim H, \quad k = 1, 2, \ldots$$

3. The random measure: $$G = \sum_{k=1}^\infty \pi_k \delta_{\theta_k}$$

Stick-Breaking Intuition

Imagine a stick of unit length representing total probability mass:

Break off proportion $\beta_1$ from the remaining stick → weight $\pi_1 = \beta_1$
Break off proportion $\beta_2$ from what remains → weight $\pi_2 = \beta_2(1-\beta_1)$
Continue indefinitely...

Since each $\beta_k < 1$ (almost surely), the remaining stick shrinks exponentially, ensuring weights sum to 1 despite the infinite process.

Properties of Stick-Breaking Weights

Expected weight: $$\mathbb{E}[\pi_k] = \frac{1}{\alpha + 1} \left(\frac{\alpha}{\alpha + 1}\right)^{k-1}$$

Weights decay geometrically with rate $\alpha/(\alpha+1)$.

Effect of $\alpha$:

Small $\alpha$ (e.g., 0.1): $\beta_k \approx 1$ often, so $\pi_1$ captures most mass. Few clusters.
Large $\alpha$ (e.g., 100): $\beta_k \approx 0$ often, so mass spreads across many components.

Truncation for Computation

For practical inference, truncate to $K$ components: $$G_K = \sum_{k=1}^{K-1} \pi_k \delta_{\theta_k} + \left(1 - \sum_{k=1}^{K-1} \pi_k\right) \delta_{\theta_K}$$

The last component absorbs all remaining mass. With $K$ sufficiently large, this approximation is excellent: $$\mathbb{E}\left[\sum_{k=K+1}^\infty \pi_k\right] = \left(\frac{\alpha}{\alpha + 1}\right)^K$$

For $\alpha = 1$ and $K = 50$, the truncation error is $\approx 10^{-15}$.

stick_breaking.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
import matplotlib.pyplot as plt
 
def stick_breaking_weights(alpha, K):
    """
    Generate stick-breaking weights from DP(alpha, H).
    
    Args:
        alpha: Concentration parameter
        K: Number of weights to generate
        
    Returns:
        weights: Array of length K summing to 1
    """
    betas = np.random.beta(1, alpha, size=K-1)
    
    # Compute cumulative product of (1 - beta)
    remaining = np.cumprod(1 - betas)
    
    weights = np.zeros(K)
    weights[0] = betas[0]
    weights[1:-1] = betas[1:] * remaining[:-1]
    weights[-1] = remaining[-1]  # Last component gets residual
    
    return weights
 
# Visualize for different alpha values
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
alphas = [0.1, 1.0, 10.0]
 
for ax, alpha in zip(axes, alphas):
    # Multiple samples to show variability
    for _ in range(5):
        weights = stick_breaking_weights(alpha, 20)
        ax.bar(range(20), weights, alpha=0.5)
    ax.set_title(f'α = {alpha}')
    ax.set_xlabel('Component k')
    ax.set_ylabel('Weight π_k')
    ax.set_ylim(0, 1)
 
plt.tight_layout()
# plt.savefig('stick_breaking_weights.png')

The Chinese Restaurant Process

The Chinese Restaurant Process (CRP) provides an alternative, sequential characterization of the Dirichlet Process that's particularly useful for understanding clustering behavior and deriving inference algorithms.

The Metaphor

Imagine a restaurant with infinitely many tables (clusters), each with unlimited capacity. Customers (data points) arrive sequentially and choose where to sit:

Customer 1: Sits at the first table.

Customer $n+1$: Given $n$ customers already seated:

Join existing table $k$ (with $n_k$ customers) with probability $\frac{n_k}{n + \alpha}$
Start a new table with probability $\frac{\alpha}{n + \alpha}$

Mathematical Form

Let $z_i$ be the table assignment for customer $i$. The conditional distribution is:

$$p(z_{n+1} = k | z_1, \ldots, z_n) = \begin{cases} \frac{n_k}{n + \alpha} & \text{if } k \text{ is an existing table} \ \frac{\alpha}{n + \alpha} & \text{if } k \text{ is a new table} \end{cases}$$

where $n_k = \sum_{i=1}^n \mathbb{1}[z_i = k]$ is the number of customers at table $k$.

Key Properties

1. Rich-get-richer: Popular tables attract more customers (preferential attachment). This leads to power-law table size distributions.

2. New tables always possible: Non-zero probability of a new table ensures cluster count grows with data.

3. Expected number of tables: For $n$ customers: $$\mathbb{E}[K_n] \approx \alpha \log(n/\alpha + 1)$$

Logarithmic growth—much slower than linear.

4. Exchangeability: The joint distribution $p(z_1, \ldots, z_n)$ is exchangeable—invariant to customer ordering. The sequential construction is for conceptual convenience.

Pólya Urn Representation

An equivalent view uses urns and balls:

Start with urn containing $\alpha$ "special" balls
To seat customer $n+1$:
- Draw a ball uniformly at random
- If special ball: new table, add regular ball for that table
- If regular ball of color $k$: join table $k$, return ball, add another of same color

CRP Expected Cluster Count by α and N
N (Data Size)	α = 0.1	α = 1.0	α = 5.0	α = 10.0
10	1.2	2.9	6.5	9.0
100	1.5	5.2	15	25
1,000	1.7	7.5	24	41
10,000	2.0	10	33	58
100,000	2.3	12	42	74

Choosing α

The concentration parameter α controls the expected number of clusters. Small α (0.1-1) gives few clusters; large α (>5) gives many. You can place a prior on α and infer it from data, commonly using a Gamma prior: α ~ Gamma(a, b).

Dirichlet Process Mixture Model Formulation

The DPMM combines the Dirichlet Process prior with a mixture model likelihood, enabling nonparametric clustering.

Generative Model

Stick-breaking formulation: $$\beta_k \sim \text{Beta}(1, \alpha), \quad \pi_k = \beta_k \prod_{j < k}(1 - \beta_j)$$ $$\theta_k \sim H \quad \text{(component parameters)}$$ $$z_i \sim \text{Categorical}(\boldsymbol{\pi}) \quad \text{(cluster assignment)}$$ $$x_i \sim F(\theta_{z_i}) \quad \text{(observation)}$$

CRP formulation: $$z_i | z_1, \ldots, z_{i-1} \sim \text{CRP}(\alpha)$$ $$\theta_k \sim H \quad \text{for each unique cluster } k$$ $$x_i | z_i, {\theta_k} \sim F(\theta_{z_i})$$

Gaussian DPMM Example

With Gaussian observations and Normal-Inverse-Wishart base: $$H = \text{NIW}(\mu_0, \kappa_0, u_0, \boldsymbol{\Psi}_0)$$ $$\theta_k = (\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \sim H$$ $$x_i | z_i = k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

The NIW is conjugate to the Gaussian likelihood, enabling closed-form updates.

Predictive Distribution

A beautiful property of DPMMs: after observing data, the predictive distribution for a new point has a simple form.

Using the CRP, the predictive assignment for $x_{n+1}$:

$$p(z_{n+1} | z_{1:n}, x_{1:n+1}) \propto \begin{cases} n_k \cdot p(x_{n+1} | x_{\text{in cluster } k}) & \text{existing cluster } k \ \alpha \cdot p(x_{n+1} | H) & \text{new cluster} \end{cases}$$

For conjugate models, $p(x_{n+1} | x_{\text{in cluster } k})$ integrates out $\theta_k$ analytically (the posterior predictive).

Comparison to Finite Mixtures

Aspect	Finite Mixture	DPMM
Number of components	Fixed $K$	Random, grows with data
Prior on weights	Dir($\alpha_1, \ldots, \alpha_K$)	Stick-breaking / CRP
Model selection	Separate step (BIC, CV)	Automatic (posterior)
Empty components	Possible (wasted)	Never allocated
New clusters	Fixed set	Always possible

The Marginalization Advantage

With conjugate priors, we can marginalize out θ analytically and work directly with cluster assignments z. This 'collapsed' representation is more efficient for Gibbs sampling: we sample z without needing to sample θ explicitly.

Inference Algorithms for DPMM

Posterior inference in DPMMs is intractable analytically, requiring sampling or variational approximations.

Collapsed Gibbs Sampling

The most common approach marginalizes out component parameters $\theta$ and samples cluster assignments $z$ directly.

Algorithm (Neal's Algorithm 2):

For each data point $i$:

Remove $x_i$ from its current cluster, decrementing $n_{z_i}$
If cluster becomes empty, remove it
Sample new assignment from: $$p(z_i = k | z_{-i}, \mathbf{x}, \alpha) \propto \begin{cases} n_{-i,k} \cdot p(x_i | \mathbf{x}_{-i, k}) & k \text{ existing} \ \alpha \cdot p(x_i | H) & k = \text{new} \end{cases}$$
Add $x_i$ to sampled cluster

Here $z_{-i}$ denotes assignments excluding $i$, and $\mathbf{x}_{-i,k}$ are observations in cluster $k$ excluding $x_i$.

Key: $p(x_i | \mathbf{x}_{-i,k})$ is the posterior predictive, which is analytical for conjugate models.

collapsed_gibbs_dpmm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
from scipy.special import gammaln
from scipy.stats import multivariate_normal
 
class CollapsedGibbsDPMM:
    """
    Collapsed Gibbs sampler for Gaussian DPMM.
    Uses Normal-Inverse-Wishart prior for conjugacy.
    """
    
    def __init__(self, alpha=1.0, prior_params=None):
        self.alpha = alpha
        self.prior = prior_params  # {'mu0', 'kappa0', 'nu0', 'Psi0'}
        
    def _log_predictive(self, x, x_cluster):
        """
        Log posterior predictive p(x | x_cluster) under NIW-Normal model.
        Marginalizes out (mu, Sigma).
        """
        D = len(x)
        mu0, kappa0, nu0, Psi0 = (self.prior['mu0'], self.prior['kappa0'],
                                   self.prior['nu0'], self.prior['Psi0'])
        
        if len(x_cluster) == 0:
            # Prior predictive (new cluster)
            kappa_n, nu_n = kappa0, nu0
            mu_n, Psi_n = mu0, Psi0
        else:
            # Posterior predictive (existing cluster)
            n = len(x_cluster)
            x_bar = x_cluster.mean(axis=0)
            S = np.zeros((D, D)) if n == 1 else np.cov(x_cluster.T) * (n - 1)
            
            kappa_n = kappa0 + n
            nu_n = nu0 + n
            mu_n = (kappa0 * mu0 + n * x_bar) / kappa_n
            
            diff = x_bar - mu0
            Psi_n = Psi0 + S + kappa0 * n / kappa_n * np.outer(diff, diff)
        
        # Student-t predictive
        nu_pred = nu_n - D + 1
        Sigma_pred = Psi_n * (kappa_n + 1) / (kappa_n * nu_pred)
        
        # Log student-t density
        diff = x - mu_n
        maha = diff @ np.linalg.solve(Sigma_pred, diff)
        
        log_prob = (gammaln((nu_pred + D) / 2) - gammaln(nu_pred / 2)
                   - D/2 * np.log(nu_pred * np.pi)
                   - 0.5 * np.linalg.slogdet(Sigma_pred)[1]
                   - (nu_pred + D) / 2 * np.log(1 + maha / nu_pred))
        
        return log_prob
    
    def fit(self, X, n_iter=1000, burn_in=100):
        """Run collapsed Gibbs sampling."""
        N, D = X.shape
        
        # Initialize: all in one cluster
        z = np.zeros(N, dtype=int)
        
        # Set default prior if not provided
        if self.prior is None:
            self.prior = {
                'mu0': X.mean(axis=0),
                'kappa0': 0.01,
                'nu0': D + 2,
                'Psi0': np.eye(D) * 0.1
            }
        
        samples = []
        
        for iteration in range(n_iter):
            for i in range(N):
                # Remove point i from its cluster
                old_cluster = z[i]
                mask = z == old_cluster
                mask[i] = False
                
                # Get unique non-empty clusters (excluding empty old_cluster)
                unique_clusters = np.unique(z)
                if not mask.any():  # Old cluster is now empty
                    unique_clusters = unique_clusters[unique_clusters != old_cluster]
                
                # Relabel to contiguous
                cluster_map = {c: idx for idx, c in enumerate(unique_clusters)}
                z_temp = np.array([cluster_map.get(c, -1) for c in z])
                z_temp[i] = -1  # Mark as unassigned
                
                K = len(unique_clusters)
                
                # Compute log probabilities for each option
                log_probs = []
                
                for k in range(K):
                    x_k = X[z_temp == k]
                    n_k = len(x_k)
                    log_prior = np.log(n_k)
                    log_lik = self._log_predictive(X[i], x_k)
                    log_probs.append(log_prior + log_lik)
                
                # New cluster option
                log_prior_new = np.log(self.alpha)
                log_lik_new = self._log_predictive(X[i], np.empty((0, D)))
                log_probs.append(log_prior_new + log_lik_new)
                
                # Sample
                log_probs = np.array(log_probs)
                log_probs -= log_probs.max()
                probs = np.exp(log_probs)
                probs /= probs.sum()
                
                new_k = np.random.choice(K + 1, p=probs)
                
                if new_k < K:
                    z[i] = unique_clusters[new_k]
                else:
                    z[i] = max(z) + 1 if len(z) > 0 else 0
            
            if iteration >= burn_in:
                samples.append(z.copy())
        
        self.samples = samples
        self.z = z
        return self

Variational Inference

For large datasets, MCMC can be slow. Variational inference provides a faster alternative.

Truncated stick-breaking VI:

Truncate to $K$ components (large enough)
Approximate posterior: $q(\boldsymbol{\beta}, \boldsymbol{\theta}, \mathbf{z}) = \prod_k q(\beta_k) q(\theta_k) \prod_i q(z_i)$
Optimize ELBO via coordinate ascent

Stochastic Variational Inference (SVI):

Process mini-batches of data
Scales to millions of points
Implemented in libraries like scikit-learn and PyMC

Practical Considerations and Guidelines

Applying DPMMs effectively requires careful attention to several practical issues.

Choosing the Concentration Parameter α

Option 1: Fix based on expected clusters

Use formula $\mathbb{E}[K] \approx \alpha \log(N/\alpha + 1)$
If you expect ~10 clusters for 1000 points, solve for α

Option 2: Hierarchical prior

Place $\alpha \sim \text{Gamma}(a, b)$ and sample it
Common choice: $a = 1$, $b = 1$ (vague prior)

Option 3: Cross-validation

Evaluate held-out likelihood for different α values
Select α that maximizes predictive performance

Base Distribution Selection

The base distribution $H$ should:

Cover the support of possible cluster parameters
Be conjugate to the likelihood for computational efficiency
Encode reasonable prior beliefs

Common choices:

Gaussian likelihood → NIW base
Categorical likelihood → Dirichlet base
Poisson likelihood → Gamma base

DPMM Advantages

•Automatic model complexity: No need to specify K
•Principled uncertainty: Full posterior over #clusters
•Grows with data: More data can reveal more structure
•Never wastes components: Only used clusters exist
•Bayesian regularization: Prior prevents overfitting

DPMM Challenges

•Computational cost: MCMC slower than EM for GMM
•Mixing issues: Gibbs can get stuck in local modes
•Label switching: Cluster labels arbitrary across samples
•Prior sensitivity: Results depend on α and H choices
•Scalability: Large N requires approximate inference

When DPMM vs. K-Selection for GMM

Use DPMM when: (1) You genuinely don't know K and want uncertainty quantification; (2) K might grow as you collect more data; (3) You want a fully Bayesian approach. Use GMM with BIC/CV when: (1) Computational efficiency is critical; (2) You have domain knowledge about K; (3) Interpretability of fixed K is important.

Summary and What's Next

This page has provided a comprehensive treatment of Dirichlet Process Mixture Models, the foundational nonparametric Bayesian approach to clustering.

Key Takeaways

•DPMMs allow infinite components while only using finitely many for finite data.
•Stick-breaking construction explicitly generates DP samples as weighted atoms.
•Chinese Restaurant Process provides intuitive sequential cluster assignment.
•Expected cluster count grows logarithmically: $O(\alpha \log N)$.
•Collapsed Gibbs sampling marginalizes parameters for efficient MCMC.
•Concentration α controls expected number of clusters; can be inferred.

What's Next: Semi-Parametric Methods

DPMMs represent fully nonparametric density estimation—flexible but computationally demanding. The final page of this module explores semi-parametric methods that strike a balance: using parametric models locally while allowing nonparametric flexibility globally. This includes mixture density networks, normalizing flows for mixture models, and copula-based approaches.

Page Complete

You now understand how Dirichlet Process Mixture Models extend finite mixtures to infinite components, automatically learning appropriate model complexity from data. The stick-breaking and CRP constructions make this abstract concept computationally tractable. Next, we'll explore semi-parametric methods that combine parametric and nonparametric ideas.

4 / 5

Loading learning content...

Machine LearningBeyond Gaussian Mixtures

Beyond Gaussian Mixtures

LevelAdvanced

Duration180 mins

TopicBeyond Gaussian Mixtures

4 / 5

Infinite Mixtures (DPMM)

Beyond Fixed Complexity: Learning the Number of Clusters

Every mixture model we've studied—Gaussian mixtures, Student-t mixtures, mixtures of experts, HMMs—requires specifying the number of components $K$ upfront. This presents a fundamental dilemma:

Set $K$ too small: The model underfits, failing to capture distinct structure
Set $K$ too large: The model overfits, splitting genuine clusters into artificial fragments
Model selection: Running multiple models with different $K$ and using BIC/AIC is computationally expensive and feels inelegant

What if the model could learn the appropriate complexity from data? What if we could specify a prior that says "use as many components as the data warrants, but no more"?

What You Will Learn

The Dirichlet Distribution: Foundation for Infinite Extensions

Before introducing the Dirichlet Process, we need to thoroughly understand the Dirichlet distribution, which generalizes the Beta distribution to multiple dimensions.

Definition

Density: $$p(\boldsymbol{\pi} | \boldsymbol{\alpha}) = \frac{\Gamma(\sum_{k=1}^K \alpha_k)}{\prod_{k=1}^K \Gamma(\alpha_k)} \prod_{k=1}^K \pi_k^{\alpha_k - 1}$$

Key properties:

$\mathbb{E}[\pi_k] = \alpha_k / \alpha_0$ where $\alpha_0 = \sum_k \alpha_k$
$\text{Var}(\pi_k) = \frac{\alpha_k(\alpha_0 - \alpha_k)}{\alpha_0^2(\alpha_0 + 1)}$
As $\alpha_0 \to \infty$, variance shrinks to zero

Interpretation of Concentration Parameters

The parameters $\alpha_k$ control the expected proportion and variability:

Symmetric case $\alpha_k = \alpha$ for all $k$:

$\alpha = 1$: Uniform over the simplex
$\alpha < 1$: Mass concentrates near vertices (sparse: most weight on few components)
$\alpha > 1$: Mass concentrates near center (uniform: weight spread across components)

Conjugacy with Multinomial

The Dirichlet is conjugate to the Categorical/Multinomial:

where $n_k = \sum_i \mathbb{1}[x_i = k]$ counts observations in category $k$.

This closed-form update is why the Dirichlet is foundational for Bayesian mixture models.

The Sparsity Connection

The Dirichlet Process

The Dirichlet Process (DP) is a distribution over probability distributions—a "distribution over distributions." It extends the Dirichlet distribution from finite to infinite dimensions.

Formal Definition

$$(G(A_1), \ldots, G(A_K)) \sim \text{Dir}(\alpha H(A_1), \ldots, \alpha H(A_K))$$

Interpretation:

$H$ is the base distribution—the expected value $\mathbb{E}[G] = H$
$\alpha$ is the concentration parameter—controls how close $G$ is to $H$
As $\alpha \to \infty$, $G \to H$ (the prior dominates)
As $\alpha \to 0$, $G$ concentrates on few atoms

Key Property: Discreteness

A crucial property of the DP is that draws $G \sim \text{DP}(\alpha, H)$ are almost surely discrete, even when $H$ is continuous. This means $G$ can be written as:

$$G = \sum_{k=1}^\infty \pi_k \delta_{\theta_k}$$

where:

$\delta_{\theta}$ is a point mass at $\theta$
$\theta_k \sim H$ are the locations (drawn from base distribution)
$\pi_k$ are the weights (summing to 1)

This decomposition is made explicit by the stick-breaking construction.

Why This Matters for Mixtures

In a standard mixture model: $$p(x) = \sum_{k=1}^K \pi_k f(x | \theta_k)$$

With a DP prior on the mixing distribution: $$G \sim \text{DP}(\alpha, H)$$ $$\theta_i | G \sim G$$ $$x_i | \theta_i \sim f(x | \theta_i)$$

Since $G$ is discrete, multiple observations can share the same $\theta$ value—they belong to the same cluster. But the number of distinct $\theta$ values (clusters) is random and grows with data.

Infinite but Tractable

The Stick-Breaking Construction

The stick-breaking construction (Sethuraman, 1994) provides an explicit, constructive representation of draws from the Dirichlet Process.

The Procedure

To sample $G \sim \text{DP}(\alpha, H)$:

1. Generate stick-breaking weights: $$\beta_k \sim \text{Beta}(1, \alpha), \quad k = 1, 2, \ldots$$ $$\pi_k = \beta_k \prod_{j=1}^{k-1} (1 - \beta_j)$$

2. Generate atoms: $$\theta_k \sim H, \quad k = 1, 2, \ldots$$

3. The random measure: $$G = \sum_{k=1}^\infty \pi_k \delta_{\theta_k}$$

Stick-Breaking Intuition

Imagine a stick of unit length representing total probability mass:

Break off proportion $\beta_1$ from the remaining stick → weight $\pi_1 = \beta_1$
Break off proportion $\beta_2$ from what remains → weight $\pi_2 = \beta_2(1-\beta_1)$
Continue indefinitely...

Since each $\beta_k < 1$ (almost surely), the remaining stick shrinks exponentially, ensuring weights sum to 1 despite the infinite process.

Properties of Stick-Breaking Weights

Expected weight: $$\mathbb{E}[\pi_k] = \frac{1}{\alpha + 1} \left(\frac{\alpha}{\alpha + 1}\right)^{k-1}$$

Weights decay geometrically with rate $\alpha/(\alpha+1)$.

Effect of $\alpha$:

Small $\alpha$ (e.g., 0.1): $\beta_k \approx 1$ often, so $\pi_1$ captures most mass. Few clusters.
Large $\alpha$ (e.g., 100): $\beta_k \approx 0$ often, so mass spreads across many components.

Truncation for Computation

For practical inference, truncate to $K$ components: $$G_K = \sum_{k=1}^{K-1} \pi_k \delta_{\theta_k} + \left(1 - \sum_{k=1}^{K-1} \pi_k\right) \delta_{\theta_K}$$

For $\alpha = 1$ and $K = 50$, the truncation error is $\approx 10^{-15}$.

stick_breaking.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
import matplotlib.pyplot as plt
 
def stick_breaking_weights(alpha, K):
    """
    Generate stick-breaking weights from DP(alpha, H).
    
    Args:
        alpha: Concentration parameter
        K: Number of weights to generate
        
    Returns:
        weights: Array of length K summing to 1
    """
    betas = np.random.beta(1, alpha, size=K-1)
    
    # Compute cumulative product of (1 - beta)
    remaining = np.cumprod(1 - betas)
    
    weights = np.zeros(K)
    weights[0] = betas[0]
    weights[1:-1] = betas[1:] * remaining[:-1]
    weights[-1] = remaining[-1]  # Last component gets residual
    
    return weights
 
# Visualize for different alpha values
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
alphas = [0.1, 1.0, 10.0]
 
for ax, alpha in zip(axes, alphas):
    # Multiple samples to show variability
    for _ in range(5):
        weights = stick_breaking_weights(alpha, 20)
        ax.bar(range(20), weights, alpha=0.5)
    ax.set_title(f'α = {alpha}')
    ax.set_xlabel('Component k')
    ax.set_ylabel('Weight π_k')
    ax.set_ylim(0, 1)
 
plt.tight_layout()
# plt.savefig('stick_breaking_weights.png')

The Chinese Restaurant Process

The Metaphor

Imagine a restaurant with infinitely many tables (clusters), each with unlimited capacity. Customers (data points) arrive sequentially and choose where to sit:

Customer 1: Sits at the first table.

Customer $n+1$: Given $n$ customers already seated:

Join existing table $k$ (with $n_k$ customers) with probability $\frac{n_k}{n + \alpha}$
Start a new table with probability $\frac{\alpha}{n + \alpha}$

Mathematical Form

Let $z_i$ be the table assignment for customer $i$. The conditional distribution is:

$$p(z_{n+1} = k | z_1, \ldots, z_n) = \begin{cases} \frac{n_k}{n + \alpha} & \text{if } k \text{ is an existing table} \ \frac{\alpha}{n + \alpha} & \text{if } k \text{ is a new table} \end{cases}$$

where $n_k = \sum_{i=1}^n \mathbb{1}[z_i = k]$ is the number of customers at table $k$.

Key Properties

1. Rich-get-richer: Popular tables attract more customers (preferential attachment). This leads to power-law table size distributions.

2. New tables always possible: Non-zero probability of a new table ensures cluster count grows with data.

3. Expected number of tables: For $n$ customers: $$\mathbb{E}[K_n] \approx \alpha \log(n/\alpha + 1)$$

Logarithmic growth—much slower than linear.

4. Exchangeability: The joint distribution $p(z_1, \ldots, z_n)$ is exchangeable—invariant to customer ordering. The sequential construction is for conceptual convenience.

Pólya Urn Representation

An equivalent view uses urns and balls:

Start with urn containing $\alpha$ "special" balls
To seat customer $n+1$:
- Draw a ball uniformly at random
- If special ball: new table, add regular ball for that table
- If regular ball of color $k$: join table $k$, return ball, add another of same color

CRP Expected Cluster Count by α and N
N (Data Size)	α = 0.1	α = 1.0	α = 5.0	α = 10.0
10	1.2	2.9	6.5	9.0
100	1.5	5.2	15	25
1,000	1.7	7.5	24	41
10,000	2.0	10	33	58
100,000	2.3	12	42	74

Choosing α

Dirichlet Process Mixture Model Formulation

The DPMM combines the Dirichlet Process prior with a mixture model likelihood, enabling nonparametric clustering.

Generative Model

CRP formulation: $$z_i | z_1, \ldots, z_{i-1} \sim \text{CRP}(\alpha)$$ $$\theta_k \sim H \quad \text{for each unique cluster } k$$ $$x_i | z_i, {\theta_k} \sim F(\theta_{z_i})$$

Gaussian DPMM Example

The NIW is conjugate to the Gaussian likelihood, enabling closed-form updates.

Predictive Distribution

A beautiful property of DPMMs: after observing data, the predictive distribution for a new point has a simple form.

Using the CRP, the predictive assignment for $x_{n+1}$:

$$p(z_{n+1} | z_{1:n}, x_{1:n+1}) \propto \begin{cases} n_k \cdot p(x_{n+1} | x_{\text{in cluster } k}) & \text{existing cluster } k \ \alpha \cdot p(x_{n+1} | H) & \text{new cluster} \end{cases}$$

For conjugate models, $p(x_{n+1} | x_{\text{in cluster } k})$ integrates out $\theta_k$ analytically (the posterior predictive).

Comparison to Finite Mixtures

Aspect	Finite Mixture	DPMM
Number of components	Fixed $K$	Random, grows with data
Prior on weights	Dir($\alpha_1, \ldots, \alpha_K$)	Stick-breaking / CRP
Model selection	Separate step (BIC, CV)	Automatic (posterior)
Empty components	Possible (wasted)	Never allocated
New clusters	Fixed set	Always possible

The Marginalization Advantage

Inference Algorithms for DPMM

Posterior inference in DPMMs is intractable analytically, requiring sampling or variational approximations.

Collapsed Gibbs Sampling

The most common approach marginalizes out component parameters $\theta$ and samples cluster assignments $z$ directly.

Algorithm (Neal's Algorithm 2):

For each data point $i$:

Remove $x_i$ from its current cluster, decrementing $n_{z_i}$
If cluster becomes empty, remove it
Sample new assignment from: $$p(z_i = k | z_{-i}, \mathbf{x}, \alpha) \propto \begin{cases} n_{-i,k} \cdot p(x_i | \mathbf{x}_{-i, k}) & k \text{ existing} \ \alpha \cdot p(x_i | H) & k = \text{new} \end{cases}$$
Add $x_i$ to sampled cluster

Here $z_{-i}$ denotes assignments excluding $i$, and $\mathbf{x}_{-i,k}$ are observations in cluster $k$ excluding $x_i$.

Key: $p(x_i | \mathbf{x}_{-i,k})$ is the posterior predictive, which is analytical for conjugate models.

collapsed_gibbs_dpmm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
from scipy.special import gammaln
from scipy.stats import multivariate_normal
 
class CollapsedGibbsDPMM:
    """
    Collapsed Gibbs sampler for Gaussian DPMM.
    Uses Normal-Inverse-Wishart prior for conjugacy.
    """
    
    def __init__(self, alpha=1.0, prior_params=None):
        self.alpha = alpha
        self.prior = prior_params  # {'mu0', 'kappa0', 'nu0', 'Psi0'}
        
    def _log_predictive(self, x, x_cluster):
        """
        Log posterior predictive p(x | x_cluster) under NIW-Normal model.
        Marginalizes out (mu, Sigma).
        """
        D = len(x)
        mu0, kappa0, nu0, Psi0 = (self.prior['mu0'], self.prior['kappa0'],
                                   self.prior['nu0'], self.prior['Psi0'])
        
        if len(x_cluster) == 0:
            # Prior predictive (new cluster)
            kappa_n, nu_n = kappa0, nu0
            mu_n, Psi_n = mu0, Psi0
        else:
            # Posterior predictive (existing cluster)
            n = len(x_cluster)
            x_bar = x_cluster.mean(axis=0)
            S = np.zeros((D, D)) if n == 1 else np.cov(x_cluster.T) * (n - 1)
            
            kappa_n = kappa0 + n
            nu_n = nu0 + n
            mu_n = (kappa0 * mu0 + n * x_bar) / kappa_n
            
            diff = x_bar - mu0
            Psi_n = Psi0 + S + kappa0 * n / kappa_n * np.outer(diff, diff)
        
        # Student-t predictive
        nu_pred = nu_n - D + 1
        Sigma_pred = Psi_n * (kappa_n + 1) / (kappa_n * nu_pred)
        
        # Log student-t density
        diff = x - mu_n
        maha = diff @ np.linalg.solve(Sigma_pred, diff)
        
        log_prob = (gammaln((nu_pred + D) / 2) - gammaln(nu_pred / 2)
                   - D/2 * np.log(nu_pred * np.pi)
                   - 0.5 * np.linalg.slogdet(Sigma_pred)[1]
                   - (nu_pred + D) / 2 * np.log(1 + maha / nu_pred))
        
        return log_prob
    
    def fit(self, X, n_iter=1000, burn_in=100):
        """Run collapsed Gibbs sampling."""
        N, D = X.shape
        
        # Initialize: all in one cluster
        z = np.zeros(N, dtype=int)
        
        # Set default prior if not provided
        if self.prior is None:
            self.prior = {
                'mu0': X.mean(axis=0),
                'kappa0': 0.01,
                'nu0': D + 2,
                'Psi0': np.eye(D) * 0.1
            }
        
        samples = []
        
        for iteration in range(n_iter):
            for i in range(N):
                # Remove point i from its cluster
                old_cluster = z[i]
                mask = z == old_cluster
                mask[i] = False
                
                # Get unique non-empty clusters (excluding empty old_cluster)
                unique_clusters = np.unique(z)
                if not mask.any():  # Old cluster is now empty
                    unique_clusters = unique_clusters[unique_clusters != old_cluster]
                
                # Relabel to contiguous
                cluster_map = {c: idx for idx, c in enumerate(unique_clusters)}
                z_temp = np.array([cluster_map.get(c, -1) for c in z])
                z_temp[i] = -1  # Mark as unassigned
                
                K = len(unique_clusters)
                
                # Compute log probabilities for each option
                log_probs = []
                
                for k in range(K):
                    x_k = X[z_temp == k]
                    n_k = len(x_k)
                    log_prior = np.log(n_k)
                    log_lik = self._log_predictive(X[i], x_k)
                    log_probs.append(log_prior + log_lik)
                
                # New cluster option
                log_prior_new = np.log(self.alpha)
                log_lik_new = self._log_predictive(X[i], np.empty((0, D)))
                log_probs.append(log_prior_new + log_lik_new)
                
                # Sample
                log_probs = np.array(log_probs)
                log_probs -= log_probs.max()
                probs = np.exp(log_probs)
                probs /= probs.sum()
                
                new_k = np.random.choice(K + 1, p=probs)
                
                if new_k < K:
                    z[i] = unique_clusters[new_k]
                else:
                    z[i] = max(z) + 1 if len(z) > 0 else 0
            
            if iteration >= burn_in:
                samples.append(z.copy())
        
        self.samples = samples
        self.z = z
        return self

Variational Inference

For large datasets, MCMC can be slow. Variational inference provides a faster alternative.

Truncated stick-breaking VI:

Truncate to $K$ components (large enough)
Approximate posterior: $q(\boldsymbol{\beta}, \boldsymbol{\theta}, \mathbf{z}) = \prod_k q(\beta_k) q(\theta_k) \prod_i q(z_i)$
Optimize ELBO via coordinate ascent

Stochastic Variational Inference (SVI):

Process mini-batches of data
Scales to millions of points
Implemented in libraries like scikit-learn and PyMC

Practical Considerations and Guidelines

Applying DPMMs effectively requires careful attention to several practical issues.

Choosing the Concentration Parameter α

Option 1: Fix based on expected clusters

Use formula $\mathbb{E}[K] \approx \alpha \log(N/\alpha + 1)$
If you expect ~10 clusters for 1000 points, solve for α

Option 2: Hierarchical prior

Place $\alpha \sim \text{Gamma}(a, b)$ and sample it
Common choice: $a = 1$, $b = 1$ (vague prior)

Option 3: Cross-validation

Evaluate held-out likelihood for different α values
Select α that maximizes predictive performance

Base Distribution Selection

The base distribution $H$ should:

Cover the support of possible cluster parameters
Be conjugate to the likelihood for computational efficiency
Encode reasonable prior beliefs

Common choices:

Gaussian likelihood → NIW base
Categorical likelihood → Dirichlet base
Poisson likelihood → Gamma base

DPMM Advantages

•Automatic model complexity: No need to specify K
•Principled uncertainty: Full posterior over #clusters
•Grows with data: More data can reveal more structure
•Never wastes components: Only used clusters exist
•Bayesian regularization: Prior prevents overfitting

DPMM Challenges

•Computational cost: MCMC slower than EM for GMM
•Mixing issues: Gibbs can get stuck in local modes
•Label switching: Cluster labels arbitrary across samples
•Prior sensitivity: Results depend on α and H choices
•Scalability: Large N requires approximate inference

When DPMM vs. K-Selection for GMM

Summary and What's Next

This page has provided a comprehensive treatment of Dirichlet Process Mixture Models, the foundational nonparametric Bayesian approach to clustering.

Key Takeaways

•DPMMs allow infinite components while only using finitely many for finite data.
•Stick-breaking construction explicitly generates DP samples as weighted atoms.
•Chinese Restaurant Process provides intuitive sequential cluster assignment.
•Expected cluster count grows logarithmically: $O(\alpha \log N)$.
•Collapsed Gibbs sampling marginalizes parameters for efficient MCMC.
•Concentration α controls expected number of clusters; can be inferred.

What's Next: Semi-Parametric Methods

Page Complete

4 / 5