Machine LearningBayesian Inference

Conjugate Priors

LevelAdvanced

Duration90 mins

TopicBayesian Inference

4 / 5

Dirichlet-Multinomial

The Foundation of Categorical Inference

When data consists of categories—words in documents, choices among options, species in ecosystems, states in systems—we need probability distributions over discrete alternatives. The Dirichlet-Multinomial conjugate pair is the Bayesian solution for this fundamental problem.

The Dirichlet distribution generalizes the Beta distribution from 2 categories to $K$ categories, while the Multinomial generalizes the Binomial. Together, they provide:

Closed-form posterior inference for category probabilities
Interpretable hyperparameters as pseudo-counts per category
Natural handling of sparsity and rare categories
Foundation for advanced models: topic models (LDA), mixture models, hidden Markov models, and Bayesian nonparametrics

From analyzing customer segments to modeling gene expression, from natural language processing to recommendation systems—Dirichlet-Multinomial conjugacy is ubiquitous in machine learning.

What You Will Learn

By the end of this page, you will understand the Dirichlet distribution's properties and parametrization, derive the Dirichlet-Multinomial posterior, interpret hyperparameters as pseudo-counts, handle sparse data and concentration parameters, build hierarchical categorical models, and apply these tools to text analysis and topic modeling foundations.

The Dirichlet Distribution: Definition and Properties

The Dirichlet distribution is a multivariate probability distribution over the $(K-1)$-dimensional probability simplex—the set of all valid $K$-dimensional probability vectors.

Definition: A random vector $\boldsymbol{\theta} = (\theta_1, \ldots, \theta_K)$ follows a Dirichlet distribution with concentration parameters $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_K)$ where all $\alpha_k > 0$, written $\boldsymbol{\theta} \sim \text{Dir}(\boldsymbol{\alpha})$, if its density is:

$$p(\boldsymbol{\theta} | \boldsymbol{\alpha}) = \frac{\Gamma(\sum_{k=1}^K \alpha_k)}{\prod_{k=1}^K \Gamma(\alpha_k)} \prod_{k=1}^K \theta_k^{\alpha_k - 1}$$

defined on the simplex ${ \boldsymbol{\theta} : \theta_k \geq 0, \sum_k \theta_k = 1 }$.

The normalizing constant $B(\boldsymbol{\alpha}) = \frac{\prod_{k=1}^K \Gamma(\alpha_k)}{\Gamma(\sum_{k=1}^K \alpha_k)}$ is the multivariate Beta function.

The Beta-Dirichlet Connection

The Beta distribution is a special case of Dirichlet with $K=2$: Beta($\alpha, \beta$) = Dir($\alpha, \beta$). Everything you learned about Beta extends to Dirichlet. The pseudo-count interpretation, the shrinkage toward prior means, the sequential updating—all generalize naturally.

Key Properties:

Marginal Distributions: $$\theta_k \sim \text{Beta}(\alpha_k, \alpha_0 - \alpha_k)$$

where $\alpha_0 = \sum_{j=1}^K \alpha_j$ is the concentration or precision parameter.

Mean: $$\mathbb{E}[\theta_k] = \frac{\alpha_k}{\alpha_0}$$

Each category's expected probability equals its share of the concentration.

Mode (for $\alpha_k > 1$): $$\text{Mode}[\theta_k] = \frac{\alpha_k - 1}{\alpha_0 - K}$$

Variance: $$\text{Var}[\theta_k] = \frac{\alpha_k(\alpha_0 - \alpha_k)}{\alpha_0^2(\alpha_0 + 1)}$$

As $\alpha_0$ increases, variance decreases—the distribution concentrates.

Covariance: $$\text{Cov}[\theta_j, \theta_k] = \frac{-\alpha_j \alpha_k}{\alpha_0^2(\alpha_0 + 1)} \quad (j \neq k)$$

Categories are negatively correlated—if one probability increases, others must decrease (they sum to 1).

Common Dirichlet Parametrizations (K=3 categories)
Parameters	Concentration $\alpha_0$	Shape	Interpretation
Dir(1, 1, 1)	3	Uniform over simplex	Maximum ignorance; all distributions equally likely
Dir(0.5, 0.5, 0.5)	1.5	Favors sparse (concentrated at corners)	Jeffreys prior; expects sparsity
Dir(10, 10, 10)	30	Concentrated at center (1/3, 1/3, 1/3)	Strong belief in uniformity
Dir(10, 2, 2)	14	Concentrated near (0.71, 0.14, 0.14)	Strong belief first category dominates
Dir(0.1, 0.1, 0.1)	0.3	Strongly favors sparse distributions	Expect few categories to dominate

The Multinomial Likelihood

The Multinomial distribution models the counts of $K$ mutually exclusive categories in $n$ independent trials—the natural generalization of the Binomial.

Definition: Given $n$ independent trials where each trial results in one of $K$ categories with probabilities $\boldsymbol{\theta} = (\theta_1, \ldots, \theta_K)$, the count vector $\mathbf{n} = (n_1, \ldots, n_K)$ follows a Multinomial:

$$P(\mathbf{n} | n, \boldsymbol{\theta}) = \frac{n!}{\prod_{k=1}^K n_k!} \prod_{k=1}^K \theta_k^{n_k}$$

where $\sum_k n_k = n$.

As a Likelihood: $$\mathcal{L}(\boldsymbol{\theta} | \mathbf{n}) \propto \prod_{k=1}^K \theta_k^{n_k}$$

The multinomial coefficient $\frac{n!}{\prod_k n_k!}$ is constant with respect to $\boldsymbol{\theta}$.

Why Conjugate to Dirichlet: The kernel of the Dirichlet: $\prod_k \theta_k^{\alpha_k - 1}$ and the Multinomial likelihood kernel: $\prod_k \theta_k^{n_k}$ are both products of powers. Their product:

$$\prod_k \theta_k^{\alpha_k - 1} \cdot \prod_k \theta_k^{n_k} = \prod_k \theta_k^{\alpha_k + n_k - 1}$$

is again a Dirichlet kernel with updated parameters $\alpha_k + n_k$.

Categorical as Special Case

A single draw from $K$ categories follows the Categorical distribution (Multinomial with $n=1$). Observing multiple independent draws gives counts that follow Multinomial. The sufficient statistic is the count vector—we don't need to know the order of observations, only the total in each category.

Posterior Derivation: The Dirichlet Update

The conjugate update is elegantly simple—each category's concentration parameter increases by its observed count.

Setup:

Prior: $\boldsymbol{\theta} \sim \text{Dir}(\alpha_1, \ldots, \alpha_K)$
Data: Observed counts $n_1, \ldots, n_K$ from $n = \sum_k n_k$ trials
Goal: Find $P(\boldsymbol{\theta} | n_1, \ldots, n_K)$

Conjugate Posterior:

$$\boldsymbol{\theta} | \mathbf{n} \sim \text{Dir}(\alpha_1 + n_1, \ldots, \alpha_K + n_K)$$

Derivation:

$$P(\boldsymbol{\theta} | \mathbf{n}) \propto P(\mathbf{n} | \boldsymbol{\theta}) \cdot P(\boldsymbol{\theta})$$

$$\propto \prod_{k=1}^K \theta_k^{n_k} \cdot \prod_{k=1}^K \theta_k^{\alpha_k - 1}$$

$$= \prod_{k=1}^K \theta_k^{\alpha_k + n_k - 1}$$

This is the kernel of Dir$(\alpha_1 + n_1, \ldots, \alpha_K + n_K)$. ∎

The Update Rule

Posterior = Dir($\alpha_1 + n_1, ..., \alpha_K + n_K$). Each category's concentration increases by its observed count. This is exactly analogous to Beta-Binomial: $\alpha_k$ acts as a pseudo-count for category $k$, and we add real counts $n_k$ to get the posterior.

Posterior Summaries:

Posterior Mean: $$\mathbb{E}[\theta_k | \mathbf{n}] = \frac{\alpha_k + n_k}{\alpha_0 + n}$$

This is a weighted average of the prior mean ($\alpha_k / \alpha_0$) and the MLE ($n_k / n$):

$$\mathbb{E}[\theta_k | \mathbf{n}] = \frac{\alpha_0}{\alpha_0 + n} \cdot \frac{\alpha_k}{\alpha_0} + \frac{n}{\alpha_0 + n} \cdot \frac{n_k}{n}$$

Credible Intervals: Each marginal $\theta_k | \mathbf{n} \sim \text{Beta}(\alpha_k + n_k, \alpha_0 + n - \alpha_k - n_k)$, so individual credible intervals are straightforward.

Maximum A Posteriori (MAP): $$\hat{\theta}_k^{\text{MAP}} = \frac{\alpha_k + n_k - 1}{\alpha_0 + n - K} \quad (\text{for } \alpha_k + n_k > 1)$$

Dirichlet-Multinomial Inference
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import numpy as np
from scipy import stats
from scipy.special import gammaln
 
class DirichletMultinomial:
    """
    Dirichlet-Multinomial conjugate model for categorical inference.
    
    Maintains a Dirichlet posterior over category probabilities,
    updated with multinomial observations.
    """
    
    def __init__(self, alpha: np.ndarray):
        """
        Initialize with Dirichlet concentration parameters.
        
        Args:
            alpha: Array of K concentration parameters (pseudo-counts)
        """
        self.alpha = np.asarray(alpha, dtype=float)
        self.K = len(alpha)
        self._validate()
    
    def _validate(self):
        assert np.all(self.alpha > 0), "All concentrations must be positive"
    
    @property
    def concentration(self) -> float:
        """Total concentration α_0 = Σα_k."""
        return np.sum(self.alpha)
    
    def update(self, counts: np.ndarray) -> 'DirichletMultinomial':
        """
        Update posterior with observed counts.
        Returns new DirichletMultinomial with updated parameters.
        """
        counts = np.asarray(counts, dtype=float)
        assert len(counts) == self.K, "Counts must match number of categories"
        return DirichletMultinomial(self.alpha + counts)
    
    def posterior_mean(self) -> np.ndarray:
        """Expected probability for each category."""
        return self.alpha / self.concentration
    
    def posterior_mode(self) -> np.ndarray:
        """
        Mode of posterior (MAP estimate).
        Only valid if all alpha_k > 1.
        """
        if np.all(self.alpha > 1):
            return (self.alpha - 1) / (self.concentration - self.K)
        else:
            # Mode is on boundary of simplex, at the corner with largest alpha
            mode = np.zeros(self.K)
            mode[np.argmax(self.alpha)] = 1.0
            return mode
    
    def posterior_variance(self) -> np.ndarray:
        """Variance for each category's probability."""
        a0 = self.concentration
        return (self.alpha * (a0 - self.alpha)) / (a0**2 * (a0 + 1))
    
    def marginal_credible_interval(self, k: int, confidence: float = 0.95) -> tuple:
        """
        Credible interval for θ_k using its marginal Beta distribution.
        """
        a_k = self.alpha[k]
        b_k = self.concentration - a_k
        tail = (1 - confidence) / 2
        lower = stats.beta.ppf(tail, a_k, b_k)
        upper = stats.beta.ppf(1 - tail, a_k, b_k)
        return (lower, upper)
    
    def sample(self, n_samples: int = 10000) -> np.ndarray:
        """
        Sample from the Dirichlet posterior.
        Returns array of shape (n_samples, K).
        """
        return np.random.dirichlet(self.alpha, size=n_samples)
    
    def log_marginal_likelihood(self, counts: np.ndarray) -> float:
        """
        Compute log P(counts | alpha) = log ∫ P(counts|θ) P(θ|alpha) dθ.
        
        This is the Dirichlet-Multinomial distribution, useful for
        model comparison.
        """
        counts = np.asarray(counts)
        n = np.sum(counts)
        a0 = self.concentration
        
        # log multinomial coefficient
        log_mult = gammaln(n + 1) - np.sum(gammaln(counts + 1))
        
        # log Beta(alpha + counts) / Beta(alpha)
        log_ratio = (gammaln(a0) - gammaln(a0 + n) +
                     np.sum(gammaln(self.alpha + counts) - gammaln(self.alpha)))
        
        return log_mult + log_ratio
    
    def summary(self, category_names: list = None):
        """Print posterior summary."""
        if category_names is None:
            category_names = [f"Cat {k}" for k in range(self.K)]
        
        print(f"Dirichlet({', '.join(f'{a:.2f}' for a in self.alpha)})")
        print(f"Concentration: {self.concentration:.2f}")
        print("-" * 50)
        
        means = self.posterior_mean()
        for k, name in enumerate(category_names):
            ci = self.marginal_credible_interval(k)
            print(f"  {name}: {means[k]:.4f} (95% CI: [{ci[0]:.4f}, {ci[1]:.4f}])")
 
 
# Example: Document Classification
# Prior: Slight preference for topic A (news), uniform otherwise
topic_names = ['News', 'Sports', 'Tech', 'Entertainment']
prior = DirichletMultinomial(np.array([2.0, 1.0, 1.0, 1.0]))
 
print("Prior:")
prior.summary(topic_names)
 
# Observed document classifications
# Suppose we classified 100 documents: 40 news, 25 sports, 20 tech, 15 entertainment
counts = np.array([40, 25, 20, 15])
 
posterior = prior.update(counts)
 
print(f"\nObserved counts: {dict(zip(topic_names, counts))}")
print(f"Total documents: {sum(counts)}")
 
print("\nPosterior:")
posterior.summary(topic_names)
 
# Probability that News is the most common topic
samples = posterior.sample(100000)
prob_news_most_common = np.mean(np.argmax(samples, axis=1) == 0)
print(f"\nP(News is most common) = {prob_news_most_common:.4f}")

Concentration Parameter and Sparsity

The concentration parameter $\alpha_0 = \sum_k \alpha_k$ controls how "peaked" or "spread" the Dirichlet distribution is. Understanding this parameter is crucial for appropriate prior specification.

Symmetric Dirichlet: A common choice is the symmetric Dirichlet with $\alpha_k = \alpha/K$ for all $k$: $$\boldsymbol{\theta} \sim \text{Dir}(\alpha/K, \ldots, \alpha/K)$$

Here $\alpha$ directly controls concentration:

$\alpha > K$ (Concentrated):

Distribution concentrates near the center $(1/K, \ldots, 1/K)$
Samples have probabilities close to uniform across categories
Use when you expect similar probabilities for all categories

$\alpha = K$ (Uniform):

This is Dir$(1, \ldots, 1)$, uniform over the simplex
All probability vectors equally likely
Maximum ignorance / non-informative prior

$\alpha < K$ (Sparse):

Distribution concentrates near edges and corners of simplex
Samples tend to have few large probabilities, many near-zero
Use when you expect sparse, peaked distributions

Sparsity in Practice

In many applications (topic models, language models, user preferences), true probability distributions are sparse—a few outcomes dominate. Setting $\alpha < 1$ encodes this belief, leading to sparser posterior estimates. This acts as a regularizer, preventing the model from spreading probability mass too thinly.

Effect of Symmetric Dirichlet Parameter (K=10 categories)
$\alpha$	Behavior	Expected Sparsity	Example Use Case
0.01	Extremely sparse	~99% mass on 1-2 categories	One-hot categorical
0.1	Very sparse	~90% mass on 2-3 categories	Dominant category + noise
0.5	Moderately sparse	Jeffreys prior; balanced sparsity	Default non-informative
1.0	Uniform	All distributions equally likely	Maximum entropy prior
5.0	Concentrated	Close to uniform (10% each)	Expect equal category weights
50.0	Very concentrated	Very close to 10% each	Strong uniformity belief

Asymmetric Dirichlet:

When categories have different prior expectations, use asymmetric concentrations: $$\alpha_k = \alpha_0 \cdot \pi_k$$

where $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K)$ is the prior expected distribution and $\alpha_0$ controls concentration around that expectation.

Example: If prior belief is 50% category A, 30% B, 20% C:

Dir$(5, 3, 2)$: Weak prior toward (0.5, 0.3, 0.2)
Dir$(50, 30, 20)$: Strong prior toward (0.5, 0.3, 0.2)

The Stick-Breaking Perspective:

An alternative view of Dirichlet samples uses stick-breaking:

Start with a stick of length 1
Break a fraction $\beta_1 \sim \text{Beta}(\alpha_1, \sum_{j>1} \alpha_j)$ for category 1
From remaining stick, break $\beta_2 \sim \text{Beta}(\alpha_2, \sum_{j>2} \alpha_j)$ for category 2
Continue until all mass assigned

This view connects to the Dirichlet Process (infinite-dimensional generalization) and explains why early categories in ordered settings may receive more mass.

Posterior Predictive Distribution

The posterior predictive distribution for the next observation is crucial for decision-making and model evaluation.

Single Next Observation:

Given posterior Dir$(\alpha_1 + n_1, \ldots, \alpha_K + n_K)$, the probability that the next observation falls in category $k$ is:

$$P(\tilde{x} = k | \mathbf{n}) = \int P(\tilde{x} = k | \boldsymbol{\theta}) P(\boldsymbol{\theta} | \mathbf{n}) d\boldsymbol{\theta}$$

$$= \mathbb{E}[\theta_k | \mathbf{n}] = \frac{\alpha_k + n_k}{\alpha_0 + n}$$

This is exactly the posterior mean for category $k$—intuitive and elegant.

Pólya Urn Interpretation:

The predictive distribution has a beautiful urn model interpretation:

Imagine an urn with colored balls:

Initially: $\alpha_k$ balls of color $k$ (can be fractional)
After each draw: return the ball plus one additional ball of the same color

This "rich get richer" dynamic explains:

Early draws influence later probabilities
Categories with more observations become more likely
The process is exchangeable (order doesn't matter for final distribution)

Connection to Chinese Restaurant Process

The Pólya urn is closely related to the Chinese Restaurant Process (CRP), which extends to infinite categories via the Dirichlet Process. In the CRP, customers (observations) join tables (categories), with probability proportional to table occupancy plus a term for new tables. This foundation underpins all Bayesian nonparametric clustering methods.

Multiple Future Observations:

For $m$ future observations, the posterior predictive is the Dirichlet-Multinomial (or Multivariate Pólya) distribution:

$$P(\tilde{\mathbf{n}} | \mathbf{n}) = \frac{m!}{\prod_k \tilde{n}_k!} \cdot \frac{B(\boldsymbol{\alpha} + \mathbf{n} + \tilde{\mathbf{n}})}{B(\boldsymbol{\alpha} + \mathbf{n})}$$

where $\tilde{\mathbf{n}} = (\tilde{n}_1, \ldots, \tilde{n}_K)$ are counts of future observations and $\sum_k \tilde{n}_k = m$.

This distribution accounts for uncertainty in $\boldsymbol{\theta}$ and is overdispersed relative to the Multinomial—future counts have higher variance than if we plugged in a point estimate.

Properties:

Mean: $\mathbb{E}[\tilde{n}_k] = m \cdot \frac{\alpha_k + n_k}{\alpha_0 + n}$
Variance is larger than Multinomial with same mean (overdispersion)
The overdispersion factor decreases as $\alpha_0 + n$ increases (parameter uncertainty shrinks)

Posterior Predictive for Categorical Data
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import numpy as np
from scipy.special import gammaln
 
def dirichlet_multinomial_pmf(counts: np.ndarray, alpha: np.ndarray) -> float:
    """
    Dirichlet-Multinomial PMF: P(counts | alpha).
    
    This is the posterior predictive distribution for new counts
    given a Dirichlet prior/posterior.
    
    Args:
        counts: Array of K category counts
        alpha: Array of K Dirichlet concentration parameters
    
    Returns:
        Probability P(counts | alpha)
    """
    counts = np.asarray(counts)
    alpha = np.asarray(alpha)
    n = np.sum(counts)
    a0 = np.sum(alpha)
    
    # Log multinomial coefficient
    log_mult = gammaln(n + 1) - np.sum(gammaln(counts + 1))
    
    # Log ratio of Beta functions
    log_ratio = (gammaln(a0) - gammaln(a0 + n) +
                 np.sum(gammaln(alpha + counts) - gammaln(alpha)))
    
    return np.exp(log_mult + log_ratio)
 
 
def predictive_next_category(alpha_posterior: np.ndarray) -> np.ndarray:
    """
    Probability distribution over next observation's category.
    
    P(x_new = k | data) = E[θ_k | data] = α_k / α_0
    
    This is much simpler than the general Dirichlet-Multinomial
    because we're predicting a single observation.
    """
    return alpha_posterior / np.sum(alpha_posterior)
 
 
def polya_urn_simulation(alpha: np.ndarray, n_draws: int) -> np.ndarray:
    """
    Simulate the Pólya urn process.
    
    Start with α_k balls of color k (conceptually).
    After each draw, add one ball of the drawn color.
    
    This generates samples from the Dirichlet-Multinomial.
    """
    K = len(alpha)
    current_alpha = alpha.copy()
    draws = np.zeros(n_draws, dtype=int)
    
    for i in range(n_draws):
        # Probability of each category
        p = current_alpha / current_alpha.sum()
        
        # Draw category
        draws[i] = np.random.choice(K, p=p)
        
        # Add one ball to drawn category
        current_alpha[draws[i]] += 1
    
    return draws
 
 
# Example: Word prediction in language model
# Suppose we have a vocabulary of 1000 words
# Prior: Symmetric Dirichlet with sparsity (most words are rare)
K = 1000
alpha_prior = np.ones(K) * 0.1  # Sparse prior
 
# Observed word counts from a document
# Most entries are 0 (words not seen), a few are > 0
np.random.seed(42)
observed_counts = np.zeros(K)
# Simulate 500 words, with only ~50 unique words appearing
for _ in range(500):
    word_id = int(np.random.exponential(20))  # Zipf-like
    if word_id < K:
        observed_counts[word_id] += 1
 
# Posterior
alpha_posterior = alpha_prior + observed_counts
 
# Predictive probability for next word
next_word_probs = predictive_next_category(alpha_posterior)
 
# Top 10 most likely words
top_10 = np.argsort(next_word_probs)[-10:][::-1]
print("Top 10 predicted words:")
for idx in top_10:
    print(f"  Word {idx}: {next_word_probs[idx]:.4f} "
          f"(observed {int(observed_counts[idx])} times)")
 
# Demonstrate overdispersion
# Compare variance of predictive vs plug-in Multinomial
n_future = 100
n_simulations = 10000
 
# Posterior predictive: sample θ then sample counts
theta_samples = np.random.dirichlet(alpha_posterior, size=n_simulations)
pp_samples = np.array([np.random.multinomial(n_future, theta) 
                        for theta in theta_samples])
pp_variance = pp_samples[:, 0].var()
 
# Plug-in Multinomial: use posterior mean
plugin_p = alpha_posterior / alpha_posterior.sum()
plugin_samples = np.random.multinomial(n_future, plugin_p, size=n_simulations)
plugin_variance = plugin_samples[:, 0].var()
 
print(f"\nPredicting {n_future} future words for word 0:")
print(f"  Posterior predictive variance: {pp_variance:.2f}")
print(f"  Plug-in Multinomial variance:  {plugin_variance:.2f}")
print(f"  Overdispersion ratio: {pp_variance / plugin_variance:.2f}x")

Applications in Machine Learning

Dirichlet-Multinomial conjugacy is fundamental to many machine learning models. Let's examine key applications.

Application 1: Naive Bayes Text Classification

In multinomial Naive Bayes, each class $c$ has a word distribution $\boldsymbol{\theta}_c$:

$$P(\text{document} | c) = \text{Multinomial}(\mathbf{n} | \boldsymbol{\theta}_c)$$

Bayesian Training:

Prior: $\boldsymbol{\theta}_c \sim \text{Dir}(\boldsymbol{\alpha})$ for each class
Training: Update with word counts from class-$c$ documents
Posterior: $\boldsymbol{\theta}_c | \text{docs}_c \sim \text{Dir}(\boldsymbol{\alpha} + \mathbf{n}_c)$

Laplace Smoothing is Bayesian: The common practice of adding 1 to all word counts (Laplace smoothing) corresponds to using a Dir$(1, \ldots, 1)$ prior. More generally, adding $\alpha$ corresponds to using a symmetric Dir$(\alpha, \ldots, \alpha)$ prior.

Application 2: Latent Dirichlet Allocation (LDA)

LDA is a hierarchical Dirichlet model for discovering topics in text:

$$\boldsymbol{\theta}d \sim \text{Dir}(\boldsymbol{\alpha}) \quad \text{(topic distribution for document } d \text{)}$$ $$\boldsymbol{\phi}k \sim \text{Dir}(\boldsymbol{\beta}) \quad \text{(word distribution for topic } k \text{)}$$ $$z{d,n} | \boldsymbol{\theta}d \sim \text{Categorical}(\boldsymbol{\theta}d) \quad \text{(topic for word } n \text{ in doc } d \text{)}$$ $$w{d,n} | z{d,n}, \boldsymbol{\Phi} \sim \text{Categorical}(\boldsymbol{\phi}{z_{d,n}}) \quad \text{(observed word)}$$

The Dirichlet-Multinomial conjugacy makes Gibbs sampling tractable—conditional posteriors for $\boldsymbol{\theta}_d$ and $\boldsymbol{\phi}_k$ are simply updated Dirichlets.

Dirichlet Hyperparameters in Topic Models

In LDA, setting $\alpha < 1$ encourages documents to focus on few topics (topic sparsity), while $\beta < 1$ encourages topics to focus on few words (word sparsity). These hyperparameters significantly affect discovered topics—lower values yield more interpretable, sparse topics; higher values yield more blended topics.

Application 3: Mixture Model Priors

For a Gaussian mixture model with $K$ components, the mixture weights $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K)$ form a probability vector:

$$\boldsymbol{\pi} \sim \text{Dir}(\boldsymbol{\alpha})$$

After observing cluster assignments $z_1, \ldots, z_n$ (or in E-step, expected assignments):

$$\boldsymbol{\pi} | z_{1:n} \sim \text{Dir}(\alpha_1 + n_1, \ldots, \alpha_K + n_K)$$

where $n_k = \sum_i \mathbb{1}[z_i = k]$ is the count assigned to cluster $k$.

Application 4: Hierarchical Dirichlet Process (HDP)

For nonparametric topic models with unknown number of topics, HDP uses nested Dirichlet processes:

$$G_0 \sim \text{DP}(\gamma, H) \quad \text{(global measure over topics)}$$ $$G_d \sim \text{DP}(\alpha, G_0) \quad \text{(document-specific measure)}$$

The Dirichlet-Multinomial conjugacy extends to these infinite-dimensional settings, enabling tractable inference via Chinese Restaurant Franchise sampling.

Key Machine Learning Applications

•Text Classification: Naive Bayes, language models, spam detection
•Topic Modeling: LDA, CTM, HDP for discovering themes in documents
•Mixture Models: GMM, Bayesian mixture weights, cluster analysis
•Sequence Models: HMM emission/transition distributions
•Recommendation Systems: Item category distributions, user preferences
•Biological Sequence Analysis: Amino acid/nucleotide frequencies
•A/B Testing with Multiple Variants: Extend Beta-Binomial to K variants

Computational Considerations

When $K$ is large (vocabulary size in text, number of categories in high-cardinality features), computational efficiency becomes important.

Numerical Stability:

For large $K$ or extreme $\alpha$ values, direct computation of the Dirichlet density can overflow/underflow. Always work in log-space:

$$\log p(\boldsymbol{\theta} | \boldsymbol{\alpha}) = \log\Gamma(\alpha_0) - \sum_k \log\Gamma(\alpha_k) + \sum_k (\alpha_k - 1) \log\theta_k$$

Use gammaln (log-gamma) functions rather than gamma followed by log.

Sampling Efficiency:

Standard Dirichlet sampling via $K$ independent Gamma draws is $O(K)$. For sparse Dirichlet ($\alpha_k$ mostly identical), more efficient methods exist:

Stick-breaking: Can be faster for ordered importance
Polya urn with hash tables: Efficient for sparse observations
Alias method: For repeated sampling from same Dirichlet

Storage for High-Dimensional Posteriors:

With vocabulary size $V = 100,000$ and $T = 100$ topics:

Full storage: $T \times V = 10^7$ parameters
With sparse priors: Many $\alpha_k + n_k$ are near $\alpha$; use sparse representation

Collapsed Gibbs Sampling:

In models like LDA, instead of sampling $\boldsymbol{\theta}_d$ and $\boldsymbol{\phi}_k$, we can marginalize (integrate them out) thanks to conjugacy:

$$P(z_{d,n} | \mathbf{z}{-d,n}, \mathbf{w}) \propto \frac{n{d,k,-d,n} + \alpha_k}{n_d - 1 + \alpha_0} \cdot \frac{n_{k,w,-d,n} + \beta}{n_k - 1 + V\beta}$$

This collapsed sampler is more efficient (fewer variables) and mixes faster (integrating out parameters reduces variance).

Watch for Underflow

Products of many probabilities (common in multinomial and Dirichlet calculations) quickly underflow to zero. Always compute log-probabilities and use log-sum-exp tricks. Libraries like NumPy's logsumexp and scipy.special.gammaln are essential for numerical stability with Dirichlet-Multinomial models.

Summary: Dirichlet-Multinomial Mastery

We have comprehensively explored Dirichlet-Multinomial conjugacy—the foundation for Bayesian categorical inference. Let us consolidate the essential knowledge:

Key Takeaways

•Dirichlet generalizes Beta to K categories: A distribution over probability vectors (simplexes), parameterized by concentration vector $\boldsymbol{\alpha}$
•Posterior update is count addition: Dir($\alpha_1, ..., \alpha_K$) + counts $(n_1, ..., n_K)$ → Dir($\alpha_1 + n_1, ..., \alpha_K + n_K$)
•Concentrations are pseudo-counts: $\alpha_k$ encodes prior observations for category $k$; $\alpha_0 = \sum_k \alpha_k$ controls prior strength
•Sparsity control via $\alpha_0$: $\alpha_0 < K$ favors sparse distributions (few dominant categories); $\alpha_0 > K$ favors uniform
•Posterior predictive is elegant: Next observation's category probability = posterior mean = $(\alpha_k + n_k)/(\alpha_0 + n)$
•Pólya urn interpretation: Rich-get-richer dynamics; exchangeable sequences; foundation for nonparametric extensions
•Ubiquitous in ML: Topic models (LDA), Naive Bayes, mixture models, HMMs, sequence analysis
•Collapsed sampling is powerful: Marginalize out Dirichlet-distributed parameters for more efficient inference in hierarchical models

What's Next:

Having mastered the three fundamental conjugate families—Beta-Binomial, Gaussian-Gaussian, and Dirichlet-Multinomial—we now turn to practical synthesis. The next page covers Practical Implications: how to choose priors, conduct sensitivity analysis, and deploy conjugate models in real-world systems.

Page Complete

You now command the Dirichlet-Multinomial conjugate family—essential for any categorical inference problem. You understand the distribution properties, posterior updates, sparsity control, predictive distributions, and applications spanning topic models to Naive Bayes. Next, we synthesize practical guidance for deploying conjugate priors in production systems.

4 / 5

Loading learning content...

Machine LearningBayesian Inference

Conjugate Priors

LevelAdvanced

Duration90 mins

TopicBayesian Inference

4 / 5

Dirichlet-Multinomial

The Foundation of Categorical Inference

The Dirichlet distribution generalizes the Beta distribution from 2 categories to $K$ categories, while the Multinomial generalizes the Binomial. Together, they provide:

Closed-form posterior inference for category probabilities
Interpretable hyperparameters as pseudo-counts per category
Natural handling of sparsity and rare categories
Foundation for advanced models: topic models (LDA), mixture models, hidden Markov models, and Bayesian nonparametrics