Loading learning content...
When data consists of categories—words in documents, choices among options, species in ecosystems, states in systems—we need probability distributions over discrete alternatives. The Dirichlet-Multinomial conjugate pair is the Bayesian solution for this fundamental problem.
The Dirichlet distribution generalizes the Beta distribution from 2 categories to $K$ categories, while the Multinomial generalizes the Binomial. Together, they provide:
From analyzing customer segments to modeling gene expression, from natural language processing to recommendation systems—Dirichlet-Multinomial conjugacy is ubiquitous in machine learning.
By the end of this page, you will understand the Dirichlet distribution's properties and parametrization, derive the Dirichlet-Multinomial posterior, interpret hyperparameters as pseudo-counts, handle sparse data and concentration parameters, build hierarchical categorical models, and apply these tools to text analysis and topic modeling foundations.
The Dirichlet distribution is a multivariate probability distribution over the $(K-1)$-dimensional probability simplex—the set of all valid $K$-dimensional probability vectors.
Definition: A random vector $\boldsymbol{\theta} = (\theta_1, \ldots, \theta_K)$ follows a Dirichlet distribution with concentration parameters $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_K)$ where all $\alpha_k > 0$, written $\boldsymbol{\theta} \sim \text{Dir}(\boldsymbol{\alpha})$, if its density is:
$$p(\boldsymbol{\theta} | \boldsymbol{\alpha}) = \frac{\Gamma(\sum_{k=1}^K \alpha_k)}{\prod_{k=1}^K \Gamma(\alpha_k)} \prod_{k=1}^K \theta_k^{\alpha_k - 1}$$
defined on the simplex ${ \boldsymbol{\theta} : \theta_k \geq 0, \sum_k \theta_k = 1 }$.
The normalizing constant $B(\boldsymbol{\alpha}) = \frac{\prod_{k=1}^K \Gamma(\alpha_k)}{\Gamma(\sum_{k=1}^K \alpha_k)}$ is the multivariate Beta function.
The Beta distribution is a special case of Dirichlet with $K=2$: Beta($\alpha, \beta$) = Dir($\alpha, \beta$). Everything you learned about Beta extends to Dirichlet. The pseudo-count interpretation, the shrinkage toward prior means, the sequential updating—all generalize naturally.
Key Properties:
Marginal Distributions: $$\theta_k \sim \text{Beta}(\alpha_k, \alpha_0 - \alpha_k)$$
where $\alpha_0 = \sum_{j=1}^K \alpha_j$ is the concentration or precision parameter.
Mean: $$\mathbb{E}[\theta_k] = \frac{\alpha_k}{\alpha_0}$$
Each category's expected probability equals its share of the concentration.
Mode (for $\alpha_k > 1$): $$\text{Mode}[\theta_k] = \frac{\alpha_k - 1}{\alpha_0 - K}$$
Variance: $$\text{Var}[\theta_k] = \frac{\alpha_k(\alpha_0 - \alpha_k)}{\alpha_0^2(\alpha_0 + 1)}$$
As $\alpha_0$ increases, variance decreases—the distribution concentrates.
Covariance: $$\text{Cov}[\theta_j, \theta_k] = \frac{-\alpha_j \alpha_k}{\alpha_0^2(\alpha_0 + 1)} \quad (j \neq k)$$
Categories are negatively correlated—if one probability increases, others must decrease (they sum to 1).
| Parameters | Concentration $\alpha_0$ | Shape | Interpretation |
|---|---|---|---|
| Dir(1, 1, 1) | 3 | Uniform over simplex | Maximum ignorance; all distributions equally likely |
| Dir(0.5, 0.5, 0.5) | 1.5 | Favors sparse (concentrated at corners) | Jeffreys prior; expects sparsity |
| Dir(10, 10, 10) | 30 | Concentrated at center (1/3, 1/3, 1/3) | Strong belief in uniformity |
| Dir(10, 2, 2) | 14 | Concentrated near (0.71, 0.14, 0.14) | Strong belief first category dominates |
| Dir(0.1, 0.1, 0.1) | 0.3 | Strongly favors sparse distributions | Expect few categories to dominate |
The Multinomial distribution models the counts of $K$ mutually exclusive categories in $n$ independent trials—the natural generalization of the Binomial.
Definition: Given $n$ independent trials where each trial results in one of $K$ categories with probabilities $\boldsymbol{\theta} = (\theta_1, \ldots, \theta_K)$, the count vector $\mathbf{n} = (n_1, \ldots, n_K)$ follows a Multinomial:
$$P(\mathbf{n} | n, \boldsymbol{\theta}) = \frac{n!}{\prod_{k=1}^K n_k!} \prod_{k=1}^K \theta_k^{n_k}$$
where $\sum_k n_k = n$.
As a Likelihood: $$\mathcal{L}(\boldsymbol{\theta} | \mathbf{n}) \propto \prod_{k=1}^K \theta_k^{n_k}$$
The multinomial coefficient $\frac{n!}{\prod_k n_k!}$ is constant with respect to $\boldsymbol{\theta}$.
Why Conjugate to Dirichlet: The kernel of the Dirichlet: $\prod_k \theta_k^{\alpha_k - 1}$ and the Multinomial likelihood kernel: $\prod_k \theta_k^{n_k}$ are both products of powers. Their product:
$$\prod_k \theta_k^{\alpha_k - 1} \cdot \prod_k \theta_k^{n_k} = \prod_k \theta_k^{\alpha_k + n_k - 1}$$
is again a Dirichlet kernel with updated parameters $\alpha_k + n_k$.
A single draw from $K$ categories follows the Categorical distribution (Multinomial with $n=1$). Observing multiple independent draws gives counts that follow Multinomial. The sufficient statistic is the count vector—we don't need to know the order of observations, only the total in each category.
The conjugate update is elegantly simple—each category's concentration parameter increases by its observed count.
Setup:
Conjugate Posterior:
$$\boldsymbol{\theta} | \mathbf{n} \sim \text{Dir}(\alpha_1 + n_1, \ldots, \alpha_K + n_K)$$
Derivation:
$$P(\boldsymbol{\theta} | \mathbf{n}) \propto P(\mathbf{n} | \boldsymbol{\theta}) \cdot P(\boldsymbol{\theta})$$
$$\propto \prod_{k=1}^K \theta_k^{n_k} \cdot \prod_{k=1}^K \theta_k^{\alpha_k - 1}$$
$$= \prod_{k=1}^K \theta_k^{\alpha_k + n_k - 1}$$
This is the kernel of Dir$(\alpha_1 + n_1, \ldots, \alpha_K + n_K)$. ∎
Posterior = Dir($\alpha_1 + n_1, ..., \alpha_K + n_K$). Each category's concentration increases by its observed count. This is exactly analogous to Beta-Binomial: $\alpha_k$ acts as a pseudo-count for category $k$, and we add real counts $n_k$ to get the posterior.
Posterior Summaries:
Posterior Mean: $$\mathbb{E}[\theta_k | \mathbf{n}] = \frac{\alpha_k + n_k}{\alpha_0 + n}$$
This is a weighted average of the prior mean ($\alpha_k / \alpha_0$) and the MLE ($n_k / n$):
$$\mathbb{E}[\theta_k | \mathbf{n}] = \frac{\alpha_0}{\alpha_0 + n} \cdot \frac{\alpha_k}{\alpha_0} + \frac{n}{\alpha_0 + n} \cdot \frac{n_k}{n}$$
Credible Intervals: Each marginal $\theta_k | \mathbf{n} \sim \text{Beta}(\alpha_k + n_k, \alpha_0 + n - \alpha_k - n_k)$, so individual credible intervals are straightforward.
Maximum A Posteriori (MAP): $$\hat{\theta}_k^{\text{MAP}} = \frac{\alpha_k + n_k - 1}{\alpha_0 + n - K} \quad (\text{for } \alpha_k + n_k > 1)$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139
import numpy as npfrom scipy import statsfrom scipy.special import gammaln class DirichletMultinomial: """ Dirichlet-Multinomial conjugate model for categorical inference. Maintains a Dirichlet posterior over category probabilities, updated with multinomial observations. """ def __init__(self, alpha: np.ndarray): """ Initialize with Dirichlet concentration parameters. Args: alpha: Array of K concentration parameters (pseudo-counts) """ self.alpha = np.asarray(alpha, dtype=float) self.K = len(alpha) self._validate() def _validate(self): assert np.all(self.alpha > 0), "All concentrations must be positive" @property def concentration(self) -> float: """Total concentration α_0 = Σα_k.""" return np.sum(self.alpha) def update(self, counts: np.ndarray) -> 'DirichletMultinomial': """ Update posterior with observed counts. Returns new DirichletMultinomial with updated parameters. """ counts = np.asarray(counts, dtype=float) assert len(counts) == self.K, "Counts must match number of categories" return DirichletMultinomial(self.alpha + counts) def posterior_mean(self) -> np.ndarray: """Expected probability for each category.""" return self.alpha / self.concentration def posterior_mode(self) -> np.ndarray: """ Mode of posterior (MAP estimate). Only valid if all alpha_k > 1. """ if np.all(self.alpha > 1): return (self.alpha - 1) / (self.concentration - self.K) else: # Mode is on boundary of simplex, at the corner with largest alpha mode = np.zeros(self.K) mode[np.argmax(self.alpha)] = 1.0 return mode def posterior_variance(self) -> np.ndarray: """Variance for each category's probability.""" a0 = self.concentration return (self.alpha * (a0 - self.alpha)) / (a0**2 * (a0 + 1)) def marginal_credible_interval(self, k: int, confidence: float = 0.95) -> tuple: """ Credible interval for θ_k using its marginal Beta distribution. """ a_k = self.alpha[k] b_k = self.concentration - a_k tail = (1 - confidence) / 2 lower = stats.beta.ppf(tail, a_k, b_k) upper = stats.beta.ppf(1 - tail, a_k, b_k) return (lower, upper) def sample(self, n_samples: int = 10000) -> np.ndarray: """ Sample from the Dirichlet posterior. Returns array of shape (n_samples, K). """ return np.random.dirichlet(self.alpha, size=n_samples) def log_marginal_likelihood(self, counts: np.ndarray) -> float: """ Compute log P(counts | alpha) = log ∫ P(counts|θ) P(θ|alpha) dθ. This is the Dirichlet-Multinomial distribution, useful for model comparison. """ counts = np.asarray(counts) n = np.sum(counts) a0 = self.concentration # log multinomial coefficient log_mult = gammaln(n + 1) - np.sum(gammaln(counts + 1)) # log Beta(alpha + counts) / Beta(alpha) log_ratio = (gammaln(a0) - gammaln(a0 + n) + np.sum(gammaln(self.alpha + counts) - gammaln(self.alpha))) return log_mult + log_ratio def summary(self, category_names: list = None): """Print posterior summary.""" if category_names is None: category_names = [f"Cat {k}" for k in range(self.K)] print(f"Dirichlet({', '.join(f'{a:.2f}' for a in self.alpha)})") print(f"Concentration: {self.concentration:.2f}") print("-" * 50) means = self.posterior_mean() for k, name in enumerate(category_names): ci = self.marginal_credible_interval(k) print(f" {name}: {means[k]:.4f} (95% CI: [{ci[0]:.4f}, {ci[1]:.4f}])") # Example: Document Classification# Prior: Slight preference for topic A (news), uniform otherwisetopic_names = ['News', 'Sports', 'Tech', 'Entertainment']prior = DirichletMultinomial(np.array([2.0, 1.0, 1.0, 1.0])) print("Prior:")prior.summary(topic_names) # Observed document classifications# Suppose we classified 100 documents: 40 news, 25 sports, 20 tech, 15 entertainmentcounts = np.array([40, 25, 20, 15]) posterior = prior.update(counts) print(f"\nObserved counts: {dict(zip(topic_names, counts))}")print(f"Total documents: {sum(counts)}") print("\nPosterior:")posterior.summary(topic_names) # Probability that News is the most common topicsamples = posterior.sample(100000)prob_news_most_common = np.mean(np.argmax(samples, axis=1) == 0)print(f"\nP(News is most common) = {prob_news_most_common:.4f}")The concentration parameter $\alpha_0 = \sum_k \alpha_k$ controls how "peaked" or "spread" the Dirichlet distribution is. Understanding this parameter is crucial for appropriate prior specification.
Symmetric Dirichlet: A common choice is the symmetric Dirichlet with $\alpha_k = \alpha/K$ for all $k$: $$\boldsymbol{\theta} \sim \text{Dir}(\alpha/K, \ldots, \alpha/K)$$
Here $\alpha$ directly controls concentration:
$\alpha > K$ (Concentrated):
$\alpha = K$ (Uniform):
$\alpha < K$ (Sparse):
In many applications (topic models, language models, user preferences), true probability distributions are sparse—a few outcomes dominate. Setting $\alpha < 1$ encodes this belief, leading to sparser posterior estimates. This acts as a regularizer, preventing the model from spreading probability mass too thinly.
| $\alpha$ | Behavior | Expected Sparsity | Example Use Case |
|---|---|---|---|
| 0.01 | Extremely sparse | ~99% mass on 1-2 categories | One-hot categorical |
| 0.1 | Very sparse | ~90% mass on 2-3 categories | Dominant category + noise |
| 0.5 | Moderately sparse | Jeffreys prior; balanced sparsity | Default non-informative |
| 1.0 | Uniform | All distributions equally likely | Maximum entropy prior |
| 5.0 | Concentrated | Close to uniform (10% each) | Expect equal category weights |
| 50.0 | Very concentrated | Very close to 10% each | Strong uniformity belief |
Asymmetric Dirichlet:
When categories have different prior expectations, use asymmetric concentrations: $$\alpha_k = \alpha_0 \cdot \pi_k$$
where $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K)$ is the prior expected distribution and $\alpha_0$ controls concentration around that expectation.
Example: If prior belief is 50% category A, 30% B, 20% C:
The Stick-Breaking Perspective:
An alternative view of Dirichlet samples uses stick-breaking:
This view connects to the Dirichlet Process (infinite-dimensional generalization) and explains why early categories in ordered settings may receive more mass.
The posterior predictive distribution for the next observation is crucial for decision-making and model evaluation.
Single Next Observation:
Given posterior Dir$(\alpha_1 + n_1, \ldots, \alpha_K + n_K)$, the probability that the next observation falls in category $k$ is:
$$P(\tilde{x} = k | \mathbf{n}) = \int P(\tilde{x} = k | \boldsymbol{\theta}) P(\boldsymbol{\theta} | \mathbf{n}) d\boldsymbol{\theta}$$
$$= \mathbb{E}[\theta_k | \mathbf{n}] = \frac{\alpha_k + n_k}{\alpha_0 + n}$$
This is exactly the posterior mean for category $k$—intuitive and elegant.
Pólya Urn Interpretation:
The predictive distribution has a beautiful urn model interpretation:
Imagine an urn with colored balls:
This "rich get richer" dynamic explains:
The Pólya urn is closely related to the Chinese Restaurant Process (CRP), which extends to infinite categories via the Dirichlet Process. In the CRP, customers (observations) join tables (categories), with probability proportional to table occupancy plus a term for new tables. This foundation underpins all Bayesian nonparametric clustering methods.
Multiple Future Observations:
For $m$ future observations, the posterior predictive is the Dirichlet-Multinomial (or Multivariate Pólya) distribution:
$$P(\tilde{\mathbf{n}} | \mathbf{n}) = \frac{m!}{\prod_k \tilde{n}_k!} \cdot \frac{B(\boldsymbol{\alpha} + \mathbf{n} + \tilde{\mathbf{n}})}{B(\boldsymbol{\alpha} + \mathbf{n})}$$
where $\tilde{\mathbf{n}} = (\tilde{n}_1, \ldots, \tilde{n}_K)$ are counts of future observations and $\sum_k \tilde{n}_k = m$.
This distribution accounts for uncertainty in $\boldsymbol{\theta}$ and is overdispersed relative to the Multinomial—future counts have higher variance than if we plugged in a point estimate.
Properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
import numpy as npfrom scipy.special import gammaln def dirichlet_multinomial_pmf(counts: np.ndarray, alpha: np.ndarray) -> float: """ Dirichlet-Multinomial PMF: P(counts | alpha). This is the posterior predictive distribution for new counts given a Dirichlet prior/posterior. Args: counts: Array of K category counts alpha: Array of K Dirichlet concentration parameters Returns: Probability P(counts | alpha) """ counts = np.asarray(counts) alpha = np.asarray(alpha) n = np.sum(counts) a0 = np.sum(alpha) # Log multinomial coefficient log_mult = gammaln(n + 1) - np.sum(gammaln(counts + 1)) # Log ratio of Beta functions log_ratio = (gammaln(a0) - gammaln(a0 + n) + np.sum(gammaln(alpha + counts) - gammaln(alpha))) return np.exp(log_mult + log_ratio) def predictive_next_category(alpha_posterior: np.ndarray) -> np.ndarray: """ Probability distribution over next observation's category. P(x_new = k | data) = E[θ_k | data] = α_k / α_0 This is much simpler than the general Dirichlet-Multinomial because we're predicting a single observation. """ return alpha_posterior / np.sum(alpha_posterior) def polya_urn_simulation(alpha: np.ndarray, n_draws: int) -> np.ndarray: """ Simulate the Pólya urn process. Start with α_k balls of color k (conceptually). After each draw, add one ball of the drawn color. This generates samples from the Dirichlet-Multinomial. """ K = len(alpha) current_alpha = alpha.copy() draws = np.zeros(n_draws, dtype=int) for i in range(n_draws): # Probability of each category p = current_alpha / current_alpha.sum() # Draw category draws[i] = np.random.choice(K, p=p) # Add one ball to drawn category current_alpha[draws[i]] += 1 return draws # Example: Word prediction in language model# Suppose we have a vocabulary of 1000 words# Prior: Symmetric Dirichlet with sparsity (most words are rare)K = 1000alpha_prior = np.ones(K) * 0.1 # Sparse prior # Observed word counts from a document# Most entries are 0 (words not seen), a few are > 0np.random.seed(42)observed_counts = np.zeros(K)# Simulate 500 words, with only ~50 unique words appearingfor _ in range(500): word_id = int(np.random.exponential(20)) # Zipf-like if word_id < K: observed_counts[word_id] += 1 # Posterioralpha_posterior = alpha_prior + observed_counts # Predictive probability for next wordnext_word_probs = predictive_next_category(alpha_posterior) # Top 10 most likely wordstop_10 = np.argsort(next_word_probs)[-10:][::-1]print("Top 10 predicted words:")for idx in top_10: print(f" Word {idx}: {next_word_probs[idx]:.4f} " f"(observed {int(observed_counts[idx])} times)") # Demonstrate overdispersion# Compare variance of predictive vs plug-in Multinomialn_future = 100n_simulations = 10000 # Posterior predictive: sample θ then sample countstheta_samples = np.random.dirichlet(alpha_posterior, size=n_simulations)pp_samples = np.array([np.random.multinomial(n_future, theta) for theta in theta_samples])pp_variance = pp_samples[:, 0].var() # Plug-in Multinomial: use posterior meanplugin_p = alpha_posterior / alpha_posterior.sum()plugin_samples = np.random.multinomial(n_future, plugin_p, size=n_simulations)plugin_variance = plugin_samples[:, 0].var() print(f"\nPredicting {n_future} future words for word 0:")print(f" Posterior predictive variance: {pp_variance:.2f}")print(f" Plug-in Multinomial variance: {plugin_variance:.2f}")print(f" Overdispersion ratio: {pp_variance / plugin_variance:.2f}x")Dirichlet-Multinomial conjugacy is fundamental to many machine learning models. Let's examine key applications.
Application 1: Naive Bayes Text Classification
In multinomial Naive Bayes, each class $c$ has a word distribution $\boldsymbol{\theta}_c$:
$$P(\text{document} | c) = \text{Multinomial}(\mathbf{n} | \boldsymbol{\theta}_c)$$
Bayesian Training:
Laplace Smoothing is Bayesian: The common practice of adding 1 to all word counts (Laplace smoothing) corresponds to using a Dir$(1, \ldots, 1)$ prior. More generally, adding $\alpha$ corresponds to using a symmetric Dir$(\alpha, \ldots, \alpha)$ prior.
Application 2: Latent Dirichlet Allocation (LDA)
LDA is a hierarchical Dirichlet model for discovering topics in text:
$$\boldsymbol{\theta}d \sim \text{Dir}(\boldsymbol{\alpha}) \quad \text{(topic distribution for document } d \text{)}$$ $$\boldsymbol{\phi}k \sim \text{Dir}(\boldsymbol{\beta}) \quad \text{(word distribution for topic } k \text{)}$$ $$z{d,n} | \boldsymbol{\theta}d \sim \text{Categorical}(\boldsymbol{\theta}d) \quad \text{(topic for word } n \text{ in doc } d \text{)}$$ $$w{d,n} | z{d,n}, \boldsymbol{\Phi} \sim \text{Categorical}(\boldsymbol{\phi}{z_{d,n}}) \quad \text{(observed word)}$$
The Dirichlet-Multinomial conjugacy makes Gibbs sampling tractable—conditional posteriors for $\boldsymbol{\theta}_d$ and $\boldsymbol{\phi}_k$ are simply updated Dirichlets.
In LDA, setting $\alpha < 1$ encourages documents to focus on few topics (topic sparsity), while $\beta < 1$ encourages topics to focus on few words (word sparsity). These hyperparameters significantly affect discovered topics—lower values yield more interpretable, sparse topics; higher values yield more blended topics.
Application 3: Mixture Model Priors
For a Gaussian mixture model with $K$ components, the mixture weights $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K)$ form a probability vector:
$$\boldsymbol{\pi} \sim \text{Dir}(\boldsymbol{\alpha})$$
After observing cluster assignments $z_1, \ldots, z_n$ (or in E-step, expected assignments):
$$\boldsymbol{\pi} | z_{1:n} \sim \text{Dir}(\alpha_1 + n_1, \ldots, \alpha_K + n_K)$$
where $n_k = \sum_i \mathbb{1}[z_i = k]$ is the count assigned to cluster $k$.
Application 4: Hierarchical Dirichlet Process (HDP)
For nonparametric topic models with unknown number of topics, HDP uses nested Dirichlet processes:
$$G_0 \sim \text{DP}(\gamma, H) \quad \text{(global measure over topics)}$$ $$G_d \sim \text{DP}(\alpha, G_0) \quad \text{(document-specific measure)}$$
The Dirichlet-Multinomial conjugacy extends to these infinite-dimensional settings, enabling tractable inference via Chinese Restaurant Franchise sampling.
When $K$ is large (vocabulary size in text, number of categories in high-cardinality features), computational efficiency becomes important.
Numerical Stability:
For large $K$ or extreme $\alpha$ values, direct computation of the Dirichlet density can overflow/underflow. Always work in log-space:
$$\log p(\boldsymbol{\theta} | \boldsymbol{\alpha}) = \log\Gamma(\alpha_0) - \sum_k \log\Gamma(\alpha_k) + \sum_k (\alpha_k - 1) \log\theta_k$$
Use gammaln (log-gamma) functions rather than gamma followed by log.
Sampling Efficiency:
Standard Dirichlet sampling via $K$ independent Gamma draws is $O(K)$. For sparse Dirichlet ($\alpha_k$ mostly identical), more efficient methods exist:
Storage for High-Dimensional Posteriors:
With vocabulary size $V = 100,000$ and $T = 100$ topics:
Collapsed Gibbs Sampling:
In models like LDA, instead of sampling $\boldsymbol{\theta}_d$ and $\boldsymbol{\phi}_k$, we can marginalize (integrate them out) thanks to conjugacy:
$$P(z_{d,n} | \mathbf{z}{-d,n}, \mathbf{w}) \propto \frac{n{d,k,-d,n} + \alpha_k}{n_d - 1 + \alpha_0} \cdot \frac{n_{k,w,-d,n} + \beta}{n_k - 1 + V\beta}$$
This collapsed sampler is more efficient (fewer variables) and mixes faster (integrating out parameters reduces variance).
Products of many probabilities (common in multinomial and Dirichlet calculations) quickly underflow to zero. Always compute log-probabilities and use log-sum-exp tricks. Libraries like NumPy's logsumexp and scipy.special.gammaln are essential for numerical stability with Dirichlet-Multinomial models.
We have comprehensively explored Dirichlet-Multinomial conjugacy—the foundation for Bayesian categorical inference. Let us consolidate the essential knowledge:
What's Next:
Having mastered the three fundamental conjugate families—Beta-Binomial, Gaussian-Gaussian, and Dirichlet-Multinomial—we now turn to practical synthesis. The next page covers Practical Implications: how to choose priors, conduct sensitivity analysis, and deploy conjugate models in real-world systems.
You now command the Dirichlet-Multinomial conjugate family—essential for any categorical inference problem. You understand the distribution properties, posterior updates, sparsity control, predictive distributions, and applications spanning topic models to Naive Bayes. Next, we synthesize practical guidance for deploying conjugate priors in production systems.