Loading content...
The Beta-Binomial model is the most fundamental and widely-used conjugate family in Bayesian statistics. Whenever we need to estimate a probability, proportion, or rate from binary data—success/failure, yes/no, click/no-click, convert/bounce—the Beta-Binomial model provides the canonical Bayesian solution.
Its ubiquity stems from a perfect storm of desirable properties:
From A/B testing in tech companies to clinical trial analysis in medicine, from election forecasting in politics to quality control in manufacturing—the Beta-Binomial model underlies a vast array of practical applications. Mastering it provides the template for understanding all other conjugate systems.
By the end of this page, you will be able to derive the Beta-Binomial posterior from first principles, interpret Beta hyperparameters as pseudo-counts, compute posterior point estimates (mean, mode, MAP), construct credible intervals, calculate the posterior predictive distribution, and apply these tools to real-world inference problems like A/B testing.
The Beta distribution is a continuous probability distribution defined on the interval $[0, 1]$, making it the natural choice for modeling probabilities, proportions, and rates.
Definition: A random variable $\theta$ follows a Beta distribution with parameters $\alpha > 0$ and $\beta > 0$, written $\theta \sim \text{Beta}(\alpha, \beta)$, if its probability density function is:
$$p(\theta | \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$$
where $\Gamma(\cdot)$ is the Gamma function, the continuous extension of the factorial: $\Gamma(n) = (n-1)!$ for positive integers.
The normalizing constant $B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}$ is called the Beta function.
For Bayesian calculations, we often focus on the kernel—the part of the density that depends on the parameter of interest: $p(\theta) \propto \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$. The normalizing constant is determined by the requirement that the density integrates to 1. This simplifies proofs: we need only show that the posterior kernel matches the Beta kernel.
Key Properties of the Beta Distribution:
Mean (Expected Value): $$\mathbb{E}[\theta] = \frac{\alpha}{\alpha + \beta}$$
This is an intuitive weighted average: $\alpha$ "pulls" toward 1, $\beta$ "pulls" toward 0, and their relative magnitudes determine the balance.
Mode (Most Probable Value): $$\text{Mode}[\theta] = \frac{\alpha - 1}{\alpha + \beta - 2} \quad \text{(for } \alpha, \beta > 1\text{)}$$
The mode exists only when both parameters exceed 1; otherwise the distribution is U-shaped or monotonic.
Variance: $$\text{Var}[\theta] = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$
The variance decreases as $\alpha + \beta$ (the effective sample size) increases, reflecting greater certainty.
Concentration Parameter: The sum $\kappa = \alpha + \beta$ controls how concentrated the distribution is around its mean. Larger $\kappa$ means lower variance and more peaked distributions.
| Parameters | Shape | Mean | Interpretation |
|---|---|---|---|
| Beta(1, 1) | Uniform | 0.5 | Complete ignorance; all probabilities equally likely |
| Beta(0.5, 0.5) | U-shaped | 0.5 | Jeffreys prior; non-informative, favors extremes |
| Beta(2, 2) | Symmetric, unimodal | 0.5 | Weak belief centered at 0.5 |
| Beta(5, 5) | Symmetric, peaked | 0.5 | Moderate belief centered at 0.5 |
| Beta(20, 20) | Highly concentrated | 0.5 | Strong belief centered at 0.5 |
| Beta(10, 2) | Right-skewed | 0.833 | Belief that probability is high (~80%) |
| Beta(2, 10) | Left-skewed | 0.167 | Belief that probability is low (~17%) |
| Beta(100, 1) | Highly concentrated at 1 | 0.99 | Strong belief probability is nearly 1 |
Visual Intuition for Beta Shapes:
The Beta distribution's shape is entirely determined by $\alpha$ and $\beta$:
This flexibility allows Beta priors to encode an enormous variety of prior beliefs about probability values.
The Binomial distribution models the number of successes in a sequence of independent binary trials. It is the natural likelihood when data consists of counts of positive outcomes out of total trials.
Definition: Let $k$ be the number of successes in $n$ independent trials, each with success probability $\theta$. Then $k$ follows a Binomial distribution:
$$P(k | n, \theta) = \binom{n}{k} \theta^k (1 - \theta)^{n-k}$$
where $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ is the binomial coefficient.
As a function of $\theta$ (the likelihood function): $$\mathcal{L}(\theta | k, n) = \binom{n}{k} \theta^k (1 - \theta)^{n-k} \propto \theta^k (1 - \theta)^{n-k}$$
The binomial coefficient $\binom{n}{k}$ is constant with respect to $\theta$, so for inference purposes, we work with the likelihood kernel: $\mathcal{L}(\theta) \propto \theta^k (1 - \theta)^{n-k}$.
The Bernoulli distribution is Binomial($n=1, \theta$). For a sequence of $n$ independent Bernoulli trials, the likelihood factors into the product $\prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i} = \theta^{\sum x_i} (1-\theta)^{n - \sum x_i}$, which reduces to the Binomial form with $k = \sum x_i$. The sum of successes is the sufficient statistic.
Why Binomial Likelihood and Beta Prior are Conjugate:
The key observation is that the Binomial likelihood kernel has the same functional form as the Beta density kernel. Compare:
Both are products of powers of $\theta$ and $(1 - \theta)$. When we multiply them (as Bayes' theorem requires), the result is another product of powers:
$$\text{Posterior kernel} = \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} \cdot \theta^k (1 - \theta)^{n-k} = \theta^{\alpha + k - 1} (1 - \theta)^{\beta + n - k - 1}$$
This is exactly a Beta kernel with updated parameters. The posterior is $\text{Beta}(\alpha + k, \beta + n - k)$.
Let us derive the Beta-Binomial posterior carefully, establishing the reasoning pattern used throughout conjugate analysis.
Setup:
Step 1: Write Prior Density $$P(\theta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$$
Step 2: Write Likelihood $$P(k | \theta, n) = \binom{n}{k} \theta^k (1 - \theta)^{n - k}$$
Step 3: Apply Bayes' Theorem $$P(\theta | k, n) = \frac{P(k | \theta, n) \cdot P(\theta)}{P(k | n)}$$
Step 4: Focus on the Kernel Since $P(k | n)$ is independent of $\theta$, we work with the numerator: $$P(\theta | k, n) \propto P(k | \theta, n) \cdot P(\theta)$$
$$\propto \theta^k (1 - \theta)^{n-k} \cdot \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$$
$$= \theta^{\alpha + k - 1} (1 - \theta)^{\beta + n - k - 1}$$
Step 5: Recognize the Posterior Family This kernel has the form $\theta^{a-1}(1-\theta)^{b-1}$ with $a = \alpha + k$ and $b = \beta + n - k$. This is the Beta$(a, b)$ kernel.
Step 6: Conclude $$\theta | k, n \sim \text{Beta}(\alpha + k, \beta + n - k)$$
The posterior is a Beta distribution with:
Prior: Beta($\alpha$, $\beta$) + Data: $k$ successes, $n-k$ failures → Posterior: Beta($\alpha + k$, $\beta + n - k$). This simple additive update is the computational core of Beta-Binomial inference. No integrals. No sampling. Just addition.
The Marginal Likelihood (Evidence):
For completeness, the marginal likelihood $P(k | n)$ that we factored out can be computed:
$$P(k | n) = \int_0^1 P(k | \theta, n) P(\theta) d\theta = \binom{n}{k} \cdot \frac{B(\alpha + k, \beta + n - k)}{B(\alpha, \beta)}$$
This quantity is useful for:
The fact that this integral has a closed form is another manifestation of conjugacy's computational elegance.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def beta_binomial_posterior( alpha_prior: float, beta_prior: float, successes: int, trials: int) -> tuple: """ Compute Beta-Binomial posterior parameters. The posterior update is trivially simple: - alpha_posterior = alpha_prior + successes - beta_posterior = beta_prior + failures Returns (alpha_posterior, beta_posterior) """ alpha_posterior = alpha_prior + successes beta_posterior = beta_prior + (trials - successes) return alpha_posterior, beta_posterior def visualize_update(alpha_prior: float, beta_prior: float, successes: int, trials: int): """ Visualize how data updates prior to posterior. """ alpha_post, beta_post = beta_binomial_posterior( alpha_prior, beta_prior, successes, trials ) theta = np.linspace(0, 1, 1000) prior_pdf = stats.beta.pdf(theta, alpha_prior, beta_prior) posterior_pdf = stats.beta.pdf(theta, alpha_post, beta_post) # Normalized likelihood for visualization likelihood = theta**successes * (1-theta)**(trials-successes) likelihood = likelihood / likelihood.max() * posterior_pdf.max() fig, ax = plt.subplots(figsize=(10, 6)) ax.plot(theta, prior_pdf, 'b--', lw=2, label=f'Prior: Beta({alpha_prior}, {beta_prior})') ax.plot(theta, likelihood, 'g:', lw=2, label=f'Likelihood: {successes}/{trials} successes') ax.plot(theta, posterior_pdf, 'r-', lw=2, label=f'Posterior: Beta({alpha_post}, {beta_post})') ax.fill_between(theta, posterior_pdf, alpha=0.2, color='red') ax.set_xlabel('θ (Success Probability)', fontsize=12) ax.set_ylabel('Density', fontsize=12) ax.set_title('Beta-Binomial Posterior Update', fontsize=14) ax.legend(fontsize=10) ax.set_xlim(0, 1) ax.grid(alpha=0.3) return fig # Example: A/B Testing Scenario# Prior: Weak belief that conversion is around 10%alpha_prior, beta_prior = 2, 18 # Mean = 2/20 = 0.10 # Data: Observed 35 conversions in 300 visitorssuccesses, trials = 35, 300 alpha_post, beta_post = beta_binomial_posterior( alpha_prior, beta_prior, successes, trials) print(f"Prior: Beta({alpha_prior}, {beta_prior})")print(f" Prior mean: {alpha_prior / (alpha_prior + beta_prior):.4f}")print(f" Prior std: {np.sqrt(alpha_prior*beta_prior/((alpha_prior+beta_prior)**2*(alpha_prior+beta_prior+1))):.4f}") print(f"Data: {successes} successes in {trials} trials")print(f" MLE: {successes/trials:.4f}") print(f"Posterior: Beta({alpha_post}, {beta_post})")print(f" Posterior mean: {alpha_post / (alpha_post + beta_post):.4f}")print(f" Posterior std: {np.sqrt(alpha_post*beta_post/((alpha_post+beta_post)**2*(alpha_post+beta_post+1))):.4f}") # 95% Credible Intervalci_lower = stats.beta.ppf(0.025, alpha_post, beta_post)ci_upper = stats.beta.ppf(0.975, alpha_post, beta_post)print(f" 95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")While the full posterior distribution is the complete Bayesian answer, we often need to summarize it with point estimates and intervals for decision-making and reporting.
Posterior Mean: $$\mathbb{E}[\theta | k, n] = \frac{\alpha + k}{\alpha + \beta + n}$$
The posterior mean is a weighted average of the prior mean and the maximum likelihood estimate (MLE):
$$\mathbb{E}[\theta | k, n] = \underbrace{\frac{\alpha + \beta}{\alpha + \beta + n}}{\text{prior weight}} \cdot \underbrace{\frac{\alpha}{\alpha + \beta}}{\text{prior mean}} + \underbrace{\frac{n}{\alpha + \beta + n}}{\text{data weight}} \cdot \underbrace{\frac{k}{n}}{\text{MLE}}$$
This formula reveals the Bayesian compromise: the posterior blends prior belief and observed evidence, with weights proportional to their respective "sample sizes" ($\alpha + \beta$ for the prior, $n$ for the data).
The posterior mean 'shrinks' the MLE toward the prior mean. With little data, the posterior stays close to the prior. With abundant data, the posterior approaches the MLE. This shrinkage regularizes extreme sample proportions, preventing overconfidence from small samples—a key advantage of Bayesian inference.
Posterior Mode (MAP Estimate): $$\text{Mode}[\theta | k, n] = \frac{\alpha + k - 1}{\alpha + \beta + n - 2} \quad \text{(for } \alpha + k > 1, \beta + n - k > 1 \text{)}$$
The Maximum A Posteriori (MAP) estimate is the most probable value under the posterior. It differs from the posterior mean when the distribution is asymmetric.
Posterior Median: The median satisfies $P(\theta \leq \tilde{\theta} | k, n) = 0.5$. It equals the 0.5 quantile of Beta($\alpha + k, \beta + n - k$). While there's no closed-form expression, it's easily computed numerically.
Credible Intervals: Bayesian credible intervals directly quantify parameter uncertainty. The $100(1-\alpha)$% credible interval $[\theta_L, \theta_U]$ satisfies:
$$P(\theta_L \leq \theta \leq \theta_U | k, n) = 1 - \alpha$$
Two common choices:
| Estimate | Formula | Properties | When to Use |
|---|---|---|---|
| Posterior Mean | $\frac{\alpha + k}{\alpha + \beta + n}$ | Minimizes squared error; smooth function of data | Default choice; risk-averse decisions |
| Posterior Mode (MAP) | $\frac{\alpha + k - 1}{\alpha + \beta + n - 2}$ | Most probable value; equals MLE with uniform prior | Point predictions; when mode is meaningful |
| Posterior Median | 0.5 quantile of posterior | Robust to asymmetry; minimizes absolute error | Skewed posteriors; robust estimation |
| MLE | $k/n$ | No prior influence; can be 0 or 1 | Large samples; when prior is controversial |
The Role of Sample Size:
As $n \to \infty$, the posterior concentrates around the true parameter:
This is Bayesian consistency: with sufficient data, the posterior converges to the truth regardless of the prior (assuming the true parameter is in the prior's support). The prior matters most with small samples—precisely when subjective knowledge is most valuable.
Practical Rule of Thumb: The prior's influence is roughly equivalent to $\alpha + \beta - 2$ observations. If you have $n >> \alpha + \beta$, the prior barely matters. If $n << \alpha + \beta$, the prior dominates.
A crucial Bayesian quantity is the posterior predictive distribution—the probability distribution of future observations, accounting for parameter uncertainty. Instead of plugging in a point estimate for $\theta$, we average predictions over all plausible $\theta$ values, weighted by the posterior.
Definition: Given observed data $D = (k, n)$ and a future trial, the posterior predictive probability of success is:
$$P(\tilde{x} = 1 | D) = \int_0^1 P(\tilde{x} = 1 | \theta) P(\theta | D) d\theta = \int_0^1 \theta \cdot \text{Beta}(\theta | \alpha + k, \beta + n - k) d\theta$$
$$= \mathbb{E}[\theta | D] = \frac{\alpha + k}{\alpha + \beta + n}$$
The probability of the next trial being a success equals the posterior mean of $\theta$.
If we used the MLE $\hat{\theta} = k/n$ for prediction, we'd underestimate uncertainty. With $k=0$ (no successes), the MLE predicts zero probability of future success—clearly overconfident. The posterior predictive, using Beta(α, β+n), correctly maintains positive probability for success. Bayesian prediction automatically incorporates parameter uncertainty.
The Beta-Binomial Distribution:
For predictions about multiple future trials, the posterior predictive distribution has a name: the Beta-Binomial distribution.
If $\tilde{k}$ is the number of successes in $m$ future trials, given posterior Beta($\alpha + k, \beta + n - k$):
$$P(\tilde{k} | k, n, m) = \binom{m}{\tilde{k}} \frac{B(\alpha + k + \tilde{k}, \beta + n - k + m - \tilde{k})}{B(\alpha + k, \beta + n - k)}$$
This distribution accounts for both:
Properties of the Beta-Binomial:
Mean: $$\mathbb{E}[\tilde{k}] = m \cdot \frac{\alpha + k}{\alpha + \beta + n}$$
Variance: $$\text{Var}[\tilde{k}] = m \cdot \frac{(\alpha + k)(\beta + n - k)}{(\alpha + \beta + n)^2} \cdot \frac{\alpha + \beta + n + m}{\alpha + \beta + n + 1}$$
The variance is larger than the Binomial variance $mp(1-p)$ because it includes parameter uncertainty. The extra factor $\frac{\alpha + \beta + n + m}{\alpha + \beta + n + 1}$ captures this additional variation.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as npfrom scipy import statsfrom scipy.special import beta as beta_func def beta_binomial_pmf(k: int, n: int, alpha: float, beta: float) -> float: """ Beta-Binomial PMF: probability of k successes in n trials given Beta(alpha, beta) prior/posterior on success probability. This is the posterior predictive distribution for future observations. """ from scipy.special import comb numerator = beta_func(alpha + k, beta + n - k) denominator = beta_func(alpha, beta) return comb(n, k, exact=True) * numerator / denominator def posterior_predictive_comparison( alpha_prior: float, beta_prior: float, observed_successes: int, observed_trials: int, future_trials: int): """ Compare posterior predictive (accounting for parameter uncertainty) vs plug-in prediction (using MLE). """ # Posterior parameters alpha_post = alpha_prior + observed_successes beta_post = beta_prior + observed_trials - observed_successes # Posterior mean (success probability) theta_mean = alpha_post / (alpha_post + beta_post) # MLE theta_mle = observed_successes / observed_trials if observed_trials > 0 else 0.5 k_values = np.arange(future_trials + 1) # Posterior predictive (Beta-Binomial) pp_probs = np.array([ beta_binomial_pmf(k, future_trials, alpha_post, beta_post) for k in k_values ]) # Plug-in Binomial using posterior mean plugin_probs = stats.binom.pmf(k_values, future_trials, theta_mean) # Using MLE directly mle_probs = stats.binom.pmf(k_values, future_trials, theta_mle) # Compare variances pp_mean = future_trials * theta_mean pp_var = np.sum((k_values - pp_mean)**2 * pp_probs) plugin_var = future_trials * theta_mean * (1 - theta_mean) print(f"Observed: {observed_successes}/{observed_trials} successes") print(f"Posterior: Beta({alpha_post}, {beta_post})") print(f"Posterior mean θ: {theta_mean:.4f}") print(f"MLE θ: {theta_mle:.4f}") print(f"Predicting {future_trials} future trials:") print(f" Expected successes: {pp_mean:.2f}") print(f" Posterior predictive variance: {pp_var:.2f}") print(f" Plug-in variance (ignoring parameter uncertainty): {plugin_var:.2f}") print(f" Variance ratio: {pp_var / plugin_var:.2f}x") return k_values, pp_probs, plugin_probs, mle_probs # Example: Small sample prediction# Very few observations - parameter uncertainty is highk_vals, pp, plugin, mle = posterior_predictive_comparison( alpha_prior=1, beta_prior=1, # Uniform prior observed_successes=3, observed_trials=5, # 60% success rate future_trials=100 # Predicting many future trials) # With small observed sample, posterior predictive shows much more uncertainty# than plug-in prediction that ignores parameter uncertaintyThe Beta-Binomial model underpins countless real-world inference systems. Let's examine several important applications in depth.
Application 1: A/B Testing
A/B testing compares two variants (A and B) to determine which performs better. The Beta-Binomial model provides a fully Bayesian framework.
Setup:
Posteriors:
Key Quantities:
Probability B beats A: $P(\theta_B > \theta_A | \text{data})$
Expected lift: $\mathbb{E}[\theta_B - \theta_A | \text{data}] = \mathbb{E}[\theta_B] - \mathbb{E}[\theta_A]$
Credible interval for difference: Distribution of $\theta_B - \theta_A$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
import numpy as npfrom scipy import stats class BayesianABTest: """ Complete Bayesian A/B testing using Beta-Binomial conjugacy. Advantages over frequentist tests: 1. Direct probability statements: "P(B > A) = 0.95" 2. Valid at any sample size (no "peeking" problems) 3. Incorporates prior knowledge naturally 4. Decision-theoretic framework for stopping rules """ def __init__(self, alpha_prior: float = 1.0, beta_prior: float = 1.0): """ Initialize with shared prior (can be customized per variant). Default: Uniform prior Beta(1,1) = maximum entropy. """ self.variants = {} self.alpha_prior = alpha_prior self.beta_prior = beta_prior def add_variant(self, name: str, successes: int = 0, trials: int = 0): """Add or update a variant with observed data.""" self.variants[name] = { 'successes': successes, 'trials': trials, 'alpha': self.alpha_prior + successes, 'beta': self.beta_prior + trials - successes } def update(self, name: str, success: bool): """Update variant with single observation (for streaming).""" v = self.variants[name] v['trials'] += 1 if success: v['successes'] += 1 v['alpha'] += 1 else: v['beta'] += 1 def posterior_mean(self, name: str) -> float: """Expected conversion rate for variant.""" v = self.variants[name] return v['alpha'] / (v['alpha'] + v['beta']) def credible_interval(self, name: str, confidence: float = 0.95) -> tuple: """Equal-tailed credible interval for conversion rate.""" v = self.variants[name] tail = (1 - confidence) / 2 lower = stats.beta.ppf(tail, v['alpha'], v['beta']) upper = stats.beta.ppf(1 - tail, v['alpha'], v['beta']) return (lower, upper) def prob_best(self, name: str, n_samples: int = 100000) -> float: """ Probability that variant 'name' is the best among all variants. Uses Monte Carlo sampling for multiple variant comparison. """ samples = {} for vname, v in self.variants.items(): samples[vname] = stats.beta.rvs(v['alpha'], v['beta'], size=n_samples) # Count how often this variant has the highest sample is_best = np.ones(n_samples, dtype=bool) for other_name in self.variants: if other_name != name: is_best &= (samples[name] > samples[other_name]) return np.mean(is_best) def prob_beats(self, name_a: str, name_b: str, n_samples: int = 100000) -> float: """P(θ_a > θ_b | data) by Monte Carlo.""" va, vb = self.variants[name_a], self.variants[name_b] samples_a = stats.beta.rvs(va['alpha'], va['beta'], size=n_samples) samples_b = stats.beta.rvs(vb['alpha'], vb['beta'], size=n_samples) return np.mean(samples_a > samples_b) def expected_loss(self, choose: str, best: str, n_samples: int = 100000) -> float: """ Expected loss from choosing 'choose' if 'best' is actually best. Loss = max(0, θ_best - θ_choose) for each sample. """ v_choose = self.variants[choose] v_best = self.variants[best] samples_choose = stats.beta.rvs(v_choose['alpha'], v_choose['beta'], size=n_samples) samples_best = stats.beta.rvs(v_best['alpha'], v_best['beta'], size=n_samples) return np.mean(np.maximum(0, samples_best - samples_choose)) def summary(self): """Print comprehensive test summary.""" print("=" * 60) print("BAYESIAN A/B TEST SUMMARY") print("=" * 60) for name, v in self.variants.items(): mean = self.posterior_mean(name) ci = self.credible_interval(name) pb = self.prob_best(name) print(f"{name}:") print(f" Data: {v['successes']}/{v['trials']} ({100*v['successes']/max(1,v['trials']):.1f}%)") print(f" Posterior: Beta({v['alpha']}, {v['beta']})") print(f" Mean: {mean:.4f}") print(f" 95% CI: [{ci[0]:.4f}, {ci[1]:.4f}]") print(f" P(best): {pb:.4f}") # Example usagetest = BayesianABTest(alpha_prior=1, beta_prior=1)test.add_variant('Control', successes=120, trials=1000) # 12%test.add_variant('Treatment', successes=145, trials=1000) # 14.5% test.summary()print(f"P(Treatment > Control) = {test.prob_beats('Treatment', 'Control'):.4f}")Application 2: Thompson Sampling for Multi-Armed Bandits
Thompson Sampling is a Bayesian algorithm for the exploration-exploitation tradeoff in multi-armed bandit problems. With Beta-Binomial models, it's remarkably simple:
Algorithm:
This simple algorithm is provably optimal (in terms of regret bounds) and naturally balances:
Application 3: Clinical Trial Analysis
In clinical trials, Binary outcomes (cure/no cure, adverse event/no adverse event) are common. Bayesian analysis provides:
Major tech companies (Netflix, Google, Microsoft) have adopted Bayesian A/B testing because: (1) Results are interpretable—'95% probability that B is better' not 'p < 0.05'. (2) No peeking penalty—can check results anytime. (3) Prior incorporation handles low-traffic tests. (4) Expected loss provides principled stopping rules. (5) Multi-variant tests are natural extensions.
While Beta-Binomial inference is generally well-behaved, practitioners should be aware of edge cases and potential issues.
Edge Case 1: Zero Observations ($k = 0$ or $k = n$)
When all trials are successes or all are failures:
With prior Beta(1, 1) and 0 successes in 10 trials:
The Bayesian estimate is non-zero, reflecting that we shouldn't be certain the rate is exactly zero after just 10 trials.
Edge Case 2: Improper Priors ($\alpha < 1$ or $\beta < 1$)
Priors like Beta(0.5, 0.5) (Jeffreys) or Beta(0, 0) (Haldane) are improper—they don't integrate to 1. However:
Edge Case 3: Very Strong Priors
If $\alpha + \beta >> n$, the prior dominates. This can be problematic if:
Solution: Always check prior predictive distribution. If prior predicts data very different from what's observed, reconsider the prior.
Overdispersion and Beta-Binomial as Likelihood:
Interestingly, the Beta-Binomial distribution (our posterior predictive) can also serve as a likelihood for overdispersed count data:
If counts show more variance than Binomial (due to hidden heterogeneity across trials), the Beta-Binomial likelihood models this: $$k \sim \text{BetaBinomial}(n, \alpha, \beta)$$
Here $\alpha$ and $\beta$ parameterize the heterogeneity distribution of success probabilities across trials, not prior uncertainty. This is a subtle but important distinction in usage.
We have thoroughly explored Beta-Binomial conjugacy—the foundation of Bayesian proportion inference. Let's consolidate the essential knowledge:
What's Next:
Having mastered Beta-Binomial conjugacy, we now extend to continuous data. The next page covers Gaussian-Gaussian conjugacy—the foundation for Bayesian inference on continuous measurements, regression, and countless extensions in modern machine learning.
You now command the Beta-Binomial conjugate family—the most widely applied conjugate model in practice. You can perform complete Bayesian inference for binary outcomes, from prior specification through posterior summaries and predictive inference. Next, we tackle Gaussian-Gaussian conjugacy for continuous data.