Conjugate Priors - Learning Module

Loading content...

0/278

Beta-Binomial

The Foundation of Bayesian Proportion Inference

The Beta-Binomial model is the most fundamental and widely-used conjugate family in Bayesian statistics. Whenever we need to estimate a probability, proportion, or rate from binary data—success/failure, yes/no, click/no-click, convert/bounce—the Beta-Binomial model provides the canonical Bayesian solution.

Its ubiquity stems from a perfect storm of desirable properties:

Mathematical elegance: The posterior update is trivially simple
Interpretability: Hyperparameters have transparent pseudo-count meanings
Flexibility: The Beta distribution can represent remarkably diverse prior beliefs
Tractability: All quantities of interest have closed-form expressions

From A/B testing in tech companies to clinical trial analysis in medicine, from election forecasting in politics to quality control in manufacturing—the Beta-Binomial model underlies a vast array of practical applications. Mastering it provides the template for understanding all other conjugate systems.

What You Will Learn

By the end of this page, you will be able to derive the Beta-Binomial posterior from first principles, interpret Beta hyperparameters as pseudo-counts, compute posterior point estimates (mean, mode, MAP), construct credible intervals, calculate the posterior predictive distribution, and apply these tools to real-world inference problems like A/B testing.

The Beta Distribution: Anatomy and Intuition

The Beta distribution is a continuous probability distribution defined on the interval $[0, 1]$, making it the natural choice for modeling probabilities, proportions, and rates.

Definition: A random variable $\theta$ follows a Beta distribution with parameters $\alpha > 0$ and $\beta > 0$, written $\theta \sim \text{Beta}(\alpha, \beta)$, if its probability density function is:

$$p(\theta | \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$$

where $\Gamma(\cdot)$ is the Gamma function, the continuous extension of the factorial: $\Gamma(n) = (n-1)!$ for positive integers.

The normalizing constant $B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}$ is called the Beta function.

The Kernel Perspective

For Bayesian calculations, we often focus on the kernel—the part of the density that depends on the parameter of interest: $p(\theta) \propto \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$. The normalizing constant is determined by the requirement that the density integrates to 1. This simplifies proofs: we need only show that the posterior kernel matches the Beta kernel.

Key Properties of the Beta Distribution:

Mean (Expected Value): $$\mathbb{E}[\theta] = \frac{\alpha}{\alpha + \beta}$$

This is an intuitive weighted average: $\alpha$ "pulls" toward 1, $\beta$ "pulls" toward 0, and their relative magnitudes determine the balance.

Mode (Most Probable Value): $$\text{Mode}[\theta] = \frac{\alpha - 1}{\alpha + \beta - 2} \quad \text{(for } \alpha, \beta > 1\text{)}$$

The mode exists only when both parameters exceed 1; otherwise the distribution is U-shaped or monotonic.

Variance: $$\text{Var}[\theta] = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$

The variance decreases as $\alpha + \beta$ (the effective sample size) increases, reflecting greater certainty.

Concentration Parameter: The sum $\kappa = \alpha + \beta$ controls how concentrated the distribution is around its mean. Larger $\kappa$ means lower variance and more peaked distributions.

Common Beta Distributions and Their Interpretations
Parameters	Shape	Mean	Interpretation
Beta(1, 1)	Uniform	0.5	Complete ignorance; all probabilities equally likely
Beta(0.5, 0.5)	U-shaped	0.5	Jeffreys prior; non-informative, favors extremes
Beta(2, 2)	Symmetric, unimodal	0.5	Weak belief centered at 0.5
Beta(5, 5)	Symmetric, peaked	0.5	Moderate belief centered at 0.5
Beta(20, 20)	Highly concentrated	0.5	Strong belief centered at 0.5
Beta(10, 2)	Right-skewed	0.833	Belief that probability is high (~80%)
Beta(2, 10)	Left-skewed	0.167	Belief that probability is low (~17%)
Beta(100, 1)	Highly concentrated at 1	0.99	Strong belief probability is nearly 1

Visual Intuition for Beta Shapes:

The Beta distribution's shape is entirely determined by $\alpha$ and $\beta$:

$\alpha = \beta = 1$: Uniform (flat) distribution
$\alpha = \beta > 1$: Symmetric and unimodal, centered at 0.5
$\alpha = \beta < 1$: U-shaped, favoring extreme values (0 and 1)
$\alpha > \beta$: Right-skewed, favoring larger values
$\alpha < \beta$: Left-skewed, favoring smaller values
Both large, $\alpha > \beta$: Concentrated, asymmetric, near mode $\frac{\alpha - 1}{\alpha + \beta - 2}$

This flexibility allows Beta priors to encode an enormous variety of prior beliefs about probability values.

The Binomial Likelihood

The Binomial distribution models the number of successes in a sequence of independent binary trials. It is the natural likelihood when data consists of counts of positive outcomes out of total trials.

Definition: Let $k$ be the number of successes in $n$ independent trials, each with success probability $\theta$. Then $k$ follows a Binomial distribution:

$$P(k | n, \theta) = \binom{n}{k} \theta^k (1 - \theta)^{n-k}$$

where $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ is the binomial coefficient.

As a function of $\theta$ (the likelihood function): $$\mathcal{L}(\theta | k, n) = \binom{n}{k} \theta^k (1 - \theta)^{n-k} \propto \theta^k (1 - \theta)^{n-k}$$

The binomial coefficient $\binom{n}{k}$ is constant with respect to $\theta$, so for inference purposes, we work with the likelihood kernel: $\mathcal{L}(\theta) \propto \theta^k (1 - \theta)^{n-k}$.

Bernoulli as a Special Case

The Bernoulli distribution is Binomial($n=1, \theta$). For a sequence of $n$ independent Bernoulli trials, the likelihood factors into the product $\prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i} = \theta^{\sum x_i} (1-\theta)^{n - \sum x_i}$, which reduces to the Binomial form with $k = \sum x_i$. The sum of successes is the sufficient statistic.

Why Binomial Likelihood and Beta Prior are Conjugate:

The key observation is that the Binomial likelihood kernel has the same functional form as the Beta density kernel. Compare:

Beta kernel: $\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$
Binomial kernel: $\theta^k (1 - \theta)^{n-k}$

Both are products of powers of $\theta$ and $(1 - \theta)$. When we multiply them (as Bayes' theorem requires), the result is another product of powers:

$$\text{Posterior kernel} = \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} \cdot \theta^k (1 - \theta)^{n-k} = \theta^{\alpha + k - 1} (1 - \theta)^{\beta + n - k - 1}$$

This is exactly a Beta kernel with updated parameters. The posterior is $\text{Beta}(\alpha + k, \beta + n - k)$.

Sufficient Statistic Insight

•The number of successes $k$ is a sufficient statistic for $\theta$ in Binomial sampling
•Once we know $k$ and $n$, the original sequence of outcomes (which trial was success/failure) provides no additional information about $\theta$
•This compression—from $n$ binary values to 2 numbers $(k, n)$—enables the tractable posterior update
•Sufficiency is the mathematical foundation of conjugacy: the posterior depends on data only through low-dimensional sufficient statistics

Posterior Derivation: Step by Step

Let us derive the Beta-Binomial posterior carefully, establishing the reasoning pattern used throughout conjugate analysis.

Setup:

Prior: $\theta \sim \text{Beta}(\alpha, \beta)$
Data: $k$ successes in $n$ trials (equivalently, Binomial($n, \theta$) observed as $k$)
Goal: Find $P(\theta | k, n)$

Step 1: Write Prior Density $$P(\theta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$$

Step 2: Write Likelihood $$P(k | \theta, n) = \binom{n}{k} \theta^k (1 - \theta)^{n - k}$$

Step 3: Apply Bayes' Theorem $$P(\theta | k, n) = \frac{P(k | \theta, n) \cdot P(\theta)}{P(k | n)}$$

Step 4: Focus on the Kernel Since $P(k | n)$ is independent of $\theta$, we work with the numerator: $$P(\theta | k, n) \propto P(k | \theta, n) \cdot P(\theta)$$

$$\propto \theta^k (1 - \theta)^{n-k} \cdot \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$$

$$= \theta^{\alpha + k - 1} (1 - \theta)^{\beta + n - k - 1}$$

Step 5: Recognize the Posterior Family This kernel has the form $\theta^{a-1}(1-\theta)^{b-1}$ with $a = \alpha + k$ and $b = \beta + n - k$. This is the Beta$(a, b)$ kernel.

Step 6: Conclude $$\theta | k, n \sim \text{Beta}(\alpha + k, \beta + n - k)$$

The posterior is a Beta distribution with:

First parameter: prior $\alpha$ plus observed successes $k$
Second parameter: prior $\beta$ plus observed failures $(n - k)$

The Update Rule

Prior: Beta($\alpha$, $\beta$) + Data: $k$ successes, $n-k$ failures → Posterior: Beta($\alpha + k$, $\beta + n - k$). This simple additive update is the computational core of Beta-Binomial inference. No integrals. No sampling. Just addition.

The Marginal Likelihood (Evidence):

For completeness, the marginal likelihood $P(k | n)$ that we factored out can be computed:

$$P(k | n) = \int_0^1 P(k | \theta, n) P(\theta) d\theta = \binom{n}{k} \cdot \frac{B(\alpha + k, \beta + n - k)}{B(\alpha, \beta)}$$

This quantity is useful for:

Model comparison: Comparing Beta-Binomial against alternative models
Bayes factors: Computing the ratio of evidences between models
Predictive distributions: As we'll see later

The fact that this integral has a closed form is another manifestation of conjugacy's computational elegance.

Beta-Binomial Posterior Derivation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def beta_binomial_posterior(
    alpha_prior: float, 
    beta_prior: float,
    successes: int, 
    trials: int
) -> tuple:
    """
    Compute Beta-Binomial posterior parameters.
    
    The posterior update is trivially simple:
    - alpha_posterior = alpha_prior + successes
    - beta_posterior = beta_prior + failures
    
    Returns (alpha_posterior, beta_posterior)
    """
    alpha_posterior = alpha_prior + successes
    beta_posterior = beta_prior + (trials - successes)
    return alpha_posterior, beta_posterior
 
 
def visualize_update(alpha_prior: float, beta_prior: float, 
                      successes: int, trials: int):
    """
    Visualize how data updates prior to posterior.
    """
    alpha_post, beta_post = beta_binomial_posterior(
        alpha_prior, beta_prior, successes, trials
    )
    
    theta = np.linspace(0, 1, 1000)
    
    prior_pdf = stats.beta.pdf(theta, alpha_prior, beta_prior)
    posterior_pdf = stats.beta.pdf(theta, alpha_post, beta_post)
    
    # Normalized likelihood for visualization
    likelihood = theta**successes * (1-theta)**(trials-successes)
    likelihood = likelihood / likelihood.max() * posterior_pdf.max()
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    ax.plot(theta, prior_pdf, 'b--', lw=2, 
            label=f'Prior: Beta({alpha_prior}, {beta_prior})')
    ax.plot(theta, likelihood, 'g:', lw=2, 
            label=f'Likelihood: {successes}/{trials} successes')
    ax.plot(theta, posterior_pdf, 'r-', lw=2, 
            label=f'Posterior: Beta({alpha_post}, {beta_post})')
    
    ax.fill_between(theta, posterior_pdf, alpha=0.2, color='red')
    
    ax.set_xlabel('θ (Success Probability)', fontsize=12)
    ax.set_ylabel('Density', fontsize=12)
    ax.set_title('Beta-Binomial Posterior Update', fontsize=14)
    ax.legend(fontsize=10)
    ax.set_xlim(0, 1)
    ax.grid(alpha=0.3)
    
    return fig
 
 
# Example: A/B Testing Scenario
# Prior: Weak belief that conversion is around 10%
alpha_prior, beta_prior = 2, 18  # Mean = 2/20 = 0.10
 
# Data: Observed 35 conversions in 300 visitors
successes, trials = 35, 300
 
alpha_post, beta_post = beta_binomial_posterior(
    alpha_prior, beta_prior, successes, trials
)
 
print(f"Prior: Beta({alpha_prior}, {beta_prior})")
print(f"  Prior mean: {alpha_prior / (alpha_prior + beta_prior):.4f}")
print(f"  Prior std:  {np.sqrt(alpha_prior*beta_prior/((alpha_prior+beta_prior)**2*(alpha_prior+beta_prior+1))):.4f}")
 
print(f"
Data: {successes} successes in {trials} trials")
print(f"  MLE: {successes/trials:.4f}")
 
print(f"
Posterior: Beta({alpha_post}, {beta_post})")
print(f"  Posterior mean: {alpha_post / (alpha_post + beta_post):.4f}")
print(f"  Posterior std:  {np.sqrt(alpha_post*beta_post/((alpha_post+beta_post)**2*(alpha_post+beta_post+1))):.4f}")
 
# 95% Credible Interval
ci_lower = stats.beta.ppf(0.025, alpha_post, beta_post)
ci_upper = stats.beta.ppf(0.975, alpha_post, beta_post)
print(f"  95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")

Posterior Summaries and Point Estimates

While the full posterior distribution is the complete Bayesian answer, we often need to summarize it with point estimates and intervals for decision-making and reporting.

Posterior Mean: $$\mathbb{E}[\theta | k, n] = \frac{\alpha + k}{\alpha + \beta + n}$$

The posterior mean is a weighted average of the prior mean and the maximum likelihood estimate (MLE):

$$\mathbb{E}[\theta | k, n] = \underbrace{\frac{\alpha + \beta}{\alpha + \beta + n}}{\text{prior weight}} \cdot \underbrace{\frac{\alpha}{\alpha + \beta}}{\text{prior mean}} + \underbrace{\frac{n}{\alpha + \beta + n}}{\text{data weight}} \cdot \underbrace{\frac{k}{n}}{\text{MLE}}$$

This formula reveals the Bayesian compromise: the posterior blends prior belief and observed evidence, with weights proportional to their respective "sample sizes" ($\alpha + \beta$ for the prior, $n$ for the data).

Shrinkage Toward the Prior

The posterior mean 'shrinks' the MLE toward the prior mean. With little data, the posterior stays close to the prior. With abundant data, the posterior approaches the MLE. This shrinkage regularizes extreme sample proportions, preventing overconfidence from small samples—a key advantage of Bayesian inference.

Posterior Mode (MAP Estimate): $$\text{Mode}[\theta | k, n] = \frac{\alpha + k - 1}{\alpha + \beta + n - 2} \quad \text{(for } \alpha + k > 1, \beta + n - k > 1 \text{)}$$

The Maximum A Posteriori (MAP) estimate is the most probable value under the posterior. It differs from the posterior mean when the distribution is asymmetric.

Posterior Median: The median satisfies $P(\theta \leq \tilde{\theta} | k, n) = 0.5$. It equals the 0.5 quantile of Beta($\alpha + k, \beta + n - k$). While there's no closed-form expression, it's easily computed numerically.

Credible Intervals: Bayesian credible intervals directly quantify parameter uncertainty. The $100(1-\alpha)$% credible interval $[\theta_L, \theta_U]$ satisfies:

$$P(\theta_L \leq \theta \leq \theta_U | k, n) = 1 - \alpha$$

Two common choices:

Equal-tailed interval: $\theta_L$ and $\theta_U$ are the $\alpha/2$ and $1 - \alpha/2$ quantiles
Highest Density Interval (HDI): The narrowest interval containing $1-\alpha$ probability mass

Comparison of Point Estimates
Estimate	Formula	Properties	When to Use
Posterior Mean	$\frac{\alpha + k}{\alpha + \beta + n}$	Minimizes squared error; smooth function of data	Default choice; risk-averse decisions
Posterior Mode (MAP)	$\frac{\alpha + k - 1}{\alpha + \beta + n - 2}$	Most probable value; equals MLE with uniform prior	Point predictions; when mode is meaningful
Posterior Median	0.5 quantile of posterior	Robust to asymmetry; minimizes absolute error	Skewed posteriors; robust estimation
MLE	$k/n$	No prior influence; can be 0 or 1	Large samples; when prior is controversial

The Role of Sample Size:

As $n \to \infty$, the posterior concentrates around the true parameter:

Posterior mean $\to k/n$ (the MLE)
Posterior variance $\to 0$
Credible intervals shrink to a point

This is Bayesian consistency: with sufficient data, the posterior converges to the truth regardless of the prior (assuming the true parameter is in the prior's support). The prior matters most with small samples—precisely when subjective knowledge is most valuable.

Practical Rule of Thumb: The prior's influence is roughly equivalent to $\alpha + \beta - 2$ observations. If you have $n >> \alpha + \beta$, the prior barely matters. If $n << \alpha + \beta$, the prior dominates.

The Posterior Predictive Distribution

A crucial Bayesian quantity is the posterior predictive distribution—the probability distribution of future observations, accounting for parameter uncertainty. Instead of plugging in a point estimate for $\theta$, we average predictions over all plausible $\theta$ values, weighted by the posterior.

Definition: Given observed data $D = (k, n)$ and a future trial, the posterior predictive probability of success is:

$$P(\tilde{x} = 1 | D) = \int_0^1 P(\tilde{x} = 1 | \theta) P(\theta | D) d\theta = \int_0^1 \theta \cdot \text{Beta}(\theta | \alpha + k, \beta + n - k) d\theta$$

$$= \mathbb{E}[\theta | D] = \frac{\alpha + k}{\alpha + \beta + n}$$

The probability of the next trial being a success equals the posterior mean of $\theta$.

Why Not Just Use the MLE?

If we used the MLE $\hat{\theta} = k/n$ for prediction, we'd underestimate uncertainty. With $k=0$ (no successes), the MLE predicts zero probability of future success—clearly overconfident. The posterior predictive, using Beta(α, β+n), correctly maintains positive probability for success. Bayesian prediction automatically incorporates parameter uncertainty.

The Beta-Binomial Distribution:

For predictions about multiple future trials, the posterior predictive distribution has a name: the Beta-Binomial distribution.

If $\tilde{k}$ is the number of successes in $m$ future trials, given posterior Beta($\alpha + k, \beta + n - k$):

$$P(\tilde{k} | k, n, m) = \binom{m}{\tilde{k}} \frac{B(\alpha + k + \tilde{k}, \beta + n - k + m - \tilde{k})}{B(\alpha + k, \beta + n - k)}$$

This distribution accounts for both:

Binomial variability: randomness in which future trials succeed/fail
Parameter uncertainty: uncertainty about the true success rate

Properties of the Beta-Binomial:

Mean: $$\mathbb{E}[\tilde{k}] = m \cdot \frac{\alpha + k}{\alpha + \beta + n}$$

Variance: $$\text{Var}[\tilde{k}] = m \cdot \frac{(\alpha + k)(\beta + n - k)}{(\alpha + \beta + n)^2} \cdot \frac{\alpha + \beta + n + m}{\alpha + \beta + n + 1}$$

The variance is larger than the Binomial variance $mp(1-p)$ because it includes parameter uncertainty. The extra factor $\frac{\alpha + \beta + n + m}{\alpha + \beta + n + 1}$ captures this additional variation.

Posterior Predictive Analysis
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from scipy import stats
from scipy.special import beta as beta_func
 
def beta_binomial_pmf(k: int, n: int, alpha: float, beta: float) -> float:
    """
    Beta-Binomial PMF: probability of k successes in n trials
    given Beta(alpha, beta) prior/posterior on success probability.
    
    This is the posterior predictive distribution for future observations.
    """
    from scipy.special import comb
    numerator = beta_func(alpha + k, beta + n - k)
    denominator = beta_func(alpha, beta)
    return comb(n, k, exact=True) * numerator / denominator
 
 
def posterior_predictive_comparison(
    alpha_prior: float, beta_prior: float,
    observed_successes: int, observed_trials: int,
    future_trials: int
):
    """
    Compare posterior predictive (accounting for parameter uncertainty)
    vs plug-in prediction (using MLE).
    """
    # Posterior parameters
    alpha_post = alpha_prior + observed_successes
    beta_post = beta_prior + observed_trials - observed_successes
    
    # Posterior mean (success probability)
    theta_mean = alpha_post / (alpha_post + beta_post)
    
    # MLE
    theta_mle = observed_successes / observed_trials if observed_trials > 0 else 0.5
    
    k_values = np.arange(future_trials + 1)
    
    # Posterior predictive (Beta-Binomial)
    pp_probs = np.array([
        beta_binomial_pmf(k, future_trials, alpha_post, beta_post) 
        for k in k_values
    ])
    
    # Plug-in Binomial using posterior mean
    plugin_probs = stats.binom.pmf(k_values, future_trials, theta_mean)
    
    # Using MLE directly
    mle_probs = stats.binom.pmf(k_values, future_trials, theta_mle)
    
    # Compare variances
    pp_mean = future_trials * theta_mean
    pp_var = np.sum((k_values - pp_mean)**2 * pp_probs)
    plugin_var = future_trials * theta_mean * (1 - theta_mean)
    
    print(f"Observed: {observed_successes}/{observed_trials} successes")
    print(f"Posterior: Beta({alpha_post}, {beta_post})")
    print(f"Posterior mean θ: {theta_mean:.4f}")
    print(f"MLE θ: {theta_mle:.4f}")
    print(f"
Predicting {future_trials} future trials:")
    print(f"  Expected successes: {pp_mean:.2f}")
    print(f"  Posterior predictive variance: {pp_var:.2f}")
    print(f"  Plug-in variance (ignoring parameter uncertainty): {plugin_var:.2f}")
    print(f"  Variance ratio: {pp_var / plugin_var:.2f}x")
    
    return k_values, pp_probs, plugin_probs, mle_probs
 
 
# Example: Small sample prediction
# Very few observations - parameter uncertainty is high
k_vals, pp, plugin, mle = posterior_predictive_comparison(
    alpha_prior=1, beta_prior=1,  # Uniform prior
    observed_successes=3, observed_trials=5,  # 60% success rate
    future_trials=100  # Predicting many future trials
)
 
# With small observed sample, posterior predictive shows much more uncertainty
# than plug-in prediction that ignores parameter uncertainty

Real-World Applications

The Beta-Binomial model underpins countless real-world inference systems. Let's examine several important applications in depth.

Application 1: A/B Testing

A/B testing compares two variants (A and B) to determine which performs better. The Beta-Binomial model provides a fully Bayesian framework.

Setup:

Variant A: $k_A$ conversions in $n_A$ visitors, prior Beta($\alpha_A, \beta_A$)
Variant B: $k_B$ conversions in $n_B$ visitors, prior Beta($\alpha_B, \beta_B$)

Posteriors:

$\theta_A | \text{data} \sim \text{Beta}(\alpha_A + k_A, \beta_A + n_A - k_A)$
$\theta_B | \text{data} \sim \text{Beta}(\alpha_B + k_B, \beta_B + n_B - k_B)$

Key Quantities:

Probability B beats A: $P(\theta_B > \theta_A | \text{data})$
- No closed form for general Beta distributions
- Easily computed by sampling or numerical integration
Expected lift: $\mathbb{E}[\theta_B - \theta_A | \text{data}] = \mathbb{E}[\theta_B] - \mathbb{E}[\theta_A]$
Credible interval for difference: Distribution of $\theta_B - \theta_A$

Bayesian A/B Testing
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import numpy as np
from scipy import stats
 
class BayesianABTest:
    """
    Complete Bayesian A/B testing using Beta-Binomial conjugacy.
    
    Advantages over frequentist tests:
    1. Direct probability statements: "P(B > A) = 0.95"
    2. Valid at any sample size (no "peeking" problems)
    3. Incorporates prior knowledge naturally
    4. Decision-theoretic framework for stopping rules
    """
    
    def __init__(self, alpha_prior: float = 1.0, beta_prior: float = 1.0):
        """
        Initialize with shared prior (can be customized per variant).
        Default: Uniform prior Beta(1,1) = maximum entropy.
        """
        self.variants = {}
        self.alpha_prior = alpha_prior
        self.beta_prior = beta_prior
    
    def add_variant(self, name: str, successes: int = 0, trials: int = 0):
        """Add or update a variant with observed data."""
        self.variants[name] = {
            'successes': successes,
            'trials': trials,
            'alpha': self.alpha_prior + successes,
            'beta': self.beta_prior + trials - successes
        }
    
    def update(self, name: str, success: bool):
        """Update variant with single observation (for streaming)."""
        v = self.variants[name]
        v['trials'] += 1
        if success:
            v['successes'] += 1
            v['alpha'] += 1
        else:
            v['beta'] += 1
    
    def posterior_mean(self, name: str) -> float:
        """Expected conversion rate for variant."""
        v = self.variants[name]
        return v['alpha'] / (v['alpha'] + v['beta'])
    
    def credible_interval(self, name: str, confidence: float = 0.95) -> tuple:
        """Equal-tailed credible interval for conversion rate."""
        v = self.variants[name]
        tail = (1 - confidence) / 2
        lower = stats.beta.ppf(tail, v['alpha'], v['beta'])
        upper = stats.beta.ppf(1 - tail, v['alpha'], v['beta'])
        return (lower, upper)
    
    def prob_best(self, name: str, n_samples: int = 100000) -> float:
        """
        Probability that variant 'name' is the best among all variants.
        Uses Monte Carlo sampling for multiple variant comparison.
        """
        samples = {}
        for vname, v in self.variants.items():
            samples[vname] = stats.beta.rvs(v['alpha'], v['beta'], size=n_samples)
        
        # Count how often this variant has the highest sample
        is_best = np.ones(n_samples, dtype=bool)
        for other_name in self.variants:
            if other_name != name:
                is_best &= (samples[name] > samples[other_name])
        
        return np.mean(is_best)
    
    def prob_beats(self, name_a: str, name_b: str, n_samples: int = 100000) -> float:
        """P(θ_a > θ_b | data) by Monte Carlo."""
        va, vb = self.variants[name_a], self.variants[name_b]
        samples_a = stats.beta.rvs(va['alpha'], va['beta'], size=n_samples)
        samples_b = stats.beta.rvs(vb['alpha'], vb['beta'], size=n_samples)
        return np.mean(samples_a > samples_b)
    
    def expected_loss(self, choose: str, best: str, n_samples: int = 100000) -> float:
        """
        Expected loss from choosing 'choose' if 'best' is actually best.
        Loss = max(0, θ_best - θ_choose) for each sample.
        """
        v_choose = self.variants[choose]
        v_best = self.variants[best]
        samples_choose = stats.beta.rvs(v_choose['alpha'], v_choose['beta'], size=n_samples)
        samples_best = stats.beta.rvs(v_best['alpha'], v_best['beta'], size=n_samples)
        return np.mean(np.maximum(0, samples_best - samples_choose))
    
    def summary(self):
        """Print comprehensive test summary."""
        print("=" * 60)
        print("BAYESIAN A/B TEST SUMMARY")
        print("=" * 60)
        
        for name, v in self.variants.items():
            mean = self.posterior_mean(name)
            ci = self.credible_interval(name)
            pb = self.prob_best(name)
            print(f"
{name}:")
            print(f"  Data: {v['successes']}/{v['trials']} ({100*v['successes']/max(1,v['trials']):.1f}%)")
            print(f"  Posterior: Beta({v['alpha']}, {v['beta']})")
            print(f"  Mean: {mean:.4f}")
            print(f"  95% CI: [{ci[0]:.4f}, {ci[1]:.4f}]")
            print(f"  P(best): {pb:.4f}")
 
 
# Example usage
test = BayesianABTest(alpha_prior=1, beta_prior=1)
test.add_variant('Control', successes=120, trials=1000)  # 12%
test.add_variant('Treatment', successes=145, trials=1000)  # 14.5%
 
test.summary()
print(f"
P(Treatment > Control) = {test.prob_beats('Treatment', 'Control'):.4f}")

Application 2: Thompson Sampling for Multi-Armed Bandits

Thompson Sampling is a Bayesian algorithm for the exploration-exploitation tradeoff in multi-armed bandit problems. With Beta-Binomial models, it's remarkably simple:

Algorithm:

Maintain Beta posteriors for each arm's success probability
At each round:
- Sample $\tilde{\theta}_i \sim \text{Beta}(\alpha_i, \beta_i)$ for each arm
- Play the arm with highest sampled value: $\arg\max_i \tilde{\theta}_i$
- Observe outcome and update posterior

This simple algorithm is provably optimal (in terms of regret bounds) and naturally balances:

Exploration: Arms with high uncertainty get sampled due to variance
Exploitation: Arms with high means get sampled frequently

Application 3: Clinical Trial Analysis

In clinical trials, Binary outcomes (cure/no cure, adverse event/no adverse event) are common. Bayesian analysis provides:

Prior incorporation: Historical trial data informs priors
Sequential monitoring: Valid inference at any interim analysis
Predictive probabilities: "What's the probability of trial success given current data?"
Decision-theoretic stopping rules: Based on expected utility, not p-values

Why Tech Companies Use Bayesian A/B Testing

Major tech companies (Netflix, Google, Microsoft) have adopted Bayesian A/B testing because: (1) Results are interpretable—'95% probability that B is better' not 'p < 0.05'. (2) No peeking penalty—can check results anytime. (3) Prior incorporation handles low-traffic tests. (4) Expected loss provides principled stopping rules. (5) Multi-variant tests are natural extensions.

Edge Cases, Pathologies, and Robustness

While Beta-Binomial inference is generally well-behaved, practitioners should be aware of edge cases and potential issues.

Edge Case 1: Zero Observations ($k = 0$ or $k = n$)

When all trials are successes or all are failures:

MLE problem: $\hat{\theta} = 0$ or $\hat{\theta} = 1$ (extreme estimates)
Bayesian solution: Posterior shrinks toward prior mean

With prior Beta(1, 1) and 0 successes in 10 trials:

MLE: $\hat{\theta} = 0$
Posterior: Beta(1, 11), mean = 1/12 ≈ 0.083

The Bayesian estimate is non-zero, reflecting that we shouldn't be certain the rate is exactly zero after just 10 trials.

Edge Case 2: Improper Priors ($\alpha < 1$ or $\beta < 1$)

Priors like Beta(0.5, 0.5) (Jeffreys) or Beta(0, 0) (Haldane) are improper—they don't integrate to 1. However:

Beta(0.5, 0.5) with any data $k \in {1, \ldots, n-1}$ yields a proper posterior
Beta(0, 0) requires at least one success AND one failure for a proper posterior

Edge Case 3: Very Strong Priors

If $\alpha + \beta >> n$, the prior dominates. This can be problematic if:

The prior is based on outdated information
The current data represents a different population
The prior was incorrectly specified

Solution: Always check prior predictive distribution. If prior predicts data very different from what's observed, reconsider the prior.

Robustness Considerations

•Sensitivity analysis: Report results under multiple priors, especially with small samples
•Prior predictive checks: Does the prior generate data that looks plausible?
•Posterior predictive checks: Does the posterior predict future data well?
•Non-independence: If trials aren't independent (clustered data, time dependence), the Binomial likelihood is wrong—consider hierarchical or correlated models
•Misspecified likelihood: If the true process isn't Binomial (e.g., overdispersion), Beta-Binomial inference may be overconfident

Overdispersion and Beta-Binomial as Likelihood:

Interestingly, the Beta-Binomial distribution (our posterior predictive) can also serve as a likelihood for overdispersed count data:

If counts show more variance than Binomial (due to hidden heterogeneity across trials), the Beta-Binomial likelihood models this: $$k \sim \text{BetaBinomial}(n, \alpha, \beta)$$

Here $\alpha$ and $\beta$ parameterize the heterogeneity distribution of success probabilities across trials, not prior uncertainty. This is a subtle but important distinction in usage.

Summary: Mastering Beta-Binomial Inference

We have thoroughly explored Beta-Binomial conjugacy—the foundation of Bayesian proportion inference. Let's consolidate the essential knowledge:

Key Takeaways

•The Beta distribution is the natural prior for probability/proportion parameters, defined on [0,1] with flexible shape control via $\alpha$ and $\beta$
•The Binomial likelihood models success counts from independent binary trials, with sufficient statistic $k$ (number of successes)
•Posterior update is simple addition: Beta($\alpha, \beta$) + $k$ successes in $n$ trials → Beta($\alpha + k, \beta + n - k$)
•Hyperparameters are pseudo-counts: $\alpha - 1$ prior successes, $\beta - 1$ prior failures, controlling both mean and strength of belief
•Posterior mean is a weighted average of prior mean and MLE, with weights proportional to effective sample sizes
•Posterior predictive is Beta-Binomial, incorporating both sampling variability and parameter uncertainty
•Applications abound: A/B testing, Thompson sampling, clinical trials, quality control, election forecasting
•Edge cases require care: Zero counts, improper priors, overdispersion, and non-independence need attention

What's Next:

Having mastered Beta-Binomial conjugacy, we now extend to continuous data. The next page covers Gaussian-Gaussian conjugacy—the foundation for Bayesian inference on continuous measurements, regression, and countless extensions in modern machine learning.

Page Complete

You now command the Beta-Binomial conjugate family—the most widely applied conjugate model in practice. You can perform complete Bayesian inference for binary outcomes, from prior specification through posterior summaries and predictive inference. Next, we tackle Gaussian-Gaussian conjugacy for continuous data.

Beta-Binomial

The Foundation of Bayesian Proportion Inference

Its ubiquity stems from a perfect storm of desirable properties:

Mathematical elegance: The posterior update is trivially simple
Interpretability: Hyperparameters have transparent pseudo-count meanings
Flexibility: The Beta distribution can represent remarkably diverse prior beliefs
Tractability: All quantities of interest have closed-form expressions

What You Will Learn

The Beta Distribution: Anatomy and Intuition

The Beta distribution is a continuous probability distribution defined on the interval $[0, 1]$, making it the natural choice for modeling probabilities, proportions, and rates.

$$p(\theta | \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$$

where $\Gamma(\cdot)$ is the Gamma function, the continuous extension of the factorial: $\Gamma(n) = (n-1)!$ for positive integers.

The normalizing constant $B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}$ is called the Beta function.

The Kernel Perspective

Key Properties of the Beta Distribution:

Mean (Expected Value): $$\mathbb{E}[\theta] = \frac{\alpha}{\alpha + \beta}$$

This is an intuitive weighted average: $\alpha$ "pulls" toward 1, $\beta$ "pulls" toward 0, and their relative magnitudes determine the balance.

Mode (Most Probable Value): $$\text{Mode}[\theta] = \frac{\alpha - 1}{\alpha + \beta - 2} \quad \text{(for } \alpha, \beta > 1\text{)}$$

The mode exists only when both parameters exceed 1; otherwise the distribution is U-shaped or monotonic.

Variance: $$\text{Var}[\theta] = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$

The variance decreases as $\alpha + \beta$ (the effective sample size) increases, reflecting greater certainty.

Concentration Parameter: The sum $\kappa = \alpha + \beta$ controls how concentrated the distribution is around its mean. Larger $\kappa$ means lower variance and more peaked distributions.

Common Beta Distributions and Their Interpretations
Parameters	Shape	Mean	Interpretation
Beta(1, 1)	Uniform	0.5	Complete ignorance; all probabilities equally likely
Beta(0.5, 0.5)	U-shaped	0.5	Jeffreys prior; non-informative, favors extremes
Beta(2, 2)	Symmetric, unimodal	0.5	Weak belief centered at 0.5
Beta(5, 5)	Symmetric, peaked	0.5	Moderate belief centered at 0.5
Beta(20, 20)	Highly concentrated	0.5	Strong belief centered at 0.5
Beta(10, 2)	Right-skewed	0.833	Belief that probability is high (~80%)
Beta(2, 10)	Left-skewed	0.167	Belief that probability is low (~17%)
Beta(100, 1)	Highly concentrated at 1	0.99	Strong belief probability is nearly 1

Visual Intuition for Beta Shapes:

The Beta distribution's shape is entirely determined by $\alpha$ and $\beta$:

$\alpha = \beta = 1$: Uniform (flat) distribution
$\alpha = \beta > 1$: Symmetric and unimodal, centered at 0.5
$\alpha = \beta < 1$: U-shaped, favoring extreme values (0 and 1)
$\alpha > \beta$: Right-skewed, favoring larger values
$\alpha < \beta$: Left-skewed, favoring smaller values
Both large, $\alpha > \beta$: Concentrated, asymmetric, near mode $\frac{\alpha - 1}{\alpha + \beta - 2}$

This flexibility allows Beta priors to encode an enormous variety of prior beliefs about probability values.

The Binomial Likelihood

Definition: Let $k$ be the number of successes in $n$ independent trials, each with success probability $\theta$. Then $k$ follows a Binomial distribution:

$$P(k | n, \theta) = \binom{n}{k} \theta^k (1 - \theta)^{n-k}$$

where $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ is the binomial coefficient.

As a function of $\theta$ (the likelihood function): $$\mathcal{L}(\theta | k, n) = \binom{n}{k} \theta^k (1 - \theta)^{n-k} \propto \theta^k (1 - \theta)^{n-k}$$

Bernoulli as a Special Case

Why Binomial Likelihood and Beta Prior are Conjugate:

The key observation is that the Binomial likelihood kernel has the same functional form as the Beta density kernel. Compare:

Beta kernel: $\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$
Binomial kernel: $\theta^k (1 - \theta)^{n-k}$

Both are products of powers of $\theta$ and $(1 - \theta)$. When we multiply them (as Bayes' theorem requires), the result is another product of powers:

$$\text{Posterior kernel} = \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} \cdot \theta^k (1 - \theta)^{n-k} = \theta^{\alpha + k - 1} (1 - \theta)^{\beta + n - k - 1}$$

This is exactly a Beta kernel with updated parameters. The posterior is $\text{Beta}(\alpha + k, \beta + n - k)$.

Sufficient Statistic Insight

•The number of successes $k$ is a sufficient statistic for $\theta$ in Binomial sampling
•Once we know $k$ and $n$, the original sequence of outcomes (which trial was success/failure) provides no additional information about $\theta$
•This compression—from $n$ binary values to 2 numbers $(k, n)$—enables the tractable posterior update
•Sufficiency is the mathematical foundation of conjugacy: the posterior depends on data only through low-dimensional sufficient statistics

Posterior Derivation: Step by Step

Let us derive the Beta-Binomial posterior carefully, establishing the reasoning pattern used throughout conjugate analysis.

Setup:

Prior: $\theta \sim \text{Beta}(\alpha, \beta)$
Data: $k$ successes in $n$ trials (equivalently, Binomial($n, \theta$) observed as $k$)
Goal: Find $P(\theta | k, n)$