Machine LearningProbability Theory

Common Probability Distributions

LevelIntermediate

Duration120 mins

TopicProbability Theory

1 / 5

Bernoulli and Binomial Distributions

The Foundation of Binary Outcomes

Every classification problem in machine learning—spam detection, disease diagnosis, fraud identification, sentiment analysis—fundamentally reduces to predicting binary outcomes. Will the email be spam or not? Is the tumor malignant or benign? Is this transaction fraudulent or legitimate?

To model these binary outcomes probabilistically, we need a mathematical framework that captures the inherent randomness of such yes/no decisions. This is precisely what the Bernoulli and Binomial distributions provide. They are the simplest yet most fundamental probability distributions, forming the conceptual bedrock upon which more sophisticated models—logistic regression, naive Bayes classifiers, and neural network classification heads—are built.

Understanding these distributions is not merely an academic exercise; it is essential for grasping how machine learning models quantify uncertainty, how we evaluate classifier performance, and how we make principled decisions under probabilistic predictions.

What You Will Learn

By the end of this page, you will master the Bernoulli and Binomial distributions: their mathematical definitions, key properties, moment calculations, connections to other distributions, and their pervasive applications in machine learning. You will understand not just the formulas, but the deep intuition behind why these distributions arise naturally in binary outcome modeling.

The Bernoulli Distribution

The Bernoulli distribution is the simplest discrete probability distribution, modeling a single trial with exactly two possible outcomes: success (typically encoded as 1) and failure (encoded as 0). Named after Swiss mathematician Jacob Bernoulli (1654–1705), who made foundational contributions to probability theory, this distribution captures the essence of random binary events.

Formal Definition

A random variable X follows a Bernoulli distribution with parameter p, written X ~ Bernoulli(p), if:

Probability Mass Function (PMF):

$$P(X = x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}$$

Equivalently, we can write this as:

$$P(X = 1) = p \quad \text{and} \quad P(X = 0) = 1 - p = q$$

where:

p ∈ [0, 1] is the probability of success
q = 1 - p is the probability of failure
X takes only two values: 0 or 1

The parameter p completely characterizes the distribution. Once you know p, you know everything about the randomness of a single binary trial.

The Elegant Unity

The formula p^x(1-p)^(1-x) elegantly unifies both cases in a single expression. When x=1, it becomes p¹(1-p)⁰ = p. When x=0, it becomes p⁰(1-p)¹ = 1-p. This compact notation becomes invaluable when writing likelihood functions for parameter estimation.

Moments of the Bernoulli Distribution

The moments (expected value, variance, etc.) of the Bernoulli distribution are remarkably simple yet carry profound implications for machine learning.

Expected Value (Mean):

$$E[X] = \sum_{x \in \{0,1\}} x \cdot P(X=x) = 0 \cdot (1-p) + 1 \cdot p = p$$

The mean of a Bernoulli random variable equals the probability of success. This is intuitive: if you flip a biased coin with P(heads) = 0.7, the average outcome is 0.7.

Variance:

$$Var(X) = E[X^2] - (E[X])^2$$

Since X ∈ {0, 1}, we have X² = X, so E[X²] = E[X] = p. Therefore:

$$Var(X) = p - p^2 = p(1-p) = pq$$

Standard Deviation:

$$\sigma = \sqrt{p(1-p)}$$

Key Insight on Variance: The variance pq is maximized when p = 0.5 (giving Var = 0.25) and minimized when p = 0 or p = 1 (giving Var = 0). This reflects the intuition that a fair coin (p = 0.5) has the most uncertainty, while a deterministic outcome (p = 0 or 1) has no uncertainty.

Bernoulli Distribution Properties at a Glance
Property	Formula	Interpretation
Support	{0, 1}	Only two possible outcomes
Parameter	p ∈ [0, 1]	Probability of success
Mean	μ = p	Average outcome equals success probability
Variance	σ² = p(1-p)	Maximum uncertainty at p = 0.5
Skewness	(1-2p)/√(pq)	Symmetric only when p = 0.5
Kurtosis	(1-6pq)/(pq)	Measures tail behavior
Entropy	-p·log(p) - (1-p)·log(1-p)	Information content

Higher Moments and Moment Generating Function

For deeper analysis, we can compute the moment generating function (MGF):

$$M_X(t) = E[e^{tX}] = (1-p) \cdot e^{0 \cdot t} + p \cdot e^{1 \cdot t} = (1-p) + pe^t$$

From the MGF, we can derive all moments by differentiation:

$$E[X^n] = \frac{d^n}{dt^n} M_X(t) \Big|_{t=0}$$

The characteristic function (Fourier transform of the PMF) is:

$$\phi_X(t) = E[e^{itX}] = (1-p) + pe^{it}$$

These functions are essential for proving limit theorems and analyzing sums of random variables.

The Binomial Distribution

While the Bernoulli distribution models a single binary trial, the Binomial distribution extends this to model the total number of successes in n independent, identical Bernoulli trials. This is the natural progression: from one coin flip to many flips, from one classification to many classifications.

Formal Definition

A random variable X follows a Binomial distribution with parameters n and p, written X ~ Binomial(n, p), if X represents the count of successes in n independent Bernoulli(p) trials.

Probability Mass Function (PMF):

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k \in \{0, 1, 2, \ldots, n\}$$

where the binomial coefficient is:

$$\binom{n}{k} = \frac{n!}{k!(n-k)!}$$

This coefficient counts the number of ways to choose which k trials are successes among n trials.

Decomposing the PMF

The binomial PMF has three interpretable components: (1) p^k is the probability that exactly k specific trials are successes, (2) (1-p)^(n-k) is the probability that the remaining n-k trials are failures, and (3) C(n,k) counts all possible ways to arrange which trials are the successes.

Derivation from First Principles

Why does the binomial PMF take this form? Consider n = 3 trials with P(success) = p.

For exactly k = 2 successes, we could have the patterns:

SSF (success, success, failure): probability p · p · (1-p) = p²(1-p)
SFS: probability p · (1-p) · p = p²(1-p)
FSS: probability (1-p) · p · p = p²(1-p)

Each pattern has probability p²(1-p), and there are C(3,2) = 3 such patterns.

Total probability: P(X = 2) = 3 · p²(1-p)

Generalizing: P(X = k) = (number of ways to arrange k successes) × (probability of any specific arrangement)

$$P(X = k) = \binom{n}{k} \cdot p^k (1-p)^{n-k}$$

Moments of the Binomial Distribution

Since X = X₁ + X₂ + ... + Xₙ where each Xᵢ ~ Bernoulli(p), we can derive moments from linearity of expectation.

Expected Value:

$$E[X] = E\left[\sum_{i=1}^n X_i\right] = \sum_{i=1}^n E[X_i] = \sum_{i=1}^n p = np$$

Variance:

Since the Xᵢ are independent:

$$Var(X) = Var\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n Var(X_i) = \sum_{i=1}^n p(1-p) = np(1-p) = npq$$

Standard Deviation:

$$\sigma = \sqrt{np(1-p)}$$

Mode (most likely value):

If (n+1)p is an integer, there are two modes: (n+1)p and (n+1)p - 1
Otherwise, the mode is ⌊(n+1)p⌋

Binomial Distribution Properties
Property	Formula	Notes
Support	{0, 1, 2, ..., n}	n+1 possible values
Parameters	n ∈ ℕ, p ∈ [0, 1]	Number of trials, success probability
Mean	μ = np	Linear in n
Variance	σ² = np(1-p)	Maximum at p = 0.5 for fixed n
Skewness	(1-2p)/√(npq)	Approaches 0 as n → ∞
Kurtosis	(1-6pq)/(npq)	Approaches 0 as n → ∞
MGF	((1-p) + pe^t)^n	Product of n Bernoulli MGFs

Moment Generating Function

The MGF of the binomial is derived from the MGF of the Bernoulli:

$$M_X(t) = E[e^{tX}] = E\left[e^{t(X_1 + \cdots + X_n)}\right] = \prod_{i=1}^n E[e^{tX_i}] = ((1-p) + pe^t)^n$$

This elegant form—the n-th power of the Bernoulli MGF—reflects the sum structure of the binomial.

The Normalization Property

A valid probability distribution must sum to 1. For the binomial:

$$\sum_{k=0}^n \binom{n}{k} p^k (1-p)^{n-k} = (p + (1-p))^n = 1^n = 1 \quad \checkmark$$

This follows directly from the binomial theorem, which is why the distribution bears this name.

Visualizing the Distributions

Understanding probability distributions through visualization deepens intuition beyond formulas. Let's examine how Bernoulli and Binomial distributions behave under different parameter settings.

Bernoulli PMF Visualization

The Bernoulli PMF is strikingly simple: two bars at x = 0 and x = 1, with heights (1-p) and p respectively.

p value	P(X=0)	P(X=1)	Visual Pattern
0.1	0.9	0.1	Heavily favors 0 (failure)
0.5	0.5	0.5	Symmetric (fair coin)
0.8	0.2	0.8	Heavily favors 1 (success)

Binomial PMF: Effect of n and p

The binomial PMF becomes richer as n increases, displaying a bell-shaped curve for moderate p values.

binomial_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
from math import comb
 
def plot_binomial_distributions():
    """
    Visualize Binomial distributions for various parameter combinations.
    Demonstrates how shape changes with n and p.
    """
    fig, axes = plt.subplots(2, 3, figsize=(14, 8))
    
    # Different parameter combinations to illustrate behavior
    params = [
        (10, 0.2, "n=10, p=0.2 (right-skewed)"),
        (10, 0.5, "n=10, p=0.5 (symmetric)"),
        (10, 0.8, "n=10, p=0.8 (left-skewed)"),
        (50, 0.2, "n=50, p=0.2 (approaches normal)"),
        (50, 0.5, "n=50, p=0.5 (nearly normal)"),
        (100, 0.3, "n=100, p=0.3 (well-approximated by normal)")
    ]
    
    for ax, (n, p, title) in zip(axes.flatten(), params):
        k_values = np.arange(0, n + 1)
        probabilities = binom.pmf(k_values, n, p)
        
        ax.bar(k_values, probabilities, color='steelblue', alpha=0.7, edgecolor='navy')
        ax.set_xlabel('Number of Successes (k)')
        ax.set_ylabel('P(X = k)')
        ax.set_title(title)
        
        # Add mean line
        mean = n * p
        ax.axvline(mean, color='red', linestyle='--', linewidth=2, label=f'μ = {mean:.1f}')
        ax.legend()
        
        # Show ±1 standard deviation
        std = np.sqrt(n * p * (1 - p))
        ax.axvspan(mean - std, mean + std, alpha=0.2, color='red')
    
    plt.tight_layout()
    plt.savefig('binomial_distributions.png', dpi=150)
    plt.show()
 
# Manual PMF calculation (understanding the formula)
def binomial_pmf(k: int, n: int, p: float) -> float:
    """
    Calculate P(X = k) for X ~ Binomial(n, p).
    
    P(X = k) = C(n,k) * p^k * (1-p)^(n-k)
    """
    return comb(n, k) * (p ** k) * ((1 - p) ** (n - k))
 
# Example: Probability of exactly 3 heads in 5 coin flips (fair coin)
print(f"P(X=3 | n=5, p=0.5) = {binomial_pmf(3, 5, 0.5):.4f}")  # 0.3125

Key Visual Insights

Skewness behavior:

When p < 0.5: Distribution is right-skewed (long tail toward high values)
When p = 0.5: Distribution is symmetric
When p > 0.5: Distribution is left-skewed (long tail toward low values)

Concentration around the mean: As n increases, the distribution becomes more concentrated around np (relative to the range). The coefficient of variation CV = σ/μ = √((1-p)/(np)) decreases as 1/√n, meaning the distribution becomes relatively tighter.

Approaching normality: For large n (typically np ≥ 5 and n(1-p) ≥ 5), the binomial approaches a normal distribution with mean np and variance np(1-p). This is the celebrated Central Limit Theorem in action.

Estimating Parameters from Data

In machine learning, we rarely know the true parameter p; instead, we must estimate it from observed data. This section develops the theory of parameter estimation for Bernoulli and Binomial distributions using Maximum Likelihood Estimation (MLE).

MLE for Bernoulli Parameter

Suppose we observe n independent samples x₁, x₂, ..., xₙ from a Bernoulli(p) distribution. The likelihood function is the probability of observing this specific data as a function of p:

$$L(p | x_1, \ldots, x_n) = \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i} = p^{\sum x_i} (1-p)^{n - \sum x_i}$$

Let k = Σxᵢ be the total number of successes. Then:

$$L(p) = p^k (1-p)^{n-k}$$

The Log-Likelihood

Maximizing the likelihood is equivalent to maximizing its logarithm (since log is monotonic):

$$\ell(p) = \log L(p) = k \log p + (n-k) \log(1-p)$$

To find the maximum, we take the derivative and set it to zero:

$$\frac{d\ell}{dp} = \frac{k}{p} - \frac{n-k}{1-p} = 0$$

Solving for p:

$$\frac{k}{p} = \frac{n-k}{1-p}$$ $$k(1-p) = (n-k)p$$ $$k - kp = np - kp$$ $$k = np$$ $$\hat{p}{MLE} = \frac{k}{n} = \frac{\sum{i=1}^n x_i}{n}$$

The MLE estimate is simply the sample proportion—the number of successes divided by total trials.

Intuitive Result

The MLE for p equals the observed frequency of successes. If you flip a coin 100 times and see 58 heads, the MLE estimate is p̂ = 0.58. This matches our intuition and is mathematically optimal in the sense of maximizing the probability of the observed data.

Properties of the MLE Estimator

Unbiasedness: $$E[\hat{p}] = E\left[\frac{\sum X_i}{n}\right] = \frac{1}{n} \sum E[X_i] = \frac{1}{n} \cdot np = p$$

The estimator is unbiased—on average, it equals the true parameter.

Variance: $$Var(\hat{p}) = Var\left(\frac{\sum X_i}{n}\right) = \frac{1}{n^2} Var\left(\sum X_i\right) = \frac{1}{n^2} \cdot np(1-p) = \frac{p(1-p)}{n}$$

The variance decreases as 1/n—more data means more precision.

Standard Error: $$SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}$$

Consistency: As n → ∞, the variance approaches 0, so p̂ → p in probability. The estimator is consistent.

Asymptotic Normality: For large n: $$\hat{p} \stackrel{d}{\rightarrow} N\left(p, \frac{p(1-p)}{n}\right)$$

This enables confidence intervals and hypothesis testing.

parameter_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
from scipy import stats
 
def estimate_bernoulli_parameter(samples: np.ndarray) -> dict:
    """
    Estimate Bernoulli parameter p from binary samples using MLE.
    
    Args:
        samples: Array of 0s and 1s
        
    Returns:
        Dictionary with point estimate and confidence interval
    """
    n = len(samples)
    k = np.sum(samples)  # Number of successes
    
    # MLE estimate
    p_hat = k / n
    
    # Standard error
    se = np.sqrt(p_hat * (1 - p_hat) / n)
    
    # 95% confidence interval (using normal approximation)
    z = 1.96  # z-score for 95% CI
    ci_lower = max(0, p_hat - z * se)
    ci_upper = min(1, p_hat + z * se)
    
    # Wilson score interval (better for extreme p or small n)
    z_sq = z ** 2
    wilson_center = (k + z_sq / 2) / (n + z_sq)
    wilson_margin = z * np.sqrt((k * (n - k) / n + z_sq / 4) / (n + z_sq))
    wilson_lower = wilson_center - wilson_margin
    wilson_upper = wilson_center + wilson_margin
    
    return {
        'n': n,
        'successes': k,
        'p_hat': p_hat,
        'standard_error': se,
        'ci_95_normal': (ci_lower, ci_upper),
        'ci_95_wilson': (wilson_lower, wilson_upper)
    }
 
# Example: Estimate click-through rate from data
np.random.seed(42)
true_p = 0.12  # True click-through rate
sample_size = 500
data = np.random.binomial(1, true_p, sample_size)
 
results = estimate_bernoulli_parameter(data)
print(f"True p: {true_p}")
print(f"MLE estimate: {results['p_hat']:.4f}")
print(f"Standard error: {results['standard_error']:.4f}")
print(f"95% CI (normal): [{results['ci_95_normal'][0]:.4f}, {results['ci_95_normal'][1]:.4f}]")
print(f"95% CI (Wilson): [{results['ci_95_wilson'][0]:.4f}, {results['ci_95_wilson'][1]:.4f}]")

Wilson Score Interval

For small sample sizes or extreme probabilities (p near 0 or 1), the normal approximation for confidence intervals can produce invalid intervals (outside [0,1]). The Wilson score interval provides better coverage in these cases and is preferred in practice.

Connections to Other Distributions

The Bernoulli and Binomial distributions are not isolated; they connect to a rich web of other probability distributions. Understanding these relationships deepens comprehension and reveals when different distributions are appropriate.

From Bernoulli to Binomial

The fundamental relationship: if X₁, X₂, ..., Xₙ are independent Bernoulli(p) random variables, then:

$$X = \sum_{i=1}^n X_i \sim \text{Binomial}(n, p)$$

Conversely, a Binomial(1, p) is simply a Bernoulli(p).

Binomial and Poisson (Poisson Limit Theorem)

When n is large, p is small, and λ = np is moderate, the binomial approximates the Poisson distribution:

$$\text{Binomial}(n, p) \approx \text{Poisson}(\lambda) \quad \text{where } \lambda = np$$

Precisely, as n → ∞ and p → 0 with np → λ:

$$\binom{n}{k} p^k (1-p)^{n-k} \rightarrow \frac{\lambda^k e^{-\lambda}}{k!}$$

Rule of thumb: Use Poisson when n ≥ 20 and p ≤ 0.05.

Example: If 1 in 1000 emails is spam (p = 0.001) and you receive 500 emails (n = 500), the number of spam emails approximates Poisson(λ = 0.5).

Distribution Relationships and When to Use Them
Source	Limiting Condition	Target Distribution	Use Case
Bernoulli(p)	Sum of n iid copies	Binomial(n, p)	Counting successes in n trials
Binomial(n, p)	n→∞, p→0, np→λ	Poisson(λ)	Rare events in many trials
Binomial(n, p)	n→∞, np≥5, n(1-p)≥5	Normal(np, np(1-p))	Large sample approximation
Binomial(n, p)	Consider trials until k successes	Negative Binomial	Waiting time problems
Bernoulli(p)	Geometric waiting time	Geometric(p)	Trials until first success

Normal Approximation (De Moivre-Laplace Theorem)

For large n with np ≥ 5 and n(1-p) ≥ 5:

$$\frac{X - np}{\sqrt{np(1-p)}} \stackrel{d}{\rightarrow} N(0, 1)$$

Or equivalently:

$$X \approx N(np, np(1-p))$$

Continuity Correction: Since we're approximating a discrete distribution with a continuous one, we improve accuracy with a continuity correction:

$$P(X \leq k) \approx \Phi\left(\frac{k + 0.5 - np}{\sqrt{np(1-p)}}\right)$$

Beta-Binomial Connection

In Bayesian inference, if we place a Beta(α, β) prior on the parameter p of a Binomial, the posterior after observing k successes in n trials is:

$$p | (k, n) \sim \text{Beta}(\alpha + k, \beta + n - k)$$

This conjugacy is fundamental to Bayesian statistics and leads to the Beta-Binomial compound distribution.

Choosing the Right Approximation

Use the Poisson approximation when events are rare (small p, large n). Use the Normal approximation when both np and n(1-p) are at least 5. When in doubt about n being 'large enough,' the exact binomial calculation is preferable with modern computing.

Applications in Machine Learning

Bernoulli and Binomial distributions permeate machine learning, underpinning fundamental algorithms and evaluation methodologies. Let's explore their key applications in depth.

Binary Classification and Logistic Regression

Logistic regression models the probability of class membership using a Bernoulli distribution:

$$P(Y = 1 | \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}$$

Given the predicted probability p = σ(w^T x + b), the target Y follows:

$$Y | \mathbf{x} \sim \text{Bernoulli}(\sigma(\mathbf{w}^T \mathbf{x} + b))$$

The negative log-likelihood loss (binary cross-entropy) is:

$$\mathcal{L} = -\sum_{i=1}^n \left[ y_i \log(p_i) + (1-y_i) \log(1-p_i) \right]$$

This is precisely the negative log-likelihood of Bernoulli observations—training logistic regression is MLE for Bernoulli parameters!

logistic_regression_bernoulli.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
from scipy.optimize import minimize
 
def sigmoid(z):
    """Stable sigmoid function."""
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)),
                    np.exp(z) / (1 + np.exp(z)))
 
def bernoulli_nll(params, X, y):
    """
    Negative log-likelihood for Bernoulli observations.
    This is the binary cross-entropy loss.
    
    L(w, b) = -sum[y_i * log(p_i) + (1-y_i) * log(1-p_i)]
    
    where p_i = sigmoid(w^T x_i + b)
    """
    w, b = params[:-1], params[-1]
    z = X @ w + b
    p = sigmoid(z)
    
    # Clip probabilities for numerical stability
    eps = 1e-15
    p = np.clip(p, eps, 1 - eps)
    
    # Bernoulli negative log-likelihood
    nll = -np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))
    return nll
 
def fit_logistic_regression(X, y):
    """
    Fit logistic regression by maximizing Bernoulli likelihood.
    """
    n_features = X.shape[1]
    initial_params = np.zeros(n_features + 1)  # weights + bias
    
    result = minimize(bernoulli_nll, initial_params, args=(X, y), method='BFGS')
    
    w_hat = result.x[:-1]
    b_hat = result.x[-1]
    
    return w_hat, b_hat
 
# Example: Simple binary classification
np.random.seed(42)
n_samples = 200
X = np.random.randn(n_samples, 2)
true_w = np.array([1.5, -1.0])
true_b = 0.5
true_p = sigmoid(X @ true_w + true_b)
y = np.random.binomial(1, true_p)  # Bernoulli samples!
 
w_est, b_est = fit_logistic_regression(X, y)
print(f"True weights: {true_w}, bias: {true_b}")
print(f"Estimated weights: {w_est.round(3)}, bias: {b_est:.3f}")

Naive Bayes Classification

The Bernoulli Naive Bayes classifier models features as independent Bernoulli random variables given the class:

$$P(x_j = 1 | y = c) = \theta_{jc}$$

For binary features (e.g., word presence/absence in text):

$$P(\mathbf{x} | y = c) = \prod_{j=1}^d \theta_{jc}^{x_j} (1 - \theta_{jc})^{1-x_j}$$

Classification uses Bayes' rule:

$$P(y = c | \mathbf{x}) \propto P(y = c) \prod_{j=1}^d P(x_j | y = c)$$

A/B Testing and Hypothesis Testing

A/B testing compares conversion rates (Bernoulli parameters) between control and treatment groups:

Control: n₁ users, k₁ conversions → p̂₁ = k₁/n₁
Treatment: n₂ users, k₂ conversions → p̂₂ = k₂/n₂

Test statistic: $$Z = \frac{\hat{p}_2 - \hat{p}_1}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}$$

where p̂ is the pooled proportion.

Sample size calculation for A/B tests: $$n \approx \frac{2(z_{\alpha/2} + z_\beta)^2 \cdot p(1-p)}{(\delta)^2}$$

where δ is the minimum detectable effect.

Key ML Applications Summary

•Logistic Regression: Models P(Y=1|X) as Bernoulli; minimizing cross-entropy = maximizing Bernoulli likelihood
•Neural Network Classification: Final sigmoid layer outputs Bernoulli parameter; BCE loss is Bernoulli NLL
•Bernoulli Naive Bayes: Document classification with binary feature vectors
•Dropout Regularization: Each neuron retained with Bernoulli probability p during training
•Monte Carlo Dropout: Approximate Bayesian inference using Bernoulli-masked predictions
•A/B Testing: Compare treatment effects modeled as Bernoulli conversion rates
•Confidence Intervals for Accuracy: Classifier accuracy is Binomial proportion with Clopper-Pearson or Wilson intervals

Evaluating Classifier Performance

When evaluating a classifier on n test samples, the number of correct predictions X follows a binomial distribution:

$$X \sim \text{Binomial}(n, p_{\text{accuracy}})$$

The observed accuracy is a/n where a is the number of correct predictions. A 95% confidence interval for true accuracy uses the binomial properties:

Normal approximation: a/n ± 1.96√(p̂(1-p̂)/n)
Clopper-Pearson (exact): Based on relationship to Beta distribution
Wilson score: Better coverage for small n or extreme accuracies

This is crucial for comparing models: "Model A achieved 85% and Model B achieved 83%"—is this difference significant? The binomial framework provides the answer.

Computational Considerations

While the Bernoulli distribution is computationally trivial—just a coin flip—the binomial involves factorials that can overflow for large n. Understanding computational approaches is essential for practical implementation.

Computing Binomial Coefficients

Naive calculation of C(n,k) = n!/(k!(n-k)!) fails for large n due to integer overflow. Better approaches:

1. Iterative calculation (avoids factorial overflow): $$\binom{n}{k} = \prod_{i=0}^{k-1} \frac{n-i}{i+1}$$

2. Log-space computation: $$\log \binom{n}{k} = \sum_{i=0}^{k-1} [\log(n-i) - \log(i+1)]$$

or using the log-gamma function: $$\log \binom{n}{k} = \log \Gamma(n+1) - \log \Gamma(k+1) - \log \Gamma(n-k+1)$$

binomial_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from scipy.special import gammaln, comb
from functools import lru_cache
 
def log_binomial_coeff(n: int, k: int) -> float:
    """
    Compute log(C(n,k)) in a numerically stable way using log-gamma.
    
    log(C(n,k)) = log(n!) - log(k!) - log((n-k)!)
                = lgamma(n+1) - lgamma(k+1) - lgamma(n-k+1)
    """
    return gammaln(n + 1) - gammaln(k + 1) - gammaln(n - k + 1)
 
def log_binomial_pmf(k: int, n: int, p: float) -> float:
    """
    Compute log P(X = k) for X ~ Binomial(n, p).
    
    log P(X=k) = log(C(n,k)) + k*log(p) + (n-k)*log(1-p)
    """
    if p == 0:
        return 0.0 if k == 0 else float('-inf')
    if p == 1:
        return 0.0 if k == n else float('-inf')
        
    log_coeff = log_binomial_coeff(n, k)
    return log_coeff + k * np.log(p) + (n - k) * np.log(1 - p)
 
def binomial_pmf_stable(k: int, n: int, p: float) -> float:
    """
    Compute P(X = k) from the log-PMF for numerical stability.
    """
    return np.exp(log_binomial_pmf(k, n, p))
 
# Example: Large n computation
n = 1000
k = 500
p = 0.5
 
# This would overflow with naive factorial:
# naive = math.factorial(n) / (math.factorial(k) * math.factorial(n-k))
 
# But log-space works fine:
log_prob = log_binomial_pmf(k, n, p)
prob = np.exp(log_prob)
 
print(f"P(X = 500 | n=1000, p=0.5) = {prob:.6e}")
print(f"Log probability: {log_prob:.4f}")
 
# Verify against scipy
from scipy.stats import binom
scipy_prob = binom.pmf(k, n, p)
print(f"Scipy verification: {scipy_prob:.6e}")
 
# For very large n, use log-sum-exp for cumulative probabilities
def log_binomial_cdf(k: int, n: int, p: float) -> float:
    """
    Compute log P(X <= k) using log-sum-exp trick for stability.
    """
    log_probs = [log_binomial_pmf(i, n, p) for i in range(k + 1)]
    max_log = max(log_probs)
    return max_log + np.log(sum(np.exp(lp - max_log) for lp in log_probs))

Generating Random Samples

Efficient sampling from binomial distributions is important for Monte Carlo methods.

Method 1: Sum of Bernoulli trials (simple but slow for large n) $$X = \sum_{i=1}^n \text{Bernoulli}(p)$$

Method 2: Inverse transform (for small n) Compute the CDF and use uniform random numbers.

Method 3: BTPE algorithm (for large n) The Binomial Triangle Parallelogram Exponential (BTPE) algorithm combines multiple techniques for efficiency across all parameter ranges. This is what NumPy and most libraries use.

Numerical Precision

When p is very small or very large, direct probability calculations can underflow. Always:

Work in log-space for likelihoods
Use scipy.special.logsumexp for summing probabilities
Use specialized functions like scipy.stats.binom which handle edge cases
For ML training, use numerically stable loss implementations (e.g., torch.nn.BCEWithLogitsLoss)

Summary and Key Takeaways

The Bernoulli and Binomial distributions form the foundation for modeling binary outcomes in machine learning. Let's consolidate the essential concepts:

Key Takeaways

•Bernoulli(p) models a single binary trial with P(X=1) = p and P(X=0) = 1-p. Mean = p, Variance = p(1-p).
•Binomial(n, p) models the count of successes in n independent Bernoulli trials. Mean = np, Variance = np(1-p).
•MLE estimation: The sample proportion k/n is the MLE for p, with standard error √(p(1-p)/n).
•Distribution relationships: Binomial connects to Poisson (rare events), Normal (large n), and Beta-Binomial (Bayesian conjugacy).
•ML applications: Binary cross-entropy loss is Bernoulli NLL; logistic regression, Naive Bayes, dropout, and A/B testing all rely on these distributions.
•Computational stability: Use log-space calculations and specialized library functions for large n or extreme p values.

What's Next:

The Bernoulli and Binomial handle discrete binary outcomes. In the next page, we explore the Gaussian (Normal) distribution—the cornerstone of continuous probability modeling. We'll see how it arises as the limit of binomials (Central Limit Theorem), why it's ubiquitous in nature and ML, and how it underpins regression, neural networks, and much more.

Page Complete

You now have a deep understanding of Bernoulli and Binomial distributions—from their mathematical foundations through their pervasive applications in machine learning. These are the simplest probability distributions, yet they underpin binary classification, hypothesis testing, and probabilistic modeling throughout the field.

1 / 5

Loading learning content...

Machine LearningProbability Theory

Common Probability Distributions

LevelIntermediate

Duration120 mins

TopicProbability Theory

1 / 5

Bernoulli and Binomial Distributions

The Foundation of Binary Outcomes

What You Will Learn

The Bernoulli Distribution

Formal Definition

A random variable X follows a Bernoulli distribution with parameter p, written X ~ Bernoulli(p), if:

Probability Mass Function (PMF):

$$P(X = x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}$$

Equivalently, we can write this as:

$$P(X = 1) = p \quad \text{and} \quad P(X = 0) = 1 - p = q$$

where:

p ∈ [0, 1] is the probability of success
q = 1 - p is the probability of failure
X takes only two values: 0 or 1

The parameter p completely characterizes the distribution. Once you know p, you know everything about the randomness of a single binary trial.

The Elegant Unity

Moments of the Bernoulli Distribution

The moments (expected value, variance, etc.) of the Bernoulli distribution are remarkably simple yet carry profound implications for machine learning.

Expected Value (Mean):

$$E[X] = \sum_{x \in \{0,1\}} x \cdot P(X=x) = 0 \cdot (1-p) + 1 \cdot p = p$$

The mean of a Bernoulli random variable equals the probability of success. This is intuitive: if you flip a biased coin with P(heads) = 0.7, the average outcome is 0.7.

Variance:

$$Var(X) = E[X^2] - (E[X])^2$$

Since X ∈ {0, 1}, we have X² = X, so E[X²] = E[X] = p. Therefore:

$$Var(X) = p - p^2 = p(1-p) = pq$$

Standard Deviation:

$$\sigma = \sqrt{p(1-p)}$$

Bernoulli Distribution Properties at a Glance
Property	Formula	Interpretation
Support	{0, 1}	Only two possible outcomes
Parameter	p ∈ [0, 1]	Probability of success
Mean	μ = p	Average outcome equals success probability
Variance	σ² = p(1-p)	Maximum uncertainty at p = 0.5
Skewness	(1-2p)/√(pq)	Symmetric only when p = 0.5
Kurtosis	(1-6pq)/(pq)	Measures tail behavior
Entropy	-p·log(p) - (1-p)·log(1-p)	Information content

Higher Moments and Moment Generating Function

For deeper analysis, we can compute the moment generating function (MGF):

$$M_X(t) = E[e^{tX}] = (1-p) \cdot e^{0 \cdot t} + p \cdot e^{1 \cdot t} = (1-p) + pe^t$$

From the MGF, we can derive all moments by differentiation:

$$E[X^n] = \frac{d^n}{dt^n} M_X(t) \Big|_{t=0}$$

The characteristic function (Fourier transform of the PMF) is:

$$\phi_X(t) = E[e^{itX}] = (1-p) + pe^{it}$$

These functions are essential for proving limit theorems and analyzing sums of random variables.

The Binomial Distribution

Formal Definition

A random variable X follows a Binomial distribution with parameters n and p, written X ~ Binomial(n, p), if X represents the count of successes in n independent Bernoulli(p) trials.

Probability Mass Function (PMF):

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k \in \{0, 1, 2, \ldots, n\}$$

where the binomial coefficient is:

$$\binom{n}{k} = \frac{n!}{k!(n-k)!}$$

This coefficient counts the number of ways to choose which k trials are successes among n trials.

Decomposing the PMF

Derivation from First Principles

Why does the binomial PMF take this form? Consider n = 3 trials with P(success) = p.

For exactly k = 2 successes, we could have the patterns:

SSF (success, success, failure): probability p · p · (1-p) = p²(1-p)
SFS: probability p · (1-p) · p = p²(1-p)
FSS: probability (1-p) · p · p = p²(1-p)

Each pattern has probability p²(1-p), and there are C(3,2) = 3 such patterns.

Total probability: P(X = 2) = 3 · p²(1-p)

Generalizing: P(X = k) = (number of ways to arrange k successes) × (probability of any specific arrangement)

$$P(X = k) = \binom{n}{k} \cdot p^k (1-p)^{n-k}$$

Moments of the Binomial Distribution

Since X = X₁ + X₂ + ... + Xₙ where each Xᵢ ~ Bernoulli(p), we can derive moments from linearity of expectation.

Expected Value:

$$E[X] = E\left[\sum_{i=1}^n X_i\right] = \sum_{i=1}^n E[X_i] = \sum_{i=1}^n p = np$$

Variance:

Since the Xᵢ are independent:

$$Var(X) = Var\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n Var(X_i) = \sum_{i=1}^n p(1-p) = np(1-p) = npq$$

Standard Deviation:

$$\sigma = \sqrt{np(1-p)}$$

Mode (most likely value):

If (n+1)p is an integer, there are two modes: (n+1)p and (n+1)p - 1
Otherwise, the mode is ⌊(n+1)p⌋

Binomial Distribution Properties
Property	Formula	Notes
Support	{0, 1, 2, ..., n}	n+1 possible values
Parameters	n ∈ ℕ, p ∈ [0, 1]	Number of trials, success probability
Mean	μ = np	Linear in n
Variance	σ² = np(1-p)	Maximum at p = 0.5 for fixed n
Skewness	(1-2p)/√(npq)	Approaches 0 as n → ∞
Kurtosis	(1-6pq)/(npq)	Approaches 0 as n → ∞
MGF	((1-p) + pe^t)^n	Product of n Bernoulli MGFs

Moment Generating Function

The MGF of the binomial is derived from the MGF of the Bernoulli:

$$M_X(t) = E[e^{tX}] = E\left[e^{t(X_1 + \cdots + X_n)}\right] = \prod_{i=1}^n E[e^{tX_i}] = ((1-p) + pe^t)^n$$

This elegant form—the n-th power of the Bernoulli MGF—reflects the sum structure of the binomial.

The Normalization Property

A valid probability distribution must sum to 1. For the binomial:

$$\sum_{k=0}^n \binom{n}{k} p^k (1-p)^{n-k} = (p + (1-p))^n = 1^n = 1 \quad \checkmark$$

This follows directly from the binomial theorem, which is why the distribution bears this name.

Visualizing the Distributions

Understanding probability distributions through visualization deepens intuition beyond formulas. Let's examine how Bernoulli and Binomial distributions behave under different parameter settings.

Bernoulli PMF Visualization

The Bernoulli PMF is strikingly simple: two bars at x = 0 and x = 1, with heights (1-p) and p respectively.

p value	P(X=0)	P(X=1)	Visual Pattern
0.1	0.9	0.1	Heavily favors 0 (failure)
0.5	0.5	0.5	Symmetric (fair coin)
0.8	0.2	0.8	Heavily favors 1 (success)

Binomial PMF: Effect of n and p

The binomial PMF becomes richer as n increases, displaying a bell-shaped curve for moderate p values.

binomial_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
from math import comb
 
def plot_binomial_distributions():
    """
    Visualize Binomial distributions for various parameter combinations.
    Demonstrates how shape changes with n and p.
    """
    fig, axes = plt.subplots(2, 3, figsize=(14, 8))
    
    # Different parameter combinations to illustrate behavior
    params = [
        (10, 0.2, "n=10, p=0.2 (right-skewed)"),
        (10, 0.5, "n=10, p=0.5 (symmetric)"),
        (10, 0.8, "n=10, p=0.8 (left-skewed)"),
        (50, 0.2, "n=50, p=0.2 (approaches normal)"),
        (50, 0.5, "n=50, p=0.5 (nearly normal)"),
        (100, 0.3, "n=100, p=0.3 (well-approximated by normal)")
    ]
    
    for ax, (n, p, title) in zip(axes.flatten(), params):
        k_values = np.arange(0, n + 1)
        probabilities = binom.pmf(k_values, n, p)
        
        ax.bar(k_values, probabilities, color='steelblue', alpha=0.7, edgecolor='navy')
        ax.set_xlabel('Number of Successes (k)')
        ax.set_ylabel('P(X = k)')
        ax.set_title(title)
        
        # Add mean line
        mean = n * p
        ax.axvline(mean, color='red', linestyle='--', linewidth=2, label=f'μ = {mean:.1f}')
        ax.legend()
        
        # Show ±1 standard deviation
        std = np.sqrt(n * p * (1 - p))
        ax.axvspan(mean - std, mean + std, alpha=0.2, color='red')
    
    plt.tight_layout()
    plt.savefig('binomial_distributions.png', dpi=150)
    plt.show()
 
# Manual PMF calculation (understanding the formula)
def binomial_pmf(k: int, n: int, p: float) -> float:
    """
    Calculate P(X = k) for X ~ Binomial(n, p).
    
    P(X = k) = C(n,k) * p^k * (1-p)^(n-k)
    """
    return comb(n, k) * (p ** k) * ((1 - p) ** (n - k))
 
# Example: Probability of exactly 3 heads in 5 coin flips (fair coin)
print(f"P(X=3 | n=5, p=0.5) = {binomial_pmf(3, 5, 0.5):.4f}")  # 0.3125

Key Visual Insights

Skewness behavior:

When p < 0.5: Distribution is right-skewed (long tail toward high values)
When p = 0.5: Distribution is symmetric
When p > 0.5: Distribution is left-skewed (long tail toward low values)

Estimating Parameters from Data

MLE for Bernoulli Parameter

Suppose we observe n independent samples x₁, x₂, ..., xₙ from a Bernoulli(p) distribution. The likelihood function is the probability of observing this specific data as a function of p:

$$L(p | x_1, \ldots, x_n) = \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i} = p^{\sum x_i} (1-p)^{n - \sum x_i}$$

Let k = Σxᵢ be the total number of successes. Then:

$$L(p) = p^k (1-p)^{n-k}$$

The Log-Likelihood

Maximizing the likelihood is equivalent to maximizing its logarithm (since log is monotonic):

$$\ell(p) = \log L(p) = k \log p + (n-k) \log(1-p)$$

To find the maximum, we take the derivative and set it to zero:

$$\frac{d\ell}{dp} = \frac{k}{p} - \frac{n-k}{1-p} = 0$$

Solving for p:

$$\frac{k}{p} = \frac{n-k}{1-p}$$ $$k(1-p) = (n-k)p$$ $$k - kp = np - kp$$ $$k = np$$ $$\hat{p}{MLE} = \frac{k}{n} = \frac{\sum{i=1}^n x_i}{n}$$

The MLE estimate is simply the sample proportion—the number of successes divided by total trials.

Intuitive Result

Properties of the MLE Estimator

Unbiasedness: $$E[\hat{p}] = E\left[\frac{\sum X_i}{n}\right] = \frac{1}{n} \sum E[X_i] = \frac{1}{n} \cdot np = p$$

The estimator is unbiased—on average, it equals the true parameter.

Variance: $$Var(\hat{p}) = Var\left(\frac{\sum X_i}{n}\right) = \frac{1}{n^2} Var\left(\sum X_i\right) = \frac{1}{n^2} \cdot np(1-p) = \frac{p(1-p)}{n}$$

The variance decreases as 1/n—more data means more precision.

Standard Error: $$SE(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}$$

Consistency: As n → ∞, the variance approaches 0, so p̂ → p in probability. The estimator is consistent.

Asymptotic Normality: For large n: $$\hat{p} \stackrel{d}{\rightarrow} N\left(p, \frac{p(1-p)}{n}\right)$$

This enables confidence intervals and hypothesis testing.

parameter_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
from scipy import stats
 
def estimate_bernoulli_parameter(samples: np.ndarray) -> dict:
    """
    Estimate Bernoulli parameter p from binary samples using MLE.
    
    Args:
        samples: Array of 0s and 1s
        
    Returns:
        Dictionary with point estimate and confidence interval
    """
    n = len(samples)
    k = np.sum(samples)  # Number of successes
    
    # MLE estimate
    p_hat = k / n
    
    # Standard error
    se = np.sqrt(p_hat * (1 - p_hat) / n)
    
    # 95% confidence interval (using normal approximation)
    z = 1.96  # z-score for 95% CI
    ci_lower = max(0, p_hat - z * se)
    ci_upper = min(1, p_hat + z * se)
    
    # Wilson score interval (better for extreme p or small n)
    z_sq = z ** 2
    wilson_center = (k + z_sq / 2) / (n + z_sq)
    wilson_margin = z * np.sqrt((k * (n - k) / n + z_sq / 4) / (n + z_sq))
    wilson_lower = wilson_center - wilson_margin
    wilson_upper = wilson_center + wilson_margin
    
    return {
        'n': n,
        'successes': k,
        'p_hat': p_hat,
        'standard_error': se,
        'ci_95_normal': (ci_lower, ci_upper),
        'ci_95_wilson': (wilson_lower, wilson_upper)
    }
 
# Example: Estimate click-through rate from data
np.random.seed(42)
true_p = 0.12  # True click-through rate
sample_size = 500
data = np.random.binomial(1, true_p, sample_size)
 
results = estimate_bernoulli_parameter(data)
print(f"True p: {true_p}")
print(f"MLE estimate: {results['p_hat']:.4f}")
print(f"Standard error: {results['standard_error']:.4f}")
print(f"95% CI (normal): [{results['ci_95_normal'][0]:.4f}, {results['ci_95_normal'][1]:.4f}]")
print(f"95% CI (Wilson): [{results['ci_95_wilson'][0]:.4f}, {results['ci_95_wilson'][1]:.4f}]")

Wilson Score Interval

Connections to Other Distributions

From Bernoulli to Binomial

The fundamental relationship: if X₁, X₂, ..., Xₙ are independent Bernoulli(p) random variables, then:

$$X = \sum_{i=1}^n X_i \sim \text{Binomial}(n, p)$$

Conversely, a Binomial(1, p) is simply a Bernoulli(p).

Binomial and Poisson (Poisson Limit Theorem)

When n is large, p is small, and λ = np is moderate, the binomial approximates the Poisson distribution:

$$\text{Binomial}(n, p) \approx \text{Poisson}(\lambda) \quad \text{where } \lambda = np$$

Precisely, as n → ∞ and p → 0 with np → λ:

$$\binom{n}{k} p^k (1-p)^{n-k} \rightarrow \frac{\lambda^k e^{-\lambda}}{k!}$$

Rule of thumb: Use Poisson when n ≥ 20 and p ≤ 0.05.

Example: If 1 in 1000 emails is spam (p = 0.001) and you receive 500 emails (n = 500), the number of spam emails approximates Poisson(λ = 0.5).

Distribution Relationships and When to Use Them
Source	Limiting Condition	Target Distribution	Use Case
Bernoulli(p)	Sum of n iid copies	Binomial(n, p)	Counting successes in n trials
Binomial(n, p)	n→∞, p→0, np→λ	Poisson(λ)	Rare events in many trials
Binomial(n, p)	n→∞, np≥5, n(1-p)≥5	Normal(np, np(1-p))	Large sample approximation
Binomial(n, p)	Consider trials until k successes	Negative Binomial	Waiting time problems
Bernoulli(p)	Geometric waiting time	Geometric(p)	Trials until first success

Normal Approximation (De Moivre-Laplace Theorem)

For large n with np ≥ 5 and n(1-p) ≥ 5:

$$\frac{X - np}{\sqrt{np(1-p)}} \stackrel{d}{\rightarrow} N(0, 1)$$

Or equivalently:

$$X \approx N(np, np(1-p))$$

Continuity Correction: Since we're approximating a discrete distribution with a continuous one, we improve accuracy with a continuity correction:

$$P(X \leq k) \approx \Phi\left(\frac{k + 0.5 - np}{\sqrt{np(1-p)}}\right)$$

Beta-Binomial Connection

In Bayesian inference, if we place a Beta(α, β) prior on the parameter p of a Binomial, the posterior after observing k successes in n trials is:

$$p | (k, n) \sim \text{Beta}(\alpha + k, \beta + n - k)$$

This conjugacy is fundamental to Bayesian statistics and leads to the Beta-Binomial compound distribution.

Choosing the Right Approximation

Applications in Machine Learning

Bernoulli and Binomial distributions permeate machine learning, underpinning fundamental algorithms and evaluation methodologies. Let's explore their key applications in depth.

Binary Classification and Logistic Regression

Logistic regression models the probability of class membership using a Bernoulli distribution:

$$P(Y = 1 | \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}$$

Given the predicted probability p = σ(w^T x + b), the target Y follows:

$$Y | \mathbf{x} \sim \text{Bernoulli}(\sigma(\mathbf{w}^T \mathbf{x} + b))$$

The negative log-likelihood loss (binary cross-entropy) is:

$$\mathcal{L} = -\sum_{i=1}^n \left[ y_i \log(p_i) + (1-y_i) \log(1-p_i) \right]$$

This is precisely the negative log-likelihood of Bernoulli observations—training logistic regression is MLE for Bernoulli parameters!

logistic_regression_bernoulli.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
from scipy.optimize import minimize
 
def sigmoid(z):
    """Stable sigmoid function."""
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)),
                    np.exp(z) / (1 + np.exp(z)))
 
def bernoulli_nll(params, X, y):
    """
    Negative log-likelihood for Bernoulli observations.
    This is the binary cross-entropy loss.
    
    L(w, b) = -sum[y_i * log(p_i) + (1-y_i) * log(1-p_i)]
    
    where p_i = sigmoid(w^T x_i + b)
    """
    w, b = params[:-1], params[-1]
    z = X @ w + b
    p = sigmoid(z)
    
    # Clip probabilities for numerical stability
    eps = 1e-15
    p = np.clip(p, eps, 1 - eps)
    
    # Bernoulli negative log-likelihood
    nll = -np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))
    return nll
 
def fit_logistic_regression(X, y):
    """
    Fit logistic regression by maximizing Bernoulli likelihood.
    """
    n_features = X.shape[1]
    initial_params = np.zeros(n_features + 1)  # weights + bias
    
    result = minimize(bernoulli_nll, initial_params, args=(X, y), method='BFGS')
    
    w_hat = result.x[:-1]
    b_hat = result.x[-1]
    
    return w_hat, b_hat
 
# Example: Simple binary classification
np.random.seed(42)
n_samples = 200
X = np.random.randn(n_samples, 2)
true_w = np.array([1.5, -1.0])
true_b = 0.5
true_p = sigmoid(X @ true_w + true_b)
y = np.random.binomial(1, true_p)  # Bernoulli samples!
 
w_est, b_est = fit_logistic_regression(X, y)
print(f"True weights: {true_w}, bias: {true_b}")
print(f"Estimated weights: {w_est.round(3)}, bias: {b_est:.3f}")

Naive Bayes Classification

The Bernoulli Naive Bayes classifier models features as independent Bernoulli random variables given the class:

$$P(x_j = 1 | y = c) = \theta_{jc}$$

For binary features (e.g., word presence/absence in text):

$$P(\mathbf{x} | y = c) = \prod_{j=1}^d \theta_{jc}^{x_j} (1 - \theta_{jc})^{1-x_j}$$

Classification uses Bayes' rule:

$$P(y = c | \mathbf{x}) \propto P(y = c) \prod_{j=1}^d P(x_j | y = c)$$

A/B Testing and Hypothesis Testing

A/B testing compares conversion rates (Bernoulli parameters) between control and treatment groups:

Control: n₁ users, k₁ conversions → p̂₁ = k₁/n₁
Treatment: n₂ users, k₂ conversions → p̂₂ = k₂/n₂

Test statistic: $$Z = \frac{\hat{p}_2 - \hat{p}_1}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}$$

where p̂ is the pooled proportion.

Sample size calculation for A/B tests: $$n \approx \frac{2(z_{\alpha/2} + z_\beta)^2 \cdot p(1-p)}{(\delta)^2}$$

where δ is the minimum detectable effect.

Key ML Applications Summary

•Logistic Regression: Models P(Y=1|X) as Bernoulli; minimizing cross-entropy = maximizing Bernoulli likelihood
•Neural Network Classification: Final sigmoid layer outputs Bernoulli parameter; BCE loss is Bernoulli NLL
•Bernoulli Naive Bayes: Document classification with binary feature vectors
•Dropout Regularization: Each neuron retained with Bernoulli probability p during training
•Monte Carlo Dropout: Approximate Bayesian inference using Bernoulli-masked predictions
•A/B Testing: Compare treatment effects modeled as Bernoulli conversion rates
•Confidence Intervals for Accuracy: Classifier accuracy is Binomial proportion with Clopper-Pearson or Wilson intervals

Evaluating Classifier Performance

When evaluating a classifier on n test samples, the number of correct predictions X follows a binomial distribution:

$$X \sim \text{Binomial}(n, p_{\text{accuracy}})$$

The observed accuracy is a/n where a is the number of correct predictions. A 95% confidence interval for true accuracy uses the binomial properties:

Normal approximation: a/n ± 1.96√(p̂(1-p̂)/n)
Clopper-Pearson (exact): Based on relationship to Beta distribution
Wilson score: Better coverage for small n or extreme accuracies

This is crucial for comparing models: "Model A achieved 85% and Model B achieved 83%"—is this difference significant? The binomial framework provides the answer.

Computational Considerations

Computing Binomial Coefficients

Naive calculation of C(n,k) = n!/(k!(n-k)!) fails for large n due to integer overflow. Better approaches:

1. Iterative calculation (avoids factorial overflow): $$\binom{n}{k} = \prod_{i=0}^{k-1} \frac{n-i}{i+1}$$

2. Log-space computation: $$\log \binom{n}{k} = \sum_{i=0}^{k-1} [\log(n-i) - \log(i+1)]$$

or using the log-gamma function: $$\log \binom{n}{k} = \log \Gamma(n+1) - \log \Gamma(k+1) - \log \Gamma(n-k+1)$$

binomial_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from scipy.special import gammaln, comb
from functools import lru_cache
 
def log_binomial_coeff(n: int, k: int) -> float:
    """
    Compute log(C(n,k)) in a numerically stable way using log-gamma.
    
    log(C(n,k)) = log(n!) - log(k!) - log((n-k)!)
                = lgamma(n+1) - lgamma(k+1) - lgamma(n-k+1)
    """
    return gammaln(n + 1) - gammaln(k + 1) - gammaln(n - k + 1)
 
def log_binomial_pmf(k: int, n: int, p: float) -> float:
    """
    Compute log P(X = k) for X ~ Binomial(n, p).
    
    log P(X=k) = log(C(n,k)) + k*log(p) + (n-k)*log(1-p)
    """
    if p == 0:
        return 0.0 if k == 0 else float('-inf')
    if p == 1:
        return 0.0 if k == n else float('-inf')
        
    log_coeff = log_binomial_coeff(n, k)
    return log_coeff + k * np.log(p) + (n - k) * np.log(1 - p)
 
def binomial_pmf_stable(k: int, n: int, p: float) -> float:
    """
    Compute P(X = k) from the log-PMF for numerical stability.
    """
    return np.exp(log_binomial_pmf(k, n, p))
 
# Example: Large n computation
n = 1000
k = 500
p = 0.5
 
# This would overflow with naive factorial:
# naive = math.factorial(n) / (math.factorial(k) * math.factorial(n-k))
 
# But log-space works fine:
log_prob = log_binomial_pmf(k, n, p)
prob = np.exp(log_prob)
 
print(f"P(X = 500 | n=1000, p=0.5) = {prob:.6e}")
print(f"Log probability: {log_prob:.4f}")
 
# Verify against scipy
from scipy.stats import binom
scipy_prob = binom.pmf(k, n, p)
print(f"Scipy verification: {scipy_prob:.6e}")
 
# For very large n, use log-sum-exp for cumulative probabilities
def log_binomial_cdf(k: int, n: int, p: float) -> float:
    """
    Compute log P(X <= k) using log-sum-exp trick for stability.
    """
    log_probs = [log_binomial_pmf(i, n, p) for i in range(k + 1)]
    max_log = max(log_probs)
    return max_log + np.log(sum(np.exp(lp - max_log) for lp in log_probs))

Generating Random Samples

Efficient sampling from binomial distributions is important for Monte Carlo methods.

Method 1: Sum of Bernoulli trials (simple but slow for large n) $$X = \sum_{i=1}^n \text{Bernoulli}(p)$$

Method 2: Inverse transform (for small n) Compute the CDF and use uniform random numbers.

Numerical Precision

When p is very small or very large, direct probability calculations can underflow. Always:

Work in log-space for likelihoods
Use scipy.special.logsumexp for summing probabilities
Use specialized functions like scipy.stats.binom which handle edge cases
For ML training, use numerically stable loss implementations (e.g., torch.nn.BCEWithLogitsLoss)

Summary and Key Takeaways

The Bernoulli and Binomial distributions form the foundation for modeling binary outcomes in machine learning. Let's consolidate the essential concepts:

Key Takeaways

•Bernoulli(p) models a single binary trial with P(X=1) = p and P(X=0) = 1-p. Mean = p, Variance = p(1-p).
•Binomial(n, p) models the count of successes in n independent Bernoulli trials. Mean = np, Variance = np(1-p).
•MLE estimation: The sample proportion k/n is the MLE for p, with standard error √(p(1-p)/n).
•Distribution relationships: Binomial connects to Poisson (rare events), Normal (large n), and Beta-Binomial (Bayesian conjugacy).
•ML applications: Binary cross-entropy loss is Bernoulli NLL; logistic regression, Naive Bayes, dropout, and A/B testing all rely on these distributions.
•Computational stability: Use log-space calculations and specialized library functions for large n or extreme p values.

What's Next:

Page Complete

1 / 5