Loading learning content...
Every classification problem in machine learning—spam detection, disease diagnosis, fraud identification, sentiment analysis—fundamentally reduces to predicting binary outcomes. Will the email be spam or not? Is the tumor malignant or benign? Is this transaction fraudulent or legitimate?\n\nTo model these binary outcomes probabilistically, we need a mathematical framework that captures the inherent randomness of such yes/no decisions. This is precisely what the Bernoulli and Binomial distributions provide. They are the simplest yet most fundamental probability distributions, forming the conceptual bedrock upon which more sophisticated models—logistic regression, naive Bayes classifiers, and neural network classification heads—are built.\n\nUnderstanding these distributions is not merely an academic exercise; it is essential for grasping how machine learning models quantify uncertainty, how we evaluate classifier performance, and how we make principled decisions under probabilistic predictions.
By the end of this page, you will master the Bernoulli and Binomial distributions: their mathematical definitions, key properties, moment calculations, connections to other distributions, and their pervasive applications in machine learning. You will understand not just the formulas, but the deep intuition behind why these distributions arise naturally in binary outcome modeling.
The Bernoulli distribution is the simplest discrete probability distribution, modeling a single trial with exactly two possible outcomes: success (typically encoded as 1) and failure (encoded as 0). Named after Swiss mathematician Jacob Bernoulli (1654–1705), who made foundational contributions to probability theory, this distribution captures the essence of random binary events.\n\n### Formal Definition\n\nA random variable X follows a Bernoulli distribution with parameter p, written X ~ Bernoulli(p), if:\n\nProbability Mass Function (PMF):\n\n$$P(X = x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}$$\n\nEquivalently, we can write this as:\n\n$$P(X = 1) = p \quad \text{and} \quad P(X = 0) = 1 - p = q$$\n\nwhere:\n- p ∈ [0, 1] is the probability of success\n- q = 1 - p is the probability of failure\n- X takes only two values: 0 or 1\n\nThe parameter p completely characterizes the distribution. Once you know p, you know everything about the randomness of a single binary trial.
The formula p^x(1-p)^(1-x) elegantly unifies both cases in a single expression. When x=1, it becomes p¹(1-p)⁰ = p. When x=0, it becomes p⁰(1-p)¹ = 1-p. This compact notation becomes invaluable when writing likelihood functions for parameter estimation.
| Property | Formula | Interpretation |
|---|---|---|
| Support | {0, 1} | Only two possible outcomes |
| Parameter | p ∈ [0, 1] | Probability of success |
| Mean | μ = p | Average outcome equals success probability |
| Variance | σ² = p(1-p) | Maximum uncertainty at p = 0.5 |
| Skewness | (1-2p)/√(pq) | Symmetric only when p = 0.5 |
| Kurtosis | (1-6pq)/(pq) | Measures tail behavior |
| Entropy | -p·log(p) - (1-p)·log(1-p) | Information content |
While the Bernoulli distribution models a single binary trial, the Binomial distribution extends this to model the total number of successes in n independent, identical Bernoulli trials. This is the natural progression: from one coin flip to many flips, from one classification to many classifications.\n\n### Formal Definition\n\nA random variable X follows a Binomial distribution with parameters n and p, written X ~ Binomial(n, p), if X represents the count of successes in n independent Bernoulli(p) trials.\n\nProbability Mass Function (PMF):\n\n$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k \in \{0, 1, 2, \ldots, n\}$$\n\nwhere the binomial coefficient is:\n\n$$\binom{n}{k} = \frac{n!}{k!(n-k)!}$$\n\nThis coefficient counts the number of ways to choose which k trials are successes among n trials.
The binomial PMF has three interpretable components: (1) p^k is the probability that exactly k specific trials are successes, (2) (1-p)^(n-k) is the probability that the remaining n-k trials are failures, and (3) C(n,k) counts all possible ways to arrange which trials are the successes.
| Property | Formula | Notes |
|---|---|---|
| Support | {0, 1, 2, ..., n} | n+1 possible values |
| Parameters | n ∈ ℕ, p ∈ [0, 1] | Number of trials, success probability |
| Mean | μ = np | Linear in n |
| Variance | σ² = np(1-p) | Maximum at p = 0.5 for fixed n |
| Skewness | (1-2p)/√(npq) | Approaches 0 as n → ∞ |
| Kurtosis | (1-6pq)/(npq) | Approaches 0 as n → ∞ |
| MGF | ((1-p) + pe^t)^n | Product of n Bernoulli MGFs |
Understanding probability distributions through visualization deepens intuition beyond formulas. Let's examine how Bernoulli and Binomial distributions behave under different parameter settings.\n\n### Bernoulli PMF Visualization\n\nThe Bernoulli PMF is strikingly simple: two bars at x = 0 and x = 1, with heights (1-p) and p respectively.\n\n| p value | P(X=0) | P(X=1) | Visual Pattern |\n|---------|--------|--------|----------------|\n| 0.1 | 0.9 | 0.1 | Heavily favors 0 (failure) |\n| 0.5 | 0.5 | 0.5 | Symmetric (fair coin) |\n| 0.8 | 0.2 | 0.8 | Heavily favors 1 (success) |\n\n### Binomial PMF: Effect of n and p\n\nThe binomial PMF becomes richer as n increases, displaying a bell-shaped curve for moderate p values.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import binomfrom math import comb def plot_binomial_distributions(): """ Visualize Binomial distributions for various parameter combinations. Demonstrates how shape changes with n and p. """ fig, axes = plt.subplots(2, 3, figsize=(14, 8)) # Different parameter combinations to illustrate behavior params = [ (10, 0.2, "n=10, p=0.2 (right-skewed)"), (10, 0.5, "n=10, p=0.5 (symmetric)"), (10, 0.8, "n=10, p=0.8 (left-skewed)"), (50, 0.2, "n=50, p=0.2 (approaches normal)"), (50, 0.5, "n=50, p=0.5 (nearly normal)"), (100, 0.3, "n=100, p=0.3 (well-approximated by normal)") ] for ax, (n, p, title) in zip(axes.flatten(), params): k_values = np.arange(0, n + 1) probabilities = binom.pmf(k_values, n, p) ax.bar(k_values, probabilities, color='steelblue', alpha=0.7, edgecolor='navy') ax.set_xlabel('Number of Successes (k)') ax.set_ylabel('P(X = k)') ax.set_title(title) # Add mean line mean = n * p ax.axvline(mean, color='red', linestyle='--', linewidth=2, label=f'μ = {mean:.1f}') ax.legend() # Show ±1 standard deviation std = np.sqrt(n * p * (1 - p)) ax.axvspan(mean - std, mean + std, alpha=0.2, color='red') plt.tight_layout() plt.savefig('binomial_distributions.png', dpi=150) plt.show() # Manual PMF calculation (understanding the formula)def binomial_pmf(k: int, n: int, p: float) -> float: """ Calculate P(X = k) for X ~ Binomial(n, p). P(X = k) = C(n,k) * p^k * (1-p)^(n-k) """ return comb(n, k) * (p ** k) * ((1 - p) ** (n - k)) # Example: Probability of exactly 3 heads in 5 coin flips (fair coin)print(f"P(X=3 | n=5, p=0.5) = {binomial_pmf(3, 5, 0.5):.4f}") # 0.3125In machine learning, we rarely know the true parameter p; instead, we must estimate it from observed data. This section develops the theory of parameter estimation for Bernoulli and Binomial distributions using Maximum Likelihood Estimation (MLE).\n\n### MLE for Bernoulli Parameter\n\nSuppose we observe n independent samples x₁, x₂, ..., xₙ from a Bernoulli(p) distribution. The likelihood function is the probability of observing this specific data as a function of p:\n\n$$L(p | x_1, \ldots, x_n) = \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i} = p^{\sum x_i} (1-p)^{n - \sum x_i}$$\n\nLet k = Σxᵢ be the total number of successes. Then:\n\n$$L(p) = p^k (1-p)^{n-k}$$\n\n### The Log-Likelihood\n\nMaximizing the likelihood is equivalent to maximizing its logarithm (since log is monotonic):\n\n$$\ell(p) = \log L(p) = k \log p + (n-k) \log(1-p)$$\n\nTo find the maximum, we take the derivative and set it to zero:\n\n$$\frac{d\ell}{dp} = \frac{k}{p} - \frac{n-k}{1-p} = 0$$\n\nSolving for p:\n\n$$\frac{k}{p} = \frac{n-k}{1-p}$$\n$$k(1-p) = (n-k)p$$\n$$k - kp = np - kp$$\n$$k = np$$\n$$\hat{p}{MLE} = \frac{k}{n} = \frac{\sum{i=1}^n x_i}{n}$$\n\nThe MLE estimate is simply the sample proportion—the number of successes divided by total trials.
The MLE for p equals the observed frequency of successes. If you flip a coin 100 times and see 58 heads, the MLE estimate is p̂ = 0.58. This matches our intuition and is mathematically optimal in the sense of maximizing the probability of the observed data.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as npfrom scipy import stats def estimate_bernoulli_parameter(samples: np.ndarray) -> dict: """ Estimate Bernoulli parameter p from binary samples using MLE. Args: samples: Array of 0s and 1s Returns: Dictionary with point estimate and confidence interval """ n = len(samples) k = np.sum(samples) # Number of successes # MLE estimate p_hat = k / n # Standard error se = np.sqrt(p_hat * (1 - p_hat) / n) # 95% confidence interval (using normal approximation) z = 1.96 # z-score for 95% CI ci_lower = max(0, p_hat - z * se) ci_upper = min(1, p_hat + z * se) # Wilson score interval (better for extreme p or small n) z_sq = z ** 2 wilson_center = (k + z_sq / 2) / (n + z_sq) wilson_margin = z * np.sqrt((k * (n - k) / n + z_sq / 4) / (n + z_sq)) wilson_lower = wilson_center - wilson_margin wilson_upper = wilson_center + wilson_margin return { 'n': n, 'successes': k, 'p_hat': p_hat, 'standard_error': se, 'ci_95_normal': (ci_lower, ci_upper), 'ci_95_wilson': (wilson_lower, wilson_upper) } # Example: Estimate click-through rate from datanp.random.seed(42)true_p = 0.12 # True click-through ratesample_size = 500data = np.random.binomial(1, true_p, sample_size) results = estimate_bernoulli_parameter(data)print(f"True p: {true_p}")print(f"MLE estimate: {results['p_hat']:.4f}")print(f"Standard error: {results['standard_error']:.4f}")print(f"95% CI (normal): [{results['ci_95_normal'][0]:.4f}, {results['ci_95_normal'][1]:.4f}]")print(f"95% CI (Wilson): [{results['ci_95_wilson'][0]:.4f}, {results['ci_95_wilson'][1]:.4f}]")For small sample sizes or extreme probabilities (p near 0 or 1), the normal approximation for confidence intervals can produce invalid intervals (outside [0,1]). The Wilson score interval provides better coverage in these cases and is preferred in practice.
The Bernoulli and Binomial distributions are not isolated; they connect to a rich web of other probability distributions. Understanding these relationships deepens comprehension and reveals when different distributions are appropriate.\n\n### From Bernoulli to Binomial\n\nThe fundamental relationship: if X₁, X₂, ..., Xₙ are independent Bernoulli(p) random variables, then:\n\n$$X = \sum_{i=1}^n X_i \sim \text{Binomial}(n, p)$$\n\nConversely, a Binomial(1, p) is simply a Bernoulli(p).\n\n### Binomial and Poisson (Poisson Limit Theorem)\n\nWhen n is large, p is small, and λ = np is moderate, the binomial approximates the Poisson distribution:\n\n$$\text{Binomial}(n, p) \approx \text{Poisson}(\lambda) \quad \text{where } \lambda = np$$\n\nPrecisely, as n → ∞ and p → 0 with np → λ:\n\n$$\binom{n}{k} p^k (1-p)^{n-k} \rightarrow \frac{\lambda^k e^{-\lambda}}{k!}$$\n\nRule of thumb: Use Poisson when n ≥ 20 and p ≤ 0.05.\n\nExample: If 1 in 1000 emails is spam (p = 0.001) and you receive 500 emails (n = 500), the number of spam emails approximates Poisson(λ = 0.5).
| Source | Limiting Condition | Target Distribution | Use Case |
|---|---|---|---|
| Bernoulli(p) | Sum of n iid copies | Binomial(n, p) | Counting successes in n trials |
| Binomial(n, p) | n→∞, p→0, np→λ | Poisson(λ) | Rare events in many trials |
| Binomial(n, p) | n→∞, np≥5, n(1-p)≥5 | Normal(np, np(1-p)) | Large sample approximation |
| Binomial(n, p) | Consider trials until k successes | Negative Binomial | Waiting time problems |
| Bernoulli(p) | Geometric waiting time | Geometric(p) | Trials until first success |
Use the Poisson approximation when events are rare (small p, large n). Use the Normal approximation when both np and n(1-p) are at least 5. When in doubt about n being 'large enough,' the exact binomial calculation is preferable with modern computing.
Bernoulli and Binomial distributions permeate machine learning, underpinning fundamental algorithms and evaluation methodologies. Let's explore their key applications in depth.\n\n### Binary Classification and Logistic Regression\n\nLogistic regression models the probability of class membership using a Bernoulli distribution:\n\n$$P(Y = 1 | \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}$$\n\nGiven the predicted probability p = σ(w^T x + b), the target Y follows:\n\n$$Y | \mathbf{x} \sim \text{Bernoulli}(\sigma(\mathbf{w}^T \mathbf{x} + b))$$\n\nThe negative log-likelihood loss (binary cross-entropy) is:\n\n$$\mathcal{L} = -\sum_{i=1}^n \left[ y_i \log(p_i) + (1-y_i) \log(1-p_i) \right]$$\n\nThis is precisely the negative log-likelihood of Bernoulli observations—training logistic regression is MLE for Bernoulli parameters!
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as npfrom scipy.optimize import minimize def sigmoid(z): """Stable sigmoid function.""" return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) def bernoulli_nll(params, X, y): """ Negative log-likelihood for Bernoulli observations. This is the binary cross-entropy loss. L(w, b) = -sum[y_i * log(p_i) + (1-y_i) * log(1-p_i)] where p_i = sigmoid(w^T x_i + b) """ w, b = params[:-1], params[-1] z = X @ w + b p = sigmoid(z) # Clip probabilities for numerical stability eps = 1e-15 p = np.clip(p, eps, 1 - eps) # Bernoulli negative log-likelihood nll = -np.sum(y * np.log(p) + (1 - y) * np.log(1 - p)) return nll def fit_logistic_regression(X, y): """ Fit logistic regression by maximizing Bernoulli likelihood. """ n_features = X.shape[1] initial_params = np.zeros(n_features + 1) # weights + bias result = minimize(bernoulli_nll, initial_params, args=(X, y), method='BFGS') w_hat = result.x[:-1] b_hat = result.x[-1] return w_hat, b_hat # Example: Simple binary classificationnp.random.seed(42)n_samples = 200X = np.random.randn(n_samples, 2)true_w = np.array([1.5, -1.0])true_b = 0.5true_p = sigmoid(X @ true_w + true_b)y = np.random.binomial(1, true_p) # Bernoulli samples! w_est, b_est = fit_logistic_regression(X, y)print(f"True weights: {true_w}, bias: {true_b}")print(f"Estimated weights: {w_est.round(3)}, bias: {b_est:.3f}")While the Bernoulli distribution is computationally trivial—just a coin flip—the binomial involves factorials that can overflow for large n. Understanding computational approaches is essential for practical implementation.\n\n### Computing Binomial Coefficients\n\nNaive calculation of C(n,k) = n!/(k!(n-k)!) fails for large n due to integer overflow. Better approaches:\n\n1. Iterative calculation (avoids factorial overflow):\n$$\binom{n}{k} = \prod_{i=0}^{k-1} \frac{n-i}{i+1}$$\n\n2. Log-space computation:\n$$\log \binom{n}{k} = \sum_{i=0}^{k-1} [\log(n-i) - \log(i+1)]$$\n\nor using the log-gamma function:\n$$\log \binom{n}{k} = \log \Gamma(n+1) - \log \Gamma(k+1) - \log \Gamma(n-k+1)$$
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as npfrom scipy.special import gammaln, combfrom functools import lru_cache def log_binomial_coeff(n: int, k: int) -> float: """ Compute log(C(n,k)) in a numerically stable way using log-gamma. log(C(n,k)) = log(n!) - log(k!) - log((n-k)!) = lgamma(n+1) - lgamma(k+1) - lgamma(n-k+1) """ return gammaln(n + 1) - gammaln(k + 1) - gammaln(n - k + 1) def log_binomial_pmf(k: int, n: int, p: float) -> float: """ Compute log P(X = k) for X ~ Binomial(n, p). log P(X=k) = log(C(n,k)) + k*log(p) + (n-k)*log(1-p) """ if p == 0: return 0.0 if k == 0 else float('-inf') if p == 1: return 0.0 if k == n else float('-inf') log_coeff = log_binomial_coeff(n, k) return log_coeff + k * np.log(p) + (n - k) * np.log(1 - p) def binomial_pmf_stable(k: int, n: int, p: float) -> float: """ Compute P(X = k) from the log-PMF for numerical stability. """ return np.exp(log_binomial_pmf(k, n, p)) # Example: Large n computationn = 1000k = 500p = 0.5 # This would overflow with naive factorial:# naive = math.factorial(n) / (math.factorial(k) * math.factorial(n-k)) # But log-space works fine:log_prob = log_binomial_pmf(k, n, p)prob = np.exp(log_prob) print(f"P(X = 500 | n=1000, p=0.5) = {prob:.6e}")print(f"Log probability: {log_prob:.4f}") # Verify against scipyfrom scipy.stats import binomscipy_prob = binom.pmf(k, n, p)print(f"Scipy verification: {scipy_prob:.6e}") # For very large n, use log-sum-exp for cumulative probabilitiesdef log_binomial_cdf(k: int, n: int, p: float) -> float: """ Compute log P(X <= k) using log-sum-exp trick for stability. """ log_probs = [log_binomial_pmf(i, n, p) for i in range(k + 1)] max_log = max(log_probs) return max_log + np.log(sum(np.exp(lp - max_log) for lp in log_probs))scipy.special.logsumexp for summing probabilities\n3. Use specialized functions like scipy.stats.binom which handle edge cases\n4. For ML training, use numerically stable loss implementations (e.g., torch.nn.BCEWithLogitsLoss)The Bernoulli and Binomial distributions form the foundation for modeling binary outcomes in machine learning. Let's consolidate the essential concepts:
What's Next:\n\nThe Bernoulli and Binomial handle discrete binary outcomes. In the next page, we explore the Gaussian (Normal) distribution—the cornerstone of continuous probability modeling. We'll see how it arises as the limit of binomials (Central Limit Theorem), why it's ubiquitous in nature and ML, and how it underpins regression, neural networks, and much more.
You now have a deep understanding of Bernoulli and Binomial distributions—from their mathematical foundations through their pervasive applications in machine learning. These are the simplest probability distributions, yet they underpin binary classification, hypothesis testing, and probabilistic modeling throughout the field.