Loading learning content...
The distributions we've studied so far—Bernoulli, Binomial, Gaussian, Poisson—might seem like independent mathematical objects. But beneath the surface lies a remarkable unification: these distributions, along with many others, are all members of a single family called the Exponential Family.\n\nThe exponential family isn't just a taxonomic classification; it's a mathematical framework with deep theoretical and practical implications:\n\n- Sufficient statistics exist and are finite-dimensional\n- Maximum likelihood estimators have elegant closed-form solutions\n- Conjugate priors exist for Bayesian inference\n- Generalized Linear Models (GLMs) naturally arise from this framework\n- Gradient-based optimization is well-behaved\n\nUnderstanding exponential families transforms your view of probability from a collection of separate distributions into a coherent mathematical structure. It explains why certain distributions are "nice" to work with and provides a template for constructing new distributions with desirable properties.
By the end of this page, you will understand the exponential family: its canonical form, natural parameters, sufficient statistics, how common distributions fit this framework, the connection to GLMs, and why this unifying perspective is powerful for machine learning.
A probability distribution belongs to the exponential family if its probability density (or mass) function can be written in the following canonical form:\n\n$$p(x | \boldsymbol{\eta}) = h(x) \exp\left(\boldsymbol{\eta}^T \mathbf{T}(x) - A(\boldsymbol{\eta})\right)$$\n\nor equivalently:\n\n$$p(x | \boldsymbol{\eta}) = \frac{h(x) \exp\left(\boldsymbol{\eta}^T \mathbf{T}(x)\right)}{\exp(A(\boldsymbol{\eta}))}$$\n\nwhere:\n\n- x is the observation (can be scalar, vector, or more complex)\n- η (eta) is the natural parameter (or canonical parameter) vector\n- T(x) is the sufficient statistic vector\n- h(x) is the base measure or carrier measure\n- A(η) is the log-partition function (or log-normalizer, cumulant function)
η (natural parameter): The 'true' parameter in this representation; determines the distribution's shape. T(x) (sufficient statistic): All the information about η contained in x. h(x) (base measure): The 'reference' distribution; doesn't depend on η. A(η) (log-partition): Ensures the distribution integrates/sums to 1; contains all moment information.
| Component | Symbol | Role | Property |
|---|---|---|---|
| Natural parameter | η | Determines distribution shape | Often different from 'intuitive' parameters |
| Sufficient statistic | T(x) | Data summary for inference | ∇A(η) = E[T(x)] |
| Base measure | h(x) | Reference/carrier distribution | Independent of η |
| Log-partition | A(η) | Normalizing constant (log) | Always convex; encodes moments |
Let's express the distributions we've studied in exponential family form. This exercise reveals their underlying structure and connections.\n\n### Bernoulli Distribution\n\nStandard form: P(x | p) = p^x (1-p)^{1-x} for x ∈ {0, 1}\n\nExponential family form:\n\n$$P(x | p) = \exp\left(x \log\frac{p}{1-p} + \log(1-p)\right)$$\n\n- Natural parameter: η = log(p/(1-p)) = logit(p)\n- Sufficient statistic: T(x) = x\n- Base measure: h(x) = 1\n- Log-partition: A(η) = log(1 + e^η) = -log(1-p)\n\nInverse mapping: p = σ(η) = 1/(1 + e^{-η}) (sigmoid function!)\n\nThis explains why logistic regression uses the sigmoid—it's the inverse of the natural parameterization!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
import numpy as npfrom scipy.special import logsumexp class ExponentialFamily: """ Base class for exponential family distributions. Provides common functionality: log-pdf, sufficient statistics, MLE from sufficient statistics, etc. """ def log_h(self, x): """Log of base measure h(x).""" raise NotImplementedError def T(self, x): """Sufficient statistic T(x).""" raise NotImplementedError def A(self, eta): """Log-partition function A(η).""" raise NotImplementedError def log_prob(self, x, eta): """Compute log p(x | η) = log h(x) + η'T(x) - A(η).""" return self.log_h(x) + np.dot(eta, self.T(x)) - self.A(eta) def expected_T(self, eta): """E[T(x)] = ∇A(η) — computed numerically by default.""" raise NotImplementedError class BernoulliExpFam(ExponentialFamily): """Bernoulli in exponential family form.""" def log_h(self, x): return 0.0 # h(x) = 1 def T(self, x): return np.array([x]) # Sufficient stat is just x def A(self, eta): # A(η) = log(1 + exp(η)) return np.log1p(np.exp(eta[0])) # Numerically stable def eta_from_p(self, p): """Convert probability p to natural parameter η = logit(p).""" return np.array([np.log(p / (1 - p))]) def p_from_eta(self, eta): """Convert natural parameter to probability p = sigmoid(η).""" return 1 / (1 + np.exp(-eta[0])) def expected_T(self, eta): """E[T(x)] = E[x] = p = sigmoid(η).""" return np.array([self.p_from_eta(eta)]) class PoissonExpFam(ExponentialFamily): """Poisson in exponential family form.""" def log_h(self, x): # h(x) = 1/x! from scipy.special import gammaln return -gammaln(x + 1) # log(1/x!) = -log(x!) def T(self, x): return np.array([x]) def A(self, eta): # A(η) = exp(η) = λ return np.exp(eta[0]) def eta_from_lambda(self, lam): """Convert λ to natural parameter η = log(λ).""" return np.array([np.log(lam)]) def lambda_from_eta(self, eta): """Convert natural parameter to λ = exp(η).""" return np.exp(eta[0]) def expected_T(self, eta): """E[T(x)] = E[x] = λ = exp(η).""" return np.array([self.lambda_from_eta(eta)]) # Demonstrationprint("Exponential Family Representations")print("=" * 60) # Bernoulli examplebern = BernoulliExpFam()p = 0.7eta_bern = bern.eta_from_p(p)print(f"\nBernoulli(p={p}):")print(f" Natural parameter η = logit({p}) = {eta_bern[0]:.4f}")print(f" A(η) = log(1+e^η) = {bern.A(eta_bern):.4f}")print(f" E[T(x)] = ∇A(η) = p = {bern.expected_T(eta_bern)[0]:.4f}") # Verify log-probabilitiesprint(f" P(x=1): exp(η - A(η)) = {np.exp(bern.log_prob(1, eta_bern)):.4f}")print(f" P(x=0): exp(0 - A(η)) = {np.exp(bern.log_prob(0, eta_bern)):.4f}") # Poisson examplepois = PoissonExpFam()lam = 5.0eta_pois = pois.eta_from_lambda(lam)print(f"\nPoisson(λ={lam}):")print(f" Natural parameter η = log({lam}) = {eta_pois[0]:.4f}")print(f" A(η) = exp(η) = λ = {pois.A(eta_pois):.4f}")print(f" E[T(x)] = ∇A(η) = λ = {pois.expected_T(eta_pois)[0]:.4f}")| Distribution | η (Natural Param) | T(x) (Suff. Stat) | A(η) | Mean-Param Link |
|---|---|---|---|---|
| Bernoulli(p) | log(p/(1-p)) | x | log(1+e^η) | p = sigmoid(η) |
| Poisson(λ) | log(λ) | x | e^η | λ = e^η |
| Gaussian(μ, σ² known) | μ/σ² | x | η²σ²/2 | μ = ησ² |
| Exponential(λ) | -λ | x | -log(-η) | λ = -η |
| Gamma(α, β) | (α-1, -β) | (log x, x) | log Γ(α) - α log β | See formulas |
| Beta(α, β) | (α-1, β-1) | (log x, log(1-x)) | log B(α,β) | See formulas |
A sufficient statistic T(x) is a function of the data that captures all information about the parameter η. Formally, the conditional distribution of X given T(X) does not depend on η.\n\n### The Fisher-Neyman Factorization\n\nA statistic T(X) is sufficient for η if and only if the density can be factored as:\n\n$$p(x | \eta) = g(T(x), \eta) \cdot h(x)$$\n\nwhere g depends on x only through T(x), and h doesn't depend on η.\n\nFor exponential families, this factorization is immediate from the canonical form!\n\n### Why Sufficient Statistics Matter\n\n1. Data reduction: You can summarize arbitrarily large datasets by computing T(X), losing no information about η.\n\nExample: For n i.i.d. Gaussian observations, the sufficient statistics are Σxᵢ and Σxᵢ². No matter how large n is, these two numbers contain all information about μ and σ².\n\n2. MLE has closed form: For exponential families, MLE just involves setting the observed sufficient statistics equal to their expected values:\n\n$$E_{\hat{\eta}}[\mathbf{T}(X)] = \frac{1}{n}\sum_{i=1}^n \mathbf{T}(x_i)$$
Exponential family sufficient statistics are typically minimal—there's no further reduction possible without losing information. For a k-parameter exponential family, the minimal sufficient statistic is k-dimensional, providing maximum compression of arbitrarily large datasets.
MLE for exponential families has a beautiful structure arising from the properties of the log-partition function.\n\n### The Log-Likelihood\n\nFor n i.i.d. observations, the log-likelihood is:\n\n$$\ell(\eta) = \sum_{i=1}^n \log h(x_i) + \eta^T \sum_{i=1}^n T(x_i) - nA(\eta)$$\n\nThe gradient with respect to η is:\n\n$$\nabla_\eta \ell(\eta) = \sum_{i=1}^n T(x_i) - n \nabla A(\eta) = \sum_{i=1}^n T(x_i) - n E_\eta[T(X)]$$\n\nSetting the gradient to zero:\n\n$$E_{\hat{\eta}}[T(X)] = \frac{1}{n}\sum_{i=1}^n T(x_i) = \bar{T}$$\n\nThe MLE is found by matching expected sufficient statistics to observed sufficient statistics!
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as npfrom scipy.optimize import minimize def gaussian_exponential_family_mle(data: np.ndarray) -> dict: """ MLE for Gaussian using exponential family perspective. For Gaussian(μ, σ²), the sufficient statistics are (Σxᵢ, Σxᵢ²). Natural parameters are η₁ = μ/σ², η₂ = -1/(2σ²). MLE: Match E[T(X)] to observed T̄. """ n = len(data) # Compute sufficient statistics sum_x = np.sum(data) sum_x2 = np.sum(data ** 2) # Sample averages of sufficient statistics T_bar = np.array([sum_x / n, sum_x2 / n]) # For Gaussian, E[T(X)] = (μ, μ² + σ²) # Solving: μ = T̄₁, σ² = T̄₂ - T̄₁² mu_mle = T_bar[0] sigma2_mle = T_bar[1] - T_bar[0]**2 # Convert to natural parameters eta1 = mu_mle / sigma2_mle eta2 = -1 / (2 * sigma2_mle) return { 'sufficient_stats': {'sum_x': sum_x, 'sum_x2': sum_x2}, 'T_bar': T_bar, 'mu_mle': mu_mle, 'sigma2_mle': sigma2_mle, 'natural_params': (eta1, eta2) } def exponential_family_fisher_information(A_hessian_func, eta): """ Fisher information for exponential family = ∇²A(η). This equals the covariance of sufficient statistics. """ return A_hessian_func(eta) # Example: Gaussian MLE via sufficient statisticsnp.random.seed(42)true_mu, true_sigma = 5.0, 2.0n = 1000data = np.random.normal(true_mu, true_sigma, n) results = gaussian_exponential_family_mle(data) print("Gaussian MLE via Exponential Family")print("=" * 50)print(f"True parameters: μ = {true_mu}, σ² = {true_sigma**2}")print(f"\nSufficient statistics:")print(f" Σxᵢ = {results['sufficient_stats']['sum_x']:.4f}")print(f" Σxᵢ² = {results['sufficient_stats']['sum_x2']:.4f}")print(f"\nMLE (from moment matching):")print(f" μ̂ = {results['mu_mle']:.4f}")print(f" σ̂² = {results['sigma2_mle']:.4f}")print(f"\nNatural parameters:")print(f" η₁ = μ/σ² = {results['natural_params'][0]:.4f}")print(f" η₂ = -1/(2σ²) = {results['natural_params'][1]:.4f}") # Demonstrate moment matching for Poissonprint("\n" + "=" * 50)print("Poisson MLE via Exponential Family")print("=" * 50) true_lambda = 7.5poisson_data = np.random.poisson(true_lambda, n) # Sufficient statistic is Σxᵢ# E[T(X)] = λ, so MLE is T̄ = meansum_x = np.sum(poisson_data)T_bar = sum_x / n # = sample mean = λ̂ print(f"True λ = {true_lambda}")print(f"Sufficient statistic Σxᵢ = {sum_x}")print(f"T̄ = Σxᵢ/n = {T_bar:.4f}")print(f"λ̂_{'{MLE}'} = {T_bar:.4f}")print(f"Natural parameter η = log(λ̂) = {np.log(T_bar):.4f}")For exponential families, there exists a systematic way to construct conjugate priors—priors where the posterior has the same functional form as the prior, just with updated parameters.\n\n### The General Form\n\nFor a likelihood from an exponential family:\n\n$$p(x | \eta) = h(x) \exp(\eta^T T(x) - A(\eta))$$\n\nThe conjugate prior for η has the form:\n\n$$p(\eta | \chi, \nu) \propto \exp(\eta^T \chi - \nu A(\eta))$$\n\nwhere χ and ν are hyperparameters of the prior.\n\n### Posterior Update\n\nAfter observing x₁, ..., xₙ, the posterior is:\n\n$$p(\eta | x_1, \ldots, x_n) \propto p(\eta | \chi, \nu) \prod_{i=1}^n p(x_i | \eta)$$\n\n$$\propto \exp\left(\eta^T \left(\chi + \sum_{i=1}^n T(x_i)\right) - (\nu + n) A(\eta)\right)$$\n\nThis is the same form with updated hyperparameters:\n\n$$\chi' = \chi + \sum_{i=1}^n T(x_i), \quad \nu' = \nu + n$$\n\nThe prior 'pretends' to have seen ν observations with sufficient statistics χ.
| Likelihood | Conjugate Prior | Prior Parameters | Posterior Parameters |
|---|---|---|---|
| Bernoulli(p) | Beta(α, β) | α, β (pseudo-counts) | α + Σxᵢ, β + n - Σxᵢ |
| Poisson(λ) | Gamma(α, β) | α (shape), β (rate) | α + Σxᵢ, β + n |
| Gaussian(μ, σ² known) | Gaussian(μ₀, σ₀²) | μ₀, σ₀² | Precision-weighted mean |
| Exponential(λ) | Gamma(α, β) | α, β | α + n, β + Σxᵢ |
| Multinomial(p) | Dirichlet(α) | α₁, ..., αₖ | αⱼ + count_j |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import numpy as npfrom scipy import stats class BetaBernoulliConjugate: """ Bayesian inference for Bernoulli likelihood with Beta prior. Prior: p ~ Beta(α, β) Likelihood: x | p ~ Bernoulli(p) Posterior: p | data ~ Beta(α + successes, β + failures) """ def __init__(self, alpha_prior: float = 1.0, beta_prior: float = 1.0): """ Args: alpha_prior: Prior pseudo-count for successes beta_prior: Prior pseudo-count for failures """ self.alpha = alpha_prior self.beta = beta_prior self.n_obs = 0 self.n_successes = 0 def update(self, observations: np.ndarray): """Update posterior with new observations (0s and 1s).""" n = len(observations) k = np.sum(observations) self.alpha += k self.beta += (n - k) self.n_obs += n self.n_successes += k return self def posterior(self) -> stats.rv_continuous: """Return the posterior distribution.""" return stats.beta(self.alpha, self.beta) def posterior_mean(self) -> float: """E[p | data] = α / (α + β).""" return self.alpha / (self.alpha + self.beta) def posterior_mode(self) -> float: """Mode of posterior (for α, β > 1).""" if self.alpha > 1 and self.beta > 1: return (self.alpha - 1) / (self.alpha + self.beta - 2) return self.posterior_mean() # Fallback def credible_interval(self, level: float = 0.95) -> tuple: """Compute a credible interval for p.""" post = self.posterior() lower = (1 - level) / 2 upper = 1 - lower return (post.ppf(lower), post.ppf(upper)) def posterior_predictive(self) -> float: """P(next observation = 1 | data) = posterior mean.""" return self.posterior_mean() # Example: A/B test analysisnp.random.seed(42)true_rate = 0.12 # True conversion raten_observations = 50 # Simulate dataconversions = np.random.binomial(1, true_rate, n_observations) # Bayesian inference with weakly informative priormodel = BetaBernoulliConjugate(alpha_prior=1, beta_prior=1) # Uniform priormodel.update(conversions) print("Beta-Bernoulli Conjugate Inference")print("=" * 50)print(f"True conversion rate: {true_rate}")print(f"Observed: {model.n_successes}/{model.n_obs} conversions")print(f"MLE estimate: {model.n_successes / model.n_obs:.4f}")print(f"\nPosterior: Beta({model.alpha:.0f}, {model.beta:.0f})")print(f"Posterior mean: {model.posterior_mean():.4f}")print(f"Posterior mode: {model.posterior_mode():.4f}")print(f"95% credible interval: [{model.credible_interval()[0]:.4f}, " f"{model.credible_interval()[1]:.4f}]")print(f"P(next conversion) = {model.posterior_predictive():.4f}") # Sequential updating demonstrationprint("\n" + "=" * 50)print("Sequential Bayesian Updating")print("=" * 50) model2 = BetaBernoulliConjugate(alpha_prior=2, beta_prior=8) # Prior: ~20% rateprint(f"Prior: Beta(2, 8) — effective rate {2/10:.1%}") for batch in range(5): new_data = np.random.binomial(1, true_rate, 20) model2.update(new_data) ci = model2.credible_interval() print(f"After batch {batch+1} (n={model2.n_obs:3d}): " f"posterior mean = {model2.posterior_mean():.4f}, " f"95% CI = [{ci[0]:.3f}, {ci[1]:.3f}]")Generalized Linear Models (GLMs) extend linear regression to response variables from any exponential family distribution. The exponential family framework provides the theoretical foundation.\n\n### GLM Structure\n\nA GLM has three components:\n\n1. Random Component: The response Y follows an exponential family distribution:\n$$Y | X \sim \text{ExpFam}(\eta)$$\n\n2. Systematic Component: A linear predictor:\n$$\eta_{\text{linear}} = \mathbf{w}^T \mathbf{x} + b$$\n\n3. Link Function: Connects the linear predictor to the natural parameter:\n$$\eta = g(E[Y]) \quad \text{or equivalently} \quad E[Y] = g^{-1}(\eta_{\text{linear}})$$\n\n### The Canonical Link\n\nThe canonical link function is the one where the linear predictor equals the natural parameter:\n$$\eta_{\text{linear}} = \eta_{\text{natural}}$$\n\nThis gives the simplest, most numerically stable GLMs.
| Model Name | Response Distribution | Canonical Link | Use Case |
|---|---|---|---|
| Linear Regression | Gaussian(μ, σ²) | Identity: η = μ | Continuous outcomes |
| Logistic Regression | Bernoulli(p) | Logit: η = log(p/(1-p)) | Binary classification |
| Poisson Regression | Poisson(λ) | Log: η = log(λ) | Count data |
| Gamma Regression | Gamma(α, β) | Inverse: η = 1/μ | Positive continuous |
| Multinomial Logit | Multinomial(p) | Log odds | Multi-class classification |
Understanding exponential families reveals why logistic regression uses the sigmoid (it's the inverse canonical link for Bernoulli) and why Poisson regression uses exp() to transform linear predictions (it's the inverse canonical link for Poisson). These aren't arbitrary choices—they arise naturally from the mathematical structure!
The exponential family is a unifying framework that connects most distributions used in machine learning. Let's consolidate the key insights:
What's Next:\n\nWe've explored the exponential family as a unifying framework. Next, we advance to the Multivariate Gaussian distribution—extending the univariate Gaussian to multiple dimensions. This distribution is fundamental to multivariate statistics, Gaussian processes, variational autoencoders, and countless other machine learning methods. We'll explore its geometry, conditioning, marginalization, and the central role of the covariance matrix.
You now understand the exponential family—a powerful unifying framework for probability distributions. This perspective reveals deep connections between distributions, explains why certain formulas arise in machine learning (sigmoid, softmax, log-link), and provides tools for principled model building through GLMs and conjugate Bayesian inference.