Loading learning content...
The distributions we've studied so far—Bernoulli, Binomial, Gaussian, Poisson—might seem like independent mathematical objects. But beneath the surface lies a remarkable unification: these distributions, along with many others, are all members of a single family called the Exponential Family.
The exponential family isn't just a taxonomic classification; it's a mathematical framework with deep theoretical and practical implications:
Understanding exponential families transforms your view of probability from a collection of separate distributions into a coherent mathematical structure. It explains why certain distributions are "nice" to work with and provides a template for constructing new distributions with desirable properties.
By the end of this page, you will understand the exponential family: its canonical form, natural parameters, sufficient statistics, how common distributions fit this framework, the connection to GLMs, and why this unifying perspective is powerful for machine learning.
A probability distribution belongs to the exponential family if its probability density (or mass) function can be written in the following canonical form:
$$p(x | \boldsymbol{\eta}) = h(x) \exp\left(\boldsymbol{\eta}^T \mathbf{T}(x) - A(\boldsymbol{\eta})\right)$$
or equivalently:
$$p(x | \boldsymbol{\eta}) = \frac{h(x) \exp\left(\boldsymbol{\eta}^T \mathbf{T}(x)\right)}{\exp(A(\boldsymbol{\eta}))}$$
where:
η (natural parameter): The 'true' parameter in this representation; determines the distribution's shape. T(x) (sufficient statistic): All the information about η contained in x. h(x) (base measure): The 'reference' distribution; doesn't depend on η. A(η) (log-partition): Ensures the distribution integrates/sums to 1; contains all moment information.
The log-partition function A(η) is defined by the normalization requirement:
$$\int h(x) \exp(\boldsymbol{\eta}^T \mathbf{T}(x)) \, dx = \exp(A(\boldsymbol{\eta}))$$
Thus:
$$A(\boldsymbol{\eta}) = \log \int h(x) \exp(\boldsymbol{\eta}^T \mathbf{T}(x)) \, dx$$
Critical insight: The log-partition function is the "key" to the distribution—it encodes all the moments. Specifically:
$$ abla A(\boldsymbol{\eta}) = E[\mathbf{T}(x)]$$ $$ abla^2 A(\boldsymbol{\eta}) = \text{Cov}[\mathbf{T}(x)]$$
The first derivative gives the expected sufficient statistics, and the second derivative (Hessian) gives their covariance matrix.
A remarkable property: A(η) is always convex (and often strictly convex). This is because the Hessian ∇²A(η) = Cov[T(x)] is a covariance matrix, which is always positive semi-definite.
This convexity ensures that maximum likelihood estimation for exponential families has globally optimal solutions with no local minima.
| Component | Symbol | Role | Property |
|---|---|---|---|
| Natural parameter | η | Determines distribution shape | Often different from 'intuitive' parameters |
| Sufficient statistic | T(x) | Data summary for inference | ∇A(η) = E[T(x)] |
| Base measure | h(x) | Reference/carrier distribution | Independent of η |
| Log-partition | A(η) | Normalizing constant (log) | Always convex; encodes moments |
Let's express the distributions we've studied in exponential family form. This exercise reveals their underlying structure and connections.
Standard form: P(x | p) = p^x (1-p)^{1-x} for x ∈ {0, 1}
Exponential family form:
$$P(x | p) = \exp\left(x \log\frac{p}{1-p} + \log(1-p)\right)$$
Inverse mapping: p = σ(η) = 1/(1 + e^{-η}) (sigmoid function!)
This explains why logistic regression uses the sigmoid—it's the inverse of the natural parameterization!
Standard form: p(x | μ, σ²) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²))
With σ² known, the exponential family form is:
$$p(x | \mu) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(\frac{\mu x}{\sigma^2} - \frac{\mu^2}{2\sigma^2} - \frac{x^2}{2\sigma^2}\right)$$
With both μ and σ² unknown, we have a 2-parameter exponential family:
Notice: T(x) now includes both x and x²—we need the second moment to estimate variance!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as npfrom scipy.special import logsumexp class ExponentialFamily: """ Base class for exponential family distributions. Provides common functionality: log-pdf, sufficient statistics, MLE from sufficient statistics, etc. """ def log_h(self, x): """Log of base measure h(x).""" raise NotImplementedError def T(self, x): """Sufficient statistic T(x).""" raise NotImplementedError def A(self, eta): """Log-partition function A(η).""" raise NotImplementedError def log_prob(self, x, eta): """Compute log p(x | η) = log h(x) + η'T(x) - A(η).""" return self.log_h(x) + np.dot(eta, self.T(x)) - self.A(eta) def expected_T(self, eta): """E[T(x)] = ∇A(η) — computed numerically by default.""" raise NotImplementedError class BernoulliExpFam(ExponentialFamily): """Bernoulli in exponential family form.""" def log_h(self, x): return 0.0 # h(x) = 1 def T(self, x): return np.array([x]) # Sufficient stat is just x def A(self, eta): # A(η) = log(1 + exp(η)) return np.log1p(np.exp(eta[0])) # Numerically stable def eta_from_p(self, p): """Convert probability p to natural parameter η = logit(p).""" return np.array([np.log(p / (1 - p))]) def p_from_eta(self, eta): """Convert natural parameter to probability p = sigmoid(η).""" return 1 / (1 + np.exp(-eta[0])) def expected_T(self, eta): """E[T(x)] = E[x] = p = sigmoid(η).""" return np.array([self.p_from_eta(eta)]) class PoissonExpFam(ExponentialFamily): """Poisson in exponential family form.""" def log_h(self, x): # h(x) = 1/x! from scipy.special import gammaln return -gammaln(x + 1) # log(1/x!) = -log(x!) def T(self, x): return np.array([x]) def A(self, eta): # A(η) = exp(η) = λ return np.exp(eta[0]) def eta_from_lambda(self, lam): """Convert λ to natural parameter η = log(λ).""" return np.array([np.log(lam)]) def lambda_from_eta(self, eta): """Convert natural parameter to λ = exp(η).""" return np.exp(eta[0]) def expected_T(self, eta): """E[T(x)] = E[x] = λ = exp(η).""" return np.array([self.lambda_from_eta(eta)]) # Demonstrationprint("Exponential Family Representations")print("=" * 60) # Bernoulli examplebern = BernoulliExpFam()p = 0.7eta_bern = bern.eta_from_p(p)print(f"Bernoulli(p={p}):")print(f" Natural parameter η = logit({p}) = {eta_bern[0]:.4f}")print(f" A(η) = log(1+e^η) = {bern.A(eta_bern):.4f}")print(f" E[T(x)] = ∇A(η) = p = {bern.expected_T(eta_bern)[0]:.4f}") # Verify log-probabilitiesprint(f" P(x=1): exp(η - A(η)) = {np.exp(bern.log_prob(1, eta_bern)):.4f}")print(f" P(x=0): exp(0 - A(η)) = {np.exp(bern.log_prob(0, eta_bern)):.4f}") # Poisson examplepois = PoissonExpFam()lam = 5.0eta_pois = pois.eta_from_lambda(lam)print(f"Poisson(λ={lam}):")print(f" Natural parameter η = log({lam}) = {eta_pois[0]:.4f}")print(f" A(η) = exp(η) = λ = {pois.A(eta_pois):.4f}")print(f" E[T(x)] = ∇A(η) = λ = {pois.expected_T(eta_pois)[0]:.4f}")| Distribution | η (Natural Param) | T(x) (Suff. Stat) | A(η) | Mean-Param Link |
|---|---|---|---|---|
| Bernoulli(p) | log(p/(1-p)) | x | log(1+e^η) | p = sigmoid(η) |
| Poisson(λ) | log(λ) | x | e^η | λ = e^η |
| Gaussian(μ, σ² known) | μ/σ² | x | η²σ²/2 | μ = ησ² |
| Exponential(λ) | -λ | x | -log(-η) | λ = -η |
| Gamma(α, β) | (α-1, -β) | (log x, x) | log Γ(α) - α log β | See formulas |
| Beta(α, β) | (α-1, β-1) | (log x, log(1-x)) | log B(α,β) | See formulas |
A sufficient statistic T(x) is a function of the data that captures all information about the parameter η. Formally, the conditional distribution of X given T(X) does not depend on η.
A statistic T(X) is sufficient for η if and only if the density can be factored as:
$$p(x | \eta) = g(T(x), \eta) \cdot h(x)$$
where g depends on x only through T(x), and h doesn't depend on η.
For exponential families, this factorization is immediate from the canonical form!
1. Data reduction: You can summarize arbitrarily large datasets by computing T(X), losing no information about η.
Example: For n i.i.d. Gaussian observations, the sufficient statistics are Σxᵢ and Σxᵢ². No matter how large n is, these two numbers contain all information about μ and σ².
2. MLE has closed form: For exponential families, MLE just involves setting the observed sufficient statistics equal to their expected values:
$$E_{\hat{\eta}}[\mathbf{T}(X)] = \frac{1}{n}\sum_{i=1}^n \mathbf{T}(x_i)$$
For i.i.d. data x₁, ..., xₙ from an exponential family, the joint distribution is:
$$p(x_1, \ldots, x_n | \eta) = \prod_{i=1}^n h(x_i) \exp\left(\eta^T \sum_{i=1}^n T(x_i) - nA(\eta)\right)$$
This shows that Σᵢ T(xᵢ) is sufficient for η. The dimensionality of the sufficient statistic is fixed, regardless of sample size!
| Distribution | Sufficient Statistic for n samples |
|---|---|
| Bernoulli(p) | Σxᵢ (count of 1s) |
| Poisson(λ) | Σxᵢ (total count) |
| Gaussian(μ, σ²) | (Σxᵢ, Σxᵢ²) |
| Exponential(λ) | Σxᵢ |
| Gamma(α known, β) | Σxᵢ |
This fixed-dimensional sufficiency is unique to exponential families and is one of their key practical advantages.
Exponential family sufficient statistics are typically minimal—there's no further reduction possible without losing information. For a k-parameter exponential family, the minimal sufficient statistic is k-dimensional, providing maximum compression of arbitrarily large datasets.
MLE for exponential families has a beautiful structure arising from the properties of the log-partition function.
For n i.i.d. observations, the log-likelihood is:
$$\ell(\eta) = \sum_{i=1}^n \log h(x_i) + \eta^T \sum_{i=1}^n T(x_i) - nA(\eta)$$
The gradient with respect to η is:
$$ abla_\eta \ell(\eta) = \sum_{i=1}^n T(x_i) - n abla A(\eta) = \sum_{i=1}^n T(x_i) - n E_\eta[T(X)]$$
Setting the gradient to zero:
$$E_{\hat{\eta}}[T(X)] = \frac{1}{n}\sum_{i=1}^n T(x_i) = \bar{T}$$
The MLE is found by matching expected sufficient statistics to observed sufficient statistics!
The Hessian of the log-likelihood is:
$$ abla^2_\eta \ell(\eta) = -n abla^2 A(\eta) = -n \cdot \text{Cov}_\eta[T(X)]$$
Since covariance matrices are positive semi-definite, the Hessian is negative semi-definite, confirming that the log-likelihood is concave (convex optimization problem when minimizing negative log-likelihood).
Implications:
MLE can be viewed as finding the distribution in the exponential family whose expected sufficient statistics match the sample averages. This is elegant:
For many distributions, this gives closed-form solutions. For others, we can use convex optimization.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as npfrom scipy.optimize import minimize def gaussian_exponential_family_mle(data: np.ndarray) -> dict: """ MLE for Gaussian using exponential family perspective. For Gaussian(μ, σ²), the sufficient statistics are (Σxᵢ, Σxᵢ²). Natural parameters are η₁ = μ/σ², η₂ = -1/(2σ²). MLE: Match E[T(X)] to observed T̄. """ n = len(data) # Compute sufficient statistics sum_x = np.sum(data) sum_x2 = np.sum(data ** 2) # Sample averages of sufficient statistics T_bar = np.array([sum_x / n, sum_x2 / n]) # For Gaussian, E[T(X)] = (μ, μ² + σ²) # Solving: μ = T̄₁, σ² = T̄₂ - T̄₁² mu_mle = T_bar[0] sigma2_mle = T_bar[1] - T_bar[0]**2 # Convert to natural parameters eta1 = mu_mle / sigma2_mle eta2 = -1 / (2 * sigma2_mle) return { 'sufficient_stats': {'sum_x': sum_x, 'sum_x2': sum_x2}, 'T_bar': T_bar, 'mu_mle': mu_mle, 'sigma2_mle': sigma2_mle, 'natural_params': (eta1, eta2) } def exponential_family_fisher_information(A_hessian_func, eta): """ Fisher information for exponential family = ∇²A(η). This equals the covariance of sufficient statistics. """ return A_hessian_func(eta) # Example: Gaussian MLE via sufficient statisticsnp.random.seed(42)true_mu, true_sigma = 5.0, 2.0n = 1000data = np.random.normal(true_mu, true_sigma, n) results = gaussian_exponential_family_mle(data) print("Gaussian MLE via Exponential Family")print("=" * 50)print(f"True parameters: μ = {true_mu}, σ² = {true_sigma**2}")print(f"Sufficient statistics:")print(f" Σxᵢ = {results['sufficient_stats']['sum_x']:.4f}")print(f" Σxᵢ² = {results['sufficient_stats']['sum_x2']:.4f}")print(f"MLE (from moment matching):")print(f" μ̂ = {results['mu_mle']:.4f}")print(f" σ̂² = {results['sigma2_mle']:.4f}")print(f"Natural parameters:")print(f" η₁ = μ/σ² = {results['natural_params'][0]:.4f}")print(f" η₂ = -1/(2σ²) = {results['natural_params'][1]:.4f}") # Demonstrate moment matching for Poissonprint("" + "=" * 50)print("Poisson MLE via Exponential Family")print("=" * 50) true_lambda = 7.5poisson_data = np.random.poisson(true_lambda, n) # Sufficient statistic is Σxᵢ# E[T(X)] = λ, so MLE is T̄ = meansum_x = np.sum(poisson_data)T_bar = sum_x / n # = sample mean = λ̂ print(f"True λ = {true_lambda}")print(f"Sufficient statistic Σxᵢ = {sum_x}")print(f"T̄ = Σxᵢ/n = {T_bar:.4f}")print(f"λ̂_{'{MLE}'} = {T_bar:.4f}")print(f"Natural parameter η = log(λ̂) = {np.log(T_bar):.4f}")For exponential families, there exists a systematic way to construct conjugate priors—priors where the posterior has the same functional form as the prior, just with updated parameters.
For a likelihood from an exponential family:
$$p(x | \eta) = h(x) \exp(\eta^T T(x) - A(\eta))$$
The conjugate prior for η has the form:
$$p(\eta | \chi, u) \propto \exp(\eta^T \chi - u A(\eta))$$
where χ and ν are hyperparameters of the prior.
After observing x₁, ..., xₙ, the posterior is:
$$p(\eta | x_1, \ldots, x_n) \propto p(\eta | \chi, u) \prod_{i=1}^n p(x_i | \eta)$$
$$\propto \exp\left(\eta^T \left(\chi + \sum_{i=1}^n T(x_i)\right) - ( u + n) A(\eta)\right)$$
This is the same form with updated hyperparameters:
$$\chi' = \chi + \sum_{i=1}^n T(x_i), \quad u' = u + n$$
The prior 'pretends' to have seen ν observations with sufficient statistics χ.
| Likelihood | Conjugate Prior | Prior Parameters | Posterior Parameters |
|---|---|---|---|
| Bernoulli(p) | Beta(α, β) | α, β (pseudo-counts) | α + Σxᵢ, β + n - Σxᵢ |
| Poisson(λ) | Gamma(α, β) | α (shape), β (rate) | α + Σxᵢ, β + n |
| Gaussian(μ, σ² known) | Gaussian(μ₀, σ₀²) | μ₀, σ₀² | Precision-weighted mean |
| Exponential(λ) | Gamma(α, β) | α, β | α + n, β + Σxᵢ |
| Multinomial(p) | Dirichlet(α) | α₁, ..., αₖ | αⱼ + count_j |
Prior: p ~ Beta(α, β)
This means p has prior PDF: p(p | α, β) ∝ p^{α-1}(1-p)^{β-1}
After observing n Bernoulli trials with k successes:
Posterior: p | data ~ Beta(α + k, β + n - k)
Interpretation:
1. Computational convenience: Closed-form posteriors without MCMC
2. Interpretability: Prior parameters have intuitive meanings
3. Sequential updating: Easy to incorporate new data incrementally
4. Posterior predictive: Often has closed form
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
import numpy as npfrom scipy import stats class BetaBernoulliConjugate: """ Bayesian inference for Bernoulli likelihood with Beta prior. Prior: p ~ Beta(α, β) Likelihood: x | p ~ Bernoulli(p) Posterior: p | data ~ Beta(α + successes, β + failures) """ def __init__(self, alpha_prior: float = 1.0, beta_prior: float = 1.0): """ Args: alpha_prior: Prior pseudo-count for successes beta_prior: Prior pseudo-count for failures """ self.alpha = alpha_prior self.beta = beta_prior self.n_obs = 0 self.n_successes = 0 def update(self, observations: np.ndarray): """Update posterior with new observations (0s and 1s).""" n = len(observations) k = np.sum(observations) self.alpha += k self.beta += (n - k) self.n_obs += n self.n_successes += k return self def posterior(self) -> stats.rv_continuous: """Return the posterior distribution.""" return stats.beta(self.alpha, self.beta) def posterior_mean(self) -> float: """E[p | data] = α / (α + β).""" return self.alpha / (self.alpha + self.beta) def posterior_mode(self) -> float: """Mode of posterior (for α, β > 1).""" if self.alpha > 1 and self.beta > 1: return (self.alpha - 1) / (self.alpha + self.beta - 2) return self.posterior_mean() # Fallback def credible_interval(self, level: float = 0.95) -> tuple: """Compute a credible interval for p.""" post = self.posterior() lower = (1 - level) / 2 upper = 1 - lower return (post.ppf(lower), post.ppf(upper)) def posterior_predictive(self) -> float: """P(next observation = 1 | data) = posterior mean.""" return self.posterior_mean() # Example: A/B test analysisnp.random.seed(42)true_rate = 0.12 # True conversion raten_observations = 50 # Simulate dataconversions = np.random.binomial(1, true_rate, n_observations) # Bayesian inference with weakly informative priormodel = BetaBernoulliConjugate(alpha_prior=1, beta_prior=1) # Uniform priormodel.update(conversions) print("Beta-Bernoulli Conjugate Inference")print("=" * 50)print(f"True conversion rate: {true_rate}")print(f"Observed: {model.n_successes}/{model.n_obs} conversions")print(f"MLE estimate: {model.n_successes / model.n_obs:.4f}")print(f"Posterior: Beta({model.alpha:.0f}, {model.beta:.0f})")print(f"Posterior mean: {model.posterior_mean():.4f}")print(f"Posterior mode: {model.posterior_mode():.4f}")print(f"95% credible interval: [{model.credible_interval()[0]:.4f}, " f"{model.credible_interval()[1]:.4f}]")print(f"P(next conversion) = {model.posterior_predictive():.4f}") # Sequential updating demonstrationprint("" + "=" * 50)print("Sequential Bayesian Updating")print("=" * 50) model2 = BetaBernoulliConjugate(alpha_prior=2, beta_prior=8) # Prior: ~20% rateprint(f"Prior: Beta(2, 8) — effective rate {2/10:.1%}") for batch in range(5): new_data = np.random.binomial(1, true_rate, 20) model2.update(new_data) ci = model2.credible_interval() print(f"After batch {batch+1} (n={model2.n_obs:3d}): " f"posterior mean = {model2.posterior_mean():.4f}, " f"95% CI = [{ci[0]:.3f}, {ci[1]:.3f}]")Generalized Linear Models (GLMs) extend linear regression to response variables from any exponential family distribution. The exponential family framework provides the theoretical foundation.
A GLM has three components:
1. Random Component: The response Y follows an exponential family distribution: $$Y | X \sim \text{ExpFam}(\eta)$$
2. Systematic Component: A linear predictor: $$\eta_{\text{linear}} = \mathbf{w}^T \mathbf{x} + b$$
3. Link Function: Connects the linear predictor to the natural parameter: $$\eta = g(E[Y]) \quad \text{or equivalently} \quad E[Y] = g^{-1}(\eta_{\text{linear}})$$
The canonical link function is the one where the linear predictor equals the natural parameter: $$\eta_{\text{linear}} = \eta_{\text{natural}}$$
This gives the simplest, most numerically stable GLMs.
| Model Name | Response Distribution | Canonical Link | Use Case |
|---|---|---|---|
| Linear Regression | Gaussian(μ, σ²) | Identity: η = μ | Continuous outcomes |
| Logistic Regression | Bernoulli(p) | Logit: η = log(p/(1-p)) | Binary classification |
| Poisson Regression | Poisson(λ) | Log: η = log(λ) | Count data |
| Gamma Regression | Gamma(α, β) | Inverse: η = 1/μ | Positive continuous |
| Multinomial Logit | Multinomial(p) | Log odds | Multi-class classification |
1. Sufficient statistics align: The sufficient statistic T(Y) = Y (or simple function of Y) combines directly with features.
2. Convexity guaranteed: The negative log-likelihood is convex in the parameters.
3. Fisher scoring = Newton-Raphson: Optimization algorithms simplify.
4. Variance function determines all: The relationship Var(Y) = V(μ) × φ characterizes the family.
All GLMs can be written as:
$$\log p(y | \mathbf{x}, \mathbf{w}) = \frac{y \cdot \theta(\mathbf{x}, \mathbf{w}) - b(\theta(\mathbf{x}, \mathbf{w}))}{a(\phi)} + c(y, \phi)$$
where θ = g⁻¹(w^T x) maps features to the natural parameter.
This unified view shows that logistic regression, Poisson regression, and linear regression are all instances of the same framework—just with different exponential family distributions!
Understanding exponential families reveals why logistic regression uses the sigmoid (it's the inverse canonical link for Bernoulli) and why Poisson regression uses exp() to transform linear predictions (it's the inverse canonical link for Poisson). These aren't arbitrary choices—they arise naturally from the mathematical structure!
The exponential family is a unifying framework that connects most distributions used in machine learning. Let's consolidate the key insights:
What's Next:
We've explored the exponential family as a unifying framework. Next, we advance to the Multivariate Gaussian distribution—extending the univariate Gaussian to multiple dimensions. This distribution is fundamental to multivariate statistics, Gaussian processes, variational autoencoders, and countless other machine learning methods. We'll explore its geometry, conditioning, marginalization, and the central role of the covariance matrix.
You now understand the exponential family—a powerful unifying framework for probability distributions. This perspective reveals deep connections between distributions, explains why certain formulas arise in machine learning (sigmoid, softmax, log-link), and provides tools for principled model building through GLMs and conjugate Bayesian inference.