Machine LearningProbability Theory

Common Probability Distributions

LevelIntermediate

Duration120 mins

TopicProbability Theory

4 / 5

Exponential Family of Distributions

A Unifying Framework for Probability

The distributions we've studied so far—Bernoulli, Binomial, Gaussian, Poisson—might seem like independent mathematical objects. But beneath the surface lies a remarkable unification: these distributions, along with many others, are all members of a single family called the Exponential Family.\n\nThe exponential family isn't just a taxonomic classification; it's a mathematical framework with deep theoretical and practical implications:\n\n- Sufficient statistics exist and are finite-dimensional\n- Maximum likelihood estimators have elegant closed-form solutions\n- Conjugate priors exist for Bayesian inference\n- Generalized Linear Models (GLMs) naturally arise from this framework\n- Gradient-based optimization is well-behaved\n\nUnderstanding exponential families transforms your view of probability from a collection of separate distributions into a coherent mathematical structure. It explains why certain distributions are "nice" to work with and provides a template for constructing new distributions with desirable properties.

What You Will Learn

By the end of this page, you will understand the exponential family: its canonical form, natural parameters, sufficient statistics, how common distributions fit this framework, the connection to GLMs, and why this unifying perspective is powerful for machine learning.

The Canonical Form

A probability distribution belongs to the exponential family if its probability density (or mass) function can be written in the following canonical form:\n\n$$p(x | \boldsymbol{\eta}) = h(x) \exp\left(\boldsymbol{\eta}^T \mathbf{T}(x) - A(\boldsymbol{\eta})\right)$$\n\nor equivalently:\n\n$$p(x | \boldsymbol{\eta}) = \frac{h(x) \exp\left(\boldsymbol{\eta}^T \mathbf{T}(x)\right)}{\exp(A(\boldsymbol{\eta}))}$$\n\nwhere:\n\n- x is the observation (can be scalar, vector, or more complex)\n- η (eta) is the natural parameter (or canonical parameter) vector\n- T(x) is the sufficient statistic vector\n- h(x) is the base measure or carrier measure\n- A(η) is the log-partition function (or log-normalizer, cumulant function)

Understanding Each Component

η (natural parameter): The 'true' parameter in this representation; determines the distribution's shape. T(x) (sufficient statistic): All the information about η contained in x. h(x) (base measure): The 'reference' distribution; doesn't depend on η. A(η) (log-partition): Ensures the distribution integrates/sums to 1; contains all moment information.

The Log-Partition Function\n\nThe log-partition function A(η) is defined by the normalization requirement:\n\n$$\int h(x) \exp(\boldsymbol{\eta}^T \mathbf{T}(x)) \, dx = \exp(A(\boldsymbol{\eta}))$$\n\nThus:\n\n$$A(\boldsymbol{\eta}) = \log \int h(x) \exp(\boldsymbol{\eta}^T \mathbf{T}(x)) \, dx$$\n\nCritical insight: The log-partition function is the "key" to the distribution—it encodes all the moments. Specifically:\n\n$$\nabla A(\boldsymbol{\eta}) = E[\mathbf{T}(x)]$$\n$$\nabla^2 A(\boldsymbol{\eta}) = \text{Cov}[\mathbf{T}(x)]$$\n\nThe first derivative gives the expected sufficient statistics, and the second derivative (Hessian) gives their covariance matrix.\n\n### Convexity\n\nA remarkable property: A(η) is always convex (and often strictly convex). This is because the Hessian ∇²A(η) = Cov[T(x)] is a covariance matrix, which is always positive semi-definite.\n\nThis convexity ensures that maximum likelihood estimation for exponential families has globally optimal solutions with no local minima.

Exponential Family Components
Component	Symbol	Role	Property
Natural parameter	η	Determines distribution shape	Often different from 'intuitive' parameters
Sufficient statistic	T(x)	Data summary for inference	∇A(η) = E[T(x)]
Base measure	h(x)	Reference/carrier distribution	Independent of η
Log-partition	A(η)	Normalizing constant (log)	Always convex; encodes moments

Common Distributions in Exponential Form

Let's express the distributions we've studied in exponential family form. This exercise reveals their underlying structure and connections.\n\n### Bernoulli Distribution\n\nStandard form: P(x | p) = p^x (1-p)^{1-x} for x ∈ {0, 1}\n\nExponential family form:\n\n$$P(x | p) = \exp\left(x \log\frac{p}{1-p} + \log(1-p)\right)$$\n\n- Natural parameter: η = log(p/(1-p)) = logit(p)\n- Sufficient statistic: T(x) = x\n- Base measure: h(x) = 1\n- Log-partition: A(η) = log(1 + e^η) = -log(1-p)\n\nInverse mapping: p = σ(η) = 1/(1 + e^{-η}) (sigmoid function!)\n\nThis explains why logistic regression uses the sigmoid—it's the inverse of the natural parameterization!

Gaussian Distribution (known variance)\n\nStandard form: p(x | μ, σ²) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²))\n\nWith σ² known, the exponential family form is:\n\n$$p(x | \mu) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(\frac{\mu x}{\sigma^2} - \frac{\mu^2}{2\sigma^2} - \frac{x^2}{2\sigma^2}\right)$$\n\n- Natural parameter: η = μ/σ²\n- Sufficient statistic: T(x) = x\n- Base measure: h(x) = (1/√(2πσ²)) exp(-x²/(2σ²))\n- Log-partition: A(η) = η²σ²/2 = μ²/(2σ²)\n\n### Gaussian (both parameters unknown)\n\nWith both μ and σ² unknown, we have a 2-parameter exponential family:\n\n- Natural parameters: η₁ = μ/σ², η₂ = -1/(2σ²)\n- Sufficient statistics: T(x) = (x, x²)\n\nNotice: T(x) now includes both x and x²—we need the second moment to estimate variance!

exponential_family.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
from scipy.special import logsumexp
 
class ExponentialFamily:
    """
    Base class for exponential family distributions.
    
    Provides common functionality: log-pdf, sufficient statistics,
    MLE from sufficient statistics, etc.
    """
    def log_h(self, x):
        """Log of base measure h(x)."""
        raise NotImplementedError
    
    def T(self, x):
        """Sufficient statistic T(x)."""
        raise NotImplementedError
    
    def A(self, eta):
        """Log-partition function A(η)."""
        raise NotImplementedError
    
    def log_prob(self, x, eta):
        """Compute log p(x | η) = log h(x) + η'T(x) - A(η)."""
        return self.log_h(x) + np.dot(eta, self.T(x)) - self.A(eta)
    
    def expected_T(self, eta):
        """E[T(x)] = ∇A(η) — computed numerically by default."""
        raise NotImplementedError
 
class BernoulliExpFam(ExponentialFamily):
    """Bernoulli in exponential family form."""
    
    def log_h(self, x):
        return 0.0  # h(x) = 1
    
    def T(self, x):
        return np.array([x])  # Sufficient stat is just x
    
    def A(self, eta):
        # A(η) = log(1 + exp(η))
        return np.log1p(np.exp(eta[0]))  # Numerically stable
    
    def eta_from_p(self, p):
        """Convert probability p to natural parameter η = logit(p)."""
        return np.array([np.log(p / (1 - p))])
    
    def p_from_eta(self, eta):
        """Convert natural parameter to probability p = sigmoid(η)."""
        return 1 / (1 + np.exp(-eta[0]))
    
    def expected_T(self, eta):
        """E[T(x)] = E[x] = p = sigmoid(η)."""
        return np.array([self.p_from_eta(eta)])
 
class PoissonExpFam(ExponentialFamily):
    """Poisson in exponential family form."""
    
    def log_h(self, x):
        # h(x) = 1/x!
        from scipy.special import gammaln
        return -gammaln(x + 1)  # log(1/x!) = -log(x!)
    
    def T(self, x):
        return np.array([x])
    
    def A(self, eta):
        # A(η) = exp(η) = λ
        return np.exp(eta[0])
    
    def eta_from_lambda(self, lam):
        """Convert λ to natural parameter η = log(λ)."""
        return np.array([np.log(lam)])
    
    def lambda_from_eta(self, eta):
        """Convert natural parameter to λ = exp(η)."""
        return np.exp(eta[0])
    
    def expected_T(self, eta):
        """E[T(x)] = E[x] = λ = exp(η)."""
        return np.array([self.lambda_from_eta(eta)])
 
# Demonstration
print("Exponential Family Representations")
print("=" * 60)
 
# Bernoulli example
bern = BernoulliExpFam()
p = 0.7
eta_bern = bern.eta_from_p(p)
print(f"\nBernoulli(p={p}):")
print(f"  Natural parameter η = logit({p}) = {eta_bern[0]:.4f}")
print(f"  A(η) = log(1+e^η) = {bern.A(eta_bern):.4f}")
print(f"  E[T(x)] = ∇A(η) = p = {bern.expected_T(eta_bern)[0]:.4f}")
 
# Verify log-probabilities
print(f"  P(x=1): exp(η - A(η)) = {np.exp(bern.log_prob(1, eta_bern)):.4f}")
print(f"  P(x=0): exp(0 - A(η)) = {np.exp(bern.log_prob(0, eta_bern)):.4f}")
 
# Poisson example
pois = PoissonExpFam()
lam = 5.0
eta_pois = pois.eta_from_lambda(lam)
print(f"\nPoisson(λ={lam}):")
print(f"  Natural parameter η = log({lam}) = {eta_pois[0]:.4f}")
print(f"  A(η) = exp(η) = λ = {pois.A(eta_pois):.4f}")
print(f"  E[T(x)] = ∇A(η) = λ = {pois.expected_T(eta_pois)[0]:.4f}")

Common Distributions in Exponential Family Form
Distribution	η (Natural Param)	T(x) (Suff. Stat)	A(η)	Mean-Param Link
Bernoulli(p)	log(p/(1-p))	x	log(1+e^η)	p = sigmoid(η)
Poisson(λ)	log(λ)	x	e^η	λ = e^η
Gaussian(μ, σ² known)	μ/σ²	x	η²σ²/2	μ = ησ²
Exponential(λ)	-λ	x	-log(-η)	λ = -η
Gamma(α, β)	(α-1, -β)	(log x, x)	log Γ(α) - α log β	See formulas
Beta(α, β)	(α-1, β-1)	(log x, log(1-x))	log B(α,β)	See formulas

Sufficient Statistics: The Power of Compression

A sufficient statistic T(x) is a function of the data that captures all information about the parameter η. Formally, the conditional distribution of X given T(X) does not depend on η.\n\n### The Fisher-Neyman Factorization\n\nA statistic T(X) is sufficient for η if and only if the density can be factored as:\n\n$$p(x | \eta) = g(T(x), \eta) \cdot h(x)$$\n\nwhere g depends on x only through T(x), and h doesn't depend on η.\n\nFor exponential families, this factorization is immediate from the canonical form!\n\n### Why Sufficient Statistics Matter\n\n1. Data reduction: You can summarize arbitrarily large datasets by computing T(X), losing no information about η.\n\nExample: For n i.i.d. Gaussian observations, the sufficient statistics are Σxᵢ and Σxᵢ². No matter how large n is, these two numbers contain all information about μ and σ².\n\n2. MLE has closed form: For exponential families, MLE just involves setting the observed sufficient statistics equal to their expected values:\n\n$$E_{\hat{\eta}}[\mathbf{T}(X)] = \frac{1}{n}\sum_{i=1}^n \mathbf{T}(x_i)$$

Sufficient Statistics for n i.i.d. Observations\n\nFor i.i.d. data x₁, ..., xₙ from an exponential family, the joint distribution is:\n\n$$p(x_1, \ldots, x_n | \eta) = \prod_{i=1}^n h(x_i) \exp\left(\eta^T \sum_{i=1}^n T(x_i) - nA(\eta)\right)$$\n\nThis shows that Σᵢ T(xᵢ) is sufficient for η. The dimensionality of the sufficient statistic is fixed, regardless of sample size!\n\n### Examples\n\n| Distribution | Sufficient Statistic for n samples |\n|--------------|-----------------------------------|\n| Bernoulli(p) | Σxᵢ (count of 1s) |\n| Poisson(λ) | Σxᵢ (total count) |\n| Gaussian(μ, σ²) | (Σxᵢ, Σxᵢ²) |\n| Exponential(λ) | Σxᵢ |\n| Gamma(α known, β) | Σxᵢ |\n\nThis fixed-dimensional sufficiency is unique to exponential families and is one of their key practical advantages.

Minimal Sufficient Statistics

Exponential family sufficient statistics are typically minimal—there's no further reduction possible without losing information. For a k-parameter exponential family, the minimal sufficient statistic is k-dimensional, providing maximum compression of arbitrarily large datasets.

Maximum Likelihood in Exponential Families

MLE for exponential families has a beautiful structure arising from the properties of the log-partition function.\n\n### The Log-Likelihood\n\nFor n i.i.d. observations, the log-likelihood is:\n\n$$\ell(\eta) = \sum_{i=1}^n \log h(x_i) + \eta^T \sum_{i=1}^n T(x_i) - nA(\eta)$$\n\nThe gradient with respect to η is:\n\n$$\nabla_\eta \ell(\eta) = \sum_{i=1}^n T(x_i) - n \nabla A(\eta) = \sum_{i=1}^n T(x_i) - n E_\eta[T(X)]$$\n\nSetting the gradient to zero:\n\n$$E_{\hat{\eta}}[T(X)] = \frac{1}{n}\sum_{i=1}^n T(x_i) = \bar{T}$$\n\nThe MLE is found by matching expected sufficient statistics to observed sufficient statistics!

Convexity Guarantees\n\nThe Hessian of the log-likelihood is:\n\n$$\nabla^2_\eta \ell(\eta) = -n \nabla^2 A(\eta) = -n \cdot \text{Cov}_\eta[T(X)]$$\n\nSince covariance matrices are positive semi-definite, the Hessian is negative semi-definite, confirming that the log-likelihood is concave (convex optimization problem when minimizing negative log-likelihood).\n\nImplications:\n- No local maxima to get stuck in\n- Gradient descent converges to global optimum\n- Newton's method converges quickly\n\n### The Moment-Matching Perspective\n\nMLE can be viewed as finding the distribution in the exponential family whose expected sufficient statistics match the sample averages. This is elegant:\n\n1. Compute T̄ = (1/n)Σᵢ T(xᵢ) from data\n2. Find η such that E_η[T(X)] = T̄\n3. This η is the MLE\n\nFor many distributions, this gives closed-form solutions. For others, we can use convex optimization.

exponential_family_mle.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from scipy.optimize import minimize
 
def gaussian_exponential_family_mle(data: np.ndarray) -> dict:
    """
    MLE for Gaussian using exponential family perspective.
    
    For Gaussian(μ, σ²), the sufficient statistics are (Σxᵢ, Σxᵢ²).
    Natural parameters are η₁ = μ/σ², η₂ = -1/(2σ²).
    
    MLE: Match E[T(X)] to observed T̄.
    """
    n = len(data)
    
    # Compute sufficient statistics
    sum_x = np.sum(data)
    sum_x2 = np.sum(data ** 2)
    
    # Sample averages of sufficient statistics
    T_bar = np.array([sum_x / n, sum_x2 / n])
    
    # For Gaussian, E[T(X)] = (μ, μ² + σ²)
    # Solving: μ = T̄₁, σ² = T̄₂ - T̄₁²
    mu_mle = T_bar[0]
    sigma2_mle = T_bar[1] - T_bar[0]**2
    
    # Convert to natural parameters
    eta1 = mu_mle / sigma2_mle
    eta2 = -1 / (2 * sigma2_mle)
    
    return {
        'sufficient_stats': {'sum_x': sum_x, 'sum_x2': sum_x2},
        'T_bar': T_bar,
        'mu_mle': mu_mle,
        'sigma2_mle': sigma2_mle,
        'natural_params': (eta1, eta2)
    }
 
def exponential_family_fisher_information(A_hessian_func, eta):
    """
    Fisher information for exponential family = ∇²A(η).
    
    This equals the covariance of sufficient statistics.
    """
    return A_hessian_func(eta)
 
# Example: Gaussian MLE via sufficient statistics
np.random.seed(42)
true_mu, true_sigma = 5.0, 2.0
n = 1000
data = np.random.normal(true_mu, true_sigma, n)
 
results = gaussian_exponential_family_mle(data)
 
print("Gaussian MLE via Exponential Family")
print("=" * 50)
print(f"True parameters: μ = {true_mu}, σ² = {true_sigma**2}")
print(f"\nSufficient statistics:")
print(f"  Σxᵢ = {results['sufficient_stats']['sum_x']:.4f}")
print(f"  Σxᵢ² = {results['sufficient_stats']['sum_x2']:.4f}")
print(f"\nMLE (from moment matching):")
print(f"  μ̂ = {results['mu_mle']:.4f}")
print(f"  σ̂² = {results['sigma2_mle']:.4f}")
print(f"\nNatural parameters:")
print(f"  η₁ = μ/σ² = {results['natural_params'][0]:.4f}")
print(f"  η₂ = -1/(2σ²) = {results['natural_params'][1]:.4f}")
 
# Demonstrate moment matching for Poisson
print("\n" + "=" * 50)
print("Poisson MLE via Exponential Family")
print("=" * 50)
 
true_lambda = 7.5
poisson_data = np.random.poisson(true_lambda, n)
 
# Sufficient statistic is Σxᵢ
# E[T(X)] = λ, so MLE is T̄ = mean
sum_x = np.sum(poisson_data)
T_bar = sum_x / n  # = sample mean = λ̂
 
print(f"True λ = {true_lambda}")
print(f"Sufficient statistic Σxᵢ = {sum_x}")
print(f"T̄ = Σxᵢ/n = {T_bar:.4f}")
print(f"λ̂_{'{MLE}'} = {T_bar:.4f}")
print(f"Natural parameter η = log(λ̂) = {np.log(T_bar):.4f}")

Conjugate Priors for Bayesian Inference

For exponential families, there exists a systematic way to construct conjugate priors—priors where the posterior has the same functional form as the prior, just with updated parameters.\n\n### The General Form\n\nFor a likelihood from an exponential family:\n\n$$p(x | \eta) = h(x) \exp(\eta^T T(x) - A(\eta))$$\n\nThe conjugate prior for η has the form:\n\n$$p(\eta | \chi, \nu) \propto \exp(\eta^T \chi - \nu A(\eta))$$\n\nwhere χ and ν are hyperparameters of the prior.\n\n### Posterior Update\n\nAfter observing x₁, ..., xₙ, the posterior is:\n\n$$p(\eta | x_1, \ldots, x_n) \propto p(\eta | \chi, \nu) \prod_{i=1}^n p(x_i | \eta)$$\n\n$$\propto \exp\left(\eta^T \left(\chi + \sum_{i=1}^n T(x_i)\right) - (\nu + n) A(\eta)\right)$$\n\nThis is the same form with updated hyperparameters:\n\n$$\chi' = \chi + \sum_{i=1}^n T(x_i), \quad \nu' = \nu + n$$\n\nThe prior 'pretends' to have seen ν observations with sufficient statistics χ.

Conjugate Priors for Common Exponential Families
Likelihood	Conjugate Prior	Prior Parameters	Posterior Parameters
Bernoulli(p)	Beta(α, β)	α, β (pseudo-counts)	α + Σxᵢ, β + n - Σxᵢ
Poisson(λ)	Gamma(α, β)	α (shape), β (rate)	α + Σxᵢ, β + n
Gaussian(μ, σ² known)	Gaussian(μ₀, σ₀²)	μ₀, σ₀²	Precision-weighted mean
Exponential(λ)	Gamma(α, β)	α, β	α + n, β + Σxᵢ
Multinomial(p)	Dirichlet(α)	α₁, ..., αₖ	αⱼ + count_j

Example: Beta-Bernoulli Conjugacy\n\nPrior: p ~ Beta(α, β)\n\nThis means p has prior PDF: p(p | α, β) ∝ p^{α-1}(1-p)^{β-1}\n\nAfter observing n Bernoulli trials with k successes:\n\nPosterior: p | data ~ Beta(α + k, β + n - k)\n\nInterpretation:\n- Prior α, β act as "pseudo-counts" of successes and failures\n- Each observation updates the appropriate count\n- Prior strength is α + β (effective prior sample size)\n\n### Why Conjugate Priors Matter\n\n1. Computational convenience: Closed-form posteriors without MCMC\n\n2. Interpretability: Prior parameters have intuitive meanings\n\n3. Sequential updating: Easy to incorporate new data incrementally\n\n4. Posterior predictive: Often has closed form

conjugate_priors.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from scipy import stats
 
class BetaBernoulliConjugate:
    """
    Bayesian inference for Bernoulli likelihood with Beta prior.
    
    Prior: p ~ Beta(α, β)
    Likelihood: x | p ~ Bernoulli(p)
    Posterior: p | data ~ Beta(α + successes, β + failures)
    """
    def __init__(self, alpha_prior: float = 1.0, beta_prior: float = 1.0):
        """
        Args:
            alpha_prior: Prior pseudo-count for successes
            beta_prior: Prior pseudo-count for failures
        """
        self.alpha = alpha_prior
        self.beta = beta_prior
        self.n_obs = 0
        self.n_successes = 0
    
    def update(self, observations: np.ndarray):
        """Update posterior with new observations (0s and 1s)."""
        n = len(observations)
        k = np.sum(observations)
        
        self.alpha += k
        self.beta += (n - k)
        self.n_obs += n
        self.n_successes += k
        
        return self
    
    def posterior(self) -> stats.rv_continuous:
        """Return the posterior distribution."""
        return stats.beta(self.alpha, self.beta)
    
    def posterior_mean(self) -> float:
        """E[p | data] = α / (α + β)."""
        return self.alpha / (self.alpha + self.beta)
    
    def posterior_mode(self) -> float:
        """Mode of posterior (for α, β > 1)."""
        if self.alpha > 1 and self.beta > 1:
            return (self.alpha - 1) / (self.alpha + self.beta - 2)
        return self.posterior_mean()  # Fallback
    
    def credible_interval(self, level: float = 0.95) -> tuple:
        """Compute a credible interval for p."""
        post = self.posterior()
        lower = (1 - level) / 2
        upper = 1 - lower
        return (post.ppf(lower), post.ppf(upper))
    
    def posterior_predictive(self) -> float:
        """P(next observation = 1 | data) = posterior mean."""
        return self.posterior_mean()
 
# Example: A/B test analysis
np.random.seed(42)
true_rate = 0.12  # True conversion rate
n_observations = 50
 
# Simulate data
conversions = np.random.binomial(1, true_rate, n_observations)
 
# Bayesian inference with weakly informative prior
model = BetaBernoulliConjugate(alpha_prior=1, beta_prior=1)  # Uniform prior
model.update(conversions)
 
print("Beta-Bernoulli Conjugate Inference")
print("=" * 50)
print(f"True conversion rate: {true_rate}")
print(f"Observed: {model.n_successes}/{model.n_obs} conversions")
print(f"MLE estimate: {model.n_successes / model.n_obs:.4f}")
print(f"\nPosterior: Beta({model.alpha:.0f}, {model.beta:.0f})")
print(f"Posterior mean: {model.posterior_mean():.4f}")
print(f"Posterior mode: {model.posterior_mode():.4f}")
print(f"95% credible interval: [{model.credible_interval()[0]:.4f}, "
      f"{model.credible_interval()[1]:.4f}]")
print(f"P(next conversion) = {model.posterior_predictive():.4f}")
 
# Sequential updating demonstration
print("\n" + "=" * 50)
print("Sequential Bayesian Updating")
print("=" * 50)
 
model2 = BetaBernoulliConjugate(alpha_prior=2, beta_prior=8)  # Prior: ~20% rate
print(f"Prior: Beta(2, 8) — effective rate {2/10:.1%}")
 
for batch in range(5):
    new_data = np.random.binomial(1, true_rate, 20)
    model2.update(new_data)
    ci = model2.credible_interval()
    print(f"After batch {batch+1} (n={model2.n_obs:3d}): "
          f"posterior mean = {model2.posterior_mean():.4f}, "
          f"95% CI = [{ci[0]:.3f}, {ci[1]:.3f}]")

Generalized Linear Models

Generalized Linear Models (GLMs) extend linear regression to response variables from any exponential family distribution. The exponential family framework provides the theoretical foundation.\n\n### GLM Structure\n\nA GLM has three components:\n\n1. Random Component: The response Y follows an exponential family distribution:\n$$Y | X \sim \text{ExpFam}(\eta)$$\n\n2. Systematic Component: A linear predictor:\n$$\eta_{\text{linear}} = \mathbf{w}^T \mathbf{x} + b$$\n\n3. Link Function: Connects the linear predictor to the natural parameter:\n$$\eta = g(E[Y]) \quad \text{or equivalently} \quad E[Y] = g^{-1}(\eta_{\text{linear}})$$\n\n### The Canonical Link\n\nThe canonical link function is the one where the linear predictor equals the natural parameter:\n$$\eta_{\text{linear}} = \eta_{\text{natural}}$$\n\nThis gives the simplest, most numerically stable GLMs.

Common GLMs and Their Components
Model Name	Response Distribution	Canonical Link	Use Case
Linear Regression	Gaussian(μ, σ²)	Identity: η = μ	Continuous outcomes
Logistic Regression	Bernoulli(p)	Logit: η = log(p/(1-p))	Binary classification
Poisson Regression	Poisson(λ)	Log: η = log(λ)	Count data
Gamma Regression	Gamma(α, β)	Inverse: η = 1/μ	Positive continuous
Multinomial Logit	Multinomial(p)	Log odds	Multi-class classification

Why Canonical Links Are Special\n\n1. Sufficient statistics align: The sufficient statistic T(Y) = Y (or simple function of Y) combines directly with features.\n\n2. Convexity guaranteed: The negative log-likelihood is convex in the parameters.\n\n3. Fisher scoring = Newton-Raphson: Optimization algorithms simplify.\n\n4. Variance function determines all: The relationship Var(Y) = V(μ) × φ characterizes the family.\n\n### The Exponential Family Unifies GLMs\n\nAll GLMs can be written as:\n\n$$\log p(y | \mathbf{x}, \mathbf{w}) = \frac{y \cdot \theta(\mathbf{x}, \mathbf{w}) - b(\theta(\mathbf{x}, \mathbf{w}))}{a(\phi)} + c(y, \phi)$$\n\nwhere θ = g⁻¹(w^T x) maps features to the natural parameter.\n\nThis unified view shows that logistic regression, Poisson regression, and linear regression are all instances of the same framework—just with different exponential family distributions!

The Deep Connection

Understanding exponential families reveals why logistic regression uses the sigmoid (it's the inverse canonical link for Bernoulli) and why Poisson regression uses exp() to transform linear predictions (it's the inverse canonical link for Poisson). These aren't arbitrary choices—they arise naturally from the mathematical structure!

Summary and Key Takeaways

The exponential family is a unifying framework that connects most distributions used in machine learning. Let's consolidate the key insights:

Key Takeaways

•Canonical Form: p(x|η) = h(x)exp(η^T T(x) - A(η)); defines the family structure with natural parameters, sufficient statistics, and log-partition function.
•Many distributions belong: Bernoulli, Gaussian, Poisson, Exponential, Gamma, Beta, Multinomial, and more are all exponential family members.
•Log-partition encodes moments: ∇A(η) = E[T(x)], ∇²A(η) = Cov[T(x)]; derivatives give all moment information.
•Convexity: A(η) is always convex, making MLE a convex optimization problem with global optima.
•Sufficient statistics: Fixed-dimensional summary of data; MLE matches observed to expected sufficient statistics.
•Conjugate priors: Natural form for Bayesian inference; posterior parameters update by adding sufficient statistics.
•GLMs: Exponential families provide the foundation; canonical links arise naturally from the mathematical structure.

What's Next:\n\nWe've explored the exponential family as a unifying framework. Next, we advance to the Multivariate Gaussian distribution—extending the univariate Gaussian to multiple dimensions. This distribution is fundamental to multivariate statistics, Gaussian processes, variational autoencoders, and countless other machine learning methods. We'll explore its geometry, conditioning, marginalization, and the central role of the covariance matrix.

Page Complete

You now understand the exponential family—a powerful unifying framework for probability distributions. This perspective reveals deep connections between distributions, explains why certain formulas arise in machine learning (sigmoid, softmax, log-link), and provides tools for principled model building through GLMs and conjugate Bayesian inference.

4 / 5

Loading learning content...

Machine LearningProbability Theory

Common Probability Distributions

LevelIntermediate

Duration120 mins

TopicProbability Theory

4 / 5

Exponential Family of Distributions

A Unifying Framework for Probability

What You Will Learn

The Canonical Form

Understanding Each Component

The Log-Partition Function\n\nThe log-partition function A(η) is defined by the normalization requirement:\n\n$$\int h(x) \exp(\boldsymbol{\eta}^T \mathbf{T}(x)) \, dx = \exp(A(\boldsymbol{\eta}))$$\n\nThus:\n\n$$A(\boldsymbol{\eta}) = \log \int h(x) \exp(\boldsymbol{\eta}^T \mathbf{T}(x)) \, dx$$\n\nCritical insight: The log-partition function is the "key" to the distribution—it encodes all the moments. Specifically:\n\n$$\nabla A(\boldsymbol{\eta}) = E[\mathbf{T}(x)]$$\n$$\nabla^2 A(\boldsymbol{\eta}) = \text{Cov}[\mathbf{T}(x)]$$\n\nThe first derivative gives the expected sufficient statistics, and the second derivative (Hessian) gives their covariance matrix.\n\n### Convexity\n\nA remarkable property: A(η) is always convex (and often strictly convex). This is because the Hessian ∇²A(η) = Cov[T(x)] is a covariance matrix, which is always positive semi-definite.\n\nThis convexity ensures that maximum likelihood estimation for exponential families has globally optimal solutions with no local minima.

Exponential Family Components
Component	Symbol	Role	Property
Natural parameter	η	Determines distribution shape	Often different from 'intuitive' parameters
Sufficient statistic	T(x)	Data summary for inference	∇A(η) = E[T(x)]
Base measure	h(x)	Reference/carrier distribution	Independent of η
Log-partition	A(η)	Normalizing constant (log)	Always convex; encodes moments

Common Distributions in Exponential Form

Gaussian Distribution (known variance)\n\nStandard form: p(x | μ, σ²) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²))\n\nWith σ² known, the exponential family form is:\n\n$$p(x | \mu) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(\frac{\mu x}{\sigma^2} - \frac{\mu^2}{2\sigma^2} - \frac{x^2}{2\sigma^2}\right)$$\n\n- Natural parameter: η = μ/σ²\n- Sufficient statistic: T(x) = x\n- Base measure: h(x) = (1/√(2πσ²)) exp(-x²/(2σ²))\n- Log-partition: A(η) = η²σ²/2 = μ²/(2σ²)\n\n### Gaussian (both parameters unknown)\n\nWith both μ and σ² unknown, we have a 2-parameter exponential family:\n\n- Natural parameters: η₁ = μ/σ², η₂ = -1/(2σ²)\n- Sufficient statistics: T(x) = (x, x²)\n\nNotice: T(x) now includes both x and x²—we need the second moment to estimate variance!

exponential_family.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
from scipy.special import logsumexp
 
class ExponentialFamily:
    """
    Base class for exponential family distributions.
    
    Provides common functionality: log-pdf, sufficient statistics,
    MLE from sufficient statistics, etc.
    """
    def log_h(self, x):
        """Log of base measure h(x)."""
        raise NotImplementedError
    
    def T(self, x):
        """Sufficient statistic T(x)."""
        raise NotImplementedError
    
    def A(self, eta):
        """Log-partition function A(η)."""
        raise NotImplementedError
    
    def log_prob(self, x, eta):
        """Compute log p(x | η) = log h(x) + η'T(x) - A(η)."""
        return self.log_h(x) + np.dot(eta, self.T(x)) - self.A(eta)
    
    def expected_T(self, eta):
        """E[T(x)] = ∇A(η) — computed numerically by default."""
        raise NotImplementedError
 
class BernoulliExpFam(ExponentialFamily):
    """Bernoulli in exponential family form."""
    
    def log_h(self, x):
        return 0.0  # h(x) = 1
    
    def T(self, x):
        return np.array([x])  # Sufficient stat is just x
    
    def A(self, eta):
        # A(η) = log(1 + exp(η))
        return np.log1p(np.exp(eta[0]))  # Numerically stable
    
    def eta_from_p(self, p):
        """Convert probability p to natural parameter η = logit(p)."""
        return np.array([np.log(p / (1 - p))])
    
    def p_from_eta(self, eta):
        """Convert natural parameter to probability p = sigmoid(η)."""
        return 1 / (1 + np.exp(-eta[0]))
    
    def expected_T(self, eta):
        """E[T(x)] = E[x] = p = sigmoid(η)."""
        return np.array([self.p_from_eta(eta)])
 
class PoissonExpFam(ExponentialFamily):
    """Poisson in exponential family form."""
    
    def log_h(self, x):
        # h(x) = 1/x!
        from scipy.special import gammaln
        return -gammaln(x + 1)  # log(1/x!) = -log(x!)
    
    def T(self, x):
        return np.array([x])
    
    def A(self, eta):
        # A(η) = exp(η) = λ
        return np.exp(eta[0])
    
    def eta_from_lambda(self, lam):
        """Convert λ to natural parameter η = log(λ)."""
        return np.array([np.log(lam)])
    
    def lambda_from_eta(self, eta):
        """Convert natural parameter to λ = exp(η)."""
        return np.exp(eta[0])
    
    def expected_T(self, eta):
        """E[T(x)] = E[x] = λ = exp(η)."""
        return np.array([self.lambda_from_eta(eta)])
 
# Demonstration
print("Exponential Family Representations")
print("=" * 60)
 
# Bernoulli example
bern = BernoulliExpFam()
p = 0.7
eta_bern = bern.eta_from_p(p)
print(f"\nBernoulli(p={p}):")
print(f"  Natural parameter η = logit({p}) = {eta_bern[0]:.4f}")
print(f"  A(η) = log(1+e^η) = {bern.A(eta_bern):.4f}")
print(f"  E[T(x)] = ∇A(η) = p = {bern.expected_T(eta_bern)[0]:.4f}")
 
# Verify log-probabilities
print(f"  P(x=1): exp(η - A(η)) = {np.exp(bern.log_prob(1, eta_bern)):.4f}")
print(f"  P(x=0): exp(0 - A(η)) = {np.exp(bern.log_prob(0, eta_bern)):.4f}")
 
# Poisson example
pois = PoissonExpFam()
lam = 5.0
eta_pois = pois.eta_from_lambda(lam)
print(f"\nPoisson(λ={lam}):")
print(f"  Natural parameter η = log({lam}) = {eta_pois[0]:.4f}")
print(f"  A(η) = exp(η) = λ = {pois.A(eta_pois):.4f}")
print(f"  E[T(x)] = ∇A(η) = λ = {pois.expected_T(eta_pois)[0]:.4f}")

Common Distributions in Exponential Family Form
Distribution	η (Natural Param)	T(x) (Suff. Stat)	A(η)	Mean-Param Link
Bernoulli(p)	log(p/(1-p))	x	log(1+e^η)	p = sigmoid(η)
Poisson(λ)	log(λ)	x	e^η	λ = e^η
Gaussian(μ, σ² known)	μ/σ²	x	η²σ²/2	μ = ησ²
Exponential(λ)	-λ	x	-log(-η)	λ = -η
Gamma(α, β)	(α-1, -β)	(log x, x)	log Γ(α) - α log β	See formulas
Beta(α, β)	(α-1, β-1)	(log x, log(1-x))	log B(α,β)	See formulas

Sufficient Statistics: The Power of Compression

Sufficient Statistics for n i.i.d. Observations\n\nFor i.i.d. data x₁, ..., xₙ from an exponential family, the joint distribution is:\n\n$$p(x_1, \ldots, x_n | \eta) = \prod_{i=1}^n h(x_i) \exp\left(\eta^T \sum_{i=1}^n T(x_i) - nA(\eta)\right)$$\n\nThis shows that Σᵢ T(xᵢ) is sufficient for η. The dimensionality of the sufficient statistic is fixed, regardless of sample size!\n\n### Examples\n\n| Distribution | Sufficient Statistic for n samples |\n|--------------|-----------------------------------|\n| Bernoulli(p) | Σxᵢ (count of 1s) |\n| Poisson(λ) | Σxᵢ (total count) |\n| Gaussian(μ, σ²) | (Σxᵢ, Σxᵢ²) |\n| Exponential(λ) | Σxᵢ |\n| Gamma(α known, β) | Σxᵢ |\n\nThis fixed-dimensional sufficiency is unique to exponential families and is one of their key practical advantages.

Minimal Sufficient Statistics

Maximum Likelihood in Exponential Families

Convexity Guarantees\n\nThe Hessian of the log-likelihood is:\n\n$$\nabla^2_\eta \ell(\eta) = -n \nabla^2 A(\eta) = -n \cdot \text{Cov}_\eta[T(X)]$$\n\nSince covariance matrices are positive semi-definite, the Hessian is negative semi-definite, confirming that the log-likelihood is concave (convex optimization problem when minimizing negative log-likelihood).\n\nImplications:\n- No local maxima to get stuck in\n- Gradient descent converges to global optimum\n- Newton's method converges quickly\n\n### The Moment-Matching Perspective\n\nMLE can be viewed as finding the distribution in the exponential family whose expected sufficient statistics match the sample averages. This is elegant:\n\n1. Compute T̄ = (1/n)Σᵢ T(xᵢ) from data\n2. Find η such that E_η[T(X)] = T̄\n3. This η is the MLE\n\nFor many distributions, this gives closed-form solutions. For others, we can use convex optimization.

exponential_family_mle.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from scipy.optimize import minimize
 
def gaussian_exponential_family_mle(data: np.ndarray) -> dict:
    """
    MLE for Gaussian using exponential family perspective.
    
    For Gaussian(μ, σ²), the sufficient statistics are (Σxᵢ, Σxᵢ²).
    Natural parameters are η₁ = μ/σ², η₂ = -1/(2σ²).
    
    MLE: Match E[T(X)] to observed T̄.
    """
    n = len(data)
    
    # Compute sufficient statistics
    sum_x = np.sum(data)
    sum_x2 = np.sum(data ** 2)
    
    # Sample averages of sufficient statistics
    T_bar = np.array([sum_x / n, sum_x2 / n])
    
    # For Gaussian, E[T(X)] = (μ, μ² + σ²)
    # Solving: μ = T̄₁, σ² = T̄₂ - T̄₁²
    mu_mle = T_bar[0]
    sigma2_mle = T_bar[1] - T_bar[0]**2
    
    # Convert to natural parameters
    eta1 = mu_mle / sigma2_mle
    eta2 = -1 / (2 * sigma2_mle)
    
    return {
        'sufficient_stats': {'sum_x': sum_x, 'sum_x2': sum_x2},
        'T_bar': T_bar,
        'mu_mle': mu_mle,
        'sigma2_mle': sigma2_mle,
        'natural_params': (eta1, eta2)
    }
 
def exponential_family_fisher_information(A_hessian_func, eta):
    """
    Fisher information for exponential family = ∇²A(η).
    
    This equals the covariance of sufficient statistics.
    """
    return A_hessian_func(eta)
 
# Example: Gaussian MLE via sufficient statistics
np.random.seed(42)
true_mu, true_sigma = 5.0, 2.0
n = 1000
data = np.random.normal(true_mu, true_sigma, n)
 
results = gaussian_exponential_family_mle(data)
 
print("Gaussian MLE via Exponential Family")
print("=" * 50)
print(f"True parameters: μ = {true_mu}, σ² = {true_sigma**2}")
print(f"\nSufficient statistics:")
print(f"  Σxᵢ = {results['sufficient_stats']['sum_x']:.4f}")
print(f"  Σxᵢ² = {results['sufficient_stats']['sum_x2']:.4f}")
print(f"\nMLE (from moment matching):")
print(f"  μ̂ = {results['mu_mle']:.4f}")
print(f"  σ̂² = {results['sigma2_mle']:.4f}")
print(f"\nNatural parameters:")
print(f"  η₁ = μ/σ² = {results['natural_params'][0]:.4f}")
print(f"  η₂ = -1/(2σ²) = {results['natural_params'][1]:.4f}")
 
# Demonstrate moment matching for Poisson
print("\n" + "=" * 50)
print("Poisson MLE via Exponential Family")
print("=" * 50)
 
true_lambda = 7.5
poisson_data = np.random.poisson(true_lambda, n)
 
# Sufficient statistic is Σxᵢ
# E[T(X)] = λ, so MLE is T̄ = mean
sum_x = np.sum(poisson_data)
T_bar = sum_x / n  # = sample mean = λ̂
 
print(f"True λ = {true_lambda}")
print(f"Sufficient statistic Σxᵢ = {sum_x}")
print(f"T̄ = Σxᵢ/n = {T_bar:.4f}")
print(f"λ̂_{'{MLE}'} = {T_bar:.4f}")
print(f"Natural parameter η = log(λ̂) = {np.log(T_bar):.4f}")

Conjugate Priors for Bayesian Inference

Conjugate Priors for Common Exponential Families
Likelihood	Conjugate Prior	Prior Parameters	Posterior Parameters
Bernoulli(p)	Beta(α, β)	α, β (pseudo-counts)	α + Σxᵢ, β + n - Σxᵢ
Poisson(λ)	Gamma(α, β)	α (shape), β (rate)	α + Σxᵢ, β + n
Gaussian(μ, σ² known)	Gaussian(μ₀, σ₀²)	μ₀, σ₀²	Precision-weighted mean
Exponential(λ)	Gamma(α, β)	α, β	α + n, β + Σxᵢ
Multinomial(p)	Dirichlet(α)	α₁, ..., αₖ	αⱼ + count_j

Example: Beta-Bernoulli Conjugacy\n\nPrior: p ~ Beta(α, β)\n\nThis means p has prior PDF: p(p | α, β) ∝ p^{α-1}(1-p)^{β-1}\n\nAfter observing n Bernoulli trials with k successes:\n\nPosterior: p | data ~ Beta(α + k, β + n - k)\n\nInterpretation:\n- Prior α, β act as "pseudo-counts" of successes and failures\n- Each observation updates the appropriate count\n- Prior strength is α + β (effective prior sample size)\n\n### Why Conjugate Priors Matter\n\n1. Computational convenience: Closed-form posteriors without MCMC\n\n2. Interpretability: Prior parameters have intuitive meanings\n\n3. Sequential updating: Easy to incorporate new data incrementally\n\n4. Posterior predictive: Often has closed form

conjugate_priors.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from scipy import stats
 
class BetaBernoulliConjugate:
    """
    Bayesian inference for Bernoulli likelihood with Beta prior.
    
    Prior: p ~ Beta(α, β)
    Likelihood: x | p ~ Bernoulli(p)
    Posterior: p | data ~ Beta(α + successes, β + failures)
    """
    def __init__(self, alpha_prior: float = 1.0, beta_prior: float = 1.0):
        """
        Args:
            alpha_prior: Prior pseudo-count for successes
            beta_prior: Prior pseudo-count for failures
        """
        self.alpha = alpha_prior
        self.beta = beta_prior
        self.n_obs = 0
        self.n_successes = 0
    
    def update(self, observations: np.ndarray):
        """Update posterior with new observations (0s and 1s)."""
        n = len(observations)
        k = np.sum(observations)
        
        self.alpha += k
        self.beta += (n - k)
        self.n_obs += n
        self.n_successes += k
        
        return self
    
    def posterior(self) -> stats.rv_continuous:
        """Return the posterior distribution."""
        return stats.beta(self.alpha, self.beta)
    
    def posterior_mean(self) -> float:
        """E[p | data] = α / (α + β)."""
        return self.alpha / (self.alpha + self.beta)
    
    def posterior_mode(self) -> float:
        """Mode of posterior (for α, β > 1)."""
        if self.alpha > 1 and self.beta > 1:
            return (self.alpha - 1) / (self.alpha + self.beta - 2)
        return self.posterior_mean()  # Fallback
    
    def credible_interval(self, level: float = 0.95) -> tuple:
        """Compute a credible interval for p."""
        post = self.posterior()
        lower = (1 - level) / 2
        upper = 1 - lower
        return (post.ppf(lower), post.ppf(upper))
    
    def posterior_predictive(self) -> float:
        """P(next observation = 1 | data) = posterior mean."""
        return self.posterior_mean()
 
# Example: A/B test analysis
np.random.seed(42)
true_rate = 0.12  # True conversion rate
n_observations = 50
 
# Simulate data
conversions = np.random.binomial(1, true_rate, n_observations)
 
# Bayesian inference with weakly informative prior
model = BetaBernoulliConjugate(alpha_prior=1, beta_prior=1)  # Uniform prior
model.update(conversions)
 
print("Beta-Bernoulli Conjugate Inference")
print("=" * 50)
print(f"True conversion rate: {true_rate}")
print(f"Observed: {model.n_successes}/{model.n_obs} conversions")
print(f"MLE estimate: {model.n_successes / model.n_obs:.4f}")
print(f"\nPosterior: Beta({model.alpha:.0f}, {model.beta:.0f})")
print(f"Posterior mean: {model.posterior_mean():.4f}")
print(f"Posterior mode: {model.posterior_mode():.4f}")
print(f"95% credible interval: [{model.credible_interval()[0]:.4f}, "
      f"{model.credible_interval()[1]:.4f}]")
print(f"P(next conversion) = {model.posterior_predictive():.4f}")
 
# Sequential updating demonstration
print("\n" + "=" * 50)
print("Sequential Bayesian Updating")
print("=" * 50)
 
model2 = BetaBernoulliConjugate(alpha_prior=2, beta_prior=8)  # Prior: ~20% rate
print(f"Prior: Beta(2, 8) — effective rate {2/10:.1%}")
 
for batch in range(5):
    new_data = np.random.binomial(1, true_rate, 20)
    model2.update(new_data)
    ci = model2.credible_interval()
    print(f"After batch {batch+1} (n={model2.n_obs:3d}): "
          f"posterior mean = {model2.posterior_mean():.4f}, "
          f"95% CI = [{ci[0]:.3f}, {ci[1]:.3f}]")

Generalized Linear Models

Common GLMs and Their Components
Model Name	Response Distribution	Canonical Link	Use Case
Linear Regression	Gaussian(μ, σ²)	Identity: η = μ	Continuous outcomes
Logistic Regression	Bernoulli(p)	Logit: η = log(p/(1-p))	Binary classification
Poisson Regression	Poisson(λ)	Log: η = log(λ)	Count data
Gamma Regression	Gamma(α, β)	Inverse: η = 1/μ	Positive continuous
Multinomial Logit	Multinomial(p)	Log odds	Multi-class classification

Why Canonical Links Are Special\n\n1. Sufficient statistics align: The sufficient statistic T(Y) = Y (or simple function of Y) combines directly with features.\n\n2. Convexity guaranteed: The negative log-likelihood is convex in the parameters.\n\n3. Fisher scoring = Newton-Raphson: Optimization algorithms simplify.\n\n4. Variance function determines all: The relationship Var(Y) = V(μ) × φ characterizes the family.\n\n### The Exponential Family Unifies GLMs\n\nAll GLMs can be written as:\n\n$$\log p(y | \mathbf{x}, \mathbf{w}) = \frac{y \cdot \theta(\mathbf{x}, \mathbf{w}) - b(\theta(\mathbf{x}, \mathbf{w}))}{a(\phi)} + c(y, \phi)$$\n\nwhere θ = g⁻¹(w^T x) maps features to the natural parameter.\n\nThis unified view shows that logistic regression, Poisson regression, and linear regression are all instances of the same framework—just with different exponential family distributions!

The Deep Connection

Summary and Key Takeaways

The exponential family is a unifying framework that connects most distributions used in machine learning. Let's consolidate the key insights:

Key Takeaways

•Canonical Form: p(x|η) = h(x)exp(η^T T(x) - A(η)); defines the family structure with natural parameters, sufficient statistics, and log-partition function.
•Many distributions belong: Bernoulli, Gaussian, Poisson, Exponential, Gamma, Beta, Multinomial, and more are all exponential family members.
•Log-partition encodes moments: ∇A(η) = E[T(x)], ∇²A(η) = Cov[T(x)]; derivatives give all moment information.
•Convexity: A(η) is always convex, making MLE a convex optimization problem with global optima.
•Sufficient statistics: Fixed-dimensional summary of data; MLE matches observed to expected sufficient statistics.
•Conjugate priors: Natural form for Bayesian inference; posterior parameters update by adding sufficient statistics.
•GLMs: Exponential families provide the foundation; canonical links arise naturally from the mathematical structure.

Page Complete

4 / 5