Generalized Linear Models - Learning Module

Loading content...

0/245

The Exponential Family: Foundation of GLMs

The Mathematical Unification

What do the Normal, Binomial, Poisson, Gamma, and Inverse Gaussian distributions have in common? At first glance, they seem quite different—continuous vs. discrete, bounded vs. unbounded, symmetric vs. skewed. Yet they all share a profound mathematical structure: they are members of the exponential family of distributions.

This unification is not merely taxonomic. The exponential family structure is precisely what enables the GLM framework to exist. It guarantees:

Existence of sufficient statistics that simplify estimation
Well-behaved likelihood functions with unique maxima
Elegant relationships between means, variances, and canonical parameters
A unified theory of estimation, inference, and diagnostics

Understanding the exponential family transforms GLMs from a collection of techniques into a coherent theoretical framework. It reveals why certain distribution-link combinations work well and provides the tools for extending GLMs to new problems.

What You Will Learn

By the end of this page, you will: (1) recognize the exponential family form and identify its components, (2) understand how mean, variance, and canonical parameters are related through the cumulant function, (3) see how common distributions fit the exponential family framework, and (4) appreciate why exponential family membership enables the GLM machinery.

The Exponential Family Form

A probability distribution belongs to the exponential family if its density (or probability mass function) can be written in the form:

$$f(y; \theta, \phi) = \exp\left{ \frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi) \right}$$

This is the canonical form of the exponential family. Let's understand each component:

The Four Components

1. The Canonical (Natural) Parameter: $\theta$

This is the primary parameter of interest, directly linked to the mean of the distribution. Different values of $\theta$ correspond to different distributions within the family. The canonical parameter lives on the "natural" scale where the sufficient statistic $Y$ appears linearly.

2. The Cumulant Function: $b(\theta)$

This is the normalizing function that ensures the density integrates to 1. It is also called the log-partition function because:

$$b(\theta) = \log \int \exp(y\theta / a(\phi)) \cdot \exp(c(y,\phi)) , dy$$

Remarkably, $b(\theta)$ encodes all the information about the mean and variance of $Y$:

$E[Y] = b'(\theta)$ (first derivative gives the mean)
$\text{Var}(Y) = a(\phi) \cdot b''(\theta)$ (second derivative relates to variance)

3. The Dispersion Function: $a(\phi)$

The dispersion parameter $\phi$ controls the spread of the distribution. The function $a(\phi)$ scales this effect. For many common distributions:

$a(\phi) = \phi$ (Normal, Gamma, Inverse Gaussian)
$a(\phi) = \phi/n$ for proportion data with n trials
$a(\phi) = 1$ (Poisson, Binomial with known n)

When $\phi$ is known (as in Poisson and Binomial), we have a one-parameter exponential family. When $\phi$ is unknown (as in Normal and Gamma), we have a two-parameter family.

4. The Carrier Measure: $c(y, \phi)$

This is everything that doesn't depend on $\theta$. It includes:

Normalizing constants that depend on $y$ (like $\log(y!)$ for Poisson)
Terms involving the dispersion parameter
The base measure against which we integrate

For parameter estimation, $c(y, \phi)$ can often be ignored when $\phi$ is known or estimated separately.

The Key Insight

The exponential family form guarantees that Y is a sufficient statistic for θ. This means all the information in the data about θ is captured by the sum Σyᵢ. This is why GLMs can be fit efficiently and why the score equations have simple forms.

Mean and Variance from the Cumulant Function

One of the most elegant properties of the exponential family is that the cumulant function $b(\theta)$ completely determines the mean and variance of the distribution.

Deriving the Mean

For any distribution in the exponential family:

$$E[Y] = \mu = b'(\theta) = \frac{db(\theta)}{d\theta}$$

Proof sketch: The log-likelihood is $\ell(\theta) = \frac{y\theta - b(\theta)}{a(\phi)} + c(y,\phi)$. Taking the expectation of the score:

$$E\left[\frac{\partial \ell}{\partial \theta}\right] = \frac{1}{a(\phi)}E[Y - b'(\theta)] = 0$$

For this to equal zero for all values of $\theta$: $E[Y] = b'(\theta)$.

Deriving the Variance

Similarly:

$$\text{Var}(Y) = a(\phi) \cdot b''(\theta) = a(\phi) \cdot V(\mu)$$

where $V(\mu) = b''(\theta)$ evaluated at the $\theta$ corresponding to $\mu$.

The function $V(\mu)$ is called the variance function. It characterizes how variance changes with the mean and is a defining feature of each exponential family distribution.

Cumulant Functions and Mean-Variance Relationships
Distribution	b(θ)	b'(θ) = μ	b''(θ) = V(μ)	Var(Y)
Normal	θ²/2	θ (so θ=μ)	1	σ² (constant)
Bernoulli	log(1 + e^θ)	e^θ/(1+e^θ) = p	p(1-p)	p(1-p)
Poisson	e^θ	e^θ = μ	μ	μ (equals mean)
Gamma	-log(-θ)	-1/θ = μ	μ²	φμ²
Inverse Gaussian	-√(-2θ)	1/√(-2θ) = μ	μ³	φμ³

The Variance Function Identifies the Distribution

Remarkably, the variance function $V(\mu)$ uniquely identifies the exponential family distribution (up to the dispersion parameter). This is why:

$V(\mu) = 1$ implies Normal
$V(\mu) = \mu$ implies Poisson
$V(\mu) = \mu(1-\mu)$ implies Binomial
$V(\mu) = \mu^2$ implies Gamma
$V(\mu) = \mu^3$ implies Inverse Gaussian

These are the power variance functions: $V(\mu) = \mu^p$ for $p = 0, 1, 2, 3$ (and $p \in (0,1)$ for Binomial). The Tweedie family generalizes this to all values of $p$.

Practical Implication

When you choose a GLM distribution, you're choosing a mean-variance relationship. Poisson assumes variance equals mean; Gamma assumes variance proportional to mean squared. If your data's variance pattern doesn't match the assumed V(μ), standard errors will be wrong and inference will be invalid—a problem addressed by quasi-likelihood methods.

Common Distributions in Exponential Family Form

Let's work through the exponential family representation of the most important distributions for GLMs.

The Normal Distribution

The normal density is: $$f(y; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left{-\frac{(y-\mu)^2}{2\sigma^2}\right}$$

Rewriting: $$f(y; \mu, \sigma^2) = \exp\left{\frac{y\mu - \mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)\right}$$

Identifying components:

$\theta = \mu$ (canonical parameter equals mean)
$b(\theta) = \theta^2/2 = \mu^2/2$
$a(\phi) = \sigma^2 = \phi$
$c(y, \phi) = -\frac{y^2}{2\phi} - \frac{1}{2}\log(2\pi\phi)$

Verification: $b'(\theta) = \theta = \mu$ ✓; $b''(\theta) = 1$, so $V(\mu) = 1$ ✓

The Bernoulli/Binomial Distribution

For a single Bernoulli trial: $$f(y; p) = p^y (1-p)^{1-y} = \exp{y\log(p) + (1-y)\log(1-p)}$$

$$= \exp\left{y\log\left(\frac{p}{1-p}\right) + \log(1-p)\right}$$

Identifying components:

$\theta = \log\left(\frac{p}{1-p}\right) = \text{logit}(p)$ (the canonical parameter is the log-odds!)
$b(\theta) = \log(1 + e^\theta)$ (because $\log(1-p) = -\log(1 + e^\theta)$, we rearrange)
$a(\phi) = 1$ (no dispersion parameter)
$c(y, \phi) = 0$ for single trial

Verification: $b'(\theta) = \frac{e^\theta}{1+e^\theta} = p$ ✓; $b''(\theta) = p(1-p)$ ✓

This explains why the logit is the canonical link: it makes $\eta = \theta$ directly!

The Poisson Distribution

The Poisson PMF: $$f(y; \lambda) = \frac{\lambda^y e^{-\lambda}}{y!} = \exp{y\log\lambda - \lambda - \log(y!)}$$

Identifying components:

$\theta = \log\lambda$ (the canonical parameter is the log-mean!)
$b(\theta) = e^\theta = \lambda$
$a(\phi) = 1$ (no dispersion parameter)
$c(y, \phi) = -\log(y!)$

Verification: $b'(\theta) = e^\theta = \lambda = \mu$ ✓; $b''(\theta) = e^\theta = \mu$ ✓ (variance equals mean)

This explains why the log is the canonical link for Poisson!

The Gamma Distribution

The Gamma density (parameterized by mean $\mu$ and shape $\nu$): $$f(y; \mu, \nu) = \frac{1}{\Gamma(\nu)}\left(\frac{\nu}{\mu}\right)^\nu y^{\nu-1} \exp\left{-\frac{\nu y}{\mu}\right}$$

Rewriting in exponential family form: $$f(y; \mu, \nu) = \exp\left{\frac{-y/\mu - \log\mu}{1/\nu} + (\nu-1)\log y + \nu\log\nu - \log\Gamma(\nu)\right}$$

Identifying components:

$\theta = -1/\mu$ (negative inverse of mean)
$b(\theta) = -\log(-\theta) = \log\mu$
$a(\phi) = 1/\nu = \phi$
$c(y, \phi) = (1/\phi - 1)\log y + (1/\phi)\log(1/\phi) - \log\Gamma(1/\phi)$

Verification: $b'(\theta) = -1/\theta = \mu$ ✓; $b''(\theta) = 1/\theta^2 = \mu^2$ ✓

The canonical link is the inverse: $g(\mu) = -1/\mu = \theta$.

The Canonical Link Emerges Naturally

Notice how each distribution's canonical link is simply the function that transforms the mean to the canonical parameter: identity for Normal (θ=μ), logit for Bernoulli (θ=logit(p)), log for Poisson (θ=log(μ)), inverse for Gamma (θ=-1/μ). This is not coincidence—it's the definition of canonical link!

Key Properties of Exponential Family Distributions

Exponential family membership confers several crucial properties that make GLMs tractable.

Property 1: Sufficient Statistics

For a sample $Y_1, \ldots, Y_n$ from an exponential family:

$$\sum_{i=1}^n Y_i \text{ is a sufficient statistic for } \theta$$

This means all information about $\theta$ contained in the sample is captured by this single summary. This is extremely powerful:

Estimation depends only on $\sum Y_i$, not on individual observations
Computational complexity of estimation doesn't grow with data complexity
The score equations simplify to $\sum Y_i = \sum \mu_i(\beta)$

Property 2: Maximum Likelihood Exists and is Unique

Under mild regularity conditions, the log-likelihood for exponential family distributions is:

Concave in the canonical parameter $\theta$
Has a unique global maximum (when it exists)
Can be maximized efficiently using convex optimization

For GLMs with canonical links, the log-likelihood is concave in $\beta$ as well, guaranteeing that Newton-Raphson and related algorithms converge to the global MLE.

Property 3: Moment Generating Function (MGF)

The MGF of an exponential family distribution has a simple form:

$$M(t) = E[e^{tY}] = \exp\left{\frac{b(\theta + ta(\phi)) - b(\theta)}{a(\phi)}\right}$$

This immediately gives:

$E[Y] = b'(\theta)$ (from $M'(0)$)
$\text{Var}(Y) = a(\phi)b''(\theta)$ (from $M''(0) - [M'(0)]^2$)

Property 4: Conjugate Priors Exist

Every exponential family distribution has a conjugate prior in the Bayesian framework. The conjugate prior for $\theta$ is:

$$\pi(\theta) \propto \exp{\theta \cdot (\text{prior sum}) - b(\theta) \cdot (\text{prior n})}$$

This makes Bayesian inference tractable: the posterior has the same functional form as the prior, with updated hyperparameters.

Property 5: Cumulant Generating Function

The cumulant generating function $K(t) = \log M(t)$ is:

$$K(t) = \frac{b(\theta + ta(\phi)) - b(\theta)}{a(\phi)}$$

The cumulants (mean, variance, skewness, kurtosis) can be read off from successive derivatives of $K(t)$ evaluated at $t=0$.

Why This Matters for GLMs

These properties are what make GLMs a unified framework rather than a collection of ad hoc methods. Sufficient statistics mean efficient estimation; concave likelihoods mean reliable optimization; conjugate priors mean tractable Bayesian extensions. Without exponential family structure, none of this would hold.

Score Function and Fisher Information

The exponential family structure yields elegant expressions for the score function and Fisher information—the workhorses of maximum likelihood estimation.

The Score Function

For a single observation $Y$ from the exponential family, the log-likelihood is:

$$\ell(\theta) = \frac{Y\theta - b(\theta)}{a(\phi)} + c(Y, \phi)$$

The score function is the derivative with respect to $\theta$:

$$U(\theta) = \frac{\partial \ell}{\partial \theta} = \frac{Y - b'(\theta)}{a(\phi)} = \frac{Y - \mu}{a(\phi)}$$

Note: The score is proportional to $(Y - \mu)$—the deviation of the observation from its expected value!

Properties of the Score

$$E[U(\theta)] = 0$$

The score has mean zero at the true parameter value. This is the basis for the score equations.

$$\text{Var}(U(\theta)) = E[U(\theta)^2] = \frac{\text{Var}(Y)}{a(\phi)^2} = \frac{b''(\theta)}{a(\phi)}$$

The Fisher Information

The Fisher information is the variance of the score, or equivalently, the negative expected second derivative of the log-likelihood:

$$\mathcal{I}(\theta) = E\left[-\frac{\partial^2 \ell}{\partial \theta^2}\right] = \frac{b''(\theta)}{a(\phi)} = \frac{V(\mu)}{a(\phi)}$$

where $V(\mu)$ is the variance function.

In Terms of β (GLM Parameters)

For a GLM, we're interested in the score and information in terms of $\beta$. Using the chain rule:

$$U(\beta) = \frac{\partial \ell}{\partial \beta} = \sum_i \frac{(Y_i - \mu_i)}{a(\phi) V(\mu_i)} \cdot \frac{\partial \mu_i}{\partial \eta_i} \cdot \mathbf{x}_i$$

With the canonical link (where $\theta = \eta$), this simplifies beautifully:

$$U(\beta) = \frac{1}{a(\phi)} \mathbf{X}^\top (\mathbf{Y} - \boldsymbol{\mu})$$

The score is just the cross-product of the design matrix with the residuals—the same form as in ordinary least squares!

score_and_information.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
from scipy.stats import poisson
import matplotlib.pyplot as plt
 
# Illustrate score function for Poisson GLM
 
# True parameter
lambda_true = 5.0
y_obs = 7  # A single observation
 
# Range of lambda values to evaluate
lambda_range = np.linspace(0.5, 15, 200)
 
# Log-likelihood for Poisson: y*log(λ) - λ - log(y!)
loglik = y_obs * np.log(lambda_range) - lambda_range
 
# Score function: ∂ℓ/∂λ = y/λ - 1
score = y_obs / lambda_range - 1
 
# Fisher information: 1/λ (for single observation)
fisher_info = 1 / lambda_range
 
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Log-likelihood
axes[0].plot(lambda_range, loglik, 'b-', linewidth=2)
axes[0].axvline(x=y_obs, color='r', linestyle='--', alpha=0.7, label=f'MLE = y = {y_obs}')
axes[0].axvline(x=lambda_true, color='g', linestyle=':', alpha=0.7, label=f'True λ = {lambda_true}')
axes[0].set_xlabel('λ', fontsize=12)
axes[0].set_ylabel('Log-Likelihood', fontsize=12)
axes[0].set_title('Poisson Log-Likelihood\n(concave in λ)', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
 
# Score function  
axes[1].plot(lambda_range, score, 'r-', linewidth=2)
axes[1].axhline(y=0, color='gray', linestyle='-', alpha=0.5)
axes[1].axvline(x=y_obs, color='b', linestyle='--', alpha=0.7, label=f'Score = 0 at λ = {y_obs}')
axes[1].set_xlabel('λ', fontsize=12)
axes[1].set_ylabel('Score U(λ)', fontsize=12)
axes[1].set_title('Score Function\nU(λ) = y/λ - 1', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)
 
# Fisher information
axes[2].plot(lambda_range, fisher_info, 'g-', linewidth=2)
axes[2].set_xlabel('λ', fontsize=12)
axes[2].set_ylabel('Fisher Information I(λ)', fontsize=12)
axes[2].set_title('Fisher Information\nI(λ) = 1/λ', fontsize=14)
axes[2].grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('score_information.png', dpi=150, bbox_inches='tight')
plt.show()

The Connection to IRLS

The Fisher information matrix determines the weights in Iteratively Reweighted Least Squares (IRLS), the standard algorithm for fitting GLMs. At each iteration, we solve a weighted least squares problem with weights w_i = 1/(V(μ_i) * [g'(μ_i)]²). This is the Newton-Raphson algorithm using Fisher information instead of observed information.

The Deviance and Model Comparison

The deviance is a central concept in GLM theory, providing both a measure of model fit and a basis for comparing nested models.

Definition of Deviance

The deviance is defined as:

$$D = 2\phi \left[\ell(\text{saturated model}) - \ell(\text{fitted model})\right]$$

where:

The saturated model has one parameter per observation ($\hat{\mu}_i = y_i$)
The fitted model is our GLM with $p$ parameters

Equivalently, using the likelihood:

$$D = 2\sum_{i=1}^n \left[y_i(\tilde{\theta}_i - \hat{\theta}_i) - b(\tilde{\theta}_i) + b(\hat{\theta}_i)\right] / a(\phi)$$

where $\tilde{\theta}_i$ is the canonical parameter in the saturated model and $\hat{\theta}_i$ is in the fitted model.

Deviance Residuals

The deviance decomposes as: $$D = \sum_{i=1}^n d_i^2$$

where $d_i = \text{sign}(y_i - \hat{\mu}_i) \sqrt{d_i^2}$ is the deviance residual for observation $i$.

For specific distributions:

Distribution	Individual Deviance Contribution
Normal	$(y_i - \hat{\mu}_i)^2$
Poisson	$2[y_i \log(y_i/\hat{\mu}_i) - (y_i - \hat{\mu}_i)]$
Binomial	$2[y_i \log(y_i/\hat{\mu}_i) + (n_i-y_i)\log((n_i-y_i)/(n_i-\hat{\mu}_i))]$
Gamma	$2[-\log(y_i/\hat{\mu}_i) + (y_i - \hat{\mu}_i)/\hat{\mu}_i]$

Uses of Deviance

1. Goodness of Fit

For some GLMs (notably Poisson and Binomial with grouped data), the deviance has an approximate $\chi^2$ distribution under the null hypothesis that the model is correct:

$$D \stackrel{\text{approx}}{\sim} \chi^2_{n-p}$$

A large deviance relative to its degrees of freedom suggests lack of fit.

2. Model Comparison (Likelihood Ratio Tests)

For nested models (Model 1 ⊂ Model 2):

$$\Delta D = D_1 - D_2 \sim \chi^2_{p_2 - p_1}$$

under the null hypothesis that the simpler model is adequate.

3. Pseudo-R²

The McFadden pseudo-R² is: $$R^2_{\text{McFadden}} = 1 - \frac{\ell(\text{fitted})}{\ell(\text{null})}$$

Alternatively, using deviances: $$R^2_{\text{deviance}} = 1 - \frac{D}{D_{\text{null}}}$$

where $D_{\text{null}}$ is the deviance of the null model (intercept only).

Caution with Deviance Goodness-of-Fit

The chi-squared approximation for deviance holds only asymptotically and requires "large" counts for Poisson/Binomial. With sparse data (many cells with small expected counts), the approximation breaks down. Always examine deviance residuals graphically rather than relying solely on the chi-squared test.

Beyond Exponential Families: Quasi-Likelihood

What if your data doesn't exactly follow an exponential family distribution? The quasi-likelihood approach extends GLM methodology to situations where only the mean-variance relationship is specified.

The Problem: Overdispersion

In practice, real data often exhibit overdispersion—variance greater than that assumed by the model. For example:

Poisson model assumes $\text{Var}(Y) = \mu$
Real count data often has $\text{Var}(Y) > \mu$

Causes include:

Unmeasured heterogeneity (clustering)
Excess zeros
Positive correlation between events

Quasi-Likelihood Solution

Quasi-likelihood specifies only:

The mean model: $E[Y_i] = \mu_i = g^{-1}(\mathbf{x}_i^\top\beta)$
The variance function: $\text{Var}(Y_i) = \phi \cdot V(\mu_i)$

No full distributional assumption is needed. We estimate $\beta$ by solving the quasi-score equations:

$$\sum_i \frac{(Y_i - \mu_i)}{V(\mu_i)} \cdot \frac{\partial \mu_i}{\partial \beta_j} = 0$$

These are identical to the GLM score equations, so point estimates are unchanged. However, we estimate $\phi$ from the data:

$$\hat{\phi} = \frac{1}{n-p} \sum_i \frac{(Y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)}$$

Standard errors are then inflated by $\sqrt{\hat{\phi}}$.

Quasi-Likelihood Advantages

•Handles overdispersion naturally
•Only requires mean-variance specification
•Point estimates same as MLE
•Robust to distributional misspecification
•Standard errors adjusted for extra variance

Quasi-Likelihood Limitations

•No true likelihood—can't use AIC/BIC directly
•Asymptotic theory only (need large samples)
•Doesn't model the full data-generating process
•Can miss important distributional features
•Less efficient than correctly specified MLE

Practical Guidance

When you observe overdispersion (deviance >> residual df), options include: (1) quasi-likelihood (quickest fix), (2) negative binomial model for counts (next page), (3) beta-binomial for proportions, (4) mixed-effects models that explicitly model heterogeneity.

Summary: The Exponential Family

We've explored the exponential family of distributions—the mathematical foundation that makes GLMs possible. Let's consolidate the key concepts:

Key Takeaways

•The exponential family form $f(y;θ,φ) = \exp{(yθ - b(θ))/a(φ) + c(y,φ)}$ provides a unified structure for Normal, Binomial, Poisson, Gamma, and many other distributions.
•The cumulant function b(θ) completely determines the mean (μ = b'(θ)) and variance (Var = a(φ)·b''(θ)).
•The variance function V(μ) = b''(θ) characterizes the mean-variance relationship and uniquely identifies the distribution family.
•The canonical parameter θ is connected to the mean via the canonical link, explaining why certain distribution-link combinations are natural.
•Exponential family properties (sufficient statistics, concave likelihood, conjugate priors) enable efficient and reliable GLM estimation.
•The deviance measures model fit and enables likelihood ratio tests for comparing nested models.
•Quasi-likelihood extends GLM methods beyond exponential families when only the mean-variance relationship is known.

What's Next:

In the next page, we'll apply the GLM framework to a specific and important case: Poisson regression for modeling count data. We'll see how the exponential family structure manifests in this concrete application and develop intuition for when and how to use Poisson regression.

Page Complete

You now understand the exponential family—the mathematical foundation that unifies all GLMs. This theoretical understanding will deepen your grasp of specific models like Poisson and negative binomial regression, which we'll explore next.

The Exponential Family: Foundation of GLMs

The Mathematical Unification

This unification is not merely taxonomic. The exponential family structure is precisely what enables the GLM framework to exist. It guarantees:

Existence of sufficient statistics that simplify estimation
Well-behaved likelihood functions with unique maxima
Elegant relationships between means, variances, and canonical parameters
A unified theory of estimation, inference, and diagnostics

What You Will Learn

The Exponential Family Form

A probability distribution belongs to the exponential family if its density (or probability mass function) can be written in the form:

$$f(y; \theta, \phi) = \exp\left{ \frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi) \right}$$

This is the canonical form of the exponential family. Let's understand each component:

The Four Components

1. The Canonical (Natural) Parameter: $\theta$

2. The Cumulant Function: $b(\theta)$

This is the normalizing function that ensures the density integrates to 1. It is also called the log-partition function because:

$$b(\theta) = \log \int \exp(y\theta / a(\phi)) \cdot \exp(c(y,\phi)) , dy$$

Remarkably, $b(\theta)$ encodes all the information about the mean and variance of $Y$:

$E[Y] = b'(\theta)$ (first derivative gives the mean)
$\text{Var}(Y) = a(\phi) \cdot b''(\theta)$ (second derivative relates to variance)

3. The Dispersion Function: $a(\phi)$

The dispersion parameter $\phi$ controls the spread of the distribution. The function $a(\phi)$ scales this effect. For many common distributions:

$a(\phi) = \phi$ (Normal, Gamma, Inverse Gaussian)
$a(\phi) = \phi/n$ for proportion data with n trials
$a(\phi) = 1$ (Poisson, Binomial with known n)

When $\phi$ is known (as in Poisson and Binomial), we have a one-parameter exponential family. When $\phi$ is unknown (as in Normal and Gamma), we have a two-parameter family.

4. The Carrier Measure: $c(y, \phi)$

This is everything that doesn't depend on $\theta$. It includes:

Normalizing constants that depend on $y$ (like $\log(y!)$ for Poisson)
Terms involving the dispersion parameter
The base measure against which we integrate

For parameter estimation, $c(y, \phi)$ can often be ignored when $\phi$ is known or estimated separately.

The Key Insight

Mean and Variance from the Cumulant Function

One of the most elegant properties of the exponential family is that the cumulant function $b(\theta)$ completely determines the mean and variance of the distribution.

Deriving the Mean

For any distribution in the exponential family:

$$E[Y] = \mu = b'(\theta) = \frac{db(\theta)}{d\theta}$$

Proof sketch: The log-likelihood is $\ell(\theta) = \frac{y\theta - b(\theta)}{a(\phi)} + c(y,\phi)$. Taking the expectation of the score:

$$E\left[\frac{\partial \ell}{\partial \theta}\right] = \frac{1}{a(\phi)}E[Y - b'(\theta)] = 0$$

For this to equal zero for all values of $\theta$: $E[Y] = b'(\theta)$.

Deriving the Variance

Similarly:

$$\text{Var}(Y) = a(\phi) \cdot b''(\theta) = a(\phi) \cdot V(\mu)$$

where $V(\mu) = b''(\theta)$ evaluated at the $\theta$ corresponding to $\mu$.

The function $V(\mu)$ is called the variance function. It characterizes how variance changes with the mean and is a defining feature of each exponential family distribution.

Cumulant Functions and Mean-Variance Relationships
Distribution	b(θ)	b'(θ) = μ	b''(θ) = V(μ)	Var(Y)
Normal	θ²/2	θ (so θ=μ)	1	σ² (constant)
Bernoulli	log(1 + e^θ)	e^θ/(1+e^θ) = p	p(1-p)	p(1-p)
Poisson	e^θ	e^θ = μ	μ	μ (equals mean)
Gamma	-log(-θ)	-1/θ = μ	μ²	φμ²
Inverse Gaussian	-√(-2θ)	1/√(-2θ) = μ	μ³	φμ³

The Variance Function Identifies the Distribution

Remarkably, the variance function $V(\mu)$ uniquely identifies the exponential family distribution (up to the dispersion parameter). This is why:

$V(\mu) = 1$ implies Normal
$V(\mu) = \mu$ implies Poisson
$V(\mu) = \mu(1-\mu)$ implies Binomial
$V(\mu) = \mu^2$ implies Gamma
$V(\mu) = \mu^3$ implies Inverse Gaussian

These are the power variance functions: $V(\mu) = \mu^p$ for $p = 0, 1, 2, 3$ (and $p \in (0,1)$ for Binomial). The Tweedie family generalizes this to all values of $p$.

Practical Implication

Common Distributions in Exponential Family Form

Let's work through the exponential family representation of the most important distributions for GLMs.

The Normal Distribution

The normal density is: $$f(y; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left{-\frac{(y-\mu)^2}{2\sigma^2}\right}$$

Rewriting: $$f(y; \mu, \sigma^2) = \exp\left{\frac{y\mu - \mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)\right}$$

Identifying components:

$\theta = \mu$ (canonical parameter equals mean)
$b(\theta) = \theta^2/2 = \mu^2/2$
$a(\phi) = \sigma^2 = \phi$
$c(y, \phi) = -\frac{y^2}{2\phi} - \frac{1}{2}\log(2\pi\phi)$

Verification: $b'(\theta) = \theta = \mu$ ✓; $b''(\theta) = 1$, so $V(\mu) = 1$ ✓

The Bernoulli/Binomial Distribution

For a single Bernoulli trial: $$f(y; p) = p^y (1-p)^{1-y} = \exp{y\log(p) + (1-y)\log(1-p)}$$

$$= \exp\left{y\log\left(\frac{p}{1-p}\right) + \log(1-p)\right}$$

Identifying components:

$\theta = \log\left(\frac{p}{1-p}\right) = \text{logit}(p)$ (the canonical parameter is the log-odds!)
$b(\theta) = \log(1 + e^\theta)$ (because $\log(1-p) = -\log(1 + e^\theta)$, we rearrange)
$a(\phi) = 1$ (no dispersion parameter)
$c(y, \phi) = 0$ for single trial

Verification: $b'(\theta) = \frac{e^\theta}{1+e^\theta} = p$ ✓; $b''(\theta) = p(1-p)$ ✓

This explains why the logit is the canonical link: it makes $\eta = \theta$ directly!

The Poisson Distribution

The Poisson PMF: $$f(y; \lambda) = \frac{\lambda^y e^{-\lambda}}{y!} = \exp{y\log\lambda - \lambda - \log(y!)}$$

Identifying components:

$\theta = \log\lambda$ (the canonical parameter is the log-mean!)
$b(\theta) = e^\theta = \lambda$
$a(\phi) = 1$ (no dispersion parameter)
$c(y, \phi) = -\log(y!)$

Verification: $b'(\theta) = e^\theta = \lambda = \mu$ ✓; $b''(\theta) = e^\theta = \mu$ ✓ (variance equals mean)

This explains why the log is the canonical link for Poisson!

The Gamma Distribution

The Gamma density (parameterized by mean $\mu$ and shape $\nu$): $$f(y; \mu, \nu) = \frac{1}{\Gamma(\nu)}\left(\frac{\nu}{\mu}\right)^\nu y^{\nu-1} \exp\left{-\frac{\nu y}{\mu}\right}$$

Rewriting in exponential family form: $$f(y; \mu, \nu) = \exp\left{\frac{-y/\mu - \log\mu}{1/\nu} + (\nu-1)\log y + \nu\log\nu - \log\Gamma(\nu)\right}$$

Identifying components:

$\theta = -1/\mu$ (negative inverse of mean)
$b(\theta) = -\log(-\theta) = \log\mu$
$a(\phi) = 1/\nu = \phi$
$c(y, \phi) = (1/\phi - 1)\log y + (1/\phi)\log(1/\phi) - \log\Gamma(1/\phi)$

Verification: $b'(\theta) = -1/\theta = \mu$ ✓; $b''(\theta) = 1/\theta^2 = \mu^2$ ✓

The canonical link is the inverse: $g(\mu) = -1/\mu = \theta$.

The Canonical Link Emerges Naturally

Key Properties of Exponential Family Distributions

Exponential family membership confers several crucial properties that make GLMs tractable.

Property 1: Sufficient Statistics

For a sample $Y_1, \ldots, Y_n$ from an exponential family:

$$\sum_{i=1}^n Y_i \text{ is a sufficient statistic for } \theta$$

This means all information about $\theta$ contained in the sample is captured by this single summary. This is extremely powerful:

Estimation depends only on $\sum Y_i$, not on individual observations
Computational complexity of estimation doesn't grow with data complexity
The score equations simplify to $\sum Y_i = \sum \mu_i(\beta)$

Property 2: Maximum Likelihood Exists and is Unique

Under mild regularity conditions, the log-likelihood for exponential family distributions is:

Concave in the canonical parameter $\theta$
Has a unique global maximum (when it exists)
Can be maximized efficiently using convex optimization

For GLMs with canonical links, the log-likelihood is concave in $\beta$ as well, guaranteeing that Newton-Raphson and related algorithms converge to the global MLE.

Property 3: Moment Generating Function (MGF)

The MGF of an exponential family distribution has a simple form:

$$M(t) = E[e^{tY}] = \exp\left{\frac{b(\theta + ta(\phi)) - b(\theta)}{a(\phi)}\right}$$

This immediately gives:

$E[Y] = b'(\theta)$ (from $M'(0)$)
$\text{Var}(Y) = a(\phi)b''(\theta)$ (from $M''(0) - [M'(0)]^2$)

Property 4: Conjugate Priors Exist

Every exponential family distribution has a conjugate prior in the Bayesian framework. The conjugate prior for $\theta$ is:

$$\pi(\theta) \propto \exp{\theta \cdot (\text{prior sum}) - b(\theta) \cdot (\text{prior n})}$$

This makes Bayesian inference tractable: the posterior has the same functional form as the prior, with updated hyperparameters.

Property 5: Cumulant Generating Function

The cumulant generating function $K(t) = \log M(t)$ is:

$$K(t) = \frac{b(\theta + ta(\phi)) - b(\theta)}{a(\phi)}$$

The cumulants (mean, variance, skewness, kurtosis) can be read off from successive derivatives of $K(t)$ evaluated at $t=0$.

Why This Matters for GLMs

Score Function and Fisher Information

The exponential family structure yields elegant expressions for the score function and Fisher information—the workhorses of maximum likelihood estimation.

The Score Function

For a single observation $Y$ from the exponential family, the log-likelihood is:

$$\ell(\theta) = \frac{Y\theta - b(\theta)}{a(\phi)} + c(Y, \phi)$$

The score function is the derivative with respect to $\theta$:

$$U(\theta) = \frac{\partial \ell}{\partial \theta} = \frac{Y - b'(\theta)}{a(\phi)} = \frac{Y - \mu}{a(\phi)}$$

Note: The score is proportional to $(Y - \mu)$—the deviation of the observation from its expected value!

Properties of the Score

$$E[U(\theta)] = 0$$

The score has mean zero at the true parameter value. This is the basis for the score equations.

$$\text{Var}(U(\theta)) = E[U(\theta)^2] = \frac{\text{Var}(Y)}{a(\phi)^2} = \frac{b''(\theta)}{a(\phi)}$$

The Fisher Information

The Fisher information is the variance of the score, or equivalently, the negative expected second derivative of the log-likelihood:

$$\mathcal{I}(\theta) = E\left[-\frac{\partial^2 \ell}{\partial \theta^2}\right] = \frac{b''(\theta)}{a(\phi)} = \frac{V(\mu)}{a(\phi)}$$

where $V(\mu)$ is the variance function.

In Terms of β (GLM Parameters)

For a GLM, we're interested in the score and information in terms of $\beta$. Using the chain rule:

$$U(\beta) = \frac{\partial \ell}{\partial \beta} = \sum_i \frac{(Y_i - \mu_i)}{a(\phi) V(\mu_i)} \cdot \frac{\partial \mu_i}{\partial \eta_i} \cdot \mathbf{x}_i$$

With the canonical link (where $\theta = \eta$), this simplifies beautifully:

$$U(\beta) = \frac{1}{a(\phi)} \mathbf{X}^\top (\mathbf{Y} - \boldsymbol{\mu})$$

The score is just the cross-product of the design matrix with the residuals—the same form as in ordinary least squares!

score_and_information.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
from scipy.stats import poisson
import matplotlib.pyplot as plt
 
# Illustrate score function for Poisson GLM
 
# True parameter
lambda_true = 5.0
y_obs = 7  # A single observation
 
# Range of lambda values to evaluate
lambda_range = np.linspace(0.5, 15, 200)
 
# Log-likelihood for Poisson: y*log(λ) - λ - log(y!)
loglik = y_obs * np.log(lambda_range) - lambda_range
 
# Score function: ∂ℓ/∂λ = y/λ - 1
score = y_obs / lambda_range - 1
 
# Fisher information: 1/λ (for single observation)
fisher_info = 1 / lambda_range
 
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Log-likelihood
axes[0].plot(lambda_range, loglik, 'b-', linewidth=2)
axes[0].axvline(x=y_obs, color='r', linestyle='--', alpha=0.7, label=f'MLE = y = {y_obs}')
axes[0].axvline(x=lambda_true, color='g', linestyle=':', alpha=0.7, label=f'True λ = {lambda_true}')
axes[0].set_xlabel('λ', fontsize=12)
axes[0].set_ylabel('Log-Likelihood', fontsize=12)
axes[0].set_title('Poisson Log-Likelihood\n(concave in λ)', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
 
# Score function  
axes[1].plot(lambda_range, score, 'r-', linewidth=2)
axes[1].axhline(y=0, color='gray', linestyle='-', alpha=0.5)
axes[1].axvline(x=y_obs, color='b', linestyle='--', alpha=0.7, label=f'Score = 0 at λ = {y_obs}')
axes[1].set_xlabel('λ', fontsize=12)
axes[1].set_ylabel('Score U(λ)', fontsize=12)
axes[1].set_title('Score Function\nU(λ) = y/λ - 1', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)
 
# Fisher information
axes[2].plot(lambda_range, fisher_info, 'g-', linewidth=2)
axes[2].set_xlabel('λ', fontsize=12)
axes[2].set_ylabel('Fisher Information I(λ)', fontsize=12)
axes[2].set_title('Fisher Information\nI(λ) = 1/λ', fontsize=14)
axes[2].grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('score_information.png', dpi=150, bbox_inches='tight')
plt.show()

The Connection to IRLS

The Deviance and Model Comparison

The deviance is a central concept in GLM theory, providing both a measure of model fit and a basis for comparing nested models.

Definition of Deviance

The deviance is defined as:

$$D = 2\phi \left[\ell(\text{saturated model}) - \ell(\text{fitted model})\right]$$

where:

The saturated model has one parameter per observation ($\hat{\mu}_i = y_i$)
The fitted model is our GLM with $p$ parameters

Equivalently, using the likelihood:

$$D = 2\sum_{i=1}^n \left[y_i(\tilde{\theta}_i - \hat{\theta}_i) - b(\tilde{\theta}_i) + b(\hat{\theta}_i)\right] / a(\phi)$$

where $\tilde{\theta}_i$ is the canonical parameter in the saturated model and $\hat{\theta}_i$ is in the fitted model.

Deviance Residuals

The deviance decomposes as: $$D = \sum_{i=1}^n d_i^2$$

where $d_i = \text{sign}(y_i - \hat{\mu}_i) \sqrt{d_i^2}$ is the deviance residual for observation $i$.

For specific distributions:

Distribution	Individual Deviance Contribution
Normal	$(y_i - \hat{\mu}_i)^2$
Poisson	$2[y_i \log(y_i/\hat{\mu}_i) - (y_i - \hat{\mu}_i)]$
Binomial	$2[y_i \log(y_i/\hat{\mu}_i) + (n_i-y_i)\log((n_i-y_i)/(n_i-\hat{\mu}_i))]$
Gamma	$2[-\log(y_i/\hat{\mu}_i) + (y_i - \hat{\mu}_i)/\hat{\mu}_i]$

Uses of Deviance

1. Goodness of Fit

For some GLMs (notably Poisson and Binomial with grouped data), the deviance has an approximate $\chi^2$ distribution under the null hypothesis that the model is correct:

$$D \stackrel{\text{approx}}{\sim} \chi^2_{n-p}$$

A large deviance relative to its degrees of freedom suggests lack of fit.

2. Model Comparison (Likelihood Ratio Tests)

For nested models (Model 1 ⊂ Model 2):

$$\Delta D = D_1 - D_2 \sim \chi^2_{p_2 - p_1}$$

under the null hypothesis that the simpler model is adequate.

3. Pseudo-R²

The McFadden pseudo-R² is: $$R^2_{\text{McFadden}} = 1 - \frac{\ell(\text{fitted})}{\ell(\text{null})}$$

Alternatively, using deviances: $$R^2_{\text{deviance}} = 1 - \frac{D}{D_{\text{null}}}$$

where $D_{\text{null}}$ is the deviance of the null model (intercept only).

Caution with Deviance Goodness-of-Fit

Beyond Exponential Families: Quasi-Likelihood

The Problem: Overdispersion

In practice, real data often exhibit overdispersion—variance greater than that assumed by the model. For example:

Poisson model assumes $\text{Var}(Y) = \mu$
Real count data often has $\text{Var}(Y) > \mu$

Causes include:

Unmeasured heterogeneity (clustering)
Excess zeros
Positive correlation between events

Quasi-Likelihood Solution

Quasi-likelihood specifies only:

The mean model: $E[Y_i] = \mu_i = g^{-1}(\mathbf{x}_i^\top\beta)$
The variance function: $\text{Var}(Y_i) = \phi \cdot V(\mu_i)$

No full distributional assumption is needed. We estimate $\beta$ by solving the quasi-score equations:

$$\sum_i \frac{(Y_i - \mu_i)}{V(\mu_i)} \cdot \frac{\partial \mu_i}{\partial \beta_j} = 0$$

These are identical to the GLM score equations, so point estimates are unchanged. However, we estimate $\phi$ from the data:

$$\hat{\phi} = \frac{1}{n-p} \sum_i \frac{(Y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)}$$

Standard errors are then inflated by $\sqrt{\hat{\phi}}$.

Quasi-Likelihood Advantages

•Handles overdispersion naturally
•Only requires mean-variance specification
•Point estimates same as MLE
•Robust to distributional misspecification
•Standard errors adjusted for extra variance

Quasi-Likelihood Limitations

•No true likelihood—can't use AIC/BIC directly
•Asymptotic theory only (need large samples)
•Doesn't model the full data-generating process
•Can miss important distributional features
•Less efficient than correctly specified MLE

Practical Guidance

Summary: The Exponential Family

We've explored the exponential family of distributions—the mathematical foundation that makes GLMs possible. Let's consolidate the key concepts:

Key Takeaways

•The exponential family form $f(y;θ,φ) = \exp{(yθ - b(θ))/a(φ) + c(y,φ)}$ provides a unified structure for Normal, Binomial, Poisson, Gamma, and many other distributions.
•The cumulant function b(θ) completely determines the mean (μ = b'(θ)) and variance (Var = a(φ)·b''(θ)).
•The variance function V(μ) = b''(θ) characterizes the mean-variance relationship and uniquely identifies the distribution family.
•The canonical parameter θ is connected to the mean via the canonical link, explaining why certain distribution-link combinations are natural.
•Exponential family properties (sufficient statistics, concave likelihood, conjugate priors) enable efficient and reliable GLM estimation.
•The deviance measures model fit and enables likelihood ratio tests for comparing nested models.
•Quasi-likelihood extends GLM methods beyond exponential families when only the mean-variance relationship is known.

What's Next:

Page Complete