Loading content...
What do the Normal, Binomial, Poisson, Gamma, and Inverse Gaussian distributions have in common? At first glance, they seem quite different—continuous vs. discrete, bounded vs. unbounded, symmetric vs. skewed. Yet they all share a profound mathematical structure: they are members of the exponential family of distributions.
This unification is not merely taxonomic. The exponential family structure is precisely what enables the GLM framework to exist. It guarantees:
Understanding the exponential family transforms GLMs from a collection of techniques into a coherent theoretical framework. It reveals why certain distribution-link combinations work well and provides the tools for extending GLMs to new problems.
By the end of this page, you will: (1) recognize the exponential family form and identify its components, (2) understand how mean, variance, and canonical parameters are related through the cumulant function, (3) see how common distributions fit the exponential family framework, and (4) appreciate why exponential family membership enables the GLM machinery.
A probability distribution belongs to the exponential family if its density (or probability mass function) can be written in the form:
$$f(y; \theta, \phi) = \exp\left{ \frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi) \right}$$
This is the canonical form of the exponential family. Let's understand each component:
1. The Canonical (Natural) Parameter: $\theta$
This is the primary parameter of interest, directly linked to the mean of the distribution. Different values of $\theta$ correspond to different distributions within the family. The canonical parameter lives on the "natural" scale where the sufficient statistic $Y$ appears linearly.
2. The Cumulant Function: $b(\theta)$
This is the normalizing function that ensures the density integrates to 1. It is also called the log-partition function because:
$$b(\theta) = \log \int \exp(y\theta / a(\phi)) \cdot \exp(c(y,\phi)) , dy$$
Remarkably, $b(\theta)$ encodes all the information about the mean and variance of $Y$:
3. The Dispersion Function: $a(\phi)$
The dispersion parameter $\phi$ controls the spread of the distribution. The function $a(\phi)$ scales this effect. For many common distributions:
When $\phi$ is known (as in Poisson and Binomial), we have a one-parameter exponential family. When $\phi$ is unknown (as in Normal and Gamma), we have a two-parameter family.
4. The Carrier Measure: $c(y, \phi)$
This is everything that doesn't depend on $\theta$. It includes:
For parameter estimation, $c(y, \phi)$ can often be ignored when $\phi$ is known or estimated separately.
The exponential family form guarantees that Y is a sufficient statistic for θ. This means all the information in the data about θ is captured by the sum Σyᵢ. This is why GLMs can be fit efficiently and why the score equations have simple forms.
One of the most elegant properties of the exponential family is that the cumulant function $b(\theta)$ completely determines the mean and variance of the distribution.
For any distribution in the exponential family:
$$E[Y] = \mu = b'(\theta) = \frac{db(\theta)}{d\theta}$$
Proof sketch: The log-likelihood is $\ell(\theta) = \frac{y\theta - b(\theta)}{a(\phi)} + c(y,\phi)$. Taking the expectation of the score:
$$E\left[\frac{\partial \ell}{\partial \theta}\right] = \frac{1}{a(\phi)}E[Y - b'(\theta)] = 0$$
For this to equal zero for all values of $\theta$: $E[Y] = b'(\theta)$.
Similarly:
$$\text{Var}(Y) = a(\phi) \cdot b''(\theta) = a(\phi) \cdot V(\mu)$$
where $V(\mu) = b''(\theta)$ evaluated at the $\theta$ corresponding to $\mu$.
The function $V(\mu)$ is called the variance function. It characterizes how variance changes with the mean and is a defining feature of each exponential family distribution.
| Distribution | b(θ) | b'(θ) = μ | b''(θ) = V(μ) | Var(Y) |
|---|---|---|---|---|
| Normal | θ²/2 | θ (so θ=μ) | 1 | σ² (constant) |
| Bernoulli | log(1 + e^θ) | e^θ/(1+e^θ) = p | p(1-p) | p(1-p) |
| Poisson | e^θ | e^θ = μ | μ | μ (equals mean) |
| Gamma | -log(-θ) | -1/θ = μ | μ² | φμ² |
| Inverse Gaussian | -√(-2θ) | 1/√(-2θ) = μ | μ³ | φμ³ |
Remarkably, the variance function $V(\mu)$ uniquely identifies the exponential family distribution (up to the dispersion parameter). This is why:
These are the power variance functions: $V(\mu) = \mu^p$ for $p = 0, 1, 2, 3$ (and $p \in (0,1)$ for Binomial). The Tweedie family generalizes this to all values of $p$.
When you choose a GLM distribution, you're choosing a mean-variance relationship. Poisson assumes variance equals mean; Gamma assumes variance proportional to mean squared. If your data's variance pattern doesn't match the assumed V(μ), standard errors will be wrong and inference will be invalid—a problem addressed by quasi-likelihood methods.
Let's work through the exponential family representation of the most important distributions for GLMs.
The normal density is: $$f(y; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left{-\frac{(y-\mu)^2}{2\sigma^2}\right}$$
Rewriting: $$f(y; \mu, \sigma^2) = \exp\left{\frac{y\mu - \mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)\right}$$
Identifying components:
Verification: $b'(\theta) = \theta = \mu$ ✓; $b''(\theta) = 1$, so $V(\mu) = 1$ ✓
For a single Bernoulli trial: $$f(y; p) = p^y (1-p)^{1-y} = \exp{y\log(p) + (1-y)\log(1-p)}$$
$$= \exp\left{y\log\left(\frac{p}{1-p}\right) + \log(1-p)\right}$$
Identifying components:
Verification: $b'(\theta) = \frac{e^\theta}{1+e^\theta} = p$ ✓; $b''(\theta) = p(1-p)$ ✓
This explains why the logit is the canonical link: it makes $\eta = \theta$ directly!
The Poisson PMF: $$f(y; \lambda) = \frac{\lambda^y e^{-\lambda}}{y!} = \exp{y\log\lambda - \lambda - \log(y!)}$$
Identifying components:
Verification: $b'(\theta) = e^\theta = \lambda = \mu$ ✓; $b''(\theta) = e^\theta = \mu$ ✓ (variance equals mean)
This explains why the log is the canonical link for Poisson!
The Gamma density (parameterized by mean $\mu$ and shape $\nu$): $$f(y; \mu, \nu) = \frac{1}{\Gamma(\nu)}\left(\frac{\nu}{\mu}\right)^\nu y^{\nu-1} \exp\left{-\frac{\nu y}{\mu}\right}$$
Rewriting in exponential family form: $$f(y; \mu, \nu) = \exp\left{\frac{-y/\mu - \log\mu}{1/\nu} + (\nu-1)\log y + \nu\log\nu - \log\Gamma(\nu)\right}$$
Identifying components:
Verification: $b'(\theta) = -1/\theta = \mu$ ✓; $b''(\theta) = 1/\theta^2 = \mu^2$ ✓
The canonical link is the inverse: $g(\mu) = -1/\mu = \theta$.
Notice how each distribution's canonical link is simply the function that transforms the mean to the canonical parameter: identity for Normal (θ=μ), logit for Bernoulli (θ=logit(p)), log for Poisson (θ=log(μ)), inverse for Gamma (θ=-1/μ). This is not coincidence—it's the definition of canonical link!
Exponential family membership confers several crucial properties that make GLMs tractable.
For a sample $Y_1, \ldots, Y_n$ from an exponential family:
$$\sum_{i=1}^n Y_i \text{ is a sufficient statistic for } \theta$$
This means all information about $\theta$ contained in the sample is captured by this single summary. This is extremely powerful:
Under mild regularity conditions, the log-likelihood for exponential family distributions is:
For GLMs with canonical links, the log-likelihood is concave in $\beta$ as well, guaranteeing that Newton-Raphson and related algorithms converge to the global MLE.
The MGF of an exponential family distribution has a simple form:
$$M(t) = E[e^{tY}] = \exp\left{\frac{b(\theta + ta(\phi)) - b(\theta)}{a(\phi)}\right}$$
This immediately gives:
Every exponential family distribution has a conjugate prior in the Bayesian framework. The conjugate prior for $\theta$ is:
$$\pi(\theta) \propto \exp{\theta \cdot (\text{prior sum}) - b(\theta) \cdot (\text{prior n})}$$
This makes Bayesian inference tractable: the posterior has the same functional form as the prior, with updated hyperparameters.
The cumulant generating function $K(t) = \log M(t)$ is:
$$K(t) = \frac{b(\theta + ta(\phi)) - b(\theta)}{a(\phi)}$$
The cumulants (mean, variance, skewness, kurtosis) can be read off from successive derivatives of $K(t)$ evaluated at $t=0$.
These properties are what make GLMs a unified framework rather than a collection of ad hoc methods. Sufficient statistics mean efficient estimation; concave likelihoods mean reliable optimization; conjugate priors mean tractable Bayesian extensions. Without exponential family structure, none of this would hold.
The exponential family structure yields elegant expressions for the score function and Fisher information—the workhorses of maximum likelihood estimation.
For a single observation $Y$ from the exponential family, the log-likelihood is:
$$\ell(\theta) = \frac{Y\theta - b(\theta)}{a(\phi)} + c(Y, \phi)$$
The score function is the derivative with respect to $\theta$:
$$U(\theta) = \frac{\partial \ell}{\partial \theta} = \frac{Y - b'(\theta)}{a(\phi)} = \frac{Y - \mu}{a(\phi)}$$
Note: The score is proportional to $(Y - \mu)$—the deviation of the observation from its expected value!
$$E[U(\theta)] = 0$$
The score has mean zero at the true parameter value. This is the basis for the score equations.
$$\text{Var}(U(\theta)) = E[U(\theta)^2] = \frac{\text{Var}(Y)}{a(\phi)^2} = \frac{b''(\theta)}{a(\phi)}$$
The Fisher information is the variance of the score, or equivalently, the negative expected second derivative of the log-likelihood:
$$\mathcal{I}(\theta) = E\left[-\frac{\partial^2 \ell}{\partial \theta^2}\right] = \frac{b''(\theta)}{a(\phi)} = \frac{V(\mu)}{a(\phi)}$$
where $V(\mu)$ is the variance function.
For a GLM, we're interested in the score and information in terms of $\beta$. Using the chain rule:
$$U(\beta) = \frac{\partial \ell}{\partial \beta} = \sum_i \frac{(Y_i - \mu_i)}{a(\phi) V(\mu_i)} \cdot \frac{\partial \mu_i}{\partial \eta_i} \cdot \mathbf{x}_i$$
With the canonical link (where $\theta = \eta$), this simplifies beautifully:
$$U(\beta) = \frac{1}{a(\phi)} \mathbf{X}^\top (\mathbf{Y} - \boldsymbol{\mu})$$
The score is just the cross-product of the design matrix with the residuals—the same form as in ordinary least squares!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import numpy as npfrom scipy.stats import poissonimport matplotlib.pyplot as plt # Illustrate score function for Poisson GLM # True parameterlambda_true = 5.0y_obs = 7 # A single observation # Range of lambda values to evaluatelambda_range = np.linspace(0.5, 15, 200) # Log-likelihood for Poisson: y*log(λ) - λ - log(y!)loglik = y_obs * np.log(lambda_range) - lambda_range # Score function: ∂ℓ/∂λ = y/λ - 1score = y_obs / lambda_range - 1 # Fisher information: 1/λ (for single observation)fisher_info = 1 / lambda_range fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Log-likelihoodaxes[0].plot(lambda_range, loglik, 'b-', linewidth=2)axes[0].axvline(x=y_obs, color='r', linestyle='--', alpha=0.7, label=f'MLE = y = {y_obs}')axes[0].axvline(x=lambda_true, color='g', linestyle=':', alpha=0.7, label=f'True λ = {lambda_true}')axes[0].set_xlabel('λ', fontsize=12)axes[0].set_ylabel('Log-Likelihood', fontsize=12)axes[0].set_title('Poisson Log-Likelihood\n(concave in λ)', fontsize=14)axes[0].legend()axes[0].grid(True, alpha=0.3) # Score function axes[1].plot(lambda_range, score, 'r-', linewidth=2)axes[1].axhline(y=0, color='gray', linestyle='-', alpha=0.5)axes[1].axvline(x=y_obs, color='b', linestyle='--', alpha=0.7, label=f'Score = 0 at λ = {y_obs}')axes[1].set_xlabel('λ', fontsize=12)axes[1].set_ylabel('Score U(λ)', fontsize=12)axes[1].set_title('Score Function\nU(λ) = y/λ - 1', fontsize=14)axes[1].legend()axes[1].grid(True, alpha=0.3) # Fisher informationaxes[2].plot(lambda_range, fisher_info, 'g-', linewidth=2)axes[2].set_xlabel('λ', fontsize=12)axes[2].set_ylabel('Fisher Information I(λ)', fontsize=12)axes[2].set_title('Fisher Information\nI(λ) = 1/λ', fontsize=14)axes[2].grid(True, alpha=0.3) plt.tight_layout()plt.savefig('score_information.png', dpi=150, bbox_inches='tight')plt.show()The Fisher information matrix determines the weights in Iteratively Reweighted Least Squares (IRLS), the standard algorithm for fitting GLMs. At each iteration, we solve a weighted least squares problem with weights w_i = 1/(V(μ_i) * [g'(μ_i)]²). This is the Newton-Raphson algorithm using Fisher information instead of observed information.
The deviance is a central concept in GLM theory, providing both a measure of model fit and a basis for comparing nested models.
The deviance is defined as:
$$D = 2\phi \left[\ell(\text{saturated model}) - \ell(\text{fitted model})\right]$$
where:
Equivalently, using the likelihood:
$$D = 2\sum_{i=1}^n \left[y_i(\tilde{\theta}_i - \hat{\theta}_i) - b(\tilde{\theta}_i) + b(\hat{\theta}_i)\right] / a(\phi)$$
where $\tilde{\theta}_i$ is the canonical parameter in the saturated model and $\hat{\theta}_i$ is in the fitted model.
The deviance decomposes as: $$D = \sum_{i=1}^n d_i^2$$
where $d_i = \text{sign}(y_i - \hat{\mu}_i) \sqrt{d_i^2}$ is the deviance residual for observation $i$.
For specific distributions:
| Distribution | Individual Deviance Contribution |
|---|---|
| Normal | $(y_i - \hat{\mu}_i)^2$ |
| Poisson | $2[y_i \log(y_i/\hat{\mu}_i) - (y_i - \hat{\mu}_i)]$ |
| Binomial | $2[y_i \log(y_i/\hat{\mu}_i) + (n_i-y_i)\log((n_i-y_i)/(n_i-\hat{\mu}_i))]$ |
| Gamma | $2[-\log(y_i/\hat{\mu}_i) + (y_i - \hat{\mu}_i)/\hat{\mu}_i]$ |
1. Goodness of Fit
For some GLMs (notably Poisson and Binomial with grouped data), the deviance has an approximate $\chi^2$ distribution under the null hypothesis that the model is correct:
$$D \stackrel{\text{approx}}{\sim} \chi^2_{n-p}$$
A large deviance relative to its degrees of freedom suggests lack of fit.
2. Model Comparison (Likelihood Ratio Tests)
For nested models (Model 1 ⊂ Model 2):
$$\Delta D = D_1 - D_2 \sim \chi^2_{p_2 - p_1}$$
under the null hypothesis that the simpler model is adequate.
3. Pseudo-R²
The McFadden pseudo-R² is: $$R^2_{\text{McFadden}} = 1 - \frac{\ell(\text{fitted})}{\ell(\text{null})}$$
Alternatively, using deviances: $$R^2_{\text{deviance}} = 1 - \frac{D}{D_{\text{null}}}$$
where $D_{\text{null}}$ is the deviance of the null model (intercept only).
The chi-squared approximation for deviance holds only asymptotically and requires "large" counts for Poisson/Binomial. With sparse data (many cells with small expected counts), the approximation breaks down. Always examine deviance residuals graphically rather than relying solely on the chi-squared test.
What if your data doesn't exactly follow an exponential family distribution? The quasi-likelihood approach extends GLM methodology to situations where only the mean-variance relationship is specified.
In practice, real data often exhibit overdispersion—variance greater than that assumed by the model. For example:
Causes include:
Quasi-likelihood specifies only:
No full distributional assumption is needed. We estimate $\beta$ by solving the quasi-score equations:
$$\sum_i \frac{(Y_i - \mu_i)}{V(\mu_i)} \cdot \frac{\partial \mu_i}{\partial \beta_j} = 0$$
These are identical to the GLM score equations, so point estimates are unchanged. However, we estimate $\phi$ from the data:
$$\hat{\phi} = \frac{1}{n-p} \sum_i \frac{(Y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)}$$
Standard errors are then inflated by $\sqrt{\hat{\phi}}$.
When you observe overdispersion (deviance >> residual df), options include: (1) quasi-likelihood (quickest fix), (2) negative binomial model for counts (next page), (3) beta-binomial for proportions, (4) mixed-effects models that explicitly model heterogeneity.
We've explored the exponential family of distributions—the mathematical foundation that makes GLMs possible. Let's consolidate the key concepts:
What's Next:
In the next page, we'll apply the GLM framework to a specific and important case: Poisson regression for modeling count data. We'll see how the exponential family structure manifests in this concrete application and develop intuition for when and how to use Poisson regression.
You now understand the exponential family—the mathematical foundation that unifies all GLMs. This theoretical understanding will deepen your grasp of specific models like Poisson and negative binomial regression, which we'll explore next.