Loading learning content...
Throughout your journey in machine learning, you've encountered a fundamental tension: linear regression assumes a continuous, unbounded response variable with normally distributed errors—yet real-world data often violates these assumptions dramatically. What do you do when your response is a count of events? A probability? A strictly positive amount? A categorical outcome?
For decades, statisticians developed specialized techniques for each case: logistic regression for binary outcomes, Poisson regression for counts, gamma regression for positive continuous values. These methods seemed unrelated—each with its own derivation, assumptions, and estimation procedures.
Then came the Generalized Linear Model (GLM) framework, introduced by John Nelder and Robert Wedderburn in their landmark 1972 paper. GLMs revealed that these seemingly disparate methods are special cases of a single, elegant mathematical structure. This unification wasn't merely taxonomic—it provided deep insights into when and why each method works, how to extend them, and how to diagnose their failures.
By the end of this page, you will understand the three fundamental components of any GLM: the random component (distribution of Y), the systematic component (linear predictor), and the link function that connects them. You'll see how this framework unifies logistic regression, Poisson regression, and ordinary linear regression as special cases—and you'll develop the conceptual foundation needed to derive new GLMs for novel problems.
Before introducing the GLM framework, we must clearly understand why ordinary linear regression (OLS) fails for many real-world problems. This understanding motivates the need for a more general approach.
The standard linear regression model assumes:
$$Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} + \varepsilon_i$$
where the errors $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$ are independent and identically distributed (iid) normal random variables. Equivalently:
$$Y_i \sim \mathcal{N}(\mu_i, \sigma^2), \quad \mu_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$
This formulation embeds three critical assumptions that severely limit applicability:
Concrete examples of failure:
Example 1: Binary Classification Suppose you want to predict whether a patient has a disease (Y = 1) or not (Y = 0) based on clinical features. Fitting linear regression would produce predictions $\hat{Y}$ that can be negative or greater than 1—nonsensical as probabilities. Moreover, the variance of Y when Y ∈ {0,1} is p(1-p), which depends on the probability p, violating homoscedasticity.
Example 2: Count Data You want to model the number of customer complaints per day. Complaints cannot be negative, and the variance often increases with the expected count (more complaints means more variability). Linear regression might predict -3 complaints on a slow day.
Example 3: Positive Continuous Data You want to predict insurance claim amounts, which must be strictly positive and often have right-skewed distributions. Linear regression could predict negative claim amounts and assumes symmetric errors.
Ordinary linear regression forces a round peg into a square hole. The real question isn't 'How do we fix linear regression for each case?' but rather 'What is the general mathematical framework that naturally handles diverse response types?' That framework is the Generalized Linear Model.
A Generalized Linear Model consists of three essential components that work together to model the relationship between predictors and response. Understanding these components is the key to mastering the entire framework.
The random component specifies the probability distribution of the response variable $Y_i$ conditional on the predictors. In a GLM, this distribution must belong to the exponential family—a rich class of distributions including normal, binomial, Poisson, gamma, inverse Gaussian, and many others.
Formally, we specify that:
$$Y_i \sim \text{ExponentialFamily}(\theta_i, \phi)$$
where $\theta_i$ is the canonical (natural) parameter that depends on the predictors, and $\phi$ is a dispersion parameter that controls the variance.
The choice of distribution should reflect your prior knowledge about the data-generating process:
The systematic component is the linear predictor $\eta_i$, which captures how predictors combine to influence the response:
$$\eta_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} = \mathbf{x}_i^\top \boldsymbol{\beta}$$
This is the only place where predictors appear in the model. The linear predictor is always a linear combination of parameters $\boldsymbol{\beta}$, though the predictors $X_{ij}$ themselves can be nonlinear transformations of raw features (polynomials, interactions, basis expansions).
Critical insight: The systematic component is identical across all GLMs. Whether you're fitting logistic regression, Poisson regression, or any other GLM, the linear predictor has the same form. What differs is how this linear predictor relates to the response distribution.
The link function $g(\cdot)$ is the critical connection between the random and systematic components. It relates the expected value of Y to the linear predictor:
$$g(\mu_i) = \eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$
where $\mu_i = E[Y_i | \mathbf{x}_i]$ is the conditional mean of the response.
Equivalently, the mean is expressed as the inverse of the link function applied to the linear predictor:
$$\mu_i = g^{-1}(\eta_i) = g^{-1}(\mathbf{x}_i^\top \boldsymbol{\beta})$$
The link function serves two essential purposes:
| Component | Symbol | Purpose | Example (Logistic) |
|---|---|---|---|
| Random Component | Y ~ F(θ, φ) | Distribution of response | Y ~ Bernoulli(p) |
| Systematic Component | η = X^T β | How predictors combine | η = β₀ + β₁X₁ + ... |
| Link Function | g(μ) = η | Connects mean to linear predictor | logit(p) = η |
The genius of the GLM framework is that by choosing the appropriate combination of distribution and link function, you can construct models tailored to virtually any response type—while the estimation, inference, and diagnostic procedures remain essentially the same across all cases.
Now that we've introduced the three components conceptually, let's formalize the complete GLM specification. This mathematical precision is essential for understanding estimation, inference, and the relationships between different models.
Complete GLM Specification:
For independent observations $(Y_i, \mathbf{x}_i)$, $i = 1, \ldots, n$, a Generalized Linear Model is defined by:
1. Random Component: $$Y_i \mid \mathbf{x}_i \sim F(\mu_i, \phi)$$
where $F$ is from the exponential family with mean $\mu_i$ and dispersion $\phi$.
2. Systematic Component: $$\eta_i = \mathbf{x}i^\top \boldsymbol{\beta} = \sum{j=0}^{p} \beta_j x_{ij}$$
3. Link Function: $$g(\mu_i) = \eta_i$$
The link function $g: \mathcal{M} \to \mathbb{R}$ is a smooth, monotonic function that maps the mean space $\mathcal{M}$ (which may be restricted, like $(0,1)$ or $(0, \infty)$) to the entire real line.
The Mean-Variance Relationship:
For exponential family distributions, the variance is a function of the mean:
$$\text{Var}(Y_i) = \phi \cdot V(\mu_i)$$
where $V(\mu)$ is the variance function characteristic of the distribution. This relationship is fundamental:
| Distribution | Variance Function V(μ) | Var(Y) |
|---|---|---|
| Normal | 1 | σ² (constant) |
| Poisson | μ | μ |
| Binomial (proportion) | μ(1-μ) | μ(1-μ)/n |
| Gamma | μ² | φμ² |
| Inverse Gaussian | μ³ | φμ³ |
This variance function encodes a key property: the variance changes with the mean in a distribution-specific way. GLMs automatically account for this heteroscedasticity, unlike OLS.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npimport matplotlib.pyplot as pltfrom scipy.special import expit # Logistic sigmoid # Illustrate the three GLM components for logistic regression # Generate predictor valuesx = np.linspace(-3, 3, 100) # Systematic component: linear predictorbeta_0, beta_1 = 0.5, 1.2eta = beta_0 + beta_1 * x # Linear predictor can range from -∞ to +∞ # Link function: logit link (and its inverse, the sigmoid)# For logistic regression: g(μ) = log(μ / (1-μ)) = logit(μ)# Inverse link: g^{-1}(η) = exp(η) / (1 + exp(η)) = sigmoid(η)mu = expit(eta) # Mean lies in (0, 1) # Variance function for Bernoulli: V(μ) = μ(1-μ)variance = mu * (1 - mu) # Visualizationfig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Panel 1: Linear predictoraxes[0].plot(x, eta, 'b-', linewidth=2)axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)axes[0].set_xlabel('Predictor X', fontsize=12)axes[0].set_ylabel('Linear Predictor η', fontsize=12)axes[0].set_title('Systematic Component\nη = β₀ + β₁X', fontsize=14)axes[0].grid(True, alpha=0.3) # Panel 2: Mean response (after inverse link)axes[1].plot(x, mu, 'r-', linewidth=2)axes[1].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)axes[1].set_xlabel('Predictor X', fontsize=12)axes[1].set_ylabel('Mean μ = P(Y=1)', fontsize=12)axes[1].set_title('After Inverse Link\nμ = sigmoid(η)', fontsize=14)axes[1].set_ylim(-0.05, 1.05)axes[1].grid(True, alpha=0.3) # Panel 3: Variance functionaxes[2].plot(mu, variance, 'g-', linewidth=2)axes[2].set_xlabel('Mean μ', fontsize=12)axes[2].set_ylabel('Variance V(μ)', fontsize=12)axes[2].set_title('Variance Function\nV(μ) = μ(1-μ)', fontsize=14)axes[2].grid(True, alpha=0.3) plt.tight_layout()plt.savefig('glm_components.png', dpi=150, bbox_inches='tight')plt.show()In OLS, we model E[Y] directly as a linear function of X. In GLMs, we model g(E[Y]) as a linear function of X. This simple transformation—modeling a function of the mean rather than the mean itself—is what enables GLMs to handle diverse response types while maintaining the tractability of linear models.
The power of the GLM framework becomes evident when we see how familiar models emerge as special cases. Each choice of distribution and link function yields a different member of the GLM family.
Distribution: $Y_i \sim \mathcal{N}(\mu_i, \sigma^2)$ (Normal) Link Function: $g(\mu) = \mu$ (Identity link) Inverse Link: $\mu = \eta$ Variance Function: $V(\mu) = 1$
$$\mu_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$
With the identity link and normal distribution, GLM reduces exactly to ordinary least squares. The mean equals the linear predictor, and variance is constant.
Distribution: $Y_i \sim \text{Bernoulli}(p_i)$ (Binomial with n=1) Link Function: $g(p) = \log\left(\frac{p}{1-p}\right)$ (Logit link) Inverse Link: $p = \frac{e^\eta}{1 + e^\eta} = \frac{1}{1 + e^{-\eta}}$ Variance Function: $V(p) = p(1-p)$
$$\log\left(\frac{p_i}{1-p_i}\right) = \mathbf{x}_i^\top \boldsymbol{\beta}$$
The logit link maps probabilities in (0,1) to the real line. The log-odds (logit) has a linear relationship with predictors. The variance is maximal at p=0.5 and decreases toward 0 and 1.
Distribution: $Y_i \sim \text{Poisson}(\lambda_i)$ Link Function: $g(\mu) = \log(\mu)$ (Log link) Inverse Link: $\mu = e^\eta$ Variance Function: $V(\mu) = \mu$
$$\log(\lambda_i) = \mathbf{x}_i^\top \boldsymbol{\beta}$$
The log link ensures the mean count is always positive ($e^\eta > 0$ for any $\eta$). Predictors have multiplicative effects on the mean: a unit increase in $X_j$ multiplies the expected count by $e^{\beta_j}$.
Distribution: $Y_i \sim \text{Gamma}(\alpha, \beta_i)$ (parameterized with mean $\mu_i$) Link Function: $g(\mu) = -1/\mu$ (Inverse link) or $g(\mu) = \log(\mu)$ (Log link) Variance Function: $V(\mu) = \mu^2$
Gamma regression is ideal for positive continuous data where variance increases with the square of the mean—common in financial data, insurance claims, and survival times.
| Model Name | Response Type | Distribution | Canonical Link | Common Alternative Links |
|---|---|---|---|---|
| Linear Regression | Continuous, unbounded | Normal | Identity | — |
| Logistic Regression | Binary (0/1) | Bernoulli | Logit | Probit, Complementary log-log |
| Binomial Regression | Proportions (k/n) | Binomial | Logit | Probit, Complementary log-log |
| Poisson Regression | Counts (0,1,2,...) | Poisson | Log | Identity, Square root |
| Gamma Regression | Positive continuous | Gamma | Inverse | Log, Identity |
| Inverse Gaussian | Positive continuous | Inverse Gaussian | Inverse squared | Log, Identity |
What appears to be a diverse zoo of regression techniques is revealed to be a single parametric family. This unification means that once you understand GLMs, you can derive new models for novel problems, understand the relationships between existing models, and apply the same diagnostic and inference tools across all cases.
For each exponential family distribution, there exists a special link function called the canonical link. This link has deep mathematical significance and practical advantages.
Recall that the density of an exponential family distribution can be written as:
$$f(y; \theta, \phi) = \exp\left{\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right}$$
where:
The mean is related to the canonical parameter by: $$\mu = b'(\theta) \quad \Rightarrow \quad \theta = (b')^{-1}(\mu)$$
The canonical link is defined as: $$g(\mu) = \theta = (b')^{-1}(\mu)$$
In words: the canonical link is the function that maps the mean to the canonical parameter.
| Distribution | Canonical Parameter θ | Cumulant b(θ) | Mean μ = b'(θ) | Canonical Link g(μ) |
|---|---|---|---|---|
| Normal | μ | θ²/2 | θ | Identity: g(μ) = μ |
| Bernoulli | log(p/(1-p)) | log(1 + e^θ) | e^θ/(1+e^θ) | Logit: g(p) = log(p/(1-p)) |
| Poisson | log(μ) | e^θ | e^θ | Log: g(μ) = log(μ) |
| Gamma | -1/μ | -log(-θ) | -1/θ | Inverse: g(μ) = -1/μ |
| Inverse Gaussian | -1/(2μ²) | -√(-2θ) | 1/√(-2θ) | Inverse squared: g(μ) = 1/μ² |
While canonical links have mathematical advantages, they are NOT always the best practical choice. You may use non-canonical links when: (1) the canonical link produces predictions in an inconvenient scale, (2) subject-matter knowledge suggests a different relationship, or (3) you want comparability with established conventions. For example, log links are often used for gamma regression despite the canonical inverse link, because the log scale is more interpretable.
One of the most important skills in applied GLM modeling is correctly interpreting the estimated parameters. The interpretation depends critically on the link function used.
In a GLM with link function $g$:
$$g(\mu) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p$$
The coefficient $\beta_j$ represents the change in $g(\mu)$ associated with a one-unit increase in $X_j$, holding all other predictors constant.
To interpret in terms of the mean $\mu$, we must apply $g^{-1}$—which gives different interpretations for different links.
$$\mu = \beta_0 + \beta_1 X_1$$
Interpretation: A one-unit increase in $X_1$ is associated with a $\beta_1$-unit additive change in the expected response.
If $\beta_1 = 2.5$:
'Each additional year of education is associated with $2,500 higher annual income, on average.'
$$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1$$
Interpretation: A one-unit increase in $X_1$ is associated with a $\beta_1$ change in the log-odds. Exponentiating: the odds are multiplied by $e^{\beta_1}$.
If $\beta_1 = 0.7$:
'Each additional year of age multiplies the odds of disease by $e^{0.7} \approx 2.01$—roughly doubling the odds per year.'
$$\log(\mu) = \beta_0 + \beta_1 X_1$$
Interpretation: A one-unit increase in $X_1$ is associated with a $\beta_1$ change in the log-mean. Exponentiating: the expected count is multiplied by $e^{\beta_1}$.
If $\beta_1 = 0.3$:
'Each additional advertisement exposure multiplies the expected number of purchases by $e^{0.3} \approx 1.35$—a 35% increase.'
$$\frac{1}{\mu} = \beta_0 + \beta_1 X_1$$
Interpretation: A one-unit increase in $X_1$ is associated with a $\beta_1$ change in the inverse of the mean.
This is less intuitive; practitioners often switch to a log link for interpretability:
'Each additional complication multiplies expected hospital cost by $e^{0.4} \approx 1.49$—a 49% increase.'
Think of the link function as asking: 'In what scale do predictors have additive effects?' For logistic regression, it's the log-odds scale. For Poisson, it's the log-count scale. The link function defines the ruler by which we measure effects—and that ruler's units are what β coefficients represent.
Having established the mathematical structure of GLMs, let's step back and understand why this unified framework is so important for practical machine learning and statistics.
Connection to Modern Machine Learning:
GLMs might seem like classical statistics, but they form the backbone of modern machine learning:
Neural Network Output Layers: The final layer of a neural network for classification or regression is essentially a GLM—the softmax for multi-class classification is a multinomial logit GLM; the linear output for regression is a Gaussian GLM.
Gradient Boosting: XGBoost, LightGBM, and CatBoost can all be configured with different loss functions that correspond to different GLM distributions.
Bayesian Methods: GLMs have natural Bayesian extensions with conjugate priors, forming the foundation for modern probabilistic programming.
Transfer Learning: Understanding GLMs helps you design appropriate loss functions and output transformations when adapting pretrained models to new tasks.
The 1972 Nelder-Wedderburn paper introducing GLMs is among the most cited in statistics. The framework has proven remarkably durable—50+ years later, it remains the foundation for understanding regression models. Mastering GLMs gives you a conceptual toolkit that transcends any specific software or algorithm.
We've covered the foundational architecture of Generalized Linear Models. Let's consolidate the key concepts:
What's Next:
In the next page, we'll dive deep into link functions—the critical component that connects the distribution mean to the linear predictor. We'll explore the properties of common link functions, understand how to choose between them, and see how the choice of link affects model behavior and interpretation.
You now understand the GLM framework—the elegant mathematical architecture that unifies diverse regression models. This foundation will serve you throughout your study of logistic regression, Poisson regression, and beyond. Next, we'll explore link functions in detail.