Machine LearningGeneralized Linear Models

Generalized Linear Models (GLMs)

LevelAdvanced

Duration90 mins

TopicGeneralized Linear Models

1 / 5

The GLM Framework: A Unified Theory of Regression

Beyond Linear Regression

Throughout your journey in machine learning, you've encountered a fundamental tension: linear regression assumes a continuous, unbounded response variable with normally distributed errors—yet real-world data often violates these assumptions dramatically. What do you do when your response is a count of events? A probability? A strictly positive amount? A categorical outcome?

For decades, statisticians developed specialized techniques for each case: logistic regression for binary outcomes, Poisson regression for counts, gamma regression for positive continuous values. These methods seemed unrelated—each with its own derivation, assumptions, and estimation procedures.

Then came the Generalized Linear Model (GLM) framework, introduced by John Nelder and Robert Wedderburn in their landmark 1972 paper. GLMs revealed that these seemingly disparate methods are special cases of a single, elegant mathematical structure. This unification wasn't merely taxonomic—it provided deep insights into when and why each method works, how to extend them, and how to diagnose their failures.

What You Will Learn

By the end of this page, you will understand the three fundamental components of any GLM: the random component (distribution of Y), the systematic component (linear predictor), and the link function that connects them. You'll see how this framework unifies logistic regression, Poisson regression, and ordinary linear regression as special cases—and you'll develop the conceptual foundation needed to derive new GLMs for novel problems.

The Limitations of Ordinary Linear Regression

Before introducing the GLM framework, we must clearly understand why ordinary linear regression (OLS) fails for many real-world problems. This understanding motivates the need for a more general approach.

The standard linear regression model assumes:

$$Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} + \varepsilon_i$$

where the errors $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$ are independent and identically distributed (iid) normal random variables. Equivalently:

$$Y_i \sim \mathcal{N}(\mu_i, \sigma^2), \quad \mu_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$

This formulation embeds three critical assumptions that severely limit applicability:

Critical OLS Assumptions

•Assumption 1: Continuous, Unbounded Response — The response Y can take any real value from $-\infty$ to $+\infty$. But counts can only be 0, 1, 2, ...; probabilities must lie in [0,1]; durations must be positive.
•Assumption 2: Constant Variance (Homoscedasticity) — The variance $\sigma^2$ is the same for all observations regardless of the mean. But for counts, variance typically increases with the mean; for proportions, variance peaks at p=0.5.
•Assumption 3: Additive Effects on the Mean — The mean $\mu_i$ is a linear combination of predictors. But for multiplicative phenomena (growth rates, odds), effects may combine on a logarithmic or logit scale.

Concrete examples of failure:

Example 1: Binary Classification Suppose you want to predict whether a patient has a disease (Y = 1) or not (Y = 0) based on clinical features. Fitting linear regression would produce predictions $\hat{Y}$ that can be negative or greater than 1—nonsensical as probabilities. Moreover, the variance of Y when Y ∈ {0,1} is p(1-p), which depends on the probability p, violating homoscedasticity.

Example 2: Count Data You want to model the number of customer complaints per day. Complaints cannot be negative, and the variance often increases with the expected count (more complaints means more variability). Linear regression might predict -3 complaints on a slow day.

Example 3: Positive Continuous Data You want to predict insurance claim amounts, which must be strictly positive and often have right-skewed distributions. Linear regression could predict negative claim amounts and assumes symmetric errors.

The Fundamental Problem

Ordinary linear regression forces a round peg into a square hole. The real question isn't 'How do we fix linear regression for each case?' but rather 'What is the general mathematical framework that naturally handles diverse response types?' That framework is the Generalized Linear Model.

The Three Components of a GLM

A Generalized Linear Model consists of three essential components that work together to model the relationship between predictors and response. Understanding these components is the key to mastering the entire framework.

Component 1: The Random Component (Distribution)

The random component specifies the probability distribution of the response variable $Y_i$ conditional on the predictors. In a GLM, this distribution must belong to the exponential family—a rich class of distributions including normal, binomial, Poisson, gamma, inverse Gaussian, and many others.

Formally, we specify that:

$$Y_i \sim \text{ExponentialFamily}(\theta_i, \phi)$$

where $\theta_i$ is the canonical (natural) parameter that depends on the predictors, and $\phi$ is a dispersion parameter that controls the variance.

The choice of distribution should reflect your prior knowledge about the data-generating process:

Binary outcomes → Bernoulli/Binomial
Counts without an upper bound → Poisson
Positive continuous values → Gamma or Inverse Gaussian
Continuous, unbounded values → Normal

Component 2: The Systematic Component (Linear Predictor)

The systematic component is the linear predictor $\eta_i$, which captures how predictors combine to influence the response:

$$\eta_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} = \mathbf{x}_i^\top \boldsymbol{\beta}$$

This is the only place where predictors appear in the model. The linear predictor is always a linear combination of parameters $\boldsymbol{\beta}$, though the predictors $X_{ij}$ themselves can be nonlinear transformations of raw features (polynomials, interactions, basis expansions).

Critical insight: The systematic component is identical across all GLMs. Whether you're fitting logistic regression, Poisson regression, or any other GLM, the linear predictor has the same form. What differs is how this linear predictor relates to the response distribution.

Component 3: The Link Function

The link function $g(\cdot)$ is the critical connection between the random and systematic components. It relates the expected value of Y to the linear predictor:

$$g(\mu_i) = \eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$

where $\mu_i = E[Y_i | \mathbf{x}_i]$ is the conditional mean of the response.

Equivalently, the mean is expressed as the inverse of the link function applied to the linear predictor:

$$\mu_i = g^{-1}(\eta_i) = g^{-1}(\mathbf{x}_i^\top \boldsymbol{\beta})$$

The link function serves two essential purposes:

Range Matching: It transforms the mean, which may be constrained (e.g., between 0 and 1), to the real line where the linear predictor lives
Interpretation: It determines how a unit change in a predictor affects the response—additively, multiplicatively, or otherwise

The Three Components in Summary
Component	Symbol	Purpose	Example (Logistic)
Random Component	Y ~ F(θ, φ)	Distribution of response	Y ~ Bernoulli(p)
Systematic Component	η = X^T β	How predictors combine	η = β₀ + β₁X₁ + ...
Link Function	g(μ) = η	Connects mean to linear predictor	logit(p) = η

The Elegant Unity

The genius of the GLM framework is that by choosing the appropriate combination of distribution and link function, you can construct models tailored to virtually any response type—while the estimation, inference, and diagnostic procedures remain essentially the same across all cases.

The Mathematical Formulation in Detail

Now that we've introduced the three components conceptually, let's formalize the complete GLM specification. This mathematical precision is essential for understanding estimation, inference, and the relationships between different models.

Complete GLM Specification:

For independent observations $(Y_i, \mathbf{x}_i)$, $i = 1, \ldots, n$, a Generalized Linear Model is defined by:

1. Random Component: $$Y_i \mid \mathbf{x}_i \sim F(\mu_i, \phi)$$

where $F$ is from the exponential family with mean $\mu_i$ and dispersion $\phi$.

2. Systematic Component: $$\eta_i = \mathbf{x}i^\top \boldsymbol{\beta} = \sum{j=0}^{p} \beta_j x_{ij}$$

3. Link Function: $$g(\mu_i) = \eta_i$$

The link function $g: \mathcal{M} \to \mathbb{R}$ is a smooth, monotonic function that maps the mean space $\mathcal{M}$ (which may be restricted, like $(0,1)$ or $(0, \infty)$) to the entire real line.

The Mean-Variance Relationship:

For exponential family distributions, the variance is a function of the mean:

$$\text{Var}(Y_i) = \phi \cdot V(\mu_i)$$

where $V(\mu)$ is the variance function characteristic of the distribution. This relationship is fundamental:

Distribution	Variance Function V(μ)	Var(Y)
Normal	1	σ² (constant)
Poisson	μ	μ
Binomial (proportion)	μ(1-μ)	μ(1-μ)/n
Gamma	μ²	φμ²
Inverse Gaussian	μ³	φμ³

This variance function encodes a key property: the variance changes with the mean in a distribution-specific way. GLMs automatically account for this heteroscedasticity, unlike OLS.

glm_components_illustration.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import expit  # Logistic sigmoid
 
# Illustrate the three GLM components for logistic regression
 
# Generate predictor values
x = np.linspace(-3, 3, 100)
 
# Systematic component: linear predictor
beta_0, beta_1 = 0.5, 1.2
eta = beta_0 + beta_1 * x  # Linear predictor can range from -∞ to +∞
 
# Link function: logit link (and its inverse, the sigmoid)
# For logistic regression: g(μ) = log(μ / (1-μ)) = logit(μ)
# Inverse link: g^{-1}(η) = exp(η) / (1 + exp(η)) = sigmoid(η)
mu = expit(eta)  # Mean lies in (0, 1)
 
# Variance function for Bernoulli: V(μ) = μ(1-μ)
variance = mu * (1 - mu)
 
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Panel 1: Linear predictor
axes[0].plot(x, eta, 'b-', linewidth=2)
axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[0].set_xlabel('Predictor X', fontsize=12)
axes[0].set_ylabel('Linear Predictor η', fontsize=12)
axes[0].set_title('Systematic Component\nη = β₀ + β₁X', fontsize=14)
axes[0].grid(True, alpha=0.3)
 
# Panel 2: Mean response (after inverse link)
axes[1].plot(x, mu, 'r-', linewidth=2)
axes[1].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Predictor X', fontsize=12)
axes[1].set_ylabel('Mean μ = P(Y=1)', fontsize=12)
axes[1].set_title('After Inverse Link\nμ = sigmoid(η)', fontsize=14)
axes[1].set_ylim(-0.05, 1.05)
axes[1].grid(True, alpha=0.3)
 
# Panel 3: Variance function
axes[2].plot(mu, variance, 'g-', linewidth=2)
axes[2].set_xlabel('Mean μ', fontsize=12)
axes[2].set_ylabel('Variance V(μ)', fontsize=12)
axes[2].set_title('Variance Function\nV(μ) = μ(1-μ)', fontsize=14)
axes[2].grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('glm_components.png', dpi=150, bbox_inches='tight')
plt.show()

The Key Insight

In OLS, we model E[Y] directly as a linear function of X. In GLMs, we model g(E[Y]) as a linear function of X. This simple transformation—modeling a function of the mean rather than the mean itself—is what enables GLMs to handle diverse response types while maintaining the tractability of linear models.

Common GLMs as Special Cases

The power of the GLM framework becomes evident when we see how familiar models emerge as special cases. Each choice of distribution and link function yields a different member of the GLM family.

Linear Regression as a GLM

Distribution: $Y_i \sim \mathcal{N}(\mu_i, \sigma^2)$ (Normal) Link Function: $g(\mu) = \mu$ (Identity link) Inverse Link: $\mu = \eta$ Variance Function: $V(\mu) = 1$

$$\mu_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$

With the identity link and normal distribution, GLM reduces exactly to ordinary least squares. The mean equals the linear predictor, and variance is constant.

Logistic Regression as a GLM

Distribution: $Y_i \sim \text{Bernoulli}(p_i)$ (Binomial with n=1) Link Function: $g(p) = \log\left(\frac{p}{1-p}\right)$ (Logit link) Inverse Link: $p = \frac{e^\eta}{1 + e^\eta} = \frac{1}{1 + e^{-\eta}}$ Variance Function: $V(p) = p(1-p)$

$$\log\left(\frac{p_i}{1-p_i}\right) = \mathbf{x}_i^\top \boldsymbol{\beta}$$

The logit link maps probabilities in (0,1) to the real line. The log-odds (logit) has a linear relationship with predictors. The variance is maximal at p=0.5 and decreases toward 0 and 1.

Poisson Regression as a GLM

Distribution: $Y_i \sim \text{Poisson}(\lambda_i)$ Link Function: $g(\mu) = \log(\mu)$ (Log link) Inverse Link: $\mu = e^\eta$ Variance Function: $V(\mu) = \mu$

$$\log(\lambda_i) = \mathbf{x}_i^\top \boldsymbol{\beta}$$

The log link ensures the mean count is always positive ($e^\eta > 0$ for any $\eta$). Predictors have multiplicative effects on the mean: a unit increase in $X_j$ multiplies the expected count by $e^{\beta_j}$.

Gamma Regression as a GLM

Distribution: $Y_i \sim \text{Gamma}(\alpha, \beta_i)$ (parameterized with mean $\mu_i$) Link Function: $g(\mu) = -1/\mu$ (Inverse link) or $g(\mu) = \log(\mu)$ (Log link) Variance Function: $V(\mu) = \mu^2$

Gamma regression is ideal for positive continuous data where variance increases with the square of the mean—common in financial data, insurance claims, and survival times.

Taxonomy of Common GLMs
Model Name	Response Type	Distribution	Canonical Link	Common Alternative Links
Linear Regression	Continuous, unbounded	Normal	Identity	—
Logistic Regression	Binary (0/1)	Bernoulli	Logit	Probit, Complementary log-log
Binomial Regression	Proportions (k/n)	Binomial	Logit	Probit, Complementary log-log
Poisson Regression	Counts (0,1,2,...)	Poisson	Log	Identity, Square root
Gamma Regression	Positive continuous	Gamma	Inverse	Log, Identity
Inverse Gaussian	Positive continuous	Inverse Gaussian	Inverse squared	Log, Identity

The Unification

What appears to be a diverse zoo of regression techniques is revealed to be a single parametric family. This unification means that once you understand GLMs, you can derive new models for novel problems, understand the relationships between existing models, and apply the same diagnostic and inference tools across all cases.

Canonical Link Functions

For each exponential family distribution, there exists a special link function called the canonical link. This link has deep mathematical significance and practical advantages.

Definition of the Canonical Link

Recall that the density of an exponential family distribution can be written as:

$$f(y; \theta, \phi) = \exp\left{\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right}$$

where:

$\theta$ is the canonical (natural) parameter
$\phi$ is the dispersion parameter
$b(\theta)$ is the cumulant function (determines the mean and variance)

The mean is related to the canonical parameter by: $$\mu = b'(\theta) \quad \Rightarrow \quad \theta = (b')^{-1}(\mu)$$

The canonical link is defined as: $$g(\mu) = \theta = (b')^{-1}(\mu)$$

In words: the canonical link is the function that maps the mean to the canonical parameter.

Properties of Canonical Links

•Sufficient Statistic Simplification — With the canonical link, $\mathbf{X}^\top \mathbf{y}$ is a sufficient statistic for $\boldsymbol{\beta}$. This is mathematically elegant and computationally advantageous.
•Fisher Information Simplification — The Fisher information matrix takes a simpler form: $\mathcal{I}(\boldsymbol{\beta}) = \mathbf{X}^\top \mathbf{W} \mathbf{X}$ where W is diagonal.
•Log-Likelihood Concavity — The log-likelihood is guaranteed to be concave (strictly, under mild conditions), ensuring that Newton-Raphson and Fisher scoring converge to the global maximum.
•Score Equation Form — The score equations simplify to $\mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu}) = \mathbf{0}$, directly interpretable as orthogonality of residuals to the column space of X.

Canonical Links for Common Distributions
Distribution	Canonical Parameter θ	Cumulant b(θ)	Mean μ = b'(θ)	Canonical Link g(μ)
Normal	μ	θ²/2	θ	Identity: g(μ) = μ
Bernoulli	log(p/(1-p))	log(1 + e^θ)	e^θ/(1+e^θ)	Logit: g(p) = log(p/(1-p))
Poisson	log(μ)	e^θ	e^θ	Log: g(μ) = log(μ)
Gamma	-1/μ	-log(-θ)	-1/θ	Inverse: g(μ) = -1/μ
Inverse Gaussian	-1/(2μ²)	-√(-2θ)	1/√(-2θ)	Inverse squared: g(μ) = 1/μ²

Non-Canonical Links

While canonical links have mathematical advantages, they are NOT always the best practical choice. You may use non-canonical links when: (1) the canonical link produces predictions in an inconvenient scale, (2) subject-matter knowledge suggests a different relationship, or (3) you want comparability with established conventions. For example, log links are often used for gamma regression despite the canonical inverse link, because the log scale is more interpretable.

Parameter Interpretation in GLMs

One of the most important skills in applied GLM modeling is correctly interpreting the estimated parameters. The interpretation depends critically on the link function used.

The General Principle

In a GLM with link function $g$:

$$g(\mu) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p$$

The coefficient $\beta_j$ represents the change in $g(\mu)$ associated with a one-unit increase in $X_j$, holding all other predictors constant.

To interpret in terms of the mean $\mu$, we must apply $g^{-1}$—which gives different interpretations for different links.

Identity Link (Linear Regression)

$$\mu = \beta_0 + \beta_1 X_1$$

Interpretation: A one-unit increase in $X_1$ is associated with a $\beta_1$-unit additive change in the expected response.

If $\beta_1 = 2.5$:

'Each additional year of education is associated with $2,500 higher annual income, on average.'

Logit Link (Logistic Regression)

$$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1$$

Interpretation: A one-unit increase in $X_1$ is associated with a $\beta_1$ change in the log-odds. Exponentiating: the odds are multiplied by $e^{\beta_1}$.

If $\beta_1 = 0.7$:

'Each additional year of age multiplies the odds of disease by $e^{0.7} \approx 2.01$—roughly doubling the odds per year.'

Log Link (Poisson Regression)

$$\log(\mu) = \beta_0 + \beta_1 X_1$$

Interpretation: A one-unit increase in $X_1$ is associated with a $\beta_1$ change in the log-mean. Exponentiating: the expected count is multiplied by $e^{\beta_1}$.

If $\beta_1 = 0.3$:

'Each additional advertisement exposure multiplies the expected number of purchases by $e^{0.3} \approx 1.35$—a 35% increase.'

Inverse Link (Gamma Regression)

$$\frac{1}{\mu} = \beta_0 + \beta_1 X_1$$

Interpretation: A one-unit increase in $X_1$ is associated with a $\beta_1$ change in the inverse of the mean.

This is less intuitive; practitioners often switch to a log link for interpretability:

'Each additional complication multiplies expected hospital cost by $e^{0.4} \approx 1.49$—a 49% increase.'

The Linearizing Perspective

Think of the link function as asking: 'In what scale do predictors have additive effects?' For logistic regression, it's the log-odds scale. For Poisson, it's the log-count scale. The link function defines the ruler by which we measure effects—and that ruler's units are what β coefficients represent.

Why the GLM Framework Matters

Having established the mathematical structure of GLMs, let's step back and understand why this unified framework is so important for practical machine learning and statistics.

Key Benefits of the GLM Framework

•Unified Estimation — A single algorithm (Iteratively Reweighted Least Squares) estimates parameters for ALL GLMs. Learn one method, apply it everywhere.
•Unified Inference — The same asymptotic theory provides confidence intervals, hypothesis tests, and model comparisons across all GLMs via likelihood ratio, Wald, and score tests.
•Unified Diagnostics — Residual analysis, influence diagnostics, and goodness-of-fit tests translate across models with suitable adjustments. Deviance residuals work for any GLM.
•Principled Model Selection — AIC, BIC, and cross-validation apply uniformly. You can compare logistic and Poisson models using the same criteria.
•Extensibility — Understanding GLMs lets you derive new models for novel response types by choosing appropriate distributions and links. You're not limited to pre-packaged models.
•Theoretical Insights — The framework reveals deep connections: why logistic regression is similar to Poisson, how neural networks generalize GLMs, why regularization has Bayesian interpretations.

Connection to Modern Machine Learning:

GLMs might seem like classical statistics, but they form the backbone of modern machine learning:

Neural Network Output Layers: The final layer of a neural network for classification or regression is essentially a GLM—the softmax for multi-class classification is a multinomial logit GLM; the linear output for regression is a Gaussian GLM.
Gradient Boosting: XGBoost, LightGBM, and CatBoost can all be configured with different loss functions that correspond to different GLM distributions.
Bayesian Methods: GLMs have natural Bayesian extensions with conjugate priors, forming the foundation for modern probabilistic programming.
Transfer Learning: Understanding GLMs helps you design appropriate loss functions and output transformations when adapting pretrained models to new tasks.

A Lasting Foundation

The 1972 Nelder-Wedderburn paper introducing GLMs is among the most cited in statistics. The framework has proven remarkably durable—50+ years later, it remains the foundation for understanding regression models. Mastering GLMs gives you a conceptual toolkit that transcends any specific software or algorithm.

Summary: The GLM Framework

We've covered the foundational architecture of Generalized Linear Models. Let's consolidate the key concepts:

Key Takeaways

•GLMs extend linear regression by allowing response distributions beyond the normal and relationships beyond identity links. This enables modeling of counts, proportions, positive values, and more.
•Three components define a GLM: (1) Random component—the response distribution from the exponential family; (2) Systematic component—the linear predictor $\eta = X^\top\beta$; (3) Link function—connecting the mean to the linear predictor via $g(\mu) = \eta$.
•Common models are GLM special cases: Linear regression (Normal, identity), Logistic regression (Bernoulli, logit), Poisson regression (Poisson, log), Gamma regression (Gamma, log or inverse).
•Canonical links have special mathematical properties: simplified sufficient statistics, concave log-likelihood, and elegant score equations. But non-canonical links are perfectly valid and often more interpretable.
•Parameter interpretation depends on the link: coefficients represent effects on the transformed scale (log-odds for logit, log-mean for log link, etc.).
•The GLM framework matters because it unifies estimation, inference, and diagnostics—and it forms the foundation for understanding modern machine learning models.

What's Next:

In the next page, we'll dive deep into link functions—the critical component that connects the distribution mean to the linear predictor. We'll explore the properties of common link functions, understand how to choose between them, and see how the choice of link affects model behavior and interpretation.

Page Complete

You now understand the GLM framework—the elegant mathematical architecture that unifies diverse regression models. This foundation will serve you throughout your study of logistic regression, Poisson regression, and beyond. Next, we'll explore link functions in detail.

1 / 5

Loading learning content...

Machine LearningGeneralized Linear Models

Generalized Linear Models (GLMs)

LevelAdvanced

Duration90 mins

TopicGeneralized Linear Models

1 / 5

The GLM Framework: A Unified Theory of Regression

Beyond Linear Regression

What You Will Learn

The Limitations of Ordinary Linear Regression

The standard linear regression model assumes:

$$Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} + \varepsilon_i$$

where the errors $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$ are independent and identically distributed (iid) normal random variables. Equivalently:

$$Y_i \sim \mathcal{N}(\mu_i, \sigma^2), \quad \mu_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$

This formulation embeds three critical assumptions that severely limit applicability:

Critical OLS Assumptions

•Assumption 1: Continuous, Unbounded Response — The response Y can take any real value from $-\infty$ to $+\infty$. But counts can only be 0, 1, 2, ...; probabilities must lie in [0,1]; durations must be positive.
•Assumption 2: Constant Variance (Homoscedasticity) — The variance $\sigma^2$ is the same for all observations regardless of the mean. But for counts, variance typically increases with the mean; for proportions, variance peaks at p=0.5.
•Assumption 3: Additive Effects on the Mean — The mean $\mu_i$ is a linear combination of predictors. But for multiplicative phenomena (growth rates, odds), effects may combine on a logarithmic or logit scale.

Concrete examples of failure:

The Fundamental Problem

The Three Components of a GLM

Component 1: The Random Component (Distribution)

Formally, we specify that:

$$Y_i \sim \text{ExponentialFamily}(\theta_i, \phi)$$

where $\theta_i$ is the canonical (natural) parameter that depends on the predictors, and $\phi$ is a dispersion parameter that controls the variance.

The choice of distribution should reflect your prior knowledge about the data-generating process:

Binary outcomes → Bernoulli/Binomial
Counts without an upper bound → Poisson
Positive continuous values → Gamma or Inverse Gaussian
Continuous, unbounded values → Normal

Component 2: The Systematic Component (Linear Predictor)

The systematic component is the linear predictor $\eta_i$, which captures how predictors combine to influence the response:

$$\eta_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} = \mathbf{x}_i^\top \boldsymbol{\beta}$$

Component 3: The Link Function

The link function $g(\cdot)$ is the critical connection between the random and systematic components. It relates the expected value of Y to the linear predictor:

$$g(\mu_i) = \eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$

where $\mu_i = E[Y_i | \mathbf{x}_i]$ is the conditional mean of the response.

Equivalently, the mean is expressed as the inverse of the link function applied to the linear predictor:

$$\mu_i = g^{-1}(\eta_i) = g^{-1}(\mathbf{x}_i^\top \boldsymbol{\beta})$$

The link function serves two essential purposes:

Range Matching: It transforms the mean, which may be constrained (e.g., between 0 and 1), to the real line where the linear predictor lives
Interpretation: It determines how a unit change in a predictor affects the response—additively, multiplicatively, or otherwise

The Three Components in Summary
Component	Symbol	Purpose	Example (Logistic)
Random Component	Y ~ F(θ, φ)	Distribution of response	Y ~ Bernoulli(p)
Systematic Component	η = X^T β	How predictors combine	η = β₀ + β₁X₁ + ...
Link Function	g(μ) = η	Connects mean to linear predictor	logit(p) = η

The Elegant Unity

The Mathematical Formulation in Detail

Complete GLM Specification:

For independent observations $(Y_i, \mathbf{x}_i)$, $i = 1, \ldots, n$, a Generalized Linear Model is defined by:

1. Random Component: $$Y_i \mid \mathbf{x}_i \sim F(\mu_i, \phi)$$

where $F$ is from the exponential family with mean $\mu_i$ and dispersion $\phi$.

2. Systematic Component: $$\eta_i = \mathbf{x}i^\top \boldsymbol{\beta} = \sum{j=0}^{p} \beta_j x_{ij}$$

3. Link Function: $$g(\mu_i) = \eta_i$$

The Mean-Variance Relationship:

For exponential family distributions, the variance is a function of the mean:

$$\text{Var}(Y_i) = \phi \cdot V(\mu_i)$$

where $V(\mu)$ is the variance function characteristic of the distribution. This relationship is fundamental:

Distribution	Variance Function V(μ)	Var(Y)
Normal	1	σ² (constant)
Poisson	μ	μ
Binomial (proportion)	μ(1-μ)	μ(1-μ)/n
Gamma	μ²	φμ²
Inverse Gaussian	μ³	φμ³

This variance function encodes a key property: the variance changes with the mean in a distribution-specific way. GLMs automatically account for this heteroscedasticity, unlike OLS.

glm_components_illustration.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import expit  # Logistic sigmoid
 
# Illustrate the three GLM components for logistic regression
 
# Generate predictor values
x = np.linspace(-3, 3, 100)
 
# Systematic component: linear predictor
beta_0, beta_1 = 0.5, 1.2
eta = beta_0 + beta_1 * x  # Linear predictor can range from -∞ to +∞
 
# Link function: logit link (and its inverse, the sigmoid)
# For logistic regression: g(μ) = log(μ / (1-μ)) = logit(μ)
# Inverse link: g^{-1}(η) = exp(η) / (1 + exp(η)) = sigmoid(η)
mu = expit(eta)  # Mean lies in (0, 1)
 
# Variance function for Bernoulli: V(μ) = μ(1-μ)
variance = mu * (1 - mu)
 
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Panel 1: Linear predictor
axes[0].plot(x, eta, 'b-', linewidth=2)
axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[0].set_xlabel('Predictor X', fontsize=12)
axes[0].set_ylabel('Linear Predictor η', fontsize=12)
axes[0].set_title('Systematic Component\nη = β₀ + β₁X', fontsize=14)
axes[0].grid(True, alpha=0.3)
 
# Panel 2: Mean response (after inverse link)
axes[1].plot(x, mu, 'r-', linewidth=2)
axes[1].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Predictor X', fontsize=12)
axes[1].set_ylabel('Mean μ = P(Y=1)', fontsize=12)
axes[1].set_title('After Inverse Link\nμ = sigmoid(η)', fontsize=14)
axes[1].set_ylim(-0.05, 1.05)
axes[1].grid(True, alpha=0.3)
 
# Panel 3: Variance function
axes[2].plot(mu, variance, 'g-', linewidth=2)
axes[2].set_xlabel('Mean μ', fontsize=12)
axes[2].set_ylabel('Variance V(μ)', fontsize=12)
axes[2].set_title('Variance Function\nV(μ) = μ(1-μ)', fontsize=14)
axes[2].grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('glm_components.png', dpi=150, bbox_inches='tight')
plt.show()

The Key Insight

Common GLMs as Special Cases

The power of the GLM framework becomes evident when we see how familiar models emerge as special cases. Each choice of distribution and link function yields a different member of the GLM family.

Linear Regression as a GLM

Distribution: $Y_i \sim \mathcal{N}(\mu_i, \sigma^2)$ (Normal) Link Function: $g(\mu) = \mu$ (Identity link) Inverse Link: $\mu = \eta$ Variance Function: $V(\mu) = 1$

$$\mu_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$

With the identity link and normal distribution, GLM reduces exactly to ordinary least squares. The mean equals the linear predictor, and variance is constant.

Logistic Regression as a GLM

$$\log\left(\frac{p_i}{1-p_i}\right) = \mathbf{x}_i^\top \boldsymbol{\beta}$$

The logit link maps probabilities in (0,1) to the real line. The log-odds (logit) has a linear relationship with predictors. The variance is maximal at p=0.5 and decreases toward 0 and 1.

Poisson Regression as a GLM

Distribution: $Y_i \sim \text{Poisson}(\lambda_i)$ Link Function: $g(\mu) = \log(\mu)$ (Log link) Inverse Link: $\mu = e^\eta$ Variance Function: $V(\mu) = \mu$

$$\log(\lambda_i) = \mathbf{x}_i^\top \boldsymbol{\beta}$$

Gamma Regression as a GLM

Gamma regression is ideal for positive continuous data where variance increases with the square of the mean—common in financial data, insurance claims, and survival times.

Taxonomy of Common GLMs
Model Name	Response Type	Distribution	Canonical Link	Common Alternative Links
Linear Regression	Continuous, unbounded	Normal	Identity	—
Logistic Regression	Binary (0/1)	Bernoulli	Logit	Probit, Complementary log-log
Binomial Regression	Proportions (k/n)	Binomial	Logit	Probit, Complementary log-log
Poisson Regression	Counts (0,1,2,...)	Poisson	Log	Identity, Square root
Gamma Regression	Positive continuous	Gamma	Inverse	Log, Identity
Inverse Gaussian	Positive continuous	Inverse Gaussian	Inverse squared	Log, Identity

The Unification

Canonical Link Functions

For each exponential family distribution, there exists a special link function called the canonical link. This link has deep mathematical significance and practical advantages.

Definition of the Canonical Link

Recall that the density of an exponential family distribution can be written as:

$$f(y; \theta, \phi) = \exp\left{\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right}$$

where:

$\theta$ is the canonical (natural) parameter
$\phi$ is the dispersion parameter
$b(\theta)$ is the cumulant function (determines the mean and variance)

The mean is related to the canonical parameter by: $$\mu = b'(\theta) \quad \Rightarrow \quad \theta = (b')^{-1}(\mu)$$

The canonical link is defined as: $$g(\mu) = \theta = (b')^{-1}(\mu)$$

In words: the canonical link is the function that maps the mean to the canonical parameter.

Properties of Canonical Links

•Sufficient Statistic Simplification — With the canonical link, $\mathbf{X}^\top \mathbf{y}$ is a sufficient statistic for $\boldsymbol{\beta}$. This is mathematically elegant and computationally advantageous.
•Fisher Information Simplification — The Fisher information matrix takes a simpler form: $\mathcal{I}(\boldsymbol{\beta}) = \mathbf{X}^\top \mathbf{W} \mathbf{X}$ where W is diagonal.
•Log-Likelihood Concavity — The log-likelihood is guaranteed to be concave (strictly, under mild conditions), ensuring that Newton-Raphson and Fisher scoring converge to the global maximum.
•Score Equation Form — The score equations simplify to $\mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu}) = \mathbf{0}$, directly interpretable as orthogonality of residuals to the column space of X.

Canonical Links for Common Distributions
Distribution	Canonical Parameter θ	Cumulant b(θ)	Mean μ = b'(θ)	Canonical Link g(μ)
Normal	μ	θ²/2	θ	Identity: g(μ) = μ
Bernoulli	log(p/(1-p))	log(1 + e^θ)	e^θ/(1+e^θ)	Logit: g(p) = log(p/(1-p))
Poisson	log(μ)	e^θ	e^θ	Log: g(μ) = log(μ)
Gamma	-1/μ	-log(-θ)	-1/θ	Inverse: g(μ) = -1/μ
Inverse Gaussian	-1/(2μ²)	-√(-2θ)	1/√(-2θ)	Inverse squared: g(μ) = 1/μ²

Non-Canonical Links

Parameter Interpretation in GLMs

One of the most important skills in applied GLM modeling is correctly interpreting the estimated parameters. The interpretation depends critically on the link function used.

The General Principle

In a GLM with link function $g$:

$$g(\mu) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p$$

The coefficient $\beta_j$ represents the change in $g(\mu)$ associated with a one-unit increase in $X_j$, holding all other predictors constant.

To interpret in terms of the mean $\mu$, we must apply $g^{-1}$—which gives different interpretations for different links.

Identity Link (Linear Regression)

$$\mu = \beta_0 + \beta_1 X_1$$

Interpretation: A one-unit increase in $X_1$ is associated with a $\beta_1$-unit additive change in the expected response.

If $\beta_1 = 2.5$:

'Each additional year of education is associated with $2,500 higher annual income, on average.'

Logit Link (Logistic Regression)

$$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1$$

Interpretation: A one-unit increase in $X_1$ is associated with a $\beta_1$ change in the log-odds. Exponentiating: the odds are multiplied by $e^{\beta_1}$.

If $\beta_1 = 0.7$:

'Each additional year of age multiplies the odds of disease by $e^{0.7} \approx 2.01$—roughly doubling the odds per year.'

Log Link (Poisson Regression)

$$\log(\mu) = \beta_0 + \beta_1 X_1$$

Interpretation: A one-unit increase in $X_1$ is associated with a $\beta_1$ change in the log-mean. Exponentiating: the expected count is multiplied by $e^{\beta_1}$.

If $\beta_1 = 0.3$:

'Each additional advertisement exposure multiplies the expected number of purchases by $e^{0.3} \approx 1.35$—a 35% increase.'

Inverse Link (Gamma Regression)

$$\frac{1}{\mu} = \beta_0 + \beta_1 X_1$$

Interpretation: A one-unit increase in $X_1$ is associated with a $\beta_1$ change in the inverse of the mean.

This is less intuitive; practitioners often switch to a log link for interpretability:

'Each additional complication multiplies expected hospital cost by $e^{0.4} \approx 1.49$—a 49% increase.'

The Linearizing Perspective

Why the GLM Framework Matters

Having established the mathematical structure of GLMs, let's step back and understand why this unified framework is so important for practical machine learning and statistics.

Key Benefits of the GLM Framework

•Unified Estimation — A single algorithm (Iteratively Reweighted Least Squares) estimates parameters for ALL GLMs. Learn one method, apply it everywhere.
•Unified Inference — The same asymptotic theory provides confidence intervals, hypothesis tests, and model comparisons across all GLMs via likelihood ratio, Wald, and score tests.
•Unified Diagnostics — Residual analysis, influence diagnostics, and goodness-of-fit tests translate across models with suitable adjustments. Deviance residuals work for any GLM.
•Principled Model Selection — AIC, BIC, and cross-validation apply uniformly. You can compare logistic and Poisson models using the same criteria.
•Extensibility — Understanding GLMs lets you derive new models for novel response types by choosing appropriate distributions and links. You're not limited to pre-packaged models.
•Theoretical Insights — The framework reveals deep connections: why logistic regression is similar to Poisson, how neural networks generalize GLMs, why regularization has Bayesian interpretations.

Connection to Modern Machine Learning:

GLMs might seem like classical statistics, but they form the backbone of modern machine learning:

Neural Network Output Layers: The final layer of a neural network for classification or regression is essentially a GLM—the softmax for multi-class classification is a multinomial logit GLM; the linear output for regression is a Gaussian GLM.
Gradient Boosting: XGBoost, LightGBM, and CatBoost can all be configured with different loss functions that correspond to different GLM distributions.
Bayesian Methods: GLMs have natural Bayesian extensions with conjugate priors, forming the foundation for modern probabilistic programming.
Transfer Learning: Understanding GLMs helps you design appropriate loss functions and output transformations when adapting pretrained models to new tasks.

A Lasting Foundation

Summary: The GLM Framework

We've covered the foundational architecture of Generalized Linear Models. Let's consolidate the key concepts:

Key Takeaways

•GLMs extend linear regression by allowing response distributions beyond the normal and relationships beyond identity links. This enables modeling of counts, proportions, positive values, and more.
•Three components define a GLM: (1) Random component—the response distribution from the exponential family; (2) Systematic component—the linear predictor $\eta = X^\top\beta$; (3) Link function—connecting the mean to the linear predictor via $g(\mu) = \eta$.
•Common models are GLM special cases: Linear regression (Normal, identity), Logistic regression (Bernoulli, logit), Poisson regression (Poisson, log), Gamma regression (Gamma, log or inverse).
•Canonical links have special mathematical properties: simplified sufficient statistics, concave log-likelihood, and elegant score equations. But non-canonical links are perfectly valid and often more interpretable.
•Parameter interpretation depends on the link: coefficients represent effects on the transformed scale (log-odds for logit, log-mean for log link, etc.).
•The GLM framework matters because it unifies estimation, inference, and diagnostics—and it forms the foundation for understanding modern machine learning models.

What's Next:

Page Complete

1 / 5