Loading content...
In the GLM framework, the link function plays a deceptively simple but profoundly important role: it transforms the expected value of the response into a quantity that can be modeled as a linear combination of predictors.
Consider the challenge: when modeling probabilities, the mean must lie in (0,1); when modeling counts, the mean must be non-negative; when modeling durations, the mean must be strictly positive. Yet our linear predictor $\eta = \mathbf{x}^\top \boldsymbol{\beta}$ can take any real value. How do we reconcile these incompatible domains?
The link function $g(\cdot)$ is the mathematical bridge. It maps the constrained mean space to the entire real line, allowing us to say $g(\mu) = \eta$ for any $\eta \in \mathbb{R}$. The properties of this bridge—its shape, its derivatives, its interpretability—profoundly affect the behavior and meaning of our model.
By the end of this page, you will deeply understand: (1) the mathematical requirements for valid link functions, (2) the properties and interpretations of common links (identity, logit, probit, log, etc.), (3) how to choose an appropriate link for your problem, and (4) the subtle tradeoffs between canonical and non-canonical links.
Not every function can serve as a link. For a function $g: \mathcal{M} \to \mathbb{R}$ to be a valid link function, it must satisfy several mathematical requirements.
The link function must be defined on the set $\mathcal{M}$ of possible mean values for the chosen distribution:
| Distribution | Mean Space $\mathcal{M}$ |
|---|---|
| Normal | $(-\infty, \infty)$ |
| Bernoulli/Binomial | $(0, 1)$ |
| Poisson | $(0, \infty)$ |
| Gamma | $(0, \infty)$ |
| Inverse Gaussian | $(0, \infty)$ |
For example, the log link $g(\mu) = \log(\mu)$ is appropriate when $\mathcal{M} = (0, \infty)$ but not when negative means are possible.
The link function must be strictly monotonic (either strictly increasing or strictly decreasing throughout $\mathcal{M}$). This ensures a one-to-one correspondence between $\mu$ and $\eta$.
Monotonicity guarantees:
The link function should be twice differentiable on the interior of $\mathcal{M}$. This is needed for:
The first derivative $g'(\mu)$ appears in the weight matrix of iteratively reweighted least squares: $$w_i = \frac{1}{V(\mu_i) [g'(\mu_i)]^2}$$
The link function must map $\mathcal{M}$ onto the entire real line $\mathbb{R}$: $$g: \mathcal{M} \to \mathbb{R} \text{ is surjective}$$
This ensures that for any possible linear predictor value, there exists a valid mean. Without this, certain predictor combinations might produce undefined predictions.
A valid link function is a smooth, invertible transformation that takes the constrained mean (like a probability in (0,1)) and stretches it out to cover the whole real line (where the linear predictor lives). The stretching must be done smoothly and without any folds or kinks.
The identity link is the simplest possible link function:
$$g(\mu) = \mu \qquad g^{-1}(\eta) = \eta$$
With the identity link, the mean equals the linear predictor directly: $$\mu_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$
Derivative: $g'(\mu) = 1$ (constant)
Interpretation: Parameters have direct, additive effects on the mean: $$\frac{\partial \mu}{\partial X_j} = \beta_j$$
A one-unit increase in $X_j$ increases the expected response by exactly $\beta_j$ units, regardless of the current value of $X_j$ or other predictors.
The identity link is the canonical link for the Normal distribution. It's also appropriate when:
Using the identity link with a constrained distribution (like Poisson or binomial) can produce predictions outside the valid range. For example, if μ = X^T β and β^T x = -2, we predict μ = -2—impossible for a count. Most software will fit the model but produce warnings or errors at prediction time for such cases.
123456789101112131415161718192021222324252627282930313233343536373839
import numpy as npimport matplotlib.pyplot as plt # Identity link: g(μ) = μ, so μ = η directly # Linear predictor valueseta = np.linspace(-3, 3, 100) # With identity link, mean equals linear predictormu = eta # g^{-1}(η) = η # Visualizationfig, axes = plt.subplots(1, 2, figsize=(12, 4)) # Left: Link functionaxes[0].plot(mu, eta, 'b-', linewidth=2)axes[0].plot(mu, mu, 'k--', alpha=0.3, label='45° line')axes[0].set_xlabel('Mean μ', fontsize=12)axes[0].set_ylabel('Linear Predictor η = g(μ)', fontsize=12)axes[0].set_title('Identity Link Function\ng(μ) = μ', fontsize=14)axes[0].legend()axes[0].grid(True, alpha=0.3) # Right: Response function (inverse link)axes[1].plot(eta, mu, 'r-', linewidth=2)axes[1].set_xlabel('Linear Predictor η', fontsize=12)axes[1].set_ylabel('Mean μ = g⁻¹(η)', fontsize=12) axes[1].set_title('Response Function (Inverse Link)\nμ = η', fontsize=14)axes[1].grid(True, alpha=0.3) # Annotation showing linearityaxes[1].annotate('Slope = 1\n(constant marginal effect)', xy=(0, 0), xytext=(1.5, -1.5), fontsize=10, ha='left', arrowprops=dict(arrowstyle='->', color='gray')) plt.tight_layout()plt.savefig('identity_link.png', dpi=150, bbox_inches='tight')plt.show()The logit link (also called the log-odds link) is the canonical link for the Bernoulli and binomial distributions:
$$g(p) = \log\left(\frac{p}{1-p}\right) = \text{logit}(p) \qquad g^{-1}(\eta) = \frac{e^\eta}{1 + e^\eta} = \frac{1}{1 + e^{-\eta}}$$
The inverse function $g^{-1}$ is the logistic (or sigmoid) function, which maps the entire real line to (0,1).
Domain: $p \in (0, 1)$ Range: $\eta \in (-\infty, +\infty)$
Derivative: $$g'(p) = \frac{1}{p(1-p)}$$
Note that $g'(p)$ is large near 0 and 1 (where small changes in probability correspond to large changes in log-odds) and smallest at $p = 0.5$.
Symmetry: $$\text{logit}(p) = -\text{logit}(1-p)$$
This symmetry means that the effect of predictors on the probability of success equals (in magnitude but opposite sign) the effect on the probability of failure.
The logit link gives rise to one of the most important quantities in epidemiology and social science: the odds ratio.
Recall that for a binary outcome, the odds of success are: $$\text{odds} = \frac{p}{1-p}$$
With the logit link: $$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p$$
Exponentiating both sides: $$\frac{p}{1-p} = e^{\beta_0} \cdot e^{\beta_1 X_1} \cdot \ldots \cdot e^{\beta_p X_p}$$
Now consider what happens when $X_j$ increases by 1 unit (holding others constant):
$$\frac{\text{odds after}}{\text{odds before}} = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_j(X_j+1) + \cdots}}{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_j X_j + \cdots}} = e^{\beta_j}$$
Thus $e^{\beta_j}$ is the multiplicative change in odds per unit increase in $X_j$—the odds ratio.
| Coefficient β_j | Odds Ratio e^(β_j) | Interpretation |
|---|---|---|
| β_j = 0 | OR = 1.00 | No effect on odds |
| β_j = 0.1 | OR ≈ 1.11 | 11% increase in odds per unit X_j |
| β_j = 0.5 | OR ≈ 1.65 | 65% increase in odds per unit X_j |
| β_j = 1.0 | OR ≈ 2.72 | Odds nearly triple per unit X_j |
| β_j = -0.5 | OR ≈ 0.61 | 39% decrease in odds per unit X_j |
| β_j = -1.0 | OR ≈ 0.37 | Odds reduced to ~1/3 per unit X_j |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import numpy as npimport matplotlib.pyplot as plt # Logit link: g(p) = log(p/(1-p))# Inverse (sigmoid): g^{-1}(η) = 1 / (1 + exp(-η)) p = np.linspace(0.001, 0.999, 1000)eta = np.linspace(-6, 6, 1000) # Logit functionlogit_p = np.log(p / (1 - p)) # Sigmoid (inverse logit) sigmoid_eta = 1 / (1 + np.exp(-eta)) fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Left: Logit functionaxes[0].plot(p, logit_p, 'b-', linewidth=2)axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)axes[0].axvline(x=0.5, color='gray', linestyle='--', alpha=0.5)axes[0].set_xlabel('Probability p', fontsize=12)axes[0].set_ylabel('Log-Odds = logit(p)', fontsize=12)axes[0].set_title('Logit Link Function\ng(p) = log(p/(1-p))', fontsize=14)axes[0].set_xlim(0, 1)axes[0].set_ylim(-6, 6)axes[0].grid(True, alpha=0.3) # Middle: Sigmoid functionaxes[1].plot(eta, sigmoid_eta, 'r-', linewidth=2)axes[1].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)axes[1].axvline(x=0, color='gray', linestyle='--', alpha=0.5)axes[1].set_xlabel('Linear Predictor η', fontsize=12)axes[1].set_ylabel('Probability p = σ(η)', fontsize=12)axes[1].set_title('Sigmoid (Inverse Logit)\np = 1/(1 + e^{-η})', fontsize=14)axes[1].set_ylim(0, 1)axes[1].grid(True, alpha=0.3) # Right: Derivative of sigmoid (sensitivity)sigmoid_deriv = sigmoid_eta * (1 - sigmoid_eta)axes[2].plot(eta, sigmoid_deriv, 'g-', linewidth=2)axes[2].axvline(x=0, color='gray', linestyle='--', alpha=0.5)axes[2].set_xlabel('Linear Predictor η', fontsize=12)axes[2].set_ylabel("Sensitivity dp/dη", fontsize=12)axes[2].set_title("Sigmoid Derivative\n∂p/∂η = p(1-p)", fontsize=14)axes[2].annotate('Maximum sensitivity\nat η=0 (p=0.5)', xy=(0, 0.25), xytext=(2, 0.2), fontsize=10, arrowprops=dict(arrowstyle='->', color='gray'))axes[2].grid(True, alpha=0.3) plt.tight_layout()plt.savefig('logit_link.png', dpi=150, bbox_inches='tight')plt.show()The sigmoid's S-shape captures a crucial phenomenon: marginal effects diminish at extremes. When p is near 0.5, a small change in the linear predictor has the largest effect on probability. But when p is near 0 or 1, the same change in η has a much smaller effect on p. This 'saturation' behavior is often realistic for binary outcomes.
The probit link is an alternative to the logit for binary response data:
$$g(p) = \Phi^{-1}(p) \qquad g^{-1}(\eta) = \Phi(\eta)$$
where $\Phi$ is the cumulative distribution function (CDF) of the standard normal distribution:
$$\Phi(z) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{z} e^{-t^2/2} , dt$$
Probit regression has an elegant interpretation through latent variables. Suppose there exists an unobserved continuous variable $Y^*_i$ such that:
$$Y^*_i = \mathbf{x}_i^\top \boldsymbol{\beta} + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, 1)$$
We observe $Y_i = 1$ if $Y^*_i > 0$ and $Y_i = 0$ otherwise. Then:
$$P(Y_i = 1) = P(Y^*_i > 0) = P(\mathbf{x}_i^\top \boldsymbol{\beta} + \varepsilon_i > 0) = P(\varepsilon_i > -\mathbf{x}_i^\top \boldsymbol{\beta})$$
Since $\varepsilon_i \sim \mathcal{N}(0,1)$ is symmetric: $$P(Y_i = 1) = \Phi(\mathbf{x}_i^\top \boldsymbol{\beta})$$
This latent variable interpretation is widely used in economics (discrete choice theory) and psychometrics.
In practice, logit and probit give nearly identical predictions for most datasets. The relationship between them is approximately:
$$\text{logit}(p) \approx \frac{\pi}{\sqrt{3}} \cdot \Phi^{-1}(p) \approx 1.81 \cdot \Phi^{-1}(p)$$
Thus, probit coefficients are roughly 1.8 times smaller than logit coefficients for the same data.
The main difference is in the tails:
When to choose:
For most applications, the choice between logit and probit is a matter of discipline convention and interpretive preference—not model fit. If you're unsure, use logit: it's more common, computationally faster, and provides the interpretable odds ratio. Only switch to probit if your field expects it or the latent variable story is central to your analysis.
The log link is the canonical link for the Poisson distribution and commonly used for Gamma regression:
$$g(\mu) = \log(\mu) \qquad g^{-1}(\eta) = e^\eta$$
Domain: $\mu \in (0, \infty)$ Range: $\eta \in (-\infty, +\infty)$
Derivative: $g'(\mu) = 1/\mu$
Key Feature: The inverse link $\mu = e^\eta$ is always positive, regardless of $\eta$. This automatically satisfies the positivity constraint for counts and positive continuous responses.
With the log link: $$\log(\mu) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p$$
Exponentiating: $$\mu = e^{\beta_0} \cdot e^{\beta_1 X_1} \cdot \ldots \cdot e^{\beta_p X_p}$$
Predictor effects are multiplicative on the mean. When $X_j$ increases by 1:
$$\frac{\mu_{\text{after}}}{\mu_{\text{before}}} = e^{\beta_j}$$
The quantity $e^{\beta_j}$ is the rate ratio or incidence rate ratio (IRR) in epidemiology.
| Coefficient β_j | Rate Ratio e^(β_j) | Interpretation |
|---|---|---|
| β_j = 0 | RR = 1.00 | No effect on expected count |
| β_j = 0.1 | RR ≈ 1.11 | 11% increase in expected count per unit X_j |
| β_j = 0.5 | RR ≈ 1.65 | 65% increase in expected count per unit X_j |
| β_j = -0.3 | RR ≈ 0.74 | 26% decrease in expected count per unit X_j |
| β_j = ln(2) ≈ 0.69 | RR = 2.00 | Expected count doubles per unit X_j |
The log link is appropriate when effects combine multiplicatively rather than additively. Many phenomena exhibit this pattern:
In these contexts, the log link provides:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as npimport matplotlib.pyplot as plt # Log link: g(μ) = log(μ)# Inverse: g^{-1}(η) = exp(η) mu = np.linspace(0.01, 10, 1000)eta = np.linspace(-3, 3, 1000) # Log functionlog_mu = np.log(mu) # Exponential (inverse)exp_eta = np.exp(eta) fig, axes = plt.subplots(1, 2, figsize=(12, 4)) # Left: Log link functionaxes[0].plot(mu, log_mu, 'b-', linewidth=2)axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)axes[0].axvline(x=1, color='gray', linestyle='--', alpha=0.5)axes[0].set_xlabel('Mean μ (must be > 0)', fontsize=12)axes[0].set_ylabel('Linear Predictor η = log(μ)', fontsize=12)axes[0].set_title('Log Link Function\ng(μ) = log(μ)', fontsize=14)axes[0].set_xlim(0, 10)axes[0].set_ylim(-5, 3)axes[0].grid(True, alpha=0.3) # Right: Exponential (inverse log)axes[1].plot(eta, exp_eta, 'r-', linewidth=2)axes[1].axhline(y=1, color='gray', linestyle='--', alpha=0.5)axes[1].axvline(x=0, color='gray', linestyle='--', alpha=0.5)axes[1].set_xlabel('Linear Predictor η', fontsize=12)axes[1].set_ylabel('Mean μ = exp(η)', fontsize=12)axes[1].set_title('Exponential (Inverse Log)\nμ = e^η (always positive!)', fontsize=14)axes[1].set_ylim(0, 15)axes[1].grid(True, alpha=0.3) # Add annotation showing multiplicative interpretationaxes[1].annotate('η goes from 1 to 2:\nμ increases by factor e ≈ 2.72', xy=(1.5, np.exp(1.5)), xytext=(0, 10), fontsize=10, arrowprops=dict(arrowstyle='->', color='gray')) plt.tight_layout()plt.savefig('log_link.png', dpi=150, bbox_inches='tight')plt.show()Using a log link is NOT the same as fitting linear regression to log(Y). With a log link: (1) we model E[Y], not E[log(Y)]; (2) we don't need Y > 0 for every observation (zeros are handled via the distribution); (3) we properly account for heteroscedasticity. Log-transforming Y changes the question being asked and biases predictions when back-transformed.
The complementary log-log (cloglog) link is an asymmetric alternative to logit and probit for binary data:
$$g(p) = \log(-\log(1-p)) \qquad g^{-1}(\eta) = 1 - \exp(-\exp(\eta))$$
Unlike the symmetric logit and probit, the cloglog link is asymmetric:
The cloglog link arises naturally in survival analysis and extreme value theory. If we have a binary outcome arising from whether an event occurs before a fixed time point, and the underlying hazard is constant (exponential survival), the appropriate link is cloglog.
Specifically, if $T \sim \text{Exponential}(\lambda)$ and we observe $Y = I(T \leq t_0)$:
$$P(Y = 1) = P(T \leq t_0) = 1 - e^{-\lambda t_0}$$
If $\log(\lambda t_0) = \eta$, then $P(Y=1) = 1 - e^{-e^\eta}$, which is the cloglog inverse link.
| Property | Logit | Probit | Cloglog |
|---|---|---|---|
| Symmetry | Symmetric around p=0.5 | Symmetric around p=0.5 | Asymmetric |
| Tail behavior | Heavy tails (logistic) | Moderate tails (normal) | Asymmetric tails |
| Canonical for | Binomial | None | None |
| Interpretation | Log-odds (odds ratio) | Latent normal threshold | Hazard model |
| Common in | Medicine, ML, general | Economics, psychometrics | Survival, epidemiology |
In practice, logit, probit, and cloglog often give similar predictions in the middle range (p between 0.2 and 0.8). They differ mainly in the tails and in interpretation. Choose based on: (1) interpretive needs (odds ratios → logit), (2) field conventions, (3) theoretical model (survival → cloglog, latent normal → probit).
The inverse link is the canonical link for the Gamma distribution:
$$g(\mu) = \frac{1}{\mu} \qquad g^{-1}(\eta) = \frac{1}{\eta}$$
Properties:
Interpretation: Not intuitive. A unit increase in $X_j$ changes $1/\mu$ by $\beta_j$. This makes the inverse link unpopular despite being canonical—practitioners often prefer the log link for Gamma regression.
The inverse squared link is canonical for the Inverse Gaussian distribution:
$$g(\mu) = \frac{1}{\mu^2} \qquad g^{-1}(\eta) = \frac{1}{\sqrt{\eta}}$$
The square root link is sometimes used for Poisson data:
$$g(\mu) = \sqrt{\mu} \qquad g^{-1}(\eta) = \eta^2$$
This link ensures $\mu > 0$ only for $\eta > 0$, and has variance-stabilizing properties for Poisson data (the square root transformation makes Poisson variance approximately constant).
| Link Name | g(μ) | g⁻¹(η) | Mean Space | Typical Use |
|---|---|---|---|---|
| Identity | μ | η | ℝ | Normal regression |
| Log | log(μ) | e^η | (0,∞) | Poisson, Gamma |
| Logit | log(p/(1-p)) | 1/(1+e^(-η)) | (0,1) | Binomial |
| Probit | Φ⁻¹(p) | Φ(η) | (0,1) | Binomial (economics) |
| Cloglog | log(-log(1-p)) | 1-exp(-exp(η)) | (0,1) | Binomial (survival) |
| Inverse | 1/μ | 1/η | (0,∞) | Gamma (canonical) |
| Inverse squared | 1/μ² | 1/√η | (0,∞) | Inverse Gaussian |
| Square root | √μ | η² | [0,∞) | Poisson (var-stab) |
You can define custom link functions for special applications. As long as your function is monotonic, differentiable, and maps the mean space to ℝ, standard GLM estimation applies. Some software (like R's glm) allows user-specified link functions.
Selecting an appropriate link function involves balancing several considerations:
Step 1: Ensure Validity The link must map the mean space to ℝ. For binary data, you need a link that maps (0,1) to ℝ; for count data, one that maps (0,∞) to ℝ.
Step 2: Consider the Canonical Link The canonical link has nice mathematical properties (sufficient statistics, concavity). Start with the canonical unless you have reasons to deviate.
Step 3: Prioritize Interpretability Coefficients should be meaningful for your application:
Step 4: Consider Domain Knowledge Does theory or prior research suggest a particular relationship? In pharmacokinetics, log-linear relationships are standard; in psychometrics, probit has theoretical justification.
When unsure, you can compare models with different links using:
However, often different links give very similar fits, and the choice comes down to interpretation and convention.
In practice, for most applications: use logit for binary outcomes, log for counts and positive continuous, and identity for unbounded continuous. This covers 95% of cases. Only deviate when you have specific theoretical or interpretive reasons.
We've explored link functions—the critical bridge between constrained means and unbounded linear predictors. Let's consolidate the key concepts:
What's Next:
In the next page, we'll explore the exponential family of distributions—the mathematical foundation that makes the GLM framework possible. Understanding the exponential family reveals why certain distribution-link combinations work well and provides the tools for deriving new GLMs.
You now have a deep understanding of link functions—their requirements, properties, and interpretations. You can choose appropriate links for different response types and interpret the resulting coefficients correctly. Next, we'll see how the exponential family provides the theoretical foundation for GLMs.