Generalized Linear Models - Learning Module

Loading content...

0/245

Poisson Regression: Modeling Count Data

When the Response is a Count

Count data is everywhere in the real world:

Number of hospital admissions per day
Customer complaints per month
Goals scored per soccer match
Traffic accidents per intersection
Website clicks per user session
Species observed per quadrat
Patents filed per company per year

These outcomes share common characteristics: they are non-negative integers, often right-skewed, and their variance often depends on their mean. Linear regression is inappropriate for count data because it can predict negative counts, assumes constant variance, and ignores the discrete nature of the response.

Poisson regression is the natural GLM for count data. It assumes the response follows a Poisson distribution, uses a log link to ensure positive predictions, and automatically accounts for mean-dependent variance. Mastering Poisson regression opens the door to analyzing a vast array of real-world counting phenomena.

What You Will Learn

By the end of this page, you will: (1) understand the Poisson distribution and when it applies, (2) know how to set up, estimate, and interpret Poisson regression models, (3) be able to diagnose overdispersion and model fit, and (4) understand rate models with exposure/offset terms.

The Poisson Distribution

Before diving into regression, let's thoroughly understand the Poisson distribution itself.

Definition and Properties

A random variable $Y$ follows a Poisson distribution with parameter $\lambda > 0$ if:

$$P(Y = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots$$

Key properties:

$E[Y] = \lambda$ (the parameter is the mean)
$\text{Var}(Y) = \lambda$ (variance equals mean—equidispersion)
$\text{Skewness} = 1/\sqrt{\lambda}$ (right-skewed, especially for small $\lambda$)
As $\lambda \to \infty$, the Poisson approaches a Normal distribution

When Does the Poisson Arise?

The Poisson distribution emerges from the Poisson process, which models random events occurring in time or space under these conditions:

Independence: Events in non-overlapping intervals are independent
Homogeneity: The rate of events is constant over time/space
No simultaneous events: Events don't occur at exactly the same instant

If events occur at rate $r$ per unit time, then the count in an interval of length $t$ is $Y \sim \text{Poisson}(rt)$.

Poisson Distribution at Different Mean Values
Mean λ	P(Y=0)	P(Y=1)	P(Y=2)	P(Y=3)	P(Y≥4)	Shape
0.5	60.7%	30.3%	7.6%	1.3%	0.2%	Highly right-skewed
1.0	36.8%	36.8%	18.4%	6.1%	1.9%	Right-skewed
2.0	13.5%	27.1%	27.1%	18.0%	14.3%	Moderately skewed
5.0	0.7%	3.4%	8.4%	14.0%	73.5%	Nearly symmetric
10.0	0.0%	0.0%	0.2%	0.8%	99.0%	Approximately normal

poisson_distribution_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson
 
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Different lambda values
lambdas = [1, 5, 15]
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
 
for ax, lam, color in zip(axes, lambdas, colors):
    # X values for PMF
    x = np.arange(0, lam * 3)
    
    # Poisson PMF
    pmf = poisson.pmf(x, lam)
    
    # Bar plot
    ax.bar(x, pmf, color=color, alpha=0.7, edgecolor='black', linewidth=0.5)
    ax.axvline(x=lam, color='red', linestyle='--', linewidth=2, 
               label=f'Mean = Var = {lam}')
    
    ax.set_xlabel('Count y', fontsize=12)
    ax.set_ylabel('P(Y = y)', fontsize=12)
    ax.set_title(f'Poisson(λ = {lam})', fontsize=14)
    ax.legend(fontsize=10)
    ax.grid(True, alpha=0.3, axis='y')
 
# Add super title
fig.suptitle('Poisson Distribution: Mean = Variance', fontsize=14, y=1.02)
 
plt.tight_layout()
plt.savefig('poisson_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

The Equidispersion Assumption

The Poisson's equidispersion (Var = Mean) is a strong assumption. Real count data often has variance GREATER than the mean (overdispersion) due to clustering, heterogeneity, or excess zeros. We'll address this limitation and its solutions throughout this page and the next.

The Poisson Regression Model

Poisson regression is a GLM that models count responses using the Poisson distribution with a log link.

Model Specification

Random Component: $$Y_i \sim \text{Poisson}(\lambda_i)$$

Each observation has its own mean $\lambda_i$ that depends on the predictors.

Systematic Component: $$\eta_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip}$$

Link Function (Log): $$\log(\lambda_i) = \eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$

Equivalently: $$\lambda_i = \exp(\mathbf{x}i^\top \boldsymbol{\beta}) = e^{\beta_0} \cdot e^{\beta_1 X{i1}} \cdot \ldots \cdot e^{\beta_p X_{ip}}$$

Why the Log Link?

Positivity: $\lambda = e^\eta > 0$ for any $\eta$, ensuring valid predicted counts
Multiplicative effects: Predictors act multiplicatively on the mean count
Canonical link: Log is the canonical link for Poisson, giving nice mathematical properties
Rate interpretation: Log-linear models naturally express proportional changes

The Log-Likelihood

For independent Poisson observations $Y_1, \ldots, Y_n$:

$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ Y_i \log(\lambda_i) - \lambda_i - \log(Y_i!) \right]$$

Substituting $\log(\lambda_i) = \mathbf{x}_i^\top \boldsymbol{\beta}$:

$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ Y_i (\mathbf{x}_i^\top \boldsymbol{\beta}) - e^{\mathbf{x}_i^\top \boldsymbol{\beta}} \right] + \text{const}$$

Score Equations

Taking the derivative with respect to $\beta_j$:

$$\frac{\partial \ell}{\partial \beta_j} = \sum_{i=1}^n (Y_i - \lambda_i) X_{ij} = 0$$

In matrix form: $$\mathbf{X}^\top (\mathbf{Y} - \boldsymbol{\lambda}) = \mathbf{0}$$

This says: the residuals $(Y_i - \hat{\lambda}_i)$ must be orthogonal to each predictor—the same condition as in OLS, but with different residuals!

No Closed-Form Solution

Unlike linear regression, these equations are nonlinear in $\boldsymbol{\beta}$ (because $\lambda_i = e^{\mathbf{x}_i^\top \boldsymbol{\beta}}$). We must use iterative methods—typically Newton-Raphson or equivalently Iteratively Reweighted Least Squares (IRLS).

IRLS for Poisson Regression

At each iteration, IRLS solves a weighted least squares problem with weights w_i = λ_i and working response z_i = η_i + (y_i - λ_i)/λ_i. The algorithm converges quickly (typically 5-10 iterations) because the Poisson log-likelihood is strictly concave.

Interpreting Poisson Regression Coefficients

The log link means that Poisson coefficients have multiplicative interpretations—they describe proportional changes in the expected count.

The Incidence Rate Ratio (IRR)

Consider a single predictor: $$\log(\lambda) = \beta_0 + \beta_1 X$$

When $X$ increases by 1 unit:

Log-mean increases by $\beta_1$
Mean is multiplied by $e^{\beta_1}$

The quantity $e^{\beta_1}$ is called the Incidence Rate Ratio (IRR) or Rate Ratio (RR):

$$\text{IRR} = \frac{\lambda(X+1)}{\lambda(X)} = \frac{e^{\beta_0 + \beta_1(X+1)}}{e^{\beta_0 + \beta_1 X}} = e^{\beta_1}$$

Interpretation Examples

Suppose we model doctor visits per year: $$\log(\text{visits}) = 0.5 + 0.3 \cdot \text{chronic_conditions} - 0.1 \cdot \text{income_10k}$$

Coefficient Interpretations
Coefficient	Value	IRR = exp(β)	Interpretation
Intercept	0.5	exp(0.5) = 1.65	Baseline: 1.65 visits/year when other predictors = 0
chronic_conditions	0.3	exp(0.3) = 1.35	Each additional chronic condition increases visits by 35%
income_10k	-0.1	exp(-0.1) = 0.90	Each $10k increase in income reduces visits by 10%

Confidence Intervals for IRRs

To get a confidence interval for the IRR:

Compute CI for $\beta$: $(\hat{\beta}j \pm z{\alpha/2} \cdot \text{SE}(\hat{\beta}_j))$
Exponentiate the endpoints: $(e^{\hat{\beta}j - z{\alpha/2} \cdot \text{SE}}, e^{\hat{\beta}j + z{\alpha/2} \cdot \text{SE}})$

The CI for IRR is asymmetric because of the exponential transformation. An IRR of 1.0 means no effect; CIs not containing 1.0 indicate statistical significance.

Percentage Change Interpretation

For small effects, the percentage change is approximately $100 \cdot \beta_j$%:

$\beta = 0.10 \Rightarrow$ IRR = 1.105, or about 10% increase
$\beta = -0.05 \Rightarrow$ IRR = 0.951, or about 5% decrease

This approximation breaks down for large $|\beta|$.

Reporting Best Practices

When reporting Poisson regression results: (1) present IRRs with 95% CIs rather than raw coefficients, (2) interpret as 'X% increase/decrease in expected count', (3) note that effects are multiplicative and thus context-dependent (a 20% increase from 5 to 6 is different from 100 to 120).

Rate Models: Using Offset Terms

Often we want to model rates rather than raw counts. For example:

Accidents per vehicle-mile driven
Infections per person-year at risk
Defects per 1000 units produced
Deaths per 100,000 population

The key is that we have different exposures (time at risk, population size, etc.) for different observations.

The Offset Formulation

If $Y_i$ is the count and $t_i$ is the exposure (time, population, etc.), we model the rate $r_i = Y_i / t_i$ by:

$$Y_i \sim \text{Poisson}(\lambda_i = t_i \cdot r_i)$$

where $r_i$ is the rate per unit exposure. Taking logs:

$$\log(\lambda_i) = \log(t_i) + \log(r_i) = \log(t_i) + \mathbf{x}_i^\top \boldsymbol{\beta}$$

The term $\log(t_i)$ is called the offset—a predictor with coefficient fixed at 1.

$$\log(\lambda_i) = \text{offset}i + \beta_0 + \beta_1 X{i1} + \cdots$$

Interpretation with Offset

With an offset, coefficients represent effects on the rate (count per unit exposure):

$e^{\beta_j}$ is the rate ratio, the multiplicative change in rate per unit change in $X_j$
The baseline rate (when all predictors = 0) is $e^{\beta_0}$ per unit exposure

poisson_offset_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
import pandas as pd
import statsmodels.api as sm
 
# Example: Modeling accident rates across road segments
 
np.random.seed(42)
n = 100
 
# Generate data
traffic_volume = np.random.uniform(1000, 50000, n)  # vehicles per day
road_length = np.random.uniform(0.5, 5.0, n)  # miles
has_signal = np.random.binomial(1, 0.3, n)  # traffic signal present
speed_limit = np.random.choice([25, 35, 45, 55], n)  # mph
 
# True model: accidents per mile-year
# Higher traffic, higher speed, no signal -> more accidents
true_rate = np.exp(-5 + 0.00005*traffic_volume + 0.03*speed_limit - 0.5*has_signal)
exposure = road_length * 1  # mile-years (1 year observation)
expected_count = exposure * true_rate
accidents = np.random.poisson(expected_count)
 
# Create dataframe
df = pd.DataFrame({
    'accidents': accidents,
    'traffic_volume': traffic_volume,
    'speed_limit': speed_limit,
    'has_signal': has_signal,
    'road_length': exposure,
    'log_exposure': np.log(exposure)
})
 
# Fit Poisson regression WITH offset
X = sm.add_constant(df[['traffic_volume', 'speed_limit', 'has_signal']])
model = sm.GLM(df['accidents'], X, 
               family=sm.families.Poisson(),
               offset=df['log_exposure'])
results = model.fit()
 
print("Poisson Regression with Offset (Rate Model)")
print("=" * 50)
print(results.summary())
 
# Interpret as Rate Ratios
print("\nIncidence Rate Ratios (IRRs):")
print("-" * 30)
irr = np.exp(results.params)
for name, val in irr.items():
    print(f"{name}: {val:.4f}")

Offset vs. Predictor

Why use an offset instead of including log(exposure) as a regular predictor? Two reasons: (1) it enforces the proportionality constraint—doubling exposure should double expected count, (2) it simplifies interpretation—coefficients directly represent rate effects. If you suspect non-proportionality, you CAN include log(exposure) as a regular predictor and test if its coefficient differs from 1.

Diagnosing Poisson Model Fit

Assessing whether a Poisson model fits your data is crucial. Poor fit can lead to invalid inference. Here are key diagnostic tools.

Deviance and Pearson Residuals

Deviance Residuals: $$d_i = \text{sign}(Y_i - \hat{\lambda}_i) \sqrt{2\left[Y_i \log\left(\frac{Y_i}{\hat{\lambda}_i}\right) - (Y_i - \hat{\lambda}_i)\right]}$$

When $Y_i = 0$, $d_i = -\sqrt{2\hat{\lambda}_i}$.

Pearson Residuals: $$r_i^P = \frac{Y_i - \hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}$$

Under correct model specification, both should be approximately standard normal for large samples.

Residual Plots

Residuals vs. Fitted: Plot deviance or Pearson residuals against $\hat{\lambda}_i$. Look for patterns suggesting:
- Heteroscedasticity (fan shape)
- Nonlinearity (curved patterns)
- Outliers (extreme residuals)
Residuals vs. Predictors: Check for nonlinear relationships not captured by the model
Q-Q Plot: Compare residual distribution to standard normal

Checking for Overdispersion

The most common problem in Poisson regression is overdispersion—variance exceeding the mean.

Detection Methods:

Dispersion Statistic: $$\hat{\phi} = \frac{1}{n-p} \sum_{i=1}^n \frac{(Y_i - \hat{\lambda}_i)^2}{\hat{\lambda}_i} = \frac{\text{Pearson } \chi^2}{n-p}$$

For a well-fitting Poisson model, $\hat{\phi} \approx 1$. Values significantly > 1 indicate overdispersion.

Deviance/df Ratio: $$\frac{D}{n-p} \approx 1$$

if the model fits. Much greater than 1 suggests overdispersion.

Cameron-Trivedi Test: Regress $(Y - \hat{\lambda})^2 - Y$ on $\hat{\lambda}$. A significant positive coefficient indicates overdispersion.

Consequences of Ignoring Overdispersion:

Standard errors are too small
Confidence intervals are too narrow
p-values are too small (false positives)
Model selection criteria (AIC) are invalid

Interpreting the Dispersion Statistic
Dispersion φ̂	Interpretation	Action
0.8 - 1.2	Adequate fit (equidispersion)	Proceed with Poisson
1.2 - 2.0	Mild overdispersion	Consider quasi-Poisson or NB
2.0 - 5.0	Moderate overdispersion	Use negative binomial
5.0	Severe overdispersion	Investigate causes; consider zero-inflated models
< 0.8	Underdispersion	Rare; check for bounded counts or negative binomial

Overdispersion Sources

Common causes of overdispersion: (1) Omitted predictors creating unobserved heterogeneity, (2) Positive correlation between events (clustering), (3) Excess zeros beyond Poisson expectations, (4) Outliers inflating variance. Each cause suggests different remedies.

Remedies for Overdispersion

When Poisson regression shows overdispersion, several approaches can address the problem.

Approach 1: Quasi-Poisson

The simplest fix is quasi-Poisson regression, which assumes: $$\text{Var}(Y_i) = \phi \cdot \lambda_i$$

where $\phi > 1$ is the dispersion parameter estimated from the data.

Effect:

Point estimates remain the same as Poisson
Standard errors are multiplied by $\sqrt{\hat{\phi}}$
Confidence intervals widen; p-values increase
No true likelihood, so AIC/BIC not directly applicable

Approach 2: Robust Standard Errors

Sandwich (robust) standard errors don't assume a specific variance structure. They estimate the variance of $\hat{\beta}$ directly from the data:

$$\hat{V}(\hat{\beta}) = (\mathbf{X}^\top \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{W} \hat{\mathbf{\Omega}} \mathbf{W} \mathbf{X} (\mathbf{X}^\top \mathbf{W} \mathbf{X})^{-1}$$

where $\hat{\mathbf{\Omega}}$ is diagonal with $(Y_i - \hat{\lambda}_i)^2$ on the diagonal.

Effect:

Point estimates unchanged
Standard errors robust to variance misspecification
Valid even if overdispersion is not constant

Approach 3: Negative Binomial Regression

The negative binomial distribution explicitly models overdispersion through an additional parameter:

$$\text{Var}(Y_i) = \lambda_i + \frac{\lambda_i^2}{\theta}$$

where $\theta > 0$ is the overdispersion parameter. As $\theta \to \infty$, this reduces to Poisson.

Advantages over quasi-Poisson:

A true likelihood exists → AIC/BIC applicable
Explicitly models the variance structure
Better for prediction

We'll cover negative binomial regression in detail on the next page.

Approach 4: Zero-Inflated Models

If excess zeros drive overdispersion, zero-inflated Poisson (ZIP) or zero-inflated negative binomial (ZINB) models are appropriate. These model:

A point mass at zero with probability $\pi$
A Poisson (or NB) distribution with probability $1-\pi$

$$P(Y = k) = \begin{cases} \pi + (1-\pi) e^{-\lambda} & k = 0 \ (1-\pi) \frac{\lambda^k e^{-\lambda}}{k!} & k > 0 \end{cases}$$

When to Use Each Remedy

•Quasi-Poisson: Quick fix for mild-moderate overdispersion; when you only need CIs/p-values corrected
•Robust SEs: When overdispersion may not be constant; for conservative inference
•Negative Binomial: When overdispersion is substantial; when you need AIC/prediction
•Zero-Inflated: When you have excess zeros from a separate process (structural zeros)

Common Mistakes

•Ignoring overdispersion and reporting Poisson SEs
•Using quasi-Poisson when overdispersion is severe
•Using zero-inflated models when zeros aren't actually 'excess'
•Choosing models based on fit statistics alone without diagnostics

Complete Poisson Regression Example

Let's work through a complete Poisson regression analysis, from data exploration to model diagnostics.

complete_poisson_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy import stats
 
# Example: Modeling number of insurance claims
np.random.seed(123)
n = 500
 
# Generate predictors
age = np.random.uniform(18, 70, n)
vehicle_value = np.random.exponential(20, n)  # in $1000s
prior_claims = np.random.poisson(0.5, n)
is_urban = np.random.binomial(1, 0.6, n)
 
# True model with log link
log_rate = -2 + 0.01*age + 0.02*vehicle_value + 0.3*prior_claims + 0.2*is_urban
true_lambda = np.exp(log_rate)
claims = np.random.poisson(true_lambda)
 
df = pd.DataFrame({
    'claims': claims,
    'age': age,
    'vehicle_value': vehicle_value,
    'prior_claims': prior_claims,
    'is_urban': is_urban
})
 
# Step 1: Exploratory analysis
print("="*60)
print("STEP 1: EXPLORATORY ANALYSIS")
print("="*60)
print(f"\nResponse variable (claims) summary:")
print(df['claims'].describe())
print(f"\nMean: {df['claims'].mean():.3f}")
print(f"Variance: {df['claims'].var():.3f}")
print(f"Variance/Mean ratio: {df['claims'].var()/df['claims'].mean():.3f}")
print("(Ratio near 1 suggests Poisson is appropriate)")
 
# Step 2: Fit Poisson model
print("\n" + "="*60)
print("STEP 2: FIT POISSON REGRESSION")
print("="*60)
 
X = sm.add_constant(df[['age', 'vehicle_value', 'prior_claims', 'is_urban']])
poisson_model = sm.GLM(df['claims'], X, family=sm.families.Poisson())
poisson_results = poisson_model.fit()
 
print(poisson_results.summary())
 
# Step 3: Model diagnostics
print("\n" + "="*60)
print("STEP 3: MODEL DIAGNOSTICS")
print("="*60)
 
# Deviance and Pearson chi-square
deviance = poisson_results.deviance
pearson = poisson_results.pearson_chi2
df_resid = poisson_results.df_resid
 
print(f"\nDeviance: {deviance:.2f}")
print(f"Deviance/df: {deviance/df_resid:.3f}")
print(f"Pearson Chi-square: {pearson:.2f}")
print(f"Dispersion estimate (Pearson X²/df): {pearson/df_resid:.3f}")
 
if pearson/df_resid > 1.5:
    print("\n⚠️ Warning: Evidence of overdispersion detected!")
else:
    print("\n✓ Dispersion appears reasonable for Poisson model")
 
# Step 4: Interpret coefficients as IRRs
print("\n" + "="*60)
print("STEP 4: INCIDENCE RATE RATIOS")
print("="*60)
 
irr = np.exp(poisson_results.params)
irr_ci = np.exp(poisson_results.conf_int())
irr_ci.columns = ['2.5%', '97.5%']
 
irr_df = pd.DataFrame({
    'IRR': irr,
    '95% CI Lower': irr_ci['2.5%'],
    '95% CI Upper': irr_ci['97.5%']
})
print("\nIncidence Rate Ratios with 95% CIs:")
print(irr_df.round(4))
 
# Interpretation
print("\nInterpretation of key findings:")
print("-" * 40)
print(f"• Each additional prior claim increases current claims by {(irr['prior_claims']-1)*100:.1f}%")
print(f"• Urban drivers have {(irr['is_urban']-1)*100:.1f}% more claims than rural drivers")
print(f"• Each $1000 increase in vehicle value increases claims by {(irr['vehicle_value']-1)*100:.1f}%")

Key Analysis Steps

A complete Poisson analysis includes: (1) Check mean ≈ variance for initial plausibility, (2) Fit the model and examine coefficient significance, (3) Check deviance/df and Pearson χ²/df for overdispersion, (4) Create residual plots, (5) Present results as IRRs with confidence intervals.

Summary: Poisson Regression

We've comprehensively covered Poisson regression—the canonical GLM for count data. Let's consolidate the key concepts:

Key Takeaways

•The Poisson distribution models count data with mean = variance (equidispersion). It arises from counting independent events at a constant rate.
•Poisson regression uses a log link, ensuring positive predictions and giving multiplicative (percentage) effect interpretations.
•The Incidence Rate Ratio (IRR) = exp(β) represents the multiplicative change in expected count per unit change in the predictor.
•Offset terms enable rate modeling with different exposures: log(exposure) is included with coefficient fixed at 1.
•Deviance and Pearson residuals diagnose model fit. Deviance/df >> 1 indicates overdispersion.
•Overdispersion remedies include quasi-Poisson (inflated SEs), robust SEs, negative binomial (next page), and zero-inflated models.
•Report IRRs with 95% CIs for interpretable results; interpret as percentage changes in expected count.

What's Next:

In the next page, we'll explore negative binomial regression—the natural extension of Poisson regression that explicitly models overdispersion. We'll see how the negative binomial distribution arises as a Poisson-gamma mixture and when to prefer it over Poisson.

Page Complete

You now have a thorough understanding of Poisson regression—from the underlying distribution through model specification, interpretation, and diagnostics. You can confidently analyze count data while checking for and addressing overdispersion.

Poisson Regression: Modeling Count Data

When the Response is a Count

Count data is everywhere in the real world:

Number of hospital admissions per day
Customer complaints per month
Goals scored per soccer match
Traffic accidents per intersection
Website clicks per user session
Species observed per quadrat
Patents filed per company per year

What You Will Learn

The Poisson Distribution

Before diving into regression, let's thoroughly understand the Poisson distribution itself.

Definition and Properties

A random variable $Y$ follows a Poisson distribution with parameter $\lambda > 0$ if:

$$P(Y = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots$$

Key properties:

$E[Y] = \lambda$ (the parameter is the mean)
$\text{Var}(Y) = \lambda$ (variance equals mean—equidispersion)
$\text{Skewness} = 1/\sqrt{\lambda}$ (right-skewed, especially for small $\lambda$)
As $\lambda \to \infty$, the Poisson approaches a Normal distribution

When Does the Poisson Arise?

The Poisson distribution emerges from the Poisson process, which models random events occurring in time or space under these conditions:

Independence: Events in non-overlapping intervals are independent
Homogeneity: The rate of events is constant over time/space
No simultaneous events: Events don't occur at exactly the same instant

If events occur at rate $r$ per unit time, then the count in an interval of length $t$ is $Y \sim \text{Poisson}(rt)$.

Poisson Distribution at Different Mean Values
Mean λ	P(Y=0)	P(Y=1)	P(Y=2)	P(Y=3)	P(Y≥4)	Shape
0.5	60.7%	30.3%	7.6%	1.3%	0.2%	Highly right-skewed
1.0	36.8%	36.8%	18.4%	6.1%	1.9%	Right-skewed
2.0	13.5%	27.1%	27.1%	18.0%	14.3%	Moderately skewed
5.0	0.7%	3.4%	8.4%	14.0%	73.5%	Nearly symmetric
10.0	0.0%	0.0%	0.2%	0.8%	99.0%	Approximately normal

poisson_distribution_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson
 
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Different lambda values
lambdas = [1, 5, 15]
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
 
for ax, lam, color in zip(axes, lambdas, colors):
    # X values for PMF
    x = np.arange(0, lam * 3)
    
    # Poisson PMF
    pmf = poisson.pmf(x, lam)
    
    # Bar plot
    ax.bar(x, pmf, color=color, alpha=0.7, edgecolor='black', linewidth=0.5)
    ax.axvline(x=lam, color='red', linestyle='--', linewidth=2, 
               label=f'Mean = Var = {lam}')
    
    ax.set_xlabel('Count y', fontsize=12)
    ax.set_ylabel('P(Y = y)', fontsize=12)
    ax.set_title(f'Poisson(λ = {lam})', fontsize=14)
    ax.legend(fontsize=10)
    ax.grid(True, alpha=0.3, axis='y')
 
# Add super title
fig.suptitle('Poisson Distribution: Mean = Variance', fontsize=14, y=1.02)
 
plt.tight_layout()
plt.savefig('poisson_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

The Equidispersion Assumption

The Poisson Regression Model

Poisson regression is a GLM that models count responses using the Poisson distribution with a log link.

Model Specification

Random Component: $$Y_i \sim \text{Poisson}(\lambda_i)$$

Each observation has its own mean $\lambda_i$ that depends on the predictors.

Systematic Component: $$\eta_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip}$$

Link Function (Log): $$\log(\lambda_i) = \eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$

Equivalently: $$\lambda_i = \exp(\mathbf{x}i^\top \boldsymbol{\beta}) = e^{\beta_0} \cdot e^{\beta_1 X{i1}} \cdot \ldots \cdot e^{\beta_p X_{ip}}$$

Why the Log Link?

Positivity: $\lambda = e^\eta > 0$ for any $\eta$, ensuring valid predicted counts
Multiplicative effects: Predictors act multiplicatively on the mean count
Canonical link: Log is the canonical link for Poisson, giving nice mathematical properties
Rate interpretation: Log-linear models naturally express proportional changes

The Log-Likelihood

For independent Poisson observations $Y_1, \ldots, Y_n$:

$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ Y_i \log(\lambda_i) - \lambda_i - \log(Y_i!) \right]$$

Substituting $\log(\lambda_i) = \mathbf{x}_i^\top \boldsymbol{\beta}$:

$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ Y_i (\mathbf{x}_i^\top \boldsymbol{\beta}) - e^{\mathbf{x}_i^\top \boldsymbol{\beta}} \right] + \text{const}$$

Score Equations

Taking the derivative with respect to $\beta_j$:

$$\frac{\partial \ell}{\partial \beta_j} = \sum_{i=1}^n (Y_i - \lambda_i) X_{ij} = 0$$

In matrix form: $$\mathbf{X}^\top (\mathbf{Y} - \boldsymbol{\lambda}) = \mathbf{0}$$

This says: the residuals $(Y_i - \hat{\lambda}_i)$ must be orthogonal to each predictor—the same condition as in OLS, but with different residuals!

No Closed-Form Solution

IRLS for Poisson Regression

Interpreting Poisson Regression Coefficients

The log link means that Poisson coefficients have multiplicative interpretations—they describe proportional changes in the expected count.

The Incidence Rate Ratio (IRR)

Consider a single predictor: $$\log(\lambda) = \beta_0 + \beta_1 X$$

When $X$ increases by 1 unit:

Log-mean increases by $\beta_1$
Mean is multiplied by $e^{\beta_1}$

The quantity $e^{\beta_1}$ is called the Incidence Rate Ratio (IRR) or Rate Ratio (RR):

$$\text{IRR} = \frac{\lambda(X+1)}{\lambda(X)} = \frac{e^{\beta_0 + \beta_1(X+1)}}{e^{\beta_0 + \beta_1 X}} = e^{\beta_1}$$

Interpretation Examples

Suppose we model doctor visits per year: $$\log(\text{visits}) = 0.5 + 0.3 \cdot \text{chronic_conditions} - 0.1 \cdot \text{income_10k}$$

Coefficient Interpretations
Coefficient	Value	IRR = exp(β)	Interpretation
Intercept	0.5	exp(0.5) = 1.65	Baseline: 1.65 visits/year when other predictors = 0
chronic_conditions	0.3	exp(0.3) = 1.35	Each additional chronic condition increases visits by 35%
income_10k	-0.1	exp(-0.1) = 0.90	Each $10k increase in income reduces visits by 10%

Confidence Intervals for IRRs

To get a confidence interval for the IRR:

Compute CI for $\beta$: $(\hat{\beta}j \pm z{\alpha/2} \cdot \text{SE}(\hat{\beta}_j))$
Exponentiate the endpoints: $(e^{\hat{\beta}j - z{\alpha/2} \cdot \text{SE}}, e^{\hat{\beta}j + z{\alpha/2} \cdot \text{SE}})$

The CI for IRR is asymmetric because of the exponential transformation. An IRR of 1.0 means no effect; CIs not containing 1.0 indicate statistical significance.

Percentage Change Interpretation

For small effects, the percentage change is approximately $100 \cdot \beta_j$%:

$\beta = 0.10 \Rightarrow$ IRR = 1.105, or about 10% increase
$\beta = -0.05 \Rightarrow$ IRR = 0.951, or about 5% decrease

This approximation breaks down for large $|\beta|$.

Reporting Best Practices

Rate Models: Using Offset Terms

Often we want to model rates rather than raw counts. For example:

Accidents per vehicle-mile driven
Infections per person-year at risk
Defects per 1000 units produced
Deaths per 100,000 population

The key is that we have different exposures (time at risk, population size, etc.) for different observations.

The Offset Formulation

If $Y_i$ is the count and $t_i$ is the exposure (time, population, etc.), we model the rate $r_i = Y_i / t_i$ by:

$$Y_i \sim \text{Poisson}(\lambda_i = t_i \cdot r_i)$$

where $r_i$ is the rate per unit exposure. Taking logs:

$$\log(\lambda_i) = \log(t_i) + \log(r_i) = \log(t_i) + \mathbf{x}_i^\top \boldsymbol{\beta}$$

The term $\log(t_i)$ is called the offset—a predictor with coefficient fixed at 1.

$$\log(\lambda_i) = \text{offset}i + \beta_0 + \beta_1 X{i1} + \cdots$$

Interpretation with Offset

With an offset, coefficients represent effects on the rate (count per unit exposure):

$e^{\beta_j}$ is the rate ratio, the multiplicative change in rate per unit change in $X_j$
The baseline rate (when all predictors = 0) is $e^{\beta_0}$ per unit exposure

poisson_offset_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
import pandas as pd
import statsmodels.api as sm
 
# Example: Modeling accident rates across road segments
 
np.random.seed(42)
n = 100
 
# Generate data
traffic_volume = np.random.uniform(1000, 50000, n)  # vehicles per day
road_length = np.random.uniform(0.5, 5.0, n)  # miles
has_signal = np.random.binomial(1, 0.3, n)  # traffic signal present
speed_limit = np.random.choice([25, 35, 45, 55], n)  # mph
 
# True model: accidents per mile-year
# Higher traffic, higher speed, no signal -> more accidents
true_rate = np.exp(-5 + 0.00005*traffic_volume + 0.03*speed_limit - 0.5*has_signal)
exposure = road_length * 1  # mile-years (1 year observation)
expected_count = exposure * true_rate
accidents = np.random.poisson(expected_count)
 
# Create dataframe
df = pd.DataFrame({
    'accidents': accidents,
    'traffic_volume': traffic_volume,
    'speed_limit': speed_limit,
    'has_signal': has_signal,
    'road_length': exposure,
    'log_exposure': np.log(exposure)
})
 
# Fit Poisson regression WITH offset
X = sm.add_constant(df[['traffic_volume', 'speed_limit', 'has_signal']])
model = sm.GLM(df['accidents'], X, 
               family=sm.families.Poisson(),
               offset=df['log_exposure'])
results = model.fit()
 
print("Poisson Regression with Offset (Rate Model)")
print("=" * 50)
print(results.summary())
 
# Interpret as Rate Ratios
print("\nIncidence Rate Ratios (IRRs):")
print("-" * 30)
irr = np.exp(results.params)
for name, val in irr.items():
    print(f"{name}: {val:.4f}")

Offset vs. Predictor

Diagnosing Poisson Model Fit

Assessing whether a Poisson model fits your data is crucial. Poor fit can lead to invalid inference. Here are key diagnostic tools.

Deviance and Pearson Residuals

Deviance Residuals: $$d_i = \text{sign}(Y_i - \hat{\lambda}_i) \sqrt{2\left[Y_i \log\left(\frac{Y_i}{\hat{\lambda}_i}\right) - (Y_i - \hat{\lambda}_i)\right]}$$

When $Y_i = 0$, $d_i = -\sqrt{2\hat{\lambda}_i}$.

Pearson Residuals: $$r_i^P = \frac{Y_i - \hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}$$

Under correct model specification, both should be approximately standard normal for large samples.

Residual Plots

Residuals vs. Fitted: Plot deviance or Pearson residuals against $\hat{\lambda}_i$. Look for patterns suggesting:
- Heteroscedasticity (fan shape)
- Nonlinearity (curved patterns)
- Outliers (extreme residuals)
Residuals vs. Predictors: Check for nonlinear relationships not captured by the model
Q-Q Plot: Compare residual distribution to standard normal

Checking for Overdispersion

The most common problem in Poisson regression is overdispersion—variance exceeding the mean.

Detection Methods:

Dispersion Statistic: $$\hat{\phi} = \frac{1}{n-p} \sum_{i=1}^n \frac{(Y_i - \hat{\lambda}_i)^2}{\hat{\lambda}_i} = \frac{\text{Pearson } \chi^2}{n-p}$$

For a well-fitting Poisson model, $\hat{\phi} \approx 1$. Values significantly > 1 indicate overdispersion.

Deviance/df Ratio: $$\frac{D}{n-p} \approx 1$$

if the model fits. Much greater than 1 suggests overdispersion.

Cameron-Trivedi Test: Regress $(Y - \hat{\lambda})^2 - Y$ on $\hat{\lambda}$. A significant positive coefficient indicates overdispersion.

Consequences of Ignoring Overdispersion:

Standard errors are too small
Confidence intervals are too narrow
p-values are too small (false positives)
Model selection criteria (AIC) are invalid

Interpreting the Dispersion Statistic
Dispersion φ̂	Interpretation	Action
0.8 - 1.2	Adequate fit (equidispersion)	Proceed with Poisson
1.2 - 2.0	Mild overdispersion	Consider quasi-Poisson or NB
2.0 - 5.0	Moderate overdispersion	Use negative binomial
5.0	Severe overdispersion	Investigate causes; consider zero-inflated models
< 0.8	Underdispersion	Rare; check for bounded counts or negative binomial

Overdispersion Sources

Remedies for Overdispersion

When Poisson regression shows overdispersion, several approaches can address the problem.

Approach 1: Quasi-Poisson

The simplest fix is quasi-Poisson regression, which assumes: $$\text{Var}(Y_i) = \phi \cdot \lambda_i$$

where $\phi > 1$ is the dispersion parameter estimated from the data.

Effect:

Point estimates remain the same as Poisson
Standard errors are multiplied by $\sqrt{\hat{\phi}}$
Confidence intervals widen; p-values increase
No true likelihood, so AIC/BIC not directly applicable

Approach 2: Robust Standard Errors

Sandwich (robust) standard errors don't assume a specific variance structure. They estimate the variance of $\hat{\beta}$ directly from the data:

$$\hat{V}(\hat{\beta}) = (\mathbf{X}^\top \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{W} \hat{\mathbf{\Omega}} \mathbf{W} \mathbf{X} (\mathbf{X}^\top \mathbf{W} \mathbf{X})^{-1}$$

where $\hat{\mathbf{\Omega}}$ is diagonal with $(Y_i - \hat{\lambda}_i)^2$ on the diagonal.

Effect:

Point estimates unchanged
Standard errors robust to variance misspecification
Valid even if overdispersion is not constant

Approach 3: Negative Binomial Regression

The negative binomial distribution explicitly models overdispersion through an additional parameter:

$$\text{Var}(Y_i) = \lambda_i + \frac{\lambda_i^2}{\theta}$$

where $\theta > 0$ is the overdispersion parameter. As $\theta \to \infty$, this reduces to Poisson.

Advantages over quasi-Poisson:

A true likelihood exists → AIC/BIC applicable
Explicitly models the variance structure
Better for prediction

We'll cover negative binomial regression in detail on the next page.

Approach 4: Zero-Inflated Models

If excess zeros drive overdispersion, zero-inflated Poisson (ZIP) or zero-inflated negative binomial (ZINB) models are appropriate. These model:

A point mass at zero with probability $\pi$
A Poisson (or NB) distribution with probability $1-\pi$

$$P(Y = k) = \begin{cases} \pi + (1-\pi) e^{-\lambda} & k = 0 \ (1-\pi) \frac{\lambda^k e^{-\lambda}}{k!} & k > 0 \end{cases}$$

When to Use Each Remedy

•Quasi-Poisson: Quick fix for mild-moderate overdispersion; when you only need CIs/p-values corrected
•Robust SEs: When overdispersion may not be constant; for conservative inference
•Negative Binomial: When overdispersion is substantial; when you need AIC/prediction
•Zero-Inflated: When you have excess zeros from a separate process (structural zeros)

Common Mistakes

•Ignoring overdispersion and reporting Poisson SEs
•Using quasi-Poisson when overdispersion is severe
•Using zero-inflated models when zeros aren't actually 'excess'
•Choosing models based on fit statistics alone without diagnostics

Complete Poisson Regression Example

Let's work through a complete Poisson regression analysis, from data exploration to model diagnostics.

complete_poisson_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy import stats
 
# Example: Modeling number of insurance claims
np.random.seed(123)
n = 500
 
# Generate predictors
age = np.random.uniform(18, 70, n)
vehicle_value = np.random.exponential(20, n)  # in $1000s
prior_claims = np.random.poisson(0.5, n)
is_urban = np.random.binomial(1, 0.6, n)
 
# True model with log link
log_rate = -2 + 0.01*age + 0.02*vehicle_value + 0.3*prior_claims + 0.2*is_urban
true_lambda = np.exp(log_rate)
claims = np.random.poisson(true_lambda)
 
df = pd.DataFrame({
    'claims': claims,
    'age': age,
    'vehicle_value': vehicle_value,
    'prior_claims': prior_claims,
    'is_urban': is_urban
})
 
# Step 1: Exploratory analysis
print("="*60)
print("STEP 1: EXPLORATORY ANALYSIS")
print("="*60)
print(f"\nResponse variable (claims) summary:")
print(df['claims'].describe())
print(f"\nMean: {df['claims'].mean():.3f}")
print(f"Variance: {df['claims'].var():.3f}")
print(f"Variance/Mean ratio: {df['claims'].var()/df['claims'].mean():.3f}")
print("(Ratio near 1 suggests Poisson is appropriate)")
 
# Step 2: Fit Poisson model
print("\n" + "="*60)
print("STEP 2: FIT POISSON REGRESSION")
print("="*60)
 
X = sm.add_constant(df[['age', 'vehicle_value', 'prior_claims', 'is_urban']])
poisson_model = sm.GLM(df['claims'], X, family=sm.families.Poisson())
poisson_results = poisson_model.fit()
 
print(poisson_results.summary())
 
# Step 3: Model diagnostics
print("\n" + "="*60)
print("STEP 3: MODEL DIAGNOSTICS")
print("="*60)
 
# Deviance and Pearson chi-square
deviance = poisson_results.deviance
pearson = poisson_results.pearson_chi2
df_resid = poisson_results.df_resid
 
print(f"\nDeviance: {deviance:.2f}")
print(f"Deviance/df: {deviance/df_resid:.3f}")
print(f"Pearson Chi-square: {pearson:.2f}")
print(f"Dispersion estimate (Pearson X²/df): {pearson/df_resid:.3f}")
 
if pearson/df_resid > 1.5:
    print("\n⚠️ Warning: Evidence of overdispersion detected!")
else:
    print("\n✓ Dispersion appears reasonable for Poisson model")
 
# Step 4: Interpret coefficients as IRRs
print("\n" + "="*60)
print("STEP 4: INCIDENCE RATE RATIOS")
print("="*60)
 
irr = np.exp(poisson_results.params)
irr_ci = np.exp(poisson_results.conf_int())
irr_ci.columns = ['2.5%', '97.5%']
 
irr_df = pd.DataFrame({
    'IRR': irr,
    '95% CI Lower': irr_ci['2.5%'],
    '95% CI Upper': irr_ci['97.5%']
})
print("\nIncidence Rate Ratios with 95% CIs:")
print(irr_df.round(4))
 
# Interpretation
print("\nInterpretation of key findings:")
print("-" * 40)
print(f"• Each additional prior claim increases current claims by {(irr['prior_claims']-1)*100:.1f}%")
print(f"• Urban drivers have {(irr['is_urban']-1)*100:.1f}% more claims than rural drivers")
print(f"• Each $1000 increase in vehicle value increases claims by {(irr['vehicle_value']-1)*100:.1f}%")

Key Analysis Steps

Summary: Poisson Regression

We've comprehensively covered Poisson regression—the canonical GLM for count data. Let's consolidate the key concepts:

Key Takeaways

•The Poisson distribution models count data with mean = variance (equidispersion). It arises from counting independent events at a constant rate.
•Poisson regression uses a log link, ensuring positive predictions and giving multiplicative (percentage) effect interpretations.
•The Incidence Rate Ratio (IRR) = exp(β) represents the multiplicative change in expected count per unit change in the predictor.
•Offset terms enable rate modeling with different exposures: log(exposure) is included with coefficient fixed at 1.
•Deviance and Pearson residuals diagnose model fit. Deviance/df >> 1 indicates overdispersion.
•Overdispersion remedies include quasi-Poisson (inflated SEs), robust SEs, negative binomial (next page), and zero-inflated models.
•Report IRRs with 95% CIs for interpretable results; interpret as percentage changes in expected count.

What's Next:

Page Complete