Loading content...
Count data is everywhere in the real world:
These outcomes share common characteristics: they are non-negative integers, often right-skewed, and their variance often depends on their mean. Linear regression is inappropriate for count data because it can predict negative counts, assumes constant variance, and ignores the discrete nature of the response.
Poisson regression is the natural GLM for count data. It assumes the response follows a Poisson distribution, uses a log link to ensure positive predictions, and automatically accounts for mean-dependent variance. Mastering Poisson regression opens the door to analyzing a vast array of real-world counting phenomena.
By the end of this page, you will: (1) understand the Poisson distribution and when it applies, (2) know how to set up, estimate, and interpret Poisson regression models, (3) be able to diagnose overdispersion and model fit, and (4) understand rate models with exposure/offset terms.
Before diving into regression, let's thoroughly understand the Poisson distribution itself.
A random variable $Y$ follows a Poisson distribution with parameter $\lambda > 0$ if:
$$P(Y = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots$$
Key properties:
The Poisson distribution emerges from the Poisson process, which models random events occurring in time or space under these conditions:
If events occur at rate $r$ per unit time, then the count in an interval of length $t$ is $Y \sim \text{Poisson}(rt)$.
| Mean λ | P(Y=0) | P(Y=1) | P(Y=2) | P(Y=3) | P(Y≥4) | Shape |
|---|---|---|---|---|---|---|
| 0.5 | 60.7% | 30.3% | 7.6% | 1.3% | 0.2% | Highly right-skewed |
| 1.0 | 36.8% | 36.8% | 18.4% | 6.1% | 1.9% | Right-skewed |
| 2.0 | 13.5% | 27.1% | 27.1% | 18.0% | 14.3% | Moderately skewed |
| 5.0 | 0.7% | 3.4% | 8.4% | 14.0% | 73.5% | Nearly symmetric |
| 10.0 | 0.0% | 0.0% | 0.2% | 0.8% | 99.0% | Approximately normal |
12345678910111213141516171819202122232425262728293031323334
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import poisson fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Different lambda valueslambdas = [1, 5, 15]colors = ['#1f77b4', '#ff7f0e', '#2ca02c'] for ax, lam, color in zip(axes, lambdas, colors): # X values for PMF x = np.arange(0, lam * 3) # Poisson PMF pmf = poisson.pmf(x, lam) # Bar plot ax.bar(x, pmf, color=color, alpha=0.7, edgecolor='black', linewidth=0.5) ax.axvline(x=lam, color='red', linestyle='--', linewidth=2, label=f'Mean = Var = {lam}') ax.set_xlabel('Count y', fontsize=12) ax.set_ylabel('P(Y = y)', fontsize=12) ax.set_title(f'Poisson(λ = {lam})', fontsize=14) ax.legend(fontsize=10) ax.grid(True, alpha=0.3, axis='y') # Add super titlefig.suptitle('Poisson Distribution: Mean = Variance', fontsize=14, y=1.02) plt.tight_layout()plt.savefig('poisson_distribution.png', dpi=150, bbox_inches='tight')plt.show()The Poisson's equidispersion (Var = Mean) is a strong assumption. Real count data often has variance GREATER than the mean (overdispersion) due to clustering, heterogeneity, or excess zeros. We'll address this limitation and its solutions throughout this page and the next.
Poisson regression is a GLM that models count responses using the Poisson distribution with a log link.
Random Component: $$Y_i \sim \text{Poisson}(\lambda_i)$$
Each observation has its own mean $\lambda_i$ that depends on the predictors.
Systematic Component: $$\eta_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip}$$
Link Function (Log): $$\log(\lambda_i) = \eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$
Equivalently: $$\lambda_i = \exp(\mathbf{x}i^\top \boldsymbol{\beta}) = e^{\beta_0} \cdot e^{\beta_1 X{i1}} \cdot \ldots \cdot e^{\beta_p X_{ip}}$$
For independent Poisson observations $Y_1, \ldots, Y_n$:
$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ Y_i \log(\lambda_i) - \lambda_i - \log(Y_i!) \right]$$
Substituting $\log(\lambda_i) = \mathbf{x}_i^\top \boldsymbol{\beta}$:
$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ Y_i (\mathbf{x}_i^\top \boldsymbol{\beta}) - e^{\mathbf{x}_i^\top \boldsymbol{\beta}} \right] + \text{const}$$
Taking the derivative with respect to $\beta_j$:
$$\frac{\partial \ell}{\partial \beta_j} = \sum_{i=1}^n (Y_i - \lambda_i) X_{ij} = 0$$
In matrix form: $$\mathbf{X}^\top (\mathbf{Y} - \boldsymbol{\lambda}) = \mathbf{0}$$
This says: the residuals $(Y_i - \hat{\lambda}_i)$ must be orthogonal to each predictor—the same condition as in OLS, but with different residuals!
Unlike linear regression, these equations are nonlinear in $\boldsymbol{\beta}$ (because $\lambda_i = e^{\mathbf{x}_i^\top \boldsymbol{\beta}}$). We must use iterative methods—typically Newton-Raphson or equivalently Iteratively Reweighted Least Squares (IRLS).
At each iteration, IRLS solves a weighted least squares problem with weights w_i = λ_i and working response z_i = η_i + (y_i - λ_i)/λ_i. The algorithm converges quickly (typically 5-10 iterations) because the Poisson log-likelihood is strictly concave.
The log link means that Poisson coefficients have multiplicative interpretations—they describe proportional changes in the expected count.
Consider a single predictor: $$\log(\lambda) = \beta_0 + \beta_1 X$$
When $X$ increases by 1 unit:
The quantity $e^{\beta_1}$ is called the Incidence Rate Ratio (IRR) or Rate Ratio (RR):
$$\text{IRR} = \frac{\lambda(X+1)}{\lambda(X)} = \frac{e^{\beta_0 + \beta_1(X+1)}}{e^{\beta_0 + \beta_1 X}} = e^{\beta_1}$$
Suppose we model doctor visits per year: $$\log(\text{visits}) = 0.5 + 0.3 \cdot \text{chronic_conditions} - 0.1 \cdot \text{income_10k}$$
| Coefficient | Value | IRR = exp(β) | Interpretation |
|---|---|---|---|
| Intercept | 0.5 | exp(0.5) = 1.65 | Baseline: 1.65 visits/year when other predictors = 0 |
| chronic_conditions | 0.3 | exp(0.3) = 1.35 | Each additional chronic condition increases visits by 35% |
| income_10k | -0.1 | exp(-0.1) = 0.90 | Each $10k increase in income reduces visits by 10% |
To get a confidence interval for the IRR:
The CI for IRR is asymmetric because of the exponential transformation. An IRR of 1.0 means no effect; CIs not containing 1.0 indicate statistical significance.
For small effects, the percentage change is approximately $100 \cdot \beta_j$%:
This approximation breaks down for large $|\beta|$.
When reporting Poisson regression results: (1) present IRRs with 95% CIs rather than raw coefficients, (2) interpret as 'X% increase/decrease in expected count', (3) note that effects are multiplicative and thus context-dependent (a 20% increase from 5 to 6 is different from 100 to 120).
Often we want to model rates rather than raw counts. For example:
The key is that we have different exposures (time at risk, population size, etc.) for different observations.
If $Y_i$ is the count and $t_i$ is the exposure (time, population, etc.), we model the rate $r_i = Y_i / t_i$ by:
$$Y_i \sim \text{Poisson}(\lambda_i = t_i \cdot r_i)$$
where $r_i$ is the rate per unit exposure. Taking logs:
$$\log(\lambda_i) = \log(t_i) + \log(r_i) = \log(t_i) + \mathbf{x}_i^\top \boldsymbol{\beta}$$
The term $\log(t_i)$ is called the offset—a predictor with coefficient fixed at 1.
$$\log(\lambda_i) = \text{offset}i + \beta_0 + \beta_1 X{i1} + \cdots$$
With an offset, coefficients represent effects on the rate (count per unit exposure):
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import numpy as npimport pandas as pdimport statsmodels.api as sm # Example: Modeling accident rates across road segments np.random.seed(42)n = 100 # Generate datatraffic_volume = np.random.uniform(1000, 50000, n) # vehicles per dayroad_length = np.random.uniform(0.5, 5.0, n) # mileshas_signal = np.random.binomial(1, 0.3, n) # traffic signal presentspeed_limit = np.random.choice([25, 35, 45, 55], n) # mph # True model: accidents per mile-year# Higher traffic, higher speed, no signal -> more accidentstrue_rate = np.exp(-5 + 0.00005*traffic_volume + 0.03*speed_limit - 0.5*has_signal)exposure = road_length * 1 # mile-years (1 year observation)expected_count = exposure * true_rateaccidents = np.random.poisson(expected_count) # Create dataframedf = pd.DataFrame({ 'accidents': accidents, 'traffic_volume': traffic_volume, 'speed_limit': speed_limit, 'has_signal': has_signal, 'road_length': exposure, 'log_exposure': np.log(exposure)}) # Fit Poisson regression WITH offsetX = sm.add_constant(df[['traffic_volume', 'speed_limit', 'has_signal']])model = sm.GLM(df['accidents'], X, family=sm.families.Poisson(), offset=df['log_exposure'])results = model.fit() print("Poisson Regression with Offset (Rate Model)")print("=" * 50)print(results.summary()) # Interpret as Rate Ratiosprint("\nIncidence Rate Ratios (IRRs):")print("-" * 30)irr = np.exp(results.params)for name, val in irr.items(): print(f"{name}: {val:.4f}")Why use an offset instead of including log(exposure) as a regular predictor? Two reasons: (1) it enforces the proportionality constraint—doubling exposure should double expected count, (2) it simplifies interpretation—coefficients directly represent rate effects. If you suspect non-proportionality, you CAN include log(exposure) as a regular predictor and test if its coefficient differs from 1.
Assessing whether a Poisson model fits your data is crucial. Poor fit can lead to invalid inference. Here are key diagnostic tools.
Deviance Residuals: $$d_i = \text{sign}(Y_i - \hat{\lambda}_i) \sqrt{2\left[Y_i \log\left(\frac{Y_i}{\hat{\lambda}_i}\right) - (Y_i - \hat{\lambda}_i)\right]}$$
When $Y_i = 0$, $d_i = -\sqrt{2\hat{\lambda}_i}$.
Pearson Residuals: $$r_i^P = \frac{Y_i - \hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}$$
Under correct model specification, both should be approximately standard normal for large samples.
Residuals vs. Fitted: Plot deviance or Pearson residuals against $\hat{\lambda}_i$. Look for patterns suggesting:
Residuals vs. Predictors: Check for nonlinear relationships not captured by the model
Q-Q Plot: Compare residual distribution to standard normal
The most common problem in Poisson regression is overdispersion—variance exceeding the mean.
Detection Methods:
For a well-fitting Poisson model, $\hat{\phi} \approx 1$. Values significantly > 1 indicate overdispersion.
if the model fits. Much greater than 1 suggests overdispersion.
Consequences of Ignoring Overdispersion:
| Dispersion φ̂ | Interpretation | Action |
|---|---|---|
| 0.8 - 1.2 | Adequate fit (equidispersion) | Proceed with Poisson |
| 1.2 - 2.0 | Mild overdispersion | Consider quasi-Poisson or NB |
| 2.0 - 5.0 | Moderate overdispersion | Use negative binomial |
5.0 | Severe overdispersion | Investigate causes; consider zero-inflated models |
| < 0.8 | Underdispersion | Rare; check for bounded counts or negative binomial |
Common causes of overdispersion: (1) Omitted predictors creating unobserved heterogeneity, (2) Positive correlation between events (clustering), (3) Excess zeros beyond Poisson expectations, (4) Outliers inflating variance. Each cause suggests different remedies.
When Poisson regression shows overdispersion, several approaches can address the problem.
The simplest fix is quasi-Poisson regression, which assumes: $$\text{Var}(Y_i) = \phi \cdot \lambda_i$$
where $\phi > 1$ is the dispersion parameter estimated from the data.
Effect:
Sandwich (robust) standard errors don't assume a specific variance structure. They estimate the variance of $\hat{\beta}$ directly from the data:
$$\hat{V}(\hat{\beta}) = (\mathbf{X}^\top \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{W} \hat{\mathbf{\Omega}} \mathbf{W} \mathbf{X} (\mathbf{X}^\top \mathbf{W} \mathbf{X})^{-1}$$
where $\hat{\mathbf{\Omega}}$ is diagonal with $(Y_i - \hat{\lambda}_i)^2$ on the diagonal.
Effect:
The negative binomial distribution explicitly models overdispersion through an additional parameter:
$$\text{Var}(Y_i) = \lambda_i + \frac{\lambda_i^2}{\theta}$$
where $\theta > 0$ is the overdispersion parameter. As $\theta \to \infty$, this reduces to Poisson.
Advantages over quasi-Poisson:
We'll cover negative binomial regression in detail on the next page.
If excess zeros drive overdispersion, zero-inflated Poisson (ZIP) or zero-inflated negative binomial (ZINB) models are appropriate. These model:
$$P(Y = k) = \begin{cases} \pi + (1-\pi) e^{-\lambda} & k = 0 \ (1-\pi) \frac{\lambda^k e^{-\lambda}}{k!} & k > 0 \end{cases}$$
Let's work through a complete Poisson regression analysis, from data exploration to model diagnostics.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
import numpy as npimport pandas as pdimport statsmodels.api as smimport matplotlib.pyplot as pltfrom scipy import stats # Example: Modeling number of insurance claimsnp.random.seed(123)n = 500 # Generate predictorsage = np.random.uniform(18, 70, n)vehicle_value = np.random.exponential(20, n) # in $1000sprior_claims = np.random.poisson(0.5, n)is_urban = np.random.binomial(1, 0.6, n) # True model with log linklog_rate = -2 + 0.01*age + 0.02*vehicle_value + 0.3*prior_claims + 0.2*is_urbantrue_lambda = np.exp(log_rate)claims = np.random.poisson(true_lambda) df = pd.DataFrame({ 'claims': claims, 'age': age, 'vehicle_value': vehicle_value, 'prior_claims': prior_claims, 'is_urban': is_urban}) # Step 1: Exploratory analysisprint("="*60)print("STEP 1: EXPLORATORY ANALYSIS")print("="*60)print(f"\nResponse variable (claims) summary:")print(df['claims'].describe())print(f"\nMean: {df['claims'].mean():.3f}")print(f"Variance: {df['claims'].var():.3f}")print(f"Variance/Mean ratio: {df['claims'].var()/df['claims'].mean():.3f}")print("(Ratio near 1 suggests Poisson is appropriate)") # Step 2: Fit Poisson modelprint("\n" + "="*60)print("STEP 2: FIT POISSON REGRESSION")print("="*60) X = sm.add_constant(df[['age', 'vehicle_value', 'prior_claims', 'is_urban']])poisson_model = sm.GLM(df['claims'], X, family=sm.families.Poisson())poisson_results = poisson_model.fit() print(poisson_results.summary()) # Step 3: Model diagnosticsprint("\n" + "="*60)print("STEP 3: MODEL DIAGNOSTICS")print("="*60) # Deviance and Pearson chi-squaredeviance = poisson_results.deviancepearson = poisson_results.pearson_chi2df_resid = poisson_results.df_resid print(f"\nDeviance: {deviance:.2f}")print(f"Deviance/df: {deviance/df_resid:.3f}")print(f"Pearson Chi-square: {pearson:.2f}")print(f"Dispersion estimate (Pearson X²/df): {pearson/df_resid:.3f}") if pearson/df_resid > 1.5: print("\n⚠️ Warning: Evidence of overdispersion detected!")else: print("\n✓ Dispersion appears reasonable for Poisson model") # Step 4: Interpret coefficients as IRRsprint("\n" + "="*60)print("STEP 4: INCIDENCE RATE RATIOS")print("="*60) irr = np.exp(poisson_results.params)irr_ci = np.exp(poisson_results.conf_int())irr_ci.columns = ['2.5%', '97.5%'] irr_df = pd.DataFrame({ 'IRR': irr, '95% CI Lower': irr_ci['2.5%'], '95% CI Upper': irr_ci['97.5%']})print("\nIncidence Rate Ratios with 95% CIs:")print(irr_df.round(4)) # Interpretationprint("\nInterpretation of key findings:")print("-" * 40)print(f"• Each additional prior claim increases current claims by {(irr['prior_claims']-1)*100:.1f}%")print(f"• Urban drivers have {(irr['is_urban']-1)*100:.1f}% more claims than rural drivers")print(f"• Each $1000 increase in vehicle value increases claims by {(irr['vehicle_value']-1)*100:.1f}%")A complete Poisson analysis includes: (1) Check mean ≈ variance for initial plausibility, (2) Fit the model and examine coefficient significance, (3) Check deviance/df and Pearson χ²/df for overdispersion, (4) Create residual plots, (5) Present results as IRRs with confidence intervals.
We've comprehensively covered Poisson regression—the canonical GLM for count data. Let's consolidate the key concepts:
What's Next:
In the next page, we'll explore negative binomial regression—the natural extension of Poisson regression that explicitly models overdispersion. We'll see how the negative binomial distribution arises as a Poisson-gamma mixture and when to prefer it over Poisson.
You now have a thorough understanding of Poisson regression—from the underlying distribution through model specification, interpretation, and diagnostics. You can confidently analyze count data while checking for and addressing overdispersion.