Machine LearningGeneralized Linear Models

Generalized Linear Models (GLMs)

LevelAdvanced

Duration90 mins

TopicGeneralized Linear Models

5 / 5

Negative Binomial Regression: Handling Overdispersion

When Poisson Isn't Enough

In the previous page, we encountered a persistent challenge: real-world count data often violates the Poisson assumption that variance equals the mean. This overdispersion can arise from:

Unobserved heterogeneity: Different subjects have different underlying rates
Clustering: Events occur in bursts rather than independently
Omitted variables: Important predictors are missing from the model
Complex processes: The data-generating mechanism is more complicated than Poisson assumes

When overdispersion is substantial, we need a distribution that can accommodate variance greater than the mean. The negative binomial distribution provides exactly this flexibility, with an extra parameter that controls the degree of overdispersion.

Negative binomial regression is the natural choice when:

Poisson regression shows clear overdispersion
You need a true likelihood for model comparison (AIC/BIC)
Prediction accuracy matters (Poisson underpredicts variance)
You want to explicitly model between-subject heterogeneity

What You Will Learn

By the end of this page, you will: (1) understand the negative binomial distribution and its Poisson-gamma mixture interpretation, (2) know how to specify, fit, and interpret negative binomial regression, (3) be able to compare Poisson and NB models and choose between them, and (4) understand the relationship between NB and other overdispersion solutions.

The Negative Binomial Distribution

Before modeling, let's understand the negative binomial distribution itself.

Definition

The negative binomial distribution can be parameterized in terms of the mean $\mu$ and a dispersion parameter $\theta > 0$ (also denoted $\alpha$ or $r$ in different texts):

$$P(Y = k) = \binom{k + \theta - 1}{k} \left(\frac{\theta}{\theta + \mu}\right)^\theta \left(\frac{\mu}{\theta + \mu}\right)^k, \quad k = 0, 1, 2, \ldots$$

Key Properties

Mean: $E[Y] = \mu$
Variance: $\text{Var}(Y) = \mu + \frac{\mu^2}{\theta}$

The variance exceeds the mean by the factor $\mu^2/\theta$. This captures overdispersion:

When $\theta \to \infty$: $\text{Var}(Y) \to \mu$ (reduces to Poisson)
When $\theta$ is small: overdispersion is severe

Alternative Parameterization

Some software uses $\alpha = 1/\theta$, giving: $$\text{Var}(Y) = \mu + \alpha \mu^2 = \mu(1 + \alpha\mu)$$

Here $\alpha \geq 0$ is the overdispersion parameter; $\alpha = 0$ gives Poisson.

The Poisson-Gamma Mixture Interpretation

The most intuitive way to understand the negative binomial is as a Poisson distribution with random rate:

Each observation has its own latent rate $\lambda_i$
Given $\lambda_i$, the count is Poisson: $Y_i | \lambda_i \sim \text{Poisson}(\lambda_i)$
The rates themselves vary according to a Gamma distribution: $\lambda_i \sim \text{Gamma}(\text{shape}=\theta, \text{rate}=\theta/\mu)$

Integrating out the latent rate: $$P(Y = k) = \int_0^\infty P(Y=k|\lambda) f(\lambda) d\lambda = \text{NegBin}(\mu, \theta)$$

Interpretation: The negative binomial models unobserved heterogeneity in the Poisson rate. Different subjects have different underlying propensities, even after controlling for observed predictors. This heterogeneity creates the extra variance.

Example: Consider hospital emergency visits. Two patients with the same age, insurance, and medical history may have different unobserved factors (health behaviors, genetics, local environment) affecting their visit rates. The Gamma distribution captures this latent heterogeneity.

Comparing Poisson and Negative Binomial
Property	Poisson	Negative Binomial
Mean	μ	μ
Variance	μ (equidispersion)	μ + μ²/θ (overdispersed)
Dispersion parameter	None (fixed at 1)	θ (estimated from data)
Interpretation	Fixed rate for all	Random rates across subjects
When to use	Var ≈ Mean	Var >> Mean
As θ → ∞	—	Reduces to Poisson

NB1 vs NB2

The parameterization above is called NB2. An alternative, NB1, has Var(Y) = μ(1 + α), where variance is linear in the mean rather than quadratic. NB2 is more common because it arises from the Poisson-Gamma mixture and is the default in most software.

The Negative Binomial Regression Model

Negative binomial regression follows the GLM framework, with the added complexity of estimating the dispersion parameter.

Model Specification

Random Component: $$Y_i \sim \text{NegBin}(\mu_i, \theta)$$

with $\text{Var}(Y_i) = \mu_i + \mu_i^2/\theta$.

Systematic Component: $$\eta_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip}$$

Link Function (Log): $$\log(\mu_i) = \eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$

The log link is used for the same reasons as Poisson: it ensures positive means and gives multiplicative interpretations.

The Log-Likelihood

For independent observations:

$$\ell(\boldsymbol{\beta}, \theta) = \sum_{i=1}^n \Bigg[ Y_i \log\left(\frac{\mu_i}{\mu_i + \theta}\right) + \theta \log\left(\frac{\theta}{\mu_i + \theta}\right) + \log\binom{Y_i + \theta - 1}{Y_i} \Bigg]$$

where $\mu_i = \exp(\mathbf{x}_i^\top \boldsymbol{\beta})$.

Estimation: Joint vs. Profile Likelihood

We need to estimate both $\boldsymbol{\beta}$ (regression coefficients) and $\theta$ (dispersion). Two approaches:

Approach 1: Joint Maximum Likelihood Maximize $\ell(\boldsymbol{\beta}, \theta)$ simultaneously over both parameters using Newton-Raphson or Fisher scoring.

Approach 2: Profile Likelihood (common in practice)

For a fixed $\theta$, maximize over $\boldsymbol{\beta}$ using IRLS (similar to Poisson)
Substitute $\hat{\boldsymbol{\beta}}(\theta)$ into the likelihood
Maximize the resulting profile likelihood over $\theta$

This iterative approach is implemented in most software. Convergence is typically fast.

Interpretation of Coefficients

Because the log link is used, coefficient interpretation is identical to Poisson regression:

$e^{\beta_j}$ is the Incidence Rate Ratio (IRR)
A one-unit increase in $X_j$ multiplies the expected count by $e^{\beta_j}$

The difference from Poisson is in the standard errors and model fit, not the point estimates (which are often similar but not identical).

negative_binomial_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.discrete.discrete_model import NegativeBinomial
 
# Generate overdispersed count data
np.random.seed(42)
n = 500
 
# Predictors
age = np.random.uniform(20, 65, n)
income = np.random.exponential(50, n)  # in $1000s
education = np.random.choice([12, 14, 16, 18], n)  # years
 
# True model: doctor visits per year
log_mu = -1.5 + 0.02*age - 0.01*income + 0.1*education
mu = np.exp(log_mu)
 
# Generate with overdispersion via Poisson-Gamma mixture
theta_true = 1.5  # overdispersion parameter
# Random rates from Gamma distribution
random_rates = np.random.gamma(shape=theta_true, scale=mu/theta_true)
visits = np.random.poisson(random_rates)
 
df = pd.DataFrame({
    'visits': visits,
    'age': age,
    'income': income,
    'education': education
})
 
# Check for overdispersion
print("="*60)
print("DATA SUMMARY")
print("="*60)
print(f"Mean visits: {df['visits'].mean():.2f}")
print(f"Variance: {df['visits'].var():.2f}")
print(f"Variance/Mean ratio: {df['visits'].var()/df['visits'].mean():.2f}")
print("(Ratio >> 1 indicates overdispersion)")
 
# Fit Poisson model first
X = sm.add_constant(df[['age', 'income', 'education']])
poisson_model = sm.GLM(df['visits'], X, family=sm.families.Poisson())
poisson_results = poisson_model.fit()
 
print("\n" + "="*60)
print("POISSON MODEL")
print("="*60)
print(f"Deviance: {poisson_results.deviance:.1f}")
print(f"Deviance/df: {poisson_results.deviance/poisson_results.df_resid:.2f}")
print("(Should be ~1 for good fit)")
 
# Fit Negative Binomial model
nb_model = sm.GLM(df['visits'], X, family=sm.families.NegativeBinomial())
nb_results = nb_model.fit()
 
print("\n" + "="*60)
print("NEGATIVE BINOMIAL MODEL")
print("="*60)
print(nb_results.summary())
 
# Compare coefficients
print("\n" + "="*60)
print("COEFFICIENT COMPARISON (IRRs)")
print("="*60)
comparison = pd.DataFrame({
    'Poisson IRR': np.exp(poisson_results.params),
    'NB IRR': np.exp(nb_results.params),
    'Poisson SE': poisson_results.bse,
    'NB SE': nb_results.bse
})
print(comparison.round(4))
print("\nNote: NB standard errors are larger, reflecting proper uncertainty quantification")

Key Difference from Poisson

With NB regression, standard errors are typically LARGER than Poisson because we're accounting for extra-Poisson variation. Point estimates may shift slightly but are often similar. The big wins are: (1) valid inference (correct SEs), (2) better prediction intervals, (3) proper model comparison via AIC.

Choosing Between Poisson and Negative Binomial

Given count data, how do we decide between Poisson and negative binomial regression? Several tools are available.

The Likelihood Ratio Test

Since Poisson is a special case of NB (when $\theta \to \infty$, i.e., $\alpha = 1/\theta \to 0$), we can use a likelihood ratio test:

$$\text{LR} = 2[\ell_{\text{NB}} - \ell_{\text{Poisson}}]$$

However, this tests $H_0: \alpha = 0$ on the boundary of the parameter space (since $\alpha \geq 0$), so the usual $\chi^2_1$ distribution isn't quite right. The correct null distribution is a 50:50 mixture of a point mass at 0 and $\chi^2_1$.

In practice, if the p-value from a standard $\chi^2_1$ test is < 0.10, overdispersion is likely significant.

Information Criteria

AIC and BIC can compare models: $$\text{AIC} = -2\ell + 2p, \quad \text{BIC} = -2\ell + p \log n$$

Lower is better. NB has one more parameter than Poisson, so it must improve the likelihood sufficiently to compensate.

The Overdispersion Test

Regress squared residuals on fitted values. If the slope is significantly positive after accounting for the Poisson structure, overdispersion is present.

Decision Framework: Poisson vs. Negative Binomial
Criterion	Choose Poisson	Choose Negative Binomial
Deviance/df	≈ 1	1 (often > 1.5)
Observed Var/Mean	≈ 1	1
LR test p-value	0.10	< 0.10 (prefer NB)
AIC comparison	Poisson AIC lower	NB AIC lower
Residual plot	No pattern	Increasing spread with fitted

Practical Recommendations

Start with Poisson: It's simpler and should be the null hypothesis
Check diagnostics: Compute deviance/df and Pearson χ²/df
If overdispersion detected: Fit NB and compare AIC
If NB is clearly better: Use NB for inference and prediction
If borderline: Consider quasi-Poisson for inference (conservative) or NB for prediction

Important Caveat: Neither Poisson nor NB may be adequate if:

There are excess zeros (consider zero-inflated models)
The mean-variance relationship isn't quadratic (consider other distributions)
There's clustering/correlation in the data (consider mixed-effects models)

When in Doubt, Use NB

Negative binomial is 'safe' in the sense that if data is truly Poisson, NB will estimate θ → ∞ and replicate Poisson results. But if there's undetected overdispersion, Poisson will give wrong standard errors. NB is more robust to this misspecification.

Understanding the Dispersion Parameter

The dispersion parameter $\theta$ (or $\alpha = 1/\theta$) deserves careful attention—it quantifies the degree of overdispersion and has substantive interpretation.

Interpreting θ

Recall: $\text{Var}(Y) = \mu + \mu^2/\theta$

Large θ (e.g., θ > 10): Variance is close to μ; overdispersion is mild; NB ≈ Poisson
θ ≈ 1: Variance ≈ 2μ; moderate overdispersion
Small θ (e.g., θ < 0.5): Variance >> μ; severe overdispersion

In the Poisson-Gamma mixture interpretation:

θ is the shape parameter of the Gamma distribution of latent rates
Smaller θ means more heterogeneity in underlying rates
θ → ∞ means all subjects have the same rate (Poisson)

Interpreting α = 1/θ

With $\alpha = 1/\theta$: $\text{Var}(Y) = \mu(1 + \alpha\mu)$

α = 0: Poisson (no extra variation)
α > 0: The variance exceeds mean by factor $(1 + \alpha\mu)$
For a given α, larger μ implies relatively more overdispersion

Reporting the Dispersion Parameter

When reporting NB regression results:

Report θ or α with its standard error or confidence interval
Interpret substantively: What does this level of heterogeneity mean?
Compare to Poisson: How much worse would Poisson inference be?

Example interpretation:

'The estimated dispersion parameter θ = 1.8 (95% CI: 1.3, 2.6) indicates substantial overdispersion. The variance at the mean count of 4 is Var = 4 + 16/1.8 = 12.9, compared to 4 under Poisson—about 3 times larger.'

Fixed vs. Estimated Dispersion

Some software options:

Estimate θ from data (standard): θ is a free parameter
Fix θ at known value: If you have prior knowledge about the degree of heterogeneity
NB1 model: $\text{Var}(Y) = \mu(1 + \alpha)$, linear overdispersion

The choice affects standard errors—fixing θ at a wrong value leads to incorrect inference.

The Coefficient of Variation Interpretation

In the Poisson-Gamma mixture, the coefficient of variation of the latent rates is CV = 1/√θ. So θ = 1 corresponds to 100% CV in rates; θ = 4 corresponds to 50% CV. This gives an intuitive sense of how much heterogeneity exists.

Negative Binomial Rate Models

Just like Poisson regression, negative binomial models can incorporate offset terms to model rates with different exposures.

Model Specification with Offset

If $t_i$ is the exposure for observation $i$:

$$Y_i \sim \text{NegBin}(\mu_i = t_i \cdot r_i, \theta)$$

where $r_i$ is the rate per unit exposure. The model becomes:

$$\log(\mu_i) = \log(t_i) + \mathbf{x}_i^\top \boldsymbol{\beta}$$

The offset $\log(t_i)$ accounts for varying exposure while modeling the rate.

Example: Disease Incidence Rates

Suppose we model cancer cases in different counties:

$Y_i$ = number of cancer cases in county $i$
$t_i$ = person-years at risk (population × years observed)
Predictors: pollution level, median age, healthcare access

The model estimates effects on the rate (cases per person-year), adjusted for population differences.

Combining NB with Offset

The negative binomial with offset is ideal when:

Count data shows overdispersion
Exposures vary across observations
You want to model rates, not raw counts
Counties/subjects have unobserved heterogeneity

nb_rate_model.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
import pandas as pd
import statsmodels.api as sm
 
# Example: Workplace injury rates across companies
 
np.random.seed(123)
n_companies = 200
 
# Company characteristics
employees = np.random.exponential(500, n_companies).astype(int) + 50  # 50-2000 employees
safety_score = np.random.uniform(1, 10, n_companies)  # 1-10 rating
is_manufacturing = np.random.binomial(1, 0.3, n_companies)
years_observed = np.random.uniform(1, 5, n_companies)
 
# Exposure: employee-years
exposure = employees * years_observed
 
# True rate model: injuries per employee-year
log_rate = -4 - 0.2*safety_score + 0.8*is_manufacturing
true_rate = np.exp(log_rate)
true_mean = exposure * true_rate
 
# Generate overdispersed counts
theta = 2.0
random_rates = np.random.gamma(shape=theta, scale=true_mean/theta)
injuries = np.random.poisson(np.maximum(random_rates, 0.1))
 
df = pd.DataFrame({
    'injuries': injuries,
    'employees': employees,
    'safety_score': safety_score,
    'is_manufacturing': is_manufacturing,
    'years_observed': years_observed,
    'exposure': exposure,
    'log_exposure': np.log(exposure)
})
 
# Fit NB model with offset
X = sm.add_constant(df[['safety_score', 'is_manufacturing']])
 
nb_rate_model = sm.GLM(df['injuries'], X, 
                        family=sm.families.NegativeBinomial(),
                        offset=df['log_exposure'])
nb_results = nb_rate_model.fit()
 
print("="*60)
print("NEGATIVE BINOMIAL RATE MODEL")
print("(Modeling injuries per employee-year)")
print("="*60)
print(nb_results.summary())
 
# Interpret as Rate Ratios
print("\nIncidence Rate Ratios:")
print("-"*40)
irr = np.exp(nb_results.params)
for name, val in irr.items():
    print(f"{name}: {val:.4f}")
    
print("\nInterpretation:")
print(f"• Each 1-point increase in safety score reduces injury rate by {(1-irr['safety_score'])*100:.1f}%")
print(f"• Manufacturing companies have {(irr['is_manufacturing']-1)*100:.0f}% higher injury rate")

Overdispersion in Rate Models

Overdispersion is especially common in rate models because exposure varies. Large-exposure observations contribute more variance than small-exposure ones. The NB model handles this naturally through the μ²/θ term, which scales with the square of the expected count.

Extensions and Related Models

The negative binomial family has several extensions and related models for different data-generating processes.

Zero-Inflated Negative Binomial (ZINB)

When there are more zeros than the NB distribution predicts, consider ZINB:

$$P(Y = 0) = \pi + (1-\pi) \cdot \text{NB}(0; \mu, \theta)$$ $$P(Y = k) = (1-\pi) \cdot \text{NB}(k; \mu, \theta), \quad k > 0$$

where $\pi$ is the probability of a 'structural zero' from a separate process.

Example: Modeling cigarettes smoked per day. Some people are never-smokers (structural zeros), while smokers have variable consumption following NB.

Hurdle Models

A hurdle model treats zeros and positive counts as coming from separate processes:

A binary model (logit) for whether Y > 0
A truncated count model for Y | Y > 0

This is appropriate when the process for 'any' differs from 'how many'.

NB1 vs NB2

Two variance specifications:

NB2 (default): $\text{Var}(Y) = \mu + \alpha\mu^2$ (quadratic)
NB1: $\text{Var}(Y) = \mu(1 + \alpha) = \mu + \alpha\mu$ (linear)

NB2 is more common; NB1 may fit better when overdispersion doesn't increase with the mean.

When to Consider ZINB

•Observed zeros >> predicted by NB
•Zeros arise from a distinct process
•Substantive distinction between 'never' and 'not this time'
•Mixture of populations (e.g., users vs non-users)
•Residual test shows excess zeros after NB fit

When NB Suffices

•Zeros are sampling zeros (could have been non-zero)
•No special zero-generating mechanism
•NB already fits zeros reasonably well
•Sample size is limited (ZINB has extra parameters)
•Interpretability is paramount (NB is simpler)

Generalized Negative Binomial

Some implementations allow observation-specific dispersion: $$\text{Var}(Y_i) = \mu_i + \mu_i^2 / \theta_i$$

where $\theta_i$ can depend on covariates. This relaxes the assumption of constant overdispersion across observations.

Mixed-Effects NB (GLMM)

For clustered data (e.g., repeated measures, hierarchical structure), a generalized linear mixed model with NB distribution:

$$\log(\mu_{ij}) = \mathbf{x}_{ij}^\top \boldsymbol{\beta} + u_i$$

where $u_i$ is a random intercept for cluster $i$. This handles both overdispersion and correlation within clusters.

Model Selection Workflow

Start simple: Poisson → NB → ZINB/Hurdle → GLMM. Add complexity only when diagnostics demand it. Compare models using AIC/BIC, residual plots, and Vuong tests (for non-nested models like NB vs. ZIP).

Complete Negative Binomial Analysis

Let's work through a complete analysis comparing Poisson and NB models.

complete_nb_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy import stats
 
# Comprehensive NB Analysis: Publications by Scientists
 
np.random.seed(456)
n = 400
 
# Predictors
years_since_phd = np.random.uniform(0, 30, n)
grants = np.random.poisson(2, n)
is_tenured = (years_since_phd > 7) & (np.random.uniform(0, 1, n) > 0.3)
log_citations = np.random.normal(5, 2, n)
 
# Generate overdispersed publication counts
log_mu = 0.5 + 0.05*years_since_phd + 0.3*grants + 0.1*is_tenured + 0.05*log_citations
mu = np.exp(log_mu)
 
# Overdispersion with theta = 1.5
theta_true = 1.5
random_rates = np.random.gamma(shape=theta_true, scale=mu/theta_true)
publications = np.maximum(np.random.poisson(random_rates), 0)
 
df = pd.DataFrame({
    'publications': publications,
    'years_since_phd': years_since_phd,
    'grants': grants,
    'is_tenured': is_tenured.astype(int),
    'log_citations': log_citations
})
 
print("="*70)
print("COMPREHENSIVE NEGATIVE BINOMIAL REGRESSION ANALYSIS")
print("="*70)
 
# Step 1: Explore the data
print("\n📊 STEP 1: EXPLORATORY DATA ANALYSIS")
print("-"*50)
print(f"Sample size: {n}")
print(f"Mean publications: {df['publications'].mean():.2f}")
print(f"Variance: {df['publications'].var():.2f}")
print(f"Variance/Mean ratio: {df['publications'].var()/df['publications'].mean():.2f}")
print(f"Proportion of zeros: {(df['publications']==0).mean()*100:.1f}%")
 
# Step 2: Fit Poisson model
print("\n📈 STEP 2: POISSON MODEL (Baseline)")
print("-"*50)
X = sm.add_constant(df[['years_since_phd', 'grants', 'is_tenured', 'log_citations']])
poisson_model = sm.GLM(df['publications'], X, family=sm.families.Poisson())
poisson_results = poisson_model.fit()
 
print(f"Log-likelihood: {poisson_results.llf:.1f}")
print(f"AIC: {poisson_results.aic:.1f}")
print(f"Deviance: {poisson_results.deviance:.1f}")
print(f"Deviance/df: {poisson_results.deviance/poisson_results.df_resid:.2f}")
print(f"Pearson χ²/df: {poisson_results.pearson_chi2/poisson_results.df_resid:.2f}")
 
if poisson_results.pearson_chi2/poisson_results.df_resid > 1.5:
    print("⚠️  Clear evidence of overdispersion!")
    
# Step 3: Fit Negative Binomial model
print("\n📈 STEP 3: NEGATIVE BINOMIAL MODEL")
print("-"*50)
nb_model = sm.GLM(df['publications'], X, family=sm.families.NegativeBinomial())
nb_results = nb_model.fit()
 
print(f"Log-likelihood: {nb_results.llf:.1f}")
print(f"AIC: {nb_results.aic:.1f}")
print(f"Deviance: {nb_results.deviance:.1f}")
print(f"Deviance/df: {nb_results.deviance/nb_results.df_resid:.2f}")
 
# Step 4: Compare models
print("\n⚖️  STEP 4: MODEL COMPARISON")
print("-"*50)
lr_stat = 2 * (nb_results.llf - poisson_results.llf)
# Note: True null distribution is 50:50 mixture of 0 and chi2(1)
# Conservative: use chi2(1)
p_value = 0.5 * (1 - stats.chi2.cdf(lr_stat, df=1))
 
print(f"Likelihood Ratio Statistic: {lr_stat:.1f}")
print(f"p-value (mixture null): {p_value:.4f}")
print(f"AIC improvement: {poisson_results.aic - nb_results.aic:.1f} points")
 
if nb_results.aic < poisson_results.aic:
    print("✅ NB model is preferred by AIC")
else:
    print("Poisson model is sufficient")
 
# Step 5: Final results
print("\n📋 STEP 5: FINAL RESULTS (Negative Binomial)")
print("-"*50)
print(nb_results.summary())
 
# IRRs with CIs
print("\n🎯 INCIDENCE RATE RATIOS WITH 95% CIs:")
print("-"*50)
irr = np.exp(nb_results.params)
irr_ci = np.exp(nb_results.conf_int())
for name in irr.index:
    low, high = irr_ci.loc[name]
    print(f"{name:20s}: IRR = {irr[name]:.3f} (95% CI: {low:.3f}, {high:.3f})")
 
print("\n💡 KEY FINDINGS:")
print("-"*50)
print(f"• Each additional grant increases publications by {(irr['grants']-1)*100:.0f}%")
print(f"• Tenured researchers have {(irr['is_tenured']-1)*100:.0f}% more publications")
print(f"• Each decade post-PhD adds {((1.05**10)-1)*100:.0f}% more publications")

Summary: Negative Binomial Regression

We've comprehensively covered negative binomial regression—the essential tool for overdispersed count data. Let's consolidate the key concepts:

Key Takeaways

•The negative binomial distribution has Var(Y) = μ + μ²/θ, allowing variance to exceed the mean (overdispersion).
•The Poisson-Gamma mixture interpretation views NB as arising from heterogeneous Poisson rates across subjects, with rates following a Gamma distribution.
•NB regression uses a log link, giving the same multiplicative (IRR) interpretation as Poisson regression.
•The dispersion parameter θ quantifies overdispersion—smaller θ means more overdispersion; θ → ∞ gives Poisson.
•Model comparison uses likelihood ratio tests, AIC/BIC, and overdispersion statistics to choose between Poisson and NB.
•Offset terms work identically to Poisson for modeling rates with different exposures.
•Extensions include ZINB for excess zeros, NB1 for linear overdispersion, and NBMMs for clustered data.

Module Complete:

You've now mastered the Generalized Linear Model framework—from its theoretical foundations (exponential family, link functions) through its most important applications (logistic, Poisson, and negative binomial regression). This knowledge forms the core of modern statistical modeling and provides the conceptual foundation for many machine learning techniques.

Key skills developed:

Choosing appropriate distributions and links for different response types
Interpreting coefficients as odds ratios, rate ratios, or additive effects
Diagnosing model fit and addressing overdispersion
Making principled model comparisons

Module Complete

Congratulations! You've completed the Generalized Linear Models module. You now have a deep understanding of the unified GLM framework that subsumes linear regression, logistic regression, Poisson regression, and negative binomial regression. This knowledge is fundamental for any advanced work in statistical modeling and machine learning.

5 / 5

Loading learning content...

Machine LearningGeneralized Linear Models

Generalized Linear Models (GLMs)

LevelAdvanced

Duration90 mins

TopicGeneralized Linear Models

5 / 5

Negative Binomial Regression: Handling Overdispersion

When Poisson Isn't Enough

In the previous page, we encountered a persistent challenge: real-world count data often violates the Poisson assumption that variance equals the mean. This overdispersion can arise from:

Unobserved heterogeneity: Different subjects have different underlying rates
Clustering: Events occur in bursts rather than independently
Omitted variables: Important predictors are missing from the model
Complex processes: The data-generating mechanism is more complicated than Poisson assumes

Negative binomial regression is the natural choice when:

Poisson regression shows clear overdispersion
You need a true likelihood for model comparison (AIC/BIC)
Prediction accuracy matters (Poisson underpredicts variance)
You want to explicitly model between-subject heterogeneity

What You Will Learn

The Negative Binomial Distribution

Before modeling, let's understand the negative binomial distribution itself.

Definition

The negative binomial distribution can be parameterized in terms of the mean $\mu$ and a dispersion parameter $\theta > 0$ (also denoted $\alpha$ or $r$ in different texts):

$$P(Y = k) = \binom{k + \theta - 1}{k} \left(\frac{\theta}{\theta + \mu}\right)^\theta \left(\frac{\mu}{\theta + \mu}\right)^k, \quad k = 0, 1, 2, \ldots$$

Key Properties

Mean: $E[Y] = \mu$
Variance: $\text{Var}(Y) = \mu + \frac{\mu^2}{\theta}$

The variance exceeds the mean by the factor $\mu^2/\theta$. This captures overdispersion:

When $\theta \to \infty$: $\text{Var}(Y) \to \mu$ (reduces to Poisson)
When $\theta$ is small: overdispersion is severe

Alternative Parameterization

Some software uses $\alpha = 1/\theta$, giving: $$\text{Var}(Y) = \mu + \alpha \mu^2 = \mu(1 + \alpha\mu)$$

Here $\alpha \geq 0$ is the overdispersion parameter; $\alpha = 0$ gives Poisson.

The Poisson-Gamma Mixture Interpretation

The most intuitive way to understand the negative binomial is as a Poisson distribution with random rate:

Each observation has its own latent rate $\lambda_i$
Given $\lambda_i$, the count is Poisson: $Y_i | \lambda_i \sim \text{Poisson}(\lambda_i)$
The rates themselves vary according to a Gamma distribution: $\lambda_i \sim \text{Gamma}(\text{shape}=\theta, \text{rate}=\theta/\mu)$

Integrating out the latent rate: $$P(Y = k) = \int_0^\infty P(Y=k|\lambda) f(\lambda) d\lambda = \text{NegBin}(\mu, \theta)$$

Comparing Poisson and Negative Binomial
Property	Poisson	Negative Binomial
Mean	μ	μ
Variance	μ (equidispersion)	μ + μ²/θ (overdispersed)
Dispersion parameter	None (fixed at 1)	θ (estimated from data)
Interpretation	Fixed rate for all	Random rates across subjects
When to use	Var ≈ Mean	Var >> Mean
As θ → ∞	—	Reduces to Poisson

NB1 vs NB2

The Negative Binomial Regression Model

Negative binomial regression follows the GLM framework, with the added complexity of estimating the dispersion parameter.

Model Specification

Random Component: $$Y_i \sim \text{NegBin}(\mu_i, \theta)$$

with $\text{Var}(Y_i) = \mu_i + \mu_i^2/\theta$.

Systematic Component: $$\eta_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip}$$

Link Function (Log): $$\log(\mu_i) = \eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$

The log link is used for the same reasons as Poisson: it ensures positive means and gives multiplicative interpretations.

The Log-Likelihood

For independent observations:

where $\mu_i = \exp(\mathbf{x}_i^\top \boldsymbol{\beta})$.

Estimation: Joint vs. Profile Likelihood

We need to estimate both $\boldsymbol{\beta}$ (regression coefficients) and $\theta$ (dispersion). Two approaches:

Approach 1: Joint Maximum Likelihood Maximize $\ell(\boldsymbol{\beta}, \theta)$ simultaneously over both parameters using Newton-Raphson or Fisher scoring.

Approach 2: Profile Likelihood (common in practice)

For a fixed $\theta$, maximize over $\boldsymbol{\beta}$ using IRLS (similar to Poisson)
Substitute $\hat{\boldsymbol{\beta}}(\theta)$ into the likelihood
Maximize the resulting profile likelihood over $\theta$

This iterative approach is implemented in most software. Convergence is typically fast.

Interpretation of Coefficients

Because the log link is used, coefficient interpretation is identical to Poisson regression:

$e^{\beta_j}$ is the Incidence Rate Ratio (IRR)
A one-unit increase in $X_j$ multiplies the expected count by $e^{\beta_j}$

The difference from Poisson is in the standard errors and model fit, not the point estimates (which are often similar but not identical).

negative_binomial_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.discrete.discrete_model import NegativeBinomial
 
# Generate overdispersed count data
np.random.seed(42)
n = 500
 
# Predictors
age = np.random.uniform(20, 65, n)
income = np.random.exponential(50, n)  # in $1000s
education = np.random.choice([12, 14, 16, 18], n)  # years
 
# True model: doctor visits per year
log_mu = -1.5 + 0.02*age - 0.01*income + 0.1*education
mu = np.exp(log_mu)
 
# Generate with overdispersion via Poisson-Gamma mixture
theta_true = 1.5  # overdispersion parameter
# Random rates from Gamma distribution
random_rates = np.random.gamma(shape=theta_true, scale=mu/theta_true)
visits = np.random.poisson(random_rates)
 
df = pd.DataFrame({
    'visits': visits,
    'age': age,
    'income': income,
    'education': education
})
 
# Check for overdispersion
print("="*60)
print("DATA SUMMARY")
print("="*60)
print(f"Mean visits: {df['visits'].mean():.2f}")
print(f"Variance: {df['visits'].var():.2f}")
print(f"Variance/Mean ratio: {df['visits'].var()/df['visits'].mean():.2f}")
print("(Ratio >> 1 indicates overdispersion)")
 
# Fit Poisson model first
X = sm.add_constant(df[['age', 'income', 'education']])
poisson_model = sm.GLM(df['visits'], X, family=sm.families.Poisson())
poisson_results = poisson_model.fit()
 
print("\n" + "="*60)
print("POISSON MODEL")
print("="*60)
print(f"Deviance: {poisson_results.deviance:.1f}")
print(f"Deviance/df: {poisson_results.deviance/poisson_results.df_resid:.2f}")
print("(Should be ~1 for good fit)")
 
# Fit Negative Binomial model
nb_model = sm.GLM(df['visits'], X, family=sm.families.NegativeBinomial())
nb_results = nb_model.fit()
 
print("\n" + "="*60)
print("NEGATIVE BINOMIAL MODEL")
print("="*60)
print(nb_results.summary())
 
# Compare coefficients
print("\n" + "="*60)
print("COEFFICIENT COMPARISON (IRRs)")
print("="*60)
comparison = pd.DataFrame({
    'Poisson IRR': np.exp(poisson_results.params),
    'NB IRR': np.exp(nb_results.params),
    'Poisson SE': poisson_results.bse,
    'NB SE': nb_results.bse
})
print(comparison.round(4))
print("\nNote: NB standard errors are larger, reflecting proper uncertainty quantification")

Key Difference from Poisson

Choosing Between Poisson and Negative Binomial

Given count data, how do we decide between Poisson and negative binomial regression? Several tools are available.

The Likelihood Ratio Test

Since Poisson is a special case of NB (when $\theta \to \infty$, i.e., $\alpha = 1/\theta \to 0$), we can use a likelihood ratio test:

$$\text{LR} = 2[\ell_{\text{NB}} - \ell_{\text{Poisson}}]$$

In practice, if the p-value from a standard $\chi^2_1$ test is < 0.10, overdispersion is likely significant.

Information Criteria

AIC and BIC can compare models: $$\text{AIC} = -2\ell + 2p, \quad \text{BIC} = -2\ell + p \log n$$

Lower is better. NB has one more parameter than Poisson, so it must improve the likelihood sufficiently to compensate.

The Overdispersion Test

Regress squared residuals on fitted values. If the slope is significantly positive after accounting for the Poisson structure, overdispersion is present.

Decision Framework: Poisson vs. Negative Binomial
Criterion	Choose Poisson	Choose Negative Binomial
Deviance/df	≈ 1	1 (often > 1.5)
Observed Var/Mean	≈ 1	1
LR test p-value	0.10	< 0.10 (prefer NB)
AIC comparison	Poisson AIC lower	NB AIC lower
Residual plot	No pattern	Increasing spread with fitted

Practical Recommendations

Start with Poisson: It's simpler and should be the null hypothesis
Check diagnostics: Compute deviance/df and Pearson χ²/df
If overdispersion detected: Fit NB and compare AIC
If NB is clearly better: Use NB for inference and prediction
If borderline: Consider quasi-Poisson for inference (conservative) or NB for prediction

Important Caveat: Neither Poisson nor NB may be adequate if:

There are excess zeros (consider zero-inflated models)
The mean-variance relationship isn't quadratic (consider other distributions)
There's clustering/correlation in the data (consider mixed-effects models)

When in Doubt, Use NB

Understanding the Dispersion Parameter

The dispersion parameter $\theta$ (or $\alpha = 1/\theta$) deserves careful attention—it quantifies the degree of overdispersion and has substantive interpretation.

Interpreting θ

Recall: $\text{Var}(Y) = \mu + \mu^2/\theta$

Large θ (e.g., θ > 10): Variance is close to μ; overdispersion is mild; NB ≈ Poisson
θ ≈ 1: Variance ≈ 2μ; moderate overdispersion
Small θ (e.g., θ < 0.5): Variance >> μ; severe overdispersion

In the Poisson-Gamma mixture interpretation:

θ is the shape parameter of the Gamma distribution of latent rates
Smaller θ means more heterogeneity in underlying rates
θ → ∞ means all subjects have the same rate (Poisson)

Interpreting α = 1/θ

With $\alpha = 1/\theta$: $\text{Var}(Y) = \mu(1 + \alpha\mu)$

α = 0: Poisson (no extra variation)
α > 0: The variance exceeds mean by factor $(1 + \alpha\mu)$
For a given α, larger μ implies relatively more overdispersion

Reporting the Dispersion Parameter

When reporting NB regression results:

Report θ or α with its standard error or confidence interval
Interpret substantively: What does this level of heterogeneity mean?
Compare to Poisson: How much worse would Poisson inference be?

Example interpretation:

'The estimated dispersion parameter θ = 1.8 (95% CI: 1.3, 2.6) indicates substantial overdispersion. The variance at the mean count of 4 is Var = 4 + 16/1.8 = 12.9, compared to 4 under Poisson—about 3 times larger.'

Fixed vs. Estimated Dispersion

Some software options:

Estimate θ from data (standard): θ is a free parameter
Fix θ at known value: If you have prior knowledge about the degree of heterogeneity
NB1 model: $\text{Var}(Y) = \mu(1 + \alpha)$, linear overdispersion

The choice affects standard errors—fixing θ at a wrong value leads to incorrect inference.

The Coefficient of Variation Interpretation

Negative Binomial Rate Models

Just like Poisson regression, negative binomial models can incorporate offset terms to model rates with different exposures.

Model Specification with Offset

If $t_i$ is the exposure for observation $i$:

$$Y_i \sim \text{NegBin}(\mu_i = t_i \cdot r_i, \theta)$$

where $r_i$ is the rate per unit exposure. The model becomes:

$$\log(\mu_i) = \log(t_i) + \mathbf{x}_i^\top \boldsymbol{\beta}$$

The offset $\log(t_i)$ accounts for varying exposure while modeling the rate.

Example: Disease Incidence Rates

Suppose we model cancer cases in different counties:

$Y_i$ = number of cancer cases in county $i$
$t_i$ = person-years at risk (population × years observed)
Predictors: pollution level, median age, healthcare access

The model estimates effects on the rate (cases per person-year), adjusted for population differences.

Combining NB with Offset

The negative binomial with offset is ideal when:

Count data shows overdispersion
Exposures vary across observations
You want to model rates, not raw counts
Counties/subjects have unobserved heterogeneity

nb_rate_model.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
import pandas as pd
import statsmodels.api as sm
 
# Example: Workplace injury rates across companies
 
np.random.seed(123)
n_companies = 200
 
# Company characteristics
employees = np.random.exponential(500, n_companies).astype(int) + 50  # 50-2000 employees
safety_score = np.random.uniform(1, 10, n_companies)  # 1-10 rating
is_manufacturing = np.random.binomial(1, 0.3, n_companies)
years_observed = np.random.uniform(1, 5, n_companies)
 
# Exposure: employee-years
exposure = employees * years_observed
 
# True rate model: injuries per employee-year
log_rate = -4 - 0.2*safety_score + 0.8*is_manufacturing
true_rate = np.exp(log_rate)
true_mean = exposure * true_rate
 
# Generate overdispersed counts
theta = 2.0
random_rates = np.random.gamma(shape=theta, scale=true_mean/theta)
injuries = np.random.poisson(np.maximum(random_rates, 0.1))
 
df = pd.DataFrame({
    'injuries': injuries,
    'employees': employees,
    'safety_score': safety_score,
    'is_manufacturing': is_manufacturing,
    'years_observed': years_observed,
    'exposure': exposure,
    'log_exposure': np.log(exposure)
})
 
# Fit NB model with offset
X = sm.add_constant(df[['safety_score', 'is_manufacturing']])
 
nb_rate_model = sm.GLM(df['injuries'], X, 
                        family=sm.families.NegativeBinomial(),
                        offset=df['log_exposure'])
nb_results = nb_rate_model.fit()
 
print("="*60)
print("NEGATIVE BINOMIAL RATE MODEL")
print("(Modeling injuries per employee-year)")
print("="*60)
print(nb_results.summary())
 
# Interpret as Rate Ratios
print("\nIncidence Rate Ratios:")
print("-"*40)
irr = np.exp(nb_results.params)
for name, val in irr.items():
    print(f"{name}: {val:.4f}")
    
print("\nInterpretation:")
print(f"• Each 1-point increase in safety score reduces injury rate by {(1-irr['safety_score'])*100:.1f}%")
print(f"• Manufacturing companies have {(irr['is_manufacturing']-1)*100:.0f}% higher injury rate")

Overdispersion in Rate Models

Extensions and Related Models

The negative binomial family has several extensions and related models for different data-generating processes.

Zero-Inflated Negative Binomial (ZINB)

When there are more zeros than the NB distribution predicts, consider ZINB:

$$P(Y = 0) = \pi + (1-\pi) \cdot \text{NB}(0; \mu, \theta)$$ $$P(Y = k) = (1-\pi) \cdot \text{NB}(k; \mu, \theta), \quad k > 0$$

where $\pi$ is the probability of a 'structural zero' from a separate process.

Example: Modeling cigarettes smoked per day. Some people are never-smokers (structural zeros), while smokers have variable consumption following NB.

Hurdle Models

A hurdle model treats zeros and positive counts as coming from separate processes:

A binary model (logit) for whether Y > 0
A truncated count model for Y | Y > 0

This is appropriate when the process for 'any' differs from 'how many'.

NB1 vs NB2

Two variance specifications:

NB2 (default): $\text{Var}(Y) = \mu + \alpha\mu^2$ (quadratic)
NB1: $\text{Var}(Y) = \mu(1 + \alpha) = \mu + \alpha\mu$ (linear)

NB2 is more common; NB1 may fit better when overdispersion doesn't increase with the mean.

When to Consider ZINB

•Observed zeros >> predicted by NB
•Zeros arise from a distinct process
•Substantive distinction between 'never' and 'not this time'
•Mixture of populations (e.g., users vs non-users)
•Residual test shows excess zeros after NB fit

When NB Suffices

•Zeros are sampling zeros (could have been non-zero)
•No special zero-generating mechanism
•NB already fits zeros reasonably well
•Sample size is limited (ZINB has extra parameters)
•Interpretability is paramount (NB is simpler)

Generalized Negative Binomial

Some implementations allow observation-specific dispersion: $$\text{Var}(Y_i) = \mu_i + \mu_i^2 / \theta_i$$

where $\theta_i$ can depend on covariates. This relaxes the assumption of constant overdispersion across observations.

Mixed-Effects NB (GLMM)

For clustered data (e.g., repeated measures, hierarchical structure), a generalized linear mixed model with NB distribution:

$$\log(\mu_{ij}) = \mathbf{x}_{ij}^\top \boldsymbol{\beta} + u_i$$

where $u_i$ is a random intercept for cluster $i$. This handles both overdispersion and correlation within clusters.

Model Selection Workflow

Complete Negative Binomial Analysis

Let's work through a complete analysis comparing Poisson and NB models.

complete_nb_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy import stats
 
# Comprehensive NB Analysis: Publications by Scientists
 
np.random.seed(456)
n = 400
 
# Predictors
years_since_phd = np.random.uniform(0, 30, n)
grants = np.random.poisson(2, n)
is_tenured = (years_since_phd > 7) & (np.random.uniform(0, 1, n) > 0.3)
log_citations = np.random.normal(5, 2, n)
 
# Generate overdispersed publication counts
log_mu = 0.5 + 0.05*years_since_phd + 0.3*grants + 0.1*is_tenured + 0.05*log_citations
mu = np.exp(log_mu)
 
# Overdispersion with theta = 1.5
theta_true = 1.5
random_rates = np.random.gamma(shape=theta_true, scale=mu/theta_true)
publications = np.maximum(np.random.poisson(random_rates), 0)
 
df = pd.DataFrame({
    'publications': publications,
    'years_since_phd': years_since_phd,
    'grants': grants,
    'is_tenured': is_tenured.astype(int),
    'log_citations': log_citations
})
 
print("="*70)
print("COMPREHENSIVE NEGATIVE BINOMIAL REGRESSION ANALYSIS")
print("="*70)
 
# Step 1: Explore the data
print("\n📊 STEP 1: EXPLORATORY DATA ANALYSIS")
print("-"*50)
print(f"Sample size: {n}")
print(f"Mean publications: {df['publications'].mean():.2f}")
print(f"Variance: {df['publications'].var():.2f}")
print(f"Variance/Mean ratio: {df['publications'].var()/df['publications'].mean():.2f}")
print(f"Proportion of zeros: {(df['publications']==0).mean()*100:.1f}%")
 
# Step 2: Fit Poisson model
print("\n📈 STEP 2: POISSON MODEL (Baseline)")
print("-"*50)
X = sm.add_constant(df[['years_since_phd', 'grants', 'is_tenured', 'log_citations']])
poisson_model = sm.GLM(df['publications'], X, family=sm.families.Poisson())
poisson_results = poisson_model.fit()
 
print(f"Log-likelihood: {poisson_results.llf:.1f}")
print(f"AIC: {poisson_results.aic:.1f}")
print(f"Deviance: {poisson_results.deviance:.1f}")
print(f"Deviance/df: {poisson_results.deviance/poisson_results.df_resid:.2f}")
print(f"Pearson χ²/df: {poisson_results.pearson_chi2/poisson_results.df_resid:.2f}")
 
if poisson_results.pearson_chi2/poisson_results.df_resid > 1.5:
    print("⚠️  Clear evidence of overdispersion!")
    
# Step 3: Fit Negative Binomial model
print("\n📈 STEP 3: NEGATIVE BINOMIAL MODEL")
print("-"*50)
nb_model = sm.GLM(df['publications'], X, family=sm.families.NegativeBinomial())
nb_results = nb_model.fit()
 
print(f"Log-likelihood: {nb_results.llf:.1f}")
print(f"AIC: {nb_results.aic:.1f}")
print(f"Deviance: {nb_results.deviance:.1f}")
print(f"Deviance/df: {nb_results.deviance/nb_results.df_resid:.2f}")
 
# Step 4: Compare models
print("\n⚖️  STEP 4: MODEL COMPARISON")
print("-"*50)
lr_stat = 2 * (nb_results.llf - poisson_results.llf)
# Note: True null distribution is 50:50 mixture of 0 and chi2(1)
# Conservative: use chi2(1)
p_value = 0.5 * (1 - stats.chi2.cdf(lr_stat, df=1))
 
print(f"Likelihood Ratio Statistic: {lr_stat:.1f}")
print(f"p-value (mixture null): {p_value:.4f}")
print(f"AIC improvement: {poisson_results.aic - nb_results.aic:.1f} points")
 
if nb_results.aic < poisson_results.aic:
    print("✅ NB model is preferred by AIC")
else:
    print("Poisson model is sufficient")
 
# Step 5: Final results
print("\n📋 STEP 5: FINAL RESULTS (Negative Binomial)")
print("-"*50)
print(nb_results.summary())
 
# IRRs with CIs
print("\n🎯 INCIDENCE RATE RATIOS WITH 95% CIs:")
print("-"*50)
irr = np.exp(nb_results.params)
irr_ci = np.exp(nb_results.conf_int())
for name in irr.index:
    low, high = irr_ci.loc[name]
    print(f"{name:20s}: IRR = {irr[name]:.3f} (95% CI: {low:.3f}, {high:.3f})")
 
print("\n💡 KEY FINDINGS:")
print("-"*50)
print(f"• Each additional grant increases publications by {(irr['grants']-1)*100:.0f}%")
print(f"• Tenured researchers have {(irr['is_tenured']-1)*100:.0f}% more publications")
print(f"• Each decade post-PhD adds {((1.05**10)-1)*100:.0f}% more publications")

Summary: Negative Binomial Regression

We've comprehensively covered negative binomial regression—the essential tool for overdispersed count data. Let's consolidate the key concepts:

Key Takeaways

•The negative binomial distribution has Var(Y) = μ + μ²/θ, allowing variance to exceed the mean (overdispersion).
•The Poisson-Gamma mixture interpretation views NB as arising from heterogeneous Poisson rates across subjects, with rates following a Gamma distribution.
•NB regression uses a log link, giving the same multiplicative (IRR) interpretation as Poisson regression.
•The dispersion parameter θ quantifies overdispersion—smaller θ means more overdispersion; θ → ∞ gives Poisson.
•Model comparison uses likelihood ratio tests, AIC/BIC, and overdispersion statistics to choose between Poisson and NB.
•Offset terms work identically to Poisson for modeling rates with different exposures.
•Extensions include ZINB for excess zeros, NB1 for linear overdispersion, and NBMMs for clustered data.

Module Complete:

Key skills developed:

Choosing appropriate distributions and links for different response types
Interpreting coefficients as odds ratios, rate ratios, or additive effects
Diagnosing model fit and addressing overdispersion
Making principled model comparisons

Module Complete

5 / 5