Loading learning content...
In the previous page, we encountered a persistent challenge: real-world count data often violates the Poisson assumption that variance equals the mean. This overdispersion can arise from:
When overdispersion is substantial, we need a distribution that can accommodate variance greater than the mean. The negative binomial distribution provides exactly this flexibility, with an extra parameter that controls the degree of overdispersion.
Negative binomial regression is the natural choice when:
By the end of this page, you will: (1) understand the negative binomial distribution and its Poisson-gamma mixture interpretation, (2) know how to specify, fit, and interpret negative binomial regression, (3) be able to compare Poisson and NB models and choose between them, and (4) understand the relationship between NB and other overdispersion solutions.
Before modeling, let's understand the negative binomial distribution itself.
The negative binomial distribution can be parameterized in terms of the mean $\mu$ and a dispersion parameter $\theta > 0$ (also denoted $\alpha$ or $r$ in different texts):
$$P(Y = k) = \binom{k + \theta - 1}{k} \left(\frac{\theta}{\theta + \mu}\right)^\theta \left(\frac{\mu}{\theta + \mu}\right)^k, \quad k = 0, 1, 2, \ldots$$
The variance exceeds the mean by the factor $\mu^2/\theta$. This captures overdispersion:
Some software uses $\alpha = 1/\theta$, giving: $$\text{Var}(Y) = \mu + \alpha \mu^2 = \mu(1 + \alpha\mu)$$
Here $\alpha \geq 0$ is the overdispersion parameter; $\alpha = 0$ gives Poisson.
The most intuitive way to understand the negative binomial is as a Poisson distribution with random rate:
Integrating out the latent rate: $$P(Y = k) = \int_0^\infty P(Y=k|\lambda) f(\lambda) d\lambda = \text{NegBin}(\mu, \theta)$$
Interpretation: The negative binomial models unobserved heterogeneity in the Poisson rate. Different subjects have different underlying propensities, even after controlling for observed predictors. This heterogeneity creates the extra variance.
Example: Consider hospital emergency visits. Two patients with the same age, insurance, and medical history may have different unobserved factors (health behaviors, genetics, local environment) affecting their visit rates. The Gamma distribution captures this latent heterogeneity.
| Property | Poisson | Negative Binomial |
|---|---|---|
| Mean | μ | μ |
| Variance | μ (equidispersion) | μ + μ²/θ (overdispersed) |
| Dispersion parameter | None (fixed at 1) | θ (estimated from data) |
| Interpretation | Fixed rate for all | Random rates across subjects |
| When to use | Var ≈ Mean | Var >> Mean |
| As θ → ∞ | — | Reduces to Poisson |
The parameterization above is called NB2. An alternative, NB1, has Var(Y) = μ(1 + α), where variance is linear in the mean rather than quadratic. NB2 is more common because it arises from the Poisson-Gamma mixture and is the default in most software.
Negative binomial regression follows the GLM framework, with the added complexity of estimating the dispersion parameter.
Random Component: $$Y_i \sim \text{NegBin}(\mu_i, \theta)$$
with $\text{Var}(Y_i) = \mu_i + \mu_i^2/\theta$.
Systematic Component: $$\eta_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip}$$
Link Function (Log): $$\log(\mu_i) = \eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}$$
The log link is used for the same reasons as Poisson: it ensures positive means and gives multiplicative interpretations.
For independent observations:
$$\ell(\boldsymbol{\beta}, \theta) = \sum_{i=1}^n \Bigg[ Y_i \log\left(\frac{\mu_i}{\mu_i + \theta}\right) + \theta \log\left(\frac{\theta}{\mu_i + \theta}\right) + \log\binom{Y_i + \theta - 1}{Y_i} \Bigg]$$
where $\mu_i = \exp(\mathbf{x}_i^\top \boldsymbol{\beta})$.
We need to estimate both $\boldsymbol{\beta}$ (regression coefficients) and $\theta$ (dispersion). Two approaches:
Approach 1: Joint Maximum Likelihood Maximize $\ell(\boldsymbol{\beta}, \theta)$ simultaneously over both parameters using Newton-Raphson or Fisher scoring.
Approach 2: Profile Likelihood (common in practice)
This iterative approach is implemented in most software. Convergence is typically fast.
Because the log link is used, coefficient interpretation is identical to Poisson regression:
The difference from Poisson is in the standard errors and model fit, not the point estimates (which are often similar but not identical).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as npimport pandas as pdimport statsmodels.api as smfrom statsmodels.discrete.discrete_model import NegativeBinomial # Generate overdispersed count datanp.random.seed(42)n = 500 # Predictorsage = np.random.uniform(20, 65, n)income = np.random.exponential(50, n) # in $1000seducation = np.random.choice([12, 14, 16, 18], n) # years # True model: doctor visits per yearlog_mu = -1.5 + 0.02*age - 0.01*income + 0.1*educationmu = np.exp(log_mu) # Generate with overdispersion via Poisson-Gamma mixturetheta_true = 1.5 # overdispersion parameter# Random rates from Gamma distributionrandom_rates = np.random.gamma(shape=theta_true, scale=mu/theta_true)visits = np.random.poisson(random_rates) df = pd.DataFrame({ 'visits': visits, 'age': age, 'income': income, 'education': education}) # Check for overdispersionprint("="*60)print("DATA SUMMARY")print("="*60)print(f"Mean visits: {df['visits'].mean():.2f}")print(f"Variance: {df['visits'].var():.2f}")print(f"Variance/Mean ratio: {df['visits'].var()/df['visits'].mean():.2f}")print("(Ratio >> 1 indicates overdispersion)") # Fit Poisson model firstX = sm.add_constant(df[['age', 'income', 'education']])poisson_model = sm.GLM(df['visits'], X, family=sm.families.Poisson())poisson_results = poisson_model.fit() print("\n" + "="*60)print("POISSON MODEL")print("="*60)print(f"Deviance: {poisson_results.deviance:.1f}")print(f"Deviance/df: {poisson_results.deviance/poisson_results.df_resid:.2f}")print("(Should be ~1 for good fit)") # Fit Negative Binomial modelnb_model = sm.GLM(df['visits'], X, family=sm.families.NegativeBinomial())nb_results = nb_model.fit() print("\n" + "="*60)print("NEGATIVE BINOMIAL MODEL")print("="*60)print(nb_results.summary()) # Compare coefficientsprint("\n" + "="*60)print("COEFFICIENT COMPARISON (IRRs)")print("="*60)comparison = pd.DataFrame({ 'Poisson IRR': np.exp(poisson_results.params), 'NB IRR': np.exp(nb_results.params), 'Poisson SE': poisson_results.bse, 'NB SE': nb_results.bse})print(comparison.round(4))print("\nNote: NB standard errors are larger, reflecting proper uncertainty quantification")With NB regression, standard errors are typically LARGER than Poisson because we're accounting for extra-Poisson variation. Point estimates may shift slightly but are often similar. The big wins are: (1) valid inference (correct SEs), (2) better prediction intervals, (3) proper model comparison via AIC.
Given count data, how do we decide between Poisson and negative binomial regression? Several tools are available.
Since Poisson is a special case of NB (when $\theta \to \infty$, i.e., $\alpha = 1/\theta \to 0$), we can use a likelihood ratio test:
$$\text{LR} = 2[\ell_{\text{NB}} - \ell_{\text{Poisson}}]$$
However, this tests $H_0: \alpha = 0$ on the boundary of the parameter space (since $\alpha \geq 0$), so the usual $\chi^2_1$ distribution isn't quite right. The correct null distribution is a 50:50 mixture of a point mass at 0 and $\chi^2_1$.
In practice, if the p-value from a standard $\chi^2_1$ test is < 0.10, overdispersion is likely significant.
AIC and BIC can compare models: $$\text{AIC} = -2\ell + 2p, \quad \text{BIC} = -2\ell + p \log n$$
Lower is better. NB has one more parameter than Poisson, so it must improve the likelihood sufficiently to compensate.
Regress squared residuals on fitted values. If the slope is significantly positive after accounting for the Poisson structure, overdispersion is present.
| Criterion | Choose Poisson | Choose Negative Binomial |
|---|---|---|
| Deviance/df | ≈ 1 | 1 (often > 1.5) |
| Observed Var/Mean | ≈ 1 | 1 |
| LR test p-value | 0.10 | < 0.10 (prefer NB) |
| AIC comparison | Poisson AIC lower | NB AIC lower |
| Residual plot | No pattern | Increasing spread with fitted |
Important Caveat: Neither Poisson nor NB may be adequate if:
Negative binomial is 'safe' in the sense that if data is truly Poisson, NB will estimate θ → ∞ and replicate Poisson results. But if there's undetected overdispersion, Poisson will give wrong standard errors. NB is more robust to this misspecification.
The dispersion parameter $\theta$ (or $\alpha = 1/\theta$) deserves careful attention—it quantifies the degree of overdispersion and has substantive interpretation.
Recall: $\text{Var}(Y) = \mu + \mu^2/\theta$
In the Poisson-Gamma mixture interpretation:
With $\alpha = 1/\theta$: $\text{Var}(Y) = \mu(1 + \alpha\mu)$
When reporting NB regression results:
Example interpretation:
'The estimated dispersion parameter θ = 1.8 (95% CI: 1.3, 2.6) indicates substantial overdispersion. The variance at the mean count of 4 is Var = 4 + 16/1.8 = 12.9, compared to 4 under Poisson—about 3 times larger.'
Some software options:
The choice affects standard errors—fixing θ at a wrong value leads to incorrect inference.
In the Poisson-Gamma mixture, the coefficient of variation of the latent rates is CV = 1/√θ. So θ = 1 corresponds to 100% CV in rates; θ = 4 corresponds to 50% CV. This gives an intuitive sense of how much heterogeneity exists.
Just like Poisson regression, negative binomial models can incorporate offset terms to model rates with different exposures.
If $t_i$ is the exposure for observation $i$:
$$Y_i \sim \text{NegBin}(\mu_i = t_i \cdot r_i, \theta)$$
where $r_i$ is the rate per unit exposure. The model becomes:
$$\log(\mu_i) = \log(t_i) + \mathbf{x}_i^\top \boldsymbol{\beta}$$
The offset $\log(t_i)$ accounts for varying exposure while modeling the rate.
Suppose we model cancer cases in different counties:
The model estimates effects on the rate (cases per person-year), adjusted for population differences.
The negative binomial with offset is ideal when:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import numpy as npimport pandas as pdimport statsmodels.api as sm # Example: Workplace injury rates across companies np.random.seed(123)n_companies = 200 # Company characteristicsemployees = np.random.exponential(500, n_companies).astype(int) + 50 # 50-2000 employeessafety_score = np.random.uniform(1, 10, n_companies) # 1-10 ratingis_manufacturing = np.random.binomial(1, 0.3, n_companies)years_observed = np.random.uniform(1, 5, n_companies) # Exposure: employee-yearsexposure = employees * years_observed # True rate model: injuries per employee-yearlog_rate = -4 - 0.2*safety_score + 0.8*is_manufacturingtrue_rate = np.exp(log_rate)true_mean = exposure * true_rate # Generate overdispersed countstheta = 2.0random_rates = np.random.gamma(shape=theta, scale=true_mean/theta)injuries = np.random.poisson(np.maximum(random_rates, 0.1)) df = pd.DataFrame({ 'injuries': injuries, 'employees': employees, 'safety_score': safety_score, 'is_manufacturing': is_manufacturing, 'years_observed': years_observed, 'exposure': exposure, 'log_exposure': np.log(exposure)}) # Fit NB model with offsetX = sm.add_constant(df[['safety_score', 'is_manufacturing']]) nb_rate_model = sm.GLM(df['injuries'], X, family=sm.families.NegativeBinomial(), offset=df['log_exposure'])nb_results = nb_rate_model.fit() print("="*60)print("NEGATIVE BINOMIAL RATE MODEL")print("(Modeling injuries per employee-year)")print("="*60)print(nb_results.summary()) # Interpret as Rate Ratiosprint("\nIncidence Rate Ratios:")print("-"*40)irr = np.exp(nb_results.params)for name, val in irr.items(): print(f"{name}: {val:.4f}") print("\nInterpretation:")print(f"• Each 1-point increase in safety score reduces injury rate by {(1-irr['safety_score'])*100:.1f}%")print(f"• Manufacturing companies have {(irr['is_manufacturing']-1)*100:.0f}% higher injury rate")Overdispersion is especially common in rate models because exposure varies. Large-exposure observations contribute more variance than small-exposure ones. The NB model handles this naturally through the μ²/θ term, which scales with the square of the expected count.
The negative binomial family has several extensions and related models for different data-generating processes.
When there are more zeros than the NB distribution predicts, consider ZINB:
$$P(Y = 0) = \pi + (1-\pi) \cdot \text{NB}(0; \mu, \theta)$$ $$P(Y = k) = (1-\pi) \cdot \text{NB}(k; \mu, \theta), \quad k > 0$$
where $\pi$ is the probability of a 'structural zero' from a separate process.
Example: Modeling cigarettes smoked per day. Some people are never-smokers (structural zeros), while smokers have variable consumption following NB.
A hurdle model treats zeros and positive counts as coming from separate processes:
This is appropriate when the process for 'any' differs from 'how many'.
Two variance specifications:
NB2 is more common; NB1 may fit better when overdispersion doesn't increase with the mean.
Some implementations allow observation-specific dispersion: $$\text{Var}(Y_i) = \mu_i + \mu_i^2 / \theta_i$$
where $\theta_i$ can depend on covariates. This relaxes the assumption of constant overdispersion across observations.
For clustered data (e.g., repeated measures, hierarchical structure), a generalized linear mixed model with NB distribution:
$$\log(\mu_{ij}) = \mathbf{x}_{ij}^\top \boldsymbol{\beta} + u_i$$
where $u_i$ is a random intercept for cluster $i$. This handles both overdispersion and correlation within clusters.
Start simple: Poisson → NB → ZINB/Hurdle → GLMM. Add complexity only when diagnostics demand it. Compare models using AIC/BIC, residual plots, and Vuong tests (for non-nested models like NB vs. ZIP).
Let's work through a complete analysis comparing Poisson and NB models.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as npimport pandas as pdimport statsmodels.api as smfrom scipy import stats # Comprehensive NB Analysis: Publications by Scientists np.random.seed(456)n = 400 # Predictorsyears_since_phd = np.random.uniform(0, 30, n)grants = np.random.poisson(2, n)is_tenured = (years_since_phd > 7) & (np.random.uniform(0, 1, n) > 0.3)log_citations = np.random.normal(5, 2, n) # Generate overdispersed publication countslog_mu = 0.5 + 0.05*years_since_phd + 0.3*grants + 0.1*is_tenured + 0.05*log_citationsmu = np.exp(log_mu) # Overdispersion with theta = 1.5theta_true = 1.5random_rates = np.random.gamma(shape=theta_true, scale=mu/theta_true)publications = np.maximum(np.random.poisson(random_rates), 0) df = pd.DataFrame({ 'publications': publications, 'years_since_phd': years_since_phd, 'grants': grants, 'is_tenured': is_tenured.astype(int), 'log_citations': log_citations}) print("="*70)print("COMPREHENSIVE NEGATIVE BINOMIAL REGRESSION ANALYSIS")print("="*70) # Step 1: Explore the dataprint("\n📊 STEP 1: EXPLORATORY DATA ANALYSIS")print("-"*50)print(f"Sample size: {n}")print(f"Mean publications: {df['publications'].mean():.2f}")print(f"Variance: {df['publications'].var():.2f}")print(f"Variance/Mean ratio: {df['publications'].var()/df['publications'].mean():.2f}")print(f"Proportion of zeros: {(df['publications']==0).mean()*100:.1f}%") # Step 2: Fit Poisson modelprint("\n📈 STEP 2: POISSON MODEL (Baseline)")print("-"*50)X = sm.add_constant(df[['years_since_phd', 'grants', 'is_tenured', 'log_citations']])poisson_model = sm.GLM(df['publications'], X, family=sm.families.Poisson())poisson_results = poisson_model.fit() print(f"Log-likelihood: {poisson_results.llf:.1f}")print(f"AIC: {poisson_results.aic:.1f}")print(f"Deviance: {poisson_results.deviance:.1f}")print(f"Deviance/df: {poisson_results.deviance/poisson_results.df_resid:.2f}")print(f"Pearson χ²/df: {poisson_results.pearson_chi2/poisson_results.df_resid:.2f}") if poisson_results.pearson_chi2/poisson_results.df_resid > 1.5: print("⚠️ Clear evidence of overdispersion!") # Step 3: Fit Negative Binomial modelprint("\n📈 STEP 3: NEGATIVE BINOMIAL MODEL")print("-"*50)nb_model = sm.GLM(df['publications'], X, family=sm.families.NegativeBinomial())nb_results = nb_model.fit() print(f"Log-likelihood: {nb_results.llf:.1f}")print(f"AIC: {nb_results.aic:.1f}")print(f"Deviance: {nb_results.deviance:.1f}")print(f"Deviance/df: {nb_results.deviance/nb_results.df_resid:.2f}") # Step 4: Compare modelsprint("\n⚖️ STEP 4: MODEL COMPARISON")print("-"*50)lr_stat = 2 * (nb_results.llf - poisson_results.llf)# Note: True null distribution is 50:50 mixture of 0 and chi2(1)# Conservative: use chi2(1)p_value = 0.5 * (1 - stats.chi2.cdf(lr_stat, df=1)) print(f"Likelihood Ratio Statistic: {lr_stat:.1f}")print(f"p-value (mixture null): {p_value:.4f}")print(f"AIC improvement: {poisson_results.aic - nb_results.aic:.1f} points") if nb_results.aic < poisson_results.aic: print("✅ NB model is preferred by AIC")else: print("Poisson model is sufficient") # Step 5: Final resultsprint("\n📋 STEP 5: FINAL RESULTS (Negative Binomial)")print("-"*50)print(nb_results.summary()) # IRRs with CIsprint("\n🎯 INCIDENCE RATE RATIOS WITH 95% CIs:")print("-"*50)irr = np.exp(nb_results.params)irr_ci = np.exp(nb_results.conf_int())for name in irr.index: low, high = irr_ci.loc[name] print(f"{name:20s}: IRR = {irr[name]:.3f} (95% CI: {low:.3f}, {high:.3f})") print("\n💡 KEY FINDINGS:")print("-"*50)print(f"• Each additional grant increases publications by {(irr['grants']-1)*100:.0f}%")print(f"• Tenured researchers have {(irr['is_tenured']-1)*100:.0f}% more publications")print(f"• Each decade post-PhD adds {((1.05**10)-1)*100:.0f}% more publications")We've comprehensively covered negative binomial regression—the essential tool for overdispersed count data. Let's consolidate the key concepts:
Module Complete:
You've now mastered the Generalized Linear Model framework—from its theoretical foundations (exponential family, link functions) through its most important applications (logistic, Poisson, and negative binomial regression). This knowledge forms the core of modern statistical modeling and provides the conceptual foundation for many machine learning techniques.
Key skills developed:
Congratulations! You've completed the Generalized Linear Models module. You now have a deep understanding of the unified GLM framework that subsumes linear regression, logistic regression, Poisson regression, and negative binomial regression. This knowledge is fundamental for any advanced work in statistical modeling and machine learning.