Machine LearningProbability Theory

Common Probability Distributions

LevelIntermediate

Duration120 mins

TopicProbability Theory

3 / 5

Poisson Distribution

Modeling the Randomness of Events

How many customers will visit a website in the next hour? How many emails will arrive in your inbox today? How many mutations will occur in a DNA sequence? How many server failures will happen this month?\n\nThese questions all involve counting random events that occur over time or space. The Poisson distribution, named after French mathematician Siméon Denis Poisson (1781–1840), provides the mathematical framework for modeling such counts.\n\nThe Poisson distribution captures a specific type of randomness: events that occur independently, at a constant average rate, with no two events occurring at exactly the same instant. These conditions describe an astonishing variety of real-world phenomena, from radioactive decay to website traffic to typos in manuscripts.\n\nIn machine learning, the Poisson appears in diverse contexts: modeling count data, text analysis (word frequencies), rate prediction, anomaly detection, and as the foundation for Poisson regression. Understanding this distribution deepens your ability to model discrete, non-negative quantities that arise naturally in many domains.

What You Will Learn

By the end of this page, you will master the Poisson distribution: its derivation from the Binomial limit, mathematical properties, relationship to the Exponential distribution (waiting times), parameter estimation, and its applications in machine learning and real-world modeling.

Mathematical Definition

A discrete random variable X follows a Poisson distribution with rate parameter λ > 0, written X ~ Poisson(λ), if its probability mass function (PMF) is:\n\n$$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, 3, \ldots$$\n\nwhere:\n- λ (lambda) is the rate parameter — the expected number of events in the given interval\n- k is the number of events (non-negative integer)\n- e is Euler's number (≈ 2.71828...)\n- k! is the factorial of k\n\nThe support of the Poisson is all non-negative integers: {0, 1, 2, 3, ...}. Unlike the Binomial, there is no upper bound on k.

Understanding the PMF Components

The Poisson PMF has an elegant structure: λᵏ/k! comes from the Taylor series of e^λ, while e^{-λ} normalizes the distribution. Together, Σₖ (λᵏ/k!) · e^{-λ} = e^{-λ} · e^λ = 1, confirming it's a valid probability distribution.

Moments of the Poisson Distribution\n\nThe Poisson has remarkably elegant moment properties.\n\nExpected Value:\n\n$$E[X] = \sum_{k=0}^{\infty} k \cdot \frac{\lambda^k e^{-\lambda}}{k!} = \lambda$$\n\nVariance:\n\n$$Var(X) = E[X^2] - (E[X])^2 = \lambda$$\n\nCrucial insight: For the Poisson, the mean equals the variance (both equal λ).\n\nThis property is often used as a diagnostic: if count data has variance significantly different from its mean, the Poisson may not be appropriate (overdispersion or underdispersion).\n\nStandard Deviation:\n$$\sigma = \sqrt{\lambda}$$\n\nCoefficient of Variation:\n$$CV = \frac{\sigma}{\mu} = \frac{\sqrt{\lambda}}{\lambda} = \frac{1}{\sqrt{\lambda}}$$\n\nAs λ increases, the relative variability decreases.

Poisson Distribution Properties
Property	Formula	Interpretation
Support	{0, 1, 2, ...}	All non-negative integers (no upper bound)
Parameter	λ > 0	Rate = expected count in the interval
Mean	E[X] = λ	Average number of events
Variance	Var(X) = λ	Variance equals mean (key diagnostic)
Skewness	1/√λ	Always right-skewed; approaches 0 as λ → ∞
Mode	⌊λ⌋ or ⌊λ⌋ and ⌊λ⌋-1	Most likely count(s)
MGF	exp(λ(e^t - 1))	Moment generating function
Entropy	λ(1 - log λ) + e^{-λ}Σ...	Complex closed form

Moment Generating Function\n\nThe MGF of the Poisson is:\n\n$$M_X(t) = E[e^{tX}] = \sum_{k=0}^{\infty} e^{tk} \frac{\lambda^k e^{-\lambda}}{k!} = e^{-\lambda} \sum_{k=0}^{\infty} \frac{(\lambda e^t)^k}{k!} = e^{-\lambda} e^{\lambda e^t} = e^{\lambda(e^t - 1)}$$\n\nThis elegant form enables derivation of all moments and plays a key role in proving the sum of independent Poissons is Poisson.\n\n### Probability Generating Function\n\nThe PGF (useful for count distributions) is:\n\n$$G_X(s) = E[s^X] = \sum_{k=0}^{\infty} s^k \frac{\lambda^k e^{-\lambda}}{k!} = e^{\lambda(s-1)}$$

Derivation: The Poisson Limit Theorem

The Poisson distribution arises naturally as the limit of the Binomial distribution when events are rare. This deep connection is known as the Poisson Limit Theorem (or Law of Rare Events).\n\n### Setting the Stage\n\nConsider a Binomial(n, p) distribution where:\n- n is large (many opportunities for events)\n- p is small (each event is individually rare)\n- np = λ remains constant (the expected count is fixed)\n\nAs n → ∞ and p → 0 with np → λ, the Binomial approaches the Poisson:\n\n$$\lim_{n \to \infty} \binom{n}{k} p^k (1-p)^{n-k} = \frac{\lambda^k e^{-\lambda}}{k!} \quad \text{where } p = \lambda/n$$

Proof of the Poisson Limit\n\nLet p = λ/n. The Binomial PMF is:\n\n$$P(X=k) = \binom{n}{k} \left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n-k}$$\n\nStep 1: Expand the binomial coefficient\n\n$$\binom{n}{k} = \frac{n!}{k!(n-k)!} = \frac{n(n-1)(n-2)\cdots(n-k+1)}{k!}$$\n\nStep 2: Rewrite\n\n$$P(X=k) = \frac{n(n-1)\cdots(n-k+1)}{k!} \cdot \frac{\lambda^k}{n^k} \cdot \left(1 - \frac{\lambda}{n}\right)^{n-k}$$\n\nStep 3: Take the limit as n → ∞\n\n$$\frac{n(n-1)\cdots(n-k+1)}{n^k} = \frac{n}{n} \cdot \frac{n-1}{n} \cdots \frac{n-k+1}{n} \to 1 \cdot 1 \cdots 1 = 1$$\n\n$$\left(1 - \frac{\lambda}{n}\right)^{n} \to e^{-\lambda} \quad \text{(definition of } e \text{)}$$\n\n$$\left(1 - \frac{\lambda}{n}\right)^{-k} \to 1$$\n\nResult:\n\n$$\lim_{n \to \infty} P(X=k) = \frac{\lambda^k}{k!} \cdot e^{-\lambda} \quad \square$$

Practical Rule of Thumb

The Poisson approximation to the Binomial is accurate when n ≥ 20 and p ≤ 0.05, or when n ≥ 100 and np ≤ 10. The approximation improves as n increases and p decreases with np held constant.

poisson_limit.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
from scipy.stats import binom, poisson
import matplotlib.pyplot as plt
 
def demonstrate_poisson_limit():
    """
    Demonstrate how Binomial converges to Poisson as n→∞, p→0, np=λ.
    """
    lambda_fixed = 5  # Fixed expected value
    
    # Different values of n (with p = λ/n)
    n_values = [10, 20, 50, 100, 500, 1000]
    
    # Range of k values to plot
    k_max = 15
    k_values = np.arange(0, k_max + 1)
    
    # True Poisson probabilities
    poisson_probs = poisson.pmf(k_values, lambda_fixed)
    
    print(f"Poisson Limit Theorem Demonstration (λ = {lambda_fixed})")
    print("=" * 60)
    
    for n in n_values:
        p = lambda_fixed / n
        
        # Binomial probabilities
        binom_probs = binom.pmf(k_values, n, p)
        
        # Maximum absolute difference
        max_diff = np.max(np.abs(binom_probs - poisson_probs))
        
        # Total variation distance
        tv_distance = 0.5 * np.sum(np.abs(binom_probs - poisson_probs))
        
        print(f"n = {n:4d}, p = {p:.4f}: Max diff = {max_diff:.6f}, "
              f"TV distance = {tv_distance:.6f}")
    
    # Theoretical bound: TV ≤ min(np², λ²/n) ≈ λ²/n for small p
    print("\nTheoretical bound: Total Variation ≤ λ²/n")
    for n in n_values:
        bound = lambda_fixed**2 / n
        print(f"n = {n:4d}: Bound = {bound:.6f}")
 
demonstrate_poisson_limit()
 
# Practical example: Rare event modeling
print("\n" + "=" * 60)
print("Example: Defective Items in Manufacturing")
print("=" * 60)
 
defect_rate = 0.001  # 0.1% defect rate
batch_size = 10000   # 10,000 items per batch
expected_defects = defect_rate * batch_size  # λ = 10
 
# Exact Binomial
binom_dist = binom(batch_size, defect_rate)
 
# Poisson approximation
poisson_dist = poisson(expected_defects)
 
# Compare P(X ≤ 15)
k = 15
binom_cdf = binom_dist.cdf(k)
poisson_cdf = poisson_dist.cdf(k)
 
print(f"Defect rate: {defect_rate:.4f}, Batch size: {batch_size}")
print(f"Expected defects: λ = {expected_defects}")
print(f"\nP(defects ≤ {k}):")
print(f"  Binomial:  {binom_cdf:.6f}")
print(f"  Poisson:   {poisson_cdf:.6f}")
print(f"  Difference: {abs(binom_cdf - poisson_cdf):.6f}")

The Poisson Process: Events in Time

The Poisson distribution naturally arises from the Poisson process—a stochastic process that models events occurring randomly over time (or space). Understanding this connection provides deep intuition for when the Poisson is appropriate.\n\n### Defining the Poisson Process\n\nA Poisson process with rate λ > 0 is characterized by three assumptions:\n\n1. Stationarity: The probability of an event in any small interval depends only on the length of the interval, not when it occurs.\n\n$$P(\text{1 event in } (t, t+h)) = \lambda h + o(h)$$\n\n2. Independent Increments: The numbers of events in non-overlapping intervals are independent.\n\n3. No Simultaneous Events: At most one event can occur at any instant.\n\n$$P(\text{≥ 2 events in } (t, t+h)) = o(h)$$\n\nwhere o(h) means terms that go to zero faster than h as h → 0.\n\n### Main Results\n\nFrom these assumptions, it follows that the number of events N(t) in an interval of length t follows:\n\n$$N(t) \sim \text{Poisson}(\lambda t)$$\n\nThe waiting time T until the first event follows:\n\n$$T \sim \text{Exponential}(\lambda)$$\n\nwith PDF f(t) = λe^{-λt} for t ≥ 0.

Poisson-Exponential Duality

The Poisson and Exponential distributions are two sides of the same coin. Poisson counts events in a fixed interval; Exponential measures time between events. If inter-arrival times are Exponential(λ), then counts are Poisson(λt). This duality is fundamental to queuing theory and event modeling.

Derivation from First Principles\n\nLet N(t) be the number of events in [0, t]. We can derive the Poisson distribution by considering what happens as we discretize time.\n\nDivide [0, t] into n small subintervals of length h = t/n. For very small h:\n\n$$P(\text{event in subinterval}) \approx \lambda h = \frac{\lambda t}{n}$$\n\nWith n subintervals and independent events, the total count is approximately:\n\n$$N(t) \approx \text{Binomial}\left(n, \frac{\lambda t}{n}\right)$$\n\nAs n → ∞ (infinitesimal subintervals), by the Poisson Limit Theorem:\n\n$$N(t) \sim \text{Poisson}(\lambda t)$$\n\n### The Memoryless Property\n\nThe Exponential distribution (waiting time) has the memoryless property:\n\n$$P(T > s + t | T > s) = P(T > t)$$\n\nThe probability of waiting at least t more units, given you've already waited s units, is the same as the probability of waiting t units from the start. This reflects the "fresh start" nature of Poisson processes—past events don't affect future waiting times.

poisson_process.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
from scipy.stats import poisson, expon
import matplotlib.pyplot as plt
 
class PoissonProcess:
    """
    Simulate and analyze a Poisson process.
    """
    def __init__(self, rate: float):
        """
        Args:
            rate: λ, the average number of events per unit time
        """
        self.rate = rate
    
    def simulate_events(self, duration: float) -> np.ndarray:
        """
        Simulate event arrival times up to the given duration.
        Uses the inter-arrival time approach (Exponential waiting times).
        
        Returns:
            Array of event arrival times
        """
        arrival_times = []
        current_time = 0
        
        while current_time < duration:
            # Wait time until next event ~ Exponential(λ)
            wait_time = np.random.exponential(1 / self.rate)
            current_time += wait_time
            
            if current_time < duration:
                arrival_times.append(current_time)
        
        return np.array(arrival_times)
    
    def count_events(self, events: np.ndarray, interval_end: float) -> int:
        """Count events in [0, interval_end]."""
        return np.sum(events <= interval_end)
    
    def theoretical_distribution(self, t: float) -> poisson:
        """Return the Poisson distribution for count in [0, t]."""
        return poisson(self.rate * t)
 
# Demonstration
np.random.seed(42)
 
# Website traffic: average 10 visitors per minute
rate = 10  # λ = 10 per minute
process = PoissonProcess(rate)
 
# Simulate 1000 one-minute intervals and check distribution
print("Poisson Process Simulation: Website Traffic")
print(f"Rate λ = {rate} visitors/minute")
print("=" * 50)
 
n_simulations = 10000
counts = []
 
for _ in range(n_simulations):
    events = process.simulate_events(duration=1.0)  # 1 minute
    counts.append(len(events))
 
counts = np.array(counts)
 
# Compare empirical distribution to theoretical Poisson(λ)
theoretical = poisson(rate)
 
print(f"\nEmpirical vs Theoretical (Poisson({rate})):")
print(f"Mean:     Empirical = {counts.mean():.3f}, Theoretical = {rate}")
print(f"Variance: Empirical = {counts.var():.3f}, Theoretical = {rate}")
print(f"Std Dev:  Empirical = {counts.std():.3f}, Theoretical = {np.sqrt(rate):.3f}")
 
# Check some specific probabilities
print("\nProbability comparisons:")
for k in [5, 10, 15]:
    emp_prob = np.mean(counts == k)
    theo_prob = theoretical.pmf(k)
    print(f"P(X = {k:2d}): Empirical = {emp_prob:.4f}, Theoretical = {theo_prob:.4f}")
 
# Inter-arrival times should be Exponential(λ)
print("\n" + "=" * 50)
print("Inter-arrival Time Analysis")
print("=" * 50)
 
# Longer simulation to get many inter-arrival times
events = process.simulate_events(duration=100)  # 100 minutes
inter_arrivals = np.diff(events)  # Time between consecutive events
 
print(f"Number of events: {len(events)}")
print(f"Number of inter-arrival times: {len(inter_arrivals)}")
print(f"\nInter-arrival times (should be Exponential(λ={rate})):")
print(f"Mean:     Empirical = {inter_arrivals.mean():.4f}, Theoretical = {1/rate:.4f}")
print(f"Std Dev:  Empirical = {inter_arrivals.std():.4f}, Theoretical = {1/rate:.4f}")

Parameter Estimation

Estimating the Poisson rate parameter λ from observed count data is straightforward via Maximum Likelihood Estimation.\n\n### MLE for λ\n\nGiven n independent observations x₁, x₂, ..., xₙ from a Poisson(λ) distribution, the likelihood is:\n\n$$L(\lambda) = \prod_{i=1}^n \frac{\lambda^{x_i} e^{-\lambda}}{x_i!} = \frac{\lambda^{\sum x_i} e^{-n\lambda}}{\prod_{i=1}^n x_i!}$$\n\nThe log-likelihood is:\n\n$$\ell(\lambda) = \left(\sum_{i=1}^n x_i\right) \ln \lambda - n\lambda - \sum_{i=1}^n \ln(x_i!)$$\n\nTaking the derivative and setting to zero:\n\n$$\frac{d\ell}{d\lambda} = \frac{\sum_{i=1}^n x_i}{\lambda} - n = 0$$\n\n$$\hat{\lambda}{MLE} = \frac{1}{n}\sum{i=1}^n x_i = \bar{x}$$\n\nThe MLE for λ is simply the sample mean—the average count.

Properties of the MLE\n\nUnbiasedness:\n$$E[\hat{\lambda}] = E[\bar{X}] = \lambda$$\n\nVariance:\n$$Var(\hat{\lambda}) = Var(\bar{X}) = \frac{\lambda}{n}$$\n\nStandard Error:\n$$SE(\hat{\lambda}) = \sqrt{\frac{\lambda}{n}} \approx \sqrt{\frac{\hat{\lambda}}{n}}$$\n\nConsistency: As n → ∞, λ̂ → λ in probability.\n\nAsymptotic Normality:\n$$\sqrt{n}(\hat{\lambda} - \lambda) \stackrel{d}{\rightarrow} N(0, \lambda)$$\n\nFor large n:\n$$\hat{\lambda} \approx N\left(\lambda, \frac{\lambda}{n}\right)$$\n\n### Confidence Intervals\n\n1. Normal Approximation (large n):\n$$\hat{\lambda} \pm z_{\alpha/2} \sqrt{\frac{\hat{\lambda}}{n}}$$\n\n2. Exact (using relationship to Chi-squared):\nThe sum Σxᵢ ~ Poisson(nλ), and 2nλ̂ follows approximately χ²(2Σxᵢ + 1).

poisson_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from scipy import stats
 
def poisson_mle(counts: np.ndarray) -> dict:
    """
    Maximum Likelihood Estimation for Poisson parameter λ.
    
    Returns point estimate and confidence intervals.
    """
    n = len(counts)
    total = np.sum(counts)
    
    # MLE estimate (sample mean)
    lambda_hat = np.mean(counts)
    
    # Standard error
    se = np.sqrt(lambda_hat / n)
    
    # 95% CI using normal approximation
    z = 1.96
    ci_normal = (lambda_hat - z * se, lambda_hat + z * se)
    
    # 95% CI using exact chi-squared method
    # 2 * n * λ_hat ~ χ²(2 * Σx_i) approximately
    # More precisely, for sum S = Σx_i ~ Poisson(nλ):
    # CI for nλ leads to CI for λ
    chi2_lower = stats.chi2.ppf(0.025, 2 * total) / (2 * n)
    chi2_upper = stats.chi2.ppf(0.975, 2 * (total + 1)) / (2 * n)
    ci_exact = (chi2_lower, chi2_upper)
    
    return {
        'n': n,
        'total_count': total,
        'lambda_hat': lambda_hat,
        'standard_error': se,
        'ci_95_normal': ci_normal,
        'ci_95_exact': ci_exact
    }
 
def test_overdispersion(counts: np.ndarray) -> dict:
    """
    Test if data is overdispersed relative to Poisson (variance > mean).
    """
    n = len(counts)
    mean = np.mean(counts)
    var = np.var(counts, ddof=1)
    
    # Dispersion statistic
    dispersion = var / mean
    
    # Chi-squared test for overdispersion
    # Under Poisson: Σ(x_i - x̄)²/x̄ ~ χ²(n-1)
    chi2_stat = (n - 1) * dispersion
    p_value = 1 - stats.chi2.cdf(chi2_stat, df=n-1)
    
    return {
        'mean': mean,
        'variance': var,
        'dispersion': dispersion,
        'chi2_statistic': chi2_stat,
        'p_value': p_value,
        'overdispersed': p_value < 0.05
    }
 
# Example: Customer arrivals per hour
np.random.seed(42)
true_lambda = 7.5
n_hours = 50
arrivals = np.random.poisson(true_lambda, n_hours)
 
print("Poisson Parameter Estimation: Customer Arrivals")
print("=" * 50)
print(f"True λ = {true_lambda}")
print(f"Observed data: {arrivals[:10]}... (first 10 of {n_hours})")
 
results = poisson_mle(arrivals)
print(f"\nMLE Results:")
print(f"  λ̂ = {results['lambda_hat']:.4f}")
print(f"  Standard Error = {results['standard_error']:.4f}")
print(f"  95% CI (Normal): [{results['ci_95_normal'][0]:.4f}, {results['ci_95_normal'][1]:.4f}]")
print(f"  95% CI (Exact):  [{results['ci_95_exact'][0]:.4f}, {results['ci_95_exact'][1]:.4f}]")
 
# Test for overdispersion
dispersion_test = test_overdispersion(arrivals)
print(f"\nOverdispersion Test:")
print(f"  Mean = {dispersion_test['mean']:.4f}")
print(f"  Variance = {dispersion_test['variance']:.4f}")
print(f"  Dispersion ratio = {dispersion_test['dispersion']:.4f}")
print(f"  p-value = {dispersion_test['p_value']:.4f}")
print(f"  Conclusion: {'Overdispersed' if dispersion_test['overdispersed'] else 'Consistent with Poisson'}")

Overdispersion in Practice

Real count data often exhibits overdispersion (variance > mean), violating the Poisson assumption. Common causes include unobserved heterogeneity and clustering. Alternatives include the Negative Binomial distribution or quasi-Poisson models that explicitly model overdispersion.

Key Properties and Operations

The Poisson distribution has several elegant properties that make it mathematically tractable.\n\n### Additivity: Sum of Independent Poissons\n\nIf X₁ ~ Poisson(λ₁) and X₂ ~ Poisson(λ₂) are independent, then:\n\n$$X_1 + X_2 \sim \text{Poisson}(\lambda_1 + \lambda_2)$$\n\nMore generally, for independent Xᵢ ~ Poisson(λᵢ):\n\n$$\sum_{i=1}^n X_i \sim \text{Poisson}\left(\sum_{i=1}^n \lambda_i\right)$$\n\nProof via MGF:\n$$M_{X+Y}(t) = M_X(t) M_Y(t) = e^{\lambda_1(e^t-1)} e^{\lambda_2(e^t-1)} = e^{(\lambda_1+\lambda_2)(e^t-1)}$$\n\nThis is the MGF of Poisson(λ₁ + λ₂). ∎\n\nApplication: Total website traffic = traffic from channel A + traffic from channel B follows Poisson if each channel is Poisson.

Splitting: Thinning a Poisson\n\nIf X ~ Poisson(λ) represents total events, and each event is independently classified as type A with probability p or type B with probability (1-p), then:\n\n- Number of type A events: Xₐ ~ Poisson(λp)\n- Number of type B events: X_B ~ Poisson(λ(1-p))\n- Xₐ and X_B are independent!\n\nApplication: Customer arrivals (Poisson) with some fraction p making purchases → purchases and non-purchases are independent Poisson processes.\n\n### Reproductive Property\n\nIf N ~ Poisson(λ) and Y₁, Y₂, ... are i.i.d. with PGF G_Y(s), then:\n\n$$S = \sum_{i=1}^N Y_i$$\n\nhas PGF:\n\n$$G_S(s) = e^{\lambda(G_Y(s) - 1)} = \exp(\lambda(G_Y(s) - 1))$$\n\nThis is the compound Poisson distribution, important in insurance and finance.

Poisson Properties Summary
Property	Statement	Application
Additivity	Sum of indep. Poissons is Poisson	Aggregating event sources
Thinning/Splitting	Random classification gives independent Poissons	Customer segmentation
Rate scaling	X(t) ~ Poisson(λt)	Adjusting observation window
Mean = Variance	E[X] = Var(X) = λ	Overdispersion diagnostic
Asymptotic normality	Poisson(λ) → N(λ, λ) for large λ	Approximate inference

Normal Approximation\n\nFor large λ (typically λ ≥ 10), the Poisson is well-approximated by a Gaussian:\n\n$$\text{Poisson}(\lambda) \approx N(\lambda, \lambda)$$\n\nOr after standardization:\n\n$$\frac{X - \lambda}{\sqrt{\lambda}} \stackrel{d}{\rightarrow} N(0, 1)$$\n\nThis enables using z-tests and z-intervals for Poisson data with large counts.

Applications in Machine Learning

The Poisson distribution appears throughout machine learning whenever we model counts or rates.\n\n### Poisson Regression\n\nPoisson regression models count data as a function of features:\n\n$$Y_i | \mathbf{x}_i \sim \text{Poisson}(\lambda_i) \quad \text{where} \quad \log \lambda_i = \mathbf{w}^T \mathbf{x}i + b$$\n\nThe log link function ensures λᵢ > 0. The negative log-likelihood loss is:\n\n$$\mathcal{L} = \sum{i=1}^n \left[ \lambda_i - y_i \log \lambda_i + \log(y_i!) \right]$$\n\nUse cases:\n- Predicting number of hospital admissions\n- Expected sales count\n- Number of defects in manufacturing\n- Citation counts for papers

poisson_regression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from scipy.optimize import minimize
from scipy.special import gammaln  # log(n!)
 
def poisson_nll(params, X, y):
    """
    Negative log-likelihood for Poisson regression.
    
    Model: y ~ Poisson(λ), where log(λ) = X @ w + b
    
    NLL = Σ [λ_i - y_i * log(λ_i) + log(y_i!)]
    """
    w, b = params[:-1], params[-1]
    log_lambda = X @ w + b
    lambda_ = np.exp(log_lambda)
    
    # Negative log-likelihood (ignoring constant log(y!) which doesn't affect optimization)
    nll = np.sum(lambda_ - y * log_lambda)
    return nll
 
def fit_poisson_regression(X, y):
    """
    Fit Poisson regression via MLE.
    """
    n_features = X.shape[1]
    initial_params = np.zeros(n_features + 1)
    
    result = minimize(poisson_nll, initial_params, args=(X, y), 
                      method='L-BFGS-B')
    
    w = result.x[:-1]
    b = result.x[-1]
    
    return w, b
 
def predict_poisson(X, w, b):
    """Predict expected counts."""
    log_lambda = X @ w + b
    return np.exp(log_lambda)
 
# Example: Predicting customer purchases based on features
np.random.seed(42)
n = 500
 
# Features: age (normalized), income (normalized), website visits
X = np.random.randn(n, 3)
true_w = np.array([0.5, 0.8, 0.3])
true_b = 1.0
 
# True λ = exp(w'x + b)
true_lambda = np.exp(X @ true_w + true_b)
 
# Generate counts
y = np.random.poisson(true_lambda)
 
# Fit model
w_hat, b_hat = fit_poisson_regression(X, y)
 
print("Poisson Regression Results")
print("=" * 50)
print(f"True weights:      {true_w}")
print(f"Estimated weights: {w_hat.round(4)}")
print(f"True bias:      {true_b}")
print(f"Estimated bias: {b_hat:.4f}")
 
# Predictions
y_pred = predict_poisson(X, w_hat, b_hat)
print(f"\nMean actual count: {y.mean():.2f}")
print(f"Mean predicted λ:  {y_pred.mean():.2f}")
 
# Deviance (measure of fit)
deviance = 2 * np.sum(y * np.log(np.maximum(y, 1) / y_pred) - (y - y_pred))
print(f"Deviance: {deviance:.2f}")

Text Modeling and Topic Models\n\nIn text analysis, word counts often follow Poisson-like distributions:\n\nPoisson assumption in LDA: Latent Dirichlet Allocation can be derived assuming Poisson word counts.\n\nSparse Additive Generative Models (SAGE): Model word counts as Poisson with log-linear rates.\n\n### Anomaly Detection\n\nIf baseline event counts are Poisson(λ), we can detect anomalies:\n\n- Calculate probability of observing count k: P(X = k)\n- Flag as anomaly if P(X ≥ k) < threshold\n\nExample: Server normally has 100 errors/hour (λ = 100). If we observe 150 errors:\n\n$$P(X \geq 150) = 1 - F_{\text{Poisson}(100)}(149) \approx 0.000003$$\n\nThis is highly anomalous!\n\n### Neural Network Loss Functions\n\nWhen predicting non-negative integers, Poisson loss is sometimes used:\n\n$$\mathcal{L}_{\text{Poisson}} = \hat{y} - y \log(\hat{y})$$\n\nwhere ŷ is the predicted rate (must be positive, often from softplus or exp activation).

Poisson Applications Summary

•Poisson Regression: GLM for count outcomes with log link; used for sales, admissions, defects
•Rate Prediction: Predicting events per unit time (website traffic, customer arrivals)
•Text Modeling: Word frequencies, topic models, document-term matrices
•Anomaly Detection: Flagging unusual event counts based on baseline Poisson
•Queuing Systems: Modeling arrival rates in service systems
•A/B Testing for Counts: Comparing conversion rates when outcomes are counts
•Recommendation Systems: Modeling implicit feedback (clicks, views) as Poisson
•Survival Analysis: Counting events in competing risks models

Summary and Key Takeaways

The Poisson distribution is the fundamental model for counting random events. Let's consolidate the key concepts:

Key Takeaways

•Definition: Poisson(λ) has PMF P(X=k) = λᵏe^{-λ}/k! for k = 0, 1, 2, ...
•Mean = Variance = λ: This equality is the diagnostic signature of Poisson data; departures suggest overdispersion.
•Poisson Limit Theorem: Binomial(n, p) → Poisson(np) as n→∞, p→0, making Poisson the model for rare events.
•Poisson Process: Events in time with rate λ; counts are Poisson, inter-arrival times are Exponential.
•Additivity: Sum of independent Poissons is Poisson; enables aggregation of event sources.
•MLE: λ̂ = sample mean, with SE = √(λ̂/n).
•ML Applications: Poisson regression, count prediction, text modeling, anomaly detection, queuing.

What's Next:\n\nWe've covered the fundamental discrete (Bernoulli/Binomial, Poisson) and continuous (Gaussian) distributions. Next, we explore the Exponential family—a unifying framework that encompasses these and many other distributions. Understanding exponential families reveals deep connections between distributions and provides powerful tools for generalized linear models and Bayesian inference.

Page Complete

You now have a comprehensive understanding of the Poisson distribution—from its derivation as the Binomial limit to its role in the Poisson process and its applications in machine learning. The Poisson is the natural choice whenever you need to model counts of events occurring randomly in time or space.

3 / 5

Loading learning content...

Machine LearningProbability Theory

Common Probability Distributions

LevelIntermediate

Duration120 mins

TopicProbability Theory

3 / 5

Poisson Distribution

Modeling the Randomness of Events

What You Will Learn

Mathematical Definition

Understanding the PMF Components

Moments of the Poisson Distribution\n\nThe Poisson has remarkably elegant moment properties.\n\nExpected Value:\n\n$$E[X] = \sum_{k=0}^{\infty} k \cdot \frac{\lambda^k e^{-\lambda}}{k!} = \lambda$$\n\nVariance:\n\n$$Var(X) = E[X^2] - (E[X])^2 = \lambda$$\n\nCrucial insight: For the Poisson, the mean equals the variance (both equal λ).\n\nThis property is often used as a diagnostic: if count data has variance significantly different from its mean, the Poisson may not be appropriate (overdispersion or underdispersion).\n\nStandard Deviation:\n$$\sigma = \sqrt{\lambda}$$\n\nCoefficient of Variation:\n$$CV = \frac{\sigma}{\mu} = \frac{\sqrt{\lambda}}{\lambda} = \frac{1}{\sqrt{\lambda}}$$\n\nAs λ increases, the relative variability decreases.

Poisson Distribution Properties
Property	Formula	Interpretation
Support	{0, 1, 2, ...}	All non-negative integers (no upper bound)
Parameter	λ > 0	Rate = expected count in the interval
Mean	E[X] = λ	Average number of events
Variance	Var(X) = λ	Variance equals mean (key diagnostic)
Skewness	1/√λ	Always right-skewed; approaches 0 as λ → ∞
Mode	⌊λ⌋ or ⌊λ⌋ and ⌊λ⌋-1	Most likely count(s)
MGF	exp(λ(e^t - 1))	Moment generating function
Entropy	λ(1 - log λ) + e^{-λ}Σ...	Complex closed form

Moment Generating Function\n\nThe MGF of the Poisson is:\n\n$$M_X(t) = E[e^{tX}] = \sum_{k=0}^{\infty} e^{tk} \frac{\lambda^k e^{-\lambda}}{k!} = e^{-\lambda} \sum_{k=0}^{\infty} \frac{(\lambda e^t)^k}{k!} = e^{-\lambda} e^{\lambda e^t} = e^{\lambda(e^t - 1)}$$\n\nThis elegant form enables derivation of all moments and plays a key role in proving the sum of independent Poissons is Poisson.\n\n### Probability Generating Function\n\nThe PGF (useful for count distributions) is:\n\n$$G_X(s) = E[s^X] = \sum_{k=0}^{\infty} s^k \frac{\lambda^k e^{-\lambda}}{k!} = e^{\lambda(s-1)}$$

Derivation: The Poisson Limit Theorem

Proof of the Poisson Limit\n\nLet p = λ/n. The Binomial PMF is:\n\n$$P(X=k) = \binom{n}{k} \left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n-k}$$\n\nStep 1: Expand the binomial coefficient\n\n$$\binom{n}{k} = \frac{n!}{k!(n-k)!} = \frac{n(n-1)(n-2)\cdots(n-k+1)}{k!}$$\n\nStep 2: Rewrite\n\n$$P(X=k) = \frac{n(n-1)\cdots(n-k+1)}{k!} \cdot \frac{\lambda^k}{n^k} \cdot \left(1 - \frac{\lambda}{n}\right)^{n-k}$$\n\nStep 3: Take the limit as n → ∞\n\n$$\frac{n(n-1)\cdots(n-k+1)}{n^k} = \frac{n}{n} \cdot \frac{n-1}{n} \cdots \frac{n-k+1}{n} \to 1 \cdot 1 \cdots 1 = 1$$\n\n$$\left(1 - \frac{\lambda}{n}\right)^{n} \to e^{-\lambda} \quad \text{(definition of } e \text{)}$$\n\n$$\left(1 - \frac{\lambda}{n}\right)^{-k} \to 1$$\n\nResult:\n\n$$\lim_{n \to \infty} P(X=k) = \frac{\lambda^k}{k!} \cdot e^{-\lambda} \quad \square$$

Practical Rule of Thumb

The Poisson approximation to the Binomial is accurate when n ≥ 20 and p ≤ 0.05, or when n ≥ 100 and np ≤ 10. The approximation improves as n increases and p decreases with np held constant.

poisson_limit.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
from scipy.stats import binom, poisson
import matplotlib.pyplot as plt
 
def demonstrate_poisson_limit():
    """
    Demonstrate how Binomial converges to Poisson as n→∞, p→0, np=λ.
    """
    lambda_fixed = 5  # Fixed expected value
    
    # Different values of n (with p = λ/n)
    n_values = [10, 20, 50, 100, 500, 1000]
    
    # Range of k values to plot
    k_max = 15
    k_values = np.arange(0, k_max + 1)
    
    # True Poisson probabilities
    poisson_probs = poisson.pmf(k_values, lambda_fixed)
    
    print(f"Poisson Limit Theorem Demonstration (λ = {lambda_fixed})")
    print("=" * 60)
    
    for n in n_values:
        p = lambda_fixed / n
        
        # Binomial probabilities
        binom_probs = binom.pmf(k_values, n, p)
        
        # Maximum absolute difference
        max_diff = np.max(np.abs(binom_probs - poisson_probs))
        
        # Total variation distance
        tv_distance = 0.5 * np.sum(np.abs(binom_probs - poisson_probs))
        
        print(f"n = {n:4d}, p = {p:.4f}: Max diff = {max_diff:.6f}, "
              f"TV distance = {tv_distance:.6f}")
    
    # Theoretical bound: TV ≤ min(np², λ²/n) ≈ λ²/n for small p
    print("\nTheoretical bound: Total Variation ≤ λ²/n")
    for n in n_values:
        bound = lambda_fixed**2 / n
        print(f"n = {n:4d}: Bound = {bound:.6f}")
 
demonstrate_poisson_limit()
 
# Practical example: Rare event modeling
print("\n" + "=" * 60)
print("Example: Defective Items in Manufacturing")
print("=" * 60)
 
defect_rate = 0.001  # 0.1% defect rate
batch_size = 10000   # 10,000 items per batch
expected_defects = defect_rate * batch_size  # λ = 10
 
# Exact Binomial
binom_dist = binom(batch_size, defect_rate)
 
# Poisson approximation
poisson_dist = poisson(expected_defects)
 
# Compare P(X ≤ 15)
k = 15
binom_cdf = binom_dist.cdf(k)
poisson_cdf = poisson_dist.cdf(k)
 
print(f"Defect rate: {defect_rate:.4f}, Batch size: {batch_size}")
print(f"Expected defects: λ = {expected_defects}")
print(f"\nP(defects ≤ {k}):")
print(f"  Binomial:  {binom_cdf:.6f}")
print(f"  Poisson:   {poisson_cdf:.6f}")
print(f"  Difference: {abs(binom_cdf - poisson_cdf):.6f}")

The Poisson Process: Events in Time

Poisson-Exponential Duality

Derivation from First Principles\n\nLet N(t) be the number of events in [0, t]. We can derive the Poisson distribution by considering what happens as we discretize time.\n\nDivide [0, t] into n small subintervals of length h = t/n. For very small h:\n\n$$P(\text{event in subinterval}) \approx \lambda h = \frac{\lambda t}{n}$$\n\nWith n subintervals and independent events, the total count is approximately:\n\n$$N(t) \approx \text{Binomial}\left(n, \frac{\lambda t}{n}\right)$$\n\nAs n → ∞ (infinitesimal subintervals), by the Poisson Limit Theorem:\n\n$$N(t) \sim \text{Poisson}(\lambda t)$$\n\n### The Memoryless Property\n\nThe Exponential distribution (waiting time) has the memoryless property:\n\n$$P(T > s + t | T > s) = P(T > t)$$\n\nThe probability of waiting at least t more units, given you've already waited s units, is the same as the probability of waiting t units from the start. This reflects the "fresh start" nature of Poisson processes—past events don't affect future waiting times.

poisson_process.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
from scipy.stats import poisson, expon
import matplotlib.pyplot as plt
 
class PoissonProcess:
    """
    Simulate and analyze a Poisson process.
    """
    def __init__(self, rate: float):
        """
        Args:
            rate: λ, the average number of events per unit time
        """
        self.rate = rate
    
    def simulate_events(self, duration: float) -> np.ndarray:
        """
        Simulate event arrival times up to the given duration.
        Uses the inter-arrival time approach (Exponential waiting times).
        
        Returns:
            Array of event arrival times
        """
        arrival_times = []
        current_time = 0
        
        while current_time < duration:
            # Wait time until next event ~ Exponential(λ)
            wait_time = np.random.exponential(1 / self.rate)
            current_time += wait_time
            
            if current_time < duration:
                arrival_times.append(current_time)
        
        return np.array(arrival_times)
    
    def count_events(self, events: np.ndarray, interval_end: float) -> int:
        """Count events in [0, interval_end]."""
        return np.sum(events <= interval_end)
    
    def theoretical_distribution(self, t: float) -> poisson:
        """Return the Poisson distribution for count in [0, t]."""
        return poisson(self.rate * t)
 
# Demonstration
np.random.seed(42)
 
# Website traffic: average 10 visitors per minute
rate = 10  # λ = 10 per minute
process = PoissonProcess(rate)
 
# Simulate 1000 one-minute intervals and check distribution
print("Poisson Process Simulation: Website Traffic")
print(f"Rate λ = {rate} visitors/minute")
print("=" * 50)
 
n_simulations = 10000
counts = []
 
for _ in range(n_simulations):
    events = process.simulate_events(duration=1.0)  # 1 minute
    counts.append(len(events))
 
counts = np.array(counts)
 
# Compare empirical distribution to theoretical Poisson(λ)
theoretical = poisson(rate)
 
print(f"\nEmpirical vs Theoretical (Poisson({rate})):")
print(f"Mean:     Empirical = {counts.mean():.3f}, Theoretical = {rate}")
print(f"Variance: Empirical = {counts.var():.3f}, Theoretical = {rate}")
print(f"Std Dev:  Empirical = {counts.std():.3f}, Theoretical = {np.sqrt(rate):.3f}")
 
# Check some specific probabilities
print("\nProbability comparisons:")
for k in [5, 10, 15]:
    emp_prob = np.mean(counts == k)
    theo_prob = theoretical.pmf(k)
    print(f"P(X = {k:2d}): Empirical = {emp_prob:.4f}, Theoretical = {theo_prob:.4f}")
 
# Inter-arrival times should be Exponential(λ)
print("\n" + "=" * 50)
print("Inter-arrival Time Analysis")
print("=" * 50)
 
# Longer simulation to get many inter-arrival times
events = process.simulate_events(duration=100)  # 100 minutes
inter_arrivals = np.diff(events)  # Time between consecutive events
 
print(f"Number of events: {len(events)}")
print(f"Number of inter-arrival times: {len(inter_arrivals)}")
print(f"\nInter-arrival times (should be Exponential(λ={rate})):")
print(f"Mean:     Empirical = {inter_arrivals.mean():.4f}, Theoretical = {1/rate:.4f}")
print(f"Std Dev:  Empirical = {inter_arrivals.std():.4f}, Theoretical = {1/rate:.4f}")

Parameter Estimation

Properties of the MLE\n\nUnbiasedness:\n$$E[\hat{\lambda}] = E[\bar{X}] = \lambda$$\n\nVariance:\n$$Var(\hat{\lambda}) = Var(\bar{X}) = \frac{\lambda}{n}$$\n\nStandard Error:\n$$SE(\hat{\lambda}) = \sqrt{\frac{\lambda}{n}} \approx \sqrt{\frac{\hat{\lambda}}{n}}$$\n\nConsistency: As n → ∞, λ̂ → λ in probability.\n\nAsymptotic Normality:\n$$\sqrt{n}(\hat{\lambda} - \lambda) \stackrel{d}{\rightarrow} N(0, \lambda)$$\n\nFor large n:\n$$\hat{\lambda} \approx N\left(\lambda, \frac{\lambda}{n}\right)$$\n\n### Confidence Intervals\n\n1. Normal Approximation (large n):\n$$\hat{\lambda} \pm z_{\alpha/2} \sqrt{\frac{\hat{\lambda}}{n}}$$\n\n2. Exact (using relationship to Chi-squared):\nThe sum Σxᵢ ~ Poisson(nλ), and 2nλ̂ follows approximately χ²(2Σxᵢ + 1).

poisson_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from scipy import stats
 
def poisson_mle(counts: np.ndarray) -> dict:
    """
    Maximum Likelihood Estimation for Poisson parameter λ.
    
    Returns point estimate and confidence intervals.
    """
    n = len(counts)
    total = np.sum(counts)
    
    # MLE estimate (sample mean)
    lambda_hat = np.mean(counts)
    
    # Standard error
    se = np.sqrt(lambda_hat / n)
    
    # 95% CI using normal approximation
    z = 1.96
    ci_normal = (lambda_hat - z * se, lambda_hat + z * se)
    
    # 95% CI using exact chi-squared method
    # 2 * n * λ_hat ~ χ²(2 * Σx_i) approximately
    # More precisely, for sum S = Σx_i ~ Poisson(nλ):
    # CI for nλ leads to CI for λ
    chi2_lower = stats.chi2.ppf(0.025, 2 * total) / (2 * n)
    chi2_upper = stats.chi2.ppf(0.975, 2 * (total + 1)) / (2 * n)
    ci_exact = (chi2_lower, chi2_upper)
    
    return {
        'n': n,
        'total_count': total,
        'lambda_hat': lambda_hat,
        'standard_error': se,
        'ci_95_normal': ci_normal,
        'ci_95_exact': ci_exact
    }
 
def test_overdispersion(counts: np.ndarray) -> dict:
    """
    Test if data is overdispersed relative to Poisson (variance > mean).
    """
    n = len(counts)
    mean = np.mean(counts)
    var = np.var(counts, ddof=1)
    
    # Dispersion statistic
    dispersion = var / mean
    
    # Chi-squared test for overdispersion
    # Under Poisson: Σ(x_i - x̄)²/x̄ ~ χ²(n-1)
    chi2_stat = (n - 1) * dispersion
    p_value = 1 - stats.chi2.cdf(chi2_stat, df=n-1)
    
    return {
        'mean': mean,
        'variance': var,
        'dispersion': dispersion,
        'chi2_statistic': chi2_stat,
        'p_value': p_value,
        'overdispersed': p_value < 0.05
    }
 
# Example: Customer arrivals per hour
np.random.seed(42)
true_lambda = 7.5
n_hours = 50
arrivals = np.random.poisson(true_lambda, n_hours)
 
print("Poisson Parameter Estimation: Customer Arrivals")
print("=" * 50)
print(f"True λ = {true_lambda}")
print(f"Observed data: {arrivals[:10]}... (first 10 of {n_hours})")
 
results = poisson_mle(arrivals)
print(f"\nMLE Results:")
print(f"  λ̂ = {results['lambda_hat']:.4f}")
print(f"  Standard Error = {results['standard_error']:.4f}")
print(f"  95% CI (Normal): [{results['ci_95_normal'][0]:.4f}, {results['ci_95_normal'][1]:.4f}]")
print(f"  95% CI (Exact):  [{results['ci_95_exact'][0]:.4f}, {results['ci_95_exact'][1]:.4f}]")
 
# Test for overdispersion
dispersion_test = test_overdispersion(arrivals)
print(f"\nOverdispersion Test:")
print(f"  Mean = {dispersion_test['mean']:.4f}")
print(f"  Variance = {dispersion_test['variance']:.4f}")
print(f"  Dispersion ratio = {dispersion_test['dispersion']:.4f}")
print(f"  p-value = {dispersion_test['p_value']:.4f}")
print(f"  Conclusion: {'Overdispersed' if dispersion_test['overdispersed'] else 'Consistent with Poisson'}")

Overdispersion in Practice

Key Properties and Operations

Splitting: Thinning a Poisson\n\nIf X ~ Poisson(λ) represents total events, and each event is independently classified as type A with probability p or type B with probability (1-p), then:\n\n- Number of type A events: Xₐ ~ Poisson(λp)\n- Number of type B events: X_B ~ Poisson(λ(1-p))\n- Xₐ and X_B are independent!\n\nApplication: Customer arrivals (Poisson) with some fraction p making purchases → purchases and non-purchases are independent Poisson processes.\n\n### Reproductive Property\n\nIf N ~ Poisson(λ) and Y₁, Y₂, ... are i.i.d. with PGF G_Y(s), then:\n\n$$S = \sum_{i=1}^N Y_i$$\n\nhas PGF:\n\n$$G_S(s) = e^{\lambda(G_Y(s) - 1)} = \exp(\lambda(G_Y(s) - 1))$$\n\nThis is the compound Poisson distribution, important in insurance and finance.

Poisson Properties Summary
Property	Statement	Application
Additivity	Sum of indep. Poissons is Poisson	Aggregating event sources
Thinning/Splitting	Random classification gives independent Poissons	Customer segmentation
Rate scaling	X(t) ~ Poisson(λt)	Adjusting observation window
Mean = Variance	E[X] = Var(X) = λ	Overdispersion diagnostic
Asymptotic normality	Poisson(λ) → N(λ, λ) for large λ	Approximate inference

Normal Approximation\n\nFor large λ (typically λ ≥ 10), the Poisson is well-approximated by a Gaussian:\n\n$$\text{Poisson}(\lambda) \approx N(\lambda, \lambda)$$\n\nOr after standardization:\n\n$$\frac{X - \lambda}{\sqrt{\lambda}} \stackrel{d}{\rightarrow} N(0, 1)$$\n\nThis enables using z-tests and z-intervals for Poisson data with large counts.

Applications in Machine Learning

poisson_regression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from scipy.optimize import minimize
from scipy.special import gammaln  # log(n!)
 
def poisson_nll(params, X, y):
    """
    Negative log-likelihood for Poisson regression.
    
    Model: y ~ Poisson(λ), where log(λ) = X @ w + b
    
    NLL = Σ [λ_i - y_i * log(λ_i) + log(y_i!)]
    """
    w, b = params[:-1], params[-1]
    log_lambda = X @ w + b
    lambda_ = np.exp(log_lambda)
    
    # Negative log-likelihood (ignoring constant log(y!) which doesn't affect optimization)
    nll = np.sum(lambda_ - y * log_lambda)
    return nll
 
def fit_poisson_regression(X, y):
    """
    Fit Poisson regression via MLE.
    """
    n_features = X.shape[1]
    initial_params = np.zeros(n_features + 1)
    
    result = minimize(poisson_nll, initial_params, args=(X, y), 
                      method='L-BFGS-B')
    
    w = result.x[:-1]
    b = result.x[-1]
    
    return w, b
 
def predict_poisson(X, w, b):
    """Predict expected counts."""
    log_lambda = X @ w + b
    return np.exp(log_lambda)
 
# Example: Predicting customer purchases based on features
np.random.seed(42)
n = 500
 
# Features: age (normalized), income (normalized), website visits
X = np.random.randn(n, 3)
true_w = np.array([0.5, 0.8, 0.3])
true_b = 1.0
 
# True λ = exp(w'x + b)
true_lambda = np.exp(X @ true_w + true_b)
 
# Generate counts
y = np.random.poisson(true_lambda)
 
# Fit model
w_hat, b_hat = fit_poisson_regression(X, y)
 
print("Poisson Regression Results")
print("=" * 50)
print(f"True weights:      {true_w}")
print(f"Estimated weights: {w_hat.round(4)}")
print(f"True bias:      {true_b}")
print(f"Estimated bias: {b_hat:.4f}")
 
# Predictions
y_pred = predict_poisson(X, w_hat, b_hat)
print(f"\nMean actual count: {y.mean():.2f}")
print(f"Mean predicted λ:  {y_pred.mean():.2f}")
 
# Deviance (measure of fit)
deviance = 2 * np.sum(y * np.log(np.maximum(y, 1) / y_pred) - (y - y_pred))
print(f"Deviance: {deviance:.2f}")

Text Modeling and Topic Models\n\nIn text analysis, word counts often follow Poisson-like distributions:\n\nPoisson assumption in LDA: Latent Dirichlet Allocation can be derived assuming Poisson word counts.\n\nSparse Additive Generative Models (SAGE): Model word counts as Poisson with log-linear rates.\n\n### Anomaly Detection\n\nIf baseline event counts are Poisson(λ), we can detect anomalies:\n\n- Calculate probability of observing count k: P(X = k)\n- Flag as anomaly if P(X ≥ k) < threshold\n\nExample: Server normally has 100 errors/hour (λ = 100). If we observe 150 errors:\n\n$$P(X \geq 150) = 1 - F_{\text{Poisson}(100)}(149) \approx 0.000003$$\n\nThis is highly anomalous!\n\n### Neural Network Loss Functions\n\nWhen predicting non-negative integers, Poisson loss is sometimes used:\n\n$$\mathcal{L}_{\text{Poisson}} = \hat{y} - y \log(\hat{y})$$\n\nwhere ŷ is the predicted rate (must be positive, often from softplus or exp activation).

Poisson Applications Summary

•Poisson Regression: GLM for count outcomes with log link; used for sales, admissions, defects
•Rate Prediction: Predicting events per unit time (website traffic, customer arrivals)
•Text Modeling: Word frequencies, topic models, document-term matrices
•Anomaly Detection: Flagging unusual event counts based on baseline Poisson
•Queuing Systems: Modeling arrival rates in service systems
•A/B Testing for Counts: Comparing conversion rates when outcomes are counts
•Recommendation Systems: Modeling implicit feedback (clicks, views) as Poisson
•Survival Analysis: Counting events in competing risks models

Summary and Key Takeaways

The Poisson distribution is the fundamental model for counting random events. Let's consolidate the key concepts:

Key Takeaways

•Definition: Poisson(λ) has PMF P(X=k) = λᵏe^{-λ}/k! for k = 0, 1, 2, ...
•Mean = Variance = λ: This equality is the diagnostic signature of Poisson data; departures suggest overdispersion.
•Poisson Limit Theorem: Binomial(n, p) → Poisson(np) as n→∞, p→0, making Poisson the model for rare events.
•Poisson Process: Events in time with rate λ; counts are Poisson, inter-arrival times are Exponential.
•Additivity: Sum of independent Poissons is Poisson; enables aggregation of event sources.
•MLE: λ̂ = sample mean, with SE = √(λ̂/n).
•ML Applications: Poisson regression, count prediction, text modeling, anomaly detection, queuing.

Page Complete

3 / 5