Random Variables Distributions - Learning Module

Loading content...

0/245

Moments

Beyond Mean and Variance: The Moment Hierarchy

We've mastered expectation (the first moment) and variance (derived from the second moment). But some distributions with identical means and variances can look completely different—one might be symmetric while another is heavily skewed; one might have thin tails, another thick tails.

To capture these differences, we need higher moments. The complete sequence of moments—first, second, third, and beyond—provides increasingly detailed information about a distribution's shape. Under certain conditions, the moments uniquely determine the entire distribution.

In machine learning, higher moments matter for:

Detecting non-normality in residuals (skewness, kurtosis tests)
Understanding tail risk in prediction errors
Feature engineering (statistical features for time series)
Generative modeling (matching moments of generated vs real data)
Theoretical analysis (moment bounds, concentration inequalities)

What You Will Learn

By the end of this page, you will understand raw and central moments, interpret skewness (asymmetry) and kurtosis (tail heaviness), use moment generating functions to derive distributions and moments, and apply these concepts to analyze data and validate modeling assumptions.

Raw Moments: Powers of X

The most basic moments are the raw moments (or moments about the origin)—expectations of powers of $X$.

Definition (Raw Moment):

The $k$-th raw moment of $X$ is: $$\mu'_k = E[X^k]$$

For discrete $X$: $\mu'_k = \sum_x x^k \cdot p(x)$

For continuous $X$: $\mu'k = \int{-\infty}^{\infty} x^k \cdot f(x) , dx$

The Moment Hierarchy:

Moment	Symbol	Name	Meaning
1st	$\mu'_1 = E[X]$	Mean	Center/location
2nd	$\mu'_2 = E[X^2]$	—	Related to spread via $\text{Var} = \mu'_2 - (\mu'_1)^2$
3rd	$\mu'_3 = E[X^3]$	—	Related to asymmetry
4th	$\mu'_4 = E[X^4]$	—	Related to tail behavior
$k$-th	$\mu'_k = E[X^k]$	—	Higher-order shape information

Key Observation: Raw moments depend on both the shape and the location of distances. For interpretable shape measures, we typically use central moments (about the mean) or standardized moments.

raw_moments.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from scipy import stats, integrate
 
def compute_raw_moments():
    """
    Compute raw moments for various distributions.
    """
    print("Raw Moments: E[X^k]")
    print("=" * 60)
    
    # Standard Normal N(0, 1)
    print("\nStandard Normal N(0, 1):")
    normal = stats.norm(0, 1)
    
    print(f"  {'k':>3} {'E[X^k] (analytical)':>22} {'E[X^k] (numerical)':>22}")
    print("  " + "-" * 50)
    
    # For N(0,1): E[X^k] = 0 if k odd, = (k-1)!! if k even
    # (k-1)!! = (k-1)(k-3)(k-5)...1
    def double_factorial(n):
        if n <= 0:
            return 1
        return n * double_factorial(n - 2)
    
    for k in range(1, 9):
        # Numerical integration
        numerical, _ = integrate.quad(lambda x: x**k * normal.pdf(x), -10, 10)
        
        # Analytical for N(0,1)
        if k % 2 == 1:
            analytical = 0
        else:
            analytical = double_factorial(k - 1)
        
        print(f"  {k:>3} {analytical:>22.6f} {numerical:>22.6f}")
    
    # Exponential(λ=1)
    print("\nExponential(λ=1):")
    exp_dist = stats.expon(scale=1)
    
    print(f"  {'k':>3} {'E[X^k] = k!':>15} {'Numerical':>15}")
    print("  " + "-" * 35)
    
    for k in range(1, 7):
        analytical = np.math.factorial(k)
        numerical = exp_dist.moment(k)
        print(f"  {k:>3} {analytical:>15.4f} {numerical:>15.4f}")
 
 
def moment_existence():
    """
    Discuss when moments exist.
    """
    print("\n\nMoment Existence")
    print("=" * 60)
    
    print("\nNot all distributions have all moments!")
    print("\nFor E[X^k] to exist: ∫ |x|^k f(x) dx < ∞")
    
    print("\nExample: Cauchy distribution")
    print("  PDF: f(x) = 1 / (π(1 + x²))")
    print("  E[X] does not exist (integral diverges)")
    print("  No moments exist!")
    
    # Pareto distribution - moments depend on shape parameter
    print("\nExample: Pareto distribution")
    print("  PDF: f(x) = α x_m^α / x^(α+1) for x ≥ x_m")
    print("  E[X^k] exists iff α > k")
    
    print("\n  α = 2: E[X] exists, E[X²] doesn't → Var infinite")
    print("  α = 3: E[X], E[X²] exist → finite variance")
    print("  α = 5: Moments up to 4th exist")
    
    # Demonstrate with sampling
    print("\nDemonstration with Pareto(α=1.5):")
    print("  Theory: E[X] exists, E[X²] doesn't")
    
    np.random.seed(42)
    pareto = stats.pareto(b=1.5)  # α = 1.5
    
    for n in [1000, 10000, 100000]:
        samples = pareto.rvs(n)
        sample_mean = np.mean(samples)
        sample_var = np.var(samples)
        print(f"  n = {n:>6}: mean = {sample_mean:>10.2f}, var = {sample_var:>15.2f}")
    
    print("\n  Notice: Sample variance keeps growing (doesn't converge)")
 
 
compute_raw_moments()
moment_existence()

Central Moments: Deviations from the Mean

Central moments measure the expected powers of deviations from the mean, providing location-invariant shape information.

Definition (Central Moment):

The $k$-th central moment of $X$ is: $$\mu_k = E[(X - \mu)^k]$$

where $\mu = E[X]$ is the mean.

The Central Moment Hierarchy:

Moment	Formula	Value	Meaning
1st	$E[X - \mu]$	Always 0	(trivial)
2nd	$E[(X - \mu)^2]$	Variance	Spread
3rd	$E[(X - \mu)^3]$	(see skewness)	Asymmetry
4th	$E[(X - \mu)^4]$	(see kurtosis)	Tail weight

Relationship to Raw Moments:

Central moments can be expressed in terms of raw moments via the binomial theorem: $$\mu_k = E[(X - \mu)^k] = \sum_{j=0}^{k} \binom{k}{j} (-\mu)^{k-j} E[X^j]$$

For example:

$\mu_2 = E[X^2] - \mu^2$ (the familiar variance formula)
$\mu_3 = E[X^3] - 3\mu E[X^2] + 2\mu^3$

Why Central Moments?

Central moments are location-invariant: if Y = X + c, then μₖ(Y) = μₖ(X) for k ≥ 2. This makes them pure measures of shape, unaffected by where the distribution is centered. Raw moments, in contrast, change when you shift the distribution.

central_moments.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
from scipy import stats
 
def compute_central_moments():
    """
    Compute central moments for various distributions.
    """
    print("Central Moments: E[(X - μ)^k]")
    print("=" * 60)
    
    # Exponential(λ=1): μ = 1
    print("\nExponential(λ=1), μ = 1:")
    exp_dist = stats.expon(scale=1)
    
    mu = exp_dist.mean()
    print(f"  Mean μ = {mu}")
    
    # Central moments
    print(f"\n  {'k':>3} {'E[(X-μ)^k]':>18} {'From scipy':>15}")
    print("  " + "-" * 40)
    
    np.random.seed(42)
    samples = exp_dist.rvs(100000)
    
    for k in range(1, 7):
        # Compute from samples
        empirical = np.mean((samples - mu) ** k)
        
        # k=1 should be ~0, k=2 is variance
        scipy_val = None
        if k == 1:
            scipy_val = 0
        elif k == 2:
            scipy_val = exp_dist.var()
        
        if scipy_val is not None:
            print(f"  {k:>3} {empirical:>18.6f} {scipy_val:>15.6f}")
        else:
            print(f"  {k:>3} {empirical:>18.6f}")
    
    # Show location invariance
    print("\n\nLocation Invariance of Central Moments")
    print("-" * 60)
    
    X = stats.norm(0, 2)  # N(0, 4)
    Y = stats.norm(100, 2)  # N(100, 4) - shifted by 100
    
    print("\nX ~ N(0, 4), Y ~ N(100, 4)")
    print("Y is just X shifted by 100")
    
    print(f"\n  {'Moment':>10} {'μₖ(X)':>15} {'μₖ(Y)':>15}")
    print("  " + "-" * 45)
    
    samples_X = X.rvs(100000)
    samples_Y = Y.rvs(100000)
    
    for k in [2, 3, 4]:
        mu_k_X = np.mean((samples_X - X.mean()) ** k)
        mu_k_Y = np.mean((samples_Y - Y.mean()) ** k)
        print(f"  {k:>10} {mu_k_X:>15.4f} {mu_k_Y:>15.4f}")
    
    print("\n  Central moments are (approximately) equal!")
    print("  They capture shape, independent of location.")
 
 
compute_central_moments()

Skewness: Measuring Asymmetry

Skewness is the standardized third central moment, measuring asymmetry of the distribution.

Definition (Skewness):

$$\gamma_1 = \frac{E[(X - \mu)^3]}{\sigma^3} = \frac{\mu_3}{\sigma^3}$$

Dividing by $\sigma^3$ makes skewness dimensionless and scale-invariant.

Interpretation:

Skewness	Distribution Shape	Examples
$\gamma_1 = 0$	Symmetric	Normal, Uniform, t
$\gamma_1 > 0$	Right-skewed (long right tail)	Log-normal, Exponential, Chi-squared
$\gamma_1 < 0$	Left-skewed (long left tail)	Negative log-normal, some Beta

Why Skewness Matters:

Mean ≠ Median ≠ Mode: For skewed distributions, these differ. Mean is pulled toward the tail.
Model Assumptions: Many models assume symmetric (often Gaussian) errors. Skewness checks this.
Transformation Need: Highly skewed data often benefits from log or Box-Cox transformation.
Risk Assessment: In finance/insurance, right-skewed loss distributions mean occasional catastrophic losses.

What's 'High' Skewness?

$|\gamma_1| < 0.5$: Approximately symmetric
$0.5 \leq |\gamma_1| < 1$: Moderately skewed
$|\gamma_1| \geq 1$: Highly skewed

skewness_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from scipy import stats
 
def skewness_examples():
    """
    Compute and interpret skewness for various distributions.
    """
    print("Skewness: Measuring Asymmetry")
    print("=" * 60)
    
    distributions = [
        ("Normal(0, 1)", stats.norm(0, 1), 0),
        ("Uniform(0, 1)", stats.uniform(0, 1), 0),
        ("Exponential(1)", stats.expon(scale=1), 2),
        ("Chi-squared(3)", stats.chi2(df=3), np.sqrt(8/3)),
        ("Log-Normal(0, 0.5)", stats.lognorm(s=0.5), None),
        ("Beta(2, 5)", stats.beta(2, 5), None),
        ("Beta(5, 2)", stats.beta(5, 2), None),
    ]
    
    print(f"\n{'Distribution':<25} {'Skewness':>12} {'Interpretation':>20}")
    print("-" * 60)
    
    for name, dist, theoretical in distributions:
        empirical = dist.stats(moments='s')  # 's' for skewness
        skew = float(empirical)
        
        if skew > 0.5:
            interp = "Right-skewed"
        elif skew < -0.5:
            interp = "Left-skewed"
        else:
            interp = "Approx. symmetric"
        
        print(f"{name:<25} {skew:>12.4f} {interp:>20}")
    
    # Relationship between mean, median, mode
    print("\n\nMean vs Median in Skewed Distributions")
    print("-" * 60)
    
    print("\nFor right-skewed distribution (Exponential):")
    exp_dist = stats.expon(scale=1)
    print(f"  Mean = {exp_dist.mean():.4f}")
    print(f"  Median = {exp_dist.median():.4f}")
    print(f"  Mode = 0")
    print("  Order: Mode < Median < Mean (mean pulled toward tail)")
    
    print("\nFor left-skewed distribution (Beta(5, 2)):")
    beta_left = stats.beta(5, 2)
    np.random.seed(42)
    samples = beta_left.rvs(100000)
    mode_approx = samples[np.argmax([beta_left.pdf(x) for x in np.linspace(0, 1, 1000)])]
    print(f"  Mean = {beta_left.mean():.4f}")
    print(f"  Median = {beta_left.median():.4f}")
    print(f"  Mode ≈ {(5-1)/(5+2-2):.4f}")  # (α-1)/(α+β-2) for Beta
    print("  Order: Mean < Median < Mode")
 
 
def detecting_skewness_in_data():
    """
    Show how to detect and interpret skewness in real data.
    """
    print("\n\nDetecting Skewness in Data")
    print("=" * 60)
    
    np.random.seed(42)
    
    # Simulate income data (typically right-skewed)
    n = 10000
    income = np.random.lognormal(mean=10, sigma=1, size=n)
    
    # Sample statistics
    mean = np.mean(income)
    median = np.median(income)
    skewness = stats.skew(income)
    
    print("\nSimulated income data (log-normal):")
    print(f"  n = {n}")
    print(f"  Mean = ${mean:, .0f
                        }")
    print(f"  Median = ${median:,.0f}")
    print(f"  Skewness = {skewness:.4f}")
    
    print("\nDiagnosis:")
    print(f"  Mean > Median suggests right skew")
    print(f"  Skewness = {skewness:.2f} > 1: Highly right-skewed")
    
    # Effect of log transform
    log_income = np.log(income)
    log_skewness = stats.skew(log_income)
    
    print(f"\nAfter log transform:")
    print(f"  Skewness = {log_skewness:.4f}")
    print(f"  Much closer to symmetric!")
    
    # Test for normality
    _, p_original = stats.normaltest(income)
    _, p_log = stats.normaltest(log_income)
    
    print(f"\nNormality test p-values:")
    print(f"  Original: p = {p_original:.2e} (reject normality)")
    print(f"  Log-transformed: p = {p_log:.4f}")
 
 
skewness_examples()
detecting_skewness_in_data()

Kurtosis: Tail Heaviness and Peakedness

Kurtosis is the standardized fourth central moment, often interpreted as measuring 'tail heaviness' relative to a normal distribution.

Definition (Kurtosis):

$$\gamma_2 = \frac{E[(X - \mu)^4]}{\sigma^4} = \frac{\mu_4}{\sigma^4}$$

Excess Kurtosis subtracts 3 (the kurtosis of a standard normal): $$\text{Excess Kurtosis} = \gamma_2 - 3$$

This makes the normal distribution have excess kurtosis of 0.

Interpretation of Excess Kurtosis:

Excess Kurtosis	Name	Tail Behavior	Examples
= 0	Mesokurtic	Normal tails	Normal
> 0	Leptokurtic	Heavy tails, more outliers	t-distribution, Laplace
< 0	Platykurtic	Light tails, fewer outliers	Uniform, bounded dists

Common Misconception:

Kurtosis is often described as 'peakedness,' but this is misleading. A distribution can be very peaked but have low kurtosis (uniform has the lowest peak but also low kurtosis). Kurtosis is primarily about tail weight—the probability of extreme values.

What's 'High' Kurtosis?

Uniform: excess kurtosis ≈ -1.2
Normal: excess kurtosis = 0
Laplace: excess kurtosis = 3
t(4): excess kurtosis = ∞ (moments don't exist for ν ≤ 4)

kurtosis_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import numpy as np
from scipy import stats
 
def kurtosis_examples():
                """
    Compute and interpret kurtosis for various distributions.
    """
    print("Kurtosis: Tail Heaviness")
    print("=" * 60)
    
    distributions =[
                ("Uniform(0, 1)", stats.uniform(0, 1)),
                ("Normal(0, 1)", stats.norm(0, 1)),
                ("Laplace(0, 1)", stats.laplace(0, 1)),
                ("t(df=10)", stats.t(df = 10)),
                ("t(df=5)", stats.t(df = 5)),
                ("Exponential(1)", stats.expon(scale = 1)),
                ("Beta(2, 2)", stats.beta(2, 2)),
            ]
    
    print(f"\n{'Distribution':<20} {'Kurtosis':>12} {'Excess Kurt':>15} {'Type':>15}")
    print("-" * 65)
    
    for name, dist in distributions:
        try:
            # scipy.stats returns excess kurtosis by default
            excess_kurt = float(dist.stats(moments = 'k'))
            kurtosis = excess_kurt + 3
            
            if excess_kurt > 0.5:
            kurt_type = "Leptokurtic"
            elif excess_kurt< - 0.5:
        kurt_type = "Platykurtic"
            else:
            kurt_type = "Mesokurtic"
            
            print(f"{name:<20} {kurtosis:>12.4f} {excess_kurt:>15.4f} {kurt_type:>15}")
        except:
            print(f"{name:<20} {'N/A':>12} {'N/A':>15} {'(undefined)':>15}")
    
    print("\nNote: t(df≤4) has undefined kurtosis (4th moment doesn't exist)")
 
 
def tail_behavior_demo():
            """
    Demonstrate how kurtosis relates to tail probability.
    """
    print("\n\nKurtosis and Tail Probability")
    print("=" * 60)
    
    print("\nComparing P(|X| > k) for different kurtosis levels")
    print("(All standardized to mean 0, variance 1)")
    
    # Distributions with same mean(0) and variance(1)
    normal = stats.norm(0, 1)
    uniform_std = stats.uniform(loc = -np.sqrt(3), scale = 2 * np.sqrt(3))  # Var = 1
    laplace_std = stats.laplace(0, 1 / np.sqrt(2))  # Var = 1
    t5_std = stats.t(df = 5, scale = np.sqrt(3 / 5))  # Adjusted for Var = 1
    
    dists =[
            ("Uniform (platy)", uniform_std),
            ("Normal (meso)", normal),
            ("Laplace (lepto)", laplace_std),
            ("t(5) (lepto)", t5_std),
        ]
    
    print(f"\n{'Distribution':<20}", end = "")
    for k in [2, 3, 4, 5]:
        print(f"{'P(|X|>' + str(k) + ')':>12}", end = "")
    print()
    print("-" * 70)
    
    for name, dist in dists:
            print(f"{name:<20}", end = "")
        for k in [2, 3, 4, 5]:
        prob = 2 * (1 - dist.cdf(k))
            print(f"{prob:>12.6f}", end = "")
        print()
    
    print("\n  Higher kurtosis → more probability in extreme tails")
    print("  t(5) has MUCH more tail probability than Normal")
 
 
def kurtosis_in_ml():
            """
    Show ML applications of kurtosis analysis.
    """
    print("\n\nKurtosis in ML Applications")
    print("=" * 60)
    
    print("\n1. Checking Regression Residuals")
    print("-" * 40)
    
    np.random.seed(42)
    
    # Simulate regression with normal errors
    n = 1000
    X = np.random.uniform(0, 10, n)
    y_normal = 2 * X + 1 + np.random.normal(0, 1, n)
    residuals_normal = y_normal - (2 * X + 1)
    
    # Simulate regression with heavy - tailed errors
    y_heavy = 2 * X + 1 + np.random.standard_t(df = 3, size = n)
    residuals_heavy = y_heavy - (2 * X + 1)
    
    print(f"\nNormal errors:")
    print(f"  Excess kurtosis = {stats.kurtosis(residuals_normal):.4f}")
    print(f"  Diagnosis: {'Normal' if abs(stats.kurtosis(residuals_normal)) < 1 else 'Non-normal'}")
    
    print(f"\nHeavy-tailed errors:")
    print(f"  Excess kurtosis = {stats.kurtosis(residuals_heavy):.4f}")
    print(f"  Diagnosis: High kurtosis → heavy tails, outliers likely")
    
    print("\n2. Feature Engineering for Anomaly Detection")
    print("-" * 40)
    
    # Normal data vs data with anomalies
    normal_data = np.random.normal(0, 1, 1000)
    anomaly_data = np.concatenate([
                np.random.normal(0, 1, 980),
                np.random.normal(0, 5, 20)  # 2 % anomalies with larger variance
    ])
 
    print(f"\nNo anomalies: excess kurtosis = {stats.kurtosis(normal_data):.4f}")
    print(f"With 2% anomalies: excess kurtosis = {stats.kurtosis(anomaly_data):.4f}")
    print("\n  Elevated kurtosis can signal contamination or outliers")
 
 
    kurtosis_examples()
    tail_behavior_demo()
    kurtosis_in_ml()

Moment Generating Functions

The Moment Generating Function (MGF) is a powerful tool that encodes all moments of a distribution in a single function.

Definition (Moment Generating Function):

$$M_X(t) = E[e^{tX}]$$

For discrete $X$: $M_X(t) = \sum_x e^{tx} \cdot p(x)$

For continuous $X$: $M_X(t) = \int_{-\infty}^{\infty} e^{tx} \cdot f(x) , dx$

The MGF exists if the expectation is finite for $t$ in some neighborhood of 0.

Why 'Moment Generating'?:

The $k$-th moment is the $k$-th derivative of $M_X(t)$ evaluated at $t = 0$: $$E[X^k] = M_X^{(k)}(0) = \left. \frac{d^k}{dt^k} M_X(t) \right|_{t=0}$$

Proof Intuition: Expanding $e^{tX} = \sum_{k=0}^{\infty} \frac{(tX)^k}{k!}$: $$M_X(t) = E\left[\sum_{k=0}^{\infty} \frac{(tX)^k}{k!}\right] = \sum_{k=0}^{\infty} \frac{t^k}{k!} E[X^k]$$

The coefficient of $t^k/k!$ is $E[X^k]$, recoverable by differentiation.

Key Properties:

Uniqueness: If $M_X(t) = M_Y(t)$ in a neighborhood of 0, then $X$ and $Y$ have the same distribution.
Sum of Independent RVs: $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ when $X \perp Y$.
Scaling: $M_{aX+b}(t) = e^{bt} M_X(at)$.

moment_generating_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from scipy import stats
        from scipy.misc import derivative
 
def mgf_examples():
    """
    Compute MGFs and derive moments.
    """
    print("Moment Generating Functions")
    print("=" * 60)
    
    # Normal MGF: M(t) = exp(μt + σ²t²/2)
    print("\nNormal(μ=2, σ²=1):")
    print("  M(t) = exp(μt + σ²t²/2) = exp(2t + t²/2)")
    
    def normal_mgf(t, mu = 2, sigma = 1):
    return np.exp(mu * t + (sigma ** 2) * t ** 2 / 2)
    
    # Derive E[X] = M'(0)
    first_moment = derivative(normal_mgf, 0, n = 1, dx = 1e-8)
    second_raw = derivative(normal_mgf, 0, n = 2, dx = 1e-8)
 
    print(f"  M'(0) = E[X] = {first_moment:.6f} (theory: 2)")
    print(f"  M''(0) = E[X²] = {second_raw:.6f} (theory: μ² + σ² = 5)")
    
    # Exponential MGF: M(t) = λ / (λ - t) for t < λ
    print("\nExponential(λ=2):")
    print("  M(t) = λ/(λ-t) = 2/(2-t) for t < 2")
    
    lam = 2
    def exp_mgf(t):
    return lam / (lam - t)
 
    first_exp = derivative(exp_mgf, 0, n = 1, dx = 1e-8)
    second_exp = derivative(exp_mgf, 0, n = 2, dx = 1e-8)
 
    print(f"  M'(0) = E[X] = {first_exp:.6f} (theory: 1/λ = 0.5)")
    print(f"  M''(0) = E[X²] = {second_exp:.6f} (theory: 2/λ² = 0.5)")
    print(f"  Var(X) = E[X²] - E[X]² = {second_exp - first_exp**2:.6f} (theory: 1/λ² = 0.25)")
 
 
def mgf_for_sums():
    """
    Show how MGFs simplify analysis of sums of independent RVs.
    """
    print("\n\nMGFs for Sums of Independent Random Variables")
    print("=" * 60)
 
    print("\nKey property: M_{X+Y}(t) = M_X(t) × M_Y(t) when X ⊥ Y")
    
    # Example: Sum of independent Normals is Normal
    print("\nExample 1: X ~ N(μ₁, σ₁²), Y ~ N(μ₂, σ₂²) independent")
    print("  M_X(t) = exp(μ₁t + σ₁²t²/2)")
    print("  M_Y(t) = exp(μ₂t + σ₂²t²/2)")
    print("  M_{X+Y}(t) = exp((μ₁+μ₂)t + (σ₁²+σ₂²)t²/2)")
    print("  → X + Y ~ N(μ₁ + μ₂, σ₁² + σ₂²)")
    
    # Verify numerically
    np.random.seed(42)
    n = 100000
    X = np.random.normal(1, 2, n)  # N(1, 4)
    Y = np.random.normal(3, 1, n)  # N(3, 1)
    Z = X + Y
 
    print(f"\n  Empirical: Z has mean {np.mean(Z):.4f}, var {np.var(Z):.4f}")
    print(f"  Theory: Z ~ N({1+3}, {4+1}) → mean 4, var 5")
    
    # Example: Sum of independent Poissons
    print("\nExample 2: X ~ Poisson(λ₁), Y ~ Poisson(λ₂) independent")
    print("  M_X(t) = exp(λ₁(e^t - 1))")
    print("  M_Y(t) = exp(λ₂(e^t - 1))")
    print("  M_{X+Y}(t) = exp((λ₁ + λ₂)(e^t - 1))")
    print("  → X + Y ~ Poisson(λ₁ + λ₂)")
 
    X_pois = np.random.poisson(3, n)
    Y_pois = np.random.poisson(5, n)
    Z_pois = X_pois + Y_pois
 
    print(f"\n  Empirical: Z has mean {np.mean(Z_pois):.4f}")
    print(f"  Theory: Poisson(3 + 5 = 8) has mean 8")
 
 
def characteristic_functions():
    """
    Briefly introduce characteristic functions as generalization.
    """
    print("\n\nCharacteristic Functions (Brief Note)")
    print("=" * 60)
 
    print("\nThe MGF doesn't always exist (e.g., Cauchy distribution).")
    print("The Characteristic Function (CF) always exists:")
    print("  φ_X(t) = E[e^{itX}] where i = √(-1)")
    print("\nProperties:")
    print("  - Always exists for any distribution")
    print("  - Still uniquely determines the distribution")
    print("  - φ_X(t) = M_X(it) when MGF exists")
    print("  - Used in proofs of Central Limit Theorem")
 
    print("\nFor practical ML work, MGFs are usually sufficient")
    print("since we often assume light-tailed distributions.")
 
 
    mgf_examples()
    mgf_for_sums()
    characteristic_functions()

Moment Estimation in Practice

In applications, we estimate moments from finite samples. Understanding the behavior of sample moments is essential for reliable inference.

Challenges in Moment Estimation

•Bias: Sample variance with n divisor is biased; use n-1 for unbiased estimate (Bessel's correction).
•Sensitivity to Outliers: Higher moments are increasingly sensitive to outliers. A single extreme value can dominate sample kurtosis.
•Large Sample Needs: Reliable estimation of skewness needs n > 50; kurtosis needs n > 100 or more.
•Heavy Tails: If true moments don't exist (e.g., Cauchy), sample moments don't converge.
•Standardization Issues: Dividing by σ³ or σ⁴ amplifies estimation error.

sample_moments.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
from scipy import stats
 
def sample_moment_behavior():
    """
    Demonstrate behavior of sample moments.
    """
    print("Sample Moment Estimation")
    print("=" * 60)
    
    # Show bias in variance estimation
    print("\n1. Bias in Variance Estimation")
    print("-" * 40)
 
    np.random.seed(42)
    true_var = 4
    n_trials = 10000
    n_samples = 10
 
    biased_vars = []
    unbiased_vars = []
 
    for _ in range(n_trials):
        samples = np.random.normal(0, 2, n_samples)
    biased_vars.append(np.var(samples))  # n divisor
    unbiased_vars.append(np.var(samples, ddof = 1))  # n - 1 divisor
 
    print(f"True variance: {true_var}")
    print(f"Mean of biased estimates (n divisor): {np.mean(biased_vars):.4f}")
    print(f"Mean of unbiased estimates (n-1 divisor): {np.mean(unbiased_vars):.4f}")
    print(f"Bias in biased estimator: {np.mean(biased_vars) - true_var:.4f}")
 
 
def outlier_sensitivity():
    """
    Show how higher moments are sensitive to outliers.
    """
    print("\n\n2. Outlier Sensitivity of Higher Moments")
    print("=" * 60)
 
    np.random.seed(42)
    
    # Clean data from N(0, 1)
    clean_data = np.random.normal(0, 1, 100)
    
    # Contaminated: one outlier
    contaminated = np.append(clean_data, [10])  # Add single outlier
 
    print("\nClean data (100 samples from N(0,1)):")
    print(f"  Mean: {np.mean(clean_data):.4f}")
    print(f"  Std: {np.std(clean_data):.4f}")
    print(f"  Skewness: {stats.skew(clean_data):.4f}")
    print(f"  Kurtosis: {stats.kurtosis(clean_data):.4f}")
 
    print("\nWith single outlier (x = 10) added:")
    print(f"  Mean: {np.mean(contaminated):.4f} (shifted)")
    print(f"  Std: {np.std(contaminated):.4f} (inflated)")
    print(f"  Skewness: {stats.skew(contaminated):.4f} (heavily affected)")
    print(f"  Kurtosis: {stats.kurtosis(contaminated):.4f} (dramatically affected)")
 
    print("\n  One outlier in 101 samples completely distorts kurtosis!")
 
 
def sample_size_requirements():
    """
    Show sample size requirements for moment estimation.
    """
    print("\n\n3. Sample Size Requirements")
    print("=" * 60)
 
    np.random.seed(42)
    
    # True values for N(0, 1)
    true_skew = 0
    true_kurt = 0  # excess kurtosis
 
    print("\nEstimating skewness and kurtosis of N(0, 1):")
    print("(True skewness = 0, True excess kurtosis = 0)")
 
    print(f"\n{'n':>8} {'Skew Est':>12} {'Skew SE':>10} {'Kurt Est':>12} {'Kurt SE':>10}")
    print("-" * 55)
 
    for n in [20, 50, 100, 500, 1000, 5000]:
        skew_estimates = []
    kurt_estimates = []
 
    for _ in range(1000):
        samples = np.random.normal(0, 1, n)
    skew_estimates.append(stats.skew(samples))
    kurt_estimates.append(stats.kurtosis(samples))
 
    print(f"{n:>8} {np.mean(skew_estimates):>12.4f} {np.std(skew_estimates):>10.4f} "
              f"{np.mean(kurt_estimates):>12.4f} {np.std(kurt_estimates):>10.4f}")
 
    print("\n  Standard error decreases slowly with n")
    print("  Need n ≥ 100 for reasonably precise kurtosis estimates")
 
 
def robust_alternatives():
    """
    Mention robust alternatives to moments.
    """
    print("\n\n4. Robust Alternatives")
    print("=" * 60)
 
    np.random.seed(42)
    
    # Data with outliers
    data = np.concatenate([
        np.random.normal(0, 1, 95),
        np.random.normal(10, 1, 5)  # 5 % contamination
    ])
 
    print("\nData: 95% from N(0,1), 5% from N(10,1)")
 
    print("\nLocation measures:")
    print(f"  Mean: {np.mean(data):.4f}")
    print(f"  Median: {np.median(data):.4f}")
    print(f"  Trimmed mean (10%): {stats.trim_mean(data, 0.1):.4f}")
 
    print("\nScale measures:")
    print(f"  Standard deviation: {np.std(data):.4f}")
    print(f"  MAD (Median Absolute Deviation): {stats.median_abs_deviation(data):.4f}")
    print(f"  IQR: {stats.iqr(data):.4f}")
 
    print("\nSkewness alternatives:")
    q1, q2, q3 = np.percentile(data, [25, 50, 75])
    bowley_skew = (q3 + q1 - 2 * q2) / (q3 - q1)  # Bowley's skewness
    print(f"  Sample skewness: {stats.skew(data):.4f}")
    print(f"  Bowley's skewness (quartile-based): {bowley_skew:.4f}")
 
    print("\n  Robust measures are less affected by outliers")
 
 
    sample_moment_behavior()
    outlier_sensitivity()
    sample_size_requirements()
    robust_alternatives()

Summary: The Complete Moment Picture

We've completed our tour of moments—from basic expectations through the full hierarchy of shape descriptors.

Summary of Key Moments
Moment	Formula	Meaning	ML Relevance
Mean (1st)	E[X]	Center/location	Predictions, loss minimizers
Variance (2nd central)	E[(X-μ)²]	Spread	Uncertainty, regularization
Skewness (3rd standardized)	E[(X-μ)³]/σ³	Asymmetry	Normality checks, transforms
Kurtosis (4th standardized)	E[(X-μ)⁴]/σ⁴ - 3	Tail weight	Outlier detection, risk

Key Takeaways

•Raw Moments: E[X^k] are the basic building blocks, but depend on location.
•Central Moments: E[(X-μ)^k] are location-invariant, measuring pure shape.
•Standardized Moments: Dividing by σ^k makes them scale-invariant (skewness, kurtosis).
•Skewness: γ₁ > 0 = right-skewed, γ₁ < 0 = left-skewed. Affects mean/median relationship.
•Kurtosis: High = heavy tails, more outliers. Low = light tails, bounded-like.
•MGFs: M(t) = E[exp(tX)] encodes all moments; M^(k)(0) = E[X^k]. Multiplicative for independent sums.
•Estimation Challenges: Higher moments are more sensitive to outliers and require larger samples.
•Robust Alternatives: Median, MAD, IQR, and quartile-based measures resist outliers better than moments.

Module Complete:

You have now mastered the foundational material on random variables and distributions. You understand:

Discrete and continuous random variables
PMF, PDF, and CDF
Expectation and variance
Higher moments and moment generating functions

This forms the probabilistic language for all of machine learning. The next module will explore specific probability distributions in depth—the Bernoulli, Binomial, Gaussian, and other workhorses of ML modeling.

Module Complete

Congratulations! You have completed Module 2: Random Variables and Distributions. You now possess the mathematical vocabulary to describe uncertainty, compute risks, and reason about the probabilistic foundations of machine learning algorithms.