Loading content...
We've mastered expectation (the first moment) and variance (derived from the second moment). But some distributions with identical means and variances can look completely different—one might be symmetric while another is heavily skewed; one might have thin tails, another thick tails.
To capture these differences, we need higher moments. The complete sequence of moments—first, second, third, and beyond—provides increasingly detailed information about a distribution's shape. Under certain conditions, the moments uniquely determine the entire distribution.
In machine learning, higher moments matter for:
By the end of this page, you will understand raw and central moments, interpret skewness (asymmetry) and kurtosis (tail heaviness), use moment generating functions to derive distributions and moments, and apply these concepts to analyze data and validate modeling assumptions.
The most basic moments are the raw moments (or moments about the origin)—expectations of powers of $X$.
Definition (Raw Moment):
The $k$-th raw moment of $X$ is: $$\mu'_k = E[X^k]$$
For discrete $X$: $\mu'_k = \sum_x x^k \cdot p(x)$
For continuous $X$: $\mu'k = \int{-\infty}^{\infty} x^k \cdot f(x) , dx$
The Moment Hierarchy:
| Moment | Symbol | Name | Meaning |
|---|---|---|---|
| 1st | $\mu'_1 = E[X]$ | Mean | Center/location |
| 2nd | $\mu'_2 = E[X^2]$ | — | Related to spread via $\text{Var} = \mu'_2 - (\mu'_1)^2$ |
| 3rd | $\mu'_3 = E[X^3]$ | — | Related to asymmetry |
| 4th | $\mu'_4 = E[X^4]$ | — | Related to tail behavior |
| $k$-th | $\mu'_k = E[X^k]$ | — | Higher-order shape information |
Key Observation: Raw moments depend on both the shape and the location of distances. For interpretable shape measures, we typically use central moments (about the mean) or standardized moments.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
import numpy as npfrom scipy import stats, integrate def compute_raw_moments(): """ Compute raw moments for various distributions. """ print("Raw Moments: E[X^k]") print("=" * 60) # Standard Normal N(0, 1) print("\nStandard Normal N(0, 1):") normal = stats.norm(0, 1) print(f" {'k':>3} {'E[X^k] (analytical)':>22} {'E[X^k] (numerical)':>22}") print(" " + "-" * 50) # For N(0,1): E[X^k] = 0 if k odd, = (k-1)!! if k even # (k-1)!! = (k-1)(k-3)(k-5)...1 def double_factorial(n): if n <= 0: return 1 return n * double_factorial(n - 2) for k in range(1, 9): # Numerical integration numerical, _ = integrate.quad(lambda x: x**k * normal.pdf(x), -10, 10) # Analytical for N(0,1) if k % 2 == 1: analytical = 0 else: analytical = double_factorial(k - 1) print(f" {k:>3} {analytical:>22.6f} {numerical:>22.6f}") # Exponential(λ=1) print("\nExponential(λ=1):") exp_dist = stats.expon(scale=1) print(f" {'k':>3} {'E[X^k] = k!':>15} {'Numerical':>15}") print(" " + "-" * 35) for k in range(1, 7): analytical = np.math.factorial(k) numerical = exp_dist.moment(k) print(f" {k:>3} {analytical:>15.4f} {numerical:>15.4f}") def moment_existence(): """ Discuss when moments exist. """ print("\n\nMoment Existence") print("=" * 60) print("\nNot all distributions have all moments!") print("\nFor E[X^k] to exist: ∫ |x|^k f(x) dx < ∞") print("\nExample: Cauchy distribution") print(" PDF: f(x) = 1 / (π(1 + x²))") print(" E[X] does not exist (integral diverges)") print(" No moments exist!") # Pareto distribution - moments depend on shape parameter print("\nExample: Pareto distribution") print(" PDF: f(x) = α x_m^α / x^(α+1) for x ≥ x_m") print(" E[X^k] exists iff α > k") print("\n α = 2: E[X] exists, E[X²] doesn't → Var infinite") print(" α = 3: E[X], E[X²] exist → finite variance") print(" α = 5: Moments up to 4th exist") # Demonstrate with sampling print("\nDemonstration with Pareto(α=1.5):") print(" Theory: E[X] exists, E[X²] doesn't") np.random.seed(42) pareto = stats.pareto(b=1.5) # α = 1.5 for n in [1000, 10000, 100000]: samples = pareto.rvs(n) sample_mean = np.mean(samples) sample_var = np.var(samples) print(f" n = {n:>6}: mean = {sample_mean:>10.2f}, var = {sample_var:>15.2f}") print("\n Notice: Sample variance keeps growing (doesn't converge)") compute_raw_moments()moment_existence()Central moments measure the expected powers of deviations from the mean, providing location-invariant shape information.
Definition (Central Moment):
The $k$-th central moment of $X$ is: $$\mu_k = E[(X - \mu)^k]$$
where $\mu = E[X]$ is the mean.
The Central Moment Hierarchy:
| Moment | Formula | Value | Meaning |
|---|---|---|---|
| 1st | $E[X - \mu]$ | Always 0 | (trivial) |
| 2nd | $E[(X - \mu)^2]$ | Variance | Spread |
| 3rd | $E[(X - \mu)^3]$ | (see skewness) | Asymmetry |
| 4th | $E[(X - \mu)^4]$ | (see kurtosis) | Tail weight |
Relationship to Raw Moments:
Central moments can be expressed in terms of raw moments via the binomial theorem: $$\mu_k = E[(X - \mu)^k] = \sum_{j=0}^{k} \binom{k}{j} (-\mu)^{k-j} E[X^j]$$
For example:
Central moments are location-invariant: if Y = X + c, then μₖ(Y) = μₖ(X) for k ≥ 2. This makes them pure measures of shape, unaffected by where the distribution is centered. Raw moments, in contrast, change when you shift the distribution.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as npfrom scipy import stats def compute_central_moments(): """ Compute central moments for various distributions. """ print("Central Moments: E[(X - μ)^k]") print("=" * 60) # Exponential(λ=1): μ = 1 print("\nExponential(λ=1), μ = 1:") exp_dist = stats.expon(scale=1) mu = exp_dist.mean() print(f" Mean μ = {mu}") # Central moments print(f"\n {'k':>3} {'E[(X-μ)^k]':>18} {'From scipy':>15}") print(" " + "-" * 40) np.random.seed(42) samples = exp_dist.rvs(100000) for k in range(1, 7): # Compute from samples empirical = np.mean((samples - mu) ** k) # k=1 should be ~0, k=2 is variance scipy_val = None if k == 1: scipy_val = 0 elif k == 2: scipy_val = exp_dist.var() if scipy_val is not None: print(f" {k:>3} {empirical:>18.6f} {scipy_val:>15.6f}") else: print(f" {k:>3} {empirical:>18.6f}") # Show location invariance print("\n\nLocation Invariance of Central Moments") print("-" * 60) X = stats.norm(0, 2) # N(0, 4) Y = stats.norm(100, 2) # N(100, 4) - shifted by 100 print("\nX ~ N(0, 4), Y ~ N(100, 4)") print("Y is just X shifted by 100") print(f"\n {'Moment':>10} {'μₖ(X)':>15} {'μₖ(Y)':>15}") print(" " + "-" * 45) samples_X = X.rvs(100000) samples_Y = Y.rvs(100000) for k in [2, 3, 4]: mu_k_X = np.mean((samples_X - X.mean()) ** k) mu_k_Y = np.mean((samples_Y - Y.mean()) ** k) print(f" {k:>10} {mu_k_X:>15.4f} {mu_k_Y:>15.4f}") print("\n Central moments are (approximately) equal!") print(" They capture shape, independent of location.") compute_central_moments()Skewness is the standardized third central moment, measuring asymmetry of the distribution.
Definition (Skewness):
$$\gamma_1 = \frac{E[(X - \mu)^3]}{\sigma^3} = \frac{\mu_3}{\sigma^3}$$
Dividing by $\sigma^3$ makes skewness dimensionless and scale-invariant.
Interpretation:
| Skewness | Distribution Shape | Examples |
|---|---|---|
| $\gamma_1 = 0$ | Symmetric | Normal, Uniform, t |
| $\gamma_1 > 0$ | Right-skewed (long right tail) | Log-normal, Exponential, Chi-squared |
| $\gamma_1 < 0$ | Left-skewed (long left tail) | Negative log-normal, some Beta |
Why Skewness Matters:
Mean ≠ Median ≠ Mode: For skewed distributions, these differ. Mean is pulled toward the tail.
Model Assumptions: Many models assume symmetric (often Gaussian) errors. Skewness checks this.
Transformation Need: Highly skewed data often benefits from log or Box-Cox transformation.
Risk Assessment: In finance/insurance, right-skewed loss distributions mean occasional catastrophic losses.
What's 'High' Skewness?
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
import numpy as npfrom scipy import stats def skewness_examples(): """ Compute and interpret skewness for various distributions. """ print("Skewness: Measuring Asymmetry") print("=" * 60) distributions = [ ("Normal(0, 1)", stats.norm(0, 1), 0), ("Uniform(0, 1)", stats.uniform(0, 1), 0), ("Exponential(1)", stats.expon(scale=1), 2), ("Chi-squared(3)", stats.chi2(df=3), np.sqrt(8/3)), ("Log-Normal(0, 0.5)", stats.lognorm(s=0.5), None), ("Beta(2, 5)", stats.beta(2, 5), None), ("Beta(5, 2)", stats.beta(5, 2), None), ] print(f"\n{'Distribution':<25} {'Skewness':>12} {'Interpretation':>20}") print("-" * 60) for name, dist, theoretical in distributions: empirical = dist.stats(moments='s') # 's' for skewness skew = float(empirical) if skew > 0.5: interp = "Right-skewed" elif skew < -0.5: interp = "Left-skewed" else: interp = "Approx. symmetric" print(f"{name:<25} {skew:>12.4f} {interp:>20}") # Relationship between mean, median, mode print("\n\nMean vs Median in Skewed Distributions") print("-" * 60) print("\nFor right-skewed distribution (Exponential):") exp_dist = stats.expon(scale=1) print(f" Mean = {exp_dist.mean():.4f}") print(f" Median = {exp_dist.median():.4f}") print(f" Mode = 0") print(" Order: Mode < Median < Mean (mean pulled toward tail)") print("\nFor left-skewed distribution (Beta(5, 2)):") beta_left = stats.beta(5, 2) np.random.seed(42) samples = beta_left.rvs(100000) mode_approx = samples[np.argmax([beta_left.pdf(x) for x in np.linspace(0, 1, 1000)])] print(f" Mean = {beta_left.mean():.4f}") print(f" Median = {beta_left.median():.4f}") print(f" Mode ≈ {(5-1)/(5+2-2):.4f}") # (α-1)/(α+β-2) for Beta print(" Order: Mean < Median < Mode") def detecting_skewness_in_data(): """ Show how to detect and interpret skewness in real data. """ print("\n\nDetecting Skewness in Data") print("=" * 60) np.random.seed(42) # Simulate income data (typically right-skewed) n = 10000 income = np.random.lognormal(mean=10, sigma=1, size=n) # Sample statistics mean = np.mean(income) median = np.median(income) skewness = stats.skew(income) print("\nSimulated income data (log-normal):") print(f" n = {n}") print(f" Mean = ${mean:, .0f }") print(f" Median = ${median:,.0f}") print(f" Skewness = {skewness:.4f}") print("\nDiagnosis:") print(f" Mean > Median suggests right skew") print(f" Skewness = {skewness:.2f} > 1: Highly right-skewed") # Effect of log transform log_income = np.log(income) log_skewness = stats.skew(log_income) print(f"\nAfter log transform:") print(f" Skewness = {log_skewness:.4f}") print(f" Much closer to symmetric!") # Test for normality _, p_original = stats.normaltest(income) _, p_log = stats.normaltest(log_income) print(f"\nNormality test p-values:") print(f" Original: p = {p_original:.2e} (reject normality)") print(f" Log-transformed: p = {p_log:.4f}") skewness_examples()detecting_skewness_in_data()Kurtosis is the standardized fourth central moment, often interpreted as measuring 'tail heaviness' relative to a normal distribution.
Definition (Kurtosis):
$$\gamma_2 = \frac{E[(X - \mu)^4]}{\sigma^4} = \frac{\mu_4}{\sigma^4}$$
Excess Kurtosis subtracts 3 (the kurtosis of a standard normal): $$\text{Excess Kurtosis} = \gamma_2 - 3$$
This makes the normal distribution have excess kurtosis of 0.
Interpretation of Excess Kurtosis:
| Excess Kurtosis | Name | Tail Behavior | Examples |
|---|---|---|---|
| = 0 | Mesokurtic | Normal tails | Normal |
| > 0 | Leptokurtic | Heavy tails, more outliers | t-distribution, Laplace |
| < 0 | Platykurtic | Light tails, fewer outliers | Uniform, bounded dists |
Common Misconception:
Kurtosis is often described as 'peakedness,' but this is misleading. A distribution can be very peaked but have low kurtosis (uniform has the lowest peak but also low kurtosis). Kurtosis is primarily about tail weight—the probability of extreme values.
What's 'High' Kurtosis?
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
import numpy as npfrom scipy import stats def kurtosis_examples(): """ Compute and interpret kurtosis for various distributions. """ print("Kurtosis: Tail Heaviness") print("=" * 60) distributions =[ ("Uniform(0, 1)", stats.uniform(0, 1)), ("Normal(0, 1)", stats.norm(0, 1)), ("Laplace(0, 1)", stats.laplace(0, 1)), ("t(df=10)", stats.t(df = 10)), ("t(df=5)", stats.t(df = 5)), ("Exponential(1)", stats.expon(scale = 1)), ("Beta(2, 2)", stats.beta(2, 2)), ] print(f"\n{'Distribution':<20} {'Kurtosis':>12} {'Excess Kurt':>15} {'Type':>15}") print("-" * 65) for name, dist in distributions: try: # scipy.stats returns excess kurtosis by default excess_kurt = float(dist.stats(moments = 'k')) kurtosis = excess_kurt + 3 if excess_kurt > 0.5: kurt_type = "Leptokurtic" elif excess_kurt< - 0.5: kurt_type = "Platykurtic" else: kurt_type = "Mesokurtic" print(f"{name:<20} {kurtosis:>12.4f} {excess_kurt:>15.4f} {kurt_type:>15}") except: print(f"{name:<20} {'N/A':>12} {'N/A':>15} {'(undefined)':>15}") print("\nNote: t(df≤4) has undefined kurtosis (4th moment doesn't exist)") def tail_behavior_demo(): """ Demonstrate how kurtosis relates to tail probability. """ print("\n\nKurtosis and Tail Probability") print("=" * 60) print("\nComparing P(|X| > k) for different kurtosis levels") print("(All standardized to mean 0, variance 1)") # Distributions with same mean(0) and variance(1) normal = stats.norm(0, 1) uniform_std = stats.uniform(loc = -np.sqrt(3), scale = 2 * np.sqrt(3)) # Var = 1 laplace_std = stats.laplace(0, 1 / np.sqrt(2)) # Var = 1 t5_std = stats.t(df = 5, scale = np.sqrt(3 / 5)) # Adjusted for Var = 1 dists =[ ("Uniform (platy)", uniform_std), ("Normal (meso)", normal), ("Laplace (lepto)", laplace_std), ("t(5) (lepto)", t5_std), ] print(f"\n{'Distribution':<20}", end = "") for k in [2, 3, 4, 5]: print(f"{'P(|X|>' + str(k) + ')':>12}", end = "") print() print("-" * 70) for name, dist in dists: print(f"{name:<20}", end = "") for k in [2, 3, 4, 5]: prob = 2 * (1 - dist.cdf(k)) print(f"{prob:>12.6f}", end = "") print() print("\n Higher kurtosis → more probability in extreme tails") print(" t(5) has MUCH more tail probability than Normal") def kurtosis_in_ml(): """ Show ML applications of kurtosis analysis. """ print("\n\nKurtosis in ML Applications") print("=" * 60) print("\n1. Checking Regression Residuals") print("-" * 40) np.random.seed(42) # Simulate regression with normal errors n = 1000 X = np.random.uniform(0, 10, n) y_normal = 2 * X + 1 + np.random.normal(0, 1, n) residuals_normal = y_normal - (2 * X + 1) # Simulate regression with heavy - tailed errors y_heavy = 2 * X + 1 + np.random.standard_t(df = 3, size = n) residuals_heavy = y_heavy - (2 * X + 1) print(f"\nNormal errors:") print(f" Excess kurtosis = {stats.kurtosis(residuals_normal):.4f}") print(f" Diagnosis: {'Normal' if abs(stats.kurtosis(residuals_normal)) < 1 else 'Non-normal'}") print(f"\nHeavy-tailed errors:") print(f" Excess kurtosis = {stats.kurtosis(residuals_heavy):.4f}") print(f" Diagnosis: High kurtosis → heavy tails, outliers likely") print("\n2. Feature Engineering for Anomaly Detection") print("-" * 40) # Normal data vs data with anomalies normal_data = np.random.normal(0, 1, 1000) anomaly_data = np.concatenate([ np.random.normal(0, 1, 980), np.random.normal(0, 5, 20) # 2 % anomalies with larger variance ]) print(f"\nNo anomalies: excess kurtosis = {stats.kurtosis(normal_data):.4f}") print(f"With 2% anomalies: excess kurtosis = {stats.kurtosis(anomaly_data):.4f}") print("\n Elevated kurtosis can signal contamination or outliers") kurtosis_examples() tail_behavior_demo() kurtosis_in_ml()The Moment Generating Function (MGF) is a powerful tool that encodes all moments of a distribution in a single function.
Definition (Moment Generating Function):
$$M_X(t) = E[e^{tX}]$$
For discrete $X$: $M_X(t) = \sum_x e^{tx} \cdot p(x)$
For continuous $X$: $M_X(t) = \int_{-\infty}^{\infty} e^{tx} \cdot f(x) , dx$
The MGF exists if the expectation is finite for $t$ in some neighborhood of 0.
Why 'Moment Generating'?:
The $k$-th moment is the $k$-th derivative of $M_X(t)$ evaluated at $t = 0$: $$E[X^k] = M_X^{(k)}(0) = \left. \frac{d^k}{dt^k} M_X(t) \right|_{t=0}$$
Proof Intuition: Expanding $e^{tX} = \sum_{k=0}^{\infty} \frac{(tX)^k}{k!}$: $$M_X(t) = E\left[\sum_{k=0}^{\infty} \frac{(tX)^k}{k!}\right] = \sum_{k=0}^{\infty} \frac{t^k}{k!} E[X^k]$$
The coefficient of $t^k/k!$ is $E[X^k]$, recoverable by differentiation.
Key Properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
import numpy as npfrom scipy import stats from scipy.misc import derivative def mgf_examples(): """ Compute MGFs and derive moments. """ print("Moment Generating Functions") print("=" * 60) # Normal MGF: M(t) = exp(μt + σ²t²/2) print("\nNormal(μ=2, σ²=1):") print(" M(t) = exp(μt + σ²t²/2) = exp(2t + t²/2)") def normal_mgf(t, mu = 2, sigma = 1): return np.exp(mu * t + (sigma ** 2) * t ** 2 / 2) # Derive E[X] = M'(0) first_moment = derivative(normal_mgf, 0, n = 1, dx = 1e-8) second_raw = derivative(normal_mgf, 0, n = 2, dx = 1e-8) print(f" M'(0) = E[X] = {first_moment:.6f} (theory: 2)") print(f" M''(0) = E[X²] = {second_raw:.6f} (theory: μ² + σ² = 5)") # Exponential MGF: M(t) = λ / (λ - t) for t < λ print("\nExponential(λ=2):") print(" M(t) = λ/(λ-t) = 2/(2-t) for t < 2") lam = 2 def exp_mgf(t): return lam / (lam - t) first_exp = derivative(exp_mgf, 0, n = 1, dx = 1e-8) second_exp = derivative(exp_mgf, 0, n = 2, dx = 1e-8) print(f" M'(0) = E[X] = {first_exp:.6f} (theory: 1/λ = 0.5)") print(f" M''(0) = E[X²] = {second_exp:.6f} (theory: 2/λ² = 0.5)") print(f" Var(X) = E[X²] - E[X]² = {second_exp - first_exp**2:.6f} (theory: 1/λ² = 0.25)") def mgf_for_sums(): """ Show how MGFs simplify analysis of sums of independent RVs. """ print("\n\nMGFs for Sums of Independent Random Variables") print("=" * 60) print("\nKey property: M_{X+Y}(t) = M_X(t) × M_Y(t) when X ⊥ Y") # Example: Sum of independent Normals is Normal print("\nExample 1: X ~ N(μ₁, σ₁²), Y ~ N(μ₂, σ₂²) independent") print(" M_X(t) = exp(μ₁t + σ₁²t²/2)") print(" M_Y(t) = exp(μ₂t + σ₂²t²/2)") print(" M_{X+Y}(t) = exp((μ₁+μ₂)t + (σ₁²+σ₂²)t²/2)") print(" → X + Y ~ N(μ₁ + μ₂, σ₁² + σ₂²)") # Verify numerically np.random.seed(42) n = 100000 X = np.random.normal(1, 2, n) # N(1, 4) Y = np.random.normal(3, 1, n) # N(3, 1) Z = X + Y print(f"\n Empirical: Z has mean {np.mean(Z):.4f}, var {np.var(Z):.4f}") print(f" Theory: Z ~ N({1+3}, {4+1}) → mean 4, var 5") # Example: Sum of independent Poissons print("\nExample 2: X ~ Poisson(λ₁), Y ~ Poisson(λ₂) independent") print(" M_X(t) = exp(λ₁(e^t - 1))") print(" M_Y(t) = exp(λ₂(e^t - 1))") print(" M_{X+Y}(t) = exp((λ₁ + λ₂)(e^t - 1))") print(" → X + Y ~ Poisson(λ₁ + λ₂)") X_pois = np.random.poisson(3, n) Y_pois = np.random.poisson(5, n) Z_pois = X_pois + Y_pois print(f"\n Empirical: Z has mean {np.mean(Z_pois):.4f}") print(f" Theory: Poisson(3 + 5 = 8) has mean 8") def characteristic_functions(): """ Briefly introduce characteristic functions as generalization. """ print("\n\nCharacteristic Functions (Brief Note)") print("=" * 60) print("\nThe MGF doesn't always exist (e.g., Cauchy distribution).") print("The Characteristic Function (CF) always exists:") print(" φ_X(t) = E[e^{itX}] where i = √(-1)") print("\nProperties:") print(" - Always exists for any distribution") print(" - Still uniquely determines the distribution") print(" - φ_X(t) = M_X(it) when MGF exists") print(" - Used in proofs of Central Limit Theorem") print("\nFor practical ML work, MGFs are usually sufficient") print("since we often assume light-tailed distributions.") mgf_examples() mgf_for_sums() characteristic_functions()In applications, we estimate moments from finite samples. Understanding the behavior of sample moments is essential for reliable inference.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138
import numpy as npfrom scipy import stats def sample_moment_behavior(): """ Demonstrate behavior of sample moments. """ print("Sample Moment Estimation") print("=" * 60) # Show bias in variance estimation print("\n1. Bias in Variance Estimation") print("-" * 40) np.random.seed(42) true_var = 4 n_trials = 10000 n_samples = 10 biased_vars = [] unbiased_vars = [] for _ in range(n_trials): samples = np.random.normal(0, 2, n_samples) biased_vars.append(np.var(samples)) # n divisor unbiased_vars.append(np.var(samples, ddof = 1)) # n - 1 divisor print(f"True variance: {true_var}") print(f"Mean of biased estimates (n divisor): {np.mean(biased_vars):.4f}") print(f"Mean of unbiased estimates (n-1 divisor): {np.mean(unbiased_vars):.4f}") print(f"Bias in biased estimator: {np.mean(biased_vars) - true_var:.4f}") def outlier_sensitivity(): """ Show how higher moments are sensitive to outliers. """ print("\n\n2. Outlier Sensitivity of Higher Moments") print("=" * 60) np.random.seed(42) # Clean data from N(0, 1) clean_data = np.random.normal(0, 1, 100) # Contaminated: one outlier contaminated = np.append(clean_data, [10]) # Add single outlier print("\nClean data (100 samples from N(0,1)):") print(f" Mean: {np.mean(clean_data):.4f}") print(f" Std: {np.std(clean_data):.4f}") print(f" Skewness: {stats.skew(clean_data):.4f}") print(f" Kurtosis: {stats.kurtosis(clean_data):.4f}") print("\nWith single outlier (x = 10) added:") print(f" Mean: {np.mean(contaminated):.4f} (shifted)") print(f" Std: {np.std(contaminated):.4f} (inflated)") print(f" Skewness: {stats.skew(contaminated):.4f} (heavily affected)") print(f" Kurtosis: {stats.kurtosis(contaminated):.4f} (dramatically affected)") print("\n One outlier in 101 samples completely distorts kurtosis!") def sample_size_requirements(): """ Show sample size requirements for moment estimation. """ print("\n\n3. Sample Size Requirements") print("=" * 60) np.random.seed(42) # True values for N(0, 1) true_skew = 0 true_kurt = 0 # excess kurtosis print("\nEstimating skewness and kurtosis of N(0, 1):") print("(True skewness = 0, True excess kurtosis = 0)") print(f"\n{'n':>8} {'Skew Est':>12} {'Skew SE':>10} {'Kurt Est':>12} {'Kurt SE':>10}") print("-" * 55) for n in [20, 50, 100, 500, 1000, 5000]: skew_estimates = [] kurt_estimates = [] for _ in range(1000): samples = np.random.normal(0, 1, n) skew_estimates.append(stats.skew(samples)) kurt_estimates.append(stats.kurtosis(samples)) print(f"{n:>8} {np.mean(skew_estimates):>12.4f} {np.std(skew_estimates):>10.4f} " f"{np.mean(kurt_estimates):>12.4f} {np.std(kurt_estimates):>10.4f}") print("\n Standard error decreases slowly with n") print(" Need n ≥ 100 for reasonably precise kurtosis estimates") def robust_alternatives(): """ Mention robust alternatives to moments. """ print("\n\n4. Robust Alternatives") print("=" * 60) np.random.seed(42) # Data with outliers data = np.concatenate([ np.random.normal(0, 1, 95), np.random.normal(10, 1, 5) # 5 % contamination ]) print("\nData: 95% from N(0,1), 5% from N(10,1)") print("\nLocation measures:") print(f" Mean: {np.mean(data):.4f}") print(f" Median: {np.median(data):.4f}") print(f" Trimmed mean (10%): {stats.trim_mean(data, 0.1):.4f}") print("\nScale measures:") print(f" Standard deviation: {np.std(data):.4f}") print(f" MAD (Median Absolute Deviation): {stats.median_abs_deviation(data):.4f}") print(f" IQR: {stats.iqr(data):.4f}") print("\nSkewness alternatives:") q1, q2, q3 = np.percentile(data, [25, 50, 75]) bowley_skew = (q3 + q1 - 2 * q2) / (q3 - q1) # Bowley's skewness print(f" Sample skewness: {stats.skew(data):.4f}") print(f" Bowley's skewness (quartile-based): {bowley_skew:.4f}") print("\n Robust measures are less affected by outliers") sample_moment_behavior() outlier_sensitivity() sample_size_requirements() robust_alternatives()We've completed our tour of moments—from basic expectations through the full hierarchy of shape descriptors.
| Moment | Formula | Meaning | ML Relevance |
|---|---|---|---|
| Mean (1st) | E[X] | Center/location | Predictions, loss minimizers |
| Variance (2nd central) | E[(X-μ)²] | Spread | Uncertainty, regularization |
| Skewness (3rd standardized) | E[(X-μ)³]/σ³ | Asymmetry | Normality checks, transforms |
| Kurtosis (4th standardized) | E[(X-μ)⁴]/σ⁴ - 3 | Tail weight | Outlier detection, risk |
Module Complete:
You have now mastered the foundational material on random variables and distributions. You understand:
This forms the probabilistic language for all of machine learning. The next module will explore specific probability distributions in depth—the Bernoulli, Binomial, Gaussian, and other workhorses of ML modeling.
Congratulations! You have completed Module 2: Random Variables and Distributions. You now possess the mathematical vocabulary to describe uncertainty, compute risks, and reason about the probabilistic foundations of machine learning algorithms.