Machine LearningRandom Variables and Distributions

Random Variables and Distributions

LevelIntermediate

Duration90 mins

TopicRandom Variables and Distributions

4 / 5

Expectation and Variance

Summarizing Distributions: Center and Spread

While PMFs, PDFs, and CDFs completely characterize a distribution, they're often more detail than we need. Much of probability theory and machine learning relies on summary statistics—single numbers that capture key aspects of a distribution.

The two most important summaries are:

Expectation (Mean): Where is the distribution centered? What is the 'average' outcome if we sample many times?
Variance: How spread out is the distribution? How much do samples deviate from the center?

These concepts are far from abstract. The expectation underlies virtually every loss function in ML—mean squared error, cross-entropy loss, and expected risk are all expectations. Variance underlies model uncertainty, the bias-variance tradeoff, and regularization theory. Mastering these concepts is mastering the mathematical language of machine learning.

What You Will Learn

By the end of this page, you will understand expectation and variance formally, know their key properties, compute them for discrete and continuous distributions, understand their role in ML loss functions and model analysis, and appreciate the computational perspective essential for practical applications.

Expectation: The Weighted Average

The expectation (or expected value or mean) of a random variable is its long-run average value—the center of mass of its probability distribution.

Definition (Expectation):

For a discrete random variable $X$ with PMF $p$: $$E[X] = \sum_{x \in \mathcal{X}} x \cdot p(x)$$

For a continuous random variable $X$ with PDF $f$: $$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) , dx$$

The expectation is often denoted $\mu$, $\mu_X$, or $\bar{X}$ (the last typically for sample means).

Intuition: Weighted Average

The expectation is a weighted average where each value is weighted by its probability. Values that are more likely contribute more to the mean.

Scenario	Weighted Average	Probability Weight
Die roll	(1+2+3+4+5+6)/6 = 3.5	Each face has prob 1/6
Loaded die (6 favored)	Higher than 3.5	6 has higher weight
Binary outcome {0, 1}	P(X = 1)	Mean equals probability of 1

Physical Interpretation:

If you cut out the PDF curve from a piece of cardboard, the expectation is the $x$-coordinate of the balance point (center of mass). The distribution 'balances' on its mean.

expectation_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from scipy import stats, integrate
 
def compute_expectations():
    """
    Compute expectations for various distributions.
    """
    print("Computing Expectations")
    print("=" * 60)
    
    # Discrete: Fair die
    print("\nDiscrete: Fair 6-sided die")
    support = np.arange(1, 7)
    pmf = np.ones(6) / 6
    expected = np.sum(support * pmf)
    print(f"  E[X] = Σ x·p(x) = (1+2+3+4+5+6)/6 = {expected:.4f}")
    
    # Discrete: Loaded die (6 has prob 0.5, others equal)
    print("\nDiscrete: Loaded die (P(6)=0.5)")
    loaded_pmf = np.array([0.1, 0.1, 0.1, 0.1, 0.1, 0.5])
    expected_loaded = np.sum(support * loaded_pmf)
    print(f"  E[X] = {expected_loaded:.4f} (higher due to loaded 6)")
    
    # Discrete: Binomial
    print("\nDiscrete: Binomial(n=10, p=0.3)")
    n, p = 10, 0.3
    binomial = stats.binom(n, p)
    print(f"  E[X] = n·p = {n}·{p} = {n*p:.4f}")
    print(f"  Computed: {binomial.mean():.4f}")
    
    # Continuous: Normal
    print("\nContinuous: Normal(μ=5, σ²=4)")
    mu, sigma = 5, 2
    normal = stats.norm(mu, sigma)
    print(f"  E[X] = μ = {mu}")
    print(f"  Computed via integration: {normal.mean():.4f}")
    
    # Continuous: Exponential
    print("\nContinuous: Exponential(λ=2)")
    lam = 2
    exponential = stats.expon(scale=1/lam)
    print(f"  E[X] = 1/λ = {1/lam:.4f}")
    print(f"  Computed: {exponential.mean():.4f}")
    
    # Verify by numerical integration
    integral, _ = integrate.quad(lambda x: x * exponential.pdf(x), 0, np.inf)
    print(f"  Via ∫x·f(x)dx: {integral:.4f}")
 
 
def law_of_large_numbers_demo():
    """
    Demonstrate the Law of Large Numbers.
    """
    print("\n\nLaw of Large Numbers")
    print("=" * 60)
    
    print("\nThe sample mean converges to E[X] as n → ∞")
    print("  X̄_n = (1/n) Σ Xᵢ → E[X]")
    
    # True expectation for Exp(λ=2)
    true_mean = 0.5
    
    np.random.seed(42)
    print(f"\nExponential(λ=2), true E[X] = {true_mean}")
    print(f"\n{'n':>10} {'Sample Mean':>15} {'|Error|':>12}")
    print("-" * 40)
    
    for n in [10, 100, 1000, 10000, 100000]:
        samples = np.random.exponential(scale=true_mean, size=n)
        sample_mean = np.mean(samples)
        error = abs(sample_mean - true_mean)
        print(f"{n:>10} {sample_mean:>15.6f} {error:>12.6f}")
    
    print("\nThe sample mean concentrates around E[X] = 0.5")
 
 
compute_expectations()
law_of_large_numbers_demo()

Expectation Doesn't Always Exist

Some distributions have undefined or infinite expectations. The Cauchy distribution (fat tails) has no mean—its integral diverges. In practice, check that your modeling assumptions imply finite expectations, especially for heavy-tailed phenomena like financial returns or claim sizes.

Expectation of Functions: LOTUS

Often we need the expected value not of $X$ itself, but of some function $g(X)$. The Law of the Unconscious Statistician (LOTUS) tells us how to compute this without first finding the distribution of $g(X)$.

LOTUS (Law of the Unconscious Statistician):

For discrete $X$: $$E[g(X)] = \sum_{x \in \mathcal{X}} g(x) \cdot p(x)$$

For continuous $X$: $$E[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f(x) , dx$$

We simply weight $g(x)$ by the probability of $x$—no need to derive the distribution of $g(X)$ first.

Why 'Unconscious Statistician'?

The name is mildly pejorative—early statisticians would apply this formula 'unconsciously' without proving it. But it is correct! The proof follows from the definition of expectation for the random variable $Y = g(X)$.

Common Applications:

Function $g(X)$	Meaning	ML Use
$X^2$	Second moment	Variance computation
$(X - \mu)^2$	Squared deviation	Variance, MSE
$\|X - y\|^2$	Squared error	Expected loss
$\log p(X)$	Log-probability	Entropy, KL divergence
$\mathbf{1}_{X \in A}$	Indicator	Probability computation

lotus_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from scipy import stats, integrate
 
def lotus_demonstrations():
    """
    Demonstrate LOTUS for computing E[g(X)].
    """
    print("Law of the Unconscious Statistician (LOTUS)")
    print("=" * 60)
    
    # Example 1: E[X²] for a fair die
    print("\nExample 1: E[X²] for fair die")
    support = np.arange(1, 7)
    pmf = np.ones(6) / 6
    
    # LOTUS: E[X²] = Σ x² · p(x)
    e_x_squared = np.sum(support**2 * pmf)
    print(f"  E[X²] = Σ x²·p(x) = (1+4+9+16+25+36)/6 = {e_x_squared:.4f}")
    
    # Compare to (E[X])² to see they differ
    e_x = np.sum(support * pmf)
    print(f"  E[X] = {e_x:.4f}")
    print(f"  (E[X])² = {e_x**2:.4f}")
    print(f"  Note: E[X²] ≠ (E[X])², difference = {e_x_squared - e_x**2:.4f}")
    
    # Example 2: E[e^X] for Normal (moment generating function idea)
    print("\nExample 2: E[exp(X)] for X ~ N(0, 1)")
    normal = stats.norm(0, 1)
    
    # LOTUS via integration
    e_exp_x, _ = integrate.quad(lambda x: np.exp(x) * normal.pdf(x), -20, 20)
    
    # Analytical: For N(μ,σ²), E[exp(X)] = exp(μ + σ²/2)
    analytical = np.exp(0 + 1/2)
    
    print(f"  Via LOTUS (numerical): {e_exp_x:.6f}")
    print(f"  Analytical formula:    {analytical:.6f}")
    
    # Example 3: E[max(X, 0)] - the ReLU-like function
    print("\nExample 3: E[max(X, 0)] for X ~ N(0, 1)")
    
    # LOTUS: E[max(X, 0)] = ∫₀^∞ x · f(x) dx
    relu_mean, _ = integrate.quad(lambda x: max(x, 0) * normal.pdf(x), -10, 10)
    
    # Analytical: For standard normal, E[max(X,0)] = 1/√(2π) ≈ 0.3989
    analytical_relu = 1 / np.sqrt(2 * np.pi)
    
    print(f"  Via LOTUS (numerical): {relu_mean:.6f}")
    print(f"  Analytical:            {analytical_relu:.6f}")
    print("  This is relevant for ReLU activation analysis!")
 
 
def indicator_function_expectations():
    """
    Show how E[1{X ∈ A}] = P(X ∈ A).
    """
    print("\n\nExpectation of Indicator Functions")
    print("=" * 60)
    
    print("\nKey result: E[1{X ∈ A}] = P(X ∈ A)")
    print("Because: 1{X ∈ A} is 1 with prob P(X ∈ A), 0 otherwise")
    print("         E[1{X ∈ A}] = 1·P(A) + 0·P(Aᶜ) = P(A)")
    
    # Demonstration
    normal = stats.norm(0, 1)
    
    A = (-1, 1)  # Event: X ∈ [-1, 1]
    
    # Using CDF
    prob_A_cdf = normal.cdf(1) - normal.cdf(-1)
    
    # Using LOTUS with indicator
    indicator = lambda x: 1.0 if A[0] <= x <= A[1] else 0.0
    prob_A_lotus, _ = integrate.quad(lambda x: indicator(x) * normal.pdf(x), -10, 10)
    
    print(f"\n  P(X ∈ [-1, 1]) for N(0,1):")
    print(f"    Via CDF:             {prob_A_cdf:.6f}")
    print(f"    Via E[indicator]:    {prob_A_lotus:.6f}")
    print("\n  Both methods give the same answer!")
 
 
def ml_loss_functions_as_expectations():
    """
    Cast ML loss functions as expectations.
    """
    print("\n\nML Loss Functions as Expectations")
    print("=" * 60)
    
    print("\nMSE, MAE, Cross-Entropy are all expectations!")
    
    # Setup: predict μ̂ for Y ~ N(μ, σ²)
    true_mu = 3.0
    sigma = 1.0
    mu_hat = 2.5  # Our prediction
    
    print(f"\nScenario: True Y ~ N({true_mu}, {sigma}²), predict μ̂ = {mu_hat}")
    
    # Mean Squared Error as expectation
    # MSE = E[(Y - μ̂)²] 
    normal = stats.norm(true_mu, sigma)
    mse_analytical = sigma**2 + (true_mu - mu_hat)**2  # Bias² + Variance
    
    mse_numerical, _ = integrate.quad(
        lambda y: (y - mu_hat)**2 * normal.pdf(y), 
        -10, 10
    )
    
    print(f"\n  MSE = E[(Y - μ̂)²]")
    print(f"    = Var(Y) + (E[Y] - μ̂)²")
    print(f"    = {sigma**2} + ({true_mu} - {mu_hat})²")
    print(f"    = {mse_analytical:.4f}")
    print(f"    (Numerical: {mse_numerical:.4f})")
    
    # Mean Absolute Error
    mae_numerical, _ = integrate.quad(
        lambda y: np.abs(y - mu_hat) * normal.pdf(y),
        -10, 10
    )
    print(f"\n  MAE = E[|Y - μ̂|] = {mae_numerical:.4f}")
    
    print("\n  Key insight: Loss functions are expectations over data distribution")
    print("  Minimizing expected loss = Minimizing E[L(Y, μ̂)]")
 
 
lotus_demonstrations()
indicator_function_expectations()
ml_loss_functions_as_expectations()

Properties of Expectation

Expectation has powerful algebraic properties that make it tractable for analysis. These properties are used constantly in ML theory.

Fundamental Properties:

1. Linearity (The most important property): $$E[aX + bY] = aE[X] + bE[Y]$$

This holds for any random variables $X$ and $Y$, regardless of their dependence structure. Linearity extends to finite sums: $$E\left[\sum_{i=1}^n a_i X_i\right] = \sum_{i=1}^n a_i E[X_i]$$

2. Monotonicity: If $X \leq Y$ with probability 1, then $E[X] \leq E[Y]$.

3. Constant Expectation: $E[c] = c$ for any constant $c$.

4. Non-negative Expectation: If $X \geq 0$ with probability 1, then $E[X] \geq 0$.

Linearity is Unreasonably Powerful

Linearity of expectation is simpler and more powerful than it first appears. It holds even when X and Y are dependent—their covariance doesn't affect the sum of expectations. Many seemingly difficult counting/probability problems become trivial once you express them as E[Σ indicators] = Σ P(event_i).

expectation_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from scipy import stats
 
def linearity_demonstration():
    """
    Demonstrate linearity of expectation.
    """
    print("Linearity of Expectation")
    print("=" * 60)
    
    # Linearity for independent RVs
    print("\nProperty: E[aX + bY] = a·E[X] + b·E[Y]")
    print("(This works even when X and Y are DEPENDENT!)")
    
    np.random.seed(42)
    n = 100000
    
    # Independent example
    X = np.random.normal(2, 1, n)  # E[X] = 2
    Y = np.random.exponential(3, n)  # E[Y] = 3
    a, b = 2, -1
    
    Z = a * X + b * Y
    
    print(f"\nIndependent X ~ N(2,1), Y ~ Exp(1/3)")
    print(f"  a = {a}, b = {b}")
    print(f"  E[{a}X + ({b})Y] = {a}·E[X] + ({b})·E[Y] = {a}·2 + ({b})·3 = {a*2 + b*3}")
    print(f"  Sample mean of Z: {np.mean(Z):.4f}")
    
    # Dependent example
    print("\nDependent example: Y = X² (clearly dependent on X)")
    Y_dep = X ** 2
    Z_dep = a * X + b * Y_dep
    
    e_x = 2  # E[X]
    e_y_dep = 4 + 1  # E[X²] = Var(X) + E[X]² = 1 + 4 = 5
    
    print(f"  E[X] = {e_x}, E[X²] = E[Y] = {e_y_dep}")
    print(f"  E[{a}X + ({b})X²] = {a}·{e_x} + ({b})·{e_y_dep} = {a*e_x + b*e_y_dep}")
    print(f"  Sample mean of Z: {np.mean(Z_dep):.4f}")
    print("  Linearity still works despite dependence!")
 
 
def counting_with_indicators():
    """
    Show the power of linearity with indicator functions.
    """
    print("\n\nCounting via Linearity + Indicators")
    print("=" * 60)
    
    print("\nProblem: Expected number of fixed points in random permutation")
    print("(Fixed point: element that stays in its original position)")
    
    print("\nNaive approach: Sum over all n! permutations... intractable")
    print("Smart approach: X = Σᵢ 1{element i is fixed}")
    print("                E[X] = Σᵢ E[1{i fixed}] = Σᵢ P(i fixed)")
    print("                P(i fixed) = 1/n for each i")
    print("                E[X] = n × (1/n) = 1")
    
    # Verify empirically
    def count_fixed_points(n, trials=10000):
        count = 0
        for _ in range(trials):
            perm = np.random.permutation(n)
            fixed = np.sum(perm == np.arange(n))
            count += fixed
        return count / trials
    
    print("\nEmpirical verification:")
    for n in [5, 10, 50, 100]:
        empirical = count_fixed_points(n)
        print(f"  n = {n:3d}: E[fixed points] ≈ {empirical:.4f} (theory: 1.00)")
 
 
def expectation_bounds():
    """
    Demonstrate bounds on expectations.
    """
    print("\n\nExpectation Bounds")
    print("=" * 60)
    
    print("\n1. Jensen's Inequality:")
    print("   For convex g: E[g(X)] ≥ g(E[X])")
    print("   For concave g: E[g(X)] ≤ g(E[X])")
    
    # Example: g(x) = x² is convex
    # So E[X²] ≥ (E[X])²
    
    np.random.seed(42)
    X = np.random.exponential(2, 10000)  # E[X] = 2
    
    e_x = np.mean(X)
    e_x_sq = np.mean(X**2)
    
    print(f"\n   Example: X ~ Exp(1/2), g(x) = x² (convex)")
    print(f"   E[X²] = {e_x_sq:.4f} ≥ (E[X])² = {e_x**2:.4f}?")
    print(f"   Yes: {e_x_sq:.4f} ≥ {e_x**2:.4f} ✓")
    
    # Example: g(x) = log(x) is concave
    print(f"\n   Example: g(x) = log(x) (concave)")
    e_log_x = np.mean(np.log(X))
    log_e_x = np.log(e_x)
    print(f"   E[log(X)] = {e_log_x:.4f} ≤ log(E[X]) = {log_e_x:.4f}?")
    print(f"   Yes: {e_log_x:.4f} ≤ {log_e_x:.4f} ✓")
    
    print("\n   Jensen is crucial for understanding:")
    print("   - Why geometric mean ≤ arithmetic mean")
    print("   - Why KL divergence ≥ 0")
    print("   - Variational bounds in ML")
 
 
linearity_demonstration()
counting_with_indicators()
expectation_bounds()

Variance: Measuring Spread

While expectation captures the center of a distribution, variance captures its spread—how much values deviate from the mean.

Definition (Variance):

$$\text{Var}(X) = E[(X - E[X])^2] = E[(X - \mu)^2]$$

Variance is the expected squared deviation from the mean. The standard deviation is $\sigma = \sqrt{\text{Var}(X)}$, which has the same units as $X$.

Computational Formula:

Using linearity of expectation: $$\text{Var}(X) = E[X^2] - (E[X])^2$$

This formula is often more convenient for calculation.

Why Squared Deviations?

Why not E[|X - μ|] (mean absolute deviation)? Squaring has several advantages: (1) it's differentiable everywhere, (2) it connects to inner products and geometry, (3) it leads to closed-form solutions in optimization, (4) it connects naturally to Gaussian distributions. The absolute deviation gives the median, not mean, as the optimal point—a different kind of summary.

variance_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from scipy import stats
 
def compute_variances():
    """
    Compute variances for various distributions.
    """
    print("Computing Variances")
    print("=" * 60)
    
    # Fair die
    print("\nDiscrete: Fair 6-sided die")
    support = np.arange(1, 7)
    pmf = np.ones(6) / 6
    
    mu = np.sum(support * pmf)
    e_x_sq = np.sum(support**2 * pmf)
    var = e_x_sq - mu**2
    
    print(f"  E[X] = {mu:.4f}")
    print(f"  E[X²] = {e_x_sq:.4f}")
    print(f"  Var(X) = E[X²] - (E[X])² = {e_x_sq:.4f} - {mu**2:.4f} = {var:.4f}")
    print(f"  Std(X) = √Var = {np.sqrt(var):.4f}")
    
    # Binomial
    print("\nDiscrete: Binomial(n=10, p=0.3)")
    n, p = 10, 0.3
    binomial = stats.binom(n, p)
    print(f"  Var(X) = n·p·(1-p) = {n}·{p}·{1-p} = {n*p*(1-p):.4f}")
    print(f"  From scipy: {binomial.var():.4f}")
    
    # Normal
    print("\nContinuous: Normal(μ=0, σ²=4)")
    normal = stats.norm(0, 2)  # σ = 2
    print(f"  Var(X) = σ² = 4")
    print(f"  From scipy: {normal.var():.4f}")
    
    # Exponential
    print("\nContinuous: Exponential(λ=2)")
    lam = 2
    exponential = stats.expon(scale=1/lam)
    print(f"  Var(X) = 1/λ² = {1/lam**2:.4f}")
    print(f"  From scipy: {exponential.var():.4f}")
 
 
def variance_interpretation():
    """
    Interpret variance geometrically and statistically.
    """
    print("\n\nInterpreting Variance")
    print("=" * 60)
    
    print("\n1. Chebyshev's Inequality:")
    print("   P(|X - μ| ≥ k·σ) ≤ 1/k²")
    print("\n   This gives universal bounds on tail probabilities!")
    
    print("\n   For any distribution:")
    print(f"   {'k':>4} {'P(|X-μ| ≥ kσ) ≤':>20}")
    for k in [1, 2, 3, 4, 5]:
        print(f"   {k:>4} {1/k**2:>20.4f}")
    
    print("\n2. Comparison: Chebyshev vs Normal")
    print("   The Normal is 'better behaved' than Chebyshev's worst case")
    
    normal = stats.norm(0, 1)
    print(f"\n   {'k':>4} {'Chebyshev bound':>18} {'Normal actual':>15}")
    for k in [1, 2, 3]:
        cheb_bound = 1 / k**2
        normal_actual = 2 * (1 - normal.cdf(k))
        print(f"   {k:>4} {cheb_bound:>18.4f} {normal_actual:>15.6f}")
    
    print("\n   Normal tails decay much faster than Chebyshev allows!")
 
 
def variance_vs_mean():
    """
    Illustrate that same mean but different variance gives very different distributions.
    """
    print("\n\nSame Mean, Different Variance")
    print("=" * 60)
    
    print("\nAll have mean 0, but very different spreads:")
    
    distributions = [
        ("N(0, 0.1)", stats.norm(0, np.sqrt(0.1))),
        ("N(0, 1.0)", stats.norm(0, 1)),
        ("N(0, 10.)", stats.norm(0, np.sqrt(10))),
    ]
    
    print(f"\n{'Distribution':<15} {'Var':>8} {'P(|X|>1)':>12} {'P(|X|>3)':>12}")
    print("-" * 50)
    
    for name, dist in distributions:
        var = dist.var()
        p_outside_1 = 2 * (1 - dist.cdf(1))
        p_outside_3 = 2 * (1 - dist.cdf(3))
        print(f"{name:<15} {var:>8.2f} {p_outside_1:>12.4f} {p_outside_3:>12.4f}")
    
    print("\n  Higher variance → more probability in tails")
    print("  This affects: outliers, risk, uncertainty quantification")
 
 
compute_variances()
variance_interpretation()
variance_vs_mean()

Properties of Variance

Unlike expectation, variance is not linear. Its properties are more subtle and depend on the dependence structure between random variables.

Key Properties:

1. Scaling: $$\text{Var}(aX) = a^2 \text{Var}(X)$$

Constants square when pulled out of variance.

2. Shift Invariance: $$\text{Var}(X + b) = \text{Var}(X)$$

Adding a constant doesn't change spread.

3. Sum of Independent RVs: $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \quad \text{(if } X \perp Y \text{)}$$

4. General Sum (with Covariance): $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$$

Covariance:

The covariance measures how two random variables vary together:

$$\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]$$

Properties:

$\text{Cov}(X, X) = \text{Var}(X)$
If $X \perp Y$, then $\text{Cov}(X, Y) = 0$ (but not conversely!)
$\text{Cov}(X, Y) = \text{Cov}(Y, X)$ (symmetric)
$\text{Cov}(aX + b, cY + d) = ac \cdot \text{Cov}(X, Y)$

variance_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from scipy import stats
 
def variance_scaling_and_shift():
    """
    Demonstrate Var(aX + b) = a² Var(X).
    """
    print("Variance Properties: Scaling and Shifting")
    print("=" * 60)
    
    np.random.seed(42)
    X = np.random.normal(3, 2, 100000)  # E[X]=3, Var[X]=4
    
    print(f"\nOriginal: X ~ N(3, 4)")
    print(f"  Sample Var(X) = {np.var(X):.4f}")
    
    # Scaling
    a = 3
    Y = a * X
    print(f"\nScaling: Y = {a}X")
    print(f"  Var({a}X) = {a}² · Var(X) = {a**2} × 4 = {a**2 * 4}")
    print(f"  Sample Var(Y) = {np.var(Y):.4f}")
    
    # Shifting
    b = 100
    Z = X + b
    print(f"\nShifting: Z = X + {b}")
    print(f"  Var(X + {b}) = Var(X) = 4")
    print(f"  Sample Var(Z) = {np.var(Z):.4f}")
    
    # Both
    W = a * X + b
    print(f"\nBoth: W = {a}X + {b}")
    print(f"  Var({a}X + {b}) = {a}² · Var(X) = {a**2 * 4}")
    print(f"  Sample Var(W) = {np.var(W):.4f}")
 
 
def variance_of_sums():
    """
    Demonstrate Var(X+Y) with and without independence.
    """
    print("\n\nVariance of Sums: Independence Matters")
    print("=" * 60)
    
    np.random.seed(42)
    n = 100000
    
    # Independent case
    print("\n1. Independent X, Y")
    X = np.random.normal(0, 2, n)  # Var = 4
    Y = np.random.normal(0, 3, n)  # Var = 9
    
    S_indep = X + Y
    print(f"  Var(X) = 4, Var(Y) = 9")
    print(f"  Theory: Var(X+Y) = Var(X) + Var(Y) = 4 + 9 = 13")
    print(f"  Sample Var(X+Y) = {np.var(S_indep):.4f}")
    
    # Positive correlation
    print("\n2. Positively correlated: Y = X + ε")
    epsilon = np.random.normal(0, 1, n)  # Var = 1
    Y_pos = X + epsilon  # Var(Y) = Var(X) + Var(ε) = 4 + 1 = 5
    
    cov_pos = np.cov(X, Y_pos)[0, 1]
    S_pos = X + Y_pos
    
    print(f"  Var(X) = 4, Var(Y) = {np.var(Y_pos):.4f}")
    print(f"  Cov(X, Y) = {cov_pos:.4f}")
    print(f"  Theory: Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)")
    print(f"         = 4 + 5 + 2×{cov_pos:.2f} ≈ {4 + 5 + 2*cov_pos:.4f}")
    print(f"  Sample Var(X+Y) = {np.var(S_pos):.4f}")
    
    # Negative correlation
    print("\n3. Negatively correlated: Y = -X + ε")
    Y_neg = -X + epsilon  # Cov(X, Y) = -Var(X) = -4
    
    cov_neg = np.cov(X, Y_neg)[0, 1]
    S_neg = X + Y_neg
    
    print(f"  Var(X) = 4, Var(Y) = {np.var(Y_neg):.4f}")
    print(f"  Cov(X, Y) = {cov_neg:.4f}")
    print(f"  Theory: Var(X+Y) ≈ 4 + 5 + 2×({cov_neg:.2f}) ≈ {4 + 5 + 2*cov_neg:.4f}")
    print(f"  Sample Var(X+Y) = {np.var(S_neg):.4f}")
    
    print("\n  Key insight: Negative correlation reduces variance!")
    print("  This is the basis for hedging and portfolio diversification.")
 
 
def sample_mean_variance():
    """
    Derive the variance of the sample mean.
    """
    print("\n\nVariance of the Sample Mean")
    print("=" * 60)
    
    print("\nFor i.i.d. samples X₁, ..., Xₙ with Var(Xᵢ) = σ²:")
    print("  X̄ = (1/n) Σ Xᵢ")
    print("  Var(X̄) = Var((1/n) Σ Xᵢ)")
    print("         = (1/n²) Var(Σ Xᵢ)")
    print("         = (1/n²) × n × σ²    [independence]")
    print("         = σ²/n")
    print("\n⟹ Std(X̄) = σ/√n    (standard error)")
    
    # Demonstrate
    np.random.seed(42)
    true_var = 9  # σ² = 9, σ = 3
    
    print(f"\nDemonstration: Xᵢ ~ N(0, {true_var})")
    print(f"{'n':>8} {'Theory Var(X̄)':>15} {'Sample Var(X̄)':>16}")
    print("-" * 42)
    
    for n in [10, 100, 1000]:
        # Compute many sample means
        sample_means = []
        for _ in range(10000):
            samples = np.random.normal(0, 3, n)
            sample_means.append(np.mean(samples))
        
        theory_var = true_var / n
        sample_var = np.var(sample_means)
        print(f"{n:>8} {theory_var:>15.6f} {sample_var:>16.6f}")
    
    print("\n  This is why averaging reduces uncertainty!")
 
 
variance_scaling_and_shift()
variance_of_sums()
sample_mean_variance()

Expectation and Variance in Machine Learning

Expectation and variance are not abstract concepts in ML—they are the mathematical fabric of learning algorithms.

Direct Applications in ML

•Loss Functions: MSE = E[(Y - Ŷ)²], Cross-Entropy = -E[Y log Ŷ]. We minimize expected loss.
•Bias-Variance Tradeoff: E[(Ŷ - Y)²] = Bias² + Variance + Irreducible. Decomposition guides model selection.
•Regularization: Bayesian interpretation: L2 regularization → Gaussian prior with specified variance.
•Batch Normalization: Normalize activations to have mean 0, variance 1 per batch.
•Dropout: E[output] = p × activation (expectation over dropout masks).
•Uncertainty Quantification: Predictive variance captures model uncertainty.
•Monte Carlo Estimation: Any expectation E[f(X)] ≈ (1/n)Σf(Xᵢ) by sampling.

ml_expectation_variance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import numpy as np
from scipy import stats
 
def bias_variance_decomposition():
    """
    Demonstrate the bias-variance decomposition.
    """
    print("Bias-Variance Decomposition")
    print("=" * 60)
    
    print("\nFor squared error loss:")
    print("  E[(Ŷ - Y)²] = (Bias)² + Variance + σ²")
    print("\nwhere:")
    print("  Bias = E[Ŷ] - f(x)  (systematic error)")
    print("  Variance = Var(Ŷ)  (sensitivity to training data)")
    print("  σ² = Var(Y|X)  (irreducible noise)")
    
    # Simulation: True function f(x) = sin(x), noisy observations
    np.random.seed(42)
    
    def true_function(x):
        return np.sin(2 * x)
    
    noise_var = 0.1  # σ² = 0.1
    n_train = 20
    n_fits = 200
    
    # Generate many training sets and fit polynomial models
    x_test = 0.5  # Test at this point
    y_true = true_function(x_test)
    
    print(f"\nSimulation: f(x) = sin(2x), x_test = {x_test}")
    print(f"            noise σ² = {noise_var}, n_train = {n_train}")
    
    for degree in [1, 3, 10]:
        predictions = []
        
        for _ in range(n_fits):
            # Generate training data
            x_train = np.random.uniform(0, 1, n_train)
            y_train = true_function(x_train) + np.random.normal(0, np.sqrt(noise_var), n_train)
            
            # Fit polynomial
            coeffs = np.polyfit(x_train, y_train, degree)
            y_pred = np.polyval(coeffs, x_test)
            predictions.append(y_pred)
        
        predictions = np.array(predictions)
        
        # Compute components
        bias = np.mean(predictions) - y_true
        variance = np.var(predictions)
        mse = np.mean((predictions - y_true) ** 2)
        
        print(f"\n  Degree {degree:2d} polynomial:")
        print(f"    Bias² = ({bias:.4f})² = {bias**2:.4f}")
        print(f"    Variance = {variance:.4f}")
        print(f"    Decomposition: {bias**2:.4f} + {variance:.4f} = {bias**2 + variance:.4f}")
        print(f"    Actual MSE = {mse:.4f}")
    
    print("\n  Low degree → high bias, low variance (underfitting)")
    print("  High degree → low bias, high variance (overfitting)")
 
 
def monte_carlo_expectation():
    """
    Demonstrate Monte Carlo estimation of expectations.
    """
    print("\n\nMonte Carlo Estimation of Expectations")
    print("=" * 60)
    
    print("\nAny expectation E[g(X)] can be approximated by sampling:")
    print("  E[g(X)] ≈ (1/n) Σᵢ g(Xᵢ)  where Xᵢ ~ distribution of X")
    
    # Example: Compute E[exp(X)] where X ~ N(0, 1)
    # Analytical: E[exp(X)] = exp(μ + σ²/2) = exp(0.5) ≈ 1.6487
    
    print("\nExample: E[exp(X)] for X ~ N(0, 1)")
    analytical = np.exp(0.5)
    print(f"  Analytical result: {analytical:.6f}")
    
    np.random.seed(42)
    print(f"\n  {'n':>8} {'Monte Carlo':>15} {'|Error|':>12} {'Std Error':>12}")
    print("  " + "-" * 50)
    
    for n in [100, 1000, 10000, 100000]:
        samples = np.random.normal(0, 1, n)
        mc_estimate = np.mean(np.exp(samples))
        error = abs(mc_estimate - analytical)
        # Standard error of MC estimate
        std_error = np.std(np.exp(samples)) / np.sqrt(n)
        print(f"  {n:>8} {mc_estimate:>15.6f} {error:>12.6f} {std_error:>12.6f}")
    
    print("\n  Error decreases as O(1/√n) - the Monte Carlo rate")
 
 
def expectation_in_loss_functions():
    """
    Show how loss functions are expectations and why this matters.
    """
    print("\n\nLoss Functions as Expectations")
    print("=" * 60)
    
    print("\nRisk = E[L(Y, Ŷ)] = Expected Loss")
    print("We minimize risk, not individual losses.")
    
    # Illustration: Optimal prediction for squared vs absolute loss
    print("\nOptimal prediction under different losses:")
    
    # Simulate Y from a skewed distribution (log-normal)
    np.random.seed(42)
    n = 100000
    Y = np.random.lognormal(0, 1, n)
    
    print(f"\nY ~ LogNormal(0, 1)")
    print(f"  E[Y] (mean) = {np.mean(Y):.4f}")
    print(f"  Median = {np.median(Y):.4f}")
    
    # Squared loss: optimal is mean
    # Absolute loss: optimal is median
    
    predictions = np.linspace(0.5, 3.0, 100)
    
    min_mse = np.inf
    min_mae = np.inf
    opt_mse_pred = None
    opt_mae_pred = None
    
    for pred in predictions:
        mse = np.mean((Y - pred) ** 2)
        mae = np.mean(np.abs(Y - pred))
        
        if mse < min_mse:
            min_mse = mse
            opt_mse_pred = pred
        if mae < min_mae:
            min_mae = mae
            opt_mae_pred = pred
    
    print(f"\n  Optimal for MSE: Ŷ = {opt_mse_pred:.4f} (≈ mean)")
    print(f"  Optimal for MAE: Ŷ = {opt_mae_pred:.4f} (≈ median)")
    
    print("\n  Key insight: The loss function determines the optimal summary!")
    print("  - Squared loss → predict the mean")
    print("  - Absolute loss → predict the median")
    print("  - 0-1 loss (classification) → predict the mode")
 
 
bias_variance_decomposition()
monte_carlo_expectation()
expectation_in_loss_functions()

Summary: The Language of Risk and Uncertainty

Expectation and variance are the primary tools for summarizing distributions and analyzing learning algorithms.

Key Takeaways

•Expectation is a Weighted Average: E[X] = Σ x·p(x) or ∫ x·f(x)dx weights values by their probability.
•LOTUS: E[g(X)] = Σ g(x)·p(x)—compute expectations of functions without deriving their distributions.
•Linearity of Expectation: E[aX + bY] = aE[X] + bE[Y], regardless of dependence. Immensely powerful.
•Variance Measures Spread: Var(X) = E[(X-μ)²] = E[X²] - (E[X])².
•Variance is NOT Linear: Var(aX+b) = a²Var(X). Independence is needed for Var(X+Y) = Var(X) + Var(Y).
•Covariance Matters: Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y). Correlated variables sum differently.
•Sample Mean Variance: Var(X̄) = σ²/n—averaging reduces uncertainty by √n.
•ML Connection: Loss functions are expectations. Bias-variance decomposition uses these concepts. Monte Carlo estimates expectations by sampling.

What's Next:

Expectation and variance are the first two moments of a distribution. The next page explores higher moments—skewness, kurtosis, and the full moment hierarchy—which capture asymmetry, tail behavior, and provide complete distributional information via moment generating functions.

Page Complete

You now command the two most important summary statistics in probability and ML. Expectation and variance are not merely descriptive—they are the mathematical foundations of risk, optimization, and the bias-variance tradeoff that governs all of machine learning.

4 / 5

Loading learning content...

Machine LearningRandom Variables and Distributions

Random Variables and Distributions

LevelIntermediate

Duration90 mins

TopicRandom Variables and Distributions

4 / 5

Expectation and Variance

Summarizing Distributions: Center and Spread

The two most important summaries are:

Expectation (Mean): Where is the distribution centered? What is the 'average' outcome if we sample many times?
Variance: How spread out is the distribution? How much do samples deviate from the center?

What You Will Learn

Expectation: The Weighted Average

The expectation (or expected value or mean) of a random variable is its long-run average value—the center of mass of its probability distribution.

Definition (Expectation):

For a discrete random variable $X$ with PMF $p$: $$E[X] = \sum_{x \in \mathcal{X}} x \cdot p(x)$$

For a continuous random variable $X$ with PDF $f$: $$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) , dx$$

The expectation is often denoted $\mu$, $\mu_X$, or $\bar{X}$ (the last typically for sample means).

Intuition: Weighted Average

The expectation is a weighted average where each value is weighted by its probability. Values that are more likely contribute more to the mean.

Scenario	Weighted Average	Probability Weight
Die roll	(1+2+3+4+5+6)/6 = 3.5	Each face has prob 1/6
Loaded die (6 favored)	Higher than 3.5	6 has higher weight
Binary outcome {0, 1}	P(X = 1)	Mean equals probability of 1

Physical Interpretation:

If you cut out the PDF curve from a piece of cardboard, the expectation is the $x$-coordinate of the balance point (center of mass). The distribution 'balances' on its mean.

expectation_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from scipy import stats, integrate
 
def compute_expectations():
    """
    Compute expectations for various distributions.
    """
    print("Computing Expectations")
    print("=" * 60)
    
    # Discrete: Fair die
    print("\nDiscrete: Fair 6-sided die")
    support = np.arange(1, 7)
    pmf = np.ones(6) / 6
    expected = np.sum(support * pmf)
    print(f"  E[X] = Σ x·p(x) = (1+2+3+4+5+6)/6 = {expected:.4f}")
    
    # Discrete: Loaded die (6 has prob 0.5, others equal)
    print("\nDiscrete: Loaded die (P(6)=0.5)")
    loaded_pmf = np.array([0.1, 0.1, 0.1, 0.1, 0.1, 0.5])
    expected_loaded = np.sum(support * loaded_pmf)
    print(f"  E[X] = {expected_loaded:.4f} (higher due to loaded 6)")
    
    # Discrete: Binomial
    print("\nDiscrete: Binomial(n=10, p=0.3)")
    n, p = 10, 0.3
    binomial = stats.binom(n, p)
    print(f"  E[X] = n·p = {n}·{p} = {n*p:.4f}")
    print(f"  Computed: {binomial.mean():.4f}")
    
    # Continuous: Normal
    print("\nContinuous: Normal(μ=5, σ²=4)")
    mu, sigma = 5, 2
    normal = stats.norm(mu, sigma)
    print(f"  E[X] = μ = {mu}")
    print(f"  Computed via integration: {normal.mean():.4f}")
    
    # Continuous: Exponential
    print("\nContinuous: Exponential(λ=2)")
    lam = 2
    exponential = stats.expon(scale=1/lam)
    print(f"  E[X] = 1/λ = {1/lam:.4f}")
    print(f"  Computed: {exponential.mean():.4f}")
    
    # Verify by numerical integration
    integral, _ = integrate.quad(lambda x: x * exponential.pdf(x), 0, np.inf)
    print(f"  Via ∫x·f(x)dx: {integral:.4f}")
 
 
def law_of_large_numbers_demo():
    """
    Demonstrate the Law of Large Numbers.
    """
    print("\n\nLaw of Large Numbers")
    print("=" * 60)
    
    print("\nThe sample mean converges to E[X] as n → ∞")
    print("  X̄_n = (1/n) Σ Xᵢ → E[X]")
    
    # True expectation for Exp(λ=2)
    true_mean = 0.5
    
    np.random.seed(42)
    print(f"\nExponential(λ=2), true E[X] = {true_mean}")
    print(f"\n{'n':>10} {'Sample Mean':>15} {'|Error|':>12}")
    print("-" * 40)
    
    for n in [10, 100, 1000, 10000, 100000]:
        samples = np.random.exponential(scale=true_mean, size=n)
        sample_mean = np.mean(samples)
        error = abs(sample_mean - true_mean)
        print(f"{n:>10} {sample_mean:>15.6f} {error:>12.6f}")
    
    print("\nThe sample mean concentrates around E[X] = 0.5")
 
 
compute_expectations()
law_of_large_numbers_demo()

Expectation Doesn't Always Exist

Expectation of Functions: LOTUS

LOTUS (Law of the Unconscious Statistician):

For discrete $X$: $$E[g(X)] = \sum_{x \in \mathcal{X}} g(x) \cdot p(x)$$

For continuous $X$: $$E[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f(x) , dx$$

We simply weight $g(x)$ by the probability of $x$—no need to derive the distribution of $g(X)$ first.

Why 'Unconscious Statistician'?

Common Applications:

Function $g(X)$	Meaning	ML Use
$X^2$	Second moment	Variance computation
$(X - \mu)^2$	Squared deviation	Variance, MSE
$\|X - y\|^2$	Squared error	Expected loss
$\log p(X)$	Log-probability	Entropy, KL divergence
$\mathbf{1}_{X \in A}$	Indicator	Probability computation

lotus_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from scipy import stats, integrate
 
def lotus_demonstrations():
    """
    Demonstrate LOTUS for computing E[g(X)].
    """
    print("Law of the Unconscious Statistician (LOTUS)")
    print("=" * 60)
    
    # Example 1: E[X²] for a fair die
    print("\nExample 1: E[X²] for fair die")
    support = np.arange(1, 7)
    pmf = np.ones(6) / 6
    
    # LOTUS: E[X²] = Σ x² · p(x)
    e_x_squared = np.sum(support**2 * pmf)
    print(f"  E[X²] = Σ x²·p(x) = (1+4+9+16+25+36)/6 = {e_x_squared:.4f}")
    
    # Compare to (E[X])² to see they differ
    e_x = np.sum(support * pmf)
    print(f"  E[X] = {e_x:.4f}")
    print(f"  (E[X])² = {e_x**2:.4f}")
    print(f"  Note: E[X²] ≠ (E[X])², difference = {e_x_squared - e_x**2:.4f}")
    
    # Example 2: E[e^X] for Normal (moment generating function idea)
    print("\nExample 2: E[exp(X)] for X ~ N(0, 1)")
    normal = stats.norm(0, 1)
    
    # LOTUS via integration
    e_exp_x, _ = integrate.quad(lambda x: np.exp(x) * normal.pdf(x), -20, 20)
    
    # Analytical: For N(μ,σ²), E[exp(X)] = exp(μ + σ²/2)
    analytical = np.exp(0 + 1/2)
    
    print(f"  Via LOTUS (numerical): {e_exp_x:.6f}")
    print(f"  Analytical formula:    {analytical:.6f}")
    
    # Example 3: E[max(X, 0)] - the ReLU-like function
    print("\nExample 3: E[max(X, 0)] for X ~ N(0, 1)")
    
    # LOTUS: E[max(X, 0)] = ∫₀^∞ x · f(x) dx
    relu_mean, _ = integrate.quad(lambda x: max(x, 0) * normal.pdf(x), -10, 10)
    
    # Analytical: For standard normal, E[max(X,0)] = 1/√(2π) ≈ 0.3989
    analytical_relu = 1 / np.sqrt(2 * np.pi)
    
    print(f"  Via LOTUS (numerical): {relu_mean:.6f}")
    print(f"  Analytical:            {analytical_relu:.6f}")
    print("  This is relevant for ReLU activation analysis!")
 
 
def indicator_function_expectations():
    """
    Show how E[1{X ∈ A}] = P(X ∈ A).
    """
    print("\n\nExpectation of Indicator Functions")
    print("=" * 60)
    
    print("\nKey result: E[1{X ∈ A}] = P(X ∈ A)")
    print("Because: 1{X ∈ A} is 1 with prob P(X ∈ A), 0 otherwise")
    print("         E[1{X ∈ A}] = 1·P(A) + 0·P(Aᶜ) = P(A)")
    
    # Demonstration
    normal = stats.norm(0, 1)
    
    A = (-1, 1)  # Event: X ∈ [-1, 1]
    
    # Using CDF
    prob_A_cdf = normal.cdf(1) - normal.cdf(-1)
    
    # Using LOTUS with indicator
    indicator = lambda x: 1.0 if A[0] <= x <= A[1] else 0.0
    prob_A_lotus, _ = integrate.quad(lambda x: indicator(x) * normal.pdf(x), -10, 10)
    
    print(f"\n  P(X ∈ [-1, 1]) for N(0,1):")
    print(f"    Via CDF:             {prob_A_cdf:.6f}")
    print(f"    Via E[indicator]:    {prob_A_lotus:.6f}")
    print("\n  Both methods give the same answer!")
 
 
def ml_loss_functions_as_expectations():
    """
    Cast ML loss functions as expectations.
    """
    print("\n\nML Loss Functions as Expectations")
    print("=" * 60)
    
    print("\nMSE, MAE, Cross-Entropy are all expectations!")
    
    # Setup: predict μ̂ for Y ~ N(μ, σ²)
    true_mu = 3.0
    sigma = 1.0
    mu_hat = 2.5  # Our prediction
    
    print(f"\nScenario: True Y ~ N({true_mu}, {sigma}²), predict μ̂ = {mu_hat}")
    
    # Mean Squared Error as expectation
    # MSE = E[(Y - μ̂)²] 
    normal = stats.norm(true_mu, sigma)
    mse_analytical = sigma**2 + (true_mu - mu_hat)**2  # Bias² + Variance
    
    mse_numerical, _ = integrate.quad(
        lambda y: (y - mu_hat)**2 * normal.pdf(y), 
        -10, 10
    )
    
    print(f"\n  MSE = E[(Y - μ̂)²]")
    print(f"    = Var(Y) + (E[Y] - μ̂)²")
    print(f"    = {sigma**2} + ({true_mu} - {mu_hat})²")
    print(f"    = {mse_analytical:.4f}")
    print(f"    (Numerical: {mse_numerical:.4f})")
    
    # Mean Absolute Error
    mae_numerical, _ = integrate.quad(
        lambda y: np.abs(y - mu_hat) * normal.pdf(y),
        -10, 10
    )
    print(f"\n  MAE = E[|Y - μ̂|] = {mae_numerical:.4f}")
    
    print("\n  Key insight: Loss functions are expectations over data distribution")
    print("  Minimizing expected loss = Minimizing E[L(Y, μ̂)]")
 
 
lotus_demonstrations()
indicator_function_expectations()
ml_loss_functions_as_expectations()

Properties of Expectation

Expectation has powerful algebraic properties that make it tractable for analysis. These properties are used constantly in ML theory.

Fundamental Properties:

1. Linearity (The most important property): $$E[aX + bY] = aE[X] + bE[Y]$$

This holds for any random variables $X$ and $Y$, regardless of their dependence structure. Linearity extends to finite sums: $$E\left[\sum_{i=1}^n a_i X_i\right] = \sum_{i=1}^n a_i E[X_i]$$

2. Monotonicity: If $X \leq Y$ with probability 1, then $E[X] \leq E[Y]$.

3. Constant Expectation: $E[c] = c$ for any constant $c$.

4. Non-negative Expectation: If $X \geq 0$ with probability 1, then $E[X] \geq 0$.

Linearity is Unreasonably Powerful

expectation_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from scipy import stats
 
def linearity_demonstration():
    """
    Demonstrate linearity of expectation.
    """
    print("Linearity of Expectation")
    print("=" * 60)
    
    # Linearity for independent RVs
    print("\nProperty: E[aX + bY] = a·E[X] + b·E[Y]")
    print("(This works even when X and Y are DEPENDENT!)")
    
    np.random.seed(42)
    n = 100000
    
    # Independent example
    X = np.random.normal(2, 1, n)  # E[X] = 2
    Y = np.random.exponential(3, n)  # E[Y] = 3
    a, b = 2, -1
    
    Z = a * X + b * Y
    
    print(f"\nIndependent X ~ N(2,1), Y ~ Exp(1/3)")
    print(f"  a = {a}, b = {b}")
    print(f"  E[{a}X + ({b})Y] = {a}·E[X] + ({b})·E[Y] = {a}·2 + ({b})·3 = {a*2 + b*3}")
    print(f"  Sample mean of Z: {np.mean(Z):.4f}")
    
    # Dependent example
    print("\nDependent example: Y = X² (clearly dependent on X)")
    Y_dep = X ** 2
    Z_dep = a * X + b * Y_dep
    
    e_x = 2  # E[X]
    e_y_dep = 4 + 1  # E[X²] = Var(X) + E[X]² = 1 + 4 = 5
    
    print(f"  E[X] = {e_x}, E[X²] = E[Y] = {e_y_dep}")
    print(f"  E[{a}X + ({b})X²] = {a}·{e_x} + ({b})·{e_y_dep} = {a*e_x + b*e_y_dep}")
    print(f"  Sample mean of Z: {np.mean(Z_dep):.4f}")
    print("  Linearity still works despite dependence!")
 
 
def counting_with_indicators():
    """
    Show the power of linearity with indicator functions.
    """
    print("\n\nCounting via Linearity + Indicators")
    print("=" * 60)
    
    print("\nProblem: Expected number of fixed points in random permutation")
    print("(Fixed point: element that stays in its original position)")
    
    print("\nNaive approach: Sum over all n! permutations... intractable")
    print("Smart approach: X = Σᵢ 1{element i is fixed}")
    print("                E[X] = Σᵢ E[1{i fixed}] = Σᵢ P(i fixed)")
    print("                P(i fixed) = 1/n for each i")
    print("                E[X] = n × (1/n) = 1")
    
    # Verify empirically
    def count_fixed_points(n, trials=10000):
        count = 0
        for _ in range(trials):
            perm = np.random.permutation(n)
            fixed = np.sum(perm == np.arange(n))
            count += fixed
        return count / trials
    
    print("\nEmpirical verification:")
    for n in [5, 10, 50, 100]:
        empirical = count_fixed_points(n)
        print(f"  n = {n:3d}: E[fixed points] ≈ {empirical:.4f} (theory: 1.00)")
 
 
def expectation_bounds():
    """
    Demonstrate bounds on expectations.
    """
    print("\n\nExpectation Bounds")
    print("=" * 60)
    
    print("\n1. Jensen's Inequality:")
    print("   For convex g: E[g(X)] ≥ g(E[X])")
    print("   For concave g: E[g(X)] ≤ g(E[X])")
    
    # Example: g(x) = x² is convex
    # So E[X²] ≥ (E[X])²
    
    np.random.seed(42)
    X = np.random.exponential(2, 10000)  # E[X] = 2
    
    e_x = np.mean(X)
    e_x_sq = np.mean(X**2)
    
    print(f"\n   Example: X ~ Exp(1/2), g(x) = x² (convex)")
    print(f"   E[X²] = {e_x_sq:.4f} ≥ (E[X])² = {e_x**2:.4f}?")
    print(f"   Yes: {e_x_sq:.4f} ≥ {e_x**2:.4f} ✓")
    
    # Example: g(x) = log(x) is concave
    print(f"\n   Example: g(x) = log(x) (concave)")
    e_log_x = np.mean(np.log(X))
    log_e_x = np.log(e_x)
    print(f"   E[log(X)] = {e_log_x:.4f} ≤ log(E[X]) = {log_e_x:.4f}?")
    print(f"   Yes: {e_log_x:.4f} ≤ {log_e_x:.4f} ✓")
    
    print("\n   Jensen is crucial for understanding:")
    print("   - Why geometric mean ≤ arithmetic mean")
    print("   - Why KL divergence ≥ 0")
    print("   - Variational bounds in ML")
 
 
linearity_demonstration()
counting_with_indicators()
expectation_bounds()

Variance: Measuring Spread

While expectation captures the center of a distribution, variance captures its spread—how much values deviate from the mean.

Definition (Variance):

$$\text{Var}(X) = E[(X - E[X])^2] = E[(X - \mu)^2]$$

Variance is the expected squared deviation from the mean. The standard deviation is $\sigma = \sqrt{\text{Var}(X)}$, which has the same units as $X$.

Computational Formula:

Using linearity of expectation: $$\text{Var}(X) = E[X^2] - (E[X])^2$$

This formula is often more convenient for calculation.

Why Squared Deviations?

variance_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from scipy import stats
 
def compute_variances():
    """
    Compute variances for various distributions.
    """
    print("Computing Variances")
    print("=" * 60)
    
    # Fair die
    print("\nDiscrete: Fair 6-sided die")
    support = np.arange(1, 7)
    pmf = np.ones(6) / 6
    
    mu = np.sum(support * pmf)
    e_x_sq = np.sum(support**2 * pmf)
    var = e_x_sq - mu**2
    
    print(f"  E[X] = {mu:.4f}")
    print(f"  E[X²] = {e_x_sq:.4f}")
    print(f"  Var(X) = E[X²] - (E[X])² = {e_x_sq:.4f} - {mu**2:.4f} = {var:.4f}")
    print(f"  Std(X) = √Var = {np.sqrt(var):.4f}")
    
    # Binomial
    print("\nDiscrete: Binomial(n=10, p=0.3)")
    n, p = 10, 0.3
    binomial = stats.binom(n, p)
    print(f"  Var(X) = n·p·(1-p) = {n}·{p}·{1-p} = {n*p*(1-p):.4f}")
    print(f"  From scipy: {binomial.var():.4f}")
    
    # Normal
    print("\nContinuous: Normal(μ=0, σ²=4)")
    normal = stats.norm(0, 2)  # σ = 2
    print(f"  Var(X) = σ² = 4")
    print(f"  From scipy: {normal.var():.4f}")
    
    # Exponential
    print("\nContinuous: Exponential(λ=2)")
    lam = 2
    exponential = stats.expon(scale=1/lam)
    print(f"  Var(X) = 1/λ² = {1/lam**2:.4f}")
    print(f"  From scipy: {exponential.var():.4f}")
 
 
def variance_interpretation():
    """
    Interpret variance geometrically and statistically.
    """
    print("\n\nInterpreting Variance")
    print("=" * 60)
    
    print("\n1. Chebyshev's Inequality:")
    print("   P(|X - μ| ≥ k·σ) ≤ 1/k²")
    print("\n   This gives universal bounds on tail probabilities!")
    
    print("\n   For any distribution:")
    print(f"   {'k':>4} {'P(|X-μ| ≥ kσ) ≤':>20}")
    for k in [1, 2, 3, 4, 5]:
        print(f"   {k:>4} {1/k**2:>20.4f}")
    
    print("\n2. Comparison: Chebyshev vs Normal")
    print("   The Normal is 'better behaved' than Chebyshev's worst case")
    
    normal = stats.norm(0, 1)
    print(f"\n   {'k':>4} {'Chebyshev bound':>18} {'Normal actual':>15}")
    for k in [1, 2, 3]:
        cheb_bound = 1 / k**2
        normal_actual = 2 * (1 - normal.cdf(k))
        print(f"   {k:>4} {cheb_bound:>18.4f} {normal_actual:>15.6f}")
    
    print("\n   Normal tails decay much faster than Chebyshev allows!")
 
 
def variance_vs_mean():
    """
    Illustrate that same mean but different variance gives very different distributions.
    """
    print("\n\nSame Mean, Different Variance")
    print("=" * 60)
    
    print("\nAll have mean 0, but very different spreads:")
    
    distributions = [
        ("N(0, 0.1)", stats.norm(0, np.sqrt(0.1))),
        ("N(0, 1.0)", stats.norm(0, 1)),
        ("N(0, 10.)", stats.norm(0, np.sqrt(10))),
    ]
    
    print(f"\n{'Distribution':<15} {'Var':>8} {'P(|X|>1)':>12} {'P(|X|>3)':>12}")
    print("-" * 50)
    
    for name, dist in distributions:
        var = dist.var()
        p_outside_1 = 2 * (1 - dist.cdf(1))
        p_outside_3 = 2 * (1 - dist.cdf(3))
        print(f"{name:<15} {var:>8.2f} {p_outside_1:>12.4f} {p_outside_3:>12.4f}")
    
    print("\n  Higher variance → more probability in tails")
    print("  This affects: outliers, risk, uncertainty quantification")
 
 
compute_variances()
variance_interpretation()
variance_vs_mean()

Properties of Variance

Unlike expectation, variance is not linear. Its properties are more subtle and depend on the dependence structure between random variables.

Key Properties:

1. Scaling: $$\text{Var}(aX) = a^2 \text{Var}(X)$$

Constants square when pulled out of variance.

2. Shift Invariance: $$\text{Var}(X + b) = \text{Var}(X)$$

Adding a constant doesn't change spread.

3. Sum of Independent RVs: $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \quad \text{(if } X \perp Y \text{)}$$

4. General Sum (with Covariance): $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$$

Covariance:

The covariance measures how two random variables vary together:

$$\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]$$

Properties:

$\text{Cov}(X, X) = \text{Var}(X)$
If $X \perp Y$, then $\text{Cov}(X, Y) = 0$ (but not conversely!)
$\text{Cov}(X, Y) = \text{Cov}(Y, X)$ (symmetric)
$\text{Cov}(aX + b, cY + d) = ac \cdot \text{Cov}(X, Y)$

variance_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from scipy import stats
 
def variance_scaling_and_shift():
    """
    Demonstrate Var(aX + b) = a² Var(X).
    """
    print("Variance Properties: Scaling and Shifting")
    print("=" * 60)
    
    np.random.seed(42)
    X = np.random.normal(3, 2, 100000)  # E[X]=3, Var[X]=4
    
    print(f"\nOriginal: X ~ N(3, 4)")
    print(f"  Sample Var(X) = {np.var(X):.4f}")
    
    # Scaling
    a = 3
    Y = a * X
    print(f"\nScaling: Y = {a}X")
    print(f"  Var({a}X) = {a}² · Var(X) = {a**2} × 4 = {a**2 * 4}")
    print(f"  Sample Var(Y) = {np.var(Y):.4f}")
    
    # Shifting
    b = 100
    Z = X + b
    print(f"\nShifting: Z = X + {b}")
    print(f"  Var(X + {b}) = Var(X) = 4")
    print(f"  Sample Var(Z) = {np.var(Z):.4f}")
    
    # Both
    W = a * X + b
    print(f"\nBoth: W = {a}X + {b}")
    print(f"  Var({a}X + {b}) = {a}² · Var(X) = {a**2 * 4}")
    print(f"  Sample Var(W) = {np.var(W):.4f}")
 
 
def variance_of_sums():
    """
    Demonstrate Var(X+Y) with and without independence.
    """
    print("\n\nVariance of Sums: Independence Matters")
    print("=" * 60)
    
    np.random.seed(42)
    n = 100000
    
    # Independent case
    print("\n1. Independent X, Y")
    X = np.random.normal(0, 2, n)  # Var = 4
    Y = np.random.normal(0, 3, n)  # Var = 9
    
    S_indep = X + Y
    print(f"  Var(X) = 4, Var(Y) = 9")
    print(f"  Theory: Var(X+Y) = Var(X) + Var(Y) = 4 + 9 = 13")
    print(f"  Sample Var(X+Y) = {np.var(S_indep):.4f}")
    
    # Positive correlation
    print("\n2. Positively correlated: Y = X + ε")
    epsilon = np.random.normal(0, 1, n)  # Var = 1
    Y_pos = X + epsilon  # Var(Y) = Var(X) + Var(ε) = 4 + 1 = 5
    
    cov_pos = np.cov(X, Y_pos)[0, 1]
    S_pos = X + Y_pos
    
    print(f"  Var(X) = 4, Var(Y) = {np.var(Y_pos):.4f}")
    print(f"  Cov(X, Y) = {cov_pos:.4f}")
    print(f"  Theory: Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)")
    print(f"         = 4 + 5 + 2×{cov_pos:.2f} ≈ {4 + 5 + 2*cov_pos:.4f}")
    print(f"  Sample Var(X+Y) = {np.var(S_pos):.4f}")
    
    # Negative correlation
    print("\n3. Negatively correlated: Y = -X + ε")
    Y_neg = -X + epsilon  # Cov(X, Y) = -Var(X) = -4
    
    cov_neg = np.cov(X, Y_neg)[0, 1]
    S_neg = X + Y_neg
    
    print(f"  Var(X) = 4, Var(Y) = {np.var(Y_neg):.4f}")
    print(f"  Cov(X, Y) = {cov_neg:.4f}")
    print(f"  Theory: Var(X+Y) ≈ 4 + 5 + 2×({cov_neg:.2f}) ≈ {4 + 5 + 2*cov_neg:.4f}")
    print(f"  Sample Var(X+Y) = {np.var(S_neg):.4f}")
    
    print("\n  Key insight: Negative correlation reduces variance!")
    print("  This is the basis for hedging and portfolio diversification.")
 
 
def sample_mean_variance():
    """
    Derive the variance of the sample mean.
    """
    print("\n\nVariance of the Sample Mean")
    print("=" * 60)
    
    print("\nFor i.i.d. samples X₁, ..., Xₙ with Var(Xᵢ) = σ²:")
    print("  X̄ = (1/n) Σ Xᵢ")
    print("  Var(X̄) = Var((1/n) Σ Xᵢ)")
    print("         = (1/n²) Var(Σ Xᵢ)")
    print("         = (1/n²) × n × σ²    [independence]")
    print("         = σ²/n")
    print("\n⟹ Std(X̄) = σ/√n    (standard error)")
    
    # Demonstrate
    np.random.seed(42)
    true_var = 9  # σ² = 9, σ = 3
    
    print(f"\nDemonstration: Xᵢ ~ N(0, {true_var})")
    print(f"{'n':>8} {'Theory Var(X̄)':>15} {'Sample Var(X̄)':>16}")
    print("-" * 42)
    
    for n in [10, 100, 1000]:
        # Compute many sample means
        sample_means = []
        for _ in range(10000):
            samples = np.random.normal(0, 3, n)
            sample_means.append(np.mean(samples))
        
        theory_var = true_var / n
        sample_var = np.var(sample_means)
        print(f"{n:>8} {theory_var:>15.6f} {sample_var:>16.6f}")
    
    print("\n  This is why averaging reduces uncertainty!")
 
 
variance_scaling_and_shift()
variance_of_sums()
sample_mean_variance()

Expectation and Variance in Machine Learning

Expectation and variance are not abstract concepts in ML—they are the mathematical fabric of learning algorithms.

Direct Applications in ML

•Loss Functions: MSE = E[(Y - Ŷ)²], Cross-Entropy = -E[Y log Ŷ]. We minimize expected loss.
•Bias-Variance Tradeoff: E[(Ŷ - Y)²] = Bias² + Variance + Irreducible. Decomposition guides model selection.
•Regularization: Bayesian interpretation: L2 regularization → Gaussian prior with specified variance.
•Batch Normalization: Normalize activations to have mean 0, variance 1 per batch.
•Dropout: E[output] = p × activation (expectation over dropout masks).
•Uncertainty Quantification: Predictive variance captures model uncertainty.
•Monte Carlo Estimation: Any expectation E[f(X)] ≈ (1/n)Σf(Xᵢ) by sampling.

ml_expectation_variance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import numpy as np
from scipy import stats
 
def bias_variance_decomposition():
    """
    Demonstrate the bias-variance decomposition.
    """
    print("Bias-Variance Decomposition")
    print("=" * 60)
    
    print("\nFor squared error loss:")
    print("  E[(Ŷ - Y)²] = (Bias)² + Variance + σ²")
    print("\nwhere:")
    print("  Bias = E[Ŷ] - f(x)  (systematic error)")
    print("  Variance = Var(Ŷ)  (sensitivity to training data)")
    print("  σ² = Var(Y|X)  (irreducible noise)")
    
    # Simulation: True function f(x) = sin(x), noisy observations
    np.random.seed(42)
    
    def true_function(x):
        return np.sin(2 * x)
    
    noise_var = 0.1  # σ² = 0.1
    n_train = 20
    n_fits = 200
    
    # Generate many training sets and fit polynomial models
    x_test = 0.5  # Test at this point
    y_true = true_function(x_test)
    
    print(f"\nSimulation: f(x) = sin(2x), x_test = {x_test}")
    print(f"            noise σ² = {noise_var}, n_train = {n_train}")
    
    for degree in [1, 3, 10]:
        predictions = []
        
        for _ in range(n_fits):
            # Generate training data
            x_train = np.random.uniform(0, 1, n_train)
            y_train = true_function(x_train) + np.random.normal(0, np.sqrt(noise_var), n_train)
            
            # Fit polynomial
            coeffs = np.polyfit(x_train, y_train, degree)
            y_pred = np.polyval(coeffs, x_test)
            predictions.append(y_pred)
        
        predictions = np.array(predictions)
        
        # Compute components
        bias = np.mean(predictions) - y_true
        variance = np.var(predictions)
        mse = np.mean((predictions - y_true) ** 2)
        
        print(f"\n  Degree {degree:2d} polynomial:")
        print(f"    Bias² = ({bias:.4f})² = {bias**2:.4f}")
        print(f"    Variance = {variance:.4f}")
        print(f"    Decomposition: {bias**2:.4f} + {variance:.4f} = {bias**2 + variance:.4f}")
        print(f"    Actual MSE = {mse:.4f}")
    
    print("\n  Low degree → high bias, low variance (underfitting)")
    print("  High degree → low bias, high variance (overfitting)")
 
 
def monte_carlo_expectation():
    """
    Demonstrate Monte Carlo estimation of expectations.
    """
    print("\n\nMonte Carlo Estimation of Expectations")
    print("=" * 60)
    
    print("\nAny expectation E[g(X)] can be approximated by sampling:")
    print("  E[g(X)] ≈ (1/n) Σᵢ g(Xᵢ)  where Xᵢ ~ distribution of X")
    
    # Example: Compute E[exp(X)] where X ~ N(0, 1)
    # Analytical: E[exp(X)] = exp(μ + σ²/2) = exp(0.5) ≈ 1.6487
    
    print("\nExample: E[exp(X)] for X ~ N(0, 1)")
    analytical = np.exp(0.5)
    print(f"  Analytical result: {analytical:.6f}")
    
    np.random.seed(42)
    print(f"\n  {'n':>8} {'Monte Carlo':>15} {'|Error|':>12} {'Std Error':>12}")
    print("  " + "-" * 50)
    
    for n in [100, 1000, 10000, 100000]:
        samples = np.random.normal(0, 1, n)
        mc_estimate = np.mean(np.exp(samples))
        error = abs(mc_estimate - analytical)
        # Standard error of MC estimate
        std_error = np.std(np.exp(samples)) / np.sqrt(n)
        print(f"  {n:>8} {mc_estimate:>15.6f} {error:>12.6f} {std_error:>12.6f}")
    
    print("\n  Error decreases as O(1/√n) - the Monte Carlo rate")
 
 
def expectation_in_loss_functions():
    """
    Show how loss functions are expectations and why this matters.
    """
    print("\n\nLoss Functions as Expectations")
    print("=" * 60)
    
    print("\nRisk = E[L(Y, Ŷ)] = Expected Loss")
    print("We minimize risk, not individual losses.")
    
    # Illustration: Optimal prediction for squared vs absolute loss
    print("\nOptimal prediction under different losses:")
    
    # Simulate Y from a skewed distribution (log-normal)
    np.random.seed(42)
    n = 100000
    Y = np.random.lognormal(0, 1, n)
    
    print(f"\nY ~ LogNormal(0, 1)")
    print(f"  E[Y] (mean) = {np.mean(Y):.4f}")
    print(f"  Median = {np.median(Y):.4f}")
    
    # Squared loss: optimal is mean
    # Absolute loss: optimal is median
    
    predictions = np.linspace(0.5, 3.0, 100)
    
    min_mse = np.inf
    min_mae = np.inf
    opt_mse_pred = None
    opt_mae_pred = None
    
    for pred in predictions:
        mse = np.mean((Y - pred) ** 2)
        mae = np.mean(np.abs(Y - pred))
        
        if mse < min_mse:
            min_mse = mse
            opt_mse_pred = pred
        if mae < min_mae:
            min_mae = mae
            opt_mae_pred = pred
    
    print(f"\n  Optimal for MSE: Ŷ = {opt_mse_pred:.4f} (≈ mean)")
    print(f"  Optimal for MAE: Ŷ = {opt_mae_pred:.4f} (≈ median)")
    
    print("\n  Key insight: The loss function determines the optimal summary!")
    print("  - Squared loss → predict the mean")
    print("  - Absolute loss → predict the median")
    print("  - 0-1 loss (classification) → predict the mode")
 
 
bias_variance_decomposition()
monte_carlo_expectation()
expectation_in_loss_functions()

Summary: The Language of Risk and Uncertainty

Expectation and variance are the primary tools for summarizing distributions and analyzing learning algorithms.

Key Takeaways

•Expectation is a Weighted Average: E[X] = Σ x·p(x) or ∫ x·f(x)dx weights values by their probability.
•LOTUS: E[g(X)] = Σ g(x)·p(x)—compute expectations of functions without deriving their distributions.
•Linearity of Expectation: E[aX + bY] = aE[X] + bE[Y], regardless of dependence. Immensely powerful.
•Variance Measures Spread: Var(X) = E[(X-μ)²] = E[X²] - (E[X])².
•Variance is NOT Linear: Var(aX+b) = a²Var(X). Independence is needed for Var(X+Y) = Var(X) + Var(Y).
•Covariance Matters: Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y). Correlated variables sum differently.
•Sample Mean Variance: Var(X̄) = σ²/n—averaging reduces uncertainty by √n.
•ML Connection: Loss functions are expectations. Bias-variance decomposition uses these concepts. Monte Carlo estimates expectations by sampling.

What's Next:

Page Complete

4 / 5