Random Variables Distributions - Learning Module

Loading content...

0/245

Discrete Random Variables

From Outcomes to Numbers: The Concept of Random Variables

In the previous module, we established the axiomatic foundations of probability theory—sample spaces, events, and the fundamental rules governing probabilities. Now we take a critical conceptual leap: we learn to assign numerical values to random outcomes, transforming abstract events into mathematical objects we can compute with.

This transformation is the essence of random variables. Rather than reasoning about outcomes like 'heads' or 'a customer churns,' we assign numbers—0 or 1, revenue amounts, click counts—enabling the full power of mathematical analysis to be brought to bear on uncertain phenomena.

In machine learning, random variables are not merely theoretical constructs. They are the bridge between data and models. Every feature in your dataset, every prediction your model makes, every loss value computed during training—all of these are realizations of random variables. Understanding their properties is not optional; it is the foundation upon which all of statistical machine learning is built.

What You Will Learn

By the end of this page, you will understand the formal definition of discrete random variables, master their mathematical properties including probability mass functions and support, and see how these concepts underpin classification, counting models, and probabilistic inference in machine learning.

Formal Definition of a Random Variable

Before we can work with random variables mathematically, we need a precise definition. The intuitive notion—'a variable whose value depends on chance'—is insufficient for rigorous analysis. Let's build the formal machinery.

Definition (Random Variable):

Given a probability space $(\Omega, \mathcal{F}, P)$, a random variable $X$ is a function:

$$X: \Omega \rightarrow \mathbb{R}$$

that maps each outcome $\omega \in \Omega$ to a real number $X(\omega)$, subject to a measurability condition: for every Borel set $B \subseteq \mathbb{R}$, the preimage $X^{-1}(B) = {\omega \in \Omega : X(\omega) \in B}$ belongs to the σ-algebra $\mathcal{F}$.

This measurability condition ensures that we can assign probabilities to statements like '$X \leq 5$' or '$X \in [2, 7]$'. Without it, we couldn't meaningfully ask 'What is the probability that $X$ takes a value in some range?'

Why Measurability Matters

The measurability condition might seem like mathematical pedantry, but it's actually protective. It prevents us from constructing pathological 'random variables' for which probability questions are unanswerable. In practice, any function you'd naturally construct from data satisfies this condition, but the formal requirement is what makes probability theory mathematically coherent.

Unpacking the Definition:

Let's make this concrete. Consider rolling a fair six-sided die:

Sample space: $\Omega = {\omega_1, \omega_2, \omega_3, \omega_4, \omega_5, \omega_6}$ where $\omega_i$ is the outcome 'die shows face $i$'
Random variable $X$: Define $X(\omega_i) = i$, i.e., $X$ equals the number shown

Now $X$ is a function from outcomes to numbers. We can ask:

$P(X = 4) = P({\omega_4}) = \frac{1}{6}$
$P(X \leq 3) = P({\omega_1, \omega_2, \omega_3}) = \frac{3}{6} = \frac{1}{2}$
$P(X \text{ is even}) = P({\omega_2, \omega_4, \omega_6}) = \frac{3}{6} = \frac{1}{2}$

The random variable $X$ lets us translate probability questions about outcomes into algebraic statements about numbers.

Components of a Random Variable
Component	Symbol	Description	Example (Die Roll)
Sample Space	Ω	Set of all possible outcomes	{ω₁, ω₂, ω₃, ω₄, ω₅, ω₆}
Outcome	ω	A single element of Ω	ω₃ = 'die shows 3'
Random Variable	X	Function from Ω to ℝ	X(ωᵢ) = i
Realization	x	A specific value X takes	x = 3
Event via X	{X = x}	Preimage {ω : X(ω) = x}	{ω₃}

Discrete Random Variables: Countable Outcomes

Random variables come in two fundamental flavors, distinguished by the nature of their possible values.

Definition (Discrete Random Variable):

A random variable $X$ is called discrete if its range (the set of values it can take) is countable—either finite or countably infinite.

Formally, $X$ is discrete if there exists a countable set $\mathcal{X} \subseteq \mathbb{R}$ such that: $$P(X \in \mathcal{X}) = 1$$

The set $\mathcal{X}$ is called the support of $X$.

Examples of Discrete Random Variables

•Finite Support: Number of heads in 10 coin flips (values: 0, 1, 2, ..., 10). A classification model's predicted class (values: 0, 1, ..., K-1).
•Countably Infinite Support: Number of website visitors until first purchase (values: 1, 2, 3, ...). Count of events in a Poisson process (values: 0, 1, 2, ...).
•Integer-Valued: Most discrete random variables in ML take integer values, though this isn't required by the definition—any countable set works.

Why 'Countable' Matters:

The distinction between countable and uncountable is mathematically profound. A set is countable if its elements can be put in one-to-one correspondence with the natural numbers—you can list them as a sequence, even if infinite.

The integers $\mathbb{Z}$ are countably infinite
The rationals $\mathbb{Q}$ are countably infinite (Cantor's diagonal argument)
The real numbers $\mathbb{R}$ are uncountably infinite—they cannot be listed

For discrete random variables, the countability of the support allows us to sum probabilities over all possible values: $$\sum_{x \in \mathcal{X}} P(X = x) = 1$$

This sum is well-defined precisely because the support is countable. For continuous random variables (next page), we'll need integration instead of summation.

The Practical Distinction

In machine learning practice, discrete random variables typically model: (1) categorical outcomes like class labels, (2) counts like click frequencies or word occurrences, (3) ordinal quantities like ratings or rankings, and (4) any quantity that takes distinct, separated values rather than a continuum.

The Probability Mass Function (PMF)

The Probability Mass Function (PMF) is the fundamental tool for describing how probability is distributed across the values of a discrete random variable. It answers the question: 'What is the probability that $X$ takes exactly this value?'

Definition (Probability Mass Function):

For a discrete random variable $X$ with support $\mathcal{X}$, the PMF is the function $p_X: \mathbb{R} \rightarrow [0, 1]$ defined by: $$p_X(x) = P(X = x)$$

We often write $p(x)$ when the random variable is clear from context.

Properties of a Valid PMF:

A function $p: \mathbb{R} \rightarrow [0, 1]$ is a valid PMF if and only if:

Non-negativity: $p(x) \geq 0$ for all $x \in \mathbb{R}$
Normalization: $\displaystyle\sum_{x \in \mathcal{X}} p(x) = 1$
Zero outside support: $p(x) = 0$ for all $x \notin \mathcal{X}$

These properties aren't arbitrary—they follow directly from the axioms of probability. Property 1 reflects that probabilities can't be negative. Property 2 reflects that the total probability over all outcomes must be 1 (something must happen). Property 3 reflects that impossible events have probability zero.

pmf_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
import matplotlib.pyplot as plt
 
def create_pmf_example():
    """
    Example: PMF of a loaded die favoring 6.
    
    This demonstrates the key PMF properties:
    - Non-negativity: all probabilities ≥ 0
    - Normalization: probabilities sum to 1
    - Support: only values 1-6 have non-zero probability
    """
    # Support of the random variable
    support = np.array([1, 2, 3, 4, 5, 6])
    
    # PMF: loaded die with P(X=6) = 1/3, others equal
    # For normalization: 5p + 1/3 = 1, so p = 2/15
    pmf = np.array([2/15, 2/15, 2/15, 2/15, 2/15, 1/3])
    
    # Verify PMF properties
    print("PMF Properties Verification:")
    print(f"  All non-negative: {np.all(pmf >= 0)}")
    print(f"  Sum equals 1: {np.isclose(np.sum(pmf), 1.0)}")
    print(f"  Sum = {np.sum(pmf):.6f}")
    
    # Compute probabilities of events
    p_even = pmf[1] + pmf[3] + pmf[5]  # X ∈ {2, 4, 6}
    p_greater_than_3 = pmf[3] + pmf[4] + pmf[5]  # X ∈ {4, 5, 6}
    
    print(f"\nEvent Probabilities:")
    print(f"  P(X is even) = {p_even:.4f}")
    print(f"  P(X > 3) = {p_greater_than_3:.4f}")
    
    return support, pmf
 
# Execute example
support, pmf = create_pmf_example()

PMF vs Probability

A common confusion: the PMF value p(x) is a probability, but the PMF itself is a function. We speak of 'the PMF of X' (the entire function) vs 'the probability that X equals x' (a single value). This distinction becomes crucial when comparing distributions or defining families of distributions with parameters.

Computing Probabilities from PMFs

The PMF gives us the probability of each individual value. But we often need probabilities of events—sets of values. The key insight is that for discrete random variables, event probabilities are computed by summing PMF values.

Fundamental Computation Rule:

For any event $A \subseteq \mathbb{R}$: $$P(X \in A) = \sum_{x \in A \cap \mathcal{X}} p_X(x)$$

We only sum over values in the support because $p_X(x) = 0$ outside the support.

Common Event Types:

Event	Mathematical Form	Computation
Exactly $k$	$P(X = k)$	$p_X(k)$
At most $k$	$P(X \leq k)$	$\sum_{x \leq k} p_X(x)$
At least $k$	$P(X \geq k)$	$\sum_{x \geq k} p_X(x)$
In range	$P(a \leq X \leq b)$	$\sum_{a \leq x \leq b} p_X(x)$
In a set	$P(X \in S)$	$\sum_{x \in S} p_X(x)$

Complement Rule:

Often it's easier to compute $P(X \notin A) = 1 - P(X \in A)$. If you want $P(X \geq 5)$ and the support is ${0, 1, 2, ..., 100}$, computing $P(X \leq 4)$ might require fewer terms.

pmf_computations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from typing import Callable
 
class DiscretePMF:
    """
    A class representing a discrete probability mass function.
    Demonstrates proper PMF operations and probability computations.
    """
    def __init__(self, support: np.ndarray, probabilities: np.ndarray):
        """
        Initialize a discrete PMF.
        
        Args:
            support: Array of possible values (the support)
            probabilities: Corresponding probabilities (must sum to 1)
        """
        # Validate inputs
        if len(support) != len(probabilities):
            raise ValueError("Support and probabilities must have same length")
        if not np.allclose(np.sum(probabilities), 1.0):
            raise ValueError(f"Probabilities must sum to 1, got {np.sum(probabilities)}")
        if np.any(probabilities < 0):
            raise ValueError("Probabilities must be non-negative")
        
        self.support = np.array(support)
        self.probs = np.array(probabilities)
        self._prob_dict = dict(zip(support, probabilities))
    
    def pmf(self, x: float) -> float:
        """Compute P(X = x)."""
        return self._prob_dict.get(x, 0.0)
    
    def prob_event(self, event_set: set) -> float:
        """Compute P(X ∈ event_set)."""
        return sum(self.pmf(x) for x in event_set)
    
    def prob_leq(self, k: float) -> float:
        """Compute P(X ≤ k) - the CDF at k."""
        return sum(p for x, p in self._prob_dict.items() if x <= k)
    
    def prob_geq(self, k: float) -> float:
        """Compute P(X ≥ k) using complement."""
        return 1.0 - self.prob_leq(k - 1e-10)  # Avoid boundary issues
    
    def prob_between(self, a: float, b: float) -> float:
        """Compute P(a ≤ X ≤ b)."""
        return sum(p for x, p in self._prob_dict.items() if a <= x <= b)
    
    def prob_condition(self, condition: Callable[[float], bool]) -> float:
        """Compute P(condition(X) is True)."""
        return sum(p for x, p in self._prob_dict.items() if condition(x))
 
 
# Example: Binomial-like distribution for classification confidence
def demonstrate_pmf_operations():
    """
    Demonstrates PMF operations with a practical ML example:
    Discrete confidence scores from an ensemble classifier.
    """
    # Support: confidence scores {0.0, 0.1, 0.2, ..., 1.0}
    support = np.arange(0, 1.1, 0.1)
    
    # Probabilities: hypothetical distribution from ensemble votes
    # Bell-shaped around 0.7 (model is fairly confident)
    raw_probs = np.array([0.01, 0.02, 0.03, 0.05, 0.08, 
                          0.12, 0.18, 0.22, 0.16, 0.09, 0.04])
    probabilities = raw_probs / raw_probs.sum()  # Normalize
    
    pmf = DiscretePMF(support, probabilities)
    
    print("Discrete Confidence Score PMF")
    print("=" * 50)
    
    # Various probability computations
    print(f"P(confidence = 0.7) = {pmf.pmf(0.7):.4f}")
    print(f"P(confidence ≤ 0.5) = {pmf.prob_leq(0.5):.4f}")
    print(f"P(confidence ≥ 0.8) = {pmf.prob_geq(0.8):.4f}")
    print(f"P(0.6 ≤ confidence ≤ 0.8) = {pmf.prob_between(0.6, 0.8):.4f}")
    
    # Custom condition: "highly uncertain" = confidence in [0.4, 0.6]
    p_uncertain = pmf.prob_condition(lambda x: 0.4 <= x <= 0.6)
    print(f"P(highly uncertain) = {p_uncertain:.4f}")
    
    return pmf
 
demonstrate_pmf_operations()

Support: The Domain of Possibility

The support of a random variable is a deceptively simple concept with significant implications.

Definition (Support):

The support of a discrete random variable $X$, denoted $\text{supp}(X)$ or $\mathcal{X}$, is the set of values $x$ for which $P(X = x) > 0$:

$$\text{supp}(X) = {x \in \mathbb{R} : p_X(x) > 0}$$

Equivalently, the support is the smallest closed set $\mathcal{X}$ such that $P(X \in \mathcal{X}) = 1$.

Why Support Matters in ML:

Understanding support is critical for several reasons:

1. Avoiding Impossible Events: If $x \notin \text{supp}(X)$, then $P(X = x) = 0$ and $\log P(X = x) = -\infty$. In log-likelihood computations, this causes catastrophic failures:

# Dangerous: if x=7 but support is {0,1,2,3,4,5,6}
log_likelihood = np.log(pmf(7))  # -inf, breaks optimization

2. Model Mismatch Detection: If your training data contains values outside your model's support, the model fundamentally cannot explain that data. This indicates model misspecification.

3. Generalization Scope: A model's support defines what predictions are possible. A classifier over classes ${0, 1, 2}$ cannot predict class 3—this must be handled architecturally.

Common Discrete Distributions and Their Supports
Distribution	Parameters	Support	ML Use Case
Bernoulli	p ∈ [0,1]	{0, 1}	Binary classification outputs
Binomial	n ∈ ℕ, p ∈ [0,1]	{0, 1, ..., n}	Count of successes in n trials
Categorical	p₁, ..., pₖ	{1, 2, ..., K}	Multi-class classification
Poisson	λ > 0	{0, 1, 2, ...} = ℕ₀	Event counts, rare occurrences
Geometric	p ∈ (0,1]	{1, 2, 3, ...} = ℕ	Trials until first success
Negative Binomial	r ∈ ℕ, p ∈ (0,1]	{0, 1, 2, ...} = ℕ₀	Overdispersed counts

Bounded vs Unbounded Support

Distributions with unbounded support (like Poisson or Geometric) can technically take arbitrarily large values, but in practice, the probability of very large values becomes negligible. For computational purposes, we often truncate to a practical range while ensuring probabilities still sum to (approximately) 1.

Functions of Random Variables

In practice, we rarely work with random variables in isolation. We transform them, combine them, and derive new random variables from existing ones. Understanding how these operations affect the PMF is essential.

Functions of a Single Random Variable:

If $X$ is a discrete random variable and $g: \mathbb{R} \rightarrow \mathbb{R}$ is any function, then $Y = g(X)$ is also a random variable. The PMF of $Y$ is:

$$p_Y(y) = P(Y = y) = P(g(X) = y) = \sum_{x: g(x) = y} p_X(x)$$

We sum over all $x$ values that map to $y$ under $g$.

function_of_rv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
from collections import defaultdict
 
def transform_pmf(support_x: np.ndarray, pmf_x: np.ndarray, 
                   g: callable) -> tuple:
    """
    Compute the PMF of Y = g(X) given the PMF of X.
    
    This is a fundamental operation: many ML quantities are
    functions of random variables (e.g., loss, accuracy).
    
    Args:
        support_x: Support of X
        pmf_x: PMF values for X
        g: Transformation function
    
    Returns:
        (support_y, pmf_y): Support and PMF of Y = g(X)
    """
    # Accumulate probabilities for each y value
    y_probs = defaultdict(float)
    
    for x, p_x in zip(support_x, pmf_x):
        y = g(x)
        y_probs[y] += p_x
    
    # Sort by y values for clean output
    support_y = np.array(sorted(y_probs.keys()))
    pmf_y = np.array([y_probs[y] for y in support_y])
    
    return support_y, pmf_y
 
 
# Example: Die roll squared
print("Example: Y = X² where X is a fair die roll")
print("=" * 50)
 
support_x = np.array([1, 2, 3, 4, 5, 6])
pmf_x = np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6])
 
support_y, pmf_y = transform_pmf(support_x, pmf_x, lambda x: x**2)
 
print("X:  | " + " | ".join(f"{x:4d}" for x in support_x))
print("P:  | " + " | ".join(f"{p:.3f}" for p in pmf_x))
print()
print("Y=X²: | " + " | ".join(f"{y:4.0f}" for y in support_y))
print("P(Y): | " + " | ".join(f"{p:.3f}" for p in pmf_y))
 
 
# Example: Classification indicator (non-injective function)
print("\n\nExample: Binary threshold Y = 1{X ≥ 4}")
print("=" * 50)
 
support_y2, pmf_y2 = transform_pmf(
    support_x, pmf_x, 
    lambda x: 1 if x >= 4 else 0
)
 
print("Y (threshold indicator):")
for y, p in zip(support_y2, pmf_y2):
    print(f"  P(Y = {int(y)}) = {p:.4f}")
 
# Verify: P(Y=1) should equal P(X>=4) = 3/6 = 0.5
print(f"\nVerification: P(X ≥ 4) = 3/6 = {3/6:.4f}")

Non-Injective Transformations:

When $g$ is not one-to-one (multiple $x$ values map to the same $y$), probabilities aggregate. This is common in ML:

Thresholding: Converting continuous predictions to binary decisions
Binning: Grouping values into categories
Quantization: Reducing precision for efficiency
Indicator functions: Creating binary features from numeric ones

Each of these 'loses information' by mapping many inputs to fewer outputs, but the probability rules ensure the resulting PMF is valid.

Discrete Random Variables in Machine Learning

Discrete random variables appear throughout machine learning, often in ways that aren't immediately obvious. Understanding their role clarifies why probabilistic modeling is so powerful.

Core ML Applications

•Classification Labels: The ground-truth label $Y \in {0, 1, ..., K-1}$ is a discrete random variable. The classifier learns $P(Y = k | X)$, a conditional PMF over classes.
•Predicted Classes: The output $\hat{Y}$ of a classifier is discrete. Even probabilistic classifiers ultimately map to discrete predictions via $\hat{Y} = \arg\max_k P(Y = k | X)$.
•Word Tokens: In NLP, each word (or subword token) is drawn from a discrete vocabulary. Language models estimate $P(w_t | w_1, ..., w_{t-1})$—a PMF over vocabulary.
•User Actions: Clicks, purchases, ratings—user behavior often involves discrete choices modeled as draws from a multinomial distribution.
•Mixture Components: In mixture models, the latent variable $Z$ indicating which component generated a data point is discrete.
•Quantized Representations: Audio (μ-law encoding), images (discrete pixel values), and neural network quantization all involve discrete random variables.

classification_as_pmf.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from typing import Dict, List
 
def softmax(logits: np.ndarray) -> np.ndarray:
    """Convert logits to probabilities (a valid PMF)."""
    exp_logits = np.exp(logits - np.max(logits))  # Numerical stability
    return exp_logits / np.sum(exp_logits)
 
 
class CategoricalClassifier:
    """
    Demonstrates how classification is fundamentally 
    about estimating a discrete PMF over classes.
    """
    def __init__(self, class_names: List[str]):
        self.class_names = class_names
        self.K = len(class_names)
        self.support = np.arange(self.K)
    
    def predict_pmf(self, logits: np.ndarray) -> Dict[str, float]:
        """
        Convert model logits to a proper PMF.
        
        In a trained model, logits come from the final layer.
        Softmax transforms them into a valid probability distribution:
        - All values in [0, 1]
        - Sum equals 1
        """
        pmf = softmax(logits)
        
        # Verify PMF properties
        assert np.all(pmf >= 0), "PMF must be non-negative"
        assert np.isclose(np.sum(pmf), 1.0), "PMF must sum to 1"
        
        return {name: prob for name, prob in zip(self.class_names, pmf)}
    
    def predict_class(self, logits: np.ndarray) -> str:
        """Predict the most likely class (mode of the PMF)."""
        pmf = softmax(logits)
        return self.class_names[np.argmax(pmf)]
    
    def predict_top_k(self, logits: np.ndarray, k: int = 3) -> List[tuple]:
        """Return top-k predictions with probabilities."""
        pmf = softmax(logits)
        top_indices = np.argsort(pmf)[::-1][:k]
        return [(self.class_names[i], pmf[i]) for i in top_indices]
 
 
# Demonstration
classifier = CategoricalClassifier(["cat", "dog", "bird", "fish"])
 
# Simulated logits from a neural network
logits = np.array([2.1, 0.8, -0.3, -1.5])
 
print("Classification as PMF Estimation")
print("=" * 50)
print(f"Raw logits: {logits}")
print(f"\nPredicted PMF:")
 
pmf = classifier.predict_pmf(logits)
for class_name, prob in pmf.items():
    bar = "█" * int(prob * 40)
    print(f"  P(Y = {class_name:5s}) = {prob:.4f} {bar}")
 
print(f"\nPredicted class: {classifier.predict_class(logits)}")
print(f"Top-2 predictions: {classifier.predict_top_k(logits, k=2)}")
 
# Cross-entropy loss is negative log of true class probability
true_class = 0  # "cat"
cross_entropy = -np.log(softmax(logits)[true_class])
print(f"\nCross-entropy loss (true class 'cat'): {cross_entropy:.4f}")

Softmax Creates a Valid PMF

The softmax function isn't arbitrary—it's precisely the function that transforms unconstrained real numbers (logits) into a valid PMF. It guarantees non-negativity and normalization, making the output interpretable as conditional class probabilities P(Y=k|X).

The Information-Theoretic Perspective

Discrete random variables have a natural connection to information theory, which provides powerful tools for analyzing and designing ML systems.

Entropy of a Discrete Random Variable:

The entropy $H(X)$ measures the 'uncertainty' or 'information content' of a discrete random variable:

$$H(X) = -\sum_{x \in \mathcal{X}} p_X(x) \log_2 p_X(x)$$

(Using $\log_2$ gives entropy in bits; $\ln$ gives nats.)

Intuition: Entropy is highest when all outcomes are equally likely (maximum uncertainty), and lowest (zero) when one outcome has probability 1 (no uncertainty).

Entropy Bounds:

For a discrete random variable with $|\mathcal{X}|$ possible values:

$$0 \leq H(X) \leq \log_2 |\mathcal{X}|$$

Lower bound: Achieved when $P(X = x) = 1$ for some $x$ (deterministic)
Upper bound: Achieved when $P(X = x) = 1/|\mathcal{X}|$ for all $x$ (uniform distribution)

Why This Matters for ML:

Uncertainty Quantification: High entropy in model predictions indicates the model is unsure
Feature Selection: Features with higher entropy carry more information
Decision Trees: Entropy guides splits—we seek to reduce entropy in each partition
Compression: Entropy gives the theoretical minimum bits needed to encode samples from the distribution

entropy_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
 
def entropy(pmf: np.ndarray, base: float = 2) -> float:
    """
    Compute entropy of a discrete distribution.
    
    Uses the convention that 0 * log(0) = 0 (justified by
    the limit as p -> 0 of p * log(p) = 0).
    
    Args:
        pmf: Probability mass function (must sum to 1)
        base: Logarithm base (2 for bits, e for nats)
    
    Returns:
        Entropy value
    """
    # Filter out zero probabilities to avoid log(0)
    pmf = pmf[pmf > 0]
    return -np.sum(pmf * np.log(pmf)) / np.log(base)
 
 
def analyze_distribution_entropy():
    """
    Compare entropy across different distributions.
    """
    print("Entropy Analysis of Discrete Distributions")
    print("=" * 55)
    
    distributions = [
        ("Deterministic (certain)", np.array([1.0, 0.0, 0.0, 0.0])),
        ("Highly skewed", np.array([0.9, 0.05, 0.03, 0.02])),
        ("Moderately uncertain", np.array([0.5, 0.3, 0.15, 0.05])),
        ("Near uniform", np.array([0.28, 0.26, 0.24, 0.22])),
        ("Uniform (maximum entropy)", np.array([0.25, 0.25, 0.25, 0.25])),
    ]
    
    max_entropy = np.log2(4)  # For 4 outcomes
    
    for name, pmf in distributions:
        H = entropy(pmf)
        efficiency = H / max_entropy * 100
        
        print(f"\n{name}:")
        print(f"  PMF: {pmf}")
        print(f"  Entropy: {H:.4f} bits")
        print(f"  Efficiency: {efficiency:.1f}% of maximum ({max_entropy:.4f} bits)")
    
    # ML Application: Classifier confidence
    print("\n" + "=" * 55)
    print("ML Application: Classifier Entropy as Uncertainty")
    print("=" * 55)
    
    predictions = [
        ("Confident prediction", np.array([0.95, 0.03, 0.02])),
        ("Moderate confidence", np.array([0.6, 0.3, 0.1])),
        ("Uncertain prediction", np.array([0.4, 0.35, 0.25])),
    ]
    
    for name, pmf in predictions:
        H = entropy(pmf)
        print(f"\n{name}:")
        print(f"  Predicted PMF: {pmf}")
        print(f"  Entropy: {H:.4f} bits (higher = more uncertain)")
 
 
analyze_distribution_entropy()

Entropy in Practice

When your classifier outputs high-entropy predictions frequently, it signals systemic uncertainty—perhaps the model needs more training data, features are insufficient, or the task is inherently ambiguous. Tracking prediction entropy over time can reveal model reliability issues before accuracy metrics show problems.

Summary: The Discrete Foundation

We've established the mathematical machinery of discrete random variables—the foundation for understanding how machine learning models reason about countable outcomes.

Key Concepts Mastered

•Random Variable as Function: A random variable maps outcomes ω ∈ Ω to real numbers, enabling mathematical analysis of uncertain quantities.
•Discrete = Countable Support: A discrete RV has at most countably many possible values, allowing us to sum (rather than integrate) probabilities.
•PMF: The probability mass function p(x) = P(X = x) completely characterizes a discrete RV's distribution.
•Valid PMF Properties: Non-negativity, normalization to 1, and zero outside the support.
•Event Computation: P(X ∈ A) = Σ p(x) over x ∈ A—event probabilities are sums of PMF values.
•Support Awareness: Understanding the support prevents computational disasters and guides model design.
•Transformations: Functions of random variables yield new random variables with derived PMFs.
•ML Ubiquity: Classification labels, predicted classes, word tokens, and discrete actions are all discrete RVs.
•Information Content: Entropy quantifies uncertainty in discrete distributions, guiding model confidence assessment.

What's Next:

In the next page, we extend to continuous random variables—where outcomes form uncountable continua like ℝ or [0, 1]. The key conceptual shift: we can no longer assign nonzero probability to individual points. Instead, we use probability density functions (PDFs), and sums become integrals. This unlocks modeling of continuous quantities like regression targets, neural network weights, and latent representations.

Page Complete

You now have a rigorous understanding of discrete random variables and probability mass functions. This foundation extends naturally to continuous random variables in the next page, and together they form the complete language for describing probability distributions in machine learning.