Loading content...
In the previous module, we established the axiomatic foundations of probability theory—sample spaces, events, and the fundamental rules governing probabilities. Now we take a critical conceptual leap: we learn to assign numerical values to random outcomes, transforming abstract events into mathematical objects we can compute with.
This transformation is the essence of random variables. Rather than reasoning about outcomes like 'heads' or 'a customer churns,' we assign numbers—0 or 1, revenue amounts, click counts—enabling the full power of mathematical analysis to be brought to bear on uncertain phenomena.
In machine learning, random variables are not merely theoretical constructs. They are the bridge between data and models. Every feature in your dataset, every prediction your model makes, every loss value computed during training—all of these are realizations of random variables. Understanding their properties is not optional; it is the foundation upon which all of statistical machine learning is built.
By the end of this page, you will understand the formal definition of discrete random variables, master their mathematical properties including probability mass functions and support, and see how these concepts underpin classification, counting models, and probabilistic inference in machine learning.
Before we can work with random variables mathematically, we need a precise definition. The intuitive notion—'a variable whose value depends on chance'—is insufficient for rigorous analysis. Let's build the formal machinery.
Definition (Random Variable):
Given a probability space $(\Omega, \mathcal{F}, P)$, a random variable $X$ is a function:
$$X: \Omega \rightarrow \mathbb{R}$$
that maps each outcome $\omega \in \Omega$ to a real number $X(\omega)$, subject to a measurability condition: for every Borel set $B \subseteq \mathbb{R}$, the preimage $X^{-1}(B) = {\omega \in \Omega : X(\omega) \in B}$ belongs to the σ-algebra $\mathcal{F}$.
This measurability condition ensures that we can assign probabilities to statements like '$X \leq 5$' or '$X \in [2, 7]$'. Without it, we couldn't meaningfully ask 'What is the probability that $X$ takes a value in some range?'
The measurability condition might seem like mathematical pedantry, but it's actually protective. It prevents us from constructing pathological 'random variables' for which probability questions are unanswerable. In practice, any function you'd naturally construct from data satisfies this condition, but the formal requirement is what makes probability theory mathematically coherent.
Unpacking the Definition:
Let's make this concrete. Consider rolling a fair six-sided die:
Now $X$ is a function from outcomes to numbers. We can ask:
The random variable $X$ lets us translate probability questions about outcomes into algebraic statements about numbers.
| Component | Symbol | Description | Example (Die Roll) |
|---|---|---|---|
| Sample Space | Ω | Set of all possible outcomes | {ω₁, ω₂, ω₃, ω₄, ω₅, ω₆} |
| Outcome | ω | A single element of Ω | ω₃ = 'die shows 3' |
| Random Variable | X | Function from Ω to ℝ | X(ωᵢ) = i |
| Realization | x | A specific value X takes | x = 3 |
| Event via X | {X = x} | Preimage {ω : X(ω) = x} | {ω₃} |
Random variables come in two fundamental flavors, distinguished by the nature of their possible values.
Definition (Discrete Random Variable):
A random variable $X$ is called discrete if its range (the set of values it can take) is countable—either finite or countably infinite.
Formally, $X$ is discrete if there exists a countable set $\mathcal{X} \subseteq \mathbb{R}$ such that: $$P(X \in \mathcal{X}) = 1$$
The set $\mathcal{X}$ is called the support of $X$.
Why 'Countable' Matters:
The distinction between countable and uncountable is mathematically profound. A set is countable if its elements can be put in one-to-one correspondence with the natural numbers—you can list them as a sequence, even if infinite.
For discrete random variables, the countability of the support allows us to sum probabilities over all possible values: $$\sum_{x \in \mathcal{X}} P(X = x) = 1$$
This sum is well-defined precisely because the support is countable. For continuous random variables (next page), we'll need integration instead of summation.
In machine learning practice, discrete random variables typically model: (1) categorical outcomes like class labels, (2) counts like click frequencies or word occurrences, (3) ordinal quantities like ratings or rankings, and (4) any quantity that takes distinct, separated values rather than a continuum.
The Probability Mass Function (PMF) is the fundamental tool for describing how probability is distributed across the values of a discrete random variable. It answers the question: 'What is the probability that $X$ takes exactly this value?'
Definition (Probability Mass Function):
For a discrete random variable $X$ with support $\mathcal{X}$, the PMF is the function $p_X: \mathbb{R} \rightarrow [0, 1]$ defined by: $$p_X(x) = P(X = x)$$
We often write $p(x)$ when the random variable is clear from context.
Properties of a Valid PMF:
A function $p: \mathbb{R} \rightarrow [0, 1]$ is a valid PMF if and only if:
Non-negativity: $p(x) \geq 0$ for all $x \in \mathbb{R}$
Normalization: $\displaystyle\sum_{x \in \mathcal{X}} p(x) = 1$
Zero outside support: $p(x) = 0$ for all $x \notin \mathcal{X}$
These properties aren't arbitrary—they follow directly from the axioms of probability. Property 1 reflects that probabilities can't be negative. Property 2 reflects that the total probability over all outcomes must be 1 (something must happen). Property 3 reflects that impossible events have probability zero.
12345678910111213141516171819202122232425262728293031323334353637
import numpy as npimport matplotlib.pyplot as plt def create_pmf_example(): """ Example: PMF of a loaded die favoring 6. This demonstrates the key PMF properties: - Non-negativity: all probabilities ≥ 0 - Normalization: probabilities sum to 1 - Support: only values 1-6 have non-zero probability """ # Support of the random variable support = np.array([1, 2, 3, 4, 5, 6]) # PMF: loaded die with P(X=6) = 1/3, others equal # For normalization: 5p + 1/3 = 1, so p = 2/15 pmf = np.array([2/15, 2/15, 2/15, 2/15, 2/15, 1/3]) # Verify PMF properties print("PMF Properties Verification:") print(f" All non-negative: {np.all(pmf >= 0)}") print(f" Sum equals 1: {np.isclose(np.sum(pmf), 1.0)}") print(f" Sum = {np.sum(pmf):.6f}") # Compute probabilities of events p_even = pmf[1] + pmf[3] + pmf[5] # X ∈ {2, 4, 6} p_greater_than_3 = pmf[3] + pmf[4] + pmf[5] # X ∈ {4, 5, 6} print(f"\nEvent Probabilities:") print(f" P(X is even) = {p_even:.4f}") print(f" P(X > 3) = {p_greater_than_3:.4f}") return support, pmf # Execute examplesupport, pmf = create_pmf_example()A common confusion: the PMF value p(x) is a probability, but the PMF itself is a function. We speak of 'the PMF of X' (the entire function) vs 'the probability that X equals x' (a single value). This distinction becomes crucial when comparing distributions or defining families of distributions with parameters.
The PMF gives us the probability of each individual value. But we often need probabilities of events—sets of values. The key insight is that for discrete random variables, event probabilities are computed by summing PMF values.
Fundamental Computation Rule:
For any event $A \subseteq \mathbb{R}$: $$P(X \in A) = \sum_{x \in A \cap \mathcal{X}} p_X(x)$$
We only sum over values in the support because $p_X(x) = 0$ outside the support.
Common Event Types:
| Event | Mathematical Form | Computation |
|---|---|---|
| Exactly $k$ | $P(X = k)$ | $p_X(k)$ |
| At most $k$ | $P(X \leq k)$ | $\sum_{x \leq k} p_X(x)$ |
| At least $k$ | $P(X \geq k)$ | $\sum_{x \geq k} p_X(x)$ |
| In range | $P(a \leq X \leq b)$ | $\sum_{a \leq x \leq b} p_X(x)$ |
| In a set | $P(X \in S)$ | $\sum_{x \in S} p_X(x)$ |
Complement Rule:
Often it's easier to compute $P(X \notin A) = 1 - P(X \in A)$. If you want $P(X \geq 5)$ and the support is ${0, 1, 2, ..., 100}$, computing $P(X \leq 4)$ might require fewer terms.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import numpy as npfrom typing import Callable class DiscretePMF: """ A class representing a discrete probability mass function. Demonstrates proper PMF operations and probability computations. """ def __init__(self, support: np.ndarray, probabilities: np.ndarray): """ Initialize a discrete PMF. Args: support: Array of possible values (the support) probabilities: Corresponding probabilities (must sum to 1) """ # Validate inputs if len(support) != len(probabilities): raise ValueError("Support and probabilities must have same length") if not np.allclose(np.sum(probabilities), 1.0): raise ValueError(f"Probabilities must sum to 1, got {np.sum(probabilities)}") if np.any(probabilities < 0): raise ValueError("Probabilities must be non-negative") self.support = np.array(support) self.probs = np.array(probabilities) self._prob_dict = dict(zip(support, probabilities)) def pmf(self, x: float) -> float: """Compute P(X = x).""" return self._prob_dict.get(x, 0.0) def prob_event(self, event_set: set) -> float: """Compute P(X ∈ event_set).""" return sum(self.pmf(x) for x in event_set) def prob_leq(self, k: float) -> float: """Compute P(X ≤ k) - the CDF at k.""" return sum(p for x, p in self._prob_dict.items() if x <= k) def prob_geq(self, k: float) -> float: """Compute P(X ≥ k) using complement.""" return 1.0 - self.prob_leq(k - 1e-10) # Avoid boundary issues def prob_between(self, a: float, b: float) -> float: """Compute P(a ≤ X ≤ b).""" return sum(p for x, p in self._prob_dict.items() if a <= x <= b) def prob_condition(self, condition: Callable[[float], bool]) -> float: """Compute P(condition(X) is True).""" return sum(p for x, p in self._prob_dict.items() if condition(x)) # Example: Binomial-like distribution for classification confidencedef demonstrate_pmf_operations(): """ Demonstrates PMF operations with a practical ML example: Discrete confidence scores from an ensemble classifier. """ # Support: confidence scores {0.0, 0.1, 0.2, ..., 1.0} support = np.arange(0, 1.1, 0.1) # Probabilities: hypothetical distribution from ensemble votes # Bell-shaped around 0.7 (model is fairly confident) raw_probs = np.array([0.01, 0.02, 0.03, 0.05, 0.08, 0.12, 0.18, 0.22, 0.16, 0.09, 0.04]) probabilities = raw_probs / raw_probs.sum() # Normalize pmf = DiscretePMF(support, probabilities) print("Discrete Confidence Score PMF") print("=" * 50) # Various probability computations print(f"P(confidence = 0.7) = {pmf.pmf(0.7):.4f}") print(f"P(confidence ≤ 0.5) = {pmf.prob_leq(0.5):.4f}") print(f"P(confidence ≥ 0.8) = {pmf.prob_geq(0.8):.4f}") print(f"P(0.6 ≤ confidence ≤ 0.8) = {pmf.prob_between(0.6, 0.8):.4f}") # Custom condition: "highly uncertain" = confidence in [0.4, 0.6] p_uncertain = pmf.prob_condition(lambda x: 0.4 <= x <= 0.6) print(f"P(highly uncertain) = {p_uncertain:.4f}") return pmf demonstrate_pmf_operations()The support of a random variable is a deceptively simple concept with significant implications.
Definition (Support):
The support of a discrete random variable $X$, denoted $\text{supp}(X)$ or $\mathcal{X}$, is the set of values $x$ for which $P(X = x) > 0$:
$$\text{supp}(X) = {x \in \mathbb{R} : p_X(x) > 0}$$
Equivalently, the support is the smallest closed set $\mathcal{X}$ such that $P(X \in \mathcal{X}) = 1$.
Why Support Matters in ML:
Understanding support is critical for several reasons:
1. Avoiding Impossible Events: If $x \notin \text{supp}(X)$, then $P(X = x) = 0$ and $\log P(X = x) = -\infty$. In log-likelihood computations, this causes catastrophic failures:
# Dangerous: if x=7 but support is {0,1,2,3,4,5,6}
log_likelihood = np.log(pmf(7)) # -inf, breaks optimization
2. Model Mismatch Detection: If your training data contains values outside your model's support, the model fundamentally cannot explain that data. This indicates model misspecification.
3. Generalization Scope: A model's support defines what predictions are possible. A classifier over classes ${0, 1, 2}$ cannot predict class 3—this must be handled architecturally.
| Distribution | Parameters | Support | ML Use Case |
|---|---|---|---|
| Bernoulli | p ∈ [0,1] | {0, 1} | Binary classification outputs |
| Binomial | n ∈ ℕ, p ∈ [0,1] | {0, 1, ..., n} | Count of successes in n trials |
| Categorical | p₁, ..., pₖ | {1, 2, ..., K} | Multi-class classification |
| Poisson | λ > 0 | {0, 1, 2, ...} = ℕ₀ | Event counts, rare occurrences |
| Geometric | p ∈ (0,1] | {1, 2, 3, ...} = ℕ | Trials until first success |
| Negative Binomial | r ∈ ℕ, p ∈ (0,1] | {0, 1, 2, ...} = ℕ₀ | Overdispersed counts |
Distributions with unbounded support (like Poisson or Geometric) can technically take arbitrarily large values, but in practice, the probability of very large values becomes negligible. For computational purposes, we often truncate to a practical range while ensuring probabilities still sum to (approximately) 1.
In practice, we rarely work with random variables in isolation. We transform them, combine them, and derive new random variables from existing ones. Understanding how these operations affect the PMF is essential.
Functions of a Single Random Variable:
If $X$ is a discrete random variable and $g: \mathbb{R} \rightarrow \mathbb{R}$ is any function, then $Y = g(X)$ is also a random variable. The PMF of $Y$ is:
$$p_Y(y) = P(Y = y) = P(g(X) = y) = \sum_{x: g(x) = y} p_X(x)$$
We sum over all $x$ values that map to $y$ under $g$.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import numpy as npfrom collections import defaultdict def transform_pmf(support_x: np.ndarray, pmf_x: np.ndarray, g: callable) -> tuple: """ Compute the PMF of Y = g(X) given the PMF of X. This is a fundamental operation: many ML quantities are functions of random variables (e.g., loss, accuracy). Args: support_x: Support of X pmf_x: PMF values for X g: Transformation function Returns: (support_y, pmf_y): Support and PMF of Y = g(X) """ # Accumulate probabilities for each y value y_probs = defaultdict(float) for x, p_x in zip(support_x, pmf_x): y = g(x) y_probs[y] += p_x # Sort by y values for clean output support_y = np.array(sorted(y_probs.keys())) pmf_y = np.array([y_probs[y] for y in support_y]) return support_y, pmf_y # Example: Die roll squaredprint("Example: Y = X² where X is a fair die roll")print("=" * 50) support_x = np.array([1, 2, 3, 4, 5, 6])pmf_x = np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6]) support_y, pmf_y = transform_pmf(support_x, pmf_x, lambda x: x**2) print("X: | " + " | ".join(f"{x:4d}" for x in support_x))print("P: | " + " | ".join(f"{p:.3f}" for p in pmf_x))print()print("Y=X²: | " + " | ".join(f"{y:4.0f}" for y in support_y))print("P(Y): | " + " | ".join(f"{p:.3f}" for p in pmf_y)) # Example: Classification indicator (non-injective function)print("\n\nExample: Binary threshold Y = 1{X ≥ 4}")print("=" * 50) support_y2, pmf_y2 = transform_pmf( support_x, pmf_x, lambda x: 1 if x >= 4 else 0) print("Y (threshold indicator):")for y, p in zip(support_y2, pmf_y2): print(f" P(Y = {int(y)}) = {p:.4f}") # Verify: P(Y=1) should equal P(X>=4) = 3/6 = 0.5print(f"\nVerification: P(X ≥ 4) = 3/6 = {3/6:.4f}")Non-Injective Transformations:
When $g$ is not one-to-one (multiple $x$ values map to the same $y$), probabilities aggregate. This is common in ML:
Each of these 'loses information' by mapping many inputs to fewer outputs, but the probability rules ensure the resulting PMF is valid.
Discrete random variables appear throughout machine learning, often in ways that aren't immediately obvious. Understanding their role clarifies why probabilistic modeling is so powerful.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as npfrom typing import Dict, List def softmax(logits: np.ndarray) -> np.ndarray: """Convert logits to probabilities (a valid PMF).""" exp_logits = np.exp(logits - np.max(logits)) # Numerical stability return exp_logits / np.sum(exp_logits) class CategoricalClassifier: """ Demonstrates how classification is fundamentally about estimating a discrete PMF over classes. """ def __init__(self, class_names: List[str]): self.class_names = class_names self.K = len(class_names) self.support = np.arange(self.K) def predict_pmf(self, logits: np.ndarray) -> Dict[str, float]: """ Convert model logits to a proper PMF. In a trained model, logits come from the final layer. Softmax transforms them into a valid probability distribution: - All values in [0, 1] - Sum equals 1 """ pmf = softmax(logits) # Verify PMF properties assert np.all(pmf >= 0), "PMF must be non-negative" assert np.isclose(np.sum(pmf), 1.0), "PMF must sum to 1" return {name: prob for name, prob in zip(self.class_names, pmf)} def predict_class(self, logits: np.ndarray) -> str: """Predict the most likely class (mode of the PMF).""" pmf = softmax(logits) return self.class_names[np.argmax(pmf)] def predict_top_k(self, logits: np.ndarray, k: int = 3) -> List[tuple]: """Return top-k predictions with probabilities.""" pmf = softmax(logits) top_indices = np.argsort(pmf)[::-1][:k] return [(self.class_names[i], pmf[i]) for i in top_indices] # Demonstrationclassifier = CategoricalClassifier(["cat", "dog", "bird", "fish"]) # Simulated logits from a neural networklogits = np.array([2.1, 0.8, -0.3, -1.5]) print("Classification as PMF Estimation")print("=" * 50)print(f"Raw logits: {logits}")print(f"\nPredicted PMF:") pmf = classifier.predict_pmf(logits)for class_name, prob in pmf.items(): bar = "█" * int(prob * 40) print(f" P(Y = {class_name:5s}) = {prob:.4f} {bar}") print(f"\nPredicted class: {classifier.predict_class(logits)}")print(f"Top-2 predictions: {classifier.predict_top_k(logits, k=2)}") # Cross-entropy loss is negative log of true class probabilitytrue_class = 0 # "cat"cross_entropy = -np.log(softmax(logits)[true_class])print(f"\nCross-entropy loss (true class 'cat'): {cross_entropy:.4f}")The softmax function isn't arbitrary—it's precisely the function that transforms unconstrained real numbers (logits) into a valid PMF. It guarantees non-negativity and normalization, making the output interpretable as conditional class probabilities P(Y=k|X).
Discrete random variables have a natural connection to information theory, which provides powerful tools for analyzing and designing ML systems.
Entropy of a Discrete Random Variable:
The entropy $H(X)$ measures the 'uncertainty' or 'information content' of a discrete random variable:
$$H(X) = -\sum_{x \in \mathcal{X}} p_X(x) \log_2 p_X(x)$$
(Using $\log_2$ gives entropy in bits; $\ln$ gives nats.)
Intuition: Entropy is highest when all outcomes are equally likely (maximum uncertainty), and lowest (zero) when one outcome has probability 1 (no uncertainty).
Entropy Bounds:
For a discrete random variable with $|\mathcal{X}|$ possible values:
$$0 \leq H(X) \leq \log_2 |\mathcal{X}|$$
Why This Matters for ML:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as np def entropy(pmf: np.ndarray, base: float = 2) -> float: """ Compute entropy of a discrete distribution. Uses the convention that 0 * log(0) = 0 (justified by the limit as p -> 0 of p * log(p) = 0). Args: pmf: Probability mass function (must sum to 1) base: Logarithm base (2 for bits, e for nats) Returns: Entropy value """ # Filter out zero probabilities to avoid log(0) pmf = pmf[pmf > 0] return -np.sum(pmf * np.log(pmf)) / np.log(base) def analyze_distribution_entropy(): """ Compare entropy across different distributions. """ print("Entropy Analysis of Discrete Distributions") print("=" * 55) distributions = [ ("Deterministic (certain)", np.array([1.0, 0.0, 0.0, 0.0])), ("Highly skewed", np.array([0.9, 0.05, 0.03, 0.02])), ("Moderately uncertain", np.array([0.5, 0.3, 0.15, 0.05])), ("Near uniform", np.array([0.28, 0.26, 0.24, 0.22])), ("Uniform (maximum entropy)", np.array([0.25, 0.25, 0.25, 0.25])), ] max_entropy = np.log2(4) # For 4 outcomes for name, pmf in distributions: H = entropy(pmf) efficiency = H / max_entropy * 100 print(f"\n{name}:") print(f" PMF: {pmf}") print(f" Entropy: {H:.4f} bits") print(f" Efficiency: {efficiency:.1f}% of maximum ({max_entropy:.4f} bits)") # ML Application: Classifier confidence print("\n" + "=" * 55) print("ML Application: Classifier Entropy as Uncertainty") print("=" * 55) predictions = [ ("Confident prediction", np.array([0.95, 0.03, 0.02])), ("Moderate confidence", np.array([0.6, 0.3, 0.1])), ("Uncertain prediction", np.array([0.4, 0.35, 0.25])), ] for name, pmf in predictions: H = entropy(pmf) print(f"\n{name}:") print(f" Predicted PMF: {pmf}") print(f" Entropy: {H:.4f} bits (higher = more uncertain)") analyze_distribution_entropy()When your classifier outputs high-entropy predictions frequently, it signals systemic uncertainty—perhaps the model needs more training data, features are insufficient, or the task is inherently ambiguous. Tracking prediction entropy over time can reveal model reliability issues before accuracy metrics show problems.
We've established the mathematical machinery of discrete random variables—the foundation for understanding how machine learning models reason about countable outcomes.
What's Next:
In the next page, we extend to continuous random variables—where outcomes form uncountable continua like ℝ or [0, 1]. The key conceptual shift: we can no longer assign nonzero probability to individual points. Instead, we use probability density functions (PDFs), and sums become integrals. This unlocks modeling of continuous quantities like regression targets, neural network weights, and latent representations.
You now have a rigorous understanding of discrete random variables and probability mass functions. This foundation extends naturally to continuous random variables in the next page, and together they form the complete language for describing probability distributions in machine learning.