Machine LearningProbability Fundamentals

Probability Fundamentals

LevelBeginner

Duration90 mins

TopicProbability Fundamentals

2 / 5

Axioms of Probability

Why Probability Needs Axioms

Having established sample spaces and events as the vocabulary of probability, we now face a fundamental question: How do we assign numbers to events in a consistent, meaningful way?

Intuitively, we might say 'probability is the frequency with which an event occurs if we repeat the experiment many times.' This frequentist interpretation is useful but immediately runs into problems:

What about events that can't be repeated (e.g., 'Will this startup succeed?')?
How do we know the limiting frequency actually converges?
Different sequences of experiments might converge to different limits.

Alternatively, we might interpret probability as 'degree of belief'—a Bayesian view. But then:

How do we ensure different people's beliefs are consistent?
What constrains valid beliefs?

The solution, developed by Andrey Kolmogorov in 1933, was to axiomatize probability—to specify minimal rules that any valid probability assignment must satisfy. These axioms don't tell us what probability means; they tell us how probability must behave.

What You Will Learn

By the end of this page, you will understand Kolmogorov's axioms of probability, derive fundamental probability rules from these axioms, and recognize how these abstract principles underpin concrete ML computations like loss functions, model calibration, and probabilistic predictions.

Kolmogorov's Axioms: The Foundation

Let (Ω, ℱ) be a measurable space where Ω is the sample space and ℱ is a σ-algebra of events. A probability measure P is a function P: ℱ → [0, 1] that assigns a number to each event, satisfying three axioms:

Axiom 1: Non-Negativity

For any event A ∈ ℱ:

P(A) ≥ 0

Probabilities cannot be negative. This seems obvious but is essential for interpretation—a negative probability has no meaning.

Axiom 2: Normalization

P(Ω) = 1

The probability that something happens is 1. The sample space, containing all possible outcomes, has probability 1 because exactly one outcome must occur.

Axiom 3: Countable Additivity (σ-additivity)

For any countable sequence of mutually exclusive events A₁, A₂, A₃, ... ∈ ℱ (i.e., Aᵢ ∩ Aⱼ = ∅ for i ≠ j):

P(∪ᵢ₌₁^∞ Aᵢ) = Σᵢ₌₁^∞ P(Aᵢ)

If events cannot overlap, the probability of 'at least one occurring' equals the sum of their individual probabilities. This extends to infinitely many events, which is crucial for continuous probability.

The triple (Ω, ℱ, P) is called a probability space.

The Beauty of Minimalism

These three axioms appear deceptively simple, yet they are sufficient to derive ALL of probability theory. Every theorem about expectations, variances, conditional probabilities, limit theorems, and more follows from these three statements. This is the power of axiomatic mathematics: from minimal assumptions, vast structures emerge.

Kolmogorov's Axioms Summary
Axiom	Statement	Intuition	Why Essential
Non-Negativity	P(A) ≥ 0 for all events A	Probability measures 'how likely'—can't be negative	Enables interpretation as proportion/frequency
Normalization	P(Ω) = 1	Something must happen	Provides reference scale for all probabilities
Countable Additivity	P(∪ Aᵢ) = Σ P(Aᵢ) for disjoint Aᵢ	Non-overlapping events add up	Allows infinite sums; essential for continuous distributions

Immediate Consequences of the Axioms

From Kolmogorov's three axioms, we can derive many fundamental properties of probability. These derivations demonstrate the power of axiomatic reasoning—each property is a logical consequence, not an additional assumption.

Property 1: Probability of the Empty Set

Theorem: P(∅) = 0

Proof:

Consider the countable sequence: A₁ = Ω, A₂ = ∅, A₃ = ∅, A₄ = ∅, ...
These are disjoint (∅ doesn't overlap with anything)
Ω ∪ ∅ ∪ ∅ ∪ ... = Ω
By Axiom 3: P(Ω) = P(Ω) + P(∅) + P(∅) + ...
By Axiom 2: 1 = 1 + P(∅) + P(∅) + ...
The only way this holds (with P(∅) ≥ 0 by Axiom 1) is if P(∅) = 0 ∎

Interpretation: An impossible event has zero probability—nothing can occur outside all possibilities.

Property 2: Complement Rule

Theorem: P(A^c) = 1 - P(A)

Proof:

A and A^c are disjoint (A ∩ A^c = ∅)
A ∪ A^c = Ω (together they cover everything)
By Axiom 3: P(A ∪ A^c) = P(A) + P(A^c)
By Axiom 2: P(Ω) = 1
Therefore: 1 = P(A) + P(A^c)
Rearranging: P(A^c) = 1 - P(A) ∎

Interpretation: The probability of 'not A' equals one minus the probability of A. If there's a 70% chance of rain, there's a 30% chance of no rain.

ML Application: If a classifier predicts class 1 with probability 0.8, it implicitly predicts class 0 (in binary classification) with probability 0.2. This is the foundation of probabilistic predictions.

Property 3: Probability Bounds

Theorem: For any event A, 0 ≤ P(A) ≤ 1

Proof:

P(A) ≥ 0 by Axiom 1 (lower bound)
P(A^c) ≥ 0 by Axiom 1
P(A) = 1 - P(A^c) ≤ 1 (upper bound) ∎

Interpretation: Probabilities are always between 0 and 1. This seems obvious but is a derived property, not an axiom.

ML Application: This is why neural network outputs for classification use softmax or sigmoid—to constrain outputs to [0, 1] and be interpretable as probabilities.

Property 4: Monotonicity

Theorem: If A ⊆ B, then P(A) ≤ P(B)

Proof:

If A ⊆ B, then B = A ∪ (B \ A)
A and (B \ A) are disjoint
By Axiom 3: P(B) = P(A) + P(B \ A)
P(B \ A) ≥ 0 by Axiom 1
Therefore: P(B) ≥ P(A) ∎

Interpretation: If event A is contained in event B (A implies B), then A is at most as likely as B. 'Drawing an ace' is less likely than 'drawing a face card or ace.'

ML Application: If your model's correct predictions are a subset of another model's correct predictions, the second model has higher accuracy.

The Art of Proof

Notice how each proof uses only the three axioms and previously proven properties. This chain of logical deduction is the essence of mathematics. Every probability theorem you'll ever use—from Bayes' theorem to the central limit theorem—ultimately traces back to these three axioms.

The Addition Rule: Unions of Events

Axiom 3 tells us how to compute P(A ∪ B) when A and B are disjoint. But what if A and B overlap? We need the General Addition Rule (also called the Inclusion-Exclusion Principle for two sets).

General Addition Rule for Two Events

Theorem: For any events A and B:

P(A ∪ B) = P(A) + P(B) - P(A ∩ B)

Proof: We decompose A ∪ B into three disjoint pieces:

A ∪ B = (A ∩ B^c) ∪ (A ∩ B) ∪ (B ∩ A^c)

But let's use a simpler approach:

B = (A ∩ B) ∪ (B ∩ A^c), and these are disjoint
So P(B) = P(A ∩ B) + P(B ∩ A^c)
Thus P(B ∩ A^c) = P(B) - P(A ∩ B)

Also:

A ∪ B = A ∪ (B ∩ A^c), and these are disjoint
So P(A ∪ B) = P(A) + P(B ∩ A^c)
= P(A) + P(B) - P(A ∩ B) ∎

Why Subtract the Intersection?

When we add P(A) + P(B), outcomes in the intersection A ∩ B get counted twice—once when counting A, once when counting B. Subtracting P(A ∩ B) corrects this double-counting. This is the intuition behind inclusion-exclusion.

Addition Rule: Email ClassificationA spam filter flags emails based on multiple criteria. Let's compute overlap probabilities.

Input

Output

Inclusion-Exclusion for Three Events

P(A ∪ B ∪ C) = P(A) + P(B) + P(C) - P(A ∩ B) - P(A ∩ C) - P(B ∩ C) + P(A ∩ B ∩ C)

The pattern: add singles, subtract pairs, add triples. This extends to n events:

P(∪ᵢ₌₁ⁿ Aᵢ) = Σᵢ P(Aᵢ) - Σᵢ<ⱼ P(Aᵢ ∩ Aⱼ) + Σᵢ<ⱼ<ₖ P(Aᵢ ∩ Aⱼ ∩ Aₖ) - ... + (-1)ⁿ⁺¹ P(∩ᵢ₌₁ⁿ Aᵢ)

inclusion_exclusion.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
from typing import List, Callable, Set, Any
from itertools import combinations
 
def probability_of_union(
    events: List[Set[Any]], 
    sample_space: Set[Any]
) -> float:
    """
    Compute P(A₁ ∪ A₂ ∪ ... ∪ Aₙ) using inclusion-exclusion.
    
    For finite sample spaces, this computes exact probabilities
    by working with the sets directly.
    
    Parameters:
        events: List of event sets
        sample_space: The complete sample space Ω
    
    Returns:
        Probability of at least one event occurring
    """
    n = len(events)
    if n == 0:
        return 0.0
    
    omega_size = len(sample_space)
    total_prob = 0.0
    
    for k in range(1, n + 1):
        # Sum over all k-subsets of events
        sign = (-1) ** (k + 1)  # Alternating signs
        
        for event_subset in combinations(events, k):
            # Compute intersection of all events in subset
            intersection = event_subset[0]
            for event in event_subset[1:]:
                intersection = intersection & event
            
            # Add (or subtract) probability of this intersection
            total_prob += sign * len(intersection) / omega_size
    
    return total_prob
 
 
def probability_of_at_least_one_simple(
    events: List[Set[Any]], 
    sample_space: Set[Any]
) -> float:
    """
    Alternative: compute P(at least one) by finding the actual union.
    
    Simpler but less illustrative of inclusion-exclusion.
    """
    union = set()
    for event in events:
        union = union | event
    return len(union) / len(sample_space)
 
 
# Example: Two dice, multiple events
if __name__ == "__main__":
    # Sample space for two dice
    Omega = {(i, j) for i in range(1, 7) for j in range(1, 7)}
    
    # Define events
    A = {(i, j) for (i, j) in Omega if i + j == 7}  # Sum is 7
    B = {(i, j) for (i, j) in Omega if i == j}      # Doubles
    C = {(i, j) for (i, j) in Omega if i >= 5}      # First die ≥ 5
    
    print("Event A (sum = 7):", len(A), "outcomes")
    print("Event B (doubles):", len(B), "outcomes")
    print("Event C (first ≥ 5):", len(C), "outcomes")
    
    # Two-event inclusion-exclusion
    p_a = len(A) / len(Omega)
    p_b = len(B) / len(Omega)
    p_a_and_b = len(A & B) / len(Omega)
    
    print(f"\nP(A) = {p_a:.4f}")
    print(f"P(B) = {p_b:.4f}")
    print(f"P(A ∩ B) = {p_a_and_b:.4f}")
    print(f"P(A ∪ B) via inclusion-exclusion = {p_a + p_b - p_a_and_b:.4f}")
    print(f"P(A ∪ B) via direct computation = {len(A | B) / len(Omega):.4f}")
    
    # Three-event inclusion-exclusion
    print(f"\nP(A ∪ B ∪ C) via function = {probability_of_union([A, B, C], Omega):.4f}")
    print(f"P(A ∪ B ∪ C) via direct = {probability_of_at_least_one_simple([A, B, C], Omega):.4f}")

Continuity of Probability

A powerful consequence of countable additivity (Axiom 3) is that probability behaves 'continuously' with respect to limits of events. This is crucial for working with infinite sequences and continuous distributions.

Monotone Convergence for Events

Theorem (Continuity from Below): If A₁ ⊆ A₂ ⊆ A₃ ⊆ ... is an increasing sequence of events, and A = ∪ᵢ₌₁^∞ Aᵢ, then:

P(A) = lim_{n→∞} P(Aₙ)

Theorem (Continuity from Above): If A₁ ⊇ A₂ ⊇ A₃ ⊇ ... is a decreasing sequence of events, and A = ∩ᵢ₌₁^∞ Aᵢ, then:

P(A) = lim_{n→∞} P(Aₙ)

Intuition for Continuity

If events get progressively larger (A₁ ⊆ A₂ ⊆ ...) and converge to some limit event A, the probabilities also converge: P(Aₙ) → P(A). There are no 'jumps' at infinity. This mirrors how continuous functions behave at limits and is essential for defining probability on continuous spaces.

Continuity Example: At Least One SuccessConsider repeatedly flipping a fair coin. What's the probability of getting at least one head?

Input

Output

Why Countable Additivity Matters

The distinction between finite additivity (works for any finite collection of disjoint events) and countable additivity (works for countable collections) is subtle but crucial.

Finite additivity alone would not guarantee:

Continuity theorems for limits of events
Well-defined probabilities for continuous distributions
Convergence guarantees needed for statistical inference

Kolmogorov's choice of countable additivity as the third axiom was deliberate—it provides just enough structure to build all of probability theory without being so restrictive as to exclude useful probability spaces.

Constructing Probability Measures

The axioms tell us what properties a probability measure must have. But how do we actually construct one? Different constructions lead to different probability spaces.

Method 1: Classical (Counting) Probability

For finite sample spaces with equally likely outcomes:

P(A) = |A| / |Ω|

This satisfies all three axioms:

P(A) = |A|/|Ω| ≥ 0 ✓
P(Ω) = |Ω|/|Ω| = 1 ✓
For disjoint A, B: P(A ∪ B) = |A ∪ B|/|Ω| = (|A| + |B|)/|Ω| = P(A) + P(B) ✓

Classical Probability Examples
Experiment	Sample Space	Example Event	Probability
Fair coin flip	Ω = {H, T}	A = {H}	P(A) = 1/2
Fair die roll	Ω = {1,2,3,4,5,6}	A = {even} = {2,4,6}	P(A) = 3/6 = 1/2
Drawing a card	Ω = 52 cards	A = {aces} = 4 cards	P(A) = 4/52 = 1/13
Two fair dice	Ω = 36 pairs	A = {sum = 7} = 6 pairs	P(A) = 6/36 = 1/6

Method 2: Assigning Probabilities to Outcomes (Discrete)

For discrete sample spaces, we can assign probabilities directly to each outcome:

p(ω) = probability of outcome ω

Constraints (from axioms):

p(ω) ≥ 0 for all ω
Σ_{ω ∈ Ω} p(ω) = 1

Then for any event A:

P(A) = Σ_{ω ∈ A} p(ω)

Weighted Die (Non-uniform Probability)A loaded die has different probabilities for each face.

Input

Output

Method 3: Probability Density Functions (Continuous)

For continuous sample spaces like Ω = ℝ, we cannot assign positive probability to individual points (otherwise we'd exceed 1 by summing uncountably many). Instead, we use a probability density function (PDF) f(x):

P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx

Constraints:

f(x) ≥ 0 for all x
∫_{-∞}^{∞} f(x) dx = 1

The PDF gives probability per unit length, not probability of points.

Example: Uniform distribution on [0, 1]

f(x) = 1 for x ∈ [0, 1], and f(x) = 0 otherwise

P(0.2 ≤ X ≤ 0.5) = ∫_{0.2}^{0.5} 1 dx = 0.3

Common Confusion: PDF vs. Probability

The value f(x) is NOT a probability—it's a density. For continuous random variables:

• P(X = 0.5) = 0 (probability of any single point is zero) • f(0.5) can be any non-negative number • f(x) can exceed 1 (e.g., uniform on [0, 0.1] has f(x) = 10)

Only integrals of f(x) yield probabilities, and those must be ≤ 1.

Axioms of Probability in Machine Learning

Machine learning models that output 'probabilities' must satisfy Kolmogorov's axioms to be valid probability distributions. Let's examine how this manifests in practice.

Softmax: Enforcing Axioms in Classification

For multi-class classification with K classes, a neural network outputs logits z₁, z₂, ..., zₖ. To convert these to valid probabilities, we apply softmax:

P(class i) = exp(zᵢ) / Σⱼ exp(zⱼ)

Let's verify the axioms:

Softmax Satisfies Probability Axioms

•Non-negativity: exp(zᵢ) > 0 for all real zᵢ, and the denominator is positive, so P(class i) > 0 ✓
•Normalization: Σᵢ P(class i) = Σᵢ exp(zᵢ) / Σⱼ exp(zⱼ) = 1 ✓
•Additivity: Since classes are mutually exclusive (sample belongs to exactly one class), P(class i or class j) = P(class i) + P(class j) ✓

Sigmoid: Binary Classification

For binary classification, sigmoid maps a single logit z to a probability:

P(class 1) = σ(z) = 1 / (1 + exp(-z))

P(class 0) = 1 - σ(z) = σ(-z)

Verification:

σ(z) ∈ (0, 1) for all real z → non-negativity and bounds ✓
P(class 0) + P(class 1) = [1 - σ(z)] + σ(z) = 1 → normalization ✓

When Models Violate the Axioms

Poorly designed models can produce invalid 'probabilities':

Uncalibrated scores: A model might output 0.9 for cases where the true probability is 0.6. The outputs sum to 1 but don't reflect reality.
Label smoothing artifacts: Heavy label smoothing can make outputs unreliable as probability estimates.
Temperature scaling issues: Post-hoc calibration with wrong temperature can break calibration.

While these 'probabilities' technically satisfy the axioms (they're in [0,1] and sum to 1), they don't accurately estimate true conditional probabilities—a problem known as miscalibration.

Well-Calibrated Models

A classifier is well-calibrated if, among all test examples where it predicts P(class 1) = 0.7, approximately 70% actually belong to class 1. The axioms ensure outputs are valid probabilities; calibration ensures they're accurate probabilities. Both matter for trustworthy ML systems.

Common Probability Calculations

Let's consolidate the probability rules derived from the axioms into a practical reference.

Probability Rules Reference Card

Rule	Formula	When to Use
Complement	P(A^c) = 1 - P(A)	When 'not A' is easier to compute
Addition (disjoint)	P(A ∪ B) = P(A) + P(B)	Events cannot co-occur
Addition (general)	P(A ∪ B) = P(A) + P(B) - P(A ∩ B)	Events may overlap
Bounds	0 ≤ P(A) ≤ 1	Sanity check
Monotonicity	If A ⊆ B, then P(A) ≤ P(B)	A implies B
Union bound	P(∪ᵢ Aᵢ) ≤ Σᵢ P(Aᵢ)	Upper bound for any union

The Union Bound (Boole's Inequality)

A tremendously useful inequality:

P(A₁ ∪ A₂ ∪ ... ∪ Aₙ) ≤ P(A₁) + P(A₂) + ... + P(Aₙ)

This follows from inclusion-exclusion by dropping the negative terms.

Why it matters in ML:

Suppose you want multiple statistical tests to simultaneously succeed with high probability. If each test fails with probability at most δ, and you run n tests:

P(at least one fails) ≤ n × δ (union bound)
To keep P(any failure) ≤ 0.05, you need δ ≤ 0.05/n (Bonferroni correction)

This union bound argument appears constantly in:

Multiple hypothesis testing
PAC learning theory
Generalization bounds
Confidence intervals for multiple parameters

probability_rules.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
from typing import Callable
 
# =============================================================
# Probability Rules Implementation
# =============================================================
 
def complement_probability(p_event: float) -> float:
    """P(A^c) = 1 - P(A)"""
    return 1 - p_event
 
 
def union_probability_disjoint(*probs: float) -> float:
    """
    P(A₁ ∪ A₂ ∪ ... ∪ Aₙ) for mutually exclusive events.
    
    Simply sums the probabilities.
    """
    return sum(probs)
 
 
def union_probability_two(p_a: float, p_b: float, p_intersection: float) -> float:
    """
    P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
    
    General addition rule for two events.
    """
    return p_a + p_b - p_intersection
 
 
def union_bound(*probs: float) -> float:
    """
    P(∪ᵢ Aᵢ) ≤ Σᵢ P(Aᵢ)
    
    Returns the union bound (upper bound on union probability).
    Actual probability may be lower due to overlaps.
    """
    bound = sum(probs)
    # Cap at 1 since probability cannot exceed 1
    return min(bound, 1.0)
 
 
def bonferroni_correction(desired_overall_prob: float, num_tests: int) -> float:
    """
    Compute per-test significance level to achieve overall level.
    
    If we want P(any Type I error) ≤ α with n tests,
    each test should use significance level α/n.
    """
    return desired_overall_prob / num_tests
 
 
# =============================================================
# Softmax: Converting logits to valid probabilities
# =============================================================
 
def softmax(logits: np.ndarray) -> np.ndarray:
    """
    Convert raw logits to probability distribution satisfying axioms.
    
    Uses numerically stable implementation (subtract max for stability).
    
    Parameters:
        logits: Array of shape (K,) or (batch_size, K)
    
    Returns:
        Probabilities satisfying:
        - All values in [0, 1] (non-negativity + bounds)
        - Sum to 1 along last axis (normalization)
    """
    # Numerical stability: subtract max
    shifted = logits - np.max(logits, axis=-1, keepdims=True)
    exp_logits = np.exp(shifted)
    return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)
 
 
def sigmoid(z: np.ndarray) -> np.ndarray:
    """
    Binary classification probability: P(class=1) = sigmoid(z)
    
    Satisfies P(class=0) + P(class=1) = 1
    """
    return 1 / (1 + np.exp(-z))
 
 
def verify_probability_axioms(probs: np.ndarray, tol: float = 1e-6) -> dict:
    """
    Check if an array represents a valid probability distribution.
    
    Returns dict with verification results for each axiom.
    """
    results = {
        "non_negative": np.all(probs >= -tol),
        "upper_bounded": np.all(probs <= 1 + tol),
        "normalized": np.abs(np.sum(probs) - 1) < tol,
    }
    results["valid"] = all(results.values())
    return results
 
 
if __name__ == "__main__":
    # Example 1: Complement rule
    p_rain = 0.7
    p_no_rain = complement_probability(p_rain)
    print(f"P(rain) = {p_rain}, P(no rain) = {p_no_rain}")
    
    # Example 2: Union of overlapping events
    p_a = 0.3
    p_b = 0.4
    p_a_and_b = 0.1
    p_a_or_b = union_probability_two(p_a, p_b, p_a_and_b)
    print(f"\nP(A ∪ B) = {p_a_or_b}")
    
    # Example 3: Union bound
    failure_probs = [0.05, 0.03, 0.02, 0.01]
    bound = union_bound(*failure_probs)
    print(f"\nUnion bound on P(any failure): {bound}")
    
    # Example 4: Softmax
    logits = np.array([2.0, 1.0, 0.5, 0.1])
    probs = softmax(logits)
    print(f"\nLogits: {logits}")
    print(f"Softmax probabilities: {probs}")
    print(f"Sum: {probs.sum()}")
    print(f"Axiom verification: {verify_probability_axioms(probs)}")
    
    # Example 5: Bonferroni correction
    n_tests = 20
    overall_alpha = 0.05
    per_test_alpha = bonferroni_correction(overall_alpha, n_tests)
    print(f"\nFor {n_tests} tests at α={overall_alpha}:")
    print(f"Per-test threshold: {per_test_alpha}")

Summary: The Axiomatic Foundation

We've established the rigorous mathematical foundation that underlies all probabilistic reasoning in machine learning.

Core Concepts Mastered

•Kolmogorov's Three Axioms — Non-negativity, normalization, and countable additivity form the complete foundation of probability theory
•Derived Properties — Complement rule, bounds, monotonicity, and continuity all follow logically from the axioms
•Addition Rules — General formula P(A ∪ B) = P(A) + P(B) - P(A ∩ B) handles overlapping events; inclusion-exclusion generalizes to many events
•Continuity of Probability — Limits of nested events have limiting probabilities, enabling continuous distributions
•Constructing Measures — Classical counting, discrete weights, and continuous PDFs all satisfy the axioms
•ML Applications — Softmax and sigmoid enforce axioms; calibration ensures accuracy
•Union Bound — Essential inequality for multiple testing and theoretical ML

What's Next:

With probability measures established, we can now study conditional probability—how to update our beliefs when we learn new information. This is the key to Bayesian reasoning, chain rules, and ultimately Bayes' theorem, which powers everything from spam filters to medical diagnosis to large language models.

Page Complete

You now understand probability at its deepest level—not as vague intuition about 'chances,' but as a precisely defined mathematical structure. Every formula you use in ML, from cross-entropy loss to Bayesian posteriors, traces back to these three axioms. This foundation makes you a more rigorous and confident practitioner.

2 / 5

Loading learning content...

Machine LearningProbability Fundamentals

Probability Fundamentals

LevelBeginner

Duration90 mins

TopicProbability Fundamentals

2 / 5

Axioms of Probability

Why Probability Needs Axioms

Having established sample spaces and events as the vocabulary of probability, we now face a fundamental question: How do we assign numbers to events in a consistent, meaningful way?

What about events that can't be repeated (e.g., 'Will this startup succeed?')?
How do we know the limiting frequency actually converges?
Different sequences of experiments might converge to different limits.

Alternatively, we might interpret probability as 'degree of belief'—a Bayesian view. But then:

How do we ensure different people's beliefs are consistent?
What constrains valid beliefs?

What You Will Learn

Kolmogorov's Axioms: The Foundation

Axiom 1: Non-Negativity

For any event A ∈ ℱ:

P(A) ≥ 0

Probabilities cannot be negative. This seems obvious but is essential for interpretation—a negative probability has no meaning.

Axiom 2: Normalization

P(Ω) = 1

The probability that something happens is 1. The sample space, containing all possible outcomes, has probability 1 because exactly one outcome must occur.

Axiom 3: Countable Additivity (σ-additivity)

For any countable sequence of mutually exclusive events A₁, A₂, A₃, ... ∈ ℱ (i.e., Aᵢ ∩ Aⱼ = ∅ for i ≠ j):

P(∪ᵢ₌₁^∞ Aᵢ) = Σᵢ₌₁^∞ P(Aᵢ)

The triple (Ω, ℱ, P) is called a probability space.

The Beauty of Minimalism

Kolmogorov's Axioms Summary
Axiom	Statement	Intuition	Why Essential
Non-Negativity	P(A) ≥ 0 for all events A	Probability measures 'how likely'—can't be negative	Enables interpretation as proportion/frequency
Normalization	P(Ω) = 1	Something must happen	Provides reference scale for all probabilities
Countable Additivity	P(∪ Aᵢ) = Σ P(Aᵢ) for disjoint Aᵢ	Non-overlapping events add up	Allows infinite sums; essential for continuous distributions

Immediate Consequences of the Axioms

Property 1: Probability of the Empty Set

Theorem: P(∅) = 0

Proof:

Consider the countable sequence: A₁ = Ω, A₂ = ∅, A₃ = ∅, A₄ = ∅, ...
These are disjoint (∅ doesn't overlap with anything)
Ω ∪ ∅ ∪ ∅ ∪ ... = Ω
By Axiom 3: P(Ω) = P(Ω) + P(∅) + P(∅) + ...
By Axiom 2: 1 = 1 + P(∅) + P(∅) + ...
The only way this holds (with P(∅) ≥ 0 by Axiom 1) is if P(∅) = 0 ∎

Interpretation: An impossible event has zero probability—nothing can occur outside all possibilities.

Property 2: Complement Rule

Theorem: P(A^c) = 1 - P(A)

Proof:

A and A^c are disjoint (A ∩ A^c = ∅)
A ∪ A^c = Ω (together they cover everything)
By Axiom 3: P(A ∪ A^c) = P(A) + P(A^c)
By Axiom 2: P(Ω) = 1
Therefore: 1 = P(A) + P(A^c)
Rearranging: P(A^c) = 1 - P(A) ∎

Interpretation: The probability of 'not A' equals one minus the probability of A. If there's a 70% chance of rain, there's a 30% chance of no rain.

Property 3: Probability Bounds

Theorem: For any event A, 0 ≤ P(A) ≤ 1

Proof:

P(A) ≥ 0 by Axiom 1 (lower bound)
P(A^c) ≥ 0 by Axiom 1
P(A) = 1 - P(A^c) ≤ 1 (upper bound) ∎

Interpretation: Probabilities are always between 0 and 1. This seems obvious but is a derived property, not an axiom.

ML Application: This is why neural network outputs for classification use softmax or sigmoid—to constrain outputs to [0, 1] and be interpretable as probabilities.

Property 4: Monotonicity

Theorem: If A ⊆ B, then P(A) ≤ P(B)

Proof:

If A ⊆ B, then B = A ∪ (B \ A)
A and (B \ A) are disjoint
By Axiom 3: P(B) = P(A) + P(B \ A)
P(B \ A) ≥ 0 by Axiom 1
Therefore: P(B) ≥ P(A) ∎

Interpretation: If event A is contained in event B (A implies B), then A is at most as likely as B. 'Drawing an ace' is less likely than 'drawing a face card or ace.'

ML Application: If your model's correct predictions are a subset of another model's correct predictions, the second model has higher accuracy.

The Art of Proof

The Addition Rule: Unions of Events

General Addition Rule for Two Events

Theorem: For any events A and B:

P(A ∪ B) = P(A) + P(B) - P(A ∩ B)

Proof: We decompose A ∪ B into three disjoint pieces:

A ∪ B = (A ∩ B^c) ∪ (A ∩ B) ∪ (B ∩ A^c)

But let's use a simpler approach:

B = (A ∩ B) ∪ (B ∩ A^c), and these are disjoint
So P(B) = P(A ∩ B) + P(B ∩ A^c)
Thus P(B ∩ A^c) = P(B) - P(A ∩ B)

Also:

A ∪ B = A ∪ (B ∩ A^c), and these are disjoint
So P(A ∪ B) = P(A) + P(B ∩ A^c)
= P(A) + P(B) - P(A ∩ B) ∎

Why Subtract the Intersection?

Addition Rule: Email ClassificationA spam filter flags emails based on multiple criteria. Let's compute overlap probabilities.

Input

Output

Inclusion-Exclusion for Three Events

P(A ∪ B ∪ C) = P(A) + P(B) + P(C) - P(A ∩ B) - P(A ∩ C) - P(B ∩ C) + P(A ∩ B ∩ C)

The pattern: add singles, subtract pairs, add triples. This extends to n events:

P(∪ᵢ₌₁ⁿ Aᵢ) = Σᵢ P(Aᵢ) - Σᵢ<ⱼ P(Aᵢ ∩ Aⱼ) + Σᵢ<ⱼ<ₖ P(Aᵢ ∩ Aⱼ ∩ Aₖ) - ... + (-1)ⁿ⁺¹ P(∩ᵢ₌₁ⁿ Aᵢ)

inclusion_exclusion.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
from typing import List, Callable, Set, Any
from itertools import combinations
 
def probability_of_union(
    events: List[Set[Any]], 
    sample_space: Set[Any]
) -> float:
    """
    Compute P(A₁ ∪ A₂ ∪ ... ∪ Aₙ) using inclusion-exclusion.
    
    For finite sample spaces, this computes exact probabilities
    by working with the sets directly.
    
    Parameters:
        events: List of event sets
        sample_space: The complete sample space Ω
    
    Returns:
        Probability of at least one event occurring
    """
    n = len(events)
    if n == 0:
        return 0.0
    
    omega_size = len(sample_space)
    total_prob = 0.0
    
    for k in range(1, n + 1):
        # Sum over all k-subsets of events
        sign = (-1) ** (k + 1)  # Alternating signs
        
        for event_subset in combinations(events, k):
            # Compute intersection of all events in subset
            intersection = event_subset[0]
            for event in event_subset[1:]:
                intersection = intersection & event
            
            # Add (or subtract) probability of this intersection
            total_prob += sign * len(intersection) / omega_size
    
    return total_prob
 
 
def probability_of_at_least_one_simple(
    events: List[Set[Any]], 
    sample_space: Set[Any]
) -> float:
    """
    Alternative: compute P(at least one) by finding the actual union.
    
    Simpler but less illustrative of inclusion-exclusion.
    """
    union = set()
    for event in events:
        union = union | event
    return len(union) / len(sample_space)
 
 
# Example: Two dice, multiple events
if __name__ == "__main__":
    # Sample space for two dice
    Omega = {(i, j) for i in range(1, 7) for j in range(1, 7)}
    
    # Define events
    A = {(i, j) for (i, j) in Omega if i + j == 7}  # Sum is 7
    B = {(i, j) for (i, j) in Omega if i == j}      # Doubles
    C = {(i, j) for (i, j) in Omega if i >= 5}      # First die ≥ 5
    
    print("Event A (sum = 7):", len(A), "outcomes")
    print("Event B (doubles):", len(B), "outcomes")
    print("Event C (first ≥ 5):", len(C), "outcomes")
    
    # Two-event inclusion-exclusion
    p_a = len(A) / len(Omega)
    p_b = len(B) / len(Omega)
    p_a_and_b = len(A & B) / len(Omega)
    
    print(f"\nP(A) = {p_a:.4f}")
    print(f"P(B) = {p_b:.4f}")
    print(f"P(A ∩ B) = {p_a_and_b:.4f}")
    print(f"P(A ∪ B) via inclusion-exclusion = {p_a + p_b - p_a_and_b:.4f}")
    print(f"P(A ∪ B) via direct computation = {len(A | B) / len(Omega):.4f}")
    
    # Three-event inclusion-exclusion
    print(f"\nP(A ∪ B ∪ C) via function = {probability_of_union([A, B, C], Omega):.4f}")
    print(f"P(A ∪ B ∪ C) via direct = {probability_of_at_least_one_simple([A, B, C], Omega):.4f}")

Continuity of Probability

Monotone Convergence for Events

Theorem (Continuity from Below): If A₁ ⊆ A₂ ⊆ A₃ ⊆ ... is an increasing sequence of events, and A = ∪ᵢ₌₁^∞ Aᵢ, then:

P(A) = lim_{n→∞} P(Aₙ)

Theorem (Continuity from Above): If A₁ ⊇ A₂ ⊇ A₃ ⊇ ... is a decreasing sequence of events, and A = ∩ᵢ₌₁^∞ Aᵢ, then:

P(A) = lim_{n→∞} P(Aₙ)

Intuition for Continuity

Continuity Example: At Least One SuccessConsider repeatedly flipping a fair coin. What's the probability of getting at least one head?

Input

Output

Why Countable Additivity Matters

The distinction between finite additivity (works for any finite collection of disjoint events) and countable additivity (works for countable collections) is subtle but crucial.

Finite additivity alone would not guarantee:

Continuity theorems for limits of events
Well-defined probabilities for continuous distributions
Convergence guarantees needed for statistical inference

Constructing Probability Measures

The axioms tell us what properties a probability measure must have. But how do we actually construct one? Different constructions lead to different probability spaces.

Method 1: Classical (Counting) Probability

For finite sample spaces with equally likely outcomes:

P(A) = |A| / |Ω|

This satisfies all three axioms:

P(A) = |A|/|Ω| ≥ 0 ✓
P(Ω) = |Ω|/|Ω| = 1 ✓
For disjoint A, B: P(A ∪ B) = |A ∪ B|/|Ω| = (|A| + |B|)/|Ω| = P(A) + P(B) ✓

Classical Probability Examples
Experiment	Sample Space	Example Event	Probability
Fair coin flip	Ω = {H, T}	A = {H}	P(A) = 1/2
Fair die roll	Ω = {1,2,3,4,5,6}	A = {even} = {2,4,6}	P(A) = 3/6 = 1/2
Drawing a card	Ω = 52 cards	A = {aces} = 4 cards	P(A) = 4/52 = 1/13
Two fair dice	Ω = 36 pairs	A = {sum = 7} = 6 pairs	P(A) = 6/36 = 1/6

Method 2: Assigning Probabilities to Outcomes (Discrete)

For discrete sample spaces, we can assign probabilities directly to each outcome:

p(ω) = probability of outcome ω

Constraints (from axioms):

p(ω) ≥ 0 for all ω
Σ_{ω ∈ Ω} p(ω) = 1

Then for any event A:

P(A) = Σ_{ω ∈ A} p(ω)

Weighted Die (Non-uniform Probability)A loaded die has different probabilities for each face.

Input

Output

Method 3: Probability Density Functions (Continuous)

P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx

Constraints:

f(x) ≥ 0 for all x
∫_{-∞}^{∞} f(x) dx = 1

The PDF gives probability per unit length, not probability of points.

Example: Uniform distribution on [0, 1]

f(x) = 1 for x ∈ [0, 1], and f(x) = 0 otherwise

P(0.2 ≤ X ≤ 0.5) = ∫_{0.2}^{0.5} 1 dx = 0.3

Common Confusion: PDF vs. Probability

The value f(x) is NOT a probability—it's a density. For continuous random variables:

• P(X = 0.5) = 0 (probability of any single point is zero) • f(0.5) can be any non-negative number • f(x) can exceed 1 (e.g., uniform on [0, 0.1] has f(x) = 10)

Only integrals of f(x) yield probabilities, and those must be ≤ 1.

Axioms of Probability in Machine Learning

Machine learning models that output 'probabilities' must satisfy Kolmogorov's axioms to be valid probability distributions. Let's examine how this manifests in practice.

Softmax: Enforcing Axioms in Classification

For multi-class classification with K classes, a neural network outputs logits z₁, z₂, ..., zₖ. To convert these to valid probabilities, we apply softmax:

P(class i) = exp(zᵢ) / Σⱼ exp(zⱼ)

Let's verify the axioms:

Softmax Satisfies Probability Axioms

•Non-negativity: exp(zᵢ) > 0 for all real zᵢ, and the denominator is positive, so P(class i) > 0 ✓
•Normalization: Σᵢ P(class i) = Σᵢ exp(zᵢ) / Σⱼ exp(zⱼ) = 1 ✓
•Additivity: Since classes are mutually exclusive (sample belongs to exactly one class), P(class i or class j) = P(class i) + P(class j) ✓

Sigmoid: Binary Classification

For binary classification, sigmoid maps a single logit z to a probability:

P(class 1) = σ(z) = 1 / (1 + exp(-z))

P(class 0) = 1 - σ(z) = σ(-z)

Verification:

σ(z) ∈ (0, 1) for all real z → non-negativity and bounds ✓
P(class 0) + P(class 1) = [1 - σ(z)] + σ(z) = 1 → normalization ✓

When Models Violate the Axioms

Poorly designed models can produce invalid 'probabilities':

Uncalibrated scores: A model might output 0.9 for cases where the true probability is 0.6. The outputs sum to 1 but don't reflect reality.
Label smoothing artifacts: Heavy label smoothing can make outputs unreliable as probability estimates.
Temperature scaling issues: Post-hoc calibration with wrong temperature can break calibration.

While these 'probabilities' technically satisfy the axioms (they're in [0,1] and sum to 1), they don't accurately estimate true conditional probabilities—a problem known as miscalibration.

Well-Calibrated Models

Common Probability Calculations

Let's consolidate the probability rules derived from the axioms into a practical reference.

Probability Rules Reference Card

Rule	Formula	When to Use
Complement	P(A^c) = 1 - P(A)	When 'not A' is easier to compute
Addition (disjoint)	P(A ∪ B) = P(A) + P(B)	Events cannot co-occur
Addition (general)	P(A ∪ B) = P(A) + P(B) - P(A ∩ B)	Events may overlap
Bounds	0 ≤ P(A) ≤ 1	Sanity check
Monotonicity	If A ⊆ B, then P(A) ≤ P(B)	A implies B
Union bound	P(∪ᵢ Aᵢ) ≤ Σᵢ P(Aᵢ)	Upper bound for any union

The Union Bound (Boole's Inequality)

A tremendously useful inequality:

P(A₁ ∪ A₂ ∪ ... ∪ Aₙ) ≤ P(A₁) + P(A₂) + ... + P(Aₙ)

This follows from inclusion-exclusion by dropping the negative terms.

Why it matters in ML:

Suppose you want multiple statistical tests to simultaneously succeed with high probability. If each test fails with probability at most δ, and you run n tests:

P(at least one fails) ≤ n × δ (union bound)
To keep P(any failure) ≤ 0.05, you need δ ≤ 0.05/n (Bonferroni correction)

This union bound argument appears constantly in:

Multiple hypothesis testing
PAC learning theory
Generalization bounds
Confidence intervals for multiple parameters

probability_rules.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
from typing import Callable
 
# =============================================================
# Probability Rules Implementation
# =============================================================
 
def complement_probability(p_event: float) -> float:
    """P(A^c) = 1 - P(A)"""
    return 1 - p_event
 
 
def union_probability_disjoint(*probs: float) -> float:
    """
    P(A₁ ∪ A₂ ∪ ... ∪ Aₙ) for mutually exclusive events.
    
    Simply sums the probabilities.
    """
    return sum(probs)
 
 
def union_probability_two(p_a: float, p_b: float, p_intersection: float) -> float:
    """
    P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
    
    General addition rule for two events.
    """
    return p_a + p_b - p_intersection
 
 
def union_bound(*probs: float) -> float:
    """
    P(∪ᵢ Aᵢ) ≤ Σᵢ P(Aᵢ)
    
    Returns the union bound (upper bound on union probability).
    Actual probability may be lower due to overlaps.
    """
    bound = sum(probs)
    # Cap at 1 since probability cannot exceed 1
    return min(bound, 1.0)
 
 
def bonferroni_correction(desired_overall_prob: float, num_tests: int) -> float:
    """
    Compute per-test significance level to achieve overall level.
    
    If we want P(any Type I error) ≤ α with n tests,
    each test should use significance level α/n.
    """
    return desired_overall_prob / num_tests
 
 
# =============================================================
# Softmax: Converting logits to valid probabilities
# =============================================================
 
def softmax(logits: np.ndarray) -> np.ndarray:
    """
    Convert raw logits to probability distribution satisfying axioms.
    
    Uses numerically stable implementation (subtract max for stability).
    
    Parameters:
        logits: Array of shape (K,) or (batch_size, K)
    
    Returns:
        Probabilities satisfying:
        - All values in [0, 1] (non-negativity + bounds)
        - Sum to 1 along last axis (normalization)
    """
    # Numerical stability: subtract max
    shifted = logits - np.max(logits, axis=-1, keepdims=True)
    exp_logits = np.exp(shifted)
    return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)
 
 
def sigmoid(z: np.ndarray) -> np.ndarray:
    """
    Binary classification probability: P(class=1) = sigmoid(z)
    
    Satisfies P(class=0) + P(class=1) = 1
    """
    return 1 / (1 + np.exp(-z))
 
 
def verify_probability_axioms(probs: np.ndarray, tol: float = 1e-6) -> dict:
    """
    Check if an array represents a valid probability distribution.
    
    Returns dict with verification results for each axiom.
    """
    results = {
        "non_negative": np.all(probs >= -tol),
        "upper_bounded": np.all(probs <= 1 + tol),
        "normalized": np.abs(np.sum(probs) - 1) < tol,
    }
    results["valid"] = all(results.values())
    return results
 
 
if __name__ == "__main__":
    # Example 1: Complement rule
    p_rain = 0.7
    p_no_rain = complement_probability(p_rain)
    print(f"P(rain) = {p_rain}, P(no rain) = {p_no_rain}")
    
    # Example 2: Union of overlapping events
    p_a = 0.3
    p_b = 0.4
    p_a_and_b = 0.1
    p_a_or_b = union_probability_two(p_a, p_b, p_a_and_b)
    print(f"\nP(A ∪ B) = {p_a_or_b}")
    
    # Example 3: Union bound
    failure_probs = [0.05, 0.03, 0.02, 0.01]
    bound = union_bound(*failure_probs)
    print(f"\nUnion bound on P(any failure): {bound}")
    
    # Example 4: Softmax
    logits = np.array([2.0, 1.0, 0.5, 0.1])
    probs = softmax(logits)
    print(f"\nLogits: {logits}")
    print(f"Softmax probabilities: {probs}")
    print(f"Sum: {probs.sum()}")
    print(f"Axiom verification: {verify_probability_axioms(probs)}")
    
    # Example 5: Bonferroni correction
    n_tests = 20
    overall_alpha = 0.05
    per_test_alpha = bonferroni_correction(overall_alpha, n_tests)
    print(f"\nFor {n_tests} tests at α={overall_alpha}:")
    print(f"Per-test threshold: {per_test_alpha}")

Summary: The Axiomatic Foundation

We've established the rigorous mathematical foundation that underlies all probabilistic reasoning in machine learning.

Core Concepts Mastered

•Kolmogorov's Three Axioms — Non-negativity, normalization, and countable additivity form the complete foundation of probability theory
•Derived Properties — Complement rule, bounds, monotonicity, and continuity all follow logically from the axioms
•Addition Rules — General formula P(A ∪ B) = P(A) + P(B) - P(A ∩ B) handles overlapping events; inclusion-exclusion generalizes to many events
•Continuity of Probability — Limits of nested events have limiting probabilities, enabling continuous distributions
•Constructing Measures — Classical counting, discrete weights, and continuous PDFs all satisfy the axioms
•ML Applications — Softmax and sigmoid enforce axioms; calibration ensures accuracy
•Union Bound — Essential inequality for multiple testing and theoretical ML

What's Next:

Page Complete

2 / 5