Machine LearningProbability Fundamentals

Probability Fundamentals

LevelBeginner

Duration90 mins

TopicProbability Fundamentals

3 / 5

Conditional Probability

Learning from Evidence

The power of probabilistic reasoning lies not in static probability assignments, but in our ability to update beliefs when new information arrives. This is precisely what machine learning models do: they take in data (evidence) and adjust their predictions accordingly.

Consider these scenarios:

A medical test comes back positive. How does this change the probability that a patient has a disease?
A user clicks on a product page. How does this update the probability they'll make a purchase?
A word appears in an email. How does this affect the probability the email is spam?

Each scenario involves the same fundamental question: given that event B has occurred, what is the probability of event A? This is conditional probability—the mathematical language for reasoning under partial information.

What You Will Learn

By the end of this page, you will master the definition and intuition of conditional probability, apply the multiplication rule to compute joint probabilities, understand the chain rule for sequences of events, and recognize how conditional probability enables ML models to make informed predictions.

Definition of Conditional Probability

Let A and B be events in a probability space, with P(B) > 0. The conditional probability of A given B, denoted P(A|B) (read 'probability of A given B'), is defined as:

P(A|B) = P(A ∩ B) / P(B)

This formula captures a simple intuition: once we know B has occurred, we restrict our attention to outcomes within B. The conditional probability P(A|B) asks: of all outcomes in B, what fraction also lie in A?

Geometric Intuition

Imagine the sample space Ω as the entire area of a rectangle. Event B is some region within this rectangle. When we condition on B, we're effectively saying: 'The outcome is somewhere in B. Given that, what's the chance it's also in A?'

The answer is the fraction of B's area that overlaps with A:

P(A|B) = Area(A ∩ B) / Area(B)

P(B) Must Be Positive

Conditional probability P(A|B) is only defined when P(B) > 0. Conditioning on an impossible event (P(B) = 0) makes no sense—we cannot learn anything from observing something that cannot happen. In practice, this means we should never condition on events we haven't actually observed or that have measure zero.

Computing Conditional Probability: Die RollLet's compute a concrete conditional probability with a fair die.

Input

Output

Properties of Conditional Probability

Conditional probability P(·|B) is itself a valid probability measure on the reduced sample space B. It satisfies all of Kolmogorov's axioms:

1. Non-negativity: P(A|B) ≥ 0 for all events A

2. Normalization: P(B|B) = P(B ∩ B) / P(B) = P(B) / P(B) = 1

3. Countable Additivity: For disjoint events A₁, A₂, ...: P(∪ᵢ Aᵢ | B) = Σᵢ P(Aᵢ | B)

Because P(·|B) is a valid probability measure, all theorems derived from the axioms also hold for conditional probabilities:

P(A^c | B) = 1 - P(A | B)
P(A₁ ∪ A₂ | B) = P(A₁|B) + P(A₂|B) - P(A₁ ∩ A₂ | B)
If A₁ ⊆ A₂, then P(A₁|B) ≤ P(A₂|B)

Conditioning Defines a New Probability Space

When we condition on B, we're effectively working in a new probability space where B is the new sample space and probabilities are rescaled by 1/P(B). All probability rules apply, just within this restricted universe. This perspective is powerful: it means we can apply everything we know about probability to conditional settings.

Special Cases

Case 1: A and B are disjoint (A ∩ B = ∅)

P(A|B) = P(∅) / P(B) = 0 / P(B) = 0

If A and B cannot co-occur, then given B happened, A definitely did not.

Case 2: B ⊆ A (B implies A)

P(A|B) = P(B) / P(B) = 1

If every outcome in B is also in A, then knowing B tells us A occurred with certainty.

Case 3: A ⊆ B (A implies B)

P(A|B) = P(A) / P(B) ≥ P(A)

Since P(B) ≤ 1, dividing by P(B) increases (or maintains) the probability.

The Multiplication Rule

Rearranging the definition of conditional probability yields the multiplication rule—a formula for computing joint probabilities:

P(A ∩ B) = P(B) · P(A|B) = P(A) · P(B|A)

Both forms are equivalent and immensely useful. The multiplication rule says: the probability that both A and B occur equals the probability that B occurs, times the probability that A occurs given B.

Intuition

Think of two sequential selections:

First, B happens with probability P(B)
Then, given B happened, A also happens with probability P(A|B)

The probability of both happening is the product of these sequential probabilities.

Multiplication Rule: Card DrawingWhat's the probability of drawing two aces in a row from a standard deck (without replacement)?

Input

Output

Machine Learning Application: Sequential Prediction

Consider a recommendation system predicting whether a user will:

View a product page (event V)
Add to cart (event C)
Complete purchase (event P)

The probability of a complete purchase journey:

P(V ∩ C ∩ P) = P(V) · P(C|V) · P(P|V ∩ C)

This factorization mirrors how the user actually behaves: they first view (probability P(V)), then given they viewed, they add to cart (probability P(C|V)), then given they viewed and added to cart, they purchase (probability P(P|V ∩ C)).

The model can estimate each conditional probability separately and multiply them for the full journey probability.

The Chain Rule of Probability

The multiplication rule extends to any number of events through the chain rule (also called the general product rule):

P(A₁ ∩ A₂ ∩ ... ∩ Aₙ) = P(A₁) · P(A₂|A₁) · P(A₃|A₁ ∩ A₂) · ... · P(Aₙ|A₁ ∩ ... ∩ Aₙ₋₁)

Each term conditions on all preceding events. This rule is fundamental to:

Sequence modeling (language models, time series)
Graphical models (Bayesian networks)
Sequential decision processes

The Chain Rule in Language Models

Modern language models like GPT use the chain rule to model text probability:

P(w₁, w₂, ..., wₙ) = P(w₁) · P(w₂|w₁) · P(w₃|w₁,w₂) · ... · P(wₙ|w₁,...,wₙ₋₁)

Each word's probability is conditioned on all previous words. The model learns these conditional distributions from massive text corpora. This is exactly the chain rule applied to sequences.

Proof of the Chain Rule

By induction:

Base case (n=2): P(A₁ ∩ A₂) = P(A₁) · P(A₂|A₁) ✓ (the multiplication rule)

Inductive step: Assume the chain rule holds for n-1 events. Then:

P(A₁ ∩ ... ∩ Aₙ) = P((A₁ ∩ ... ∩ Aₙ₋₁) ∩ Aₙ) = P(A₁ ∩ ... ∩ Aₙ₋₁) · P(Aₙ | A₁ ∩ ... ∩ Aₙ₋₁) [multiplication rule] = [P(A₁) · P(A₂|A₁) · ... · P(Aₙ₋₁|A₁ ∩ ... ∩ Aₙ₋₂)] · P(Aₙ | A₁ ∩ ... ∩ Aₙ₋₁) [inductive hypothesis]

This matches the chain rule for n events. ∎

conditional_probability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
import numpy as np
from typing import List, Tuple, Dict
 
# =============================================================
# Conditional Probability Fundamentals
# =============================================================
 
def conditional_probability(
    p_intersection: float, 
    p_condition: float
) -> float:
    """
    Compute P(A|B) = P(A ∩ B) / P(B)
    
    Parameters:
        p_intersection: P(A ∩ B), probability of both events
        p_condition: P(B), probability of conditioning event
    
    Returns:
        P(A|B), conditional probability
    
    Raises:
        ValueError: If P(B) = 0
    """
    if p_condition == 0:
        raise ValueError("Cannot condition on event with probability 0")
    return p_intersection / p_condition
 
 
def joint_from_conditional(
    p_condition: float,
    p_given_condition: float
) -> float:
    """
    Multiplication rule: P(A ∩ B) = P(B) · P(A|B)
    
    Parameters:
        p_condition: P(B)
        p_given_condition: P(A|B)
    
    Returns:
        P(A ∩ B)
    """
    return p_condition * p_given_condition
 
 
def chain_rule(conditional_probs: List[float]) -> float:
    """
    Apply chain rule to compute joint probability.
    
    P(A₁ ∩ A₂ ∩ ... ∩ Aₙ) = P(A₁) · P(A₂|A₁) · P(A₃|A₁,A₂) · ...
    
    Parameters:
        conditional_probs: List [P(A₁), P(A₂|A₁), P(A₃|A₁,A₂), ...]
    
    Returns:
        Joint probability of all events
    """
    result = 1.0
    for p in conditional_probs:
        result *= p
    return result
 
 
# =============================================================
# Conditional Probability from Samples (Empirical)
# =============================================================
 
def empirical_conditional(
    joint_count: int,
    condition_count: int
) -> float:
    """
    Compute empirical conditional probability from counts.
    
    P̂(A|B) = Count(A ∩ B) / Count(B)
    
    Parameters:
        joint_count: Number of samples where both A and B occurred
        condition_count: Number of samples where B occurred
    
    Returns:
        Estimated conditional probability
    """
    if condition_count == 0:
        raise ValueError("No samples satisfy the condition")
    return joint_count / condition_count
 
 
def compute_conditional_from_data(
    data: List[Dict],
    target_key: str,
    target_value,
    condition_key: str,
    condition_value
) -> Tuple[float, int, int]:
    """
    Compute conditional probability from a dataset.
    
    Parameters:
        data: List of dictionaries (samples)
        target_key, target_value: Define event A
        condition_key, condition_value: Define event B
    
    Returns:
        (P(A|B), count of A∩B, count of B)
    """
    condition_samples = [d for d in data if d.get(condition_key) == condition_value]
    joint_samples = [d for d in condition_samples if d.get(target_key) == target_value]
    
    n_condition = len(condition_samples)
    n_joint = len(joint_samples)
    
    if n_condition == 0:
        return 0.0, 0, 0
    
    return n_joint / n_condition, n_joint, n_condition
 
 
# =============================================================
# Language Model Chain Rule Example
# =============================================================
 
class SimpleLanguageModel:
    """
    Simple n-gram language model demonstrating the chain rule.
    
    Models P(w₁, w₂, ..., wₙ) = ∏ᵢ P(wᵢ | w₁, ..., wᵢ₋₁)
    
    For simplicity, uses bigrams: P(wᵢ | wᵢ₋₁) instead of full history.
    """
    
    def __init__(self):
        self.unigram_counts: Dict[str, int] = {}
        self.bigram_counts: Dict[Tuple[str, str], int] = {}
        self.total_words = 0
        
    def train(self, sentences: List[List[str]]):
        """Train on tokenized sentences."""
        for sentence in sentences:
            # Add start token
            tokens = ["<START>"] + sentence
            
            for i, word in enumerate(tokens):
                # Unigram counts
                self.unigram_counts[word] = self.unigram_counts.get(word, 0) + 1
                self.total_words += 1
                
                # Bigram counts
                if i > 0:
                    bigram = (tokens[i-1], word)
                    self.bigram_counts[bigram] = self.bigram_counts.get(bigram, 0) + 1
    
    def p_word_given_previous(self, word: str, previous: str) -> float:
        """
        Compute P(word | previous) using MLE with Laplace smoothing.
        """
        bigram_count = self.bigram_counts.get((previous, word), 0)
        previous_count = self.unigram_counts.get(previous, 0)
        vocab_size = len(self.unigram_counts)
        
        # Laplace smoothing to avoid zero probabilities
        return (bigram_count + 1) / (previous_count + vocab_size)
    
    def sentence_probability(self, sentence: List[str]) -> float:
        """
        Compute P(w₁, ..., wₙ) using the chain rule.
        
        P(sentence) = P(w₁|<START>) · P(w₂|w₁) · ... · P(wₙ|wₙ₋₁)
        """
        tokens = ["<START>"] + sentence
        log_prob = 0.0  # Use log to avoid underflow
        
        for i in range(1, len(tokens)):
            p = self.p_word_given_previous(tokens[i], tokens[i-1])
            log_prob += np.log(p)
        
        return np.exp(log_prob)
 
 
if __name__ == "__main__":
    # Example 1: Die roll conditional probability
    print("=" * 60)
    print("Example 1: Die Roll")
    print("=" * 60)
    
    # P(greater than 3 | even)
    p_intersection = 2/6  # P({4,6})
    p_condition = 3/6     # P({2,4,6})
    p_conditional = conditional_probability(p_intersection, p_condition)
    print(f"P(>3 | even) = {p_conditional:.4f}")
    
    # Example 2: Chain rule - card drawing
    print("
" + "=" * 60)
    print("Example 2: Drawing Two Aces (Chain Rule)")
    print("=" * 60)
    
    probs = [4/52, 3/51]  # P(A₁), P(A₂|A₁)
    p_two_aces = chain_rule(probs)
    print(f"P(two aces) = {p_two_aces:.6f} = 1/{int(1/p_two_aces):.0f}")
    
    # Example 3: Empirical conditional from data
    print("
" + "=" * 60)
    print("Example 3: Empirical Conditional (Click Data)")
    print("=" * 60)
    
    # Simulated click data
    data = [
        {"page_view": True, "clicked": True},
        {"page_view": True, "clicked": False},
        {"page_view": True, "clicked": True},
        {"page_view": True, "clicked": False},
        {"page_view": True, "clicked": False},
        {"page_view": False, "clicked": False},
    ]
    
    p_click_given_view, n_joint, n_view = compute_conditional_from_data(
        data, "clicked", True, "page_view", True
    )
    print(f"P(clicked | page_view) = {n_joint}/{n_view} = {p_click_given_view:.2f}")
    
    # Example 4: Language model
    print("
" + "=" * 60)
    print("Example 4: Language Model (Chain Rule)")
    print("=" * 60)
    
    lm = SimpleLanguageModel()
    training_data = [
        ["the", "cat", "sat"],
        ["the", "dog", "ran"],
        ["the", "cat", "ran"],
        ["a", "cat", "sat"],
    ]
    lm.train(training_data)
    
    test_sentence = ["the", "cat", "sat"]
    prob = lm.sentence_probability(test_sentence)
    print(f"P('{' '.join(test_sentence)}') = {prob:.6f}")
    
    # Show chain rule decomposition
    tokens = ["<START>"] + test_sentence
    print("
Chain rule decomposition:")
    for i in range(1, len(tokens)):
        p = lm.p_word_given_previous(tokens[i], tokens[i-1])
        print(f"  P({tokens[i]} | {tokens[i-1]}) = {p:.4f}")

Common Pitfalls in Conditional Probability

Conditional probability is deceptively subtle, and even experienced practitioners make errors. Let's examine common pitfalls.

Pitfall 1: Confusing P(A|B) with P(B|A)

These are generally not equal! This error, called the confusion of the inverse, leads to serious reasoning failures.

The Prosecutor's FallacyA dangerous confusion between P(evidence|innocent) and P(innocent|evidence).

Input

Output

Pitfall 2: Base Rate Neglect

When computing P(A|B), we must account for the base rate—the unconditional probability P(A).

Example: A medical test is 99% accurate (detects disease 99% of the time if present, correctly clears 99% if absent). A patient tests positive. What's P(disease | positive)?

If the disease affects only 1 in 10,000 people, P(disease | positive) ≈ 1% despite the 'highly accurate' test! Most positives are false positives because the disease is so rare.

Pitfall 3: Updating Incorrectly

After observing multiple pieces of evidence, we must update sequentially and correctly:

Wrong: P(A | B ∩ C) = P(A|B) × P(A|C) ❌

Correct: Use Bayes' theorem and proper conditioning.

Pitfall 4: Assuming Independence

P(A|B) = P(A) only if A and B are independent. In most real scenarios, conditioning changes probabilities. Never assume independence without justification.

The Infamous Base Rate Problem

In ML classification, accuracy on rare classes can be misleading. A model that predicts 'not fraud' for everything achieves 99.9% accuracy if only 0.1% of transactions are fraudulent—but it's useless for detecting fraud. Conditional thinking (P(fraud | prediction)) matters more than overall accuracy.

The Law of Total Probability

The Law of Total Probability allows us to compute P(A) by breaking it down across a partition of the sample space.

If B₁, B₂, ..., Bₙ form a partition of Ω (mutually exclusive and exhaustive), then:

P(A) = Σᵢ P(A|Bᵢ) · P(Bᵢ)

Why This Works

Since the Bᵢ partition Ω, event A can be decomposed:

A = (A ∩ B₁) ∪ (A ∩ B₂) ∪ ... ∪ (A ∩ Bₙ)

These pieces are disjoint, so:

P(A) = Σᵢ P(A ∩ Bᵢ) = Σᵢ P(A|Bᵢ) · P(Bᵢ)

Total Probability: Customer ConversionCompute overall conversion rate from segment-specific rates.

Input

Output

ML Application: Mixture Models

The law of total probability is the foundation of mixture models:

P(x) = Σₖ P(x | component k) · P(component k) = Σₖ πₖ · p(x | θₖ)

where πₖ is the mixing weight for component k and p(x | θₖ) is the component distribution.

Examples:

Gaussian Mixture Models (GMMs): P(x) = Σₖ πₖ · N(x; μₖ, Σₖ)
Topic Models: P(word) = Σ_{topic} P(word | topic) · P(topic)
Hidden Markov Models: Marginalizing over hidden states

The Power of Partitioning

When facing a complex probability calculation, ask: 'Can I partition the space into simpler cases?' If P(A) is hard to compute directly but P(A|Bᵢ) is easy for each piece of a partition, use total probability. This divide-and-conquer strategy is ubiquitous in probability.

Conditional Probability Throughout ML

Conditional probability isn't just a theoretical concept—it's the operational definition of what ML models compute.

Classification as Conditional Probability

A classifier learns P(Y | X), the conditional distribution of labels given features:

Logistic Regression: P(Y=1 | X) = σ(w^T X + b)
Neural Network Classifier: P(Y=k | X) = softmax(f_θ(X))ₖ
Decision Tree: P(Y=k | X) = fraction of class k in the leaf containing X

Every prediction is fundamentally a conditional probability.

Generative vs. Discriminative Models

Discriminative models learn P(Y | X) directly. Generative models learn P(X | Y) and P(Y), then use Bayes' theorem:

P(Y | X) = P(X | Y) · P(Y) / P(X)

Loss Functions as Conditional Log-Probabilities

Cross-entropy loss for classification:

L = -log P(Y_true | X)

Minimizing cross-entropy is equivalent to maximizing the conditional log-likelihood.

Conditional Probability in ML Models
Model	What It Learns	Conditional Form
Logistic Regression	Binary classification	P(Y=1 \| X) = σ(w^T X)
Softmax Classifier	Multi-class classification	P(Y=k \| X) = softmaxₖ(Wx)
Language Model (GPT)	Next token prediction	P(wₜ \| w₁...wₜ₋₁)
Conditional VAE	Conditional generation	P(X \| Z, Y)
Conditional GAN	Conditional generation	P(X \| Z, Y)
Sequence-to-Sequence	Sequence transduction	P(Y₁...Yₘ \| X₁...Xₙ)

ML is Fundamentally Conditional

Nearly every ML prediction is a conditional probability: P(output | input). Whether classifying images, generating text, or predicting stock prices, we're asking 'What is the probability of this output, given this input?' Understanding conditional probability deeply means understanding what ML models actually compute.

Summary: The Logic of Evidence

Conditional probability transforms probability from a static description of uncertainty into a dynamic framework for learning from evidence.

Core Concepts Mastered

•Conditional Probability Definition — P(A|B) = P(A ∩ B) / P(B) restricts attention to outcomes where B occurred
•P(·|B) is a Valid Probability Measure — All axioms and derived rules apply to conditional probabilities
•Multiplication Rule — P(A ∩ B) = P(A|B) · P(B) computes joint probability from conditional and marginal
•Chain Rule — Extends multiplication to any sequence of events: P(A₁ ∩ ... ∩ Aₙ) = P(A₁) · P(A₂|A₁) · ...
•Common Pitfalls — P(A|B) ≠ P(B|A); base rate neglect; incorrect updating; false independence assumptions
•Law of Total Probability — P(A) = Σᵢ P(A|Bᵢ)P(Bᵢ) using any partition
•ML is Conditional — Classification learns P(Y|X); language models learn P(wₜ|history); loss functions are conditional log-likelihoods

What's Next:

Conditional probability has one critical limitation: it tells us P(A|B) if we know P(A ∩ B) and P(B). But what if we only know P(B|A)? The answer is Bayes' Theorem—the cornerstone of Bayesian reasoning, which inverts conditional probabilities and enables learning from data. This is the subject of our next page.

Page Complete

You now command conditional probability—the mechanism by which evidence updates beliefs. You can compute conditional probabilities, avoid common fallacies, apply the chain rule to sequential events, and recognize that ML models fundamentally output conditional probabilities. With the law of total probability, you can marginalize over unknown variables. Next: Bayes' theorem will complete your toolkit for probabilistic reasoning.

3 / 5

Loading learning content...

Machine LearningProbability Fundamentals

Probability Fundamentals

LevelBeginner

Duration90 mins

TopicProbability Fundamentals

3 / 5

Conditional Probability

Learning from Evidence

Consider these scenarios:

A medical test comes back positive. How does this change the probability that a patient has a disease?
A user clicks on a product page. How does this update the probability they'll make a purchase?
A word appears in an email. How does this affect the probability the email is spam?

What You Will Learn

Definition of Conditional Probability

Let A and B be events in a probability space, with P(B) > 0. The conditional probability of A given B, denoted P(A|B) (read 'probability of A given B'), is defined as:

P(A|B) = P(A ∩ B) / P(B)

Geometric Intuition

The answer is the fraction of B's area that overlaps with A:

P(A|B) = Area(A ∩ B) / Area(B)

P(B) Must Be Positive

Computing Conditional Probability: Die RollLet's compute a concrete conditional probability with a fair die.

Input

Output

Properties of Conditional Probability

Conditional probability P(·|B) is itself a valid probability measure on the reduced sample space B. It satisfies all of Kolmogorov's axioms:

1. Non-negativity: P(A|B) ≥ 0 for all events A

2. Normalization: P(B|B) = P(B ∩ B) / P(B) = P(B) / P(B) = 1

3. Countable Additivity: For disjoint events A₁, A₂, ...: P(∪ᵢ Aᵢ | B) = Σᵢ P(Aᵢ | B)

Because P(·|B) is a valid probability measure, all theorems derived from the axioms also hold for conditional probabilities:

P(A^c | B) = 1 - P(A | B)
P(A₁ ∪ A₂ | B) = P(A₁|B) + P(A₂|B) - P(A₁ ∩ A₂ | B)
If A₁ ⊆ A₂, then P(A₁|B) ≤ P(A₂|B)

Conditioning Defines a New Probability Space

Special Cases

Case 1: A and B are disjoint (A ∩ B = ∅)

P(A|B) = P(∅) / P(B) = 0 / P(B) = 0

If A and B cannot co-occur, then given B happened, A definitely did not.

Case 2: B ⊆ A (B implies A)

P(A|B) = P(B) / P(B) = 1

If every outcome in B is also in A, then knowing B tells us A occurred with certainty.

Case 3: A ⊆ B (A implies B)

P(A|B) = P(A) / P(B) ≥ P(A)

Since P(B) ≤ 1, dividing by P(B) increases (or maintains) the probability.

The Multiplication Rule

Rearranging the definition of conditional probability yields the multiplication rule—a formula for computing joint probabilities:

P(A ∩ B) = P(B) · P(A|B) = P(A) · P(B|A)

Intuition

Think of two sequential selections:

First, B happens with probability P(B)
Then, given B happened, A also happens with probability P(A|B)

The probability of both happening is the product of these sequential probabilities.

Multiplication Rule: Card DrawingWhat's the probability of drawing two aces in a row from a standard deck (without replacement)?

Input

Output

Machine Learning Application: Sequential Prediction

Consider a recommendation system predicting whether a user will:

View a product page (event V)
Add to cart (event C)
Complete purchase (event P)

The probability of a complete purchase journey:

P(V ∩ C ∩ P) = P(V) · P(C|V) · P(P|V ∩ C)

The model can estimate each conditional probability separately and multiply them for the full journey probability.

The Chain Rule of Probability

The multiplication rule extends to any number of events through the chain rule (also called the general product rule):

P(A₁ ∩ A₂ ∩ ... ∩ Aₙ) = P(A₁) · P(A₂|A₁) · P(A₃|A₁ ∩ A₂) · ... · P(Aₙ|A₁ ∩ ... ∩ Aₙ₋₁)

Each term conditions on all preceding events. This rule is fundamental to:

Sequence modeling (language models, time series)
Graphical models (Bayesian networks)
Sequential decision processes

The Chain Rule in Language Models

Modern language models like GPT use the chain rule to model text probability:

P(w₁, w₂, ..., wₙ) = P(w₁) · P(w₂|w₁) · P(w₃|w₁,w₂) · ... · P(wₙ|w₁,...,wₙ₋₁)

Each word's probability is conditioned on all previous words. The model learns these conditional distributions from massive text corpora. This is exactly the chain rule applied to sequences.

Proof of the Chain Rule

By induction:

Base case (n=2): P(A₁ ∩ A₂) = P(A₁) · P(A₂|A₁) ✓ (the multiplication rule)

Inductive step: Assume the chain rule holds for n-1 events. Then:

This matches the chain rule for n events. ∎

conditional_probability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
import numpy as np
from typing import List, Tuple, Dict
 
# =============================================================
# Conditional Probability Fundamentals
# =============================================================
 
def conditional_probability(
    p_intersection: float, 
    p_condition: float
) -> float:
    """
    Compute P(A|B) = P(A ∩ B) / P(B)
    
    Parameters:
        p_intersection: P(A ∩ B), probability of both events
        p_condition: P(B), probability of conditioning event
    
    Returns:
        P(A|B), conditional probability
    
    Raises:
        ValueError: If P(B) = 0
    """
    if p_condition == 0:
        raise ValueError("Cannot condition on event with probability 0")
    return p_intersection / p_condition
 
 
def joint_from_conditional(
    p_condition: float,
    p_given_condition: float
) -> float:
    """
    Multiplication rule: P(A ∩ B) = P(B) · P(A|B)
    
    Parameters:
        p_condition: P(B)
        p_given_condition: P(A|B)
    
    Returns:
        P(A ∩ B)
    """
    return p_condition * p_given_condition
 
 
def chain_rule(conditional_probs: List[float]) -> float:
    """
    Apply chain rule to compute joint probability.
    
    P(A₁ ∩ A₂ ∩ ... ∩ Aₙ) = P(A₁) · P(A₂|A₁) · P(A₃|A₁,A₂) · ...
    
    Parameters:
        conditional_probs: List [P(A₁), P(A₂|A₁), P(A₃|A₁,A₂), ...]
    
    Returns:
        Joint probability of all events
    """
    result = 1.0
    for p in conditional_probs:
        result *= p
    return result
 
 
# =============================================================
# Conditional Probability from Samples (Empirical)
# =============================================================
 
def empirical_conditional(
    joint_count: int,
    condition_count: int
) -> float:
    """
    Compute empirical conditional probability from counts.
    
    P̂(A|B) = Count(A ∩ B) / Count(B)
    
    Parameters:
        joint_count: Number of samples where both A and B occurred
        condition_count: Number of samples where B occurred
    
    Returns:
        Estimated conditional probability
    """
    if condition_count == 0:
        raise ValueError("No samples satisfy the condition")
    return joint_count / condition_count
 
 
def compute_conditional_from_data(
    data: List[Dict],
    target_key: str,
    target_value,
    condition_key: str,
    condition_value
) -> Tuple[float, int, int]:
    """
    Compute conditional probability from a dataset.
    
    Parameters:
        data: List of dictionaries (samples)
        target_key, target_value: Define event A
        condition_key, condition_value: Define event B
    
    Returns:
        (P(A|B), count of A∩B, count of B)
    """
    condition_samples = [d for d in data if d.get(condition_key) == condition_value]
    joint_samples = [d for d in condition_samples if d.get(target_key) == target_value]
    
    n_condition = len(condition_samples)
    n_joint = len(joint_samples)
    
    if n_condition == 0:
        return 0.0, 0, 0
    
    return n_joint / n_condition, n_joint, n_condition
 
 
# =============================================================
# Language Model Chain Rule Example
# =============================================================
 
class SimpleLanguageModel:
    """
    Simple n-gram language model demonstrating the chain rule.
    
    Models P(w₁, w₂, ..., wₙ) = ∏ᵢ P(wᵢ | w₁, ..., wᵢ₋₁)
    
    For simplicity, uses bigrams: P(wᵢ | wᵢ₋₁) instead of full history.
    """
    
    def __init__(self):
        self.unigram_counts: Dict[str, int] = {}
        self.bigram_counts: Dict[Tuple[str, str], int] = {}
        self.total_words = 0
        
    def train(self, sentences: List[List[str]]):
        """Train on tokenized sentences."""
        for sentence in sentences:
            # Add start token
            tokens = ["<START>"] + sentence
            
            for i, word in enumerate(tokens):
                # Unigram counts
                self.unigram_counts[word] = self.unigram_counts.get(word, 0) + 1
                self.total_words += 1
                
                # Bigram counts
                if i > 0:
                    bigram = (tokens[i-1], word)
                    self.bigram_counts[bigram] = self.bigram_counts.get(bigram, 0) + 1
    
    def p_word_given_previous(self, word: str, previous: str) -> float:
        """
        Compute P(word | previous) using MLE with Laplace smoothing.
        """
        bigram_count = self.bigram_counts.get((previous, word), 0)
        previous_count = self.unigram_counts.get(previous, 0)
        vocab_size = len(self.unigram_counts)
        
        # Laplace smoothing to avoid zero probabilities
        return (bigram_count + 1) / (previous_count + vocab_size)
    
    def sentence_probability(self, sentence: List[str]) -> float:
        """
        Compute P(w₁, ..., wₙ) using the chain rule.
        
        P(sentence) = P(w₁|<START>) · P(w₂|w₁) · ... · P(wₙ|wₙ₋₁)
        """
        tokens = ["<START>"] + sentence
        log_prob = 0.0  # Use log to avoid underflow
        
        for i in range(1, len(tokens)):
            p = self.p_word_given_previous(tokens[i], tokens[i-1])
            log_prob += np.log(p)
        
        return np.exp(log_prob)
 
 
if __name__ == "__main__":
    # Example 1: Die roll conditional probability
    print("=" * 60)
    print("Example 1: Die Roll")
    print("=" * 60)
    
    # P(greater than 3 | even)
    p_intersection = 2/6  # P({4,6})
    p_condition = 3/6     # P({2,4,6})
    p_conditional = conditional_probability(p_intersection, p_condition)
    print(f"P(>3 | even) = {p_conditional:.4f}")
    
    # Example 2: Chain rule - card drawing
    print("
" + "=" * 60)
    print("Example 2: Drawing Two Aces (Chain Rule)")
    print("=" * 60)
    
    probs = [4/52, 3/51]  # P(A₁), P(A₂|A₁)
    p_two_aces = chain_rule(probs)
    print(f"P(two aces) = {p_two_aces:.6f} = 1/{int(1/p_two_aces):.0f}")
    
    # Example 3: Empirical conditional from data
    print("
" + "=" * 60)
    print("Example 3: Empirical Conditional (Click Data)")
    print("=" * 60)
    
    # Simulated click data
    data = [
        {"page_view": True, "clicked": True},
        {"page_view": True, "clicked": False},
        {"page_view": True, "clicked": True},
        {"page_view": True, "clicked": False},
        {"page_view": True, "clicked": False},
        {"page_view": False, "clicked": False},
    ]
    
    p_click_given_view, n_joint, n_view = compute_conditional_from_data(
        data, "clicked", True, "page_view", True
    )
    print(f"P(clicked | page_view) = {n_joint}/{n_view} = {p_click_given_view:.2f}")
    
    # Example 4: Language model
    print("
" + "=" * 60)
    print("Example 4: Language Model (Chain Rule)")
    print("=" * 60)
    
    lm = SimpleLanguageModel()
    training_data = [
        ["the", "cat", "sat"],
        ["the", "dog", "ran"],
        ["the", "cat", "ran"],
        ["a", "cat", "sat"],
    ]
    lm.train(training_data)
    
    test_sentence = ["the", "cat", "sat"]
    prob = lm.sentence_probability(test_sentence)
    print(f"P('{' '.join(test_sentence)}') = {prob:.6f}")
    
    # Show chain rule decomposition
    tokens = ["<START>"] + test_sentence
    print("
Chain rule decomposition:")
    for i in range(1, len(tokens)):
        p = lm.p_word_given_previous(tokens[i], tokens[i-1])
        print(f"  P({tokens[i]} | {tokens[i-1]}) = {p:.4f}")

Common Pitfalls in Conditional Probability

Conditional probability is deceptively subtle, and even experienced practitioners make errors. Let's examine common pitfalls.

Pitfall 1: Confusing P(A|B) with P(B|A)

These are generally not equal! This error, called the confusion of the inverse, leads to serious reasoning failures.

The Prosecutor's FallacyA dangerous confusion between P(evidence|innocent) and P(innocent|evidence).

Input

Output

Pitfall 2: Base Rate Neglect

When computing P(A|B), we must account for the base rate—the unconditional probability P(A).

Example: A medical test is 99% accurate (detects disease 99% of the time if present, correctly clears 99% if absent). A patient tests positive. What's P(disease | positive)?

If the disease affects only 1 in 10,000 people, P(disease | positive) ≈ 1% despite the 'highly accurate' test! Most positives are false positives because the disease is so rare.

Pitfall 3: Updating Incorrectly

After observing multiple pieces of evidence, we must update sequentially and correctly:

Wrong: P(A | B ∩ C) = P(A|B) × P(A|C) ❌

Correct: Use Bayes' theorem and proper conditioning.

Pitfall 4: Assuming Independence

P(A|B) = P(A) only if A and B are independent. In most real scenarios, conditioning changes probabilities. Never assume independence without justification.

The Infamous Base Rate Problem

The Law of Total Probability

The Law of Total Probability allows us to compute P(A) by breaking it down across a partition of the sample space.

If B₁, B₂, ..., Bₙ form a partition of Ω (mutually exclusive and exhaustive), then:

P(A) = Σᵢ P(A|Bᵢ) · P(Bᵢ)

Why This Works

Since the Bᵢ partition Ω, event A can be decomposed:

A = (A ∩ B₁) ∪ (A ∩ B₂) ∪ ... ∪ (A ∩ Bₙ)

These pieces are disjoint, so:

P(A) = Σᵢ P(A ∩ Bᵢ) = Σᵢ P(A|Bᵢ) · P(Bᵢ)

Total Probability: Customer ConversionCompute overall conversion rate from segment-specific rates.

Input

Output

ML Application: Mixture Models

The law of total probability is the foundation of mixture models:

P(x) = Σₖ P(x | component k) · P(component k) = Σₖ πₖ · p(x | θₖ)

where πₖ is the mixing weight for component k and p(x | θₖ) is the component distribution.

Examples:

Gaussian Mixture Models (GMMs): P(x) = Σₖ πₖ · N(x; μₖ, Σₖ)
Topic Models: P(word) = Σ_{topic} P(word | topic) · P(topic)
Hidden Markov Models: Marginalizing over hidden states

The Power of Partitioning

Conditional Probability Throughout ML

Conditional probability isn't just a theoretical concept—it's the operational definition of what ML models compute.

Classification as Conditional Probability

A classifier learns P(Y | X), the conditional distribution of labels given features:

Logistic Regression: P(Y=1 | X) = σ(w^T X + b)
Neural Network Classifier: P(Y=k | X) = softmax(f_θ(X))ₖ
Decision Tree: P(Y=k | X) = fraction of class k in the leaf containing X

Every prediction is fundamentally a conditional probability.

Generative vs. Discriminative Models

Discriminative models learn P(Y | X) directly. Generative models learn P(X | Y) and P(Y), then use Bayes' theorem:

P(Y | X) = P(X | Y) · P(Y) / P(X)

Loss Functions as Conditional Log-Probabilities

Cross-entropy loss for classification:

L = -log P(Y_true | X)

Minimizing cross-entropy is equivalent to maximizing the conditional log-likelihood.

Conditional Probability in ML Models
Model	What It Learns	Conditional Form
Logistic Regression	Binary classification	P(Y=1 \| X) = σ(w^T X)
Softmax Classifier	Multi-class classification	P(Y=k \| X) = softmaxₖ(Wx)
Language Model (GPT)	Next token prediction	P(wₜ \| w₁...wₜ₋₁)
Conditional VAE	Conditional generation	P(X \| Z, Y)
Conditional GAN	Conditional generation	P(X \| Z, Y)
Sequence-to-Sequence	Sequence transduction	P(Y₁...Yₘ \| X₁...Xₙ)

ML is Fundamentally Conditional

Summary: The Logic of Evidence

Conditional probability transforms probability from a static description of uncertainty into a dynamic framework for learning from evidence.

Core Concepts Mastered

•Conditional Probability Definition — P(A|B) = P(A ∩ B) / P(B) restricts attention to outcomes where B occurred
•P(·|B) is a Valid Probability Measure — All axioms and derived rules apply to conditional probabilities
•Multiplication Rule — P(A ∩ B) = P(A|B) · P(B) computes joint probability from conditional and marginal
•Chain Rule — Extends multiplication to any sequence of events: P(A₁ ∩ ... ∩ Aₙ) = P(A₁) · P(A₂|A₁) · ...
•Common Pitfalls — P(A|B) ≠ P(B|A); base rate neglect; incorrect updating; false independence assumptions
•Law of Total Probability — P(A) = Σᵢ P(A|Bᵢ)P(Bᵢ) using any partition
•ML is Conditional — Classification learns P(Y|X); language models learn P(wₜ|history); loss functions are conditional log-likelihoods

What's Next:

Page Complete

3 / 5