Machine LearningProbability Fundamentals

Probability Fundamentals

LevelBeginner

Duration90 mins

TopicProbability Fundamentals

5 / 5

Bayes' Theorem

Inverting the Flow of Reasoning

We've seen that P(A|B) and P(B|A) are generally different quantities. But what if we know P(B|A) and need P(A|B)? This is one of the most common situations in practice:

We know P(symptom | disease) from medical studies, but we need P(disease | symptom) for diagnosis
We know P(spam words | spam) from labeled data, but we need P(spam | email content) for filtering
We know P(data | model parameters), but we need P(parameters | data) for learning

Bayes' theorem provides the bridge from one conditional direction to the other. It is arguably the most important theorem in probability, and the foundation of Bayesian statistics and machine learning.

Named after Reverend Thomas Bayes (1702-1761), who first formulated a special case, the theorem in its general form is a simple algebraic consequence of the definition of conditional probability—yet its implications are profound and far-reaching.

What You Will Learn

By the end of this page, you will derive Bayes' theorem from first principles, understand its components (prior, likelihood, evidence, posterior), apply it to practical classification problems, and recognize its central role throughout probabilistic machine learning.

Derivation of Bayes' Theorem

Bayes' theorem follows directly from the definition of conditional probability and the multiplication rule.

Starting Point: Two Forms of the Multiplication Rule

From the definition P(A|B) = P(A ∩ B) / P(B), we get:

P(A ∩ B) = P(A|B) · P(B)

But we can also write:

P(A ∩ B) = P(B|A) · P(A)

The Derivation

Since both expressions equal P(A ∩ B):

P(A|B) · P(B) = P(B|A) · P(A)

Solving for P(A|B):

P(A|B) = P(B|A) · P(A) / P(B)

This is Bayes' Theorem in its basic form.

Remarkably Simple Derivation

The proof is just two lines of algebra! Yet this simple result enables sophisticated reasoning about uncertain hypotheses, parameter estimation, medical diagnosis, spam filtering, and much of modern machine learning. Sometimes the most profound ideas have the simplest derivations.

Bayes' Theorem with Total Probability

We can expand P(B) using the law of total probability:

P(B) = P(B|A) · P(A) + P(B|A^c) · P(A^c)

Substituting:

P(A|B) = P(B|A) · P(A) / [P(B|A) · P(A) + P(B|A^c) · P(A^c)]

This form is useful when P(B) isn't given directly but can be computed from the components.

General Form for Multiple Hypotheses

If we have mutually exclusive, exhaustive hypotheses H₁, H₂, ..., Hₙ:

P(Hᵢ | E) = P(E | Hᵢ) · P(Hᵢ) / Σⱼ P(E | Hⱼ) · P(Hⱼ)

This is the form used in classification: given evidence E (features), which hypothesis Hᵢ (class) is most probable?

The Components of Bayes' Theorem

Each component of Bayes' theorem has a name and a specific interpretation:

P(H|E) = P(E|H) · P(H) / P(E)

Posterior = Likelihood × Prior / Evidence

Let's break down each term:

Components of Bayes' Theorem
Term	Name	Interpretation	Where It Comes From
P(H)	Prior	Probability of hypothesis BEFORE seeing evidence	Background knowledge, historical rates, initial beliefs
P(E\|H)	Likelihood	Probability of evidence IF hypothesis is true	Model of how evidence is generated
P(E)	Evidence (Marginal Likelihood)	Total probability of observing this evidence	Normalization: ensures posterior sums to 1
P(H\|E)	Posterior	Probability of hypothesis AFTER seeing evidence	What we're trying to compute!

Intuitive Understanding

Prior P(H): What we believed before seeing any data. If 1% of patients have a disease, P(disease) = 0.01.

Likelihood P(E|H): How well the evidence fits the hypothesis. If the disease causes a positive test 99% of the time, P(positive | disease) = 0.99.

Evidence P(E): How common is this evidence overall? This accounts for both true positives (disease + positive) and false positives (no disease + positive).

Posterior P(H|E): Our updated belief after seeing evidence. This is what we want for making decisions.

The Bayesian Update

Bayes' theorem describes a belief update:

Start with prior belief P(H)
Observe evidence E
Update to posterior P(H|E)
The posterior becomes the new prior for the next update

This iterative process is the essence of Bayesian learning.

Prior × Likelihood ≠ Posterior (Usually)

P(H|E) ≠ P(E|H) × P(H) in general. The evidence P(E) in the denominator is crucial—it normalizes the product into a valid probability. Without it, the result might exceed 1 or not sum to 1 across all hypotheses.

Classic Applications of Bayes' Theorem

Medical Diagnosis: The Base Rate ProblemA classic example showing why prior probability matters critically.

Input

Output

Base Rate Neglect is Dangerous

Ignoring the prior (base rate) leads to wildly incorrect conclusions. This error is common in:

• Medical diagnosis (overconfidence in test results) • Criminal justice (DNA matching with large suspect pools) • Fraud detection (rare fraud + many transactions = many false alarms) • ML classification (imbalanced classes)

Always consider the prior!

Spam Classification: Naive Bayes in ActionBayes' theorem powers one of the earliest and most successful spam filters.

Input

Output

Bayes' Theorem in Machine Learning

Bayes' theorem appears throughout machine learning in multiple forms and at multiple levels.

Level 1: Discriminative Classification

Classifiers compute P(Y | X), which can be written using Bayes:

P(Y = k | X) = P(X | Y = k) × P(Y = k) / P(X)

Generative models (like Naive Bayes, Gaussian Discriminant Analysis) learn the likelihood P(X | Y) and prior P(Y), then apply Bayes.

Discriminative models (like logistic regression, neural nets) learn P(Y | X) directly, implicitly combining the Bayesian pieces.

Level 2: Parameter Estimation (Bayesian Inference)

Bayesian learning treats model parameters θ as random variables:

P(θ | Data) = P(Data | θ) × P(θ) / P(Data)

P(θ): Prior distribution over parameters (regularization!)
P(Data | θ): Likelihood of data given parameters
P(θ | Data): Posterior distribution over parameters

This is fundamentally different from frequentist approaches where θ is a fixed (unknown) value.

Bayes' Theorem at Different Levels of ML
Level	What's the 'Hypothesis'?	What's the 'Evidence'?	Application
Classification	Class label Y	Features X	P(Y\|X) for prediction
Parameter Learning	Model parameters θ	Training data D	P(θ\|D) for Bayesian learning
Model Selection	Model M	Data D	P(M\|D) for model comparison
Causal Inference	Causal structure G	Observational data	P(G\|data) for structure learning

Bayesian Deep Learning

Modern Bayesian deep learning applies Bayes at the parameter level:

Place priors on network weights: P(W)
Compute likelihood: P(Y | X, W) from forward pass
Infer posterior: P(W | Data) ∝ P(Data | W) × P(W)
Predict with uncertainty: P(Y* | X*, Data) = ∫ P(Y* | X*, W) P(W | Data) dW

The integral (marginalizing over posterior) is intractable for neural networks, so approximations (variational inference, Monte Carlo dropout, ensembles) are used.

Regularization is a Prior!

L2 regularization (weight decay) is equivalent to a Gaussian prior on weights:

P(W) = N(0, σ²I)

L1 regularization corresponds to a Laplace prior. The regularization strength λ relates to the prior variance. Bayesian thinking reveals that 'regularization' is really 'prior belief that weights should be small.'

The Odds Form of Bayes' Theorem

Bayes' theorem has an elegant form when expressed in terms of odds rather than probabilities.

Defining Odds

The odds of event A is:

O(A) = P(A) / P(A^c) = P(A) / (1 - P(A))

For example, if P(A) = 0.75, then O(A) = 0.75 / 0.25 = 3 (or '3 to 1').

Bayes' Theorem in Odds Form

Posterior Odds = Prior Odds × Likelihood Ratio

O(H|E) = O(H) × P(E|H) / P(E|H^c)

The term P(E|H) / P(E|H^c) is called the likelihood ratio or Bayes factor.

Why Odds Form is Useful

Multiplicative updates: Each piece of evidence multiplies the odds by its likelihood ratio. Clean, composable.
Evidence aggregation: With independent evidence E₁, E₂, ...: O(H | E₁, E₂, ...) = O(H) × LR₁ × LR₂ × ...
Natural in log-space: Taking logs gives additive updates: log O(H|E) = log O(H) + log LR

Log-Odds (Logit)

The log-odds (or logit) is:

logit(P) = log[P / (1-P)]

This maps [0, 1] to (-∞, +∞) and is the basis of logistic regression.

Logistic regression models the log-odds as a linear function of features:

log[P(Y=1|X) / P(Y=0|X)] = w^T X + b

This is the log-odds form of Bayes with a linear log-likelihood ratio!

bayes_theorem.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
import numpy as np
from typing import Tuple, Dict, List
 
# =============================================================
# Bayes' Theorem Implementations
# =============================================================
 
def bayes_theorem(
    prior: float,
    likelihood: float,
    false_positive_rate: float
) -> float:
    """
    Compute posterior probability using Bayes' theorem.
    
    P(H|E) = P(E|H) * P(H) / P(E)
    
    where P(E) = P(E|H)P(H) + P(E|¬H)P(¬H)
    
    Parameters:
        prior: P(H) - prior probability of hypothesis
        likelihood: P(E|H) - probability of evidence if H is true
        false_positive_rate: P(E|¬H) - probability of evidence if H is false
    
    Returns:
        P(H|E) - posterior probability
    """
    # Compute P(E) using total probability
    evidence = likelihood * prior + false_positive_rate * (1 - prior)
    
    if evidence == 0:
        return 0.0
    
    posterior = (likelihood * prior) / evidence
    return posterior
 
 
def bayes_multiple_hypotheses(
    priors: np.ndarray,
    likelihoods: np.ndarray
) -> np.ndarray:
    """
    Bayes' theorem for multiple mutually exclusive hypotheses.
    
    P(Hᵢ|E) = P(E|Hᵢ) * P(Hᵢ) / Σⱼ P(E|Hⱼ) * P(Hⱼ)
    
    Parameters:
        priors: Array of prior probabilities [P(H₁), P(H₂), ...]
        likelihoods: Array of likelihoods [P(E|H₁), P(E|H₂), ...]
    
    Returns:
        Array of posterior probabilities
    """
    assert len(priors) == len(likelihoods)
    assert np.isclose(priors.sum(), 1.0), "Priors must sum to 1"
    
    numerators = likelihoods * priors
    evidence = numerators.sum()
    
    posteriors = numerators / evidence
    return posteriors
 
 
def odds(probability: float) -> float:
    """Convert probability to odds: O(A) = P(A) / (1 - P(A))"""
    if probability >= 1:
        return float('inf')
    return probability / (1 - probability)
 
 
def probability_from_odds(o: float) -> float:
    """Convert odds to probability: P = O / (1 + O)"""
    if np.isinf(o):
        return 1.0
    return o / (1 + o)
 
 
def logit(p: float) -> float:
    """Log-odds (logit function): log(p / (1-p))"""
    if p <= 0:
        return float('-inf')
    if p >= 1:
        return float('inf')
    return np.log(p / (1 - p))
 
 
def sigmoid(x: float) -> float:
    """Inverse logit (sigmoid): 1 / (1 + exp(-x))"""
    return 1 / (1 + np.exp(-x))
 
 
def bayes_odds_form(
    prior_odds: float,
    likelihood_ratio: float
) -> Tuple[float, float]:
    """
    Bayes' theorem in odds form.
    
    Posterior Odds = Prior Odds × Likelihood Ratio
    
    Returns:
        (posterior_odds, posterior_probability)
    """
    posterior_odds = prior_odds * likelihood_ratio
    posterior_prob = probability_from_odds(posterior_odds)
    return posterior_odds, posterior_prob
 
 
def sequential_bayes(
    prior: float,
    observations: List[Tuple[float, float]]
) -> List[float]:
    """
    Apply Bayes' theorem sequentially for multiple observations.
    
    Parameters:
        prior: Initial prior P(H)
        observations: List of (likelihood_if_H, likelihood_if_not_H) tuples
    
    Returns:
        List of posterior probabilities after each observation
    """
    posteriors = []
    current = prior
    
    for likelihood_h, likelihood_not_h in observations:
        current = bayes_theorem(current, likelihood_h, likelihood_not_h)
        posteriors.append(current)
    
    return posteriors
 
 
# =============================================================
# Naive Bayes Classifier
# =============================================================
 
class NaiveBayesClassifier:
    """
    Simple Naive Bayes classifier demonstrating Bayes' theorem.
    
    Assumes binary features and binary classification.
    """
    
    def __init__(self, alpha: float = 1.0):
        """
        Parameters:
            alpha: Laplace smoothing parameter
        """
        self.alpha = alpha
        self.class_prior: Dict[int, float] = {}
        self.feature_likelihood: Dict[Tuple[int, int, int], float] = {}
        
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Train the classifier on binary feature matrix X and labels y."""
        n_samples, n_features = X.shape
        classes = np.unique(y)
        
        # Compute class priors
        for c in classes:
            self.class_prior[c] = (np.sum(y == c) + self.alpha) / (n_samples + self.alpha * len(classes))
        
        # Compute feature likelihoods P(Xⱼ=v | Y=c)
        for c in classes:
            X_c = X[y == c]
            n_c = len(X_c)
            
            for j in range(n_features):
                for v in [0, 1]:
                    count = np.sum(X_c[:, j] == v)
                    self.feature_likelihood[(j, v, c)] = (count + self.alpha) / (n_c + 2 * self.alpha)
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict class probabilities using Bayes' theorem."""
        n_samples, n_features = X.shape
        classes = list(self.class_prior.keys())
        
        log_probs = np.zeros((n_samples, len(classes)))
        
        for i, c in enumerate(classes):
            # Start with log prior
            log_probs[:, i] = np.log(self.class_prior[c])
            
            # Add log likelihoods for each feature
            for j in range(n_features):
                for sample_idx in range(n_samples):
                    v = int(X[sample_idx, j])
                    log_probs[sample_idx, i] += np.log(self.feature_likelihood[(j, v, c)])
        
        # Normalize to get proper probabilities (softmax)
        log_probs -= np.max(log_probs, axis=1, keepdims=True)  # Numerical stability
        probs = np.exp(log_probs)
        probs /= probs.sum(axis=1, keepdims=True)
        
        return probs
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict most likely class."""
        probs = self.predict_proba(X)
        classes = list(self.class_prior.keys())
        return np.array([classes[i] for i in probs.argmax(axis=1)])
 
 
if __name__ == "__main__":
    # Example 1: Medical diagnosis
    print("=" * 60)
    print("Example 1: Medical Diagnosis")
    print("=" * 60)
    
    prior = 0.001  # Disease prevalence
    sensitivity = 0.99  # P(positive | disease)
    false_positive = 0.01  # P(positive | no disease)
    
    posterior = bayes_theorem(prior, sensitivity, false_positive)
    print(f"Prior P(disease): {prior}")
    print(f"P(positive | disease): {sensitivity}")
    print(f"P(positive | no disease): {false_positive}")
    print(f"Posterior P(disease | positive): {posterior:.4f} = {posterior*100:.1f}%")
    
    # Example 2: Odds form
    print("\n" + "=" * 60)
    print("Example 2: Odds Form")
    print("=" * 60)
    
    prior_odds = odds(prior)
    LR = sensitivity / false_positive
    post_odds, post_prob = bayes_odds_form(prior_odds, LR)
    
    print(f"Prior odds: {prior_odds:.4f}")
    print(f"Likelihood ratio: {LR:.1f}")
    print(f"Posterior odds: {post_odds:.4f}")
    print(f"Posterior probability: {post_prob:.4f}")
    
    # Example 3: Sequential updates
    print("\n" + "=" * 60)
    print("Example 3: Sequential Bayes Updates")
    print("=" * 60)
    
    # Multiple positive tests (assuming independence)
    observations = [(0.99, 0.01)] * 3  # Three positive tests
    posteriors = sequential_bayes(0.001, observations)
    
    print(f"Prior: {0.001}")
    for i, p in enumerate(posteriors, 1):
        print(f"After test {i}: P(disease) = {p:.4f}")
    
    # Example 4: Naive Bayes on synthetic data
    print("\n" + "=" * 60)
    print("Example 4: Naive Bayes Classifier")
    print("=" * 60)
    
    # Create simple dataset
    np.random.seed(42)
    n_samples = 200
    X = np.random.binomial(1, 0.5, (n_samples, 4))
    # Class 1 if features 0 and 1 are both 1
    y = ((X[:, 0] == 1) & (X[:, 1] == 1)).astype(int)
    
    clf = NaiveBayesClassifier()
    clf.fit(X, y)
    
    # Test predictions
    test_X = np.array([
        [1, 1, 0, 0],  # Should predict 1
        [0, 0, 1, 1],  # Should predict 0
        [1, 0, 1, 0],  # Uncertain
    ])
    
    probs = clf.predict_proba(test_X)
    preds = clf.predict(test_X)
    
    print("Test predictions:")
    for i, (x, prob, pred) in enumerate(zip(test_X, probs, preds)):
        print(f"  {x} -> P(class=1) = {prob[1]:.3f}, predicted: {pred}")

Pitfalls and Correct Interpretations

Bayes' theorem is simple to state but subtle to apply correctly. Here are common errors and their corrections.

Pitfall 1: Ignoring the Prior (Base Rate Fallacy)

Error: Treating the likelihood P(E|H) as if it were the posterior P(H|E).

Example: 'The test is 99% accurate, so if you test positive, there's a 99% chance you have the disease.'

Reality: This ignores the disease's base rate. With a rare disease, most positives are false positives despite the 'accurate' test.

Pitfall 2: Using Wrong Prior

Error: Using a prior that doesn't match the relevant population.

Example: Using the general population's disease rate when the patient already has symptoms that pre-select a higher-risk group.

Solution: The prior should reflect everything known BEFORE the current evidence, including any pre-screening.

Pitfall 3: Confusing Update Direction

Error: Computing P(E|H) when you need P(H|E), or vice versa.

Example: 'Most spam emails contain the word 'free'. Therefore, an email with 'free' is probably spam.'

Reality: P('free' | spam) being high doesn't mean P(spam | 'free') is high. Many legitimate emails also contain 'free'.

Pitfall 4: Assuming Evidence is Absolute

Error: Treating a positive test as proof of the hypothesis.

Bayesian Reality: Evidence shifts probabilities; it rarely proves or disproves. Even with strong evidence (high likelihood ratio), a very low prior keeps the posterior modest.

The Correct Bayesian Mindset

Start with honest uncertainty (prior)
Quantify how evidence relates to hypotheses (likelihoods)
Update beliefs proportionally to evidence strength
Maintain calibrated uncertainty (not overconfident)

P(A|B) ≠ P(B|A) — Always!

The fundamental lesson of Bayes' theorem is that these are different quantities:

• P(wet streets | rain) ≈ 1 (rain makes streets wet) • P(rain | wet streets) << 1 (streets could be wet from many causes)

Bayes' theorem is the ONLY valid way to convert one into the other.

Bayesian vs. Frequentist Perspectives

Bayes' theorem is mathematical fact, but its interpretation divides statisticians into 'Bayesian' and 'frequentist' camps.

The Frequentist View

Probability = long-run frequency of events
Parameters are fixed (unknown) constants, not random variables
We cannot have 'probability of a hypothesis'—hypotheses are true or false
Inference focuses on P(Data | θ), the likelihood
Confidence intervals have frequentist coverage guarantees

The Bayesian View

Probability = degree of belief (can apply to parameters)
Parameters can be modeled as random variables with distributions
P(θ | Data) is meaningful and useful
Prior distributions encode prior knowledge
Posterior distributions quantify remaining uncertainty

Practical Differences

Bayesian vs. Frequentist Approaches
Aspect	Frequentist	Bayesian
Point estimation	Maximum Likelihood Estimate (MLE)	Maximum A Posteriori (MAP) or Posterior Mean
Interval estimation	Confidence interval	Credible interval
Model comparison	Likelihood ratio tests, AIC	Bayes factors, marginal likelihood
Regularization view	Penalty term	Prior distribution
Handling of parameters	Fixed unknown constants	Random variables with distributions
Sequential updates	Requires full re-analysis	Natural: posterior becomes new prior

Pragmatic Modern View

Most practitioners today use both Bayesian and frequentist tools depending on the problem. Deep learning is mostly frequentist (MLE via SGD), but Bayesian neural networks, Gaussian processes, and probabilistic programming bring Bayesian methods to modern ML. Understanding both perspectives enriches your toolkit.

Summary: The Art of Updating Beliefs

Bayes' theorem is the mathematical formalization of learning from evidence—the essence of what machine learning systems do.

Core Concepts Mastered

•Bayes' Theorem — P(H|E) = P(E|H)P(H) / P(E) inverts conditional probabilities using the law of total probability
•Components — Prior (initial belief), Likelihood (evidence fit), Evidence (normalization), Posterior (updated belief)
•Base Rate Importance — The prior cannot be ignored; rare conditions have low posteriors even with 'accurate' tests
•Odds Form — Posterior Odds = Prior Odds × Likelihood Ratio; elegant for sequential updates and log-linear models
•ML Applications — Classification (P(Y|X)), parameter learning (P(θ|Data)), model selection (P(M|Data))
•Regularization as Prior — L2 = Gaussian prior, L1 = Laplace prior; Bayesian interpretation unifies concepts
•Bayesian vs. Frequentist — Different interpretations of probability; both useful in practice

Module Complete!

You have now mastered the foundations of probability theory essential for machine learning:

Sample Spaces and Events — The vocabulary of probability
Axioms of Probability — The mathematical foundation
Conditional Probability — Reasoning with partial information
Independence — When knowledge doesn't help
Bayes' Theorem — Learning from evidence

These concepts form the bedrock upon which all probabilistic machine learning is built—from simple Naive Bayes classifiers to sophisticated Bayesian neural networks and generative models.

Module Complete: Probability Fundamentals

Congratulations! You now possess a rigorous understanding of probability theory's foundations. You can reason precisely about uncertainty, update beliefs correctly with Bayes' theorem, recognize independence structures, and understand why probabilistic thinking is at the heart of machine learning. You're prepared for the next module: Random Variables and Distributions.

5 / 5

Loading learning content...

Machine LearningProbability Fundamentals

Probability Fundamentals

LevelBeginner

Duration90 mins

TopicProbability Fundamentals

5 / 5

Bayes' Theorem

Inverting the Flow of Reasoning

We've seen that P(A|B) and P(B|A) are generally different quantities. But what if we know P(B|A) and need P(A|B)? This is one of the most common situations in practice:

We know P(symptom | disease) from medical studies, but we need P(disease | symptom) for diagnosis
We know P(spam words | spam) from labeled data, but we need P(spam | email content) for filtering
We know P(data | model parameters), but we need P(parameters | data) for learning

What You Will Learn

Derivation of Bayes' Theorem

Bayes' theorem follows directly from the definition of conditional probability and the multiplication rule.

Starting Point: Two Forms of the Multiplication Rule

From the definition P(A|B) = P(A ∩ B) / P(B), we get:

P(A ∩ B) = P(A|B) · P(B)

But we can also write:

P(A ∩ B) = P(B|A) · P(A)

The Derivation

Since both expressions equal P(A ∩ B):

P(A|B) · P(B) = P(B|A) · P(A)

Solving for P(A|B):

P(A|B) = P(B|A) · P(A) / P(B)

This is Bayes' Theorem in its basic form.

Remarkably Simple Derivation

Bayes' Theorem with Total Probability

We can expand P(B) using the law of total probability:

P(B) = P(B|A) · P(A) + P(B|A^c) · P(A^c)

Substituting:

P(A|B) = P(B|A) · P(A) / [P(B|A) · P(A) + P(B|A^c) · P(A^c)]

This form is useful when P(B) isn't given directly but can be computed from the components.

General Form for Multiple Hypotheses

If we have mutually exclusive, exhaustive hypotheses H₁, H₂, ..., Hₙ:

P(Hᵢ | E) = P(E | Hᵢ) · P(Hᵢ) / Σⱼ P(E | Hⱼ) · P(Hⱼ)

This is the form used in classification: given evidence E (features), which hypothesis Hᵢ (class) is most probable?

The Components of Bayes' Theorem

Each component of Bayes' theorem has a name and a specific interpretation:

P(H|E) = P(E|H) · P(H) / P(E)

Posterior = Likelihood × Prior / Evidence

Let's break down each term:

Components of Bayes' Theorem
Term	Name	Interpretation	Where It Comes From
P(H)	Prior	Probability of hypothesis BEFORE seeing evidence	Background knowledge, historical rates, initial beliefs
P(E\|H)	Likelihood	Probability of evidence IF hypothesis is true	Model of how evidence is generated
P(E)	Evidence (Marginal Likelihood)	Total probability of observing this evidence	Normalization: ensures posterior sums to 1
P(H\|E)	Posterior	Probability of hypothesis AFTER seeing evidence	What we're trying to compute!

Intuitive Understanding

Prior P(H): What we believed before seeing any data. If 1% of patients have a disease, P(disease) = 0.01.

Likelihood P(E|H): How well the evidence fits the hypothesis. If the disease causes a positive test 99% of the time, P(positive | disease) = 0.99.

Evidence P(E): How common is this evidence overall? This accounts for both true positives (disease + positive) and false positives (no disease + positive).

Posterior P(H|E): Our updated belief after seeing evidence. This is what we want for making decisions.

The Bayesian Update

Bayes' theorem describes a belief update:

Start with prior belief P(H)
Observe evidence E
Update to posterior P(H|E)
The posterior becomes the new prior for the next update

This iterative process is the essence of Bayesian learning.

Prior × Likelihood ≠ Posterior (Usually)

Classic Applications of Bayes' Theorem

Medical Diagnosis: The Base Rate ProblemA classic example showing why prior probability matters critically.

Input

Output

Base Rate Neglect is Dangerous

Ignoring the prior (base rate) leads to wildly incorrect conclusions. This error is common in:

Always consider the prior!

Spam Classification: Naive Bayes in ActionBayes' theorem powers one of the earliest and most successful spam filters.

Input

Output

Bayes' Theorem in Machine Learning

Bayes' theorem appears throughout machine learning in multiple forms and at multiple levels.

Level 1: Discriminative Classification

Classifiers compute P(Y | X), which can be written using Bayes:

P(Y = k | X) = P(X | Y = k) × P(Y = k) / P(X)

Generative models (like Naive Bayes, Gaussian Discriminant Analysis) learn the likelihood P(X | Y) and prior P(Y), then apply Bayes.

Discriminative models (like logistic regression, neural nets) learn P(Y | X) directly, implicitly combining the Bayesian pieces.

Level 2: Parameter Estimation (Bayesian Inference)

Bayesian learning treats model parameters θ as random variables:

P(θ | Data) = P(Data | θ) × P(θ) / P(Data)

P(θ): Prior distribution over parameters (regularization!)
P(Data | θ): Likelihood of data given parameters
P(θ | Data): Posterior distribution over parameters

This is fundamentally different from frequentist approaches where θ is a fixed (unknown) value.

Bayes' Theorem at Different Levels of ML
Level	What's the 'Hypothesis'?	What's the 'Evidence'?	Application
Classification	Class label Y	Features X	P(Y\|X) for prediction
Parameter Learning	Model parameters θ	Training data D	P(θ\|D) for Bayesian learning
Model Selection	Model M	Data D	P(M\|D) for model comparison
Causal Inference	Causal structure G	Observational data	P(G\|data) for structure learning

Bayesian Deep Learning

Modern Bayesian deep learning applies Bayes at the parameter level:

Place priors on network weights: P(W)
Compute likelihood: P(Y | X, W) from forward pass
Infer posterior: P(W | Data) ∝ P(Data | W) × P(W)
Predict with uncertainty: P(Y* | X*, Data) = ∫ P(Y* | X*, W) P(W | Data) dW

The integral (marginalizing over posterior) is intractable for neural networks, so approximations (variational inference, Monte Carlo dropout, ensembles) are used.

Regularization is a Prior!

L2 regularization (weight decay) is equivalent to a Gaussian prior on weights:

P(W) = N(0, σ²I)

The Odds Form of Bayes' Theorem

Bayes' theorem has an elegant form when expressed in terms of odds rather than probabilities.

Defining Odds

The odds of event A is:

O(A) = P(A) / P(A^c) = P(A) / (1 - P(A))

For example, if P(A) = 0.75, then O(A) = 0.75 / 0.25 = 3 (or '3 to 1').

Bayes' Theorem in Odds Form

Posterior Odds = Prior Odds × Likelihood Ratio

O(H|E) = O(H) × P(E|H) / P(E|H^c)

The term P(E|H) / P(E|H^c) is called the likelihood ratio or Bayes factor.

Why Odds Form is Useful

Multiplicative updates: Each piece of evidence multiplies the odds by its likelihood ratio. Clean, composable.
Evidence aggregation: With independent evidence E₁, E₂, ...: O(H | E₁, E₂, ...) = O(H) × LR₁ × LR₂ × ...
Natural in log-space: Taking logs gives additive updates: log O(H|E) = log O(H) + log LR

Log-Odds (Logit)

The log-odds (or logit) is:

logit(P) = log[P / (1-P)]

This maps [0, 1] to (-∞, +∞) and is the basis of logistic regression.

Logistic regression models the log-odds as a linear function of features:

log[P(Y=1|X) / P(Y=0|X)] = w^T X + b

This is the log-odds form of Bayes with a linear log-likelihood ratio!

bayes_theorem.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
import numpy as np
from typing import Tuple, Dict, List
 
# =============================================================
# Bayes' Theorem Implementations
# =============================================================
 
def bayes_theorem(
    prior: float,
    likelihood: float,
    false_positive_rate: float
) -> float:
    """
    Compute posterior probability using Bayes' theorem.
    
    P(H|E) = P(E|H) * P(H) / P(E)
    
    where P(E) = P(E|H)P(H) + P(E|¬H)P(¬H)
    
    Parameters:
        prior: P(H) - prior probability of hypothesis
        likelihood: P(E|H) - probability of evidence if H is true
        false_positive_rate: P(E|¬H) - probability of evidence if H is false
    
    Returns:
        P(H|E) - posterior probability
    """
    # Compute P(E) using total probability
    evidence = likelihood * prior + false_positive_rate * (1 - prior)
    
    if evidence == 0:
        return 0.0
    
    posterior = (likelihood * prior) / evidence
    return posterior
 
 
def bayes_multiple_hypotheses(
    priors: np.ndarray,
    likelihoods: np.ndarray
) -> np.ndarray:
    """
    Bayes' theorem for multiple mutually exclusive hypotheses.
    
    P(Hᵢ|E) = P(E|Hᵢ) * P(Hᵢ) / Σⱼ P(E|Hⱼ) * P(Hⱼ)
    
    Parameters:
        priors: Array of prior probabilities [P(H₁), P(H₂), ...]
        likelihoods: Array of likelihoods [P(E|H₁), P(E|H₂), ...]
    
    Returns:
        Array of posterior probabilities
    """
    assert len(priors) == len(likelihoods)
    assert np.isclose(priors.sum(), 1.0), "Priors must sum to 1"
    
    numerators = likelihoods * priors
    evidence = numerators.sum()
    
    posteriors = numerators / evidence
    return posteriors
 
 
def odds(probability: float) -> float:
    """Convert probability to odds: O(A) = P(A) / (1 - P(A))"""
    if probability >= 1:
        return float('inf')
    return probability / (1 - probability)
 
 
def probability_from_odds(o: float) -> float:
    """Convert odds to probability: P = O / (1 + O)"""
    if np.isinf(o):
        return 1.0
    return o / (1 + o)
 
 
def logit(p: float) -> float:
    """Log-odds (logit function): log(p / (1-p))"""
    if p <= 0:
        return float('-inf')
    if p >= 1:
        return float('inf')
    return np.log(p / (1 - p))
 
 
def sigmoid(x: float) -> float:
    """Inverse logit (sigmoid): 1 / (1 + exp(-x))"""
    return 1 / (1 + np.exp(-x))
 
 
def bayes_odds_form(
    prior_odds: float,
    likelihood_ratio: float
) -> Tuple[float, float]:
    """
    Bayes' theorem in odds form.
    
    Posterior Odds = Prior Odds × Likelihood Ratio
    
    Returns:
        (posterior_odds, posterior_probability)
    """
    posterior_odds = prior_odds * likelihood_ratio
    posterior_prob = probability_from_odds(posterior_odds)
    return posterior_odds, posterior_prob
 
 
def sequential_bayes(
    prior: float,
    observations: List[Tuple[float, float]]
) -> List[float]:
    """
    Apply Bayes' theorem sequentially for multiple observations.
    
    Parameters:
        prior: Initial prior P(H)
        observations: List of (likelihood_if_H, likelihood_if_not_H) tuples
    
    Returns:
        List of posterior probabilities after each observation
    """
    posteriors = []
    current = prior
    
    for likelihood_h, likelihood_not_h in observations:
        current = bayes_theorem(current, likelihood_h, likelihood_not_h)
        posteriors.append(current)
    
    return posteriors
 
 
# =============================================================
# Naive Bayes Classifier
# =============================================================
 
class NaiveBayesClassifier:
    """
    Simple Naive Bayes classifier demonstrating Bayes' theorem.
    
    Assumes binary features and binary classification.
    """
    
    def __init__(self, alpha: float = 1.0):
        """
        Parameters:
            alpha: Laplace smoothing parameter
        """
        self.alpha = alpha
        self.class_prior: Dict[int, float] = {}
        self.feature_likelihood: Dict[Tuple[int, int, int], float] = {}
        
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Train the classifier on binary feature matrix X and labels y."""
        n_samples, n_features = X.shape
        classes = np.unique(y)
        
        # Compute class priors
        for c in classes:
            self.class_prior[c] = (np.sum(y == c) + self.alpha) / (n_samples + self.alpha * len(classes))
        
        # Compute feature likelihoods P(Xⱼ=v | Y=c)
        for c in classes:
            X_c = X[y == c]
            n_c = len(X_c)
            
            for j in range(n_features):
                for v in [0, 1]:
                    count = np.sum(X_c[:, j] == v)
                    self.feature_likelihood[(j, v, c)] = (count + self.alpha) / (n_c + 2 * self.alpha)
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict class probabilities using Bayes' theorem."""
        n_samples, n_features = X.shape
        classes = list(self.class_prior.keys())
        
        log_probs = np.zeros((n_samples, len(classes)))
        
        for i, c in enumerate(classes):
            # Start with log prior
            log_probs[:, i] = np.log(self.class_prior[c])
            
            # Add log likelihoods for each feature
            for j in range(n_features):
                for sample_idx in range(n_samples):
                    v = int(X[sample_idx, j])
                    log_probs[sample_idx, i] += np.log(self.feature_likelihood[(j, v, c)])
        
        # Normalize to get proper probabilities (softmax)
        log_probs -= np.max(log_probs, axis=1, keepdims=True)  # Numerical stability
        probs = np.exp(log_probs)
        probs /= probs.sum(axis=1, keepdims=True)
        
        return probs
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict most likely class."""
        probs = self.predict_proba(X)
        classes = list(self.class_prior.keys())
        return np.array([classes[i] for i in probs.argmax(axis=1)])
 
 
if __name__ == "__main__":
    # Example 1: Medical diagnosis
    print("=" * 60)
    print("Example 1: Medical Diagnosis")
    print("=" * 60)
    
    prior = 0.001  # Disease prevalence
    sensitivity = 0.99  # P(positive | disease)
    false_positive = 0.01  # P(positive | no disease)
    
    posterior = bayes_theorem(prior, sensitivity, false_positive)
    print(f"Prior P(disease): {prior}")
    print(f"P(positive | disease): {sensitivity}")
    print(f"P(positive | no disease): {false_positive}")
    print(f"Posterior P(disease | positive): {posterior:.4f} = {posterior*100:.1f}%")
    
    # Example 2: Odds form
    print("\n" + "=" * 60)
    print("Example 2: Odds Form")
    print("=" * 60)
    
    prior_odds = odds(prior)
    LR = sensitivity / false_positive
    post_odds, post_prob = bayes_odds_form(prior_odds, LR)
    
    print(f"Prior odds: {prior_odds:.4f}")
    print(f"Likelihood ratio: {LR:.1f}")
    print(f"Posterior odds: {post_odds:.4f}")
    print(f"Posterior probability: {post_prob:.4f}")
    
    # Example 3: Sequential updates
    print("\n" + "=" * 60)
    print("Example 3: Sequential Bayes Updates")
    print("=" * 60)
    
    # Multiple positive tests (assuming independence)
    observations = [(0.99, 0.01)] * 3  # Three positive tests
    posteriors = sequential_bayes(0.001, observations)
    
    print(f"Prior: {0.001}")
    for i, p in enumerate(posteriors, 1):
        print(f"After test {i}: P(disease) = {p:.4f}")
    
    # Example 4: Naive Bayes on synthetic data
    print("\n" + "=" * 60)
    print("Example 4: Naive Bayes Classifier")
    print("=" * 60)
    
    # Create simple dataset
    np.random.seed(42)
    n_samples = 200
    X = np.random.binomial(1, 0.5, (n_samples, 4))
    # Class 1 if features 0 and 1 are both 1
    y = ((X[:, 0] == 1) & (X[:, 1] == 1)).astype(int)
    
    clf = NaiveBayesClassifier()
    clf.fit(X, y)
    
    # Test predictions
    test_X = np.array([
        [1, 1, 0, 0],  # Should predict 1
        [0, 0, 1, 1],  # Should predict 0
        [1, 0, 1, 0],  # Uncertain
    ])
    
    probs = clf.predict_proba(test_X)
    preds = clf.predict(test_X)
    
    print("Test predictions:")
    for i, (x, prob, pred) in enumerate(zip(test_X, probs, preds)):
        print(f"  {x} -> P(class=1) = {prob[1]:.3f}, predicted: {pred}")

Pitfalls and Correct Interpretations

Bayes' theorem is simple to state but subtle to apply correctly. Here are common errors and their corrections.

Pitfall 1: Ignoring the Prior (Base Rate Fallacy)

Error: Treating the likelihood P(E|H) as if it were the posterior P(H|E).

Example: 'The test is 99% accurate, so if you test positive, there's a 99% chance you have the disease.'

Reality: This ignores the disease's base rate. With a rare disease, most positives are false positives despite the 'accurate' test.

Pitfall 2: Using Wrong Prior

Error: Using a prior that doesn't match the relevant population.

Example: Using the general population's disease rate when the patient already has symptoms that pre-select a higher-risk group.

Solution: The prior should reflect everything known BEFORE the current evidence, including any pre-screening.

Pitfall 3: Confusing Update Direction

Error: Computing P(E|H) when you need P(H|E), or vice versa.

Example: 'Most spam emails contain the word 'free'. Therefore, an email with 'free' is probably spam.'

Reality: P('free' | spam) being high doesn't mean P(spam | 'free') is high. Many legitimate emails also contain 'free'.

Pitfall 4: Assuming Evidence is Absolute

Error: Treating a positive test as proof of the hypothesis.

Bayesian Reality: Evidence shifts probabilities; it rarely proves or disproves. Even with strong evidence (high likelihood ratio), a very low prior keeps the posterior modest.

The Correct Bayesian Mindset

Start with honest uncertainty (prior)
Quantify how evidence relates to hypotheses (likelihoods)
Update beliefs proportionally to evidence strength
Maintain calibrated uncertainty (not overconfident)

P(A|B) ≠ P(B|A) — Always!

The fundamental lesson of Bayes' theorem is that these are different quantities:

• P(wet streets | rain) ≈ 1 (rain makes streets wet) • P(rain | wet streets) << 1 (streets could be wet from many causes)

Bayes' theorem is the ONLY valid way to convert one into the other.

Bayesian vs. Frequentist Perspectives

Bayes' theorem is mathematical fact, but its interpretation divides statisticians into 'Bayesian' and 'frequentist' camps.

The Frequentist View

Probability = long-run frequency of events
Parameters are fixed (unknown) constants, not random variables
We cannot have 'probability of a hypothesis'—hypotheses are true or false
Inference focuses on P(Data | θ), the likelihood
Confidence intervals have frequentist coverage guarantees

The Bayesian View

Probability = degree of belief (can apply to parameters)
Parameters can be modeled as random variables with distributions
P(θ | Data) is meaningful and useful
Prior distributions encode prior knowledge
Posterior distributions quantify remaining uncertainty

Practical Differences

Bayesian vs. Frequentist Approaches
Aspect	Frequentist	Bayesian
Point estimation	Maximum Likelihood Estimate (MLE)	Maximum A Posteriori (MAP) or Posterior Mean
Interval estimation	Confidence interval	Credible interval
Model comparison	Likelihood ratio tests, AIC	Bayes factors, marginal likelihood
Regularization view	Penalty term	Prior distribution
Handling of parameters	Fixed unknown constants	Random variables with distributions
Sequential updates	Requires full re-analysis	Natural: posterior becomes new prior

Pragmatic Modern View

Summary: The Art of Updating Beliefs

Bayes' theorem is the mathematical formalization of learning from evidence—the essence of what machine learning systems do.

Core Concepts Mastered

•Bayes' Theorem — P(H|E) = P(E|H)P(H) / P(E) inverts conditional probabilities using the law of total probability
•Components — Prior (initial belief), Likelihood (evidence fit), Evidence (normalization), Posterior (updated belief)
•Base Rate Importance — The prior cannot be ignored; rare conditions have low posteriors even with 'accurate' tests
•Odds Form — Posterior Odds = Prior Odds × Likelihood Ratio; elegant for sequential updates and log-linear models
•ML Applications — Classification (P(Y|X)), parameter learning (P(θ|Data)), model selection (P(M|Data))
•Regularization as Prior — L2 = Gaussian prior, L1 = Laplace prior; Bayesian interpretation unifies concepts
•Bayesian vs. Frequentist — Different interpretations of probability; both useful in practice

Module Complete!

You have now mastered the foundations of probability theory essential for machine learning:

Sample Spaces and Events — The vocabulary of probability
Axioms of Probability — The mathematical foundation
Conditional Probability — Reasoning with partial information
Independence — When knowledge doesn't help
Bayes' Theorem — Learning from evidence

These concepts form the bedrock upon which all probabilistic machine learning is built—from simple Naive Bayes classifiers to sophisticated Bayesian neural networks and generative models.

Module Complete: Probability Fundamentals

5 / 5