Binary Classification Problem - Learning Module

Loading content...

0/245

Probabilistic Classification

Beyond Hard Predictions

A classifier that outputs only "spam" or "not spam" provides limited information. But one that says "92% probability of spam" enables nuanced decision-making. Probabilistic classification produces calibrated probability estimates, not just class labels.

This shift from hard to soft predictions unlocks:

Confidence-aware decisions
Optimal threshold selection
Principled handling of uncertainty
Combination with prior knowledge

What You Will Learn

By the end of this page, you will understand posterior probability estimation and its interpretation, what makes probability estimates 'calibrated', how to evaluate and improve calibration, and why probabilistic outputs enable better decision-making than hard labels.

Posterior Probabilities

The Central Object of Probabilistic Classification

The posterior probability $P(Y = 1 | \mathbf{X} = \mathbf{x})$ represents our belief that input $\mathbf{x}$ belongs to class 1, given all evidence in the features.

Using Bayes' theorem: $$P(Y = 1 | \mathbf{X}) = \frac{P(\mathbf{X} | Y = 1) \cdot P(Y = 1)}{P(\mathbf{X})}$$

where:

$P(\mathbf{X} | Y = 1)$ is the class-conditional likelihood
$P(Y = 1)$ is the prior probability (base rate)
$P(\mathbf{X})$ is the evidence (normalizing constant)

Two Approaches to Posterior Estimation

Discriminative vs. Generative Approaches
Approach	Models Directly	Examples	Strengths
Discriminative	$P(Y\|\mathbf{X})$	Logistic regression, Neural networks	Often more accurate, needs less modeling assumptions
Generative	$P(\mathbf{X}\|Y)$ and $P(Y)$	Naive Bayes, LDA, GMM	Handles missing data, works with small samples, interpretable

Interpreting Posterior Probabilities

A well-calibrated posterior $\hat{p} = P(Y=1|\mathbf{x})$ has a precise meaning:

Among all instances where the model predicts probability $\hat{p}$, approximately $\hat{p}$ fraction truly belong to class 1.

For example, if the model says 70% probability of rain for 100 different days, we expect rain on approximately 70 of those days.

Probabilities Enable Optimal Decisions

With calibrated probabilities and known costs, the Bayes-optimal decision rule is: predict class 1 if P(Y=1|x) · C₀₁ > P(Y=0|x) · C₁₀, where Cᵢⱼ is the cost of predicting j when truth is i. Hard classifiers cannot support such cost-sensitive decisions.

Probability Calibration

Definition of Calibration

A classifier is perfectly calibrated if: $$P(Y = 1 | \hat{p}(\mathbf{X}) = p) = p \quad \forall p \in [0, 1]$$

In words: when the model predicts probability $p$, the true proportion of positives is exactly $p$.

Common Calibration Problems

•Overconfidence — Probabilities too extreme (close to 0 or 1). Model is more certain than warranted. Common in deep neural networks.
•Underconfidence — Probabilities too moderate (close to 0.5). Model is less certain than it should be.
•Systematic bias — Probabilities consistently too high or too low. May indicate class imbalance issues.

Reliability Diagrams

The standard tool for assessing calibration:

Bin predictions by predicted probability (e.g., 0-10%, 10-20%, ...)
For each bin, compute the actual positive rate
Plot predicted vs. actual rates
Perfect calibration = diagonal line

calibration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
from sklearn.calibration import calibration_curve
 
def compute_calibration_metrics(y_true, y_prob, n_bins=10):
    """Compute calibration metrics and reliability data."""
    # Reliability curve
    prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins)
    
    # Expected Calibration Error (ECE)
    bin_edges = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    for i in range(n_bins):
        mask = (y_prob >= bin_edges[i]) & (y_prob < bin_edges[i+1])
        if mask.sum() > 0:
            bin_acc = y_true[mask].mean()
            bin_conf = y_prob[mask].mean()
            ece += mask.sum() * abs(bin_acc - bin_conf)
    ece /= len(y_true)
    
    return {
        "prob_true": prob_true,
        "prob_pred": prob_pred,
        "ece": ece
    }
 
# Example usage
y_true = np.array([0, 0, 1, 1, 1, 0, 1, 0, 1, 1])
y_prob = np.array([0.1, 0.2, 0.8, 0.7, 0.9, 0.3, 0.6, 0.4, 0.85, 0.75])
 
metrics = compute_calibration_metrics(y_true, y_prob, n_bins=5)
print(f"Expected Calibration Error: {metrics['ece']:.4f}")

Calibration Methods

When classifiers produce poorly calibrated probabilities, post-hoc calibration can improve them.

Platt Scaling

Fit a logistic regression on the classifier's outputs: $$P(Y=1|s) = \frac{1}{1 + \exp(As + B)}$$

where $s$ is the classifier's score. Parameters $A$ and $B$ are learned on a held-out calibration set.

Isotonic Regression

Fit a non-decreasing step function mapping scores to probabilities. More flexible than Platt scaling but requires more data.

Temperature Scaling

For neural networks, divide logits by a learned temperature $T$: $$P(Y=1) = \sigma(z/T)$$

$T > 1$ softens probabilities (reduces overconfidence).

Calibration Method Comparison
Method	Flexibility	Data Required	Best For
Platt Scaling	Low (2 params)	Moderate	SVMs, small datasets
Isotonic Regression	High	Large	General purpose with enough data
Temperature Scaling	Very Low (1 param)	Moderate	Neural networks
Beta Calibration	Medium (3 params)	Moderate	Handling boundary effects

Calibration Requires Held-Out Data

Never calibrate on training data—this leads to overfitting and optimistic estimates. Always use a separate calibration set or cross-validation. The calibration set should be representative of the deployment distribution.

From Probabilities to Decisions

Probability estimates must be converted to decisions. The threshold $\tau$ determines the cutoff:

$$\hat{y} = \begin{cases} 1 & \text{if } P(Y=1|\mathbf{x}) \geq \tau \ 0 & \text{otherwise} \end{cases}$$

Choosing the Threshold

Default $\tau = 0.5$: Predict the more probable class
Cost-sensitive: $\tau = \frac{C_{FP}}{C_{FP} + C_{FN}}$ minimizes expected cost
Class imbalance: Lower $\tau$ to catch more positives when rare
Precision-recall tradeoff: Adjust $\tau$ to achieve desired operating point

Threshold Selection Strategies

•Maximize F1 — Find threshold with best F1 score on validation set
•Target precision — Set threshold to achieve minimum precision requirement
•Target recall — Set threshold to achieve minimum recall (sensitivity)
•Youden's J — Maximize sensitivity + specificity - 1 (equivalent to top-left of ROC curve)
•Business constraints — Match threshold to operational capacity or cost structure

threshold_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
from sklearn.metrics import precision_recall_curve, f1_score
 
def find_optimal_threshold(y_true, y_prob, metric='f1'):
    """Find threshold optimizing specified metric."""
    thresholds = np.linspace(0.01, 0.99, 99)
    scores = []
    
    for t in thresholds:
        y_pred = (y_prob >= t).astype(int)
        if metric == 'f1':
            scores.append(f1_score(y_true, y_pred, zero_division=0))
    
    best_idx = np.argmax(scores)
    return thresholds[best_idx], scores[best_idx]
 
# Example
y_true = np.array([0, 0, 1, 1, 1, 0, 1, 0, 1, 1])
y_prob = np.array([0.1, 0.2, 0.8, 0.7, 0.9, 0.3, 0.6, 0.4, 0.85, 0.75])
 
best_t, best_f1 = find_optimal_threshold(y_true, y_prob)
print(f"Optimal threshold: {best_t:.2f}, F1: {best_f1:.4f}")

Uncertainty Quantification

Probabilistic classifiers naturally express uncertainty. Understanding different types of uncertainty is crucial:

Aleatoric Uncertainty

Irreducible uncertainty from inherent randomness in the data. Even with infinite data and a perfect model, some inputs genuinely could be either class. Reflected in posterior probabilities near 0.5.

Epistemic Uncertainty

Reducible uncertainty from limited knowledge—insufficient data or model limitations. Bayesian methods can estimate this by maintaining distributions over model parameters.

Types of Uncertainty in Classification
Type	Source	Reducible?	How to Address
Aleatoric	Inherent class overlap	No	Accept as irreducible error
Epistemic (data)	Limited training data	Yes	Collect more data
Epistemic (model)	Model misspecification	Yes	Use better model class
Distribution shift	Train/test mismatch	Partially	Domain adaptation, monitoring

When to Defer

High uncertainty predictions (probabilities near 0.5) are prime candidates for human review. In high-stakes applications, define an 'abstention region' where the classifier defers to human judgment rather than making unreliable predictions.

Proper Scoring Rules

How do we evaluate probability estimates? Proper scoring rules are loss functions that are minimized when the predicted probabilities match the true probabilities.

Log Loss (Cross-Entropy)

$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^n [y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]$$

The most common choice. Heavily penalizes confident wrong predictions.

Brier Score

$$\text{BS} = \frac{1}{n}\sum_{i=1}^n (\hat{p}_i - y_i)^2$$

Mean squared error for probabilities. Less sensitive to extreme errors than log loss.

Why Properness Matters

•A proper scoring rule incentivizes honest probability reporting
•The optimal strategy is to report your true belief, not to hedge
•Improper rules can reward miscalibrated predictions
•Log loss is proper, while accuracy is not (accuracy ignores probability values)

Summary

Key Takeaways

•Posterior probabilities $P(Y|\mathbf{X})$ provide richer output than hard class labels
•Calibration ensures predicted probabilities match empirical frequencies
•Reliability diagrams visualize calibration; ECE quantifies it numerically
•Post-hoc calibration (Platt scaling, isotonic regression) can fix miscalibrated classifiers
•Threshold selection converts probabilities to decisions optimally for different objectives
•Proper scoring rules (log loss, Brier score) correctly evaluate probability estimates

Probabilistic Foundation Complete

You now understand classification as probability estimation. This perspective is essential for logistic regression and beyond. Next, we explore linear classifiers—the workhorses of probabilistic classification.