Loading content...
A classifier that outputs only "spam" or "not spam" provides limited information. But one that says "92% probability of spam" enables nuanced decision-making. Probabilistic classification produces calibrated probability estimates, not just class labels.
This shift from hard to soft predictions unlocks:
By the end of this page, you will understand posterior probability estimation and its interpretation, what makes probability estimates 'calibrated', how to evaluate and improve calibration, and why probabilistic outputs enable better decision-making than hard labels.
The Central Object of Probabilistic Classification
The posterior probability $P(Y = 1 | \mathbf{X} = \mathbf{x})$ represents our belief that input $\mathbf{x}$ belongs to class 1, given all evidence in the features.
Using Bayes' theorem: $$P(Y = 1 | \mathbf{X}) = \frac{P(\mathbf{X} | Y = 1) \cdot P(Y = 1)}{P(\mathbf{X})}$$
where:
Two Approaches to Posterior Estimation
| Approach | Models Directly | Examples | Strengths |
|---|---|---|---|
| Discriminative | $P(Y|\mathbf{X})$ | Logistic regression, Neural networks | Often more accurate, needs less modeling assumptions |
| Generative | $P(\mathbf{X}|Y)$ and $P(Y)$ | Naive Bayes, LDA, GMM | Handles missing data, works with small samples, interpretable |
Interpreting Posterior Probabilities
A well-calibrated posterior $\hat{p} = P(Y=1|\mathbf{x})$ has a precise meaning:
Among all instances where the model predicts probability $\hat{p}$, approximately $\hat{p}$ fraction truly belong to class 1.
For example, if the model says 70% probability of rain for 100 different days, we expect rain on approximately 70 of those days.
With calibrated probabilities and known costs, the Bayes-optimal decision rule is: predict class 1 if P(Y=1|x) · C₀₁ > P(Y=0|x) · C₁₀, where Cᵢⱼ is the cost of predicting j when truth is i. Hard classifiers cannot support such cost-sensitive decisions.
Definition of Calibration
A classifier is perfectly calibrated if: $$P(Y = 1 | \hat{p}(\mathbf{X}) = p) = p \quad \forall p \in [0, 1]$$
In words: when the model predicts probability $p$, the true proportion of positives is exactly $p$.
Common Calibration Problems
Reliability Diagrams
The standard tool for assessing calibration:
12345678910111213141516171819202122232425262728293031
import numpy as npfrom sklearn.calibration import calibration_curve def compute_calibration_metrics(y_true, y_prob, n_bins=10): """Compute calibration metrics and reliability data.""" # Reliability curve prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins) # Expected Calibration Error (ECE) bin_edges = np.linspace(0, 1, n_bins + 1) ece = 0.0 for i in range(n_bins): mask = (y_prob >= bin_edges[i]) & (y_prob < bin_edges[i+1]) if mask.sum() > 0: bin_acc = y_true[mask].mean() bin_conf = y_prob[mask].mean() ece += mask.sum() * abs(bin_acc - bin_conf) ece /= len(y_true) return { "prob_true": prob_true, "prob_pred": prob_pred, "ece": ece } # Example usagey_true = np.array([0, 0, 1, 1, 1, 0, 1, 0, 1, 1])y_prob = np.array([0.1, 0.2, 0.8, 0.7, 0.9, 0.3, 0.6, 0.4, 0.85, 0.75]) metrics = compute_calibration_metrics(y_true, y_prob, n_bins=5)print(f"Expected Calibration Error: {metrics['ece']:.4f}")When classifiers produce poorly calibrated probabilities, post-hoc calibration can improve them.
Platt Scaling
Fit a logistic regression on the classifier's outputs: $$P(Y=1|s) = \frac{1}{1 + \exp(As + B)}$$
where $s$ is the classifier's score. Parameters $A$ and $B$ are learned on a held-out calibration set.
Isotonic Regression
Fit a non-decreasing step function mapping scores to probabilities. More flexible than Platt scaling but requires more data.
Temperature Scaling
For neural networks, divide logits by a learned temperature $T$: $$P(Y=1) = \sigma(z/T)$$
$T > 1$ softens probabilities (reduces overconfidence).
| Method | Flexibility | Data Required | Best For |
|---|---|---|---|
| Platt Scaling | Low (2 params) | Moderate | SVMs, small datasets |
| Isotonic Regression | High | Large | General purpose with enough data |
| Temperature Scaling | Very Low (1 param) | Moderate | Neural networks |
| Beta Calibration | Medium (3 params) | Moderate | Handling boundary effects |
Never calibrate on training data—this leads to overfitting and optimistic estimates. Always use a separate calibration set or cross-validation. The calibration set should be representative of the deployment distribution.
Probability estimates must be converted to decisions. The threshold $\tau$ determines the cutoff:
$$\hat{y} = \begin{cases} 1 & \text{if } P(Y=1|\mathbf{x}) \geq \tau \ 0 & \text{otherwise} \end{cases}$$
Choosing the Threshold
12345678910111213141516171819202122
import numpy as npfrom sklearn.metrics import precision_recall_curve, f1_score def find_optimal_threshold(y_true, y_prob, metric='f1'): """Find threshold optimizing specified metric.""" thresholds = np.linspace(0.01, 0.99, 99) scores = [] for t in thresholds: y_pred = (y_prob >= t).astype(int) if metric == 'f1': scores.append(f1_score(y_true, y_pred, zero_division=0)) best_idx = np.argmax(scores) return thresholds[best_idx], scores[best_idx] # Exampley_true = np.array([0, 0, 1, 1, 1, 0, 1, 0, 1, 1])y_prob = np.array([0.1, 0.2, 0.8, 0.7, 0.9, 0.3, 0.6, 0.4, 0.85, 0.75]) best_t, best_f1 = find_optimal_threshold(y_true, y_prob)print(f"Optimal threshold: {best_t:.2f}, F1: {best_f1:.4f}")Probabilistic classifiers naturally express uncertainty. Understanding different types of uncertainty is crucial:
Aleatoric Uncertainty
Irreducible uncertainty from inherent randomness in the data. Even with infinite data and a perfect model, some inputs genuinely could be either class. Reflected in posterior probabilities near 0.5.
Epistemic Uncertainty
Reducible uncertainty from limited knowledge—insufficient data or model limitations. Bayesian methods can estimate this by maintaining distributions over model parameters.
| Type | Source | Reducible? | How to Address |
|---|---|---|---|
| Aleatoric | Inherent class overlap | No | Accept as irreducible error |
| Epistemic (data) | Limited training data | Yes | Collect more data |
| Epistemic (model) | Model misspecification | Yes | Use better model class |
| Distribution shift | Train/test mismatch | Partially | Domain adaptation, monitoring |
High uncertainty predictions (probabilities near 0.5) are prime candidates for human review. In high-stakes applications, define an 'abstention region' where the classifier defers to human judgment rather than making unreliable predictions.
How do we evaluate probability estimates? Proper scoring rules are loss functions that are minimized when the predicted probabilities match the true probabilities.
Log Loss (Cross-Entropy)
$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^n [y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]$$
The most common choice. Heavily penalizes confident wrong predictions.
Brier Score
$$\text{BS} = \frac{1}{n}\sum_{i=1}^n (\hat{p}_i - y_i)^2$$
Mean squared error for probabilities. Less sensitive to extreme errors than log loss.
You now understand classification as probability estimation. This perspective is essential for logistic regression and beyond. Next, we explore linear classifiers—the workhorses of probabilistic classification.