Machine LearningLogistic Regression Model

The Logistic Regression Model

LevelIntermediate

Duration90 mins

TopicLogistic Regression Model

5 / 5

Probabilistic Interpretation

Beyond Classification: The Power of Probabilities

What makes logistic regression special among classifiers isn't just that it classifies—many algorithms do that. What makes it special is that it produces genuine probabilities.

When logistic regression says "70% chance of class 1," this isn't just a confidence score—it's a true probability estimate. If you collect all predictions where the model said 70%, roughly 70% of them should actually be class 1. This property is called calibration, and it's far rarer and more valuable than you might expect.

Probabilistic outputs enable richer decision-making: cost-sensitive classification, uncertainty quantification, risk assessment, and proper combination with other information sources. Understanding this probabilistic interpretation transforms how you think about and deploy logistic regression.

What You Will Learn

By the end of this page, you will understand: (1) why logistic regression outputs are true probabilities (not just scores), (2) what calibration means and how to measure it, (3) comparison with other classifiers' probability estimates, (4) how to use probabilities for cost-sensitive decision-making, and (5) practical considerations for deploying probabilistic classifiers.

Probabilities vs. Scores: A Crucial Distinction

Many classifiers produce numbers between 0 and 1, but not all of these are true probabilities. Understanding this distinction is fundamental.

What Is a True Probability?

A model output $\hat{p}$ is a true (calibrated) probability if:

$$P(Y = 1 | \hat{p} = p) = p$$

In words: among all examples where the model predicts probability $p$, the fraction that actually belong to class 1 is $p$.

What Is a Score?

A score is just a number that ranks predictions by likelihood. Higher scores indicate greater confidence in class 1, but the numerical values don't have direct probabilistic meaning.

Example: An SVM might output "margin = 2.5". This tells you the example is far on the class-1 side of the boundary, but it doesn't mean 92.5% probability or any specific probability.

Why Logistic Regression Produces True Probabilities

Logistic regression is fit by maximum likelihood estimation under a Bernoulli model:

$$P(Y=y | \mathbf{x}) = \hat{p}^y (1-\hat{p})^{1-y}$$

The optimization explicitly finds parameters that make the predicted probabilities match the observed frequencies in training data. This is fundamentally different from optimization objectives like hinge loss (SVM) or squared error that don't directly model probabilities.

Probability Outputs of Different Classifiers
Classifier	Output Type	Naturally Calibrated?	Notes
Logistic Regression	Probability	Yes (if well-specified)	MLE under Bernoulli model
Decision Trees	Class frequencies	Often poorly calibrated	Can be miscalibrated in leaves
Random Forests	Averaged frequencies	Better than trees, still imperfect	Averaging helps
SVM	Distance to margin	No (requires Platt scaling)	Not trained for probabilities
Naive Bayes	Posterior probability	Often poorly calibrated	Wrong independence assumption
Neural Networks	Softmax output	Often miscalibrated	Modern DNNs tend to be overconfident

Scores in [0, 1] ≠ Probabilities

Just because a model outputs values between 0 and 1 doesn't make them probabilities. Many methods (sigmoid-squashed margins, softmax outputs) produce values in this range that are NOT calibrated. Always verify calibration before treating outputs as true probabilities.

Understanding Calibration

Calibration is the alignment between predicted probabilities and actual outcomes. A perfectly calibrated model produces predictions where the predicted confidence matches the empirical frequency of the positive class.

Formal Definition

A classifier is perfectly calibrated if for all $p \in [0, 1]$:

$$\mathbb{E}[Y | \hat{P} = p] = p$$

That is, among all predictions with $\hat{p} \approx p$, the proportion of actual positives equals $p$.

Reliability Diagrams (Calibration Curves)

The standard visualization for calibration is the reliability diagram:

Bin predictions by predicted probability (e.g., 0-10%, 10-20%, ...)
For each bin, compute the mean predicted probability and mean actual outcome
Plot actual frequency vs. predicted probability

A perfectly calibrated model produces a diagonal line from (0, 0) to (1, 1).

Common Miscalibration Patterns

Overconfidence: Curve below diagonal. Model predicts 80% confidence, but actual rate is lower.
Underconfidence: Curve above diagonal. Model predicts 60% confidence, but actual rate is higher.
S-shaped miscalibration: Overconfident at extremes, underconfident in middle (or vice versa).

calibration_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.calibration import calibration_curve
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10,
                            n_redundant=5, class_sep=0.8, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
# Fit different classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM (Platt Scaling)': SVC(probability=True, random_state=42),
}
 
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
print("Calibration Analysis")
print("=" * 60)
 
for i, (name, clf) in enumerate(classifiers.items()):
    clf.fit(X_train, y_train)
    
    if hasattr(clf, 'predict_proba'):
        prob_pos = clf.predict_proba(X_test)[:, 1]
    else:
        prob_pos = clf.decision_function(X_test)
    
    # Compute calibration curve
    fraction_of_positives, mean_predicted_value = calibration_curve(
        y_test, prob_pos, n_bins=10, strategy='uniform'
    )
    
    # Calculate Expected Calibration Error (ECE)
    bin_counts = np.histogram(prob_pos, bins=10, range=(0, 1))[0]
    ece = np.sum(bin_counts * np.abs(fraction_of_positives - mean_predicted_value)) / len(prob_pos)
    
    print(f"\n{name}:")
    print(f"  Expected Calibration Error (ECE): {ece:.4f}")
    
    # Plot
    ax = axes[i]
    ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
    ax.plot(mean_predicted_value, fraction_of_positives, 's-', 
            label=f'{name}\nECE={ece:.3f}', markersize=8)
    ax.fill_between([0, 1], [0, 0], [1, 1], alpha=0.1, color='gray')
    ax.set_xlabel('Mean Predicted Probability')
    ax.set_ylabel('Fraction of Positives')
    ax.set_title(f'{name}\nCalibration Curve')
    ax.legend(loc='lower right')
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('calibration_analysis.png', dpi=150)
plt.show()
 
# Show detailed bin analysis for logistic regression
print("\n" + "=" * 60)
print("Logistic Regression: Detailed Bin Analysis")
print("=" * 60)
 
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
probs = lr.predict_proba(X_test)[:, 1]
 
for lo, hi in [(0, 0.2), (0.2, 0.4), (0.4, 0.6), (0.6, 0.8), (0.8, 1.0)]:
    mask = (probs >= lo) & (probs < hi)
    if mask.sum() > 0:
        actual_rate = y_test[mask].mean()
        mean_pred = probs[mask].mean()
        print(f"  Bin [{lo:.1f}, {hi:.1f}): n={mask.sum():>4}, predicted={mean_pred:.3f}, actual={actual_rate:.3f}")

Expected Calibration Error (ECE)

The Expected Calibration Error summarizes calibration in a single number. Lower ECE means better calibration. ECE is the weighted average of |predicted - actual| across bins, where weights are proportional to bin sizes. ECE = 0 means perfect calibration.

Why Logistic Regression Produces Calibrated Probabilities

The calibration of logistic regression isn't accidental—it follows directly from the mathematical framework.

Maximum Likelihood and Calibration

Logistic regression minimizes the negative log-likelihood:

$$\mathcal{L} = -\sum_{i=1}^n \left[ y_i \log \hat{p}_i + (1-y_i) \log(1-\hat{p}_i) \right]$$

At the optimum, the first-order conditions (gradient equals zero) require:

$$\sum_{i=1}^n (\hat{p}_i - y_i) = 0$$

This means the average predicted probability equals the average actual outcome (the overall class proportion). This is a global calibration constraint.

More strongly, for each feature $j$:

$$\sum_{i=1}^n (\hat{p}i - y_i) x{ij} = 0$$

This ensures that predictions are calibrated conditional on each feature's value—a much stronger form of calibration.

When Calibration Can Fail

Despite the theoretical guarantee, logistic regression can be miscalibrated when:

Model Misspecification: True relationship isn't linear in log-odds
Finite Sample Effects: Not enough data in some probability ranges
Distribution Shift: Test distribution differs from training
Regularization: Strong L2 regularization pushes predictions toward 0.5, potentially over-smoothing
Feature Omission: Missing important features

Calibration Preserved When

•Model is correctly specified
•Training data is representative
•Sufficient samples across probability range
•Moderate regularization
•Important features included

Calibration Degraded When

•True boundary is nonlinear
•Training ≠ test distribution
•Sparse data at extreme probabilities
•Very strong regularization
•Missing key predictors

Comparison with Other Methods

Decision Trees: Predict the class proportion in each leaf. With small leaves, these frequencies can be noisy (e.g., 1/3 or 2/5), leading to poor calibration. Ensemble methods like Random Forests help by averaging.

SVMs: Trained to maximize margin, not to produce probabilities. Platt scaling fits a sigmoid to convert margins to probabilities, but this is a post-hoc fix, not native calibration.

Neural Networks: Despite using cross-entropy loss (like logistic regression), modern deep networks are often overconfident. This is an active research area; hypotheses include high capacity, learning dynamics, and batch normalization effects.

Calibration as a Selling Point

In domains where probability estimates matter (medicine, finance, risk assessment), logistic regression's natural calibration is a significant advantage. While other models might have slightly higher accuracy, their probability estimates often require additional calibration steps that logistic regression doesn't need.

Recalibration Methods

When a classifier produces miscalibrated probabilities, we can apply post-hoc recalibration to fix them. These methods transform the model's outputs to improve calibration without retraining the base model.

Platt Scaling

Fit a logistic regression on the model's outputs:

$$\hat{p}_{\text{calibrated}} = \sigma(A \cdot f(\mathbf{x}) + B)$$

where $f(\mathbf{x})$ is the original model's score (or log-odds), and $A$, $B$ are learned from a held-out calibration set.

This is equivalent to fitting a logistic regression with the original score as the only feature.

Isotonic Regression

Fit a non-decreasing piecewise constant function that maps scores to calibrated probabilities:

$$\hat{p}_{\text{calibrated}} = g(f(\mathbf{x}))$$

where $g$ is learned to minimize squared error while maintaining monotonicity.

Isotonic regression is more flexible than Platt scaling but requires more data and can overfit.

Temperature Scaling (for neural networks)

Divide logits by a learned temperature $T$ before softmax:

$$\hat{p}_{\text{calibrated}} = \text{softmax}(\mathbf{z} / T)$$

Where $T > 1$ softens probabilities (reducing overconfidence) and $T < 1$ sharpens them.

When to Apply Recalibration

Always validate calibration on held-out data first
Use a separate calibration set (not training or test)
Recalibration doesn't improve accuracy—only reliability of probability estimates

recalibration_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10,
                            class_sep=0.7, random_state=42)
 
# Split into train, calibration, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_cal, X_test, y_cal, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
 
print("Recalibration Methods Comparison")
print("=" * 60)
print(f"Training set: {len(X_train)}, Calibration set: {len(X_cal)}, Test set: {len(X_test)}")
 
# Train a Random Forest (typically overconfident)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
 
# Apply different calibration methods
calibrated_models = {
    'Uncalibrated': rf,
    'Platt Scaling': CalibratedClassifierCV(rf, method='sigmoid', cv='prefit'),
    'Isotonic Regression': CalibratedClassifierCV(rf, method='isotonic', cv='prefit'),
}
 
# Fit calibrated versions on calibration set
calibrated_models['Platt Scaling'].fit(X_cal, y_cal)
calibrated_models['Isotonic Regression'].fit(X_cal, y_cal)
 
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
for i, (name, model) in enumerate(calibrated_models.items()):
    probs = model.predict_proba(X_test)[:, 1]
    
    # Calibration curve
    fraction_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
    
    # ECE
    bin_counts = np.histogram(probs, bins=10, range=(0, 1))[0]
    ece = np.sum(bin_counts * np.abs(fraction_pos - mean_pred)) / len(probs)
    
    print(f"\n{name}: ECE = {ece:.4f}")
    
    ax = axes[i]
    ax.plot([0, 1], [0, 1], 'k--', label='Perfect')
    ax.plot(mean_pred, fraction_pos, 's-', markersize=8, label=f'ECE={ece:.3f}')
    ax.set_xlabel('Mean Predicted Probability')
    ax.set_ylabel('Fraction of Positives')
    ax.set_title(name)
    ax.legend(loc='lower right')
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
 
plt.suptitle('Effect of Recalibration on Random Forest', fontsize=14)
plt.tight_layout()
plt.savefig('recalibration_demo.png', dpi=150)
plt.show()
 
# Compare logistic regression (already calibrated) vs RF
print("\n" + "=" * 60)
print("Logistic Regression vs Random Forest Calibration")
print("=" * 60)
 
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
lr_probs = lr.predict_proba(X_test)[:, 1]
rf_probs = rf.predict_proba(X_test)[:, 1]
 
for name, probs in [('Logistic Regression', lr_probs), ('Random Forest', rf_probs)]:
    fraction_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
    bin_counts = np.histogram(probs, bins=10, range=(0, 1))[0]
    ece = np.sum(bin_counts * np.abs(fraction_pos - mean_pred)) / len(probs)
    print(f"{name}: ECE = {ece:.4f}")

Recalibration Requires Held-Out Data

Never use training data for recalibration—this leads to overfitting on training set calibration rather than true calibration. Always use a separate calibration set or cross-validation. With limited data, cross-validated calibration (cv='prefit' with separate fold) is preferred.

Using Probabilities for Cost-Sensitive Decisions

Calibrated probabilities enable optimal decision-making when misclassification costs are unequal. This is one of the most powerful practical applications of probabilistic classifiers.

The Optimal Decision Rule

With costs:

$C_{01}$: Cost of predicting 0 when true class is 1 (false negative)
$C_{10}$: Cost of predicting 1 when true class is 0 (false positive)

The Bayes-optimal decision is:

$$\hat{y} = \begin{cases} 1 & \text{if } \hat{p} > \frac{C_{10}}{C_{01} + C_{10}} \ 0 & \text{otherwise} \end{cases}$$

The optimal threshold is $\tau^* = \frac{C_{10}}{C_{01} + C_{10}}$.

Examples

Equal costs ($C_{01} = C_{10}$): $\tau^* = 0.5$ (standard threshold)
False negatives 10× worse ($C_{01} = 10, C_{10} = 1$): $\tau^* = \frac{1}{11} \approx 0.09$. Classify as positive unless very confident it's negative.
False positives 10× worse ($C_{01} = 1, C_{10} = 10$): $\tau^* = \frac{10}{11} \approx 0.91$. Classify as positive only when very confident.

Expected Cost Minimization

The expected cost of predicting $\hat{y}$ when true probability is $p$:

Predict 0: Expected cost = $p \cdot C_{01}$
Predict 1: Expected cost = $(1-p) \cdot C_{10}$

Predict 1 when $(1-p) \cdot C_{10} < p \cdot C_{01}$, which gives the threshold above.

cost_sensitive_decisions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Generate data (medical diagnosis scenario)
np.random.seed(42)
X, y = make_classification(n_samples=2000, n_features=10, n_informative=5,
                            class_sep=1.0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
# Fit logistic regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
probabilities = model.predict_proba(X_test)[:, 1]
 
def evaluate_at_threshold(y_true, probs, threshold, C_01, C_10):
    """Evaluate predictions at given threshold with costs."""
    predictions = (probs >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, predictions).ravel()
    
    total_cost = fn * C_01 + fp * C_10
    accuracy = (tp + tn) / len(y_true)
    
    return {
        'threshold': threshold,
        'TP': tp, 'FP': fp, 'TN': tn, 'FN': fn,
        'accuracy': accuracy,
        'total_cost': total_cost,
        'avg_cost': total_cost / len(y_true)
    }
 
# Scenario 1: Disease screening (false negatives are bad - miss sick patient)
C_01_medical = 10  # Cost of missing disease
C_10_medical = 1   # Cost of unnecessary follow-up tests
 
optimal_threshold_medical = C_10_medical / (C_01_medical + C_10_medical)
 
print("Cost-Sensitive Decision Making")
print("=" * 70)
print("\nScenario 1: Medical Screening")
print(f"  Cost of false negative (miss disease): {C_01_medical}")
print(f"  Cost of false positive (unnecessary test): {C_10_medical}")
print(f"  Optimal threshold: {optimal_threshold_medical:.4f}")
 
thresholds_to_test = [0.1, 0.3, optimal_threshold_medical, 0.5, 0.7, 0.9]
 
print(f"\n{'Threshold':>10} | {'TP':>5} | {'FP':>5} | {'TN':>5} | {'FN':>5} | {'Accuracy':>8} | {'AvgCost':>8}")
print("-" * 70)
 
for t in thresholds_to_test:
    result = evaluate_at_threshold(y_test, probabilities, t, C_01_medical, C_10_medical)
    marker = " ← optimal" if abs(t - optimal_threshold_medical) < 0.01 else ""
    print(f"{t:>10.3f} | {result['TP']:>5} | {result['FP']:>5} | {result['TN']:>5} | {result['FN']:>5} | {result['accuracy']:>7.1%} | {result['avg_cost']:>8.3f}{marker}")
 
# Scenario 2: Spam filter (false positives are bad - lose important email)
C_01_spam = 1   # Cost of spam in inbox
C_10_spam = 20  # Cost of losing legitimate email
 
optimal_threshold_spam = C_10_spam / (C_01_spam + C_10_spam)
 
print("\n" + "=" * 70)
print("Scenario 2: Spam Filtering")
print(f"  Cost of false negative (spam in inbox): {C_01_spam}")
print(f"  Cost of false positive (lose real email): {C_10_spam}")
print(f"  Optimal threshold: {optimal_threshold_spam:.4f}")
 
print(f"\n{'Threshold':>10} | {'TP':>5} | {'FP':>5} | {'TN':>5} | {'FN':>5} | {'Accuracy':>8} | {'AvgCost':>8}")
print("-" * 70)
 
for t in [0.3, 0.5, 0.7, 0.9, optimal_threshold_spam, 0.95]:
    result = evaluate_at_threshold(y_test, probabilities, t, C_01_spam, C_10_spam)
    marker = " ← optimal" if abs(t - optimal_threshold_spam) < 0.02 else ""
    print(f"{t:>10.3f} | {result['TP']:>5} | {result['FP']:>5} | {result['TN']:>5} | {result['FN']:>5} | {result['accuracy']:>7.1%} | {result['avg_cost']:>8.3f}{marker}")

Costs Don't Need to Be Dollar Values

Costs can represent any relative harm: utility loss, risk, inconvenience, or even raw counts if you want 'miss no positives.' The ratio C₁₀/C₀₁ is what matters. If missing a positive is 10× worse than a false alarm, set C₀₁ = 10, C₁₀ = 1 (or any 10:1 ratio).

Uncertainty Quantification

Probabilistic outputs provide uncertainty estimates that reveal how confident the model is in its predictions. This is crucial for high-stakes applications where knowing what we don't know matters.

Types of Uncertainty

Logistic regression captures aleatoric uncertainty (irreducible noise in the data-generating process) but not epistemic uncertainty (model uncertainty due to limited data or model choice).

Aleatoric Uncertainty: Intrinsic noise. Even with perfect knowledge, outcomes have randomness. Logistic regression captures this through probabilities near 0.5—the model says "this could go either way."

Epistemic Uncertainty: Knowledge gaps. With more data, we could resolve this uncertainty. Standard logistic regression doesn't quantify this, but Bayesian logistic regression does (through posterior uncertainty on parameters).

Using Uncertainty in Practice

Flagging Low-Confidence Predictions: Route predictions with $0.4 < \hat{p} < 0.6$ to human review.
Abstaining from Prediction: Refuse to classify when $|\hat{p} - 0.5| < \epsilon$.
Confidence Intervals: With enough data, bootstrap resampling can give confidence intervals on predictions.
Expected Entropy: $H = -p \log p - (1-p) \log(1-p)$ measures prediction uncertainty. High entropy → uncertain.

uncertainty_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=1000, n_features=5, n_informative=3,
                            class_sep=1.0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
probabilities = model.predict_proba(X_test)[:, 1]
 
def entropy(p):
    """Binary entropy function."""
    eps = 1e-15
    p = np.clip(p, eps, 1 - eps)
    return -p * np.log2(p) - (1 - p) * np.log2(1 - p)
 
# Analyze predictions by confidence level
entropies = entropy(probabilities)
confidences = np.abs(probabilities - 0.5) * 2  # 0 to 1 scale (0=uncertain, 1=confident)
 
print("Uncertainty Analysis")
print("=" * 60)
 
# Group by confidence
bins = [(0, 0.3), (0.3, 0.6), (0.6, 0.8), (0.8, 1.0)]
labels = ['Very Uncertain', 'Uncertain', 'Confident', 'Very Confident']
 
print(f"\n{'Confidence Level':<20} | {'Count':>6} | {'Accuracy':>10} | {'Mean Entropy':>12}")
print("-" * 60)
 
for (lo, hi), label in zip(bins, labels):
    mask = (confidences >= lo) & (confidences < hi)
    if mask.sum() > 0:
        predictions = (probabilities[mask] > 0.5).astype(int)
        acc = (predictions == y_test[mask]).mean()
        mean_ent = entropies[mask].mean()
        print(f"{label:<20} | {mask.sum():>6} | {acc:>10.2%} | {mean_ent:>12.4f}")
 
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Plot 1: Probability distribution
ax1 = axes[0]
ax1.hist(probabilities[y_test==0], bins=20, alpha=0.7, label='Class 0', color='blue')
ax1.hist(probabilities[y_test==1], bins=20, alpha=0.7, label='Class 1', color='red')
ax1.set_xlabel('Predicted Probability')
ax1.set_ylabel('Count')
ax1.set_title('Distribution of Predicted Probabilities')
ax1.axvline(x=0.5, color='black', linestyle='--')
ax1.legend()
 
# Plot 2: Accuracy vs Confidence
ax2 = axes[1]
conf_bins = np.linspace(0, 1, 11)
conf_accs = []
conf_centers = []
for i in range(len(conf_bins)-1):
    mask = (confidences >= conf_bins[i]) & (confidences < conf_bins[i+1])
    if mask.sum() >= 10:
        preds = (probabilities[mask] > 0.5).astype(int)
        conf_accs.append((preds == y_test[mask]).mean())
        conf_centers.append((conf_bins[i] + conf_bins[i+1]) / 2)
 
ax2.plot(conf_centers, conf_accs, 'o-', markersize=8)
ax2.set_xlabel('Confidence (|p - 0.5| × 2)')
ax2.set_ylabel('Accuracy')
ax2.set_title('Accuracy vs Confidence')
ax2.set_xlim(0, 1)
ax2.set_ylim(0.5, 1.0)
ax2.grid(True, alpha=0.3)
 
# Plot 3: Entropy distribution
ax3 = axes[2]
ax3.hist(entropies[y_test==0], bins=20, alpha=0.7, label='Class 0', color='blue')
ax3.hist(entropies[y_test==1], bins=20, alpha=0.7, label='Class 1', color='red')
ax3.set_xlabel('Entropy (bits)')
ax3.set_ylabel('Count')
ax3.set_title('Distribution of Prediction Entropy')
ax3.legend()
 
plt.tight_layout()
plt.savefig('uncertainty_analysis.png', dpi=150)
plt.show()
 
# Selective prediction: abstain on uncertain cases
print("\n" + "=" * 60)
print("Selective Prediction (Abstaining on Uncertain Cases)")
print("=" * 60)
 
for abstain_threshold in [0.0, 0.1, 0.2, 0.3]:
    confident_mask = confidences >= abstain_threshold
    n_predict = confident_mask.sum()
    if n_predict > 0:
        preds = (probabilities[confident_mask] > 0.5).astype(int)
        acc = (preds == y_test[confident_mask]).mean()
        coverage = n_predict / len(y_test)
        print(f"Abstain if confidence < {abstain_threshold:.1f}: "
              f"Coverage = {coverage:.1%}, Accuracy = {acc:.1%}")

Higher Confidence → Higher Accuracy

For a well-calibrated model, predictions with higher confidence should have higher accuracy. If this relationship doesn't hold, the model may be miscalibrated or there's a fundamental issue with the data quality at certain prediction levels. This accuracy-confidence alignment is a sanity check for probabilistic predictions.

Practical Deployment Considerations

Deploying a probabilistic classifier like logistic regression requires careful attention to how probabilities will be used and how well they'll hold up in production.

Calibration Monitoring

Calibration can degrade over time due to:

Concept drift: The relationship between features and outcomes changes
Population shift: The distribution of incoming examples changes
Adversarial gaming: Users learn to game the system

Regularly plot calibration curves on recent predictions to detect drift.

Probability Reporting Guidelines

Avoid over-precision: Report "75-80%" not "77.34%"
Provide context: "This probability is based on these features..."
Acknowledge limitations: "This estimate assumes patterns similar to training data"

Threshold Selection in Production

The optimal threshold depends on:

Relative costs of errors (as discussed above)
Current class prevalence (may differ from training)
Downstream process requirements

Often, you'll expose the probability and let the business layer apply a threshold based on use-case-specific costs.

Logging for Accountability

Always log:

The predicted probability (not just the final decision)
Feature values used
Model version
Timestamp

This enables post-hoc calibration analysis, debugging, and regulatory compliance.

Deployment Checklist for Probabilistic Models

•Verify calibration on recent data before deployment
•Set up calibration monitoring with alerting for drift
•Document threshold choices and their rationale
•Log probabilities, not just decisions
•Plan for recalibration if drift is detected
•Communicate uncertainty appropriately to end users
•Establish fallback procedures for uncertain predictions
•Regular retraining schedule to capture distributional changes

Don't Round Too Early

Keep full probability precision through your decision pipeline. A policy like 'treat >70% as definite yes' should be applied at the final decision point, not encoded in the probability output. Different downstream uses may need different thresholds.

Summary: The Probabilistic Perspective

We've explored the probabilistic interpretation of logistic regression in depth—from the nature of calibrated probabilities to their practical applications in decision-making and deployment. Let's consolidate the essential insights:

Key Takeaways

•True Probabilities vs. Scores: Logistic regression produces calibrated probabilities, not just confidence scores. This is a significant advantage over many other classifiers.
•Calibration: A model is calibrated when predicted probabilities match actual frequencies. Reliability diagrams and ECE measure calibration quality.
•Why LR Is Calibrated: Maximum likelihood under the Bernoulli model directly optimizes for calibration. The MLE conditions enforce probability-outcome alignment.
•Recalibration: When needed, Platt scaling or isotonic regression can recalibrate miscalibrated classifiers using held-out data.
•Cost-Sensitive Decisions: Calibrated probabilities enable optimal threshold selection when misclassification costs differ: τ* = C₁₀/(C₀₁ + C₁₀).
•Uncertainty Quantification: Probabilities near 0.5 indicate uncertainty. Use entropy or confidence metrics to identify and handle uncertain cases.
•Deployment: Monitor calibration over time, log full probabilities, and apply thresholds at the decision layer—not in the model output.
•Value Proposition: In domains requiring reliable probability estimates (medicine, finance, risk), logistic regression's natural calibration is a major advantage.

Module Complete:

You've now completed a comprehensive exploration of the Logistic Regression Model. From the sigmoid function through log-odds interpretation, model parameters, decision boundaries, and probabilistic interpretation—you have a deep, principled understanding of this foundational classifier.

The next module, Maximum Likelihood Estimation, will dive deeper into the optimization process: deriving the likelihood function, understanding why there's no closed-form solution, and examining efficient algorithms for finding optimal parameters.

Module Complete!

You now have mastery of the logistic regression model from every angle—mathematical, geometric, and probabilistic. This understanding forms the foundation for everything from advanced classification methods to neural networks for classification.

5 / 5

Loading learning content...

Machine LearningLogistic Regression Model

The Logistic Regression Model

LevelIntermediate

Duration90 mins

TopicLogistic Regression Model

5 / 5

Probabilistic Interpretation

Beyond Classification: The Power of Probabilities

What makes logistic regression special among classifiers isn't just that it classifies—many algorithms do that. What makes it special is that it produces genuine probabilities.

What You Will Learn

Probabilities vs. Scores: A Crucial Distinction

Many classifiers produce numbers between 0 and 1, but not all of these are true probabilities. Understanding this distinction is fundamental.

What Is a True Probability?

A model output $\hat{p}$ is a true (calibrated) probability if:

$$P(Y = 1 | \hat{p} = p) = p$$

In words: among all examples where the model predicts probability $p$, the fraction that actually belong to class 1 is $p$.

What Is a Score?

A score is just a number that ranks predictions by likelihood. Higher scores indicate greater confidence in class 1, but the numerical values don't have direct probabilistic meaning.

Example: An SVM might output "margin = 2.5". This tells you the example is far on the class-1 side of the boundary, but it doesn't mean 92.5% probability or any specific probability.

Why Logistic Regression Produces True Probabilities

Logistic regression is fit by maximum likelihood estimation under a Bernoulli model:

$$P(Y=y | \mathbf{x}) = \hat{p}^y (1-\hat{p})^{1-y}$$

Probability Outputs of Different Classifiers
Classifier	Output Type	Naturally Calibrated?	Notes
Logistic Regression	Probability	Yes (if well-specified)	MLE under Bernoulli model
Decision Trees	Class frequencies	Often poorly calibrated	Can be miscalibrated in leaves
Random Forests	Averaged frequencies	Better than trees, still imperfect	Averaging helps
SVM	Distance to margin	No (requires Platt scaling)	Not trained for probabilities
Naive Bayes	Posterior probability	Often poorly calibrated	Wrong independence assumption
Neural Networks	Softmax output	Often miscalibrated	Modern DNNs tend to be overconfident

Scores in [0, 1] ≠ Probabilities

Understanding Calibration

Formal Definition

A classifier is perfectly calibrated if for all $p \in [0, 1]$:

$$\mathbb{E}[Y | \hat{P} = p] = p$$

That is, among all predictions with $\hat{p} \approx p$, the proportion of actual positives equals $p$.

Reliability Diagrams (Calibration Curves)

The standard visualization for calibration is the reliability diagram:

Bin predictions by predicted probability (e.g., 0-10%, 10-20%, ...)
For each bin, compute the mean predicted probability and mean actual outcome
Plot actual frequency vs. predicted probability

A perfectly calibrated model produces a diagonal line from (0, 0) to (1, 1).

Common Miscalibration Patterns

Overconfidence: Curve below diagonal. Model predicts 80% confidence, but actual rate is lower.
Underconfidence: Curve above diagonal. Model predicts 60% confidence, but actual rate is higher.
S-shaped miscalibration: Overconfident at extremes, underconfident in middle (or vice versa).

calibration_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.calibration import calibration_curve
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10,
                            n_redundant=5, class_sep=0.8, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
# Fit different classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM (Platt Scaling)': SVC(probability=True, random_state=42),
}
 
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
print("Calibration Analysis")
print("=" * 60)
 
for i, (name, clf) in enumerate(classifiers.items()):
    clf.fit(X_train, y_train)
    
    if hasattr(clf, 'predict_proba'):
        prob_pos = clf.predict_proba(X_test)[:, 1]
    else:
        prob_pos = clf.decision_function(X_test)
    
    # Compute calibration curve
    fraction_of_positives, mean_predicted_value = calibration_curve(
        y_test, prob_pos, n_bins=10, strategy='uniform'
    )
    
    # Calculate Expected Calibration Error (ECE)
    bin_counts = np.histogram(prob_pos, bins=10, range=(0, 1))[0]
    ece = np.sum(bin_counts * np.abs(fraction_of_positives - mean_predicted_value)) / len(prob_pos)
    
    print(f"\n{name}:")
    print(f"  Expected Calibration Error (ECE): {ece:.4f}")
    
    # Plot
    ax = axes[i]
    ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
    ax.plot(mean_predicted_value, fraction_of_positives, 's-', 
            label=f'{name}\nECE={ece:.3f}', markersize=8)
    ax.fill_between([0, 1], [0, 0], [1, 1], alpha=0.1, color='gray')
    ax.set_xlabel('Mean Predicted Probability')
    ax.set_ylabel('Fraction of Positives')
    ax.set_title(f'{name}\nCalibration Curve')
    ax.legend(loc='lower right')
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('calibration_analysis.png', dpi=150)
plt.show()
 
# Show detailed bin analysis for logistic regression
print("\n" + "=" * 60)
print("Logistic Regression: Detailed Bin Analysis")
print("=" * 60)
 
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
probs = lr.predict_proba(X_test)[:, 1]
 
for lo, hi in [(0, 0.2), (0.2, 0.4), (0.4, 0.6), (0.6, 0.8), (0.8, 1.0)]:
    mask = (probs >= lo) & (probs < hi)
    if mask.sum() > 0:
        actual_rate = y_test[mask].mean()
        mean_pred = probs[mask].mean()
        print(f"  Bin [{lo:.1f}, {hi:.1f}): n={mask.sum():>4}, predicted={mean_pred:.3f}, actual={actual_rate:.3f}")

Expected Calibration Error (ECE)

Why Logistic Regression Produces Calibrated Probabilities

The calibration of logistic regression isn't accidental—it follows directly from the mathematical framework.

Maximum Likelihood and Calibration

Logistic regression minimizes the negative log-likelihood:

$$\mathcal{L} = -\sum_{i=1}^n \left[ y_i \log \hat{p}_i + (1-y_i) \log(1-\hat{p}_i) \right]$$

At the optimum, the first-order conditions (gradient equals zero) require:

$$\sum_{i=1}^n (\hat{p}_i - y_i) = 0$$

This means the average predicted probability equals the average actual outcome (the overall class proportion). This is a global calibration constraint.

More strongly, for each feature $j$:

$$\sum_{i=1}^n (\hat{p}i - y_i) x{ij} = 0$$

This ensures that predictions are calibrated conditional on each feature's value—a much stronger form of calibration.

When Calibration Can Fail

Despite the theoretical guarantee, logistic regression can be miscalibrated when:

Model Misspecification: True relationship isn't linear in log-odds
Finite Sample Effects: Not enough data in some probability ranges
Distribution Shift: Test distribution differs from training
Regularization: Strong L2 regularization pushes predictions toward 0.5, potentially over-smoothing
Feature Omission: Missing important features

Calibration Preserved When

•Model is correctly specified
•Training data is representative
•Sufficient samples across probability range
•Moderate regularization
•Important features included

Calibration Degraded When

•True boundary is nonlinear
•Training ≠ test distribution
•Sparse data at extreme probabilities
•Very strong regularization
•Missing key predictors

Comparison with Other Methods

SVMs: Trained to maximize margin, not to produce probabilities. Platt scaling fits a sigmoid to convert margins to probabilities, but this is a post-hoc fix, not native calibration.

Calibration as a Selling Point

Recalibration Methods

Platt Scaling

Fit a logistic regression on the model's outputs:

$$\hat{p}_{\text{calibrated}} = \sigma(A \cdot f(\mathbf{x}) + B)$$

where $f(\mathbf{x})$ is the original model's score (or log-odds), and $A$, $B$ are learned from a held-out calibration set.

This is equivalent to fitting a logistic regression with the original score as the only feature.

Isotonic Regression

Fit a non-decreasing piecewise constant function that maps scores to calibrated probabilities:

$$\hat{p}_{\text{calibrated}} = g(f(\mathbf{x}))$$

where $g$ is learned to minimize squared error while maintaining monotonicity.

Isotonic regression is more flexible than Platt scaling but requires more data and can overfit.

Temperature Scaling (for neural networks)

Divide logits by a learned temperature $T$ before softmax:

$$\hat{p}_{\text{calibrated}} = \text{softmax}(\mathbf{z} / T)$$

Where $T > 1$ softens probabilities (reducing overconfidence) and $T < 1$ sharpens them.

When to Apply Recalibration

Always validate calibration on held-out data first
Use a separate calibration set (not training or test)
Recalibration doesn't improve accuracy—only reliability of probability estimates

recalibration_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10,
                            class_sep=0.7, random_state=42)
 
# Split into train, calibration, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_cal, X_test, y_cal, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
 
print("Recalibration Methods Comparison")
print("=" * 60)
print(f"Training set: {len(X_train)}, Calibration set: {len(X_cal)}, Test set: {len(X_test)}")
 
# Train a Random Forest (typically overconfident)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
 
# Apply different calibration methods
calibrated_models = {
    'Uncalibrated': rf,
    'Platt Scaling': CalibratedClassifierCV(rf, method='sigmoid', cv='prefit'),
    'Isotonic Regression': CalibratedClassifierCV(rf, method='isotonic', cv='prefit'),
}
 
# Fit calibrated versions on calibration set
calibrated_models['Platt Scaling'].fit(X_cal, y_cal)
calibrated_models['Isotonic Regression'].fit(X_cal, y_cal)
 
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
for i, (name, model) in enumerate(calibrated_models.items()):
    probs = model.predict_proba(X_test)[:, 1]
    
    # Calibration curve
    fraction_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
    
    # ECE
    bin_counts = np.histogram(probs, bins=10, range=(0, 1))[0]
    ece = np.sum(bin_counts * np.abs(fraction_pos - mean_pred)) / len(probs)
    
    print(f"\n{name}: ECE = {ece:.4f}")
    
    ax = axes[i]
    ax.plot([0, 1], [0, 1], 'k--', label='Perfect')
    ax.plot(mean_pred, fraction_pos, 's-', markersize=8, label=f'ECE={ece:.3f}')
    ax.set_xlabel('Mean Predicted Probability')
    ax.set_ylabel('Fraction of Positives')
    ax.set_title(name)
    ax.legend(loc='lower right')
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
 
plt.suptitle('Effect of Recalibration on Random Forest', fontsize=14)
plt.tight_layout()
plt.savefig('recalibration_demo.png', dpi=150)
plt.show()
 
# Compare logistic regression (already calibrated) vs RF
print("\n" + "=" * 60)
print("Logistic Regression vs Random Forest Calibration")
print("=" * 60)
 
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
lr_probs = lr.predict_proba(X_test)[:, 1]
rf_probs = rf.predict_proba(X_test)[:, 1]
 
for name, probs in [('Logistic Regression', lr_probs), ('Random Forest', rf_probs)]:
    fraction_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
    bin_counts = np.histogram(probs, bins=10, range=(0, 1))[0]
    ece = np.sum(bin_counts * np.abs(fraction_pos - mean_pred)) / len(probs)
    print(f"{name}: ECE = {ece:.4f}")

Recalibration Requires Held-Out Data

Using Probabilities for Cost-Sensitive Decisions

Calibrated probabilities enable optimal decision-making when misclassification costs are unequal. This is one of the most powerful practical applications of probabilistic classifiers.

The Optimal Decision Rule

With costs:

$C_{01}$: Cost of predicting 0 when true class is 1 (false negative)
$C_{10}$: Cost of predicting 1 when true class is 0 (false positive)

The Bayes-optimal decision is:

$$\hat{y} = \begin{cases} 1 & \text{if } \hat{p} > \frac{C_{10}}{C_{01} + C_{10}} \ 0 & \text{otherwise} \end{cases}$$

The optimal threshold is $\tau^* = \frac{C_{10}}{C_{01} + C_{10}}$.

Examples

Equal costs ($C_{01} = C_{10}$): $\tau^* = 0.5$ (standard threshold)
False negatives 10× worse ($C_{01} = 10, C_{10} = 1$): $\tau^* = \frac{1}{11} \approx 0.09$. Classify as positive unless very confident it's negative.
False positives 10× worse ($C_{01} = 1, C_{10} = 10$): $\tau^* = \frac{10}{11} \approx 0.91$. Classify as positive only when very confident.

Expected Cost Minimization

The expected cost of predicting $\hat{y}$ when true probability is $p$:

Predict 0: Expected cost = $p \cdot C_{01}$
Predict 1: Expected cost = $(1-p) \cdot C_{10}$

Predict 1 when $(1-p) \cdot C_{10} < p \cdot C_{01}$, which gives the threshold above.

cost_sensitive_decisions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Generate data (medical diagnosis scenario)
np.random.seed(42)
X, y = make_classification(n_samples=2000, n_features=10, n_informative=5,
                            class_sep=1.0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
# Fit logistic regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
probabilities = model.predict_proba(X_test)[:, 1]
 
def evaluate_at_threshold(y_true, probs, threshold, C_01, C_10):
    """Evaluate predictions at given threshold with costs."""
    predictions = (probs >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, predictions).ravel()
    
    total_cost = fn * C_01 + fp * C_10
    accuracy = (tp + tn) / len(y_true)
    
    return {
        'threshold': threshold,
        'TP': tp, 'FP': fp, 'TN': tn, 'FN': fn,
        'accuracy': accuracy,
        'total_cost': total_cost,
        'avg_cost': total_cost / len(y_true)
    }
 
# Scenario 1: Disease screening (false negatives are bad - miss sick patient)
C_01_medical = 10  # Cost of missing disease
C_10_medical = 1   # Cost of unnecessary follow-up tests
 
optimal_threshold_medical = C_10_medical / (C_01_medical + C_10_medical)
 
print("Cost-Sensitive Decision Making")
print("=" * 70)
print("\nScenario 1: Medical Screening")
print(f"  Cost of false negative (miss disease): {C_01_medical}")
print(f"  Cost of false positive (unnecessary test): {C_10_medical}")
print(f"  Optimal threshold: {optimal_threshold_medical:.4f}")
 
thresholds_to_test = [0.1, 0.3, optimal_threshold_medical, 0.5, 0.7, 0.9]
 
print(f"\n{'Threshold':>10} | {'TP':>5} | {'FP':>5} | {'TN':>5} | {'FN':>5} | {'Accuracy':>8} | {'AvgCost':>8}")
print("-" * 70)
 
for t in thresholds_to_test:
    result = evaluate_at_threshold(y_test, probabilities, t, C_01_medical, C_10_medical)
    marker = " ← optimal" if abs(t - optimal_threshold_medical) < 0.01 else ""
    print(f"{t:>10.3f} | {result['TP']:>5} | {result['FP']:>5} | {result['TN']:>5} | {result['FN']:>5} | {result['accuracy']:>7.1%} | {result['avg_cost']:>8.3f}{marker}")
 
# Scenario 2: Spam filter (false positives are bad - lose important email)
C_01_spam = 1   # Cost of spam in inbox
C_10_spam = 20  # Cost of losing legitimate email
 
optimal_threshold_spam = C_10_spam / (C_01_spam + C_10_spam)
 
print("\n" + "=" * 70)
print("Scenario 2: Spam Filtering")
print(f"  Cost of false negative (spam in inbox): {C_01_spam}")
print(f"  Cost of false positive (lose real email): {C_10_spam}")
print(f"  Optimal threshold: {optimal_threshold_spam:.4f}")
 
print(f"\n{'Threshold':>10} | {'TP':>5} | {'FP':>5} | {'TN':>5} | {'FN':>5} | {'Accuracy':>8} | {'AvgCost':>8}")
print("-" * 70)
 
for t in [0.3, 0.5, 0.7, 0.9, optimal_threshold_spam, 0.95]:
    result = evaluate_at_threshold(y_test, probabilities, t, C_01_spam, C_10_spam)
    marker = " ← optimal" if abs(t - optimal_threshold_spam) < 0.02 else ""
    print(f"{t:>10.3f} | {result['TP']:>5} | {result['FP']:>5} | {result['TN']:>5} | {result['FN']:>5} | {result['accuracy']:>7.1%} | {result['avg_cost']:>8.3f}{marker}")

Costs Don't Need to Be Dollar Values

Uncertainty Quantification

Types of Uncertainty

Logistic regression captures aleatoric uncertainty (irreducible noise in the data-generating process) but not epistemic uncertainty (model uncertainty due to limited data or model choice).

Using Uncertainty in Practice

Flagging Low-Confidence Predictions: Route predictions with $0.4 < \hat{p} < 0.6$ to human review.
Abstaining from Prediction: Refuse to classify when $|\hat{p} - 0.5| < \epsilon$.
Confidence Intervals: With enough data, bootstrap resampling can give confidence intervals on predictions.
Expected Entropy: $H = -p \log p - (1-p) \log(1-p)$ measures prediction uncertainty. High entropy → uncertain.

uncertainty_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=1000, n_features=5, n_informative=3,
                            class_sep=1.0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
probabilities = model.predict_proba(X_test)[:, 1]
 
def entropy(p):
    """Binary entropy function."""
    eps = 1e-15
    p = np.clip(p, eps, 1 - eps)
    return -p * np.log2(p) - (1 - p) * np.log2(1 - p)
 
# Analyze predictions by confidence level
entropies = entropy(probabilities)
confidences = np.abs(probabilities - 0.5) * 2  # 0 to 1 scale (0=uncertain, 1=confident)
 
print("Uncertainty Analysis")
print("=" * 60)
 
# Group by confidence
bins = [(0, 0.3), (0.3, 0.6), (0.6, 0.8), (0.8, 1.0)]
labels = ['Very Uncertain', 'Uncertain', 'Confident', 'Very Confident']
 
print(f"\n{'Confidence Level':<20} | {'Count':>6} | {'Accuracy':>10} | {'Mean Entropy':>12}")
print("-" * 60)
 
for (lo, hi), label in zip(bins, labels):
    mask = (confidences >= lo) & (confidences < hi)
    if mask.sum() > 0:
        predictions = (probabilities[mask] > 0.5).astype(int)
        acc = (predictions == y_test[mask]).mean()
        mean_ent = entropies[mask].mean()
        print(f"{label:<20} | {mask.sum():>6} | {acc:>10.2%} | {mean_ent:>12.4f}")
 
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Plot 1: Probability distribution
ax1 = axes[0]
ax1.hist(probabilities[y_test==0], bins=20, alpha=0.7, label='Class 0', color='blue')
ax1.hist(probabilities[y_test==1], bins=20, alpha=0.7, label='Class 1', color='red')
ax1.set_xlabel('Predicted Probability')
ax1.set_ylabel('Count')
ax1.set_title('Distribution of Predicted Probabilities')
ax1.axvline(x=0.5, color='black', linestyle='--')
ax1.legend()
 
# Plot 2: Accuracy vs Confidence
ax2 = axes[1]
conf_bins = np.linspace(0, 1, 11)
conf_accs = []
conf_centers = []
for i in range(len(conf_bins)-1):
    mask = (confidences >= conf_bins[i]) & (confidences < conf_bins[i+1])
    if mask.sum() >= 10:
        preds = (probabilities[mask] > 0.5).astype(int)
        conf_accs.append((preds == y_test[mask]).mean())
        conf_centers.append((conf_bins[i] + conf_bins[i+1]) / 2)
 
ax2.plot(conf_centers, conf_accs, 'o-', markersize=8)
ax2.set_xlabel('Confidence (|p - 0.5| × 2)')
ax2.set_ylabel('Accuracy')
ax2.set_title('Accuracy vs Confidence')
ax2.set_xlim(0, 1)
ax2.set_ylim(0.5, 1.0)
ax2.grid(True, alpha=0.3)
 
# Plot 3: Entropy distribution
ax3 = axes[2]
ax3.hist(entropies[y_test==0], bins=20, alpha=0.7, label='Class 0', color='blue')
ax3.hist(entropies[y_test==1], bins=20, alpha=0.7, label='Class 1', color='red')
ax3.set_xlabel('Entropy (bits)')
ax3.set_ylabel('Count')
ax3.set_title('Distribution of Prediction Entropy')
ax3.legend()
 
plt.tight_layout()
plt.savefig('uncertainty_analysis.png', dpi=150)
plt.show()
 
# Selective prediction: abstain on uncertain cases
print("\n" + "=" * 60)
print("Selective Prediction (Abstaining on Uncertain Cases)")
print("=" * 60)
 
for abstain_threshold in [0.0, 0.1, 0.2, 0.3]:
    confident_mask = confidences >= abstain_threshold
    n_predict = confident_mask.sum()
    if n_predict > 0:
        preds = (probabilities[confident_mask] > 0.5).astype(int)
        acc = (preds == y_test[confident_mask]).mean()
        coverage = n_predict / len(y_test)
        print(f"Abstain if confidence < {abstain_threshold:.1f}: "
              f"Coverage = {coverage:.1%}, Accuracy = {acc:.1%}")

Higher Confidence → Higher Accuracy

Practical Deployment Considerations

Deploying a probabilistic classifier like logistic regression requires careful attention to how probabilities will be used and how well they'll hold up in production.

Calibration Monitoring

Calibration can degrade over time due to:

Concept drift: The relationship between features and outcomes changes
Population shift: The distribution of incoming examples changes
Adversarial gaming: Users learn to game the system

Regularly plot calibration curves on recent predictions to detect drift.

Probability Reporting Guidelines

Avoid over-precision: Report "75-80%" not "77.34%"
Provide context: "This probability is based on these features..."
Acknowledge limitations: "This estimate assumes patterns similar to training data"

Threshold Selection in Production

The optimal threshold depends on:

Relative costs of errors (as discussed above)
Current class prevalence (may differ from training)
Downstream process requirements

Often, you'll expose the probability and let the business layer apply a threshold based on use-case-specific costs.

Logging for Accountability

Always log:

The predicted probability (not just the final decision)
Feature values used
Model version
Timestamp

This enables post-hoc calibration analysis, debugging, and regulatory compliance.

Deployment Checklist for Probabilistic Models

•Verify calibration on recent data before deployment
•Set up calibration monitoring with alerting for drift
•Document threshold choices and their rationale
•Log probabilities, not just decisions
•Plan for recalibration if drift is detected
•Communicate uncertainty appropriately to end users
•Establish fallback procedures for uncertain predictions
•Regular retraining schedule to capture distributional changes

Don't Round Too Early

Summary: The Probabilistic Perspective

Key Takeaways

•True Probabilities vs. Scores: Logistic regression produces calibrated probabilities, not just confidence scores. This is a significant advantage over many other classifiers.
•Calibration: A model is calibrated when predicted probabilities match actual frequencies. Reliability diagrams and ECE measure calibration quality.
•Why LR Is Calibrated: Maximum likelihood under the Bernoulli model directly optimizes for calibration. The MLE conditions enforce probability-outcome alignment.
•Recalibration: When needed, Platt scaling or isotonic regression can recalibrate miscalibrated classifiers using held-out data.
•Cost-Sensitive Decisions: Calibrated probabilities enable optimal threshold selection when misclassification costs differ: τ* = C₁₀/(C₀₁ + C₁₀).
•Uncertainty Quantification: Probabilities near 0.5 indicate uncertainty. Use entropy or confidence metrics to identify and handle uncertain cases.
•Deployment: Monitor calibration over time, log full probabilities, and apply thresholds at the decision layer—not in the model output.
•Value Proposition: In domains requiring reliable probability estimates (medicine, finance, risk), logistic regression's natural calibration is a major advantage.

Module Complete:

Module Complete!

5 / 5