Output Layers - Learning Module

Loading content...

0/278

Binary Classification

Binary Classification: Predicting Between Two Outcomes

Binary classification is one of the most fundamental prediction tasks in machine learning: given an input, determine which of two classes it belongs to. Spam vs. not spam, fraudulent vs. legitimate, positive vs. negative sentiment, click vs. no click—countless real-world problems reduce to this binary choice.

The output layer for binary classification must produce a probability that the input belongs to the positive class. This is fundamentally different from regression: we're not predicting a continuous value, but rather expressing confidence in a discrete outcome. The architecture requires a specific activation function (sigmoid) and loss function (binary cross-entropy) that work together to produce well-calibrated probabilities.

This page provides a rigorous treatment of binary classification outputs, from mathematical foundations to production considerations. By the end, you will understand not just what to use, but why these specific choices emerge from principled statistical reasoning.

What You Will Learn

This page covers: the sigmoid activation function and its properties, the Bernoulli distribution interpretation, binary cross-entropy loss derivation, numerical stability techniques, class imbalance handling, threshold selection and ROC analysis, and decision-making under probabilistic outputs.

The Sigmoid Activation Function

The sigmoid function (also called the logistic function) maps any real number to the interval $(0, 1)$:

$$\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}$$

where $z = \mathbf{w}^T\mathbf{x} + b$ is the pre-activation (logit) from the output layer.

Key properties of the sigmoid:

Range: Output is always in $(0, 1)$, naturally interpretable as a probability
Monotonicity: Strictly increasing—larger logits mean higher probability
Symmetry: $\sigma(-z) = 1 - \sigma(z)$, useful for symmetric classes
Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$, maximum at $z = 0$, vanishes at extremes
Limits: $\lim_{z \to \infty} \sigma(z) = 1$, $\lim_{z \to -\infty} \sigma(z) = 0$
Inverse (logit): $\sigma^{-1}(p) = \log\frac{p}{1-p}$, transforms probability to log-odds

sigmoid_properties.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import torch
import torch.nn.functional as F
 
def sigmoid_numpy(z):
    """Naive sigmoid implementation."""
    return 1 / (1 + np.exp(-z))
 
def sigmoid_stable(z):
    """
    Numerically stable sigmoid.
    For z >= 0: 1 / (1 + exp(-z))
    For z < 0:  exp(z) / (1 + exp(z))
    This avoids overflow in exp(-z) for large negative z.
    """
    positive_mask = z >= 0
    result = np.zeros_like(z, dtype=np.float64)
    
    # For positive z
    result[positive_mask] = 1 / (1 + np.exp(-z[positive_mask]))
    
    # For negative z
    exp_z = np.exp(z[~positive_mask])
    result[~positive_mask] = exp_z / (1 + exp_z)
    
    return result
 
def sigmoid_derivative(z):
    """Derivative of sigmoid: σ(z) * (1 - σ(z))"""
    sig = sigmoid_stable(z)
    return sig * (1 - sig)
 
# Property demonstrations
z = np.linspace(-10, 10, 100)
 
# Symmetry: σ(-z) = 1 - σ(z)
print("Symmetry check:")
print(f"  σ(2)  = {sigmoid_stable(np.array([2.0]))[0]:.6f}")
print(f"  σ(-2) = {sigmoid_stable(np.array([-2.0]))[0]:.6f}")
print(f"  Sum   = {sigmoid_stable(np.array([2.0]))[0] + sigmoid_stable(np.array([-2.0]))[0]:.6f}")
 
# Derivative maximum at z=0
print(f"
Derivative at z=0: {sigmoid_derivative(np.array([0.0]))[0]:.6f}")
print(f"Derivative at z=5: {sigmoid_derivative(np.array([5.0]))[0]:.6f}")
 
# PyTorch built-in
z_torch = torch.tensor([0.0, 2.0, -2.0])
print(f"
PyTorch sigmoid: {torch.sigmoid(z_torch)}")
print(f"PyTorch F.sigmoid: {F.sigmoid(z_torch)}")

The Logit: Inverse of Sigmoid

The logit function $\text{logit}(p) = \log\frac{p}{1-p}$ transforms probabilities to log-odds. This is the linear quantity your network actually predicts. A logit of 0 corresponds to probability 0.5; logit of 2.2 ≈ probability 0.9. Understanding log-odds helps interpret what the network learns.

Why sigmoid for binary classification?

The sigmoid function arises naturally from maximum likelihood estimation under a Bernoulli model. If we assume each data point is drawn from a Bernoulli distribution with parameter $p = \sigma(\mathbf{w}^T\mathbf{x} + b)$:

$$p(y|x) = \sigma(z)^y \cdot (1 - \sigma(z))^{1-y}$$

Taking the negative log-likelihood of this distribution gives us binary cross-entropy. The sigmoid emerges not as an arbitrary choice, but as the canonical link function for the Bernoulli distribution in the generalized linear model framework.

The saturation problem:

While sigmoid works well for output layers, its derivatives approach zero for large $|z|$ (saturation). This caused problems when sigmoid was used in hidden layers (vanishing gradients), leading to ReLU's popularity. However, at the output layer, saturation is actually desirable—confident predictions should have small gradients because they need less adjustment.

Probabilistic Foundation: The Bernoulli Distribution

Binary classification is fundamentally about modeling a Bernoulli distribution over two outcomes. Given input $\mathbf{x}$, we model the probability that the label $y = 1$:

$$p(y = 1|\mathbf{x}) = \hat{p} = \sigma(\mathbf{w}^T\mathbf{x} + b)$$

$$p(y = 0|\mathbf{x}) = 1 - \hat{p}$$

This can be written compactly as:

$$p(y|\mathbf{x}) = \hat{p}^y (1 - \hat{p})^{1-y}$$

The network's output $\hat{p}$ is the Bernoulli parameter—the probability of success in a single trial. Our loss function should encourage the network to produce $\hat{p}$ that accurately reflects the true conditional probability.

Key insight: calibration

A well-trained binary classifier produces calibrated probabilities. If the model outputs $\hat{p} = 0.7$ for many examples, approximately 70% of those examples should actually have $y = 1$. Calibration is a crucial property for reliable decision-making and is directly encouraged by the cross-entropy loss (which we derive next).

Bernoulli Distribution Properties
Property	Formula	Interpretation
Mean	$E[Y] = p$	Expected value equals probability parameter
Variance	$\text{Var}[Y] = p(1-p)$	Maximum variance at $p=0.5$ (maximum uncertainty)
Entropy	$H = -p\log p - (1-p)\log(1-p)$	Uncertainty in the outcome; maximum at $p=0.5$
Log-likelihood	$y\log p + (1-y)\log(1-p)$	Proper scoring rule for probability estimation

Why Not Just Use MSE?

You might wonder: why not use MSE between $\hat{p}$ and $y \in {0, 1}$? Theoretically, maximizing Bernoulli likelihood (cross-entropy) is statistically consistent and produces calibrated probabilities. MSE on probabilities has inferior gradient properties near 0 and 1, leading to slower learning when predictions are confident but wrong.

Binary Cross-Entropy: The Correct Loss Function

Binary Cross-Entropy (BCE), also known as log loss, is derived directly from the negative log-likelihood of the Bernoulli distribution:

$$\mathcal{L}{\text{BCE}} = -\frac{1}{n}\sum{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1 - y_i)\log(1 - \hat{p}_i) \right]$$

where:

$y_i \in {0, 1}$ is the true label
$\hat{p}_i = \sigma(z_i)$ is the predicted probability
$n$ is the number of samples

Understanding the loss:

When $y = 1$: Loss is $-\log(\hat{p})$. If $\hat{p} \to 1$, loss $\to 0$. If $\hat{p} \to 0$, loss $\to \infty$.
When $y = 0$: Loss is $-\log(1 - \hat{p})$. If $\hat{p} \to 0$, loss $\to 0$. If $\hat{p} \to 1$, loss $\to \infty$.

This asymmetric penalty heavily punishes confident wrong predictions, which is exactly what we want—the model should be penalized severely for saying "definitely yes" when the answer is no.

The gradient of BCE with respect to logits:

One of the beautiful properties of the sigmoid + BCE combination is that the gradient simplifies elegantly:

$$\frac{\partial \mathcal{L}}{\partial z} = \hat{p} - y$$

This is the residual—how far the prediction is from the target. No matter how confident (saturated) the sigmoid becomes, the gradient is simply the difference between prediction and truth. This avoids vanishing gradient issues that plague sigmoid in hidden layers.

binary_cross_entropy.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import torch
import torch.nn as nn
import torch.nn.functional as F
 
def bce_loss_manual(logits, targets, eps=1e-7):
    """
    Binary Cross-Entropy computed manually.
    This illustrates the formula but is NOT numerically stable.
    """
    probs = torch.sigmoid(logits)
    probs = torch.clamp(probs, eps, 1 - eps)  # Avoid log(0)
    
    loss = -(targets * torch.log(probs) + 
             (1 - targets) * torch.log(1 - probs))
    
    return loss.mean()
 
 
def bce_with_logits_manual(logits, targets):
    """
    Numerically stable BCE using the log-sum-exp trick.
    This is what torch.nn.BCEWithLogitsLoss does internally.
    
    Key insight: we can rewrite the loss to avoid computing sigmoid:
    -y*log(σ(z)) - (1-y)*log(1-σ(z))
    = y*log(1+e^-z) + (1-y)*(z + log(1+e^-z))
    = (1-y)*z + log(1+e^-z)
    
    For numerical stability when z < 0:
    = (1-y)*z + log(1+e^-z)
    = (1-y)*z - z + log(e^z + 1)
    = -y*z + log(e^z + 1)
    = max(z, 0) - y*z + log(1 + e^-|z|)
    """
    # Stable formula: max(z,0) - y*z + log(1 + exp(-|z|))
    loss = torch.clamp(logits, min=0) - logits * targets +            torch.log(1 + torch.exp(-torch.abs(logits)))
    
    return loss.mean()
 
 
# Compare implementations
logits = torch.tensor([-10.0, -2.0, 0.0, 2.0, 10.0])
targets = torch.tensor([0.0, 0.0, 1.0, 1.0, 1.0])
 
print("Loss comparison:")
print(f"  Manual (naive):      {bce_loss_manual(logits, targets):.6f}")
print(f"  Manual (stable):     {bce_with_logits_manual(logits, targets):.6f}")
print(f"  PyTorch BCEWithLogitsLoss: "
      f"{F.binary_cross_entropy_with_logits(logits, targets):.6f}")
 
# Gradient computation
logits_grad = logits.clone().requires_grad_(True)
loss = F.binary_cross_entropy_with_logits(logits_grad, targets)
loss.backward()
 
print(f"
Gradients: {logits_grad.grad}")
print(f"Residuals: {torch.sigmoid(logits) - targets}")
print("Note: gradient = (predicted_prob - target) / n")

Always Use BCEWithLogitsLoss

Never apply sigmoid then BCELoss separately in PyTorch. Always use BCEWithLogitsLoss which takes raw logits. It's numerically stable and faster. Applying sigmoid first can cause log(0) errors and loss of precision for extreme predictions.

Numerical Stability in Binary Classification

Numerical stability is critical for binary classification because we're computing logarithms of values that can be arbitrarily close to 0 or 1. Without care, floating-point arithmetic can produce NaN or Inf values, crashing training.

Problem 1: Log of zero

If $\hat{p} = \sigma(z) \approx 0$ and $y = 1$, we compute $\log(\hat{p}) \to -\infty$. Similarly for $\hat{p} \approx 1$ and $y = 0$. In 32-bit floats, $\sigma(z) = 0.0$ exactly for $z < -88$.

Solution: The log-sum-exp trick

Rather than computing $\sigma(z)$ then taking $\log$, we use algebraic manipulation:

$$-\log \sigma(z) = \log(1 + e^{-z})$$

This can be computed stably using torch.nn.functional.softplus(-z) or the closed-form:

$$\text{softplus}(x) = \log(1 + e^x) = \max(0, x) + \log(1 + e^{-|x|})$$

The second form is numerically stable for all $x$.

Problem 2: Sigmoid overflow

For large positive $z$, $e^{-z} \approx 0$ (fine). But for large negative $z$, $e^{-z}$ overflows. We use the equivalent form:

$$\sigma(z) = \begin{cases} \frac{1}{1 + e^{-z}} & z \geq 0 \ \frac{e^z}{1 + e^z} & z < 0 \end{cases}$$

numerical_stability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import torch
import numpy as np
 
def demonstrate_numerical_issues():
    """Show what goes wrong without numerical stability."""
    
    # Extreme logit values
    extreme_logits = torch.tensor([100.0, -100.0, 500.0, -500.0])
    
    # Naive computation (DON'T DO THIS)
    print("Naive sigmoid computation:")
    try:
        # This can overflow
        naive_sigmoid = 1 / (1 + torch.exp(-extreme_logits))
        print(f"  Sigmoid results: {naive_sigmoid}")
    except Exception as e:
        print(f"  Error: {e}")
    
    # Compute log of sigmoid naively (DON'T DO THIS)
    print("
Naive log-sigmoid:")
    sigmoid_vals = torch.sigmoid(extreme_logits)
    print(f"  Sigmoid: {sigmoid_vals}")
    
    # This produces -inf for small probabilities
    log_sigmoid = torch.log(sigmoid_vals)
    print(f"  Log-sigmoid (naive): {log_sigmoid}")
    
    # Stable computation (DO THIS)
    print("
Stable log-sigmoid (using F.logsigmoid):")
    stable_logsigmoid = torch.nn.functional.logsigmoid(extreme_logits)
    print(f"  Log-sigmoid (stable): {stable_logsigmoid}")
    
    # Full BCE comparison
    print("
BCE loss comparison for extreme values:")
    targets = torch.tensor([1.0, 0.0, 1.0, 0.0])
    
    # Naive (crashes or gives inf)
    probs = torch.sigmoid(extreme_logits)
    eps = 1e-7
    probs_clipped = torch.clamp(probs, eps, 1 - eps)
    naive_bce = -(targets * torch.log(probs_clipped) + 
                  (1 - targets) * torch.log(1 - probs_clipped))
    print(f"  Naive BCE: {naive_bce}")
    
    # Stable
    stable_bce = torch.nn.functional.binary_cross_entropy_with_logits(
        extreme_logits, targets, reduction='none'
    )
    print(f"  Stable BCE: {stable_bce}")
 
demonstrate_numerical_issues()
 
# Best practices summary
print("
" + "="*50)
print("BEST PRACTICES:")
print("1. Never compute sigmoid then log—use logsigmoid")
print("2. Use BCEWithLogitsLoss, not sigmoid + BCELoss")
print("3. Clamp logits to [-20, 20] if you must use naive formulas")
print("4. Use float64 for loss computation if precision matters")

Framework-Level Stability

Modern frameworks (PyTorch, TensorFlow, JAX) implement numerically stable versions by default. But if you're implementing custom losses or working at lower levels, always use the stable formulations. Production models can encounter extreme values that toy examples don't.

Handling Class Imbalance

Many real-world binary classification problems are heavily imbalanced: fraud detection (0.1% fraud), disease screening (1% positive), click prediction (2% click rate). Standard BCE treats all samples equally, which can cause the model to predict the majority class almost always—achieving high accuracy but poor detection of the minority class.

Strategies for class imbalance:

1. Weighted loss

Assign higher weight to minority class samples:

$$\mathcal{L}_{\text{weighted}} = -\frac{1}{n}\sum_i \left[ w_1 \cdot y_i \log(\hat{p}_i) + w_0 \cdot (1 - y_i)\log(1 - \hat{p}_i) \right]$$

Typical choice: $w_1/w_0 = n_0/n_1$ (inverse class frequency).

2. Focal loss

Down-weight easy (well-classified) examples to focus on hard cases:

$$\mathcal{L}_{\text{focal}} = -\alpha_t (1 - p_t)^\gamma \log(p_t)$$

where $p_t = \hat{p}$ if $y = 1$ else $1 - \hat{p}$, $\alpha_t$ is class weight, and $\gamma$ is the focusing parameter (typically 2).

3. Resampling

Oversampling: Duplicate minority class samples or use SMOTE to generate synthetic samples
Undersampling: Randomly remove majority class samples (loses information)
Balanced batching: Ensure each mini-batch has equal class representation

4. Threshold adjustment

At inference, don't use 0.5 threshold. Choose threshold based on desired precision-recall tradeoff.

class_imbalance.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class FocalLoss(nn.Module):
    """
    Focal Loss for imbalanced classification.
    
    From: Lin et al., "Focal Loss for Dense Object Detection" (2017)
    
    FL(p_t) = -α_t * (1 - p_t)^γ * log(p_t)
    
    γ = 0 recovers standard cross-entropy
    γ > 0 reduces loss for well-classified examples
    """
    def __init__(self, alpha=0.25, gamma=2.0, reduction='mean'):
        super().__init__()
        self.alpha = alpha      # Balance factor (often 0.25)
        self.gamma = gamma      # Focusing parameter (often 2.0)
        self.reduction = reduction
    
    def forward(self, logits, targets):
        # Compute probabilities
        probs = torch.sigmoid(logits)
        
        # p_t = p if y=1, 1-p if y=0
        p_t = probs * targets + (1 - probs) * (1 - targets)
        
        # Alpha weighting
        alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)
        
        # Focal weight: (1 - p_t)^gamma
        focal_weight = (1 - p_t) ** self.gamma
        
        # Cross-entropy part (stable computation)
        bce = F.binary_cross_entropy_with_logits(
            logits, targets, reduction='none'
        )
        
        # Combine
        focal_loss = alpha_t * focal_weight * bce
        
        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        return focal_loss
 
 
class WeightedBCELoss(nn.Module):
    """
    BCE with per-class weighting for imbalanced datasets.
    """
    def __init__(self, pos_weight=None):
        super().__init__()
        # pos_weight = n_negative / n_positive for balanced loss
        self.pos_weight = pos_weight
    
    def forward(self, logits, targets):
        if self.pos_weight is not None:
            pos_weight = torch.tensor([self.pos_weight], device=logits.device)
            return F.binary_cross_entropy_with_logits(
                logits, targets, pos_weight=pos_weight
            )
        return F.binary_cross_entropy_with_logits(logits, targets)
 
 
# Sample imbalanced data
n_samples = 1000
n_positive = 50  # 5% positive rate
n_negative = n_samples - n_positive
 
logits = torch.randn(n_samples)  # Random predictions
targets = torch.cat([
    torch.ones(n_positive),
    torch.zeros(n_negative)
])
 
# Calculate appropriate weight
pos_weight = n_negative / n_positive
print(f"Class imbalance: {n_positive}/{n_samples} = {n_positive/n_samples:.1%}")
print(f"Positive weight: {pos_weight:.1f}")
 
# Compare losses
bce_standard = F.binary_cross_entropy_with_logits(logits, targets)
bce_weighted = WeightedBCELoss(pos_weight=pos_weight)(logits, targets)
focal = FocalLoss(alpha=0.25, gamma=2.0)(logits, targets)
 
print(f"
Standard BCE: {bce_standard:.4f}")
print(f"Weighted BCE: {bce_weighted:.4f}")
print(f"Focal Loss:   {focal:.4f}")

Choosing the Right Strategy

Weighted loss is best for moderate imbalance (10:1 to 100:1). Focal loss excels in extreme imbalance with many easy negatives (e.g., object detection). Resampling works well when dataset is large. Often, the best approach is combining moderate class weights with careful threshold selection at inference.

Threshold Selection and Decision Making

The model outputs a probability $\hat{p}$, but applications often need a binary decision. The decision threshold $\tau$ determines when we predict positive:

$$\hat{y} = \begin{cases} 1 & \text{if } \hat{p} \geq \tau \ 0 & \text{otherwise} \end{cases}$$

The 0.5 threshold is NOT always correct.

The optimal threshold depends on:

Class priors: If classes are imbalanced, 0.5 may be suboptimal
Cost asymmetry: False positives and false negatives often have different costs
Business objectives: Maximize precision? Maximize recall? Optimize F1?

Threshold selection methods:

ROC Analysis: Plot True Positive Rate vs. False Positive Rate at all thresholds. Choose point on curve that matches your needs.
Precision-Recall Curve: For imbalanced data, PR curves are more informative. Find threshold that gives desired precision or recall.
Cost-sensitive selection: Given costs $C_{FP}$ and $C_{FN}$, optimal threshold is approximately: $$\tau^* \approx \frac{C_{FP}}{C_{FP} + C_{FN}}$$
F-beta optimization: Find threshold maximizing $F_\beta = \frac{(1+\beta^2) \cdot P \cdot R}{\beta^2 P + R}$

threshold_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from sklearn.metrics import (
    precision_recall_curve, 
    roc_curve, 
    roc_auc_score,
    f1_score
)
 
def find_optimal_threshold(y_true, y_probs, method='f1'):
    """
    Find optimal classification threshold.
    
    Args:
        y_true: Ground truth labels
        y_probs: Predicted probabilities
        method: 'f1', 'youden' (ROC), 'precision_at_recall', 'cost'
    
    Returns:
        Optimal threshold
    """
    if method == 'f1':
        # Find threshold that maximizes F1 score
        precisions, recalls, thresholds = precision_recall_curve(
            y_true, y_probs
        )
        
        # F1 = 2 * (precision * recall) / (precision + recall)
        f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-10)
        
        # Best F1 (excluding last element which corresponds to threshold=max)
        best_idx = np.argmax(f1_scores[:-1])
        return thresholds[best_idx]
    
    elif method == 'youden':
        # Youden's J statistic: maximize TPR - FPR
        fpr, tpr, thresholds = roc_curve(y_true, y_probs)
        j_scores = tpr - fpr
        best_idx = np.argmax(j_scores)
        return thresholds[best_idx]
    
    elif method == 'cost':
        # Cost-sensitive threshold
        # Assuming cost_fp = 1, cost_fn = 10 (FN is 10x worse)
        cost_fp = 1
        cost_fn = 10
        
        # Optimal threshold approximation
        return cost_fp / (cost_fp + cost_fn)
    
    else:
        return 0.5
 
 
def analyze_thresholds(y_true, y_probs):
    """
    Comprehensive threshold analysis.
    """
    print("Threshold Analysis")
    print("=" * 50)
    
    # AUC - threshold-agnostic metric
    auc = roc_auc_score(y_true, y_probs)
    print(f"AUC-ROC: {auc:.4f}")
    
    # Find thresholds by different methods
    thresh_f1 = find_optimal_threshold(y_true, y_probs, method='f1')
    thresh_youden = find_optimal_threshold(y_true, y_probs, method='youden')
    
    print(f"
Optimal thresholds:")
    print(f"  F1-optimal:     {thresh_f1:.4f}")
    print(f"  Youden-optimal: {thresh_youden:.4f}")
    
    # Performance at different thresholds
    print(f"
Performance at different thresholds:")
    for thresh in [0.1, 0.3, 0.5, thresh_f1, 0.7, 0.9]:
        preds = (y_probs >= thresh).astype(int)
        f1 = f1_score(y_true, preds, zero_division=0)
        tp = ((preds == 1) & (y_true == 1)).sum()
        fp = ((preds == 1) & (y_true == 0)).sum()
        fn = ((preds == 0) & (y_true == 1)).sum()
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        
        print(f"  τ={thresh:.2f}: F1={f1:.3f}, P={precision:.3f}, R={recall:.3f}")
 
 
# Example with synthetic data
np.random.seed(42)
n = 1000
# Imbalanced: 10% positive
y_true = np.random.binomial(1, 0.1, n)
# Model with reasonable AUC
y_probs = np.clip(y_true + np.random.randn(n) * 0.5, 0.01, 0.99)
 
analyze_thresholds(y_true, y_probs)

Calibration Before Threshold Selection

Threshold selection assumes probabilities are calibrated. If your model is overconfident or underconfident, first apply calibration (Platt scaling, isotonic regression) before finding the optimal threshold. Uncalibrated probabilities will lead to suboptimal threshold choices.

Binary Classification Architecture Patterns

Let's consolidate the architectural decisions for binary classification into a complete, production-ready pattern:

Standard architecture:

Input → Hidden Layers → Linear(d, 1) → [No Activation] → BCEWithLogitsLoss
                                                      ↓
                                          Sigmoid at inference only

Key points:

Output layer has a single unit (produces scalar logit)
No activation in forward pass during training—loss function handles sigmoid
Apply sigmoid only at inference time when you need probabilities
Use pos_weight argument in loss for imbalanced data

binary_classifier.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class BinaryClassifier(nn.Module):
    """
    Complete binary classification model with best practices.
    
    Design decisions:
    1. Output is a single logit (no activation)
    2. Use BCEWithLogitsLoss during training
    3. Apply sigmoid only at inference for probabilities
    4. Support class imbalance via pos_weight
    """
    def __init__(
        self, 
        input_dim: int, 
        hidden_dims: list = [128, 64],
        dropout: float = 0.1
    ):
        super().__init__()
        
        layers = []
        prev_dim = input_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout),
            ])
            prev_dim = hidden_dim
        
        self.features = nn.Sequential(*layers)
        
        # Single output unit, NO activation
        self.classifier = nn.Linear(prev_dim, 1)
        
        # Initialize output bias to log-odds of prior
        # (will be set based on training data)
        self._init_output_bias()
    
    def _init_output_bias(self, prior_prob: float = 0.5):
        """
        Initialize output bias to reflect class prior.
        For imbalanced data, this helps training start with
        reasonable predictions.
        """
        # logit(prior) = log(prior / (1 - prior))
        eps = 1e-7
        prior_prob = max(eps, min(1 - eps, prior_prob))
        bias_init = torch.log(torch.tensor(prior_prob / (1 - prior_prob)))
        self.classifier.bias.data.fill_(bias_init)
    
    def set_prior(self, positive_fraction: float):
        """Set output bias based on observed class prior."""
        self._init_output_bias(positive_fraction)
    
    def forward(self, x):
        """
        Forward pass - returns LOGITS, not probabilities.
        This is important for numerically stable loss computation.
        """
        features = self.features(x)
        logits = self.classifier(features)
        return logits.squeeze(-1)  # Shape: [batch_size]
    
    def predict_proba(self, x):
        """
        Get calibrated probabilities (for inference).
        """
        with torch.no_grad():
            logits = self.forward(x)
            return torch.sigmoid(logits)
    
    def predict(self, x, threshold: float = 0.5):
        """
        Get binary predictions.
        """
        probs = self.predict_proba(x)
        return (probs >= threshold).long()
 
 
class BinaryClassificationTrainer:
    """
    Training wrapper with imbalance handling.
    """
    def __init__(self, model, pos_weight=None, learning_rate=1e-3):
        self.model = model
        
        # Compute pos_weight for imbalanced data
        if pos_weight is not None:
            pos_weight_tensor = torch.tensor([pos_weight])
        else:
            pos_weight_tensor = None
        
        self.criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight_tensor)
        self.optimizer = torch.optim.AdamW(
            model.parameters(), 
            lr=learning_rate,
            weight_decay=1e-4
        )
    
    def train_step(self, x, y):
        self.model.train()
        self.optimizer.zero_grad()
        
        logits = self.model(x)
        loss = self.criterion(logits, y.float())
        
        loss.backward()
        self.optimizer.step()
        
        return loss.item()
    
    @torch.no_grad()
    def evaluate(self, x, y, threshold=0.5):
        self.model.eval()
        
        probs = self.model.predict_proba(x)
        preds = (probs >= threshold).long()
        
        accuracy = (preds == y).float().mean().item()
        return accuracy
 
 
# Usage example
model = BinaryClassifier(input_dim=20, hidden_dims=[64, 32])
 
# For imbalanced data with 5% positive rate
pos_rate = 0.05
pos_weight = (1 - pos_rate) / pos_rate  # = 19
model.set_prior(pos_rate)
 
trainer = BinaryClassificationTrainer(model, pos_weight=pos_weight)
 
# Training loop (simplified)
x = torch.randn(64, 20)
y = torch.randint(0, 2, (64,))
loss = trainer.train_step(x, y)
print(f"Training loss: {loss:.4f}")

Two-Unit Alternative

Some practitioners use two output units with softmax + cross-entropy instead of one unit with sigmoid + BCE. Mathematically equivalent for binary classification, but the single-unit approach is more parameter-efficient and slightly faster. The two-unit approach is only necessary when integrating with multi-class frameworks.

Summary: Binary Classification Outputs

Binary classification is a fundamental prediction task with a well-understood theory connecting output design to probabilistic modeling. The sigmoid activation and BCE loss are not arbitrary choices—they emerge from maximum likelihood estimation under the Bernoulli distribution.

Key Takeaways

•Sigmoid activation maps logits to probabilities in $(0, 1)$, with elegant gradient properties when combined with BCE loss
•Binary cross-entropy is derived from Bernoulli likelihood and produces calibrated probability estimates
•Numerical stability requires using BCEWithLogitsLoss rather than applying sigmoid then BCE separately
•Class imbalance can be addressed via weighted loss (pos_weight), focal loss, or threshold adjustment
•Threshold selection is application-dependent and should optimize for business objectives, not default to 0.5
•Production architecture outputs raw logits during training, applies sigmoid only at inference
•Bias initialization to match class prior accelerates training, especially for imbalanced data

Page Complete

You now understand binary classification output design from mathematical foundations to production implementation. Next, we'll extend these concepts to multi-class classification, where the output layer must produce a probability distribution over more than two mutually exclusive classes using softmax activation.