Loading content...
Binary classification is one of the most fundamental prediction tasks in machine learning: given an input, determine which of two classes it belongs to. Spam vs. not spam, fraudulent vs. legitimate, positive vs. negative sentiment, click vs. no click—countless real-world problems reduce to this binary choice.
The output layer for binary classification must produce a probability that the input belongs to the positive class. This is fundamentally different from regression: we're not predicting a continuous value, but rather expressing confidence in a discrete outcome. The architecture requires a specific activation function (sigmoid) and loss function (binary cross-entropy) that work together to produce well-calibrated probabilities.
This page provides a rigorous treatment of binary classification outputs, from mathematical foundations to production considerations. By the end, you will understand not just what to use, but why these specific choices emerge from principled statistical reasoning.
This page covers: the sigmoid activation function and its properties, the Bernoulli distribution interpretation, binary cross-entropy loss derivation, numerical stability techniques, class imbalance handling, threshold selection and ROC analysis, and decision-making under probabilistic outputs.
The sigmoid function (also called the logistic function) maps any real number to the interval $(0, 1)$:
$$\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}$$
where $z = \mathbf{w}^T\mathbf{x} + b$ is the pre-activation (logit) from the output layer.
Key properties of the sigmoid:
Range: Output is always in $(0, 1)$, naturally interpretable as a probability
Monotonicity: Strictly increasing—larger logits mean higher probability
Symmetry: $\sigma(-z) = 1 - \sigma(z)$, useful for symmetric classes
Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$, maximum at $z = 0$, vanishes at extremes
Limits: $\lim_{z \to \infty} \sigma(z) = 1$, $\lim_{z \to -\infty} \sigma(z) = 0$
Inverse (logit): $\sigma^{-1}(p) = \log\frac{p}{1-p}$, transforms probability to log-odds
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npimport torchimport torch.nn.functional as F def sigmoid_numpy(z): """Naive sigmoid implementation.""" return 1 / (1 + np.exp(-z)) def sigmoid_stable(z): """ Numerically stable sigmoid. For z >= 0: 1 / (1 + exp(-z)) For z < 0: exp(z) / (1 + exp(z)) This avoids overflow in exp(-z) for large negative z. """ positive_mask = z >= 0 result = np.zeros_like(z, dtype=np.float64) # For positive z result[positive_mask] = 1 / (1 + np.exp(-z[positive_mask])) # For negative z exp_z = np.exp(z[~positive_mask]) result[~positive_mask] = exp_z / (1 + exp_z) return result def sigmoid_derivative(z): """Derivative of sigmoid: σ(z) * (1 - σ(z))""" sig = sigmoid_stable(z) return sig * (1 - sig) # Property demonstrationsz = np.linspace(-10, 10, 100) # Symmetry: σ(-z) = 1 - σ(z)print("Symmetry check:")print(f" σ(2) = {sigmoid_stable(np.array([2.0]))[0]:.6f}")print(f" σ(-2) = {sigmoid_stable(np.array([-2.0]))[0]:.6f}")print(f" Sum = {sigmoid_stable(np.array([2.0]))[0] + sigmoid_stable(np.array([-2.0]))[0]:.6f}") # Derivative maximum at z=0print(f"Derivative at z=0: {sigmoid_derivative(np.array([0.0]))[0]:.6f}")print(f"Derivative at z=5: {sigmoid_derivative(np.array([5.0]))[0]:.6f}") # PyTorch built-inz_torch = torch.tensor([0.0, 2.0, -2.0])print(f"PyTorch sigmoid: {torch.sigmoid(z_torch)}")print(f"PyTorch F.sigmoid: {F.sigmoid(z_torch)}")The logit function $\text{logit}(p) = \log\frac{p}{1-p}$ transforms probabilities to log-odds. This is the linear quantity your network actually predicts. A logit of 0 corresponds to probability 0.5; logit of 2.2 ≈ probability 0.9. Understanding log-odds helps interpret what the network learns.
Why sigmoid for binary classification?
The sigmoid function arises naturally from maximum likelihood estimation under a Bernoulli model. If we assume each data point is drawn from a Bernoulli distribution with parameter $p = \sigma(\mathbf{w}^T\mathbf{x} + b)$:
$$p(y|x) = \sigma(z)^y \cdot (1 - \sigma(z))^{1-y}$$
Taking the negative log-likelihood of this distribution gives us binary cross-entropy. The sigmoid emerges not as an arbitrary choice, but as the canonical link function for the Bernoulli distribution in the generalized linear model framework.
The saturation problem:
While sigmoid works well for output layers, its derivatives approach zero for large $|z|$ (saturation). This caused problems when sigmoid was used in hidden layers (vanishing gradients), leading to ReLU's popularity. However, at the output layer, saturation is actually desirable—confident predictions should have small gradients because they need less adjustment.
Binary classification is fundamentally about modeling a Bernoulli distribution over two outcomes. Given input $\mathbf{x}$, we model the probability that the label $y = 1$:
$$p(y = 1|\mathbf{x}) = \hat{p} = \sigma(\mathbf{w}^T\mathbf{x} + b)$$
$$p(y = 0|\mathbf{x}) = 1 - \hat{p}$$
This can be written compactly as:
$$p(y|\mathbf{x}) = \hat{p}^y (1 - \hat{p})^{1-y}$$
The network's output $\hat{p}$ is the Bernoulli parameter—the probability of success in a single trial. Our loss function should encourage the network to produce $\hat{p}$ that accurately reflects the true conditional probability.
Key insight: calibration
A well-trained binary classifier produces calibrated probabilities. If the model outputs $\hat{p} = 0.7$ for many examples, approximately 70% of those examples should actually have $y = 1$. Calibration is a crucial property for reliable decision-making and is directly encouraged by the cross-entropy loss (which we derive next).
| Property | Formula | Interpretation |
|---|---|---|
| Mean | $E[Y] = p$ | Expected value equals probability parameter |
| Variance | $\text{Var}[Y] = p(1-p)$ | Maximum variance at $p=0.5$ (maximum uncertainty) |
| Entropy | $H = -p\log p - (1-p)\log(1-p)$ | Uncertainty in the outcome; maximum at $p=0.5$ |
| Log-likelihood | $y\log p + (1-y)\log(1-p)$ | Proper scoring rule for probability estimation |
You might wonder: why not use MSE between $\hat{p}$ and $y \in {0, 1}$? Theoretically, maximizing Bernoulli likelihood (cross-entropy) is statistically consistent and produces calibrated probabilities. MSE on probabilities has inferior gradient properties near 0 and 1, leading to slower learning when predictions are confident but wrong.
Binary Cross-Entropy (BCE), also known as log loss, is derived directly from the negative log-likelihood of the Bernoulli distribution:
$$\mathcal{L}{\text{BCE}} = -\frac{1}{n}\sum{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1 - y_i)\log(1 - \hat{p}_i) \right]$$
where:
Understanding the loss:
This asymmetric penalty heavily punishes confident wrong predictions, which is exactly what we want—the model should be penalized severely for saying "definitely yes" when the answer is no.
The gradient of BCE with respect to logits:
One of the beautiful properties of the sigmoid + BCE combination is that the gradient simplifies elegantly:
$$\frac{\partial \mathcal{L}}{\partial z} = \hat{p} - y$$
This is the residual—how far the prediction is from the target. No matter how confident (saturated) the sigmoid becomes, the gradient is simply the difference between prediction and truth. This avoids vanishing gradient issues that plague sigmoid in hidden layers.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import torchimport torch.nn as nnimport torch.nn.functional as F def bce_loss_manual(logits, targets, eps=1e-7): """ Binary Cross-Entropy computed manually. This illustrates the formula but is NOT numerically stable. """ probs = torch.sigmoid(logits) probs = torch.clamp(probs, eps, 1 - eps) # Avoid log(0) loss = -(targets * torch.log(probs) + (1 - targets) * torch.log(1 - probs)) return loss.mean() def bce_with_logits_manual(logits, targets): """ Numerically stable BCE using the log-sum-exp trick. This is what torch.nn.BCEWithLogitsLoss does internally. Key insight: we can rewrite the loss to avoid computing sigmoid: -y*log(σ(z)) - (1-y)*log(1-σ(z)) = y*log(1+e^-z) + (1-y)*(z + log(1+e^-z)) = (1-y)*z + log(1+e^-z) For numerical stability when z < 0: = (1-y)*z + log(1+e^-z) = (1-y)*z - z + log(e^z + 1) = -y*z + log(e^z + 1) = max(z, 0) - y*z + log(1 + e^-|z|) """ # Stable formula: max(z,0) - y*z + log(1 + exp(-|z|)) loss = torch.clamp(logits, min=0) - logits * targets + torch.log(1 + torch.exp(-torch.abs(logits))) return loss.mean() # Compare implementationslogits = torch.tensor([-10.0, -2.0, 0.0, 2.0, 10.0])targets = torch.tensor([0.0, 0.0, 1.0, 1.0, 1.0]) print("Loss comparison:")print(f" Manual (naive): {bce_loss_manual(logits, targets):.6f}")print(f" Manual (stable): {bce_with_logits_manual(logits, targets):.6f}")print(f" PyTorch BCEWithLogitsLoss: " f"{F.binary_cross_entropy_with_logits(logits, targets):.6f}") # Gradient computationlogits_grad = logits.clone().requires_grad_(True)loss = F.binary_cross_entropy_with_logits(logits_grad, targets)loss.backward() print(f"Gradients: {logits_grad.grad}")print(f"Residuals: {torch.sigmoid(logits) - targets}")print("Note: gradient = (predicted_prob - target) / n")Never apply sigmoid then BCELoss separately in PyTorch. Always use BCEWithLogitsLoss which takes raw logits. It's numerically stable and faster. Applying sigmoid first can cause log(0) errors and loss of precision for extreme predictions.
Numerical stability is critical for binary classification because we're computing logarithms of values that can be arbitrarily close to 0 or 1. Without care, floating-point arithmetic can produce NaN or Inf values, crashing training.
Problem 1: Log of zero
If $\hat{p} = \sigma(z) \approx 0$ and $y = 1$, we compute $\log(\hat{p}) \to -\infty$. Similarly for $\hat{p} \approx 1$ and $y = 0$. In 32-bit floats, $\sigma(z) = 0.0$ exactly for $z < -88$.
Solution: The log-sum-exp trick
Rather than computing $\sigma(z)$ then taking $\log$, we use algebraic manipulation:
$$-\log \sigma(z) = \log(1 + e^{-z})$$
This can be computed stably using torch.nn.functional.softplus(-z) or the closed-form:
$$\text{softplus}(x) = \log(1 + e^x) = \max(0, x) + \log(1 + e^{-|x|})$$
The second form is numerically stable for all $x$.
Problem 2: Sigmoid overflow
For large positive $z$, $e^{-z} \approx 0$ (fine). But for large negative $z$, $e^{-z}$ overflows. We use the equivalent form:
$$\sigma(z) = \begin{cases} \frac{1}{1 + e^{-z}} & z \geq 0 \ \frac{e^z}{1 + e^z} & z < 0 \end{cases}$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
import torchimport numpy as np def demonstrate_numerical_issues(): """Show what goes wrong without numerical stability.""" # Extreme logit values extreme_logits = torch.tensor([100.0, -100.0, 500.0, -500.0]) # Naive computation (DON'T DO THIS) print("Naive sigmoid computation:") try: # This can overflow naive_sigmoid = 1 / (1 + torch.exp(-extreme_logits)) print(f" Sigmoid results: {naive_sigmoid}") except Exception as e: print(f" Error: {e}") # Compute log of sigmoid naively (DON'T DO THIS) print("Naive log-sigmoid:") sigmoid_vals = torch.sigmoid(extreme_logits) print(f" Sigmoid: {sigmoid_vals}") # This produces -inf for small probabilities log_sigmoid = torch.log(sigmoid_vals) print(f" Log-sigmoid (naive): {log_sigmoid}") # Stable computation (DO THIS) print("Stable log-sigmoid (using F.logsigmoid):") stable_logsigmoid = torch.nn.functional.logsigmoid(extreme_logits) print(f" Log-sigmoid (stable): {stable_logsigmoid}") # Full BCE comparison print("BCE loss comparison for extreme values:") targets = torch.tensor([1.0, 0.0, 1.0, 0.0]) # Naive (crashes or gives inf) probs = torch.sigmoid(extreme_logits) eps = 1e-7 probs_clipped = torch.clamp(probs, eps, 1 - eps) naive_bce = -(targets * torch.log(probs_clipped) + (1 - targets) * torch.log(1 - probs_clipped)) print(f" Naive BCE: {naive_bce}") # Stable stable_bce = torch.nn.functional.binary_cross_entropy_with_logits( extreme_logits, targets, reduction='none' ) print(f" Stable BCE: {stable_bce}") demonstrate_numerical_issues() # Best practices summaryprint("" + "="*50)print("BEST PRACTICES:")print("1. Never compute sigmoid then log—use logsigmoid")print("2. Use BCEWithLogitsLoss, not sigmoid + BCELoss")print("3. Clamp logits to [-20, 20] if you must use naive formulas")print("4. Use float64 for loss computation if precision matters")Modern frameworks (PyTorch, TensorFlow, JAX) implement numerically stable versions by default. But if you're implementing custom losses or working at lower levels, always use the stable formulations. Production models can encounter extreme values that toy examples don't.
Many real-world binary classification problems are heavily imbalanced: fraud detection (0.1% fraud), disease screening (1% positive), click prediction (2% click rate). Standard BCE treats all samples equally, which can cause the model to predict the majority class almost always—achieving high accuracy but poor detection of the minority class.
Strategies for class imbalance:
1. Weighted loss
Assign higher weight to minority class samples:
$$\mathcal{L}_{\text{weighted}} = -\frac{1}{n}\sum_i \left[ w_1 \cdot y_i \log(\hat{p}_i) + w_0 \cdot (1 - y_i)\log(1 - \hat{p}_i) \right]$$
Typical choice: $w_1/w_0 = n_0/n_1$ (inverse class frequency).
2. Focal loss
Down-weight easy (well-classified) examples to focus on hard cases:
$$\mathcal{L}_{\text{focal}} = -\alpha_t (1 - p_t)^\gamma \log(p_t)$$
where $p_t = \hat{p}$ if $y = 1$ else $1 - \hat{p}$, $\alpha_t$ is class weight, and $\gamma$ is the focusing parameter (typically 2).
3. Resampling
4. Threshold adjustment
At inference, don't use 0.5 threshold. Choose threshold based on desired precision-recall tradeoff.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import torchimport torch.nn as nnimport torch.nn.functional as F class FocalLoss(nn.Module): """ Focal Loss for imbalanced classification. From: Lin et al., "Focal Loss for Dense Object Detection" (2017) FL(p_t) = -α_t * (1 - p_t)^γ * log(p_t) γ = 0 recovers standard cross-entropy γ > 0 reduces loss for well-classified examples """ def __init__(self, alpha=0.25, gamma=2.0, reduction='mean'): super().__init__() self.alpha = alpha # Balance factor (often 0.25) self.gamma = gamma # Focusing parameter (often 2.0) self.reduction = reduction def forward(self, logits, targets): # Compute probabilities probs = torch.sigmoid(logits) # p_t = p if y=1, 1-p if y=0 p_t = probs * targets + (1 - probs) * (1 - targets) # Alpha weighting alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets) # Focal weight: (1 - p_t)^gamma focal_weight = (1 - p_t) ** self.gamma # Cross-entropy part (stable computation) bce = F.binary_cross_entropy_with_logits( logits, targets, reduction='none' ) # Combine focal_loss = alpha_t * focal_weight * bce if self.reduction == 'mean': return focal_loss.mean() elif self.reduction == 'sum': return focal_loss.sum() return focal_loss class WeightedBCELoss(nn.Module): """ BCE with per-class weighting for imbalanced datasets. """ def __init__(self, pos_weight=None): super().__init__() # pos_weight = n_negative / n_positive for balanced loss self.pos_weight = pos_weight def forward(self, logits, targets): if self.pos_weight is not None: pos_weight = torch.tensor([self.pos_weight], device=logits.device) return F.binary_cross_entropy_with_logits( logits, targets, pos_weight=pos_weight ) return F.binary_cross_entropy_with_logits(logits, targets) # Sample imbalanced datan_samples = 1000n_positive = 50 # 5% positive raten_negative = n_samples - n_positive logits = torch.randn(n_samples) # Random predictionstargets = torch.cat([ torch.ones(n_positive), torch.zeros(n_negative)]) # Calculate appropriate weightpos_weight = n_negative / n_positiveprint(f"Class imbalance: {n_positive}/{n_samples} = {n_positive/n_samples:.1%}")print(f"Positive weight: {pos_weight:.1f}") # Compare lossesbce_standard = F.binary_cross_entropy_with_logits(logits, targets)bce_weighted = WeightedBCELoss(pos_weight=pos_weight)(logits, targets)focal = FocalLoss(alpha=0.25, gamma=2.0)(logits, targets) print(f"Standard BCE: {bce_standard:.4f}")print(f"Weighted BCE: {bce_weighted:.4f}")print(f"Focal Loss: {focal:.4f}")Weighted loss is best for moderate imbalance (10:1 to 100:1). Focal loss excels in extreme imbalance with many easy negatives (e.g., object detection). Resampling works well when dataset is large. Often, the best approach is combining moderate class weights with careful threshold selection at inference.
The model outputs a probability $\hat{p}$, but applications often need a binary decision. The decision threshold $\tau$ determines when we predict positive:
$$\hat{y} = \begin{cases} 1 & \text{if } \hat{p} \geq \tau \ 0 & \text{otherwise} \end{cases}$$
The 0.5 threshold is NOT always correct.
The optimal threshold depends on:
Threshold selection methods:
ROC Analysis: Plot True Positive Rate vs. False Positive Rate at all thresholds. Choose point on curve that matches your needs.
Precision-Recall Curve: For imbalanced data, PR curves are more informative. Find threshold that gives desired precision or recall.
Cost-sensitive selection: Given costs $C_{FP}$ and $C_{FN}$, optimal threshold is approximately: $$\tau^* \approx \frac{C_{FP}}{C_{FP} + C_{FN}}$$
F-beta optimization: Find threshold maximizing $F_\beta = \frac{(1+\beta^2) \cdot P \cdot R}{\beta^2 P + R}$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import numpy as npfrom sklearn.metrics import ( precision_recall_curve, roc_curve, roc_auc_score, f1_score) def find_optimal_threshold(y_true, y_probs, method='f1'): """ Find optimal classification threshold. Args: y_true: Ground truth labels y_probs: Predicted probabilities method: 'f1', 'youden' (ROC), 'precision_at_recall', 'cost' Returns: Optimal threshold """ if method == 'f1': # Find threshold that maximizes F1 score precisions, recalls, thresholds = precision_recall_curve( y_true, y_probs ) # F1 = 2 * (precision * recall) / (precision + recall) f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-10) # Best F1 (excluding last element which corresponds to threshold=max) best_idx = np.argmax(f1_scores[:-1]) return thresholds[best_idx] elif method == 'youden': # Youden's J statistic: maximize TPR - FPR fpr, tpr, thresholds = roc_curve(y_true, y_probs) j_scores = tpr - fpr best_idx = np.argmax(j_scores) return thresholds[best_idx] elif method == 'cost': # Cost-sensitive threshold # Assuming cost_fp = 1, cost_fn = 10 (FN is 10x worse) cost_fp = 1 cost_fn = 10 # Optimal threshold approximation return cost_fp / (cost_fp + cost_fn) else: return 0.5 def analyze_thresholds(y_true, y_probs): """ Comprehensive threshold analysis. """ print("Threshold Analysis") print("=" * 50) # AUC - threshold-agnostic metric auc = roc_auc_score(y_true, y_probs) print(f"AUC-ROC: {auc:.4f}") # Find thresholds by different methods thresh_f1 = find_optimal_threshold(y_true, y_probs, method='f1') thresh_youden = find_optimal_threshold(y_true, y_probs, method='youden') print(f"Optimal thresholds:") print(f" F1-optimal: {thresh_f1:.4f}") print(f" Youden-optimal: {thresh_youden:.4f}") # Performance at different thresholds print(f"Performance at different thresholds:") for thresh in [0.1, 0.3, 0.5, thresh_f1, 0.7, 0.9]: preds = (y_probs >= thresh).astype(int) f1 = f1_score(y_true, preds, zero_division=0) tp = ((preds == 1) & (y_true == 1)).sum() fp = ((preds == 1) & (y_true == 0)).sum() fn = ((preds == 0) & (y_true == 1)).sum() precision = tp / (tp + fp) if (tp + fp) > 0 else 0 recall = tp / (tp + fn) if (tp + fn) > 0 else 0 print(f" τ={thresh:.2f}: F1={f1:.3f}, P={precision:.3f}, R={recall:.3f}") # Example with synthetic datanp.random.seed(42)n = 1000# Imbalanced: 10% positivey_true = np.random.binomial(1, 0.1, n)# Model with reasonable AUCy_probs = np.clip(y_true + np.random.randn(n) * 0.5, 0.01, 0.99) analyze_thresholds(y_true, y_probs)Threshold selection assumes probabilities are calibrated. If your model is overconfident or underconfident, first apply calibration (Platt scaling, isotonic regression) before finding the optimal threshold. Uncalibrated probabilities will lead to suboptimal threshold choices.
Let's consolidate the architectural decisions for binary classification into a complete, production-ready pattern:
Standard architecture:
Input → Hidden Layers → Linear(d, 1) → [No Activation] → BCEWithLogitsLoss
↓
Sigmoid at inference only
Key points:
pos_weight argument in loss for imbalanced data123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
import torchimport torch.nn as nnimport torch.nn.functional as F class BinaryClassifier(nn.Module): """ Complete binary classification model with best practices. Design decisions: 1. Output is a single logit (no activation) 2. Use BCEWithLogitsLoss during training 3. Apply sigmoid only at inference for probabilities 4. Support class imbalance via pos_weight """ def __init__( self, input_dim: int, hidden_dims: list = [128, 64], dropout: float = 0.1 ): super().__init__() layers = [] prev_dim = input_dim for hidden_dim in hidden_dims: layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.BatchNorm1d(hidden_dim), nn.ReLU(), nn.Dropout(dropout), ]) prev_dim = hidden_dim self.features = nn.Sequential(*layers) # Single output unit, NO activation self.classifier = nn.Linear(prev_dim, 1) # Initialize output bias to log-odds of prior # (will be set based on training data) self._init_output_bias() def _init_output_bias(self, prior_prob: float = 0.5): """ Initialize output bias to reflect class prior. For imbalanced data, this helps training start with reasonable predictions. """ # logit(prior) = log(prior / (1 - prior)) eps = 1e-7 prior_prob = max(eps, min(1 - eps, prior_prob)) bias_init = torch.log(torch.tensor(prior_prob / (1 - prior_prob))) self.classifier.bias.data.fill_(bias_init) def set_prior(self, positive_fraction: float): """Set output bias based on observed class prior.""" self._init_output_bias(positive_fraction) def forward(self, x): """ Forward pass - returns LOGITS, not probabilities. This is important for numerically stable loss computation. """ features = self.features(x) logits = self.classifier(features) return logits.squeeze(-1) # Shape: [batch_size] def predict_proba(self, x): """ Get calibrated probabilities (for inference). """ with torch.no_grad(): logits = self.forward(x) return torch.sigmoid(logits) def predict(self, x, threshold: float = 0.5): """ Get binary predictions. """ probs = self.predict_proba(x) return (probs >= threshold).long() class BinaryClassificationTrainer: """ Training wrapper with imbalance handling. """ def __init__(self, model, pos_weight=None, learning_rate=1e-3): self.model = model # Compute pos_weight for imbalanced data if pos_weight is not None: pos_weight_tensor = torch.tensor([pos_weight]) else: pos_weight_tensor = None self.criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight_tensor) self.optimizer = torch.optim.AdamW( model.parameters(), lr=learning_rate, weight_decay=1e-4 ) def train_step(self, x, y): self.model.train() self.optimizer.zero_grad() logits = self.model(x) loss = self.criterion(logits, y.float()) loss.backward() self.optimizer.step() return loss.item() @torch.no_grad() def evaluate(self, x, y, threshold=0.5): self.model.eval() probs = self.model.predict_proba(x) preds = (probs >= threshold).long() accuracy = (preds == y).float().mean().item() return accuracy # Usage examplemodel = BinaryClassifier(input_dim=20, hidden_dims=[64, 32]) # For imbalanced data with 5% positive ratepos_rate = 0.05pos_weight = (1 - pos_rate) / pos_rate # = 19model.set_prior(pos_rate) trainer = BinaryClassificationTrainer(model, pos_weight=pos_weight) # Training loop (simplified)x = torch.randn(64, 20)y = torch.randint(0, 2, (64,))loss = trainer.train_step(x, y)print(f"Training loss: {loss:.4f}")Some practitioners use two output units with softmax + cross-entropy instead of one unit with sigmoid + BCE. Mathematically equivalent for binary classification, but the single-unit approach is more parameter-efficient and slightly faster. The two-unit approach is only necessary when integrating with multi-class frameworks.
Binary classification is a fundamental prediction task with a well-understood theory connecting output design to probabilistic modeling. The sigmoid activation and BCE loss are not arbitrary choices—they emerge from maximum likelihood estimation under the Bernoulli distribution.
You now understand binary classification output design from mathematical foundations to production implementation. Next, we'll extend these concepts to multi-class classification, where the output layer must produce a probability distribution over more than two mutually exclusive classes using softmax activation.