Loss Functions - Learning Module

Loading content...

0/245

Cross-Entropy Loss: The Foundation of Classification

The Language of Uncertainty

When neural networks learn to classify images, detect spam, or predict the next word in a sentence, they are fundamentally learning to assign probabilities to discrete outcomes. But here lies a crucial question: how do we measure the quality of a probability distribution?

Enter cross-entropy loss—arguably the most important loss function in modern machine learning. It doesn't merely measure whether predictions are 'right' or 'wrong'; it quantifies the information-theoretic cost of using predicted probabilities instead of the true distribution. This elegant perspective, rooted in Claude Shannon's foundational work on information theory, provides deep insight into why cross-entropy is so effective for training classifiers.

This page will take you on a rigorous journey through cross-entropy loss, from its information-theoretic origins to its role as the cornerstone of neural network training for classification tasks.

What You Will Master

By the end of this page, you will understand: (1) the information-theoretic foundations of cross-entropy, (2) its derivation from maximum likelihood estimation, (3) why it produces superior gradients compared to alternatives, (4) numerical stability techniques essential for implementation, and (5) its deep connections to KL divergence and the broader landscape of probabilistic learning.

Information-Theoretic Foundations

To truly understand cross-entropy, we must first grasp the foundational concepts of information theory. Claude Shannon's 1948 paper 'A Mathematical Theory of Communication' introduced a revolutionary way to quantify information, uncertainty, and the cost of communication.

Entropy: Measuring Uncertainty

Entropy quantifies the average uncertainty or 'information content' in a probability distribution. For a discrete random variable $X$ with possible outcomes $x_1, x_2, \ldots, x_n$ and associated probabilities $p(x_i)$, the entropy is defined as:

$$H(P) = -\sum_{i=1}^{n} p(x_i) \log p(x_i)$$

The choice of logarithm base determines the unit: base 2 gives bits, base $e$ gives nats. In machine learning, we typically use natural logarithms (nats) for computational convenience.

Intuition: Entropy measures the average number of bits (or nats) needed to encode samples from a distribution using an optimal coding scheme. Events that occur rarely carry more information when they happen—a rare event is 'surprising'.

Key properties of entropy:

Non-negativity: $H(P) \geq 0$, with equality only when the distribution is deterministic (all probability on one outcome)
Maximum entropy: For $n$ outcomes, entropy is maximized at $\log n$ when the distribution is uniform
Concavity: Entropy is a concave function of the probability distribution

Entropy Examples for Different Distributions
Distribution	Probabilities	Entropy (bits)	Interpretation
Deterministic	[1.0, 0.0]	0.00	No uncertainty—outcome is certain
Fair coin	[0.5, 0.5]	1.00	Maximum uncertainty for 2 outcomes
Biased coin	[0.9, 0.1]	0.47	Low uncertainty—one outcome dominates
Fair die (6-sided)	[1/6, ...]	2.58	Maximum uncertainty for 6 outcomes
Loaded die	[0.5, 0.1, 0.1, 0.1, 0.1, 0.1]	2.16	Reduced uncertainty due to bias

Cross-Entropy: The Cost of Mismatched Codes

Now consider a fundamental problem: you have a true distribution $P$ over outcomes, but you're using a different distribution $Q$ to design your encoding scheme. The cross-entropy between $P$ and $Q$ measures the expected number of bits needed to encode samples from $P$ when using a code optimized for $Q$:

$$H(P, Q) = -\sum_{i=1}^{n} p(x_i) \log q(x_i)$$

Critical insight: Cross-entropy is always at least as large as entropy: $H(P, Q) \geq H(P)$, with equality only when $P = Q$. This extra cost is precisely the KL divergence:

$$D_{KL}(P | Q) = H(P, Q) - H(P) = \sum_{i=1}^{n} p(x_i) \log \frac{p(x_i)}{q(x_i)}$$

In machine learning context:

$P$ is the true data distribution (encoded via one-hot labels)
$Q$ is the model's predicted probability distribution
Minimizing cross-entropy $H(P, Q)$ is equivalent to minimizing KL divergence $D_{KL}(P | Q)$ since $H(P)$ is constant for fixed data

The Information-Theoretic Intuition

Cross-entropy penalizes predictions that assign low probability to the true outcome. If the true label is class 3 but your model predicts p(class 3) = 0.01, the loss is -log(0.01) ≈ 4.6 nats—a heavy penalty. If p(class 3) = 0.99, the loss is only -log(0.99) ≈ 0.01 nats. This logarithmic scaling is precisely what makes cross-entropy so effective: it creates strong pressure to assign high probability to correct classes.

Binary Cross-Entropy in Depth

For binary classification problems—where we predict between two classes (often labeled 0 and 1)—we use binary cross-entropy (BCE), also known as log loss. Let's derive it rigorously.

Problem Setup

Given:

True label $y \in {0, 1}$
Model output (raw logit) $z \in \mathbb{R}$
Predicted probability $\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$ where $\sigma$ is the sigmoid function

The true distribution $P$ is: $$P(Y=1) = y, \quad P(Y=0) = 1 - y$$

The predicted distribution $Q$ is: $$Q(Y=1) = \hat{y}, \quad Q(Y=0) = 1 - \hat{y}$$

Binary Cross-Entropy Formula

Applying the cross-entropy formula:

$$\mathcal{L}_{BCE} = H(P, Q) = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]$$

Understanding the formula:

When $y = 1$: Loss becomes $-\log \hat{y}$. High $\hat{y}$ → low loss; low $\hat{y}$ → high loss.
When $y = 0$: Loss becomes $-\log(1 - \hat{y})$. Low $\hat{y}$ → low loss; high $\hat{y}$ → high loss.
The formula elegantly handles both cases in one expression.

binary_cross_entropy.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(z):
    """Numerically stable sigmoid function."""
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )
 
def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """
    Compute binary cross-entropy loss.
    
    Args:
        y_true: Ground truth labels (0 or 1)
        y_pred: Predicted probabilities (0 to 1)
        epsilon: Small constant for numerical stability
    
    Returns:
        BCE loss value
    """
    # Clip predictions to prevent log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    # Compute BCE
    loss = -np.mean(
        y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
    )
    return loss
 
def binary_cross_entropy_with_logits(y_true, logits):
    """
    Numerically stable BCE directly from logits.
    Uses the log-sum-exp trick to avoid overflow.
    
    This is equivalent to:
    -y * log(sigmoid(z)) - (1-y) * log(1 - sigmoid(z))
    
    But computed in a numerically stable way.
    """
    # For numerical stability, we use:
    # BCE = max(z, 0) - z*y + log(1 + exp(-|z|))
    return np.mean(
        np.maximum(logits, 0) - logits * y_true + 
        np.log1p(np.exp(-np.abs(logits)))
    )
 
# Demonstration
print("Binary Cross-Entropy Examples:")
print("="*50)
 
# Perfect prediction
y, p = 1.0, 0.99
loss = binary_cross_entropy(np.array([y]), np.array([p]))
print(f"y=1, pred=0.99: BCE = {loss:.4f}")
 
# Confident wrong prediction
y, p = 1.0, 0.01
loss = binary_cross_entropy(np.array([y]), np.array([p]))
print(f"y=1, pred=0.01: BCE = {loss:.4f} (heavily penalized)")
 
# Uncertain prediction
y, p = 1.0, 0.5
loss = binary_cross_entropy(np.array([y]), np.array([p]))
print(f"y=1, pred=0.50: BCE = {loss:.4f} (moderate penalty)")

Maximum Likelihood Derivation

Binary cross-entropy has a beautiful probabilistic interpretation as negative log-likelihood. Consider a Bernoulli likelihood for each sample:

$$p(y | \hat{y}) = \hat{y}^y (1 - \hat{y})^{1-y}$$

For a dataset of $N$ independent samples, the likelihood is:

$$\mathcal{L}(\theta) = \prod_{i=1}^{N} \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1-y_i}$$

Taking the negative log and dividing by $N$:

$$-\frac{1}{N} \log \mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]$$

This is exactly the binary cross-entropy loss! Minimizing BCE is equivalent to maximum likelihood estimation under a Bernoulli model.

Gradient Properties

The gradient of BCE with respect to the logit $z$ has a remarkably elegant form:

$$\frac{\partial \mathcal{L}_{BCE}}{\partial z} = \hat{y} - y = \sigma(z) - y$$

This is extraordinarily important. The gradient is simply the difference between prediction and target, bounded between -1 and 1. This provides:

Stable gradients: No gradient explosion or vanishing due to the loss function itself
Interpretable magnitude: Error directly reflects prediction-target mismatch
Perfect symmetry: Wrong predictions in either direction are penalized proportionally

Gradient Derivation (For the Mathematically Curious)

Let L = -[y log(σ(z)) + (1-y) log(1-σ(z))]. Using chain rule and σ'(z) = σ(z)(1-σ(z)):

∂L/∂z = -y · (1/σ(z)) · σ(z)(1-σ(z)) - (1-y) · (1/(1-σ(z))) · (-σ(z)(1-σ(z))) = -y(1-σ(z)) + (1-y)σ(z) = -y + yσ(z) + σ(z) - yσ(z) = σ(z) - y

This elegant result is why sigmoid + BCE is such a powerful combination.

Categorical Cross-Entropy for Multi-Class Problems

When we have $K > 2$ mutually exclusive classes, we generalize to categorical cross-entropy (CCE), also called softmax cross-entropy or simply cross-entropy loss in many frameworks.

Problem Setup

Given:

True label as a one-hot vector $\mathbf{y} \in {0, 1}^K$ where $\sum_k y_k = 1$
Model outputs (raw logits) $\mathbf{z} \in \mathbb{R}^K$
Predicted probabilities via softmax: $\hat{y}k = \frac{e^{z_k}}{\sum{j=1}^{K} e^{z_j}}$

Categorical Cross-Entropy Formula

$$\mathcal{L}{CCE} = -\sum{k=1}^{K} y_k \log \hat{y}_k$$

Since $\mathbf{y}$ is one-hot with the true class at index $c$, this simplifies to:

$$\mathcal{L}{CCE} = -\log \hat{y}c = -\log \frac{e^{z_c}}{\sum{j=1}^{K} e^{z_j}} = -z_c + \log \sum{j=1}^{K} e^{z_j}$$

This formulation is known as the softmax cross-entropy or log-softmax loss.

categorical_cross_entropy.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
 
def softmax(logits, axis=-1):
    """
    Numerically stable softmax.
    Subtracts max to prevent exp overflow.
    """
    # Shift for numerical stability
    shifted = logits - np.max(logits, axis=axis, keepdims=True)
    exp_shifted = np.exp(shifted)
    return exp_shifted / np.sum(exp_shifted, axis=axis, keepdims=True)
 
def categorical_cross_entropy(y_true_onehot, y_pred_probs, epsilon=1e-15):
    """
    Compute categorical cross-entropy loss.
    
    Args:
        y_true_onehot: Ground truth as one-hot vectors (N, K)
        y_pred_probs: Predicted probabilities (N, K)
        epsilon: Small constant for numerical stability
    
    Returns:
        Average CCE loss
    """
    y_pred_probs = np.clip(y_pred_probs, epsilon, 1 - epsilon)
    loss = -np.sum(y_true_onehot * np.log(y_pred_probs), axis=-1)
    return np.mean(loss)
 
def softmax_cross_entropy_with_logits(y_true_onehot, logits):
    """
    Numerically stable CCE directly from logits.
    Avoids computing softmax explicitly.
    
    Uses log-sum-exp trick for stability.
    """
    # Compute log(softmax(z)) in a stable way:
    # log_softmax = z - log(sum(exp(z)))
    # Using log-sum-exp with max subtraction for stability
    max_logits = np.max(logits, axis=-1, keepdims=True)
    log_sum_exp = max_logits + np.log(
        np.sum(np.exp(logits - max_logits), axis=-1, keepdims=True)
    )
    log_softmax = logits - log_sum_exp
    
    # CCE = -sum(y * log_softmax)
    loss = -np.sum(y_true_onehot * log_softmax, axis=-1)
    return np.mean(loss)
 
def sparse_categorical_cross_entropy(y_true_indices, logits):
    """
    CCE when labels are given as class indices, not one-hot.
    More memory efficient for large K.
    
    Args:
        y_true_indices: Ground truth class indices (N,)
        logits: Raw model outputs (N, K)
    """
    n_samples = logits.shape[0]
    
    # Compute log-sum-exp for stability
    max_logits = np.max(logits, axis=-1, keepdims=True)
    log_sum_exp = max_logits.squeeze() + np.log(
        np.sum(np.exp(logits - max_logits), axis=-1)
    )
    
    # Get logits at true class indices
    true_class_logits = logits[np.arange(n_samples), y_true_indices]
    
    # Loss = -log(softmax) = -z_c + log_sum_exp
    loss = -true_class_logits + log_sum_exp
    return np.mean(loss)
 
# Demonstration
print("Categorical Cross-Entropy Examples:")
print("="*50)
 
# 3-class problem
logits = np.array([[2.0, 1.0, 0.1]])  # Raw model output
y_true = np.array([[1, 0, 0]])  # Class 0 is correct
 
probs = softmax(logits)
print(f"Logits: {logits[0]}")
print(f"Softmax probabilities: {probs[0]}")
print(f"True class probability: {probs[0, 0]:.4f}")
 
loss1 = categorical_cross_entropy(y_true, probs)
loss2 = softmax_cross_entropy_with_logits(y_true, logits)
loss3 = sparse_categorical_cross_entropy(np.array([0]), logits)
 
print(f"\nCCE (from probs): {loss1:.4f}")
print(f"CCE (from logits): {loss2:.4f}")
print(f"Sparse CCE: {loss3:.4f}")

Gradient of Softmax Cross-Entropy

Remarkably, the gradient of CCE with respect to logits has the same elegant form as BCE:

$$\frac{\partial \mathcal{L}_{CCE}}{\partial z_k} = \hat{y}_k - y_k$$

For the true class $c$: $\frac{\partial \mathcal{L}}{\partial z_c} = \hat{y}_c - 1$ (pushes logit up)

For other classes $j \neq c$: $\frac{\partial \mathcal{L}}{\partial z_j} = \hat{y}_j$ (pushes logits down)

This beautiful symmetry is no coincidence. Both binary and categorical cross-entropy are derived from minimizing KL divergence between the true distribution and the model's predictions. The gradient being simply prediction minus target is a consequence of the exponential family structure of the softmax/sigmoid functions.

Label Smoothing

In practice, one-hot labels can cause overconfidence. Label smoothing softens targets:

$$y_k^{smooth} = (1 - \alpha) \cdot y_k + \frac{\alpha}{K}$$

where $\alpha$ is the smoothing parameter (typically 0.1). This regularizes the model by:

Preventing extreme probability predictions
Improving generalization
Reducing overconfidence in test-time predictions

Sparse vs Dense Labels

For problems with many classes (e.g., ImageNet with 1000 classes), storing one-hot vectors wastes memory. 'Sparse' categorical cross-entropy uses integer class indices instead, computing the same loss more efficiently. Most deep learning frameworks support both formats.

Numerical Stability: The Hidden Art

Cross-entropy loss involves logarithms and exponentials—operations notorious for numerical issues. Understanding and implementing stable versions is crucial for reliable training.

The Dangers of Naive Implementation

Problem 1: Log of zero When predicted probability $\hat{y} \to 0$, we get $\log(\hat{y}) \to -\infty$. This happens when the model is very confident about the wrong class.

Problem 2: Exponential overflow For large logits, $e^{z}$ can overflow floating-point representation (>1e308 for float64, >1e38 for float32).

Problem 3: Exponential underflow For large negative logits, $e^{z} \to 0$, causing division by zero in softmax.

Stable Sigmoid

The naive sigmoid $\sigma(z) = \frac{1}{1 + e^{-z}}$ overflows for large negative $z$. The stable version:

$$\sigma(z) = \begin{cases} \frac{1}{1 + e^{-z}} & \text{if } z \geq 0 \ \frac{e^z}{1 + e^z} & \text{if } z < 0 \end{cases}$$

Stable Softmax

The key insight: softmax is invariant to adding a constant to all logits. We subtract the maximum:

$$\hat{y}_k = \frac{e^{z_k - \max(\mathbf{z})}}{\sum_j e^{z_j - \max(\mathbf{z})}}$$

This ensures all exponents are $\leq 0$, preventing overflow.

Stable Log-Softmax

For cross-entropy, we need $\log(\text{softmax})$. Computing this directly risks underflow. Instead:

$$\log \hat{y}_k = z_k - \log \sum_j e^{z_j}$$

The log-sum-exp is computed stably as:

$$\log \sum_j e^{z_j} = m + \log \sum_j e^{z_j - m}$$

where $m = \max(\mathbf{z})$.

numerical_stability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
 
def log_sum_exp_stable(logits, axis=-1):
    """
    Compute log(sum(exp(logits))) in a numerically stable way.
    
    This is the key primitive for stable softmax cross-entropy.
    """
    max_val = np.max(logits, axis=axis, keepdims=True)
    return max_val.squeeze(axis) + np.log(
        np.sum(np.exp(logits - max_val), axis=axis)
    )
 
def stable_softmax_cross_entropy(y_true_idx, logits):
    """
    Production-quality softmax cross-entropy.
    
    Computes: -logit[true_class] + log(sum(exp(logits)))
    
    All operations are numerically stable.
    """
    n_samples = len(y_true_idx)
    
    # Stable log-sum-exp
    lse = log_sum_exp_stable(logits, axis=-1)
    
    # Get logit at true class
    true_logits = logits[np.arange(n_samples), y_true_idx]
    
    # Cross-entropy = -z_c + LSE
    loss = -true_logits + lse
    
    return np.mean(loss)
 
def stable_binary_cross_entropy(y_true, logits):
    """
    Production-quality binary cross-entropy from logits.
    
    Uses the identity:
    -y*log(σ(z)) - (1-y)*log(1-σ(z)) = max(z,0) - z*y + log(1+exp(-|z|))
    
    This formulation avoids computing sigmoid explicitly and is
    stable for all values of z.
    """
    # For large positive z: max(z,0)=z, exp(-z)≈0, so loss ≈ z - zy = z(1-y)
    # For large negative z: max(z,0)=0, exp(z)≈0, so loss ≈ 0 - zy = -zy
    # Both are correct and finite!
    
    return np.mean(
        np.maximum(logits, 0) - logits * y_true + 
        np.log1p(np.exp(-np.abs(logits)))
    )
 
# Demonstration: Why stability matters
print("Demonstrating Numerical Stability")
print("="*50)
 
# Extreme logits that would break naive implementation
extreme_logits = np.array([[1000.0, 0.0, -1000.0]])  # Will overflow naive exp()
y_true = np.array([0])  # True class is 0
 
# Naive approach would fail
try:
    naive_exp = np.exp(extreme_logits)  # This doesn't overflow in NumPy but...
    naive_softmax = naive_exp / np.sum(naive_exp, axis=-1, keepdims=True)
    print(f"Naive softmax: {naive_softmax}")
    print(f"Sum to 1? {np.sum(naive_softmax)}")  # May not sum to 1 due to precision
except:
    print("Naive approach failed!")
 
# Stable approach
loss = stable_softmax_cross_entropy(y_true, extreme_logits)
print(f"\nStable CCE with extreme logits: {loss:.6f}")
print("(Loss is finite and correct!)")
 
# Binary case with extreme logit
print("\n" + "="*50)
y_binary = np.array([1.0])
extreme_z = np.array([50.0])  # Very confident prediction
 
loss = stable_binary_cross_entropy(y_binary, extreme_z)
print(f"Stable BCE with logit=50: {loss:.10f}")
print("(Correctly close to 0 since prediction is confident and correct)")

Always Use Built-in Functions

Never implement cross-entropy loss by first computing softmax/sigmoid and then taking the log. Use combined functions like 'tf.nn.softmax_cross_entropy_with_logits', 'F.cross_entropy' (PyTorch), or 'jax.nn.log_softmax'. These implement the stable formulations internally and are also more efficient (fewer passes through memory).

The Deep Connection to KL Divergence

Cross-entropy and KL divergence are intimately connected, and understanding this relationship illuminates why cross-entropy is the 'right' loss for classification.

KL Divergence Defined

The Kullback-Leibler divergence from distribution $Q$ to $P$ measures the information lost when $Q$ is used to approximate $P$:

$$D_{KL}(P | Q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = H(P, Q) - H(P)$$

Key properties:

Non-negative: $D_{KL}(P | Q) \geq 0$, with equality iff $P = Q$
Asymmetric: $D_{KL}(P | Q) \neq D_{KL}(Q | P)$ in general
Not a metric: Doesn't satisfy triangle inequality

Why Minimize $D_{KL}(P_{data} | P_{model})$?

In supervised learning, $P_{data}$ is the true distribution (one-hot labels) and $P_{model}$ is our neural network's output. We minimize:

$$D_{KL}(P_{data} | P_{model}) = H(P_{data}, P_{model}) - H(P_{data})$$

Since $H(P_{data})$ is constant (determined by the data), minimizing KL divergence is equivalent to minimizing cross-entropy $H(P_{data}, P_{model})$.

Mode-Seeking vs Mode-Covering

The direction of KL divergence matters:

$D_{KL}(P | Q)$ (forward KL):

Penalizes $q(x) \to 0$ where $p(x) > 0$ (mode-seeking)
Used in supervised learning where $P$ is the target

$D_{KL}(Q | P)$ (reverse KL):

Penalizes $q(x) > 0$ where $p(x) \to 0$ (mode-covering)
Used in variational inference where $Q$ is the approximate posterior

In classification, we use forward KL because we want the model to assign high probability wherever the true label says so.

Forward KL (Mode-Seeking)

•Penalizes model for missing true modes
•Model tries to cover all data mass
•Can lead to overly broad distributions
•Used in classification (data → model)
•Loss: $-\sum p(x) \log q(x)$

Reverse KL (Mode-Covering)

•Penalizes model for extra mass
•Model tries to stay within data mass
•Can lead to overly peaked distributions
•Used in VAEs (model → approximate)
•Loss: $-\sum q(x) \log p(x)$

Information-Theoretic Interpretation

Minimizing cross-entropy has a beautiful interpretation: we are finding the model distribution $Q$ that requires the fewest extra bits (beyond the entropy of the true distribution) to encode samples from $P$.

$$\underbrace{H(P, Q)}{\text{bits to encode}} = \underbrace{H(P)}{\text{min possible}} + \underbrace{D_{KL}(P | Q)}_{\text{wasted bits}}$$

Training a classifier is literally finding the most efficient encoding of the labels given the inputs.

This perspective explains why cross-entropy generalizes well: a model that compresses data well must have learned the true underlying patterns, not memorized spurious correlations.

Why Not Other Losses?

A natural question arises: why use cross-entropy for classification instead of simpler alternatives? Let's compare.

Cross-Entropy vs Mean Squared Error

For binary classification with target $y \in {0, 1}$ and prediction $\hat{y} = \sigma(z)$:

MSE Loss: $\mathcal{L}_{MSE} = (y - \hat{y})^2$

Gradient of MSE w.r.t. logit $z$: $$\frac{\partial \mathcal{L}_{MSE}}{\partial z} = 2(\hat{y} - y) \cdot \sigma(z)(1 - \sigma(z))$$

Problem: The gradient includes $\sigma(z)(1-\sigma(z))$, which vanishes when $\sigma(z) \to 0$ or $\sigma(z) \to 1$. This means:

If the model is confidently wrong (high loss region), gradients are tiny!
Training stalls precisely when corrections are most needed

Cross-Entropy Gradient: $\frac{\partial \mathcal{L}_{BCE}}{\partial z} = \hat{y} - y$

No multiplicative sigmoid derivative term
Gradient remains strong even for confident wrong predictions
Training never stalls due to the loss function

Gradient Comparison: MSE vs Cross-Entropy (y=1)
Prediction (ŷ)	MSE Gradient (×σ')	BCE Gradient	Learning Signal
0.01 (wrong)	0.0198 × 0.0099 = 0.0002	-0.99	BCE: 5000× stronger
0.10 (wrong)	0.18 × 0.09 = 0.016	-0.90	BCE: 56× stronger
0.50 (uncertain)	0.50 × 0.25 = 0.125	-0.50	BCE: 4× stronger
0.90 (correct)	0.18 × 0.09 = 0.016	-0.10	Similar
0.99 (correct)	0.0198 × 0.0099 = 0.0002	-0.01	Similar

The Geometric Perspective

Cross-entropy is the natural loss for classification because:

It respects the geometry of probability space: Probabilities live on the simplex, and cross-entropy is a Bregman divergence that respects this geometry.
It matches the natural gradient: The gradient of cross-entropy corresponds to the "natural gradient" in the space of distributions, leading to more efficient optimization.
It's derived from first principles: Unlike MSE, which is chosen for mathematical convenience in regression, cross-entropy emerges from maximum likelihood—the principled approach to fitting probability distributions.

When MSE is Appropriate for Classification

Despite its gradient issues, MSE can work in specific cases:

Knowledge distillation: When matching soft targets from a teacher model
Output smoothness: When we want predictions to be more conservative
Multi-target regression: When outputs are truly continuous, not categorical

But for standard classification, cross-entropy remains the gold standard.

The Principle of Maximum Entropy

Cross-entropy's theoretical justification goes beyond just 'good gradients'. By the principle of maximum entropy, among all distributions satisfying known constraints, the one with highest entropy is the least biased. Cross-entropy loss encourages the model to find this least biased distribution that still explains the data—a form of regularization built into the loss itself.

Practical Implementation Guidelines

Armed with theoretical understanding, let's discuss practical considerations for using cross-entropy in real systems.

Class Imbalance

When classes are severely imbalanced (e.g., 99% negative, 1% positive), standard cross-entropy can lead to models that simply predict the majority class. Solutions:

Weighted Cross-Entropy: $$\mathcal{L} = -\sum_k w_k \cdot y_k \log \hat{y}_k$$

where $w_k$ is inversely proportional to class frequency. Common choice: $w_k = \frac{N}{K \cdot n_k}$ where $n_k$ is the count of class $k$.

Focal Loss (covered in a later page): Down-weights easy examples to focus on hard ones.

Multi-Label Classification

When samples can belong to multiple classes simultaneously, use binary cross-entropy per class:

$$\mathcal{L} = -\frac{1}{K}\sum_{k=1}^{K} [y_k \log \sigma(z_k) + (1-y_k) \log(1 - \sigma(z_k))]$$

Each output is now an independent binary prediction.

practical_cross_entropy.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
 
def weighted_cross_entropy(y_true, y_pred, class_weights, epsilon=1e-15):
    """
    Cross-entropy with class weights for imbalanced data.
    
    Args:
        y_true: One-hot encoded labels (N, K)
        y_pred: Predicted probabilities (N, K)
        class_weights: Weight for each class (K,)
    """
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    # Weight each sample by its true class weight
    # For one-hot y_true, this picks the weight of the true class
    sample_weights = np.sum(y_true * class_weights, axis=-1)
    
    # Standard cross-entropy per sample
    ce = -np.sum(y_true * np.log(y_pred), axis=-1)
    
    # Weighted average
    return np.mean(sample_weights * ce)
 
def multilabel_binary_cross_entropy(y_true, logits):
    """
    BCE for multi-label classification.
    Each class is treated as independent binary prediction.
    
    Args:
        y_true: Binary labels per class (N, K)
        logits: Raw model outputs (N, K)
    """
    # Stable BCE for each class independently
    bce_per_class = (
        np.maximum(logits, 0) - logits * y_true + 
        np.log1p(np.exp(-np.abs(logits)))
    )
    # Average over classes, then over samples
    return np.mean(bce_per_class)
 
def label_smoothing_cross_entropy(y_true_indices, logits, alpha=0.1):
    """
    Cross-entropy with label smoothing regularization.
    
    Soft labels: (1-α) for true class, α/K for others
    Prevents overconfidence and improves generalization.
    
    Args:
        y_true_indices: True class indices (N,)
        logits: Raw outputs (N, K)
        alpha: Smoothing parameter (0.1 typical)
    """
    n_samples, n_classes = logits.shape
    
    # Create smoothed labels
    smooth_labels = np.full((n_samples, n_classes), alpha / n_classes)
    smooth_labels[np.arange(n_samples), y_true_indices] = 1 - alpha + alpha / n_classes
    
    # Stable log-softmax
    max_logits = np.max(logits, axis=-1, keepdims=True)
    log_softmax = logits - max_logits - np.log(
        np.sum(np.exp(logits - max_logits), axis=-1, keepdims=True)
    )
    
    # Cross-entropy with soft labels
    loss = -np.sum(smooth_labels * log_softmax, axis=-1)
    return np.mean(loss)
 
# Demonstration
print("Practical Cross-Entropy Variants")
print("="*50)
 
# Setup
np.random.seed(42)
logits = np.random.randn(4, 5)  # 4 samples, 5 classes
y_true = np.array([0, 1, 2, 0])
 
# Standard CCE
loss_standard = sparse_categorical_cross_entropy(y_true, logits)
print(f"Standard CCE: {loss_standard:.4f}")
 
# With label smoothing
loss_smooth = label_smoothing_cross_entropy(y_true, logits, alpha=0.1)
print(f"Label-smoothed CCE (α=0.1): {loss_smooth:.4f}")
 
# Class-weighted (pretend class 2 is rare)
weights = np.array([1.0, 1.0, 5.0, 1.0, 1.0])  # 5× weight for class 2
y_onehot = np.zeros((4, 5))
y_onehot[np.arange(4), y_true] = 1
probs = softmax(logits)
loss_weighted = weighted_cross_entropy(y_onehot, probs, weights)
print(f"Class-weighted CCE: {loss_weighted:.4f}")

Cross-Entropy Best Practices

•Always work with logits: Pass raw network outputs to cross-entropy, not softmax/sigmoid probabilities. Built-in functions handle stability.
•Use sparse labels when possible: For large K, class indices are more memory-efficient than one-hot vectors.
•Apply label smoothing for regularization: α=0.1 typically works well. Reduces overconfidence without hurting accuracy.
•Weight classes for imbalance: Inverse frequency weighting is a good starting point. Tune weights on validation data.
•Monitor loss scale: Cross-entropy values depend on K. For 1000-class ImageNet, expect higher losses than for 10-class CIFAR.
•Check for NaN/Inf: Despite stable implementations, extreme data can cause issues. Add assertions to catch problems early.

Summary: Cross-Entropy Mastery

We've taken a comprehensive journey through cross-entropy loss, from information theory to practical implementation. Let's consolidate the key insights.

Key Takeaways

•Information-theoretic foundation: Cross-entropy measures the expected bits needed to encode samples from the true distribution using a code optimized for the predicted distribution.
•Maximum likelihood equivalence: Minimizing cross-entropy is equivalent to maximum likelihood estimation, providing principled probabilistic grounding.
•Superior gradients: The gradient $\hat{y} - y$ provides strong, stable learning signals regardless of prediction confidence—unlike MSE which vanishes for confident wrong predictions.
•Numerical stability matters: Always use combined logit-to-loss functions. Never compute softmax then log separately.
•KL divergence connection: Cross-entropy minimization is equivalent to KL divergence minimization since data entropy is constant.
•Practical extensions: Label smoothing, class weighting, and proper handling of multi-label tasks extend cross-entropy to real-world challenges.

Looking Forward

Cross-entropy is the foundation for classification losses, but it's not the only option. In the following pages, we'll explore:

Mean Squared Error: When and why it's used for regression—and occasionally classification
Hinge Loss: The max-margin alternative from SVMs, offering different inductive biases
Focal Loss: Addressing class imbalance by down-weighting easy examples
Custom Losses: Designing task-specific objectives that encode domain knowledge

Page Complete

You now possess deep understanding of cross-entropy loss—its theoretical foundations, gradient properties, numerical implementation, and practical usage. This knowledge will serve as the benchmark against which you evaluate all other loss functions.

Cross-Entropy Loss: The Foundation of Classification

The Language of Uncertainty

This page will take you on a rigorous journey through cross-entropy loss, from its information-theoretic origins to its role as the cornerstone of neural network training for classification tasks.

What You Will Master

Information-Theoretic Foundations

Entropy: Measuring Uncertainty

$$H(P) = -\sum_{i=1}^{n} p(x_i) \log p(x_i)$$

The choice of logarithm base determines the unit: base 2 gives bits, base $e$ gives nats. In machine learning, we typically use natural logarithms (nats) for computational convenience.

Key properties of entropy:

Non-negativity: $H(P) \geq 0$, with equality only when the distribution is deterministic (all probability on one outcome)
Maximum entropy: For $n$ outcomes, entropy is maximized at $\log n$ when the distribution is uniform
Concavity: Entropy is a concave function of the probability distribution

Entropy Examples for Different Distributions
Distribution	Probabilities	Entropy (bits)	Interpretation
Deterministic	[1.0, 0.0]	0.00	No uncertainty—outcome is certain
Fair coin	[0.5, 0.5]	1.00	Maximum uncertainty for 2 outcomes
Biased coin	[0.9, 0.1]	0.47	Low uncertainty—one outcome dominates
Fair die (6-sided)	[1/6, ...]	2.58	Maximum uncertainty for 6 outcomes
Loaded die	[0.5, 0.1, 0.1, 0.1, 0.1, 0.1]	2.16	Reduced uncertainty due to bias

Cross-Entropy: The Cost of Mismatched Codes

$$H(P, Q) = -\sum_{i=1}^{n} p(x_i) \log q(x_i)$$

Critical insight: Cross-entropy is always at least as large as entropy: $H(P, Q) \geq H(P)$, with equality only when $P = Q$. This extra cost is precisely the KL divergence:

$$D_{KL}(P | Q) = H(P, Q) - H(P) = \sum_{i=1}^{n} p(x_i) \log \frac{p(x_i)}{q(x_i)}$$

In machine learning context:

$P$ is the true data distribution (encoded via one-hot labels)
$Q$ is the model's predicted probability distribution
Minimizing cross-entropy $H(P, Q)$ is equivalent to minimizing KL divergence $D_{KL}(P | Q)$ since $H(P)$ is constant for fixed data

The Information-Theoretic Intuition

Binary Cross-Entropy in Depth

For binary classification problems—where we predict between two classes (often labeled 0 and 1)—we use binary cross-entropy (BCE), also known as log loss. Let's derive it rigorously.

Problem Setup

Given:

True label $y \in {0, 1}$
Model output (raw logit) $z \in \mathbb{R}$
Predicted probability $\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$ where $\sigma$ is the sigmoid function

The true distribution $P$ is: $$P(Y=1) = y, \quad P(Y=0) = 1 - y$$

The predicted distribution $Q$ is: $$Q(Y=1) = \hat{y}, \quad Q(Y=0) = 1 - \hat{y}$$

Binary Cross-Entropy Formula

Applying the cross-entropy formula:

$$\mathcal{L}_{BCE} = H(P, Q) = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]$$

Understanding the formula:

When $y = 1$: Loss becomes $-\log \hat{y}$. High $\hat{y}$ → low loss; low $\hat{y}$ → high loss.
When $y = 0$: Loss becomes $-\log(1 - \hat{y})$. Low $\hat{y}$ → low loss; high $\hat{y}$ → high loss.
The formula elegantly handles both cases in one expression.

binary_cross_entropy.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(z):
    """Numerically stable sigmoid function."""
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )
 
def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """
    Compute binary cross-entropy loss.
    
    Args:
        y_true: Ground truth labels (0 or 1)
        y_pred: Predicted probabilities (0 to 1)
        epsilon: Small constant for numerical stability
    
    Returns:
        BCE loss value
    """
    # Clip predictions to prevent log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    # Compute BCE
    loss = -np.mean(
        y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
    )
    return loss
 
def binary_cross_entropy_with_logits(y_true, logits):
    """
    Numerically stable BCE directly from logits.
    Uses the log-sum-exp trick to avoid overflow.
    
    This is equivalent to:
    -y * log(sigmoid(z)) - (1-y) * log(1 - sigmoid(z))
    
    But computed in a numerically stable way.
    """
    # For numerical stability, we use:
    # BCE = max(z, 0) - z*y + log(1 + exp(-|z|))
    return np.mean(
        np.maximum(logits, 0) - logits * y_true + 
        np.log1p(np.exp(-np.abs(logits)))
    )
 
# Demonstration
print("Binary Cross-Entropy Examples:")
print("="*50)
 
# Perfect prediction
y, p = 1.0, 0.99
loss = binary_cross_entropy(np.array([y]), np.array([p]))
print(f"y=1, pred=0.99: BCE = {loss:.4f}")
 
# Confident wrong prediction
y, p = 1.0, 0.01
loss = binary_cross_entropy(np.array([y]), np.array([p]))
print(f"y=1, pred=0.01: BCE = {loss:.4f} (heavily penalized)")
 
# Uncertain prediction
y, p = 1.0, 0.5
loss = binary_cross_entropy(np.array([y]), np.array([p]))
print(f"y=1, pred=0.50: BCE = {loss:.4f} (moderate penalty)")

Maximum Likelihood Derivation

Binary cross-entropy has a beautiful probabilistic interpretation as negative log-likelihood. Consider a Bernoulli likelihood for each sample:

$$p(y | \hat{y}) = \hat{y}^y (1 - \hat{y})^{1-y}$$

For a dataset of $N$ independent samples, the likelihood is:

$$\mathcal{L}(\theta) = \prod_{i=1}^{N} \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1-y_i}$$

Taking the negative log and dividing by $N$:

$$-\frac{1}{N} \log \mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]$$

This is exactly the binary cross-entropy loss! Minimizing BCE is equivalent to maximum likelihood estimation under a Bernoulli model.

Gradient Properties

The gradient of BCE with respect to the logit $z$ has a remarkably elegant form:

$$\frac{\partial \mathcal{L}_{BCE}}{\partial z} = \hat{y} - y = \sigma(z) - y$$

This is extraordinarily important. The gradient is simply the difference between prediction and target, bounded between -1 and 1. This provides:

Stable gradients: No gradient explosion or vanishing due to the loss function itself
Interpretable magnitude: Error directly reflects prediction-target mismatch
Perfect symmetry: Wrong predictions in either direction are penalized proportionally

Gradient Derivation (For the Mathematically Curious)

Let L = -[y log(σ(z)) + (1-y) log(1-σ(z))]. Using chain rule and σ'(z) = σ(z)(1-σ(z)):

∂L/∂z = -y · (1/σ(z)) · σ(z)(1-σ(z)) - (1-y) · (1/(1-σ(z))) · (-σ(z)(1-σ(z))) = -y(1-σ(z)) + (1-y)σ(z) = -y + yσ(z) + σ(z) - yσ(z) = σ(z) - y

This elegant result is why sigmoid + BCE is such a powerful combination.

Categorical Cross-Entropy for Multi-Class Problems

When we have $K > 2$ mutually exclusive classes, we generalize to categorical cross-entropy (CCE), also called softmax cross-entropy or simply cross-entropy loss in many frameworks.

Problem Setup

Given:

True label as a one-hot vector $\mathbf{y} \in {0, 1}^K$ where $\sum_k y_k = 1$
Model outputs (raw logits) $\mathbf{z} \in \mathbb{R}^K$
Predicted probabilities via softmax: $\hat{y}k = \frac{e^{z_k}}{\sum{j=1}^{K} e^{z_j}}$

Categorical Cross-Entropy Formula

$$\mathcal{L}{CCE} = -\sum{k=1}^{K} y_k \log \hat{y}_k$$

Since $\mathbf{y}$ is one-hot with the true class at index $c$, this simplifies to:

$$\mathcal{L}{CCE} = -\log \hat{y}c = -\log \frac{e^{z_c}}{\sum{j=1}^{K} e^{z_j}} = -z_c + \log \sum{j=1}^{K} e^{z_j}$$

This formulation is known as the softmax cross-entropy or log-softmax loss.

categorical_cross_entropy.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
 
def softmax(logits, axis=-1):
    """
    Numerically stable softmax.
    Subtracts max to prevent exp overflow.
    """
    # Shift for numerical stability
    shifted = logits - np.max(logits, axis=axis, keepdims=True)
    exp_shifted = np.exp(shifted)
    return exp_shifted / np.sum(exp_shifted, axis=axis, keepdims=True)
 
def categorical_cross_entropy(y_true_onehot, y_pred_probs, epsilon=1e-15):
    """
    Compute categorical cross-entropy loss.
    
    Args:
        y_true_onehot: Ground truth as one-hot vectors (N, K)
        y_pred_probs: Predicted probabilities (N, K)
        epsilon: Small constant for numerical stability
    
    Returns:
        Average CCE loss
    """
    y_pred_probs = np.clip(y_pred_probs, epsilon, 1 - epsilon)
    loss = -np.sum(y_true_onehot * np.log(y_pred_probs), axis=-1)
    return np.mean(loss)
 
def softmax_cross_entropy_with_logits(y_true_onehot, logits):
    """
    Numerically stable CCE directly from logits.
    Avoids computing softmax explicitly.
    
    Uses log-sum-exp trick for stability.
    """
    # Compute log(softmax(z)) in a stable way:
    # log_softmax = z - log(sum(exp(z)))
    # Using log-sum-exp with max subtraction for stability
    max_logits = np.max(logits, axis=-1, keepdims=True)
    log_sum_exp = max_logits + np.log(
        np.sum(np.exp(logits - max_logits), axis=-1, keepdims=True)
    )
    log_softmax = logits - log_sum_exp
    
    # CCE = -sum(y * log_softmax)
    loss = -np.sum(y_true_onehot * log_softmax, axis=-1)
    return np.mean(loss)
 
def sparse_categorical_cross_entropy(y_true_indices, logits):
    """
    CCE when labels are given as class indices, not one-hot.
    More memory efficient for large K.
    
    Args:
        y_true_indices: Ground truth class indices (N,)
        logits: Raw model outputs (N, K)
    """
    n_samples = logits.shape[0]
    
    # Compute log-sum-exp for stability
    max_logits = np.max(logits, axis=-1, keepdims=True)
    log_sum_exp = max_logits.squeeze() + np.log(
        np.sum(np.exp(logits - max_logits), axis=-1)
    )
    
    # Get logits at true class indices
    true_class_logits = logits[np.arange(n_samples), y_true_indices]
    
    # Loss = -log(softmax) = -z_c + log_sum_exp
    loss = -true_class_logits + log_sum_exp
    return np.mean(loss)
 
# Demonstration
print("Categorical Cross-Entropy Examples:")
print("="*50)
 
# 3-class problem
logits = np.array([[2.0, 1.0, 0.1]])  # Raw model output
y_true = np.array([[1, 0, 0]])  # Class 0 is correct
 
probs = softmax(logits)
print(f"Logits: {logits[0]}")
print(f"Softmax probabilities: {probs[0]}")
print(f"True class probability: {probs[0, 0]:.4f}")
 
loss1 = categorical_cross_entropy(y_true, probs)
loss2 = softmax_cross_entropy_with_logits(y_true, logits)
loss3 = sparse_categorical_cross_entropy(np.array([0]), logits)
 
print(f"\nCCE (from probs): {loss1:.4f}")
print(f"CCE (from logits): {loss2:.4f}")
print(f"Sparse CCE: {loss3:.4f}")

Gradient of Softmax Cross-Entropy

Remarkably, the gradient of CCE with respect to logits has the same elegant form as BCE:

$$\frac{\partial \mathcal{L}_{CCE}}{\partial z_k} = \hat{y}_k - y_k$$

For the true class $c$: $\frac{\partial \mathcal{L}}{\partial z_c} = \hat{y}_c - 1$ (pushes logit up)

For other classes $j \neq c$: $\frac{\partial \mathcal{L}}{\partial z_j} = \hat{y}_j$ (pushes logits down)

Label Smoothing

In practice, one-hot labels can cause overconfidence. Label smoothing softens targets:

$$y_k^{smooth} = (1 - \alpha) \cdot y_k + \frac{\alpha}{K}$$

where $\alpha$ is the smoothing parameter (typically 0.1). This regularizes the model by:

Preventing extreme probability predictions
Improving generalization
Reducing overconfidence in test-time predictions

Sparse vs Dense Labels

Numerical Stability: The Hidden Art

Cross-entropy loss involves logarithms and exponentials—operations notorious for numerical issues. Understanding and implementing stable versions is crucial for reliable training.

The Dangers of Naive Implementation

Problem 1: Log of zero When predicted probability $\hat{y} \to 0$, we get $\log(\hat{y}) \to -\infty$. This happens when the model is very confident about the wrong class.

Problem 2: Exponential overflow For large logits, $e^{z}$ can overflow floating-point representation (>1e308 for float64, >1e38 for float32).

Problem 3: Exponential underflow For large negative logits, $e^{z} \to 0$, causing division by zero in softmax.

Stable Sigmoid

The naive sigmoid $\sigma(z) = \frac{1}{1 + e^{-z}}$ overflows for large negative $z$. The stable version:

$$\sigma(z) = \begin{cases} \frac{1}{1 + e^{-z}} & \text{if } z \geq 0 \ \frac{e^z}{1 + e^z} & \text{if } z < 0 \end{cases}$$

Stable Softmax

The key insight: softmax is invariant to adding a constant to all logits. We subtract the maximum:

$$\hat{y}_k = \frac{e^{z_k - \max(\mathbf{z})}}{\sum_j e^{z_j - \max(\mathbf{z})}}$$

This ensures all exponents are $\leq 0$, preventing overflow.

Stable Log-Softmax

For cross-entropy, we need $\log(\text{softmax})$. Computing this directly risks underflow. Instead:

$$\log \hat{y}_k = z_k - \log \sum_j e^{z_j}$$

The log-sum-exp is computed stably as:

$$\log \sum_j e^{z_j} = m + \log \sum_j e^{z_j - m}$$

where $m = \max(\mathbf{z})$.

numerical_stability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
 
def log_sum_exp_stable(logits, axis=-1):
    """
    Compute log(sum(exp(logits))) in a numerically stable way.
    
    This is the key primitive for stable softmax cross-entropy.
    """
    max_val = np.max(logits, axis=axis, keepdims=True)
    return max_val.squeeze(axis) + np.log(
        np.sum(np.exp(logits - max_val), axis=axis)
    )
 
def stable_softmax_cross_entropy(y_true_idx, logits):
    """
    Production-quality softmax cross-entropy.
    
    Computes: -logit[true_class] + log(sum(exp(logits)))
    
    All operations are numerically stable.
    """
    n_samples = len(y_true_idx)
    
    # Stable log-sum-exp
    lse = log_sum_exp_stable(logits, axis=-1)
    
    # Get logit at true class
    true_logits = logits[np.arange(n_samples), y_true_idx]
    
    # Cross-entropy = -z_c + LSE
    loss = -true_logits + lse
    
    return np.mean(loss)
 
def stable_binary_cross_entropy(y_true, logits):
    """
    Production-quality binary cross-entropy from logits.
    
    Uses the identity:
    -y*log(σ(z)) - (1-y)*log(1-σ(z)) = max(z,0) - z*y + log(1+exp(-|z|))
    
    This formulation avoids computing sigmoid explicitly and is
    stable for all values of z.
    """
    # For large positive z: max(z,0)=z, exp(-z)≈0, so loss ≈ z - zy = z(1-y)
    # For large negative z: max(z,0)=0, exp(z)≈0, so loss ≈ 0 - zy = -zy
    # Both are correct and finite!
    
    return np.mean(
        np.maximum(logits, 0) - logits * y_true + 
        np.log1p(np.exp(-np.abs(logits)))
    )
 
# Demonstration: Why stability matters
print("Demonstrating Numerical Stability")
print("="*50)
 
# Extreme logits that would break naive implementation
extreme_logits = np.array([[1000.0, 0.0, -1000.0]])  # Will overflow naive exp()
y_true = np.array([0])  # True class is 0
 
# Naive approach would fail
try:
    naive_exp = np.exp(extreme_logits)  # This doesn't overflow in NumPy but...
    naive_softmax = naive_exp / np.sum(naive_exp, axis=-1, keepdims=True)
    print(f"Naive softmax: {naive_softmax}")
    print(f"Sum to 1? {np.sum(naive_softmax)}")  # May not sum to 1 due to precision
except:
    print("Naive approach failed!")
 
# Stable approach
loss = stable_softmax_cross_entropy(y_true, extreme_logits)
print(f"\nStable CCE with extreme logits: {loss:.6f}")
print("(Loss is finite and correct!)")
 
# Binary case with extreme logit
print("\n" + "="*50)
y_binary = np.array([1.0])
extreme_z = np.array([50.0])  # Very confident prediction
 
loss = stable_binary_cross_entropy(y_binary, extreme_z)
print(f"Stable BCE with logit=50: {loss:.10f}")
print("(Correctly close to 0 since prediction is confident and correct)")

Always Use Built-in Functions

The Deep Connection to KL Divergence

Cross-entropy and KL divergence are intimately connected, and understanding this relationship illuminates why cross-entropy is the 'right' loss for classification.

KL Divergence Defined

The Kullback-Leibler divergence from distribution $Q$ to $P$ measures the information lost when $Q$ is used to approximate $P$:

$$D_{KL}(P | Q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = H(P, Q) - H(P)$$

Key properties:

Non-negative: $D_{KL}(P | Q) \geq 0$, with equality iff $P = Q$
Asymmetric: $D_{KL}(P | Q) \neq D_{KL}(Q | P)$ in general
Not a metric: Doesn't satisfy triangle inequality

Why Minimize $D_{KL}(P_{data} | P_{model})$?

In supervised learning, $P_{data}$ is the true distribution (one-hot labels) and $P_{model}$ is our neural network's output. We minimize:

$$D_{KL}(P_{data} | P_{model}) = H(P_{data}, P_{model}) - H(P_{data})$$

Since $H(P_{data})$ is constant (determined by the data), minimizing KL divergence is equivalent to minimizing cross-entropy $H(P_{data}, P_{model})$.

Mode-Seeking vs Mode-Covering

The direction of KL divergence matters:

$D_{KL}(P | Q)$ (forward KL):

Penalizes $q(x) \to 0$ where $p(x) > 0$ (mode-seeking)
Used in supervised learning where $P$ is the target

$D_{KL}(Q | P)$ (reverse KL):

Penalizes $q(x) > 0$ where $p(x) \to 0$ (mode-covering)
Used in variational inference where $Q$ is the approximate posterior

In classification, we use forward KL because we want the model to assign high probability wherever the true label says so.

Forward KL (Mode-Seeking)

•Penalizes model for missing true modes
•Model tries to cover all data mass
•Can lead to overly broad distributions
•Used in classification (data → model)
•Loss: $-\sum p(x) \log q(x)$

Reverse KL (Mode-Covering)

•Penalizes model for extra mass
•Model tries to stay within data mass
•Can lead to overly peaked distributions
•Used in VAEs (model → approximate)
•Loss: $-\sum q(x) \log p(x)$

Information-Theoretic Interpretation

$$\underbrace{H(P, Q)}{\text{bits to encode}} = \underbrace{H(P)}{\text{min possible}} + \underbrace{D_{KL}(P | Q)}_{\text{wasted bits}}$$

Training a classifier is literally finding the most efficient encoding of the labels given the inputs.

This perspective explains why cross-entropy generalizes well: a model that compresses data well must have learned the true underlying patterns, not memorized spurious correlations.

Why Not Other Losses?

A natural question arises: why use cross-entropy for classification instead of simpler alternatives? Let's compare.

Cross-Entropy vs Mean Squared Error

For binary classification with target $y \in {0, 1}$ and prediction $\hat{y} = \sigma(z)$:

MSE Loss: $\mathcal{L}_{MSE} = (y - \hat{y})^2$

Gradient of MSE w.r.t. logit $z$: $$\frac{\partial \mathcal{L}_{MSE}}{\partial z} = 2(\hat{y} - y) \cdot \sigma(z)(1 - \sigma(z))$$

Problem: The gradient includes $\sigma(z)(1-\sigma(z))$, which vanishes when $\sigma(z) \to 0$ or $\sigma(z) \to 1$. This means:

If the model is confidently wrong (high loss region), gradients are tiny!
Training stalls precisely when corrections are most needed

Cross-Entropy Gradient: $\frac{\partial \mathcal{L}_{BCE}}{\partial z} = \hat{y} - y$

No multiplicative sigmoid derivative term
Gradient remains strong even for confident wrong predictions
Training never stalls due to the loss function

Gradient Comparison: MSE vs Cross-Entropy (y=1)
Prediction (ŷ)	MSE Gradient (×σ')	BCE Gradient	Learning Signal
0.01 (wrong)	0.0198 × 0.0099 = 0.0002	-0.99	BCE: 5000× stronger
0.10 (wrong)	0.18 × 0.09 = 0.016	-0.90	BCE: 56× stronger
0.50 (uncertain)	0.50 × 0.25 = 0.125	-0.50	BCE: 4× stronger
0.90 (correct)	0.18 × 0.09 = 0.016	-0.10	Similar
0.99 (correct)	0.0198 × 0.0099 = 0.0002	-0.01	Similar

The Geometric Perspective

Cross-entropy is the natural loss for classification because:

It respects the geometry of probability space: Probabilities live on the simplex, and cross-entropy is a Bregman divergence that respects this geometry.
It matches the natural gradient: The gradient of cross-entropy corresponds to the "natural gradient" in the space of distributions, leading to more efficient optimization.
It's derived from first principles: Unlike MSE, which is chosen for mathematical convenience in regression, cross-entropy emerges from maximum likelihood—the principled approach to fitting probability distributions.

When MSE is Appropriate for Classification

Despite its gradient issues, MSE can work in specific cases:

Knowledge distillation: When matching soft targets from a teacher model
Output smoothness: When we want predictions to be more conservative
Multi-target regression: When outputs are truly continuous, not categorical

But for standard classification, cross-entropy remains the gold standard.

The Principle of Maximum Entropy

Practical Implementation Guidelines

Armed with theoretical understanding, let's discuss practical considerations for using cross-entropy in real systems.

Class Imbalance

When classes are severely imbalanced (e.g., 99% negative, 1% positive), standard cross-entropy can lead to models that simply predict the majority class. Solutions:

Weighted Cross-Entropy: $$\mathcal{L} = -\sum_k w_k \cdot y_k \log \hat{y}_k$$

where $w_k$ is inversely proportional to class frequency. Common choice: $w_k = \frac{N}{K \cdot n_k}$ where $n_k$ is the count of class $k$.

Focal Loss (covered in a later page): Down-weights easy examples to focus on hard ones.

Multi-Label Classification

When samples can belong to multiple classes simultaneously, use binary cross-entropy per class:

$$\mathcal{L} = -\frac{1}{K}\sum_{k=1}^{K} [y_k \log \sigma(z_k) + (1-y_k) \log(1 - \sigma(z_k))]$$

Each output is now an independent binary prediction.

practical_cross_entropy.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
 
def weighted_cross_entropy(y_true, y_pred, class_weights, epsilon=1e-15):
    """
    Cross-entropy with class weights for imbalanced data.
    
    Args:
        y_true: One-hot encoded labels (N, K)
        y_pred: Predicted probabilities (N, K)
        class_weights: Weight for each class (K,)
    """
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    # Weight each sample by its true class weight
    # For one-hot y_true, this picks the weight of the true class
    sample_weights = np.sum(y_true * class_weights, axis=-1)
    
    # Standard cross-entropy per sample
    ce = -np.sum(y_true * np.log(y_pred), axis=-1)
    
    # Weighted average
    return np.mean(sample_weights * ce)
 
def multilabel_binary_cross_entropy(y_true, logits):
    """
    BCE for multi-label classification.
    Each class is treated as independent binary prediction.
    
    Args:
        y_true: Binary labels per class (N, K)
        logits: Raw model outputs (N, K)
    """
    # Stable BCE for each class independently
    bce_per_class = (
        np.maximum(logits, 0) - logits * y_true + 
        np.log1p(np.exp(-np.abs(logits)))
    )
    # Average over classes, then over samples
    return np.mean(bce_per_class)
 
def label_smoothing_cross_entropy(y_true_indices, logits, alpha=0.1):
    """
    Cross-entropy with label smoothing regularization.
    
    Soft labels: (1-α) for true class, α/K for others
    Prevents overconfidence and improves generalization.
    
    Args:
        y_true_indices: True class indices (N,)
        logits: Raw outputs (N, K)
        alpha: Smoothing parameter (0.1 typical)
    """
    n_samples, n_classes = logits.shape
    
    # Create smoothed labels
    smooth_labels = np.full((n_samples, n_classes), alpha / n_classes)
    smooth_labels[np.arange(n_samples), y_true_indices] = 1 - alpha + alpha / n_classes
    
    # Stable log-softmax
    max_logits = np.max(logits, axis=-1, keepdims=True)
    log_softmax = logits - max_logits - np.log(
        np.sum(np.exp(logits - max_logits), axis=-1, keepdims=True)
    )
    
    # Cross-entropy with soft labels
    loss = -np.sum(smooth_labels * log_softmax, axis=-1)
    return np.mean(loss)
 
# Demonstration
print("Practical Cross-Entropy Variants")
print("="*50)
 
# Setup
np.random.seed(42)
logits = np.random.randn(4, 5)  # 4 samples, 5 classes
y_true = np.array([0, 1, 2, 0])
 
# Standard CCE
loss_standard = sparse_categorical_cross_entropy(y_true, logits)
print(f"Standard CCE: {loss_standard:.4f}")
 
# With label smoothing
loss_smooth = label_smoothing_cross_entropy(y_true, logits, alpha=0.1)
print(f"Label-smoothed CCE (α=0.1): {loss_smooth:.4f}")
 
# Class-weighted (pretend class 2 is rare)
weights = np.array([1.0, 1.0, 5.0, 1.0, 1.0])  # 5× weight for class 2
y_onehot = np.zeros((4, 5))
y_onehot[np.arange(4), y_true] = 1
probs = softmax(logits)
loss_weighted = weighted_cross_entropy(y_onehot, probs, weights)
print(f"Class-weighted CCE: {loss_weighted:.4f}")

Cross-Entropy Best Practices

•Always work with logits: Pass raw network outputs to cross-entropy, not softmax/sigmoid probabilities. Built-in functions handle stability.
•Use sparse labels when possible: For large K, class indices are more memory-efficient than one-hot vectors.
•Apply label smoothing for regularization: α=0.1 typically works well. Reduces overconfidence without hurting accuracy.
•Weight classes for imbalance: Inverse frequency weighting is a good starting point. Tune weights on validation data.
•Monitor loss scale: Cross-entropy values depend on K. For 1000-class ImageNet, expect higher losses than for 10-class CIFAR.
•Check for NaN/Inf: Despite stable implementations, extreme data can cause issues. Add assertions to catch problems early.

Summary: Cross-Entropy Mastery

We've taken a comprehensive journey through cross-entropy loss, from information theory to practical implementation. Let's consolidate the key insights.

Key Takeaways

•Information-theoretic foundation: Cross-entropy measures the expected bits needed to encode samples from the true distribution using a code optimized for the predicted distribution.
•Maximum likelihood equivalence: Minimizing cross-entropy is equivalent to maximum likelihood estimation, providing principled probabilistic grounding.
•Superior gradients: The gradient $\hat{y} - y$ provides strong, stable learning signals regardless of prediction confidence—unlike MSE which vanishes for confident wrong predictions.
•Numerical stability matters: Always use combined logit-to-loss functions. Never compute softmax then log separately.
•KL divergence connection: Cross-entropy minimization is equivalent to KL divergence minimization since data entropy is constant.
•Practical extensions: Label smoothing, class weighting, and proper handling of multi-label tasks extend cross-entropy to real-world challenges.

Looking Forward

Cross-entropy is the foundation for classification losses, but it's not the only option. In the following pages, we'll explore:

Mean Squared Error: When and why it's used for regression—and occasionally classification
Hinge Loss: The max-margin alternative from SVMs, offering different inductive biases
Focal Loss: Addressing class imbalance by down-weighting easy examples
Custom Losses: Designing task-specific objectives that encode domain knowledge

Page Complete