Multi Class Logistic Regression - Learning Module

Loading content...

0/278

Cross-Entropy Loss

The Loss Function That Powers Classification

A model is only as good as the signal guiding its learning. For classification tasks, that signal comes from the cross-entropy loss—a function so fundamental that it appears in virtually every neural network classifier, from simple logistic regression to state-of-the-art transformers.

Cross-entropy isn't an arbitrary choice. It emerges naturally from maximum likelihood estimation, has deep roots in information theory, and possesses mathematical properties that make gradient-based optimization particularly effective. Understanding cross-entropy deeply is essential for anyone serious about machine learning.

What You Will Learn

By the end of this page, you will understand: how cross-entropy emerges from maximum likelihood estimation; its information-theoretic interpretation as KL divergence; the mathematical properties that make it ideal for optimization; numerical stability considerations; and the relationship between different cross-entropy variants (binary, categorical, sparse).

Derivation from Maximum Likelihood

The cross-entropy loss emerges naturally when we apply maximum likelihood estimation to the multinomial logistic regression model. Let's trace this derivation carefully.

The Setup

We have $n$ training examples ${(\mathbf{x}i, y_i)}{i=1}^n$ where:

$\mathbf{x}_i \in \mathbb{R}^d$ is the feature vector
$y_i \in {1, 2, \ldots, K}$ is the true class label

Our model predicts: $$P(y = k | \mathbf{x}; \boldsymbol{\theta}) = \hat{p}_k = \text{softmax}(\mathbf{z})k = \frac{e^{z_k}}{\sum{j=1}^{K} e^{z_j}}$$

where $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$ and $\boldsymbol{\theta}$ comprises all parameters.

Writing the Likelihood

Assuming training examples are i.i.d., the likelihood of observing the data given parameters is:

$$L(\boldsymbol{\theta}) = \prod_{i=1}^{n} P(y_i | \mathbf{x}i; \boldsymbol{\theta}) = \prod{i=1}^{n} \hat{p}_{i, y_i}$$

where $\hat{p}_{i,k}$ denotes the predicted probability of class $k$ for sample $i$.

Taking the Log-Likelihood

Maximizing likelihood is equivalent to maximizing log-likelihood (log is monotonic):

$$\ell(\boldsymbol{\theta}) = \log L(\boldsymbol{\theta}) = \sum_{i=1}^{n} \log \hat{p}_{i, y_i}$$

To use one-hot encoding, let $\mathbf{y}i \in {0,1}^K$ be the one-hot representation of label $y_i$ (i.e., $y{ik} = 1$ if $y_i = k$, else 0). Then:

$$\log \hat{p}{i, y_i} = \sum{k=1}^{K} y_{ik} \log \hat{p}_{ik}$$

(Only the true class contributes since $y_{ik} = 0$ for incorrect classes.)

Thus: $$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log \hat{p}_{ik}$$

Converting to Loss (Negative Log-Likelihood)

Optimization frameworks minimize losses, so we negate:

$$\mathcal{L}{\text{NLL}}(\boldsymbol{\theta}) = -\ell(\boldsymbol{\theta}) = -\sum{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log \hat{p}_{ik}$$

This is exactly the categorical cross-entropy loss (also called softmax loss or log loss).

The Fundamental Equation

The categorical cross-entropy loss for a dataset is:

$$\mathcal{L}{\text{CE}} = -\frac{1}{n} \sum{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log \hat{p}_{ik}$$

For a single sample with true label $y$ and predicted probabilities $\hat{\mathbf{p}}$:

$$\mathcal{L}{\text{CE}}^{(i)} = -\sum{k=1}^{K} y_k \log \hat{p}_k = -\log \hat{p}_y$$

Cross-entropy equals the negative log-probability assigned to the true class.

Why Not Squared Error?

One might ask: why not simply minimize $(y_k - \hat{p}_k)^2$? Several reasons:

Probabilistic foundation: Cross-entropy is the MLE objective; squared error assumes Gaussian noise, inappropriate for categorical outcomes.
Gradient behavior: For softmax + squared error, gradients can saturate (become tiny) when predictions are very wrong. Cross-entropy gradients remain informative.
Convexity: Cross-entropy with softmax is convex in the logits. Squared error with softmax is not.
Information-theoretic meaning: Cross-entropy has a principled interpretation (next section); squared error for probabilities does not.

Information-Theoretic Interpretation

Cross-entropy has deep roots in information theory, providing a principled interpretation beyond its role as a likelihood-derived loss.

Entropy Review

For a discrete probability distribution $\mathbf{p} = (p_1, \ldots, p_K)$, the entropy is:

$$H(\mathbf{p}) = -\sum_{k=1}^{K} p_k \log p_k$$

Entropy measures the average 'surprise' or uncertainty about outcomes sampled from $\mathbf{p}$. It's measured in nats (natural log) or bits (log base 2).

Key properties:

$H(\mathbf{p}) \geq 0$ always
$H(\mathbf{p}) = 0$ iff $\mathbf{p}$ is deterministic (one $p_k = 1$)
$H(\mathbf{p})$ is maximized when $\mathbf{p}$ is uniform

Cross-Entropy Definition

The cross-entropy between distributions $\mathbf{p}$ (true) and $\mathbf{q}$ (predicted) is:

$$H(\mathbf{p}, \mathbf{q}) = -\sum_{k=1}^{K} p_k \log q_k$$

Interpretation: The average number of nats/bits needed to encode samples from $\mathbf{p}$ using a code optimized for $\mathbf{q}$.

KL Divergence Connection

The Kullback-Leibler divergence (relative entropy) from $\mathbf{q}$ to $\mathbf{p}$ is:

$$D_{KL}(\mathbf{p} | \mathbf{q}) = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k} = -H(\mathbf{p}) + H(\mathbf{p}, \mathbf{q})$$

Rearranging: $$H(\mathbf{p}, \mathbf{q}) = H(\mathbf{p}) + D_{KL}(\mathbf{p} | \mathbf{q})$$

Key insight: Cross-entropy = Entropy + KL Divergence.

Since $H(\mathbf{p})$ is fixed (depends only on true labels), minimizing cross-entropy is equivalent to minimizing KL divergence between true and predicted distributions.

$$\text{Minimizing } H(\mathbf{p}, \hat{\mathbf{p}}) \iff \text{Minimizing } D_{KL}(\mathbf{p} | \hat{\mathbf{p}})$$

Properties of KL Divergence:

$D_{KL}(\mathbf{p} | \mathbf{q}) \geq 0$ (Gibbs' inequality)
$D_{KL}(\mathbf{p} | \mathbf{q}) = 0$ iff $\mathbf{p} = \mathbf{q}$
Not symmetric: $D_{KL}(\mathbf{p} | \mathbf{q}) eq D_{KL}(\mathbf{q} | \mathbf{p})$ in general

Why This Matters

The KL divergence interpretation means: Training minimizes the 'distance' between our predicted distribution and the true label distribution. When the model perfectly predicts the true class (one-hot), KL divergence becomes zero (cross-entropy equals entropy of one-hot = 0).

Cross-Entropy for One-Hot Labels

In classification, true labels are one-hot: $\mathbf{y} = (0, \ldots, 0, 1, 0, \ldots, 0)$ with a single 1 at the true class $c$.

The entropy of a one-hot distribution is zero: $$H(\mathbf{y}) = -1 \cdot \log(1) - 0 \cdot \log(0) = 0$$

(Using convention $0 \cdot \log(0) = 0$.)

So: $$H(\mathbf{y}, \hat{\mathbf{p}}) = D_{KL}(\mathbf{y} | \hat{\mathbf{p}}) = -\log \hat{p}_c$$

Cross-entropy reduces to the negative log-probability of the true class.

Intuition:

If $\hat{p}_c \to 1$: loss $\to 0$ (correct prediction, confident)
If $\hat{p}_c \to 0$: loss $\to \infty$ (wrong prediction, very penalized)

The logarithm creates an asymmetric, sharp penalty for confident wrong predictions.

Mathematical Properties of Cross-Entropy

Cross-entropy's mathematical properties make it ideal for optimization. Understanding these properties explains why it's the universal choice for classification.

Property 1: Non-Negativity

$$\mathcal{L}_{CE} = -\log \hat{p}_y \geq 0$$

since $\hat{p}_y \in (0, 1)$ implies $\log \hat{p}_y \leq 0$.

The minimum occurs when $\hat{p}y = 1$, giving $\mathcal{L}{CE} = 0$. However, softmax never outputs exactly 1 (only approaches it as logits $\to \pm\infty$), so the practical minimum is approached asymptotically.

Property 2: Convexity in Logits

For multinomial logistic regression, the cross-entropy loss is convex as a function of the logits $\mathbf{z}$ (and hence the parameters $\boldsymbol{\theta}$).

Proof sketch: The negative log-likelihood can be written as: $$\mathcal{L} = -z_y + \log\left(\sum_{j=1}^{K} e^{z_j}\right)$$

$-z_y$ is linear (hence convex) in $z_y$
$\log\sum e^{z_j}$ (log-sum-exp) is convex

Sum of convex functions is convex. Since $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$ is linear in parameters, composition with convex loss gives a convex optimization problem.

Implication: Gradient descent finds the global optimum (if one exists). No local minima trap the optimizer.

Property 3: Elegant Gradient Structure

The gradient of cross-entropy with respect to logits has a remarkably simple form:

$$\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}_k - y_k$$

where $y_k = 1$ if $k$ is the true class, else $0$.

Derivation:

For true class $c$: $$\mathcal{L} = -z_c + \log\left(\sum_j e^{z_j}\right)$$

$$\frac{\partial \mathcal{L}}{\partial z_k} = -\mathbf{1}_{k=c} + \frac{e^{z_k}}{\sum_j e^{z_j}} = -y_k + \hat{p}_k = \hat{p}_k - y_k$$

Interpretation:

For the true class $c$: gradient = $\hat{p}_c - 1 < 0$, pushing to increase $z_c$
For other classes $k eq c$: gradient = $\hat{p}_k > 0$, pushing to decrease $z_k$
Gradient magnitude proportional to error: More wrong → larger gradient

This $(\text{prediction} - \text{target})$ structure is the hallmark of exponential family likelihoods with canonical link functions.

Gradient Never Saturates (from below)

Unlike sigmoid + MSE, softmax + cross-entropy gradients don't vanish for wrong predictions. When $\hat{p}{wrong} \to 1$, the gradient $\hat{p}{wrong} - 0 = \hat{p}_{wrong}$ remains sizable. This property accelerates learning from mistakes.

Property 4: Calibration Under Correct Specification

When the model is correctly specified and trained to convergence, cross-entropy training produces well-calibrated probability estimates: predicted probabilities match observed frequencies.

Formally, among all test samples where the model predicts $P(y=k) \approx p$, approximately a fraction $p$ actually belong to class $k$.

This occurs because:

MLE is consistent under correct specification
Cross-entropy is the proper scoring rule for categorical distributions
No systematic bias in the loss encourages over/under-confidence

However, deep neural networks often exhibit miscalibration (overconfidence). Post-hoc calibration methods (temperature scaling, Platt scaling) can address this.

Property 5: Information-Theoretic Optimality

Cross-entropy is a proper scoring rule: the expected score is uniquely minimized when the predicted distribution equals the true distribution. This prevents 'gaming' the loss with non-informative predictions.

Cross-Entropy Properties Summary
Property	Description	Practical Benefit
Non-negative	$\mathcal{L} \geq 0$, $= 0$ iff perfect	Clear optimal target
Convex in logits	No local minima for linear model	Global optimum guaranteed
Simple gradients	$ abla = \hat{\mathbf{p}} - \mathbf{y}$	Efficient computation, intuitive
Non-saturating	Large gradient for large errors	Fast learning from mistakes
Proper scoring rule	Incentivizes true probabilities	Well-calibrated outputs
Information-optimal	Minimizes KL divergence	Principled probabilistic training

Numerical Stability Considerations

Computing cross-entropy naively leads to numerical disasters. Understanding and preventing these issues is essential for robust implementations.

Issue 1: Log of Softmax Underflow

Softmax outputs can be very small. For logits $\mathbf{z} = (10, 0, 0)$:

$$\hat{p}_2 = \frac{e^0}{e^{10} + e^0 + e^0} \approx \frac{1}{22048} \approx 4.5 \times 10^{-5}$$

For more extreme logits $\mathbf{z} = (100, 0, 0)$:

$$\hat{p}_2 \approx e^{-100} \approx 3.7 \times 10^{-44}$$

If $\hat{p}_2$ underflows to 0 and the true class is 2, then $\log(0) = -\infty$.

Issue 2: Overflow in Softmax

As discussed in the softmax page, large logits cause $e^z$ to overflow before the ratio is computed.

The Numerical Trap

Computing loss = -log(softmax(z)) in two separate steps is numerically dangerous. Even if softmax doesn't overflow (using the max-subtraction trick), the resulting small probabilities can underflow, causing log(0) = -inf.

The Solution: Log-Softmax

Compute log-probabilities directly without computing probabilities first:

$$\log \hat{p}_k = \log \frac{e^{z_k}}{\sum_j e^{z_j}} = z_k - \log\left(\sum_j e^{z_j}\right) = z_k - \text{LSE}(\mathbf{z})$$

where $\text{LSE}(\mathbf{z}) = \log\sum_j e^{z_j}$ is computed stably:

$$\text{LSE}(\mathbf{z}) = m + \log\left(\sum_j e^{z_j - m}\right) \quad \text{where } m = \max(\mathbf{z})$$

Now cross-entropy is: $$\mathcal{L} = -\log \hat{p}_y = -z_y + \text{LSE}(\mathbf{z})$$

No intermediate probability computation—no underflow!

stable_cross_entropy.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import numpy as np
 
def log_sum_exp_stable(z):
    """
    Compute log(sum(exp(z))) in a numerically stable way.
    
    Args:
        z: Array of logits, shape (K,) or (N, K)
    
    Returns:
        Log-sum-exp value
    """
    z = np.asarray(z)
    axis = -1 if z.ndim > 1 else None
    z_max = np.max(z, axis=axis, keepdims=True)
    return z_max.squeeze() + np.log(np.sum(np.exp(z - z_max), axis=axis))
 
def log_softmax_stable(z):
    """
    Compute log(softmax(z)) stably.
    
    Args:
        z: Logits, shape (K,) or (N, K)
    
    Returns:
        Log-probabilities, same shape as z
    """
    if z.ndim == 1:
        return z - log_sum_exp_stable(z)
    else:
        lse = log_sum_exp_stable(z)
        return z - lse[:, np.newaxis]
 
def cross_entropy_naive(logits, labels):
    """
    UNSTABLE: Cross-entropy via softmax then log.
    
    Args:
        logits: (N, K) logit matrix
        labels: (N,) integer labels
    """
    # Compute softmax (even if stable, small values underflow)
    z_max = np.max(logits, axis=1, keepdims=True)
    exp_z = np.exp(logits - z_max)
    probs = exp_z / np.sum(exp_z, axis=1, keepdims=True)
    
    # Select true class probabilities
    n = logits.shape[0]
    true_probs = probs[np.arange(n), labels]
    
    # Log can underflow if true_probs is tiny!
    return -np.mean(np.log(true_probs))
 
def cross_entropy_stable(logits, labels):
    """
    STABLE: Cross-entropy via log-softmax directly.
    
    Args:
        logits: (N, K) logit matrix
        labels: (N,) integer labels
    
    Returns:
        Mean cross-entropy loss
    """
    # Compute log-softmax stably
    log_probs = log_softmax_stable(logits)
    
    # Select log-probabilities of true classes
    n = logits.shape[0]
    true_log_probs = log_probs[np.arange(n), labels]
    
    # Mean negative log-probability
    return -np.mean(true_log_probs)
 
def cross_entropy_with_gradient(logits, labels):
    """
    Compute cross-entropy and gradient simultaneously.
    
    Args:
        logits: (N, K) logit matrix  
        labels: (N,) integer labels
    
    Returns:
        loss: Scalar loss value
        grad: (N, K) gradient w.r.t. logits
    """
    n, K = logits.shape
    
    # Stable softmax for gradient
    z_max = np.max(logits, axis=1, keepdims=True)
    exp_z = np.exp(logits - z_max)
    probs = exp_z / np.sum(exp_z, axis=1, keepdims=True)
    
    # Stable loss via log-softmax
    log_probs = log_softmax_stable(logits)
    true_log_probs = log_probs[np.arange(n), labels]
    loss = -np.mean(true_log_probs)
    
    # Gradient: (p_k - y_k) / n
    grad = probs.copy()
    grad[np.arange(n), labels] -= 1
    grad /= n
    
    return loss, grad
 
# Demonstration
print("=== Cross-Entropy Numerical Stability ===
")
 
# Normal case: both methods work
logits_normal = np.array([[1.0, 2.0, 3.0], [2.0, 1.0, 3.0]])
labels = np.array([2, 0])
 
print("Normal logits:")
print(f"  Naive:  {cross_entropy_naive(logits_normal, labels):.6f}")
print(f"  Stable: {cross_entropy_stable(logits_normal, labels):.6f}")
 
# Extreme case: naive fails
logits_extreme = np.array([[100.0, 0.0, 0.0], [0.0, 100.0, 0.0]])
labels_hard = np.array([1, 0])  # True classes have tiny probability
 
print("
Extreme logits (true class has tiny probability):")
print(f"  Naive:  {cross_entropy_naive(logits_extreme, labels_hard)}")  # inf or nan
print(f"  Stable: {cross_entropy_stable(logits_extreme, labels_hard):.6f}")
 
# Gradient computation
print("
=== Gradient Computation ===")
loss, grad = cross_entropy_with_gradient(logits_normal, labels)
print(f"Loss: {loss:.6f}")
print(f"Gradient:
{grad}")
 
# Verify gradient sum property
print(f"
Gradient row sums (should be ~0): {grad.sum(axis=1)}")

Framework Best Practices

Use fused softmax + cross-entropy functions: • PyTorch: F.cross_entropy(logits, labels) (takes raw logits, not softmax output) • TensorFlow: tf.nn.softmax_cross_entropy_with_logits(labels, logits) • JAX: optax.softmax_cross_entropy(logits, labels)

These compute stable log-softmax internally.

Binary vs. Categorical Cross-Entropy

The cross-entropy family includes several related loss functions. Understanding their relationships prevents confusion and errors.

Binary Cross-Entropy (BCE)

For binary classification with $y \in {0, 1}$ and predicted probability $\hat{p} = P(y=1)$:

$$\mathcal{L}_{BCE} = -[y \log \hat{p} + (1-y) \log(1 - \hat{p})]$$

Expanded:

If $y = 1$: loss = $-\log \hat{p}$ (penalize if $\hat{p}$ is low)
If $y = 0$: loss = $-\log(1 - \hat{p})$ (penalize if $\hat{p}$ is high)

Categorical Cross-Entropy (CCE)

For multi-class with one-hot $\mathbf{y}$ and predicted distribution $\hat{\mathbf{p}}$:

$$\mathcal{L}{CCE} = -\sum{k=1}^{K} y_k \log \hat{p}_k$$

Special case: For $K=2$ with one-hot encoding:

$\mathbf{y} = (1, 0)$ or $(0, 1)$
CCE = $-y_1 \log \hat{p}_1 - y_2 \log \hat{p}_2 = -y_1 \log \hat{p}_1 - (1-y_1) \log(1-\hat{p}_1)$

This is exactly BCE! The two are equivalent for $K=2$.

Cross-Entropy Variants
Variant	Formula	Use Case	Label Format
Binary CE	$-[y\log \hat{p} + (1-y)\log(1-\hat{p})]$	Binary classification	Scalar $y \in {0,1}$
Categorical CE	$-\sum_k y_k \log \hat{p}_k$	Multi-class (mutually exclusive)	One-hot $\mathbf{y}$
Sparse Categorical CE	$-\log \hat{p}_y$	Multi-class (efficiency)	Integer label $y$
Binary CE with Logits	$-[y \cdot z - \log(1+e^z)]$	Binary with sigmoid	Scalar $y$, logit $z$

Sparse Categorical Cross-Entropy

For large $K$, storing one-hot vectors is memory-wasteful. Sparse categorical CE takes integer labels directly:

$$\mathcal{L}_{\text{sparse}} = -\log \hat{p}_y$$

where $y \in {0, 1, \ldots, K-1}$ is the integer label.

Mathematically identical to categorical CE, but computationally efficient:

No one-hot encoding needed
$O(1)$ indexing instead of $O(K)$ dot product

Modern frameworks prefer sparse format:

PyTorch F.cross_entropy expects integer labels
TensorFlow offers both CategoricalCrossentropy (one-hot) and SparseCategoricalCrossentropy (integers)

Multi-Label BCE

For multi-label (non-mutually exclusive) classification, apply BCE independently to each label:

$$\mathcal{L}{\text{multi-label}} = -\frac{1}{K}\sum{k=1}^{K} [y_k \log \hat{p}_k + (1-y_k) \log(1-\hat{p}_k)]$$

Each $\hat{p}_k$ comes from an independent sigmoid, not softmax: $$\hat{p}_k = \sigma(z_k)$$

Probabilities don't sum to 1—each class is a separate binary decision.

Common Pitfall: Mixing Up Losses

Wrong: Using categorical CE with sigmoid outputs, or BCE with softmax outputs.

Correct pairings: • Softmax output → Categorical/Sparse Categorical CE • Sigmoid output → Binary CE (single class) or Multi-label BCE (multiple classes)

Using the wrong pairing doesn't cause errors but produces poor training dynamics.

Weighted and Focal Loss Variants

Standard cross-entropy treats all samples and classes equally. In many real-world scenarios, this is suboptimal. Several variants address specific challenges.

Weighted Cross-Entropy

Assign different weights to different classes:

$$\mathcal{L}{\text{weighted}} = -\sum{k=1}^{K} w_k \cdot y_k \log \hat{p}_k$$

For integer labels: $$\mathcal{L}_{\text{weighted}} = -w_y \log \hat{p}_y$$

Use cases:

Class imbalance: Up-weight rare classes to balance effective sample size
Asymmetric costs: Higher weight for costly misclassifications (e.g., missing a disease)

Weight selection:

Inverse frequency: $w_k \propto 1/n_k$ where $n_k$ is class count
Effective number: $w_k \propto (1 - \beta^{n_k})/(1-\beta)$ for some $\beta$
Manual tuning based on domain costs

Focal Loss

Introduced for object detection (RetinaNet), focal loss down-weights easy examples to focus learning on hard ones:

$$\mathcal{L}_{\text{focal}} = -\alpha_y (1 - \hat{p}_y)^\gamma \log \hat{p}_y$$

where:

$\gamma \geq 0$ is the focusing parameter (typically 2)
$\alpha_y$ is the class weight (optional)

How it works:

Easy examples: $\hat{p}_y \to 1 \Rightarrow (1-\hat{p}_y)^\gamma \to 0$, loss heavily reduced
Hard examples: $\hat{p}_y \to 0 \Rightarrow (1-\hat{p}_y)^\gamma \to 1$, loss unchanged

Effect: The model doesn't drown in easy examples that flood gradients; it focuses on informative hard cases.

Gradient analysis:

For cross-entropy: $| abla| \propto |\hat{p}_y - 1|$ For focal loss: $| abla| \propto |\hat{p}_y - 1|^{1+\gamma}$

The extra $(1-\hat{p}_y)^\gamma$ factor suppresses gradients from easy examples (where $1 - \hat{p}_y$ is small).

loss_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
 
def weighted_cross_entropy(logits, labels, class_weights):
    """
    Weighted cross-entropy loss.
    
    Args:
        logits: (N, K) logit matrix
        labels: (N,) integer labels
        class_weights: (K,) weight per class
    
    Returns:
        Weighted mean loss
    """
    n, K = logits.shape
    
    # Stable log-softmax
    z_max = np.max(logits, axis=1, keepdims=True)
    log_probs = logits - z_max - np.log(
        np.sum(np.exp(logits - z_max), axis=1, keepdims=True)
    )
    
    # Get log-probs of true classes
    true_log_probs = log_probs[np.arange(n), labels]
    
    # Apply class weights
    sample_weights = class_weights[labels]
    
    return -np.mean(sample_weights * true_log_probs)
 
def focal_loss(logits, labels, gamma=2.0, alpha=None):
    """
    Focal loss for handling class imbalance and hard examples.
    
    Args:
        logits: (N, K) logit matrix
        labels: (N,) integer labels
        gamma: Focusing parameter (default 2)
        alpha: Optional class weights (K,)
    
    Returns:
        Mean focal loss
    """
    n, K = logits.shape
    
    # Stable softmax
    z_max = np.max(logits, axis=1, keepdims=True)
    exp_z = np.exp(logits - z_max)
    probs = exp_z / np.sum(exp_z, axis=1, keepdims=True)
    
    # Stable log-softmax
    log_probs = logits - z_max - np.log(
        np.sum(np.exp(logits - z_max), axis=1, keepdims=True)
    )
    
    # Get probs and log-probs of true classes
    p_y = probs[np.arange(n), labels]
    log_p_y = log_probs[np.arange(n), labels]
    
    # Focal modulation: (1 - p_y)^gamma
    focal_weight = (1 - p_y) ** gamma
    
    # Optional class weights
    if alpha is not None:
        alpha_weight = alpha[labels]
        focal_weight *= alpha_weight
    
    loss = -focal_weight * log_p_y
    
    return np.mean(loss)
 
def label_smoothing_cross_entropy(logits, labels, epsilon=0.1):
    """
    Cross-entropy with label smoothing.
    
    Softens one-hot targets to [epsilon/K, ..., 1-epsilon+epsilon/K, ...]
    Helps prevent overconfidence.
    
    Args:
        logits: (N, K) logit matrix
        labels: (N,) integer labels
        epsilon: Smoothing factor (0 = no smoothing)
    
    Returns:
        Mean smoothed loss
    """
    n, K = logits.shape
    
    # Stable log-softmax
    z_max = np.max(logits, axis=1, keepdims=True)
    log_probs = logits - z_max - np.log(
        np.sum(np.exp(logits - z_max), axis=1, keepdims=True)
    )
    
    # Construct smoothed targets
    # True class gets (1 - epsilon) + epsilon/K
    # Other classes get epsilon/K
    smooth_targets = np.full((n, K), epsilon / K)
    smooth_targets[np.arange(n), labels] = 1 - epsilon + epsilon / K
    
    # Cross-entropy with soft targets
    loss = -np.sum(smooth_targets * log_probs, axis=1)
    
    return np.mean(loss)
 
# Demonstration
print("=== Loss Variants Comparison ===
")
 
np.random.seed(42)
 
# Create imbalanced example
logits = np.array([
    [3.0, 0.5, 0.1],    # Easy correct (class 0)
    [0.1, 3.0, 0.2],    # Easy correct (class 1)  
    [0.5, 0.4, 0.3],    # Hard example (class 2)
    [0.8, 1.0, 0.9],    # Hard example (class 0)
])
labels = np.array([0, 1, 2, 0])
 
# Class weights (class 2 is rare, upweight it)
class_weights = np.array([1.0, 1.0, 3.0])
 
print("Standard CE:")
print(f"  Per-sample losses: {[-np.log(np.exp(logits[i,labels[i]]) / np.sum(np.exp(logits[i])) ) for i in range(4)]}")
 
print(f"
Weighted CE (class 2 = 3x): {weighted_cross_entropy(logits, labels, class_weights):.4f}")
print(f"Focal Loss (gamma=2): {focal_loss(logits, labels, gamma=2.0):.4f}")
print(f"Label Smoothing (eps=0.1): {label_smoothing_cross_entropy(logits, labels, epsilon=0.1):.4f}")
 
# Show focal loss effect on easy vs hard examples
print("
=== Focal Loss: Easy vs Hard ===")
print("Easy example (logits=[3, 0, 0], true=0):")
easy_logits = np.array([[3.0, 0.0, 0.0]])
easy_p = np.exp(3) / (np.exp(3) + 2)
print(f"  p_true = {easy_p:.4f}")
print(f"  CE loss = {-np.log(easy_p):.4f}")
print(f"  Focal (γ=2) = {(1-easy_p)**2 * (-np.log(easy_p)):.4f}")
 
print("
Hard example (logits=[0.5, 0.5, 0.5], true=0):")
hard_logits = np.array([[0.5, 0.5, 0.5]])
hard_p = 1/3
print(f"  p_true = {hard_p:.4f}")
print(f"  CE loss = {-np.log(hard_p):.4f}")
print(f"  Focal (γ=2) = {(1-hard_p)**2 * (-np.log(hard_p)):.4f}")

Label Smoothing

Instead of one-hot targets, use 'soft' targets:

$$\tilde{y}_k = \begin{cases} 1 - \epsilon + \frac{\epsilon}{K} & k = y \ \frac{\epsilon}{K} & k eq y \end{cases}$$

Benefits:

Prevents overconfidence (probabilities don't approach 0/1)
Acts as regularization
Improves calibration and generalization in deep networks

Loss formula: $$\mathcal{L}_{\text{smooth}} = (1-\epsilon) \cdot (-\log \hat{p}_y) + \epsilon \cdot H(\text{uniform}, \hat{\mathbf{p}})$$

The second term encourages output entropy, preventing collapse to extreme predictions.

Complete Gradient Derivation

Understanding how gradients flow through softmax + cross-entropy is essential for debugging, implementing custom layers, and theoretical understanding.

Setup

For a single sample with:

Logits: $\mathbf{z} = (z_1, \ldots, z_K)$
True class: $c$ (integer label)
Predictions: $\hat{p}_k = \text{softmax}(\mathbf{z})_k$
Loss: $\mathcal{L} = -\log \hat{p}_c$

Goal: Compute $\frac{\partial \mathcal{L}}{\partial z_k}$ for all $k$.

Step 1: Expand the loss

$$\mathcal{L} = -\log \hat{p}_c = -\log \frac{e^{z_c}}{\sum_j e^{z_j}} = -z_c + \log\left(\sum_j e^{z_j}\right)$$

Step 2: Differentiate w.r.t. $z_k$

$$\frac{\partial \mathcal{L}}{\partial z_k} = -\mathbf{1}_{k=c} + \frac{\partial}{\partial z_k}\left[\log\left(\sum_j e^{z_j}\right)\right]$$

The second term: $$\frac{\partial}{\partial z_k}\left[\log\left(\sum_j e^{z_j}\right)\right] = \frac{e^{z_k}}{\sum_j e^{z_j}} = \hat{p}_k$$

Step 3: Combine

$$\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}k - \mathbf{1}{k=c}$$

Using one-hot notation where $y_k = \mathbf{1}_{k=c}$:

$$\boxed{\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}_k - y_k}$$

Vector form: $$ abla_{\mathbf{z}} \mathcal{L} = \hat{\mathbf{p}} - \mathbf{y}$$

This is the predicted distribution minus the true distribution.

Properties of this gradient:

Sums to zero: $\sum_k (\hat{p}_k - y_k) = 1 - 1 = 0$

Interpretation: The gradient has no component in the 'all equal' direction—it's purely about rebalancing probability mass.
Bounded: Each component $|\hat{p}_k - y_k| \leq 1$
Interpretable: Positive gradient pushes logit down; negative pushes it up. The true class always gets pushed up (since $\hat{p}_c - 1 < 0$ for $\hat{p}_c < 1$).

Backpropagating to Parameters

For logits $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}_k} = \frac{\partial \mathcal{L}}{\partial z_k} \cdot \frac{\partial z_k}{\partial \mathbf{w}_k} = (\hat{p}_k - y_k) \mathbf{x}$$

$$\frac{\partial \mathcal{L}}{\partial b_k} = \hat{p}_k - y_k$$

For a batch of $n$ samples:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}k} = \frac{1}{n}\sum{i=1}^{n} (\hat{p}{ik} - y{ik}) \mathbf{x}_i$$

In matrix form with $X \in \mathbb{R}^{n \times d}$ (rows are samples), $P \in \mathbb{R}^{n \times K}$ (predicted), $Y \in \mathbb{R}^{n \times K}$ (one-hot):

$$\frac{\partial \mathcal{L}}{\partial W} = \frac{1}{n}(P - Y)^T X$$

where $W \in \mathbb{R}^{K \times d}$ and the gradient has the same shape.

Implementation Insight

The gradient computation is remarkably simple: just subtract the one-hot labels from the softmax probabilities. No complex Jacobian computation needed! This simplicity is why softmax + cross-entropy is so computationally efficient compared to alternatives like softmax + squared error.

Summary: Cross-Entropy Loss

We have thoroughly explored the cross-entropy loss—the engine that drives classification learning. Let's consolidate the key insights:

Key Takeaways

•MLE Foundation: Cross-entropy = negative log-likelihood, emerging naturally from probabilistic modeling.
•Information Theory: Minimizing cross-entropy = minimizing KL divergence between predicted and true distributions.
•Mathematical Properties: Non-negative, convex, simple gradients ($\hat{\mathbf{p}} - \mathbf{y}$), non-saturating for wrong predictions.
•Numerical Stability: Always use log-softmax formulation; never compute log(softmax(z)) in two steps.
•Variants: Binary CE, categorical CE, sparse categorical CE, multi-label BCE—each for specific settings.
•Extensions: Weighted CE for class imbalance, focal loss for hard example mining, label smoothing for calibration.
•Gradient Flow: Simple, efficient computation enables training of massive models.

What's Next:

With the softmax model and cross-entropy loss established, we now explore an alternative approach to multi-class classification: one-vs-all (OvA) strategies. This decomposition method trains $K$ binary classifiers and offers different tradeoffs compared to multinomial logistic regression.

Page Complete

You now possess a deep understanding of cross-entropy loss—from its probabilistic derivation through its information-theoretic meaning to numerical implementation details. This knowledge is essential for understanding and debugging classification systems at any scale.