Loading content...
A model is only as good as the signal guiding its learning. For classification tasks, that signal comes from the cross-entropy loss—a function so fundamental that it appears in virtually every neural network classifier, from simple logistic regression to state-of-the-art transformers.
Cross-entropy isn't an arbitrary choice. It emerges naturally from maximum likelihood estimation, has deep roots in information theory, and possesses mathematical properties that make gradient-based optimization particularly effective. Understanding cross-entropy deeply is essential for anyone serious about machine learning.
By the end of this page, you will understand: how cross-entropy emerges from maximum likelihood estimation; its information-theoretic interpretation as KL divergence; the mathematical properties that make it ideal for optimization; numerical stability considerations; and the relationship between different cross-entropy variants (binary, categorical, sparse).
The cross-entropy loss emerges naturally when we apply maximum likelihood estimation to the multinomial logistic regression model. Let's trace this derivation carefully.
The Setup
We have $n$ training examples ${(\mathbf{x}i, y_i)}{i=1}^n$ where:
Our model predicts: $$P(y = k | \mathbf{x}; \boldsymbol{\theta}) = \hat{p}_k = \text{softmax}(\mathbf{z})k = \frac{e^{z_k}}{\sum{j=1}^{K} e^{z_j}}$$
where $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$ and $\boldsymbol{\theta}$ comprises all parameters.
Writing the Likelihood
Assuming training examples are i.i.d., the likelihood of observing the data given parameters is:
$$L(\boldsymbol{\theta}) = \prod_{i=1}^{n} P(y_i | \mathbf{x}i; \boldsymbol{\theta}) = \prod{i=1}^{n} \hat{p}_{i, y_i}$$
where $\hat{p}_{i,k}$ denotes the predicted probability of class $k$ for sample $i$.
Taking the Log-Likelihood
Maximizing likelihood is equivalent to maximizing log-likelihood (log is monotonic):
$$\ell(\boldsymbol{\theta}) = \log L(\boldsymbol{\theta}) = \sum_{i=1}^{n} \log \hat{p}_{i, y_i}$$
To use one-hot encoding, let $\mathbf{y}i \in {0,1}^K$ be the one-hot representation of label $y_i$ (i.e., $y{ik} = 1$ if $y_i = k$, else 0). Then:
$$\log \hat{p}{i, y_i} = \sum{k=1}^{K} y_{ik} \log \hat{p}_{ik}$$
(Only the true class contributes since $y_{ik} = 0$ for incorrect classes.)
Thus: $$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log \hat{p}_{ik}$$
Converting to Loss (Negative Log-Likelihood)
Optimization frameworks minimize losses, so we negate:
$$\mathcal{L}{\text{NLL}}(\boldsymbol{\theta}) = -\ell(\boldsymbol{\theta}) = -\sum{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log \hat{p}_{ik}$$
This is exactly the categorical cross-entropy loss (also called softmax loss or log loss).
The categorical cross-entropy loss for a dataset is:
$$\mathcal{L}{\text{CE}} = -\frac{1}{n} \sum{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log \hat{p}_{ik}$$
For a single sample with true label $y$ and predicted probabilities $\hat{\mathbf{p}}$:
$$\mathcal{L}{\text{CE}}^{(i)} = -\sum{k=1}^{K} y_k \log \hat{p}_k = -\log \hat{p}_y$$
Cross-entropy equals the negative log-probability assigned to the true class.
Why Not Squared Error?
One might ask: why not simply minimize $(y_k - \hat{p}_k)^2$? Several reasons:
Probabilistic foundation: Cross-entropy is the MLE objective; squared error assumes Gaussian noise, inappropriate for categorical outcomes.
Gradient behavior: For softmax + squared error, gradients can saturate (become tiny) when predictions are very wrong. Cross-entropy gradients remain informative.
Convexity: Cross-entropy with softmax is convex in the logits. Squared error with softmax is not.
Information-theoretic meaning: Cross-entropy has a principled interpretation (next section); squared error for probabilities does not.
Cross-entropy has deep roots in information theory, providing a principled interpretation beyond its role as a likelihood-derived loss.
Entropy Review
For a discrete probability distribution $\mathbf{p} = (p_1, \ldots, p_K)$, the entropy is:
$$H(\mathbf{p}) = -\sum_{k=1}^{K} p_k \log p_k$$
Entropy measures the average 'surprise' or uncertainty about outcomes sampled from $\mathbf{p}$. It's measured in nats (natural log) or bits (log base 2).
Key properties:
Cross-Entropy Definition
The cross-entropy between distributions $\mathbf{p}$ (true) and $\mathbf{q}$ (predicted) is:
$$H(\mathbf{p}, \mathbf{q}) = -\sum_{k=1}^{K} p_k \log q_k$$
Interpretation: The average number of nats/bits needed to encode samples from $\mathbf{p}$ using a code optimized for $\mathbf{q}$.
KL Divergence Connection
The Kullback-Leibler divergence (relative entropy) from $\mathbf{q}$ to $\mathbf{p}$ is:
$$D_{KL}(\mathbf{p} | \mathbf{q}) = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k} = -H(\mathbf{p}) + H(\mathbf{p}, \mathbf{q})$$
Rearranging: $$H(\mathbf{p}, \mathbf{q}) = H(\mathbf{p}) + D_{KL}(\mathbf{p} | \mathbf{q})$$
Key insight: Cross-entropy = Entropy + KL Divergence.
Since $H(\mathbf{p})$ is fixed (depends only on true labels), minimizing cross-entropy is equivalent to minimizing KL divergence between true and predicted distributions.
$$\text{Minimizing } H(\mathbf{p}, \hat{\mathbf{p}}) \iff \text{Minimizing } D_{KL}(\mathbf{p} | \hat{\mathbf{p}})$$
Properties of KL Divergence:
The KL divergence interpretation means: Training minimizes the 'distance' between our predicted distribution and the true label distribution. When the model perfectly predicts the true class (one-hot), KL divergence becomes zero (cross-entropy equals entropy of one-hot = 0).
Cross-Entropy for One-Hot Labels
In classification, true labels are one-hot: $\mathbf{y} = (0, \ldots, 0, 1, 0, \ldots, 0)$ with a single 1 at the true class $c$.
The entropy of a one-hot distribution is zero: $$H(\mathbf{y}) = -1 \cdot \log(1) - 0 \cdot \log(0) = 0$$
(Using convention $0 \cdot \log(0) = 0$.)
So: $$H(\mathbf{y}, \hat{\mathbf{p}}) = D_{KL}(\mathbf{y} | \hat{\mathbf{p}}) = -\log \hat{p}_c$$
Cross-entropy reduces to the negative log-probability of the true class.
Intuition:
The logarithm creates an asymmetric, sharp penalty for confident wrong predictions.
Cross-entropy's mathematical properties make it ideal for optimization. Understanding these properties explains why it's the universal choice for classification.
Property 1: Non-Negativity
$$\mathcal{L}_{CE} = -\log \hat{p}_y \geq 0$$
since $\hat{p}_y \in (0, 1)$ implies $\log \hat{p}_y \leq 0$.
The minimum occurs when $\hat{p}y = 1$, giving $\mathcal{L}{CE} = 0$. However, softmax never outputs exactly 1 (only approaches it as logits $\to \pm\infty$), so the practical minimum is approached asymptotically.
Property 2: Convexity in Logits
For multinomial logistic regression, the cross-entropy loss is convex as a function of the logits $\mathbf{z}$ (and hence the parameters $\boldsymbol{\theta}$).
Proof sketch: The negative log-likelihood can be written as: $$\mathcal{L} = -z_y + \log\left(\sum_{j=1}^{K} e^{z_j}\right)$$
Sum of convex functions is convex. Since $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$ is linear in parameters, composition with convex loss gives a convex optimization problem.
Implication: Gradient descent finds the global optimum (if one exists). No local minima trap the optimizer.
Property 3: Elegant Gradient Structure
The gradient of cross-entropy with respect to logits has a remarkably simple form:
$$\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}_k - y_k$$
where $y_k = 1$ if $k$ is the true class, else $0$.
Derivation:
For true class $c$: $$\mathcal{L} = -z_c + \log\left(\sum_j e^{z_j}\right)$$
$$\frac{\partial \mathcal{L}}{\partial z_k} = -\mathbf{1}_{k=c} + \frac{e^{z_k}}{\sum_j e^{z_j}} = -y_k + \hat{p}_k = \hat{p}_k - y_k$$
Interpretation:
This $(\text{prediction} - \text{target})$ structure is the hallmark of exponential family likelihoods with canonical link functions.
Unlike sigmoid + MSE, softmax + cross-entropy gradients don't vanish for wrong predictions. When $\hat{p}{wrong} \to 1$, the gradient $\hat{p}{wrong} - 0 = \hat{p}_{wrong}$ remains sizable. This property accelerates learning from mistakes.
Property 4: Calibration Under Correct Specification
When the model is correctly specified and trained to convergence, cross-entropy training produces well-calibrated probability estimates: predicted probabilities match observed frequencies.
Formally, among all test samples where the model predicts $P(y=k) \approx p$, approximately a fraction $p$ actually belong to class $k$.
This occurs because:
However, deep neural networks often exhibit miscalibration (overconfidence). Post-hoc calibration methods (temperature scaling, Platt scaling) can address this.
Property 5: Information-Theoretic Optimality
Cross-entropy is a proper scoring rule: the expected score is uniquely minimized when the predicted distribution equals the true distribution. This prevents 'gaming' the loss with non-informative predictions.
| Property | Description | Practical Benefit |
|---|---|---|
| Non-negative | $\mathcal{L} \geq 0$, $= 0$ iff perfect | Clear optimal target |
| Convex in logits | No local minima for linear model | Global optimum guaranteed |
| Simple gradients | $ abla = \hat{\mathbf{p}} - \mathbf{y}$ | Efficient computation, intuitive |
| Non-saturating | Large gradient for large errors | Fast learning from mistakes |
| Proper scoring rule | Incentivizes true probabilities | Well-calibrated outputs |
| Information-optimal | Minimizes KL divergence | Principled probabilistic training |
Computing cross-entropy naively leads to numerical disasters. Understanding and preventing these issues is essential for robust implementations.
Issue 1: Log of Softmax Underflow
Softmax outputs can be very small. For logits $\mathbf{z} = (10, 0, 0)$:
$$\hat{p}_2 = \frac{e^0}{e^{10} + e^0 + e^0} \approx \frac{1}{22048} \approx 4.5 \times 10^{-5}$$
For more extreme logits $\mathbf{z} = (100, 0, 0)$:
$$\hat{p}_2 \approx e^{-100} \approx 3.7 \times 10^{-44}$$
If $\hat{p}_2$ underflows to 0 and the true class is 2, then $\log(0) = -\infty$.
Issue 2: Overflow in Softmax
As discussed in the softmax page, large logits cause $e^z$ to overflow before the ratio is computed.
Computing loss = -log(softmax(z)) in two separate steps is numerically dangerous. Even if softmax doesn't overflow (using the max-subtraction trick), the resulting small probabilities can underflow, causing log(0) = -inf.
The Solution: Log-Softmax
Compute log-probabilities directly without computing probabilities first:
$$\log \hat{p}_k = \log \frac{e^{z_k}}{\sum_j e^{z_j}} = z_k - \log\left(\sum_j e^{z_j}\right) = z_k - \text{LSE}(\mathbf{z})$$
where $\text{LSE}(\mathbf{z}) = \log\sum_j e^{z_j}$ is computed stably:
$$\text{LSE}(\mathbf{z}) = m + \log\left(\sum_j e^{z_j - m}\right) \quad \text{where } m = \max(\mathbf{z})$$
Now cross-entropy is: $$\mathcal{L} = -\log \hat{p}_y = -z_y + \text{LSE}(\mathbf{z})$$
No intermediate probability computation—no underflow!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137
import numpy as np def log_sum_exp_stable(z): """ Compute log(sum(exp(z))) in a numerically stable way. Args: z: Array of logits, shape (K,) or (N, K) Returns: Log-sum-exp value """ z = np.asarray(z) axis = -1 if z.ndim > 1 else None z_max = np.max(z, axis=axis, keepdims=True) return z_max.squeeze() + np.log(np.sum(np.exp(z - z_max), axis=axis)) def log_softmax_stable(z): """ Compute log(softmax(z)) stably. Args: z: Logits, shape (K,) or (N, K) Returns: Log-probabilities, same shape as z """ if z.ndim == 1: return z - log_sum_exp_stable(z) else: lse = log_sum_exp_stable(z) return z - lse[:, np.newaxis] def cross_entropy_naive(logits, labels): """ UNSTABLE: Cross-entropy via softmax then log. Args: logits: (N, K) logit matrix labels: (N,) integer labels """ # Compute softmax (even if stable, small values underflow) z_max = np.max(logits, axis=1, keepdims=True) exp_z = np.exp(logits - z_max) probs = exp_z / np.sum(exp_z, axis=1, keepdims=True) # Select true class probabilities n = logits.shape[0] true_probs = probs[np.arange(n), labels] # Log can underflow if true_probs is tiny! return -np.mean(np.log(true_probs)) def cross_entropy_stable(logits, labels): """ STABLE: Cross-entropy via log-softmax directly. Args: logits: (N, K) logit matrix labels: (N,) integer labels Returns: Mean cross-entropy loss """ # Compute log-softmax stably log_probs = log_softmax_stable(logits) # Select log-probabilities of true classes n = logits.shape[0] true_log_probs = log_probs[np.arange(n), labels] # Mean negative log-probability return -np.mean(true_log_probs) def cross_entropy_with_gradient(logits, labels): """ Compute cross-entropy and gradient simultaneously. Args: logits: (N, K) logit matrix labels: (N,) integer labels Returns: loss: Scalar loss value grad: (N, K) gradient w.r.t. logits """ n, K = logits.shape # Stable softmax for gradient z_max = np.max(logits, axis=1, keepdims=True) exp_z = np.exp(logits - z_max) probs = exp_z / np.sum(exp_z, axis=1, keepdims=True) # Stable loss via log-softmax log_probs = log_softmax_stable(logits) true_log_probs = log_probs[np.arange(n), labels] loss = -np.mean(true_log_probs) # Gradient: (p_k - y_k) / n grad = probs.copy() grad[np.arange(n), labels] -= 1 grad /= n return loss, grad # Demonstrationprint("=== Cross-Entropy Numerical Stability ===") # Normal case: both methods worklogits_normal = np.array([[1.0, 2.0, 3.0], [2.0, 1.0, 3.0]])labels = np.array([2, 0]) print("Normal logits:")print(f" Naive: {cross_entropy_naive(logits_normal, labels):.6f}")print(f" Stable: {cross_entropy_stable(logits_normal, labels):.6f}") # Extreme case: naive failslogits_extreme = np.array([[100.0, 0.0, 0.0], [0.0, 100.0, 0.0]])labels_hard = np.array([1, 0]) # True classes have tiny probability print("Extreme logits (true class has tiny probability):")print(f" Naive: {cross_entropy_naive(logits_extreme, labels_hard)}") # inf or nanprint(f" Stable: {cross_entropy_stable(logits_extreme, labels_hard):.6f}") # Gradient computationprint("=== Gradient Computation ===")loss, grad = cross_entropy_with_gradient(logits_normal, labels)print(f"Loss: {loss:.6f}")print(f"Gradient:{grad}") # Verify gradient sum propertyprint(f"Gradient row sums (should be ~0): {grad.sum(axis=1)}")Use fused softmax + cross-entropy functions:
• PyTorch: F.cross_entropy(logits, labels) (takes raw logits, not softmax output)
• TensorFlow: tf.nn.softmax_cross_entropy_with_logits(labels, logits)
• JAX: optax.softmax_cross_entropy(logits, labels)
These compute stable log-softmax internally.
The cross-entropy family includes several related loss functions. Understanding their relationships prevents confusion and errors.
Binary Cross-Entropy (BCE)
For binary classification with $y \in {0, 1}$ and predicted probability $\hat{p} = P(y=1)$:
$$\mathcal{L}_{BCE} = -[y \log \hat{p} + (1-y) \log(1 - \hat{p})]$$
Expanded:
Categorical Cross-Entropy (CCE)
For multi-class with one-hot $\mathbf{y}$ and predicted distribution $\hat{\mathbf{p}}$:
$$\mathcal{L}{CCE} = -\sum{k=1}^{K} y_k \log \hat{p}_k$$
Special case: For $K=2$ with one-hot encoding:
This is exactly BCE! The two are equivalent for $K=2$.
| Variant | Formula | Use Case | Label Format |
|---|---|---|---|
| Binary CE | $-[y\log \hat{p} + (1-y)\log(1-\hat{p})]$ | Binary classification | Scalar $y \in {0,1}$ |
| Categorical CE | $-\sum_k y_k \log \hat{p}_k$ | Multi-class (mutually exclusive) | One-hot $\mathbf{y}$ |
| Sparse Categorical CE | $-\log \hat{p}_y$ | Multi-class (efficiency) | Integer label $y$ |
| Binary CE with Logits | $-[y \cdot z - \log(1+e^z)]$ | Binary with sigmoid | Scalar $y$, logit $z$ |
Sparse Categorical Cross-Entropy
For large $K$, storing one-hot vectors is memory-wasteful. Sparse categorical CE takes integer labels directly:
$$\mathcal{L}_{\text{sparse}} = -\log \hat{p}_y$$
where $y \in {0, 1, \ldots, K-1}$ is the integer label.
Mathematically identical to categorical CE, but computationally efficient:
Modern frameworks prefer sparse format:
F.cross_entropy expects integer labelsCategoricalCrossentropy (one-hot) and SparseCategoricalCrossentropy (integers)Multi-Label BCE
For multi-label (non-mutually exclusive) classification, apply BCE independently to each label:
$$\mathcal{L}{\text{multi-label}} = -\frac{1}{K}\sum{k=1}^{K} [y_k \log \hat{p}_k + (1-y_k) \log(1-\hat{p}_k)]$$
Each $\hat{p}_k$ comes from an independent sigmoid, not softmax: $$\hat{p}_k = \sigma(z_k)$$
Probabilities don't sum to 1—each class is a separate binary decision.
Wrong: Using categorical CE with sigmoid outputs, or BCE with softmax outputs.
Correct pairings: • Softmax output → Categorical/Sparse Categorical CE • Sigmoid output → Binary CE (single class) or Multi-label BCE (multiple classes)
Using the wrong pairing doesn't cause errors but produces poor training dynamics.
Standard cross-entropy treats all samples and classes equally. In many real-world scenarios, this is suboptimal. Several variants address specific challenges.
Weighted Cross-Entropy
Assign different weights to different classes:
$$\mathcal{L}{\text{weighted}} = -\sum{k=1}^{K} w_k \cdot y_k \log \hat{p}_k$$
For integer labels: $$\mathcal{L}_{\text{weighted}} = -w_y \log \hat{p}_y$$
Use cases:
Weight selection:
Focal Loss
Introduced for object detection (RetinaNet), focal loss down-weights easy examples to focus learning on hard ones:
$$\mathcal{L}_{\text{focal}} = -\alpha_y (1 - \hat{p}_y)^\gamma \log \hat{p}_y$$
where:
How it works:
Effect: The model doesn't drown in easy examples that flood gradients; it focuses on informative hard cases.
Gradient analysis:
For cross-entropy: $| abla| \propto |\hat{p}_y - 1|$ For focal loss: $| abla| \propto |\hat{p}_y - 1|^{1+\gamma}$
The extra $(1-\hat{p}_y)^\gamma$ factor suppresses gradients from easy examples (where $1 - \hat{p}_y$ is small).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
import numpy as np def weighted_cross_entropy(logits, labels, class_weights): """ Weighted cross-entropy loss. Args: logits: (N, K) logit matrix labels: (N,) integer labels class_weights: (K,) weight per class Returns: Weighted mean loss """ n, K = logits.shape # Stable log-softmax z_max = np.max(logits, axis=1, keepdims=True) log_probs = logits - z_max - np.log( np.sum(np.exp(logits - z_max), axis=1, keepdims=True) ) # Get log-probs of true classes true_log_probs = log_probs[np.arange(n), labels] # Apply class weights sample_weights = class_weights[labels] return -np.mean(sample_weights * true_log_probs) def focal_loss(logits, labels, gamma=2.0, alpha=None): """ Focal loss for handling class imbalance and hard examples. Args: logits: (N, K) logit matrix labels: (N,) integer labels gamma: Focusing parameter (default 2) alpha: Optional class weights (K,) Returns: Mean focal loss """ n, K = logits.shape # Stable softmax z_max = np.max(logits, axis=1, keepdims=True) exp_z = np.exp(logits - z_max) probs = exp_z / np.sum(exp_z, axis=1, keepdims=True) # Stable log-softmax log_probs = logits - z_max - np.log( np.sum(np.exp(logits - z_max), axis=1, keepdims=True) ) # Get probs and log-probs of true classes p_y = probs[np.arange(n), labels] log_p_y = log_probs[np.arange(n), labels] # Focal modulation: (1 - p_y)^gamma focal_weight = (1 - p_y) ** gamma # Optional class weights if alpha is not None: alpha_weight = alpha[labels] focal_weight *= alpha_weight loss = -focal_weight * log_p_y return np.mean(loss) def label_smoothing_cross_entropy(logits, labels, epsilon=0.1): """ Cross-entropy with label smoothing. Softens one-hot targets to [epsilon/K, ..., 1-epsilon+epsilon/K, ...] Helps prevent overconfidence. Args: logits: (N, K) logit matrix labels: (N,) integer labels epsilon: Smoothing factor (0 = no smoothing) Returns: Mean smoothed loss """ n, K = logits.shape # Stable log-softmax z_max = np.max(logits, axis=1, keepdims=True) log_probs = logits - z_max - np.log( np.sum(np.exp(logits - z_max), axis=1, keepdims=True) ) # Construct smoothed targets # True class gets (1 - epsilon) + epsilon/K # Other classes get epsilon/K smooth_targets = np.full((n, K), epsilon / K) smooth_targets[np.arange(n), labels] = 1 - epsilon + epsilon / K # Cross-entropy with soft targets loss = -np.sum(smooth_targets * log_probs, axis=1) return np.mean(loss) # Demonstrationprint("=== Loss Variants Comparison ===") np.random.seed(42) # Create imbalanced examplelogits = np.array([ [3.0, 0.5, 0.1], # Easy correct (class 0) [0.1, 3.0, 0.2], # Easy correct (class 1) [0.5, 0.4, 0.3], # Hard example (class 2) [0.8, 1.0, 0.9], # Hard example (class 0)])labels = np.array([0, 1, 2, 0]) # Class weights (class 2 is rare, upweight it)class_weights = np.array([1.0, 1.0, 3.0]) print("Standard CE:")print(f" Per-sample losses: {[-np.log(np.exp(logits[i,labels[i]]) / np.sum(np.exp(logits[i])) ) for i in range(4)]}") print(f"Weighted CE (class 2 = 3x): {weighted_cross_entropy(logits, labels, class_weights):.4f}")print(f"Focal Loss (gamma=2): {focal_loss(logits, labels, gamma=2.0):.4f}")print(f"Label Smoothing (eps=0.1): {label_smoothing_cross_entropy(logits, labels, epsilon=0.1):.4f}") # Show focal loss effect on easy vs hard examplesprint("=== Focal Loss: Easy vs Hard ===")print("Easy example (logits=[3, 0, 0], true=0):")easy_logits = np.array([[3.0, 0.0, 0.0]])easy_p = np.exp(3) / (np.exp(3) + 2)print(f" p_true = {easy_p:.4f}")print(f" CE loss = {-np.log(easy_p):.4f}")print(f" Focal (γ=2) = {(1-easy_p)**2 * (-np.log(easy_p)):.4f}") print("Hard example (logits=[0.5, 0.5, 0.5], true=0):")hard_logits = np.array([[0.5, 0.5, 0.5]])hard_p = 1/3print(f" p_true = {hard_p:.4f}")print(f" CE loss = {-np.log(hard_p):.4f}")print(f" Focal (γ=2) = {(1-hard_p)**2 * (-np.log(hard_p)):.4f}")Label Smoothing
Instead of one-hot targets, use 'soft' targets:
$$\tilde{y}_k = \begin{cases} 1 - \epsilon + \frac{\epsilon}{K} & k = y \ \frac{\epsilon}{K} & k eq y \end{cases}$$
Benefits:
Loss formula: $$\mathcal{L}_{\text{smooth}} = (1-\epsilon) \cdot (-\log \hat{p}_y) + \epsilon \cdot H(\text{uniform}, \hat{\mathbf{p}})$$
The second term encourages output entropy, preventing collapse to extreme predictions.
Understanding how gradients flow through softmax + cross-entropy is essential for debugging, implementing custom layers, and theoretical understanding.
Setup
For a single sample with:
Goal: Compute $\frac{\partial \mathcal{L}}{\partial z_k}$ for all $k$.
Step 1: Expand the loss
$$\mathcal{L} = -\log \hat{p}_c = -\log \frac{e^{z_c}}{\sum_j e^{z_j}} = -z_c + \log\left(\sum_j e^{z_j}\right)$$
Step 2: Differentiate w.r.t. $z_k$
$$\frac{\partial \mathcal{L}}{\partial z_k} = -\mathbf{1}_{k=c} + \frac{\partial}{\partial z_k}\left[\log\left(\sum_j e^{z_j}\right)\right]$$
The second term: $$\frac{\partial}{\partial z_k}\left[\log\left(\sum_j e^{z_j}\right)\right] = \frac{e^{z_k}}{\sum_j e^{z_j}} = \hat{p}_k$$
Step 3: Combine
$$\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}k - \mathbf{1}{k=c}$$
Using one-hot notation where $y_k = \mathbf{1}_{k=c}$:
$$\boxed{\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}_k - y_k}$$
Vector form: $$ abla_{\mathbf{z}} \mathcal{L} = \hat{\mathbf{p}} - \mathbf{y}$$
This is the predicted distribution minus the true distribution.
Properties of this gradient:
Sums to zero: $\sum_k (\hat{p}_k - y_k) = 1 - 1 = 0$
Interpretation: The gradient has no component in the 'all equal' direction—it's purely about rebalancing probability mass.
Bounded: Each component $|\hat{p}_k - y_k| \leq 1$
Interpretable: Positive gradient pushes logit down; negative pushes it up. The true class always gets pushed up (since $\hat{p}_c - 1 < 0$ for $\hat{p}_c < 1$).
Backpropagating to Parameters
For logits $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$:
$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}_k} = \frac{\partial \mathcal{L}}{\partial z_k} \cdot \frac{\partial z_k}{\partial \mathbf{w}_k} = (\hat{p}_k - y_k) \mathbf{x}$$
$$\frac{\partial \mathcal{L}}{\partial b_k} = \hat{p}_k - y_k$$
For a batch of $n$ samples:
$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}k} = \frac{1}{n}\sum{i=1}^{n} (\hat{p}{ik} - y{ik}) \mathbf{x}_i$$
In matrix form with $X \in \mathbb{R}^{n \times d}$ (rows are samples), $P \in \mathbb{R}^{n \times K}$ (predicted), $Y \in \mathbb{R}^{n \times K}$ (one-hot):
$$\frac{\partial \mathcal{L}}{\partial W} = \frac{1}{n}(P - Y)^T X$$
where $W \in \mathbb{R}^{K \times d}$ and the gradient has the same shape.
The gradient computation is remarkably simple: just subtract the one-hot labels from the softmax probabilities. No complex Jacobian computation needed! This simplicity is why softmax + cross-entropy is so computationally efficient compared to alternatives like softmax + squared error.
We have thoroughly explored the cross-entropy loss—the engine that drives classification learning. Let's consolidate the key insights:
log(softmax(z)) in two steps.What's Next:
With the softmax model and cross-entropy loss established, we now explore an alternative approach to multi-class classification: one-vs-all (OvA) strategies. This decomposition method trains $K$ binary classifiers and offers different tradeoffs compared to multinomial logistic regression.
You now possess a deep understanding of cross-entropy loss—from its probabilistic derivation through its information-theoretic meaning to numerical implementation details. This knowledge is essential for understanding and debugging classification systems at any scale.