Loading content...
One of the most critical computations in training neural network classifiers is calculating the gradient of the cross-entropy loss with respect to the pre-activation outputs (logits). This gradient formula is remarkably elegant and computationally efficient, making it the foundation of backpropagation in classification networks.
In multi-class classification, neural networks typically output raw scores called logits—unnormalized values for each class. These logits are transformed into a valid probability distribution using the softmax function:
$$p_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$
Where z is the vector of logits and K is the number of classes.
The cross-entropy loss then measures the discrepancy between the predicted probability distribution p and the true target distribution (typically a one-hot encoded vector y):
$$L = -\sum_{i=1}^{K} y_i \log(p_i)$$
For a single correct class c (one-hot encoding), this simplifies to:
$$L = -\log(p_c)$$
The elegant result from calculus is that the gradient of the combined softmax cross-entropy loss with respect to the logits has an extraordinarily simple closed-form solution:
$$\frac{\partial L}{\partial z_i} = p_i - y_i$$
This means:
This formula's simplicity is why softmax cross-entropy is the preferred loss function for classification—the gradient computation is both numerically stable and computationally efficient.
Write a Python function that computes the gradient vector of the softmax cross-entropy loss with respect to the input logits. Given a vector of logits and a target class index, return the gradient for each logit position, rounded to 4 decimal places.
logits = [1.0, 2.0, 3.0], target = 0[-0.91, 0.2447, 0.6652]Step 1: Compute Softmax Probabilities
Apply softmax to convert logits to probabilities: • Compute exponentials: e^1.0 ≈ 2.718, e^2.0 ≈ 7.389, e^3.0 ≈ 20.086 • Sum of exponentials: 2.718 + 7.389 + 20.086 ≈ 30.193 • Probabilities: p = [2.718/30.193, 7.389/30.193, 20.086/30.193] ≈ [0.09, 0.2447, 0.6652]
Step 2: Construct One-Hot Target Vector
Since target = 0, the one-hot vector is: y = [1, 0, 0]
Step 3: Compute Gradient (p − y)
• Gradient[0] = 0.09 − 1 = −0.91 (negative gradient pushes logit up) • Gradient[1] = 0.2447 − 0 = 0.2447 (positive gradient pushes logit down) • Gradient[2] = 0.6652 − 0 = 0.6652 (positive gradient pushes logit down)
The negative gradient for the target class indicates we should increase that logit to reduce loss, while positive gradients for non-target classes indicate those logits should decrease.
logits = [0.0, 0.0], target = 1[0.5, -0.5]Step 1: Compute Softmax Probabilities
With equal logits, softmax produces uniform probabilities: • Exponentials: e^0.0 = 1, e^0.0 = 1 • Sum: 1 + 1 = 2 • Probabilities: p = [0.5, 0.5]
Step 2: Construct One-Hot Target Vector
Since target = 1, the one-hot vector is: y = [0, 1]
Step 3: Compute Gradient (p − y)
• Gradient[0] = 0.5 − 0 = 0.5 (reduce this logit) • Gradient[1] = 0.5 − 1 = −0.5 (increase this logit)
The symmetric result shows that with equal starting logits and 50/50 predictions, the network needs to equally push one class up and the other down.
logits = [1.0, 2.0, 3.0, 4.0], target = 2[0.0321, 0.0871, -0.7631, 0.6439]Step 1: Compute Softmax Probabilities
Apply softmax to the 4-class logit vector: • Exponentials: e^1 ≈ 2.718, e^2 ≈ 7.389, e^3 ≈ 20.086, e^4 ≈ 54.598 • Sum ≈ 84.791 • Probabilities: p ≈ [0.0321, 0.0871, 0.2369, 0.6439]
Step 2: Construct One-Hot Target Vector
Since target = 2 (third class, 0-indexed), the one-hot vector is: y = [0, 0, 1, 0]
Step 3: Compute Gradient (p − y)
• Gradient[0] = 0.0321 − 0 = 0.0321 • Gradient[1] = 0.0871 − 0 = 0.0871 • Gradient[2] = 0.2369 − 1 = −0.7631 (large negative gradient) • Gradient[3] = 0.6439 − 0 = 0.6439
Note: Class 3 (index 2) is the target, but the model currently assigns it only ~24% probability while class 4 gets ~64%. The large negative gradient for class 3 will strongly push its logit upward, while the substantial positive gradient for class 4 will push its logit downward.
Constraints