Loading content...
The softmax function is one of the most ubiquitous operations in machine learning, transforming a vector of arbitrary real numbers (often called logits) into a valid probability distribution. It is the final activation function in virtually every classification neural network, converting raw network outputs into interpretable class probabilities.
For an input vector x of length n, the softmax function computes:
$$\text{softmax}(x)i = \frac{e^{x_i}}{\sum{k=1}^{n} e^{x_k}}$$
To train neural networks using gradient descent, we need to compute how the softmax outputs change with respect to the inputs—this is where the Jacobian matrix comes in. The Jacobian is a matrix of all first-order partial derivatives, where each element J[i][j] represents:
$$J_{ij} = \frac{\partial , \text{softmax}(x)_i}{\partial , x_j}$$
Mathematical Derivation:
Let s = softmax(x) denote the softmax output vector. The Jacobian elements follow an elegant closed-form expression:
Diagonal elements (when i = j): $$J_{ii} = s_i \cdot (1 - s_i)$$
Off-diagonal elements (when i ≠ j): $$J_{ij} = -s_i \cdot s_j$$
This can be written compactly as: $$J_{ij} = s_i \cdot (\delta_{ij} - s_j)$$
where δᵢⱼ is the Kronecker delta (1 when i = j, 0 otherwise).
Intuitive Understanding:
The Jacobian captures the sensitivity of each probability output to changes in each input logit:
Your Task: Write a Python function that computes the Jacobian matrix of the softmax function for a given input vector. Return all values rounded to 4 decimal places for numerical consistency.
x = [1.0, 2.0, 3.0][[0.0819, -0.022, -0.0599], [-0.022, 0.1848, -0.1628], [-0.0599, -0.1628, 0.2227]]Step 1: Compute the softmax probabilities
First, calculate the exponentials and normalize: • e¹·⁰ ≈ 2.718, e²·⁰ ≈ 7.389, e³·⁰ ≈ 20.086 • Sum = 2.718 + 7.389 + 20.086 ≈ 30.193
Softmax outputs: s = [0.0900, 0.2447, 0.6652]
Step 2: Compute diagonal elements J[i][i] = sᵢ × (1 - sᵢ) • J[0][0] = 0.0900 × (1 - 0.0900) = 0.0900 × 0.9100 = 0.0819 • J[1][1] = 0.2447 × (1 - 0.2447) = 0.2447 × 0.7553 = 0.1848 • J[2][2] = 0.6652 × (1 - 0.6652) = 0.6652 × 0.3348 = 0.2227
Step 3: Compute off-diagonal elements J[i][j] = -sᵢ × sⱼ • J[0][1] = J[1][0] = -0.0900 × 0.2447 = -0.0220 • J[0][2] = J[2][0] = -0.0900 × 0.6652 = -0.0599 • J[1][2] = J[2][1] = -0.2447 × 0.6652 = -0.1628
The resulting Jacobian is symmetric (since J[i][j] = J[j][i] = -sᵢsⱼ) and each row sums to zero.
x = [0.0, 1.0][[0.1966, -0.1966], [-0.1966, 0.1966]]Step 1: Compute softmax probabilities • e⁰·⁰ = 1.0, e¹·⁰ ≈ 2.718 • Sum = 1.0 + 2.718 ≈ 3.718
Softmax outputs: s = [0.2689, 0.7311]
Step 2: Compute the 2×2 Jacobian matrix
Diagonal elements: • J[0][0] = 0.2689 × (1 - 0.2689) = 0.2689 × 0.7311 = 0.1966 • J[1][1] = 0.7311 × (1 - 0.7311) = 0.7311 × 0.2689 = 0.1966
Off-diagonal elements: • J[0][1] = J[1][0] = -0.2689 × 0.7311 = -0.1966
Notice the elegant structure: in the 2-class case, the Jacobian has identical diagonal elements and identical (negative) off-diagonal elements. The absolute values are all equal, reflecting the binary trade-off between two probabilities.
x = [1.0, -1.0, 2.0, -2.0][[0.1906, -0.0089, -0.1784, -0.0033], [-0.0089, 0.0335, -0.0241, -0.0004], [-0.1784, -0.0241, 0.2114, -0.0089], [-0.0033, -0.0004, -0.0089, 0.0126]]Step 1: Compute softmax probabilities • e¹·⁰ ≈ 2.718, e⁻¹·⁰ ≈ 0.368, e²·⁰ ≈ 7.389, e⁻²·⁰ ≈ 0.135 • Sum ≈ 10.610
Softmax outputs: s ≈ [0.2563, 0.0347, 0.6965, 0.0127]
Step 2: Jacobian structure analysis
The largest probability (s₂ = 0.6965) has: • The largest diagonal element: J[2][2] = 0.2114 (highest sensitivity) • The most negative off-diagonal impact on others
The smallest probability (s₃ = 0.0127) has: • Very small diagonal element: J[3][3] = 0.0126 • Negligible influence on other classes
This 4×4 Jacobian demonstrates how probability mass flows: changes to the dominant class (index 2) propagate strongly throughout, while changes to low-probability classes have minimal effect.
Constraints