Loading content...
The softmax function occupies a unique position in neural networks. Unlike ReLU or Swish, which operate element-wise on individual values, softmax operates on entire vectors, transforming them into probability distributions.
Softmax is ubiquitous in modern deep learning:
Understanding softmax deeply—its mathematical properties, gradient behavior, numerical pitfalls, and variants—is essential for any serious practitioner of deep learning.
By completing this page, you will understand the mathematical definition and probabilistic interpretation of softmax, the temperature parameter and its effects on distribution sharpness, numerical stability issues and the log-sum-exp trick, the Jacobian matrix for backpropagation, the connection between softmax and cross-entropy loss, and specialized variants like sparsemax and gumbel-softmax.
Given a vector z = (z₁, z₂, ..., zₖ) of K real-valued logits (unnormalized log-probabilities), the softmax function produces a probability distribution:
$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$
The output is a vector p = (p₁, p₂, ..., pₖ) where each pᵢ represents the probability of class i.
1. Valid Probability Distribution:
2. Monotonic in Logits:
3. Translation Invariance:
4. Scale Sensitivity:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as np def softmax_naive(z): """ Naive softmax implementation. WARNING: Numerically unstable for large values! """ exp_z = np.exp(z) return exp_z / np.sum(exp_z, axis=-1, keepdims=True) def softmax_stable(z): """ Numerically stable softmax using max subtraction. Exploits translation invariance: softmax(z) = softmax(z - max(z)) """ z_shifted = z - np.max(z, axis=-1, keepdims=True) exp_z = np.exp(z_shifted) return exp_z / np.sum(exp_z, axis=-1, keepdims=True) # Demonstrate propertiesdef verify_properties(): z = np.array([2.0, 1.0, 0.1]) p = softmax_stable(z) print("Softmax Properties Verification:") print("-" * 50) print(f"Input logits z: {z}") print(f"Output probs p: {p}") print(f"Sum of probabilities: {p.sum():.10f}") # Should be 1.0 print(f"All positive: {(p > 0).all()}") # Should be True # Translation invariance z_shifted = z + 100 p_shifted = softmax_stable(z_shifted) print(f"Translation invariance (z + 100):") print(f"Max difference: {np.abs(p - p_shifted).max():.10e}") # Should be ~0 # Ordering preserved print(f"Ordering preserved:") print(f"z order: {np.argsort(z)[::-1]}") # [0, 1, 2] print(f"p order: {np.argsort(p)[::-1]}") # [0, 1, 2] - same! verify_properties() # Numerical instability demonstrationdef demonstrate_instability(): """Show why naive softmax fails for large values.""" z_large = np.array([1000.0, 1000.1, 1000.2]) print("Numerical Stability Demonstration:") print("-" * 50) print(f"Input: {z_large}") try: naive_result = softmax_naive(z_large) print(f"Naive result: {naive_result}") # Will likely be nan except: print("Naive result: FAILED (overflow)") stable_result = softmax_stable(z_large) print(f"Stable result: {stable_result}") # Works correctly demonstrate_instability()The naive softmax exp(zᵢ)/Σexp(zⱼ) will produce NaN or Inf for logits larger than ~709 (where exp overflows float64). Always subtract the maximum: exp(zᵢ - max(z))/Σexp(zⱼ - max(z)). This is mathematically equivalent due to translation invariance but numerically stable.
The temperature T scales the logits before softmax:
$$\text{softmax}T(z_i) = \frac{e^{z_i/T}}{\sum{j=1}^{K} e^{z_j/T}}$$
Temperature controls the sharpness of the resulting distribution:
T → 0 (Low Temperature / 'Cold'):
T = 1 (Standard):
T → ∞ (High Temperature / 'Hot'):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as np def softmax_with_temperature(z, temperature=1.0): """ Temperature-scaled softmax. T → 0: sharp (argmax) T = 1: standard T → ∞: uniform """ z_scaled = z / temperature z_shifted = z_scaled - np.max(z_scaled, axis=-1, keepdims=True) exp_z = np.exp(z_shifted) return exp_z / np.sum(exp_z, axis=-1, keepdims=True) def entropy(p): """Compute entropy of probability distribution.""" # Add small epsilon to avoid log(0) p_safe = np.clip(p, 1e-10, 1.0) return -np.sum(p_safe * np.log(p_safe)) # Demonstrate temperature effectsdef temperature_analysis(): # Logits for a 5-class problem z = np.array([2.0, 1.5, 1.0, 0.5, 0.0]) temperatures = [0.1, 0.5, 1.0, 2.0, 5.0, 10.0] print("Temperature Effects on Softmax:") print("=" * 70) print(f"Input logits: {z}") print("-" * 70) print(f"{'T':>6} | {'p[0]':>8} {'p[1]':>8} {'p[2]':>8} {'p[3]':>8} {'p[4]':>8} | {'Entropy':>8}") print("-" * 70) max_entropy = np.log(len(z)) # Maximum possible entropy for T in temperatures: p = softmax_with_temperature(z, T) H = entropy(p) print(f"{T:6.1f} | {p[0]:8.4f} {p[1]:8.4f} {p[2]:8.4f} {p[3]:8.4f} {p[4]:8.4f} | {H:8.4f}") print("-" * 70) print(f"Max possible entropy: {max_entropy:.4f}") temperature_analysis() # Application: Top-p (nucleus) samplingdef top_p_sampling(logits, temperature=1.0, top_p=0.9): """ Top-p (nucleus) sampling: sample from smallest set of tokens whose cumulative probability exceeds p. Used in GPT text generation for better diversity. """ probs = softmax_with_temperature(logits, temperature) # Sort by probability (descending) sorted_indices = np.argsort(probs)[::-1] sorted_probs = probs[sorted_indices] # Find cutoff for cumulative probability > top_p cumulative_probs = np.cumsum(sorted_probs) cutoff_index = np.searchsorted(cumulative_probs, top_p) + 1 # Keep only top tokens kept_indices = sorted_indices[:cutoff_index] kept_probs = probs[kept_indices] # Renormalize kept_probs = kept_probs / kept_probs.sum() # Sample sampled_idx = np.random.choice(kept_indices, p=kept_probs) return sampled_idx, kept_indices, kept_probs # Demonstrate top-p samplingdef top_p_demo(): np.random.seed(42) # Simulate language model logits over vocabulary vocab_size = 10 logits = np.random.randn(vocab_size) * 2 # Some variation logits[0] = 5.0 # Make one token much more likely print("Top-p Sampling Demonstration:") print("-" * 50) print(f"Original logits: {logits}") print(f"Full probs: {softmax_with_temperature(logits, 1.0)}") sampled, kept, new_probs = top_p_sampling(logits, temperature=0.8, top_p=0.9) print(f"Kept indices (top_p=0.9): {kept}") print(f"Renormalized probs: {new_probs}") print(f"Sampled index: {sampled}") top_p_demo()Training: Always use T=1 (standard softmax). Temperature scaling during training can harm learning by making gradients too sharp or too diffuse.
Inference (text generation): Common to use T=0.7-1.0. Lower temperature produces more focused, repetitive text; higher temperature produces more diverse, sometimes incoherent text.
Knowledge distillation: Use high temperature (T=2-20) to transfer 'dark knowledge' from teacher to student network.
Unlike element-wise activation functions, the gradient of softmax involves all inputs simultaneously. The Jacobian matrix ∂p/∂z captures how each output pᵢ changes with each input zⱼ.
Derivation:
For the softmax output pᵢ = exp(zᵢ)/Σₖexp(zₖ), we compute:
Case 1: i = j (diagonal elements) $$\frac{\partial p_i}{\partial z_i} = p_i(1 - p_i)$$
This is identical to the sigmoid derivative! Not coincidental—softmax for K=2 reduces to sigmoid.
Case 2: i ≠ j (off-diagonal elements) $$\frac{\partial p_i}{\partial z_j} = -p_i \cdot p_j$$
Increasing zⱼ decreases pᵢ for all other classes (competition).
The full Jacobian is a K × K matrix:
$$J_{ij} = \frac{\partial p_i}{\partial z_j} = p_i(\delta_{ij} - p_j)$$
where δᵢⱼ is the Kronecker delta (1 if i=j, 0 otherwise).
In matrix form:
$$J = \text{diag}(p) - p p^T$$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import numpy as np def softmax(z): """Stable softmax.""" z_shifted = z - np.max(z) exp_z = np.exp(z_shifted) return exp_z / np.sum(exp_z) def softmax_jacobian(p): """ Compute the Jacobian matrix of softmax. J[i,j] = p[i] * (delta_ij - p[j]) J = diag(p) - p @ p.T """ K = len(p) jacobian = np.diag(p) - np.outer(p, p) return jacobian def numerical_jacobian(z, epsilon=1e-7): """ Compute Jacobian numerically for verification. """ K = len(z) jacobian = np.zeros((K, K)) for j in range(K): z_plus = z.copy() z_plus[j] += epsilon z_minus = z.copy() z_minus[j] -= epsilon p_plus = softmax(z_plus) p_minus = softmax(z_minus) jacobian[:, j] = (p_plus - p_minus) / (2 * epsilon) return jacobian # Verify analytical Jacobiandef verify_jacobian(): z = np.array([2.0, 1.0, 0.5, 0.0]) p = softmax(z) analytical_J = softmax_jacobian(p) numerical_J = numerical_jacobian(z) print("Softmax Jacobian Verification:") print("-" * 50) print(f"Input logits: {z}") print(f"Probabilities: {p}") print(f"Analytical Jacobian:{analytical_J}") print(f"Numerical Jacobian:{numerical_J}") print(f"Max difference: {np.abs(analytical_J - numerical_J).max():.2e}") # Properties print(f"Jacobian Properties:") print(f" Row sums (should be 0): {analytical_J.sum(axis=1)}") # Each row sums to 0 print(f" Symmetric: {np.allclose(analytical_J, analytical_J.T)}") # Should be True verify_jacobian() def softmax_backward(grad_output, p): """ Backward pass for softmax. Given upstream gradient dL/dp, compute dL/dz. dL/dz = J^T @ dL/dp But there's a more efficient formulation! """ # Method 1: Explicit Jacobian (O(K²) space) J = softmax_jacobian(p) grad_z_explicit = J.T @ grad_output # Method 2: Efficient O(K) formulation # dL/dz_i = p_i * (dL/dp_i - sum_j(p_j * dL/dp_j)) dot_product = np.dot(p, grad_output) grad_z_efficient = p * (grad_output - dot_product) print("Softmax Backward Comparison:") print("-" * 50) print(f"Upstream gradient: {grad_output}") print(f"Explicit (Jacobian): {grad_z_explicit}") print(f"Efficient O(K): {grad_z_efficient}") print(f"Match: {np.allclose(grad_z_explicit, grad_z_efficient)}") return grad_z_efficient # Demop = softmax(np.array([2.0, 1.0, 0.0]))grad_output = np.array([1.0, -0.5, 0.0]) # Example upstream gradientsoftmax_backward(grad_output, p)The efficient backward pass formula dL/dzᵢ = pᵢ · (dL/dpᵢ - Σⱼ pⱼ · dL/dpⱼ) computes the gradient in O(K) time and space, avoiding the O(K²) Jacobian matrix. This is crucial for large vocabularies in language models where K > 50,000.
Softmax is almost always paired with the cross-entropy loss for classification:
$$\mathcal{L} = -\sum_{i=1}^{K} y_i \log(p_i) = -\log(p_c)$$
where y is the one-hot target (y_c = 1 for the correct class c).
The remarkable mathematical fact is that the gradient of cross-entropy loss with respect to the logits z (not the probabilities p) simplifies dramatically:
$$\frac{\partial \mathcal{L}}{\partial z_i} = p_i - y_i$$
This is simply the difference between predictions and targets!
Derivation:
$$\frac{\partial \mathcal{L}}{\partial z_i} = \sum_j \frac{\partial \mathcal{L}}{\partial p_j} \cdot \frac{\partial p_j}{\partial z_i}$$
Since $\mathcal{L} = -\log(p_c)$, we have $\frac{\partial \mathcal{L}}{\partial p_j} = -\frac{y_j}{p_j}$.
Using the Jacobian $\frac{\partial p_j}{\partial z_i} = p_j(\delta_{ij} - p_i)$:
$$\frac{\partial \mathcal{L}}{\partial z_i} = -\sum_j \frac{y_j}{p_j} \cdot p_j(\delta_{ij} - p_i) = -\sum_j y_j(\delta_{ij} - p_i) = -y_i + p_i \sum_j y_j = p_i - y_i$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
import numpy as np def softmax(z): z_shifted = z - np.max(z, axis=-1, keepdims=True) exp_z = np.exp(z_shifted) return exp_z / np.sum(exp_z, axis=-1, keepdims=True) def cross_entropy_loss(p, y): """ Cross-entropy loss. p: predicted probabilities (after softmax) y: one-hot targets """ # Clip for numerical stability p_safe = np.clip(p, 1e-15, 1 - 1e-15) return -np.sum(y * np.log(p_safe), axis=-1) def softmax_cross_entropy_forward(z, y): """ Combined forward pass: more numerically stable. Uses log-sum-exp trick. """ # log(softmax(z)) = z - log(sum(exp(z))) log_sum_exp = np.log(np.sum(np.exp(z - np.max(z)), axis=-1)) + np.max(z) log_probs = z - log_sum_exp return -np.sum(y * log_probs, axis=-1) def softmax_cross_entropy_backward(z, y): """ Gradient of cross-entropy w.r.t. logits. The beautiful result: grad = p - y """ p = softmax(z) return p - y # Verify the gradientdef verify_combined_gradient(): np.random.seed(42) # Random logits and one-hot target z = np.array([2.0, 1.0, 0.5, -0.5]) y = np.array([0.0, 1.0, 0.0, 0.0]) # Class 1 is correct # Analytical gradient analytical_grad = softmax_cross_entropy_backward(z, y) # Numerical gradient epsilon = 1e-7 numerical_grad = np.zeros_like(z) for i in range(len(z)): z_plus = z.copy() z_plus[i] += epsilon z_minus = z.copy() z_minus[i] -= epsilon loss_plus = softmax_cross_entropy_forward(z_plus, y) loss_minus = softmax_cross_entropy_forward(z_minus, y) numerical_grad[i] = (loss_plus - loss_minus) / (2 * epsilon) print("Softmax + Cross-Entropy Gradient Verification:") print("-" * 50) print(f"Logits z: {z}") print(f"Target y: {y}") print(f"Probs p: {softmax(z)}") print(f"Analytical gradient (p - y): {analytical_grad}") print(f"Numerical gradient: {numerical_grad}") print(f"Match: {np.allclose(analytical_grad, numerical_grad)}") # Show the beauty: gradient is just p - y! p = softmax(z) print(f"Verify p - y = {p - y}") verify_combined_gradient() def log_softmax_stable(z): """ Log-softmax: more stable than log(softmax(z)) for large logits. log_softmax(z) = z - log(sum(exp(z))) """ max_z = np.max(z, axis=-1, keepdims=True) log_sum_exp = np.log(np.sum(np.exp(z - max_z), axis=-1, keepdims=True)) + max_z return z - log_sum_exp # Numerical stability comparisondef stability_comparison(): """Show that log-softmax is more stable than log(softmax).""" z = np.array([1000.0, 1000.1, 999.9]) print("Numerical Stability: log-softmax vs log(softmax):") print("-" * 50) # Direct log(softmax) - likely to have issues try: p = softmax(z) log_p_direct = np.log(p) print(f"log(softmax(z)): {log_p_direct}") except: print("log(softmax(z)): FAILED") # Log-softmax (stable) log_p_stable = log_softmax_stable(z) print(f"log_softmax(z): {log_p_stable}") stability_comparison()The simple gradient p - y is the foundation of efficient classification training. It's intuitive: if the model predicts p = [0.9, 0.1] but the target is y = [0, 1], the gradient pushes logit₀ down (0.9 - 0 = +0.9 becomes negative update) and logit₁ up (0.1 - 1 = -0.9 becomes positive update). The size of the push is proportional to how wrong the prediction is.
The scaled dot-product attention mechanism uses softmax as its core normalization:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Why softmax here?
The scaling factor 1/√dₖ:
Without scaling, dot products QKᵀ grow with dimension dₖ, pushing softmax into saturation. With dₖ = 512, unscaled dot products can easily exceed 20-30, causing softmax to produce near-one-hot distributions with vanishing gradients.
Scaling keeps dot products moderate, maintaining a balanced distribution.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
import numpy as np def softmax(z, axis=-1): """Stable softmax along specified axis.""" z_shifted = z - np.max(z, axis=axis, keepdims=True) exp_z = np.exp(z_shifted) return exp_z / np.sum(exp_z, axis=axis, keepdims=True) def scaled_dot_product_attention(Q, K, V, mask=None): """ Scaled dot-product attention. Q: [batch, seq_len, d_k] - Queries K: [batch, seq_len, d_k] - Keys V: [batch, seq_len, d_v] - Values mask: Optional attention mask """ d_k = Q.shape[-1] # Compute attention scores scores = Q @ K.transpose(0, 2, 1) # [batch, seq_len, seq_len] # Scale by sqrt(d_k) scores_scaled = scores / np.sqrt(d_k) # Apply mask (for causal attention or padding) if mask is not None: scores_scaled = np.where(mask, scores_scaled, -1e9) # Softmax over keys dimension attention_weights = softmax(scores_scaled, axis=-1) # Weighted sum of values output = attention_weights @ V return output, attention_weights # Demonstrate attentiondef attention_demo(): np.random.seed(42) batch_size = 1 seq_len = 5 d_k = 64 d_v = 64 # Random Q, K, V Q = np.random.randn(batch_size, seq_len, d_k) * 0.1 K = np.random.randn(batch_size, seq_len, d_k) * 0.1 V = np.random.randn(batch_size, seq_len, d_v) * 0.1 output, attn_weights = scaled_dot_product_attention(Q, K, V) print("Scaled Dot-Product Attention:") print("-" * 50) print(f"Q, K, V shape: {Q.shape}") print(f"Output shape: {output.shape}") print(f"Attention weights shape: {attn_weights.shape}") print(f"Attention weights (softmax probabilities):") print(f"{attn_weights[0]}") print(f"Row sums (should be 1.0): {attn_weights[0].sum(axis=-1)}") attention_demo() # Causal mask for autoregressive modelsdef causal_mask_demo(): """Demonstrate causal (autoregressive) masking.""" seq_len = 5 # Create causal mask: position i can only attend to positions <= i causal_mask = np.tril(np.ones((seq_len, seq_len), dtype=bool)) print("Causal Attention Mask:") print(causal_mask.astype(int)) # With random scores np.random.seed(42) scores = np.random.randn(seq_len, seq_len) # Apply mask scores_masked = np.where(causal_mask, scores, -1e9) # Softmax attn_weights = softmax(scores_masked, axis=-1) print("Resulting attention (causal):") print(attn_weights.round(3)) print("Note: Upper triangle is effectively zero (masked out)") causal_mask_demo() # Why scaling mattersdef scaling_importance(): """Show why the sqrt(d_k) scaling is necessary.""" np.random.seed(42) for d_k in [8, 64, 512, 2048]: # Random unit vectors for Q, K q = np.random.randn(d_k) k = np.random.randn(d_k) # Dot product variance grows with dimension # E[q·k] = 0, Var[q·k] = d_k (when q,k are independent standard normals) dot_product = np.dot(q, k) scaled_dot = dot_product / np.sqrt(d_k) print(f"d_k={d_k:4d}: dot product = {dot_product:7.2f}, " f"scaled = {scaled_dot:6.2f}") print("Why Scaling Matters:")print("-" * 50)scaling_importance()Flash Attention and similar algorithms fuse the attention computation (including softmax) into optimized GPU kernels. They compute attention in tiles, tracking softmax normalization incrementally to avoid materializing the full N×N attention matrix. This requires careful handling of the running sum for softmax denominators.
Sparsemax produces sparse probability distributions—most entries are exactly zero:
$$\text{sparsemax}(z) = \arg\min_p |p - z|^2 \quad \text{subject to } p \in \Delta^{K-1}$$
where Δᴷ⁻¹ is the probability simplex.
Properties:
Gumbel-softmax (or concrete distribution) enables differentiable sampling from categorical distributions:
$$y_i = \frac{\exp((z_i + g_i)/\tau)}{\sum_j \exp((z_j + g_j)/\tau)}$$
where gᵢ ~ Gumbel(0, 1) are i.i.d. Gumbel noise samples.
Use cases:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
import numpy as np def sparsemax(z): """ Sparsemax: projects onto probability simplex. Returns sparse probability distribution. """ z_sorted = np.sort(z)[::-1] # Sort descending K = len(z) # Find threshold cumsum = np.cumsum(z_sorted) k_range = np.arange(1, K + 1) threshold_candidates = (cumsum - 1) / k_range # Find largest k where z_sorted[k] > threshold support = z_sorted > threshold_candidates k_star = np.sum(support) threshold = threshold_candidates[k_star - 1] # Project return np.maximum(z - threshold, 0) def gumbel_softmax(logits, temperature=1.0): """ Gumbel-Softmax (Concrete) distribution. Differentiable approximation to categorical sampling. """ # Sample Gumbel noise U = np.random.uniform(0, 1, logits.shape) gumbel_noise = -np.log(-np.log(U + 1e-10) + 1e-10) # Add noise to logits and apply temperature-scaled softmax noisy_logits = logits + gumbel_noise return softmax_with_temperature(noisy_logits, temperature) def softmax_with_temperature(z, temperature): z_scaled = z / temperature z_shifted = z_scaled - np.max(z_scaled) exp_z = np.exp(z_shifted) return exp_z / np.sum(exp_z) def gumbel_softmax_straight_through(logits, temperature=1.0): """ Straight-through Gumbel-Softmax. Forward: argmax (discrete) Backward: softmax gradient (continuous) """ soft = gumbel_softmax(logits, temperature) hard = np.zeros_like(soft) hard[np.argmax(soft)] = 1.0 # Straight-through: use hard in forward, but soft's gradient # In actual implementation, this uses stop_gradient tricks return hard # For inference, use hard # Demonstrate sparsemaxdef sparsemax_demo(): z = np.array([2.0, 1.5, 1.0, 0.5, 0.0]) soft = softmax_with_temperature(z, 1.0) sparse = sparsemax(z) print("Sparsemax vs Softmax:") print("-" * 50) print(f"Logits: {z}") print(f"Softmax: {soft} (all non-zero)") print(f"Sparsemax:{sparse} (sparse!)") print(f"Non-zero entries: Softmax={np.sum(soft > 0)}, Sparsemax={np.sum(sparse > 0)}") sparsemax_demo() # Demonstrate Gumbel-Softmaxdef gumbel_softmax_demo(): np.random.seed(42) logits = np.array([2.0, 1.0, 0.0]) print("Gumbel-Softmax Sampling:") print("-" * 50) print(f"Logits: {logits}") print(f"Softmax (no noise): {softmax_with_temperature(logits, 1.0)}") print("Samples at different temperatures:") for temp in [0.1, 0.5, 1.0, 2.0]: samples = [gumbel_softmax(logits, temp) for _ in range(4)] print(f"T={temp}: ", end="") for s in samples: print(f"[{s[0]:.2f},{s[1]:.2f},{s[2]:.2f}] ", end="") print() print("Note: Lower T → sharper (closer to one-hot)") print(" Higher T → softer (closer to softmax)") gumbel_softmax_demo()| Variant | Outputs | Use Case | Differentiable |
|---|---|---|---|
| Softmax | Dense probabilities | Standard classification, attention | Yes |
| Temperature softmax | Sharp/flat probabilities | Knowledge distillation, sampling | Yes |
| Sparsemax | Sparse probabilities | Interpretable attention, hard selection | Yes (subgradient) |
| Gumbel-softmax | Noisy soft probabilities | Discrete latent variables, NAS | Yes (approximate) |
| Straight-through | Hard one-hot (forward) | Discrete actions with gradient | Approximate |
We have comprehensively analyzed the softmax function—the essential operation that transforms logits into probability distributions for classification and attention.
Looking Ahead:
With sigmoid, tanh, ReLU variants, Swish, GELU, and softmax fully understood, the next page provides practical selection guidelines—helping you choose the right activation function for any architecture, task, and deployment context.
You now have complete mastery of the softmax function. You understand its mathematical properties, temperature effects, numerical stability requirements, Jacobian structure, and role in both classification and attention. This knowledge is essential for working with any modern neural network.