Activation Functions - Learning Module

Loading content...

0/278

Softmax

The Probability Transformer

The softmax function occupies a unique position in neural networks. Unlike ReLU or Swish, which operate element-wise on individual values, softmax operates on entire vectors, transforming them into probability distributions.

Softmax is ubiquitous in modern deep learning:

Classification outputs: Every multi-class classifier ends with softmax
Attention mechanisms: The core of Transformers uses softmax to compute attention weights
Mixture models: Softmax parameterizes mixture weights
Reinforcement learning: Policy networks use softmax for action distributions
Language models: Next-token prediction uses softmax over the vocabulary

Understanding softmax deeply—its mathematical properties, gradient behavior, numerical pitfalls, and variants—is essential for any serious practitioner of deep learning.

What You Will Master

By completing this page, you will understand the mathematical definition and probabilistic interpretation of softmax, the temperature parameter and its effects on distribution sharpness, numerical stability issues and the log-sum-exp trick, the Jacobian matrix for backpropagation, the connection between softmax and cross-entropy loss, and specialized variants like sparsemax and gumbel-softmax.

Mathematical Definition and Properties

The Softmax Function

Given a vector z = (z₁, z₂, ..., zₖ) of K real-valued logits (unnormalized log-probabilities), the softmax function produces a probability distribution:

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

The output is a vector p = (p₁, p₂, ..., pₖ) where each pᵢ represents the probability of class i.

Key Properties

1. Valid Probability Distribution:

All outputs are positive: pᵢ > 0 for all i
Outputs sum to 1: Σᵢ pᵢ = 1

2. Monotonic in Logits:

If zᵢ > zⱼ, then pᵢ > pⱼ
Relative ordering of logits is preserved in probabilities

3. Translation Invariance:

softmax(z + c) = softmax(z) for any constant c
Adding the same value to all logits doesn't change the output
This is crucial for numerical stability (we'll exploit it later)

4. Scale Sensitivity:

softmax(αz) ≠ softmax(z) for α ≠ 1
Scaling logits changes the 'sharpness' of the distribution

softmax_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
 
def softmax_naive(z):
    """
    Naive softmax implementation.
    WARNING: Numerically unstable for large values!
    """
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def softmax_stable(z):
    """
    Numerically stable softmax using max subtraction.
    Exploits translation invariance: softmax(z) = softmax(z - max(z))
    """
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
# Demonstrate properties
def verify_properties():
    z = np.array([2.0, 1.0, 0.1])
    p = softmax_stable(z)
    
    print("Softmax Properties Verification:")
    print("-" * 50)
    print(f"Input logits z: {z}")
    print(f"Output probs p: {p}")
    print(f"Sum of probabilities: {p.sum():.10f}")  # Should be 1.0
    print(f"All positive: {(p > 0).all()}")  # Should be True
    
    # Translation invariance
    z_shifted = z + 100
    p_shifted = softmax_stable(z_shifted)
    print(f"
Translation invariance (z + 100):")
    print(f"Max difference: {np.abs(p - p_shifted).max():.10e}")  # Should be ~0
    
    # Ordering preserved
    print(f"
Ordering preserved:")
    print(f"z order:    {np.argsort(z)[::-1]}")  # [0, 1, 2]
    print(f"p order:    {np.argsort(p)[::-1]}")  # [0, 1, 2] - same!
 
verify_properties()
 
# Numerical instability demonstration
def demonstrate_instability():
    """Show why naive softmax fails for large values."""
    z_large = np.array([1000.0, 1000.1, 1000.2])
    
    print("
Numerical Stability Demonstration:")
    print("-" * 50)
    print(f"Input: {z_large}")
    
    try:
        naive_result = softmax_naive(z_large)
        print(f"Naive result: {naive_result}")  # Will likely be nan
    except:
        print("Naive result: FAILED (overflow)")
    
    stable_result = softmax_stable(z_large)
    print(f"Stable result: {stable_result}")  # Works correctly
 
demonstrate_instability()

Always Use Stable Softmax

The naive softmax exp(zᵢ)/Σexp(zⱼ) will produce NaN or Inf for logits larger than ~709 (where exp overflows float64). Always subtract the maximum: exp(zᵢ - max(z))/Σexp(zⱼ - max(z)). This is mathematically equivalent due to translation invariance but numerically stable.

Temperature: Controlling Distribution Sharpness

The Temperature Parameter

The temperature T scales the logits before softmax:

$$\text{softmax}T(z_i) = \frac{e^{z_i/T}}{\sum{j=1}^{K} e^{z_j/T}}$$

Temperature controls the sharpness of the resulting distribution:

T → 0 (Low Temperature / 'Cold'):

Distribution becomes sharp (peaked)
Probability concentrates on the maximum logit
In the limit, approaches argmax (one-hot distribution)
Used for greedy decoding in language models

T = 1 (Standard):

Standard softmax behavior
Balanced distribution reflecting logit magnitudes

T → ∞ (High Temperature / 'Hot'):

Distribution becomes flat (uniform)
All classes receive equal probability
Entropy approaches maximum (log K)
Used for exploration and diversity

temperature_softmax.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
 
def softmax_with_temperature(z, temperature=1.0):
    """
    Temperature-scaled softmax.
    T → 0: sharp (argmax)
    T = 1: standard
    T → ∞: uniform
    """
    z_scaled = z / temperature
    z_shifted = z_scaled - np.max(z_scaled, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def entropy(p):
    """Compute entropy of probability distribution."""
    # Add small epsilon to avoid log(0)
    p_safe = np.clip(p, 1e-10, 1.0)
    return -np.sum(p_safe * np.log(p_safe))
 
# Demonstrate temperature effects
def temperature_analysis():
    # Logits for a 5-class problem
    z = np.array([2.0, 1.5, 1.0, 0.5, 0.0])
    
    temperatures = [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
    
    print("Temperature Effects on Softmax:")
    print("=" * 70)
    print(f"Input logits: {z}")
    print("-" * 70)
    print(f"{'T':>6} | {'p[0]':>8} {'p[1]':>8} {'p[2]':>8} {'p[3]':>8} {'p[4]':>8} | {'Entropy':>8}")
    print("-" * 70)
    
    max_entropy = np.log(len(z))  # Maximum possible entropy
    
    for T in temperatures:
        p = softmax_with_temperature(z, T)
        H = entropy(p)
        print(f"{T:6.1f} | {p[0]:8.4f} {p[1]:8.4f} {p[2]:8.4f} {p[3]:8.4f} {p[4]:8.4f} | {H:8.4f}")
    
    print("-" * 70)
    print(f"Max possible entropy: {max_entropy:.4f}")
 
temperature_analysis()
 
# Application: Top-p (nucleus) sampling
def top_p_sampling(logits, temperature=1.0, top_p=0.9):
    """
    Top-p (nucleus) sampling: sample from smallest set of tokens
    whose cumulative probability exceeds p.
    
    Used in GPT text generation for better diversity.
    """
    probs = softmax_with_temperature(logits, temperature)
    
    # Sort by probability (descending)
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]
    
    # Find cutoff for cumulative probability > top_p
    cumulative_probs = np.cumsum(sorted_probs)
    cutoff_index = np.searchsorted(cumulative_probs, top_p) + 1
    
    # Keep only top tokens
    kept_indices = sorted_indices[:cutoff_index]
    kept_probs = probs[kept_indices]
    
    # Renormalize
    kept_probs = kept_probs / kept_probs.sum()
    
    # Sample
    sampled_idx = np.random.choice(kept_indices, p=kept_probs)
    return sampled_idx, kept_indices, kept_probs
 
# Demonstrate top-p sampling
def top_p_demo():
    np.random.seed(42)
    
    # Simulate language model logits over vocabulary
    vocab_size = 10
    logits = np.random.randn(vocab_size) * 2  # Some variation
    logits[0] = 5.0  # Make one token much more likely
    
    print("
Top-p Sampling Demonstration:")
    print("-" * 50)
    print(f"Original logits: {logits}")
    print(f"Full probs:      {softmax_with_temperature(logits, 1.0)}")
    
    sampled, kept, new_probs = top_p_sampling(logits, temperature=0.8, top_p=0.9)
    print(f"Kept indices (top_p=0.9): {kept}")
    print(f"Renormalized probs: {new_probs}")
    print(f"Sampled index: {sampled}")
 
top_p_demo()

Temperature in Practice

Training: Always use T=1 (standard softmax). Temperature scaling during training can harm learning by making gradients too sharp or too diffuse.

Inference (text generation): Common to use T=0.7-1.0. Lower temperature produces more focused, repetitive text; higher temperature produces more diverse, sometimes incoherent text.

Knowledge distillation: Use high temperature (T=2-20) to transfer 'dark knowledge' from teacher to student network.

The Jacobian and Backpropagation

The Softmax Jacobian

Unlike element-wise activation functions, the gradient of softmax involves all inputs simultaneously. The Jacobian matrix ∂p/∂z captures how each output pᵢ changes with each input zⱼ.

Derivation:

For the softmax output pᵢ = exp(zᵢ)/Σₖexp(zₖ), we compute:

Case 1: i = j (diagonal elements) $$\frac{\partial p_i}{\partial z_i} = p_i(1 - p_i)$$

This is identical to the sigmoid derivative! Not coincidental—softmax for K=2 reduces to sigmoid.

Case 2: i ≠ j (off-diagonal elements) $$\frac{\partial p_i}{\partial z_j} = -p_i \cdot p_j$$

Increasing zⱼ decreases pᵢ for all other classes (competition).

Jacobian Matrix Form

The full Jacobian is a K × K matrix:

$$J_{ij} = \frac{\partial p_i}{\partial z_j} = p_i(\delta_{ij} - p_j)$$

where δᵢⱼ is the Kronecker delta (1 if i=j, 0 otherwise).

In matrix form:

$$J = \text{diag}(p) - p p^T$$

softmax_jacobian.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
 
def softmax(z):
    """Stable softmax."""
    z_shifted = z - np.max(z)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z)
 
def softmax_jacobian(p):
    """
    Compute the Jacobian matrix of softmax.
    J[i,j] = p[i] * (delta_ij - p[j])
    J = diag(p) - p @ p.T
    """
    K = len(p)
    jacobian = np.diag(p) - np.outer(p, p)
    return jacobian
 
def numerical_jacobian(z, epsilon=1e-7):
    """
    Compute Jacobian numerically for verification.
    """
    K = len(z)
    jacobian = np.zeros((K, K))
    
    for j in range(K):
        z_plus = z.copy()
        z_plus[j] += epsilon
        z_minus = z.copy()
        z_minus[j] -= epsilon
        
        p_plus = softmax(z_plus)
        p_minus = softmax(z_minus)
        
        jacobian[:, j] = (p_plus - p_minus) / (2 * epsilon)
    
    return jacobian
 
# Verify analytical Jacobian
def verify_jacobian():
    z = np.array([2.0, 1.0, 0.5, 0.0])
    p = softmax(z)
    
    analytical_J = softmax_jacobian(p)
    numerical_J = numerical_jacobian(z)
    
    print("Softmax Jacobian Verification:")
    print("-" * 50)
    print(f"Input logits: {z}")
    print(f"Probabilities: {p}")
    print(f"
Analytical Jacobian:
{analytical_J}")
    print(f"
Numerical Jacobian:
{numerical_J}")
    print(f"
Max difference: {np.abs(analytical_J - numerical_J).max():.2e}")
    
    # Properties
    print(f"
Jacobian Properties:")
    print(f"  Row sums (should be 0): {analytical_J.sum(axis=1)}")  # Each row sums to 0
    print(f"  Symmetric: {np.allclose(analytical_J, analytical_J.T)}")  # Should be True
 
verify_jacobian()
 
def softmax_backward(grad_output, p):
    """
    Backward pass for softmax.
    
    Given upstream gradient dL/dp, compute dL/dz.
    dL/dz = J^T @ dL/dp
    
    But there's a more efficient formulation!
    """
    # Method 1: Explicit Jacobian (O(K²) space)
    J = softmax_jacobian(p)
    grad_z_explicit = J.T @ grad_output
    
    # Method 2: Efficient O(K) formulation
    # dL/dz_i = p_i * (dL/dp_i - sum_j(p_j * dL/dp_j))
    dot_product = np.dot(p, grad_output)
    grad_z_efficient = p * (grad_output - dot_product)
    
    print("Softmax Backward Comparison:")
    print("-" * 50)
    print(f"Upstream gradient: {grad_output}")
    print(f"Explicit (Jacobian): {grad_z_explicit}")
    print(f"Efficient O(K):      {grad_z_efficient}")
    print(f"Match: {np.allclose(grad_z_explicit, grad_z_efficient)}")
    
    return grad_z_efficient
 
# Demo
p = softmax(np.array([2.0, 1.0, 0.0]))
grad_output = np.array([1.0, -0.5, 0.0])  # Example upstream gradient
softmax_backward(grad_output, p)

Efficient Backpropagation

The efficient backward pass formula dL/dzᵢ = pᵢ · (dL/dpᵢ - Σⱼ pⱼ · dL/dpⱼ) computes the gradient in O(K) time and space, avoiding the O(K²) Jacobian matrix. This is crucial for large vocabularies in language models where K > 50,000.

Softmax and Cross-Entropy: The Perfect Pair

The Cross-Entropy Loss

Softmax is almost always paired with the cross-entropy loss for classification:

$$\mathcal{L} = -\sum_{i=1}^{K} y_i \log(p_i) = -\log(p_c)$$

where y is the one-hot target (y_c = 1 for the correct class c).

Combined Gradient: Beautiful Simplification

The remarkable mathematical fact is that the gradient of cross-entropy loss with respect to the logits z (not the probabilities p) simplifies dramatically:

$$\frac{\partial \mathcal{L}}{\partial z_i} = p_i - y_i$$

This is simply the difference between predictions and targets!

Derivation:

$$\frac{\partial \mathcal{L}}{\partial z_i} = \sum_j \frac{\partial \mathcal{L}}{\partial p_j} \cdot \frac{\partial p_j}{\partial z_i}$$

Since $\mathcal{L} = -\log(p_c)$, we have $\frac{\partial \mathcal{L}}{\partial p_j} = -\frac{y_j}{p_j}$.

Using the Jacobian $\frac{\partial p_j}{\partial z_i} = p_j(\delta_{ij} - p_i)$:

$$\frac{\partial \mathcal{L}}{\partial z_i} = -\sum_j \frac{y_j}{p_j} \cdot p_j(\delta_{ij} - p_i) = -\sum_j y_j(\delta_{ij} - p_i) = -y_i + p_i \sum_j y_j = p_i - y_i$$

softmax_cross_entropy.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
 
def softmax(z):
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def cross_entropy_loss(p, y):
    """
    Cross-entropy loss.
    p: predicted probabilities (after softmax)
    y: one-hot targets
    """
    # Clip for numerical stability
    p_safe = np.clip(p, 1e-15, 1 - 1e-15)
    return -np.sum(y * np.log(p_safe), axis=-1)
 
def softmax_cross_entropy_forward(z, y):
    """
    Combined forward pass: more numerically stable.
    Uses log-sum-exp trick.
    """
    # log(softmax(z)) = z - log(sum(exp(z)))
    log_sum_exp = np.log(np.sum(np.exp(z - np.max(z)), axis=-1)) + np.max(z)
    log_probs = z - log_sum_exp
    return -np.sum(y * log_probs, axis=-1)
 
def softmax_cross_entropy_backward(z, y):
    """
    Gradient of cross-entropy w.r.t. logits.
    The beautiful result: grad = p - y
    """
    p = softmax(z)
    return p - y
 
# Verify the gradient
def verify_combined_gradient():
    np.random.seed(42)
    
    # Random logits and one-hot target
    z = np.array([2.0, 1.0, 0.5, -0.5])
    y = np.array([0.0, 1.0, 0.0, 0.0])  # Class 1 is correct
    
    # Analytical gradient
    analytical_grad = softmax_cross_entropy_backward(z, y)
    
    # Numerical gradient
    epsilon = 1e-7
    numerical_grad = np.zeros_like(z)
    for i in range(len(z)):
        z_plus = z.copy()
        z_plus[i] += epsilon
        z_minus = z.copy()
        z_minus[i] -= epsilon
        
        loss_plus = softmax_cross_entropy_forward(z_plus, y)
        loss_minus = softmax_cross_entropy_forward(z_minus, y)
        numerical_grad[i] = (loss_plus - loss_minus) / (2 * epsilon)
    
    print("Softmax + Cross-Entropy Gradient Verification:")
    print("-" * 50)
    print(f"Logits z: {z}")
    print(f"Target y: {y}")
    print(f"Probs p:  {softmax(z)}")
    print(f"
Analytical gradient (p - y): {analytical_grad}")
    print(f"Numerical gradient:          {numerical_grad}")
    print(f"Match: {np.allclose(analytical_grad, numerical_grad)}")
    
    # Show the beauty: gradient is just p - y!
    p = softmax(z)
    print(f"
Verify p - y = {p - y}")
 
verify_combined_gradient()
 
def log_softmax_stable(z):
    """
    Log-softmax: more stable than log(softmax(z)) for large logits.
    log_softmax(z) = z - log(sum(exp(z)))
    """
    max_z = np.max(z, axis=-1, keepdims=True)
    log_sum_exp = np.log(np.sum(np.exp(z - max_z), axis=-1, keepdims=True)) + max_z
    return z - log_sum_exp
 
# Numerical stability comparison
def stability_comparison():
    """Show that log-softmax is more stable than log(softmax)."""
    z = np.array([1000.0, 1000.1, 999.9])
    
    print("
Numerical Stability: log-softmax vs log(softmax):")
    print("-" * 50)
    
    # Direct log(softmax) - likely to have issues
    try:
        p = softmax(z)
        log_p_direct = np.log(p)
        print(f"log(softmax(z)): {log_p_direct}")
    except:
        print("log(softmax(z)): FAILED")
    
    # Log-softmax (stable)
    log_p_stable = log_softmax_stable(z)
    print(f"log_softmax(z):  {log_p_stable}")
 
stability_comparison()

Why This Matters

The simple gradient p - y is the foundation of efficient classification training. It's intuitive: if the model predicts p = [0.9, 0.1] but the target is y = [0, 1], the gradient pushes logit₀ down (0.9 - 0 = +0.9 becomes negative update) and logit₁ up (0.1 - 1 = -0.9 becomes positive update). The size of the push is proportional to how wrong the prediction is.

Softmax in Attention Mechanisms

The Role of Softmax in Transformers

The scaled dot-product attention mechanism uses softmax as its core normalization:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Why softmax here?

Probability interpretation: Attention weights represent how much each position 'attends to' other positions
Competition: Softmax creates competition—attending more to one position means attending less to others
Differentiability: Smooth gradients allow end-to-end training

The scaling factor 1/√dₖ:

Without scaling, dot products QKᵀ grow with dimension dₖ, pushing softmax into saturation. With dₖ = 512, unscaled dot products can easily exceed 20-30, causing softmax to produce near-one-hot distributions with vanishing gradients.

Scaling keeps dot products moderate, maintaining a balanced distribution.

attention_softmax.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
 
def softmax(z, axis=-1):
    """Stable softmax along specified axis."""
    z_shifted = z - np.max(z, axis=axis, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=axis, keepdims=True)
 
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Scaled dot-product attention.
    
    Q: [batch, seq_len, d_k] - Queries
    K: [batch, seq_len, d_k] - Keys
    V: [batch, seq_len, d_v] - Values
    mask: Optional attention mask
    """
    d_k = Q.shape[-1]
    
    # Compute attention scores
    scores = Q @ K.transpose(0, 2, 1)  # [batch, seq_len, seq_len]
    
    # Scale by sqrt(d_k)
    scores_scaled = scores / np.sqrt(d_k)
    
    # Apply mask (for causal attention or padding)
    if mask is not None:
        scores_scaled = np.where(mask, scores_scaled, -1e9)
    
    # Softmax over keys dimension
    attention_weights = softmax(scores_scaled, axis=-1)
    
    # Weighted sum of values
    output = attention_weights @ V
    
    return output, attention_weights
 
# Demonstrate attention
def attention_demo():
    np.random.seed(42)
    
    batch_size = 1
    seq_len = 5
    d_k = 64
    d_v = 64
    
    # Random Q, K, V
    Q = np.random.randn(batch_size, seq_len, d_k) * 0.1
    K = np.random.randn(batch_size, seq_len, d_k) * 0.1
    V = np.random.randn(batch_size, seq_len, d_v) * 0.1
    
    output, attn_weights = scaled_dot_product_attention(Q, K, V)
    
    print("Scaled Dot-Product Attention:")
    print("-" * 50)
    print(f"Q, K, V shape: {Q.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Attention weights shape: {attn_weights.shape}")
    print(f"
Attention weights (softmax probabilities):")
    print(f"{attn_weights[0]}")
    print(f"
Row sums (should be 1.0): {attn_weights[0].sum(axis=-1)}")
 
attention_demo()
 
# Causal mask for autoregressive models
def causal_mask_demo():
    """Demonstrate causal (autoregressive) masking."""
    seq_len = 5
    
    # Create causal mask: position i can only attend to positions <= i
    causal_mask = np.tril(np.ones((seq_len, seq_len), dtype=bool))
    
    print("
Causal Attention Mask:")
    print(causal_mask.astype(int))
    
    # With random scores
    np.random.seed(42)
    scores = np.random.randn(seq_len, seq_len)
    
    # Apply mask
    scores_masked = np.where(causal_mask, scores, -1e9)
    
    # Softmax
    attn_weights = softmax(scores_masked, axis=-1)
    
    print("
Resulting attention (causal):")
    print(attn_weights.round(3))
    print("
Note: Upper triangle is effectively zero (masked out)")
 
causal_mask_demo()
 
# Why scaling matters
def scaling_importance():
    """Show why the sqrt(d_k) scaling is necessary."""
    np.random.seed(42)
    
    for d_k in [8, 64, 512, 2048]:
        # Random unit vectors for Q, K
        q = np.random.randn(d_k)
        k = np.random.randn(d_k)
        
        # Dot product variance grows with dimension
        # E[q·k] = 0, Var[q·k] = d_k (when q,k are independent standard normals)
        dot_product = np.dot(q, k)
        scaled_dot = dot_product / np.sqrt(d_k)
        
        print(f"d_k={d_k:4d}: dot product = {dot_product:7.2f}, "
              f"scaled = {scaled_dot:6.2f}")
 
print("
Why Scaling Matters:")
print("-" * 50)
scaling_importance()

Flash Attention

Flash Attention and similar algorithms fuse the attention computation (including softmax) into optimized GPU kernels. They compute attention in tiles, tracking softmax normalization incrementally to avoid materializing the full N×N attention matrix. This requires careful handling of the running sum for softmax denominators.

Softmax Variants

Sparsemax: Sparse Attention and Classification

Sparsemax produces sparse probability distributions—most entries are exactly zero:

$$\text{sparsemax}(z) = \arg\min_p |p - z|^2 \quad \text{subject to } p \in \Delta^{K-1}$$

where Δᴷ⁻¹ is the probability simplex.

Properties:

Outputs have exact zeros (true sparsity)
More interpretable attention weights
Can be computed in O(K log K) time
Gradient: non-zero only for non-zero outputs

Gumbel-Softmax: Differentiable Sampling

Gumbel-softmax (or concrete distribution) enables differentiable sampling from categorical distributions:

$$y_i = \frac{\exp((z_i + g_i)/\tau)}{\sum_j \exp((z_j + g_j)/\tau)}$$

where gᵢ ~ Gumbel(0, 1) are i.i.d. Gumbel noise samples.

Use cases:

Variational autoencoders with discrete latents
Neural architecture search
Learning discrete structures (graphs, programs)

softmax_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
 
def sparsemax(z):
    """
    Sparsemax: projects onto probability simplex.
    Returns sparse probability distribution.
    """
    z_sorted = np.sort(z)[::-1]  # Sort descending
    K = len(z)
    
    # Find threshold
    cumsum = np.cumsum(z_sorted)
    k_range = np.arange(1, K + 1)
    threshold_candidates = (cumsum - 1) / k_range
    
    # Find largest k where z_sorted[k] > threshold
    support = z_sorted > threshold_candidates
    k_star = np.sum(support)
    threshold = threshold_candidates[k_star - 1]
    
    # Project
    return np.maximum(z - threshold, 0)
 
def gumbel_softmax(logits, temperature=1.0):
    """
    Gumbel-Softmax (Concrete) distribution.
    Differentiable approximation to categorical sampling.
    """
    # Sample Gumbel noise
    U = np.random.uniform(0, 1, logits.shape)
    gumbel_noise = -np.log(-np.log(U + 1e-10) + 1e-10)
    
    # Add noise to logits and apply temperature-scaled softmax
    noisy_logits = logits + gumbel_noise
    return softmax_with_temperature(noisy_logits, temperature)
 
def softmax_with_temperature(z, temperature):
    z_scaled = z / temperature
    z_shifted = z_scaled - np.max(z_scaled)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z)
 
def gumbel_softmax_straight_through(logits, temperature=1.0):
    """
    Straight-through Gumbel-Softmax.
    Forward: argmax (discrete)
    Backward: softmax gradient (continuous)
    """
    soft = gumbel_softmax(logits, temperature)
    hard = np.zeros_like(soft)
    hard[np.argmax(soft)] = 1.0
    
    # Straight-through: use hard in forward, but soft's gradient
    # In actual implementation, this uses stop_gradient tricks
    return hard  # For inference, use hard
 
# Demonstrate sparsemax
def sparsemax_demo():
    z = np.array([2.0, 1.5, 1.0, 0.5, 0.0])
    
    soft = softmax_with_temperature(z, 1.0)
    sparse = sparsemax(z)
    
    print("Sparsemax vs Softmax:")
    print("-" * 50)
    print(f"Logits:   {z}")
    print(f"Softmax:  {soft} (all non-zero)")
    print(f"Sparsemax:{sparse} (sparse!)")
    print(f"
Non-zero entries: Softmax={np.sum(soft > 0)}, Sparsemax={np.sum(sparse > 0)}")
 
sparsemax_demo()
 
# Demonstrate Gumbel-Softmax
def gumbel_softmax_demo():
    np.random.seed(42)
    
    logits = np.array([2.0, 1.0, 0.0])
    
    print("
Gumbel-Softmax Sampling:")
    print("-" * 50)
    print(f"Logits: {logits}")
    print(f"Softmax (no noise): {softmax_with_temperature(logits, 1.0)}")
    
    print("
Samples at different temperatures:")
    for temp in [0.1, 0.5, 1.0, 2.0]:
        samples = [gumbel_softmax(logits, temp) for _ in range(4)]
        print(f"T={temp}: ", end="")
        for s in samples:
            print(f"[{s[0]:.2f},{s[1]:.2f},{s[2]:.2f}] ", end="")
        print()
    
    print("
Note: Lower T → sharper (closer to one-hot)")
    print("       Higher T → softer (closer to softmax)")
 
gumbel_softmax_demo()

Softmax Variants Comparison
Variant	Outputs	Use Case	Differentiable
Softmax	Dense probabilities	Standard classification, attention	Yes
Temperature softmax	Sharp/flat probabilities	Knowledge distillation, sampling	Yes
Sparsemax	Sparse probabilities	Interpretable attention, hard selection	Yes (subgradient)
Gumbel-softmax	Noisy soft probabilities	Discrete latent variables, NAS	Yes (approximate)
Straight-through	Hard one-hot (forward)	Discrete actions with gradient	Approximate

Summary: Softmax

We have comprehensively analyzed the softmax function—the essential operation that transforms logits into probability distributions for classification and attention.

Core Takeaways

•Softmax converts K logits to a probability distribution: softmax(zᵢ) = exp(zᵢ)/Σexp(zⱼ), with outputs summing to 1.
•Temperature T controls sharpness: low T → peaked, high T → uniform. Use T=1 for training, vary for inference/distillation.
•Numerical stability requires the max-subtraction trick: always compute softmax(z - max(z)) to prevent overflow.
•The Jacobian J = diag(p) - ppᵀ captures dependencies between all outputs, but efficient O(K) backward pass exists.
•Combined with cross-entropy, the gradient simplifies elegantly to p - y: the difference between predictions and targets.
•In attention, softmax creates competition between positions, with 1/√dₖ scaling to prevent saturation.

Looking Ahead:

With sigmoid, tanh, ReLU variants, Swish, GELU, and softmax fully understood, the next page provides practical selection guidelines—helping you choose the right activation function for any architecture, task, and deployment context.

Page Complete

You now have complete mastery of the softmax function. You understand its mathematical properties, temperature effects, numerical stability requirements, Jacobian structure, and role in both classification and attention. This knowledge is essential for working with any modern neural network.

Softmax

The Probability Transformer

Softmax is ubiquitous in modern deep learning:

Classification outputs: Every multi-class classifier ends with softmax
Attention mechanisms: The core of Transformers uses softmax to compute attention weights
Mixture models: Softmax parameterizes mixture weights
Reinforcement learning: Policy networks use softmax for action distributions
Language models: Next-token prediction uses softmax over the vocabulary

Understanding softmax deeply—its mathematical properties, gradient behavior, numerical pitfalls, and variants—is essential for any serious practitioner of deep learning.

What You Will Master

Mathematical Definition and Properties

The Softmax Function

Given a vector z = (z₁, z₂, ..., zₖ) of K real-valued logits (unnormalized log-probabilities), the softmax function produces a probability distribution:

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

The output is a vector p = (p₁, p₂, ..., pₖ) where each pᵢ represents the probability of class i.

Key Properties

1. Valid Probability Distribution:

All outputs are positive: pᵢ > 0 for all i
Outputs sum to 1: Σᵢ pᵢ = 1

2. Monotonic in Logits:

If zᵢ > zⱼ, then pᵢ > pⱼ
Relative ordering of logits is preserved in probabilities

3. Translation Invariance:

softmax(z + c) = softmax(z) for any constant c
Adding the same value to all logits doesn't change the output
This is crucial for numerical stability (we'll exploit it later)

4. Scale Sensitivity:

softmax(αz) ≠ softmax(z) for α ≠ 1
Scaling logits changes the 'sharpness' of the distribution

softmax_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
 
def softmax_naive(z):
    """
    Naive softmax implementation.
    WARNING: Numerically unstable for large values!
    """
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def softmax_stable(z):
    """
    Numerically stable softmax using max subtraction.
    Exploits translation invariance: softmax(z) = softmax(z - max(z))
    """
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
# Demonstrate properties
def verify_properties():
    z = np.array([2.0, 1.0, 0.1])
    p = softmax_stable(z)
    
    print("Softmax Properties Verification:")
    print("-" * 50)
    print(f"Input logits z: {z}")
    print(f"Output probs p: {p}")
    print(f"Sum of probabilities: {p.sum():.10f}")  # Should be 1.0
    print(f"All positive: {(p > 0).all()}")  # Should be True
    
    # Translation invariance
    z_shifted = z + 100
    p_shifted = softmax_stable(z_shifted)
    print(f"
Translation invariance (z + 100):")
    print(f"Max difference: {np.abs(p - p_shifted).max():.10e}")  # Should be ~0
    
    # Ordering preserved
    print(f"
Ordering preserved:")
    print(f"z order:    {np.argsort(z)[::-1]}")  # [0, 1, 2]
    print(f"p order:    {np.argsort(p)[::-1]}")  # [0, 1, 2] - same!
 
verify_properties()
 
# Numerical instability demonstration
def demonstrate_instability():
    """Show why naive softmax fails for large values."""
    z_large = np.array([1000.0, 1000.1, 1000.2])
    
    print("
Numerical Stability Demonstration:")
    print("-" * 50)
    print(f"Input: {z_large}")
    
    try:
        naive_result = softmax_naive(z_large)
        print(f"Naive result: {naive_result}")  # Will likely be nan
    except:
        print("Naive result: FAILED (overflow)")
    
    stable_result = softmax_stable(z_large)
    print(f"Stable result: {stable_result}")  # Works correctly
 
demonstrate_instability()

Always Use Stable Softmax

Temperature: Controlling Distribution Sharpness

The Temperature Parameter

The temperature T scales the logits before softmax:

$$\text{softmax}T(z_i) = \frac{e^{z_i/T}}{\sum{j=1}^{K} e^{z_j/T}}$$

Temperature controls the sharpness of the resulting distribution:

T → 0 (Low Temperature / 'Cold'):

Distribution becomes sharp (peaked)
Probability concentrates on the maximum logit
In the limit, approaches argmax (one-hot distribution)
Used for greedy decoding in language models

T = 1 (Standard):

Standard softmax behavior
Balanced distribution reflecting logit magnitudes

T → ∞ (High Temperature / 'Hot'):

Distribution becomes flat (uniform)
All classes receive equal probability
Entropy approaches maximum (log K)
Used for exploration and diversity

temperature_softmax.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
 
def softmax_with_temperature(z, temperature=1.0):
    """
    Temperature-scaled softmax.
    T → 0: sharp (argmax)
    T = 1: standard
    T → ∞: uniform
    """
    z_scaled = z / temperature
    z_shifted = z_scaled - np.max(z_scaled, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def entropy(p):
    """Compute entropy of probability distribution."""
    # Add small epsilon to avoid log(0)
    p_safe = np.clip(p, 1e-10, 1.0)
    return -np.sum(p_safe * np.log(p_safe))
 
# Demonstrate temperature effects
def temperature_analysis():
    # Logits for a 5-class problem
    z = np.array([2.0, 1.5, 1.0, 0.5, 0.0])
    
    temperatures = [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
    
    print("Temperature Effects on Softmax:")
    print("=" * 70)
    print(f"Input logits: {z}")
    print("-" * 70)
    print(f"{'T':>6} | {'p[0]':>8} {'p[1]':>8} {'p[2]':>8} {'p[3]':>8} {'p[4]':>8} | {'Entropy':>8}")
    print("-" * 70)
    
    max_entropy = np.log(len(z))  # Maximum possible entropy
    
    for T in temperatures:
        p = softmax_with_temperature(z, T)
        H = entropy(p)
        print(f"{T:6.1f} | {p[0]:8.4f} {p[1]:8.4f} {p[2]:8.4f} {p[3]:8.4f} {p[4]:8.4f} | {H:8.4f}")
    
    print("-" * 70)
    print(f"Max possible entropy: {max_entropy:.4f}")
 
temperature_analysis()
 
# Application: Top-p (nucleus) sampling
def top_p_sampling(logits, temperature=1.0, top_p=0.9):
    """
    Top-p (nucleus) sampling: sample from smallest set of tokens
    whose cumulative probability exceeds p.
    
    Used in GPT text generation for better diversity.
    """
    probs = softmax_with_temperature(logits, temperature)
    
    # Sort by probability (descending)
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]
    
    # Find cutoff for cumulative probability > top_p
    cumulative_probs = np.cumsum(sorted_probs)
    cutoff_index = np.searchsorted(cumulative_probs, top_p) + 1
    
    # Keep only top tokens
    kept_indices = sorted_indices[:cutoff_index]
    kept_probs = probs[kept_indices]
    
    # Renormalize
    kept_probs = kept_probs / kept_probs.sum()
    
    # Sample
    sampled_idx = np.random.choice(kept_indices, p=kept_probs)
    return sampled_idx, kept_indices, kept_probs
 
# Demonstrate top-p sampling
def top_p_demo():
    np.random.seed(42)
    
    # Simulate language model logits over vocabulary
    vocab_size = 10
    logits = np.random.randn(vocab_size) * 2  # Some variation
    logits[0] = 5.0  # Make one token much more likely
    
    print("
Top-p Sampling Demonstration:")
    print("-" * 50)
    print(f"Original logits: {logits}")
    print(f"Full probs:      {softmax_with_temperature(logits, 1.0)}")
    
    sampled, kept, new_probs = top_p_sampling(logits, temperature=0.8, top_p=0.9)
    print(f"Kept indices (top_p=0.9): {kept}")
    print(f"Renormalized probs: {new_probs}")
    print(f"Sampled index: {sampled}")
 
top_p_demo()

Temperature in Practice

Training: Always use T=1 (standard softmax). Temperature scaling during training can harm learning by making gradients too sharp or too diffuse.

Inference (text generation): Common to use T=0.7-1.0. Lower temperature produces more focused, repetitive text; higher temperature produces more diverse, sometimes incoherent text.

Knowledge distillation: Use high temperature (T=2-20) to transfer 'dark knowledge' from teacher to student network.

The Jacobian and Backpropagation

The Softmax Jacobian

Unlike element-wise activation functions, the gradient of softmax involves all inputs simultaneously. The Jacobian matrix ∂p/∂z captures how each output pᵢ changes with each input zⱼ.

Derivation:

For the softmax output pᵢ = exp(zᵢ)/Σₖexp(zₖ), we compute:

Case 1: i = j (diagonal elements) $$\frac{\partial p_i}{\partial z_i} = p_i(1 - p_i)$$

This is identical to the sigmoid derivative! Not coincidental—softmax for K=2 reduces to sigmoid.

Case 2: i ≠ j (off-diagonal elements) $$\frac{\partial p_i}{\partial z_j} = -p_i \cdot p_j$$

Increasing zⱼ decreases pᵢ for all other classes (competition).

Jacobian Matrix Form

The full Jacobian is a K × K matrix:

$$J_{ij} = \frac{\partial p_i}{\partial z_j} = p_i(\delta_{ij} - p_j)$$

where δᵢⱼ is the Kronecker delta (1 if i=j, 0 otherwise).

In matrix form:

$$J = \text{diag}(p) - p p^T$$

softmax_jacobian.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
 
def softmax(z):
    """Stable softmax."""
    z_shifted = z - np.max(z)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z)
 
def softmax_jacobian(p):
    """
    Compute the Jacobian matrix of softmax.
    J[i,j] = p[i] * (delta_ij - p[j])
    J = diag(p) - p @ p.T
    """
    K = len(p)
    jacobian = np.diag(p) - np.outer(p, p)
    return jacobian
 
def numerical_jacobian(z, epsilon=1e-7):
    """
    Compute Jacobian numerically for verification.
    """
    K = len(z)
    jacobian = np.zeros((K, K))
    
    for j in range(K):
        z_plus = z.copy()
        z_plus[j] += epsilon
        z_minus = z.copy()
        z_minus[j] -= epsilon
        
        p_plus = softmax(z_plus)
        p_minus = softmax(z_minus)
        
        jacobian[:, j] = (p_plus - p_minus) / (2 * epsilon)
    
    return jacobian
 
# Verify analytical Jacobian
def verify_jacobian():
    z = np.array([2.0, 1.0, 0.5, 0.0])
    p = softmax(z)
    
    analytical_J = softmax_jacobian(p)
    numerical_J = numerical_jacobian(z)
    
    print("Softmax Jacobian Verification:")
    print("-" * 50)
    print(f"Input logits: {z}")
    print(f"Probabilities: {p}")
    print(f"
Analytical Jacobian:
{analytical_J}")
    print(f"
Numerical Jacobian:
{numerical_J}")
    print(f"
Max difference: {np.abs(analytical_J - numerical_J).max():.2e}")
    
    # Properties
    print(f"
Jacobian Properties:")
    print(f"  Row sums (should be 0): {analytical_J.sum(axis=1)}")  # Each row sums to 0
    print(f"  Symmetric: {np.allclose(analytical_J, analytical_J.T)}")  # Should be True
 
verify_jacobian()
 
def softmax_backward(grad_output, p):
    """
    Backward pass for softmax.
    
    Given upstream gradient dL/dp, compute dL/dz.
    dL/dz = J^T @ dL/dp
    
    But there's a more efficient formulation!
    """
    # Method 1: Explicit Jacobian (O(K²) space)
    J = softmax_jacobian(p)
    grad_z_explicit = J.T @ grad_output
    
    # Method 2: Efficient O(K) formulation
    # dL/dz_i = p_i * (dL/dp_i - sum_j(p_j * dL/dp_j))
    dot_product = np.dot(p, grad_output)
    grad_z_efficient = p * (grad_output - dot_product)
    
    print("Softmax Backward Comparison:")
    print("-" * 50)
    print(f"Upstream gradient: {grad_output}")
    print(f"Explicit (Jacobian): {grad_z_explicit}")
    print(f"Efficient O(K):      {grad_z_efficient}")
    print(f"Match: {np.allclose(grad_z_explicit, grad_z_efficient)}")
    
    return grad_z_efficient
 
# Demo
p = softmax(np.array([2.0, 1.0, 0.0]))
grad_output = np.array([1.0, -0.5, 0.0])  # Example upstream gradient
softmax_backward(grad_output, p)

Efficient Backpropagation

Softmax and Cross-Entropy: The Perfect Pair

The Cross-Entropy Loss

Softmax is almost always paired with the cross-entropy loss for classification:

$$\mathcal{L} = -\sum_{i=1}^{K} y_i \log(p_i) = -\log(p_c)$$

where y is the one-hot target (y_c = 1 for the correct class c).

Combined Gradient: Beautiful Simplification

The remarkable mathematical fact is that the gradient of cross-entropy loss with respect to the logits z (not the probabilities p) simplifies dramatically:

$$\frac{\partial \mathcal{L}}{\partial z_i} = p_i - y_i$$

This is simply the difference between predictions and targets!

Derivation:

$$\frac{\partial \mathcal{L}}{\partial z_i} = \sum_j \frac{\partial \mathcal{L}}{\partial p_j} \cdot \frac{\partial p_j}{\partial z_i}$$

Since $\mathcal{L} = -\log(p_c)$, we have $\frac{\partial \mathcal{L}}{\partial p_j} = -\frac{y_j}{p_j}$.

Using the Jacobian $\frac{\partial p_j}{\partial z_i} = p_j(\delta_{ij} - p_i)$:

$$\frac{\partial \mathcal{L}}{\partial z_i} = -\sum_j \frac{y_j}{p_j} \cdot p_j(\delta_{ij} - p_i) = -\sum_j y_j(\delta_{ij} - p_i) = -y_i + p_i \sum_j y_j = p_i - y_i$$

softmax_cross_entropy.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
 
def softmax(z):
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def cross_entropy_loss(p, y):
    """
    Cross-entropy loss.
    p: predicted probabilities (after softmax)
    y: one-hot targets
    """
    # Clip for numerical stability
    p_safe = np.clip(p, 1e-15, 1 - 1e-15)
    return -np.sum(y * np.log(p_safe), axis=-1)
 
def softmax_cross_entropy_forward(z, y):
    """
    Combined forward pass: more numerically stable.
    Uses log-sum-exp trick.
    """
    # log(softmax(z)) = z - log(sum(exp(z)))
    log_sum_exp = np.log(np.sum(np.exp(z - np.max(z)), axis=-1)) + np.max(z)
    log_probs = z - log_sum_exp
    return -np.sum(y * log_probs, axis=-1)
 
def softmax_cross_entropy_backward(z, y):
    """
    Gradient of cross-entropy w.r.t. logits.
    The beautiful result: grad = p - y
    """
    p = softmax(z)
    return p - y
 
# Verify the gradient
def verify_combined_gradient():
    np.random.seed(42)
    
    # Random logits and one-hot target
    z = np.array([2.0, 1.0, 0.5, -0.5])
    y = np.array([0.0, 1.0, 0.0, 0.0])  # Class 1 is correct
    
    # Analytical gradient
    analytical_grad = softmax_cross_entropy_backward(z, y)
    
    # Numerical gradient
    epsilon = 1e-7
    numerical_grad = np.zeros_like(z)
    for i in range(len(z)):
        z_plus = z.copy()
        z_plus[i] += epsilon
        z_minus = z.copy()
        z_minus[i] -= epsilon
        
        loss_plus = softmax_cross_entropy_forward(z_plus, y)
        loss_minus = softmax_cross_entropy_forward(z_minus, y)
        numerical_grad[i] = (loss_plus - loss_minus) / (2 * epsilon)
    
    print("Softmax + Cross-Entropy Gradient Verification:")
    print("-" * 50)
    print(f"Logits z: {z}")
    print(f"Target y: {y}")
    print(f"Probs p:  {softmax(z)}")
    print(f"
Analytical gradient (p - y): {analytical_grad}")
    print(f"Numerical gradient:          {numerical_grad}")
    print(f"Match: {np.allclose(analytical_grad, numerical_grad)}")
    
    # Show the beauty: gradient is just p - y!
    p = softmax(z)
    print(f"
Verify p - y = {p - y}")
 
verify_combined_gradient()
 
def log_softmax_stable(z):
    """
    Log-softmax: more stable than log(softmax(z)) for large logits.
    log_softmax(z) = z - log(sum(exp(z)))
    """
    max_z = np.max(z, axis=-1, keepdims=True)
    log_sum_exp = np.log(np.sum(np.exp(z - max_z), axis=-1, keepdims=True)) + max_z
    return z - log_sum_exp
 
# Numerical stability comparison
def stability_comparison():
    """Show that log-softmax is more stable than log(softmax)."""
    z = np.array([1000.0, 1000.1, 999.9])
    
    print("
Numerical Stability: log-softmax vs log(softmax):")
    print("-" * 50)
    
    # Direct log(softmax) - likely to have issues
    try:
        p = softmax(z)
        log_p_direct = np.log(p)
        print(f"log(softmax(z)): {log_p_direct}")
    except:
        print("log(softmax(z)): FAILED")
    
    # Log-softmax (stable)
    log_p_stable = log_softmax_stable(z)
    print(f"log_softmax(z):  {log_p_stable}")
 
stability_comparison()

Why This Matters

Softmax in Attention Mechanisms

The Role of Softmax in Transformers

The scaled dot-product attention mechanism uses softmax as its core normalization:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Why softmax here?

Probability interpretation: Attention weights represent how much each position 'attends to' other positions
Competition: Softmax creates competition—attending more to one position means attending less to others
Differentiability: Smooth gradients allow end-to-end training

The scaling factor 1/√dₖ:

Scaling keeps dot products moderate, maintaining a balanced distribution.

attention_softmax.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
 
def softmax(z, axis=-1):
    """Stable softmax along specified axis."""
    z_shifted = z - np.max(z, axis=axis, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=axis, keepdims=True)
 
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Scaled dot-product attention.
    
    Q: [batch, seq_len, d_k] - Queries
    K: [batch, seq_len, d_k] - Keys
    V: [batch, seq_len, d_v] - Values
    mask: Optional attention mask
    """
    d_k = Q.shape[-1]
    
    # Compute attention scores
    scores = Q @ K.transpose(0, 2, 1)  # [batch, seq_len, seq_len]
    
    # Scale by sqrt(d_k)
    scores_scaled = scores / np.sqrt(d_k)
    
    # Apply mask (for causal attention or padding)
    if mask is not None:
        scores_scaled = np.where(mask, scores_scaled, -1e9)
    
    # Softmax over keys dimension
    attention_weights = softmax(scores_scaled, axis=-1)
    
    # Weighted sum of values
    output = attention_weights @ V
    
    return output, attention_weights
 
# Demonstrate attention
def attention_demo():
    np.random.seed(42)
    
    batch_size = 1
    seq_len = 5
    d_k = 64
    d_v = 64
    
    # Random Q, K, V
    Q = np.random.randn(batch_size, seq_len, d_k) * 0.1
    K = np.random.randn(batch_size, seq_len, d_k) * 0.1
    V = np.random.randn(batch_size, seq_len, d_v) * 0.1
    
    output, attn_weights = scaled_dot_product_attention(Q, K, V)
    
    print("Scaled Dot-Product Attention:")
    print("-" * 50)
    print(f"Q, K, V shape: {Q.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Attention weights shape: {attn_weights.shape}")
    print(f"
Attention weights (softmax probabilities):")
    print(f"{attn_weights[0]}")
    print(f"
Row sums (should be 1.0): {attn_weights[0].sum(axis=-1)}")
 
attention_demo()
 
# Causal mask for autoregressive models
def causal_mask_demo():
    """Demonstrate causal (autoregressive) masking."""
    seq_len = 5
    
    # Create causal mask: position i can only attend to positions <= i
    causal_mask = np.tril(np.ones((seq_len, seq_len), dtype=bool))
    
    print("
Causal Attention Mask:")
    print(causal_mask.astype(int))
    
    # With random scores
    np.random.seed(42)
    scores = np.random.randn(seq_len, seq_len)
    
    # Apply mask
    scores_masked = np.where(causal_mask, scores, -1e9)
    
    # Softmax
    attn_weights = softmax(scores_masked, axis=-1)
    
    print("
Resulting attention (causal):")
    print(attn_weights.round(3))
    print("
Note: Upper triangle is effectively zero (masked out)")
 
causal_mask_demo()
 
# Why scaling matters
def scaling_importance():
    """Show why the sqrt(d_k) scaling is necessary."""
    np.random.seed(42)
    
    for d_k in [8, 64, 512, 2048]:
        # Random unit vectors for Q, K
        q = np.random.randn(d_k)
        k = np.random.randn(d_k)
        
        # Dot product variance grows with dimension
        # E[q·k] = 0, Var[q·k] = d_k (when q,k are independent standard normals)
        dot_product = np.dot(q, k)
        scaled_dot = dot_product / np.sqrt(d_k)
        
        print(f"d_k={d_k:4d}: dot product = {dot_product:7.2f}, "
              f"scaled = {scaled_dot:6.2f}")
 
print("
Why Scaling Matters:")
print("-" * 50)
scaling_importance()

Flash Attention

Softmax Variants

Sparsemax: Sparse Attention and Classification

Sparsemax produces sparse probability distributions—most entries are exactly zero:

$$\text{sparsemax}(z) = \arg\min_p |p - z|^2 \quad \text{subject to } p \in \Delta^{K-1}$$

where Δᴷ⁻¹ is the probability simplex.

Properties:

Outputs have exact zeros (true sparsity)
More interpretable attention weights
Can be computed in O(K log K) time
Gradient: non-zero only for non-zero outputs

Gumbel-Softmax: Differentiable Sampling

Gumbel-softmax (or concrete distribution) enables differentiable sampling from categorical distributions:

$$y_i = \frac{\exp((z_i + g_i)/\tau)}{\sum_j \exp((z_j + g_j)/\tau)}$$

where gᵢ ~ Gumbel(0, 1) are i.i.d. Gumbel noise samples.

Use cases:

Variational autoencoders with discrete latents
Neural architecture search
Learning discrete structures (graphs, programs)

softmax_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
 
def sparsemax(z):
    """
    Sparsemax: projects onto probability simplex.
    Returns sparse probability distribution.
    """
    z_sorted = np.sort(z)[::-1]  # Sort descending
    K = len(z)
    
    # Find threshold
    cumsum = np.cumsum(z_sorted)
    k_range = np.arange(1, K + 1)
    threshold_candidates = (cumsum - 1) / k_range
    
    # Find largest k where z_sorted[k] > threshold
    support = z_sorted > threshold_candidates
    k_star = np.sum(support)
    threshold = threshold_candidates[k_star - 1]
    
    # Project
    return np.maximum(z - threshold, 0)
 
def gumbel_softmax(logits, temperature=1.0):
    """
    Gumbel-Softmax (Concrete) distribution.
    Differentiable approximation to categorical sampling.
    """
    # Sample Gumbel noise
    U = np.random.uniform(0, 1, logits.shape)
    gumbel_noise = -np.log(-np.log(U + 1e-10) + 1e-10)
    
    # Add noise to logits and apply temperature-scaled softmax
    noisy_logits = logits + gumbel_noise
    return softmax_with_temperature(noisy_logits, temperature)
 
def softmax_with_temperature(z, temperature):
    z_scaled = z / temperature
    z_shifted = z_scaled - np.max(z_scaled)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z)
 
def gumbel_softmax_straight_through(logits, temperature=1.0):
    """
    Straight-through Gumbel-Softmax.
    Forward: argmax (discrete)
    Backward: softmax gradient (continuous)
    """
    soft = gumbel_softmax(logits, temperature)
    hard = np.zeros_like(soft)
    hard[np.argmax(soft)] = 1.0
    
    # Straight-through: use hard in forward, but soft's gradient
    # In actual implementation, this uses stop_gradient tricks
    return hard  # For inference, use hard
 
# Demonstrate sparsemax
def sparsemax_demo():
    z = np.array([2.0, 1.5, 1.0, 0.5, 0.0])
    
    soft = softmax_with_temperature(z, 1.0)
    sparse = sparsemax(z)
    
    print("Sparsemax vs Softmax:")
    print("-" * 50)
    print(f"Logits:   {z}")
    print(f"Softmax:  {soft} (all non-zero)")
    print(f"Sparsemax:{sparse} (sparse!)")
    print(f"
Non-zero entries: Softmax={np.sum(soft > 0)}, Sparsemax={np.sum(sparse > 0)}")
 
sparsemax_demo()
 
# Demonstrate Gumbel-Softmax
def gumbel_softmax_demo():
    np.random.seed(42)
    
    logits = np.array([2.0, 1.0, 0.0])
    
    print("
Gumbel-Softmax Sampling:")
    print("-" * 50)
    print(f"Logits: {logits}")
    print(f"Softmax (no noise): {softmax_with_temperature(logits, 1.0)}")
    
    print("
Samples at different temperatures:")
    for temp in [0.1, 0.5, 1.0, 2.0]:
        samples = [gumbel_softmax(logits, temp) for _ in range(4)]
        print(f"T={temp}: ", end="")
        for s in samples:
            print(f"[{s[0]:.2f},{s[1]:.2f},{s[2]:.2f}] ", end="")
        print()
    
    print("
Note: Lower T → sharper (closer to one-hot)")
    print("       Higher T → softer (closer to softmax)")
 
gumbel_softmax_demo()

Softmax Variants Comparison
Variant	Outputs	Use Case	Differentiable
Softmax	Dense probabilities	Standard classification, attention	Yes
Temperature softmax	Sharp/flat probabilities	Knowledge distillation, sampling	Yes
Sparsemax	Sparse probabilities	Interpretable attention, hard selection	Yes (subgradient)
Gumbel-softmax	Noisy soft probabilities	Discrete latent variables, NAS	Yes (approximate)
Straight-through	Hard one-hot (forward)	Discrete actions with gradient	Approximate

Summary: Softmax

We have comprehensively analyzed the softmax function—the essential operation that transforms logits into probability distributions for classification and attention.

Core Takeaways

•Softmax converts K logits to a probability distribution: softmax(zᵢ) = exp(zᵢ)/Σexp(zⱼ), with outputs summing to 1.
•Temperature T controls sharpness: low T → peaked, high T → uniform. Use T=1 for training, vary for inference/distillation.
•Numerical stability requires the max-subtraction trick: always compute softmax(z - max(z)) to prevent overflow.
•The Jacobian J = diag(p) - ppᵀ captures dependencies between all outputs, but efficient O(K) backward pass exists.
•Combined with cross-entropy, the gradient simplifies elegantly to p - y: the difference between predictions and targets.
•In attention, softmax creates competition between positions, with 1/√dₖ scaling to prevent saturation.

Looking Ahead:

Page Complete