Loading content...
In the previous page, we established the self-attention formulation: queries dot-product with keys to produce scores, which are then transformed into weights used to aggregate values. The critical transformation in this process is converting raw scores into a probability distribution.
This page examines the attention weight computation in detail—the mathematical operation that turns unbounded real numbers into meaningful attention probabilities. Understanding this process reveals:
The attention weight computation, while seemingly simple, has profound implications for transformer behavior.
By the end of this page, you will understand the complete pipeline from attention scores to attention weights, including softmax mechanics, numerical stability techniques, gradient properties, and the interpretation of resulting attention distributions.
Raw attention scores $S_{ij} = Q_i \cdot K_j$ are unbounded real numbers—they can be arbitrarily positive, negative, or close to zero. To use these scores for weighted aggregation, we need a normalization that:
Requirements for Attention Weights:
Why Not Simple Normalization?
A naive approach might be: $$\alpha_{ij} = \frac{S_{ij}}{\sum_k S_{ik}}$$
This fails because:
1234567891011121314151617181920212223242526272829
import numpy as np def naive_normalize(scores: np.ndarray) -> np.ndarray: """Naive normalization - PROBLEMATIC.""" return scores / scores.sum(axis=-1, keepdims=True) def softmax_normalize(scores: np.ndarray) -> np.ndarray: """Softmax normalization - CORRECT.""" exp_scores = np.exp(scores - scores.max(axis=-1, keepdims=True)) return exp_scores / exp_scores.sum(axis=-1, keepdims=True) # Example: Score vector with negative valuesscores = np.array([2.0, -1.0, 0.5, -0.5]) naive = naive_normalize(scores)softmax = softmax_normalize(scores) print("Scores:", scores)print(f"Naive normalization: {naive}")print(f" Sum: {naive.sum():.4f}, Min: {naive.min():.4f}") # Negative weights! print(f"\nSoftmax normalization: {np.round(softmax, 4)}")print(f" Sum: {softmax.sum():.4f}, Min: {softmax.min():.4f}") # All positive! # Verify softmax propertiesprint(f"\nSoftmax properties:")print(f" All positive: {(softmax > 0).all()}")print(f" Sums to 1: {np.isclose(softmax.sum(), 1.0)}")print(f" Highest score → highest weight: {scores.argmax() == softmax.argmax()}")Softmax exponentiates scores before normalizing. This guarantees positivity (exponentials are always positive) while preserving relative ordering (exp is monotonic). The exponential also accentuates differences—larger scores get disproportionately more weight.
The softmax function is the core mathematical operation that transforms attention scores into weights:
$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}}$$
For a score vector $z = [z_1, ..., z_n]$, softmax produces a probability distribution where each element is positive and the total sums to 1.
Key Properties of Softmax:
12345678910111213141516171819202122232425262728293031323334353637
import numpy as np def softmax(z: np.ndarray, axis: int = -1) -> np.ndarray: """Numerically stable softmax.""" z_shifted = z - z.max(axis=axis, keepdims=True) exp_z = np.exp(z_shifted) return exp_z / exp_z.sum(axis=axis, keepdims=True) # Property 1: Output range (0, 1)z = np.array([1.0, 2.0, 3.0])s = softmax(z)print(f"Outputs: {s}")print(f"All in (0,1): {((s > 0) & (s < 1)).all()}") # Property 2: Sums to 1print(f"Sum: {s.sum()}") # Property 3: Translation invariancez_shifted = z + 1000s_shifted = softmax(z_shifted)print(f"\nOriginal softmax: {s}")print(f"Shifted (+1000) softmax: {s_shifted}")print(f"Difference: {np.abs(s - s_shifted).max()}") # Essentially 0 # Property 4: Monotonicity preservedz_ordered = np.array([1.0, 3.0, 2.0]) # Order: 1 < 2 < 3s_ordered = softmax(z_ordered)print(f"\nInput order: {np.argsort(z_ordered)}")print(f"Output order: {np.argsort(s_ordered)}") # Same ordering # Property 5: Exponential amplification (temperature effect)z = np.array([1.0, 2.0, 3.0])print("\nExponential amplification:")for temp in [2.0, 1.0, 0.5, 0.1]: s = softmax(z / temp) print(f" T={temp}: {np.round(s, 4)}") # Lower temperature → sharper distributionTemperature and Sharpness:
A key insight is that softmax sharpness can be controlled by scaling input scores:
$$\text{softmax}(z/T)_i = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}$$
Where $T$ is the "temperature":
This temperature concept will become important when we discuss the scaling factor in scaled dot-product attention.
Softmax has deep connections to the maximum entropy principle and logistic regression. It's the natural output for categorical distributions when we want to maximize entropy subject to expected value constraints. This makes it theoretically principled, not just a convenient choice.
The naive softmax implementation can fail catastrophically due to numerical overflow:
$$e^{z_i} \rightarrow \infty \text{ when } z_i \text{ is large}$$
For example, $e^{1000}$ exceeds float64 range, producing inf. Similarly, very negative values produce 0, leading to division by zero.
The Stable Softmax Trick:
Since softmax is translation-invariant, we can subtract the maximum value:
$$\text{softmax}(z)_i = \frac{e^{z_i - \max(z)}}{\sum_j e^{z_j - \max(z)}}$$
Now:
12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as npimport warnings def softmax_naive(z: np.ndarray) -> np.ndarray: """Naive softmax - numerically UNSTABLE.""" exp_z = np.exp(z) return exp_z / exp_z.sum() def softmax_stable(z: np.ndarray) -> np.ndarray: """Stable softmax - subtract max before exp.""" z_shifted = z - z.max() exp_z = np.exp(z_shifted) return exp_z / exp_z.sum() # Test with moderate valuesz_moderate = np.array([1.0, 2.0, 3.0])print("Moderate values:")print(f" Naive: {softmax_naive(z_moderate)}")print(f" Stable: {softmax_stable(z_moderate)}") # Test with large valuesz_large = np.array([100.0, 200.0, 300.0])print("\nLarge values:")with warnings.catch_warnings(): warnings.simplefilter("ignore") naive_result = softmax_naive(z_large) print(f" Naive: {naive_result}") # [0. 0. nan] - BROKENprint(f" Stable: {softmax_stable(z_large)}") # Correct # Test with very large values (would overflow float64)z_huge = np.array([1000.0, 1001.0, 1002.0])print("\nVery large values (would overflow float64):")with warnings.catch_warnings(): warnings.simplefilter("ignore") naive_result = softmax_naive(z_huge) print(f" Naive: {naive_result}") # [nan nan nan]print(f" Stable: {softmax_stable(z_huge)}") # Still correct! # Demonstration of why this worksprint("\nWhy stable version works:")print(f" z_huge: {z_huge}")print(f" z_huge - max(z_huge): {z_huge - z_huge.max()}") # [−2, −1, 0]print(f" exp of shifted: {np.exp(z_huge - z_huge.max())}")Log-Sum-Exp for Log-Probabilities:
When working with log-probabilities (common in practice), we need logsoftmax:
$$\log\text{softmax}(z)_i = z_i - \log\sum_j e^{z_j}$$
The log-sum-exp (LSE) requires special handling:
$$\text{LSE}(z) = \max(z) + \log\sum_j e^{z_j - \max(z)}$$
12345678910111213141516171819202122
import numpy as np def logsumexp(z: np.ndarray) -> float: """Numerically stable log-sum-exp.""" z_max = z.max() return z_max + np.log(np.exp(z - z_max).sum()) def log_softmax(z: np.ndarray) -> np.ndarray: """Numerically stable log-softmax.""" return z - logsumexp(z) # Examplez = np.array([1000.0, 1001.0, 1002.0]) # Naive log-softmax would fail# log_prob = z - np.log(np.exp(z).sum()) # inf - inf = nan # Stable version workslog_prob = log_softmax(z)print(f"Log probabilities: {log_prob}")print(f"Probabilities (exp): {np.exp(log_prob)}")print(f"Sum: {np.exp(log_prob).sum()}") # Should be ~1Deep learning frameworks (PyTorch, TensorFlow) implement numerically stable softmax internally. However, if you ever implement attention from scratch (e.g., in CUDA kernels), numerical stability is critical. Always use the subtract-max trick.
For training transformers via backpropagation, we need the gradient of softmax. The softmax Jacobian has a beautiful structure:
$$\frac{\partial \text{softmax}(z)_i}{\partial z_j} = \text{softmax}(z)i \left( \delta{ij} - \text{softmax}(z)_j \right)$$
Where $\delta_{ij}$ is the Kronecker delta (1 if $i=j$, else 0).
Intuition:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as np def softmax(z: np.ndarray) -> np.ndarray: """Stable softmax.""" exp_z = np.exp(z - z.max()) return exp_z / exp_z.sum() def softmax_jacobian(z: np.ndarray) -> np.ndarray: """ Compute the Jacobian matrix of softmax. Returns: J[i,j] = d(softmax(z)_i) / d(z_j) """ s = softmax(z) n = len(s) # J[i,j] = s[i] * (delta[i,j] - s[j]) # = s[i] * delta[i,j] - s[i] * s[j] # = diag(s) - outer(s, s) jacobian = np.diag(s) - np.outer(s, s) return jacobian def softmax_backward(z: np.ndarray, grad_output: np.ndarray) -> np.ndarray: """ Efficient backward pass without computing full Jacobian. Given gradient w.r.t. output s, compute gradient w.r.t. input z. """ s = softmax(z) # grad_z = s * (grad_output - sum(grad_output * s)) sum_term = np.sum(grad_output * s) grad_z = s * (grad_output - sum_term) return grad_z # Demonstrationz = np.array([1.0, 2.0, 3.0])s = softmax(z)J = softmax_jacobian(z) print("Softmax Jacobian Analysis")print("=" * 50)print(f"Input z: {z}")print(f"Softmax s: {np.round(s, 4)}")print(f"\nJacobian matrix:")print(np.round(J, 4)) # Verify Jacobian propertiesprint(f"\nJacobian properties:")print(f" Row sums (should be 0): {J.sum(axis=1)}") # Sum to 0print(f" Symmetric: {np.allclose(J, J.T)}") # Jacobian is symmetric # Verify backward pass matches Jacobian productgrad_output = np.array([1.0, 0.0, 0.0]) # Gradient from lossgrad_z_via_jacobian = J.T @ grad_outputgrad_z_efficient = softmax_backward(z, grad_output)print(f"\nBackward pass verification:")print(f" Via Jacobian: {grad_z_via_jacobian}")print(f" Efficient: {grad_z_efficient}")print(f" Match: {np.allclose(grad_z_via_jacobian, grad_z_efficient)}")Gradient Implications for Attention:
The softmax gradient structure has important implications for attention-based learning:
Gradient Competition: When one attention weight increases, others must decrease (zero-sum). This creates competition between positions for attention.
Saturation: When softmax output is near 0 or 1, gradients become small. Very confident attention patterns are hard to change. This can cause:
Credit Assignment: The gradient flows back to all positions proportionally to their attention weight. Highly-attended positions receive more gradient signal.
Scaling Matters: The magnitude of scores before softmax affects gradient magnitude. This is another reason why scaled dot-product attention is important.
Computing the full n×n Jacobian is O(n²). The efficient backward formula grad_z = s * (grad_output - sum(grad_output * s)) is O(n), which is critical when n is the sequence length squared in attention computation.
Now let's trace through the complete pipeline from input to attention weights:
Step-by-Step Pipeline:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
import numpy as npfrom scipy.special import softmax def attention_weights_pipeline( X: np.ndarray, W_q: np.ndarray, W_k: np.ndarray, mask: np.ndarray = None, return_intermediates: bool = False) -> dict: """ Complete pipeline for computing attention weights with all intermediates. Args: X: Input sequence, shape (n, d_model) W_q: Query projection, shape (d_model, d_k) W_k: Key projection, shape (d_model, d_k) mask: Optional mask, shape (n, n), with -inf for masked positions return_intermediates: Whether to return all intermediate values Returns: dict with 'weights' and optionally intermediate values """ d_k = W_q.shape[1] # Step 1: Project to Q and K Q = X @ W_q # (n, d_k) K = X @ W_k # (n, d_k) # Step 2: Compute raw scores scores_raw = Q @ K.T # (n, n) # Step 3: Scale by sqrt(d_k) scale = np.sqrt(d_k) scores_scaled = scores_raw / scale # Step 4: Apply mask (if provided) if mask is not None: scores_masked = scores_scaled + mask else: scores_masked = scores_scaled # Step 5: Softmax over keys (axis=-1) weights = softmax(scores_masked, axis=-1) if return_intermediates: return { 'Q': Q, 'K': K, 'scores_raw': scores_raw, 'scale': scale, 'scores_scaled': scores_scaled, 'scores_masked': scores_masked, 'weights': weights, } return {'weights': weights} # Example: Process a small sequencenp.random.seed(42)n, d_model, d_k = 4, 16, 8 X = np.random.randn(n, d_model)W_q = np.random.randn(d_model, d_k) * 0.1W_k = np.random.randn(d_model, d_k) * 0.1 # Compute with intermediatesresult = attention_weights_pipeline(X, W_q, W_k, return_intermediates=True) print("Attention Weight Computation Pipeline")print("=" * 50)print(f"\nStep 1 - Q shape: {result['Q'].shape}")print(f"Step 1 - K shape: {result['K'].shape}")print(f"\nStep 2 - Raw scores ({result['scores_raw'].shape}):")print(np.round(result['scores_raw'], 3))print(f"\nStep 3 - Scale factor: {result['scale']:.3f}")print(f"Step 3 - Scaled scores:")print(np.round(result['scores_scaled'], 3))print(f"\nStep 5 - Attention weights:")print(np.round(result['weights'], 3))print(f"\nWeight row sums: {result['weights'].sum(axis=1)}")Understanding Each Step:
| Step | Operation | Effect | Shape |
|---|---|---|---|
| Project | Linear transformation | Create query/key representations | (n, d_k) |
| Dot Product | Score computation | Measure query-key compatibility | (n, n) |
| Scale | Divide by √d_k | Control softmax temperature | (n, n) |
| Mask | Add −∞ to invalid positions | Prevent attending to certain positions | (n, n) |
| Softmax | Normalize rows | Convert to probability distribution | (n, n) |
Each ROW of the attention matrix corresponds to one query position. Row i contains the attention weights from position i to all key positions. Thus, softmax is applied row-wise (axis=-1), and each row sums to 1.
The attention weight matrix reveals which positions the model considers important for each query. Analyzing these patterns provides insight into model behavior.
Common Attention Patterns:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import numpy as npfrom scipy.special import softmax def analyze_attention_pattern(weights: np.ndarray, tokens: list = None) -> dict: """ Analyze attention weight matrix for common patterns. """ n = weights.shape[0] analysis = {} # Entropy per query (high = diffuse, low = peaked) # H = -sum(p * log(p)) log_weights = np.log(weights + 1e-10) entropy = -np.sum(weights * log_weights, axis=-1) max_entropy = np.log(n) # Maximum entropy for uniform distribution analysis['entropy'] = entropy analysis['normalized_entropy'] = entropy / max_entropy # Peakedness: max attention weight per row analysis['peak_weights'] = weights.max(axis=-1) # Self-attention strength: diagonal elements diagonal = np.diag(weights) analysis['self_attention'] = diagonal # Locality: average distance to attended positions positions = np.arange(n) expected_position = np.sum(weights * positions[None, :], axis=-1) locality = np.abs(expected_position - positions) analysis['locality_shift'] = locality # Sparsity: how many positions get significant attention significant_threshold = 1.0 / (2 * n) # More than 2x uniform would be significant_count = (weights > significant_threshold).sum(axis=-1) analysis['effective_positions'] = significant_count return analysis def visualize_pattern_stats(analysis: dict, tokens: list = None): """Print pattern analysis statistics.""" n = len(analysis['entropy']) print("Attention Pattern Analysis") print("=" * 60) print(f"\n{'Position':<10} {'Token':<10} {'Entropy':<10} {'Peak':<10} " f"{'Self-Attn':<10} {'Eff.Pos':<10}") print("-" * 60) for i in range(n): token = tokens[i] if tokens else f"pos_{i}" print(f"{i:<10} {token:<10} " f"{analysis['normalized_entropy'][i]:.3f} " f"{analysis['peak_weights'][i]:.3f} " f"{analysis['self_attention'][i]:.3f} " f"{analysis['effective_positions'][i]:<10}") print(f"\nSummary:") print(f" Average normalized entropy: {analysis['normalized_entropy'].mean():.3f}") print(f" Average peak weight: {analysis['peak_weights'].mean():.3f}") print(f" Average self-attention: {analysis['self_attention'].mean():.3f}") # Example: Simulate different patternsnp.random.seed(42)n = 6tokens = ["The", "cat", "sat", "on", "the", "mat"] # Simulate learned attention weights (peaked pattern)raw_scores = np.random.randn(n, n)# Make "sat" attend strongly to "cat" and "mat"raw_scores[2, 1] = 3.0 # sat → catraw_scores[2, 5] = 2.5 # sat → mat weights = softmax(raw_scores, axis=-1)analysis = analyze_attention_pattern(weights, tokens)visualize_pattern_stats(analysis, tokens) print(f"\nAttention weights for 'sat' (position 2):")for j, (token, weight) in enumerate(zip(tokens, weights[2])): bar = "█" * int(weight * 30) print(f" → {token}: {weight:.3f} {bar}")Entropy as a Diagnostic:
Attention entropy measures how spread out the attention is:
During training, attention typically:
Analyzing entropy across layers often reveals:
When transformers fail on specific examples, visualizing attention patterns can reveal issues: Is the model attending to relevant tokens? Are attention patterns degenerate (all uniform or all to one position)? These diagnostics guide debugging and model improvement.
In practice, attention weights are computed for batches of sequences. This requires careful handling of dimensions and efficient tensor operations.
Batch Dimensions:
For a batch of $B$ sequences, each of length $n$:
The batch dimension is independent—each sequence has its own attention matrix.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import numpy as npfrom scipy.special import softmax def batched_attention_weights( Q: np.ndarray, # (B, n, d_k) K: np.ndarray, # (B, n, d_k) mask: np.ndarray = None # (B, n, n) or (1, n, n) or (n, n)) -> np.ndarray: """ Compute attention weights for a batch of sequences. Returns: A: Attention weights, shape (B, n, n) """ d_k = Q.shape[-1] # Batched matrix multiplication: (B, n, d_k) @ (B, d_k, n) → (B, n, n) # Note: K.transpose swaps last two dims, keeping batch dim scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k) # Apply mask if provided (broadcasting handles shape differences) if mask is not None: scores = scores + mask # Softmax along key dimension (last axis) weights = softmax(scores, axis=-1) return weights # Example: Batch of 3 sequences, length 5, d_k=8B, n, d_k = 3, 5, 8 Q = np.random.randn(B, n, d_k)K = np.random.randn(B, n, d_k) # Compute batched attentionweights = batched_attention_weights(Q, K) print(f"Batched Attention Computation")print(f"=" * 50)print(f"Q shape: {Q.shape}")print(f"K shape: {K.shape}")print(f"Weights shape: {weights.shape}") # Verify each batch elementprint(f"\nVerifying batch independence:")for b in range(B): row_sums = weights[b].sum(axis=-1) print(f" Batch {b}: Row sums = {np.allclose(row_sums, 1.0)}, " f"All positive = {(weights[b] > 0).all()}") # Broadcasting example: Single mask applied to all batchescausal_mask = np.triu(np.ones((n, n)), k=1) * -1e9 # (n, n)weights_masked = batched_attention_weights(Q, K, mask=causal_mask) print(f"\nWith causal mask (same for all batches):")print(f" Mask shape: {causal_mask.shape}")print(f" Broadcast to: {weights_masked.shape}")print(f" Sample masked weights (batch 0):")print(np.round(weights_masked[0], 3))Memory Considerations:
The attention matrix is $O(n^2)$ per sequence, so for a batch:
For long sequences (n = 4096 or more), this becomes the memory bottleneck. Various efficient attention mechanisms address this (covered in Module 6: Transformer Variants).
Gradient Checkpointing:
During backpropagation, attention weights must be stored or recomputed. Gradient checkpointing trades compute for memory by recomputing activations during backward pass instead of storing them.
For sequences of length 4096, each attention matrix has 16M entries (4096²). With float16 precision, that's 32MB per layer per head. For a 12-layer, 12-head model, storing all attention matrices requires ~4.6GB just for attention weights! This motivates efficient attention variants.
Implementing attention weight computation correctly requires avoiding several common mistakes:
softmax(scores, axis=-2) — this normalizes each column to 1, which is meaningless.-1e4 instead of -1e9 or -inf may still allow gradient flow to masked positions, especially with float16. Use -inf or very large negative values.12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as npfrom scipy.special import softmax def demonstrate_pitfalls(): """Demonstrate common attention implementation mistakes.""" np.random.seed(42) n, d_k = 4, 64 Q = np.random.randn(n, d_k) K = np.random.randn(n, d_k) scores = Q @ K.T print("Demonstration of Attention Pitfalls") print("=" * 60) # Pitfall 1: Wrong softmax axis wrong_axis = softmax(scores, axis=-2) # Normalizes columns correct_axis = softmax(scores, axis=-1) # Normalizes rows print("\n1. WRONG SOFTMAX AXIS") print(f" Wrong (axis=-2) row sums: {wrong_axis.sum(axis=-1)}") # Not 1! print(f" Wrong (axis=-2) col sums: {wrong_axis.sum(axis=-2)}") # 1, but wrong! print(f" Correct (axis=-1) row sums: {correct_axis.sum(axis=-1)}") # 1 ✓ # Pitfall 2: Missing scale factor print("\n2. MISSING SCALE FACTOR") unscaled = softmax(scores, axis=-1) scaled = softmax(scores / np.sqrt(d_k), axis=-1) unscaled_entropy = -np.sum(unscaled * np.log(unscaled + 1e-10), axis=-1).mean() scaled_entropy = -np.sum(scaled * np.log(scaled + 1e-10), axis=-1).mean() print(f" Unscaled entropy: {unscaled_entropy:.4f}") print(f" Scaled entropy: {scaled_entropy:.4f}") print(f" Scaling spreads attention (higher entropy), better gradients") # Pitfall 3: Weak mask values print("\n3. WEAK MASK VALUES") causal_mask = np.triu(np.ones((n, n)), k=1) # Weak mask weak_masked = softmax((scores - causal_mask * 1e4) / np.sqrt(d_k), axis=-1) # Strong mask strong_masked = softmax((scores - causal_mask * 1e9) / np.sqrt(d_k), axis=-1) print(f" Weak mask (-1e4), position [0,3] (should be 0): {weak_masked[0,3]:.2e}") print(f" Strong mask (-1e9), position [0,3]: {strong_masked[0,3]:.2e}") # Pitfall 4: Numerical overflow without max-subtraction print("\n4. NUMERICAL STABILITY") # Simulate large scores large_scores = scores * 100 # Artificially large try: # Naive: might overflow naive_exp = np.exp(large_scores) naive_result = naive_exp / naive_exp.sum(axis=-1, keepdims=True) print(f" Naive on large scores: {naive_result[0, :3]}") except: print(" Naive failed with overflow!") # Stable version stable_result = softmax(large_scores, axis=-1) print(f" Stable on large scores: {stable_result[0, :3]}") demonstrate_pitfalls()Always verify: (1) Row sums equal 1, (2) All weights are positive, (3) Masked positions have weight ≈0, (4) Argmax is preserved from scores. These simple checks catch most implementation bugs.
We've thoroughly examined how raw attention scores become normalized attention weights—the probability distributions that determine information flow in transformers.
What's Next:
Now that we understand how attention weights are computed, we'll examine the scaling factor in detail. The next page explores scaled dot-product attention—why the √d_k divisor is crucial for training stability and how it relates to softmax behavior.
You now understand the complete pipeline from attention scores to attention weights, including softmax mechanics, numerical considerations, gradient properties, and pattern analysis. This knowledge is essential for implementing, debugging, and improving attention-based models.