Self Attention - Learning Module

Loading content...

0/245

Computing Attention Weights

From Scores to Distributions

In the previous page, we established the self-attention formulation: queries dot-product with keys to produce scores, which are then transformed into weights used to aggregate values. The critical transformation in this process is converting raw scores into a probability distribution.

This page examines the attention weight computation in detail—the mathematical operation that turns unbounded real numbers into meaningful attention probabilities. Understanding this process reveals:

Why softmax is the natural choice for attention
How the computation preserves differentiability for gradient-based learning
Where numerical instability can arise and how to prevent it
What the resulting attention patterns look like and how to interpret them

The attention weight computation, while seemingly simple, has profound implications for transformer behavior.

What You Will Learn

By the end of this page, you will understand the complete pipeline from attention scores to attention weights, including softmax mechanics, numerical stability techniques, gradient properties, and the interpretation of resulting attention distributions.

The Normalization Requirement

Raw attention scores $S_{ij} = Q_i \cdot K_j$ are unbounded real numbers—they can be arbitrarily positive, negative, or close to zero. To use these scores for weighted aggregation, we need a normalization that:

Requirements for Attention Weights:

Non-negativity: $\alpha_{ij} \geq 0$ — Weights should never be negative (no "anti-attention")
Sum to one: $\sum_j \alpha_{ij} = 1$ — Each query's weights form a probability distribution
Differentiability: Smooth gradients for end-to-end training
Preserve relative ordering: Higher scores → higher weights
Scale invariance: Similar behavior regardless of score magnitudes

Why Not Simple Normalization?

A naive approach might be: $$\alpha_{ij} = \frac{S_{ij}}{\sum_k S_{ik}}$$

This fails because:

Negative scores produce negative weights
Near-zero totals cause numerical instability
Doesn't emphasize high-scoring keys sufficiently

normalization_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
 
def naive_normalize(scores: np.ndarray) -> np.ndarray:
    """Naive normalization - PROBLEMATIC."""
    return scores / scores.sum(axis=-1, keepdims=True)
 
def softmax_normalize(scores: np.ndarray) -> np.ndarray:
    """Softmax normalization - CORRECT."""
    exp_scores = np.exp(scores - scores.max(axis=-1, keepdims=True))
    return exp_scores / exp_scores.sum(axis=-1, keepdims=True)
 
# Example: Score vector with negative values
scores = np.array([2.0, -1.0, 0.5, -0.5])
 
naive = naive_normalize(scores)
softmax = softmax_normalize(scores)
 
print("Scores:", scores)
print(f"Naive normalization: {naive}")
print(f"  Sum: {naive.sum():.4f}, Min: {naive.min():.4f}")  # Negative weights!
 
print(f"\nSoftmax normalization: {np.round(softmax, 4)}")
print(f"  Sum: {softmax.sum():.4f}, Min: {softmax.min():.4f}")  # All positive!
 
# Verify softmax properties
print(f"\nSoftmax properties:")
print(f"  All positive: {(softmax > 0).all()}")
print(f"  Sums to 1: {np.isclose(softmax.sum(), 1.0)}")
print(f"  Highest score → highest weight: {scores.argmax() == softmax.argmax()}")

The Softmax Solution

Softmax exponentiates scores before normalizing. This guarantees positivity (exponentials are always positive) while preserving relative ordering (exp is monotonic). The exponential also accentuates differences—larger scores get disproportionately more weight.

Softmax: The Mathematical Foundation

The softmax function is the core mathematical operation that transforms attention scores into weights:

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}}$$

For a score vector $z = [z_1, ..., z_n]$, softmax produces a probability distribution where each element is positive and the total sums to 1.

Key Properties of Softmax:

Softmax Properties

•Output range: $(0, 1)$ — never exactly 0 or 1 due to exponential
•Normalization: $\sum_i \text{softmax}(z_i) = 1$ — always sums to 1
•Translation invariance: $\text{softmax}(z + c) = \text{softmax}(z)$ — adding constant doesn't change output
•Monotonicity preservation: If $z_i > z_j$, then $\text{softmax}(z_i) > \text{softmax}(z_j)$
•Differentiability: Smooth everywhere, enabling gradient-based optimization
•Exponential amplification: Large differences in input → even larger differences in output

softmax_properties.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
 
def softmax(z: np.ndarray, axis: int = -1) -> np.ndarray:
    """Numerically stable softmax."""
    z_shifted = z - z.max(axis=axis, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / exp_z.sum(axis=axis, keepdims=True)
 
# Property 1: Output range (0, 1)
z = np.array([1.0, 2.0, 3.0])
s = softmax(z)
print(f"Outputs: {s}")
print(f"All in (0,1): {((s > 0) & (s < 1)).all()}")
 
# Property 2: Sums to 1
print(f"Sum: {s.sum()}")
 
# Property 3: Translation invariance
z_shifted = z + 1000
s_shifted = softmax(z_shifted)
print(f"\nOriginal softmax: {s}")
print(f"Shifted (+1000) softmax: {s_shifted}")
print(f"Difference: {np.abs(s - s_shifted).max()}")  # Essentially 0
 
# Property 4: Monotonicity preserved
z_ordered = np.array([1.0, 3.0, 2.0])  # Order: 1 < 2 < 3
s_ordered = softmax(z_ordered)
print(f"\nInput order: {np.argsort(z_ordered)}")
print(f"Output order: {np.argsort(s_ordered)}")  # Same ordering
 
# Property 5: Exponential amplification (temperature effect)
z = np.array([1.0, 2.0, 3.0])
print("\nExponential amplification:")
for temp in [2.0, 1.0, 0.5, 0.1]:
    s = softmax(z / temp)
    print(f"  T={temp}: {np.round(s, 4)}")
    # Lower temperature → sharper distribution

Temperature and Sharpness:

A key insight is that softmax sharpness can be controlled by scaling input scores:

$$\text{softmax}(z/T)_i = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}$$

Where $T$ is the "temperature":

High temperature (T > 1): Flatter distribution, more uniform attention
Temperature = 1: Standard softmax
Low temperature (T < 1): Sharper distribution, approaching argmax
T → 0: Converges to one-hot (hard attention)

This temperature concept will become important when we discuss the scaling factor in scaled dot-product attention.

Connection to Information Theory

Softmax has deep connections to the maximum entropy principle and logistic regression. It's the natural output for categorical distributions when we want to maximize entropy subject to expected value constraints. This makes it theoretically principled, not just a convenient choice.

Numerical Stability in Softmax

The naive softmax implementation can fail catastrophically due to numerical overflow:

$$e^{z_i} \rightarrow \infty \text{ when } z_i \text{ is large}$$

For example, $e^{1000}$ exceeds float64 range, producing inf. Similarly, very negative values produce 0, leading to division by zero.

The Stable Softmax Trick:

Since softmax is translation-invariant, we can subtract the maximum value:

$$\text{softmax}(z)_i = \frac{e^{z_i - \max(z)}}{\sum_j e^{z_j - \max(z)}}$$

Now:

The largest exponent is $e^0 = 1$
All other exponents are $e^{\text{negative}} \in (0, 1)$
No overflow possible!

numerical_stability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
import warnings
 
def softmax_naive(z: np.ndarray) -> np.ndarray:
    """Naive softmax - numerically UNSTABLE."""
    exp_z = np.exp(z)
    return exp_z / exp_z.sum()
 
def softmax_stable(z: np.ndarray) -> np.ndarray:
    """Stable softmax - subtract max before exp."""
    z_shifted = z - z.max()
    exp_z = np.exp(z_shifted)
    return exp_z / exp_z.sum()
 
# Test with moderate values
z_moderate = np.array([1.0, 2.0, 3.0])
print("Moderate values:")
print(f"  Naive: {softmax_naive(z_moderate)}")
print(f"  Stable: {softmax_stable(z_moderate)}")
 
# Test with large values
z_large = np.array([100.0, 200.0, 300.0])
print("\nLarge values:")
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    naive_result = softmax_naive(z_large)
    print(f"  Naive: {naive_result}")  # [0. 0. nan] - BROKEN
print(f"  Stable: {softmax_stable(z_large)}")  # Correct
 
# Test with very large values (would overflow float64)
z_huge = np.array([1000.0, 1001.0, 1002.0])
print("\nVery large values (would overflow float64):")
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    naive_result = softmax_naive(z_huge)
    print(f"  Naive: {naive_result}")  # [nan nan nan]
print(f"  Stable: {softmax_stable(z_huge)}")  # Still correct!
 
# Demonstration of why this works
print("\nWhy stable version works:")
print(f"  z_huge:                 {z_huge}")
print(f"  z_huge - max(z_huge):   {z_huge - z_huge.max()}")  # [−2, −1, 0]
print(f"  exp of shifted:         {np.exp(z_huge - z_huge.max())}")

Log-Sum-Exp for Log-Probabilities:

When working with log-probabilities (common in practice), we need logsoftmax:

$$\log\text{softmax}(z)_i = z_i - \log\sum_j e^{z_j}$$

The log-sum-exp (LSE) requires special handling:

$$\text{LSE}(z) = \max(z) + \log\sum_j e^{z_j - \max(z)}$$

logsumexp.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
 
def logsumexp(z: np.ndarray) -> float:
    """Numerically stable log-sum-exp."""
    z_max = z.max()
    return z_max + np.log(np.exp(z - z_max).sum())
 
def log_softmax(z: np.ndarray) -> np.ndarray:
    """Numerically stable log-softmax."""
    return z - logsumexp(z)
 
# Example
z = np.array([1000.0, 1001.0, 1002.0])
 
# Naive log-softmax would fail
# log_prob = z - np.log(np.exp(z).sum())  # inf - inf = nan
 
# Stable version works
log_prob = log_softmax(z)
print(f"Log probabilities: {log_prob}")
print(f"Probabilities (exp): {np.exp(log_prob)}")
print(f"Sum: {np.exp(log_prob).sum()}")  # Should be ~1

Framework Implementations

Deep learning frameworks (PyTorch, TensorFlow) implement numerically stable softmax internally. However, if you ever implement attention from scratch (e.g., in CUDA kernels), numerical stability is critical. Always use the subtract-max trick.

Softmax Gradient and Backpropagation

For training transformers via backpropagation, we need the gradient of softmax. The softmax Jacobian has a beautiful structure:

$$\frac{\partial \text{softmax}(z)_i}{\partial z_j} = \text{softmax}(z)i \left( \delta{ij} - \text{softmax}(z)_j \right)$$

Where $\delta_{ij}$ is the Kronecker delta (1 if $i=j$, else 0).

Intuition:

Diagonal ($i = j$): $\partial s_i / \partial z_i = s_i(1 - s_i)$ — increasing $z_i$ increases $s_i$, but with diminishing returns as $s_i$ approaches 1
Off-diagonal ($i \neq j$): $\partial s_i / \partial z_j = -s_i s_j$ — increasing $z_j$ decreases $s_i$ (redistributes probability away)

softmax_gradient.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
 
def softmax(z: np.ndarray) -> np.ndarray:
    """Stable softmax."""
    exp_z = np.exp(z - z.max())
    return exp_z / exp_z.sum()
 
def softmax_jacobian(z: np.ndarray) -> np.ndarray:
    """
    Compute the Jacobian matrix of softmax.
    
    Returns:
        J[i,j] = d(softmax(z)_i) / d(z_j)
    """
    s = softmax(z)
    n = len(s)
    
    # J[i,j] = s[i] * (delta[i,j] - s[j])
    #        = s[i] * delta[i,j] - s[i] * s[j]
    #        = diag(s) - outer(s, s)
    
    jacobian = np.diag(s) - np.outer(s, s)
    return jacobian
 
def softmax_backward(z: np.ndarray, grad_output: np.ndarray) -> np.ndarray:
    """
    Efficient backward pass without computing full Jacobian.
    
    Given gradient w.r.t. output s, compute gradient w.r.t. input z.
    """
    s = softmax(z)
    # grad_z = s * (grad_output - sum(grad_output * s))
    sum_term = np.sum(grad_output * s)
    grad_z = s * (grad_output - sum_term)
    return grad_z
 
# Demonstration
z = np.array([1.0, 2.0, 3.0])
s = softmax(z)
J = softmax_jacobian(z)
 
print("Softmax Jacobian Analysis")
print("=" * 50)
print(f"Input z: {z}")
print(f"Softmax s: {np.round(s, 4)}")
print(f"\nJacobian matrix:")
print(np.round(J, 4))
 
# Verify Jacobian properties
print(f"\nJacobian properties:")
print(f"  Row sums (should be 0): {J.sum(axis=1)}")  # Sum to 0
print(f"  Symmetric: {np.allclose(J, J.T)}")  # Jacobian is symmetric
 
# Verify backward pass matches Jacobian product
grad_output = np.array([1.0, 0.0, 0.0])  # Gradient from loss
grad_z_via_jacobian = J.T @ grad_output
grad_z_efficient = softmax_backward(z, grad_output)
print(f"\nBackward pass verification:")
print(f"  Via Jacobian: {grad_z_via_jacobian}")
print(f"  Efficient:    {grad_z_efficient}")
print(f"  Match: {np.allclose(grad_z_via_jacobian, grad_z_efficient)}")

Gradient Implications for Attention:

The softmax gradient structure has important implications for attention-based learning:

Gradient Competition: When one attention weight increases, others must decrease (zero-sum). This creates competition between positions for attention.
Saturation: When softmax output is near 0 or 1, gradients become small. Very confident attention patterns are hard to change. This can cause:
- Slow learning when initial attention is very diffuse or very sharp
- Difficulty escaping local minima in attention patterns
Credit Assignment: The gradient flows back to all positions proportionally to their attention weight. Highly-attended positions receive more gradient signal.
Scaling Matters: The magnitude of scores before softmax affects gradient magnitude. This is another reason why scaled dot-product attention is important.

Efficient Backprop

Computing the full n×n Jacobian is O(n²). The efficient backward formula grad_z = s * (grad_output - sum(grad_output * s)) is O(n), which is critical when n is the sequence length squared in attention computation.

The Complete Attention Weight Pipeline

Now let's trace through the complete pipeline from input to attention weights:

Step-by-Step Pipeline:

Input: Sequence $X \in \mathbb{R}^{n \times d_{model}}$
Project: $Q = XW_Q$, $K = XW_K$
Compute scores: $S = QK^T$ — pairwise dot products
Scale: $S_{scaled} = S / \sqrt{d_k}$
Apply mask (optional): $S_{masked} = S_{scaled} + M$
Normalize: $A = \text{softmax}(S_{masked})$ along key dimension
Output: Attention weight matrix $A \in \mathbb{R}^{n \times n}$

attention_weight_pipeline.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import numpy as np
from scipy.special import softmax
 
def attention_weights_pipeline(
    X: np.ndarray,
    W_q: np.ndarray,
    W_k: np.ndarray,
    mask: np.ndarray = None,
    return_intermediates: bool = False
) -> dict:
    """
    Complete pipeline for computing attention weights with all intermediates.
    
    Args:
        X: Input sequence, shape (n, d_model)
        W_q: Query projection, shape (d_model, d_k)
        W_k: Key projection, shape (d_model, d_k)
        mask: Optional mask, shape (n, n), with -inf for masked positions
        return_intermediates: Whether to return all intermediate values
    
    Returns:
        dict with 'weights' and optionally intermediate values
    """
    d_k = W_q.shape[1]
    
    # Step 1: Project to Q and K
    Q = X @ W_q  # (n, d_k)
    K = X @ W_k  # (n, d_k)
    
    # Step 2: Compute raw scores
    scores_raw = Q @ K.T  # (n, n)
    
    # Step 3: Scale by sqrt(d_k)
    scale = np.sqrt(d_k)
    scores_scaled = scores_raw / scale
    
    # Step 4: Apply mask (if provided)
    if mask is not None:
        scores_masked = scores_scaled + mask
    else:
        scores_masked = scores_scaled
    
    # Step 5: Softmax over keys (axis=-1)
    weights = softmax(scores_masked, axis=-1)
    
    if return_intermediates:
        return {
            'Q': Q,
            'K': K,
            'scores_raw': scores_raw,
            'scale': scale,
            'scores_scaled': scores_scaled,
            'scores_masked': scores_masked,
            'weights': weights,
        }
    return {'weights': weights}
 
# Example: Process a small sequence
np.random.seed(42)
n, d_model, d_k = 4, 16, 8
 
X = np.random.randn(n, d_model)
W_q = np.random.randn(d_model, d_k) * 0.1
W_k = np.random.randn(d_model, d_k) * 0.1
 
# Compute with intermediates
result = attention_weights_pipeline(X, W_q, W_k, return_intermediates=True)
 
print("Attention Weight Computation Pipeline")
print("=" * 50)
print(f"\nStep 1 - Q shape: {result['Q'].shape}")
print(f"Step 1 - K shape: {result['K'].shape}")
print(f"\nStep 2 - Raw scores ({result['scores_raw'].shape}):")
print(np.round(result['scores_raw'], 3))
print(f"\nStep 3 - Scale factor: {result['scale']:.3f}")
print(f"Step 3 - Scaled scores:")
print(np.round(result['scores_scaled'], 3))
print(f"\nStep 5 - Attention weights:")
print(np.round(result['weights'], 3))
print(f"\nWeight row sums: {result['weights'].sum(axis=1)}")

Understanding Each Step:

Step	Operation	Effect	Shape
Project	Linear transformation	Create query/key representations	(n, d_k)
Dot Product	Score computation	Measure query-key compatibility	(n, n)
Scale	Divide by √d_k	Control softmax temperature	(n, n)
Mask	Add −∞ to invalid positions	Prevent attending to certain positions	(n, n)
Softmax	Normalize rows	Convert to probability distribution	(n, n)

Row vs Column Interpretation

Each ROW of the attention matrix corresponds to one query position. Row i contains the attention weights from position i to all key positions. Thus, softmax is applied row-wise (axis=-1), and each row sums to 1.

Analyzing Attention Patterns

The attention weight matrix reveals which positions the model considers important for each query. Analyzing these patterns provides insight into model behavior.

Common Attention Patterns:

Uniform/Diffuse: Nearly equal weights to all positions — often early in training or when context is uninformative
Local: High weights to nearby positions — positional locality bias
Peaked: One or few positions dominate — strong semantic relationships
Diagonal: Self-attention to same position — identity-like pattern
Structured: Consistent patterns across queries — learned syntactic rules

pattern_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
from scipy.special import softmax
 
def analyze_attention_pattern(weights: np.ndarray, 
                               tokens: list = None) -> dict:
    """
    Analyze attention weight matrix for common patterns.
    """
    n = weights.shape[0]
    
    analysis = {}
    
    # Entropy per query (high = diffuse, low = peaked)
    # H = -sum(p * log(p))
    log_weights = np.log(weights + 1e-10)
    entropy = -np.sum(weights * log_weights, axis=-1)
    max_entropy = np.log(n)  # Maximum entropy for uniform distribution
    analysis['entropy'] = entropy
    analysis['normalized_entropy'] = entropy / max_entropy
    
    # Peakedness: max attention weight per row
    analysis['peak_weights'] = weights.max(axis=-1)
    
    # Self-attention strength: diagonal elements
    diagonal = np.diag(weights)
    analysis['self_attention'] = diagonal
    
    # Locality: average distance to attended positions
    positions = np.arange(n)
    expected_position = np.sum(weights * positions[None, :], axis=-1)
    locality = np.abs(expected_position - positions)
    analysis['locality_shift'] = locality
    
    # Sparsity: how many positions get significant attention
    significant_threshold = 1.0 / (2 * n)  # More than 2x uniform would be
    significant_count = (weights > significant_threshold).sum(axis=-1)
    analysis['effective_positions'] = significant_count
    
    return analysis
 
def visualize_pattern_stats(analysis: dict, tokens: list = None):
    """Print pattern analysis statistics."""
    n = len(analysis['entropy'])
    
    print("Attention Pattern Analysis")
    print("=" * 60)
    
    print(f"\n{'Position':<10} {'Token':<10} {'Entropy':<10} {'Peak':<10} "
          f"{'Self-Attn':<10} {'Eff.Pos':<10}")
    print("-" * 60)
    
    for i in range(n):
        token = tokens[i] if tokens else f"pos_{i}"
        print(f"{i:<10} {token:<10} "
              f"{analysis['normalized_entropy'][i]:.3f}     "
              f"{analysis['peak_weights'][i]:.3f}     "
              f"{analysis['self_attention'][i]:.3f}     "
              f"{analysis['effective_positions'][i]:<10}")
    
    print(f"\nSummary:")
    print(f"  Average normalized entropy: {analysis['normalized_entropy'].mean():.3f}")
    print(f"  Average peak weight: {analysis['peak_weights'].mean():.3f}")
    print(f"  Average self-attention: {analysis['self_attention'].mean():.3f}")
 
# Example: Simulate different patterns
np.random.seed(42)
n = 6
tokens = ["The", "cat", "sat", "on", "the", "mat"]
 
# Simulate learned attention weights (peaked pattern)
raw_scores = np.random.randn(n, n)
# Make "sat" attend strongly to "cat" and "mat"
raw_scores[2, 1] = 3.0  # sat → cat
raw_scores[2, 5] = 2.5  # sat → mat
 
weights = softmax(raw_scores, axis=-1)
analysis = analyze_attention_pattern(weights, tokens)
visualize_pattern_stats(analysis, tokens)
 
print(f"\nAttention weights for 'sat' (position 2):")
for j, (token, weight) in enumerate(zip(tokens, weights[2])):
    bar = "█" * int(weight * 30)
    print(f"  → {token}: {weight:.3f} {bar}")

Entropy as a Diagnostic:

Attention entropy measures how spread out the attention is:

High entropy (near log(n)): Diffuse attention, looking at many positions
Low entropy (near 0): Peaked attention, focusing on few positions

During training, attention typically:

Starts relatively uniform (random initialization)
Develops task-specific patterns
May become very peaked (specialization) or remain diffuse (global context)

Analyzing entropy across layers often reveals:

Early layers: More local/uniform patterns
Middle layers: Syntactic patterns emerge
Later layers: More semantic/peaked patterns

Debugging with Attention

When transformers fail on specific examples, visualizing attention patterns can reveal issues: Is the model attending to relevant tokens? Are attention patterns degenerate (all uniform or all to one position)? These diagnostics guide debugging and model improvement.

Batched Attention Weight Computation

In practice, attention weights are computed for batches of sequences. This requires careful handling of dimensions and efficient tensor operations.

Batch Dimensions:

For a batch of $B$ sequences, each of length $n$:

Input: $X \in \mathbb{R}^{B \times n \times d_{model}}$
Queries: $Q \in \mathbb{R}^{B \times n \times d_k}$
Keys: $K \in \mathbb{R}^{B \times n \times d_k}$
Scores: $S \in \mathbb{R}^{B \times n \times n}$
Weights: $A \in \mathbb{R}^{B \times n \times n}$

The batch dimension is independent—each sequence has its own attention matrix.

batched_attention.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
from scipy.special import softmax
 
def batched_attention_weights(
    Q: np.ndarray,  # (B, n, d_k)
    K: np.ndarray,  # (B, n, d_k)
    mask: np.ndarray = None  # (B, n, n) or (1, n, n) or (n, n)
) -> np.ndarray:
    """
    Compute attention weights for a batch of sequences.
    
    Returns:
        A: Attention weights, shape (B, n, n)
    """
    d_k = Q.shape[-1]
    
    # Batched matrix multiplication: (B, n, d_k) @ (B, d_k, n) → (B, n, n)
    # Note: K.transpose swaps last two dims, keeping batch dim
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
    
    # Apply mask if provided (broadcasting handles shape differences)
    if mask is not None:
        scores = scores + mask
    
    # Softmax along key dimension (last axis)
    weights = softmax(scores, axis=-1)
    
    return weights
 
# Example: Batch of 3 sequences, length 5, d_k=8
B, n, d_k = 3, 5, 8
 
Q = np.random.randn(B, n, d_k)
K = np.random.randn(B, n, d_k)
 
# Compute batched attention
weights = batched_attention_weights(Q, K)
 
print(f"Batched Attention Computation")
print(f"=" * 50)
print(f"Q shape: {Q.shape}")
print(f"K shape: {K.shape}")
print(f"Weights shape: {weights.shape}")
 
# Verify each batch element
print(f"\nVerifying batch independence:")
for b in range(B):
    row_sums = weights[b].sum(axis=-1)
    print(f"  Batch {b}: Row sums = {np.allclose(row_sums, 1.0)}, "
          f"All positive = {(weights[b] > 0).all()}")
 
# Broadcasting example: Single mask applied to all batches
causal_mask = np.triu(np.ones((n, n)), k=1) * -1e9  # (n, n)
weights_masked = batched_attention_weights(Q, K, mask=causal_mask)
 
print(f"\nWith causal mask (same for all batches):")
print(f"  Mask shape: {causal_mask.shape}")
print(f"  Broadcast to: {weights_masked.shape}")
print(f"  Sample masked weights (batch 0):")
print(np.round(weights_masked[0], 3))

Memory Considerations:

The attention matrix is $O(n^2)$ per sequence, so for a batch:

Memory: $O(B \cdot n^2)$ for storing attention weights
Compute: $O(B \cdot n^2 \cdot d_k)$ for the QK^T multiplication

For long sequences (n = 4096 or more), this becomes the memory bottleneck. Various efficient attention mechanisms address this (covered in Module 6: Transformer Variants).

Gradient Checkpointing:

During backpropagation, attention weights must be stored or recomputed. Gradient checkpointing trades compute for memory by recomputing activations during backward pass instead of storing them.

The n² Problem

For sequences of length 4096, each attention matrix has 16M entries (4096²). With float16 precision, that's 32MB per layer per head. For a 12-layer, 12-head model, storing all attention matrices requires ~4.6GB just for attention weights! This motivates efficient attention variants.

Common Implementation Pitfalls

Implementing attention weight computation correctly requires avoiding several common mistakes:

Common Pitfalls and Solutions

•Wrong softmax axis: Softmax must be applied over the key dimension (last axis), not query dimension. Incorrect: softmax(scores, axis=-2) — this normalizes each column to 1, which is meaningless.
•Forgetting the scale factor: Without √d_k scaling, gradients vanish for large d_k because softmax saturates. This causes training to fail or converge slowly.
•Mask value too small: Using -1e4 instead of -1e9 or -inf may still allow gradient flow to masked positions, especially with float16. Use -inf or very large negative values.
•Broadcasting mask incorrectly: Masks must broadcast properly. A (n, n) mask works for (B, n, n) scores, but a (B, n) mask needs reshaping to (B, 1, n) or (B, n, 1) depending on intent.
•Numerical instability: Not using the subtract-max trick in custom implementations leads to overflow with large scores.

pitfall_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
from scipy.special import softmax
 
def demonstrate_pitfalls():
    """Demonstrate common attention implementation mistakes."""
    
    np.random.seed(42)
    n, d_k = 4, 64
    Q = np.random.randn(n, d_k)
    K = np.random.randn(n, d_k)
    
    scores = Q @ K.T
    
    print("Demonstration of Attention Pitfalls")
    print("=" * 60)
    
    # Pitfall 1: Wrong softmax axis
    wrong_axis = softmax(scores, axis=-2)  # Normalizes columns
    correct_axis = softmax(scores, axis=-1)  # Normalizes rows
    
    print("\n1. WRONG SOFTMAX AXIS")
    print(f"   Wrong (axis=-2) row sums: {wrong_axis.sum(axis=-1)}")  # Not 1!
    print(f"   Wrong (axis=-2) col sums: {wrong_axis.sum(axis=-2)}")  # 1, but wrong!
    print(f"   Correct (axis=-1) row sums: {correct_axis.sum(axis=-1)}")  # 1 ✓
    
    # Pitfall 2: Missing scale factor
    print("\n2. MISSING SCALE FACTOR")
    unscaled = softmax(scores, axis=-1)
    scaled = softmax(scores / np.sqrt(d_k), axis=-1)
    
    unscaled_entropy = -np.sum(unscaled * np.log(unscaled + 1e-10), axis=-1).mean()
    scaled_entropy = -np.sum(scaled * np.log(scaled + 1e-10), axis=-1).mean()
    
    print(f"   Unscaled entropy: {unscaled_entropy:.4f}")
    print(f"   Scaled entropy:   {scaled_entropy:.4f}")
    print(f"   Scaling spreads attention (higher entropy), better gradients")
    
    # Pitfall 3: Weak mask values
    print("\n3. WEAK MASK VALUES")
    causal_mask = np.triu(np.ones((n, n)), k=1)
    
    # Weak mask
    weak_masked = softmax((scores - causal_mask * 1e4) / np.sqrt(d_k), axis=-1)
    # Strong mask  
    strong_masked = softmax((scores - causal_mask * 1e9) / np.sqrt(d_k), axis=-1)
    
    print(f"   Weak mask (-1e4), position [0,3] (should be 0): {weak_masked[0,3]:.2e}")
    print(f"   Strong mask (-1e9), position [0,3]: {strong_masked[0,3]:.2e}")
    
    # Pitfall 4: Numerical overflow without max-subtraction
    print("\n4. NUMERICAL STABILITY")
    # Simulate large scores
    large_scores = scores * 100  # Artificially large
    
    try:
        # Naive: might overflow
        naive_exp = np.exp(large_scores)
        naive_result = naive_exp / naive_exp.sum(axis=-1, keepdims=True)
        print(f"   Naive on large scores: {naive_result[0, :3]}")
    except:
        print("   Naive failed with overflow!")
    
    # Stable version
    stable_result = softmax(large_scores, axis=-1)
    print(f"   Stable on large scores: {stable_result[0, :3]}")
 
demonstrate_pitfalls()

Unit Testing Attention

Always verify: (1) Row sums equal 1, (2) All weights are positive, (3) Masked positions have weight ≈0, (4) Argmax is preserved from scores. These simple checks catch most implementation bugs.

Summary: Computing Attention Weights

We've thoroughly examined how raw attention scores become normalized attention weights—the probability distributions that determine information flow in transformers.

Key Takeaways

•Softmax normalizes scores to probabilities — Converting unbounded real numbers to a distribution over keys, ensuring non-negativity and sum-to-one constraints.
•Numerical stability is critical — The subtract-max trick prevents overflow; modern frameworks handle this automatically, but custom implementations must be careful.
•Softmax gradients enable learning — The Jacobian structure allows gradients to flow back to all positions, with competition for attention creating meaningful learning signals.
•Attention patterns reveal model behavior — Entropy, peakedness, and locality metrics diagnose whether attention is learning useful patterns or degenerating.
•Batched computation is essential for efficiency — Careful tensor operations enable parallel processing across sequences while maintaining correctness.
•Common pitfalls have simple fixes — Wrong axis, missing scaling, weak masks, and numerical overflow are avoidable with proper implementation.

What's Next:

Now that we understand how attention weights are computed, we'll examine the scaling factor in detail. The next page explores scaled dot-product attention—why the √d_k divisor is crucial for training stability and how it relates to softmax behavior.

Page Complete

You now understand the complete pipeline from attention scores to attention weights, including softmax mechanics, numerical considerations, gradient properties, and pattern analysis. This knowledge is essential for implementing, debugging, and improving attention-based models.