Loading learning content...
In the previous module, we explored the fundamental attention mechanism—how a decoder query can attend to encoder outputs to dynamically weight relevant information. Self-attention takes this powerful concept and applies it reflexively: a sequence attends to itself.
This seemingly simple modification—letting each position in a sequence compute weighted combinations of all other positions—is the core innovation that powers transformers. Unlike RNNs, which process sequences step-by-step and struggle with long-range dependencies, self-attention enables direct, parallel connections between any two positions, regardless of distance.
The implications are profound. Self-attention:
Understanding self-attention deeply—its formulation, computation, and properties—is essential for mastering modern deep learning architectures.
By the end of this page, you will understand the complete mathematical formulation of self-attention, how queries, keys, and values are derived from the same input sequence, and why this self-referential structure enables richer representations than traditional sequential processing.
To understand self-attention, we must first recall the standard attention mechanism. In encoder-decoder attention, we have three distinct components:
Standard Attention Components:
The key insight is that Q, K, and V come from different sources: the decoder and encoder respectively.
The Self-Attention Transformation:
Self-attention eliminates this asymmetry. Given an input sequence, we derive all three components from the same source:
1234567891011121314
# Standard Encoder-Decoder Attention# Q from decoder, K and V from encoderQ = decoder_hidden_states @ W_q # (T_dec, d_model) @ (d_model, d_k)K = encoder_outputs @ W_k # (T_enc, d_model) @ (d_model, d_k)V = encoder_outputs @ W_v # (T_enc, d_model) @ (d_model, d_v)# Attention: Q attends to K/V from different sequence # Self-Attention# Q, K, and V ALL from the SAME input sequenceX = input_sequence # (T, d_model) - same source for all threeQ = X @ W_q # (T, d_model) @ (d_model, d_k) → (T, d_k)K = X @ W_k # (T, d_model) @ (d_model, d_k) → (T, d_k)V = X @ W_v # (T, d_model) @ (d_model, d_v) → (T, d_v)# Attention: sequence attends to ITSELFWhy is this powerful?
When a sequence attends to itself, each position can:
Consider translating "The cat sat on the mat because it was soft":
This coreference resolution happens naturally through the attention weights, without explicit linguistic rules.
Self-attention is not just attention applied within a sequence—it's a fundamentally different computational paradigm. Each position simultaneously acts as a query (asking questions), a key (being searchable), and a value (providing information). This triple role enables rich, contextual representations.
The Query-Key-Value (QKV) framework is the mathematical backbone of self-attention. Understanding each component's role and how they interact is crucial for deep comprehension.
Formal Definition:
Given an input sequence $X \in \mathbb{R}^{n \times d_{model}}$ where $n$ is sequence length and $d_{model}$ is the embedding dimension, we compute:
123456789101112131415161718192021222324252627282930313233343536373839404142
import numpy as np def compute_qkv(X: np.ndarray, W_q: np.ndarray, W_k: np.ndarray, W_v: np.ndarray) -> tuple: """ Compute Query, Key, Value matrices from input. Args: X: Input sequence, shape (n, d_model) W_q: Query weight matrix, shape (d_model, d_k) W_k: Key weight matrix, shape (d_model, d_k) W_v: Value weight matrix, shape (d_model, d_v) Returns: Q: Query matrix, shape (n, d_k) K: Key matrix, shape (n, d_k) V: Value matrix, shape (n, d_v) """ Q = X @ W_q # Each row i: query for position i K = X @ W_k # Each row j: key for position j V = X @ W_v # Each row j: value for position j return Q, K, V # Example dimensionsn = 10 # Sequence lengthd_model = 512 # Model dimensiond_k = 64 # Key/Query dimensiond_v = 64 # Value dimension # InitializeX = np.random.randn(n, d_model)W_q = np.random.randn(d_model, d_k) * 0.02W_k = np.random.randn(d_model, d_k) * 0.02W_v = np.random.randn(d_model, d_v) * 0.02 Q, K, V = compute_qkv(X, W_q, W_k, W_v)print(f"Q shape: {Q.shape}") # (10, 64)print(f"K shape: {K.shape}") # (10, 64)print(f"V shape: {V.shape}") # (10, 64)Semantic Interpretation of QKV:
Think of the QKV framework as an information retrieval system:
| Component | Analogy | Mathematical Role | Learned Representation |
|---|---|---|---|
| Query (Q) | Search query | What position i is looking for | Encodes 'what information do I need?' |
| Key (K) | Index/tag | How position j describes itself | Encodes 'what information do I have?' |
| Value (V) | Content | What position j contributes | Encodes 'what to return if matched' |
Why Separate Projections?
A natural question: why not use $X$ directly for Q, K, and V? The learned projection matrices ($W_q$, $W_k$, $W_v$) provide three critical capabilities:
Example: Role Specialization
Consider the word "bank" in "river bank" vs. "money bank":
Without separate projections, a token asking a question would have to use the same representation as when being searched—limiting expressiveness.
Typically d_k = d_v = d_model / h where h is the number of attention heads (covered in the next module). For a 512-dimensional model with 8 heads, d_k = d_v = 64. This keeps computational cost manageable while enabling multi-head attention.
Now we assemble the complete self-attention operation. Given our Q, K, V matrices, the self-attention output is computed as:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Let's break this down step-by-step:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import numpy as npfrom scipy.special import softmax def self_attention(X: np.ndarray, W_q: np.ndarray, W_k: np.ndarray, W_v: np.ndarray) -> np.ndarray: """ Complete self-attention computation. Args: X: Input sequence, shape (n, d_model) W_q, W_k, W_v: Projection matrices Returns: Output: Contextualized representations, shape (n, d_v) """ # Step 1: Compute Q, K, V Q = X @ W_q # (n, d_k) K = X @ W_k # (n, d_k) V = X @ W_v # (n, d_v) d_k = Q.shape[1] # Step 2: Compute attention scores scores = Q @ K.T # (n, n) - each entry is dot product # Step 3: Scale by sqrt(d_k) scores_scaled = scores / np.sqrt(d_k) # Step 4: Softmax over keys (last axis) attention_weights = softmax(scores_scaled, axis=-1) # (n, n) # Step 5: Weighted aggregation of values output = attention_weights @ V # (n, d_v) return output, attention_weights # Demonstrationn, d_model, d_k, d_v = 5, 32, 8, 8 X = np.random.randn(n, d_model)W_q = np.random.randn(d_model, d_k) * 0.1W_k = np.random.randn(d_model, d_k) * 0.1W_v = np.random.randn(d_model, d_v) * 0.1 output, attn_weights = self_attention(X, W_q, W_k, W_v) print(f"Input shape: {X.shape}") # (5, 32)print(f"Output shape: {output.shape}") # (5, 8)print(f"Attention weights shape: {attn_weights.shape}") # (5, 5)print(f"Row sums (should be 1): {attn_weights.sum(axis=1)}")Dimensional Analysis:
Understanding how shapes propagate through self-attention is crucial:
| Operation | Input Shape | Weight Shape | Output Shape |
|---|---|---|---|
| Q = XW_q | (n, d_model) | (d_model, d_k) | (n, d_k) |
| K = XW_k | (n, d_model) | (d_model, d_k) | (n, d_k) |
| V = XW_v | (n, d_model) | (d_model, d_v) | (n, d_v) |
| S = QK^T | (n, d_k) × (d_k, n) | — | (n, n) |
| A = softmax(S/√d_k) | (n, n) | — | (n, n) |
| O = AV | (n, n) × (n, d_v) | — | (n, d_v) |
Notice that:
The softmax must be applied row-wise (axis=-1), not globally. Each row corresponds to one query position, and its attention weights must sum to 1. A global softmax would incorrectly normalize across all n² entries, breaking the interpretation as attention distribution.
The attention score matrix $S = QK^T$ is the heart of self-attention. Understanding what these scores represent provides deep insight into the mechanism.
Dot Product as Similarity:
The score $S_{ij} = Q_i \cdot K_j$ measures the alignment between what position $i$ is looking for (query) and what position $j$ offers (key). This dot product captures:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as npimport matplotlib.pyplot as plt def analyze_attention_patterns(sentence: list, Q: np.ndarray, K: np.ndarray) -> np.ndarray: """ Analyze and visualize attention score patterns. """ # Raw scores scores = Q @ K.T # Scaled scores d_k = Q.shape[1] scores_scaled = scores / np.sqrt(d_k) # Print pairwise interpretations print("Attention Score Analysis:") print("-" * 50) for i, word_i in enumerate(sentence): # Find top attended positions for this word top_indices = np.argsort(scores_scaled[i])[::-1][:3] print(f"'{word_i}' attends most to: ", end="") for j in top_indices: print(f"'{sentence[j]}' ({scores_scaled[i,j]:.2f}), ", end="") print() return scores_scaled # Example: Coreference resolutionsentence = ["The", "cat", "sat", "on", "the", "mat", "because", "it", "was", "soft"]n = len(sentence)d_k = 8 # Simulate learned Q, K that capture coreferencenp.random.seed(42)Q = np.random.randn(n, d_k)K = np.random.randn(n, d_k) # Manually boost "it" query toward "mat" and "cat" keys# to simulate learned coreference patternsQ[7] = K[5] * 0.8 + np.random.randn(d_k) * 0.2 # "it" → "mat"Q[7] += K[1] * 0.3 # Also some attention to "cat" scores = analyze_attention_patterns(sentence, Q, K)Geometric Interpretation:
Self-attention can be viewed geometrically in the $d_k$-dimensional space:
What Do Attention Scores Capture?
Through training, attention scores learn to reflect various linguistic and structural phenomena:
| Phenomenon | Example | Attention Pattern |
|---|---|---|
| Coreference | 'it' → 'cat' in 'The cat... it...' | Pronoun attends strongly to antecedent noun |
| Subject-Verb | 'runs' → 'dog' in 'The dog runs' | Verb attends to its subject for agreement |
| Modifier-Head | 'red' → 'car' in 'the red car' | Adjective attends to the noun it modifies |
| Dependency | 'with' → 'telescope' in 'saw... with telescope' | Preposition attends to object, resolving attachment |
| Negation Scope | 'happy' → 'not' in 'not very happy' | Modifier attends to negation affecting it |
These patterns aren't hard-coded—they emerge from training. The model learns that attending to certain positions helps with downstream tasks like translation or generation. This makes attention a powerful, general-purpose mechanism that discovers task-relevant relationships automatically.
Understanding self-attention's advantages requires contrasting it with the recurrent approach it largely replaced.
The Recurrent Paradigm:
RNNs process sequences step-by-step, maintaining a hidden state that accumulates information:
$$h_t = f(h_{t-1}, x_t)$$
This creates a sequential bottleneck: to compute $h_t$, we must first compute $h_1, h_2, ..., h_{t-1}$.
The Self-Attention Paradigm:
Self-attention computes each position's output in parallel:
$$o_i = \sum_j \alpha_{ij} V_j$$
where $\alpha_{ij}$ depends only on the input, not on previous computations.
Path Length and Long-Range Dependencies:
Consider learning that "The cat that sat on the mat was soft" means the mat is soft, not the cat. In both architectures:
RNN Path:
cat → that → sat → on → the → mat → was → soft
(7 recurrent steps, gradient must flow through all)
Self-Attention Path:
soft ←→ mat (direct attention connection, 1 step)
soft ←→ cat (also direct, model learns to distinguish)
The self-attention path length is O(1) regardless of distance, making long-range dependencies as easy to learn as local ones.
12345678910111213141516171819202122232425
def rnn_path_length(pos_from: int, pos_to: int) -> int: """Path length in RNN for information to flow between positions.""" return abs(pos_to - pos_from) def self_attention_path_length(pos_from: int, pos_to: int) -> int: """Path length in self-attention for information flow.""" return 1 # Always 1 - direct connection via attention # Example: Long sequenceseq_length = 1000 # Information from position 0 to position 999print(f"RNN path length: {rnn_path_length(0, 999)}") # 999print(f"Self-attention path length: {self_attention_path_length(0, 999)}") # 1 # For L stacked layers:# - RNN: L layers, each with O(n) path → total O(L·n) max path# - Self-attention: L layers, each O(1) path → total O(L) max path L = 12 # Typical transformer depthn = 512 # Typical sequence length print(f"\nFor {L}-layer model, sequence length {n}:")print(f" RNN worst-case path: {L * n}") # 6144print(f" Self-attention worst-case path: {L}") # 12Self-attention sacrifices the inductive bias of sequential processing (locality, recent context priority) for the ability to learn arbitrary connections. This is appropriate for many tasks but means the model must learn even obvious local patterns from data. Positional encodings partially address this (covered later in this chapter).
The output of self-attention is a new sequence of the same length, where each position's representation has been contextualized by information from all other positions.
Output Interpretation:
For each position $i$, the output $O_i$ is:
$$O_i = \sum_{j=1}^{n} \alpha_{ij} V_j$$
This is a weighted average of all value vectors, where weights $\alpha_{ij}$ are determined by attention scores. The result:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as npfrom scipy.special import softmax def analyze_output_contributions(sentence: list, V: np.ndarray, attention_weights: np.ndarray, position: int) -> None: """ Analyze which positions contributed to a specific output position. """ weights = attention_weights[position] output = weights @ V print(f"\nAnalyzing output for position {position}: '{sentence[position]}'") print("-" * 50) print(f"Output vector (shape {output.shape}): norm = {np.linalg.norm(output):.3f}") print("\nContributions by position:") # Sort by contribution weight sorted_indices = np.argsort(weights)[::-1] for j in sorted_indices[:5]: # Top 5 contributors contribution = weights[j] * V[j] print(f" {sentence[j]:10s} (pos {j}): " f"weight={weights[j]:.3f}, " f"contrib_norm={np.linalg.norm(contribution):.3f}") # Verify output is weighted sum reconstructed = (weights[:, None] * V).sum(axis=0) assert np.allclose(output, reconstructed) print(f"\n✓ Output verified as weighted sum of values") # Examplesentence = ["The", "bank", "by", "the", "river", "was", "eroded"]n = len(sentence)d_v = 8 # Simulated attention weights learned to resolve "bank" → "river"# Position 1 ("bank") attends strongly to position 4 ("river")attention_weights = softmax(np.random.randn(n, n), axis=-1)attention_weights[1, 4] = 0.6 # Strong attention to "river"attention_weights[1] = attention_weights[1] / attention_weights[1].sum() # Renormalize V = np.random.randn(n, d_v) analyze_output_contributions(sentence, V, attention_weights, position=1)Contextualization in Practice:
Before self-attention, the representation for "bank" might look the same in "river bank" and "bank account". After self-attention:
These are now distinguishable embeddings, even though the input token was identical.
Stacking Self-Attention Layers:
A single self-attention layer produces contextualized representations. Stacking multiple layers creates hierarchical contextualization:
This is analogous to how CNNs build up from edges → textures → objects.
In practice, self-attention output is added to the input via residual connections: Output = X + SelfAttention(X). This preserves the original representation while adding contextual information—making optimization easier and preventing the loss of position-specific information.
A fundamental property of self-attention is its inherent bidirectionality: each position can attend to all other positions, including those before and after it.
Implications:
This is ideal for encoder tasks (understanding complete sequences) but problematic for decoder tasks (generating sequences left-to-right).
123456789101112131415161718192021222324252627282930313233343536373839404142
import numpy as npfrom scipy.special import softmax def bidirectional_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> tuple: """Standard bidirectional self-attention.""" d_k = Q.shape[1] scores = Q @ K.T / np.sqrt(d_k) attention_weights = softmax(scores, axis=-1) output = attention_weights @ V return output, attention_weights def causal_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> tuple: """Causal (masked) self-attention - positions can only attend to past.""" d_k = Q.shape[1] n = Q.shape[0] scores = Q @ K.T / np.sqrt(d_k) # Create causal mask: position i can only attend to positions <= i # Upper triangular = future positions = should be masked causal_mask = np.triu(np.ones((n, n)), k=1) # Ones above diagonal scores = scores - 1e9 * causal_mask # -inf for future positions attention_weights = softmax(scores, axis=-1) output = attention_weights @ V return output, attention_weights # Demonstrationn, d = 5, 4Q = np.random.randn(n, d)K = np.random.randn(n, d)V = np.random.randn(n, d) _, bidirectional_weights = bidirectional_attention(Q, K, V)_, causal_weights = causal_attention(Q, K, V) print("Bidirectional attention weights (rounded):")print(np.round(bidirectional_weights, 2))print("\nCausal attention weights (rounded):")print(np.round(causal_weights, 2)) # Notice: Causal has zeros above diagonal (can't attend to future)Causal Masking Mechanism:
For autoregressive generation (predicting the next token), we must prevent the model from "cheating" by looking at future tokens. The causal mask sets attention scores for future positions to $-\infty$, which becomes 0 after softmax:
$$\text{mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \text{ (past/present)} \ -\infty & \text{if } j > i \text{ (future)} \end{cases}$$
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + \text{mask}}{\sqrt{d_k}}\right)V$$
In the full transformer, three types of attention coexist: (1) Bidirectional self-attention in the encoder, (2) Causal self-attention in the decoder, and (3) Cross-attention where decoder queries attend to encoder outputs. Each serves a different purpose in the architecture.
In practice, a self-attention layer includes several additional components beyond the core attention operation:
Full Layer Architecture:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import numpy as npfrom scipy.special import softmax class SelfAttentionLayer: """ Complete self-attention layer with all standard components. """ def __init__(self, d_model: int, d_k: int, d_v: int, d_ff: int): self.d_model = d_model self.d_k = d_k self.d_v = d_v self.d_ff = d_ff # QKV projections scale = 0.02 self.W_q = np.random.randn(d_model, d_k) * scale self.W_k = np.random.randn(d_model, d_k) * scale self.W_v = np.random.randn(d_model, d_v) * scale # Output projection (project d_v back to d_model) self.W_o = np.random.randn(d_v, d_model) * scale # Feedforward network self.W_ff1 = np.random.randn(d_model, d_ff) * scale self.b_ff1 = np.zeros(d_ff) self.W_ff2 = np.random.randn(d_ff, d_model) * scale self.b_ff2 = np.zeros(d_model) # LayerNorm parameters (simplified) self.ln1_gamma = np.ones(d_model) self.ln1_beta = np.zeros(d_model) self.ln2_gamma = np.ones(d_model) self.ln2_beta = np.zeros(d_model) def layer_norm(self, x: np.ndarray, gamma: np.ndarray, beta: np.ndarray, eps: float = 1e-5) -> np.ndarray: """Layer normalization over last dimension.""" mean = x.mean(axis=-1, keepdims=True) var = x.var(axis=-1, keepdims=True) x_norm = (x - mean) / np.sqrt(var + eps) return gamma * x_norm + beta def attention(self, Q: np.ndarray, K: np.ndarray, V: np.ndarray, mask: np.ndarray = None) -> np.ndarray: """Scaled dot-product attention.""" scores = Q @ K.T / np.sqrt(self.d_k) if mask is not None: scores = scores + mask weights = softmax(scores, axis=-1) return weights @ V def feedforward(self, x: np.ndarray) -> np.ndarray: """Position-wise feedforward network with ReLU.""" hidden = np.maximum(0, x @ self.W_ff1 + self.b_ff1) # ReLU return hidden @ self.W_ff2 + self.b_ff2 def forward(self, x: np.ndarray, mask: np.ndarray = None) -> np.ndarray: """ Full forward pass through self-attention layer. Args: x: Input, shape (n, d_model) mask: Optional attention mask Returns: Output, shape (n, d_model) """ # === Self-Attention Sublayer === # 1. Compute Q, K, V Q = x @ self.W_q K = x @ self.W_k V = x @ self.W_v # 2. Attention attn_output = self.attention(Q, K, V, mask) # 3. Output projection attn_output = attn_output @ self.W_o # 4. Residual connection + LayerNorm x = self.layer_norm(x + attn_output, self.ln1_gamma, self.ln1_beta) # === Feedforward Sublayer === # 5. Feedforward network ff_output = self.feedforward(x) # 6. Residual connection + LayerNorm x = self.layer_norm(x + ff_output, self.ln2_gamma, self.ln2_beta) return x # Usagelayer = SelfAttentionLayer(d_model=64, d_k=16, d_v=16, d_ff=256)x = np.random.randn(10, 64) # Sequence of 10 tokens, 64-dimoutput = layer.forward(x)print(f"Input shape: {x.shape}") # (10, 64)print(f"Output shape: {output.shape}") # (10, 64)This complete layer is the fundamental building block of transformers. Stack L such layers (typically 6-24), and you have either a transformer encoder or decoder. The only difference is whether the attention is bidirectional or causal.
We've established a comprehensive understanding of self-attention—the mechanism that enables transformers to capture complex dependencies across sequences.
What's Next:
Now that we understand the formulation, we need to examine the details of computing attention weights. The next page explores how raw scores become probability distributions, including the crucial scaling factor and the softmax operation.
You now understand the complete formulation of self-attention—from the QKV framework to the full layer architecture. This foundation is essential for understanding the computational details, multi-head attention, and transformer architectures that follow.