Loading problem...
The Parallel Attention Heads Mechanism is the cornerstone of transformer-based architectures that power modern large language models, machine translation systems, and numerous state-of-the-art AI applications. This mechanism enables models to simultaneously attend to information from different representation subspaces at different positions, providing a richer understanding of contextual relationships within sequences.
Unlike single-head attention that captures one type of relationship, parallel attention heads allow the model to learn multiple distinct attention patterns simultaneously. Each head focuses on different aspects of the input—one head might capture syntactic dependencies while another tracks semantic relationships. The final representation combines insights from all heads through concatenation.
Implement three interconnected functions that together form the complete parallel attention heads mechanism:
compute_qkv(X, W_q, W_k, W_v) — Projection FunctionTransform the input embedding matrix X into three distinct representation spaces using learned weight matrices:
Each projection is computed as a simple matrix multiplication: Q = X @ W_q, K = X @ W_k, V = X @ W_v.
self_attention(Q, K, V) — Scaled Dot-Product AttentionCompute attention scores and weighted values for a single attention head:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Where:
Critical Implementation Note: Use numerically stable softmax by subtracting the maximum value before exponentiating to prevent numerical overflow.
multi_head_attention(Q, K, V, n_heads) — Parallel Head ProcessingSplit the Q, K, V matrices across multiple heads along the feature dimension, apply self-attention independently on each head, and concatenate results:
n_heads equal parts along the last dimensionself_attention to each corresponding (Qᵢ, Kᵢ, Vᵢ) triplet(sequence_length, model_dimension) containing token representations(model_dimension, model_dimension)model_dimension)Return a matrix of shape (sequence_length, model_dimension) representing the attention-weighted combination of values across all heads.
compute_qkv(X, W_q, W_k, W_v) to obtain Q, K, V matricesmulti_head_attention(Q, K, V, n_heads) to get the final outputX = [[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]]
W_q = W_k = W_v = [[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]] # Identity matrices
n_heads = 2[[4.999174, 5.999174, 7.0, 8.0], [5.0, 6.0, 7.0, 8.0]]With a 2-token sequence (each with 4 features) and identity weight matrices, the Q, K, V matrices equal the input X. With 2 attention heads:
• Each head processes 2 features (4 ÷ 2 = 2) • Head 1 processes features [1,2] and [5,6] for each token • Head 2 processes features [3,4] and [7,8] for each token
The scaled dot-product attention computes compatibility between tokens. Due to the larger dot product of position 2 with itself (larger values in second row), the attention weights heavily favor the second token. The soft attention distribution causes the first position to pull values toward the second position, while position 2 attends almost entirely to itself. The results are concatenated back to produce the final 4-dimensional output.
X = [[1.0, 2.0], [3.0, 4.0]]
W_q = W_k = W_v = [[1, 0], [0, 1]] # 2x2 Identity matrices
n_heads = 1[[2.971668, 3.971668], [2.9999, 3.9999]]With a single attention head processing the entire 2-dimensional feature space:
• Q = K = V = X (due to identity weights) • Scaling factor √d_k = √2 ≈ 1.414 • Attention scores matrix: [[5/√2, 11/√2], [11/√2, 25/√2]] = [[3.54, 7.78], [7.78, 17.68]] • After softmax, position 2 (with larger values) receives higher attention weights • Both positions shift toward the representation of position 2, with position 1 shifting more significantly
The output shows both positions producing values very close to [3, 4], the second token's values.
X = [[1,2,3,4,5,6], [7,8,9,10,11,12], [13,14,15,16,17,18]]
W_q = W_k = W_v = 6x6 Identity Matrix
n_heads = 3[[12.999982, 13.999982, 15.0, 16.0, 17.0, 18.0], [13.0, 14.0, 15.0, 16.0, 17.0, 18.0], [13.0, 14.0, 15.0, 16.0, 17.0, 18.0]]With 3 tokens (6 features each) and 3 attention heads:
• Each head processes 2 features (6 ÷ 3 = 2) • Head 1: features [1,2,7,8,13,14], Head 2: [3,4,9,10,15,16], Head 3: [5,6,11,12,17,18] • The largest-valued token (position 3) dominates attention across all heads • Attention weights exponentially favor position 3 due to higher dot products
All positions converge nearly to the third token's values [13,14,15,16,17,18], demonstrating how strongly self-attention can focus on dominant representations.
Constraints