Parallel Attention Heads Mechanism (Hard) — Practice with Code Visualizer

The Parallel Attention Heads Mechanism is the cornerstone of transformer-based architectures that power modern large language models, machine translation systems, and numerous state-of-the-art AI applications. This mechanism enables models to simultaneously attend to information from different representation subspaces at different positions, providing a richer understanding of contextual relationships within sequences.

Overview

Unlike single-head attention that captures one type of relationship, parallel attention heads allow the model to learn multiple distinct attention patterns simultaneously. Each head focuses on different aspects of the input—one head might capture syntactic dependencies while another tracks semantic relationships. The final representation combines insights from all heads through concatenation.

Your Task

Implement three interconnected functions that together form the complete parallel attention heads mechanism:

1. `compute_qkv(X, W_q, W_k, W_v)` — Projection Function

Transform the input embedding matrix X into three distinct representation spaces using learned weight matrices:

Query (Q): Represents what each position is looking for
Key (K): Represents what each position contains for matching
Value (V): Represents the actual content to be aggregated

Each projection is computed as a simple matrix multiplication: Q = X @ W_q, K = X @ W_k, V = X @ W_v.

2. `self_attention(Q, K, V)` — Scaled Dot-Product Attention

Compute attention scores and weighted values for a single attention head:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:

QKᵀ computes compatibility scores between all query-key pairs
√d_k (dimension of keys) scales scores to prevent large dot products
softmax normalizes scores into attention weights
Final multiplication with V produces weighted output

Critical Implementation Note: Use numerically stable softmax by subtracting the maximum value before exponentiating to prevent numerical overflow.

3. `multi_head_attention(Q, K, V, n_heads)` — Parallel Head Processing

Split the Q, K, V matrices across multiple heads along the feature dimension, apply self-attention independently on each head, and concatenate results:

Split: Divide each matrix into n_heads equal parts along the last dimension
Process: Apply self_attention to each corresponding (Qᵢ, Kᵢ, Vᵢ) triplet
Concatenate: Merge all head outputs back to the original dimension

Input Specifications

X: Input embedding matrix of shape (sequence_length, model_dimension) containing token representations
W_q, W_k, W_v: Square projection weight matrices of shape (model_dimension, model_dimension)
n_heads: Integer specifying number of parallel attention heads (must evenly divide model_dimension)

Output Specification

Return a matrix of shape (sequence_length, model_dimension) representing the attention-weighted combination of values across all heads.

Workflow

Call compute_qkv(X, W_q, W_k, W_v) to obtain Q, K, V matrices
Pass Q, K, V to multi_head_attention(Q, K, V, n_heads) to get the final output

With a 2-token sequence (each with 4 features) and identity weight matrices, the Q, K, V matrices equal the input X. With 2 attention heads:

• Each head processes 2 features (4 ÷ 2 = 2) • Head 1 processes features [1,2] and [5,6] for each token • Head 2 processes features [3,4] and [7,8] for each token

The scaled dot-product attention computes compatibility between tokens. Due to the larger dot product of position 2 with itself (larger values in second row), the attention weights heavily favor the second token. The soft attention distribution causes the first position to pull values toward the second position, while position 2 attends almost entirely to itself. The results are concatenated back to produce the final 4-dimensional output.

With a single attention head processing the entire 2-dimensional feature space:

• Q = K = V = X (due to identity weights) • Scaling factor √d_k = √2 ≈ 1.414 • Attention scores matrix: [[5/√2, 11/√2], [11/√2, 25/√2]] = [[3.54, 7.78], [7.78, 17.68]] • After softmax, position 2 (with larger values) receives higher attention weights • Both positions shift toward the representation of position 2, with position 1 shifting more significantly

The output shows both positions producing values very close to [3, 4], the second token's values.

With 3 tokens (6 features each) and 3 attention heads:

• Each head processes 2 features (6 ÷ 3 = 2) • Head 1: features [1,2,7,8,13,14], Head 2: [3,4,9,10,15,16], Head 3: [5,6,11,12,17,18] • The largest-valued token (position 3) dominates attention across all heads • Attention weights exponentially favor position 3 due to higher dot products

All positions converge nearly to the third token's values [13,14,15,16,17,18], demonstrating how strongly self-attention can focus on dominant representations.

Overview

Your Task

Implement three interconnected functions that together form the complete parallel attention heads mechanism:

1. `compute_qkv(X, W_q, W_k, W_v)` — Projection Function

Transform the input embedding matrix X into three distinct representation spaces using learned weight matrices:

Query (Q): Represents what each position is looking for
Key (K): Represents what each position contains for matching
Value (V): Represents the actual content to be aggregated

Each projection is computed as a simple matrix multiplication: Q = X @ W_q, K = X @ W_k, V = X @ W_v.

2. `self_attention(Q, K, V)` — Scaled Dot-Product Attention

Compute attention scores and weighted values for a single attention head:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:

QKᵀ computes compatibility scores between all query-key pairs
√d_k (dimension of keys) scales scores to prevent large dot products
softmax normalizes scores into attention weights
Final multiplication with V produces weighted output

Critical Implementation Note: Use numerically stable softmax by subtracting the maximum value before exponentiating to prevent numerical overflow.

3. `multi_head_attention(Q, K, V, n_heads)` — Parallel Head Processing

Split the Q, K, V matrices across multiple heads along the feature dimension, apply self-attention independently on each head, and concatenate results:

Split: Divide each matrix into n_heads equal parts along the last dimension
Process: Apply self_attention to each corresponding (Qᵢ, Kᵢ, Vᵢ) triplet
Concatenate: Merge all head outputs back to the original dimension

Input Specifications

X: Input embedding matrix of shape (sequence_length, model_dimension) containing token representations
W_q, W_k, W_v: Square projection weight matrices of shape (model_dimension, model_dimension)
n_heads: Integer specifying number of parallel attention heads (must evenly divide model_dimension)

Output Specification

Return a matrix of shape (sequence_length, model_dimension) representing the attention-weighted combination of values across all heads.

Workflow

Call compute_qkv(X, W_q, W_k, W_v) to obtain Q, K, V matrices
Pass Q, K, V to multi_head_attention(Q, K, V, n_heads) to get the final output

With a 2-token sequence (each with 4 features) and identity weight matrices, the Q, K, V matrices equal the input X. With 2 attention heads:

• Each head processes 2 features (4 ÷ 2 = 2) • Head 1 processes features [1,2] and [5,6] for each token • Head 2 processes features [3,4] and [7,8] for each token

With a single attention head processing the entire 2-dimensional feature space:

The output shows both positions producing values very close to [3, 4], the second token's values.

With 3 tokens (6 features each) and 3 attention heads:

All positions converge nearly to the third token's values [13,14,15,16,17,18], demonstrating how strongly self-attention can focus on dominant representations.

Parallel Attention Heads Mechanism

Overview

Your Task

1. compute_qkv(X, W_q, W_k, W_v) — Projection Function

2. self_attention(Q, K, V) — Scaled Dot-Product Attention

3. multi_head_attention(Q, K, V, n_heads) — Parallel Head Processing

Input Specifications

Output Specification

Workflow

Hints

Parallel Attention Heads Mechanism

Overview

Your Task

1. compute_qkv(X, W_q, W_k, W_v) — Projection Function

2. self_attention(Q, K, V) — Scaled Dot-Product Attention

3. multi_head_attention(Q, K, V, n_heads) — Parallel Head Processing

Input Specifications

Output Specification

Workflow

Hints

1. `compute_qkv(X, W_q, W_k, W_v)` — Projection Function

2. `self_attention(Q, K, V)` — Scaled Dot-Product Attention

3. `multi_head_attention(Q, K, V, n_heads)` — Parallel Head Processing

1. `compute_qkv(X, W_q, W_k, W_v)` — Projection Function

2. `self_attention(Q, K, V)` — Scaled Dot-Product Attention

3. `multi_head_attention(Q, K, V, n_heads)` — Parallel Head Processing