Loading content...
The Scaled Dot-Product Attention mechanism is the cornerstone of modern deep learning architectures, particularly the Transformer model. This elegant mathematical operation enables neural networks to dynamically weigh and aggregate information from different positions in a sequence, allowing them to capture long-range dependencies and contextual relationships with remarkable efficiency.
At its core, attention answers a fundamental question: "When processing a particular element in a sequence, how much should I focus on each other element?" Unlike recurrent architectures that process sequences step-by-step, attention allows every position to directly interact with every other position in a single operation.
The mechanism operates on three learned representations:
Given an input sequence X of shape (seq_len, d_model) and learned weight matrices W_q, W_k, and W_v, the attention computation proceeds as follows:
Step 1: Linear Projections
First, compute the Query, Key, and Value matrices through linear transformations:
$$Q = X \cdot W_q$$ $$K = X \cdot W_k$$ $$V = X \cdot W_v$$
Step 2: Attention Score Computation
Compute compatibility scores between queries and keys using dot products, then scale by the square root of the key dimension to maintain stable gradients:
$$\text{Attention Scores} = \frac{Q \cdot K^T}{\sqrt{d_k}}$$
Step 3: Softmax Normalization
Convert scores to attention weights that sum to 1 across each row:
$$\text{Attention Weights} = \text{softmax}(\text{Attention Scores})$$
Step 4: Weighted Aggregation
Produce the final output by computing a weighted sum of values:
$$\text{Output} = \text{Attention Weights} \cdot V$$
Implement two functions:
compute_qkv: Transform the input matrix into Query, Key, and Value representations using the provided weight matrices.
self_attention: Given Q, K, and V matrices, compute and return the attention output using the scaled dot-product attention formula.
Your implementation should handle varying sequence lengths and embedding dimensions while maintaining numerical stability through proper scaling.
X = [[1, 0], [0, 1]]
W_q = [[1, 0], [0, 1]]
W_k = [[1, 0], [0, 1]]
W_v = [[1, 2], [3, 4]][[1.660477, 2.660477], [2.339523, 3.339523]]Step 1: Compute Q, K, V With identity matrices for W_q and W_k, Q = K = X = [[1, 0], [0, 1]] V = X · W_v = [[1, 2], [3, 4]]
Step 2: Compute Attention Scores Scores = (Q · K^T) / √d_k = [[1, 0], [0, 1]] / √2 ≈ [[0.707, 0], [0, 0.707]]
Step 3: Apply Softmax Row 1: softmax([0.707, 0]) ≈ [0.660477, 0.339523] Row 2: softmax([0, 0.707]) ≈ [0.339523, 0.660477]
Step 4: Weighted Sum of Values Output[0] = 0.660477 × [1, 2] + 0.339523 × [3, 4] = [1.660477, 2.660477] Output[1] = 0.339523 × [1, 2] + 0.660477 × [3, 4] = [2.339523, 3.339523]
X = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
W_q = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
W_k = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
W_v = [[1, 0, 0], [0, 1, 0], [0, 0, 1]][[0.471083, 0.264458, 0.264458], [0.264458, 0.471083, 0.264458], [0.264458, 0.264458, 0.471083]]When all projection matrices are identity matrices, Q = K = V = X.
The attention scores become Q · K^T / √3, which after softmax results in each position attending most to itself (≈0.471) and equally to others (≈0.264).
This demonstrates the symmetric attention pattern when positions have orthogonal representations.
X = [[1, 2, 3], [4, 5, 6]]
W_q = [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]
W_k = [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]
W_v = [[1, 0], [0, 1], [1, 1]][[9.999928, 10.999928], [10.0, 11.0]]With non-trivial projection matrices, the queries and keys are projected to a 2-dimensional space.
V = X · W_v projects to the value space: [[4, 5], [10, 11]]
The very high attention scores (due to aligned query-key projections) cause the second position to dominate the first's attention, while the second position attends almost entirely to itself.
Constraints