0/318

00:00:00

Description

Editorial

Scaled Dot-Product Attention

MEDIUM20 pts

The Scaled Dot-Product Attention mechanism is the cornerstone of modern deep learning architectures, particularly the Transformer model. This elegant mathematical operation enables neural networks to dynamically weigh and aggregate information from different positions in a sequence, allowing them to capture long-range dependencies and contextual relationships with remarkable efficiency.

Understanding the Attention Paradigm

At its core, attention answers a fundamental question: "When processing a particular element in a sequence, how much should I focus on each other element?" Unlike recurrent architectures that process sequences step-by-step, attention allows every position to directly interact with every other position in a single operation.

The mechanism operates on three learned representations:

Query (Q): What am I looking for? Represents the current position's "question" to the other positions.
Key (K): What do I contain? Represents the "identifiers" that are matched against queries.
Value (V): What information do I carry? Represents the actual content to be aggregated.

Mathematical Formulation

Given an input sequence X of shape (seq_len, d_model) and learned weight matrices W_q, W_k, and W_v, the attention computation proceeds as follows:

Step 1: Linear Projections

First, compute the Query, Key, and Value matrices through linear transformations:

$$Q = X \cdot W_q$$ $$K = X \cdot W_k$$ $$V = X \cdot W_v$$

Step 2: Attention Score Computation

Compute compatibility scores between queries and keys using dot products, then scale by the square root of the key dimension to maintain stable gradients:

$$\text{Attention Scores} = \frac{Q \cdot K^T}{\sqrt{d_k}}$$

Step 3: Softmax Normalization

Convert scores to attention weights that sum to 1 across each row:

$$\text{Attention Weights} = \text{softmax}(\text{Attention Scores})$$

Step 4: Weighted Aggregation

Produce the final output by computing a weighted sum of values:

$$\text{Output} = \text{Attention Weights} \cdot V$$

Your Task

Implement two functions:

compute_qkv: Transform the input matrix into Query, Key, and Value representations using the provided weight matrices.
self_attention: Given Q, K, and V matrices, compute and return the attention output using the scaled dot-product attention formula.

Your implementation should handle varying sequence lengths and embedding dimensions while maintaining numerical stability through proper scaling.

Example

Input

X = [[1, 0], [0, 1]]
W_q = [[1, 0], [0, 1]]
W_k = [[1, 0], [0, 1]]
W_v = [[1, 2], [3, 4]]

Output

[[1.660477, 2.660477], [2.339523, 3.339523]]

Explanation

Step 1: Compute Q, K, V With identity matrices for W_q and W_k, Q = K = X = [[1, 0], [0, 1]] V = X · W_v = [[1, 2], [3, 4]]

Step 2: Compute Attention Scores Scores = (Q · K^T) / √d_k = [[1, 0], [0, 1]] / √2 ≈ [[0.707, 0], [0, 0.707]]

Step 3: Apply Softmax Row 1: softmax([0.707, 0]) ≈ [0.660477, 0.339523] Row 2: softmax([0, 0.707]) ≈ [0.339523, 0.660477]

Step 4: Weighted Sum of Values Output[0] = 0.660477 × [1, 2] + 0.339523 × [3, 4] = [1.660477, 2.660477] Output[1] = 0.339523 × [1, 2] + 0.660477 × [3, 4] = [2.339523, 3.339523]

Example

Input

X = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
W_q = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
W_k = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
W_v = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]

Output

[[0.471083, 0.264458, 0.264458], [0.264458, 0.471083, 0.264458], [0.264458, 0.264458, 0.471083]]

Explanation

When all projection matrices are identity matrices, Q = K = V = X.

The attention scores become Q · K^T / √3, which after softmax results in each position attending most to itself (≈0.471) and equally to others (≈0.264).

This demonstrates the symmetric attention pattern when positions have orthogonal representations.

Example

Input

X = [[1, 2, 3], [4, 5, 6]]
W_q = [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]
W_k = [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]
W_v = [[1, 0], [0, 1], [1, 1]]

Output

[[9.999928, 10.999928], [10.0, 11.0]]

Explanation

With non-trivial projection matrices, the queries and keys are projected to a 2-dimensional space.

V = X · W_v projects to the value space: [[4, 5], [10, 11]]

The very high attention scores (due to aligned query-key projections) cause the second position to dominate the first's attention, while the second position attends almost entirely to itself.

Accepted0/0·0% Acceptance

Constraints

1 ≤ seq_len ≤ 512
1 ≤ d_model ≤ 1024
1 ≤ d_k, d_v ≤ 256
-10³ ≤ X[i][j] ≤ 10³
-10³ ≤ W_q[i][j], W_k[i][j], W_v[i][j] ≤ 10³
All input matrices are valid numpy arrays with compatible dimensions
d_k (key dimension) equals the number of columns in W_q and W_k

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

X =

[[1,0],[0,1]]

W_k =

[[1,0],[0,1]]

W_q =

[[1,0],[0,1]]

W_v =

[[1,2],[3,4]]

Scaled Dot-Product Attention

Understanding the Attention Paradigm

Mathematical Formulation

Your Task

Hints

Scaled Dot-Product Attention

Understanding the Attention Paradigm

Mathematical Formulation

Your Task

Hints