00:00:00

Description

Editorial

Gated Attention Mechanism

MEDIUM20 pts

In traditional self-attention mechanisms, the softmax function normalizes attention weights to form a probability distribution that sums to 1. While this design elegantly enables the model to focus on different parts of the input sequence, it introduces a fundamental limitation: the model is forced to attend to something even when no tokens in the context are genuinely relevant to the current query.

The Gated Attention Mechanism addresses this architectural constraint by introducing a learnable gating function that can dynamically modulate the attention output. This gate, computed directly from the input representation, applies element-wise scaling to the attention result, effectively allowing the model to "turn down" or completely suppress outputs when attending provides no useful information.

Mathematical Formulation

Given an input sequence matrix X of shape (seq_len, d_model) and projection weight matrices:

W_q (query projection): shape (d_model, d_k)
W_k (key projection): shape (d_model, d_k)
W_v (value projection): shape (d_model, d_v)
W_g (gate projection): shape (d_model, d_v)

The gated attention mechanism computes:

Step 1: Project inputs to query, key, and value spaces $$Q = X \cdot W_q, \quad K = X \cdot W_k, \quad V = X \cdot W_v$$

Step 2: Compute scaled dot-product attention $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$$

Step 3: Compute query-dependent gate $$G = \sigma(X \cdot W_g)$$

where σ denotes the sigmoid activation function, producing values in the range (0, 1).

Step 4: Apply gating to modulate output $$\text{Output} = G \odot \text{Attention}(Q, K, V)$$

where ⊙ denotes element-wise (Hadamard) multiplication.

Key Insights

The gate G serves as a learned relevance filter:

When W_g produces large positive values, the sigmoid approaches 1, and the attention output passes through nearly unchanged
When W_g produces large negative values, the sigmoid approaches 0, effectively suppressing the attention output
When W_g produces values near 0, the sigmoid outputs approximately 0.5, scaling the attention by half

This mechanism allows the network to learn when attention is valuable versus when it should be dampened, improving model expressiveness without adding significant computational overhead.

Your Task: Implement the gated attention mechanism that computes the query, key, and value projections, applies scaled dot-product attention, and modulates the result using a learned sigmoid gate. Round all output values to 4 decimal places.

Example

Input

X = [[1.0, 0.0], [0.0, 1.0]]
W_q = [[1.0, 0.0], [0.0, 1.0]]
W_k = [[1.0, 0.0], [0.0, 1.0]]
W_v = [[1.0, 0.0], [0.0, 1.0]]
W_g = [[0.0, 0.0], [0.0, 0.0]]

Output

[[0.3349, 0.1651], [0.1651, 0.3349]]

Explanation

Step-by-step computation:

Projections (with identity matrices):
- Q = X @ W_q = [[1, 0], [0, 1]] (same as X since W_q is identity)
- K = X @ W_k = [[1, 0], [0, 1]]
- V = X @ W_v = [[1, 0], [0, 1]]
Attention scores:
- Q @ K^T = [[1×1+0×0, 1×0+0×1], [0×1+1×0, 0×0+1×1]] = [[1, 0], [0, 1]]
- Scaled scores (d_k = 2): [[1/√2, 0], [0, 1/√2]] ≈ [[0.707, 0], [0, 0.707]]
Softmax attention weights:
- Row 1: softmax([0.707, 0]) ≈ [0.6699, 0.3301]
- Row 2: softmax([0, 0.707]) ≈ [0.3301, 0.6699]
Attention output (weights × V):
- [[0.6699, 0.3301], [0.3301, 0.6699]] @ [[1, 0], [0, 1]]
- = [[0.6699, 0.3301], [0.3301, 0.6699]]
Gate computation (W_g is zeros):
- G = sigmoid(X @ W_g) = sigmoid([[0, 0], [0, 0]]) = [[0.5, 0.5], [0.5, 0.5]]
Final gated output:
- Output = G ⊙ Attention = [[0.5×0.6699, 0.5×0.3301], [0.5×0.3301, 0.5×0.6699]]
- = [[0.3349, 0.1651], [0.1651, 0.3349]]

Example

Input

X = [[1.0, 0.5], [0.5, 1.0]]
W_q = [[1.0, 0.0], [0.0, 1.0]]
W_k = [[1.0, 0.0], [0.0, 1.0]]
W_v = [[1.0, 0.0], [0.0, 1.0]]
W_g = [[1.0, 0.0], [0.0, 1.0]]

Output

[[0.5644, 0.4531], [0.4531, 0.5644]]

Explanation

Step-by-step computation:

Projections:
- Q = K = V = X = [[1.0, 0.5], [0.5, 1.0]] (identity projections)
Attention scores (Q @ K^T):
- [[1×1+0.5×0.5, 1×0.5+0.5×1], [0.5×1+1×0.5, 0.5×0.5+1×1]]
- = [[1.25, 1.0], [1.0, 1.25]]
- Scaled by √d_k = √2: [[0.884, 0.707], [0.707, 0.884]]
Softmax attention weights:
- Row 1: softmax([0.884, 0.707]) ≈ [0.543, 0.457]
- Row 2: softmax([0.707, 0.884]) ≈ [0.457, 0.543]
Attention output:
- Weighted sum of V produces balanced outputs around [0.75, 0.75]
Gate computation (W_g is identity):
- G = sigmoid(X @ W_g) = sigmoid([[1.0, 0.5], [0.5, 1.0]])
- = [[0.731, 0.622], [0.622, 0.731]]
Final gated output:
- Element-wise product of gate and attention yields [[0.5644, 0.4531], [0.4531, 0.5644]]

The symmetric input and parameters produce a symmetric output matrix.

Example

Input

X = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
W_q = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
W_k = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
W_v = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
W_g = [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]

Output

[[0.2355, 0.1322, 0.1322], [0.1322, 0.2355, 0.1322], [0.1322, 0.1322, 0.2355]]

Explanation

Step-by-step computation for 3×3 identity case:

Projections: With identity matrices, Q = K = V = X (3×3 identity matrix)
Attention scores (Q @ K^T): The identity matrix product yields the identity matrix again.
- Scaled by √d_k = √3 ≈ 1.732: [[0.577, 0, 0], [0, 0.577, 0], [0, 0, 0.577]]
Softmax attention weights:
- Each row: softmax([0.577, 0, 0]) ≈ [0.471, 0.264, 0.264]
- The higher diagonal values receive more attention weight
Attention output (weights × V):
- Each position attends more strongly to itself (self-attention pattern)
- Result ≈ [[0.471, 0.264, 0.264], [0.264, 0.471, 0.264], [0.264, 0.264, 0.471]]
Gate computation (W_g is zeros):
- G = sigmoid([[0, 0, 0], ...]) = [[0.5, 0.5, 0.5], ...] (uniform 0.5 gate)
Final gated output:
- All values multiplied by 0.5, yielding the diagonal-dominant pattern
- [[0.2355, 0.1322, 0.1322], [0.1322, 0.2355, 0.1322], [0.1322, 0.1322, 0.2355]]

Accepted0/0·0% Acceptance

Constraints

1 ≤ seq_len ≤ 512 (sequence length)
1 ≤ d_model ≤ 256 (input dimensionality)
1 ≤ d_k ≤ 256 (query/key dimensionality)
1 ≤ d_v ≤ 256 (value/output dimensionality)
-10³ ≤ X[i][j] ≤ 10³ (input values)
-10³ ≤ W_q[i][j], W_k[i][j], W_v[i][j], W_g[i][j] ≤ 10³ (weight values)
All matrices have compatible dimensions for the specified operations
Output values should be rounded to 4 decimal places

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

X =

[[1,0],[0,1]]

W_g =

[[0,0],[0,0]]

W_k =

[[1,0],[0,1]]

W_q =

[[1,0],[0,1]]

W_v =

[[1,0],[0,1]]

Loading problem...

101

00:00:00

Description

Editorial

Gated Attention Mechanism

MEDIUM20 pts

Mathematical Formulation

Given an input sequence matrix X of shape (seq_len, d_model) and projection weight matrices:

W_q (query projection): shape (d_model, d_k)
W_k (key projection): shape (d_model, d_k)
W_v (value projection): shape (d_model, d_v)
W_g (gate projection): shape (d_model, d_v)

The gated attention mechanism computes:

Step 1: Project inputs to query, key, and value spaces $$Q = X \cdot W_q, \quad K = X \cdot W_k, \quad V = X \cdot W_v$$

Step 2: Compute scaled dot-product attention $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$$

Step 3: Compute query-dependent gate $$G = \sigma(X \cdot W_g)$$

where σ denotes the sigmoid activation function, producing values in the range (0, 1).

Step 4: Apply gating to modulate output $$\text{Output} = G \odot \text{Attention}(Q, K, V)$$

where ⊙ denotes element-wise (Hadamard) multiplication.

Key Insights

The gate G serves as a learned relevance filter:

When W_g produces large positive values, the sigmoid approaches 1, and the attention output passes through nearly unchanged
When W_g produces large negative values, the sigmoid approaches 0, effectively suppressing the attention output
When W_g produces values near 0, the sigmoid outputs approximately 0.5, scaling the attention by half

This mechanism allows the network to learn when attention is valuable versus when it should be dampened, improving model expressiveness without adding significant computational overhead.

Example

Input

X = [[1.0, 0.0], [0.0, 1.0]]
W_q = [[1.0, 0.0], [0.0, 1.0]]
W_k = [[1.0, 0.0], [0.0, 1.0]]
W_v = [[1.0, 0.0], [0.0, 1.0]]
W_g = [[0.0, 0.0], [0.0, 0.0]]

Output

[[0.3349, 0.1651], [0.1651, 0.3349]]

Explanation

Step-by-step computation:

Projections (with identity matrices):
- Q = X @ W_q = [[1, 0], [0, 1]] (same as X since W_q is identity)
- K = X @ W_k = [[1, 0], [0, 1]]
- V = X @ W_v = [[1, 0], [0, 1]]
Attention scores:
- Q @ K^T = [[1×1+0×0, 1×0+0×1], [0×1+1×0, 0×0+1×1]] = [[1, 0], [0, 1]]
- Scaled scores (d_k = 2): [[1/√2, 0], [0, 1/√2]] ≈ [[0.707, 0], [0, 0.707]]
Softmax attention weights:
- Row 1: softmax([0.707, 0]) ≈ [0.6699, 0.3301]
- Row 2: softmax([0, 0.707]) ≈ [0.3301, 0.6699]
Attention output (weights × V):
- [[0.6699, 0.3301], [0.3301, 0.6699]] @ [[1, 0], [0, 1]]
- = [[0.6699, 0.3301], [0.3301, 0.6699]]
Gate computation (W_g is zeros):
- G = sigmoid(X @ W_g) = sigmoid([[0, 0], [0, 0]]) = [[0.5, 0.5], [0.5, 0.5]]
Final gated output:
- Output = G ⊙ Attention = [[0.5×0.6699, 0.5×0.3301], [0.5×0.3301, 0.5×0.6699]]
- = [[0.3349, 0.1651], [0.1651, 0.3349]]

Example

Input

X = [[1.0, 0.5], [0.5, 1.0]]
W_q = [[1.0, 0.0], [0.0, 1.0]]
W_k = [[1.0, 0.0], [0.0, 1.0]]
W_v = [[1.0, 0.0], [0.0, 1.0]]
W_g = [[1.0, 0.0], [0.0, 1.0]]

Output

[[0.5644, 0.4531], [0.4531, 0.5644]]

Explanation

Step-by-step computation:

Projections:
- Q = K = V = X = [[1.0, 0.5], [0.5, 1.0]] (identity projections)
Attention scores (Q @ K^T):
- [[1×1+0.5×0.5, 1×0.5+0.5×1], [0.5×1+1×0.5, 0.5×0.5+1×1]]
- = [[1.25, 1.0], [1.0, 1.25]]
- Scaled by √d_k = √2: [[0.884, 0.707], [0.707, 0.884]]
Softmax attention weights:
- Row 1: softmax([0.884, 0.707]) ≈ [0.543, 0.457]
- Row 2: softmax([0.707, 0.884]) ≈ [0.457, 0.543]
Attention output:
- Weighted sum of V produces balanced outputs around [0.75, 0.75]
Gate computation (W_g is identity):
- G = sigmoid(X @ W_g) = sigmoid([[1.0, 0.5], [0.5, 1.0]])
- = [[0.731, 0.622], [0.622, 0.731]]
Final gated output:
- Element-wise product of gate and attention yields [[0.5644, 0.4531], [0.4531, 0.5644]]

The symmetric input and parameters produce a symmetric output matrix.

Example

Input

X = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
W_q = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
W_k = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
W_v = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
W_g = [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]

Output

[[0.2355, 0.1322, 0.1322], [0.1322, 0.2355, 0.1322], [0.1322, 0.1322, 0.2355]]

Explanation

Step-by-step computation for 3×3 identity case:

Projections: With identity matrices, Q = K = V = X (3×3 identity matrix)
Attention scores (Q @ K^T): The identity matrix product yields the identity matrix again.
- Scaled by √d_k = √3 ≈ 1.732: [[0.577, 0, 0], [0, 0.577, 0], [0, 0, 0.577]]
Softmax attention weights:
- Each row: softmax([0.577, 0, 0]) ≈ [0.471, 0.264, 0.264]
- The higher diagonal values receive more attention weight
Attention output (weights × V):
- Each position attends more strongly to itself (self-attention pattern)
- Result ≈ [[0.471, 0.264, 0.264], [0.264, 0.471, 0.264], [0.264, 0.264, 0.471]]
Gate computation (W_g is zeros):
- G = sigmoid([[0, 0, 0], ...]) = [[0.5, 0.5, 0.5], ...] (uniform 0.5 gate)
Final gated output:
- All values multiplied by 0.5, yielding the diagonal-dominant pattern
- [[0.2355, 0.1322, 0.1322], [0.1322, 0.2355, 0.1322], [0.1322, 0.1322, 0.2355]]

Accepted0/0·0% Acceptance

Constraints

1 ≤ seq_len ≤ 512 (sequence length)
1 ≤ d_model ≤ 256 (input dimensionality)
1 ≤ d_k ≤ 256 (query/key dimensionality)
1 ≤ d_v ≤ 256 (value/output dimensionality)
-10³ ≤ X[i][j] ≤ 10³ (input values)
-10³ ≤ W_q[i][j], W_k[i][j], W_v[i][j], W_g[i][j] ≤ 10³ (weight values)
All matrices have compatible dimensions for the specified operations
Output values should be rounded to 4 decimal places

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

X =

[[1,0],[0,1]]

W_g =

[[0,0],[0,0]]

W_k =

[[1,0],[0,1]]

W_q =

[[1,0],[0,1]]

W_v =

[[1,0],[0,1]]

Gated Attention Mechanism

Mathematical Formulation

Key Insights

Hints

Gated Attention Mechanism

Mathematical Formulation

Key Insights

Hints