Loading problem...
In traditional self-attention mechanisms, the softmax function normalizes attention weights to form a probability distribution that sums to 1. While this design elegantly enables the model to focus on different parts of the input sequence, it introduces a fundamental limitation: the model is forced to attend to something even when no tokens in the context are genuinely relevant to the current query.
The Gated Attention Mechanism addresses this architectural constraint by introducing a learnable gating function that can dynamically modulate the attention output. This gate, computed directly from the input representation, applies element-wise scaling to the attention result, effectively allowing the model to "turn down" or completely suppress outputs when attending provides no useful information.
Given an input sequence matrix X of shape (seq_len, d_model) and projection weight matrices:
The gated attention mechanism computes:
Step 1: Project inputs to query, key, and value spaces $$Q = X \cdot W_q, \quad K = X \cdot W_k, \quad V = X \cdot W_v$$
Step 2: Compute scaled dot-product attention $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$$
Step 3: Compute query-dependent gate $$G = \sigma(X \cdot W_g)$$
where σ denotes the sigmoid activation function, producing values in the range (0, 1).
Step 4: Apply gating to modulate output $$\text{Output} = G \odot \text{Attention}(Q, K, V)$$
where ⊙ denotes element-wise (Hadamard) multiplication.
The gate G serves as a learned relevance filter:
This mechanism allows the network to learn when attention is valuable versus when it should be dampened, improving model expressiveness without adding significant computational overhead.
Your Task: Implement the gated attention mechanism that computes the query, key, and value projections, applies scaled dot-product attention, and modulates the result using a learned sigmoid gate. Round all output values to 4 decimal places.
X = [[1.0, 0.0], [0.0, 1.0]]
W_q = [[1.0, 0.0], [0.0, 1.0]]
W_k = [[1.0, 0.0], [0.0, 1.0]]
W_v = [[1.0, 0.0], [0.0, 1.0]]
W_g = [[0.0, 0.0], [0.0, 0.0]][[0.3349, 0.1651], [0.1651, 0.3349]]Step-by-step computation:
Projections (with identity matrices):
Attention scores:
Softmax attention weights:
Attention output (weights × V):
Gate computation (W_g is zeros):
Final gated output:
X = [[1.0, 0.5], [0.5, 1.0]]
W_q = [[1.0, 0.0], [0.0, 1.0]]
W_k = [[1.0, 0.0], [0.0, 1.0]]
W_v = [[1.0, 0.0], [0.0, 1.0]]
W_g = [[1.0, 0.0], [0.0, 1.0]][[0.5644, 0.4531], [0.4531, 0.5644]]Step-by-step computation:
Projections:
Attention scores (Q @ K^T):
Softmax attention weights:
Attention output:
Gate computation (W_g is identity):
Final gated output:
The symmetric input and parameters produce a symmetric output matrix.
X = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
W_q = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
W_k = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
W_v = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
W_g = [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]][[0.2355, 0.1322, 0.1322], [0.1322, 0.2355, 0.1322], [0.1322, 0.1322, 0.2355]]Step-by-step computation for 3×3 identity case:
Projections: With identity matrices, Q = K = V = X (3×3 identity matrix)
Attention scores (Q @ K^T): The identity matrix product yields the identity matrix again.
Softmax attention weights:
Attention output (weights × V):
Gate computation (W_g is zeros):
Final gated output:
Constraints