0/318

00:00:00

Description

Editorial

Localized Sliding Attention Mechanism

MEDIUM20 pts

In modern sequence modeling, particularly when dealing with extremely long sequences, computing full self-attention becomes computationally prohibitive due to its O(n²) complexity. The Localized Sliding Attention mechanism addresses this challenge by restricting each token's attention to a local neighborhood, dramatically reducing computational costs while preserving the model's ability to capture relevant contextual information.

Concept Overview

The core idea behind localized sliding attention is elegantly simple yet powerful: instead of allowing each token to attend to every other token in the sequence, we define a window of radius w around each position. A token at position i can only attend to tokens within the range [max(0, i - w), min(seq_len - 1, i + w)], where seq_len is the total sequence length.

This approach creates a band-diagonal attention pattern, where attention is concentrated along the diagonal of the attention matrix, with bandwidth determined by the window size.

Mathematical Formulation

For each query position i, the localized attention is computed as follows:

Define the local window: Determine the attending positions $mathcal{N}_i = {j : |i - j| \leq w}$
Compute scaled dot-product scores for positions within the window: $$s_{ij} = \frac{Q_i \cdot K_j^T}{\sqrt{d_k}} \quad \text{for } j \in \mathcal{N}_i$$
Apply softmax normalization over the local window: $$\alpha_{ij} = \frac{\exp(s_{ij})}{\sum_{k \in \mathcal{N}i} \exp(s{ik})}$$
Compute the weighted sum of values: $$\text{Output}i = \sum{j \in \mathcal{N}i} \alpha{ij} \cdot V_j$$

Edge Handling

Tokens near the beginning or end of the sequence naturally have smaller effective windows since they cannot attend to positions beyond the sequence boundaries. No padding is added—the window simply truncates at the boundaries, and the softmax normalization automatically adjusts to the reduced set of attending positions.

Parameters

Q: Query matrix with shape (seq_len, d_k) containing the query vectors for each position
K: Key matrix with shape (seq_len, d_k) containing the key vectors for each position
V: Value matrix with shape (seq_len, d_v) containing the value vectors for each position
window_size: An integer representing the radius w of the attention window
scale_factor (optional): Custom scaling factor for the dot-product scores; if None, defaults to √d_k

Return Value

Return a NumPy array of shape (seq_len, d_v) containing the attention-weighted output for each position in the sequence.

Implementation Notes

The window is symmetric: a window_size of 1 means each token attends to itself plus one token on each side (up to 3 tokens total for non-boundary positions)
A window_size of 0 means each token attends only to itself
A window_size larger than the sequence length effectively produces full self-attention

Example

Input

Q = np.array([[1.0], [1.0], [1.0]])
K = np.array([[1.0], [1.0], [1.0]])
V = np.array([[1.0], [2.0], [3.0]])
window_size = 1

Output

[[1.5], [2.0], [2.5]]

Explanation

With a window radius of 1, each token attends to itself and its immediate neighbors:

• Position 0: Attends to positions [0, 1]. Since Q·K = 1 for both, attention weights are [0.5, 0.5]. Output = 0.5 × 1.0 + 0.5 × 2.0 = 1.5

• Position 1: Attends to positions [0, 1, 2]. Equal attention weights of ≈0.333 each. Output = (1.0 + 2.0 + 3.0) / 3 = 2.0

• Position 2: Attends to positions [1, 2]. Equal weights of [0.5, 0.5]. Output = 0.5 × 2.0 + 0.5 × 3.0 = 2.5

Example

Input

Q = np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]])
K = np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]])
V = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
window_size = 0

Output

[[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]

Explanation

With window_size = 0, each token attends only to itself. This creates an identity-like behavior where each output is simply its own value:

• Position 0: Only sees itself → attention weight = 1.0 → Output = V[0] = [1.0, 2.0] • Position 1: Only sees itself → attention weight = 1.0 → Output = V[1] = [3.0, 4.0] • Position 2: Only sees itself → attention weight = 1.0 → Output = V[2] = [5.0, 6.0]

This demonstrates the extreme case of purely local (single-token) attention.

Example

Input

Q = np.array([[1.0], [1.0], [1.0], [1.0]])
K = np.array([[1.0], [1.0], [1.0], [1.0]])
V = np.array([[1.0], [2.0], [3.0], [4.0]])
window_size = 10

Output

[[2.5], [2.5], [2.5], [2.5]]

Explanation

When window_size (10) exceeds the sequence length (4), every position can attend to every other position—equivalent to full self-attention.

Since all Q·K products are equal (uniform attention), each position's output is the mean of all values:

Output = (1.0 + 2.0 + 3.0 + 4.0) / 4 = 2.5 for all positions.

This shows how localized attention gracefully degrades to full attention when the window is sufficiently large.

Accepted0/0·0% Acceptance

Constraints

1 ≤ seq_len ≤ 10,000
1 ≤ d_k, d_v ≤ 512
0 ≤ window_size ≤ seq_len
Q and K must have the same d_k dimension
All matrix elements are real-valued floating-point numbers
The scale_factor, if provided, will be a positive floating-point number

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

K =

[[1],[1],[1]]

Q =

[[1],[1],[1]]

V =

[[1],[2],[3]]

window_size =

scale_factor =

null

Concept Overview

This approach creates a band-diagonal attention pattern, where attention is concentrated along the diagonal of the attention matrix, with bandwidth determined by the window size.

Mathematical Formulation

For each query position i, the localized attention is computed as follows:

Define the local window: Determine the attending positions $mathcal{N}_i = {j : |i - j| \leq w}$

Compute scaled dot-product scores for positions within the window: $$s_{ij} = \frac{Q_i \cdot K_j^T}{\sqrt{d_k}} \quad \text{for } j \in \mathcal{N}_i$$

Apply softmax normalization over the local window: $$\alpha_{ij} = \frac{\exp(s_{ij})}{\sum_{k \in \mathcal{N}i} \exp(s{ik})}$$

Compute the weighted sum of values: $$\text{Output}i = \sum{j \in \mathcal{N}i} \alpha{ij} \cdot V_j$$

Edge Handling

Parameters

Q: Query matrix with shape (seq_len, d_k) containing the query vectors for each position

K: Key matrix with shape (seq_len, d_k) containing the key vectors for each position

V: Value matrix with shape (seq_len, d_v) containing the value vectors for each position

window_size: An integer representing the radius w of the attention window

scale_factor (optional): Custom scaling factor for the dot-product scores; if None, defaults to √d_k

Implementation Notes

The window is symmetric: a window_size of 1 means each token attends to itself plus one token on each side (up to 3 tokens total for non-boundary positions)

A window_size of 0 means each token attends only to itself

A window_size larger than the sequence length effectively produces full self-attention

Localized Sliding Attention Mechanism

Concept Overview

Mathematical Formulation

Edge Handling

Parameters

Return Value

Implementation Notes

Hints

Localized Sliding Attention Mechanism

Concept Overview

Mathematical Formulation

Edge Handling

Parameters

Return Value

Implementation Notes

Hints