Loading content...
In modern sequence modeling, particularly when dealing with extremely long sequences, computing full self-attention becomes computationally prohibitive due to its O(n²) complexity. The Localized Sliding Attention mechanism addresses this challenge by restricting each token's attention to a local neighborhood, dramatically reducing computational costs while preserving the model's ability to capture relevant contextual information.
The core idea behind localized sliding attention is elegantly simple yet powerful: instead of allowing each token to attend to every other token in the sequence, we define a window of radius w around each position. A token at position i can only attend to tokens within the range [max(0, i - w), min(seq_len - 1, i + w)], where seq_len is the total sequence length.
This approach creates a band-diagonal attention pattern, where attention is concentrated along the diagonal of the attention matrix, with bandwidth determined by the window size.
For each query position i, the localized attention is computed as follows:
Define the local window: Determine the attending positions $mathcal{N}_i = {j : |i - j| \leq w}$
Compute scaled dot-product scores for positions within the window: $$s_{ij} = \frac{Q_i \cdot K_j^T}{\sqrt{d_k}} \quad \text{for } j \in \mathcal{N}_i$$
Apply softmax normalization over the local window: $$\alpha_{ij} = \frac{\exp(s_{ij})}{\sum_{k \in \mathcal{N}i} \exp(s{ik})}$$
Compute the weighted sum of values: $$\text{Output}i = \sum{j \in \mathcal{N}i} \alpha{ij} \cdot V_j$$
Tokens near the beginning or end of the sequence naturally have smaller effective windows since they cannot attend to positions beyond the sequence boundaries. No padding is added—the window simply truncates at the boundaries, and the softmax normalization automatically adjusts to the reduced set of attending positions.
Return a NumPy array of shape (seq_len, d_v) containing the attention-weighted output for each position in the sequence.
Q = np.array([[1.0], [1.0], [1.0]])
K = np.array([[1.0], [1.0], [1.0]])
V = np.array([[1.0], [2.0], [3.0]])
window_size = 1[[1.5], [2.0], [2.5]]With a window radius of 1, each token attends to itself and its immediate neighbors:
• Position 0: Attends to positions [0, 1]. Since Q·K = 1 for both, attention weights are [0.5, 0.5]. Output = 0.5 × 1.0 + 0.5 × 2.0 = 1.5
• Position 1: Attends to positions [0, 1, 2]. Equal attention weights of ≈0.333 each. Output = (1.0 + 2.0 + 3.0) / 3 = 2.0
• Position 2: Attends to positions [1, 2]. Equal weights of [0.5, 0.5]. Output = 0.5 × 2.0 + 0.5 × 3.0 = 2.5
Q = np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]])
K = np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]])
V = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
window_size = 0[[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]With window_size = 0, each token attends only to itself. This creates an identity-like behavior where each output is simply its own value:
• Position 0: Only sees itself → attention weight = 1.0 → Output = V[0] = [1.0, 2.0] • Position 1: Only sees itself → attention weight = 1.0 → Output = V[1] = [3.0, 4.0] • Position 2: Only sees itself → attention weight = 1.0 → Output = V[2] = [5.0, 6.0]
This demonstrates the extreme case of purely local (single-token) attention.
Q = np.array([[1.0], [1.0], [1.0], [1.0]])
K = np.array([[1.0], [1.0], [1.0], [1.0]])
V = np.array([[1.0], [2.0], [3.0], [4.0]])
window_size = 10[[2.5], [2.5], [2.5], [2.5]]When window_size (10) exceeds the sequence length (4), every position can attend to every other position—equivalent to full self-attention.
Since all Q·K products are equal (uniform attention), each position's output is the mean of all values:
Output = (1.0 + 2.0 + 3.0 + 4.0) / 4 = 2.5 for all positions.
This shows how localized attention gracefully degrades to full attention when the window is sufficiently large.
Constraints