Positional Encoding - Learning Module

Loading content...

0/245

Relative Positional Representations

From Absolute to Relative: A Paradigm Shift

The positional encoding approaches we've examined so far—sinusoidal and learned embeddings—share a common philosophy: they assign a unique representation to each absolute position in the sequence. Position 5 always has the same encoding, regardless of what other positions exist or what tokens occupy them.

But is absolute position really what we need? Consider these linguistic observations:

The relationship between "the" and "cat" when adjacent should be similar whether they appear at positions (3, 4) or (103, 104)
Agreement between subject and verb depends on their distance, not their absolute positions
"The word three positions ago" is often more relevant than "the word at position 7"

These observations motivate relative positional representations—approaches that encode the distance or relationship between position pairs rather than the positions themselves.

This paradigm shift, pioneered by models like Transformer-XL and T5, has profound implications for generalization, computational efficiency, and what the model can learn about positional structure.

The Core Insight

Instead of asking "What is the encoding of position i?", relative methods ask "What is the relationship between positions i and j?" This shift from absolute coordinates to pairwise relationships enables translation invariance and better length generalization.

Motivation and Theoretical Foundation

The Translation Invariance Principle

In many sequence processing tasks, the meaning of positional relationships is translation invariant—shifting all positions by a constant doesn't change the relevant structure.

Consider the sentence: "The quick brown fox jumps."

Whether this appears at positions [0, 1, 2, 3, 4] or [100, 101, 102, 103, 104]:

"quick" is always 1 position after "The"
"fox" is always 2 positions before "jumps"
The syntactic structure (determiner → adjective → adjective → noun → verb) is identical

Absolute positional encoding breaks this invariance—the same sentence at different positions has different representations. Relative encoding preserves it.

Mathematical Formalization

In absolute positional encoding, the attention score between positions $i$ and $j$ includes terms that depend on $PE(i)$ and $PE(j)$ separately:

$$\text{score}(i, j) = f(x_i, x_j, PE(i), PE(j))$$

In relative positional encoding, the score depends on the offset $i - j$:

$$\text{score}(i, j) = f(x_i, x_j, R_{i-j})$$

where $R_{i-j}$ is a representation of the relative distance.

This formulation ensures: $$\text{score}(i, j) = \text{score}(i+k, j+k) \quad \forall k \in \mathbb{Z}$$

The attention pattern is invariant to absolute position—only relative positions matter.

Benefits of Relative Position Encoding

•Translation invariance: Same relative structure → same attention patterns, regardless of absolute position
•Better length generalization: Relative distances seen in training transfer to new sequence lengths
•Parameter efficiency: Need only O(L) relative position embeddings rather than O(L²) pairwise encodings
•Linguistic alignment: Many linguistic phenomena (agreement, binding, scope) depend on relative position
•Explicit bias control: Directly model preferences like 'attend more to nearby tokens'

What We Lose

Pure relative position encoding loses absolute position information. This matters for tasks where absolute position is meaningful—e.g., "the first word is usually the subject in English" or "document titles appear at position 0." Practical implementations often combine relative and absolute signals.

Shaw et al.: Relative Position in Attention

The foundational work on relative position in transformers came from Shaw et al. (2018), which proposed adding relative position representations directly into the attention mechanism.

The Key Insight

Instead of adding position to embeddings (which then propagates through Q, K, V projections), Shaw et al. inject relative position directly into the attention score computation.

Modified Attention Formulation

Standard self-attention: $$e_{ij} = \frac{(x_i W^Q)(x_j W^K)^T}{\sqrt{d_k}}$$

Shaw et al.'s relative attention: $$e_{ij} = \frac{(x_i W^Q)(x_j W^K + a_{ij}^K)^T}{\sqrt{d_k}}$$

And for the values: $$z_i = \sum_j \alpha_{ij}(x_j W^V + a_{ij}^V)$$

Where:

$a_{ij}^K \in \mathbb{R}^{d_k}$ is the relative position embedding for keys
$a_{ij}^V \in \mathbb{R}^{d_v}$ is the relative position embedding for values
Both depend only on the relative distance: $a_{ij} = a_{\text{clip}(j-i, -k, k)}$

Clipped Distance

The relative distance is clipped to a maximum range $[-k, k]$:

$$\text{clip}(x, -k, k) = \max(-k, \min(k, x))$$

This ensures a finite set of $2k+1$ relative position embeddings, regardless of sequence length. Positions more than $k$ apart are treated as "maximally distant."

shaw_relative_attention.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
 
class ShawRelativeAttention(nn.Module):
    """
    Relative Position Representations (Shaw et al., 2018)
    
    Adds learned relative position embeddings to keys and values
    in the attention computation.
    """
    
    def __init__(
        self,
        d_model: int,
        num_heads: int,
        max_relative_position: int = 16,  # The 'k' in the paper (clip range)
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.max_relative_position = max_relative_position
        
        # Standard QKV projections
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)
        
        # Relative position embeddings for keys and values
        # Total positions: 2 * max_relative_position + 1 (from -k to +k)
        num_relative_positions = 2 * max_relative_position + 1
        self.rel_pos_k = nn.Embedding(num_relative_positions, self.d_k)
        self.rel_pos_v = nn.Embedding(num_relative_positions, self.d_k)
        
        self.dropout = nn.Dropout(dropout)
        
        # Initialize
        nn.init.xavier_uniform_(self.rel_pos_k.weight)
        nn.init.xavier_uniform_(self.rel_pos_v.weight)
    
    def _get_relative_positions(self, seq_len: int, device: torch.device):
        """
        Compute relative position indices for all pairs.
        
        Returns matrix R where R[i,j] = clip(j - i, -k, k) + k
        (shifted by k to get non-negative indices for embedding lookup)
        """
        # Create position index grid
        positions = torch.arange(seq_len, device=device)
        
        # Compute relative distances: j - i for all pairs
        # rel_pos[i, j] = j - i
        rel_pos = positions.unsqueeze(0) - positions.unsqueeze(1)
        
        # Clip to [-max_rel, max_rel] and shift to [0, 2*max_rel]
        rel_pos_clipped = rel_pos.clamp(
            -self.max_relative_position, 
            self.max_relative_position
        )
        rel_pos_indices = rel_pos_clipped + self.max_relative_position
        
        return rel_pos_indices
    
    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        """
        Compute self-attention with relative position representations.
        
        Args:
            x: Input [batch_size, seq_len, d_model]
            mask: Optional attention mask [batch_size, 1, seq_len, seq_len]
        
        Returns:
            Output [batch_size, seq_len, d_model]
        """
        batch_size, seq_len, _ = x.shape
        
        # Project to Q, K, V
        Q = self.W_Q(x).view(batch_size, seq_len, self.num_heads, self.d_k)
        K = self.W_K(x).view(batch_size, seq_len, self.num_heads, self.d_k)
        V = self.W_V(x).view(batch_size, seq_len, self.num_heads, self.d_k)
        
        # Transpose for attention: [batch, heads, seq_len, d_k]
        Q = Q.transpose(1, 2)
        K = K.transpose(1, 2)
        V = V.transpose(1, 2)
        
        # Get relative position indices
        rel_pos_indices = self._get_relative_positions(seq_len, x.device)
        
        # Get relative position embeddings
        rel_pos_k = self.rel_pos_k(rel_pos_indices)  # [seq_len, seq_len, d_k]
        rel_pos_v = self.rel_pos_v(rel_pos_indices)  # [seq_len, seq_len, d_k]
        
        # Compute attention scores
        # Standard content-based attention
        content_scores = torch.matmul(Q, K.transpose(-2, -1))  # [batch, heads, seq, seq]
        
        # Relative position contribution to attention
        # Q: [batch, heads, seq_len, d_k] -> need to multiply with rel_pos_k
        # Reshape Q for multiplication: [batch, heads, seq_len, 1, d_k]
        # rel_pos_k: [seq_len, seq_len, d_k] -> [1, 1, seq_len, seq_len, d_k]
        Q_expanded = Q.unsqueeze(3)  # [batch, heads, seq, 1, d_k]
        rel_pos_k_expanded = rel_pos_k.unsqueeze(0).unsqueeze(0)  # [1, 1, seq, seq, d_k]
        
        # [batch, heads, seq, 1, d_k] * [1, 1, seq, seq, d_k] -> sum over d_k
        position_scores = (Q_expanded * rel_pos_k_expanded).sum(-1)  # [batch, heads, seq, seq]
        
        # Combined scores
        scores = (content_scores + position_scores) / np.sqrt(self.d_k)
        
        # Apply mask if provided
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Softmax
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Apply to values (content + position)
        content_output = torch.matmul(attn_weights, V)  # [batch, heads, seq, d_k]
        
        # Position contribution to values
        # attn_weights: [batch, heads, seq, seq]
        # rel_pos_v: [seq, seq, d_k]
        attn_weights_expanded = attn_weights.unsqueeze(-1)  # [batch, heads, seq, seq, 1]
        rel_pos_v_expanded = rel_pos_v.unsqueeze(0).unsqueeze(0)  # [1, 1, seq, seq, d_k]
        position_output = (attn_weights_expanded * rel_pos_v_expanded).sum(3)  # [batch, heads, seq, d_k]
        
        # Combine
        output = content_output + position_output
        
        # Reshape and project
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
        output = self.W_O(output)
        
        return output
 
 
# Demonstration
def demo_shaw_attention():
    batch_size, seq_len, d_model = 2, 64, 256
    
    attention = ShawRelativeAttention(
        d_model=d_model,
        num_heads=8,
        max_relative_position=16
    )
    
    x = torch.randn(batch_size, seq_len, d_model)
    output = attention(x)
    
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Relative position embeddings: {2 * 16 + 1} distances × {d_model // 8} dim")
    print(f"  = {33 * 32 * 2} total parameters for position (K and V)")
    
    return attention
 
demo_shaw_attention()

Computational Overhead

Shaw et al.'s approach adds computational cost: relative position embeddings must be looked up and multiplied for each pair of positions. The clipping mechanism bounds this at O(seq_len² × d_k), same order as standard attention, but with larger constant factors.

Transformer-XL: Relative Position for Long Contexts

Transformer-XL (Dai et al., 2019) introduced a more sophisticated relative position mechanism designed to handle very long sequences through segment-level recurrence.

The Decomposed Attention Score

Transformer-XL decomposes the attention score into four interpretable components. Starting from:

$$e_{ij} = (x_i W^Q)(x_j W^K)^T$$

With absolute position embeddings $U_i, U_j$:

$$(x_i + U_i)W^Q W^{KT}(x_j + U_j)^T$$

$$= \underbrace{x_i W^Q W^{KT} x_j^T}{(a)} + \underbrace{x_i W^Q W^{KT} U_j^T}{(b)} + \underbrace{U_i W^Q W^{KT} x_j^T}{(c)} + \underbrace{U_i W^Q W^{KT} U_j^T}{(d)}$$

Transformer-XL's Reformulation

Replace absolute positions $U_i, U_j$ with:

Relative position embedding $R_{i-j}$ (sinusoidal)
Learned query biases $u$ and $v$

$$(a): x_i W^Q W^{KT} x_j^T \quad \text{(content-based)}$$ $$(b): x_i W^Q W^{KR} R_{i-j}^T \quad \text{(content-dependent position bias)}$$ $$(c): u W^{KT} x_j^T \quad \text{(global content bias)}$$ $$(d): v W^{KR} R_{i-j}^T \quad \text{(global position bias)}$$

Where:

$W^{KR}$ is a separate projection for relative positions
$u, v \in \mathbb{R}^d$ are learned global biases (shared across positions)
$R_{i-j}$ uses sinusoidal encodings for the relative distance

Transformer-XL Attention Score Components
Term	Formula	Meaning
(a) Content-Content	$x_i W^Q W^{K^T} x_j^T$	Query content attending to key content
(b) Content-Position	$x_i W^Q W^{K_R} R_{i-j}^T$	Query content attending to relative position
(c) Position-Content	$u W^{K^T} x_j^T$	Global query bias for key content
(d) Position-Position	$v W^{K_R} R_{i-j}^T$	Global query bias for relative position

transformer_xl_attention.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
 
class TransformerXLAttention(nn.Module):
    """
    Transformer-XL Relative Positional Attention
    
    Features:
    - Decomposed 4-term attention
    - Sinusoidal relative position embeddings
    - Learned global query biases (u, v)
    """
    
    def __init__(
        self,
        d_model: int,
        num_heads: int,
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_head = d_model // num_heads
        
        # Projections
        self.W_Q = nn.Linear(d_model, d_model, bias=False)
        self.W_K = nn.Linear(d_model, d_model, bias=False)
        self.W_V = nn.Linear(d_model, d_model, bias=False)
        self.W_KR = nn.Linear(d_model, d_model, bias=False)  # For relative position
        self.W_O = nn.Linear(d_model, d_model, bias=False)
        
        # Global query biases (shared across all positions)
        # u: for content attention, v: for position attention
        self.u = nn.Parameter(torch.randn(num_heads, self.d_head))
        self.v = nn.Parameter(torch.randn(num_heads, self.d_head))
        
        self.dropout = nn.Dropout(dropout)
        self.scale = np.sqrt(self.d_head)
        
    def _create_sinusoidal_embeddings(self, seq_len: int, device: torch.device):
        """Create sinusoidal relative position embeddings."""
        # Positions from -(seq_len-1) to +(seq_len-1)
        positions = torch.arange(-(seq_len - 1), seq_len, device=device).float()
        
        # Frequency terms
        dim_indices = torch.arange(0, self.d_model, 2, device=device).float()
        div_term = torch.exp(dim_indices * (-np.log(10000.0) / self.d_model))
        
        # Create embeddings
        R = torch.zeros(2 * seq_len - 1, self.d_model, device=device)
        R[:, 0::2] = torch.sin(positions.unsqueeze(1) * div_term)
        R[:, 1::2] = torch.cos(positions.unsqueeze(1) * div_term)
        
        return R
    
    def _relative_shift(self, x: torch.Tensor):
        """
        Efficient relative position attention computation.
        
        Convert from:
            x[i, j] = attn(query_i, rel_pos_{j})  (wrong indexing)
        To:
            x[i, j] = attn(query_i, rel_pos_{j-i})  (correct relative indexing)
        """
        # x shape: [batch, heads, seq_len, 2*seq_len-1]
        batch, heads, seq_len, _ = x.shape
        
        # Pad on the left
        x = F.pad(x, (1, 0))  # [batch, heads, seq_len, 2*seq_len]
        
        # Reshape and slice
        x = x.view(batch, heads, 2 * seq_len, seq_len)
        x = x[:, :, 1:, :]  # [batch, heads, 2*seq_len-1, seq_len]
        
        # Take the relevant part
        x = x[:, :, :seq_len, :].transpose(-2, -1)  # [batch, heads, seq_len, seq_len]
        
        return x
    
    def forward(self, x: torch.Tensor, memory: torch.Tensor = None):
        """
        Compute attention with Transformer-XL relative position.
        
        Args:
            x: Input [batch, seq_len, d_model]
            memory: Optional memory from previous segment [batch, mem_len, d_model]
        """
        batch_size, seq_len, _ = x.shape
        
        # Compute Q, K, V
        Q = self.W_Q(x)  # [batch, seq_len, d_model]
        K = self.W_K(x)  # [batch, seq_len, d_model]
        V = self.W_V(x)  # [batch, seq_len, d_model]
        
        # Reshape for multi-head: [batch, seq_len, heads, d_head]
        Q = Q.view(batch_size, seq_len, self.num_heads, self.d_head)
        K = K.view(batch_size, seq_len, self.num_heads, self.d_head)
        V = V.view(batch_size, seq_len, self.num_heads, self.d_head)
        
        # Get relative position embeddings
        R = self._create_sinusoidal_embeddings(seq_len, x.device)
        R = self.W_KR(R)  # Project: [2*seq_len-1, d_model]
        R = R.view(2 * seq_len - 1, self.num_heads, self.d_head)
        
        # Compute the 4 attention terms
        # Reshape for attention: [batch, heads, seq_len, d_head]
        Q_t = Q.transpose(1, 2)
        K_t = K.transpose(1, 2)
        V_t = V.transpose(1, 2)
        R_t = R.transpose(0, 1)  # [heads, 2*seq_len-1, d_head]
        
        # Term (a): content -> content
        AC = torch.matmul(Q_t, K_t.transpose(-2, -1))  # [batch, heads, seq, seq]
        
        # Term (b): content -> position
        # Q_t: [batch, heads, seq, d_head]
        # R_t: [heads, 2*seq-1, d_head]
        BD = torch.einsum('bhid,hjd->bhij', Q_t, R_t)  # [batch, heads, seq, 2*seq-1]
        BD = self._relative_shift(BD)  # [batch, heads, seq, seq]
        
        # Term (c): global bias -> content
        # u: [heads, d_head]
        # K_t: [batch, heads, seq, d_head]
        global_content = torch.einsum('hd,bhkd->bhk', self.u, K_t)  # [batch, heads, seq]
        global_content = global_content.unsqueeze(2)  # [batch, heads, 1, seq]
        
        # Term (d): global bias -> position
        global_position = torch.einsum('hd,hjd->hj', self.v, R_t)  # [heads, 2*seq-1]
        global_position = global_position.unsqueeze(0).unsqueeze(2)  # [1, heads, 1, 2*seq-1]
        global_position = global_position.expand(batch_size, -1, seq_len, -1)
        global_position = self._relative_shift(global_position)
        
        # Combine all terms
        attn_scores = AC + BD + global_content + global_position
        attn_scores = attn_scores / self.scale
        
        # Causal mask
        mask = torch.triu(torch.ones(seq_len, seq_len, device=x.device), diagonal=1).bool()
        attn_scores.masked_fill_(mask.unsqueeze(0).unsqueeze(0), float('-inf'))
        
        # Softmax and apply
        attn_weights = F.softmax(attn_scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        output = torch.matmul(attn_weights, V_t)
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
        output = self.W_O(output)
        
        return output
 
 
# Demonstration
def demo_transformer_xl():
    batch_size, seq_len, d_model = 2, 128, 512
    
    attention = TransformerXLAttention(
        d_model=d_model,
        num_heads=8
    )
    
    x = torch.randn(batch_size, seq_len, d_model)
    output = attention(x)
    
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Global biases (u, v): 2 × {8} heads × {64} d_head = {2 * 8 * 64} params")
    
demo_transformer_xl()

The Relative Shift Trick

The _relative_shift operation is crucial for efficiency. Instead of computing QKR^T for every (i, j-i) pair (O(n³)), we compute Q against all relative positions once, then shift the result so that position [i,j] contains the score for relative distance j-i. This maintains O(n²) complexity.

T5: Simplified Relative Position Bias

T5 (Raffel et al., 2020) introduced a simpler form of relative position encoding: learned scalar biases added directly to attention scores.

The T5 Approach

Instead of relative position embeddings (d-dimensional vectors), T5 uses relative position biases (scalar values):

$$e_{ij} = \frac{x_i W^Q (x_j W^K)^T}{\sqrt{d_k}} + b_{i-j}$$

Where $b_{i-j}$ is a learned scalar bias for relative position $i-j$.

Key Simplifications

No position embeddings: No vectors, just scalars added to attention logits
Bucketed distances: Instead of a unique bias for each distance, distances are grouped into logarithmic buckets
Per-head biases: Different attention heads can learn different positional preferences
Shared across layers: The same position biases used throughout the model (in original implementation)

t5_relative_bias.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
import torch
import torch.nn as nn
import numpy as np
 
class T5RelativePositionBias(nn.Module):
    """
    T5-style relative position bias.
    
    Uses logarithmic bucketing to reduce the number of unique biases
    while still distinguishing between fine-grained nearby distances
    and coarser far distances.
    """
    
    def __init__(
        self,
        num_heads: int,
        num_buckets: int = 32,
        max_distance: int = 128,
        bidirectional: bool = True
    ):
        """
        Args:
            num_heads: Number of attention heads
            num_buckets: Number of distinct position bias buckets
            max_distance: Maximum distance to consider (beyond this, use max bucket)
            bidirectional: If True, distinguish positive/negative offsets
        """
        super().__init__()
        
        self.num_heads = num_heads
        self.num_buckets = num_buckets
        self.max_distance = max_distance
        self.bidirectional = bidirectional
        
        # Learned bias for each bucket and each head
        self.relative_attention_bias = nn.Embedding(num_buckets, num_heads)
    
    @staticmethod
    def _relative_position_bucket(
        relative_position: torch.Tensor,
        bidirectional: bool,
        num_buckets: int,
        max_distance: int
    ) -> torch.Tensor:
        """
        Convert relative positions to bucket indices.
        
        The bucket scheme:
        - Bucket 0: relative position = 0
        - Buckets 1 to num_buckets//2: exact positions 1 to num_buckets//2-1
        - Remaining buckets: logarithmically spaced for larger distances
        """
        relative_buckets = 0
        
        if bidirectional:
            # Separate buckets for positive and negative distances
            num_buckets = num_buckets // 2
            relative_buckets += (relative_position > 0).long() * num_buckets
            relative_position = relative_position.abs()
        else:
            # Causal: clamp negative to 0
            relative_position = -torch.min(
                relative_position, 
                torch.zeros_like(relative_position)
            )
        
        # Half buckets for exact small distances
        max_exact = num_buckets // 2
        is_small = relative_position < max_exact
        
        # Logarithmic bucketing for larger distances
        relative_position_if_large = max_exact + (
            torch.log(relative_position.float() / max_exact)
            / np.log(max_distance / max_exact)
            * (num_buckets - max_exact)
        ).long()
        
        relative_position_if_large = torch.min(
            relative_position_if_large,
            torch.full_like(relative_position_if_large, num_buckets - 1)
        )
        
        relative_buckets += torch.where(
            is_small,
            relative_position,
            relative_position_if_large
        )
        
        return relative_buckets
    
    def compute_bias(self, query_length: int, key_length: int, device: torch.device):
        """
        Compute relative position bias matrix.
        
        Args:
            query_length: Length of query sequence
            key_length: Length of key sequence
            device: Device for tensors
            
        Returns:
            Bias tensor of shape [1, num_heads, query_length, key_length]
        """
        # Create relative position matrix
        context_position = torch.arange(query_length, device=device)[:, None]
        memory_position = torch.arange(key_length, device=device)[None, :]
        
        relative_position = memory_position - context_position  # [query_len, key_len]
        
        # Convert to bucket indices
        relative_position_bucket = self._relative_position_bucket(
            relative_position,
            bidirectional=self.bidirectional,
            num_buckets=self.num_buckets,
            max_distance=self.max_distance
        )
        
        # Look up biases: [query_len, key_len, num_heads]
        values = self.relative_attention_bias(relative_position_bucket)
        
        # Reshape to [1, num_heads, query_len, key_len]
        values = values.permute(2, 0, 1).unsqueeze(0)
        
        return values
    
    def forward(self, query_length: int, key_length: int, device: torch.device):
        return self.compute_bias(query_length, key_length, device)
 
 
class T5Attention(nn.Module):
    """
    T5-style attention with relative position bias.
    """
    
    def __init__(
        self,
        d_model: int,
        num_heads: int,
        num_buckets: int = 32,
        max_distance: int = 128,
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.num_heads = num_heads
        self.d_head = d_model // num_heads
        
        # Linear projections (T5 uses no bias in projections)
        self.W_Q = nn.Linear(d_model, d_model, bias=False)
        self.W_K = nn.Linear(d_model, d_model, bias=False)
        self.W_V = nn.Linear(d_model, d_model, bias=False)
        self.W_O = nn.Linear(d_model, d_model, bias=False)
        
        # Relative position bias
        self.relative_position_bias = T5RelativePositionBias(
            num_heads=num_heads,
            num_buckets=num_buckets,
            max_distance=max_distance
        )
        
        self.dropout = nn.Dropout(dropout)
        self.scale = np.sqrt(self.d_head)
    
    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        batch_size, seq_len, d_model = x.shape
        
        # Compute Q, K, V
        Q = self.W_Q(x).view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        K = self.W_K(x).view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        V = self.W_V(x).view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        
        # Content-based attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        
        # Add relative position bias
        position_bias = self.relative_position_bias(seq_len, seq_len, x.device)
        scores = scores + position_bias
        
        # Apply mask and softmax
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attn_weights = torch.nn.functional.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Apply to values
        output = torch.matmul(attn_weights, V)
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
        output = self.W_O(output)
        
        return output
 
 
# Demonstration
def demo_t5_bias():
    bias_module = T5RelativePositionBias(num_heads=8, num_buckets=32, max_distance=128)
    
    # Show bucketing behavior
    print("=== T5 Logarithmic Bucket Scheme ===\n")
    
    test_positions = torch.tensor([0, 1, 2, 3, 4, 5, 10, 20, 50, 100, 200])
    buckets = T5RelativePositionBias._relative_position_bucket(
        test_positions, bidirectional=True, num_buckets=32, max_distance=128
    )
    
    print("Distance -> Bucket")
    for pos, bucket in zip(test_positions.tolist(), buckets.tolist()):
        print(f"  {pos:4d} -> {bucket:2d}")
    
    print(f"\nBias matrix shape: {bias_module(64, 64, torch.device('cpu')).shape}")
    print(f"Total bias parameters: {32 * 8} (buckets × heads)")
 
demo_t5_bias()

Logarithmic Bucketing Intuition

The logarithmic bucketing scheme reflects diminishing returns of precise distance information. Whether a word is 100 or 101 positions away matters much less than whether it's 3 or 4 positions away. By using exact buckets for small distances and logarithmic buckets for large ones, T5 focuses parameters where they matter most.

Comparing Relative Position Approaches

We've examined three major relative position approaches. Let's compare them systematically:

Comparison of Relative Position Methods
Aspect	Shaw et al.	Transformer-XL	T5 Bias
Position Input	Learned embeddings	Sinusoidal embeddings	Learned scalar biases
Integration Point	Keys and Values	Keys only (4-term)	Attention logits
Parameters	O(k × d)	O(d) + global biases	O(buckets × heads)
Computation	Medium	Higher	Low
Clipping/Bucketing	Hard clip at ±k	No clip (unbounded)	Logarithmic buckets
Length Generalization	Limited	Good	Good (if within max_distance)
Per-Head Patterns	Shared embeddings	Learned u, v per head	Independent biases

Empirical Findings

Research comparing these approaches reveals:

Task Sensitivity: Different tasks benefit from different approaches
- Machine translation: T5 bias performs well
- Language modeling: Transformer-XL excels for very long sequences
- Fine-tuning tasks: Often task-agnostic
Training Stability: T5's simple additive bias is generally most stable to train
Computational Efficiency: T5 has the lowest overhead; Transformer-XL has more complex indexing
Interpretability: T5 biases are easiest to analyze (just plot bias vs. distance)

Selection Guidelines

•T5 Bias: Good default for most tasks. Simple, parameter-efficient, works well empirically.
•Transformer-XL: Best for very long sequence language modeling. Handles segment recurrence.
•Shaw et al.: When you need rich relative position representations (not just biases).
•RoPE (next page): Modern choice combining benefits of multiple approaches.

Length Generalization with Relative Positions

One of the primary motivations for relative position encoding is better length generalization—the ability to handle sequences longer than those seen during training. Let's analyze this carefully.

Why Relative Positions Generalize Better

With absolute positions, seeing position 512 at inference after training only on positions 0-256 means encountering completely unseen embeddings. With relative positions:

The model learns patterns like "attend to 3 positions back"
Whether we're at position 50 or position 500, "3 positions back" uses the same representation
As long as relative distances in test are similar to training, the model generalizes

The Remaining Challenges

However, relative position doesn't solve everything:

Length Generalization Limitations

•Distant relative positions: If training max distance is 512, what happens for Distance 600?
•Attention distribution: At longer lengths, attention becomes more diffuse (divided among more positions)
•Context utilization: Models may not learn to effectively use very distant context
•Position bias patterns: If trained with local bias, may over-attend locally at long lengths

length_generalization_experiments.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import torch
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_length_generalization():
    """
    Analyze how different relative position methods handle length extrapolation.
    """
    
    # Simulate T5-style bucketing behavior at different distances
    def t5_bucket(distance, num_buckets=32, max_distance=128):
        """Simplified T5 bucketing for analysis."""
        if distance < num_buckets // 4:
            return distance
        else:
            # Logarithmic bucketing
            max_exact = num_buckets // 4
            bucket = max_exact + int(
                np.log(distance / max_exact) / 
                np.log(max_distance / max_exact) * 
                (num_buckets // 2 - max_exact)
            )
            return min(bucket, num_buckets // 2 - 1)
    
    # Test distances
    distances = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048]
    
    print("=== T5 Bucketing Behavior at Various Distances ===\n")
    print("Distance -> Bucket (trained max_distance=128)\n")
    
    for d in distances:
        bucket = t5_bucket(d)
        in_training = "✓" if d <= 128 else "✗ (extrapolation)"
        print(f"  {d:5d} -> Bucket {bucket:2d}  {in_training}")
    
    print("\n=== Key Observations ===")
    print("1. All distances > 128 map to the same max bucket")
    print("2. Fine-grained distinction only for nearby positions")
    print("3. Model can't distinguish distance 256 from distance 2048")
    
    # Simulate impact on attention patterns
    print("\n=== Attention Pattern at Different Lengths ===\n")
    
    def simulate_attention_entropy(seq_len, local_bias_strength=0.5):
        """
        Simulate attention entropy with local position bias.
        Higher entropy = more diffuse attention.
        """
        # Simple model: bias toward nearby positions
        positions = np.arange(seq_len)
        query_pos = seq_len - 1  # Query from last position
        
        # Distance-based bias (negative for nearby = higher attention)
        distances = np.abs(positions - query_pos)
        logits = -local_bias_strength * np.log1p(distances)
        
        # Softmax
        exp_logits = np.exp(logits - logits.max())
        attention = exp_logits / exp_logits.sum()
        
        # Entropy
        entropy = -np.sum(attention * np.log(attention + 1e-10))
        
        # Effective attention span (positions receiving >1% attention)
        effective_span = np.sum(attention > 0.01)
        
        return entropy, effective_span
    
    seq_lengths = [64, 128, 256, 512, 1024, 2048]
    
    print(f"{'Seq Length':>10} {'Entropy':>10} {'Effective Span':>15}")
    print("-" * 40)
    
    for seq_len in seq_lengths:
        entropy, span = simulate_attention_entropy(seq_len)
        print(f"{seq_len:>10} {entropy:>10.3f} {span:>15}")
    
    print("\nNote: Entropy increases with length = attention becomes more diffuse")
    print("This can hurt performance even if relative positions generalize correctly")
 
analyze_length_generalization()

Generalization vs. Utilization

Even with perfect relative position generalization, models may fail at long sequences because they never learned to utilize distant information effectively. Training on short sequences doesn't teach the model that distant context is useful. This is a separate challenge from positional encoding design.

Implementation Best Practices

Let's consolidate practical guidance for implementing relative position encoding in your models.

Choosing an Approach

Decision Matrix for Relative Position Methods
Use Case	Recommended Method	Rationale
General-purpose encoder	T5 bias or RoPE	Simple, well-tested, good generalization
Long-context LM	RoPE or ALiBi	Best length extrapolation
Sequence-to-sequence	T5 bias	Works well for encoder-decoder
Very long documents	Transformer-XL	Segment recurrence handles length
Research/analysis	T5 bias	Most interpretable patterns

Implementation Checklist

•Choose max relative distance carefully: Should cover the longest patterns you expect (e.g., sentence length, paragraph length).
•Consider bucketing for efficiency: Logarithmic bucketing reduces parameters while maintaining useful resolution.
•Initialize thoughtfully: For biases, small values near zero often work well. For embeddings, Xavier/Kaiming init.
•Test edge cases: Check behavior at distance 0, max distance, and beyond (for extrapolation).
•Monitor attention patterns: Verify that position biases produce sensible patterns (visualize attention matrices).
•Benchmark against baselines: Measure actual performance improvement over absolute position methods for your task.
•Consider memory efficiency: Pre-compute bias matrices where possible to avoid redundant computation.

Modern Recommendation

For new projects in 2024+, consider Rotary Position Embeddings (RoPE) as your default. It combines the benefits of relative position encoding with elegant integration into attention and excellent length generalization. We cover RoPE in depth in the next page.

Summary: The Relative Position Revolution

Relative positional representations marked a significant evolution in how transformers encode sequential structure. Let's consolidate the key insights:

Key Takeaways

•Paradigm shift: From 'what position am I?' to 'what's my relationship to other positions?'. This aligns better with many linguistic and computational needs.
•Translation invariance: Relative position encoding ensures that the same local structure produces the same attention patterns, regardless of absolute position in the sequence.
•Shaw et al. foundation: Added relative position embeddings to keys/values. First demonstration that relative approaches work.
•Transformer-XL decomposition: Four-term attention with global biases. Enabled very long sequence modeling with segment recurrence.
•T5 simplification: Scalar biases with logarithmic bucketing. Simple, parameter-efficient, and empirically effective.
•Length generalization improvement: Relative positions generalize better to unseen lengths than absolute positions, though challenges remain.

What's Next: Rotary Position Embeddings (RoPE)

The evolution of positional encoding continues with Rotary Position Embeddings (RoPE), perhaps the most elegant solution yet. RoPE encodes relative position through the very geometry of the attention computation:

Positions are encoded as rotations in embedding space
The dot product of rotated vectors naturally extracts relative position
No explicit position embeddings or biases needed in attention scores
Excellent length generalization and computational efficiency

RoPE has become the standard in modern large language models like LLaMA, Mistral, and many others. Understanding it completes your knowledge of the positional encoding landscape.

Page Complete

You now understand the principles, implementations, and tradeoffs of relative positional encoding. From Shaw et al.'s foundational work through Transformer-XL's decomposition to T5's elegant simplification, you've traced the evolution that led to modern approaches. Next, we explore Rotary Position Embeddings—the state-of-the-art synthesis of these ideas.

Relative Positional Representations

From Absolute to Relative: A Paradigm Shift

But is absolute position really what we need? Consider these linguistic observations:

The relationship between "the" and "cat" when adjacent should be similar whether they appear at positions (3, 4) or (103, 104)
Agreement between subject and verb depends on their distance, not their absolute positions
"The word three positions ago" is often more relevant than "the word at position 7"

These observations motivate relative positional representations—approaches that encode the distance or relationship between position pairs rather than the positions themselves.

This paradigm shift, pioneered by models like Transformer-XL and T5, has profound implications for generalization, computational efficiency, and what the model can learn about positional structure.

The Core Insight

Motivation and Theoretical Foundation

The Translation Invariance Principle

In many sequence processing tasks, the meaning of positional relationships is translation invariant—shifting all positions by a constant doesn't change the relevant structure.

Consider the sentence: "The quick brown fox jumps."

Whether this appears at positions [0, 1, 2, 3, 4] or [100, 101, 102, 103, 104]:

"quick" is always 1 position after "The"
"fox" is always 2 positions before "jumps"
The syntactic structure (determiner → adjective → adjective → noun → verb) is identical

Absolute positional encoding breaks this invariance—the same sentence at different positions has different representations. Relative encoding preserves it.

Mathematical Formalization

In absolute positional encoding, the attention score between positions $i$ and $j$ includes terms that depend on $PE(i)$ and $PE(j)$ separately:

$$\text{score}(i, j) = f(x_i, x_j, PE(i), PE(j))$$

In relative positional encoding, the score depends on the offset $i - j$:

$$\text{score}(i, j) = f(x_i, x_j, R_{i-j})$$

where $R_{i-j}$ is a representation of the relative distance.

This formulation ensures: $$\text{score}(i, j) = \text{score}(i+k, j+k) \quad \forall k \in \mathbb{Z}$$

The attention pattern is invariant to absolute position—only relative positions matter.

Benefits of Relative Position Encoding

•Translation invariance: Same relative structure → same attention patterns, regardless of absolute position
•Better length generalization: Relative distances seen in training transfer to new sequence lengths
•Parameter efficiency: Need only O(L) relative position embeddings rather than O(L²) pairwise encodings
•Linguistic alignment: Many linguistic phenomena (agreement, binding, scope) depend on relative position
•Explicit bias control: Directly model preferences like 'attend more to nearby tokens'

What We Lose

Shaw et al.: Relative Position in Attention

The foundational work on relative position in transformers came from Shaw et al. (2018), which proposed adding relative position representations directly into the attention mechanism.

The Key Insight

Instead of adding position to embeddings (which then propagates through Q, K, V projections), Shaw et al. inject relative position directly into the attention score computation.

Modified Attention Formulation

Standard self-attention: $$e_{ij} = \frac{(x_i W^Q)(x_j W^K)^T}{\sqrt{d_k}}$$

Shaw et al.'s relative attention: $$e_{ij} = \frac{(x_i W^Q)(x_j W^K + a_{ij}^K)^T}{\sqrt{d_k}}$$

And for the values: $$z_i = \sum_j \alpha_{ij}(x_j W^V + a_{ij}^V)$$

Where:

$a_{ij}^K \in \mathbb{R}^{d_k}$ is the relative position embedding for keys
$a_{ij}^V \in \mathbb{R}^{d_v}$ is the relative position embedding for values
Both depend only on the relative distance: $a_{ij} = a_{\text{clip}(j-i, -k, k)}$

Clipped Distance

The relative distance is clipped to a maximum range $[-k, k]$:

$$\text{clip}(x, -k, k) = \max(-k, \min(k, x))$$

This ensures a finite set of $2k+1$ relative position embeddings, regardless of sequence length. Positions more than $k$ apart are treated as "maximally distant."

shaw_relative_attention.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
 
class ShawRelativeAttention(nn.Module):
    """
    Relative Position Representations (Shaw et al., 2018)
    
    Adds learned relative position embeddings to keys and values
    in the attention computation.
    """
    
    def __init__(
        self,
        d_model: int,
        num_heads: int,
        max_relative_position: int = 16,  # The 'k' in the paper (clip range)
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.max_relative_position = max_relative_position
        
        # Standard QKV projections
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)
        
        # Relative position embeddings for keys and values
        # Total positions: 2 * max_relative_position + 1 (from -k to +k)
        num_relative_positions = 2 * max_relative_position + 1
        self.rel_pos_k = nn.Embedding(num_relative_positions, self.d_k)
        self.rel_pos_v = nn.Embedding(num_relative_positions, self.d_k)
        
        self.dropout = nn.Dropout(dropout)
        
        # Initialize
        nn.init.xavier_uniform_(self.rel_pos_k.weight)
        nn.init.xavier_uniform_(self.rel_pos_v.weight)
    
    def _get_relative_positions(self, seq_len: int, device: torch.device):
        """
        Compute relative position indices for all pairs.
        
        Returns matrix R where R[i,j] = clip(j - i, -k, k) + k
        (shifted by k to get non-negative indices for embedding lookup)
        """
        # Create position index grid
        positions = torch.arange(seq_len, device=device)
        
        # Compute relative distances: j - i for all pairs
        # rel_pos[i, j] = j - i
        rel_pos = positions.unsqueeze(0) - positions.unsqueeze(1)
        
        # Clip to [-max_rel, max_rel] and shift to [0, 2*max_rel]
        rel_pos_clipped = rel_pos.clamp(
            -self.max_relative_position, 
            self.max_relative_position
        )
        rel_pos_indices = rel_pos_clipped + self.max_relative_position
        
        return rel_pos_indices
    
    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        """
        Compute self-attention with relative position representations.
        
        Args:
            x: Input [batch_size, seq_len, d_model]
            mask: Optional attention mask [batch_size, 1, seq_len, seq_len]
        
        Returns:
            Output [batch_size, seq_len, d_model]
        """
        batch_size, seq_len, _ = x.shape
        
        # Project to Q, K, V
        Q = self.W_Q(x).view(batch_size, seq_len, self.num_heads, self.d_k)
        K = self.W_K(x).view(batch_size, seq_len, self.num_heads, self.d_k)
        V = self.W_V(x).view(batch_size, seq_len, self.num_heads, self.d_k)
        
        # Transpose for attention: [batch, heads, seq_len, d_k]
        Q = Q.transpose(1, 2)
        K = K.transpose(1, 2)
        V = V.transpose(1, 2)
        
        # Get relative position indices
        rel_pos_indices = self._get_relative_positions(seq_len, x.device)
        
        # Get relative position embeddings
        rel_pos_k = self.rel_pos_k(rel_pos_indices)  # [seq_len, seq_len, d_k]
        rel_pos_v = self.rel_pos_v(rel_pos_indices)  # [seq_len, seq_len, d_k]
        
        # Compute attention scores
        # Standard content-based attention
        content_scores = torch.matmul(Q, K.transpose(-2, -1))  # [batch, heads, seq, seq]
        
        # Relative position contribution to attention
        # Q: [batch, heads, seq_len, d_k] -> need to multiply with rel_pos_k
        # Reshape Q for multiplication: [batch, heads, seq_len, 1, d_k]
        # rel_pos_k: [seq_len, seq_len, d_k] -> [1, 1, seq_len, seq_len, d_k]
        Q_expanded = Q.unsqueeze(3)  # [batch, heads, seq, 1, d_k]
        rel_pos_k_expanded = rel_pos_k.unsqueeze(0).unsqueeze(0)  # [1, 1, seq, seq, d_k]
        
        # [batch, heads, seq, 1, d_k] * [1, 1, seq, seq, d_k] -> sum over d_k
        position_scores = (Q_expanded * rel_pos_k_expanded).sum(-1)  # [batch, heads, seq, seq]
        
        # Combined scores
        scores = (content_scores + position_scores) / np.sqrt(self.d_k)
        
        # Apply mask if provided
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Softmax
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Apply to values (content + position)
        content_output = torch.matmul(attn_weights, V)  # [batch, heads, seq, d_k]
        
        # Position contribution to values
        # attn_weights: [batch, heads, seq, seq]
        # rel_pos_v: [seq, seq, d_k]
        attn_weights_expanded = attn_weights.unsqueeze(-1)  # [batch, heads, seq, seq, 1]
        rel_pos_v_expanded = rel_pos_v.unsqueeze(0).unsqueeze(0)  # [1, 1, seq, seq, d_k]
        position_output = (attn_weights_expanded * rel_pos_v_expanded).sum(3)  # [batch, heads, seq, d_k]
        
        # Combine
        output = content_output + position_output
        
        # Reshape and project
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
        output = self.W_O(output)
        
        return output
 
 
# Demonstration
def demo_shaw_attention():
    batch_size, seq_len, d_model = 2, 64, 256
    
    attention = ShawRelativeAttention(
        d_model=d_model,
        num_heads=8,
        max_relative_position=16
    )
    
    x = torch.randn(batch_size, seq_len, d_model)
    output = attention(x)
    
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Relative position embeddings: {2 * 16 + 1} distances × {d_model // 8} dim")
    print(f"  = {33 * 32 * 2} total parameters for position (K and V)")
    
    return attention
 
demo_shaw_attention()

Computational Overhead

Transformer-XL: Relative Position for Long Contexts

Transformer-XL (Dai et al., 2019) introduced a more sophisticated relative position mechanism designed to handle very long sequences through segment-level recurrence.

The Decomposed Attention Score

Transformer-XL decomposes the attention score into four interpretable components. Starting from:

$$e_{ij} = (x_i W^Q)(x_j W^K)^T$$

With absolute position embeddings $U_i, U_j$:

$$(x_i + U_i)W^Q W^{KT}(x_j + U_j)^T$$

$$= \underbrace{x_i W^Q W^{KT} x_j^T}{(a)} + \underbrace{x_i W^Q W^{KT} U_j^T}{(b)} + \underbrace{U_i W^Q W^{KT} x_j^T}{(c)} + \underbrace{U_i W^Q W^{KT} U_j^T}{(d)}$$

Transformer-XL's Reformulation

Replace absolute positions $U_i, U_j$ with:

Relative position embedding $R_{i-j}$ (sinusoidal)
Learned query biases $u$ and $v$

Where:

$W^{KR}$ is a separate projection for relative positions
$u, v \in \mathbb{R}^d$ are learned global biases (shared across positions)
$R_{i-j}$ uses sinusoidal encodings for the relative distance

Transformer-XL Attention Score Components
Term	Formula	Meaning
(a) Content-Content	$x_i W^Q W^{K^T} x_j^T$	Query content attending to key content
(b) Content-Position	$x_i W^Q W^{K_R} R_{i-j}^T$	Query content attending to relative position
(c) Position-Content	$u W^{K^T} x_j^T$	Global query bias for key content
(d) Position-Position	$v W^{K_R} R_{i-j}^T$	Global query bias for relative position

transformer_xl_attention.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
 
class TransformerXLAttention(nn.Module):
    """
    Transformer-XL Relative Positional Attention
    
    Features:
    - Decomposed 4-term attention
    - Sinusoidal relative position embeddings
    - Learned global query biases (u, v)
    """
    
    def __init__(
        self,
        d_model: int,
        num_heads: int,
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_head = d_model // num_heads
        
        # Projections
        self.W_Q = nn.Linear(d_model, d_model, bias=False)
        self.W_K = nn.Linear(d_model, d_model, bias=False)
        self.W_V = nn.Linear(d_model, d_model, bias=False)
        self.W_KR = nn.Linear(d_model, d_model, bias=False)  # For relative position
        self.W_O = nn.Linear(d_model, d_model, bias=False)
        
        # Global query biases (shared across all positions)
        # u: for content attention, v: for position attention
        self.u = nn.Parameter(torch.randn(num_heads, self.d_head))
        self.v = nn.Parameter(torch.randn(num_heads, self.d_head))
        
        self.dropout = nn.Dropout(dropout)
        self.scale = np.sqrt(self.d_head)
        
    def _create_sinusoidal_embeddings(self, seq_len: int, device: torch.device):
        """Create sinusoidal relative position embeddings."""
        # Positions from -(seq_len-1) to +(seq_len-1)
        positions = torch.arange(-(seq_len - 1), seq_len, device=device).float()
        
        # Frequency terms
        dim_indices = torch.arange(0, self.d_model, 2, device=device).float()
        div_term = torch.exp(dim_indices * (-np.log(10000.0) / self.d_model))
        
        # Create embeddings
        R = torch.zeros(2 * seq_len - 1, self.d_model, device=device)
        R[:, 0::2] = torch.sin(positions.unsqueeze(1) * div_term)
        R[:, 1::2] = torch.cos(positions.unsqueeze(1) * div_term)
        
        return R
    
    def _relative_shift(self, x: torch.Tensor):
        """
        Efficient relative position attention computation.
        
        Convert from:
            x[i, j] = attn(query_i, rel_pos_{j})  (wrong indexing)
        To:
            x[i, j] = attn(query_i, rel_pos_{j-i})  (correct relative indexing)
        """
        # x shape: [batch, heads, seq_len, 2*seq_len-1]
        batch, heads, seq_len, _ = x.shape
        
        # Pad on the left
        x = F.pad(x, (1, 0))  # [batch, heads, seq_len, 2*seq_len]
        
        # Reshape and slice
        x = x.view(batch, heads, 2 * seq_len, seq_len)
        x = x[:, :, 1:, :]  # [batch, heads, 2*seq_len-1, seq_len]
        
        # Take the relevant part
        x = x[:, :, :seq_len, :].transpose(-2, -1)  # [batch, heads, seq_len, seq_len]
        
        return x
    
    def forward(self, x: torch.Tensor, memory: torch.Tensor = None):
        """
        Compute attention with Transformer-XL relative position.
        
        Args:
            x: Input [batch, seq_len, d_model]
            memory: Optional memory from previous segment [batch, mem_len, d_model]
        """
        batch_size, seq_len, _ = x.shape
        
        # Compute Q, K, V
        Q = self.W_Q(x)  # [batch, seq_len, d_model]
        K = self.W_K(x)  # [batch, seq_len, d_model]
        V = self.W_V(x)  # [batch, seq_len, d_model]
        
        # Reshape for multi-head: [batch, seq_len, heads, d_head]
        Q = Q.view(batch_size, seq_len, self.num_heads, self.d_head)
        K = K.view(batch_size, seq_len, self.num_heads, self.d_head)
        V = V.view(batch_size, seq_len, self.num_heads, self.d_head)
        
        # Get relative position embeddings
        R = self._create_sinusoidal_embeddings(seq_len, x.device)
        R = self.W_KR(R)  # Project: [2*seq_len-1, d_model]
        R = R.view(2 * seq_len - 1, self.num_heads, self.d_head)
        
        # Compute the 4 attention terms
        # Reshape for attention: [batch, heads, seq_len, d_head]
        Q_t = Q.transpose(1, 2)
        K_t = K.transpose(1, 2)
        V_t = V.transpose(1, 2)
        R_t = R.transpose(0, 1)  # [heads, 2*seq_len-1, d_head]
        
        # Term (a): content -> content
        AC = torch.matmul(Q_t, K_t.transpose(-2, -1))  # [batch, heads, seq, seq]
        
        # Term (b): content -> position
        # Q_t: [batch, heads, seq, d_head]
        # R_t: [heads, 2*seq-1, d_head]
        BD = torch.einsum('bhid,hjd->bhij', Q_t, R_t)  # [batch, heads, seq, 2*seq-1]
        BD = self._relative_shift(BD)  # [batch, heads, seq, seq]
        
        # Term (c): global bias -> content
        # u: [heads, d_head]
        # K_t: [batch, heads, seq, d_head]
        global_content = torch.einsum('hd,bhkd->bhk', self.u, K_t)  # [batch, heads, seq]
        global_content = global_content.unsqueeze(2)  # [batch, heads, 1, seq]
        
        # Term (d): global bias -> position
        global_position = torch.einsum('hd,hjd->hj', self.v, R_t)  # [heads, 2*seq-1]
        global_position = global_position.unsqueeze(0).unsqueeze(2)  # [1, heads, 1, 2*seq-1]
        global_position = global_position.expand(batch_size, -1, seq_len, -1)
        global_position = self._relative_shift(global_position)
        
        # Combine all terms
        attn_scores = AC + BD + global_content + global_position
        attn_scores = attn_scores / self.scale
        
        # Causal mask
        mask = torch.triu(torch.ones(seq_len, seq_len, device=x.device), diagonal=1).bool()
        attn_scores.masked_fill_(mask.unsqueeze(0).unsqueeze(0), float('-inf'))
        
        # Softmax and apply
        attn_weights = F.softmax(attn_scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        output = torch.matmul(attn_weights, V_t)
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
        output = self.W_O(output)
        
        return output
 
 
# Demonstration
def demo_transformer_xl():
    batch_size, seq_len, d_model = 2, 128, 512
    
    attention = TransformerXLAttention(
        d_model=d_model,
        num_heads=8
    )
    
    x = torch.randn(batch_size, seq_len, d_model)
    output = attention(x)
    
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Global biases (u, v): 2 × {8} heads × {64} d_head = {2 * 8 * 64} params")
    
demo_transformer_xl()

The Relative Shift Trick

T5: Simplified Relative Position Bias

T5 (Raffel et al., 2020) introduced a simpler form of relative position encoding: learned scalar biases added directly to attention scores.

The T5 Approach

Instead of relative position embeddings (d-dimensional vectors), T5 uses relative position biases (scalar values):

$$e_{ij} = \frac{x_i W^Q (x_j W^K)^T}{\sqrt{d_k}} + b_{i-j}$$

Where $b_{i-j}$ is a learned scalar bias for relative position $i-j$.

Key Simplifications

No position embeddings: No vectors, just scalars added to attention logits
Bucketed distances: Instead of a unique bias for each distance, distances are grouped into logarithmic buckets
Per-head biases: Different attention heads can learn different positional preferences
Shared across layers: The same position biases used throughout the model (in original implementation)

t5_relative_bias.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
import torch
import torch.nn as nn
import numpy as np
 
class T5RelativePositionBias(nn.Module):
    """
    T5-style relative position bias.
    
    Uses logarithmic bucketing to reduce the number of unique biases
    while still distinguishing between fine-grained nearby distances
    and coarser far distances.
    """
    
    def __init__(
        self,
        num_heads: int,
        num_buckets: int = 32,
        max_distance: int = 128,
        bidirectional: bool = True
    ):
        """
        Args:
            num_heads: Number of attention heads
            num_buckets: Number of distinct position bias buckets
            max_distance: Maximum distance to consider (beyond this, use max bucket)
            bidirectional: If True, distinguish positive/negative offsets
        """
        super().__init__()
        
        self.num_heads = num_heads
        self.num_buckets = num_buckets
        self.max_distance = max_distance
        self.bidirectional = bidirectional
        
        # Learned bias for each bucket and each head
        self.relative_attention_bias = nn.Embedding(num_buckets, num_heads)
    
    @staticmethod
    def _relative_position_bucket(
        relative_position: torch.Tensor,
        bidirectional: bool,
        num_buckets: int,
        max_distance: int
    ) -> torch.Tensor:
        """
        Convert relative positions to bucket indices.
        
        The bucket scheme:
        - Bucket 0: relative position = 0
        - Buckets 1 to num_buckets//2: exact positions 1 to num_buckets//2-1
        - Remaining buckets: logarithmically spaced for larger distances
        """
        relative_buckets = 0
        
        if bidirectional:
            # Separate buckets for positive and negative distances
            num_buckets = num_buckets // 2
            relative_buckets += (relative_position > 0).long() * num_buckets
            relative_position = relative_position.abs()
        else:
            # Causal: clamp negative to 0
            relative_position = -torch.min(
                relative_position, 
                torch.zeros_like(relative_position)
            )
        
        # Half buckets for exact small distances
        max_exact = num_buckets // 2
        is_small = relative_position < max_exact
        
        # Logarithmic bucketing for larger distances
        relative_position_if_large = max_exact + (
            torch.log(relative_position.float() / max_exact)
            / np.log(max_distance / max_exact)
            * (num_buckets - max_exact)
        ).long()
        
        relative_position_if_large = torch.min(
            relative_position_if_large,
            torch.full_like(relative_position_if_large, num_buckets - 1)
        )
        
        relative_buckets += torch.where(
            is_small,
            relative_position,
            relative_position_if_large
        )
        
        return relative_buckets
    
    def compute_bias(self, query_length: int, key_length: int, device: torch.device):
        """
        Compute relative position bias matrix.
        
        Args:
            query_length: Length of query sequence
            key_length: Length of key sequence
            device: Device for tensors
            
        Returns:
            Bias tensor of shape [1, num_heads, query_length, key_length]
        """
        # Create relative position matrix
        context_position = torch.arange(query_length, device=device)[:, None]
        memory_position = torch.arange(key_length, device=device)[None, :]
        
        relative_position = memory_position - context_position  # [query_len, key_len]
        
        # Convert to bucket indices
        relative_position_bucket = self._relative_position_bucket(
            relative_position,
            bidirectional=self.bidirectional,
            num_buckets=self.num_buckets,
            max_distance=self.max_distance
        )
        
        # Look up biases: [query_len, key_len, num_heads]
        values = self.relative_attention_bias(relative_position_bucket)
        
        # Reshape to [1, num_heads, query_len, key_len]
        values = values.permute(2, 0, 1).unsqueeze(0)
        
        return values
    
    def forward(self, query_length: int, key_length: int, device: torch.device):
        return self.compute_bias(query_length, key_length, device)
 
 
class T5Attention(nn.Module):
    """
    T5-style attention with relative position bias.
    """
    
    def __init__(
        self,
        d_model: int,
        num_heads: int,
        num_buckets: int = 32,
        max_distance: int = 128,
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.num_heads = num_heads
        self.d_head = d_model // num_heads
        
        # Linear projections (T5 uses no bias in projections)
        self.W_Q = nn.Linear(d_model, d_model, bias=False)
        self.W_K = nn.Linear(d_model, d_model, bias=False)
        self.W_V = nn.Linear(d_model, d_model, bias=False)
        self.W_O = nn.Linear(d_model, d_model, bias=False)
        
        # Relative position bias
        self.relative_position_bias = T5RelativePositionBias(
            num_heads=num_heads,
            num_buckets=num_buckets,
            max_distance=max_distance
        )
        
        self.dropout = nn.Dropout(dropout)
        self.scale = np.sqrt(self.d_head)
    
    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        batch_size, seq_len, d_model = x.shape
        
        # Compute Q, K, V
        Q = self.W_Q(x).view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        K = self.W_K(x).view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        V = self.W_V(x).view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        
        # Content-based attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        
        # Add relative position bias
        position_bias = self.relative_position_bias(seq_len, seq_len, x.device)
        scores = scores + position_bias
        
        # Apply mask and softmax
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attn_weights = torch.nn.functional.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Apply to values
        output = torch.matmul(attn_weights, V)
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
        output = self.W_O(output)
        
        return output
 
 
# Demonstration
def demo_t5_bias():
    bias_module = T5RelativePositionBias(num_heads=8, num_buckets=32, max_distance=128)
    
    # Show bucketing behavior
    print("=== T5 Logarithmic Bucket Scheme ===\n")
    
    test_positions = torch.tensor([0, 1, 2, 3, 4, 5, 10, 20, 50, 100, 200])
    buckets = T5RelativePositionBias._relative_position_bucket(
        test_positions, bidirectional=True, num_buckets=32, max_distance=128
    )
    
    print("Distance -> Bucket")
    for pos, bucket in zip(test_positions.tolist(), buckets.tolist()):
        print(f"  {pos:4d} -> {bucket:2d}")
    
    print(f"\nBias matrix shape: {bias_module(64, 64, torch.device('cpu')).shape}")
    print(f"Total bias parameters: {32 * 8} (buckets × heads)")
 
demo_t5_bias()

Logarithmic Bucketing Intuition

Comparing Relative Position Approaches

We've examined three major relative position approaches. Let's compare them systematically:

Comparison of Relative Position Methods
Aspect	Shaw et al.	Transformer-XL	T5 Bias
Position Input	Learned embeddings	Sinusoidal embeddings	Learned scalar biases
Integration Point	Keys and Values	Keys only (4-term)	Attention logits
Parameters	O(k × d)	O(d) + global biases	O(buckets × heads)
Computation	Medium	Higher	Low
Clipping/Bucketing	Hard clip at ±k	No clip (unbounded)	Logarithmic buckets
Length Generalization	Limited	Good	Good (if within max_distance)
Per-Head Patterns	Shared embeddings	Learned u, v per head	Independent biases

Empirical Findings

Research comparing these approaches reveals:

Task Sensitivity: Different tasks benefit from different approaches
- Machine translation: T5 bias performs well
- Language modeling: Transformer-XL excels for very long sequences
- Fine-tuning tasks: Often task-agnostic
Training Stability: T5's simple additive bias is generally most stable to train
Computational Efficiency: T5 has the lowest overhead; Transformer-XL has more complex indexing
Interpretability: T5 biases are easiest to analyze (just plot bias vs. distance)

Selection Guidelines

•T5 Bias: Good default for most tasks. Simple, parameter-efficient, works well empirically.
•Transformer-XL: Best for very long sequence language modeling. Handles segment recurrence.
•Shaw et al.: When you need rich relative position representations (not just biases).
•RoPE (next page): Modern choice combining benefits of multiple approaches.

Length Generalization with Relative Positions

One of the primary motivations for relative position encoding is better length generalization—the ability to handle sequences longer than those seen during training. Let's analyze this carefully.

Why Relative Positions Generalize Better

With absolute positions, seeing position 512 at inference after training only on positions 0-256 means encountering completely unseen embeddings. With relative positions:

The model learns patterns like "attend to 3 positions back"
Whether we're at position 50 or position 500, "3 positions back" uses the same representation
As long as relative distances in test are similar to training, the model generalizes

The Remaining Challenges

However, relative position doesn't solve everything:

Length Generalization Limitations

•Distant relative positions: If training max distance is 512, what happens for Distance 600?
•Attention distribution: At longer lengths, attention becomes more diffuse (divided among more positions)
•Context utilization: Models may not learn to effectively use very distant context
•Position bias patterns: If trained with local bias, may over-attend locally at long lengths

length_generalization_experiments.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import torch
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_length_generalization():
    """
    Analyze how different relative position methods handle length extrapolation.
    """
    
    # Simulate T5-style bucketing behavior at different distances
    def t5_bucket(distance, num_buckets=32, max_distance=128):
        """Simplified T5 bucketing for analysis."""
        if distance < num_buckets // 4:
            return distance
        else:
            # Logarithmic bucketing
            max_exact = num_buckets // 4
            bucket = max_exact + int(
                np.log(distance / max_exact) / 
                np.log(max_distance / max_exact) * 
                (num_buckets // 2 - max_exact)
            )
            return min(bucket, num_buckets // 2 - 1)
    
    # Test distances
    distances = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048]
    
    print("=== T5 Bucketing Behavior at Various Distances ===\n")
    print("Distance -> Bucket (trained max_distance=128)\n")
    
    for d in distances:
        bucket = t5_bucket(d)
        in_training = "✓" if d <= 128 else "✗ (extrapolation)"
        print(f"  {d:5d} -> Bucket {bucket:2d}  {in_training}")
    
    print("\n=== Key Observations ===")
    print("1. All distances > 128 map to the same max bucket")
    print("2. Fine-grained distinction only for nearby positions")
    print("3. Model can't distinguish distance 256 from distance 2048")
    
    # Simulate impact on attention patterns
    print("\n=== Attention Pattern at Different Lengths ===\n")
    
    def simulate_attention_entropy(seq_len, local_bias_strength=0.5):
        """
        Simulate attention entropy with local position bias.
        Higher entropy = more diffuse attention.
        """
        # Simple model: bias toward nearby positions
        positions = np.arange(seq_len)
        query_pos = seq_len - 1  # Query from last position
        
        # Distance-based bias (negative for nearby = higher attention)
        distances = np.abs(positions - query_pos)
        logits = -local_bias_strength * np.log1p(distances)
        
        # Softmax
        exp_logits = np.exp(logits - logits.max())
        attention = exp_logits / exp_logits.sum()
        
        # Entropy
        entropy = -np.sum(attention * np.log(attention + 1e-10))
        
        # Effective attention span (positions receiving >1% attention)
        effective_span = np.sum(attention > 0.01)
        
        return entropy, effective_span
    
    seq_lengths = [64, 128, 256, 512, 1024, 2048]
    
    print(f"{'Seq Length':>10} {'Entropy':>10} {'Effective Span':>15}")
    print("-" * 40)
    
    for seq_len in seq_lengths:
        entropy, span = simulate_attention_entropy(seq_len)
        print(f"{seq_len:>10} {entropy:>10.3f} {span:>15}")
    
    print("\nNote: Entropy increases with length = attention becomes more diffuse")
    print("This can hurt performance even if relative positions generalize correctly")
 
analyze_length_generalization()

Generalization vs. Utilization

Implementation Best Practices

Let's consolidate practical guidance for implementing relative position encoding in your models.

Choosing an Approach

Decision Matrix for Relative Position Methods
Use Case	Recommended Method	Rationale
General-purpose encoder	T5 bias or RoPE	Simple, well-tested, good generalization
Long-context LM	RoPE or ALiBi	Best length extrapolation
Sequence-to-sequence	T5 bias	Works well for encoder-decoder
Very long documents	Transformer-XL	Segment recurrence handles length
Research/analysis	T5 bias	Most interpretable patterns

Implementation Checklist

•Choose max relative distance carefully: Should cover the longest patterns you expect (e.g., sentence length, paragraph length).
•Consider bucketing for efficiency: Logarithmic bucketing reduces parameters while maintaining useful resolution.
•Initialize thoughtfully: For biases, small values near zero often work well. For embeddings, Xavier/Kaiming init.
•Test edge cases: Check behavior at distance 0, max distance, and beyond (for extrapolation).
•Monitor attention patterns: Verify that position biases produce sensible patterns (visualize attention matrices).
•Benchmark against baselines: Measure actual performance improvement over absolute position methods for your task.
•Consider memory efficiency: Pre-compute bias matrices where possible to avoid redundant computation.

Modern Recommendation

Summary: The Relative Position Revolution

Relative positional representations marked a significant evolution in how transformers encode sequential structure. Let's consolidate the key insights:

Key Takeaways

•Paradigm shift: From 'what position am I?' to 'what's my relationship to other positions?'. This aligns better with many linguistic and computational needs.
•Translation invariance: Relative position encoding ensures that the same local structure produces the same attention patterns, regardless of absolute position in the sequence.
•Shaw et al. foundation: Added relative position embeddings to keys/values. First demonstration that relative approaches work.
•Transformer-XL decomposition: Four-term attention with global biases. Enabled very long sequence modeling with segment recurrence.
•T5 simplification: Scalar biases with logarithmic bucketing. Simple, parameter-efficient, and empirically effective.
•Length generalization improvement: Relative positions generalize better to unseen lengths than absolute positions, though challenges remain.

What's Next: Rotary Position Embeddings (RoPE)

Positions are encoded as rotations in embedding space
The dot product of rotated vectors naturally extracts relative position
No explicit position embeddings or biases needed in attention scores
Excellent length generalization and computational efficiency

RoPE has become the standard in modern large language models like LLaMA, Mistral, and many others. Understanding it completes your knowledge of the positional encoding landscape.

Page Complete