Loading content...
The positional encoding approaches we've examined so far—sinusoidal and learned embeddings—share a common philosophy: they assign a unique representation to each absolute position in the sequence. Position 5 always has the same encoding, regardless of what other positions exist or what tokens occupy them.
But is absolute position really what we need? Consider these linguistic observations:
These observations motivate relative positional representations—approaches that encode the distance or relationship between position pairs rather than the positions themselves.
This paradigm shift, pioneered by models like Transformer-XL and T5, has profound implications for generalization, computational efficiency, and what the model can learn about positional structure.
Instead of asking "What is the encoding of position i?", relative methods ask "What is the relationship between positions i and j?" This shift from absolute coordinates to pairwise relationships enables translation invariance and better length generalization.
The Translation Invariance Principle
In many sequence processing tasks, the meaning of positional relationships is translation invariant—shifting all positions by a constant doesn't change the relevant structure.
Consider the sentence: "The quick brown fox jumps."
Whether this appears at positions [0, 1, 2, 3, 4] or [100, 101, 102, 103, 104]:
Absolute positional encoding breaks this invariance—the same sentence at different positions has different representations. Relative encoding preserves it.
Mathematical Formalization
In absolute positional encoding, the attention score between positions $i$ and $j$ includes terms that depend on $PE(i)$ and $PE(j)$ separately:
$$\text{score}(i, j) = f(x_i, x_j, PE(i), PE(j))$$
In relative positional encoding, the score depends on the offset $i - j$:
$$\text{score}(i, j) = f(x_i, x_j, R_{i-j})$$
where $R_{i-j}$ is a representation of the relative distance.
This formulation ensures: $$\text{score}(i, j) = \text{score}(i+k, j+k) \quad \forall k \in \mathbb{Z}$$
The attention pattern is invariant to absolute position—only relative positions matter.
Pure relative position encoding loses absolute position information. This matters for tasks where absolute position is meaningful—e.g., "the first word is usually the subject in English" or "document titles appear at position 0." Practical implementations often combine relative and absolute signals.
The foundational work on relative position in transformers came from Shaw et al. (2018), which proposed adding relative position representations directly into the attention mechanism.
The Key Insight
Instead of adding position to embeddings (which then propagates through Q, K, V projections), Shaw et al. inject relative position directly into the attention score computation.
Modified Attention Formulation
Standard self-attention: $$e_{ij} = \frac{(x_i W^Q)(x_j W^K)^T}{\sqrt{d_k}}$$
Shaw et al.'s relative attention: $$e_{ij} = \frac{(x_i W^Q)(x_j W^K + a_{ij}^K)^T}{\sqrt{d_k}}$$
And for the values: $$z_i = \sum_j \alpha_{ij}(x_j W^V + a_{ij}^V)$$
Where:
Clipped Distance
The relative distance is clipped to a maximum range $[-k, k]$:
$$\text{clip}(x, -k, k) = \max(-k, \min(k, x))$$
This ensures a finite set of $2k+1$ relative position embeddings, regardless of sequence length. Positions more than $k$ apart are treated as "maximally distant."
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
import torchimport torch.nn as nnimport torch.nn.functional as Fimport numpy as np class ShawRelativeAttention(nn.Module): """ Relative Position Representations (Shaw et al., 2018) Adds learned relative position embeddings to keys and values in the attention computation. """ def __init__( self, d_model: int, num_heads: int, max_relative_position: int = 16, # The 'k' in the paper (clip range) dropout: float = 0.1 ): super().__init__() self.d_model = d_model self.num_heads = num_heads self.d_k = d_model // num_heads self.max_relative_position = max_relative_position # Standard QKV projections self.W_Q = nn.Linear(d_model, d_model) self.W_K = nn.Linear(d_model, d_model) self.W_V = nn.Linear(d_model, d_model) self.W_O = nn.Linear(d_model, d_model) # Relative position embeddings for keys and values # Total positions: 2 * max_relative_position + 1 (from -k to +k) num_relative_positions = 2 * max_relative_position + 1 self.rel_pos_k = nn.Embedding(num_relative_positions, self.d_k) self.rel_pos_v = nn.Embedding(num_relative_positions, self.d_k) self.dropout = nn.Dropout(dropout) # Initialize nn.init.xavier_uniform_(self.rel_pos_k.weight) nn.init.xavier_uniform_(self.rel_pos_v.weight) def _get_relative_positions(self, seq_len: int, device: torch.device): """ Compute relative position indices for all pairs. Returns matrix R where R[i,j] = clip(j - i, -k, k) + k (shifted by k to get non-negative indices for embedding lookup) """ # Create position index grid positions = torch.arange(seq_len, device=device) # Compute relative distances: j - i for all pairs # rel_pos[i, j] = j - i rel_pos = positions.unsqueeze(0) - positions.unsqueeze(1) # Clip to [-max_rel, max_rel] and shift to [0, 2*max_rel] rel_pos_clipped = rel_pos.clamp( -self.max_relative_position, self.max_relative_position ) rel_pos_indices = rel_pos_clipped + self.max_relative_position return rel_pos_indices def forward(self, x: torch.Tensor, mask: torch.Tensor = None): """ Compute self-attention with relative position representations. Args: x: Input [batch_size, seq_len, d_model] mask: Optional attention mask [batch_size, 1, seq_len, seq_len] Returns: Output [batch_size, seq_len, d_model] """ batch_size, seq_len, _ = x.shape # Project to Q, K, V Q = self.W_Q(x).view(batch_size, seq_len, self.num_heads, self.d_k) K = self.W_K(x).view(batch_size, seq_len, self.num_heads, self.d_k) V = self.W_V(x).view(batch_size, seq_len, self.num_heads, self.d_k) # Transpose for attention: [batch, heads, seq_len, d_k] Q = Q.transpose(1, 2) K = K.transpose(1, 2) V = V.transpose(1, 2) # Get relative position indices rel_pos_indices = self._get_relative_positions(seq_len, x.device) # Get relative position embeddings rel_pos_k = self.rel_pos_k(rel_pos_indices) # [seq_len, seq_len, d_k] rel_pos_v = self.rel_pos_v(rel_pos_indices) # [seq_len, seq_len, d_k] # Compute attention scores # Standard content-based attention content_scores = torch.matmul(Q, K.transpose(-2, -1)) # [batch, heads, seq, seq] # Relative position contribution to attention # Q: [batch, heads, seq_len, d_k] -> need to multiply with rel_pos_k # Reshape Q for multiplication: [batch, heads, seq_len, 1, d_k] # rel_pos_k: [seq_len, seq_len, d_k] -> [1, 1, seq_len, seq_len, d_k] Q_expanded = Q.unsqueeze(3) # [batch, heads, seq, 1, d_k] rel_pos_k_expanded = rel_pos_k.unsqueeze(0).unsqueeze(0) # [1, 1, seq, seq, d_k] # [batch, heads, seq, 1, d_k] * [1, 1, seq, seq, d_k] -> sum over d_k position_scores = (Q_expanded * rel_pos_k_expanded).sum(-1) # [batch, heads, seq, seq] # Combined scores scores = (content_scores + position_scores) / np.sqrt(self.d_k) # Apply mask if provided if mask is not None: scores = scores.masked_fill(mask == 0, float('-inf')) # Softmax attn_weights = F.softmax(scores, dim=-1) attn_weights = self.dropout(attn_weights) # Apply to values (content + position) content_output = torch.matmul(attn_weights, V) # [batch, heads, seq, d_k] # Position contribution to values # attn_weights: [batch, heads, seq, seq] # rel_pos_v: [seq, seq, d_k] attn_weights_expanded = attn_weights.unsqueeze(-1) # [batch, heads, seq, seq, 1] rel_pos_v_expanded = rel_pos_v.unsqueeze(0).unsqueeze(0) # [1, 1, seq, seq, d_k] position_output = (attn_weights_expanded * rel_pos_v_expanded).sum(3) # [batch, heads, seq, d_k] # Combine output = content_output + position_output # Reshape and project output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model) output = self.W_O(output) return output # Demonstrationdef demo_shaw_attention(): batch_size, seq_len, d_model = 2, 64, 256 attention = ShawRelativeAttention( d_model=d_model, num_heads=8, max_relative_position=16 ) x = torch.randn(batch_size, seq_len, d_model) output = attention(x) print(f"Input shape: {x.shape}") print(f"Output shape: {output.shape}") print(f"Relative position embeddings: {2 * 16 + 1} distances × {d_model // 8} dim") print(f" = {33 * 32 * 2} total parameters for position (K and V)") return attention demo_shaw_attention()Shaw et al.'s approach adds computational cost: relative position embeddings must be looked up and multiplied for each pair of positions. The clipping mechanism bounds this at O(seq_len² × d_k), same order as standard attention, but with larger constant factors.
Transformer-XL (Dai et al., 2019) introduced a more sophisticated relative position mechanism designed to handle very long sequences through segment-level recurrence.
The Decomposed Attention Score
Transformer-XL decomposes the attention score into four interpretable components. Starting from:
$$e_{ij} = (x_i W^Q)(x_j W^K)^T$$
With absolute position embeddings $U_i, U_j$:
$$(x_i + U_i)W^Q W^{KT}(x_j + U_j)^T$$
$$= \underbrace{x_i W^Q W^{KT} x_j^T}{(a)} + \underbrace{x_i W^Q W^{KT} U_j^T}{(b)} + \underbrace{U_i W^Q W^{KT} x_j^T}{(c)} + \underbrace{U_i W^Q W^{KT} U_j^T}{(d)}$$
Transformer-XL's Reformulation
Replace absolute positions $U_i, U_j$ with:
$$(a): x_i W^Q W^{KT} x_j^T \quad \text{(content-based)}$$ $$(b): x_i W^Q W^{KR} R_{i-j}^T \quad \text{(content-dependent position bias)}$$ $$(c): u W^{KT} x_j^T \quad \text{(global content bias)}$$ $$(d): v W^{KR} R_{i-j}^T \quad \text{(global position bias)}$$
Where:
| Term | Formula | Meaning |
|---|---|---|
| (a) Content-Content | $x_i W^Q W^{K^T} x_j^T$ | Query content attending to key content |
| (b) Content-Position | $x_i W^Q W^{K_R} R_{i-j}^T$ | Query content attending to relative position |
| (c) Position-Content | $u W^{K^T} x_j^T$ | Global query bias for key content |
| (d) Position-Position | $v W^{K_R} R_{i-j}^T$ | Global query bias for relative position |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171
import torchimport torch.nn as nnimport torch.nn.functional as Fimport numpy as np class TransformerXLAttention(nn.Module): """ Transformer-XL Relative Positional Attention Features: - Decomposed 4-term attention - Sinusoidal relative position embeddings - Learned global query biases (u, v) """ def __init__( self, d_model: int, num_heads: int, dropout: float = 0.1 ): super().__init__() self.d_model = d_model self.num_heads = num_heads self.d_head = d_model // num_heads # Projections self.W_Q = nn.Linear(d_model, d_model, bias=False) self.W_K = nn.Linear(d_model, d_model, bias=False) self.W_V = nn.Linear(d_model, d_model, bias=False) self.W_KR = nn.Linear(d_model, d_model, bias=False) # For relative position self.W_O = nn.Linear(d_model, d_model, bias=False) # Global query biases (shared across all positions) # u: for content attention, v: for position attention self.u = nn.Parameter(torch.randn(num_heads, self.d_head)) self.v = nn.Parameter(torch.randn(num_heads, self.d_head)) self.dropout = nn.Dropout(dropout) self.scale = np.sqrt(self.d_head) def _create_sinusoidal_embeddings(self, seq_len: int, device: torch.device): """Create sinusoidal relative position embeddings.""" # Positions from -(seq_len-1) to +(seq_len-1) positions = torch.arange(-(seq_len - 1), seq_len, device=device).float() # Frequency terms dim_indices = torch.arange(0, self.d_model, 2, device=device).float() div_term = torch.exp(dim_indices * (-np.log(10000.0) / self.d_model)) # Create embeddings R = torch.zeros(2 * seq_len - 1, self.d_model, device=device) R[:, 0::2] = torch.sin(positions.unsqueeze(1) * div_term) R[:, 1::2] = torch.cos(positions.unsqueeze(1) * div_term) return R def _relative_shift(self, x: torch.Tensor): """ Efficient relative position attention computation. Convert from: x[i, j] = attn(query_i, rel_pos_{j}) (wrong indexing) To: x[i, j] = attn(query_i, rel_pos_{j-i}) (correct relative indexing) """ # x shape: [batch, heads, seq_len, 2*seq_len-1] batch, heads, seq_len, _ = x.shape # Pad on the left x = F.pad(x, (1, 0)) # [batch, heads, seq_len, 2*seq_len] # Reshape and slice x = x.view(batch, heads, 2 * seq_len, seq_len) x = x[:, :, 1:, :] # [batch, heads, 2*seq_len-1, seq_len] # Take the relevant part x = x[:, :, :seq_len, :].transpose(-2, -1) # [batch, heads, seq_len, seq_len] return x def forward(self, x: torch.Tensor, memory: torch.Tensor = None): """ Compute attention with Transformer-XL relative position. Args: x: Input [batch, seq_len, d_model] memory: Optional memory from previous segment [batch, mem_len, d_model] """ batch_size, seq_len, _ = x.shape # Compute Q, K, V Q = self.W_Q(x) # [batch, seq_len, d_model] K = self.W_K(x) # [batch, seq_len, d_model] V = self.W_V(x) # [batch, seq_len, d_model] # Reshape for multi-head: [batch, seq_len, heads, d_head] Q = Q.view(batch_size, seq_len, self.num_heads, self.d_head) K = K.view(batch_size, seq_len, self.num_heads, self.d_head) V = V.view(batch_size, seq_len, self.num_heads, self.d_head) # Get relative position embeddings R = self._create_sinusoidal_embeddings(seq_len, x.device) R = self.W_KR(R) # Project: [2*seq_len-1, d_model] R = R.view(2 * seq_len - 1, self.num_heads, self.d_head) # Compute the 4 attention terms # Reshape for attention: [batch, heads, seq_len, d_head] Q_t = Q.transpose(1, 2) K_t = K.transpose(1, 2) V_t = V.transpose(1, 2) R_t = R.transpose(0, 1) # [heads, 2*seq_len-1, d_head] # Term (a): content -> content AC = torch.matmul(Q_t, K_t.transpose(-2, -1)) # [batch, heads, seq, seq] # Term (b): content -> position # Q_t: [batch, heads, seq, d_head] # R_t: [heads, 2*seq-1, d_head] BD = torch.einsum('bhid,hjd->bhij', Q_t, R_t) # [batch, heads, seq, 2*seq-1] BD = self._relative_shift(BD) # [batch, heads, seq, seq] # Term (c): global bias -> content # u: [heads, d_head] # K_t: [batch, heads, seq, d_head] global_content = torch.einsum('hd,bhkd->bhk', self.u, K_t) # [batch, heads, seq] global_content = global_content.unsqueeze(2) # [batch, heads, 1, seq] # Term (d): global bias -> position global_position = torch.einsum('hd,hjd->hj', self.v, R_t) # [heads, 2*seq-1] global_position = global_position.unsqueeze(0).unsqueeze(2) # [1, heads, 1, 2*seq-1] global_position = global_position.expand(batch_size, -1, seq_len, -1) global_position = self._relative_shift(global_position) # Combine all terms attn_scores = AC + BD + global_content + global_position attn_scores = attn_scores / self.scale # Causal mask mask = torch.triu(torch.ones(seq_len, seq_len, device=x.device), diagonal=1).bool() attn_scores.masked_fill_(mask.unsqueeze(0).unsqueeze(0), float('-inf')) # Softmax and apply attn_weights = F.softmax(attn_scores, dim=-1) attn_weights = self.dropout(attn_weights) output = torch.matmul(attn_weights, V_t) output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model) output = self.W_O(output) return output # Demonstrationdef demo_transformer_xl(): batch_size, seq_len, d_model = 2, 128, 512 attention = TransformerXLAttention( d_model=d_model, num_heads=8 ) x = torch.randn(batch_size, seq_len, d_model) output = attention(x) print(f"Input shape: {x.shape}") print(f"Output shape: {output.shape}") print(f"Global biases (u, v): 2 × {8} heads × {64} d_head = {2 * 8 * 64} params") demo_transformer_xl()The _relative_shift operation is crucial for efficiency. Instead of computing QKR^T for every (i, j-i) pair (O(n³)), we compute Q against all relative positions once, then shift the result so that position [i,j] contains the score for relative distance j-i. This maintains O(n²) complexity.
T5 (Raffel et al., 2020) introduced a simpler form of relative position encoding: learned scalar biases added directly to attention scores.
The T5 Approach
Instead of relative position embeddings (d-dimensional vectors), T5 uses relative position biases (scalar values):
$$e_{ij} = \frac{x_i W^Q (x_j W^K)^T}{\sqrt{d_k}} + b_{i-j}$$
Where $b_{i-j}$ is a learned scalar bias for relative position $i-j$.
Key Simplifications
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212
import torchimport torch.nn as nnimport numpy as np class T5RelativePositionBias(nn.Module): """ T5-style relative position bias. Uses logarithmic bucketing to reduce the number of unique biases while still distinguishing between fine-grained nearby distances and coarser far distances. """ def __init__( self, num_heads: int, num_buckets: int = 32, max_distance: int = 128, bidirectional: bool = True ): """ Args: num_heads: Number of attention heads num_buckets: Number of distinct position bias buckets max_distance: Maximum distance to consider (beyond this, use max bucket) bidirectional: If True, distinguish positive/negative offsets """ super().__init__() self.num_heads = num_heads self.num_buckets = num_buckets self.max_distance = max_distance self.bidirectional = bidirectional # Learned bias for each bucket and each head self.relative_attention_bias = nn.Embedding(num_buckets, num_heads) @staticmethod def _relative_position_bucket( relative_position: torch.Tensor, bidirectional: bool, num_buckets: int, max_distance: int ) -> torch.Tensor: """ Convert relative positions to bucket indices. The bucket scheme: - Bucket 0: relative position = 0 - Buckets 1 to num_buckets//2: exact positions 1 to num_buckets//2-1 - Remaining buckets: logarithmically spaced for larger distances """ relative_buckets = 0 if bidirectional: # Separate buckets for positive and negative distances num_buckets = num_buckets // 2 relative_buckets += (relative_position > 0).long() * num_buckets relative_position = relative_position.abs() else: # Causal: clamp negative to 0 relative_position = -torch.min( relative_position, torch.zeros_like(relative_position) ) # Half buckets for exact small distances max_exact = num_buckets // 2 is_small = relative_position < max_exact # Logarithmic bucketing for larger distances relative_position_if_large = max_exact + ( torch.log(relative_position.float() / max_exact) / np.log(max_distance / max_exact) * (num_buckets - max_exact) ).long() relative_position_if_large = torch.min( relative_position_if_large, torch.full_like(relative_position_if_large, num_buckets - 1) ) relative_buckets += torch.where( is_small, relative_position, relative_position_if_large ) return relative_buckets def compute_bias(self, query_length: int, key_length: int, device: torch.device): """ Compute relative position bias matrix. Args: query_length: Length of query sequence key_length: Length of key sequence device: Device for tensors Returns: Bias tensor of shape [1, num_heads, query_length, key_length] """ # Create relative position matrix context_position = torch.arange(query_length, device=device)[:, None] memory_position = torch.arange(key_length, device=device)[None, :] relative_position = memory_position - context_position # [query_len, key_len] # Convert to bucket indices relative_position_bucket = self._relative_position_bucket( relative_position, bidirectional=self.bidirectional, num_buckets=self.num_buckets, max_distance=self.max_distance ) # Look up biases: [query_len, key_len, num_heads] values = self.relative_attention_bias(relative_position_bucket) # Reshape to [1, num_heads, query_len, key_len] values = values.permute(2, 0, 1).unsqueeze(0) return values def forward(self, query_length: int, key_length: int, device: torch.device): return self.compute_bias(query_length, key_length, device) class T5Attention(nn.Module): """ T5-style attention with relative position bias. """ def __init__( self, d_model: int, num_heads: int, num_buckets: int = 32, max_distance: int = 128, dropout: float = 0.1 ): super().__init__() self.num_heads = num_heads self.d_head = d_model // num_heads # Linear projections (T5 uses no bias in projections) self.W_Q = nn.Linear(d_model, d_model, bias=False) self.W_K = nn.Linear(d_model, d_model, bias=False) self.W_V = nn.Linear(d_model, d_model, bias=False) self.W_O = nn.Linear(d_model, d_model, bias=False) # Relative position bias self.relative_position_bias = T5RelativePositionBias( num_heads=num_heads, num_buckets=num_buckets, max_distance=max_distance ) self.dropout = nn.Dropout(dropout) self.scale = np.sqrt(self.d_head) def forward(self, x: torch.Tensor, mask: torch.Tensor = None): batch_size, seq_len, d_model = x.shape # Compute Q, K, V Q = self.W_Q(x).view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2) K = self.W_K(x).view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2) V = self.W_V(x).view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2) # Content-based attention scores scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale # Add relative position bias position_bias = self.relative_position_bias(seq_len, seq_len, x.device) scores = scores + position_bias # Apply mask and softmax if mask is not None: scores = scores.masked_fill(mask == 0, float('-inf')) attn_weights = torch.nn.functional.softmax(scores, dim=-1) attn_weights = self.dropout(attn_weights) # Apply to values output = torch.matmul(attn_weights, V) output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model) output = self.W_O(output) return output # Demonstrationdef demo_t5_bias(): bias_module = T5RelativePositionBias(num_heads=8, num_buckets=32, max_distance=128) # Show bucketing behavior print("=== T5 Logarithmic Bucket Scheme ===\n") test_positions = torch.tensor([0, 1, 2, 3, 4, 5, 10, 20, 50, 100, 200]) buckets = T5RelativePositionBias._relative_position_bucket( test_positions, bidirectional=True, num_buckets=32, max_distance=128 ) print("Distance -> Bucket") for pos, bucket in zip(test_positions.tolist(), buckets.tolist()): print(f" {pos:4d} -> {bucket:2d}") print(f"\nBias matrix shape: {bias_module(64, 64, torch.device('cpu')).shape}") print(f"Total bias parameters: {32 * 8} (buckets × heads)") demo_t5_bias()The logarithmic bucketing scheme reflects diminishing returns of precise distance information. Whether a word is 100 or 101 positions away matters much less than whether it's 3 or 4 positions away. By using exact buckets for small distances and logarithmic buckets for large ones, T5 focuses parameters where they matter most.
We've examined three major relative position approaches. Let's compare them systematically:
| Aspect | Shaw et al. | Transformer-XL | T5 Bias |
|---|---|---|---|
| Position Input | Learned embeddings | Sinusoidal embeddings | Learned scalar biases |
| Integration Point | Keys and Values | Keys only (4-term) | Attention logits |
| Parameters | O(k × d) | O(d) + global biases | O(buckets × heads) |
| Computation | Medium | Higher | Low |
| Clipping/Bucketing | Hard clip at ±k | No clip (unbounded) | Logarithmic buckets |
| Length Generalization | Limited | Good | Good (if within max_distance) |
| Per-Head Patterns | Shared embeddings | Learned u, v per head | Independent biases |
Empirical Findings
Research comparing these approaches reveals:
Task Sensitivity: Different tasks benefit from different approaches
Training Stability: T5's simple additive bias is generally most stable to train
Computational Efficiency: T5 has the lowest overhead; Transformer-XL has more complex indexing
Interpretability: T5 biases are easiest to analyze (just plot bias vs. distance)
One of the primary motivations for relative position encoding is better length generalization—the ability to handle sequences longer than those seen during training. Let's analyze this carefully.
Why Relative Positions Generalize Better
With absolute positions, seeing position 512 at inference after training only on positions 0-256 means encountering completely unseen embeddings. With relative positions:
The Remaining Challenges
However, relative position doesn't solve everything:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import torchimport numpy as npimport matplotlib.pyplot as plt def analyze_length_generalization(): """ Analyze how different relative position methods handle length extrapolation. """ # Simulate T5-style bucketing behavior at different distances def t5_bucket(distance, num_buckets=32, max_distance=128): """Simplified T5 bucketing for analysis.""" if distance < num_buckets // 4: return distance else: # Logarithmic bucketing max_exact = num_buckets // 4 bucket = max_exact + int( np.log(distance / max_exact) / np.log(max_distance / max_exact) * (num_buckets // 2 - max_exact) ) return min(bucket, num_buckets // 2 - 1) # Test distances distances = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048] print("=== T5 Bucketing Behavior at Various Distances ===\n") print("Distance -> Bucket (trained max_distance=128)\n") for d in distances: bucket = t5_bucket(d) in_training = "✓" if d <= 128 else "✗ (extrapolation)" print(f" {d:5d} -> Bucket {bucket:2d} {in_training}") print("\n=== Key Observations ===") print("1. All distances > 128 map to the same max bucket") print("2. Fine-grained distinction only for nearby positions") print("3. Model can't distinguish distance 256 from distance 2048") # Simulate impact on attention patterns print("\n=== Attention Pattern at Different Lengths ===\n") def simulate_attention_entropy(seq_len, local_bias_strength=0.5): """ Simulate attention entropy with local position bias. Higher entropy = more diffuse attention. """ # Simple model: bias toward nearby positions positions = np.arange(seq_len) query_pos = seq_len - 1 # Query from last position # Distance-based bias (negative for nearby = higher attention) distances = np.abs(positions - query_pos) logits = -local_bias_strength * np.log1p(distances) # Softmax exp_logits = np.exp(logits - logits.max()) attention = exp_logits / exp_logits.sum() # Entropy entropy = -np.sum(attention * np.log(attention + 1e-10)) # Effective attention span (positions receiving >1% attention) effective_span = np.sum(attention > 0.01) return entropy, effective_span seq_lengths = [64, 128, 256, 512, 1024, 2048] print(f"{'Seq Length':>10} {'Entropy':>10} {'Effective Span':>15}") print("-" * 40) for seq_len in seq_lengths: entropy, span = simulate_attention_entropy(seq_len) print(f"{seq_len:>10} {entropy:>10.3f} {span:>15}") print("\nNote: Entropy increases with length = attention becomes more diffuse") print("This can hurt performance even if relative positions generalize correctly") analyze_length_generalization()Even with perfect relative position generalization, models may fail at long sequences because they never learned to utilize distant information effectively. Training on short sequences doesn't teach the model that distant context is useful. This is a separate challenge from positional encoding design.
Let's consolidate practical guidance for implementing relative position encoding in your models.
Choosing an Approach
| Use Case | Recommended Method | Rationale |
|---|---|---|
| General-purpose encoder | T5 bias or RoPE | Simple, well-tested, good generalization |
| Long-context LM | RoPE or ALiBi | Best length extrapolation |
| Sequence-to-sequence | T5 bias | Works well for encoder-decoder |
| Very long documents | Transformer-XL | Segment recurrence handles length |
| Research/analysis | T5 bias | Most interpretable patterns |
For new projects in 2024+, consider Rotary Position Embeddings (RoPE) as your default. It combines the benefits of relative position encoding with elegant integration into attention and excellent length generalization. We cover RoPE in depth in the next page.
Relative positional representations marked a significant evolution in how transformers encode sequential structure. Let's consolidate the key insights:
What's Next: Rotary Position Embeddings (RoPE)
The evolution of positional encoding continues with Rotary Position Embeddings (RoPE), perhaps the most elegant solution yet. RoPE encodes relative position through the very geometry of the attention computation:
RoPE has become the standard in modern large language models like LLaMA, Mistral, and many others. Understanding it completes your knowledge of the positional encoding landscape.
You now understand the principles, implementations, and tradeoffs of relative positional encoding. From Shaw et al.'s foundational work through Transformer-XL's decomposition to T5's elegant simplification, you've traced the evolution that led to modern approaches. Next, we explore Rotary Position Embeddings—the state-of-the-art synthesis of these ideas.