Positional Encoding - Learning Module

Loading content...

0/278

Learned Positional Embeddings

Learning Position from Data

While sinusoidal encoding provides an elegant, parameter-free solution to positional encoding, a natural question arises: what if we let the model learn the optimal position representations?

This is precisely what learned positional embeddings do. Instead of computing position encodings through fixed mathematical formulas, we treat position indices as tokens and learn their embeddings through standard backpropagation. This approach was popularized by landmark models including BERT, GPT-2, and RoBERTa, and remains widely used in many transformer architectures.

Learned embeddings offer a compelling value proposition: maximum flexibility. The model can discover whatever positional patterns are most useful for the task, unconstrained by the geometric structure of sinusoidal functions. But this flexibility comes with costs—additional parameters, fixed sequence length limits, and potential generalization challenges.

This page provides a comprehensive treatment of learned positional embeddings: their implementation, theoretical properties, empirical comparisons with sinusoidal encoding, and guidance on when each approach is preferred.

Prevalence in Practice

Learned positional embeddings are used in BERT (512 positions), GPT-2 (1024 positions), RoBERTa, ALBERT, DistilBERT, and many other influential models. Their simplicity and effectiveness make them a default choice for many practitioners, though modern large language models increasingly favor relative position methods.

Formulation and Implementation

The Core Idea

Learned positional embeddings treat position indices exactly like vocabulary tokens. We create an embedding matrix:

$$P \in \mathbb{R}^{L_{\max} \times d_{\text{model}}}$$

where:

$L_{\max}$ is the maximum sequence length the model can handle
$d_{\text{model}}$ is the model's embedding dimension

For a token at position $i$, its positional embedding is simply the $i$-th row of $P$:

$$PE(i) = P[i, :] \in \mathbb{R}^{d_{\text{model}}}$$

The final input to the transformer combines token and position embeddings:

$$\tilde{x}_i = \text{TokenEmbed}(\text{token}_i) + \text{PositionEmbed}(i)$$

That's it. The simplicity is striking—we're just adding another embedding lookup.

learned_positional_embeddings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional
 
class LearnedPositionalEmbedding(nn.Module):
    """
    Learned Positional Embeddings as used in BERT, GPT-2, etc.
    
    Each position has a trainable d_model-dimensional vector that is
    learned end-to-end with the rest of the model.
    """
    
    def __init__(
        self, 
        max_length: int, 
        d_model: int,
        dropout: float = 0.1,
        initialization: str = "normal"  # "normal", "uniform", or "sinusoidal"
    ):
        """
        Args:
            max_length: Maximum sequence length (L_max)
            d_model: Embedding dimension
            dropout: Dropout after adding positional embeddings
            initialization: How to initialize the embeddings
        """
        super().__init__()
        
        self.max_length = max_length
        self.d_model = d_model
        
        # The position embedding matrix: [max_length, d_model]
        self.position_embeddings = nn.Embedding(max_length, d_model)
        
        # Initialize based on strategy
        self._initialize_embeddings(initialization)
        
        self.dropout = nn.Dropout(p=dropout)
        
        # Pre-compute position indices for efficiency
        self.register_buffer(
            "position_ids",
            torch.arange(max_length).unsqueeze(0)  # [1, max_length]
        )
    
    def _initialize_embeddings(self, method: str):
        """Initialize position embeddings."""
        if method == "normal":
            # Standard normal initialization (like BERT)
            nn.init.normal_(self.position_embeddings.weight, std=0.02)
        elif method == "uniform":
            # Uniform initialization
            nn.init.uniform_(self.position_embeddings.weight, -0.1, 0.1)
        elif method == "sinusoidal":
            # Initialize with sinusoidal values, then allow learning
            sinusoidal = self._create_sinusoidal_embeddings()
            self.position_embeddings.weight.data.copy_(sinusoidal)
        else:
            raise ValueError(f"Unknown initialization: {method}")
    
    def _create_sinusoidal_embeddings(self) -> torch.Tensor:
        """Create sinusoidal embeddings for initialization."""
        import numpy as np
        
        position = torch.arange(self.max_length).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, self.d_model, 2) * 
            (-np.log(10000.0) / self.d_model)
        )
        
        pe = torch.zeros(self.max_length, self.d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        return pe
    
    def forward(
        self, 
        x: torch.Tensor,
        position_ids: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Add positional embeddings to input.
        
        Args:
            x: Input tensor of shape [batch_size, seq_len, d_model]
            position_ids: Optional explicit position indices [batch_size, seq_len]
                         If None, uses positions [0, 1, 2, ..., seq_len-1]
        
        Returns:
            Tensor with positional information added
        """
        seq_len = x.size(1)
        
        if seq_len > self.max_length:
            raise ValueError(
                f"Sequence length {seq_len} exceeds maximum {self.max_length}"
            )
        
        if position_ids is None:
            # Use default positions: [0, 1, 2, ..., seq_len-1]
            position_ids = self.position_ids[:, :seq_len]
        
        # Look up position embeddings
        position_embeddings = self.position_embeddings(position_ids)
        
        # Add to input
        x = x + position_embeddings
        
        return self.dropout(x)
    
    def get_position_embedding(self, position: int) -> torch.Tensor:
        """Get embedding for a specific position."""
        return self.position_embeddings.weight[position]
 
 
# Full BERT-style embeddings combining all components
class BERTEmbeddings(nn.Module):
    """
    Complete embedding layer as used in BERT.
    Combines token embeddings, position embeddings, and segment embeddings.
    """
    
    def __init__(
        self,
        vocab_size: int,
        max_length: int,
        d_model: int,
        num_segments: int = 2,
        dropout: float = 0.1,
        layer_norm_eps: float = 1e-12
    ):
        super().__init__()
        
        # Token embeddings
        self.token_embeddings = nn.Embedding(vocab_size, d_model)
        
        # Position embeddings
        self.position_embeddings = nn.Embedding(max_length, d_model)
        
        # Segment embeddings (for sentence A vs B)
        self.segment_embeddings = nn.Embedding(num_segments, d_model)
        
        # Layer normalization after combining
        self.layer_norm = nn.LayerNorm(d_model, eps=layer_norm_eps)
        
        self.dropout = nn.Dropout(dropout)
        
        # Pre-compute positions
        self.register_buffer(
            "position_ids",
            torch.arange(max_length).unsqueeze(0)
        )
        
        # Initialize
        self._init_weights()
    
    def _init_weights(self):
        """Initialize all embeddings with small normal values."""
        nn.init.normal_(self.token_embeddings.weight, std=0.02)
        nn.init.normal_(self.position_embeddings.weight, std=0.02)
        nn.init.normal_(self.segment_embeddings.weight, std=0.02)
    
    def forward(
        self,
        input_ids: torch.Tensor,
        segment_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Compute combined embeddings.
        
        Args:
            input_ids: Token indices [batch_size, seq_len]
            segment_ids: Segment indices [batch_size, seq_len], default all 0s
            position_ids: Position indices [batch_size, seq_len], default 0..seq_len-1
        """
        seq_len = input_ids.size(1)
        
        # Defaults
        if position_ids is None:
            position_ids = self.position_ids[:, :seq_len]
        if segment_ids is None:
            segment_ids = torch.zeros_like(input_ids)
        
        # Look up all embeddings
        token_embeds = self.token_embeddings(input_ids)
        position_embeds = self.position_embeddings(position_ids)
        segment_embeds = self.segment_embeddings(segment_ids)
        
        # Combine
        embeddings = token_embeds + position_embeds + segment_embeds
        
        # Normalize and dropout
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        
        return embeddings
 
 
# Example usage
def demonstrate_learned_embeddings():
    """Demonstrate learned positional embeddings."""
    batch_size = 2
    seq_len = 128
    d_model = 768
    max_length = 512
    
    # Create embedding layer
    pos_embed = LearnedPositionalEmbedding(
        max_length=max_length,
        d_model=d_model,
        initialization="normal"
    )
    
    # Random input (simulating token embeddings)
    x = torch.randn(batch_size, seq_len, d_model)
    
    # Add positional embeddings
    x_with_pos = pos_embed(x)
    
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {x_with_pos.shape}")
    print(f"Position embedding parameters: {pos_embed.max_length * d_model:,}")
    print(f"  = {max_length} positions × {d_model} dimensions")
    
    return pos_embed
 
pos_embed = demonstrate_learned_embeddings()

Implementation Note

Most implementations pre-register position indices as a buffer (non-parameter tensor saved with the model). This allows efficient position lookup during inference without recreating the indices each time. The position embeddings themselves are regular nn.Embedding parameters that are updated during training.

Parameter Cost Analysis

One of the most significant differences between learned and sinusoidal positional embeddings is the parameter count. Let's analyze this carefully.

Parameter Count Formula

Learned positional embeddings require:

$$\text{Parameters} = L_{\max} \times d_{\text{model}}$$

For typical model configurations:

Position Embedding Parameter Counts
Model	Max Length	d_model	Position Params	% of Total
BERT-Base	512	768	393,216	0.36%
BERT-Large	512	1024	524,288	0.16%
GPT-2 Small	1024	768	786,432	0.63%
GPT-2 Large	1024	1280	1,310,720	0.17%
Hypothetical 8K	8192	1024	8,388,608	~1%+
Hypothetical 32K	32768	2048	67,108,864	Significant

Parameter Efficiency Comparison

Aspect	Learned Embeddings	Sinusoidal Encoding
Parameters	$O(L_{\max} \times d)$	0 (computed)
Computation	One embedding lookup	Sin/cos computation
Memory at training	Store gradient for each position	No gradient needed
Memory at inference	Store full embedding matrix	Compute on demand

When Parameters Matter

For BERT-scale models (110M-340M parameters), position embeddings are a negligible fraction. But as context lengths grow, the cost becomes significant:

Long-context models: A 32K context window with learned embeddings at d=2048 requires 67M parameters just for positions—comparable to an entire small BERT model
Memory-constrained deployment: On edge devices, every megabyte counts
Training efficiency: More parameters mean more gradient computation and optimizer state

The Long-Context Bottleneck

The linear scaling of learned position parameters with context length is one reason modern long-context models (GPT-4, Claude, etc.) use relative position methods like RoPE or ALiBi. These approaches decouple parameter count from sequence length.

Training Dynamics of Learned Positions

When we learn positional embeddings from data, the optimization process discovers position representations suited to the task. Let's examine what the model learns and how.

Initialization Strategies

The initialization of position embeddings significantly affects training:

Common Initialization Approaches

•Random Normal (BERT default): $PE \sim \mathcal{N}(0, 0.02^2)$ — Simple, works well in practice
•Random Uniform: $PE \sim U(-0.1, 0.1)$ — Alternative to normal, similar results
•Sinusoidal Initialization: Start from sinusoidal values, then fine-tune — Combines prior knowledge with flexibility
•Zero Initialization: Start at zero, learn from scratch — Generally worse, positions initially indistinguishable

What Does the Model Learn?

Research analyzing learned position embeddings reveals structured patterns:

Similarity Structure: Adjacent positions learn similar embeddings. The embedding space develops a smooth manifold where position distance correlates with embedding distance.
Periodic Components: Principal component analysis (PCA) of learned embeddings often reveals sinusoidal-like patterns, suggesting the model rediscovers useful periodic structure.
Task-Specific Patterns: Different tasks induce different position embedding structures:
- Question answering: Strong distinction between question and answer positions
- Sentence classification: Emphasis on [CLS] token position (position 0)
- Translation: Alignment-friendly representations

analyze_learned_embeddings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import torch
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
 
def analyze_position_embeddings(position_embeddings: torch.Tensor):
    """
    Comprehensive analysis of learned position embeddings.
    
    Args:
        position_embeddings: Tensor of shape [max_length, d_model]
    """
    pe = position_embeddings.detach().cpu().numpy()
    max_len, d_model = pe.shape
    
    print(f"Analyzing {max_len} position embeddings of dimension {d_model}")
    
    # === Analysis 1: Similarity Structure ===
    print("
=== Similarity Analysis ===")
    sim_matrix = cosine_similarity(pe)
    
    # Check if adjacent positions are more similar
    adjacent_sims = [sim_matrix[i, i+1] for i in range(max_len - 1)]
    distant_sims = [sim_matrix[0, i] for i in range(10, min(50, max_len))]
    
    print(f"Mean adjacent similarity: {np.mean(adjacent_sims):.4f}")
    print(f"Mean distant similarity (pos 0 to 10-50): {np.mean(distant_sims):.4f}")
    
    # === Analysis 2: PCA of Embeddings ===
    print("
=== PCA Analysis ===")
    pca = PCA(n_components=min(10, d_model))
    pe_pca = pca.fit_transform(pe)
    
    print(f"Variance explained by first 5 components: "
          f"{pca.explained_variance_ratio_[:5].sum():.2%}")
    
    # Check for periodic patterns in top components
    for i in range(min(3, pe_pca.shape[1])):
        component = pe_pca[:, i]
        # Simple periodicity check via autocorrelation
        autocorr = np.correlate(component, component, mode='full')
        autocorr = autocorr[len(autocorr)//2:]
        autocorr = autocorr / autocorr[0]  # Normalize
        
        # Find first significant peak (excluding 0)
        peaks = []
        for j in range(1, len(autocorr) - 1):
            if autocorr[j] > autocorr[j-1] and autocorr[j] > autocorr[j+1]:
                if autocorr[j] > 0.3:  # Threshold for significance
                    peaks.append((j, autocorr[j]))
        
        if peaks:
            print(f"PC{i+1}: Detected periodicity at distances {[p[0] for p in peaks[:3]]}")
        else:
            print(f"PC{i+1}: No strong periodic pattern detected")
    
    # === Analysis 3: Norm Distribution ===
    print("
=== Norm Analysis ===")
    norms = np.linalg.norm(pe, axis=1)
    print(f"Norm range: [{norms.min():.4f}, {norms.max():.4f}]")
    print(f"Norm mean: {norms.mean():.4f}, std: {norms.std():.4f}")
    
    # Check if early positions have different norms (common pattern)
    early_norm = norms[:20].mean()
    late_norm = norms[-20:].mean()
    print(f"Early positions (0-19) mean norm: {early_norm:.4f}")
    print(f"Late positions mean norm: {late_norm:.4f}")
    
    # === Analysis 4: Distance Structure ===
    print("
=== Distance Structure ===")
    # Euclidean distance as function of position difference
    distances = {}
    for delta in [1, 2, 5, 10, 20, 50, 100]:
        if delta < max_len:
            dists = [np.linalg.norm(pe[i] - pe[i + delta]) 
                    for i in range(max_len - delta)]
            distances[delta] = np.mean(dists)
            print(f"Mean distance for Δ={delta:3d}: {distances[delta]:.4f}")
    
    return {
        "similarity_matrix": sim_matrix,
        "pca_result": pe_pca,
        "explained_variance": pca.explained_variance_ratio_,
        "norms": norms,
        "distances": distances
    }
 
# Example: Analyze a randomly initialized embedding (simulating pre-training)
# In practice, you would load weights from a trained model
max_length, d_model = 512, 768
position_embed = torch.nn.Embedding(max_length, d_model)
torch.nn.init.normal_(position_embed.weight, std=0.02)
 
print("Analysis of RANDOM initialization (before training):")
print("=" * 50)
random_analysis = analyze_position_embeddings(position_embed.weight)
 
# Show that random init has no structure
print("
Note: Random initialization shows no meaningful patterns.")
print("After training, we would expect:")
print("  - Higher adjacent similarity")
print("  - Periodic patterns in PCA components")
print("  - Structured distance relationships")

What Training Discovers

Research by Shaw et al. and others has shown that learned position embeddings, when analyzed after training, often exhibit structure similar to sinusoidal encodings—smooth transitions between positions and periodic components. The model independently discovers that this structure is useful for representing position.

Empirical Comparison with Sinusoidal Encoding

A critical question is whether learned embeddings outperform sinusoidal encoding. Research and practical experience provide nuanced answers.

Original Transformer Results

The original "Attention Is All You Need" paper reported:

"We also experimented with using learned positional embeddings instead, and found that the two versions produced nearly identical results."

For the WMT machine translation task, both approaches achieved similar BLEU scores (within 0.1 BLEU). This early result suggested the choice might not matter significantly.

Subsequent Research Findings

More extensive experiments across tasks have revealed patterns:

Empirical Comparison: Learned vs. Sinusoidal
Task/Setting	Learned	Sinusoidal	Winner
Machine Translation (WMT)	27.4 BLEU	27.3 BLEU	~Tie
Language Understanding (GLUE avg)	82.1	81.8	Learned (slight)
Question Answering (SQuAD)	88.5 F1	88.2 F1	Learned (slight)
Length Generalization (2x train)	Poor	Moderate	Sinusoidal
Low-Data Regime	Underperforms	Stable	Sinusoidal
Very Long Contexts (>4K)	Poor scaling	Good scaling	Sinusoidal

When Learned Embeddings Excel

Abundant training data: With sufficient data, the model can learn task-specific position patterns that sinusoidal encoding cannot capture.
Fixed sequence length at inference: If all inputs have similar lengths to training data, learned embeddings shine.
Task-specific position patterns: Some tasks have unusual positional requirements (e.g., structured documents) that benefit from learned representations.

When Sinusoidal Encoding Excels

Length generalization: Sinusoidal encoding can generate valid embeddings for any position, while learned embeddings fail for unseen positions.
Low-data regimes: With limited training data, sinusoidal encoding provides useful inductive bias.
Parameter efficiency: For long-context models, sinusoidal encoding avoids the parameter explosion.
Interpretability: The mathematical structure of sinusoidal encoding is fully understood; learned embeddings are opaque.

The Modern Perspective

Modern large language models have largely moved beyond both pure learned and pure sinusoidal approaches, favoring relative position methods like RoPE. These combine the parameter efficiency of sinusoidal encoding with learned flexibility through the attention mechanism. For medium-scale models with fixed context, learned embeddings remain a solid default.

The Length Generalization Problem

The most significant limitation of learned positional embeddings is their inability to generalize to sequence lengths not seen during training. This is a fundamental architectural constraint.

The Hard Limit

If a model is trained with $L_{\max} = 512$, the position embedding matrix has 512 rows. At inference time:

Position 0-511: Valid, use learned embeddings
Position 512+: Undefined behavior

There is no mathematically principled way to extend learned embeddings to new positions. Common workarounds have significant limitations:

Common Extension Strategies (All Flawed)

•Truncation: Cut input to max length — Loses information, not always acceptable
•Modular Positions: Use position mod L_max — Destroys global position information
•Clipping: Use last position embedding for all beyond L_max — No position distinction
•Interpolation: Interpolate between existing embeddings — Works poorly, arbitrary spacing
•Extrapolation: Train a function to predict new embeddings — Complicated, unreliable

length_generalization_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import torch
import torch.nn as nn
 
def demonstrate_length_limitation():
    """Show the hard limit of learned position embeddings."""
    
    max_length = 512
    d_model = 768
    
    # Standard learned embedding
    pos_embed = nn.Embedding(max_length, d_model)
    nn.init.normal_(pos_embed.weight, std=0.02)
    
    # Works fine for positions within training range
    valid_positions = torch.tensor([0, 100, 256, 511])
    embeddings = pos_embed(valid_positions)
    print(f"Valid positions {valid_positions.tolist()}: OK")
    print(f"Embedding shape: {embeddings.shape}")
    
    # Fails for positions beyond training range
    invalid_positions = torch.tensor([512, 600, 1000])
    try:
        embeddings = pos_embed(invalid_positions)
        print("This should not print")
    except IndexError as e:
        print(f"
Invalid positions {invalid_positions.tolist()}: FAILED")
        print(f"Error: Index out of range")
    
    return pos_embed
 
def workaround_modular(pos_embed: nn.Embedding, positions: torch.Tensor):
    """
    Modular position workaround: position mod max_length.
    
    Problem: Position 512 becomes indistinguishable from position 0!
    """
    max_length = pos_embed.num_embeddings
    modular_positions = positions % max_length
    return pos_embed(modular_positions)
 
def workaround_clipping(pos_embed: nn.Embedding, positions: torch.Tensor):
    """
    Clipping workaround: cap at max_length - 1.
    
    Problem: All positions >= max_length are identical!
    """
    max_length = pos_embed.num_embeddings
    clipped_positions = torch.clamp(positions, max=max_length - 1)
    return pos_embed(clipped_positions)
 
def workaround_interpolation(pos_embed: nn.Embedding, positions: torch.Tensor, 
                            target_max: int):
    """
    Position interpolation: rescale positions to fit within trained range.
    
    Problem: Adjacent positions may get same embedding, losing resolution.
    """
    max_length = pos_embed.num_embeddings
    # Scale positions: [0, target_max] -> [0, max_length-1]
    scaled = (positions.float() / target_max * (max_length - 1)).long()
    scaled = torch.clamp(scaled, max=max_length - 1)
    return pos_embed(scaled)
 
# Demonstration
print("=== Length Generalization with Learned Embeddings ===
")
 
pos_embed = demonstrate_length_limitation()
print()
 
# Test workarounds
print("=== Workaround Analysis ===
")
 
long_positions = torch.tensor([0, 256, 512, 600, 1024])
 
# Modular
modular_emb = workaround_modular(pos_embed, long_positions)
mod_positions = long_positions % 512
print(f"Modular: positions {long_positions.tolist()} -> {mod_positions.tolist()}")
print(f"  Problem: 512 and 0 have identical embeddings!")
 
# Check similarity
sim_0_512 = torch.cosine_similarity(
    workaround_modular(pos_embed, torch.tensor([0])),
    workaround_modular(pos_embed, torch.tensor([512])),
    dim=1
)
print(f"  Cosine similarity(pos 0, pos 512) = {sim_0_512.item():.4f}")
 
print()
 
# Clipping
clip_positions = torch.clamp(long_positions, max=511)
print(f"Clipping: positions {long_positions.tolist()} -> {clip_positions.tolist()}")
print(f"  Problem: 512, 600, 1024 all map to 511!")
 
print()
 
# Interpolation (assuming target max is 1024)
interp_emb = workaround_interpolation(pos_embed, long_positions, 1024)
scaled_positions = (long_positions.float() / 1024 * 511).long()
print(f"Interpolation: positions {long_positions.tolist()} -> {scaled_positions.tolist()}")
print(f"  Problem: Lost half the position resolution!")

No Good Solutions

There is no principled way to extend learned positional embeddings beyond their trained range. This is a fundamental limitation of the approach. For applications requiring variable or long sequence lengths, relative position methods (RoPE, ALiBi) are strongly preferred.

Hybrid and Advanced Approaches

Researchers have developed several hybrid approaches that combine the benefits of learned and fixed positional encodings.

Sinusoidal Initialization with Learning

A popular approach initializes position embeddings with sinusoidal values, then allows gradient updates during training:

hybrid_position_embeddings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import torch
import torch.nn as nn
import numpy as np
 
class SinusoidalInitializedLearnedPE(nn.Module):
    """
    Position embeddings initialized with sinusoidal values,
    then fine-tuned during training.
    
    Benefits:
    - Starts with mathematically principled structure
    - Can adapt structure to task-specific needs
    - Better early training dynamics
    """
    
    def __init__(self, max_length: int, d_model: int):
        super().__init__()
        
        # Create sinusoidal initialization
        pe = self._create_sinusoidal(max_length, d_model)
        
        # Register as learnable parameter
        self.position_embeddings = nn.Parameter(pe)
    
    def _create_sinusoidal(self, max_length: int, d_model: int) -> torch.Tensor:
        position = torch.arange(max_length).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * 
            (-np.log(10000.0) / d_model)
        )
        
        pe = torch.zeros(max_length, d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        return pe
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        seq_len = x.size(1)
        return x + self.position_embeddings[:seq_len]
 
 
class FrozenSinusoidalWithLearnedScaling(nn.Module):
    """
    Frozen sinusoidal base with learned per-dimension scaling.
    
    This keeps the extrapolation properties of sinusoidal encoding
    while allowing learned adjustments.
    """
    
    def __init__(self, max_length: int, d_model: int):
        super().__init__()
        
        # Frozen sinusoidal base
        pe = self._create_sinusoidal(max_length, d_model)
        self.register_buffer('sinusoidal_base', pe)
        
        # Learned per-dimension scaling
        self.scale = nn.Parameter(torch.ones(d_model))
        self.bias = nn.Parameter(torch.zeros(d_model))
    
    def _create_sinusoidal(self, max_length: int, d_model: int) -> torch.Tensor:
        position = torch.arange(max_length).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * 
            (-np.log(10000.0) / d_model)
        )
        
        pe = torch.zeros(max_length, d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        return pe
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        seq_len = x.size(1)
        # Apply learned transformation to frozen base
        pe = self.sinusoidal_base[:seq_len] * self.scale + self.bias
        return x + pe
    
    def get_extended_embedding(self, position: int) -> torch.Tensor:
        """
        Get embedding for position beyond training range.
        Still works because sinusoidal base is computed!
        """
        # Compute sinusoidal for new position
        div_term = torch.exp(
            torch.arange(0, self.scale.size(0), 2).float() * 
            (-np.log(10000.0) / self.scale.size(0))
        )
        
        pe = torch.zeros(self.scale.size(0))
        pe[0::2] = torch.sin(position * div_term)
        pe[1::2] = torch.cos(position * div_term)
        
        # Apply learned transformation
        return pe * self.scale + self.bias
 
 
# Demonstration
def compare_initialization_strategies():
    """Compare training dynamics with different initializations."""
    
    max_length, d_model = 512, 256
    
    strategies = {
        "Random Normal": lambda: nn.Embedding(max_length, d_model),
        "Sinusoidal (frozen)": lambda: SinusoidalInitializedLearnedPE(max_length, d_model),
        "Scaled Sinusoidal": lambda: FrozenSinusoidalWithLearnedScaling(max_length, d_model)
    }
    
    for name, create_fn in strategies.items():
        embed = create_fn()
        
        # Count parameters
        num_params = sum(p.numel() for p in embed.parameters() if p.requires_grad)
        
        print(f"{name}:")
        print(f"  Learnable parameters: {num_params:,}")
        
        if hasattr(embed, 'position_embeddings'):
            weight = embed.position_embeddings
            if isinstance(weight, nn.Parameter):
                print(f"  Initial norm: {weight.norm():.2f}")
        elif hasattr(embed, 'weight'):
            print(f"  Initial norm: {embed.weight.norm():.2f}")
        
        print()
 
compare_initialization_strategies()

Factored Position Embeddings

Some architectures decompose position embeddings into lower-rank components to reduce parameters:

$$PE(i) = U \cdot f(i) \cdot V^T$$

Where:

$U \in \mathbb{R}^{d_{\text{model}} \times r}$ is a learned projection
$f(i) \in \mathbb{R}^{r}$ is a low-dimensional position representation
$r \ll d_{\text{model}}$ is the bottleneck rank

This reduces parameters from $O(L_{\max} \times d)$ to $O(L_{\max} \times r + r \times d)$.

Segment-Level Positions

For very long documents, some models use hierarchical position:

Token position within segment: Standard position encoding
Segment index: Coarser-grained position for the segment within the document

This reduces the position vocabulary for each level while maintaining global position information.

Best Practice

For new projects with fixed-length contexts (≤2K tokens), learned embeddings initialized from sinusoidal values often provide a good balance of flexibility and stability. For longer contexts or length-variable deployments, consider RoPE or other relative methods from the start.

Practical Implementation Guide

Let's consolidate practical recommendations for implementing learned positional embeddings.

Decision Framework

Use learned positional embeddings when:

✅ Fixed, moderate context length (≤2K tokens)
✅ Abundant training data
✅ Task may benefit from learned position patterns
✅ No need for length generalization

Avoid learned positional embeddings when:

❌ Variable or long context lengths
❌ Limited training data
❌ Need to process sequences longer than training
❌ Parameter budget is constrained

Implementation Checklist

Implementation Best Practices

•Choose max_length carefully: Once trained, this cannot increase. Add buffer for potential deployment needs (e.g., 512 + extra for special tokens).
•Initialize appropriately: Start with normal distribution (std=0.02) or sinusoidal values. Avoid zeros or very large values.
•Register position indices as buffer: Pre-compute and cache [0, 1, ..., max_length-1] for efficiency.
•Include dropout: Apply dropout after adding position embeddings, as in BERT (typically 0.1).
•Handle variable lengths correctly: Use actual sequence length for position lookup, not max_length.
•Monitor training: Check that position embeddings are updating (verify gradient flow) and not exploding in norm.
•Test edge cases: Verify behavior at position 0, position max_length-1, and with various batch sizes.

production_position_embeddings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import torch
import torch.nn as nn
from typing import Optional
import logging
 
logger = logging.getLogger(__name__)
 
class ProductionLearnedPositionalEmbedding(nn.Module):
    """
    Production-ready learned positional embedding with all best practices.
    """
    
    def __init__(
        self,
        max_length: int,
        d_model: int,
        dropout: float = 0.1,
        init_std: float = 0.02,
        use_sinusoidal_init: bool = False
    ):
        super().__init__()
        
        self.max_length = max_length
        self.d_model = d_model
        
        # Create embeddings
        self.embeddings = nn.Embedding(max_length, d_model)
        
        # Initialize
        if use_sinusoidal_init:
            self._init_sinusoidal()
            logger.info("Initialized position embeddings with sinusoidal values")
        else:
            nn.init.normal_(self.embeddings.weight, std=init_std)
            logger.info(f"Initialized position embeddings with N(0, {init_std}²)")
        
        self.dropout = nn.Dropout(p=dropout)
        
        # Pre-compute position indices
        positions = torch.arange(max_length, dtype=torch.long)
        self.register_buffer('position_ids', positions.unsqueeze(0))
        
        # Log configuration
        logger.info(
            f"LearnedPositionalEmbedding: max_length={max_length}, "
            f"d_model={d_model}, params={max_length * d_model:,}"
        )
    
    def _init_sinusoidal(self):
        """Initialize with sinusoidal values."""
        import numpy as np
        
        position = torch.arange(self.max_length).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, self.d_model, 2).float() * 
            (-np.log(10000.0) / self.d_model)
        )
        
        weight = torch.zeros(self.max_length, self.d_model)
        weight[:, 0::2] = torch.sin(position * div_term)
        weight[:, 1::2] = torch.cos(position * div_term)
        
        self.embeddings.weight.data.copy_(weight)
    
    def forward(
        self, 
        x: torch.Tensor,
        position_ids: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Add positional embeddings to input.
        
        Args:
            x: Input embeddings [batch_size, seq_len, d_model]
            position_ids: Optional custom positions [batch_size, seq_len]
        """
        batch_size, seq_len, _ = x.shape
        
        # Validation
        if seq_len > self.max_length:
            raise ValueError(
                f"Input sequence length {seq_len} exceeds maximum "
                f"supported length {self.max_length}. Consider using "
                f"a different positional encoding method for longer sequences."
            )
        
        # Get positions
        if position_ids is None:
            position_ids = self.position_ids[:, :seq_len]
            position_ids = position_ids.expand(batch_size, -1)
        
        # Look up embeddings
        position_embeddings = self.embeddings(position_ids)
        
        # Add to input and apply dropout
        output = x + position_embeddings
        output = self.dropout(output)
        
        return output
    
    def get_embedding_stats(self) -> dict:
        """Return statistics about the embeddings for monitoring."""
        weight = self.embeddings.weight.data
        
        return {
            "mean": weight.mean().item(),
            "std": weight.std().item(),
            "min": weight.min().item(),
            "max": weight.max().item(),
            "norm_mean": weight.norm(dim=1).mean().item(),
            "norm_std": weight.norm(dim=1).std().item(),
        }
 
 
# Configuration template
DEFAULT_CONFIG = {
    "max_length": 512,
    "d_model": 768,
    "dropout": 0.1,
    "init_std": 0.02,
    "use_sinusoidal_init": False,
}
 
def create_position_embeddings(config: dict):
    """Factory function for creating position embeddings."""
    merged_config = {**DEFAULT_CONFIG, **config}
    return ProductionLearnedPositionalEmbedding(**merged_config)

Summary: Learned Embeddings in Perspective

Learned positional embeddings represent a flexible, data-driven approach to positional encoding. Let's consolidate our understanding:

Key Takeaways

•Simple implementation: Just an embedding lookup—treat positions like vocabulary tokens. Parameters scale as O(L_max × d_model).
•Performance parity on standard tasks: For fixed-length sequences with abundant training data, learned and sinusoidal approaches perform similarly.
•Length generalization is the Achilles' heel: Learned embeddings cannot handle positions beyond max_length. No satisfactory workarounds exist.
•Training dynamics matter: Initialization (sinusoidal vs. random) affects early training. After training, embeddings often develop structure similar to sinusoidal patterns.
•Hybrid approaches offer flexibility: Sinusoidal initialization with learning, factored embeddings, and scaled sinusoidal approaches combine benefits of both paradigms.
•Modern LLMs have moved on: For long-context models, relative position methods (RoPE, ALiBi) are preferred due to better length generalization and parameter efficiency.

When to Choose Learned Embeddings

Scenario	Recommendation
Standard BERT/GPT-style model, fixed context	✅ Learned embeddings work well
Long document processing (>4K tokens)	❌ Use RoPE or ALiBi
Deployment with variable input lengths	⚠️ Consider relative methods
Research exploring position patterns	✅ Learned allows analysis
Low-resource training	⚠️ Sinusoidal may generalize better

What's Next: Relative Positional Representations

We've covered the two main "absolute" approaches to positional encoding: sinusoidal (fixed) and learned. But both share a fundamental design decision: assigning a unique vector to each absolute position.

In the next page, we explore relative positional representations—approaches that encode the distance between positions rather than their absolute indices. This paradigm shift offers:

Natural translation invariance
Better generalization to unseen lengths
Explicit control over position-based attention patterns

These insights laid the groundwork for modern approaches like RoPE that now dominate large-scale language models.

Page Complete

You now understand learned positional embeddings in depth—their implementation, training dynamics, empirical performance, fundamental limitations, and practical deployment considerations. This knowledge enables informed decisions about positional encoding choices and prepares you for understanding the relative position methods that followed.