Loading content...
While sinusoidal encoding provides an elegant, parameter-free solution to positional encoding, a natural question arises: what if we let the model learn the optimal position representations?
This is precisely what learned positional embeddings do. Instead of computing position encodings through fixed mathematical formulas, we treat position indices as tokens and learn their embeddings through standard backpropagation. This approach was popularized by landmark models including BERT, GPT-2, and RoBERTa, and remains widely used in many transformer architectures.
Learned embeddings offer a compelling value proposition: maximum flexibility. The model can discover whatever positional patterns are most useful for the task, unconstrained by the geometric structure of sinusoidal functions. But this flexibility comes with costs—additional parameters, fixed sequence length limits, and potential generalization challenges.
This page provides a comprehensive treatment of learned positional embeddings: their implementation, theoretical properties, empirical comparisons with sinusoidal encoding, and guidance on when each approach is preferred.
Learned positional embeddings are used in BERT (512 positions), GPT-2 (1024 positions), RoBERTa, ALBERT, DistilBERT, and many other influential models. Their simplicity and effectiveness make them a default choice for many practitioners, though modern large language models increasingly favor relative position methods.
The Core Idea
Learned positional embeddings treat position indices exactly like vocabulary tokens. We create an embedding matrix:
$$P \in \mathbb{R}^{L_{\max} \times d_{\text{model}}}$$
where:
For a token at position $i$, its positional embedding is simply the $i$-th row of $P$:
$$PE(i) = P[i, :] \in \mathbb{R}^{d_{\text{model}}}$$
The final input to the transformer combines token and position embeddings:
$$\tilde{x}_i = \text{TokenEmbed}(\text{token}_i) + \text{PositionEmbed}(i)$$
That's it. The simplicity is striking—we're just adding another embedding lookup.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230
import torchimport torch.nn as nnimport torch.nn.functional as Ffrom typing import Optional class LearnedPositionalEmbedding(nn.Module): """ Learned Positional Embeddings as used in BERT, GPT-2, etc. Each position has a trainable d_model-dimensional vector that is learned end-to-end with the rest of the model. """ def __init__( self, max_length: int, d_model: int, dropout: float = 0.1, initialization: str = "normal" # "normal", "uniform", or "sinusoidal" ): """ Args: max_length: Maximum sequence length (L_max) d_model: Embedding dimension dropout: Dropout after adding positional embeddings initialization: How to initialize the embeddings """ super().__init__() self.max_length = max_length self.d_model = d_model # The position embedding matrix: [max_length, d_model] self.position_embeddings = nn.Embedding(max_length, d_model) # Initialize based on strategy self._initialize_embeddings(initialization) self.dropout = nn.Dropout(p=dropout) # Pre-compute position indices for efficiency self.register_buffer( "position_ids", torch.arange(max_length).unsqueeze(0) # [1, max_length] ) def _initialize_embeddings(self, method: str): """Initialize position embeddings.""" if method == "normal": # Standard normal initialization (like BERT) nn.init.normal_(self.position_embeddings.weight, std=0.02) elif method == "uniform": # Uniform initialization nn.init.uniform_(self.position_embeddings.weight, -0.1, 0.1) elif method == "sinusoidal": # Initialize with sinusoidal values, then allow learning sinusoidal = self._create_sinusoidal_embeddings() self.position_embeddings.weight.data.copy_(sinusoidal) else: raise ValueError(f"Unknown initialization: {method}") def _create_sinusoidal_embeddings(self) -> torch.Tensor: """Create sinusoidal embeddings for initialization.""" import numpy as np position = torch.arange(self.max_length).unsqueeze(1) div_term = torch.exp( torch.arange(0, self.d_model, 2) * (-np.log(10000.0) / self.d_model) ) pe = torch.zeros(self.max_length, self.d_model) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) return pe def forward( self, x: torch.Tensor, position_ids: Optional[torch.Tensor] = None ) -> torch.Tensor: """ Add positional embeddings to input. Args: x: Input tensor of shape [batch_size, seq_len, d_model] position_ids: Optional explicit position indices [batch_size, seq_len] If None, uses positions [0, 1, 2, ..., seq_len-1] Returns: Tensor with positional information added """ seq_len = x.size(1) if seq_len > self.max_length: raise ValueError( f"Sequence length {seq_len} exceeds maximum {self.max_length}" ) if position_ids is None: # Use default positions: [0, 1, 2, ..., seq_len-1] position_ids = self.position_ids[:, :seq_len] # Look up position embeddings position_embeddings = self.position_embeddings(position_ids) # Add to input x = x + position_embeddings return self.dropout(x) def get_position_embedding(self, position: int) -> torch.Tensor: """Get embedding for a specific position.""" return self.position_embeddings.weight[position] # Full BERT-style embeddings combining all componentsclass BERTEmbeddings(nn.Module): """ Complete embedding layer as used in BERT. Combines token embeddings, position embeddings, and segment embeddings. """ def __init__( self, vocab_size: int, max_length: int, d_model: int, num_segments: int = 2, dropout: float = 0.1, layer_norm_eps: float = 1e-12 ): super().__init__() # Token embeddings self.token_embeddings = nn.Embedding(vocab_size, d_model) # Position embeddings self.position_embeddings = nn.Embedding(max_length, d_model) # Segment embeddings (for sentence A vs B) self.segment_embeddings = nn.Embedding(num_segments, d_model) # Layer normalization after combining self.layer_norm = nn.LayerNorm(d_model, eps=layer_norm_eps) self.dropout = nn.Dropout(dropout) # Pre-compute positions self.register_buffer( "position_ids", torch.arange(max_length).unsqueeze(0) ) # Initialize self._init_weights() def _init_weights(self): """Initialize all embeddings with small normal values.""" nn.init.normal_(self.token_embeddings.weight, std=0.02) nn.init.normal_(self.position_embeddings.weight, std=0.02) nn.init.normal_(self.segment_embeddings.weight, std=0.02) def forward( self, input_ids: torch.Tensor, segment_ids: Optional[torch.Tensor] = None, position_ids: Optional[torch.Tensor] = None ) -> torch.Tensor: """ Compute combined embeddings. Args: input_ids: Token indices [batch_size, seq_len] segment_ids: Segment indices [batch_size, seq_len], default all 0s position_ids: Position indices [batch_size, seq_len], default 0..seq_len-1 """ seq_len = input_ids.size(1) # Defaults if position_ids is None: position_ids = self.position_ids[:, :seq_len] if segment_ids is None: segment_ids = torch.zeros_like(input_ids) # Look up all embeddings token_embeds = self.token_embeddings(input_ids) position_embeds = self.position_embeddings(position_ids) segment_embeds = self.segment_embeddings(segment_ids) # Combine embeddings = token_embeds + position_embeds + segment_embeds # Normalize and dropout embeddings = self.layer_norm(embeddings) embeddings = self.dropout(embeddings) return embeddings # Example usagedef demonstrate_learned_embeddings(): """Demonstrate learned positional embeddings.""" batch_size = 2 seq_len = 128 d_model = 768 max_length = 512 # Create embedding layer pos_embed = LearnedPositionalEmbedding( max_length=max_length, d_model=d_model, initialization="normal" ) # Random input (simulating token embeddings) x = torch.randn(batch_size, seq_len, d_model) # Add positional embeddings x_with_pos = pos_embed(x) print(f"Input shape: {x.shape}") print(f"Output shape: {x_with_pos.shape}") print(f"Position embedding parameters: {pos_embed.max_length * d_model:,}") print(f" = {max_length} positions × {d_model} dimensions") return pos_embed pos_embed = demonstrate_learned_embeddings()Most implementations pre-register position indices as a buffer (non-parameter tensor saved with the model). This allows efficient position lookup during inference without recreating the indices each time. The position embeddings themselves are regular nn.Embedding parameters that are updated during training.
One of the most significant differences between learned and sinusoidal positional embeddings is the parameter count. Let's analyze this carefully.
Parameter Count Formula
Learned positional embeddings require:
$$\text{Parameters} = L_{\max} \times d_{\text{model}}$$
For typical model configurations:
| Model | Max Length | d_model | Position Params | % of Total |
|---|---|---|---|---|
| BERT-Base | 512 | 768 | 393,216 | 0.36% |
| BERT-Large | 512 | 1024 | 524,288 | 0.16% |
| GPT-2 Small | 1024 | 768 | 786,432 | 0.63% |
| GPT-2 Large | 1024 | 1280 | 1,310,720 | 0.17% |
| Hypothetical 8K | 8192 | 1024 | 8,388,608 | ~1%+ |
| Hypothetical 32K | 32768 | 2048 | 67,108,864 | Significant |
Parameter Efficiency Comparison
| Aspect | Learned Embeddings | Sinusoidal Encoding |
|---|---|---|
| Parameters | $O(L_{\max} \times d)$ | 0 (computed) |
| Computation | One embedding lookup | Sin/cos computation |
| Memory at training | Store gradient for each position | No gradient needed |
| Memory at inference | Store full embedding matrix | Compute on demand |
When Parameters Matter
For BERT-scale models (110M-340M parameters), position embeddings are a negligible fraction. But as context lengths grow, the cost becomes significant:
The linear scaling of learned position parameters with context length is one reason modern long-context models (GPT-4, Claude, etc.) use relative position methods like RoPE or ALiBi. These approaches decouple parameter count from sequence length.
When we learn positional embeddings from data, the optimization process discovers position representations suited to the task. Let's examine what the model learns and how.
Initialization Strategies
The initialization of position embeddings significantly affects training:
What Does the Model Learn?
Research analyzing learned position embeddings reveals structured patterns:
Similarity Structure: Adjacent positions learn similar embeddings. The embedding space develops a smooth manifold where position distance correlates with embedding distance.
Periodic Components: Principal component analysis (PCA) of learned embeddings often reveals sinusoidal-like patterns, suggesting the model rediscovers useful periodic structure.
Task-Specific Patterns: Different tasks induce different position embedding structures:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import torchimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.decomposition import PCAfrom sklearn.metrics.pairwise import cosine_similarity def analyze_position_embeddings(position_embeddings: torch.Tensor): """ Comprehensive analysis of learned position embeddings. Args: position_embeddings: Tensor of shape [max_length, d_model] """ pe = position_embeddings.detach().cpu().numpy() max_len, d_model = pe.shape print(f"Analyzing {max_len} position embeddings of dimension {d_model}") # === Analysis 1: Similarity Structure === print("=== Similarity Analysis ===") sim_matrix = cosine_similarity(pe) # Check if adjacent positions are more similar adjacent_sims = [sim_matrix[i, i+1] for i in range(max_len - 1)] distant_sims = [sim_matrix[0, i] for i in range(10, min(50, max_len))] print(f"Mean adjacent similarity: {np.mean(adjacent_sims):.4f}") print(f"Mean distant similarity (pos 0 to 10-50): {np.mean(distant_sims):.4f}") # === Analysis 2: PCA of Embeddings === print("=== PCA Analysis ===") pca = PCA(n_components=min(10, d_model)) pe_pca = pca.fit_transform(pe) print(f"Variance explained by first 5 components: " f"{pca.explained_variance_ratio_[:5].sum():.2%}") # Check for periodic patterns in top components for i in range(min(3, pe_pca.shape[1])): component = pe_pca[:, i] # Simple periodicity check via autocorrelation autocorr = np.correlate(component, component, mode='full') autocorr = autocorr[len(autocorr)//2:] autocorr = autocorr / autocorr[0] # Normalize # Find first significant peak (excluding 0) peaks = [] for j in range(1, len(autocorr) - 1): if autocorr[j] > autocorr[j-1] and autocorr[j] > autocorr[j+1]: if autocorr[j] > 0.3: # Threshold for significance peaks.append((j, autocorr[j])) if peaks: print(f"PC{i+1}: Detected periodicity at distances {[p[0] for p in peaks[:3]]}") else: print(f"PC{i+1}: No strong periodic pattern detected") # === Analysis 3: Norm Distribution === print("=== Norm Analysis ===") norms = np.linalg.norm(pe, axis=1) print(f"Norm range: [{norms.min():.4f}, {norms.max():.4f}]") print(f"Norm mean: {norms.mean():.4f}, std: {norms.std():.4f}") # Check if early positions have different norms (common pattern) early_norm = norms[:20].mean() late_norm = norms[-20:].mean() print(f"Early positions (0-19) mean norm: {early_norm:.4f}") print(f"Late positions mean norm: {late_norm:.4f}") # === Analysis 4: Distance Structure === print("=== Distance Structure ===") # Euclidean distance as function of position difference distances = {} for delta in [1, 2, 5, 10, 20, 50, 100]: if delta < max_len: dists = [np.linalg.norm(pe[i] - pe[i + delta]) for i in range(max_len - delta)] distances[delta] = np.mean(dists) print(f"Mean distance for Δ={delta:3d}: {distances[delta]:.4f}") return { "similarity_matrix": sim_matrix, "pca_result": pe_pca, "explained_variance": pca.explained_variance_ratio_, "norms": norms, "distances": distances } # Example: Analyze a randomly initialized embedding (simulating pre-training)# In practice, you would load weights from a trained modelmax_length, d_model = 512, 768position_embed = torch.nn.Embedding(max_length, d_model)torch.nn.init.normal_(position_embed.weight, std=0.02) print("Analysis of RANDOM initialization (before training):")print("=" * 50)random_analysis = analyze_position_embeddings(position_embed.weight) # Show that random init has no structureprint("Note: Random initialization shows no meaningful patterns.")print("After training, we would expect:")print(" - Higher adjacent similarity")print(" - Periodic patterns in PCA components")print(" - Structured distance relationships")Research by Shaw et al. and others has shown that learned position embeddings, when analyzed after training, often exhibit structure similar to sinusoidal encodings—smooth transitions between positions and periodic components. The model independently discovers that this structure is useful for representing position.
A critical question is whether learned embeddings outperform sinusoidal encoding. Research and practical experience provide nuanced answers.
Original Transformer Results
The original "Attention Is All You Need" paper reported:
"We also experimented with using learned positional embeddings instead, and found that the two versions produced nearly identical results."
For the WMT machine translation task, both approaches achieved similar BLEU scores (within 0.1 BLEU). This early result suggested the choice might not matter significantly.
Subsequent Research Findings
More extensive experiments across tasks have revealed patterns:
| Task/Setting | Learned | Sinusoidal | Winner |
|---|---|---|---|
| Machine Translation (WMT) | 27.4 BLEU | 27.3 BLEU | ~Tie |
| Language Understanding (GLUE avg) | 82.1 | 81.8 | Learned (slight) |
| Question Answering (SQuAD) | 88.5 F1 | 88.2 F1 | Learned (slight) |
| Length Generalization (2x train) | Poor | Moderate | Sinusoidal |
| Low-Data Regime | Underperforms | Stable | Sinusoidal |
| Very Long Contexts (>4K) | Poor scaling | Good scaling | Sinusoidal |
When Learned Embeddings Excel
Abundant training data: With sufficient data, the model can learn task-specific position patterns that sinusoidal encoding cannot capture.
Fixed sequence length at inference: If all inputs have similar lengths to training data, learned embeddings shine.
Task-specific position patterns: Some tasks have unusual positional requirements (e.g., structured documents) that benefit from learned representations.
When Sinusoidal Encoding Excels
Length generalization: Sinusoidal encoding can generate valid embeddings for any position, while learned embeddings fail for unseen positions.
Low-data regimes: With limited training data, sinusoidal encoding provides useful inductive bias.
Parameter efficiency: For long-context models, sinusoidal encoding avoids the parameter explosion.
Interpretability: The mathematical structure of sinusoidal encoding is fully understood; learned embeddings are opaque.
Modern large language models have largely moved beyond both pure learned and pure sinusoidal approaches, favoring relative position methods like RoPE. These combine the parameter efficiency of sinusoidal encoding with learned flexibility through the attention mechanism. For medium-scale models with fixed context, learned embeddings remain a solid default.
The most significant limitation of learned positional embeddings is their inability to generalize to sequence lengths not seen during training. This is a fundamental architectural constraint.
The Hard Limit
If a model is trained with $L_{\max} = 512$, the position embedding matrix has 512 rows. At inference time:
There is no mathematically principled way to extend learned embeddings to new positions. Common workarounds have significant limitations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
import torchimport torch.nn as nn def demonstrate_length_limitation(): """Show the hard limit of learned position embeddings.""" max_length = 512 d_model = 768 # Standard learned embedding pos_embed = nn.Embedding(max_length, d_model) nn.init.normal_(pos_embed.weight, std=0.02) # Works fine for positions within training range valid_positions = torch.tensor([0, 100, 256, 511]) embeddings = pos_embed(valid_positions) print(f"Valid positions {valid_positions.tolist()}: OK") print(f"Embedding shape: {embeddings.shape}") # Fails for positions beyond training range invalid_positions = torch.tensor([512, 600, 1000]) try: embeddings = pos_embed(invalid_positions) print("This should not print") except IndexError as e: print(f"Invalid positions {invalid_positions.tolist()}: FAILED") print(f"Error: Index out of range") return pos_embed def workaround_modular(pos_embed: nn.Embedding, positions: torch.Tensor): """ Modular position workaround: position mod max_length. Problem: Position 512 becomes indistinguishable from position 0! """ max_length = pos_embed.num_embeddings modular_positions = positions % max_length return pos_embed(modular_positions) def workaround_clipping(pos_embed: nn.Embedding, positions: torch.Tensor): """ Clipping workaround: cap at max_length - 1. Problem: All positions >= max_length are identical! """ max_length = pos_embed.num_embeddings clipped_positions = torch.clamp(positions, max=max_length - 1) return pos_embed(clipped_positions) def workaround_interpolation(pos_embed: nn.Embedding, positions: torch.Tensor, target_max: int): """ Position interpolation: rescale positions to fit within trained range. Problem: Adjacent positions may get same embedding, losing resolution. """ max_length = pos_embed.num_embeddings # Scale positions: [0, target_max] -> [0, max_length-1] scaled = (positions.float() / target_max * (max_length - 1)).long() scaled = torch.clamp(scaled, max=max_length - 1) return pos_embed(scaled) # Demonstrationprint("=== Length Generalization with Learned Embeddings ===") pos_embed = demonstrate_length_limitation()print() # Test workaroundsprint("=== Workaround Analysis ===") long_positions = torch.tensor([0, 256, 512, 600, 1024]) # Modularmodular_emb = workaround_modular(pos_embed, long_positions)mod_positions = long_positions % 512print(f"Modular: positions {long_positions.tolist()} -> {mod_positions.tolist()}")print(f" Problem: 512 and 0 have identical embeddings!") # Check similaritysim_0_512 = torch.cosine_similarity( workaround_modular(pos_embed, torch.tensor([0])), workaround_modular(pos_embed, torch.tensor([512])), dim=1)print(f" Cosine similarity(pos 0, pos 512) = {sim_0_512.item():.4f}") print() # Clippingclip_positions = torch.clamp(long_positions, max=511)print(f"Clipping: positions {long_positions.tolist()} -> {clip_positions.tolist()}")print(f" Problem: 512, 600, 1024 all map to 511!") print() # Interpolation (assuming target max is 1024)interp_emb = workaround_interpolation(pos_embed, long_positions, 1024)scaled_positions = (long_positions.float() / 1024 * 511).long()print(f"Interpolation: positions {long_positions.tolist()} -> {scaled_positions.tolist()}")print(f" Problem: Lost half the position resolution!")There is no principled way to extend learned positional embeddings beyond their trained range. This is a fundamental limitation of the approach. For applications requiring variable or long sequence lengths, relative position methods (RoPE, ALiBi) are strongly preferred.
Researchers have developed several hybrid approaches that combine the benefits of learned and fixed positional encodings.
Sinusoidal Initialization with Learning
A popular approach initializes position embeddings with sinusoidal values, then allows gradient updates during training:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import torchimport torch.nn as nnimport numpy as np class SinusoidalInitializedLearnedPE(nn.Module): """ Position embeddings initialized with sinusoidal values, then fine-tuned during training. Benefits: - Starts with mathematically principled structure - Can adapt structure to task-specific needs - Better early training dynamics """ def __init__(self, max_length: int, d_model: int): super().__init__() # Create sinusoidal initialization pe = self._create_sinusoidal(max_length, d_model) # Register as learnable parameter self.position_embeddings = nn.Parameter(pe) def _create_sinusoidal(self, max_length: int, d_model: int) -> torch.Tensor: position = torch.arange(max_length).unsqueeze(1).float() div_term = torch.exp( torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model) ) pe = torch.zeros(max_length, d_model) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) return pe def forward(self, x: torch.Tensor) -> torch.Tensor: seq_len = x.size(1) return x + self.position_embeddings[:seq_len] class FrozenSinusoidalWithLearnedScaling(nn.Module): """ Frozen sinusoidal base with learned per-dimension scaling. This keeps the extrapolation properties of sinusoidal encoding while allowing learned adjustments. """ def __init__(self, max_length: int, d_model: int): super().__init__() # Frozen sinusoidal base pe = self._create_sinusoidal(max_length, d_model) self.register_buffer('sinusoidal_base', pe) # Learned per-dimension scaling self.scale = nn.Parameter(torch.ones(d_model)) self.bias = nn.Parameter(torch.zeros(d_model)) def _create_sinusoidal(self, max_length: int, d_model: int) -> torch.Tensor: position = torch.arange(max_length).unsqueeze(1).float() div_term = torch.exp( torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model) ) pe = torch.zeros(max_length, d_model) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) return pe def forward(self, x: torch.Tensor) -> torch.Tensor: seq_len = x.size(1) # Apply learned transformation to frozen base pe = self.sinusoidal_base[:seq_len] * self.scale + self.bias return x + pe def get_extended_embedding(self, position: int) -> torch.Tensor: """ Get embedding for position beyond training range. Still works because sinusoidal base is computed! """ # Compute sinusoidal for new position div_term = torch.exp( torch.arange(0, self.scale.size(0), 2).float() * (-np.log(10000.0) / self.scale.size(0)) ) pe = torch.zeros(self.scale.size(0)) pe[0::2] = torch.sin(position * div_term) pe[1::2] = torch.cos(position * div_term) # Apply learned transformation return pe * self.scale + self.bias # Demonstrationdef compare_initialization_strategies(): """Compare training dynamics with different initializations.""" max_length, d_model = 512, 256 strategies = { "Random Normal": lambda: nn.Embedding(max_length, d_model), "Sinusoidal (frozen)": lambda: SinusoidalInitializedLearnedPE(max_length, d_model), "Scaled Sinusoidal": lambda: FrozenSinusoidalWithLearnedScaling(max_length, d_model) } for name, create_fn in strategies.items(): embed = create_fn() # Count parameters num_params = sum(p.numel() for p in embed.parameters() if p.requires_grad) print(f"{name}:") print(f" Learnable parameters: {num_params:,}") if hasattr(embed, 'position_embeddings'): weight = embed.position_embeddings if isinstance(weight, nn.Parameter): print(f" Initial norm: {weight.norm():.2f}") elif hasattr(embed, 'weight'): print(f" Initial norm: {embed.weight.norm():.2f}") print() compare_initialization_strategies()Factored Position Embeddings
Some architectures decompose position embeddings into lower-rank components to reduce parameters:
$$PE(i) = U \cdot f(i) \cdot V^T$$
Where:
This reduces parameters from $O(L_{\max} \times d)$ to $O(L_{\max} \times r + r \times d)$.
Segment-Level Positions
For very long documents, some models use hierarchical position:
This reduces the position vocabulary for each level while maintaining global position information.
For new projects with fixed-length contexts (≤2K tokens), learned embeddings initialized from sinusoidal values often provide a good balance of flexibility and stability. For longer contexts or length-variable deployments, consider RoPE or other relative methods from the start.
Let's consolidate practical recommendations for implementing learned positional embeddings.
Decision Framework
Use learned positional embeddings when:
Avoid learned positional embeddings when:
Implementation Checklist
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import torchimport torch.nn as nnfrom typing import Optionalimport logging logger = logging.getLogger(__name__) class ProductionLearnedPositionalEmbedding(nn.Module): """ Production-ready learned positional embedding with all best practices. """ def __init__( self, max_length: int, d_model: int, dropout: float = 0.1, init_std: float = 0.02, use_sinusoidal_init: bool = False ): super().__init__() self.max_length = max_length self.d_model = d_model # Create embeddings self.embeddings = nn.Embedding(max_length, d_model) # Initialize if use_sinusoidal_init: self._init_sinusoidal() logger.info("Initialized position embeddings with sinusoidal values") else: nn.init.normal_(self.embeddings.weight, std=init_std) logger.info(f"Initialized position embeddings with N(0, {init_std}²)") self.dropout = nn.Dropout(p=dropout) # Pre-compute position indices positions = torch.arange(max_length, dtype=torch.long) self.register_buffer('position_ids', positions.unsqueeze(0)) # Log configuration logger.info( f"LearnedPositionalEmbedding: max_length={max_length}, " f"d_model={d_model}, params={max_length * d_model:,}" ) def _init_sinusoidal(self): """Initialize with sinusoidal values.""" import numpy as np position = torch.arange(self.max_length).unsqueeze(1).float() div_term = torch.exp( torch.arange(0, self.d_model, 2).float() * (-np.log(10000.0) / self.d_model) ) weight = torch.zeros(self.max_length, self.d_model) weight[:, 0::2] = torch.sin(position * div_term) weight[:, 1::2] = torch.cos(position * div_term) self.embeddings.weight.data.copy_(weight) def forward( self, x: torch.Tensor, position_ids: Optional[torch.Tensor] = None ) -> torch.Tensor: """ Add positional embeddings to input. Args: x: Input embeddings [batch_size, seq_len, d_model] position_ids: Optional custom positions [batch_size, seq_len] """ batch_size, seq_len, _ = x.shape # Validation if seq_len > self.max_length: raise ValueError( f"Input sequence length {seq_len} exceeds maximum " f"supported length {self.max_length}. Consider using " f"a different positional encoding method for longer sequences." ) # Get positions if position_ids is None: position_ids = self.position_ids[:, :seq_len] position_ids = position_ids.expand(batch_size, -1) # Look up embeddings position_embeddings = self.embeddings(position_ids) # Add to input and apply dropout output = x + position_embeddings output = self.dropout(output) return output def get_embedding_stats(self) -> dict: """Return statistics about the embeddings for monitoring.""" weight = self.embeddings.weight.data return { "mean": weight.mean().item(), "std": weight.std().item(), "min": weight.min().item(), "max": weight.max().item(), "norm_mean": weight.norm(dim=1).mean().item(), "norm_std": weight.norm(dim=1).std().item(), } # Configuration templateDEFAULT_CONFIG = { "max_length": 512, "d_model": 768, "dropout": 0.1, "init_std": 0.02, "use_sinusoidal_init": False,} def create_position_embeddings(config: dict): """Factory function for creating position embeddings.""" merged_config = {**DEFAULT_CONFIG, **config} return ProductionLearnedPositionalEmbedding(**merged_config)Learned positional embeddings represent a flexible, data-driven approach to positional encoding. Let's consolidate our understanding:
When to Choose Learned Embeddings
| Scenario | Recommendation |
|---|---|
| Standard BERT/GPT-style model, fixed context | ✅ Learned embeddings work well |
| Long document processing (>4K tokens) | ❌ Use RoPE or ALiBi |
| Deployment with variable input lengths | ⚠️ Consider relative methods |
| Research exploring position patterns | ✅ Learned allows analysis |
| Low-resource training | ⚠️ Sinusoidal may generalize better |
What's Next: Relative Positional Representations
We've covered the two main "absolute" approaches to positional encoding: sinusoidal (fixed) and learned. But both share a fundamental design decision: assigning a unique vector to each absolute position.
In the next page, we explore relative positional representations—approaches that encode the distance between positions rather than their absolute indices. This paradigm shift offers:
These insights laid the groundwork for modern approaches like RoPE that now dominate large-scale language models.
You now understand learned positional embeddings in depth—their implementation, training dynamics, empirical performance, fundamental limitations, and practical deployment considerations. This knowledge enables informed decisions about positional encoding choices and prepares you for understanding the relative position methods that followed.