Loading content...
In the landmark 2017 paper "Attention Is All You Need," Vaswani et al. introduced not only the transformer architecture but also an elegant solution to its positional blindness: sinusoidal positional encoding. This approach encodes position using sine and cosine functions of varying frequencies, creating a unique signature for each position that the model can leverage for sequence understanding.
The sinusoidal encoding is remarkable for several reasons:
This page provides a comprehensive treatment of sinusoidal encoding: its precise formulation, mathematical properties, implementation details, and the insights that made it successful.
The sinusoidal encoding was the first positional encoding used in transformers and influenced all subsequent approaches. Even as modern architectures adopt alternatives like RoPE or ALiBi, understanding sinusoidal encoding provides essential intuition for why those alternatives were developed and how they improve upon the original.
The Core Formula
The sinusoidal positional encoding is defined as follows. For a position $pos$ in the sequence and dimension $i$ of the encoding:
$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$
Where:
Unpacking the Formula
The encoding creates a unique $d_{\text{model}}$-dimensional vector for each position. Let's understand each component:
1. The Frequency Term: $10000^{2i/d_{\text{model}}}$
This term determines how rapidly the sine/cosine oscillates along the position axis. Let's call this the wavelength divisor:
$$\omega_i = \frac{1}{10000^{2i/d_{\text{model}}}}$$
The angular frequency for dimension $i$ is therefore $\omega_i$, and the corresponding wavelength is:
$$\lambda_i = \frac{2\pi}{\omega_i} = 2\pi \cdot 10000^{2i/d_{\text{model}}}$$
2. The Dimension Indexing
Even dimensions ($2i$) use sine; odd dimensions ($2i+1$) use cosine. This pairing is crucial for the relative position property we'll explore shortly.
Why 10000? This value was chosen to span a wide range of wavelengths. The lowest frequency (i = d_model/2 - 1) has wavelength ≈ 2π × 10000 ≈ 62,832 positions, far exceeding typical sequence lengths. The highest frequency (i = 0) has wavelength 2π ≈ 6.28 positions. This geometric progression covers the range from very local to very global positional patterns.
Frequency Progression
The frequencies form a geometric progression from $1$ to $1/10000$:
| Dimension Pair $i$ | Wavelength | Pattern Detection Scale |
|---|---|---|
| $i = 0$ | $2\pi \approx 6.28$ | Very local (adjacent tokens) |
| $i = d_{\text{model}}/4$ | $2\pi \cdot 100 \approx 628$ | Sentence-level |
| $i = d_{\text{model}}/2 - 1$ | $2\pi \cdot 10000 \approx 62,832$ | Document-level |
This multi-scale encoding allows the model to detect positional patterns at all relevant scales simultaneously—from the adjacency of consecutive tokens to the broad structure of long documents.
Let's implement sinusoidal positional encoding from scratch, examining each step in detail:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147
import numpy as npimport torchimport torch.nn as nnimport matplotlib.pyplot as plt class SinusoidalPositionalEncoding(nn.Module): """ Sinusoidal Positional Encoding as described in "Attention Is All You Need" (Vaswani et al., 2017) The encoding uses sine and cosine functions of different frequencies to create unique positional signatures. """ def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1): """ Args: d_model: Model embedding dimension (must be even) max_len: Maximum sequence length to pre-compute dropout: Dropout probability applied after adding positional encoding """ super().__init__() assert d_model % 2 == 0, "d_model must be even for sin/cos pairing" self.d_model = d_model self.dropout = nn.Dropout(p=dropout) # Pre-compute positional encodings for all positions up to max_len pe = self._create_positional_encoding(max_len, d_model) # Register as buffer (not a parameter, but saved with model) self.register_buffer('pe', pe) def _create_positional_encoding(self, max_len: int, d_model: int) -> torch.Tensor: """ Create the sinusoidal positional encoding matrix. Returns: Tensor of shape [1, max_len, d_model] """ # Position indices: [0, 1, 2, ..., max_len-1] position = torch.arange(max_len).unsqueeze(1) # Shape: [max_len, 1] # Dimension indices for even positions: [0, 2, 4, ..., d_model-2] # We compute div_term = 10000^(2i/d_model) # Using exp(log(10000) * 2i/d_model) for numerical stability div_term = torch.exp( torch.arange(0, d_model, 2) * (-np.log(10000.0) / d_model) ) # Shape: [d_model/2] # Initialize encoding matrix pe = torch.zeros(max_len, d_model) # Apply sin to even dimensions, cos to odd dimensions pe[:, 0::2] = torch.sin(position * div_term) # Even: sin(pos / 10000^(2i/d)) pe[:, 1::2] = torch.cos(position * div_term) # Odd: cos(pos / 10000^(2i/d)) # Add batch dimension pe = pe.unsqueeze(0) # Shape: [1, max_len, d_model] return pe def forward(self, x: torch.Tensor) -> torch.Tensor: """ Add positional encoding to input embeddings. Args: x: Input tensor of shape [batch_size, seq_len, d_model] Returns: Tensor of shape [batch_size, seq_len, d_model] with positions added """ seq_len = x.size(1) # Slice the pre-computed encodings to match input sequence length # x.size(1) gives the actual sequence length for this batch x = x + self.pe[:, :seq_len, :] return self.dropout(x) def get_encoding(self, positions: torch.Tensor) -> torch.Tensor: """ Get positional encoding for specific positions (useful for analysis). Args: positions: Tensor of position indices Returns: Positional encodings for those positions """ return self.pe[0, positions, :] # Demonstrationdef visualize_positional_encoding(): """Visualize the sinusoidal encoding patterns.""" d_model = 128 max_len = 100 encoder = SinusoidalPositionalEncoding(d_model, max_len, dropout=0.0) # Get encodings for all positions pe = encoder.pe[0, :max_len, :].numpy() # Create visualization fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # 1. Heatmap of positional encodings im = axes[0, 0].imshow(pe, aspect='auto', cmap='RdBu_r', vmin=-1, vmax=1) axes[0, 0].set_xlabel('Dimension') axes[0, 0].set_ylabel('Position') axes[0, 0].set_title('Positional Encoding Heatmap') plt.colorbar(im, ax=axes[0, 0]) # 2. First few dimensions across positions for dim in [0, 1, 10, 11, 50, 51]: axes[0, 1].plot(pe[:, dim], label=f'Dim {dim}') axes[0, 1].set_xlabel('Position') axes[0, 1].set_ylabel('Encoding Value') axes[0, 1].set_title('Encoding Values by Dimension') axes[0, 1].legend() axes[0, 1].grid(True, alpha=0.3) # 3. Position similarity matrix pe_normalized = pe / np.linalg.norm(pe, axis=1, keepdims=True) similarity = pe_normalized @ pe_normalized.T im = axes[1, 0].imshow(similarity, aspect='equal', cmap='viridis') axes[1, 0].set_xlabel('Position') axes[1, 0].set_ylabel('Position') axes[1, 0].set_title('Cosine Similarity Between Positions') plt.colorbar(im, ax=axes[1, 0]) # 4. Distance to position 0 distances = np.linalg.norm(pe - pe[0:1, :], axis=1) axes[1, 1].plot(distances) axes[1, 1].set_xlabel('Position') axes[1, 1].set_ylabel('L2 Distance from Position 0') axes[1, 1].set_title('Encoding Distance from First Position') axes[1, 1].grid(True, alpha=0.3) plt.tight_layout() return fig # Run demonstrationprint("Sinusoidal Positional Encoding Properties:")print(f"- All values in [-1, 1]: {encoder.pe.min():.4f} to {encoder.pe.max():.4f}")print(f"- Shape: {encoder.pe.shape}")The implementation uses exp(-log(10000) * 2i/d_model) instead of directly computing 10000^(2i/d_model). This is numerically equivalent but more stable, avoiding potential overflow with large exponents. This is a common pattern in neural network implementations.
The most elegant aspect of sinusoidal encoding is its ability to represent relative positions through linear transformations. This property was a key motivation for choosing sine and cosine functions.
The Key Insight
For any fixed offset $k$, there exists a linear transformation $M_k$ such that:
$$PE(pos + k) = M_k \cdot PE(pos)$$
This means the model can attend to relative positions using simple matrix operations.
Derivation
Using the angle addition formulas:
For a position pair $(pos, pos+k)$ at dimension index $i$:
$$\begin{pmatrix} \sin(\omega_i(pos+k)) \ \cos(\omega_i(pos+k)) \end{pmatrix} = \begin{pmatrix} \cos(\omega_i k) & \sin(\omega_i k) \ -\sin(\omega_i k) & \cos(\omega_i k) \end{pmatrix} \begin{pmatrix} \sin(\omega_i \cdot pos) \ \cos(\omega_i \cdot pos) \end{pmatrix}$$
Where $\omega_i = 1/10000^{2i/d_{\text{model}}}$ is the angular frequency for dimension $i$.
This is a rotation matrix! The positional encoding at position $pos + k$ is simply the encoding at position $pos$ rotated by an angle proportional to $k$.
Each (sin, cos) pair can be viewed as a point on a unit circle. Advancing by k positions rotates this point by angle ω_i × k. Different frequency components rotate at different rates, creating a unique trajectory for each position. This geometric interpretation is foundational to Rotary Position Embedding (RoPE), which we'll cover in a later page.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import torchimport numpy as np def compute_relative_transformation(d_model: int = 64, k: int = 5): """ Demonstrate that PE(pos + k) = M_k @ PE(pos) for sinusoidal encoding. """ # Compute angular frequencies for each dimension pair dim_indices = torch.arange(0, d_model, 2).float() omega = 1.0 / (10000.0 ** (dim_indices / d_model)) # Shape: [d_model/2] # For a test position pos = 10 # Direct computation of PE(pos) and PE(pos + k) pe_pos = torch.zeros(d_model) pe_pos[0::2] = torch.sin(pos * omega) pe_pos[1::2] = torch.cos(pos * omega) pe_pos_k = torch.zeros(d_model) pe_pos_k[0::2] = torch.sin((pos + k) * omega) pe_pos_k[1::2] = torch.cos((pos + k) * omega) # Construct the transformation matrix M_k # For each (sin, cos) pair, we apply a 2x2 rotation M_k = torch.zeros(d_model, d_model) for i in range(d_model // 2): angle = k * omega[i] cos_angle = torch.cos(angle) sin_angle = torch.sin(angle) # Rotation matrix for this dimension pair M_k[2*i, 2*i] = cos_angle M_k[2*i, 2*i + 1] = sin_angle M_k[2*i + 1, 2*i] = -sin_angle M_k[2*i + 1, 2*i + 1] = cos_angle # Compute PE(pos + k) via transformation pe_pos_k_computed = M_k @ pe_pos # Verify they match error = torch.norm(pe_pos_k - pe_pos_k_computed).item() print(f"Position: {pos}, Offset: {k}") print(f"Direct PE(pos+k) vs M_k @ PE(pos) error: {error:.2e}") assert error < 1e-6, "Relative position property violated!" return M_k, pe_pos, pe_pos_k # Verify for multiple offsetsfor offset in [1, 5, 10, 50, 100]: compute_relative_transformation(d_model=64, k=offset) print("\n✓ Relative position property verified for all offsets!") # Show that M_k is indeed a rotation (orthogonal matrix with det = 1)M_5, _, _ = compute_relative_transformation(d_model=64, k=5)print(f"\nM_5 is orthogonal: |M_5 @ M_5.T - I| = {torch.norm(M_5 @ M_5.T - torch.eye(64)):.2e}")print(f"det(M_5) = {torch.linalg.det(M_5):.4f} (should be ~1.0)")Why This Matters
The relative position property has profound implications:
Efficient Relative Attention: The attention mechanism can compute relative position relationships through standard operations on the position-encoded representations.
Translation Invariance Potential: The same relative offset produces the same transformation regardless of absolute position. "Word at position 5 attending to word at position 3" has the same positional offset as "position 105 attending to 103."
Theoretical Foundation: This property directly inspired Rotary Position Embeddings (RoPE), which make the rotation explicit in the attention computation.
Limitation: Dot Product Doesn't Directly Encode Distance
However, there's a subtlety. The dot product $PE(pos_1) \cdot PE(pos_2)$ does not simply encode $|pos_1 - pos_2|$:
$$PE(pos_1) \cdot PE(pos_2) = \sum_{i=0}^{d/2-1} \left[\sin(\omega_i p_1)\sin(\omega_i p_2) + \cos(\omega_i p_1)\cos(\omega_i p_2)\right]$$
Using the identity $\cos(a - b) = \cos a \cos b + \sin a \sin b$:
$$= \sum_{i=0}^{d/2-1} \cos(\omega_i (p_1 - p_2))$$
This is a sum of cosines at different frequencies—not a simple function of the distance. The model must learn to extract relative position information from this representation.
The sinusoidal encoding has a beautiful geometric interpretation that provides intuition for its behavior.
Multi-Scale Clocks
Imagine $d_{\text{model}}/2$ clocks, each running at a different speed:
Each clock's (sin, cos) outputs form a 2D point on a unit circle. The full positional encoding is the concatenation of all these 2D points—a point on a $d_{\text{model}}/2$-dimensional torus.
Why This Representation Works
Uniqueness: Different positions land at different points on each clock. With enough clocks at incommensurate frequencies, no two positions have the same encoding (up to very long sequences).
Smoothness: Adjacent positions differ by small rotations on each clock, creating smooth encoding changes.
Multi-resolution: Fast clocks capture local variations; slow clocks capture global position in the sequence.
| Dimension Pair | Wavelength (positions) | Sensitivity Scale | Clock Speed Analogy |
|---|---|---|---|
| 0-1 | ~6.3 | Token-level | Second hand |
| 64-65 | ~40 | Phrase-level | Minute hand |
| 128-129 | ~250 | Sentence-level | Hour hand |
| 192-193 | ~1,580 | Paragraph-level | Day dial |
| 254-255 | ~62,832 | Document-level | Year dial |
The Torus Manifold
Mathematically, the sinusoidal encoding maps positions to points on a high-dimensional torus $\mathcal{T}^{d/2}$. Each (sin, cos) pair corresponds to a circle, and their Cartesian product forms the torus.
Properties of this embedding:
Position Similarity Patterns
When visualizing pairwise similarities between position encodings:
$$\text{sim}(p_1, p_2) = \frac{PE(p_1) \cdot PE(p_2)}{|PE(p_1)| |PE(p_2)|}$$
We observe:
A common misconception is that nearby positions always have higher similarity. Due to the multi-frequency nature, position 10 may be more similar to position 20 than to position 15 for certain encodings. The model must learn to interpret these similarity patterns appropriately.
Let's analyze the mathematical properties of sinusoidal encoding more rigorously.
Property 1: Boundedness
For all positions $pos$ and dimensions $i$: $$-1 \leq PE(pos, i) \leq 1$$
This follows directly from the range of $\sin$ and $\cos$. Bounded encodings prevent positional information from overwhelming token embeddings.
Property 2: Orthogonality of Transformations
The relative position transformation $M_k$ is an orthogonal matrix: $$M_k^T M_k = M_k M_k^T = I$$ $$\det(M_k) = 1$$
This means:
Property 3: Group Structure
The transformations form a group: $$M_{k_1} M_{k_2} = M_{k_1 + k_2}$$ $$M_0 = I$$ $$M_{-k} = M_k^{-1} = M_k^T$$
This means relative position is additive: going forward $k_1$ positions then $k_2$ positions equals going $k_1 + k_2$ positions.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import torchimport numpy as np def analyze_sinusoidal_properties(d_model: int = 128, max_pos: int = 512): """ Rigorous analysis of sinusoidal encoding properties. """ # Create encoding positions = torch.arange(max_pos).float() dim_indices = torch.arange(0, d_model, 2).float() omega = 1.0 / (10000.0 ** (dim_indices / d_model)) PE = torch.zeros(max_pos, d_model) PE[:, 0::2] = torch.sin(positions.unsqueeze(1) * omega) PE[:, 1::2] = torch.cos(positions.unsqueeze(1) * omega) # Property 1: Boundedness print("=== Property 1: Boundedness ===") print(f"Min value: {PE.min().item():.6f} (should be >= -1)") print(f"Max value: {PE.max().item():.6f} (should be <= 1)") # Property 2: Norm distribution print("\n=== Property 2: Norm Distribution ===") norms = torch.norm(PE, dim=1) print(f"Norm range: [{norms.min():.4f}, {norms.max():.4f}]") print(f"Expected norm: sqrt(d_model/2) = {np.sqrt(d_model/2):.4f}") # Each sin^2 + cos^2 = 1, so sum over d_model/2 pairs = d_model/2 # Property 3: Dot product analysis print("\n=== Property 3: Dot Product Structure ===") # PE(p1) · PE(p2) = sum_i cos(omega_i * (p1 - p2)) p1, p2 = 10, 15 direct_dot = (PE[p1] @ PE[p2]).item() theoretical_dot = torch.sum(torch.cos(omega * (p1 - p2))).item() print(f"Direct dot product PE({p1}) · PE({p2}): {direct_dot:.6f}") print(f"Theoretical sum of cos(ω(p1-p2)): {theoretical_dot:.6f}") print(f"Match: {abs(direct_dot - theoretical_dot) < 1e-5}") # Property 4: Distance structure print("\n=== Property 4: Distance Structure ===") # L2 distance as function of position difference distances = [] for delta in range(0, 50): if delta < max_pos: d = torch.norm(PE[0] - PE[delta]).item() distances.append((delta, d)) print("Position difference -> L2 distance:") for delta, dist in distances[:10]: print(f" Δ = {delta:3d}: distance = {dist:.4f}") # Property 5: Uniqueness (check all pairwise distances) print("\n=== Property 5: Uniqueness ===") num_collisions = 0 threshold = 0.01 # Very small distance = potential collision for i in range(min(100, max_pos)): for j in range(i + 1, min(100, max_pos)): if torch.norm(PE[i] - PE[j]) < threshold: num_collisions += 1 print(f" Near-collision: positions {i} and {j}") print(f"Near-collisions (threshold {threshold}): {num_collisions}") return PE PE = analyze_sinusoidal_properties()All position encodings have approximately the same L2 norm of √(d_model/2). This is because each (sin, cos) pair contributes sin²(x) + cos²(x) = 1 to the squared norm. With d_model/2 such pairs, the total squared norm is d_model/2.
The positional encoding is added to the token embeddings before they enter the transformer. This seemingly simple operation has complex implications for what the attention mechanism computes.
The Combined Representation
After adding positional encoding: $$\tilde{x}_i = x_i + PE(i)$$
The attention scores become: $$\alpha_{ij} \propto \exp\left(\frac{(\tilde{x}_i W_Q)(\tilde{x}_j W_K)^T}{\sqrt{d_k}}\right)$$
Expanding: $$(\tilde{x}_i W_Q)(\tilde{x}_j W_K)^T = (x_i + PE(i))W_Q W_K^T (x_j + PE(j))^T$$
$$= \underbrace{x_i W_Q W_K^T x_j^T}{\text{content-content}} + \underbrace{x_i W_Q W_K^T PE(j)^T}{\text{content-position}} + \underbrace{PE(i) W_Q W_K^T x_j^T}{\text{position-content}} + \underbrace{PE(i) W_Q W_K^T PE(j)^T}{\text{position-position}}$$
Four Attention Components
The attention score decomposes into four terms:
Content-Content: How much the content of token $i$ wants to attend to the content of token $j$ (independent of position)
Content-Position: How much the content of token $i$ wants to attend to the position of token $j$ (e.g., "verbs often look at the end of the sentence")
Position-Content: How much position $i$ wants to attend to the content of token $j$ (e.g., "the first position looks for proper nouns")
Position-Position: How much position $i$ wants to attend to position $j$ (e.g., "position 5 attends more to position 3 than position 1")
| Component | Formula | Linguistic Interpretation |
|---|---|---|
| Content-Content | $x_i W_Q W_K^T x_j^T$ | Semantic similarity between tokens |
| Content-Position | $x_i W_Q W_K^T PE(j)^T$ | Token type preferences for certain positions |
| Position-Content | $PE(i) W_Q W_K^T x_j^T$ | Positional preferences for token types |
| Position-Position | $PE(i) W_Q W_K^T PE(j)^T$ | Structural attention patterns (e.g., local bias) |
All four components are computed together—there's no architectural separation. The model must learn to disentangle positional and content information through the projection matrices W_Q and W_K. This entanglement is one motivation for alternative approaches that explicitly separate position from content (e.g., relative position biases added directly to attention scores).
Learned vs. Fixed Position Patterns
Because the positional encoding is fixed, the position-position attention patterns are limited to what can be expressed as inner products in the sinusoidal encoding space, transformed by the learned $W_Q$ and $W_K$.
This creates interesting constraints:
Research has shown that transformers do learn meaningful positional attention patterns—for example, attending to adjacent positions, attending to sentence boundaries, and attending to syntactically related positions. The sinusoidal encoding provides sufficient structure for these patterns to emerge.
Despite its elegance, sinusoidal encoding has several limitations that motivated the development of alternative approaches.
Limitation 1: Length Extrapolation
The sinusoidal encoding is parameter-free and can generate encodings for any position. However, this doesn't mean models generalize well to longer sequences:
Limitation 2: Absolute Position Dependence
The encoding assigns unique vectors to absolute positions 1, 2, 3, ... This creates challenges:
Limitation 3: Information Interference
Adding position to content embeddings means the attention mechanism sees both together. This can cause:
Limitation 4: Fixed Frequency Spectrum
The frequencies are hardcoded (based on 10000 constant). This may not be optimal for:
| Approach | Year | Key Innovation | Addresses Limitation |
|---|---|---|---|
| Sinusoidal (Original) | 2017 | Fixed sin/cos functions | — |
| Learned Embeddings (BERT) | 2018 | Trainable position vectors | Flexibility |
| Relative Position (TransformerXL) | 2019 | Relative position in attention | Translation invariance |
| Rotary (RoPE) | 2021 | Rotation-based relative position | Extrapolation + efficiency |
| ALiBi | 2021 | Linear attention bias | Strong extrapolation |
Despite these limitations, sinusoidal encoding established the foundational design principles: boundedness, smoothness, relative position accessibility. Modern approaches like RoPE directly build on the rotation intuition from sinusoidal encoding, making it more explicit and effective.
Sinusoidal positional encoding represents an elegant first solution to the transformer's positional blindness. Let's consolidate the key insights:
What's Next: Learned Positional Embeddings
In the next page, we explore learned positional embeddings—the approach used by BERT, GPT-2, and many other models. Rather than using fixed sinusoidal functions, these models learn position vectors from data. We'll analyze:
Learned embeddings offer flexibility at the cost of generalization, creating a fundamental tradeoff in positional encoding design.
You now have deep understanding of sinusoidal positional encoding—its mathematical formulation, geometric interpretation, theoretical properties, and practical limitations. This foundational knowledge is essential for understanding why modern architectures have evolved toward alternative approaches while building on these core insights.