Positional Encoding - Learning Module

Loading content...

0/245

Sinusoidal Encoding

The Transformer's Original Position Solution

In the landmark 2017 paper "Attention Is All You Need," Vaswani et al. introduced not only the transformer architecture but also an elegant solution to its positional blindness: sinusoidal positional encoding. This approach encodes position using sine and cosine functions of varying frequencies, creating a unique signature for each position that the model can leverage for sequence understanding.

The sinusoidal encoding is remarkable for several reasons:

Parameter-free: Unlike learned embeddings, it requires no training—the encodings are deterministically computed from position indices
Theoretically motivated: The choice of sine/cosine functions enables relative position computation through simple linear transformations
Bounded: All encoding values fall in [-1, 1], preventing magnitude explosion
Smooth: Adjacent positions have similar encodings, matching intuitions about positional proximity

This page provides a comprehensive treatment of sinusoidal encoding: its precise formulation, mathematical properties, implementation details, and the insights that made it successful.

Historical Significance

The sinusoidal encoding was the first positional encoding used in transformers and influenced all subsequent approaches. Even as modern architectures adopt alternatives like RoPE or ALiBi, understanding sinusoidal encoding provides essential intuition for why those alternatives were developed and how they improve upon the original.

The Mathematical Formulation

The Core Formula

The sinusoidal positional encoding is defined as follows. For a position $pos$ in the sequence and dimension $i$ of the encoding:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

Where:

$pos \in {0, 1, 2, \ldots, n-1}$ is the position in the sequence
$i \in {0, 1, 2, \ldots, d_{\text{model}}/2 - 1}$ indexes the dimension pairs
$d_{\text{model}}$ is the model's embedding dimension (e.g., 512 or 768)
$10000$ is a hyperparameter controlling the frequency range

Unpacking the Formula

The encoding creates a unique $d_{\text{model}}$-dimensional vector for each position. Let's understand each component:

1. The Frequency Term: $10000^{2i/d_{\text{model}}}$

This term determines how rapidly the sine/cosine oscillates along the position axis. Let's call this the wavelength divisor:

$$\omega_i = \frac{1}{10000^{2i/d_{\text{model}}}}$$

The angular frequency for dimension $i$ is therefore $\omega_i$, and the corresponding wavelength is:

$$\lambda_i = \frac{2\pi}{\omega_i} = 2\pi \cdot 10000^{2i/d_{\text{model}}}$$

2. The Dimension Indexing

Even dimensions ($2i$) use sine; odd dimensions ($2i+1$) use cosine. This pairing is crucial for the relative position property we'll explore shortly.

The 10000 Constant

Why 10000? This value was chosen to span a wide range of wavelengths. The lowest frequency (i = d_model/2 - 1) has wavelength ≈ 2π × 10000 ≈ 62,832 positions, far exceeding typical sequence lengths. The highest frequency (i = 0) has wavelength 2π ≈ 6.28 positions. This geometric progression covers the range from very local to very global positional patterns.

Frequency Progression

The frequencies form a geometric progression from $1$ to $1/10000$:

Dimension Pair $i$	Wavelength	Pattern Detection Scale
$i = 0$	$2\pi \approx 6.28$	Very local (adjacent tokens)
$i = d_{\text{model}}/4$	$2\pi \cdot 100 \approx 628$	Sentence-level
$i = d_{\text{model}}/2 - 1$	$2\pi \cdot 10000 \approx 62,832$	Document-level

This multi-scale encoding allows the model to detect positional patterns at all relevant scales simultaneously—from the adjacency of consecutive tokens to the broad structure of long documents.

Implementation Deep Dive

Let's implement sinusoidal positional encoding from scratch, examining each step in detail:

sinusoidal_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
 
class SinusoidalPositionalEncoding(nn.Module):
    """
    Sinusoidal Positional Encoding as described in 
    "Attention Is All You Need" (Vaswani et al., 2017)
    
    The encoding uses sine and cosine functions of different frequencies
    to create unique positional signatures.
    """
    
    def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1):
        """
        Args:
            d_model: Model embedding dimension (must be even)
            max_len: Maximum sequence length to pre-compute
            dropout: Dropout probability applied after adding positional encoding
        """
        super().__init__()
        assert d_model % 2 == 0, "d_model must be even for sin/cos pairing"
        
        self.d_model = d_model
        self.dropout = nn.Dropout(p=dropout)
        
        # Pre-compute positional encodings for all positions up to max_len
        pe = self._create_positional_encoding(max_len, d_model)
        
        # Register as buffer (not a parameter, but saved with model)
        self.register_buffer('pe', pe)
    
    def _create_positional_encoding(self, max_len: int, d_model: int) -> torch.Tensor:
        """
        Create the sinusoidal positional encoding matrix.
        
        Returns:
            Tensor of shape [1, max_len, d_model]
        """
        # Position indices: [0, 1, 2, ..., max_len-1]
        position = torch.arange(max_len).unsqueeze(1)  # Shape: [max_len, 1]
        
        # Dimension indices for even positions: [0, 2, 4, ..., d_model-2]
        # We compute div_term = 10000^(2i/d_model)
        # Using exp(log(10000) * 2i/d_model) for numerical stability
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * (-np.log(10000.0) / d_model)
        )  # Shape: [d_model/2]
        
        # Initialize encoding matrix
        pe = torch.zeros(max_len, d_model)
        
        # Apply sin to even dimensions, cos to odd dimensions
        pe[:, 0::2] = torch.sin(position * div_term)  # Even: sin(pos / 10000^(2i/d))
        pe[:, 1::2] = torch.cos(position * div_term)  # Odd: cos(pos / 10000^(2i/d))
        
        # Add batch dimension
        pe = pe.unsqueeze(0)  # Shape: [1, max_len, d_model]
        
        return pe
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Add positional encoding to input embeddings.
        
        Args:
            x: Input tensor of shape [batch_size, seq_len, d_model]
            
        Returns:
            Tensor of shape [batch_size, seq_len, d_model] with positions added
        """
        seq_len = x.size(1)
        
        # Slice the pre-computed encodings to match input sequence length
        # x.size(1) gives the actual sequence length for this batch
        x = x + self.pe[:, :seq_len, :]
        
        return self.dropout(x)
    
    def get_encoding(self, positions: torch.Tensor) -> torch.Tensor:
        """
        Get positional encoding for specific positions (useful for analysis).
        
        Args:
            positions: Tensor of position indices
            
        Returns:
            Positional encodings for those positions
        """
        return self.pe[0, positions, :]
 
 
# Demonstration
def visualize_positional_encoding():
    """Visualize the sinusoidal encoding patterns."""
    d_model = 128
    max_len = 100
    
    encoder = SinusoidalPositionalEncoding(d_model, max_len, dropout=0.0)
    
    # Get encodings for all positions
    pe = encoder.pe[0, :max_len, :].numpy()
    
    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 1. Heatmap of positional encodings
    im = axes[0, 0].imshow(pe, aspect='auto', cmap='RdBu_r', vmin=-1, vmax=1)
    axes[0, 0].set_xlabel('Dimension')
    axes[0, 0].set_ylabel('Position')
    axes[0, 0].set_title('Positional Encoding Heatmap')
    plt.colorbar(im, ax=axes[0, 0])
    
    # 2. First few dimensions across positions
    for dim in [0, 1, 10, 11, 50, 51]:
        axes[0, 1].plot(pe[:, dim], label=f'Dim {dim}')
    axes[0, 1].set_xlabel('Position')
    axes[0, 1].set_ylabel('Encoding Value')
    axes[0, 1].set_title('Encoding Values by Dimension')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # 3. Position similarity matrix
    pe_normalized = pe / np.linalg.norm(pe, axis=1, keepdims=True)
    similarity = pe_normalized @ pe_normalized.T
    im = axes[1, 0].imshow(similarity, aspect='equal', cmap='viridis')
    axes[1, 0].set_xlabel('Position')
    axes[1, 0].set_ylabel('Position')
    axes[1, 0].set_title('Cosine Similarity Between Positions')
    plt.colorbar(im, ax=axes[1, 0])
    
    # 4. Distance to position 0
    distances = np.linalg.norm(pe - pe[0:1, :], axis=1)
    axes[1, 1].plot(distances)
    axes[1, 1].set_xlabel('Position')
    axes[1, 1].set_ylabel('L2 Distance from Position 0')
    axes[1, 1].set_title('Encoding Distance from First Position')
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig
 
# Run demonstration
print("Sinusoidal Positional Encoding Properties:")
print(f"- All values in [-1, 1]: {encoder.pe.min():.4f} to {encoder.pe.max():.4f}")
print(f"- Shape: {encoder.pe.shape}")

Numerical Stability

The implementation uses exp(-log(10000) * 2i/d_model) instead of directly computing 10000^(2i/d_model). This is numerically equivalent but more stable, avoiding potential overflow with large exponents. This is a common pattern in neural network implementations.

The Relative Position Property

The most elegant aspect of sinusoidal encoding is its ability to represent relative positions through linear transformations. This property was a key motivation for choosing sine and cosine functions.

The Key Insight

For any fixed offset $k$, there exists a linear transformation $M_k$ such that:

$$PE(pos + k) = M_k \cdot PE(pos)$$

This means the model can attend to relative positions using simple matrix operations.

Derivation

Using the angle addition formulas:

$\sin(\alpha + \beta) = \sin\alpha\cos\beta + \cos\alpha\sin\beta$
$\cos(\alpha + \beta) = \cos\alpha\cos\beta - \sin\alpha\sin\beta$

For a position pair $(pos, pos+k)$ at dimension index $i$:

$$\begin{pmatrix} \sin(\omega_i(pos+k)) \ \cos(\omega_i(pos+k)) \end{pmatrix} = \begin{pmatrix} \cos(\omega_i k) & \sin(\omega_i k) \ -\sin(\omega_i k) & \cos(\omega_i k) \end{pmatrix} \begin{pmatrix} \sin(\omega_i \cdot pos) \ \cos(\omega_i \cdot pos) \end{pmatrix}$$

Where $\omega_i = 1/10000^{2i/d_{\text{model}}}$ is the angular frequency for dimension $i$.

This is a rotation matrix! The positional encoding at position $pos + k$ is simply the encoding at position $pos$ rotated by an angle proportional to $k$.

Rotation Interpretation

Each (sin, cos) pair can be viewed as a point on a unit circle. Advancing by k positions rotates this point by angle ω_i × k. Different frequency components rotate at different rates, creating a unique trajectory for each position. This geometric interpretation is foundational to Rotary Position Embedding (RoPE), which we'll cover in a later page.

relative_position_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
import numpy as np
 
def compute_relative_transformation(d_model: int = 64, k: int = 5):
    """
    Demonstrate that PE(pos + k) = M_k @ PE(pos) for sinusoidal encoding.
    """
    # Compute angular frequencies for each dimension pair
    dim_indices = torch.arange(0, d_model, 2).float()
    omega = 1.0 / (10000.0 ** (dim_indices / d_model))  # Shape: [d_model/2]
    
    # For a test position
    pos = 10
    
    # Direct computation of PE(pos) and PE(pos + k)
    pe_pos = torch.zeros(d_model)
    pe_pos[0::2] = torch.sin(pos * omega)
    pe_pos[1::2] = torch.cos(pos * omega)
    
    pe_pos_k = torch.zeros(d_model)
    pe_pos_k[0::2] = torch.sin((pos + k) * omega)
    pe_pos_k[1::2] = torch.cos((pos + k) * omega)
    
    # Construct the transformation matrix M_k
    # For each (sin, cos) pair, we apply a 2x2 rotation
    M_k = torch.zeros(d_model, d_model)
    for i in range(d_model // 2):
        angle = k * omega[i]
        cos_angle = torch.cos(angle)
        sin_angle = torch.sin(angle)
        
        # Rotation matrix for this dimension pair
        M_k[2*i, 2*i] = cos_angle
        M_k[2*i, 2*i + 1] = sin_angle
        M_k[2*i + 1, 2*i] = -sin_angle
        M_k[2*i + 1, 2*i + 1] = cos_angle
    
    # Compute PE(pos + k) via transformation
    pe_pos_k_computed = M_k @ pe_pos
    
    # Verify they match
    error = torch.norm(pe_pos_k - pe_pos_k_computed).item()
    print(f"Position: {pos}, Offset: {k}")
    print(f"Direct PE(pos+k) vs M_k @ PE(pos) error: {error:.2e}")
    assert error < 1e-6, "Relative position property violated!"
    
    return M_k, pe_pos, pe_pos_k
 
# Verify for multiple offsets
for offset in [1, 5, 10, 50, 100]:
    compute_relative_transformation(d_model=64, k=offset)
 
print("\n✓ Relative position property verified for all offsets!")
 
# Show that M_k is indeed a rotation (orthogonal matrix with det = 1)
M_5, _, _ = compute_relative_transformation(d_model=64, k=5)
print(f"\nM_5 is orthogonal: |M_5 @ M_5.T - I| = {torch.norm(M_5 @ M_5.T - torch.eye(64)):.2e}")
print(f"det(M_5) = {torch.linalg.det(M_5):.4f}  (should be ~1.0)")

Why This Matters

The relative position property has profound implications:

Efficient Relative Attention: The attention mechanism can compute relative position relationships through standard operations on the position-encoded representations.
Translation Invariance Potential: The same relative offset produces the same transformation regardless of absolute position. "Word at position 5 attending to word at position 3" has the same positional offset as "position 105 attending to 103."
Theoretical Foundation: This property directly inspired Rotary Position Embeddings (RoPE), which make the rotation explicit in the attention computation.

Limitation: Dot Product Doesn't Directly Encode Distance

However, there's a subtlety. The dot product $PE(pos_1) \cdot PE(pos_2)$ does not simply encode $|pos_1 - pos_2|$:

$$PE(pos_1) \cdot PE(pos_2) = \sum_{i=0}^{d/2-1} \left[\sin(\omega_i p_1)\sin(\omega_i p_2) + \cos(\omega_i p_1)\cos(\omega_i p_2)\right]$$

Using the identity $\cos(a - b) = \cos a \cos b + \sin a \sin b$:

$$= \sum_{i=0}^{d/2-1} \cos(\omega_i (p_1 - p_2))$$

This is a sum of cosines at different frequencies—not a simple function of the distance. The model must learn to extract relative position information from this representation.

Geometric Interpretation

The sinusoidal encoding has a beautiful geometric interpretation that provides intuition for its behavior.

Multi-Scale Clocks

Imagine $d_{\text{model}}/2$ clocks, each running at a different speed:

Clock 0 (highest frequency): Completes a revolution every ~6 positions
Clock 1: Slower, perhaps every ~10 positions
Last clock (lowest frequency): Completes a revolution every ~63,000 positions

Each clock's (sin, cos) outputs form a 2D point on a unit circle. The full positional encoding is the concatenation of all these 2D points—a point on a $d_{\text{model}}/2$-dimensional torus.

Why This Representation Works

Uniqueness: Different positions land at different points on each clock. With enough clocks at incommensurate frequencies, no two positions have the same encoding (up to very long sequences).
Smoothness: Adjacent positions differ by small rotations on each clock, creating smooth encoding changes.
Multi-resolution: Fast clocks capture local variations; slow clocks capture global position in the sequence.

Clock Analogy for Sinusoidal Encoding (d_model = 512)
Dimension Pair	Wavelength (positions)	Sensitivity Scale	Clock Speed Analogy
0-1	~6.3	Token-level	Second hand
64-65	~40	Phrase-level	Minute hand
128-129	~250	Sentence-level	Hour hand
192-193	~1,580	Paragraph-level	Day dial
254-255	~62,832	Document-level	Year dial

The Torus Manifold

Mathematically, the sinusoidal encoding maps positions to points on a high-dimensional torus $\mathcal{T}^{d/2}$. Each (sin, cos) pair corresponds to a circle, and their Cartesian product forms the torus.

Properties of this embedding:

Bounded: All points lie on the torus surface
Periodic: The torus wraps around (though at frequencies too low to matter in practice)
Locally Euclidean: Small position differences create approximately linear movement

Position Similarity Patterns

When visualizing pairwise similarities between position encodings:

$$\text{sim}(p_1, p_2) = \frac{PE(p_1) \cdot PE(p_2)}{|PE(p_1)| |PE(p_2)|}$$

We observe:

Diagonal dominance: Each position is most similar to itself
Smooth decay: Similarity decreases with distance (on average)
Oscillatory pattern: Due to the multiple frequency components, similarity shows ripples
Periodicity traces: Very distant positions may show spurious similarity due to low-frequency components wrapping around

The Similarity Is Not Monotonic

A common misconception is that nearby positions always have higher similarity. Due to the multi-frequency nature, position 10 may be more similar to position 20 than to position 15 for certain encodings. The model must learn to interpret these similarity patterns appropriately.

Theoretical Analysis

Let's analyze the mathematical properties of sinusoidal encoding more rigorously.

Property 1: Boundedness

For all positions $pos$ and dimensions $i$: $$-1 \leq PE(pos, i) \leq 1$$

This follows directly from the range of $\sin$ and $\cos$. Bounded encodings prevent positional information from overwhelming token embeddings.

Property 2: Orthogonality of Transformations

The relative position transformation $M_k$ is an orthogonal matrix: $$M_k^T M_k = M_k M_k^T = I$$ $$\det(M_k) = 1$$

This means:

$M_k$ preserves vector norms: $|M_k v| = |v|$
$M_k$ preserves angles between vectors
$M_k^{-1} = M_k^T$ (efficient inverse)

Property 3: Group Structure

The transformations form a group: $$M_{k_1} M_{k_2} = M_{k_1 + k_2}$$ $$M_0 = I$$ $$M_{-k} = M_k^{-1} = M_k^T$$

This means relative position is additive: going forward $k_1$ positions then $k_2$ positions equals going $k_1 + k_2$ positions.

theoretical_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import torch
import numpy as np
 
def analyze_sinusoidal_properties(d_model: int = 128, max_pos: int = 512):
    """
    Rigorous analysis of sinusoidal encoding properties.
    """
    # Create encoding
    positions = torch.arange(max_pos).float()
    dim_indices = torch.arange(0, d_model, 2).float()
    omega = 1.0 / (10000.0 ** (dim_indices / d_model))
    
    PE = torch.zeros(max_pos, d_model)
    PE[:, 0::2] = torch.sin(positions.unsqueeze(1) * omega)
    PE[:, 1::2] = torch.cos(positions.unsqueeze(1) * omega)
    
    # Property 1: Boundedness
    print("=== Property 1: Boundedness ===")
    print(f"Min value: {PE.min().item():.6f} (should be >= -1)")
    print(f"Max value: {PE.max().item():.6f} (should be <= 1)")
    
    # Property 2: Norm distribution
    print("\n=== Property 2: Norm Distribution ===")
    norms = torch.norm(PE, dim=1)
    print(f"Norm range: [{norms.min():.4f}, {norms.max():.4f}]")
    print(f"Expected norm: sqrt(d_model/2) = {np.sqrt(d_model/2):.4f}")
    # Each sin^2 + cos^2 = 1, so sum over d_model/2 pairs = d_model/2
    
    # Property 3: Dot product analysis
    print("\n=== Property 3: Dot Product Structure ===")
    # PE(p1) · PE(p2) = sum_i cos(omega_i * (p1 - p2))
    p1, p2 = 10, 15
    direct_dot = (PE[p1] @ PE[p2]).item()
    theoretical_dot = torch.sum(torch.cos(omega * (p1 - p2))).item()
    print(f"Direct dot product PE({p1}) · PE({p2}): {direct_dot:.6f}")
    print(f"Theoretical sum of cos(ω(p1-p2)): {theoretical_dot:.6f}")
    print(f"Match: {abs(direct_dot - theoretical_dot) < 1e-5}")
    
    # Property 4: Distance structure
    print("\n=== Property 4: Distance Structure ===")
    # L2 distance as function of position difference
    distances = []
    for delta in range(0, 50):
        if delta < max_pos:
            d = torch.norm(PE[0] - PE[delta]).item()
            distances.append((delta, d))
    
    print("Position difference -> L2 distance:")
    for delta, dist in distances[:10]:
        print(f"  Δ = {delta:3d}: distance = {dist:.4f}")
    
    # Property 5: Uniqueness (check all pairwise distances)
    print("\n=== Property 5: Uniqueness ===")
    num_collisions = 0
    threshold = 0.01  # Very small distance = potential collision
    for i in range(min(100, max_pos)):
        for j in range(i + 1, min(100, max_pos)):
            if torch.norm(PE[i] - PE[j]) < threshold:
                num_collisions += 1
                print(f"  Near-collision: positions {i} and {j}")
    print(f"Near-collisions (threshold {threshold}): {num_collisions}")
    
    return PE
 
PE = analyze_sinusoidal_properties()

Norm Consistency

All position encodings have approximately the same L2 norm of √(d_model/2). This is because each (sin, cos) pair contributes sin²(x) + cos²(x) = 1 to the squared norm. With d_model/2 such pairs, the total squared norm is d_model/2.

Integration with the Attention Mechanism

The positional encoding is added to the token embeddings before they enter the transformer. This seemingly simple operation has complex implications for what the attention mechanism computes.

The Combined Representation

After adding positional encoding: $$\tilde{x}_i = x_i + PE(i)$$

The attention scores become: $$\alpha_{ij} \propto \exp\left(\frac{(\tilde{x}_i W_Q)(\tilde{x}_j W_K)^T}{\sqrt{d_k}}\right)$$

Expanding: $$(\tilde{x}_i W_Q)(\tilde{x}_j W_K)^T = (x_i + PE(i))W_Q W_K^T (x_j + PE(j))^T$$

$$= \underbrace{x_i W_Q W_K^T x_j^T}{\text{content-content}} + \underbrace{x_i W_Q W_K^T PE(j)^T}{\text{content-position}} + \underbrace{PE(i) W_Q W_K^T x_j^T}{\text{position-content}} + \underbrace{PE(i) W_Q W_K^T PE(j)^T}{\text{position-position}}$$

Four Attention Components

The attention score decomposes into four terms:

Content-Content: How much the content of token $i$ wants to attend to the content of token $j$ (independent of position)
Content-Position: How much the content of token $i$ wants to attend to the position of token $j$ (e.g., "verbs often look at the end of the sentence")
Position-Content: How much position $i$ wants to attend to the content of token $j$ (e.g., "the first position looks for proper nouns")
Position-Position: How much position $i$ wants to attend to position $j$ (e.g., "position 5 attends more to position 3 than position 1")

Attention Score Components
Component	Formula	Linguistic Interpretation
Content-Content	$x_i W_Q W_K^T x_j^T$	Semantic similarity between tokens
Content-Position	$x_i W_Q W_K^T PE(j)^T$	Token type preferences for certain positions
Position-Content	$PE(i) W_Q W_K^T x_j^T$	Positional preferences for token types
Position-Position	$PE(i) W_Q W_K^T PE(j)^T$	Structural attention patterns (e.g., local bias)

Entanglement Issue

All four components are computed together—there's no architectural separation. The model must learn to disentangle positional and content information through the projection matrices W_Q and W_K. This entanglement is one motivation for alternative approaches that explicitly separate position from content (e.g., relative position biases added directly to attention scores).

Learned vs. Fixed Position Patterns

Because the positional encoding is fixed, the position-position attention patterns are limited to what can be expressed as inner products in the sinusoidal encoding space, transformed by the learned $W_Q$ and $W_K$.

This creates interesting constraints:

The model cannot learn arbitrary position-position attention matrices
The learned patterns must be expressible through the sinusoidal representation
Some positional attention patterns may be easier to learn than others

Research has shown that transformers do learn meaningful positional attention patterns—for example, attending to adjacent positions, attending to sentence boundaries, and attending to syntactically related positions. The sinusoidal encoding provides sufficient structure for these patterns to emerge.

Limitations and the Path Forward

Despite its elegance, sinusoidal encoding has several limitations that motivated the development of alternative approaches.

Limitation 1: Length Extrapolation

The sinusoidal encoding is parameter-free and can generate encodings for any position. However, this doesn't mean models generalize well to longer sequences:

Length Generalization Challenges

•Attention pattern breakdown: Models trained on length 512 often fail at length 2048, despite valid position encodings
•Distribution shift: High-frequency components at extrapolated positions have never been seen during training
•Learned patterns don't transfer: Position-position attention patterns learned for short sequences may not generalize

Limitation 2: Absolute Position Dependence

The encoding assigns unique vectors to absolute positions 1, 2, 3, ... This creates challenges:

A sentence at the start of a document has different representations than the same sentence at the end
Tasks requiring pure relative position (e.g., "is A before B?") must extract this information indirectly
Data augmentation is limited—you can't easily shift sentences within training sequences

Limitation 3: Information Interference

Adding position to content embeddings means the attention mechanism sees both together. This can cause:

Content signal dilution when positional patterns dominate
Difficulty learning purely content-based attention when position is irrelevant
Suboptimal separation of positional and semantic information

Limitation 4: Fixed Frequency Spectrum

The frequencies are hardcoded (based on 10000 constant). This may not be optimal for:

All tasks (some may benefit from different frequency distributions)
All sequence lengths (short sequences may not need low frequencies)
Different domains (code vs. natural language may have different positional structure)

Evolution of Positional Encoding Approaches
Approach	Year	Key Innovation	Addresses Limitation
Sinusoidal (Original)	2017	Fixed sin/cos functions	—
Learned Embeddings (BERT)	2018	Trainable position vectors	Flexibility
Relative Position (TransformerXL)	2019	Relative position in attention	Translation invariance
Rotary (RoPE)	2021	Rotation-based relative position	Extrapolation + efficiency
ALiBi	2021	Linear attention bias	Strong extrapolation

Sinusoidal's Lasting Legacy

Despite these limitations, sinusoidal encoding established the foundational design principles: boundedness, smoothness, relative position accessibility. Modern approaches like RoPE directly build on the rotation intuition from sinusoidal encoding, making it more explicit and effective.

Summary: The Foundation of Positional Encoding

Sinusoidal positional encoding represents an elegant first solution to the transformer's positional blindness. Let's consolidate the key insights:

Key Takeaways

•Formula mastery: $PE_{(pos, 2i)} = \sin(pos/10000^{2i/d})$, $PE_{(pos, 2i+1)} = \cos(pos/10000^{2i/d})$. Even dimensions use sine, odd use cosine, frequencies decrease geometrically.
•Relative position through rotation: For offset $k$, $PE(pos+k) = M_k \times PE(pos)$ where $M_k$ is a rotation matrix. This enables learning relative positions through linear operations.
•Multi-scale clock interpretation: Different frequencies act like clocks at different speeds, capturing positional patterns from local (adjacent tokens) to global (document structure).
•Bounded and parameter-free: All values in $[-1, 1]$, no learned parameters, deterministic computation for any position.
•Additive combination has tradeoffs: Adding position to content enables joint processing but creates entanglement that the model must learn to handle.
•Limitations drove innovation: Length extrapolation, absolute position dependence, and information interference motivated learned, relative, and rotary approaches.

What's Next: Learned Positional Embeddings

In the next page, we explore learned positional embeddings—the approach used by BERT, GPT-2, and many other models. Rather than using fixed sinusoidal functions, these models learn position vectors from data. We'll analyze:

The implementation and training of learned position embeddings
Comparison with sinusoidal encoding across tasks and scales
The parameter cost and efficiency considerations
When each approach is preferred

Learned embeddings offer flexibility at the cost of generalization, creating a fundamental tradeoff in positional encoding design.

Page Complete

You now have deep understanding of sinusoidal positional encoding—its mathematical formulation, geometric interpretation, theoretical properties, and practical limitations. This foundational knowledge is essential for understanding why modern architectures have evolved toward alternative approaches while building on these core insights.