Loading content...
Transformers have revolutionized machine learning, achieving state-of-the-art results across language, vision, and multimodal domains. Yet beneath their remarkable success lies a fundamental architectural blind spot: transformers are inherently position-agnostic.
Unlike recurrent neural networks that process tokens sequentially—one after another in strict temporal order—or convolutional networks that operate on local spatial neighborhoods, the transformer's self-attention mechanism treats all input positions as an unordered set. Every token attends to every other token simultaneously, with no built-in mechanism to distinguish whether a word appears at the beginning, middle, or end of a sentence.
This presents a profound challenge for any task where order matters—which is to say, virtually every task involving sequences.
Without positional information, a transformer cannot distinguish between 'The cat sat on the mat' and 'The mat sat on the cat'. Both sentences contain identical tokens and would produce identical representations. Positional encoding is not optional decoration—it is structurally essential.
To understand why positional encoding is necessary, we must first understand what the self-attention mechanism actually computes and why it lacks positional awareness by design.
Self-Attention: A Set Operation
Recall the core self-attention computation:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Where:
The critical observation is that this operation is permutation equivariant. If we permute the input sequence $X$ to get $X' = PX$ (where $P$ is a permutation matrix), the output is similarly permuted:
$$\text{Attention}(PX) = P \cdot \text{Attention}(X)$$
This means shuffling the input tokens shuffles the output in exactly the same way—the relative relationships between positions are preserved, but the absolute positions carry no special meaning.
A function f is permutation equivariant if f(P·x) = P·f(x) for any permutation P. Self-attention has this property because the attention weights depend only on pairwise similarities, not on which position those pairs occupy. This is by design—it enables parallel computation—but it eliminates all notion of sequential order.
Mathematical Demonstration
Consider a simple three-token sequence: ["A", "B", "C"]. The attention weight from token $i$ to token $j$ is:
$$\alpha_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d_k})}{\sum_{m} \exp(q_i \cdot k_m / \sqrt{d_k})}$$
Note that this formula depends only on the content of the queries and keys—the embedding vectors $q_i$ and $k_j$—not on the indices $i$ and $j$ themselves. If we swap the positions of "B" and "C" to get ["A", "C", "B"], the attention computation proceeds identically; only the final output ordering changes.
This is in stark contrast to an RNN, where the hidden state $h_t$ is computed as:
$$h_t = f(h_{t-1}, x_t)$$
Here, the output at position $t$ explicitly depends on the output at position $t-1$, creating an unavoidable sequential dependency that encodes position implicitly through the order of computation itself.
Before diving into technical solutions, let's examine why position matters from a linguistic perspective. Understanding these phenomena motivates the design of positional encoding schemes that must capture subtle positional dependencies.
Syntactic Structure Depends on Position
Natural language encodes meaning through word order. Consider these examples:
| Sentence A | Sentence B | Key Difference |
|---|---|---|
| The dog bit the man | The man bit the dog | Subject-object reversal changes who does what |
| Only I love you | I only love you | 'Only' modifies different constituents |
| She decided quickly to leave | She decided to leave quickly | Adverb scope changes temporal interpretation |
| Not all students passed | All students did not pass | Negation scope: some passed vs. none passed |
| Time flies like an arrow | Fruit flies like a banana | Identical structure, entirely different parse |
Long-Range Dependencies
Position matters not just for adjacent words but across arbitrary distances. Consider:
"The keys to the cabinet that was bought yesterday were lost."
The verb "were" must agree with "keys" (plural), not "cabinet" (singular), despite "cabinet" being much closer. A model must track that "keys" is the syntactic subject across an intervening relative clause. This requires understanding both the absolute position of "keys" and its relationship to "were" across multiple intervening positions.
Relative vs. Absolute Position
Linguistic phenomena often care about relative rather than absolute position:
This insight has driven the development of relative positional encodings, which encode the distance between token pairs rather than their absolute indices.
Effective positional encodings must capture both absolute position (for tasks like named entity recognition where sentence-initial words behave differently) and relative position (for syntactic agreement across variable distances). This dual requirement has driven much of the innovation in positional encoding design.
To appreciate positional encoding's importance, let's examine what a transformer actually computes when no positional information is provided. This pathological case illuminates exactly what is missing.
The Bag-of-Words Collapse
Without positional encoding, a transformer's output for each position depends only on:
This creates effective bag-of-words representations—useful for some tasks, but catastrophically insufficient for most sequence understanding.
Consider the self-attention output for position $i$ without positional encoding:
$$z_i = \sum_{j=1}^{n} \alpha_{ij} v_j$$
where the attention weights $\alpha_{ij}$ depend only on: $$\alpha_{ij} \propto \exp\left(\frac{(x_i W_Q)(x_j W_K)^T}{\sqrt{d_k}}\right)$$
If tokens "dog" and "cat" appear in a sentence, the attention from any other token to "dog" vs. "cat" depends solely on their respective embeddings—not on whether "dog" precedes or follows "cat".
123456789101112131415161718192021222324252627282930313233343536373839404142
import torchimport torch.nn.functional as F def attention_no_position(X, W_Q, W_K, W_V, d_k): """ Standard attention without positional encoding. Demonstrates permutation equivariance. """ Q = X @ W_Q # [batch, seq_len, d_k] K = X @ W_K # [batch, seq_len, d_k] V = X @ W_V # [batch, seq_len, d_v] # Attention scores depend only on content scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5) weights = F.softmax(scores, dim=-1) return weights @ V # Demonstration: permuting input permutes output identicallytorch.manual_seed(42)batch_size, seq_len, d_model, d_k = 1, 5, 64, 64 # Random input and weightsX = torch.randn(batch_size, seq_len, d_model)W_Q = torch.randn(d_model, d_k)W_K = torch.randn(d_model, d_k)W_V = torch.randn(d_model, d_k) # Compute attention on original sequenceoutput_original = attention_no_position(X, W_Q, W_K, W_V, d_k) # Create a permutation (swap positions 1 and 3)perm = torch.tensor([0, 3, 2, 1, 4])X_permuted = X[:, perm, :] # Compute attention on permuted sequenceoutput_permuted = attention_no_position(X_permuted, W_Q, W_K, W_V, d_k) # The outputs are permuted identicallyoutput_original_reordered = output_original[:, perm, :]print("Max difference:", (output_original_reordered - output_permuted).abs().max().item())# Output: Max difference: 0.0 (within floating point precision)Empirical Consequences
Experiments removing positional encoding from trained transformers reveal severe performance degradation on order-sensitive tasks:
| Task | With Position | Without Position | Degradation |
|---|---|---|---|
| Machine Translation (BLEU) | 27.3 | 8.2 | −70% |
| Named Entity Recognition (F1) | 91.2 | 62.1 | −32% |
| Sentence Ordering (Accuracy) | 94.7 | 24.3 | ~Random |
| Text Classification (Acc) | 93.1 | 89.4 | −4% |
| Reading Comprehension (F1) | 88.6 | 41.2 | −53% |
Notice that text classification degrades less than other tasks. This is because simple sentiment analysis often relies on keyword presence rather than word order ('terrible movie' vs 'movie terrible' convey similar sentiment). Tasks requiring syntactic or sequential reasoning suffer catastrophically.
Given that position information is essential, what properties should a positional encoding scheme have? The original transformer paper and subsequent research have identified several desiderata that guide the design of effective positional representations.
Core Requirements
An ideal positional encoding should satisfy:
The Encoding Interface
Mathematically, positional encoding is typically formalized as a function that maps position indices to vectors:
$$PE: \mathbb{N} \rightarrow \mathbb{R}^d$$
where $d$ is the model dimension. The encoded positions are then combined with token embeddings:
$$\tilde{x}_i = x_i + PE(i)$$
This additive combination is the standard approach in the original transformer. Alternative combination strategies include:
Each approach has tradeoffs we will explore in subsequent sections.
Adding positional encoding to token embeddings assumes these two types of information can coexist in the same vector space. This is a strong assumption—semantic content and positional information may interfere. Some architectures explore alternative integration strategies to avoid this potential conflict.
Before the transformer, various neural architectures addressed positional information in different ways. Understanding these precedents helps contextualize the transformer's positional encoding design.
Recurrent Neural Networks (Pre-Transformer)
RNNs encode position implicitly through their sequential computation structure. The hidden state at position $t$ is computed as:
$$h_t = f(W_h h_{t-1} + W_x x_t + b)$$
Position information is captured in:
Advantages: Position is implicit and unlimited—RNNs can, in principle, process arbitrarily long sequences.
Disadvantages: Sequential processing prevents parallelization; gradients vanish or explode over long distances; the hidden state becomes a bottleneck for long-range information.
Convolutional Approaches
CNNs for sequence modeling (e.g., WaveNet, ByteNet) encode position through:
Some CNN approaches add explicit position encodings similar to transformers, but the convolution structure provides some implicit positional bias toward local context.
Simple Index Encoding (Naive Baseline)
The simplest possible positional encoding is just the position index itself:
$$PE_{\text{simple}}(i) = [i, 0, 0, \ldots, 0] \in \mathbb{R}^d$$
or normalized:
$$PE_{\text{normalized}}(i) = [i / n_{\max}, 0, 0, \ldots, 0]$$
Problems with simple indexing:
These shortcomings motivated the development of more sophisticated encoding schemes, which we will explore in the following pages.
The research community has developed diverse approaches to positional encoding, which can be organized into several broad categories. This taxonomy provides a roadmap for the detailed treatments in subsequent pages.
Taxonomy of Positional Encoding Approaches
| Category | Examples | Key Idea | Pros/Cons |
|---|---|---|---|
| Absolute Sinusoidal | Original Transformer | Fixed frequencies encode position | Deterministic, smooth; fixed at training length |
| Learned Absolute | BERT, GPT-2 | Position embeddings as parameters | Flexible; requires many parameters |
| Relative Bias | T5, Transformer-XL | Encode distance in attention | Captures relative structure; added complexity |
| Rotary (RoPE) | LLaMA, GPT-NeoX | Rotation matrices in embedding space | Relative position via dot product; excellent extrapolation |
| ALiBi | BLOOM, MPT | Linear attention bias by distance | Simple; strong extrapolation |
| Hybrid | Various modern LLMs | Combine multiple approaches | Best of multiple worlds; engineering complexity |
Absolute vs. Relative: The Central Divide
The most fundamental distinction in positional encoding is between absolute and relative approaches:
Absolute Positional Encoding:
Relative Positional Encoding:
Each approach has implications for generalization, extrapolation, and computational efficiency that we will explore in detail.
Recent large language models have converged on relative positional encodings, particularly RoPE and ALiBi. These methods offer better length generalization—the ability to process sequences longer than those seen during training—which is crucial for practical deployment where input lengths vary widely.
While this module focuses primarily on sequence (text) transformers, positional encoding generalizes to other domains where transformers have been applied. Understanding these extensions reveals the core principles underlying all positional encoding schemes.
Vision Transformers (ViT)
In Vision Transformers, images are divided into patches that are flattened and treated as tokens. Position now involves 2D spatial coordinates:
$$PE_{\text{2D}}(i, j) = [PE_{\text{row}}(i); PE_{\text{col}}(j)]$$
or more sophisticated 2D sinusoidal encodings:
$$PE_{\text{2D}}(x, y) = [\sin(\omega_1 x), \cos(\omega_1 x), \sin(\omega_2 y), \cos(\omega_2 y), \ldots]$$
Audio and Speech
Audio transformers (e.g., Wav2Vec, Whisper) process spectrograms or raw waveforms. Position corresponds to time, often at very high resolution (e.g., 16kHz sampling). Efficient positional encoding is critical given the long sequences involved.
Video
Video transformers must encode position in three dimensions: two spatial and one temporal. Strategies include:
Graphs and Sets
Some transformer variants operate on graphs (e.g., message passing transformers) or sets (e.g., set transformers). Here, absolute position is undefined—there is no canonical ordering. Instead:
Despite domain differences, the core principle is consistent: inject information about the structure and ordering of the input that the self-attention mechanism cannot infer on its own. The specific encoding should match the domain's notion of 'proximity' and 'order'.
We have established why positional encoding is fundamental to transformer architectures. Let's consolidate the key insights:
What's Next: Sinusoidal Encoding
In the next page, we dive deep into the original transformer's positional encoding: the sinusoidal scheme. We will:
The sinusoidal encoding remains influential despite newer alternatives. Understanding it provides essential intuition for all subsequent positional encoding research.
You now understand why positional encoding is not merely a technical detail but a structural necessity for transformer architectures. Without it, these powerful models would be fundamentally incapable of understanding sequence order. Next, we explore the elegant sinusoidal solution proposed in the original 'Attention Is All You Need' paper.