Loading content...
One of the most profound properties of self-attention—and one that distinguishes it from recurrent architectures—is its inherent position independence. Self-attention treats sequences as sets, not ordered lists. It cannot distinguish between "The cat sat on the mat" and "mat the on sat cat The" unless we explicitly provide positional information.
This property, formally known as permutation equivariance, is simultaneously:
Understanding this property deeply is crucial because:
This page rigorously examines position independence, its mathematical formulation, and its practical implications.
By the end of this page, you will understand permutation equivariance, mathematically prove self-attention's position independence, see concrete examples of its consequences, and appreciate why positional encodings are not optional but essential for sequence modeling.
Equivariance means that a function's output transforms in the same way as its input. For permutations:
Definition: Permutation Equivariance
A function $f: \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d}$ is permutation equivariant if for any permutation $\pi$ applied to the input rows:
$$f(\pi(X)) = \pi(f(X))$$
In other words: permute the input → the output is permuted in the same way.
Contrast with Permutation Invariance:
A function is permutation invariant if $f(\pi(X)) = f(X)$ — the output doesn't change at all when input is permuted. Mean/sum pooling are invariant; self-attention is equivariant.
Why This Matters:
Equivariance means self-attention preserves the correspondence between input and output positions, but doesn't use positional information to compute the output values.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as npfrom scipy.special import softmax def self_attention(X: np.ndarray, W_q: np.ndarray, W_k: np.ndarray, W_v: np.ndarray): """Standard self-attention without positional encoding.""" Q = X @ W_q K = X @ W_k V = X @ W_v d_k = Q.shape[-1] scores = Q @ K.T / np.sqrt(d_k) attn_weights = softmax(scores, axis=-1) output = attn_weights @ V return output, attn_weights def apply_permutation(X: np.ndarray, perm: np.ndarray) -> np.ndarray: """Permute rows of X according to permutation.""" return X[perm] def demonstrate_equivariance(): """Demonstrate that self-attention is permutation equivariant.""" np.random.seed(42) n, d_model, d_k = 4, 8, 4 # Input and weights X = np.random.randn(n, d_model) W_q = np.random.randn(d_model, d_k) W_k = np.random.randn(d_model, d_k) W_v = np.random.randn(d_model, d_k) # Define a permutation: [0,1,2,3] → [2,0,3,1] perm = np.array([2, 0, 3, 1]) inverse_perm = np.argsort(perm) # To undo the permutation # Approach 1: Apply attention, then permute output output_original, _ = self_attention(X, W_q, W_k, W_v) output_then_permute = apply_permutation(output_original, perm) # Approach 2: Permute input, then apply attention X_permuted = apply_permutation(X, perm) output_permuted_input, _ = self_attention(X_permuted, W_q, W_k, W_v) print("Permutation Equivariance Demonstration") print("=" * 60) print(f"Original input X:") print(X.round(3)) print(f"\nPermutation: {list(range(n))} → {list(perm)}") print(f"\nPermuted input X[π]:") print(X_permuted.round(3)) print(f"\n--- Method 1: f(X), then permute ---") print(f"f(X):") print(output_original.round(3)) print(f"π(f(X)):") print(output_then_permute.round(3)) print(f"\n--- Method 2: permute X, then f ---") print(f"f(π(X)):") print(output_permuted_input.round(3)) # Check if they're equal are_equal = np.allclose(output_then_permute, output_permuted_input) print(f"\n✓ π(f(X)) == f(π(X)): {are_equal}") if are_equal: print("Self-attention IS permutation equivariant!") demonstrate_equivariance()Permutation equivariance means: if you shuffle a deck of cards (input), run attention, and get a result, that result is just the shuffled version of what you'd get by running attention first, then shuffling. The computation doesn't 'know' where cards originally were.
Let's rigorously prove that self-attention is permutation equivariant by tracing through the computation.
Setup:
Let $X \in \mathbb{R}^{n \times d}$ be the input, and $P_\pi \in {0,1}^{n \times n}$ be a permutation matrix representing permutation $\pi$. Note that:
Step 1: Computing Q, K, V
$$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$$
For permuted input $P_\pi X$: $$Q' = P_\pi X W_Q = P_\pi Q$$ $$K' = P_\pi X W_K = P_\pi K$$ $$V' = P_\pi X W_V = P_\pi V$$
So Q, K, V are permuted in the same way as X.
Step 2: Attention Scores
Original scores: $S = QK^T$ Permuted scores: $$S' = Q'K'^T = (P_\pi Q)(P_\pi K)^T = P_\pi Q K^T P_\pi^T = P_\pi S P_\pi^T$$
This is a similarity transformation. The $(i,j)$ entry of $S'$ equals the $(\pi^{-1}(i), \pi^{-1}(j))$ entry of $S$. The permutation applies to both rows AND columns.
Step 3: Softmax
Softmax is applied row-wise. Since $S'$ is $S$ with permuted rows and columns: $$A' = \text{softmax}(S' / \sqrt{d_k}) = P_\pi \text{softmax}(S / \sqrt{d_k}) P_\pi^T = P_\pi A P_\pi^T$$
Step 4: Output Aggregation
$$O' = A' V' = (P_\pi A P_\pi^T)(P_\pi V) = P_\pi A (P_\pi^T P_\pi) V = P_\pi A V = P_\pi O$$
Conclusion: $\text{Attention}(P_\pi X) = P_\pi \cdot \text{Attention}(X)$ ∎
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npfrom scipy.special import softmax def verify_proof_steps(): """Verify each step of the permutation equivariance proof.""" np.random.seed(42) n, d = 4, 8 d_k = 4 # Setup X = np.random.randn(n, d) W_q = np.random.randn(d, d_k) W_k = np.random.randn(d, d_k) W_v = np.random.randn(d, d_k) # Permutation: [0,1,2,3] → [2,0,3,1] perm = np.array([2, 0, 3, 1]) P = np.eye(n)[perm] # Permutation matrix print("Proof Verification: Self-Attention Permutation Equivariance") print("=" * 60) # Step 1: Q, K, V Q = X @ W_q K = X @ W_k V = X @ W_v Q_perm = (P @ X) @ W_q check_1 = np.allclose(Q_perm, P @ Q) print(f"Step 1: Q' = P·Q: {check_1}") # Step 2: Attention scores S = Q @ K.T S_perm = Q_perm @ (P @ K).T check_2 = np.allclose(S_perm, P @ S @ P.T) print(f"Step 2: S' = P·S·P^T: {check_2}") # Step 3: Softmax (row-wise) A = softmax(S / np.sqrt(d_k), axis=-1) A_perm = softmax(S_perm / np.sqrt(d_k), axis=-1) check_3 = np.allclose(A_perm, P @ A @ P.T) print(f"Step 3: A' = P·A·P^T: {check_3}") # Step 4: Output O = A @ V O_perm = A_perm @ (P @ V) check_4 = np.allclose(O_perm, P @ O) print(f"Step 4: O' = P·O: {check_4}") print(f"\nAll steps verified: {all([check_1, check_2, check_3, check_4])}") # Demonstrate the key insight: attention "content" is position-blind print("\n" + "=" * 60) print("Key Insight: Attention Patterns Follow Permutation") print("=" * 60) print("\nOriginal attention weights A:") print(A.round(3)) print("\nPermuted attention weights A':") print(A_perm.round(3)) print("\nP·A·P^T (should equal A'):") print((P @ A @ P.T).round(3)) # Show what this means: "who attends to whom" is content-based, not position-based tokens = ["The", "cat", "sat", "mat"] print("\nInterpretation:") print("Original order:", tokens) print(f" 'sat' (pos 2) attends to 'cat' (pos 1) with weight {A[2,1]:.3f}") permuted_tokens = [tokens[p] for p in perm] print("\nPermuted order:", permuted_tokens) # Find where 'sat' and 'cat' are now sat_new_pos = perm.tolist().index(2) # Where is original pos 2 now? cat_new_pos = perm.tolist().index(1) # Where is original pos 1 now? print(f" 'sat' (now pos {sat_new_pos}) attends to 'cat' (now pos {cat_new_pos}) with weight {A_perm[sat_new_pos, cat_new_pos]:.3f}") print("\n→ Same content-based attention, positions don't matter!") verify_proof_steps()The proof shows that self-attention only knows 'which input element' attends to 'which other input element'—not 'which POSITION attends to which POSITION'. The actual positions are invisible to the computation.
Position independence has profound consequences for processing natural language, where word order is often crucial for meaning.
Example 1: Semantic Ambiguity
Consider these two sentences:
Without positional encoding, self-attention sees both as the same unordered set: {John, loves, Mary}. It cannot distinguish subject from object.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as npfrom scipy.special import softmax def simulate_attention_without_position(): """Show how position-blind attention confuses word order.""" # Simulate word embeddings (position-free) embeddings = { "john": np.array([1.0, 0.5, 0.2]), "loves": np.array([0.3, 1.0, 0.4]), "mary": np.array([0.8, 0.6, 0.3]), } # Two sentences with opposite meaning sentence1 = ["john", "loves", "mary"] sentence2 = ["mary", "loves", "john"] X1 = np.array([embeddings[w] for w in sentence1]) X2 = np.array([embeddings[w] for w in sentence2]) # Random projection weights (same for both) np.random.seed(42) d_k = 2 W_q = np.random.randn(3, d_k) W_k = np.random.randn(3, d_k) W_v = np.random.randn(3, d_k) def compute_attention_output(X, W_q, W_k, W_v): Q = X @ W_q K = X @ W_k V = X @ W_v scores = Q @ K.T / np.sqrt(d_k) weights = softmax(scores, axis=-1) output = weights @ V return output, weights out1, weights1 = compute_attention_output(X1, W_q, W_k, W_v) out2, weights2 = compute_attention_output(X2, W_q, W_k, W_v) print("Position Independence in Language") print("=" * 60) print(f"\nSentence 1: '{' '.join(sentence1)}'") print(f" Meaning: John → Mary (John is subject)") print(f"\nSentence 2: '{' '.join(sentence2)}'") print(f" Meaning: Mary → John (Mary is subject)") print("\n--- Attention Weights ---") print(f"Sentence 1 attention (row = query, col = key):") for i, word in enumerate(sentence1): weights_str = " ".join([f"{sentence1[j]}:{weights1[i,j]:.2f}" for j in range(3)]) print(f" {word}: [{weights_str}]") print(f"\nSentence 2 attention:") for i, word in enumerate(sentence2): weights_str = " ".join([f"{sentence2[j]}:{weights2[i,j]:.2f}" for j in range(3)]) print(f" {word}: [{weights_str}]") # Key observation: The attention FROM "john" TO "loves" is the same # regardless of john's position print("\n--- Key Observation ---") john_loves_1 = weights1[0, 1] # john→loves in sentence 1 john_loves_2 = weights2[2, 1] # john→loves in sentence 2 (john is now at pos 2) print(f"'john' → 'loves' weight in sentence 1: {john_loves_1:.4f}") print(f"'john' → 'loves' weight in sentence 2: {john_loves_2:.4f}") print(f"Same weight: {np.isclose(john_loves_1, john_loves_2)}") print("\n→ Attention can't tell that 'john' is subject in one and object in other!") simulate_attention_without_position()Example 2: Negation Sensitivity
Without position, these are indistinguishable to self-attention.
Example 3: Temporal Order
For sequences like events or time series:
The causal order is lost without positional information.
What Self-Attention CAN Capture:
Despite position blindness, pure self-attention can still learn:
But it CANNOT learn position-dependent patterns without help.
Without positional encodings, a transformer reduces to a sophisticated bag-of-words model. It can weight and combine tokens, but cannot distinguish 'dog bites man' from 'man bites dog'. Positional encodings are absolutely essential for any position-sensitive task.
Comparing self-attention to RNNs and CNNs reveals the trade-offs of position independence.
Recurrent Neural Networks:
RNNs are inherently position-aware through their sequential processing: $$h_t = f(h_{t-1}, x_t)$$
The hidden state $h_t$ "knows" it's at position $t$ because it required $t$ recurrent steps to reach. This provides:
But also requires:
Convolutional Neural Networks:
CNNs process local windows, making them positionally-aware within their kernel: $$y_t = \sum_{k=-K}^{K} w_k \cdot x_{t+k}$$
Different kernel positions have different weights, encoding relative position. This provides:
But requires:
| Architecture | Position Awareness | Range | Parallelization | Inductive Bias |
|---|---|---|---|---|
| RNN | Inherent (step count) | Limited by hidden state | None (sequential) | Strong locality |
| CNN | Local (kernel weights) | Limited by receptive field | Full | Local patterns |
| Self-Attention | None (must add) | Full (all-to-all) | Full | None |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as np def rnn_step(h_prev, x_t, W_h, W_x, b): """Single RNN step - inherently knows position via h_prev.""" return np.tanh(W_h @ h_prev + W_x @ x_t + b) def cnn_1d(x, kernel): """1D convolution - positional within kernel window.""" n, k = len(x), len(kernel) output = np.zeros(n) for i in range(n): for j in range(k): if 0 <= i + j - k//2 < n: output[i] += kernel[j] * x[i + j - k//2] return output def demonstrate_position_handling(): """Show how different architectures handle position.""" # Sequence: [A, B, C, A, B, C] # Question: Can the model distinguish first 'A' from second 'A'? seq = np.array([1.0, 2.0, 3.0, 1.0, 2.0, 3.0]) # A=1, B=2, C=3 n = len(seq) print("Position Handling: Can we distinguish repeated tokens?") print("=" * 60) print(f"Sequence: [A, B, C, A, B, C] = {list(seq)}") print() # RNN: Different hidden states at positions 0 and 3 print("RNN:") d = 2 W_h = np.random.randn(d, d) * 0.1 W_x = np.random.randn(d, 1) * 0.1 b = np.zeros(d) h = np.zeros(d) hidden_states = [] for t, x_t in enumerate(seq): h = rnn_step(h, np.array([x_t]), W_h, W_x, b) hidden_states.append(h.copy()) print(f" h[0] (first A): {hidden_states[0].round(4)}") print(f" h[3] (second A): {hidden_states[3].round(4)}") print(f" Different: {not np.allclose(hidden_states[0], hidden_states[3])}") print(" → RNN naturally distinguishes by accumulated state") # CNN: Different if kernel reaches different neighbors print("\nCNN (kernel size 3):") kernel = np.array([0.2, 0.6, 0.2]) # Weighted by neighbors cnn_out = cnn_1d(seq, kernel) print(f" output[0] (first A): {cnn_out[0]:.4f} (neighbors: _, A=1, B=2)") print(f" output[3] (second A): {cnn_out[3]:.4f} (neighbors: C=3, A=1, B=2)") print(f" Different: {not np.isclose(cnn_out[0], cnn_out[3])}") print(" → CNN distinguishes by different local neighborhoods") # Self-Attention (without positional encoding) print("\nSelf-Attention (no positional encoding):") # For identical tokens with identical embeddings... # Their Q, K, V will be identical # Therefore their attention patterns will be identical # And their outputs will be identical print(" If A→A: embeddings are the same") print(" So Q_0 = Q_3, K_0 = K_3, V_0 = V_3") print(" Attention weights from both positions are IDENTICAL") print(" Output[0] = Output[3]") print(" → Self-attention CANNOT distinguish without position info!") demonstrate_position_handling()RNNs and CNNs have position awareness built-in but limited range. Self-attention has unlimited range but no position awareness. Transformers solve this by adding positional encoding—the best of both worlds: O(1) path length + position information.
While position independence seems like a limitation, it's actually a feature for certain domains and tasks.
Set-Based Problems:
Some problems have inherently unordered inputs:
For these, position independence is the correct inductive bias!
Example: Object Detection
In image processing, detected objects might be:
The order in which we list detected objects shouldn't affect understanding. Self-attention over object features naturally respects this.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as npfrom scipy.special import softmax def point_cloud_attention(): """ Demonstrate self-attention on point clouds where order doesn't matter. """ # 3D point cloud: 4 points with (x, y, z) coordinates # Order shouldn't affect the output points_v1 = np.array([ [1.0, 0.0, 0.0], # Point A [0.0, 1.0, 0.0], # Point B [0.0, 0.0, 1.0], # Point C [1.0, 1.0, 1.0], # Point D ]) # Same points, different order points_v2 = np.array([ [0.0, 0.0, 1.0], # Point C [1.0, 1.0, 1.0], # Point D [1.0, 0.0, 0.0], # Point A [0.0, 1.0, 0.0], # Point B ]) # Simple self-attention (Q=K=V=X for simplicity) def set_attention(X): scores = X @ X.T / np.sqrt(X.shape[1]) weights = softmax(scores, axis=-1) output = weights @ X # Aggregate to single representation (mean pool) return output.mean(axis=0) rep_v1 = set_attention(points_v1) rep_v2 = set_attention(points_v2) print("Point Cloud Self-Attention (Order-Invariant)") print("=" * 60) print("Points (two different orderings of same 4 points):") print("Version 1: A, B, C, D") print("Version 2: C, D, A, B") print(f"\nGlobal representation (mean-pooled attention output):") print(f" Version 1: {rep_v1.round(4)}") print(f" Version 2: {rep_v2.round(4)}") print(f" Identical: {np.allclose(rep_v1, rep_v2)}") print("\n→ Self-attention + mean pooling is permutation INVARIANT!") print(" Perfect for sets and point clouds.") def graph_attention(): """ Demonstrate self-attention on graph nodes. """ # Graph with 3 nodes and their features # The "order" of nodes in memory is arbitrary print("\nGraph Node Attention") print("=" * 60) print("Graph: A -- B -- C (linear chain)") print("Node features don't depend on their ordering in memory.") node_features = np.array([ [1.0, 0.0], # Node A [0.5, 0.5], # Node B [0.0, 1.0], # Node C ]) # In graph attention networks (GAT), attention is further # constrained by adjacency, but the base self-attention # is still permutation equivariant print("\nSelf-attention computes content-based relationships.") print("GAT adds adjacency masking to further constrain attention.") print("The base mechanism is position-free by design.") point_cloud_attention()graph_attention()Position independence provides the correct inductive bias for set-structured data. Using positional encodings here would actually HURT by imposing artificial ordering. The key is matching architecture to data structure.
For sequential data where position matters, the solution is to inject positional information into the input. This is done through positional encodings.
The Basic Idea:
Instead of feeding raw token embeddings $x_i$ to self-attention, we add a position-dependent vector:
$$\tilde{x}_i = x_i + \text{PE}(i)$$
Where $\text{PE}(i)$ is a vector that uniquely identifies position $i$.
Now, two identical tokens at different positions have different representations, breaking the symmetry.
Requirements for Positional Encodings:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npfrom scipy.special import softmax def demonstrate_positional_encoding_effect(): """Show how positional encoding breaks symmetry.""" # Two identical tokens at different positions token_embedding = np.array([1.0, 0.5, 0.3, 0.2]) # Token "A" # Simple positional encodings (will cover proper ones later) pe_0 = np.array([0.0, 0.0, 1.0, 0.0]) # Position 0 pe_3 = np.array([0.3, 0.3, 0.7, 0.6]) # Position 3 # Without positional encoding x_0_no_pe = token_embedding x_3_no_pe = token_embedding # With positional encoding x_0_with_pe = token_embedding + pe_0 x_3_with_pe = token_embedding + pe_3 print("Breaking Symmetry with Positional Encodings") print("=" * 60) print("Same token 'A' at positions 0 and 3:") print(f"\nWithout positional encoding:") print(f" x[0] = {x_0_no_pe}") print(f" x[3] = {x_3_no_pe}") print(f" Identical: {np.allclose(x_0_no_pe, x_3_no_pe)}") print(f"\nWith positional encoding:") print(f" x[0] + PE[0] = {x_0_with_pe}") print(f" x[3] + PE[3] = {x_3_with_pe}") print(f" Identical: {np.allclose(x_0_with_pe, x_3_with_pe)}") # Show effect on attention print("\n--- Effect on Self-Attention ---") # Q and K for position 0 attending to both positions W_q = W_k = np.eye(4) * 0.5 # Simple projection # Without PE Q_0_no = x_0_no_pe @ W_q K_0_no = x_0_no_pe @ W_k K_3_no = x_3_no_pe @ W_k score_0_to_0_no = Q_0_no @ K_0_no score_0_to_3_no = Q_0_no @ K_3_no print(f"Without PE:") print(f" Score from pos 0 to pos 0: {score_0_to_0_no:.4f}") print(f" Score from pos 0 to pos 3: {score_0_to_3_no:.4f}") print(f" (Identical because tokens are identical)") # With PE Q_0_pe = x_0_with_pe @ W_q K_0_pe = x_0_with_pe @ W_k K_3_pe = x_3_with_pe @ W_k score_0_to_0_pe = Q_0_pe @ K_0_pe score_0_to_3_pe = Q_0_pe @ K_3_pe print(f"\nWith PE:") print(f" Score from pos 0 to pos 0: {score_0_to_0_pe:.4f}") print(f" Score from pos 0 to pos 3: {score_0_to_3_pe:.4f}") print(f" (Different because positions encoded!)") demonstrate_positional_encoding_effect()Types of Positional Encodings (Preview):
Each has trade-offs in terms of generalization, computational cost, and handling of variable-length sequences. These are covered in detail in Module 5: Positional Encoding.
Positional encodings aren't an afterthought—they're essential for any task where order matters. Without them, a transformer cannot learn word order, temporal sequences, or any position-dependent pattern. They're as important as the attention mechanism itself.
Understanding position independence guides several important design decisions:
1. Choosing When to Use Positional Encodings
2. Designing Custom Positional Schemes
For non-standard domains:
3. Combining with Other Architectures
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
"""Design patterns for handling position in self-attention architectures.""" # Pattern 1: Standard sequence (NLP)# → Sinusoidal or learned positional embeddings added to input class SequenceTransformer: def forward(self, tokens): x = self.embed(tokens) x = x + self.positional_encoding(len(tokens)) # Add position return self.transformer_layers(x) # Pattern 2: Set input (point clouds)# → No positional encoding, order-invariant pooling class SetTransformer: def forward(self, points): x = self.point_embed(points) # No position added x = self.transformer_layers(x) return self.pool(x) # Mean/max pool for invariance # Pattern 3: Grid input (images)# → 2D positional encoding class VisionTransformer: def forward(self, image): patches = self.patchify(image) # (H/P) × (W/P) patches x = self.patch_embed(patches) x = x + self.pos_embed_2d(H // P, W // P) # 2D positions return self.transformer_layers(x) # Pattern 4: Graph input# → Graph structure via attention masking, optional node position class GraphTransformer: def forward(self, node_features, adjacency): x = node_features # No position if graph structure suffices mask = self.make_attention_mask(adjacency) # Limit attention to edges return self.transformer_layers(x, mask=mask) # Pattern 5: Hybrid with locality# → CNN for local, Transformer for global class LocalGlobalTransformer: def forward(self, x): local_features = self.cnn(x) # CNN handles local patterns x = local_features + self.positional_encoding(len(x)) global_context = self.transformer(x) # Transformer adds global return global_context print("Design patterns demonstrate how position independence")print("should be addressed based on data structure and task.")The best architectures match their inductive biases to the data structure. Self-attention's position independence is a feature for sets, a limitation for sequences. Always choose based on your actual data, not on what's popular.
We've thoroughly explored one of self-attention's most important properties: its inherent position independence, or permutation equivariance.
What's Next:
Now that we understand self-attention's position independence and why positional encodings are necessary, we'll analyze the computational complexity of self-attention in detail. Understanding the O(n²) nature of attention is crucial for scaling to long sequences.
You now understand the fundamental permutation equivariance of self-attention, its mathematical proof, consequences for different domains, and why positional encodings are essential for sequence modeling. This knowledge is crucial for designing and applying transformer architectures appropriately.