Self Attention - Learning Module

Loading content...

0/245

Position Independence

The Positional Blindness of Self-Attention

One of the most profound properties of self-attention—and one that distinguishes it from recurrent architectures—is its inherent position independence. Self-attention treats sequences as sets, not ordered lists. It cannot distinguish between "The cat sat on the mat" and "mat the on sat cat The" unless we explicitly provide positional information.

This property, formally known as permutation equivariance, is simultaneously:

A feature: It enables massive parallelization and global connectivity
A limitation: It loses the inherent sequential structure of language, time series, etc.

Understanding this property deeply is crucial because:

It explains why positional encodings exist
It reveals what self-attention can and cannot learn natively
It clarifies the relationship between attention and other architectures
It guides design decisions for new attention-based models

This page rigorously examines position independence, its mathematical formulation, and its practical implications.

What You Will Learn

By the end of this page, you will understand permutation equivariance, mathematically prove self-attention's position independence, see concrete examples of its consequences, and appreciate why positional encodings are not optional but essential for sequence modeling.

Permutation Equivariance: The Formal Definition

Equivariance means that a function's output transforms in the same way as its input. For permutations:

Definition: Permutation Equivariance

A function $f: \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d}$ is permutation equivariant if for any permutation $\pi$ applied to the input rows:

$$f(\pi(X)) = \pi(f(X))$$

In other words: permute the input → the output is permuted in the same way.

Contrast with Permutation Invariance:

A function is permutation invariant if $f(\pi(X)) = f(X)$ — the output doesn't change at all when input is permuted. Mean/sum pooling are invariant; self-attention is equivariant.

Why This Matters:

Equivariance means self-attention preserves the correspondence between input and output positions, but doesn't use positional information to compute the output values.

equivariance_demo.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from scipy.special import softmax
 
def self_attention(X: np.ndarray, W_q: np.ndarray, W_k: np.ndarray, W_v: np.ndarray):
    """Standard self-attention without positional encoding."""
    Q = X @ W_q
    K = X @ W_k
    V = X @ W_v
    d_k = Q.shape[-1]
    
    scores = Q @ K.T / np.sqrt(d_k)
    attn_weights = softmax(scores, axis=-1)
    output = attn_weights @ V
    
    return output, attn_weights
 
def apply_permutation(X: np.ndarray, perm: np.ndarray) -> np.ndarray:
    """Permute rows of X according to permutation."""
    return X[perm]
 
def demonstrate_equivariance():
    """Demonstrate that self-attention is permutation equivariant."""
    np.random.seed(42)
    
    n, d_model, d_k = 4, 8, 4
    
    # Input and weights
    X = np.random.randn(n, d_model)
    W_q = np.random.randn(d_model, d_k)
    W_k = np.random.randn(d_model, d_k)
    W_v = np.random.randn(d_model, d_k)
    
    # Define a permutation: [0,1,2,3] → [2,0,3,1]
    perm = np.array([2, 0, 3, 1])
    inverse_perm = np.argsort(perm)  # To undo the permutation
    
    # Approach 1: Apply attention, then permute output
    output_original, _ = self_attention(X, W_q, W_k, W_v)
    output_then_permute = apply_permutation(output_original, perm)
    
    # Approach 2: Permute input, then apply attention
    X_permuted = apply_permutation(X, perm)
    output_permuted_input, _ = self_attention(X_permuted, W_q, W_k, W_v)
    
    print("Permutation Equivariance Demonstration")
    print("=" * 60)
    print(f"Original input X:")
    print(X.round(3))
    print(f"\nPermutation: {list(range(n))} → {list(perm)}")
    print(f"\nPermuted input X[π]:")
    print(X_permuted.round(3))
    
    print(f"\n--- Method 1: f(X), then permute ---")
    print(f"f(X):")
    print(output_original.round(3))
    print(f"π(f(X)):")
    print(output_then_permute.round(3))
    
    print(f"\n--- Method 2: permute X, then f ---")
    print(f"f(π(X)):")
    print(output_permuted_input.round(3))
    
    # Check if they're equal
    are_equal = np.allclose(output_then_permute, output_permuted_input)
    print(f"\n✓ π(f(X)) == f(π(X)): {are_equal}")
    
    if are_equal:
        print("Self-attention IS permutation equivariant!")
 
demonstrate_equivariance()

Intuition Check

Permutation equivariance means: if you shuffle a deck of cards (input), run attention, and get a result, that result is just the shuffled version of what you'd get by running attention first, then shuffling. The computation doesn't 'know' where cards originally were.

Mathematical Proof of Position Independence

Let's rigorously prove that self-attention is permutation equivariant by tracing through the computation.

Setup:

Let $X \in \mathbb{R}^{n \times d}$ be the input, and $P_\pi \in {0,1}^{n \times n}$ be a permutation matrix representing permutation $\pi$. Note that:

$P_\pi X$ permutes the rows of $X$
$P_\pi^T = P_\pi^{-1}$ (permutation matrices are orthogonal)

Step 1: Computing Q, K, V

$$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$$

For permuted input $P_\pi X$: $$Q' = P_\pi X W_Q = P_\pi Q$$ $$K' = P_\pi X W_K = P_\pi K$$ $$V' = P_\pi X W_V = P_\pi V$$

So Q, K, V are permuted in the same way as X.

Step 2: Attention Scores

Original scores: $S = QK^T$ Permuted scores: $$S' = Q'K'^T = (P_\pi Q)(P_\pi K)^T = P_\pi Q K^T P_\pi^T = P_\pi S P_\pi^T$$

This is a similarity transformation. The $(i,j)$ entry of $S'$ equals the $(\pi^{-1}(i), \pi^{-1}(j))$ entry of $S$. The permutation applies to both rows AND columns.

Step 3: Softmax

Softmax is applied row-wise. Since $S'$ is $S$ with permuted rows and columns: $$A' = \text{softmax}(S' / \sqrt{d_k}) = P_\pi \text{softmax}(S / \sqrt{d_k}) P_\pi^T = P_\pi A P_\pi^T$$

Step 4: Output Aggregation

$$O' = A' V' = (P_\pi A P_\pi^T)(P_\pi V) = P_\pi A (P_\pi^T P_\pi) V = P_\pi A V = P_\pi O$$

Conclusion: $\text{Attention}(P_\pi X) = P_\pi \cdot \text{Attention}(X)$ ∎

proof_verification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from scipy.special import softmax
 
def verify_proof_steps():
    """Verify each step of the permutation equivariance proof."""
    np.random.seed(42)
    
    n, d = 4, 8
    d_k = 4
    
    # Setup
    X = np.random.randn(n, d)
    W_q = np.random.randn(d, d_k)
    W_k = np.random.randn(d, d_k)
    W_v = np.random.randn(d, d_k)
    
    # Permutation: [0,1,2,3] → [2,0,3,1]
    perm = np.array([2, 0, 3, 1])
    P = np.eye(n)[perm]  # Permutation matrix
    
    print("Proof Verification: Self-Attention Permutation Equivariance")
    print("=" * 60)
    
    # Step 1: Q, K, V
    Q = X @ W_q
    K = X @ W_k
    V = X @ W_v
    
    Q_perm = (P @ X) @ W_q
    check_1 = np.allclose(Q_perm, P @ Q)
    print(f"Step 1: Q' = P·Q: {check_1}")
    
    # Step 2: Attention scores
    S = Q @ K.T
    S_perm = Q_perm @ (P @ K).T
    check_2 = np.allclose(S_perm, P @ S @ P.T)
    print(f"Step 2: S' = P·S·P^T: {check_2}")
    
    # Step 3: Softmax (row-wise)
    A = softmax(S / np.sqrt(d_k), axis=-1)
    A_perm = softmax(S_perm / np.sqrt(d_k), axis=-1)
    check_3 = np.allclose(A_perm, P @ A @ P.T)
    print(f"Step 3: A' = P·A·P^T: {check_3}")
    
    # Step 4: Output
    O = A @ V
    O_perm = A_perm @ (P @ V)
    check_4 = np.allclose(O_perm, P @ O)
    print(f"Step 4: O' = P·O: {check_4}")
    
    print(f"\nAll steps verified: {all([check_1, check_2, check_3, check_4])}")
    
    # Demonstrate the key insight: attention "content" is position-blind
    print("\n" + "=" * 60)
    print("Key Insight: Attention Patterns Follow Permutation")
    print("=" * 60)
    
    print("\nOriginal attention weights A:")
    print(A.round(3))
    
    print("\nPermuted attention weights A':")
    print(A_perm.round(3))
    
    print("\nP·A·P^T (should equal A'):")
    print((P @ A @ P.T).round(3))
    
    # Show what this means: "who attends to whom" is content-based, not position-based
    tokens = ["The", "cat", "sat", "mat"]
    
    print("\nInterpretation:")
    print("Original order:", tokens)
    print(f"  'sat' (pos 2) attends to 'cat' (pos 1) with weight {A[2,1]:.3f}")
    
    permuted_tokens = [tokens[p] for p in perm]
    print("\nPermuted order:", permuted_tokens)
    # Find where 'sat' and 'cat' are now
    sat_new_pos = perm.tolist().index(2)  # Where is original pos 2 now?
    cat_new_pos = perm.tolist().index(1)  # Where is original pos 1 now?
    print(f"  'sat' (now pos {sat_new_pos}) attends to 'cat' (now pos {cat_new_pos}) with weight {A_perm[sat_new_pos, cat_new_pos]:.3f}")
    print("\n→ Same content-based attention, positions don't matter!")
 
verify_proof_steps()

Position Information is Lost

The proof shows that self-attention only knows 'which input element' attends to 'which other input element'—not 'which POSITION attends to which POSITION'. The actual positions are invisible to the computation.

Consequences for Natural Language

Position independence has profound consequences for processing natural language, where word order is often crucial for meaning.

Example 1: Semantic Ambiguity

Consider these two sentences:

"John loves Mary" → John is the lover, Mary is the beloved
"Mary loves John" → Mary is the lover, John is the beloved

Without positional encoding, self-attention sees both as the same unordered set: {John, loves, Mary}. It cannot distinguish subject from object.

nlp_consequences.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from scipy.special import softmax
 
def simulate_attention_without_position():
    """Show how position-blind attention confuses word order."""
    
    # Simulate word embeddings (position-free)
    embeddings = {
        "john": np.array([1.0, 0.5, 0.2]),
        "loves": np.array([0.3, 1.0, 0.4]),
        "mary": np.array([0.8, 0.6, 0.3]),
    }
    
    # Two sentences with opposite meaning
    sentence1 = ["john", "loves", "mary"]
    sentence2 = ["mary", "loves", "john"]
    
    X1 = np.array([embeddings[w] for w in sentence1])
    X2 = np.array([embeddings[w] for w in sentence2])
    
    # Random projection weights (same for both)
    np.random.seed(42)
    d_k = 2
    W_q = np.random.randn(3, d_k)
    W_k = np.random.randn(3, d_k)
    W_v = np.random.randn(3, d_k)
    
    def compute_attention_output(X, W_q, W_k, W_v):
        Q = X @ W_q
        K = X @ W_k
        V = X @ W_v
        scores = Q @ K.T / np.sqrt(d_k)
        weights = softmax(scores, axis=-1)
        output = weights @ V
        return output, weights
    
    out1, weights1 = compute_attention_output(X1, W_q, W_k, W_v)
    out2, weights2 = compute_attention_output(X2, W_q, W_k, W_v)
    
    print("Position Independence in Language")
    print("=" * 60)
    
    print(f"\nSentence 1: '{' '.join(sentence1)}'")
    print(f"  Meaning: John → Mary (John is subject)")
    
    print(f"\nSentence 2: '{' '.join(sentence2)}'")    
    print(f"  Meaning: Mary → John (Mary is subject)")
    
    print("\n--- Attention Weights ---")
    print(f"Sentence 1 attention (row = query, col = key):")
    for i, word in enumerate(sentence1):
        weights_str = " ".join([f"{sentence1[j]}:{weights1[i,j]:.2f}" for j in range(3)])
        print(f"  {word}: [{weights_str}]")
    
    print(f"\nSentence 2 attention:")
    for i, word in enumerate(sentence2):
        weights_str = " ".join([f"{sentence2[j]}:{weights2[i,j]:.2f}" for j in range(3)])
        print(f"  {word}: [{weights_str}]")
    
    # Key observation: The attention FROM "john" TO "loves" is the same
    # regardless of john's position
    print("\n--- Key Observation ---")
    john_loves_1 = weights1[0, 1]  # john→loves in sentence 1
    john_loves_2 = weights2[2, 1]  # john→loves in sentence 2 (john is now at pos 2)
    print(f"'john' → 'loves' weight in sentence 1: {john_loves_1:.4f}")
    print(f"'john' → 'loves' weight in sentence 2: {john_loves_2:.4f}")
    print(f"Same weight: {np.isclose(john_loves_1, john_loves_2)}")
    
    print("\n→ Attention can't tell that 'john' is subject in one and object in other!")
 
simulate_attention_without_position()

Example 2: Negation Sensitivity

"I do not like this" → negative sentiment
"I do like this not" → same tokens, same set, but grammatically strange

Without position, these are indistinguishable to self-attention.

Example 3: Temporal Order

For sequences like events or time series:

"He woke up, ate breakfast, went to work"
"He went to work, ate breakfast, woke up"

The causal order is lost without positional information.

What Self-Attention CAN Capture:

Despite position blindness, pure self-attention can still learn:

Token co-occurrence: Which tokens appear together
Token similarity: Semantic clustering
Bag-of-words features: Presence/absence of tokens

But it CANNOT learn position-dependent patterns without help.

The Bag-of-Words Limit

Without positional encodings, a transformer reduces to a sophisticated bag-of-words model. It can weight and combine tokens, but cannot distinguish 'dog bites man' from 'man bites dog'. Positional encodings are absolutely essential for any position-sensitive task.

Comparison with Positionally-Aware Architectures

Comparing self-attention to RNNs and CNNs reveals the trade-offs of position independence.

Recurrent Neural Networks:

RNNs are inherently position-aware through their sequential processing: $$h_t = f(h_{t-1}, x_t)$$

The hidden state $h_t$ "knows" it's at position $t$ because it required $t$ recurrent steps to reach. This provides:

Natural position encoding through step count
Natural preference for recent context (recency bias)
Inductive bias for local dependencies

But also requires:

Sequential processing (no parallelization)
Information bottleneck through hidden state
Difficulty with very long-range dependencies

Convolutional Neural Networks:

CNNs process local windows, making them positionally-aware within their kernel: $$y_t = \sum_{k=-K}^{K} w_k \cdot x_{t+k}$$

Different kernel positions have different weights, encoding relative position. This provides:

Local positional awareness
Translation equivariance (not invariance)
Parallelizable computation

But requires:

Multiple layers to capture long-range dependencies
Fixed receptive field size

Position Handling in Different Architectures
Architecture	Position Awareness	Range	Parallelization	Inductive Bias
RNN	Inherent (step count)	Limited by hidden state	None (sequential)	Strong locality
CNN	Local (kernel weights)	Limited by receptive field	Full	Local patterns
Self-Attention	None (must add)	Full (all-to-all)	Full	None

architecture_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
 
def rnn_step(h_prev, x_t, W_h, W_x, b):
    """Single RNN step - inherently knows position via h_prev."""
    return np.tanh(W_h @ h_prev + W_x @ x_t + b)
 
def cnn_1d(x, kernel):
    """1D convolution - positional within kernel window."""
    n, k = len(x), len(kernel)
    output = np.zeros(n)
    for i in range(n):
        for j in range(k):
            if 0 <= i + j - k//2 < n:
                output[i] += kernel[j] * x[i + j - k//2]
    return output
 
def demonstrate_position_handling():
    """Show how different architectures handle position."""
    
    # Sequence: [A, B, C, A, B, C]
    # Question: Can the model distinguish first 'A' from second 'A'?
    
    seq = np.array([1.0, 2.0, 3.0, 1.0, 2.0, 3.0])  # A=1, B=2, C=3
    n = len(seq)
    
    print("Position Handling: Can we distinguish repeated tokens?")
    print("=" * 60)
    print(f"Sequence: [A, B, C, A, B, C] = {list(seq)}")
    print()
    
    # RNN: Different hidden states at positions 0 and 3
    print("RNN:")
    d = 2
    W_h = np.random.randn(d, d) * 0.1
    W_x = np.random.randn(d, 1) * 0.1
    b = np.zeros(d)
    
    h = np.zeros(d)
    hidden_states = []
    for t, x_t in enumerate(seq):
        h = rnn_step(h, np.array([x_t]), W_h, W_x, b)
        hidden_states.append(h.copy())
    
    print(f"  h[0] (first A): {hidden_states[0].round(4)}")
    print(f"  h[3] (second A): {hidden_states[3].round(4)}")
    print(f"  Different: {not np.allclose(hidden_states[0], hidden_states[3])}")
    print("  → RNN naturally distinguishes by accumulated state")
    
    # CNN: Different if kernel reaches different neighbors
    print("\nCNN (kernel size 3):")
    kernel = np.array([0.2, 0.6, 0.2])  # Weighted by neighbors
    cnn_out = cnn_1d(seq, kernel)
    print(f"  output[0] (first A): {cnn_out[0]:.4f} (neighbors: _, A=1, B=2)")
    print(f"  output[3] (second A): {cnn_out[3]:.4f} (neighbors: C=3, A=1, B=2)")
    print(f"  Different: {not np.isclose(cnn_out[0], cnn_out[3])}")
    print("  → CNN distinguishes by different local neighborhoods")
    
    # Self-Attention (without positional encoding)
    print("\nSelf-Attention (no positional encoding):")
    # For identical tokens with identical embeddings...
    # Their Q, K, V will be identical
    # Therefore their attention patterns will be identical
    # And their outputs will be identical
    print("  If A→A: embeddings are the same")
    print("  So Q_0 = Q_3, K_0 = K_3, V_0 = V_3")
    print("  Attention weights from both positions are IDENTICAL")
    print("  Output[0] = Output[3]")
    print("  → Self-attention CANNOT distinguish without position info!")
 
demonstrate_position_handling()

Trade-off Summary

RNNs and CNNs have position awareness built-in but limited range. Self-attention has unlimited range but no position awareness. Transformers solve this by adding positional encoding—the best of both worlds: O(1) path length + position information.

When Position Independence is Actually Useful

While position independence seems like a limitation, it's actually a feature for certain domains and tasks.

Set-Based Problems:

Some problems have inherently unordered inputs:

Point clouds: 3D points have coordinates but no natural ordering
Graph nodes: Nodes in a graph aren't sequentially ordered
Multi-object images: Objects in an image aren't ordered
Molecular structures: Atoms in a molecule form a graph, not a sequence

For these, position independence is the correct inductive bias!

Example: Object Detection

In image processing, detected objects might be:

[dog at (100,50), cat at (200,150), ball at (300,200)]
[ball at (300,200), dog at (100,50), cat at (200,150)]

The order in which we list detected objects shouldn't affect understanding. Self-attention over object features naturally respects this.

set_applications.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from scipy.special import softmax
 
def point_cloud_attention():
    """
    Demonstrate self-attention on point clouds where order doesn't matter.
    """
    # 3D point cloud: 4 points with (x, y, z) coordinates
    # Order shouldn't affect the output
    
    points_v1 = np.array([
        [1.0, 0.0, 0.0],  # Point A
        [0.0, 1.0, 0.0],  # Point B
        [0.0, 0.0, 1.0],  # Point C
        [1.0, 1.0, 1.0],  # Point D
    ])
    
    # Same points, different order
    points_v2 = np.array([
        [0.0, 0.0, 1.0],  # Point C
        [1.0, 1.0, 1.0],  # Point D
        [1.0, 0.0, 0.0],  # Point A
        [0.0, 1.0, 0.0],  # Point B
    ])
    
    # Simple self-attention (Q=K=V=X for simplicity)
    def set_attention(X):
        scores = X @ X.T / np.sqrt(X.shape[1])
        weights = softmax(scores, axis=-1)
        output = weights @ X
        # Aggregate to single representation (mean pool)
        return output.mean(axis=0)
    
    rep_v1 = set_attention(points_v1)
    rep_v2 = set_attention(points_v2)
    
    print("Point Cloud Self-Attention (Order-Invariant)")
    print("=" * 60)
    print("Points (two different orderings of same 4 points):")
    print("Version 1: A, B, C, D")
    print("Version 2: C, D, A, B")
    
    print(f"\nGlobal representation (mean-pooled attention output):")
    print(f"  Version 1: {rep_v1.round(4)}")
    print(f"  Version 2: {rep_v2.round(4)}")
    print(f"  Identical: {np.allclose(rep_v1, rep_v2)}")
    print("\n→ Self-attention + mean pooling is permutation INVARIANT!")
    print("   Perfect for sets and point clouds.")
 
def graph_attention():
    """
    Demonstrate self-attention on graph nodes.
    """
    # Graph with 3 nodes and their features
    # The "order" of nodes in memory is arbitrary
    
    print("\nGraph Node Attention")
    print("=" * 60)
    print("Graph: A -- B -- C (linear chain)")
    print("Node features don't depend on their ordering in memory.")
    
    node_features = np.array([
        [1.0, 0.0],  # Node A
        [0.5, 0.5],  # Node B  
        [0.0, 1.0],  # Node C
    ])
    
    # In graph attention networks (GAT), attention is further 
    # constrained by adjacency, but the base self-attention
    # is still permutation equivariant
    
    print("\nSelf-attention computes content-based relationships.")
    print("GAT adds adjacency masking to further constrain attention.")
    print("The base mechanism is position-free by design.")
 
point_cloud_attention()
graph_attention()

Domains Where Position Independence Helps

•Point Cloud Processing (PointNet, PCT): 3D points form an unordered set
•Graph Neural Networks: Node order is arbitrary
•Set Prediction (DETR): Object detection queries are unordered
•Multi-Instance Learning: Instances in a bag have no order
•Recommendation Systems: User items form a set, not a sequence

Inductive Bias Match

Position independence provides the correct inductive bias for set-structured data. Using positional encodings here would actually HURT by imposing artificial ordering. The key is matching architecture to data structure.

The Solution: Positional Encodings (Preview)

For sequential data where position matters, the solution is to inject positional information into the input. This is done through positional encodings.

The Basic Idea:

Instead of feeding raw token embeddings $x_i$ to self-attention, we add a position-dependent vector:

$$\tilde{x}_i = x_i + \text{PE}(i)$$

Where $\text{PE}(i)$ is a vector that uniquely identifies position $i$.

Now, two identical tokens at different positions have different representations, breaking the symmetry.

Requirements for Positional Encodings:

Unique: Different positions have different encodings
Consistent: Same position always has same encoding
Bounded: Encodings don't explode for long sequences
Relative: Should encode relative positions, not just absolute

positional_encoding_preview.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from scipy.special import softmax
 
def demonstrate_positional_encoding_effect():
    """Show how positional encoding breaks symmetry."""
    
    # Two identical tokens at different positions
    token_embedding = np.array([1.0, 0.5, 0.3, 0.2])  # Token "A"
    
    # Simple positional encodings (will cover proper ones later)
    pe_0 = np.array([0.0, 0.0, 1.0, 0.0])  # Position 0
    pe_3 = np.array([0.3, 0.3, 0.7, 0.6])  # Position 3
    
    # Without positional encoding
    x_0_no_pe = token_embedding
    x_3_no_pe = token_embedding
    
    # With positional encoding  
    x_0_with_pe = token_embedding + pe_0
    x_3_with_pe = token_embedding + pe_3
    
    print("Breaking Symmetry with Positional Encodings")
    print("=" * 60)
    
    print("Same token 'A' at positions 0 and 3:")
    print(f"\nWithout positional encoding:")
    print(f"  x[0] = {x_0_no_pe}")
    print(f"  x[3] = {x_3_no_pe}")
    print(f"  Identical: {np.allclose(x_0_no_pe, x_3_no_pe)}")
    
    print(f"\nWith positional encoding:")
    print(f"  x[0] + PE[0] = {x_0_with_pe}")
    print(f"  x[3] + PE[3] = {x_3_with_pe}")
    print(f"  Identical: {np.allclose(x_0_with_pe, x_3_with_pe)}")
    
    # Show effect on attention
    print("\n--- Effect on Self-Attention ---")
    
    # Q and K for position 0 attending to both positions
    W_q = W_k = np.eye(4) * 0.5  # Simple projection
    
    # Without PE
    Q_0_no = x_0_no_pe @ W_q
    K_0_no = x_0_no_pe @ W_k
    K_3_no = x_3_no_pe @ W_k
    
    score_0_to_0_no = Q_0_no @ K_0_no
    score_0_to_3_no = Q_0_no @ K_3_no
    
    print(f"Without PE:")
    print(f"  Score from pos 0 to pos 0: {score_0_to_0_no:.4f}")
    print(f"  Score from pos 0 to pos 3: {score_0_to_3_no:.4f}")
    print(f"  (Identical because tokens are identical)")
    
    # With PE
    Q_0_pe = x_0_with_pe @ W_q
    K_0_pe = x_0_with_pe @ W_k
    K_3_pe = x_3_with_pe @ W_k
    
    score_0_to_0_pe = Q_0_pe @ K_0_pe
    score_0_to_3_pe = Q_0_pe @ K_3_pe
    
    print(f"\nWith PE:")
    print(f"  Score from pos 0 to pos 0: {score_0_to_0_pe:.4f}")
    print(f"  Score from pos 0 to pos 3: {score_0_to_3_pe:.4f}")
    print(f"  (Different because positions encoded!)")
 
demonstrate_positional_encoding_effect()

Types of Positional Encodings (Preview):

Sinusoidal (original Transformer): Fixed, based on sine/cosine functions
Learned Embeddings: Trainable vectors for each position
Relative Positional Encodings: Encode relative distance, not absolute position
Rotary Embeddings (RoPE): Encode position via rotation in embedding space

Each has trade-offs in terms of generalization, computational cost, and handling of variable-length sequences. These are covered in detail in Module 5: Positional Encoding.

Essential, Not Optional

Positional encodings aren't an afterthought—they're essential for any task where order matters. Without them, a transformer cannot learn word order, temporal sequences, or any position-dependent pattern. They're as important as the attention mechanism itself.

Implications for Model Design

Understanding position independence guides several important design decisions:

1. Choosing When to Use Positional Encodings

Sequences (text, audio, time series): Always use positional encodings
Sets (point clouds, graphs): Often omit positional encodings
Images: Depends—2D sinusoidal encodings for ViT, none for set-based approaches

2. Designing Custom Positional Schemes

For non-standard domains:

DNA sequences: Might need biology-specific position semantics
Multi-modal: Different modalities might need different encodings
Hierarchical: Documents → paragraphs → sentences → words

3. Combining with Other Architectures

Transformer + CNN: CNN provides local position awareness, transformer adds global
Transformer + RNN: RNN handles strict order, transformer adds parallel context
Transformer + Graph: Graph structure provides topology, transformer adds feature mixing

Design Decision Guide

•Does order matter in your data? → Yes: Use positional encodings. No: Consider pure self-attention.
•Is position absolute or relative? → Relative patterns (like locality) → Relative positional encodings
•Fixed length or variable? → Variable → Ensure encodings generalize beyond training lengths
•Multiple position dimensions? → 2D images, 3D point clouds → Multi-dimensional positional encodings
•Want locality bias? → Consider combining with convolutions or local attention

design_patterns.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
"""
Design patterns for handling position in self-attention architectures.
"""
 
# Pattern 1: Standard sequence (NLP)
# → Sinusoidal or learned positional embeddings added to input
 
class SequenceTransformer:
    def forward(self, tokens):
        x = self.embed(tokens)
        x = x + self.positional_encoding(len(tokens))  # Add position
        return self.transformer_layers(x)
 
 
# Pattern 2: Set input (point clouds)
# → No positional encoding, order-invariant pooling
 
class SetTransformer:
    def forward(self, points):
        x = self.point_embed(points)  # No position added
        x = self.transformer_layers(x)
        return self.pool(x)  # Mean/max pool for invariance
 
 
# Pattern 3: Grid input (images)
# → 2D positional encoding
 
class VisionTransformer:
    def forward(self, image):
        patches = self.patchify(image)  # (H/P) × (W/P) patches
        x = self.patch_embed(patches)
        x = x + self.pos_embed_2d(H // P, W // P)  # 2D positions
        return self.transformer_layers(x)
 
 
# Pattern 4: Graph input
# → Graph structure via attention masking, optional node position
 
class GraphTransformer:
    def forward(self, node_features, adjacency):
        x = node_features  # No position if graph structure suffices
        mask = self.make_attention_mask(adjacency)  # Limit attention to edges
        return self.transformer_layers(x, mask=mask)
 
 
# Pattern 5: Hybrid with locality
# → CNN for local, Transformer for global
 
class LocalGlobalTransformer:
    def forward(self, x):
        local_features = self.cnn(x)  # CNN handles local patterns
        x = local_features + self.positional_encoding(len(x))
        global_context = self.transformer(x)  # Transformer adds global
        return global_context
 
 
print("Design patterns demonstrate how position independence")
print("should be addressed based on data structure and task.")

Match Architecture to Data

The best architectures match their inductive biases to the data structure. Self-attention's position independence is a feature for sets, a limitation for sequences. Always choose based on your actual data, not on what's popular.

Summary: Position Independence

We've thoroughly explored one of self-attention's most important properties: its inherent position independence, or permutation equivariance.

Key Takeaways

•Self-attention is permutation equivariant — Permuting input rows permutes output rows identically. The computation doesn't use positional information.
•Without position encoding, sequences are treated as sets — Self-attention cannot distinguish 'dog bites man' from 'man bites dog' based on order alone.
•This limits what pure self-attention can learn — Word order, temporal sequences, and any position-dependent patterns require explicit positional information.
•RNNs and CNNs are inherently position-aware — They don't need positional encodings because position is encoded in their computation structure.
•Position independence is beneficial for set-structured data — Point clouds, graphs, and unordered collections naturally fit the self-attention paradigm.
•Positional encodings restore position awareness — Adding position-dependent vectors to inputs breaks symmetry and enables sequence modeling.

What's Next:

Now that we understand self-attention's position independence and why positional encodings are necessary, we'll analyze the computational complexity of self-attention in detail. Understanding the O(n²) nature of attention is crucial for scaling to long sequences.

Page Complete

You now understand the fundamental permutation equivariance of self-attention, its mathematical proof, consequences for different domains, and why positional encodings are essential for sequence modeling. This knowledge is crucial for designing and applying transformer architectures appropriately.