Machine LearningAttention & Transformers

Self-Attention

LevelAdvanced

Duration75 mins

TopicAttention & Transformers

1 / 5

Self-Attention Formulation

The Self-Referential Revolution

In the previous module, we explored the fundamental attention mechanism—how a decoder query can attend to encoder outputs to dynamically weight relevant information. Self-attention takes this powerful concept and applies it reflexively: a sequence attends to itself.

This seemingly simple modification—letting each position in a sequence compute weighted combinations of all other positions—is the core innovation that powers transformers. Unlike RNNs, which process sequences step-by-step and struggle with long-range dependencies, self-attention enables direct, parallel connections between any two positions, regardless of distance.

The implications are profound. Self-attention:

Eliminates the sequential bottleneck of recurrent processing
Provides each position with global context in a single operation
Creates representations that inherently capture long-range relationships
Enables massive parallelization during training

Understanding self-attention deeply—its formulation, computation, and properties—is essential for mastering modern deep learning architectures.

What You Will Learn

By the end of this page, you will understand the complete mathematical formulation of self-attention, how queries, keys, and values are derived from the same input sequence, and why this self-referential structure enables richer representations than traditional sequential processing.

From Attention to Self-Attention

To understand self-attention, we must first recall the standard attention mechanism. In encoder-decoder attention, we have three distinct components:

Standard Attention Components:

Queries (Q): Derived from the decoder state—what we're looking for
Keys (K): Derived from encoder outputs—what we're searching through
Values (V): Also from encoder outputs—what we extract

The key insight is that Q, K, and V come from different sources: the decoder and encoder respectively.

The Self-Attention Transformation:

Self-attention eliminates this asymmetry. Given an input sequence, we derive all three components from the same source:

self_attention_concept.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Standard Encoder-Decoder Attention
# Q from decoder, K and V from encoder
Q = decoder_hidden_states @ W_q  # (T_dec, d_model) @ (d_model, d_k)
K = encoder_outputs @ W_k        # (T_enc, d_model) @ (d_model, d_k)
V = encoder_outputs @ W_v        # (T_enc, d_model) @ (d_model, d_v)
# Attention: Q attends to K/V from different sequence
 
# Self-Attention
# Q, K, and V ALL from the SAME input sequence
X = input_sequence               # (T, d_model) - same source for all three
Q = X @ W_q                      # (T, d_model) @ (d_model, d_k) → (T, d_k)
K = X @ W_k                      # (T, d_model) @ (d_model, d_k) → (T, d_k)
V = X @ W_v                      # (T, d_model) @ (d_model, d_v) → (T, d_v)
# Attention: sequence attends to ITSELF

Why is this powerful?

When a sequence attends to itself, each position can:

Gather context from all other positions to enrich its representation
Determine relevance dynamically based on content, not position
Build hierarchical representations through stacked layers

Consider translating "The cat sat on the mat because it was soft":

The word "it" needs to understand it refers to "mat" (not "cat")
Self-attention lets position("it") query all other positions
Through learned weights, it assigns high attention to position("mat")
The representation of "it" now incorporates "mat" information

This coreference resolution happens naturally through the attention weights, without explicit linguistic rules.

The Fundamental Insight

Self-attention is not just attention applied within a sequence—it's a fundamentally different computational paradigm. Each position simultaneously acts as a query (asking questions), a key (being searchable), and a value (providing information). This triple role enables rich, contextual representations.

The Query-Key-Value Framework

The Query-Key-Value (QKV) framework is the mathematical backbone of self-attention. Understanding each component's role and how they interact is crucial for deep comprehension.

Formal Definition:

Given an input sequence $X \in \mathbb{R}^{n \times d_{model}}$ where $n$ is sequence length and $d_{model}$ is the embedding dimension, we compute:

qkv_formulation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
 
def compute_qkv(X: np.ndarray, 
                W_q: np.ndarray, 
                W_k: np.ndarray, 
                W_v: np.ndarray) -> tuple:
    """
    Compute Query, Key, Value matrices from input.
    
    Args:
        X: Input sequence, shape (n, d_model)
        W_q: Query weight matrix, shape (d_model, d_k)
        W_k: Key weight matrix, shape (d_model, d_k)
        W_v: Value weight matrix, shape (d_model, d_v)
    
    Returns:
        Q: Query matrix, shape (n, d_k)
        K: Key matrix, shape (n, d_k)
        V: Value matrix, shape (n, d_v)
    """
    Q = X @ W_q  # Each row i: query for position i
    K = X @ W_k  # Each row j: key for position j
    V = X @ W_v  # Each row j: value for position j
    
    return Q, K, V
 
# Example dimensions
n = 10          # Sequence length
d_model = 512   # Model dimension
d_k = 64        # Key/Query dimension
d_v = 64        # Value dimension
 
# Initialize
X = np.random.randn(n, d_model)
W_q = np.random.randn(d_model, d_k) * 0.02
W_k = np.random.randn(d_model, d_k) * 0.02
W_v = np.random.randn(d_model, d_v) * 0.02
 
Q, K, V = compute_qkv(X, W_q, W_k, W_v)
print(f"Q shape: {Q.shape}")  # (10, 64)
print(f"K shape: {K.shape}")  # (10, 64)
print(f"V shape: {V.shape}")  # (10, 64)

Semantic Interpretation of QKV:

Think of the QKV framework as an information retrieval system:

QKV Semantic Roles
Component	Analogy	Mathematical Role	Learned Representation
Query (Q)	Search query	What position i is looking for	Encodes 'what information do I need?'
Key (K)	Index/tag	How position j describes itself	Encodes 'what information do I have?'
Value (V)	Content	What position j contributes	Encodes 'what to return if matched'

Why Separate Projections?

A natural question: why not use $X$ directly for Q, K, and V? The learned projection matrices ($W_q$, $W_k$, $W_v$) provide three critical capabilities:

Dimension Reduction: Project from $d_{model}$ to smaller $d_k$, $d_v$ for efficiency
Role Specialization: The same token can have different query vs. key representations
Learned Attention Patterns: Projections learn what constitutes "relevant" matching

Example: Role Specialization

Consider the word "bank" in "river bank" vs. "money bank":

The key projection learns: "I am a location/institution"
The query projection learns: "I'm looking for geographic/financial context"
Same input, different roles depending on projection

Without separate projections, a token asking a question would have to use the same representation as when being searched—limiting expressiveness.

Projection Dimensions

Typically d_k = d_v = d_model / h where h is the number of attention heads (covered in the next module). For a 512-dimensional model with 8 heads, d_k = d_v = 64. This keeps computational cost manageable while enabling multi-head attention.

Complete Mathematical Formulation

Now we assemble the complete self-attention operation. Given our Q, K, V matrices, the self-attention output is computed as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Let's break this down step-by-step:

Self-Attention Computation Steps

•Compute Attention Scores: $S = QK^T$ — Matrix multiplication between queries and keys transposed. Produces $(n \times n)$ score matrix where $S_{ij}$ measures compatibility between position $i$'s query and position $j$'s key.
•Scale Scores: $S_{scaled} = \frac{S}{\sqrt{d_k}}$ — Divide by square root of key dimension. This scaling is critical for training stability (explained in detail in the Scaled Dot-Product page).
•Apply Softmax: $A = \text{softmax}(S_{scaled})$ — Row-wise softmax converts scores to probability distribution. Each row $A_i$ sums to 1, representing attention weights from position $i$ to all positions.
•Weighted Aggregation: $O = AV$ — Matrix multiplication between attention weights and values. Each output row $O_i$ is a weighted combination of all value vectors, with weights from $A_i$.

self_attention_complete.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
from scipy.special import softmax
 
def self_attention(X: np.ndarray, 
                   W_q: np.ndarray, 
                   W_k: np.ndarray, 
                   W_v: np.ndarray) -> np.ndarray:
    """
    Complete self-attention computation.
    
    Args:
        X: Input sequence, shape (n, d_model)
        W_q, W_k, W_v: Projection matrices
    
    Returns:
        Output: Contextualized representations, shape (n, d_v)
    """
    # Step 1: Compute Q, K, V
    Q = X @ W_q  # (n, d_k)
    K = X @ W_k  # (n, d_k)
    V = X @ W_v  # (n, d_v)
    
    d_k = Q.shape[1]
    
    # Step 2: Compute attention scores
    scores = Q @ K.T  # (n, n) - each entry is dot product
    
    # Step 3: Scale by sqrt(d_k)
    scores_scaled = scores / np.sqrt(d_k)
    
    # Step 4: Softmax over keys (last axis)
    attention_weights = softmax(scores_scaled, axis=-1)  # (n, n)
    
    # Step 5: Weighted aggregation of values
    output = attention_weights @ V  # (n, d_v)
    
    return output, attention_weights
 
# Demonstration
n, d_model, d_k, d_v = 5, 32, 8, 8
 
X = np.random.randn(n, d_model)
W_q = np.random.randn(d_model, d_k) * 0.1
W_k = np.random.randn(d_model, d_k) * 0.1
W_v = np.random.randn(d_model, d_v) * 0.1
 
output, attn_weights = self_attention(X, W_q, W_k, W_v)
 
print(f"Input shape: {X.shape}")        # (5, 32)
print(f"Output shape: {output.shape}")  # (5, 8)
print(f"Attention weights shape: {attn_weights.shape}")  # (5, 5)
print(f"Row sums (should be 1): {attn_weights.sum(axis=1)}")

Dimensional Analysis:

Understanding how shapes propagate through self-attention is crucial:

Operation	Input Shape	Weight Shape	Output Shape
Q = XW_q	(n, d_model)	(d_model, d_k)	(n, d_k)
K = XW_k	(n, d_model)	(d_model, d_k)	(n, d_k)
V = XW_v	(n, d_model)	(d_model, d_v)	(n, d_v)
S = QK^T	(n, d_k) × (d_k, n)	—	(n, n)
A = softmax(S/√d_k)	(n, n)	—	(n, n)
O = AV	(n, n) × (n, d_v)	—	(n, d_v)

Notice that:

The attention matrix is always square (n × n)
Output dimensionality is independent of input sequence length
Each position's output has dimension $d_v$
The value dimension $d_v$ can differ from key dimension $d_k$ (though typically equal)

Critical Implementation Note

The softmax must be applied row-wise (axis=-1), not globally. Each row corresponds to one query position, and its attention weights must sum to 1. A global softmax would incorrectly normalize across all n² entries, breaking the interpretation as attention distribution.

Attention Score Interpretation

The attention score matrix $S = QK^T$ is the heart of self-attention. Understanding what these scores represent provides deep insight into the mechanism.

Dot Product as Similarity:

The score $S_{ij} = Q_i \cdot K_j$ measures the alignment between what position $i$ is looking for (query) and what position $j$ offers (key). This dot product captures:

Direction similarity: Parallel vectors yield high scores
Magnitude influence: Larger vectors produce larger scores
Learned compatibility: Through training, projections learn which content patterns should attend to each other

attention_score_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_attention_patterns(sentence: list, 
                                Q: np.ndarray, 
                                K: np.ndarray) -> np.ndarray:
    """
    Analyze and visualize attention score patterns.
    """
    # Raw scores
    scores = Q @ K.T
    
    # Scaled scores
    d_k = Q.shape[1]
    scores_scaled = scores / np.sqrt(d_k)
    
    # Print pairwise interpretations
    print("Attention Score Analysis:")
    print("-" * 50)
    for i, word_i in enumerate(sentence):
        # Find top attended positions for this word
        top_indices = np.argsort(scores_scaled[i])[::-1][:3]
        print(f"'{word_i}' attends most to: ", end="")
        for j in top_indices:
            print(f"'{sentence[j]}' ({scores_scaled[i,j]:.2f}), ", end="")
        print()
    
    return scores_scaled
 
# Example: Coreference resolution
sentence = ["The", "cat", "sat", "on", "the", "mat", "because", "it", "was", "soft"]
n = len(sentence)
d_k = 8
 
# Simulate learned Q, K that capture coreference
np.random.seed(42)
Q = np.random.randn(n, d_k)
K = np.random.randn(n, d_k)
 
# Manually boost "it" query toward "mat" and "cat" keys
# to simulate learned coreference patterns
Q[7] = K[5] * 0.8 + np.random.randn(d_k) * 0.2  # "it" → "mat"
Q[7] += K[1] * 0.3  # Also some attention to "cat"
 
scores = analyze_attention_patterns(sentence, Q, K)

Geometric Interpretation:

Self-attention can be viewed geometrically in the $d_k$-dimensional space:

Each key vector $K_j$ defines a direction in this space
Each query vector $Q_i$ searches for keys aligned with its direction
The dot product measures the projection of $K_j$ onto $Q_i$'s direction
After softmax, this becomes a soft nearest-neighbor lookup

What Do Attention Scores Capture?

Through training, attention scores learn to reflect various linguistic and structural phenomena:

Phenomena Captured by Attention Scores
Phenomenon	Example	Attention Pattern
Coreference	'it' → 'cat' in 'The cat... it...'	Pronoun attends strongly to antecedent noun
Subject-Verb	'runs' → 'dog' in 'The dog runs'	Verb attends to its subject for agreement
Modifier-Head	'red' → 'car' in 'the red car'	Adjective attends to the noun it modifies
Dependency	'with' → 'telescope' in 'saw... with telescope'	Preposition attends to object, resolving attachment
Negation Scope	'happy' → 'not' in 'not very happy'	Modifier attends to negation affecting it

Emergent Behavior

These patterns aren't hard-coded—they emerge from training. The model learns that attending to certain positions helps with downstream tasks like translation or generation. This makes attention a powerful, general-purpose mechanism that discovers task-relevant relationships automatically.

Self-Attention vs. Recurrent Processing

Understanding self-attention's advantages requires contrasting it with the recurrent approach it largely replaced.

The Recurrent Paradigm:

RNNs process sequences step-by-step, maintaining a hidden state that accumulates information:

$$h_t = f(h_{t-1}, x_t)$$

This creates a sequential bottleneck: to compute $h_t$, we must first compute $h_1, h_2, ..., h_{t-1}$.

The Self-Attention Paradigm:

Self-attention computes each position's output in parallel:

$$o_i = \sum_j \alpha_{ij} V_j$$

where $\alpha_{ij}$ depends only on the input, not on previous computations.

Recurrent Networks

•Sequential: O(n) serial operations
•Compression: All history compressed into fixed-size hidden state
•Path length: O(n) steps between distant positions
•Gradient flow: Must traverse n timesteps
•Parallelization: Limited—each step depends on previous

Self-Attention

•Parallel: O(1) serial depth
•Direct access: Every position can access every other directly
•Path length: O(1) between any two positions
•Gradient flow: Direct paths in single operation
•Parallelization: Full—all positions computed simultaneously

Path Length and Long-Range Dependencies:

Consider learning that "The cat that sat on the mat was soft" means the mat is soft, not the cat. In both architectures:

RNN Path:

cat → that → sat → on → the → mat → was → soft
       (7 recurrent steps, gradient must flow through all)

Self-Attention Path:

soft ←→ mat (direct attention connection, 1 step)
soft ←→ cat (also direct, model learns to distinguish)

The self-attention path length is O(1) regardless of distance, making long-range dependencies as easy to learn as local ones.

path_length_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def rnn_path_length(pos_from: int, pos_to: int) -> int:
    """Path length in RNN for information to flow between positions."""
    return abs(pos_to - pos_from)
 
def self_attention_path_length(pos_from: int, pos_to: int) -> int:
    """Path length in self-attention for information flow."""
    return 1  # Always 1 - direct connection via attention
 
# Example: Long sequence
seq_length = 1000
 
# Information from position 0 to position 999
print(f"RNN path length: {rnn_path_length(0, 999)}")          # 999
print(f"Self-attention path length: {self_attention_path_length(0, 999)}")  # 1
 
# For L stacked layers:
# - RNN: L layers, each with O(n) path → total O(L·n) max path
# - Self-attention: L layers, each O(1) path → total O(L) max path
 
L = 12  # Typical transformer depth
n = 512  # Typical sequence length
 
print(f"\nFor {L}-layer model, sequence length {n}:")
print(f"  RNN worst-case path: {L * n}")  # 6144
print(f"  Self-attention worst-case path: {L}")  # 12

The Trade-off

Self-attention sacrifices the inductive bias of sequential processing (locality, recent context priority) for the ability to learn arbitrary connections. This is appropriate for many tasks but means the model must learn even obvious local patterns from data. Positional encodings partially address this (covered later in this chapter).

Understanding the Self-Attention Output

The output of self-attention is a new sequence of the same length, where each position's representation has been contextualized by information from all other positions.

Output Interpretation:

For each position $i$, the output $O_i$ is:

$$O_i = \sum_{j=1}^{n} \alpha_{ij} V_j$$

This is a weighted average of all value vectors, where weights $\alpha_{ij}$ are determined by attention scores. The result:

$O_i$ is no longer just about position $i$—it incorporates global context
The weights $\alpha_{ij}$ indicate which positions contributed most
Through training, the model learns which contexts are relevant for each task

output_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
from scipy.special import softmax
 
def analyze_output_contributions(sentence: list, V: np.ndarray, 
                                   attention_weights: np.ndarray, 
                                   position: int) -> None:
    """
    Analyze which positions contributed to a specific output position.
    """
    weights = attention_weights[position]
    output = weights @ V
    
    print(f"\nAnalyzing output for position {position}: '{sentence[position]}'")
    print("-" * 50)
    print(f"Output vector (shape {output.shape}): norm = {np.linalg.norm(output):.3f}")
    print("\nContributions by position:")
    
    # Sort by contribution weight
    sorted_indices = np.argsort(weights)[::-1]
    
    for j in sorted_indices[:5]:  # Top 5 contributors
        contribution = weights[j] * V[j]
        print(f"  {sentence[j]:10s} (pos {j}): "
              f"weight={weights[j]:.3f}, "
              f"contrib_norm={np.linalg.norm(contribution):.3f}")
    
    # Verify output is weighted sum
    reconstructed = (weights[:, None] * V).sum(axis=0)
    assert np.allclose(output, reconstructed)
    print(f"\n✓ Output verified as weighted sum of values")
 
# Example
sentence = ["The", "bank", "by", "the", "river", "was", "eroded"]
n = len(sentence)
d_v = 8
 
# Simulated attention weights learned to resolve "bank" → "river"
# Position 1 ("bank") attends strongly to position 4 ("river")
attention_weights = softmax(np.random.randn(n, n), axis=-1)
attention_weights[1, 4] = 0.6  # Strong attention to "river"
attention_weights[1] = attention_weights[1] / attention_weights[1].sum()  # Renormalize
 
V = np.random.randn(n, d_v)
 
analyze_output_contributions(sentence, V, attention_weights, position=1)

Contextualization in Practice:

Before self-attention, the representation for "bank" might look the same in "river bank" and "bank account". After self-attention:

In "river bank": The "bank" representation incorporates "river" context
In "bank account": The "bank" representation incorporates "account" context

These are now distinguishable embeddings, even though the input token was identical.

Stacking Self-Attention Layers:

A single self-attention layer produces contextualized representations. Stacking multiple layers creates hierarchical contextualization:

Layer 1: Each token incorporates immediate neighbors
Layer 2: Each token incorporates neighbors-of-neighbors
Layer N: Each token has access to highly abstract, global patterns

This is analogous to how CNNs build up from edges → textures → objects.

Residual Connections

In practice, self-attention output is added to the input via residual connections: Output = X + SelfAttention(X). This preserves the original representation while adding contextual information—making optimization easier and preventing the loss of position-specific information.

The Bidirectional Nature of Self-Attention

A fundamental property of self-attention is its inherent bidirectionality: each position can attend to all other positions, including those before and after it.

Implications:

Full Context Access: Position $i$ sees positions $1...n$, not just $1...i$
Symmetric Computation: The attention from $i$ to $j$ uses the same mechanism as from $j$ to $i$ (though weights differ based on Q/K)
No Causal Constraint: By default, the future is visible

This is ideal for encoder tasks (understanding complete sequences) but problematic for decoder tasks (generating sequences left-to-right).

bidirectional_vs_causal.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
from scipy.special import softmax
 
def bidirectional_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> tuple:
    """Standard bidirectional self-attention."""
    d_k = Q.shape[1]
    scores = Q @ K.T / np.sqrt(d_k)
    attention_weights = softmax(scores, axis=-1)
    output = attention_weights @ V
    return output, attention_weights
 
def causal_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> tuple:
    """Causal (masked) self-attention - positions can only attend to past."""
    d_k = Q.shape[1]
    n = Q.shape[0]
    
    scores = Q @ K.T / np.sqrt(d_k)
    
    # Create causal mask: position i can only attend to positions <= i
    # Upper triangular = future positions = should be masked
    causal_mask = np.triu(np.ones((n, n)), k=1)  # Ones above diagonal
    scores = scores - 1e9 * causal_mask  # -inf for future positions
    
    attention_weights = softmax(scores, axis=-1)
    output = attention_weights @ V
    return output, attention_weights
 
# Demonstration
n, d = 5, 4
Q = np.random.randn(n, d)
K = np.random.randn(n, d)
V = np.random.randn(n, d)
 
_, bidirectional_weights = bidirectional_attention(Q, K, V)
_, causal_weights = causal_attention(Q, K, V)
 
print("Bidirectional attention weights (rounded):")
print(np.round(bidirectional_weights, 2))
print("\nCausal attention weights (rounded):")
print(np.round(causal_weights, 2))
 
# Notice: Causal has zeros above diagonal (can't attend to future)

Bidirectional Attention

•Each position sees entire sequence
•Used in: BERT, encoder models
•Tasks: Classification, NER, extraction
•Advantage: Full context for understanding

Causal Attention

•Each position sees only past/present
•Used in: GPT, decoder models
•Tasks: Generation, completion
•Advantage: Enables autoregressive generation

Causal Masking Mechanism:

For autoregressive generation (predicting the next token), we must prevent the model from "cheating" by looking at future tokens. The causal mask sets attention scores for future positions to $-\infty$, which becomes 0 after softmax:

$$\text{mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \text{ (past/present)} \ -\infty & \text{if } j > i \text{ (future)} \end{cases}$$

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + \text{mask}}{\sqrt{d_k}}\right)V$$

Encoder-Decoder Attention Revisited

In the full transformer, three types of attention coexist: (1) Bidirectional self-attention in the encoder, (2) Causal self-attention in the decoder, and (3) Cross-attention where decoder queries attend to encoder outputs. Each serves a different purpose in the architecture.

The Complete Self-Attention Layer

In practice, a self-attention layer includes several additional components beyond the core attention operation:

Full Layer Architecture:

QKV Projection: Linear transformations to create Q, K, V
Attention Computation: The core scaled dot-product attention
Output Projection: A linear layer after attention
Residual Connection: Add input to output
Layer Normalization: Stabilize training dynamics
Feedforward Network: Additional per-position transformation
Another Residual + LayerNorm: After the feedforward

complete_self_attention_layer.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from scipy.special import softmax
 
class SelfAttentionLayer:
    """
    Complete self-attention layer with all standard components.
    """
    
    def __init__(self, d_model: int, d_k: int, d_v: int, d_ff: int):
        self.d_model = d_model
        self.d_k = d_k
        self.d_v = d_v
        self.d_ff = d_ff
        
        # QKV projections
        scale = 0.02
        self.W_q = np.random.randn(d_model, d_k) * scale
        self.W_k = np.random.randn(d_model, d_k) * scale
        self.W_v = np.random.randn(d_model, d_v) * scale
        
        # Output projection (project d_v back to d_model)
        self.W_o = np.random.randn(d_v, d_model) * scale
        
        # Feedforward network
        self.W_ff1 = np.random.randn(d_model, d_ff) * scale
        self.b_ff1 = np.zeros(d_ff)
        self.W_ff2 = np.random.randn(d_ff, d_model) * scale
        self.b_ff2 = np.zeros(d_model)
        
        # LayerNorm parameters (simplified)
        self.ln1_gamma = np.ones(d_model)
        self.ln1_beta = np.zeros(d_model)
        self.ln2_gamma = np.ones(d_model)
        self.ln2_beta = np.zeros(d_model)
    
    def layer_norm(self, x: np.ndarray, gamma: np.ndarray, 
                   beta: np.ndarray, eps: float = 1e-5) -> np.ndarray:
        """Layer normalization over last dimension."""
        mean = x.mean(axis=-1, keepdims=True)
        var = x.var(axis=-1, keepdims=True)
        x_norm = (x - mean) / np.sqrt(var + eps)
        return gamma * x_norm + beta
    
    def attention(self, Q: np.ndarray, K: np.ndarray, 
                  V: np.ndarray, mask: np.ndarray = None) -> np.ndarray:
        """Scaled dot-product attention."""
        scores = Q @ K.T / np.sqrt(self.d_k)
        if mask is not None:
            scores = scores + mask
        weights = softmax(scores, axis=-1)
        return weights @ V
    
    def feedforward(self, x: np.ndarray) -> np.ndarray:
        """Position-wise feedforward network with ReLU."""
        hidden = np.maximum(0, x @ self.W_ff1 + self.b_ff1)  # ReLU
        return hidden @ self.W_ff2 + self.b_ff2
    
    def forward(self, x: np.ndarray, mask: np.ndarray = None) -> np.ndarray:
        """
        Full forward pass through self-attention layer.
        
        Args:
            x: Input, shape (n, d_model)
            mask: Optional attention mask
        
        Returns:
            Output, shape (n, d_model)
        """
        # === Self-Attention Sublayer ===
        # 1. Compute Q, K, V
        Q = x @ self.W_q
        K = x @ self.W_k
        V = x @ self.W_v
        
        # 2. Attention
        attn_output = self.attention(Q, K, V, mask)
        
        # 3. Output projection
        attn_output = attn_output @ self.W_o
        
        # 4. Residual connection + LayerNorm
        x = self.layer_norm(x + attn_output, self.ln1_gamma, self.ln1_beta)
        
        # === Feedforward Sublayer ===
        # 5. Feedforward network
        ff_output = self.feedforward(x)
        
        # 6. Residual connection + LayerNorm
        x = self.layer_norm(x + ff_output, self.ln2_gamma, self.ln2_beta)
        
        return x
 
# Usage
layer = SelfAttentionLayer(d_model=64, d_k=16, d_v=16, d_ff=256)
x = np.random.randn(10, 64)  # Sequence of 10 tokens, 64-dim
output = layer.forward(x)
print(f"Input shape: {x.shape}")    # (10, 64)
print(f"Output shape: {output.shape}")  # (10, 64)

The Building Block

This complete layer is the fundamental building block of transformers. Stack L such layers (typically 6-24), and you have either a transformer encoder or decoder. The only difference is whether the attention is bidirectional or causal.

Summary: Self-Attention Formulation

We've established a comprehensive understanding of self-attention—the mechanism that enables transformers to capture complex dependencies across sequences.

Key Takeaways

•Self-attention derives Q, K, V from the same input — Unlike encoder-decoder attention, a sequence attends to itself, with each position playing query, key, and value roles simultaneously.
•The QKV framework enables role specialization — Learned projections allow the same token to represent different concepts when asking (Q) vs. being searched (K) vs. providing content (V).
•Attention scores measure compatibility — The dot product between queries and keys captures learned patterns of relevance, from coreference to syntactic dependencies.
•O(1) path length between any positions — Unlike RNNs with O(n) sequential paths, self-attention provides direct connections, making long-range dependencies as learnable as local ones.
•Bidirectional by default, causal via masking — Encoders use full bidirectional attention; decoders mask future positions for autoregressive generation.
•Complete layers include residuals and LayerNorm — The practical self-attention layer adds projections, residual connections, normalization, and feedforward networks.

What's Next:

Now that we understand the formulation, we need to examine the details of computing attention weights. The next page explores how raw scores become probability distributions, including the crucial scaling factor and the softmax operation.

Page Complete

You now understand the complete formulation of self-attention—from the QKV framework to the full layer architecture. This foundation is essential for understanding the computational details, multi-head attention, and transformer architectures that follow.

1 / 5

Loading learning content...

Machine LearningAttention & Transformers

Self-Attention

LevelAdvanced

Duration75 mins

TopicAttention & Transformers

1 / 5

Self-Attention Formulation

The Self-Referential Revolution

The implications are profound. Self-attention:

Eliminates the sequential bottleneck of recurrent processing
Provides each position with global context in a single operation
Creates representations that inherently capture long-range relationships
Enables massive parallelization during training

Understanding self-attention deeply—its formulation, computation, and properties—is essential for mastering modern deep learning architectures.

What You Will Learn

From Attention to Self-Attention

To understand self-attention, we must first recall the standard attention mechanism. In encoder-decoder attention, we have three distinct components:

Standard Attention Components:

Queries (Q): Derived from the decoder state—what we're looking for
Keys (K): Derived from encoder outputs—what we're searching through
Values (V): Also from encoder outputs—what we extract

The key insight is that Q, K, and V come from different sources: the decoder and encoder respectively.

The Self-Attention Transformation:

Self-attention eliminates this asymmetry. Given an input sequence, we derive all three components from the same source:

self_attention_concept.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Standard Encoder-Decoder Attention
# Q from decoder, K and V from encoder
Q = decoder_hidden_states @ W_q  # (T_dec, d_model) @ (d_model, d_k)
K = encoder_outputs @ W_k        # (T_enc, d_model) @ (d_model, d_k)
V = encoder_outputs @ W_v        # (T_enc, d_model) @ (d_model, d_v)
# Attention: Q attends to K/V from different sequence
 
# Self-Attention
# Q, K, and V ALL from the SAME input sequence
X = input_sequence               # (T, d_model) - same source for all three
Q = X @ W_q                      # (T, d_model) @ (d_model, d_k) → (T, d_k)
K = X @ W_k                      # (T, d_model) @ (d_model, d_k) → (T, d_k)
V = X @ W_v                      # (T, d_model) @ (d_model, d_v) → (T, d_v)
# Attention: sequence attends to ITSELF

Why is this powerful?

When a sequence attends to itself, each position can:

Gather context from all other positions to enrich its representation
Determine relevance dynamically based on content, not position
Build hierarchical representations through stacked layers

Consider translating "The cat sat on the mat because it was soft":

The word "it" needs to understand it refers to "mat" (not "cat")
Self-attention lets position("it") query all other positions
Through learned weights, it assigns high attention to position("mat")
The representation of "it" now incorporates "mat" information

This coreference resolution happens naturally through the attention weights, without explicit linguistic rules.

The Fundamental Insight

The Query-Key-Value Framework

The Query-Key-Value (QKV) framework is the mathematical backbone of self-attention. Understanding each component's role and how they interact is crucial for deep comprehension.

Formal Definition:

Given an input sequence $X \in \mathbb{R}^{n \times d_{model}}$ where $n$ is sequence length and $d_{model}$ is the embedding dimension, we compute:

qkv_formulation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
 
def compute_qkv(X: np.ndarray, 
                W_q: np.ndarray, 
                W_k: np.ndarray, 
                W_v: np.ndarray) -> tuple:
    """
    Compute Query, Key, Value matrices from input.
    
    Args:
        X: Input sequence, shape (n, d_model)
        W_q: Query weight matrix, shape (d_model, d_k)
        W_k: Key weight matrix, shape (d_model, d_k)
        W_v: Value weight matrix, shape (d_model, d_v)
    
    Returns:
        Q: Query matrix, shape (n, d_k)
        K: Key matrix, shape (n, d_k)
        V: Value matrix, shape (n, d_v)
    """
    Q = X @ W_q  # Each row i: query for position i
    K = X @ W_k  # Each row j: key for position j
    V = X @ W_v  # Each row j: value for position j
    
    return Q, K, V
 
# Example dimensions
n = 10          # Sequence length
d_model = 512   # Model dimension
d_k = 64        # Key/Query dimension
d_v = 64        # Value dimension
 
# Initialize
X = np.random.randn(n, d_model)
W_q = np.random.randn(d_model, d_k) * 0.02
W_k = np.random.randn(d_model, d_k) * 0.02
W_v = np.random.randn(d_model, d_v) * 0.02
 
Q, K, V = compute_qkv(X, W_q, W_k, W_v)
print(f"Q shape: {Q.shape}")  # (10, 64)
print(f"K shape: {K.shape}")  # (10, 64)
print(f"V shape: {V.shape}")  # (10, 64)

Semantic Interpretation of QKV:

Think of the QKV framework as an information retrieval system:

QKV Semantic Roles
Component	Analogy	Mathematical Role	Learned Representation
Query (Q)	Search query	What position i is looking for	Encodes 'what information do I need?'
Key (K)	Index/tag	How position j describes itself	Encodes 'what information do I have?'
Value (V)	Content	What position j contributes	Encodes 'what to return if matched'

Why Separate Projections?

A natural question: why not use $X$ directly for Q, K, and V? The learned projection matrices ($W_q$, $W_k$, $W_v$) provide three critical capabilities:

Dimension Reduction: Project from $d_{model}$ to smaller $d_k$, $d_v$ for efficiency
Role Specialization: The same token can have different query vs. key representations
Learned Attention Patterns: Projections learn what constitutes "relevant" matching

Example: Role Specialization

Consider the word "bank" in "river bank" vs. "money bank":

The key projection learns: "I am a location/institution"
The query projection learns: "I'm looking for geographic/financial context"
Same input, different roles depending on projection

Without separate projections, a token asking a question would have to use the same representation as when being searched—limiting expressiveness.

Projection Dimensions

Complete Mathematical Formulation

Now we assemble the complete self-attention operation. Given our Q, K, V matrices, the self-attention output is computed as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Let's break this down step-by-step:

Self-Attention Computation Steps

•Compute Attention Scores: $S = QK^T$ — Matrix multiplication between queries and keys transposed. Produces $(n \times n)$ score matrix where $S_{ij}$ measures compatibility between position $i$'s query and position $j$'s key.
•Scale Scores: $S_{scaled} = \frac{S}{\sqrt{d_k}}$ — Divide by square root of key dimension. This scaling is critical for training stability (explained in detail in the Scaled Dot-Product page).
•Apply Softmax: $A = \text{softmax}(S_{scaled})$ — Row-wise softmax converts scores to probability distribution. Each row $A_i$ sums to 1, representing attention weights from position $i$ to all positions.
•Weighted Aggregation: $O = AV$ — Matrix multiplication between attention weights and values. Each output row $O_i$ is a weighted combination of all value vectors, with weights from $A_i$.

self_attention_complete.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
from scipy.special import softmax
 
def self_attention(X: np.ndarray, 
                   W_q: np.ndarray, 
                   W_k: np.ndarray, 
                   W_v: np.ndarray) -> np.ndarray:
    """
    Complete self-attention computation.
    
    Args:
        X: Input sequence, shape (n, d_model)
        W_q, W_k, W_v: Projection matrices
    
    Returns:
        Output: Contextualized representations, shape (n, d_v)
    """
    # Step 1: Compute Q, K, V
    Q = X @ W_q  # (n, d_k)
    K = X @ W_k  # (n, d_k)
    V = X @ W_v  # (n, d_v)
    
    d_k = Q.shape[1]
    
    # Step 2: Compute attention scores
    scores = Q @ K.T  # (n, n) - each entry is dot product
    
    # Step 3: Scale by sqrt(d_k)
    scores_scaled = scores / np.sqrt(d_k)
    
    # Step 4: Softmax over keys (last axis)
    attention_weights = softmax(scores_scaled, axis=-1)  # (n, n)
    
    # Step 5: Weighted aggregation of values
    output = attention_weights @ V  # (n, d_v)
    
    return output, attention_weights
 
# Demonstration
n, d_model, d_k, d_v = 5, 32, 8, 8
 
X = np.random.randn(n, d_model)
W_q = np.random.randn(d_model, d_k) * 0.1
W_k = np.random.randn(d_model, d_k) * 0.1
W_v = np.random.randn(d_model, d_v) * 0.1
 
output, attn_weights = self_attention(X, W_q, W_k, W_v)
 
print(f"Input shape: {X.shape}")        # (5, 32)
print(f"Output shape: {output.shape}")  # (5, 8)
print(f"Attention weights shape: {attn_weights.shape}")  # (5, 5)
print(f"Row sums (should be 1): {attn_weights.sum(axis=1)}")

Dimensional Analysis:

Understanding how shapes propagate through self-attention is crucial:

Operation	Input Shape	Weight Shape	Output Shape
Q = XW_q	(n, d_model)	(d_model, d_k)	(n, d_k)
K = XW_k	(n, d_model)	(d_model, d_k)	(n, d_k)
V = XW_v	(n, d_model)	(d_model, d_v)	(n, d_v)
S = QK^T	(n, d_k) × (d_k, n)	—	(n, n)
A = softmax(S/√d_k)	(n, n)	—	(n, n)
O = AV	(n, n) × (n, d_v)	—	(n, d_v)

Notice that:

The attention matrix is always square (n × n)
Output dimensionality is independent of input sequence length
Each position's output has dimension $d_v$
The value dimension $d_v$ can differ from key dimension $d_k$ (though typically equal)

Critical Implementation Note

Attention Score Interpretation

The attention score matrix $S = QK^T$ is the heart of self-attention. Understanding what these scores represent provides deep insight into the mechanism.

Dot Product as Similarity:

The score $S_{ij} = Q_i \cdot K_j$ measures the alignment between what position $i$ is looking for (query) and what position $j$ offers (key). This dot product captures:

Direction similarity: Parallel vectors yield high scores
Magnitude influence: Larger vectors produce larger scores
Learned compatibility: Through training, projections learn which content patterns should attend to each other

attention_score_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_attention_patterns(sentence: list, 
                                Q: np.ndarray, 
                                K: np.ndarray) -> np.ndarray:
    """
    Analyze and visualize attention score patterns.
    """
    # Raw scores
    scores = Q @ K.T
    
    # Scaled scores
    d_k = Q.shape[1]
    scores_scaled = scores / np.sqrt(d_k)
    
    # Print pairwise interpretations
    print("Attention Score Analysis:")
    print("-" * 50)
    for i, word_i in enumerate(sentence):
        # Find top attended positions for this word
        top_indices = np.argsort(scores_scaled[i])[::-1][:3]
        print(f"'{word_i}' attends most to: ", end="")
        for j in top_indices:
            print(f"'{sentence[j]}' ({scores_scaled[i,j]:.2f}), ", end="")
        print()
    
    return scores_scaled
 
# Example: Coreference resolution
sentence = ["The", "cat", "sat", "on", "the", "mat", "because", "it", "was", "soft"]
n = len(sentence)
d_k = 8
 
# Simulate learned Q, K that capture coreference
np.random.seed(42)
Q = np.random.randn(n, d_k)
K = np.random.randn(n, d_k)
 
# Manually boost "it" query toward "mat" and "cat" keys
# to simulate learned coreference patterns
Q[7] = K[5] * 0.8 + np.random.randn(d_k) * 0.2  # "it" → "mat"
Q[7] += K[1] * 0.3  # Also some attention to "cat"
 
scores = analyze_attention_patterns(sentence, Q, K)

Geometric Interpretation:

Self-attention can be viewed geometrically in the $d_k$-dimensional space:

Each key vector $K_j$ defines a direction in this space
Each query vector $Q_i$ searches for keys aligned with its direction
The dot product measures the projection of $K_j$ onto $Q_i$'s direction
After softmax, this becomes a soft nearest-neighbor lookup

What Do Attention Scores Capture?

Through training, attention scores learn to reflect various linguistic and structural phenomena:

Phenomena Captured by Attention Scores
Phenomenon	Example	Attention Pattern
Coreference	'it' → 'cat' in 'The cat... it...'	Pronoun attends strongly to antecedent noun
Subject-Verb	'runs' → 'dog' in 'The dog runs'	Verb attends to its subject for agreement
Modifier-Head	'red' → 'car' in 'the red car'	Adjective attends to the noun it modifies
Dependency	'with' → 'telescope' in 'saw... with telescope'	Preposition attends to object, resolving attachment
Negation Scope	'happy' → 'not' in 'not very happy'	Modifier attends to negation affecting it

Emergent Behavior

Self-Attention vs. Recurrent Processing

Understanding self-attention's advantages requires contrasting it with the recurrent approach it largely replaced.

The Recurrent Paradigm:

RNNs process sequences step-by-step, maintaining a hidden state that accumulates information:

$$h_t = f(h_{t-1}, x_t)$$

This creates a sequential bottleneck: to compute $h_t$, we must first compute $h_1, h_2, ..., h_{t-1}$.

The Self-Attention Paradigm:

Self-attention computes each position's output in parallel:

$$o_i = \sum_j \alpha_{ij} V_j$$

where $\alpha_{ij}$ depends only on the input, not on previous computations.

Recurrent Networks

•Sequential: O(n) serial operations
•Compression: All history compressed into fixed-size hidden state
•Path length: O(n) steps between distant positions
•Gradient flow: Must traverse n timesteps
•Parallelization: Limited—each step depends on previous

Self-Attention

•Parallel: O(1) serial depth
•Direct access: Every position can access every other directly
•Path length: O(1) between any two positions
•Gradient flow: Direct paths in single operation
•Parallelization: Full—all positions computed simultaneously

Path Length and Long-Range Dependencies:

Consider learning that "The cat that sat on the mat was soft" means the mat is soft, not the cat. In both architectures:

RNN Path:

cat → that → sat → on → the → mat → was → soft
       (7 recurrent steps, gradient must flow through all)

Self-Attention Path:

soft ←→ mat (direct attention connection, 1 step)
soft ←→ cat (also direct, model learns to distinguish)

The self-attention path length is O(1) regardless of distance, making long-range dependencies as easy to learn as local ones.

path_length_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def rnn_path_length(pos_from: int, pos_to: int) -> int:
    """Path length in RNN for information to flow between positions."""
    return abs(pos_to - pos_from)
 
def self_attention_path_length(pos_from: int, pos_to: int) -> int:
    """Path length in self-attention for information flow."""
    return 1  # Always 1 - direct connection via attention
 
# Example: Long sequence
seq_length = 1000
 
# Information from position 0 to position 999
print(f"RNN path length: {rnn_path_length(0, 999)}")          # 999
print(f"Self-attention path length: {self_attention_path_length(0, 999)}")  # 1
 
# For L stacked layers:
# - RNN: L layers, each with O(n) path → total O(L·n) max path
# - Self-attention: L layers, each O(1) path → total O(L) max path
 
L = 12  # Typical transformer depth
n = 512  # Typical sequence length
 
print(f"\nFor {L}-layer model, sequence length {n}:")
print(f"  RNN worst-case path: {L * n}")  # 6144
print(f"  Self-attention worst-case path: {L}")  # 12

The Trade-off

Understanding the Self-Attention Output

The output of self-attention is a new sequence of the same length, where each position's representation has been contextualized by information from all other positions.

Output Interpretation:

For each position $i$, the output $O_i$ is:

$$O_i = \sum_{j=1}^{n} \alpha_{ij} V_j$$

This is a weighted average of all value vectors, where weights $\alpha_{ij}$ are determined by attention scores. The result:

$O_i$ is no longer just about position $i$—it incorporates global context
The weights $\alpha_{ij}$ indicate which positions contributed most
Through training, the model learns which contexts are relevant for each task

output_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
from scipy.special import softmax
 
def analyze_output_contributions(sentence: list, V: np.ndarray, 
                                   attention_weights: np.ndarray, 
                                   position: int) -> None:
    """
    Analyze which positions contributed to a specific output position.
    """
    weights = attention_weights[position]
    output = weights @ V
    
    print(f"\nAnalyzing output for position {position}: '{sentence[position]}'")
    print("-" * 50)
    print(f"Output vector (shape {output.shape}): norm = {np.linalg.norm(output):.3f}")
    print("\nContributions by position:")
    
    # Sort by contribution weight
    sorted_indices = np.argsort(weights)[::-1]
    
    for j in sorted_indices[:5]:  # Top 5 contributors
        contribution = weights[j] * V[j]
        print(f"  {sentence[j]:10s} (pos {j}): "
              f"weight={weights[j]:.3f}, "
              f"contrib_norm={np.linalg.norm(contribution):.3f}")
    
    # Verify output is weighted sum
    reconstructed = (weights[:, None] * V).sum(axis=0)
    assert np.allclose(output, reconstructed)
    print(f"\n✓ Output verified as weighted sum of values")
 
# Example
sentence = ["The", "bank", "by", "the", "river", "was", "eroded"]
n = len(sentence)
d_v = 8
 
# Simulated attention weights learned to resolve "bank" → "river"
# Position 1 ("bank") attends strongly to position 4 ("river")
attention_weights = softmax(np.random.randn(n, n), axis=-1)
attention_weights[1, 4] = 0.6  # Strong attention to "river"
attention_weights[1] = attention_weights[1] / attention_weights[1].sum()  # Renormalize
 
V = np.random.randn(n, d_v)
 
analyze_output_contributions(sentence, V, attention_weights, position=1)

Contextualization in Practice:

Before self-attention, the representation for "bank" might look the same in "river bank" and "bank account". After self-attention:

In "river bank": The "bank" representation incorporates "river" context
In "bank account": The "bank" representation incorporates "account" context

These are now distinguishable embeddings, even though the input token was identical.

Stacking Self-Attention Layers:

A single self-attention layer produces contextualized representations. Stacking multiple layers creates hierarchical contextualization:

Layer 1: Each token incorporates immediate neighbors
Layer 2: Each token incorporates neighbors-of-neighbors
Layer N: Each token has access to highly abstract, global patterns

This is analogous to how CNNs build up from edges → textures → objects.

Residual Connections

The Bidirectional Nature of Self-Attention

A fundamental property of self-attention is its inherent bidirectionality: each position can attend to all other positions, including those before and after it.

Implications:

Full Context Access: Position $i$ sees positions $1...n$, not just $1...i$
Symmetric Computation: The attention from $i$ to $j$ uses the same mechanism as from $j$ to $i$ (though weights differ based on Q/K)
No Causal Constraint: By default, the future is visible

This is ideal for encoder tasks (understanding complete sequences) but problematic for decoder tasks (generating sequences left-to-right).

bidirectional_vs_causal.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
from scipy.special import softmax
 
def bidirectional_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> tuple:
    """Standard bidirectional self-attention."""
    d_k = Q.shape[1]
    scores = Q @ K.T / np.sqrt(d_k)
    attention_weights = softmax(scores, axis=-1)
    output = attention_weights @ V
    return output, attention_weights
 
def causal_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> tuple:
    """Causal (masked) self-attention - positions can only attend to past."""
    d_k = Q.shape[1]
    n = Q.shape[0]
    
    scores = Q @ K.T / np.sqrt(d_k)
    
    # Create causal mask: position i can only attend to positions <= i
    # Upper triangular = future positions = should be masked
    causal_mask = np.triu(np.ones((n, n)), k=1)  # Ones above diagonal
    scores = scores - 1e9 * causal_mask  # -inf for future positions
    
    attention_weights = softmax(scores, axis=-1)
    output = attention_weights @ V
    return output, attention_weights
 
# Demonstration
n, d = 5, 4
Q = np.random.randn(n, d)
K = np.random.randn(n, d)
V = np.random.randn(n, d)
 
_, bidirectional_weights = bidirectional_attention(Q, K, V)
_, causal_weights = causal_attention(Q, K, V)
 
print("Bidirectional attention weights (rounded):")
print(np.round(bidirectional_weights, 2))
print("\nCausal attention weights (rounded):")
print(np.round(causal_weights, 2))
 
# Notice: Causal has zeros above diagonal (can't attend to future)

Bidirectional Attention

•Each position sees entire sequence
•Used in: BERT, encoder models
•Tasks: Classification, NER, extraction
•Advantage: Full context for understanding

Causal Attention

•Each position sees only past/present
•Used in: GPT, decoder models
•Tasks: Generation, completion
•Advantage: Enables autoregressive generation

Causal Masking Mechanism:

$$\text{mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \text{ (past/present)} \ -\infty & \text{if } j > i \text{ (future)} \end{cases}$$

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + \text{mask}}{\sqrt{d_k}}\right)V$$

Encoder-Decoder Attention Revisited

The Complete Self-Attention Layer

In practice, a self-attention layer includes several additional components beyond the core attention operation:

Full Layer Architecture:

QKV Projection: Linear transformations to create Q, K, V
Attention Computation: The core scaled dot-product attention
Output Projection: A linear layer after attention
Residual Connection: Add input to output
Layer Normalization: Stabilize training dynamics
Feedforward Network: Additional per-position transformation
Another Residual + LayerNorm: After the feedforward

complete_self_attention_layer.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from scipy.special import softmax
 
class SelfAttentionLayer:
    """
    Complete self-attention layer with all standard components.
    """
    
    def __init__(self, d_model: int, d_k: int, d_v: int, d_ff: int):
        self.d_model = d_model
        self.d_k = d_k
        self.d_v = d_v
        self.d_ff = d_ff
        
        # QKV projections
        scale = 0.02
        self.W_q = np.random.randn(d_model, d_k) * scale
        self.W_k = np.random.randn(d_model, d_k) * scale
        self.W_v = np.random.randn(d_model, d_v) * scale
        
        # Output projection (project d_v back to d_model)
        self.W_o = np.random.randn(d_v, d_model) * scale
        
        # Feedforward network
        self.W_ff1 = np.random.randn(d_model, d_ff) * scale
        self.b_ff1 = np.zeros(d_ff)
        self.W_ff2 = np.random.randn(d_ff, d_model) * scale
        self.b_ff2 = np.zeros(d_model)
        
        # LayerNorm parameters (simplified)
        self.ln1_gamma = np.ones(d_model)
        self.ln1_beta = np.zeros(d_model)
        self.ln2_gamma = np.ones(d_model)
        self.ln2_beta = np.zeros(d_model)
    
    def layer_norm(self, x: np.ndarray, gamma: np.ndarray, 
                   beta: np.ndarray, eps: float = 1e-5) -> np.ndarray:
        """Layer normalization over last dimension."""
        mean = x.mean(axis=-1, keepdims=True)
        var = x.var(axis=-1, keepdims=True)
        x_norm = (x - mean) / np.sqrt(var + eps)
        return gamma * x_norm + beta
    
    def attention(self, Q: np.ndarray, K: np.ndarray, 
                  V: np.ndarray, mask: np.ndarray = None) -> np.ndarray:
        """Scaled dot-product attention."""
        scores = Q @ K.T / np.sqrt(self.d_k)
        if mask is not None:
            scores = scores + mask
        weights = softmax(scores, axis=-1)
        return weights @ V
    
    def feedforward(self, x: np.ndarray) -> np.ndarray:
        """Position-wise feedforward network with ReLU."""
        hidden = np.maximum(0, x @ self.W_ff1 + self.b_ff1)  # ReLU
        return hidden @ self.W_ff2 + self.b_ff2
    
    def forward(self, x: np.ndarray, mask: np.ndarray = None) -> np.ndarray:
        """
        Full forward pass through self-attention layer.
        
        Args:
            x: Input, shape (n, d_model)
            mask: Optional attention mask
        
        Returns:
            Output, shape (n, d_model)
        """
        # === Self-Attention Sublayer ===
        # 1. Compute Q, K, V
        Q = x @ self.W_q
        K = x @ self.W_k
        V = x @ self.W_v
        
        # 2. Attention
        attn_output = self.attention(Q, K, V, mask)
        
        # 3. Output projection
        attn_output = attn_output @ self.W_o
        
        # 4. Residual connection + LayerNorm
        x = self.layer_norm(x + attn_output, self.ln1_gamma, self.ln1_beta)
        
        # === Feedforward Sublayer ===
        # 5. Feedforward network
        ff_output = self.feedforward(x)
        
        # 6. Residual connection + LayerNorm
        x = self.layer_norm(x + ff_output, self.ln2_gamma, self.ln2_beta)
        
        return x
 
# Usage
layer = SelfAttentionLayer(d_model=64, d_k=16, d_v=16, d_ff=256)
x = np.random.randn(10, 64)  # Sequence of 10 tokens, 64-dim
output = layer.forward(x)
print(f"Input shape: {x.shape}")    # (10, 64)
print(f"Output shape: {output.shape}")  # (10, 64)

The Building Block

Summary: Self-Attention Formulation

We've established a comprehensive understanding of self-attention—the mechanism that enables transformers to capture complex dependencies across sequences.

Key Takeaways

•Self-attention derives Q, K, V from the same input — Unlike encoder-decoder attention, a sequence attends to itself, with each position playing query, key, and value roles simultaneously.
•The QKV framework enables role specialization — Learned projections allow the same token to represent different concepts when asking (Q) vs. being searched (K) vs. providing content (V).
•Attention scores measure compatibility — The dot product between queries and keys captures learned patterns of relevance, from coreference to syntactic dependencies.
•O(1) path length between any positions — Unlike RNNs with O(n) sequential paths, self-attention provides direct connections, making long-range dependencies as learnable as local ones.
•Bidirectional by default, causal via masking — Encoders use full bidirectional attention; decoders mask future positions for autoregressive generation.
•Complete layers include residuals and LayerNorm — The practical self-attention layer adds projections, residual connections, normalization, and feedforward networks.

What's Next:

Page Complete

1 / 5