Machine LearningAttention & Transformers

Attention Mechanism

LevelAdvanced

Duration90 mins

TopicAttention & Transformers

2 / 5

Query-Key-Value Framework

From Intuition to Mathematics

In the previous page, we built intuition for attention as dynamic, content-based retrieval. Now we formalize this intuition into precise mathematics that can be implemented and trained.

The Query-Key-Value (QKV) framework provides the mathematical vocabulary for attention. It's not just notation—it's a computational paradigm that separates the what am I looking for (Query), what is available to look at (Key), and what information to retrieve (Value). This separation enables attention to be modular, composable, and remarkably powerful.

Nearly every modern attention variant—from Bahdanau's original 2014 formulation to the multi-head attention in GPT-4—can be expressed in this framework.

What You Will Learn

By the end of this page, you will understand: (1) The mathematical definition of queries, keys, and values, (2) How attention scores are computed and normalized, (3) The role of learned projection matrices, (4) Different scoring functions and their tradeoffs, (5) Complete forward pass computation with concrete examples, and (6) How this framework enables the transformer architecture.

The Three Components: Query, Key, Value

The QKV framework structures attention around three distinct roles, each with a clear semantic meaning:

Query (Q) — "What am I looking for?"

The query represents the current information need. In cross-attention for translation:

The query comes from the decoder hidden state
It encodes: "I'm about to generate word i; what source information is relevant?"
Dimension: d_q (query dimension)

Key (K) — "What do I contain?"

Keys are identifiers that describe what information is available at each position. Think of them as labels or indices for retrieval:

Keys come from the encoder hidden states (or source of attention)
Each position has a key that "advertises" its content
Dimension: d_k (key dimension, must match query for comparison)

Value (V) — "Here is my actual content"

Values contain the information to be retrieved when attention focuses on a position:

Values also come from the encoder hidden states
If a position gets high attention, its value contributes strongly to the output
Dimension: d_v (value dimension, determines output dimension)

QKV Components Comparison
Component	Symbol	Source	Purpose	Dimension
Query	Q	Decoder / current position	What to search for	d_q
Key	K	Encoder / memory positions	What's available, for matching	d_k = d_q
Value	V	Encoder / memory positions	Information to retrieve	d_v

The Information Retrieval Analogy:

The separation of Key and Value mirrors well-known patterns:

Library Catalog:

Query: "I need books about quantum physics published after 2020"
Key: Catalog entry (title, author, subject, year)
Value: The actual book content

Database:

Query: SQL WHERE clause
Key: Indexed columns
Value: Row data to return

Hash Table:

Query: The key you're looking up
Key: Stored keys (for matching)
Value: Stored values (returned on match)

Why Not Just Use One Representation?

Early attention (Bahdanau) used encoder hidden states as both keys and values. The separation offers important benefits:

Specialization: Keys optimized for matching (what makes two things related?) can differ from values optimized for content (what should be returned?)
Dimensionality Freedom: d_k and d_v can be different. Small keys → fast matching; large values → rich content.
Composability: In self-attention, the same input produces all three, but through different learned projections—enabling different "views" of the same content.

The Transformer Design Choice

In the original Transformer, d_k = d_q = d_v = d_model/h where h is the number of attention heads. This symmetric choice simplifies implementation but isn't fundamental—many variants use different dimensions for keys versus values.

Computing Attention Scores

The heart of attention is computing a relevance score between each query and each key. This score determines how much attention each memory position receives.

The General Formulation:

For a query q and key k, we compute:

score(q, k) = f(q, k)

where f is a scoring function that measures compatibility. Several scoring functions have been proposed, each with distinct characteristics.

attention_scoring.py

Python

Why Scaling Matters:

The scaled dot-product attention divides by √d_k. This seemingly minor detail is crucial:

Consider what happens without scaling. If q and k are vectors with d_k independent components, each with mean 0 and variance 1, then:

q · k = Σ q_i * k_i

This sum has variance d_k (sum of d_k independent products, each with variance 1). For large d_k (e.g., 512), the dot products become very large in magnitude.

Large magnitudes cause softmax to saturate:

exp(1000) → ∞
exp(-1000) → 0
Gradients become vanishingly small

Dividing by √d_k normalizes the variance back to 1, keeping scores in a reasonable range where softmax gradients remain healthy.

Numerical Stability

Proper scaling is essential for training stability. Without it, attention tends to collapse to hard one-hot patterns early in training, making gradients extremely sparse. The √d_k scaling was a key innovation that made transformers trainable at scale.

Attention Weight Normalization

Raw attention scores are not directly usable—we need to normalize them into proper weights. The standard approach uses softmax, but alternatives exist with different properties.

Softmax Normalization:

Given scores s = [s₁, s₂, ..., sₙ], softmax produces weights:

α_i = exp(s_i) / Σⱼ exp(s_j)

Properties of softmax attention weights:

Non-negative: All weights ≥ 0
Sum to 1: Σ αᵢ = 1 (valid probability distribution)
Differentiable: Smooth gradients everywhere
Competitive: As one weight increases, others must decrease
Sparse tendency: With large score differences, approximates one-hot

The Softmax Temperature:

A generalization adds a temperature parameter τ:

α_i = exp(s_i/τ) / Σⱼ exp(s_j/τ)

Low temperature (τ → 0):

Sharpens distribution toward argmax
Approaches hard attention
Can cause gradient problems

High temperature (τ → ∞):

Flattens toward uniform distribution
All positions receive equal attention
Loses selective focus

Temperature τ = 1:

Standard softmax
Most common in practice

The √d_k Scaling as Implicit Temperature:

Recall that scaled dot-product divides scores by √d_k:

softmax(scores / √d_k) = softmax(scores, τ=√d_k)

This is equivalent to softmax with temperature √d_k. For d_k = 64, that's temperature ≈ 8. This prevents extremely peaked distributions that would harm gradient flow.

Alternative Normalization Methods
Method	Formula	Key Property	Use Case
Softmax	exp(s)/Σexp(s)	Smooth, dense, sums to 1	Standard, works well generally
Sparsemax	Euclidean projection to simplex	Truly sparse (some weights = 0)	Interpretable, discrete-like
Sigmoid	σ(s) for each position	Independent weights, don't sum to 1	Multi-label / independent attention
Hardmax / Argmax	1 at max, 0 elsewhere	Completely sparse	Discrete / hard attention

Why Softmax Dominates:

Despite alternatives, softmax remains the standard for several reasons:

Gradient Flow: Softmax has well-behaved gradients. Even low-weight positions receive small but non-zero gradient signal.
Probabilistic Interpretation: Weights form a proper distribution over positions. This enables clean information-theoretic analysis.
Smooth Transitions: Changes in scores produce smooth changes in weights. No discontinuities that could destabilize training.
Matrix Compatibility: Softmax composes cleanly with matrix operations, enabling efficient implementation.

Sparsemax for Interpretability:

Sparsemax attracts interest for interpretable attention. Instead of "smooth" but dense weights, sparsemax can produce exactly zero weights:

Standard softmax: [0.52, 0.27, 0.15, 0.06]
Sparsemax:        [0.60, 0.40, 0.00, 0.00]

This makes attention patterns easier to interpret—positions with zero weight truly receive no influence. However, sparsemax can be harder to train and may lose the beneficial "soft" gradient signal to all positions.

Softmax Numerical Stability

In practice, softmax is computed as: softmax(s) = exp(s - max(s)) / Σexp(s - max(s)). This shift by max(s) prevents overflow when scores are large. The subtraction doesn't change the mathematical result (due to cancellation) but ensures numerical stability.

Learned Projection Matrices

While the scoring function compares queries to keys, the learned projection matrices are where attention truly learns its task. These matrices transform raw representations into the query, key, and value spaces.

The Projection Framework:

Given input representations x and context representations z (x = z for self-attention):

Q = x · W_Q    (project to query space)
K = z · W_K    (project to key space)
V = z · W_V    (project to value space)

Where:

W_Q ∈ ℝ^{d_model × d_k}: Query projection
W_K ∈ ℝ^{d_model × d_k}: Key projection
W_V ∈ ℝ^{d_model × d_v}: Value projection

attention_projections.py

Python

What Projections Learn:

The beauty of learned projections is that the network discovers useful representations for attention without explicit supervision:

W_Q (Query Projection):

Learns to extract "search criteria" from the current state
For translation: "I'm generating a verb; what verb-relevant info is in the source?"
Different queries for different types of decoder states

W_K (Key Projection):

Learns to create "searchable representations" of input positions
Positions with similar roles (e.g., all verbs) may have similar keys
Keys become the "index" by which content is found

W_V (Value Projection):

Learns what information to return when a position is attended
May preserve different aspects than keys emphasize
The "payload" that flows through attention

W_O (Output Projection):

Maps the attended representation back to model dimension
Enables integration with residual connections and FFN layers
Allows attention output dimension (d_v) to differ from model dimension

Self-Attention: Same Input, Different Views

In self-attention, Q, K, and V all derive from the same input sequence, but through different learned projections. This enables the network to create specialized representations: the query view (what am I looking for?), the key view (how should I be found?), and the value view (what should I contribute?). Three different learned perspectives on identical data.

The Complete Attention Equation

Let's now present the complete, canonical formulation of scaled dot-product attention—the version used in transformers and most modern architectures.

The Transformer Attention Equation:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Where:

Q ∈ ℝ^{n × d_k}: Matrix of n query vectors
K ∈ ℝ^{m × d_k}: Matrix of m key vectors
V ∈ ℝ^{m × d_v}: Matrix of m value vectors
d_k: Key dimension (for scaling)
Output ∈ ℝ^{n × d_v}: n vectors of dimension d_v

Step-by-Step Forward Pass:

Let's trace through exactly what happens during attention computation:

Step 1: Score Computation

S = Q · K^T    →    S ∈ ℝ^{n × m}

Each entry S[i,j] is the dot product of query i with key j. This is a dense n × m matrix of raw compatibility scores.

Step 2: Scaling

S_scaled = S / √d_k

Divide all scores by √d_k to normalize variance. Prevents softmax saturation.

Step 3: Masking (if applicable)

S_masked = S_scaled + M    (where M contains -∞ for masked positions)

For causal attention, historical positions are visible; future positions are masked. Masked positions get -∞, becoming 0 after softmax.

Step 4: Softmax Normalization

A = softmax(S_masked, dim=-1)    →    A ∈ ℝ^{n × m}

Apply softmax row-wise. Each row i of A sums to 1 and represents the attention weights for query i over all keys.

Step 5: Value Aggregation

Output = A · V    →    Output ∈ ℝ^{n × d_v}

Weight-sum the values. Each output row is a weighted combination of all value vectors, weighted by attention.

complete_attention.py

Python

Computational Complexity

Attention has O(n·m) space and time complexity for n queries and m keys. For self-attention (n=m), this is O(n²). This quadratic scaling is the main computational limitation of transformers on long sequences, driving research into efficient attention variants (linear attention, sparse attention, etc.).

Attention Masking Patterns

Attention masking controls which positions each query can attend to. This is crucial for tasks where certain information should not be accessible.

Common Masking Patterns:

1. Causal (Autoregressive) Masking: Position i can only attend to positions j ≤ i. Used in decoder-only models (GPT) for language modeling.

Mask (4×4) for sequence length 4:
[[1, 0, 0, 0],
 [1, 1, 0, 0],
 [1, 1, 1, 0],
 [1, 1, 1, 1]]

2. Padding Masking: Mask out padding tokens (variable-length sequences in a batch). Applied to both rows (queries) and columns (keys).

Sequence: [real, real, real, PAD, PAD]
Mask: [[1, 1, 1, 0, 0], ...]  (zeros for PAD positions)

3. Bidirectional (Full) Attention: Every position can attend to every other position. Used in BERT-style encoders.

Mask: all ones (or no mask)

4. Prefix-Causal (Prefix-LM): Prefix is bidirectional; generation is causal. Used in some encoder-decoder models.

Prefix: bidirectional attention
Generated: causal attention

attention_masks.py

Python

Why Masking is Crucial:

Preventing Information Leakage: In autoregressive generation (like GPT), the model predicts the next token. If position i could attend to position i+1, it would see the answer it's supposed to predict—making the task trivial and useless for generation.

Handling Variable Lengths: Batching requires padding to uniform length, but PAD tokens shouldn't influence computation. Masking ensures they contribute zero to attention outputs.

Modeling Different Tasks: The same transformer architecture can be adapted via masking:

Causal masking → Language modeling (GPT)
No masking → Bidirectional encoding (BERT)
Cross-attention masking → Encoder-decoder (T5)

Implementation via -∞:

Masking is applied before softmax by adding -∞ to masked positions:

scores_masked = scores + mask * (-1e9)
weights = softmax(scores_masked)

Since exp(-∞) = 0:, masked positions get exactly zero weight, contributing nothing to the output. The remaining weights are renormalized by softmax to sum to 1.

Mask Dimension Gotchas

Masks must broadcast correctly with attention scores (batch, heads, queries, keys). Common errors: (1) Forgetting batch/head dimensions, (2) Transposing when combining with scores, (3) Using 0/1 when -inf masking is expected. Always verify mask shapes match score tensor before applying.

Cross-Attention vs Self-Attention

The QKV framework elegantly handles both cross-attention (queries from one source, keys/values from another) and self-attention (all from the same source) with the same computational machinery.

Cross-Attention:

Different sources for Q versus K,V:

Q = decoder_states · W_Q    (from decoder)
K = encoder_states · W_K    (from encoder)
V = encoder_states · W_V    (from encoder)

Example Use Cases:

Machine translation: Decoder attends to encoder
Image captioning: LSTM attends to image regions
Question answering: Query attends to passage
Multimodal models: Text attends to images

Self-Attention:

Same source for Q, K, V:

Q = sequence · W_Q
K = sequence · W_K
V = sequence · W_V

Example Use Cases:

BERT/GPT-style modeling: Tokens attend to other tokens
Vision Transformer: Image patches attend to other patches
Graph networks: Nodes attend to neighboring nodes

Cross-Attention Properties

•Queries and keys from different sequences
•n_queries ≠ n_keys typically
•Learns alignment between modalities
•Asymmetric: Q source consumes K,V source
•No causal masking typically (all keys visible)
•Critical for encoder-decoder architectures

Self-Attention Properties

•Queries and keys from same sequence
•n_queries = n_keys = sequence length
•Learns intra-sequence relationships
•Symmetric structure (each position affects all others)
•Often uses causal or bidirectional masking
•Foundation of Transformer encoder and decoder

Why Self-Attention Was Revolutionary:

Before transformers, capturing long-range dependencies required sequential processing. Self-attention changed this:

RNN Processing:

Information from position 1 reaches position 100 after 99 sequential steps
Gradient must flow through 99 transitions → vanishing/exploding
O(n) sequential operations, cannot parallelize

Self-Attention Processing:

Every position directly attends to every other position
Information from position 1 to position 100 in one step
Gradient flows directly → no vanishing over distance
O(1) sequential operations (all positions computed in parallel)

The key insight: replace recurrence (temporal processing) with attention (parallel relationship computation). This enabled:

Massive parallelization on GPUs
Direct long-range dependency modeling
Stack depth measured in layers, not sequence length
Training on vastly longer contexts

The Transformer Combination

Transformer encoder-decoder uses both: (1) Self-attention in encoder for source relationships, (2) Causal self-attention in decoder for autoregressive generation, (3) Cross-attention from decoder to encoder for conditioning on source. Decoder-only models (GPT) use only causal self-attention. Encoder-only models (BERT) use only bidirectional self-attention.

Summary: Query-Key-Value Framework

We've now established the complete mathematical framework for attention. Let's consolidate:

Key Takeaways

•Three Distinct Roles — Query (what I'm looking for), Key (how to find me), Value (what to retrieve). Separation enables specialization and flexibility.
•Scored Dot-Product — Compatibility is computed as (Q·K^T)/√d_k. Scaling prevents softmax saturation. Efficient matrix operations enable parallelization.
•Softmax Normalization — Converts raw scores to probability distribution over positions. Enables differentiable, learnable attention patterns.
•Learned Projections — W_Q, W_K, W_V are the learnable parameters. They transform representations into spaces optimized for matching and retrieval.
•Masking for Control — Causal masking for autoregressive tasks, padding masking for batching, bidirectional for encoding. Same mechanism, different constraints.
•Unified Framework — Cross-attention and self-attention use identical operations, just with different Q vs K,V sources. Composable building block.
•Quadratic Scaling — O(n²) complexity for sequences of length n. Fundamental tradeoff that drives research into efficient alternatives.

What's Next:

With the QKV framework established, we'll explore two variants that differ fundamentally in their approach to attention: soft attention (the differentiable, weighted combination we've studied) and hard attention (discrete, sampling-based selection). Understanding this distinction reveals important tradeoffs in attention design.

Page Complete

You now understand the Query-Key-Value framework—the mathematical foundation of all modern attention mechanisms. This knowledge is essential for understanding transformers, multi-head attention, and the many attention variants we'll encounter. The scaled dot-product formula you just learned is executed trillions of times per day across the world's AI systems.

2 / 5

Loading learning content...

Machine LearningAttention & Transformers

Attention Mechanism

LevelAdvanced

Duration90 mins

TopicAttention & Transformers

2 / 5

Query-Key-Value Framework

From Intuition to Mathematics

In the previous page, we built intuition for attention as dynamic, content-based retrieval. Now we formalize this intuition into precise mathematics that can be implemented and trained.

Nearly every modern attention variant—from Bahdanau's original 2014 formulation to the multi-head attention in GPT-4—can be expressed in this framework.

What You Will Learn

The Three Components: Query, Key, Value

The QKV framework structures attention around three distinct roles, each with a clear semantic meaning:

Query (Q) — "What am I looking for?"

The query represents the current information need. In cross-attention for translation:

The query comes from the decoder hidden state
It encodes: "I'm about to generate word i; what source information is relevant?"
Dimension: d_q (query dimension)

Key (K) — "What do I contain?"

Keys are identifiers that describe what information is available at each position. Think of them as labels or indices for retrieval:

Keys come from the encoder hidden states (or source of attention)
Each position has a key that "advertises" its content
Dimension: d_k (key dimension, must match query for comparison)

Value (V) — "Here is my actual content"

Values contain the information to be retrieved when attention focuses on a position:

Values also come from the encoder hidden states
If a position gets high attention, its value contributes strongly to the output
Dimension: d_v (value dimension, determines output dimension)

QKV Components Comparison
Component	Symbol	Source	Purpose	Dimension
Query	Q	Decoder / current position	What to search for	d_q
Key	K	Encoder / memory positions	What's available, for matching	d_k = d_q
Value	V	Encoder / memory positions	Information to retrieve	d_v

The Information Retrieval Analogy:

The separation of Key and Value mirrors well-known patterns:

Library Catalog:

Query: "I need books about quantum physics published after 2020"
Key: Catalog entry (title, author, subject, year)
Value: The actual book content

Database:

Query: SQL WHERE clause
Key: Indexed columns
Value: Row data to return

Hash Table:

Query: The key you're looking up
Key: Stored keys (for matching)
Value: Stored values (returned on match)

Why Not Just Use One Representation?

Early attention (Bahdanau) used encoder hidden states as both keys and values. The separation offers important benefits:

Specialization: Keys optimized for matching (what makes two things related?) can differ from values optimized for content (what should be returned?)
Dimensionality Freedom: d_k and d_v can be different. Small keys → fast matching; large values → rich content.
Composability: In self-attention, the same input produces all three, but through different learned projections—enabling different "views" of the same content.

The Transformer Design Choice

Computing Attention Scores

The heart of attention is computing a relevance score between each query and each key. This score determines how much attention each memory position receives.

The General Formulation:

For a query q and key k, we compute:

score(q, k) = f(q, k)

where f is a scoring function that measures compatibility. Several scoring functions have been proposed, each with distinct characteristics.

attention_scoring.py

Python

Why Scaling Matters:

The scaled dot-product attention divides by √d_k. This seemingly minor detail is crucial:

Consider what happens without scaling. If q and k are vectors with d_k independent components, each with mean 0 and variance 1, then:

q · k = Σ q_i * k_i

This sum has variance d_k (sum of d_k independent products, each with variance 1). For large d_k (e.g., 512), the dot products become very large in magnitude.

Large magnitudes cause softmax to saturate:

exp(1000) → ∞
exp(-1000) → 0
Gradients become vanishingly small

Dividing by √d_k normalizes the variance back to 1, keeping scores in a reasonable range where softmax gradients remain healthy.

Numerical Stability

Attention Weight Normalization

Raw attention scores are not directly usable—we need to normalize them into proper weights. The standard approach uses softmax, but alternatives exist with different properties.

Softmax Normalization:

Given scores s = [s₁, s₂, ..., sₙ], softmax produces weights:

α_i = exp(s_i) / Σⱼ exp(s_j)

Properties of softmax attention weights:

Non-negative: All weights ≥ 0
Sum to 1: Σ αᵢ = 1 (valid probability distribution)
Differentiable: Smooth gradients everywhere
Competitive: As one weight increases, others must decrease
Sparse tendency: With large score differences, approximates one-hot

The Softmax Temperature:

A generalization adds a temperature parameter τ:

α_i = exp(s_i/τ) / Σⱼ exp(s_j/τ)

Low temperature (τ → 0):

Sharpens distribution toward argmax
Approaches hard attention
Can cause gradient problems

High temperature (τ → ∞):

Flattens toward uniform distribution
All positions receive equal attention
Loses selective focus

Temperature τ = 1:

Standard softmax
Most common in practice

The √d_k Scaling as Implicit Temperature:

Recall that scaled dot-product divides scores by √d_k:

softmax(scores / √d_k) = softmax(scores, τ=√d_k)

This is equivalent to softmax with temperature √d_k. For d_k = 64, that's temperature ≈ 8. This prevents extremely peaked distributions that would harm gradient flow.

Alternative Normalization Methods
Method	Formula	Key Property	Use Case
Softmax	exp(s)/Σexp(s)	Smooth, dense, sums to 1	Standard, works well generally
Sparsemax	Euclidean projection to simplex	Truly sparse (some weights = 0)	Interpretable, discrete-like
Sigmoid	σ(s) for each position	Independent weights, don't sum to 1	Multi-label / independent attention
Hardmax / Argmax	1 at max, 0 elsewhere	Completely sparse	Discrete / hard attention

Why Softmax Dominates:

Despite alternatives, softmax remains the standard for several reasons:

Gradient Flow: Softmax has well-behaved gradients. Even low-weight positions receive small but non-zero gradient signal.
Probabilistic Interpretation: Weights form a proper distribution over positions. This enables clean information-theoretic analysis.
Smooth Transitions: Changes in scores produce smooth changes in weights. No discontinuities that could destabilize training.
Matrix Compatibility: Softmax composes cleanly with matrix operations, enabling efficient implementation.

Sparsemax for Interpretability:

Sparsemax attracts interest for interpretable attention. Instead of "smooth" but dense weights, sparsemax can produce exactly zero weights:

Standard softmax: [0.52, 0.27, 0.15, 0.06]
Sparsemax:        [0.60, 0.40, 0.00, 0.00]

Softmax Numerical Stability

Learned Projection Matrices

The Projection Framework:

Given input representations x and context representations z (x = z for self-attention):

Q = x · W_Q    (project to query space)
K = z · W_K    (project to key space)
V = z · W_V    (project to value space)

Where:

W_Q ∈ ℝ^{d_model × d_k}: Query projection
W_K ∈ ℝ^{d_model × d_k}: Key projection
W_V ∈ ℝ^{d_model × d_v}: Value projection

attention_projections.py

Python

What Projections Learn:

The beauty of learned projections is that the network discovers useful representations for attention without explicit supervision:

W_Q (Query Projection):

Learns to extract "search criteria" from the current state
For translation: "I'm generating a verb; what verb-relevant info is in the source?"
Different queries for different types of decoder states

W_K (Key Projection):

Learns to create "searchable representations" of input positions
Positions with similar roles (e.g., all verbs) may have similar keys
Keys become the "index" by which content is found

W_V (Value Projection):

Learns what information to return when a position is attended
May preserve different aspects than keys emphasize
The "payload" that flows through attention

W_O (Output Projection):

Maps the attended representation back to model dimension
Enables integration with residual connections and FFN layers
Allows attention output dimension (d_v) to differ from model dimension

Self-Attention: Same Input, Different Views

The Complete Attention Equation

Let's now present the complete, canonical formulation of scaled dot-product attention—the version used in transformers and most modern architectures.

The Transformer Attention Equation:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Where:

Q ∈ ℝ^{n × d_k}: Matrix of n query vectors
K ∈ ℝ^{m × d_k}: Matrix of m key vectors
V ∈ ℝ^{m × d_v}: Matrix of m value vectors
d_k: Key dimension (for scaling)
Output ∈ ℝ^{n × d_v}: n vectors of dimension d_v

Step-by-Step Forward Pass:

Let's trace through exactly what happens during attention computation:

Step 1: Score Computation

S = Q · K^T    →    S ∈ ℝ^{n × m}

Each entry S[i,j] is the dot product of query i with key j. This is a dense n × m matrix of raw compatibility scores.

Step 2: Scaling

S_scaled = S / √d_k

Divide all scores by √d_k to normalize variance. Prevents softmax saturation.

Step 3: Masking (if applicable)

S_masked = S_scaled + M    (where M contains -∞ for masked positions)

For causal attention, historical positions are visible; future positions are masked. Masked positions get -∞, becoming 0 after softmax.

Step 4: Softmax Normalization

A = softmax(S_masked, dim=-1)    →    A ∈ ℝ^{n × m}

Apply softmax row-wise. Each row i of A sums to 1 and represents the attention weights for query i over all keys.

Step 5: Value Aggregation

Output = A · V    →    Output ∈ ℝ^{n × d_v}

Weight-sum the values. Each output row is a weighted combination of all value vectors, weighted by attention.

complete_attention.py

Python

Computational Complexity

Attention Masking Patterns

Attention masking controls which positions each query can attend to. This is crucial for tasks where certain information should not be accessible.

Common Masking Patterns:

1. Causal (Autoregressive) Masking: Position i can only attend to positions j ≤ i. Used in decoder-only models (GPT) for language modeling.

Mask (4×4) for sequence length 4:
[[1, 0, 0, 0],
 [1, 1, 0, 0],
 [1, 1, 1, 0],
 [1, 1, 1, 1]]

2. Padding Masking: Mask out padding tokens (variable-length sequences in a batch). Applied to both rows (queries) and columns (keys).

Sequence: [real, real, real, PAD, PAD]
Mask: [[1, 1, 1, 0, 0], ...]  (zeros for PAD positions)

3. Bidirectional (Full) Attention: Every position can attend to every other position. Used in BERT-style encoders.

Mask: all ones (or no mask)

4. Prefix-Causal (Prefix-LM): Prefix is bidirectional; generation is causal. Used in some encoder-decoder models.

Prefix: bidirectional attention
Generated: causal attention

attention_masks.py

Python

Why Masking is Crucial:

Handling Variable Lengths: Batching requires padding to uniform length, but PAD tokens shouldn't influence computation. Masking ensures they contribute zero to attention outputs.

Modeling Different Tasks: The same transformer architecture can be adapted via masking:

Causal masking → Language modeling (GPT)
No masking → Bidirectional encoding (BERT)
Cross-attention masking → Encoder-decoder (T5)

Implementation via -∞:

Masking is applied before softmax by adding -∞ to masked positions:

scores_masked = scores + mask * (-1e9)
weights = softmax(scores_masked)

Since exp(-∞) = 0:, masked positions get exactly zero weight, contributing nothing to the output. The remaining weights are renormalized by softmax to sum to 1.

Mask Dimension Gotchas

Cross-Attention vs Self-Attention

The QKV framework elegantly handles both cross-attention (queries from one source, keys/values from another) and self-attention (all from the same source) with the same computational machinery.

Cross-Attention:

Different sources for Q versus K,V:

Q = decoder_states · W_Q    (from decoder)
K = encoder_states · W_K    (from encoder)
V = encoder_states · W_V    (from encoder)

Example Use Cases:

Machine translation: Decoder attends to encoder
Image captioning: LSTM attends to image regions
Question answering: Query attends to passage
Multimodal models: Text attends to images

Self-Attention:

Same source for Q, K, V:

Q = sequence · W_Q
K = sequence · W_K
V = sequence · W_V

Example Use Cases:

BERT/GPT-style modeling: Tokens attend to other tokens
Vision Transformer: Image patches attend to other patches
Graph networks: Nodes attend to neighboring nodes

Cross-Attention Properties

•Queries and keys from different sequences
•n_queries ≠ n_keys typically
•Learns alignment between modalities
•Asymmetric: Q source consumes K,V source
•No causal masking typically (all keys visible)
•Critical for encoder-decoder architectures

Self-Attention Properties

•Queries and keys from same sequence
•n_queries = n_keys = sequence length
•Learns intra-sequence relationships
•Symmetric structure (each position affects all others)
•Often uses causal or bidirectional masking
•Foundation of Transformer encoder and decoder

Why Self-Attention Was Revolutionary:

Before transformers, capturing long-range dependencies required sequential processing. Self-attention changed this:

RNN Processing:

Information from position 1 reaches position 100 after 99 sequential steps
Gradient must flow through 99 transitions → vanishing/exploding
O(n) sequential operations, cannot parallelize

Self-Attention Processing:

Every position directly attends to every other position
Information from position 1 to position 100 in one step
Gradient flows directly → no vanishing over distance
O(1) sequential operations (all positions computed in parallel)

The key insight: replace recurrence (temporal processing) with attention (parallel relationship computation). This enabled:

Massive parallelization on GPUs
Direct long-range dependency modeling
Stack depth measured in layers, not sequence length
Training on vastly longer contexts

The Transformer Combination

Summary: Query-Key-Value Framework

We've now established the complete mathematical framework for attention. Let's consolidate:

Key Takeaways

•Three Distinct Roles — Query (what I'm looking for), Key (how to find me), Value (what to retrieve). Separation enables specialization and flexibility.
•Scored Dot-Product — Compatibility is computed as (Q·K^T)/√d_k. Scaling prevents softmax saturation. Efficient matrix operations enable parallelization.
•Softmax Normalization — Converts raw scores to probability distribution over positions. Enables differentiable, learnable attention patterns.
•Learned Projections — W_Q, W_K, W_V are the learnable parameters. They transform representations into spaces optimized for matching and retrieval.
•Masking for Control — Causal masking for autoregressive tasks, padding masking for batching, bidirectional for encoding. Same mechanism, different constraints.
•Unified Framework — Cross-attention and self-attention use identical operations, just with different Q vs K,V sources. Composable building block.
•Quadratic Scaling — O(n²) complexity for sequences of length n. Fundamental tradeoff that drives research into efficient alternatives.

What's Next:

Page Complete

2 / 5