Loading learning content...
In the previous page, we built intuition for attention as dynamic, content-based retrieval. Now we formalize this intuition into precise mathematics that can be implemented and trained.
The Query-Key-Value (QKV) framework provides the mathematical vocabulary for attention. It's not just notation—it's a computational paradigm that separates the what am I looking for (Query), what is available to look at (Key), and what information to retrieve (Value). This separation enables attention to be modular, composable, and remarkably powerful.
Nearly every modern attention variant—from Bahdanau's original 2014 formulation to the multi-head attention in GPT-4—can be expressed in this framework.
By the end of this page, you will understand: (1) The mathematical definition of queries, keys, and values, (2) How attention scores are computed and normalized, (3) The role of learned projection matrices, (4) Different scoring functions and their tradeoffs, (5) Complete forward pass computation with concrete examples, and (6) How this framework enables the transformer architecture.
The QKV framework structures attention around three distinct roles, each with a clear semantic meaning:
Query (Q) — "What am I looking for?"
The query represents the current information need. In cross-attention for translation:
Key (K) — "What do I contain?"
Keys are identifiers that describe what information is available at each position. Think of them as labels or indices for retrieval:
Value (V) — "Here is my actual content"
Values contain the information to be retrieved when attention focuses on a position:
| Component | Symbol | Source | Purpose | Dimension |
|---|---|---|---|---|
| Query | Q | Decoder / current position | What to search for | d_q |
| Key | K | Encoder / memory positions | What's available, for matching | d_k = d_q |
| Value | V | Encoder / memory positions | Information to retrieve | d_v |
The Information Retrieval Analogy:
The separation of Key and Value mirrors well-known patterns:
Library Catalog:
Database:
Hash Table:
Why Not Just Use One Representation?
Early attention (Bahdanau) used encoder hidden states as both keys and values. The separation offers important benefits:
Specialization: Keys optimized for matching (what makes two things related?) can differ from values optimized for content (what should be returned?)
Dimensionality Freedom: d_k and d_v can be different. Small keys → fast matching; large values → rich content.
Composability: In self-attention, the same input produces all three, but through different learned projections—enabling different "views" of the same content.
In the original Transformer, d_k = d_q = d_v = d_model/h where h is the number of attention heads. This symmetric choice simplifies implementation but isn't fundamental—many variants use different dimensions for keys versus values.
The heart of attention is computing a relevance score between each query and each key. This score determines how much attention each memory position receives.
The General Formulation:
For a query q and key k, we compute:
score(q, k) = f(q, k)
where f is a scoring function that measures compatibility. Several scoring functions have been proposed, each with distinct characteristics.
1
Why Scaling Matters:
The scaled dot-product attention divides by √d_k. This seemingly minor detail is crucial:
Consider what happens without scaling. If q and k are vectors with d_k independent components, each with mean 0 and variance 1, then:
q · k = Σ q_i * k_i
This sum has variance d_k (sum of d_k independent products, each with variance 1). For large d_k (e.g., 512), the dot products become very large in magnitude.
Large magnitudes cause softmax to saturate:
Dividing by √d_k normalizes the variance back to 1, keeping scores in a reasonable range where softmax gradients remain healthy.
Proper scaling is essential for training stability. Without it, attention tends to collapse to hard one-hot patterns early in training, making gradients extremely sparse. The √d_k scaling was a key innovation that made transformers trainable at scale.
Raw attention scores are not directly usable—we need to normalize them into proper weights. The standard approach uses softmax, but alternatives exist with different properties.
Softmax Normalization:
Given scores s = [s₁, s₂, ..., sₙ], softmax produces weights:
α_i = exp(s_i) / Σⱼ exp(s_j)
Properties of softmax attention weights:
The Softmax Temperature:
A generalization adds a temperature parameter τ:
α_i = exp(s_i/τ) / Σⱼ exp(s_j/τ)
Low temperature (τ → 0):
High temperature (τ → ∞):
Temperature τ = 1:
The √d_k Scaling as Implicit Temperature:
Recall that scaled dot-product divides scores by √d_k:
softmax(scores / √d_k) = softmax(scores, τ=√d_k)
This is equivalent to softmax with temperature √d_k. For d_k = 64, that's temperature ≈ 8. This prevents extremely peaked distributions that would harm gradient flow.
| Method | Formula | Key Property | Use Case |
|---|---|---|---|
| Softmax | exp(s)/Σexp(s) | Smooth, dense, sums to 1 | Standard, works well generally |
| Sparsemax | Euclidean projection to simplex | Truly sparse (some weights = 0) | Interpretable, discrete-like |
| Sigmoid | σ(s) for each position | Independent weights, don't sum to 1 | Multi-label / independent attention |
| Hardmax / Argmax | 1 at max, 0 elsewhere | Completely sparse | Discrete / hard attention |
Why Softmax Dominates:
Despite alternatives, softmax remains the standard for several reasons:
Gradient Flow: Softmax has well-behaved gradients. Even low-weight positions receive small but non-zero gradient signal.
Probabilistic Interpretation: Weights form a proper distribution over positions. This enables clean information-theoretic analysis.
Smooth Transitions: Changes in scores produce smooth changes in weights. No discontinuities that could destabilize training.
Matrix Compatibility: Softmax composes cleanly with matrix operations, enabling efficient implementation.
Sparsemax for Interpretability:
Sparsemax attracts interest for interpretable attention. Instead of "smooth" but dense weights, sparsemax can produce exactly zero weights:
Standard softmax: [0.52, 0.27, 0.15, 0.06]
Sparsemax: [0.60, 0.40, 0.00, 0.00]
This makes attention patterns easier to interpret—positions with zero weight truly receive no influence. However, sparsemax can be harder to train and may lose the beneficial "soft" gradient signal to all positions.
In practice, softmax is computed as: softmax(s) = exp(s - max(s)) / Σexp(s - max(s)). This shift by max(s) prevents overflow when scores are large. The subtraction doesn't change the mathematical result (due to cancellation) but ensures numerical stability.
While the scoring function compares queries to keys, the learned projection matrices are where attention truly learns its task. These matrices transform raw representations into the query, key, and value spaces.
The Projection Framework:
Given input representations x and context representations z (x = z for self-attention):
Q = x · W_Q (project to query space)
K = z · W_K (project to key space)
V = z · W_V (project to value space)
Where:
1
What Projections Learn:
The beauty of learned projections is that the network discovers useful representations for attention without explicit supervision:
W_Q (Query Projection):
W_K (Key Projection):
W_V (Value Projection):
W_O (Output Projection):
In self-attention, Q, K, and V all derive from the same input sequence, but through different learned projections. This enables the network to create specialized representations: the query view (what am I looking for?), the key view (how should I be found?), and the value view (what should I contribute?). Three different learned perspectives on identical data.
Let's now present the complete, canonical formulation of scaled dot-product attention—the version used in transformers and most modern architectures.
The Transformer Attention Equation:
Attention(Q, K, V) = softmax(QK^T / √d_k) · V
Where:
Step-by-Step Forward Pass:
Let's trace through exactly what happens during attention computation:
Step 1: Score Computation
S = Q · K^T → S ∈ ℝ^{n × m}
Each entry S[i,j] is the dot product of query i with key j. This is a dense n × m matrix of raw compatibility scores.
Step 2: Scaling
S_scaled = S / √d_k
Divide all scores by √d_k to normalize variance. Prevents softmax saturation.
Step 3: Masking (if applicable)
S_masked = S_scaled + M (where M contains -∞ for masked positions)
For causal attention, historical positions are visible; future positions are masked. Masked positions get -∞, becoming 0 after softmax.
Step 4: Softmax Normalization
A = softmax(S_masked, dim=-1) → A ∈ ℝ^{n × m}
Apply softmax row-wise. Each row i of A sums to 1 and represents the attention weights for query i over all keys.
Step 5: Value Aggregation
Output = A · V → Output ∈ ℝ^{n × d_v}
Weight-sum the values. Each output row is a weighted combination of all value vectors, weighted by attention.
1
Attention has O(n·m) space and time complexity for n queries and m keys. For self-attention (n=m), this is O(n²). This quadratic scaling is the main computational limitation of transformers on long sequences, driving research into efficient attention variants (linear attention, sparse attention, etc.).
Attention masking controls which positions each query can attend to. This is crucial for tasks where certain information should not be accessible.
Common Masking Patterns:
1. Causal (Autoregressive) Masking: Position i can only attend to positions j ≤ i. Used in decoder-only models (GPT) for language modeling.
Mask (4×4) for sequence length 4:
[[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 1, 1, 0],
[1, 1, 1, 1]]
2. Padding Masking: Mask out padding tokens (variable-length sequences in a batch). Applied to both rows (queries) and columns (keys).
Sequence: [real, real, real, PAD, PAD]
Mask: [[1, 1, 1, 0, 0], ...] (zeros for PAD positions)
3. Bidirectional (Full) Attention: Every position can attend to every other position. Used in BERT-style encoders.
Mask: all ones (or no mask)
4. Prefix-Causal (Prefix-LM): Prefix is bidirectional; generation is causal. Used in some encoder-decoder models.
Prefix: bidirectional attention
Generated: causal attention
1
Why Masking is Crucial:
Preventing Information Leakage: In autoregressive generation (like GPT), the model predicts the next token. If position i could attend to position i+1, it would see the answer it's supposed to predict—making the task trivial and useless for generation.
Handling Variable Lengths: Batching requires padding to uniform length, but PAD tokens shouldn't influence computation. Masking ensures they contribute zero to attention outputs.
Modeling Different Tasks: The same transformer architecture can be adapted via masking:
Implementation via -∞:
Masking is applied before softmax by adding -∞ to masked positions:
scores_masked = scores + mask * (-1e9)
weights = softmax(scores_masked)
Since exp(-∞) = 0:, masked positions get exactly zero weight, contributing nothing to the output. The remaining weights are renormalized by softmax to sum to 1.
Masks must broadcast correctly with attention scores (batch, heads, queries, keys). Common errors: (1) Forgetting batch/head dimensions, (2) Transposing when combining with scores, (3) Using 0/1 when -inf masking is expected. Always verify mask shapes match score tensor before applying.
The QKV framework elegantly handles both cross-attention (queries from one source, keys/values from another) and self-attention (all from the same source) with the same computational machinery.
Cross-Attention:
Different sources for Q versus K,V:
Q = decoder_states · W_Q (from decoder)
K = encoder_states · W_K (from encoder)
V = encoder_states · W_V (from encoder)
Example Use Cases:
Self-Attention:
Same source for Q, K, V:
Q = sequence · W_Q
K = sequence · W_K
V = sequence · W_V
Example Use Cases:
Why Self-Attention Was Revolutionary:
Before transformers, capturing long-range dependencies required sequential processing. Self-attention changed this:
RNN Processing:
Self-Attention Processing:
The key insight: replace recurrence (temporal processing) with attention (parallel relationship computation). This enabled:
Transformer encoder-decoder uses both: (1) Self-attention in encoder for source relationships, (2) Causal self-attention in decoder for autoregressive generation, (3) Cross-attention from decoder to encoder for conditioning on source. Decoder-only models (GPT) use only causal self-attention. Encoder-only models (BERT) use only bidirectional self-attention.
We've now established the complete mathematical framework for attention. Let's consolidate:
What's Next:
With the QKV framework established, we'll explore two variants that differ fundamentally in their approach to attention: soft attention (the differentiable, weighted combination we've studied) and hard attention (discrete, sampling-based selection). Understanding this distinction reveals important tradeoffs in attention design.
You now understand the Query-Key-Value framework—the mathematical foundation of all modern attention mechanisms. This knowledge is essential for understanding transformers, multi-head attention, and the many attention variants we'll encounter. The scaled dot-product formula you just learned is executed trillions of times per day across the world's AI systems.