Loading content...
What if attention didn't blend—what if it truly selected?
Soft attention's weighted combination is elegant and differentiable, but it's an approximation of selection. When you want to copy a word verbatim, point to a specific location, or focus on exactly one item, the "blur" of soft attention is a limitation.
Hard attention takes a different path: instead of computing a weighted average, it samples or argmaxes a single position. The output is the value at that position, not a blend. This enables true discrete selection—but at the cost of differentiability.
This page explores hard attention: its definition, the training challenges it introduces, techniques to make it trainable, and situations where it shines.
By the end of this page, you will understand: (1) The mathematical definition of hard attention, (2) Why discrete selection breaks backpropagation, (3) The REINFORCE algorithm for training through sampling, (4) Variance reduction and control variates, (5) The Gumbel-Softmax trick for differentiable approximation, (6) Straight-Through Estimators, and (7) When to use hard vs soft attention.
Hard attention performs discrete selection rather than continuous weighting.
Formal Definition:
Given attention scores e = [e₁, ..., eₘ] and values V = [v₁, ..., vₘ]:
Deterministic Hard Attention (Argmax):
i* = argmax_i e_i
context = v_{i*}
Stochastic Hard Attention (Sampling):
α = softmax(e) # Convert to probabilities
i ~ Categorical(α) # Sample position
context = v_i # Return sampled value
In both cases, the output is exactly one value, not a blend.
Why Consider Hard Attention?
1. True Copying Behavior: For tasks like neural machine translation with rare words, or text summarization with quoted phrases, we want to copy tokens exactly—not create blended representations.
2. Discrete Action Spaces: When attention is selecting among discrete choices (which tool to use, which memory slot to read), discrete selection matches the problem structure.
3. Interpretability: Hard attention produces unambiguous "I looked at position 5" decisions, easier to interpret than "I looked 0.3 at position 5, 0.25 at position 3, ..."
4. Computational Efficiency (In Theory): If you only need one value, you don't need to compute the weighted sum over all values. However, computing attention scores still requires O(m) work, so savings are modest.
5. Biological Plausibility: Human eye movements (saccades) are discrete—we look at one spot, not a weighted average of all spots. Hard attention models this more directly.
Hard attention's discrete selection is non-differentiable. The operation i* = argmax(e) has no meaningful gradient—we cannot ask "how should I change e to improve the loss?" through standard backpropagation. This is the fundamental obstacle that makes hard attention significantly more difficult to train.
Let's understand precisely why discrete selection breaks backpropagation.
The Gradient Through Argmax:
Consider i* = argmax_i e_i. What is ∂i*/∂e_j?
The argmax function is piecewise constant:
Mathematically:
This means we cannot use the chain rule: ∂L/∂e = ∂L/∂i* · ∂i*/∂e = ∂L/∂i* · 0 = 0.
No gradient flows back through the selection decision.
The Gradient Through Sampling:
Stochastic hard attention samples i ~ Categorical(softmax(e)). This involves two problems:
1. The Sampling Operation: Sampling from a categorical distribution is fundamentally stochastic. Given the same probabilities, different random seeds yield different samples. There's no deterministic function to differentiate.
2. The Indicator Function: The selected value is: context = Σ_i 𝟙[i = sampled] · v_i
The indicator function 𝟙 is:
This is discrete—no smooth gradient.
Visual Intuition:
Scores: [1.2, 0.8, 0.5, -0.2, 0.3]
↓
Softmax: [0.35, 0.24, 0.18, 0.09, 0.14]
↓
Sample: [0, 1, 0, 0, 0] ← Position 1 selected
↓
Output: v_1 (exactly)
How should we adjust the softmax logits to improve the loss? Standard backprop cannot answer this through the sampling step.
1
We want discrete selection (hard attention), but neural network training requires differentiable operations. The entire field of techniques we'll study—REINFORCE, Gumbel-Softmax, Straight-Through Estimators—are solutions to this fundamental tension between discrete computation and gradient-based learning.
When differentiating through discrete samples, we turn to policy gradient methods from reinforcement learning. The REINFORCE algorithm provides unbiased gradient estimates for stochastic decisions.
The Key Insight:
We cannot differentiate through the sample, but we can differentiate the probability of taking that sample. If an action leads to good outcomes, make it more probable; if bad, less probable.
Mathematical Derivation:
Let:
We want to maximize expected reward:
J(θ) = E_{a ~ p_θ}[R(a)] = Σ_a p_θ(a) · R(a)
Taking gradients:
∇_θ J = Σ_a ∇_θ p_θ(a) · R(a)
= Σ_a p_θ(a) · ∇_θ log p_θ(a) · R(a) (log-derivative trick)
= E_{a ~ p_θ}[∇_θ log p_θ(a) · R(a)] (back to expectation)
This expectation can be estimated by sampling:
∇_θ J ≈ ∇_θ log p_θ(a_sampled) · R(a_sampled)
The REINFORCE Gradient Estimator:
For hard attention:
The gradient ∇_θ log α_i CAN be computed—it flows through the softmax and score computation. We're just weighting it by the observed reward.
Intuition:
Why This Works:
REINFORCE doesn't require differentiating through the sampling operation. It only needs:
The sample itself is treated as fixed; we adjust probabilities based on outcomes.
1
REINFORCE gradients have notoriously high variance. A single sample might give R(a) = +10 or R(a) = -10 for the same action, just due to stochasticity in other parts of the model. This variance makes training slow and unstable. Variance reduction is essential.
The high variance of REINFORCE gradients is the primary obstacle to effective hard attention training. Several techniques help reduce this variance.
1. Baseline Subtraction:
The REINFORCE gradient is:
∇_θ J ≈ ∇_θ log p_θ(a) · R(a)
We can subtract any baseline b that doesn't depend on a without changing the expected gradient:
∇_θ J ≈ ∇_θ log p_θ(a) · (R(a) - b)
Why does this reduce variance?
The variance of XY is high when both X and Y have high variance. By centering R around its mean (using b ≈ E[R]), we reduce the magnitude of the reward signal, reducing variance while preserving the gradient direction (actions better than average still get positive signal).
Common Baseline Choices:
1
2. Multiple Sampling:
Instead of one sample per example, draw k samples and average:
∇_θ J ≈ (1/k) Σ_{j=1}^k ∇_θ log p_θ(a_j) · R(a_j)
Variance reduces by factor of k. Cost increases by factor of k.
3. Control Variates:
More sophisticated: use correlated random variables to subtract variance without changing expectation. The soft attention output can serve as a control variate:
∇_θ J ≈ ∇_θ log p_θ(a) · R(a) - c · (soft_output - E[soft_output])
where c is tuned to minimize variance.
4. Entropy Regularization:
Add entropy bonus to encourage exploration:
Loss = -E[R] - β · H(p_θ)
Higher entropy means more uniform attention probabilities, which:
For most applications, combining a moving average baseline with entropy regularization provides reasonable variance reduction. Self-critical baseline is powerful for generation tasks (image captioning, summarization) where you can efficiently compute the greedy action's reward. Multiple sampling is effective but computationally expensive.
An alternative to REINFORCE is the Gumbel-Softmax (also called Concrete Distribution). This provides a differentiable approximation to categorical sampling.
The Core Idea:
Instead of sampling discretely, we sample from a continuous distribution that approximates the discrete one. The approximation becomes exact as temperature approaches zero.
The Gumbel-Max Trick:
First, recall a fundamental result. To sample from a categorical distribution with probabilities p:
i ~ Categorical(p)
is equivalent to:
g_i ~ Gumbel(0, 1) for each i
i = argmax_i (log p_i + g_i)
where Gumbel(0, 1) noise is sampled as: g = -log(-log(u)), u ~ Uniform(0, 1).
The Gumbel-Softmax Relaxation:
The argmax is still non-differentiable. The trick: replace argmax with softmax!
y_i = exp((log p_i + g_i) / τ) / Σ_j exp((log p_j + g_j) / τ)
where τ is a temperature parameter:
1
Temperature Annealing:
A common practice is to start with high temperature (soft, easy to optimize) and gradually decrease it (more discrete, closer to true hard attention):
# Training schedule
start_temp = 5.0
end_temp = 0.1
for epoch in range(num_epochs):
temp = start_temp * (end_temp / start_temp) ** (epoch / num_epochs)
model.temperature = temp
This provides:
Comparison with REINFORCE:
| Aspect | REINFORCE | Gumbel-Softmax |
|---|---|---|
| Gradient | Unbiased but high variance | Biased but low variance |
| Training | Slower, needs variance reduction | Faster, standard backprop |
| Test time | True discrete | Either discrete or relaxed |
| Hyperparameters | Baseline choice, entropy weight | Temperature schedule |
| Implementation | Custom loss, multiple concepts | Simple, drop-in replacement |
REINFORCE gives unbiased gradients but high variance. Gumbel-Softmax gives biased gradients (the relaxation doesn't perfectly match discrete sampling) but low variance. In practice, low variance often wins—Gumbel-Softmax usually trains faster and more stably. The bias becomes negligible as temperature decreases.
The Straight-Through Estimator (STE) is a simple but effective technique for training through discrete operations. It uses a discrete forward pass but a continuous backward pass.
The Core Idea:
Forward: Use discrete operation (argmax gives true one-hot)
Backward: Pretend the discrete operation was continuous (pass gradients through)
Pseudocode:
# Forward pass: discrete
y_hard = one_hot(argmax(logits))
# Backward pass: treat it as if we used softmax
# Achieved via: y_hard - softmax(logits).detach() + softmax(logits)
1
Why Does STE Work?
The STE is theoretically questionable—the gradient doesn't match the actual computation. But empirically it often works well because:
Comparison of Techniques:
| Technique | Forward | Backward | Pros | Cons |
|---|---|---|---|---|
| REINFORCE | Discrete | Policy gradient | Unbiased | High variance |
| Gumbel-Softmax | Soft (approx discrete) | Standard | Low variance, simple | Biased, needs annealing |
| STE | Discrete | Through soft | Simple, truly discrete | Gradient mismatch |
| Gumbel + STE | Discrete | Through soft | Best of both? | Complex |
Start with Gumbel-Softmax (hard=True) for most cases—it's simple and effective. Use REINFORCE when you need strictly unbiased gradients or have reward signals rather than differentiable losses. Use STE when you want truly discrete forward passes and can tolerate gradient approximation. Often, the choice matters less than proper hyperparameter tuning.
While soft attention dominates, hard attention shines in specific scenarios where discrete selection is inherent to the task.
1. Pointer Networks:
Pointer Networks use hard attention to point to positions in the input. For tasks like:
The output IS the pointed position—soft averaging doesn't make sense.
2. Neural Turing Machines / Memory Networks:
Memory access often benefits from hard attention:
Though many implementations use soft attention for tractability.
3. Copy Mechanism in Sequence-to-Sequence:
When copying rare words (names, technical terms):
Hard attention ensures the copied token is exact.
4. Discrete Latent Variable Models:
Models with discrete hidden structure:
Hard attention enforces the discrete structure.
5. Interpretable Attention:
For interpretability, hard attention provides clear decisions:
Binary decisions are easier to explain to non-experts.
6. Efficient Inference:
At inference time, even if trained with soft attention:
This is common in production: train soft, deploy hard.
Despite these applications, soft attention remains dominant in practice. Transformers (BERT, GPT) use soft attention. Most state-of-the-art systems use soft attention. Hard attention is reserved for special cases where discrete selection is essential, or for research into discrete neural computation.
| Scenario | Why Hard Attention | Implementation Approach |
|---|---|---|
| Pointer/selection tasks | Output IS the selection | Pointer networks with REINFORCE or Gumbel |
| Copying mechanisms | Exact token copy needed | Copy attention with STE |
| Memory networks | Precise memory access | Often use soft despite motivation |
| Interpretability requirements | Binary decisions easier to explain | Hard inference, soft training |
| Latent discrete structure | Task has inherent discreteness | Discrete VAE techniques |
We've explored hard attention—the discrete, selection-based alternative to soft attention. Let's consolidate the key insights:
What's Next:
With both soft and hard attention understood, we now turn to attention visualization—how to interpret, debug, and understand what attention mechanisms learn. This practical skill is essential for developing attention-based models and diagnosing their behavior.
You now understand hard attention and the techniques for training discrete selection mechanisms. While soft attention dominates practice, understanding hard attention deepens your grasp of attention as a design space, and the training techniques (REINFORCE, Gumbel-Softmax, STE) appear throughout deep learning whenever discrete decisions must be learned.