Attention Mechanism - Learning Module

Loading content...

0/278

Hard Attention

The Discrete Alternative

What if attention didn't blend—what if it truly selected?

Soft attention's weighted combination is elegant and differentiable, but it's an approximation of selection. When you want to copy a word verbatim, point to a specific location, or focus on exactly one item, the "blur" of soft attention is a limitation.

Hard attention takes a different path: instead of computing a weighted average, it samples or argmaxes a single position. The output is the value at that position, not a blend. This enables true discrete selection—but at the cost of differentiability.

This page explores hard attention: its definition, the training challenges it introduces, techniques to make it trainable, and situations where it shines.

What You Will Learn

By the end of this page, you will understand: (1) The mathematical definition of hard attention, (2) Why discrete selection breaks backpropagation, (3) The REINFORCE algorithm for training through sampling, (4) Variance reduction and control variates, (5) The Gumbel-Softmax trick for differentiable approximation, (6) Straight-Through Estimators, and (7) When to use hard vs soft attention.

Definition and Motivation

Hard attention performs discrete selection rather than continuous weighting.

Formal Definition:

Given attention scores e = [e₁, ..., eₘ] and values V = [v₁, ..., vₘ]:

Deterministic Hard Attention (Argmax):

i* = argmax_i e_i
context = v_{i*}

Stochastic Hard Attention (Sampling):

α = softmax(e)            # Convert to probabilities
i ~ Categorical(α)        # Sample position
context = v_i             # Return sampled value

In both cases, the output is exactly one value, not a blend.

Hard Attention Properties

•Exact selection: Output is precisely one value, no blending
•Sparse by definition: All attention weight on one position
•O(m) compute for selection: No full weighted sum
•Discrete decision: Binary in/out for each position
•Non-differentiable: Cannot compute ∂output/∂scores directly

Soft Attention Properties

•Approximate selection: Output is weighted blend of all values
•Dense by default: All positions contribute (unless weight ≈ 0)
•O(m) compute for full weighted sum: Must process all positions
•Continuous decision: Each position contributes partially
•Fully differentiable: Standard backpropagation applies

Why Consider Hard Attention?

1. True Copying Behavior: For tasks like neural machine translation with rare words, or text summarization with quoted phrases, we want to copy tokens exactly—not create blended representations.

2. Discrete Action Spaces: When attention is selecting among discrete choices (which tool to use, which memory slot to read), discrete selection matches the problem structure.

3. Interpretability: Hard attention produces unambiguous "I looked at position 5" decisions, easier to interpret than "I looked 0.3 at position 5, 0.25 at position 3, ..."

4. Computational Efficiency (In Theory): If you only need one value, you don't need to compute the weighted sum over all values. However, computing attention scores still requires O(m) work, so savings are modest.

5. Biological Plausibility: Human eye movements (saccades) are discrete—we look at one spot, not a weighted average of all spots. Hard attention models this more directly.

The Training Challenge

Hard attention's discrete selection is non-differentiable. The operation i* = argmax(e) has no meaningful gradient—we cannot ask "how should I change e to improve the loss?" through standard backpropagation. This is the fundamental obstacle that makes hard attention significantly more difficult to train.

The Non-Differentiability Problem

Let's understand precisely why discrete selection breaks backpropagation.

The Gradient Through Argmax:

Consider i* = argmax_i e_i. What is ∂i*/∂e_j?

The argmax function is piecewise constant:

For most small changes to e_j, the argmax doesn't change
For changes that flip the argmax, the derivative is undefined (discontinuous)

Mathematically:

∂i*/∂e_j = 0 almost everywhere (no useful gradient)
At boundaries where max is tied, the derivative is undefined

This means we cannot use the chain rule: ∂L/∂e = ∂L/∂i* · ∂i*/∂e = ∂L/∂i* · 0 = 0.

No gradient flows back through the selection decision.

The Gradient Through Sampling:

Stochastic hard attention samples i ~ Categorical(softmax(e)). This involves two problems:

1. The Sampling Operation: Sampling from a categorical distribution is fundamentally stochastic. Given the same probabilities, different random seeds yield different samples. There's no deterministic function to differentiate.

2. The Indicator Function: The selected value is: context = Σ_i 𝟙[i = sampled] · v_i

The indicator function 𝟙 is:

1 if i equals the sampled index
0 otherwise

This is discrete—no smooth gradient.

Visual Intuition:

Scores: [1.2, 0.8, 0.5, -0.2, 0.3]
           ↓
Softmax: [0.35, 0.24, 0.18, 0.09, 0.14]
           ↓
Sample:  [0, 1, 0, 0, 0]  ← Position 1 selected
           ↓
Output:  v_1 (exactly)

How should we adjust the softmax logits to improve the loss? Standard backprop cannot answer this through the sampling step.

gradient_discontinuity.py

Python

The Core Dilemma

We want discrete selection (hard attention), but neural network training requires differentiable operations. The entire field of techniques we'll study—REINFORCE, Gumbel-Softmax, Straight-Through Estimators—are solutions to this fundamental tension between discrete computation and gradient-based learning.

REINFORCE for Stochastic Attention

When differentiating through discrete samples, we turn to policy gradient methods from reinforcement learning. The REINFORCE algorithm provides unbiased gradient estimates for stochastic decisions.

The Key Insight:

We cannot differentiate through the sample, but we can differentiate the probability of taking that sample. If an action leads to good outcomes, make it more probable; if bad, less probable.

Mathematical Derivation:

Let:

θ = parameters (weights)
a = selected action (attention position)
p_θ(a) = probability of selecting a
R(a) = reward (negative loss) for action a

We want to maximize expected reward:

J(θ) = E_{a ~ p_θ}[R(a)] = Σ_a p_θ(a) · R(a)

Taking gradients:

∇_θ J = Σ_a ∇_θ p_θ(a) · R(a)
       = Σ_a p_θ(a) · ∇_θ log p_θ(a) · R(a)    (log-derivative trick)
       = E_{a ~ p_θ}[∇_θ log p_θ(a) · R(a)]    (back to expectation)

This expectation can be estimated by sampling:

∇_θ J ≈ ∇_θ log p_θ(a_sampled) · R(a_sampled)

The REINFORCE Gradient Estimator:

For hard attention:

Compute attention probabilities: α = softmax(e)
Sample attention position: i ~ Categorical(α)
Compute output: context = v_i
Compute loss/reward: R = -L (or task-specific reward)
Gradient estimate: ∇_θ J ≈ ∇_θ log α_i · R

The gradient ∇_θ log α_i CAN be computed—it flows through the softmax and score computation. We're just weighting it by the observed reward.

Intuition:

If action i led to high reward (R > 0): Increase log α_i → Make i more probable
If action i led to low reward (R < 0): Decrease log α_i → Make i less probable
The magnitude of adjustment is proportional to reward magnitude

Why This Works:

REINFORCE doesn't require differentiating through the sampling operation. It only needs:

Ability to compute log p_θ(a) and its gradient (we have this via softmax)
Ability to evaluate the reward for the sampled action (we have the loss)

The sample itself is treated as fixed; we adjust probabilities based on outcomes.

reinforce_hard_attention.py

Python

High Variance Problem

REINFORCE gradients have notoriously high variance. A single sample might give R(a) = +10 or R(a) = -10 for the same action, just due to stochasticity in other parts of the model. This variance makes training slow and unstable. Variance reduction is essential.

Variance Reduction Techniques

The high variance of REINFORCE gradients is the primary obstacle to effective hard attention training. Several techniques help reduce this variance.

1. Baseline Subtraction:

The REINFORCE gradient is:

∇_θ J ≈ ∇_θ log p_θ(a) · R(a)

We can subtract any baseline b that doesn't depend on a without changing the expected gradient:

∇_θ J ≈ ∇_θ log p_θ(a) · (R(a) - b)

Why does this reduce variance?

The variance of XY is high when both X and Y have high variance. By centering R around its mean (using b ≈ E[R]), we reduce the magnitude of the reward signal, reducing variance while preserving the gradient direction (actions better than average still get positive signal).

Common Baseline Choices:

Moving average: b = 0.95 * b_old + 0.05 * R_new
Learned baseline: Neural network predicting E[R | state]
Self-critical baseline: b = R(greedy_action), reward only if better than argmax

variance_reduction.py

Python

2. Multiple Sampling:

Instead of one sample per example, draw k samples and average:

∇_θ J ≈ (1/k) Σ_{j=1}^k ∇_θ log p_θ(a_j) · R(a_j)

Variance reduces by factor of k. Cost increases by factor of k.

3. Control Variates:

More sophisticated: use correlated random variables to subtract variance without changing expectation. The soft attention output can serve as a control variate:

∇_θ J ≈ ∇_θ log p_θ(a) · R(a) - c · (soft_output - E[soft_output])

where c is tuned to minimize variance.

4. Entropy Regularization:

Add entropy bonus to encourage exploration:

Loss = -E[R] - β · H(p_θ)

Higher entropy means more uniform attention probabilities, which:

Reduces variance (samples are more predictable)
Encourages exploration (doesn't collapse to always choosing same action)
Acts as implicit regularization

Practical Recommendation

For most applications, combining a moving average baseline with entropy regularization provides reasonable variance reduction. Self-critical baseline is powerful for generation tasks (image captioning, summarization) where you can efficiently compute the greedy action's reward. Multiple sampling is effective but computationally expensive.

The Gumbel-Softmax Trick

An alternative to REINFORCE is the Gumbel-Softmax (also called Concrete Distribution). This provides a differentiable approximation to categorical sampling.

The Core Idea:

Instead of sampling discretely, we sample from a continuous distribution that approximates the discrete one. The approximation becomes exact as temperature approaches zero.

The Gumbel-Max Trick:

First, recall a fundamental result. To sample from a categorical distribution with probabilities p:

i ~ Categorical(p)

is equivalent to:

g_i ~ Gumbel(0, 1)  for each i
i = argmax_i (log p_i + g_i)

where Gumbel(0, 1) noise is sampled as: g = -log(-log(u)), u ~ Uniform(0, 1).

The Gumbel-Softmax Relaxation:

The argmax is still non-differentiable. The trick: replace argmax with softmax!

y_i = exp((log p_i + g_i) / τ) / Σ_j exp((log p_j + g_j) / τ)

where τ is a temperature parameter:

As τ → 0: y approaches one-hot (hard selection)
As τ → ∞: y approaches uniform (no selection)
For small τ: y is "approximately one-hot" but differentiable

gumbel_softmax.py

Python

Temperature Annealing:

A common practice is to start with high temperature (soft, easy to optimize) and gradually decrease it (more discrete, closer to true hard attention):

# Training schedule
start_temp = 5.0
end_temp = 0.1
for epoch in range(num_epochs):
    temp = start_temp * (end_temp / start_temp) ** (epoch / num_epochs)
    model.temperature = temp

This provides:

Early training: High temp → smooth gradients, easy optimization
Late training: Low temp → nearly discrete, close to test-time behavior

Comparison with REINFORCE:

Aspect	REINFORCE	Gumbel-Softmax
Gradient	Unbiased but high variance	Biased but low variance
Training	Slower, needs variance reduction	Faster, standard backprop
Test time	True discrete	Either discrete or relaxed
Hyperparameters	Baseline choice, entropy weight	Temperature schedule
Implementation	Custom loss, multiple concepts	Simple, drop-in replacement

Bias vs Variance Tradeoff

REINFORCE gives unbiased gradients but high variance. Gumbel-Softmax gives biased gradients (the relaxation doesn't perfectly match discrete sampling) but low variance. In practice, low variance often wins—Gumbel-Softmax usually trains faster and more stably. The bias becomes negligible as temperature decreases.

Straight-Through Estimator

The Straight-Through Estimator (STE) is a simple but effective technique for training through discrete operations. It uses a discrete forward pass but a continuous backward pass.

The Core Idea:

Forward: Use discrete operation (argmax gives true one-hot)
Backward: Pretend the discrete operation was continuous (pass gradients through)

Pseudocode:

# Forward pass: discrete
y_hard = one_hot(argmax(logits))

# Backward pass: treat it as if we used softmax
# Achieved via: y_hard - softmax(logits).detach() + softmax(logits)

straight_through.py

Python

Why Does STE Work?

The STE is theoretically questionable—the gradient doesn't match the actual computation. But empirically it often works well because:

Gradient Direction: Even if magnitude is wrong, the gradient direction is informative
Discrete-Continuous Agreement: Near the argmax, discrete and soft behaviors are similar
Training Dynamics: Networks adapt to the gradient signal they receive

Comparison of Techniques:

Technique	Forward	Backward	Pros	Cons
REINFORCE	Discrete	Policy gradient	Unbiased	High variance
Gumbel-Softmax	Soft (approx discrete)	Standard	Low variance, simple	Biased, needs annealing
STE	Discrete	Through soft	Simple, truly discrete	Gradient mismatch
Gumbel + STE	Discrete	Through soft	Best of both?	Complex

When to Use What

Start with Gumbel-Softmax (hard=True) for most cases—it's simple and effective. Use REINFORCE when you need strictly unbiased gradients or have reward signals rather than differentiable losses. Use STE when you want truly discrete forward passes and can tolerate gradient approximation. Often, the choice matters less than proper hyperparameter tuning.

Hard Attention Applications

While soft attention dominates, hard attention shines in specific scenarios where discrete selection is inherent to the task.

1. Pointer Networks:

Pointer Networks use hard attention to point to positions in the input. For tasks like:

Sorting: Point to next smallest element
Convex hull: Point to next hull vertex
Combinatorial optimization: Select next node in tour

The output IS the pointed position—soft averaging doesn't make sense.

2. Neural Turing Machines / Memory Networks:

Memory access often benefits from hard attention:

Read exactly one memory slot (no blur)
Write to exactly one location (no smearing)

Though many implementations use soft attention for tractability.

3. Copy Mechanism in Sequence-to-Sequence:

When copying rare words (names, technical terms):

Hard copy: Select one source token and copy it exactly
Soft copy: Blend between source tokens → corrupted output

Hard attention ensures the copied token is exact.

4. Discrete Latent Variable Models:

Models with discrete hidden structure:

Topic models: Text has one topic (hard) vs topic mixture (soft)
Segmentation: Each pixel belongs to one segment
Part-based models: Each feature belongs to one object part

Hard attention enforces the discrete structure.

5. Interpretable Attention:

For interpretability, hard attention provides clear decisions:

"The model looked at word 5" vs "The model looked 0.3 at word 5, 0.25 at word 3, ..."

Binary decisions are easier to explain to non-experts.

6. Efficient Inference:

At inference time, even if trained with soft attention:

Take argmax to get single position
Retrieve only that value
Skip the full weighted sum

This is common in production: train soft, deploy hard.

The Reality Check

Despite these applications, soft attention remains dominant in practice. Transformers (BERT, GPT) use soft attention. Most state-of-the-art systems use soft attention. Hard attention is reserved for special cases where discrete selection is essential, or for research into discrete neural computation.

When to Consider Hard Attention
Scenario	Why Hard Attention	Implementation Approach
Pointer/selection tasks	Output IS the selection	Pointer networks with REINFORCE or Gumbel
Copying mechanisms	Exact token copy needed	Copy attention with STE
Memory networks	Precise memory access	Often use soft despite motivation
Interpretability requirements	Binary decisions easier to explain	Hard inference, soft training
Latent discrete structure	Task has inherent discreteness	Discrete VAE techniques

Summary: Hard Attention

We've explored hard attention—the discrete, selection-based alternative to soft attention. Let's consolidate the key insights:

Key Takeaways

•True Selection — Hard attention selects exactly one position, outputting its value without blending. This matches tasks requiring discrete selection.
•Non-Differentiable — Discrete selection (argmax or sampling) breaks standard backpropagation. Special techniques are needed to train through the selection.
•REINFORCE — Policy gradient method providing unbiased but high-variance gradient estimates. Sample an action, weight log-probability by reward.
•Variance Reduction — Baselines (moving average, learned, self-critical), multiple sampling, and entropy regularization are essential for stable REINFORCE training.
•Gumbel-Softmax — Differentiable approximation to categorical sampling. Uses temperature-controlled continuous relaxation with optional straight-through for discrete forward pass.
•Straight-Through Estimator — Simple technique using discrete forward pass with continuous backward pass. Theoretically questionable but empirically effective.
•Niche Applications — Hard attention excels for pointer networks, copy mechanisms, and interpretable models, but soft attention dominates general practice.

What's Next:

With both soft and hard attention understood, we now turn to attention visualization—how to interpret, debug, and understand what attention mechanisms learn. This practical skill is essential for developing attention-based models and diagnosing their behavior.

Page Complete

You now understand hard attention and the techniques for training discrete selection mechanisms. While soft attention dominates practice, understanding hard attention deepens your grasp of attention as a design space, and the training techniques (REINFORCE, Gumbel-Softmax, STE) appear throughout deep learning whenever discrete decisions must be learned.

Hard Attention

The Discrete Alternative

What if attention didn't blend—what if it truly selected?

This page explores hard attention: its definition, the training challenges it introduces, techniques to make it trainable, and situations where it shines.

What You Will Learn

Definition and Motivation

Hard attention performs discrete selection rather than continuous weighting.

Formal Definition:

Given attention scores e = [e₁, ..., eₘ] and values V = [v₁, ..., vₘ]:

Deterministic Hard Attention (Argmax):

i* = argmax_i e_i
context = v_{i*}

Stochastic Hard Attention (Sampling):

α = softmax(e)            # Convert to probabilities
i ~ Categorical(α)        # Sample position
context = v_i             # Return sampled value

In both cases, the output is exactly one value, not a blend.

Hard Attention Properties

•Exact selection: Output is precisely one value, no blending
•Sparse by definition: All attention weight on one position
•O(m) compute for selection: No full weighted sum
•Discrete decision: Binary in/out for each position
•Non-differentiable: Cannot compute ∂output/∂scores directly

Soft Attention Properties

•Approximate selection: Output is weighted blend of all values
•Dense by default: All positions contribute (unless weight ≈ 0)
•O(m) compute for full weighted sum: Must process all positions
•Continuous decision: Each position contributes partially
•Fully differentiable: Standard backpropagation applies

Why Consider Hard Attention?

2. Discrete Action Spaces: When attention is selecting among discrete choices (which tool to use, which memory slot to read), discrete selection matches the problem structure.

3. Interpretability: Hard attention produces unambiguous "I looked at position 5" decisions, easier to interpret than "I looked 0.3 at position 5, 0.25 at position 3, ..."

5. Biological Plausibility: Human eye movements (saccades) are discrete—we look at one spot, not a weighted average of all spots. Hard attention models this more directly.

The Training Challenge

The Non-Differentiability Problem

Let's understand precisely why discrete selection breaks backpropagation.

The Gradient Through Argmax:

Consider i* = argmax_i e_i. What is ∂i*/∂e_j?

The argmax function is piecewise constant:

For most small changes to e_j, the argmax doesn't change
For changes that flip the argmax, the derivative is undefined (discontinuous)

Mathematically:

∂i*/∂e_j = 0 almost everywhere (no useful gradient)
At boundaries where max is tied, the derivative is undefined

This means we cannot use the chain rule: ∂L/∂e = ∂L/∂i* · ∂i*/∂e = ∂L/∂i* · 0 = 0.

No gradient flows back through the selection decision.

The Gradient Through Sampling:

Stochastic hard attention samples i ~ Categorical(softmax(e)). This involves two problems:

2. The Indicator Function: The selected value is: context = Σ_i 𝟙[i = sampled] · v_i

The indicator function 𝟙 is:

1 if i equals the sampled index
0 otherwise

This is discrete—no smooth gradient.

Visual Intuition:

Scores: [1.2, 0.8, 0.5, -0.2, 0.3]
           ↓
Softmax: [0.35, 0.24, 0.18, 0.09, 0.14]
           ↓
Sample:  [0, 1, 0, 0, 0]  ← Position 1 selected
           ↓
Output:  v_1 (exactly)

How should we adjust the softmax logits to improve the loss? Standard backprop cannot answer this through the sampling step.

gradient_discontinuity.py

Python

The Core Dilemma

REINFORCE for Stochastic Attention

The Key Insight:

We cannot differentiate through the sample, but we can differentiate the probability of taking that sample. If an action leads to good outcomes, make it more probable; if bad, less probable.

Mathematical Derivation:

Let:

θ = parameters (weights)
a = selected action (attention position)
p_θ(a) = probability of selecting a
R(a) = reward (negative loss) for action a

We want to maximize expected reward:

J(θ) = E_{a ~ p_θ}[R(a)] = Σ_a p_θ(a) · R(a)

Taking gradients:

∇_θ J = Σ_a ∇_θ p_θ(a) · R(a)
       = Σ_a p_θ(a) · ∇_θ log p_θ(a) · R(a)    (log-derivative trick)
       = E_{a ~ p_θ}[∇_θ log p_θ(a) · R(a)]    (back to expectation)

This expectation can be estimated by sampling:

∇_θ J ≈ ∇_θ log p_θ(a_sampled) · R(a_sampled)

The REINFORCE Gradient Estimator:

For hard attention:

Compute attention probabilities: α = softmax(e)
Sample attention position: i ~ Categorical(α)
Compute output: context = v_i
Compute loss/reward: R = -L (or task-specific reward)
Gradient estimate: ∇_θ J ≈ ∇_θ log α_i · R

The gradient ∇_θ log α_i CAN be computed—it flows through the softmax and score computation. We're just weighting it by the observed reward.

Intuition:

If action i led to high reward (R > 0): Increase log α_i → Make i more probable
If action i led to low reward (R < 0): Decrease log α_i → Make i less probable
The magnitude of adjustment is proportional to reward magnitude

Why This Works:

REINFORCE doesn't require differentiating through the sampling operation. It only needs:

Ability to compute log p_θ(a) and its gradient (we have this via softmax)
Ability to evaluate the reward for the sampled action (we have the loss)

The sample itself is treated as fixed; we adjust probabilities based on outcomes.

reinforce_hard_attention.py

Python

High Variance Problem

Variance Reduction Techniques

The high variance of REINFORCE gradients is the primary obstacle to effective hard attention training. Several techniques help reduce this variance.

1. Baseline Subtraction:

The REINFORCE gradient is:

∇_θ J ≈ ∇_θ log p_θ(a) · R(a)

We can subtract any baseline b that doesn't depend on a without changing the expected gradient:

∇_θ J ≈ ∇_θ log p_θ(a) · (R(a) - b)

Why does this reduce variance?

Common Baseline Choices:

Moving average: b = 0.95 * b_old + 0.05 * R_new
Learned baseline: Neural network predicting E[R | state]
Self-critical baseline: b = R(greedy_action), reward only if better than argmax

variance_reduction.py

Python

2. Multiple Sampling:

Instead of one sample per example, draw k samples and average:

∇_θ J ≈ (1/k) Σ_{j=1}^k ∇_θ log p_θ(a_j) · R(a_j)

Variance reduces by factor of k. Cost increases by factor of k.

3. Control Variates:

More sophisticated: use correlated random variables to subtract variance without changing expectation. The soft attention output can serve as a control variate:

∇_θ J ≈ ∇_θ log p_θ(a) · R(a) - c · (soft_output - E[soft_output])

where c is tuned to minimize variance.

4. Entropy Regularization:

Add entropy bonus to encourage exploration:

Loss = -E[R] - β · H(p_θ)

Higher entropy means more uniform attention probabilities, which:

Reduces variance (samples are more predictable)
Encourages exploration (doesn't collapse to always choosing same action)
Acts as implicit regularization

Practical Recommendation

The Gumbel-Softmax Trick

An alternative to REINFORCE is the Gumbel-Softmax (also called Concrete Distribution). This provides a differentiable approximation to categorical sampling.

The Core Idea:

Instead of sampling discretely, we sample from a continuous distribution that approximates the discrete one. The approximation becomes exact as temperature approaches zero.

The Gumbel-Max Trick:

First, recall a fundamental result. To sample from a categorical distribution with probabilities p:

i ~ Categorical(p)

is equivalent to:

g_i ~ Gumbel(0, 1)  for each i
i = argmax_i (log p_i + g_i)

where Gumbel(0, 1) noise is sampled as: g = -log(-log(u)), u ~ Uniform(0, 1).

The Gumbel-Softmax Relaxation:

The argmax is still non-differentiable. The trick: replace argmax with softmax!

y_i = exp((log p_i + g_i) / τ) / Σ_j exp((log p_j + g_j) / τ)

where τ is a temperature parameter:

As τ → 0: y approaches one-hot (hard selection)
As τ → ∞: y approaches uniform (no selection)
For small τ: y is "approximately one-hot" but differentiable

gumbel_softmax.py

Python

Temperature Annealing:

A common practice is to start with high temperature (soft, easy to optimize) and gradually decrease it (more discrete, closer to true hard attention):

# Training schedule
start_temp = 5.0
end_temp = 0.1
for epoch in range(num_epochs):
    temp = start_temp * (end_temp / start_temp) ** (epoch / num_epochs)
    model.temperature = temp

This provides:

Early training: High temp → smooth gradients, easy optimization
Late training: Low temp → nearly discrete, close to test-time behavior

Comparison with REINFORCE:

Aspect	REINFORCE	Gumbel-Softmax
Gradient	Unbiased but high variance	Biased but low variance
Training	Slower, needs variance reduction	Faster, standard backprop
Test time	True discrete	Either discrete or relaxed
Hyperparameters	Baseline choice, entropy weight	Temperature schedule
Implementation	Custom loss, multiple concepts	Simple, drop-in replacement

Bias vs Variance Tradeoff

Straight-Through Estimator

The Straight-Through Estimator (STE) is a simple but effective technique for training through discrete operations. It uses a discrete forward pass but a continuous backward pass.

The Core Idea:

Forward: Use discrete operation (argmax gives true one-hot)
Backward: Pretend the discrete operation was continuous (pass gradients through)

Pseudocode:

# Forward pass: discrete
y_hard = one_hot(argmax(logits))

# Backward pass: treat it as if we used softmax
# Achieved via: y_hard - softmax(logits).detach() + softmax(logits)

straight_through.py

Python

Why Does STE Work?

The STE is theoretically questionable—the gradient doesn't match the actual computation. But empirically it often works well because:

Gradient Direction: Even if magnitude is wrong, the gradient direction is informative
Discrete-Continuous Agreement: Near the argmax, discrete and soft behaviors are similar
Training Dynamics: Networks adapt to the gradient signal they receive

Comparison of Techniques:

Technique	Forward	Backward	Pros	Cons
REINFORCE	Discrete	Policy gradient	Unbiased	High variance
Gumbel-Softmax	Soft (approx discrete)	Standard	Low variance, simple	Biased, needs annealing
STE	Discrete	Through soft	Simple, truly discrete	Gradient mismatch
Gumbel + STE	Discrete	Through soft	Best of both?	Complex

When to Use What

Hard Attention Applications

While soft attention dominates, hard attention shines in specific scenarios where discrete selection is inherent to the task.

1. Pointer Networks:

Pointer Networks use hard attention to point to positions in the input. For tasks like:

Sorting: Point to next smallest element
Convex hull: Point to next hull vertex
Combinatorial optimization: Select next node in tour

The output IS the pointed position—soft averaging doesn't make sense.

2. Neural Turing Machines / Memory Networks:

Memory access often benefits from hard attention:

Read exactly one memory slot (no blur)
Write to exactly one location (no smearing)

Though many implementations use soft attention for tractability.

3. Copy Mechanism in Sequence-to-Sequence:

When copying rare words (names, technical terms):

Hard copy: Select one source token and copy it exactly
Soft copy: Blend between source tokens → corrupted output

Hard attention ensures the copied token is exact.

4. Discrete Latent Variable Models:

Models with discrete hidden structure:

Topic models: Text has one topic (hard) vs topic mixture (soft)
Segmentation: Each pixel belongs to one segment
Part-based models: Each feature belongs to one object part

Hard attention enforces the discrete structure.

5. Interpretable Attention:

For interpretability, hard attention provides clear decisions:

"The model looked at word 5" vs "The model looked 0.3 at word 5, 0.25 at word 3, ..."

Binary decisions are easier to explain to non-experts.

6. Efficient Inference:

At inference time, even if trained with soft attention:

Take argmax to get single position
Retrieve only that value
Skip the full weighted sum

This is common in production: train soft, deploy hard.

The Reality Check

When to Consider Hard Attention
Scenario	Why Hard Attention	Implementation Approach
Pointer/selection tasks	Output IS the selection	Pointer networks with REINFORCE or Gumbel
Copying mechanisms	Exact token copy needed	Copy attention with STE
Memory networks	Precise memory access	Often use soft despite motivation
Interpretability requirements	Binary decisions easier to explain	Hard inference, soft training
Latent discrete structure	Task has inherent discreteness	Discrete VAE techniques

Summary: Hard Attention

We've explored hard attention—the discrete, selection-based alternative to soft attention. Let's consolidate the key insights:

Key Takeaways

•True Selection — Hard attention selects exactly one position, outputting its value without blending. This matches tasks requiring discrete selection.
•Non-Differentiable — Discrete selection (argmax or sampling) breaks standard backpropagation. Special techniques are needed to train through the selection.
•REINFORCE — Policy gradient method providing unbiased but high-variance gradient estimates. Sample an action, weight log-probability by reward.
•Variance Reduction — Baselines (moving average, learned, self-critical), multiple sampling, and entropy regularization are essential for stable REINFORCE training.
•Gumbel-Softmax — Differentiable approximation to categorical sampling. Uses temperature-controlled continuous relaxation with optional straight-through for discrete forward pass.
•Straight-Through Estimator — Simple technique using discrete forward pass with continuous backward pass. Theoretically questionable but empirically effective.
•Niche Applications — Hard attention excels for pointer networks, copy mechanisms, and interpretable models, but soft attention dominates general practice.

What's Next:

Page Complete