Machine LearningAdvanced RNN Topics

Advanced RNN Topics

LevelAdvanced

Duration90 mins

TopicAdvanced RNN Topics

5 / 5

Attention Preview

The Attention Revolution

In the previous pages, we identified the information bottleneck as the fundamental limitation of basic seq2seq models: compressing an entire source sequence into a single fixed-size vector inevitably loses information, especially for long sequences.

Attention mechanisms solve this problem elegantly. Instead of forcing the encoder to compress everything into one vector, attention allows the decoder to dynamically access any part of the encoded sequence at each generation step. The decoder learns to focus on relevant source positions while generating each output token.

This seemingly simple idea—letting the decoder look back at the encoder outputs—revolutionized neural sequence modeling and laid the groundwork for the Transformer architecture that now dominates deep learning.

What You Will Learn

This page previews attention in the RNN context. You will understand the core attention mechanism, additive vs multiplicative attention, how attention integrates with seq2seq, and the properties that make attention so powerful. Chapter 35 will provide a comprehensive treatment of attention and Transformers.

Motivation: Why Attention?

Before diving into the mechanism, let's understand why attention is necessary through a concrete example.

Machine Translation Example

Consider translating: "The black cat sat on the mat" → "Le chat noir était assis sur le tapis"

When generating the French word "chat" (cat), the decoder needs to focus on the English word "cat". When generating "noir" (black), it needs "black". When generating "tapis" (mat), it needs "mat".

Without Attention:

The context vector $\mathbf{c}$ is fixed, computed once from the encoder
Same information for generating "chat", "noir", and every other word
Long-range dependencies (between "mat" and "tapis") are hard to preserve

With Attention:

Each decoder step computes a different context $\mathbf{c}_t$
$\mathbf{c}_1$ focuses on "cat" when generating "chat"
$\mathbf{c}_3$ focuses on "black" when generating "noir"
The decoder dynamically selects what source information it needs

Converting Mermaid diagram...

Soft vs Hard Attention

The attention weights α form a probability distribution over source positions—this is 'soft' attention, a differentiable weighted average. 'Hard' attention would discretely select one position, which is not differentiable and requires reinforcement learning. We focus on soft attention.

Core Attention Formulation

The attention mechanism computes a context vector as a weighted sum of encoder hidden states, where weights depend on the current decoder state.

Given:

Encoder hidden states: $\mathbf{H}^{\text{enc}} = [\mathbf{h}1^{\text{enc}}, \ldots, \mathbf{h}{T_x}^{\text{enc}}]$
Current decoder hidden state: $\mathbf{s}t$ (or $\mathbf{h}{t-1}^{\text{dec}}$)

Step 1: Compute Alignment Scores

For each source position $i$, compute how well it aligns with the current decoder state:

$$e_{ti} = \text{score}(\mathbf{s}_t, \mathbf{h}_i^{\text{enc}})$$

Step 2: Normalize to Attention Weights

Apply softmax to get a probability distribution:

$$\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j=1}^{T_x} \exp(e_{tj})}$$

Step 3: Compute Context Vector

Weighted sum of encoder states:

$$\mathbf{c}t = \sum{i=1}^{T_x} \alpha_{ti} \mathbf{h}_i^{\text{enc}}$$

Step 4: Use Context in Decoder

Combine context with decoder hidden state for output:

$$\tilde{\mathbf{s}}_t = \tanh(\mathbf{W}_c[\mathbf{c}_t; \mathbf{s}t])$$ $$P(y_t | y{<t}, \mathbf{x}) = \text{softmax}(\mathbf{W}_o \tilde{\mathbf{s}}_t)$$

Key Properties

Property	Implication
Weights sum to 1	$\sum_i \alpha_{ti} = 1$ (valid probability distribution)
Differentiable	End-to-end training with backpropagation
Position-specific	Different context for each decoder step
Soft selection	Smooth combination, not discrete choice
Interpretable	Weights show which source positions are attended

The Score Function

The score function $\text{score}(\mathbf{s}, \mathbf{h})$ determines how alignment is computed. Different choices yield different attention variants (next section).

Attention Variants

Several score functions have been proposed, each with different computational and modeling properties.

Additive Attention (Bahdanau)

Also called "concat" attention. Uses a feedforward network:

$$e_{ti} = \mathbf{v}^\top \tanh(\mathbf{W}_s \mathbf{s}_t + \mathbf{W}_h \mathbf{h}_i^{\text{enc}})$$

where:

$\mathbf{W}_s \in \mathbb{R}^{d_a \times d_s}$, $\mathbf{W}_h \in \mathbb{R}^{d_a \times d_h}$ are projection matrices
$\mathbf{v} \in \mathbb{R}^{d_a}$ is a learned vector
$d_a$ is the attention hidden dimension

Multiplicative Attention (Luong)

Also called "dot-product" attention. Direct dot product:

$$e_{ti} = \mathbf{s}_t^\top \mathbf{h}_i^{\text{enc}}$$

or with learned transformation:

$$e_{ti} = \mathbf{s}_t^\top \mathbf{W}_a \mathbf{h}_i^{\text{enc}}$$

Scaled Dot-Product Attention

Divides by square root of dimension to prevent large logits:

$$e_{ti} = \frac{\mathbf{s}_t^\top \mathbf{h}_i^{\text{enc}}}{\sqrt{d}}$$

This is the variant used in Transformers.

Comparison of Attention Variants
Variant	Formula	Parameters	Complexity
Additive	$\mathbf{v}^\top \tanh(\mathbf{W}_s \mathbf{s} + \mathbf{W}_h \mathbf{h})$	$d_a(d_s + d_h) + d_a$	Slower (non-parallelizable)
Dot-Product	$\mathbf{s}^\top \mathbf{h}$	0 (requires $d_s = d_h$)	Fast (matrix multiplication)
General	$\mathbf{s}^\top \mathbf{W}_a \mathbf{h}$	$d_s \times d_h$	Fast (one matrix)
Scaled Dot-Product	$\frac{\mathbf{s}^\top \mathbf{h}}{\sqrt{d}}$	0	Fast + numerically stable

attention_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
 
class AdditiveAttention(nn.Module):
    """
    Bahdanau-style additive attention.
    Uses a feedforward network to compute alignment scores.
    """
    
    def __init__(
        self,
        encoder_dim: int,
        decoder_dim: int,
        attention_dim: int
    ):
        super().__init__()
        self.encoder_proj = nn.Linear(encoder_dim, attention_dim, bias=False)
        self.decoder_proj = nn.Linear(decoder_dim, attention_dim, bias=False)
        self.v = nn.Linear(attention_dim, 1, bias=False)
    
    def forward(
        self,
        encoder_outputs: torch.Tensor,
        decoder_hidden: torch.Tensor,
        mask: torch.Tensor = None
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            encoder_outputs: [batch, src_len, encoder_dim]
            decoder_hidden: [batch, decoder_dim]
            mask: [batch, src_len] - True for valid positions
            
        Returns:
            context: [batch, encoder_dim]
            attention_weights: [batch, src_len]
        """
        src_len = encoder_outputs.size(1)
        
        # Project encoder and decoder states
        encoder_proj = self.encoder_proj(encoder_outputs)  # [batch, src_len, attn_dim]
        decoder_proj = self.decoder_proj(decoder_hidden)    # [batch, attn_dim]
        
        # Expand decoder projection to match source length
        decoder_proj = decoder_proj.unsqueeze(1).expand(-1, src_len, -1)
        
        # Compute scores
        energy = torch.tanh(encoder_proj + decoder_proj)  # [batch, src_len, attn_dim]
        scores = self.v(energy).squeeze(-1)  # [batch, src_len]
        
        # Mask padding positions
        if mask is not None:
            scores = scores.masked_fill(~mask, float('-inf'))
        
        # Normalize to attention weights
        attention_weights = F.softmax(scores, dim=-1)  # [batch, src_len]
        
        # Compute context vector
        context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)
        context = context.squeeze(1)  # [batch, encoder_dim]
        
        return context, attention_weights
 
 
class DotProductAttention(nn.Module):
    """
    Luong-style dot-product attention.
    Fast and efficient, requires matching dimensions.
    """
    
    def __init__(self, scaled: bool = True):
        super().__init__()
        self.scaled = scaled
    
    def forward(
        self,
        query: torch.Tensor,
        keys: torch.Tensor,
        values: torch.Tensor,
        mask: torch.Tensor = None
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            query: [batch, query_dim]
            keys: [batch, src_len, key_dim]
            values: [batch, src_len, value_dim]
            mask: [batch, src_len]
            
        Returns:
            context: [batch, value_dim]
            attention_weights: [batch, src_len]
        """
        # Compute dot product: query @ keys^T
        # query: [batch, 1, query_dim], keys: [batch, src_len, key_dim]
        scores = torch.bmm(query.unsqueeze(1), keys.transpose(1, 2))
        scores = scores.squeeze(1)  # [batch, src_len]
        
        # Scale by sqrt(d) for numerical stability
        if self.scaled:
            d = query.size(-1)
            scores = scores / math.sqrt(d)
        
        # Mask and normalize
        if mask is not None:
            scores = scores.masked_fill(~mask, float('-inf'))
        
        attention_weights = F.softmax(scores, dim=-1)
        
        # Compute context
        context = torch.bmm(attention_weights.unsqueeze(1), values)
        context = context.squeeze(1)
        
        return context, attention_weights
 
 
class GeneralAttention(nn.Module):
    """
    General attention with learned transformation.
    Allows different encoder/decoder dimensions.
    """
    
    def __init__(
        self,
        encoder_dim: int,
        decoder_dim: int,
        scaled: bool = True
    ):
        super().__init__()
        self.W = nn.Linear(encoder_dim, decoder_dim, bias=False)
        self.scaled = scaled
    
    def forward(
        self,
        decoder_hidden: torch.Tensor,
        encoder_outputs: torch.Tensor,
        mask: torch.Tensor = None
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            decoder_hidden: [batch, decoder_dim]
            encoder_outputs: [batch, src_len, encoder_dim]
            mask: [batch, src_len]
            
        Returns:
            context: [batch, encoder_dim]
            attention_weights: [batch, src_len]
        """
        # Transform encoder outputs
        transformed = self.W(encoder_outputs)  # [batch, src_len, decoder_dim]
        
        # Compute scores: decoder @ transformed^T
        scores = torch.bmm(
            decoder_hidden.unsqueeze(1),  # [batch, 1, decoder_dim]
            transformed.transpose(1, 2)    # [batch, decoder_dim, src_len]
        ).squeeze(1)  # [batch, src_len]
        
        if self.scaled:
            scores = scores / math.sqrt(decoder_hidden.size(-1))
        
        if mask is not None:
            scores = scores.masked_fill(~mask, float('-inf'))
        
        attention_weights = F.softmax(scores, dim=-1)
        
        context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)
        context = context.squeeze(1)
        
        return context, attention_weights

Integrating Attention with Seq2Seq

Let's see how attention integrates into the full seq2seq architecture we developed earlier.

Bahdanau Attention (Input-Feeding)

In the original Bahdanau attention, the context is computed from the previous decoder state and fed as additional input to the current decoder step:

Compute attention using $\mathbf{s}_{t-1}$: $\mathbf{c}t = \text{Attention}(\mathbf{s}{t-1}, \mathbf{H}^{\text{enc}})$
Concatenate with input: $\tilde{\mathbf{x}}_t = [\mathbf{e}_t; \mathbf{c}_t]$
Update decoder: $\mathbf{s}_t = \text{LSTM}(\tilde{\mathbf{x}}t, \mathbf{s}{t-1})$
Predict: $P(y_t) = \text{softmax}(\mathbf{W}_o \mathbf{s}_t)$

Luong Attention

In Luong attention, context is computed from the current decoder state:

Update decoder: $\mathbf{s}_t = \text{LSTM}(\mathbf{e}t, \mathbf{s}{t-1})$
Compute attention using $\mathbf{s}_t$: $\mathbf{c}_t = \text{Attention}(\mathbf{s}_t, \mathbf{H}^{\text{enc}})$
Combine: $\tilde{\mathbf{s}}_t = \tanh(\mathbf{W}_c[\mathbf{c}_t; \mathbf{s}_t])$
Predict: $P(y_t) = \text{softmax}(\mathbf{W}_o \tilde{\mathbf{s}}_t)$

attention_seq2seq.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
import torch
import torch.nn as nn
import torch.nn.functional as F
 
 
class AttentionDecoder(nn.Module):
    """
    LSTM decoder with Luong-style attention.
    Computes context from current hidden state after each LSTM step.
    """
    
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int,
        encoder_dim: int,
        decoder_dim: int,
        attention_dim: int,
        num_layers: int = 1,
        dropout: float = 0.2
    ):
        super().__init__()
        
        self.vocab_size = vocab_size
        self.decoder_dim = decoder_dim
        
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=decoder_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Attention mechanism
        self.attention = AdditiveAttention(encoder_dim, decoder_dim, attention_dim)
        
        # Combine context + hidden for output
        self.context_combine = nn.Linear(encoder_dim + decoder_dim, decoder_dim)
        
        # Output projection
        self.fc_out = nn.Linear(decoder_dim, vocab_size)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(
        self,
        input_token: torch.Tensor,
        hidden: tuple[torch.Tensor, torch.Tensor],
        encoder_outputs: torch.Tensor,
        src_mask: torch.Tensor = None
    ) -> tuple[torch.Tensor, tuple, torch.Tensor]:
        """
        Single decoder step with attention.
        
        Args:
            input_token: [batch, 1] - current input token
            hidden: (h, c) - LSTM states
            encoder_outputs: [batch, src_len, encoder_dim]
            src_mask: [batch, src_len] - True for valid positions
            
        Returns:
            output: [batch, vocab_size] - token probabilities
            hidden: Updated (h, c)
            attention_weights: [batch, src_len]
        """
        # Embed input
        embedded = self.dropout(self.embedding(input_token))  # [batch, 1, embed]
        
        # LSTM step
        rnn_output, hidden = self.lstm(embedded, hidden)
        rnn_output = rnn_output.squeeze(1)  # [batch, decoder_dim]
        
        # Attention over encoder outputs
        context, attention_weights = self.attention(
            encoder_outputs, rnn_output, src_mask
        )  # context: [batch, encoder_dim]
        
        # Combine context and RNN output
        combined = torch.cat([context, rnn_output], dim=-1)
        combined = torch.tanh(self.context_combine(combined))
        combined = self.dropout(combined)
        
        # Project to vocabulary
        output = self.fc_out(combined)  # [batch, vocab_size]
        
        return output, hidden, attention_weights
 
 
class AttentionSeq2Seq(nn.Module):
    """
    Complete seq2seq model with attention.
    """
    
    def __init__(
        self,
        encoder: nn.Module,
        decoder: AttentionDecoder,
        device: torch.device
    ):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
    
    def forward(
        self,
        src: torch.Tensor,
        src_lengths: torch.Tensor,
        trg: torch.Tensor,
        teacher_forcing_ratio: float = 0.5
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Training forward with attention.
        
        Args:
            src: [batch, src_len]
            src_lengths: [batch]
            trg: [batch, trg_len]
            teacher_forcing_ratio: probability of teacher forcing
            
        Returns:
            outputs: [batch, trg_len-1, vocab_size]
            attentions: [batch, trg_len-1, src_len]
        """
        batch_size = src.size(0)
        trg_len = trg.size(1)
        src_len = src.size(1)
        vocab_size = self.decoder.vocab_size
        
        # Storage for outputs and attention weights
        outputs = torch.zeros(batch_size, trg_len - 1, vocab_size).to(self.device)
        attentions = torch.zeros(batch_size, trg_len - 1, src_len).to(self.device)
        
        # Encode source
        encoder_outputs, hidden = self.encoder(src, src_lengths)
        
        # Create source mask
        src_mask = torch.arange(src_len, device=self.device)[None, :] < src_lengths[:, None]
        
        # First input is <sos>
        decoder_input = trg[:, 0:1]
        
        for t in range(1, trg_len):
            # Decode with attention
            output, hidden, attn_weights = self.decoder(
                decoder_input, hidden, encoder_outputs, src_mask
            )
            
            outputs[:, t-1] = output
            attentions[:, t-1] = attn_weights
            
            # Next input
            use_tf = torch.rand(1).item() < teacher_forcing_ratio
            decoder_input = trg[:, t:t+1] if use_tf else output.argmax(-1, keepdim=True)
        
        return outputs, attentions
    
    def translate(
        self,
        src: torch.Tensor,
        src_lengths: torch.Tensor,
        max_length: int = 50,
        sos_idx: int = 2,
        eos_idx: int = 3
    ) -> tuple[list[int], torch.Tensor]:
        """
        Greedy translation with attention visualization.
        
        Returns:
            tokens: List of generated token indices
            attention_matrix: [generated_len, src_len]
        """
        self.eval()
        
        with torch.no_grad():
            encoder_outputs, hidden = self.encoder(src, src_lengths)
            src_mask = torch.arange(
                src.size(1), device=self.device
            )[None, :] < src_lengths[:, None]
            
            decoder_input = torch.tensor([[sos_idx]], device=self.device)
            
            tokens = []
            attentions_list = []
            
            for _ in range(max_length):
                output, hidden, attn = self.decoder(
                    decoder_input, hidden, encoder_outputs, src_mask
                )
                
                pred_token = output.argmax(dim=-1).item()
                attentions_list.append(attn.squeeze(0))
                
                if pred_token == eos_idx:
                    break
                
                tokens.append(pred_token)
                decoder_input = torch.tensor([[pred_token]], device=self.device)
            
            attention_matrix = torch.stack(attentions_list, dim=0)
        
        return tokens, attention_matrix

Visualizing Attention

One of attention's most valuable properties is interpretability. The attention weights reveal which source positions influenced each output position.

Attention Heatmap

For a translation from English to French, the attention matrix shows alignment:

              The   black   cat   sat   on   the   mat
    Le       0.60   0.05   0.15  0.05  0.05  0.05  0.05
    chat     0.05   0.05   0.80  0.03  0.02  0.02  0.03
    noir     0.05   0.80   0.05  0.03  0.02  0.02  0.03
    était    0.05   0.05   0.10  0.60  0.10  0.05  0.05
    assis    0.05   0.05   0.05  0.75  0.05  0.03  0.02
    sur      0.03   0.03   0.03  0.03  0.75  0.10  0.03
    le       0.03   0.03   0.03  0.03  0.05  0.75  0.08
    tapis    0.03   0.03   0.03  0.03  0.03  0.08  0.77

The diagonal pattern shows the model learning approximate word alignment, with some deviations for reordering (e.g., "black cat" → "chat noir").

Attention ≠ Explanation

While attention weights are interpretable, they should be viewed cautiously. High attention doesn't necessarily mean 'the model used this for prediction'—it means 'this contributed to the context vector.' The relationship between attention and model behavior is complex.

attention_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import torch
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
 
def visualize_attention(
    source_tokens: list[str],
    target_tokens: list[str],
    attention_matrix: torch.Tensor,
    save_path: str = None
):
    """
    Create attention heatmap visualization.
    
    Args:
        source_tokens: List of source tokens
        target_tokens: List of target tokens (generated)
        attention_matrix: [target_len, source_len] tensor
        save_path: Optional path to save figure
    """
    # Convert to numpy
    attn = attention_matrix.cpu().numpy()
    
    # Create figure
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Create heatmap
    sns.heatmap(
        attn,
        xticklabels=source_tokens,
        yticklabels=target_tokens,
        cmap='Blues',
        ax=ax,
        cbar_kws={'label': 'Attention Weight'}
    )
    
    ax.set_xlabel('Source Tokens')
    ax.set_ylabel('Target Tokens')
    ax.set_title('Attention Weights')
    
    # Rotate x labels for readability
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    
    plt.tight_layout()
    
    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches='tight')
    
    plt.show()
 
 
def analyze_attention_patterns(
    attention_matrices: list[torch.Tensor],
    threshold: float = 0.3
) -> dict:
    """
    Analyze attention patterns across multiple examples.
    
    Args:
        attention_matrices: List of [target_len, source_len] tensors
        threshold: Minimum attention weight to consider "focused"
        
    Returns:
        Dictionary of statistics
    """
    stats = {
        'avg_entropy': [],
        'avg_max_attention': [],
        'diagonal_alignment': [],
        'num_focused_positions': []
    }
    
    for attn in attention_matrices:
        attn = attn.cpu()
        
        # Attention entropy (lower = more focused)
        # H = -sum(p * log(p))
        entropy = -torch.sum(attn * torch.log(attn + 1e-10), dim=-1)
        stats['avg_entropy'].append(entropy.mean().item())
        
        # Maximum attention per target position
        max_attn = attn.max(dim=-1).values
        stats['avg_max_attention'].append(max_attn.mean().item())
        
        # Diagonal alignment (for monotonic tasks like translation)
        min_len = min(attn.size(0), attn.size(1))
        diagonal = torch.diag(attn[:min_len, :min_len])
        stats['diagonal_alignment'].append(diagonal.mean().item())
        
        # Number of positions with attention > threshold
        focused = (attn > threshold).sum(dim=-1).float()
        stats['num_focused_positions'].append(focused.mean().item())
    
    # Aggregate statistics
    return {
        'entropy': {
            'mean': np.mean(stats['avg_entropy']),
            'std': np.std(stats['avg_entropy'])
        },
        'max_attention': {
            'mean': np.mean(stats['avg_max_attention']),
            'std': np.std(stats['avg_max_attention'])
        },
        'diagonal_alignment': {
            'mean': np.mean(stats['diagonal_alignment']),
            'std': np.std(stats['diagonal_alignment'])
        },
        'focused_positions': {
            'mean': np.mean(stats['num_focused_positions']),
            'std': np.std(stats['num_focused_positions'])
        }
    }

Why Attention Works: Theoretical Perspectives

Attention's effectiveness stems from several complementary factors:

1. Eliminates the Bottleneck

Without attention: Information flows through single $\mathbf{c}$ With attention: Direct pathways from every $\mathbf{h}_i^{\text{enc}}$ to decoder

$$\text{Information capacity:} \quad d \quad \text{vs} \quad T_x \cdot d$$

2. Shortens Gradient Paths

In vanilla seq2seq, gradients from output $y_t$ to source $x_1$ must traverse: $$y_t \to \mathbf{s}_t \to \ldots \to \mathbf{s}1 \to \mathbf{c} \to \mathbf{h}{T_x}^{\text{enc}} \to \ldots \to \mathbf{h}_1^{\text{enc}}$$

With attention, there's a direct path: $$y_t \to \mathbf{c}_t \to \mathbf{h}_1^{\text{enc}}$$

This dramatically improves gradient flow for learning long-range dependencies.

3. Task-Appropriate Inductive Bias

Attention encodes the assumption that outputs depend on weighted combinations of inputs—which is accurate for many sequence transduction tasks. The model learns which combinations, but the compositional structure is built-in.

Attention Benefits for Different Components
Component	Without Attention	With Attention
Encoder burden	Must compress everything into c	Can produce any useful representation per position
Decoder access	Same c for all timesteps	Dynamic c_t specific to each timestep
Long sequences	Performance degrades significantly	Scales well with sequence length
Gradient flow	Long paths, vanishing gradients	Direct paths to every source position
Interpretability	Black box	Attention weights show alignment

The Path to Transformers

Attention's success with RNNs raised a natural question: if attention provides direct access to all positions, do we even need the sequential RNN? The answer—no—led to the Transformer architecture, which uses attention exclusively. Chapter 35 covers this evolution in depth.

Extensions and Limitations

Basic attention has spawned numerous extensions addressing its limitations.

Coverage Mechanism

Problem: Attention may repeatedly focus on the same positions (over-translation) or ignore some positions (under-translation).

Solution: Track cumulative attention and penalize re-attention:

$$\text{coverage}t = \sum{t'=1}^{t-1} \alpha_{t'}$$

$$e_{ti} = f(\mathbf{s}_t, \mathbf{h}i^{\text{enc}}, \text{coverage}{t-1,i})$$

Local Attention

Problem: Global attention over all $T_x$ positions is $O(T_x)$ per step.

Solution: Attend only to a window around a predicted position:

$$\alpha_{ti} \propto \exp(-\frac{(i - p_t)^2}{2\sigma^2}) \cdot \text{score}(\mathbf{s}_t, \mathbf{h}_i)$$

where $p_t$ is a predicted or fixed alignment position.

Multi-Head Attention

Problem: Single attention head captures one type of relationship.

Solution: Multiple parallel attention heads, each learning different patterns:

$$\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = [\text{head}_1; \ldots; \text{head}_h]\mathbf{W}^O$$

This is central to Transformers (Chapter 35).

Limitations of RNN Attention

•Still sequential in time — Encoder/decoder must process tokens one at a time
•Quadratic attention cost — O(T_x) attention per decoder step
•Limited parallelization — Cannot fully utilize GPU parallelism
•Attention collapse — Can degenerate to attending mostly to recent positions
•No explicit position modeling — Relies on RNN hidden states for position info

Looking Ahead

These limitations motivated the development of self-attention and Transformers, which eliminate the recurrent bottleneck entirely. Chapter 35 covers the full attention and Transformer story.

Summary: Attention Preview

We have introduced the attention mechanism—the key innovation that overcomes the seq2seq bottleneck and laid the foundation for modern deep learning architectures. Let's consolidate the key takeaways:

Key Takeaways

•Attention eliminates the information bottleneck — Decoder accesses all encoder states, not just a single context vector
•Dynamic context per decoder step — Each generated token gets a position-specific context
•Attention weights form a soft alignment — Probabilistic selection over source positions
•Additive vs multiplicative variants — Trade-offs in expressiveness, efficiency, and parameter count
•Bahdanau vs Luong integration — Input-feeding vs post-RNN context computation
•Interpretable alignment — Attention weights reveal source-target relationships
•Shortened gradient paths — Direct connections improve long-range dependency learning
•Foundation for Transformers — Self-attention extends these ideas to eliminate recurrence entirely

Module Complete!

This concludes our exploration of Advanced RNN Topics. You've mastered:

Bidirectional RNNs for capturing full sequential context
Deep RNNs for hierarchical representation learning
Sequence-to-sequence for variable-length transduction
Encoder-decoder design principles and engineering
Attention mechanisms that revolutionized sequence modeling

In Chapter 35, we'll dive deep into Attention & Transformers—exploring self-attention, multi-head attention, positional encodings, and the Transformer architecture that now powers state-of-the-art models across NLP, vision, and beyond.

Module Complete

Congratulations! You've completed Module 6: Advanced RNN Topics. You now understand the full landscape of RNN architectures—from basic recurrence through bidirectional processing, deep stacking, sequence-to-sequence translation, and attention mechanisms. These concepts form the foundation for understanding modern sequence models.

5 / 5

Loading learning content...

Machine LearningAdvanced RNN Topics

Advanced RNN Topics

LevelAdvanced

Duration90 mins

TopicAdvanced RNN Topics

5 / 5

Attention Preview

The Attention Revolution

What You Will Learn

Motivation: Why Attention?

Before diving into the mechanism, let's understand why attention is necessary through a concrete example.

Machine Translation Example

Consider translating: "The black cat sat on the mat" → "Le chat noir était assis sur le tapis"

When generating the French word "chat" (cat), the decoder needs to focus on the English word "cat". When generating "noir" (black), it needs "black". When generating "tapis" (mat), it needs "mat".

Without Attention:

The context vector $\mathbf{c}$ is fixed, computed once from the encoder
Same information for generating "chat", "noir", and every other word
Long-range dependencies (between "mat" and "tapis") are hard to preserve

With Attention:

Each decoder step computes a different context $\mathbf{c}_t$
$\mathbf{c}_1$ focuses on "cat" when generating "chat"
$\mathbf{c}_3$ focuses on "black" when generating "noir"
The decoder dynamically selects what source information it needs

Converting Mermaid diagram...

Soft vs Hard Attention

Core Attention Formulation

The attention mechanism computes a context vector as a weighted sum of encoder hidden states, where weights depend on the current decoder state.

Given:

Encoder hidden states: $\mathbf{H}^{\text{enc}} = [\mathbf{h}1^{\text{enc}}, \ldots, \mathbf{h}{T_x}^{\text{enc}}]$
Current decoder hidden state: $\mathbf{s}t$ (or $\mathbf{h}{t-1}^{\text{dec}}$)

Step 1: Compute Alignment Scores

For each source position $i$, compute how well it aligns with the current decoder state:

$$e_{ti} = \text{score}(\mathbf{s}_t, \mathbf{h}_i^{\text{enc}})$$

Step 2: Normalize to Attention Weights

Apply softmax to get a probability distribution:

$$\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j=1}^{T_x} \exp(e_{tj})}$$

Step 3: Compute Context Vector

Weighted sum of encoder states:

$$\mathbf{c}t = \sum{i=1}^{T_x} \alpha_{ti} \mathbf{h}_i^{\text{enc}}$$

Step 4: Use Context in Decoder

Combine context with decoder hidden state for output:

$$\tilde{\mathbf{s}}_t = \tanh(\mathbf{W}_c[\mathbf{c}_t; \mathbf{s}t])$$ $$P(y_t | y{<t}, \mathbf{x}) = \text{softmax}(\mathbf{W}_o \tilde{\mathbf{s}}_t)$$

Key Properties

Property	Implication
Weights sum to 1	$\sum_i \alpha_{ti} = 1$ (valid probability distribution)
Differentiable	End-to-end training with backpropagation
Position-specific	Different context for each decoder step
Soft selection	Smooth combination, not discrete choice
Interpretable	Weights show which source positions are attended

The Score Function

The score function $\text{score}(\mathbf{s}, \mathbf{h})$ determines how alignment is computed. Different choices yield different attention variants (next section).

Attention Variants

Several score functions have been proposed, each with different computational and modeling properties.

Additive Attention (Bahdanau)

Also called "concat" attention. Uses a feedforward network:

$$e_{ti} = \mathbf{v}^\top \tanh(\mathbf{W}_s \mathbf{s}_t + \mathbf{W}_h \mathbf{h}_i^{\text{enc}})$$

where:

$\mathbf{W}_s \in \mathbb{R}^{d_a \times d_s}$, $\mathbf{W}_h \in \mathbb{R}^{d_a \times d_h}$ are projection matrices
$\mathbf{v} \in \mathbb{R}^{d_a}$ is a learned vector
$d_a$ is the attention hidden dimension

Multiplicative Attention (Luong)

Also called "dot-product" attention. Direct dot product:

$$e_{ti} = \mathbf{s}_t^\top \mathbf{h}_i^{\text{enc}}$$

or with learned transformation:

$$e_{ti} = \mathbf{s}_t^\top \mathbf{W}_a \mathbf{h}_i^{\text{enc}}$$

Scaled Dot-Product Attention

Divides by square root of dimension to prevent large logits:

$$e_{ti} = \frac{\mathbf{s}_t^\top \mathbf{h}_i^{\text{enc}}}{\sqrt{d}}$$

This is the variant used in Transformers.

Comparison of Attention Variants
Variant	Formula	Parameters	Complexity
Additive	$\mathbf{v}^\top \tanh(\mathbf{W}_s \mathbf{s} + \mathbf{W}_h \mathbf{h})$	$d_a(d_s + d_h) + d_a$	Slower (non-parallelizable)
Dot-Product	$\mathbf{s}^\top \mathbf{h}$	0 (requires $d_s = d_h$)	Fast (matrix multiplication)
General	$\mathbf{s}^\top \mathbf{W}_a \mathbf{h}$	$d_s \times d_h$	Fast (one matrix)
Scaled Dot-Product	$\frac{\mathbf{s}^\top \mathbf{h}}{\sqrt{d}}$	0	Fast + numerically stable

attention_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
 
class AdditiveAttention(nn.Module):
    """
    Bahdanau-style additive attention.
    Uses a feedforward network to compute alignment scores.
    """
    
    def __init__(
        self,
        encoder_dim: int,
        decoder_dim: int,
        attention_dim: int
    ):
        super().__init__()
        self.encoder_proj = nn.Linear(encoder_dim, attention_dim, bias=False)
        self.decoder_proj = nn.Linear(decoder_dim, attention_dim, bias=False)
        self.v = nn.Linear(attention_dim, 1, bias=False)
    
    def forward(
        self,
        encoder_outputs: torch.Tensor,
        decoder_hidden: torch.Tensor,
        mask: torch.Tensor = None
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            encoder_outputs: [batch, src_len, encoder_dim]
            decoder_hidden: [batch, decoder_dim]
            mask: [batch, src_len] - True for valid positions
            
        Returns:
            context: [batch, encoder_dim]
            attention_weights: [batch, src_len]
        """
        src_len = encoder_outputs.size(1)
        
        # Project encoder and decoder states
        encoder_proj = self.encoder_proj(encoder_outputs)  # [batch, src_len, attn_dim]
        decoder_proj = self.decoder_proj(decoder_hidden)    # [batch, attn_dim]
        
        # Expand decoder projection to match source length
        decoder_proj = decoder_proj.unsqueeze(1).expand(-1, src_len, -1)
        
        # Compute scores
        energy = torch.tanh(encoder_proj + decoder_proj)  # [batch, src_len, attn_dim]
        scores = self.v(energy).squeeze(-1)  # [batch, src_len]
        
        # Mask padding positions
        if mask is not None:
            scores = scores.masked_fill(~mask, float('-inf'))
        
        # Normalize to attention weights
        attention_weights = F.softmax(scores, dim=-1)  # [batch, src_len]
        
        # Compute context vector
        context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)
        context = context.squeeze(1)  # [batch, encoder_dim]
        
        return context, attention_weights
 
 
class DotProductAttention(nn.Module):
    """
    Luong-style dot-product attention.
    Fast and efficient, requires matching dimensions.
    """
    
    def __init__(self, scaled: bool = True):
        super().__init__()
        self.scaled = scaled
    
    def forward(
        self,
        query: torch.Tensor,
        keys: torch.Tensor,
        values: torch.Tensor,
        mask: torch.Tensor = None
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            query: [batch, query_dim]
            keys: [batch, src_len, key_dim]
            values: [batch, src_len, value_dim]
            mask: [batch, src_len]
            
        Returns:
            context: [batch, value_dim]
            attention_weights: [batch, src_len]
        """
        # Compute dot product: query @ keys^T
        # query: [batch, 1, query_dim], keys: [batch, src_len, key_dim]
        scores = torch.bmm(query.unsqueeze(1), keys.transpose(1, 2))
        scores = scores.squeeze(1)  # [batch, src_len]
        
        # Scale by sqrt(d) for numerical stability
        if self.scaled:
            d = query.size(-1)
            scores = scores / math.sqrt(d)
        
        # Mask and normalize
        if mask is not None:
            scores = scores.masked_fill(~mask, float('-inf'))
        
        attention_weights = F.softmax(scores, dim=-1)
        
        # Compute context
        context = torch.bmm(attention_weights.unsqueeze(1), values)
        context = context.squeeze(1)
        
        return context, attention_weights
 
 
class GeneralAttention(nn.Module):
    """
    General attention with learned transformation.
    Allows different encoder/decoder dimensions.
    """
    
    def __init__(
        self,
        encoder_dim: int,
        decoder_dim: int,
        scaled: bool = True
    ):
        super().__init__()
        self.W = nn.Linear(encoder_dim, decoder_dim, bias=False)
        self.scaled = scaled
    
    def forward(
        self,
        decoder_hidden: torch.Tensor,
        encoder_outputs: torch.Tensor,
        mask: torch.Tensor = None
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            decoder_hidden: [batch, decoder_dim]
            encoder_outputs: [batch, src_len, encoder_dim]
            mask: [batch, src_len]
            
        Returns:
            context: [batch, encoder_dim]
            attention_weights: [batch, src_len]
        """
        # Transform encoder outputs
        transformed = self.W(encoder_outputs)  # [batch, src_len, decoder_dim]
        
        # Compute scores: decoder @ transformed^T
        scores = torch.bmm(
            decoder_hidden.unsqueeze(1),  # [batch, 1, decoder_dim]
            transformed.transpose(1, 2)    # [batch, decoder_dim, src_len]
        ).squeeze(1)  # [batch, src_len]
        
        if self.scaled:
            scores = scores / math.sqrt(decoder_hidden.size(-1))
        
        if mask is not None:
            scores = scores.masked_fill(~mask, float('-inf'))
        
        attention_weights = F.softmax(scores, dim=-1)
        
        context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)
        context = context.squeeze(1)
        
        return context, attention_weights

Integrating Attention with Seq2Seq

Let's see how attention integrates into the full seq2seq architecture we developed earlier.

Bahdanau Attention (Input-Feeding)

In the original Bahdanau attention, the context is computed from the previous decoder state and fed as additional input to the current decoder step:

Compute attention using $\mathbf{s}_{t-1}$: $\mathbf{c}t = \text{Attention}(\mathbf{s}{t-1}, \mathbf{H}^{\text{enc}})$
Concatenate with input: $\tilde{\mathbf{x}}_t = [\mathbf{e}_t; \mathbf{c}_t]$
Update decoder: $\mathbf{s}_t = \text{LSTM}(\tilde{\mathbf{x}}t, \mathbf{s}{t-1})$
Predict: $P(y_t) = \text{softmax}(\mathbf{W}_o \mathbf{s}_t)$

Luong Attention

In Luong attention, context is computed from the current decoder state:

Update decoder: $\mathbf{s}_t = \text{LSTM}(\mathbf{e}t, \mathbf{s}{t-1})$
Compute attention using $\mathbf{s}_t$: $\mathbf{c}_t = \text{Attention}(\mathbf{s}_t, \mathbf{H}^{\text{enc}})$
Combine: $\tilde{\mathbf{s}}_t = \tanh(\mathbf{W}_c[\mathbf{c}_t; \mathbf{s}_t])$
Predict: $P(y_t) = \text{softmax}(\mathbf{W}_o \tilde{\mathbf{s}}_t)$

attention_seq2seq.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
import torch
import torch.nn as nn
import torch.nn.functional as F
 
 
class AttentionDecoder(nn.Module):
    """
    LSTM decoder with Luong-style attention.
    Computes context from current hidden state after each LSTM step.
    """
    
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int,
        encoder_dim: int,
        decoder_dim: int,
        attention_dim: int,
        num_layers: int = 1,
        dropout: float = 0.2
    ):
        super().__init__()
        
        self.vocab_size = vocab_size
        self.decoder_dim = decoder_dim
        
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=decoder_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Attention mechanism
        self.attention = AdditiveAttention(encoder_dim, decoder_dim, attention_dim)
        
        # Combine context + hidden for output
        self.context_combine = nn.Linear(encoder_dim + decoder_dim, decoder_dim)
        
        # Output projection
        self.fc_out = nn.Linear(decoder_dim, vocab_size)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(
        self,
        input_token: torch.Tensor,
        hidden: tuple[torch.Tensor, torch.Tensor],
        encoder_outputs: torch.Tensor,
        src_mask: torch.Tensor = None
    ) -> tuple[torch.Tensor, tuple, torch.Tensor]:
        """
        Single decoder step with attention.
        
        Args:
            input_token: [batch, 1] - current input token
            hidden: (h, c) - LSTM states
            encoder_outputs: [batch, src_len, encoder_dim]
            src_mask: [batch, src_len] - True for valid positions
            
        Returns:
            output: [batch, vocab_size] - token probabilities
            hidden: Updated (h, c)
            attention_weights: [batch, src_len]
        """
        # Embed input
        embedded = self.dropout(self.embedding(input_token))  # [batch, 1, embed]
        
        # LSTM step
        rnn_output, hidden = self.lstm(embedded, hidden)
        rnn_output = rnn_output.squeeze(1)  # [batch, decoder_dim]
        
        # Attention over encoder outputs
        context, attention_weights = self.attention(
            encoder_outputs, rnn_output, src_mask
        )  # context: [batch, encoder_dim]
        
        # Combine context and RNN output
        combined = torch.cat([context, rnn_output], dim=-1)
        combined = torch.tanh(self.context_combine(combined))
        combined = self.dropout(combined)
        
        # Project to vocabulary
        output = self.fc_out(combined)  # [batch, vocab_size]
        
        return output, hidden, attention_weights
 
 
class AttentionSeq2Seq(nn.Module):
    """
    Complete seq2seq model with attention.
    """
    
    def __init__(
        self,
        encoder: nn.Module,
        decoder: AttentionDecoder,
        device: torch.device
    ):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
    
    def forward(
        self,
        src: torch.Tensor,
        src_lengths: torch.Tensor,
        trg: torch.Tensor,
        teacher_forcing_ratio: float = 0.5
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Training forward with attention.
        
        Args:
            src: [batch, src_len]
            src_lengths: [batch]
            trg: [batch, trg_len]
            teacher_forcing_ratio: probability of teacher forcing
            
        Returns:
            outputs: [batch, trg_len-1, vocab_size]
            attentions: [batch, trg_len-1, src_len]
        """
        batch_size = src.size(0)
        trg_len = trg.size(1)
        src_len = src.size(1)
        vocab_size = self.decoder.vocab_size
        
        # Storage for outputs and attention weights
        outputs = torch.zeros(batch_size, trg_len - 1, vocab_size).to(self.device)
        attentions = torch.zeros(batch_size, trg_len - 1, src_len).to(self.device)
        
        # Encode source
        encoder_outputs, hidden = self.encoder(src, src_lengths)
        
        # Create source mask
        src_mask = torch.arange(src_len, device=self.device)[None, :] < src_lengths[:, None]
        
        # First input is <sos>
        decoder_input = trg[:, 0:1]
        
        for t in range(1, trg_len):
            # Decode with attention
            output, hidden, attn_weights = self.decoder(
                decoder_input, hidden, encoder_outputs, src_mask
            )
            
            outputs[:, t-1] = output
            attentions[:, t-1] = attn_weights
            
            # Next input
            use_tf = torch.rand(1).item() < teacher_forcing_ratio
            decoder_input = trg[:, t:t+1] if use_tf else output.argmax(-1, keepdim=True)
        
        return outputs, attentions
    
    def translate(
        self,
        src: torch.Tensor,
        src_lengths: torch.Tensor,
        max_length: int = 50,
        sos_idx: int = 2,
        eos_idx: int = 3
    ) -> tuple[list[int], torch.Tensor]:
        """
        Greedy translation with attention visualization.
        
        Returns:
            tokens: List of generated token indices
            attention_matrix: [generated_len, src_len]
        """
        self.eval()
        
        with torch.no_grad():
            encoder_outputs, hidden = self.encoder(src, src_lengths)
            src_mask = torch.arange(
                src.size(1), device=self.device
            )[None, :] < src_lengths[:, None]
            
            decoder_input = torch.tensor([[sos_idx]], device=self.device)
            
            tokens = []
            attentions_list = []
            
            for _ in range(max_length):
                output, hidden, attn = self.decoder(
                    decoder_input, hidden, encoder_outputs, src_mask
                )
                
                pred_token = output.argmax(dim=-1).item()
                attentions_list.append(attn.squeeze(0))
                
                if pred_token == eos_idx:
                    break
                
                tokens.append(pred_token)
                decoder_input = torch.tensor([[pred_token]], device=self.device)
            
            attention_matrix = torch.stack(attentions_list, dim=0)
        
        return tokens, attention_matrix

Visualizing Attention

One of attention's most valuable properties is interpretability. The attention weights reveal which source positions influenced each output position.

Attention Heatmap

For a translation from English to French, the attention matrix shows alignment:

              The   black   cat   sat   on   the   mat
    Le       0.60   0.05   0.15  0.05  0.05  0.05  0.05
    chat     0.05   0.05   0.80  0.03  0.02  0.02  0.03
    noir     0.05   0.80   0.05  0.03  0.02  0.02  0.03
    était    0.05   0.05   0.10  0.60  0.10  0.05  0.05
    assis    0.05   0.05   0.05  0.75  0.05  0.03  0.02
    sur      0.03   0.03   0.03  0.03  0.75  0.10  0.03
    le       0.03   0.03   0.03  0.03  0.05  0.75  0.08
    tapis    0.03   0.03   0.03  0.03  0.03  0.08  0.77

The diagonal pattern shows the model learning approximate word alignment, with some deviations for reordering (e.g., "black cat" → "chat noir").

Attention ≠ Explanation

attention_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import torch
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
 
def visualize_attention(
    source_tokens: list[str],
    target_tokens: list[str],
    attention_matrix: torch.Tensor,
    save_path: str = None
):
    """
    Create attention heatmap visualization.
    
    Args:
        source_tokens: List of source tokens
        target_tokens: List of target tokens (generated)
        attention_matrix: [target_len, source_len] tensor
        save_path: Optional path to save figure
    """
    # Convert to numpy
    attn = attention_matrix.cpu().numpy()
    
    # Create figure
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Create heatmap
    sns.heatmap(
        attn,
        xticklabels=source_tokens,
        yticklabels=target_tokens,
        cmap='Blues',
        ax=ax,
        cbar_kws={'label': 'Attention Weight'}
    )
    
    ax.set_xlabel('Source Tokens')
    ax.set_ylabel('Target Tokens')
    ax.set_title('Attention Weights')
    
    # Rotate x labels for readability
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    
    plt.tight_layout()
    
    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches='tight')
    
    plt.show()
 
 
def analyze_attention_patterns(
    attention_matrices: list[torch.Tensor],
    threshold: float = 0.3
) -> dict:
    """
    Analyze attention patterns across multiple examples.
    
    Args:
        attention_matrices: List of [target_len, source_len] tensors
        threshold: Minimum attention weight to consider "focused"
        
    Returns:
        Dictionary of statistics
    """
    stats = {
        'avg_entropy': [],
        'avg_max_attention': [],
        'diagonal_alignment': [],
        'num_focused_positions': []
    }
    
    for attn in attention_matrices:
        attn = attn.cpu()
        
        # Attention entropy (lower = more focused)
        # H = -sum(p * log(p))
        entropy = -torch.sum(attn * torch.log(attn + 1e-10), dim=-1)
        stats['avg_entropy'].append(entropy.mean().item())
        
        # Maximum attention per target position
        max_attn = attn.max(dim=-1).values
        stats['avg_max_attention'].append(max_attn.mean().item())
        
        # Diagonal alignment (for monotonic tasks like translation)
        min_len = min(attn.size(0), attn.size(1))
        diagonal = torch.diag(attn[:min_len, :min_len])
        stats['diagonal_alignment'].append(diagonal.mean().item())
        
        # Number of positions with attention > threshold
        focused = (attn > threshold).sum(dim=-1).float()
        stats['num_focused_positions'].append(focused.mean().item())
    
    # Aggregate statistics
    return {
        'entropy': {
            'mean': np.mean(stats['avg_entropy']),
            'std': np.std(stats['avg_entropy'])
        },
        'max_attention': {
            'mean': np.mean(stats['avg_max_attention']),
            'std': np.std(stats['avg_max_attention'])
        },
        'diagonal_alignment': {
            'mean': np.mean(stats['diagonal_alignment']),
            'std': np.std(stats['diagonal_alignment'])
        },
        'focused_positions': {
            'mean': np.mean(stats['num_focused_positions']),
            'std': np.std(stats['num_focused_positions'])
        }
    }

Why Attention Works: Theoretical Perspectives

Attention's effectiveness stems from several complementary factors:

1. Eliminates the Bottleneck

Without attention: Information flows through single $\mathbf{c}$ With attention: Direct pathways from every $\mathbf{h}_i^{\text{enc}}$ to decoder

$$\text{Information capacity:} \quad d \quad \text{vs} \quad T_x \cdot d$$

2. Shortens Gradient Paths

With attention, there's a direct path: $$y_t \to \mathbf{c}_t \to \mathbf{h}_1^{\text{enc}}$$

This dramatically improves gradient flow for learning long-range dependencies.

3. Task-Appropriate Inductive Bias

Attention Benefits for Different Components
Component	Without Attention	With Attention
Encoder burden	Must compress everything into c	Can produce any useful representation per position
Decoder access	Same c for all timesteps	Dynamic c_t specific to each timestep
Long sequences	Performance degrades significantly	Scales well with sequence length
Gradient flow	Long paths, vanishing gradients	Direct paths to every source position
Interpretability	Black box	Attention weights show alignment

The Path to Transformers

Extensions and Limitations

Basic attention has spawned numerous extensions addressing its limitations.

Coverage Mechanism

Problem: Attention may repeatedly focus on the same positions (over-translation) or ignore some positions (under-translation).

Solution: Track cumulative attention and penalize re-attention:

$$\text{coverage}t = \sum{t'=1}^{t-1} \alpha_{t'}$$

$$e_{ti} = f(\mathbf{s}_t, \mathbf{h}i^{\text{enc}}, \text{coverage}{t-1,i})$$

Local Attention

Problem: Global attention over all $T_x$ positions is $O(T_x)$ per step.

Solution: Attend only to a window around a predicted position:

$$\alpha_{ti} \propto \exp(-\frac{(i - p_t)^2}{2\sigma^2}) \cdot \text{score}(\mathbf{s}_t, \mathbf{h}_i)$$

where $p_t$ is a predicted or fixed alignment position.

Multi-Head Attention

Problem: Single attention head captures one type of relationship.

Solution: Multiple parallel attention heads, each learning different patterns:

$$\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = [\text{head}_1; \ldots; \text{head}_h]\mathbf{W}^O$$

This is central to Transformers (Chapter 35).

Limitations of RNN Attention

•Still sequential in time — Encoder/decoder must process tokens one at a time
•Quadratic attention cost — O(T_x) attention per decoder step
•Limited parallelization — Cannot fully utilize GPU parallelism
•Attention collapse — Can degenerate to attending mostly to recent positions
•No explicit position modeling — Relies on RNN hidden states for position info

Looking Ahead

These limitations motivated the development of self-attention and Transformers, which eliminate the recurrent bottleneck entirely. Chapter 35 covers the full attention and Transformer story.

Summary: Attention Preview

Key Takeaways

•Attention eliminates the information bottleneck — Decoder accesses all encoder states, not just a single context vector
•Dynamic context per decoder step — Each generated token gets a position-specific context
•Attention weights form a soft alignment — Probabilistic selection over source positions
•Additive vs multiplicative variants — Trade-offs in expressiveness, efficiency, and parameter count
•Bahdanau vs Luong integration — Input-feeding vs post-RNN context computation
•Interpretable alignment — Attention weights reveal source-target relationships
•Shortened gradient paths — Direct connections improve long-range dependency learning
•Foundation for Transformers — Self-attention extends these ideas to eliminate recurrence entirely

Module Complete!

This concludes our exploration of Advanced RNN Topics. You've mastered:

Bidirectional RNNs for capturing full sequential context
Deep RNNs for hierarchical representation learning
Sequence-to-sequence for variable-length transduction
Encoder-decoder design principles and engineering
Attention mechanisms that revolutionized sequence modeling

Module Complete

5 / 5