Advanced Rnn Topics - Learning Module

Loading content...

0/278

Sequence-to-Sequence

From Fixed to Variable-Length Outputs

Previous RNN architectures we've studied—for sequence classification or labeling—produce outputs of a predetermined structure: a single class label, or one label per input position. But many compelling applications require generating outputs whose length is independent of the input length:

Machine Translation: "Hello, world" (2 tokens) → "Bonjour le monde" (3 tokens)
Text Summarization: A 500-word article → A 50-word summary
Speech Recognition: 160 audio frames → "Hello" (5 characters)
Dialogue Generation: "How are you?" → "I'm doing great, thanks for asking!"
Code Generation: Natural language description → Variable-length code

Sequence-to-sequence (seq2seq) models solve this fundamental challenge by combining two key components:

Encoder: Compresses the input sequence into a fixed-dimensional representation
Decoder: Generates the output sequence from this representation, one element at a time

This paradigm, introduced by Sutskever et al. (2014) and Cho et al. (2014), revolutionized NLP and remains foundational even as attention mechanisms and Transformers have refined it.

What You Will Learn

By the end of this page, you will understand the seq2seq architecture in depth, including encoder compression, context vector formation, decoder generation, teacher forcing, inference strategies (greedy vs beam search), and the critical information bottleneck that motivated attention mechanisms.

The Seq2Seq Architecture

The sequence-to-sequence model consists of two distinct RNNs—an encoder and a decoder—connected through a context vector.

High-Level Data Flow

Input sequence $\mathbf{x} = (x_1, x_2, \ldots, x_{T_x})$ enters the encoder
Encoder produces hidden states $\mathbf{h}_1^{\text{enc}}, \mathbf{h}2^{\text{enc}}, \ldots, \mathbf{h}{T_x}^{\text{enc}}$
Context vector $\mathbf{c}$ is derived from encoder hidden states (often just $\mathbf{h}_{T_x}^{\text{enc}}$)
Decoder is initialized with $\mathbf{c}$ and generates output sequence $\mathbf{y} = (y_1, y_2, \ldots, y_{T_y})$
Decoder stops when it produces an end-of-sequence token $\langle\text{eos}\rangle$

Converting Mermaid diagram...

Key Architectural Properties

Component	Input	Output	Role
Encoder	Source sequence $(x_1, \ldots, x_{T_x})$	Hidden states + context vector	Compress source into representation
Context Vector	Final encoder state(s)	Fixed-dimension vector $\mathbf{c}$	Bridge encoder to decoder
Decoder	Context + previously generated tokens	Output sequence $(y_1, \ldots, y_{T_y})$	Generate target sequence

Vocabulary and Embeddings

The encoder and decoder typically have separate vocabularies and embedding matrices:

Source vocabulary $V_x$: All unique tokens in source language/domain
Target vocabulary $V_y$: All unique tokens in target language/domain
Source embeddings $\mathbf{E}_x \in \mathbb{R}^{|V_x| \times d}$
Target embeddings $\mathbf{E}_y \in \mathbb{R}^{|V_y| \times d}$

For same-language tasks (like summarization), embeddings may be shared.

Encoder Design

The encoder's job is to compress the variable-length input sequence into a fixed-dimensional context that captures all information the decoder needs.

Encoder Computation

Given source sequence $\mathbf{x} = (x_1, x_2, \ldots, x_{T_x})$ with embeddings $\mathbf{e}_t = \mathbf{E}_x[x_t]$:

$$\mathbf{h}t^{\text{enc}} = \text{LSTM}(\mathbf{e}t, \mathbf{h}{t-1}^{\text{enc}}, \mathbf{c}{t-1}^{\text{enc}})$$

where $\mathbf{h}_0^{\text{enc}}$ and $\mathbf{c}_0^{\text{enc}}$ are typically initialized to zero.

Bidirectional Encoder (Preferred)

For most seq2seq tasks, a bidirectional encoder captures richer context:

$$\overrightarrow{\mathbf{h}}t^{\text{enc}} = \text{LSTM}{\text{forward}}(\mathbf{e}t, \overrightarrow{\mathbf{h}}{t-1}^{\text{enc}})$$ $$\overleftarrow{\mathbf{h}}t^{\text{enc}} = \text{LSTM}{\text{backward}}(\mathbf{e}t, \overleftarrow{\mathbf{h}}{t+1}^{\text{enc}})$$ $$\mathbf{h}_t^{\text{enc}} = [\overrightarrow{\mathbf{h}}_t^{\text{enc}}; \overleftarrow{\mathbf{h}}_t^{\text{enc}}]$$

Why Bidirectional for Encoder?

The encoder processes a complete, known input sequence. Unlike the decoder (which generates autoregressively), the encoder can look at future tokens. Bidirectional encoding ensures every source position has full context awareness, improving the quality of the context vector.

Context Vector Construction

Several strategies exist for constructing the context vector $\mathbf{c}$:

Strategy	Formula	When to Use
Last hidden state	$\mathbf{c} = \mathbf{h}_{T_x}^{\text{enc}}$	Simple baseline
Bidirectional concat	$\mathbf{c} = [\overrightarrow{\mathbf{h}}_{T_x}; \overleftarrow{\mathbf{h}}_1]$	Capture sequence endpoints
Mean pooling	$\mathbf{c} = \frac{1}{T_x}\sum_{t=1}^{T_x}\mathbf{h}_t^{\text{enc}}$	Reduce position bias
Max pooling	$\mathbf{c} = \max_t(\mathbf{h}_t^{\text{enc}})$	Highlight strongest features
Learned pooling	$\mathbf{c} = \sum_t \alpha_t \mathbf{h}_t^{\text{enc}}$	Attention mechanism (next page)

The Information Bottleneck

Using a single fixed-size context vector to encode entire paragraphs or documents creates an information bottleneck. This is the fundamental limitation that attention mechanisms address.

Decoder Design

The decoder generates the output sequence autoregressively—one token at a time, conditioning on the context vector and all previously generated tokens.

Decoder Initialization

The decoder's initial hidden state is set from the context vector:

$$\mathbf{h}0^{\text{dec}} = \tanh(\mathbf{W}{\text{init}} \mathbf{c} + \mathbf{b}_{\text{init}})$$

or simply $\mathbf{h}_0^{\text{dec}} = \mathbf{c}$ if dimensions match.

For LSTM, both hidden and cell states need initialization: $$\mathbf{h}_0^{\text{dec}} = \tanh(\mathbf{W}_h \mathbf{c} + \mathbf{b}_h)$$ $$\mathbf{c}_0^{\text{dec}} = \tanh(\mathbf{W}_c \mathbf{c} + \mathbf{b}_c)$$

Decoder Step Computation

At each step $t$, the decoder:

Embeds the previous output: $\mathbf{e}_t = \mathbf{E}y[y{t-1}]$ (or $\langle\text{sos}\rangle$ for $t=1$)
Updates hidden state: $\mathbf{h}_t^{\text{dec}}, \mathbf{c}t^{\text{dec}} = \text{LSTM}(\mathbf{e}t, \mathbf{h}{t-1}^{\text{dec}}, \mathbf{c}{t-1}^{\text{dec}})$
Produces output distribution: $P(y_t | y_{<t}, \mathbf{x}) = \text{softmax}(\mathbf{W}_{\text{out}} \mathbf{h}t^{\text{dec}} + \mathbf{b}{\text{out}})$

seq2seq_model.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class Encoder(nn.Module):
    """
    Bidirectional LSTM encoder for seq2seq.
    Compresses source sequence into context vector(s).
    """
    
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int,
        hidden_dim: int,
        num_layers: int = 2,
        dropout: float = 0.2
    ):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        self.dropout = nn.Dropout(dropout)
        
        # Project bidirectional hidden states to decoder dimension
        self.fc_hidden = nn.Linear(hidden_dim * 2, hidden_dim)
        self.fc_cell = nn.Linear(hidden_dim * 2, hidden_dim)
    
    def forward(
        self,
        src: torch.Tensor,
        src_lengths: torch.Tensor
    ) -> tuple[torch.Tensor, tuple[torch.Tensor, torch.Tensor]]:
        """
        Args:
            src: [batch, src_len] - source token indices
            src_lengths: [batch] - actual lengths
            
        Returns:
            encoder_outputs: [batch, src_len, hidden_dim*2]
            (hidden, cell): Decoder initial states, each [num_layers, batch, hidden_dim]
        """
        # Embed source tokens
        embedded = self.dropout(self.embedding(src))
        
        # Pack for efficient processing
        packed = nn.utils.rnn.pack_padded_sequence(
            embedded, src_lengths.cpu(), 
            batch_first=True, enforce_sorted=False
        )
        
        # Run bidirectional LSTM
        packed_outputs, (hidden, cell) = self.lstm(packed)
        
        # Unpack
        encoder_outputs, _ = nn.utils.rnn.pad_packed_sequence(
            packed_outputs, batch_first=True
        )
        
        # Process bidirectional hidden states for decoder initialization
        # hidden/cell: [num_layers*2, batch, hidden_dim]
        batch_size = src.size(0)
        
        # Reshape to separate directions
        hidden = hidden.view(self.num_layers, 2, batch_size, self.hidden_dim)
        cell = cell.view(self.num_layers, 2, batch_size, self.hidden_dim)
        
        # Concatenate forward and backward, then project
        # [num_layers, batch, hidden_dim*2] -> [num_layers, batch, hidden_dim]
        hidden = torch.cat([hidden[:, 0], hidden[:, 1]], dim=-1)
        cell = torch.cat([cell[:, 0], cell[:, 1]], dim=-1)
        
        hidden = torch.tanh(self.fc_hidden(hidden))
        cell = torch.tanh(self.fc_cell(cell))
        
        return encoder_outputs, (hidden, cell)
 
 
class Decoder(nn.Module):
    """
    LSTM decoder for seq2seq.
    Generates output sequence conditioned on encoder context.
    """
    
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int,
        hidden_dim: int,
        num_layers: int = 2,
        dropout: float = 0.2
    ):
        super().__init__()
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Output projection to vocabulary
        self.fc_out = nn.Linear(hidden_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)
    
    def forward(
        self,
        input_token: torch.Tensor,
        hidden: torch.Tensor,
        cell: torch.Tensor
    ) -> tuple[torch.Tensor, tuple[torch.Tensor, torch.Tensor]]:
        """
        Single decoder step.
        
        Args:
            input_token: [batch, 1] - current input token
            hidden: [num_layers, batch, hidden_dim]
            cell: [num_layers, batch, hidden_dim]
            
        Returns:
            output: [batch, vocab_size] - next token probabilities
            (hidden, cell): Updated states
        """
        # Embed input
        embedded = self.dropout(self.embedding(input_token))  # [batch, 1, embed_dim]
        
        # LSTM step
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        
        # Project to vocabulary
        output = self.fc_out(output.squeeze(1))  # [batch, vocab_size]
        
        return output, (hidden, cell)
 
 
class Seq2Seq(nn.Module):
    """
    Complete sequence-to-sequence model combining encoder and decoder.
    """
    
    def __init__(
        self,
        encoder: Encoder,
        decoder: Decoder,
        device: torch.device
    ):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
    
    def forward(
        self,
        src: torch.Tensor,
        src_lengths: torch.Tensor,
        trg: torch.Tensor,
        teacher_forcing_ratio: float = 0.5
    ) -> torch.Tensor:
        """
        Training forward pass with teacher forcing.
        
        Args:
            src: [batch, src_len] - source sequences
            src_lengths: [batch] - source lengths
            trg: [batch, trg_len] - target sequences (with <sos>)
            teacher_forcing_ratio: probability of using ground truth as next input
            
        Returns:
            outputs: [batch, trg_len-1, vocab_size] - output predictions
        """
        batch_size = src.size(0)
        trg_len = trg.size(1)
        vocab_size = self.decoder.vocab_size
        
        # Store outputs
        outputs = torch.zeros(batch_size, trg_len - 1, vocab_size).to(self.device)
        
        # Encode source
        _, (hidden, cell) = self.encoder(src, src_lengths)
        
        # First decoder input is <sos> token
        decoder_input = trg[:, 0:1]  # [batch, 1]
        
        for t in range(1, trg_len):
            # Decode one step
            output, (hidden, cell) = self.decoder(decoder_input, hidden, cell)
            
            # Store output
            outputs[:, t-1] = output
            
            # Decide next input: teacher forcing or predicted
            use_teacher_forcing = torch.rand(1).item() < teacher_forcing_ratio
            
            if use_teacher_forcing:
                decoder_input = trg[:, t:t+1]
            else:
                decoder_input = output.argmax(dim=-1, keepdim=True)
        
        return outputs

Training: Teacher Forcing

During training, we have access to the correct target sequence. Teacher forcing uses this ground truth to accelerate and stabilize training.

The Teacher Forcing Strategy

Instead of feeding the model's own predictions as inputs to the next timestep, teacher forcing feeds the actual correct previous token:

Without teacher forcing: $y_t \sim P(y_t | \hat{y}_{<t}, \mathbf{x})$ where $\hat{y}$ are predicted tokens
With teacher forcing: $y_t \sim P(y_t | y^_{<t}, \mathbf{x})$ where $y^$ are ground truth tokens

Benefits of Teacher Forcing

Benefit	Explanation
Faster convergence	Model sees correct context, learns mapping directly
Stable gradients	No error accumulation during training
Parallelizable	All decoder steps use known inputs (can partially parallelize)
Curriculum effect	Model learns from perfect examples before imperfect

The Exposure Bias Problem

Teacher forcing creates a train-test mismatch called exposure bias:

Training: Decoder always sees correct previous tokens
Inference: Decoder sees its own (potentially incorrect) predictions

The model never learns to recover from its mistakes during training, which can cause error cascades at inference time.

Exposure Bias in Practice

If the model makes an early mistake during inference (e.g., predicting 'chat' instead of 'cat'), subsequent predictions may be completely derailed because the model never trained on recovering from such errors. This is especially problematic for long sequences.

Mitigating Exposure Bias

Several strategies address the train-test mismatch:

Strategy	Implementation	Trade-off
Scheduled Sampling	Gradually reduce teacher forcing ratio during training	Slower training, better inference
Curriculum Learning	Start easy (short sequences), increase difficulty	Careful schedule design needed
Sequence-Level Training	REINFORCE/policy gradient on full sequences	High variance, complex
Minimum Risk Training	Optimize expected BLEU/metric directly	Computationally expensive
Mixed Strategy	Random mix of teacher forcing and prediction	Simple, often effective

scheduled_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import torch
import math
 
def get_teacher_forcing_ratio(
    epoch: int,
    total_epochs: int,
    strategy: str = 'linear',
    min_ratio: float = 0.0,
    max_ratio: float = 1.0
) -> float:
    """
    Compute teacher forcing ratio based on training progress.
    
    Strategies:
    - 'constant': Always return max_ratio
    - 'linear': Linear decay from max_ratio to min_ratio
    - 'exponential': Exponential decay
    - 'inverse_sigmoid': Sigmoid schedule (slow start/end, fast middle)
    
    Args:
        epoch: Current epoch (0-indexed)
        total_epochs: Total training epochs
        strategy: Decay strategy
        min_ratio: Minimum teacher forcing ratio
        max_ratio: Maximum teacher forcing ratio
        
    Returns:
        Teacher forcing ratio for current epoch
    """
    progress = epoch / max(total_epochs - 1, 1)  # 0 to 1
    
    if strategy == 'constant':
        return max_ratio
    
    elif strategy == 'linear':
        # Linear decay
        ratio = max_ratio - progress * (max_ratio - min_ratio)
        
    elif strategy == 'exponential':
        # Exponential decay: r(t) = max * exp(-k*t)
        k = 5.0  # Decay rate
        ratio = max_ratio * math.exp(-k * progress)
        ratio = max(ratio, min_ratio)
        
    elif strategy == 'inverse_sigmoid':
        # Inverse sigmoid: slow start, fast middle, slow end
        k = 10.0  # Steepness
        mid_point = 0.5 * total_epochs
        ratio = max_ratio / (1 + math.exp(k * (epoch - mid_point) / total_epochs))
        ratio = max(ratio, min_ratio)
        
    else:
        raise ValueError(f"Unknown strategy: {strategy}")
    
    return ratio
 
 
class ScheduledSamplingTrainer:
    """
    Trainer with scheduled sampling for seq2seq models.
    """
    
    def __init__(
        self,
        model: torch.nn.Module,
        optimizer: torch.optim.Optimizer,
        criterion: torch.nn.Module,
        sampling_strategy: str = 'linear',
        min_tf_ratio: float = 0.0
    ):
        self.model = model
        self.optimizer = optimizer
        self.criterion = criterion
        self.sampling_strategy = sampling_strategy
        self.min_tf_ratio = min_tf_ratio
    
    def train_epoch(
        self,
        dataloader,
        epoch: int,
        total_epochs: int
    ) -> float:
        """Train one epoch with scheduled sampling."""
        
        self.model.train()
        total_loss = 0
        
        # Get teacher forcing ratio for this epoch
        tf_ratio = get_teacher_forcing_ratio(
            epoch, total_epochs,
            strategy=self.sampling_strategy,
            min_ratio=self.min_tf_ratio
        )
        
        print(f"Epoch {epoch}: Teacher Forcing Ratio = {tf_ratio:.3f}")
        
        for batch in dataloader:
            src, src_lengths, trg = batch
            
            self.optimizer.zero_grad()
            
            # Forward pass with current teacher forcing ratio
            outputs = self.model(
                src, src_lengths, trg,
                teacher_forcing_ratio=tf_ratio
            )
            
            # Compute loss (ignore padding and <sos>)
            output_dim = outputs.size(-1)
            outputs = outputs.view(-1, output_dim)
            trg = trg[:, 1:].contiguous().view(-1)  # Skip <sos>
            
            loss = self.criterion(outputs, trg)
            
            loss.backward()
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
            self.optimizer.step()
            
            total_loss += loss.item()
        
        return total_loss / len(dataloader)

Inference: Greedy and Beam Search

At inference time, we don't have ground-truth targets. The decoder must generate the output sequence based solely on the encoded input, selecting tokens autoregressively.

Greedy Decoding

The simplest approach: at each step, select the most probable token.

$$\hat{y}t = \arg\max_y P(y | \hat{y}{<t}, \mathbf{x})$$

Pros	Cons
Fast ($O(T_y)$ selections)	May miss globally optimal sequence
Simple to implement	No backtracking from mistakes
Low memory	Sensitive to early errors

Beam Search

Beam search maintains $k$ (the beam width) best partial hypotheses at each step:

Start with $\langle\text{sos}\rangle$ as the only hypothesis
Expand each hypothesis with all possible next tokens
Keep only the top-$k$ expansions by cumulative log-probability
Continue until all beams produce $\langle\text{eos}\rangle$ or max length reached
Return the highest-scoring complete sequence

beam_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
import torch
import torch.nn.functional as F
from dataclasses import dataclass
from typing import List
 
@dataclass
class BeamHypothesis:
    """A single hypothesis in beam search."""
    tokens: List[int]
    log_prob: float
    hidden: tuple  # (h, c) for LSTM
    finished: bool = False
 
def beam_search(
    model,
    src: torch.Tensor,
    src_lengths: torch.Tensor,
    beam_width: int = 5,
    max_length: int = 50,
    sos_token: int = 2,
    eos_token: int = 3,
    length_penalty: float = 0.6
) -> List[int]:
    """
    Beam search decoding for seq2seq.
    
    Args:
        model: Seq2Seq model
        src: [1, src_len] - single source sequence
        src_lengths: [1] - source length
        beam_width: Number of beams to maintain
        max_length: Maximum output length
        sos_token: Start-of-sequence token ID
        eos_token: End-of-sequence token ID
        length_penalty: Penalty for length normalization (0 = no penalty)
        
    Returns:
        Best output sequence as list of token IDs
    """
    device = src.device
    model.eval()
    
    with torch.no_grad():
        # Encode source
        _, (hidden, cell) = model.encoder(src, src_lengths)
        
        # Initialize beams
        beams = [BeamHypothesis(
            tokens=[sos_token],
            log_prob=0.0,
            hidden=(hidden, cell),
            finished=False
        )]
        
        completed = []
        
        for step in range(max_length):
            if not beams:
                break
            
            all_candidates = []
            
            for beam in beams:
                if beam.finished:
                    completed.append(beam)
                    continue
                
                # Get last token
                last_token = torch.tensor([[beam.tokens[-1]]], device=device)
                h, c = beam.hidden
                
                # Decoder step
                output, (new_h, new_c) = model.decoder(last_token, h, c)
                log_probs = F.log_softmax(output, dim=-1)  # [1, vocab_size]
                
                # Get top-k next tokens
                topk_log_probs, topk_tokens = log_probs.topk(beam_width)
                
                for i in range(beam_width):
                    token = topk_tokens[0, i].item()
                    token_log_prob = topk_log_probs[0, i].item()
                    
                    new_beam = BeamHypothesis(
                        tokens=beam.tokens + [token],
                        log_prob=beam.log_prob + token_log_prob,
                        hidden=(new_h.clone(), new_c.clone()),
                        finished=(token == eos_token)
                    )
                    all_candidates.append(new_beam)
            
            # Select top-k beams
            # Apply length normalization to scores
            def score(beam):
                length = len(beam.tokens)
                # Length penalty: normalize by length^alpha
                lp = ((5 + length) / 6) ** length_penalty
                return beam.log_prob / lp
            
            all_candidates.sort(key=score, reverse=True)
            beams = [b for b in all_candidates if not b.finished][:beam_width]
            
            # Check if all beams are finished
            if not beams:
                break
        
        # Add remaining active beams to completed
        completed.extend(beams)
        
        # Return best completed sequence
        if completed:
            best = max(completed, key=lambda b: b.log_prob / len(b.tokens))
            # Remove <sos> and <eos> from output
            output_tokens = best.tokens[1:]
            if output_tokens and output_tokens[-1] == eos_token:
                output_tokens = output_tokens[:-1]
            return output_tokens
        
        return []
 
 
def greedy_decode(
    model,
    src: torch.Tensor,
    src_lengths: torch.Tensor,
    max_length: int = 50,
    sos_token: int = 2,
    eos_token: int = 3
) -> List[int]:
    """
    Greedy decoding: always pick the most probable token.
    
    Args:
        model: Seq2Seq model
        src: [1, src_len] - single source sequence
        src_lengths: [1] - source length
        max_length: Maximum output length
        sos_token: Start-of-sequence token ID
        eos_token: End-of-sequence token ID
        
    Returns:
        Output sequence as list of token IDs
    """
    device = src.device
    model.eval()
    
    with torch.no_grad():
        # Encode
        _, (hidden, cell) = model.encoder(src, src_lengths)
        
        # Start with <sos>
        decoder_input = torch.tensor([[sos_token]], device=device)
        output_tokens = []
        
        for _ in range(max_length):
            # Decode one step
            output, (hidden, cell) = model.decoder(decoder_input, hidden, cell)
            
            # Greedy selection
            predicted_token = output.argmax(dim=-1).item()
            
            if predicted_token == eos_token:
                break
            
            output_tokens.append(predicted_token)
            decoder_input = torch.tensor([[predicted_token]], device=device)
        
        return output_tokens

Beam Width Selection

Larger beam width finds better sequences but increases computation quadratically. Common values: machine translation uses beam=4-10; open-ended generation may use smaller beams (1-5) to maintain diversity. Length penalty prevents beam search from favoring very short outputs.

The Information Bottleneck Problem

The fundamental limitation of the basic seq2seq architecture is the information bottleneck: the entire source sequence must be compressed into a single fixed-size context vector.

Why This Is Problematic

Consider translating a 100-word sentence. All lexical, syntactic, and semantic information from those 100 words must fit into a vector of dimension ~512-1024. This creates several issues:

Problem	Manifestation
Capacity limit	Can't encode arbitrarily long sequences
Vanishing context	Early tokens fade as encoder processes more input
Missing alignment	No direct correspondence between source and target positions
Uniform treatment	All source information equally compressed, even if some is irrelevant

Empirical Evidence

Research showed clear performance degradation with sequence length:

BLEU Score vs Sentence Length (Pre-Attention Seq2Seq)
Source Length	BLEU Score	Relative Performance
0-10 tokens	~28	100% (baseline)
10-20 tokens	~26	93%
20-30 tokens	~23	82%
30-40 tokens	~19	68%
40-50 tokens	~15	54%
50+ tokens	<12	<43%

Why a Single Vector Can't Suffice

The decoder generates each output token using the same context vector $\mathbf{c}$. But different output positions need different source information:

Output position 1 might need source position 1-3
Output position 5 might need source position 10-12
Output position 10 might need source position 1 again

A fixed vector can't dynamically select which source information is relevant for each output position.

The Solution: Attention

Attention mechanisms (covered in the next page) solve this by:

Storing all encoder hidden states, not just the final one
Computing a weighted combination of encoder states for each decoder step
Allowing the decoder to dynamically focus on different source positions

This eliminates the fixed bottleneck: the decoder accesses the full encoded sequence at every generation step.

Historical Perspective

The attention mechanism was introduced precisely to address the bottleneck problem in neural machine translation. Bahdanau et al. (2015) showed that attention dramatically improved translation quality, especially for long sentences. This was a pivotal moment in NLP that eventually led to the Transformer architecture.

Practical Considerations

Building production-quality seq2seq systems requires attention to many practical details beyond the core architecture.

Vocabulary Handling

Challenge	Solution
Unknown words (OOV)	Special $\langle\text{unk}\rangle$ token; subword tokenization (BPE, WordPiece)
Large vocabulary	Softmax approximations; hierarchical softmax; sampled softmax
Rare words	Weight tying between embedding and output layers
Cross-lingual	Shared subword vocabulary for similar languages

Length Control

Controlling output length is often necessary:

Min length: Prevent premature $\langle\text{eos}\rangle$ by masking it until min length reached
Max length: Hard cutoff prevents infinite generation
Length prediction: Auxiliary objective to predict target length, then condition generation

Training Efficiency

Training Optimizations

•Bucketing: Group sequences of similar length to minimize padding waste
•Dynamic batching: Batch by total tokens rather than number of sequences
•Gradient accumulation: Simulate larger batches on limited GPU memory
•Mixed precision: FP16 training for 2x memory reduction and speedup
•Label smoothing: Regularize output distribution during training
•Weight tying: Share embedding and output projection weights

Handling Special Cases

Case	Approach
Copy mechanism	Pointer network to copy from source (useful for names, rare words)
Coverage	Track attention history to avoid repetition or omission
Diverse outputs	Nucleus sampling, diverse beam search
Factual grounding	Attention over external knowledge base

Subword Tokenization is Essential

Modern seq2seq systems almost universally use subword tokenization (BPE, WordPiece, or SentencePiece). This balances vocabulary size (10-50K tokens), handles unseen words gracefully, and improves translation of morphologically rich languages.

Summary: Sequence-to-Sequence

We have thoroughly explored sequence-to-sequence models—the foundational architecture for variable-length input-to-output mapping. Let's consolidate the key takeaways:

Key Takeaways

•Encoder-decoder architecture enables variable-length mapping — Encoder compresses input, decoder generates output of any length
•Context vector bridges encoder to decoder — Fixed-size representation initialized from final encoder state
•Bidirectional encoders capture full source context — Both forward and backward information for each position
•Teacher forcing accelerates training — Using ground truth inputs speeds convergence but creates exposure bias
•Beam search improves inference quality — Exploring multiple hypotheses finds better global solutions than greedy decoding
•Information bottleneck limits performance — Single fixed context vector can't encode long sequences effectively
•Subword tokenization is essential — Handles vocabulary size and rare/unknown words gracefully

What's Next:

The information bottleneck is the fundamental limitation of basic seq2seq. In the next page, we'll explore the encoder-decoder architecture with attention—how dynamically computing context for each decoder step revolutionized sequence-to-sequence learning.

Page Complete

You now understand the sequence-to-sequence paradigm—how to encode variable-length inputs, decode variable-length outputs, train with teacher forcing, and decode with beam search. This architecture revolutionized machine translation and remains the foundation for modern encoder-decoder models.