Loading content...
Previous RNN architectures we've studied—for sequence classification or labeling—produce outputs of a predetermined structure: a single class label, or one label per input position. But many compelling applications require generating outputs whose length is independent of the input length:
Sequence-to-sequence (seq2seq) models solve this fundamental challenge by combining two key components:
This paradigm, introduced by Sutskever et al. (2014) and Cho et al. (2014), revolutionized NLP and remains foundational even as attention mechanisms and Transformers have refined it.
By the end of this page, you will understand the seq2seq architecture in depth, including encoder compression, context vector formation, decoder generation, teacher forcing, inference strategies (greedy vs beam search), and the critical information bottleneck that motivated attention mechanisms.
The sequence-to-sequence model consists of two distinct RNNs—an encoder and a decoder—connected through a context vector.
High-Level Data Flow
Key Architectural Properties
| Component | Input | Output | Role |
|---|---|---|---|
| Encoder | Source sequence $(x_1, \ldots, x_{T_x})$ | Hidden states + context vector | Compress source into representation |
| Context Vector | Final encoder state(s) | Fixed-dimension vector $\mathbf{c}$ | Bridge encoder to decoder |
| Decoder | Context + previously generated tokens | Output sequence $(y_1, \ldots, y_{T_y})$ | Generate target sequence |
Vocabulary and Embeddings
The encoder and decoder typically have separate vocabularies and embedding matrices:
For same-language tasks (like summarization), embeddings may be shared.
The encoder's job is to compress the variable-length input sequence into a fixed-dimensional context that captures all information the decoder needs.
Encoder Computation
Given source sequence $\mathbf{x} = (x_1, x_2, \ldots, x_{T_x})$ with embeddings $\mathbf{e}_t = \mathbf{E}_x[x_t]$:
$$\mathbf{h}t^{\text{enc}} = \text{LSTM}(\mathbf{e}t, \mathbf{h}{t-1}^{\text{enc}}, \mathbf{c}{t-1}^{\text{enc}})$$
where $\mathbf{h}_0^{\text{enc}}$ and $\mathbf{c}_0^{\text{enc}}$ are typically initialized to zero.
Bidirectional Encoder (Preferred)
For most seq2seq tasks, a bidirectional encoder captures richer context:
$$\overrightarrow{\mathbf{h}}t^{\text{enc}} = \text{LSTM}{\text{forward}}(\mathbf{e}t, \overrightarrow{\mathbf{h}}{t-1}^{\text{enc}})$$ $$\overleftarrow{\mathbf{h}}t^{\text{enc}} = \text{LSTM}{\text{backward}}(\mathbf{e}t, \overleftarrow{\mathbf{h}}{t+1}^{\text{enc}})$$ $$\mathbf{h}_t^{\text{enc}} = [\overrightarrow{\mathbf{h}}_t^{\text{enc}}; \overleftarrow{\mathbf{h}}_t^{\text{enc}}]$$
The encoder processes a complete, known input sequence. Unlike the decoder (which generates autoregressively), the encoder can look at future tokens. Bidirectional encoding ensures every source position has full context awareness, improving the quality of the context vector.
Context Vector Construction
Several strategies exist for constructing the context vector $\mathbf{c}$:
| Strategy | Formula | When to Use |
|---|---|---|
| Last hidden state | $\mathbf{c} = \mathbf{h}_{T_x}^{\text{enc}}$ | Simple baseline |
| Bidirectional concat | $\mathbf{c} = [\overrightarrow{\mathbf{h}}_{T_x}; \overleftarrow{\mathbf{h}}_1]$ | Capture sequence endpoints |
| Mean pooling | $\mathbf{c} = \frac{1}{T_x}\sum_{t=1}^{T_x}\mathbf{h}_t^{\text{enc}}$ | Reduce position bias |
| Max pooling | $\mathbf{c} = \max_t(\mathbf{h}_t^{\text{enc}})$ | Highlight strongest features |
| Learned pooling | $\mathbf{c} = \sum_t \alpha_t \mathbf{h}_t^{\text{enc}}$ | Attention mechanism (next page) |
The Information Bottleneck
Using a single fixed-size context vector to encode entire paragraphs or documents creates an information bottleneck. This is the fundamental limitation that attention mechanisms address.
The decoder generates the output sequence autoregressively—one token at a time, conditioning on the context vector and all previously generated tokens.
Decoder Initialization
The decoder's initial hidden state is set from the context vector:
$$\mathbf{h}0^{\text{dec}} = \tanh(\mathbf{W}{\text{init}} \mathbf{c} + \mathbf{b}_{\text{init}})$$
or simply $\mathbf{h}_0^{\text{dec}} = \mathbf{c}$ if dimensions match.
For LSTM, both hidden and cell states need initialization: $$\mathbf{h}_0^{\text{dec}} = \tanh(\mathbf{W}_h \mathbf{c} + \mathbf{b}_h)$$ $$\mathbf{c}_0^{\text{dec}} = \tanh(\mathbf{W}_c \mathbf{c} + \mathbf{b}_c)$$
Decoder Step Computation
At each step $t$, the decoder:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
import torchimport torch.nn as nnimport torch.nn.functional as F class Encoder(nn.Module): """ Bidirectional LSTM encoder for seq2seq. Compresses source sequence into context vector(s). """ def __init__( self, vocab_size: int, embed_dim: int, hidden_dim: int, num_layers: int = 2, dropout: float = 0.2 ): super().__init__() self.hidden_dim = hidden_dim self.num_layers = num_layers self.embedding = nn.Embedding(vocab_size, embed_dim) self.lstm = nn.LSTM( input_size=embed_dim, hidden_size=hidden_dim, num_layers=num_layers, batch_first=True, bidirectional=True, dropout=dropout if num_layers > 1 else 0 ) self.dropout = nn.Dropout(dropout) # Project bidirectional hidden states to decoder dimension self.fc_hidden = nn.Linear(hidden_dim * 2, hidden_dim) self.fc_cell = nn.Linear(hidden_dim * 2, hidden_dim) def forward( self, src: torch.Tensor, src_lengths: torch.Tensor ) -> tuple[torch.Tensor, tuple[torch.Tensor, torch.Tensor]]: """ Args: src: [batch, src_len] - source token indices src_lengths: [batch] - actual lengths Returns: encoder_outputs: [batch, src_len, hidden_dim*2] (hidden, cell): Decoder initial states, each [num_layers, batch, hidden_dim] """ # Embed source tokens embedded = self.dropout(self.embedding(src)) # Pack for efficient processing packed = nn.utils.rnn.pack_padded_sequence( embedded, src_lengths.cpu(), batch_first=True, enforce_sorted=False ) # Run bidirectional LSTM packed_outputs, (hidden, cell) = self.lstm(packed) # Unpack encoder_outputs, _ = nn.utils.rnn.pad_packed_sequence( packed_outputs, batch_first=True ) # Process bidirectional hidden states for decoder initialization # hidden/cell: [num_layers*2, batch, hidden_dim] batch_size = src.size(0) # Reshape to separate directions hidden = hidden.view(self.num_layers, 2, batch_size, self.hidden_dim) cell = cell.view(self.num_layers, 2, batch_size, self.hidden_dim) # Concatenate forward and backward, then project # [num_layers, batch, hidden_dim*2] -> [num_layers, batch, hidden_dim] hidden = torch.cat([hidden[:, 0], hidden[:, 1]], dim=-1) cell = torch.cat([cell[:, 0], cell[:, 1]], dim=-1) hidden = torch.tanh(self.fc_hidden(hidden)) cell = torch.tanh(self.fc_cell(cell)) return encoder_outputs, (hidden, cell) class Decoder(nn.Module): """ LSTM decoder for seq2seq. Generates output sequence conditioned on encoder context. """ def __init__( self, vocab_size: int, embed_dim: int, hidden_dim: int, num_layers: int = 2, dropout: float = 0.2 ): super().__init__() self.vocab_size = vocab_size self.hidden_dim = hidden_dim self.embedding = nn.Embedding(vocab_size, embed_dim) self.lstm = nn.LSTM( input_size=embed_dim, hidden_size=hidden_dim, num_layers=num_layers, batch_first=True, dropout=dropout if num_layers > 1 else 0 ) # Output projection to vocabulary self.fc_out = nn.Linear(hidden_dim, vocab_size) self.dropout = nn.Dropout(dropout) def forward( self, input_token: torch.Tensor, hidden: torch.Tensor, cell: torch.Tensor ) -> tuple[torch.Tensor, tuple[torch.Tensor, torch.Tensor]]: """ Single decoder step. Args: input_token: [batch, 1] - current input token hidden: [num_layers, batch, hidden_dim] cell: [num_layers, batch, hidden_dim] Returns: output: [batch, vocab_size] - next token probabilities (hidden, cell): Updated states """ # Embed input embedded = self.dropout(self.embedding(input_token)) # [batch, 1, embed_dim] # LSTM step output, (hidden, cell) = self.lstm(embedded, (hidden, cell)) # Project to vocabulary output = self.fc_out(output.squeeze(1)) # [batch, vocab_size] return output, (hidden, cell) class Seq2Seq(nn.Module): """ Complete sequence-to-sequence model combining encoder and decoder. """ def __init__( self, encoder: Encoder, decoder: Decoder, device: torch.device ): super().__init__() self.encoder = encoder self.decoder = decoder self.device = device def forward( self, src: torch.Tensor, src_lengths: torch.Tensor, trg: torch.Tensor, teacher_forcing_ratio: float = 0.5 ) -> torch.Tensor: """ Training forward pass with teacher forcing. Args: src: [batch, src_len] - source sequences src_lengths: [batch] - source lengths trg: [batch, trg_len] - target sequences (with <sos>) teacher_forcing_ratio: probability of using ground truth as next input Returns: outputs: [batch, trg_len-1, vocab_size] - output predictions """ batch_size = src.size(0) trg_len = trg.size(1) vocab_size = self.decoder.vocab_size # Store outputs outputs = torch.zeros(batch_size, trg_len - 1, vocab_size).to(self.device) # Encode source _, (hidden, cell) = self.encoder(src, src_lengths) # First decoder input is <sos> token decoder_input = trg[:, 0:1] # [batch, 1] for t in range(1, trg_len): # Decode one step output, (hidden, cell) = self.decoder(decoder_input, hidden, cell) # Store output outputs[:, t-1] = output # Decide next input: teacher forcing or predicted use_teacher_forcing = torch.rand(1).item() < teacher_forcing_ratio if use_teacher_forcing: decoder_input = trg[:, t:t+1] else: decoder_input = output.argmax(dim=-1, keepdim=True) return outputsDuring training, we have access to the correct target sequence. Teacher forcing uses this ground truth to accelerate and stabilize training.
The Teacher Forcing Strategy
Instead of feeding the model's own predictions as inputs to the next timestep, teacher forcing feeds the actual correct previous token:
Benefits of Teacher Forcing
| Benefit | Explanation |
|---|---|
| Faster convergence | Model sees correct context, learns mapping directly |
| Stable gradients | No error accumulation during training |
| Parallelizable | All decoder steps use known inputs (can partially parallelize) |
| Curriculum effect | Model learns from perfect examples before imperfect |
The Exposure Bias Problem
Teacher forcing creates a train-test mismatch called exposure bias:
The model never learns to recover from its mistakes during training, which can cause error cascades at inference time.
If the model makes an early mistake during inference (e.g., predicting 'chat' instead of 'cat'), subsequent predictions may be completely derailed because the model never trained on recovering from such errors. This is especially problematic for long sequences.
Mitigating Exposure Bias
Several strategies address the train-test mismatch:
| Strategy | Implementation | Trade-off |
|---|---|---|
| Scheduled Sampling | Gradually reduce teacher forcing ratio during training | Slower training, better inference |
| Curriculum Learning | Start easy (short sequences), increase difficulty | Careful schedule design needed |
| Sequence-Level Training | REINFORCE/policy gradient on full sequences | High variance, complex |
| Minimum Risk Training | Optimize expected BLEU/metric directly | Computationally expensive |
| Mixed Strategy | Random mix of teacher forcing and prediction | Simple, often effective |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import torchimport math def get_teacher_forcing_ratio( epoch: int, total_epochs: int, strategy: str = 'linear', min_ratio: float = 0.0, max_ratio: float = 1.0) -> float: """ Compute teacher forcing ratio based on training progress. Strategies: - 'constant': Always return max_ratio - 'linear': Linear decay from max_ratio to min_ratio - 'exponential': Exponential decay - 'inverse_sigmoid': Sigmoid schedule (slow start/end, fast middle) Args: epoch: Current epoch (0-indexed) total_epochs: Total training epochs strategy: Decay strategy min_ratio: Minimum teacher forcing ratio max_ratio: Maximum teacher forcing ratio Returns: Teacher forcing ratio for current epoch """ progress = epoch / max(total_epochs - 1, 1) # 0 to 1 if strategy == 'constant': return max_ratio elif strategy == 'linear': # Linear decay ratio = max_ratio - progress * (max_ratio - min_ratio) elif strategy == 'exponential': # Exponential decay: r(t) = max * exp(-k*t) k = 5.0 # Decay rate ratio = max_ratio * math.exp(-k * progress) ratio = max(ratio, min_ratio) elif strategy == 'inverse_sigmoid': # Inverse sigmoid: slow start, fast middle, slow end k = 10.0 # Steepness mid_point = 0.5 * total_epochs ratio = max_ratio / (1 + math.exp(k * (epoch - mid_point) / total_epochs)) ratio = max(ratio, min_ratio) else: raise ValueError(f"Unknown strategy: {strategy}") return ratio class ScheduledSamplingTrainer: """ Trainer with scheduled sampling for seq2seq models. """ def __init__( self, model: torch.nn.Module, optimizer: torch.optim.Optimizer, criterion: torch.nn.Module, sampling_strategy: str = 'linear', min_tf_ratio: float = 0.0 ): self.model = model self.optimizer = optimizer self.criterion = criterion self.sampling_strategy = sampling_strategy self.min_tf_ratio = min_tf_ratio def train_epoch( self, dataloader, epoch: int, total_epochs: int ) -> float: """Train one epoch with scheduled sampling.""" self.model.train() total_loss = 0 # Get teacher forcing ratio for this epoch tf_ratio = get_teacher_forcing_ratio( epoch, total_epochs, strategy=self.sampling_strategy, min_ratio=self.min_tf_ratio ) print(f"Epoch {epoch}: Teacher Forcing Ratio = {tf_ratio:.3f}") for batch in dataloader: src, src_lengths, trg = batch self.optimizer.zero_grad() # Forward pass with current teacher forcing ratio outputs = self.model( src, src_lengths, trg, teacher_forcing_ratio=tf_ratio ) # Compute loss (ignore padding and <sos>) output_dim = outputs.size(-1) outputs = outputs.view(-1, output_dim) trg = trg[:, 1:].contiguous().view(-1) # Skip <sos> loss = self.criterion(outputs, trg) loss.backward() torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0) self.optimizer.step() total_loss += loss.item() return total_loss / len(dataloader)At inference time, we don't have ground-truth targets. The decoder must generate the output sequence based solely on the encoded input, selecting tokens autoregressively.
Greedy Decoding
The simplest approach: at each step, select the most probable token.
$$\hat{y}t = \arg\max_y P(y | \hat{y}{<t}, \mathbf{x})$$
| Pros | Cons |
|---|---|
| Fast ($O(T_y)$ selections) | May miss globally optimal sequence |
| Simple to implement | No backtracking from mistakes |
| Low memory | Sensitive to early errors |
Beam Search
Beam search maintains $k$ (the beam width) best partial hypotheses at each step:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167
import torchimport torch.nn.functional as Ffrom dataclasses import dataclassfrom typing import List @dataclassclass BeamHypothesis: """A single hypothesis in beam search.""" tokens: List[int] log_prob: float hidden: tuple # (h, c) for LSTM finished: bool = False def beam_search( model, src: torch.Tensor, src_lengths: torch.Tensor, beam_width: int = 5, max_length: int = 50, sos_token: int = 2, eos_token: int = 3, length_penalty: float = 0.6) -> List[int]: """ Beam search decoding for seq2seq. Args: model: Seq2Seq model src: [1, src_len] - single source sequence src_lengths: [1] - source length beam_width: Number of beams to maintain max_length: Maximum output length sos_token: Start-of-sequence token ID eos_token: End-of-sequence token ID length_penalty: Penalty for length normalization (0 = no penalty) Returns: Best output sequence as list of token IDs """ device = src.device model.eval() with torch.no_grad(): # Encode source _, (hidden, cell) = model.encoder(src, src_lengths) # Initialize beams beams = [BeamHypothesis( tokens=[sos_token], log_prob=0.0, hidden=(hidden, cell), finished=False )] completed = [] for step in range(max_length): if not beams: break all_candidates = [] for beam in beams: if beam.finished: completed.append(beam) continue # Get last token last_token = torch.tensor([[beam.tokens[-1]]], device=device) h, c = beam.hidden # Decoder step output, (new_h, new_c) = model.decoder(last_token, h, c) log_probs = F.log_softmax(output, dim=-1) # [1, vocab_size] # Get top-k next tokens topk_log_probs, topk_tokens = log_probs.topk(beam_width) for i in range(beam_width): token = topk_tokens[0, i].item() token_log_prob = topk_log_probs[0, i].item() new_beam = BeamHypothesis( tokens=beam.tokens + [token], log_prob=beam.log_prob + token_log_prob, hidden=(new_h.clone(), new_c.clone()), finished=(token == eos_token) ) all_candidates.append(new_beam) # Select top-k beams # Apply length normalization to scores def score(beam): length = len(beam.tokens) # Length penalty: normalize by length^alpha lp = ((5 + length) / 6) ** length_penalty return beam.log_prob / lp all_candidates.sort(key=score, reverse=True) beams = [b for b in all_candidates if not b.finished][:beam_width] # Check if all beams are finished if not beams: break # Add remaining active beams to completed completed.extend(beams) # Return best completed sequence if completed: best = max(completed, key=lambda b: b.log_prob / len(b.tokens)) # Remove <sos> and <eos> from output output_tokens = best.tokens[1:] if output_tokens and output_tokens[-1] == eos_token: output_tokens = output_tokens[:-1] return output_tokens return [] def greedy_decode( model, src: torch.Tensor, src_lengths: torch.Tensor, max_length: int = 50, sos_token: int = 2, eos_token: int = 3) -> List[int]: """ Greedy decoding: always pick the most probable token. Args: model: Seq2Seq model src: [1, src_len] - single source sequence src_lengths: [1] - source length max_length: Maximum output length sos_token: Start-of-sequence token ID eos_token: End-of-sequence token ID Returns: Output sequence as list of token IDs """ device = src.device model.eval() with torch.no_grad(): # Encode _, (hidden, cell) = model.encoder(src, src_lengths) # Start with <sos> decoder_input = torch.tensor([[sos_token]], device=device) output_tokens = [] for _ in range(max_length): # Decode one step output, (hidden, cell) = model.decoder(decoder_input, hidden, cell) # Greedy selection predicted_token = output.argmax(dim=-1).item() if predicted_token == eos_token: break output_tokens.append(predicted_token) decoder_input = torch.tensor([[predicted_token]], device=device) return output_tokensLarger beam width finds better sequences but increases computation quadratically. Common values: machine translation uses beam=4-10; open-ended generation may use smaller beams (1-5) to maintain diversity. Length penalty prevents beam search from favoring very short outputs.
The fundamental limitation of the basic seq2seq architecture is the information bottleneck: the entire source sequence must be compressed into a single fixed-size context vector.
Why This Is Problematic
Consider translating a 100-word sentence. All lexical, syntactic, and semantic information from those 100 words must fit into a vector of dimension ~512-1024. This creates several issues:
| Problem | Manifestation |
|---|---|
| Capacity limit | Can't encode arbitrarily long sequences |
| Vanishing context | Early tokens fade as encoder processes more input |
| Missing alignment | No direct correspondence between source and target positions |
| Uniform treatment | All source information equally compressed, even if some is irrelevant |
Empirical Evidence
Research showed clear performance degradation with sequence length:
| Source Length | BLEU Score | Relative Performance |
|---|---|---|
| 0-10 tokens | ~28 | 100% (baseline) |
| 10-20 tokens | ~26 | 93% |
| 20-30 tokens | ~23 | 82% |
| 30-40 tokens | ~19 | 68% |
| 40-50 tokens | ~15 | 54% |
| 50+ tokens | <12 | <43% |
Why a Single Vector Can't Suffice
The decoder generates each output token using the same context vector $\mathbf{c}$. But different output positions need different source information:
A fixed vector can't dynamically select which source information is relevant for each output position.
The Solution: Attention
Attention mechanisms (covered in the next page) solve this by:
This eliminates the fixed bottleneck: the decoder accesses the full encoded sequence at every generation step.
The attention mechanism was introduced precisely to address the bottleneck problem in neural machine translation. Bahdanau et al. (2015) showed that attention dramatically improved translation quality, especially for long sentences. This was a pivotal moment in NLP that eventually led to the Transformer architecture.
Building production-quality seq2seq systems requires attention to many practical details beyond the core architecture.
Vocabulary Handling
| Challenge | Solution |
|---|---|
| Unknown words (OOV) | Special $\langle\text{unk}\rangle$ token; subword tokenization (BPE, WordPiece) |
| Large vocabulary | Softmax approximations; hierarchical softmax; sampled softmax |
| Rare words | Weight tying between embedding and output layers |
| Cross-lingual | Shared subword vocabulary for similar languages |
Length Control
Controlling output length is often necessary:
Training Efficiency
Handling Special Cases
| Case | Approach |
|---|---|
| Copy mechanism | Pointer network to copy from source (useful for names, rare words) |
| Coverage | Track attention history to avoid repetition or omission |
| Diverse outputs | Nucleus sampling, diverse beam search |
| Factual grounding | Attention over external knowledge base |
Modern seq2seq systems almost universally use subword tokenization (BPE, WordPiece, or SentencePiece). This balances vocabulary size (10-50K tokens), handles unseen words gracefully, and improves translation of morphologically rich languages.
We have thoroughly explored sequence-to-sequence models—the foundational architecture for variable-length input-to-output mapping. Let's consolidate the key takeaways:
What's Next:
The information bottleneck is the fundamental limitation of basic seq2seq. In the next page, we'll explore the encoder-decoder architecture with attention—how dynamically computing context for each decoder step revolutionized sequence-to-sequence learning.
You now understand the sequence-to-sequence paradigm—how to encode variable-length inputs, decode variable-length outputs, train with teacher forcing, and decode with beam search. This architecture revolutionized machine translation and remains the foundation for modern encoder-decoder models.