Loading content...
The encoder-decoder framework is a general architectural pattern that extends far beyond basic seq2seq. While we introduced the core concept in the previous page, here we dive deeper into the design principles, variations, and engineering considerations that make encoder-decoder systems work effectively in practice.
The fundamental insight is separation of concerns:
This separation enables:
By the end of this page, you will understand encoder-decoder design principles, state transfer mechanisms, cross-modal applications, asymmetric architectures, and the engineering decisions that affect real-world performance.
Effective encoder-decoder systems follow several key design principles that guide architectural choices.
Principle 1: Representation Completeness
The encoder must produce a representation that contains all information the decoder needs to generate the correct output. Missing information cannot be recovered.
$$\text{If } I(\mathbf{z}; \mathbf{y} | \mathbf{x}) = I(\mathbf{x}; \mathbf{y})$$
where $\mathbf{z}$ is the encoded representation, then $\mathbf{z}$ is a sufficient statistic for predicting $\mathbf{y}$ from $\mathbf{x}$.
Principle 2: Representational Compactness
The encoding should be as compact as possible while preserving necessary information. Over-encoding leads to:
Principle 3: Decoder Autonomy
The decoder should be capable of conditional generation—producing valid outputs for any plausible encoding, not just those seen during training. This enables generalization to new inputs.
| Design Choice | Benefit | Cost |
|---|---|---|
| Deep encoder, shallow decoder | Rich source understanding | Less expressive generation |
| Shallow encoder, deep decoder | Powerful generation | May miss source nuances |
| Balanced depth | Versatile | May not optimize either function |
| Bidirectional encoder | Full source context | Cannot share weights with decoder |
| Shared encoder-decoder layers | Parameter efficiency | May limit specialization |
| Large hidden dimension | More capacity | Harder to train, slower inference |
Many successful systems use different capacities for encoder and decoder. For translation, encoders often have more layers (understanding foreign language is harder). For summarization, decoders may need more capacity (generating coherent summaries is harder than understanding articles).
The interface between encoder and decoder—how information transfers from encoding to decoding—is a critical design decision with significant implications.
Strategy 1: Final State Transfer
The simplest approach: initialize decoder hidden state with encoder's final hidden state.
$$\mathbf{h}0^{\text{dec}} = \mathbf{h}{T_x}^{\text{enc}}$$
For bidirectional encoders with different dimensions:
$$\mathbf{h}0^{\text{dec}} = \mathbf{W}{\text{proj}}[\overrightarrow{\mathbf{h}}_{T_x}; \overleftarrow{\mathbf{h}}1] + \mathbf{b}{\text{proj}}$$
Strategy 2: All-States Access (Attention)
Instead of passing only the final state, store all encoder states and let the decoder access them dynamically:
$$\mathbf{c}t = \sum{i=1}^{T_x} \alpha_{ti} \mathbf{h}_i^{\text{enc}}$$
where $\alpha_{ti}$ are attention weights (covered in detail in the next page).
Strategy 3: Mean/Max Pooling
Aggregate all encoder states:
$$\mathbf{c} = \frac{1}{T_x} \sum_{i=1}^{T_x} \mathbf{h}_i^{\text{enc}} \quad \text{(mean)}$$ $$\mathbf{c} = \max_i \mathbf{h}_i^{\text{enc}} \quad \text{(max)}$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197
import torchimport torch.nn as nn class StateTransferMethods(nn.Module): """ Various methods for transferring encoder state to decoder. This module demonstrates different initialization strategies. """ def __init__( self, encoder_hidden_dim: int, decoder_hidden_dim: int, num_encoder_layers: int, num_decoder_layers: int, bidirectional: bool = True ): super().__init__() self.bidirectional = bidirectional self.num_encoder_layers = num_encoder_layers self.num_decoder_layers = num_decoder_layers # Calculate effective encoder output dimension encoder_output_dim = encoder_hidden_dim * (2 if bidirectional else 1) # Projection layers for state transfer self.hidden_proj = nn.Linear(encoder_output_dim, decoder_hidden_dim) self.cell_proj = nn.Linear(encoder_output_dim, decoder_hidden_dim) # For multi-layer decoder with different depth if num_decoder_layers != num_encoder_layers: # Bridge network to map encoder layers to decoder layers self.layer_bridge = nn.Linear( num_encoder_layers * encoder_output_dim, num_decoder_layers * decoder_hidden_dim ) def final_state_transfer( self, encoder_hidden: torch.Tensor, encoder_cell: torch.Tensor ) -> tuple[torch.Tensor, torch.Tensor]: """ Transfer final encoder states to initialize decoder. Args: encoder_hidden: [num_layers*num_directions, batch, hidden] encoder_cell: [num_layers*num_directions, batch, hidden] Returns: decoder_hidden: [num_decoder_layers, batch, decoder_hidden] decoder_cell: [num_decoder_layers, batch, decoder_hidden] """ batch_size = encoder_hidden.size(1) if self.bidirectional: # Reshape to separate directions # [num_layers, 2, batch, hidden] hidden = encoder_hidden.view( self.num_encoder_layers, 2, batch_size, -1 ) cell = encoder_cell.view( self.num_encoder_layers, 2, batch_size, -1 ) # Concatenate forward and backward for each layer # Take last encoder layer hidden = torch.cat([hidden[-1, 0], hidden[-1, 1]], dim=-1) cell = torch.cat([cell[-1, 0], cell[-1, 1]], dim=-1) else: # Just take last layer hidden = encoder_hidden[-1] cell = encoder_cell[-1] # Project to decoder dimension decoder_hidden = torch.tanh(self.hidden_proj(hidden)) decoder_cell = torch.tanh(self.cell_proj(cell)) # Expand to decoder layers (simple: repeat) decoder_hidden = decoder_hidden.unsqueeze(0).expand( self.num_decoder_layers, -1, -1 ).contiguous() decoder_cell = decoder_cell.unsqueeze(0).expand( self.num_decoder_layers, -1, -1 ).contiguous() return decoder_hidden, decoder_cell def pooled_state_transfer( self, encoder_outputs: torch.Tensor, src_lengths: torch.Tensor, pooling: str = 'mean' ) -> torch.Tensor: """ Create context by pooling all encoder outputs. Args: encoder_outputs: [batch, src_len, encoder_dim] src_lengths: [batch] - actual sequence lengths pooling: 'mean', 'max', or 'last' Returns: context: [batch, encoder_dim] """ batch_size = encoder_outputs.size(0) max_len = encoder_outputs.size(1) device = encoder_outputs.device # Create mask for variable-length sequences mask = torch.arange(max_len, device=device)[None, :] < src_lengths[:, None] mask = mask.unsqueeze(-1).float() # [batch, src_len, 1] if pooling == 'mean': # Masked mean pooling summed = (encoder_outputs * mask).sum(dim=1) context = summed / src_lengths.unsqueeze(-1).float() elif pooling == 'max': # Masked max pooling (set padding to large negative) masked_outputs = encoder_outputs.masked_fill( ~mask.bool(), float('-inf') ) context, _ = masked_outputs.max(dim=1) elif pooling == 'last': # Get last valid position for each sequence last_indices = (src_lengths - 1).view(-1, 1, 1) last_indices = last_indices.expand(-1, 1, encoder_outputs.size(-1)) context = encoder_outputs.gather(1, last_indices).squeeze(1) else: raise ValueError(f"Unknown pooling: {pooling}") return context def layer_wise_transfer( self, encoder_hidden: torch.Tensor, encoder_cell: torch.Tensor ) -> tuple[torch.Tensor, torch.Tensor]: """ Transfer state from each encoder layer to corresponding decoder layer. Handles cases where num_encoder_layers != num_decoder_layers. Args: encoder_hidden: [num_layers*num_directions, batch, hidden] encoder_cell: [num_layers*num_directions, batch, hidden] Returns: decoder_hidden: [num_decoder_layers, batch, decoder_hidden] decoder_cell: [num_decoder_layers, batch, decoder_hidden] """ batch_size = encoder_hidden.size(1) if self.bidirectional: # Reshape and concatenate directions hidden = encoder_hidden.view( self.num_encoder_layers, 2, batch_size, -1 ) cell = encoder_cell.view( self.num_encoder_layers, 2, batch_size, -1 ) # Concat forward and backward: [num_layers, batch, 2*hidden] hidden = torch.cat([hidden[:, 0], hidden[:, 1]], dim=-1) cell = torch.cat([cell[:, 0], cell[:, 1]], dim=-1) # If same number of layers, project each layer directly if self.num_encoder_layers == self.num_decoder_layers: decoder_hidden = torch.tanh(self.hidden_proj(hidden)) decoder_cell = torch.tanh(self.cell_proj(cell)) else: # Flatten all layers, project through bridge, reshape hidden_flat = hidden.permute(1, 0, 2).contiguous() # [batch, layers, dim] hidden_flat = hidden_flat.view(batch_size, -1) # [batch, layers*dim] cell_flat = cell.permute(1, 0, 2).contiguous().view(batch_size, -1) # Project through bridge hidden_bridged = self.layer_bridge(hidden_flat) cell_bridged = self.layer_bridge(cell_flat) # Reshape to decoder layers decoder_hidden = hidden_bridged.view( batch_size, self.num_decoder_layers, -1 ).permute(1, 0, 2).contiguous() decoder_cell = cell_bridged.view( batch_size, self.num_decoder_layers, -1 ).permute(1, 0, 2).contiguous() decoder_hidden = torch.tanh(decoder_hidden) decoder_cell = torch.tanh(decoder_cell) return decoder_hidden, decoder_cellWhen both encoder and decoder have multiple layers, managing state transfer becomes more nuanced.
Layer Correspondence Strategies
| Strategy | Description | Use Case |
|---|---|---|
| Top-to-Top | Only transfer top encoder layer to bottom decoder layer | Simple, most common |
| Layer-wise | Each encoder layer initializes corresponding decoder layer | Matched architecture |
| Full Bridge | Dense connection from all encoder layers to all decoder layers | Maximum flexibility |
| Selective | Learned selection of which encoder layers feed which decoder layers | Auto-discovered correspondence |
Residual Connections Across Encoder-Decoder
For deep architectures, residual connections can span the encoder-decoder boundary:
Depth Considerations
| Configuration | Total Gradient Path | Training Difficulty |
|---|---|---|
| 2-layer encoder + 2-layer decoder | $T_x + 2 + T_y + 2$ | Moderate |
| 4-layer encoder + 4-layer decoder | $T_x + 4 + T_y + 4$ | Challenging |
| 6-layer encoder + 6-layer decoder | $T_x + 6 + T_y + 6$ | Requires residual/attention |
Practical Recommendation:
For deep encoder-decoder models (≥4 layers each):
Encoder and decoder don't need to be symmetric. Asymmetric architectures match capacity to task demands.
Encoder-Heavy (Understanding-Focused)
Decoder-Heavy (Generation-Focused)
Mixed Modality
Different modalities often require radically different encoders and decoders:
| Task | Encoder | Decoder | Interface |
|---|---|---|---|
| Image Captioning | CNN (ResNet, ViT) | LSTM/Transformer | Image features → RNN state |
| Speech Recognition | CNN + RNN (acoustic) | RNN (language model) | Audio frames → text tokens |
| Video Description | 3D CNN + Temporal RNN | Language LSTM | Spatio-temporal features → words |
| Document QA | BERT/RoBERTa | Pointer network | Contextualized embeddings → spans |
| Text-to-Speech | Text encoder (RNN) | Spectrogram decoder (autoregressive) | Phoneme sequence → acoustic features |
When encoder and decoder process different modalities, a learned projection layer is essential. The projection bridges the representation spaces, mapping image features to the manifold expected by a text decoder, or acoustic embeddings to language model hidden states.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207
import torchimport torch.nn as nnimport torchvision.models as models class ImageCaptioningModel(nn.Module): """ Asymmetric encoder-decoder for image captioning. CNN encoder (frozen or fine-tuned) + LSTM decoder. """ def __init__( self, vocab_size: int, embed_dim: int = 256, hidden_dim: int = 512, num_decoder_layers: int = 2, encoder_pretrained: bool = True, encoder_finetune: bool = False ): super().__init__() # === CNN ENCODER === # Use pretrained ResNet-50 as feature extractor resnet = models.resnet50(pretrained=encoder_pretrained) # Remove final classification layer self.encoder = nn.Sequential(*list(resnet.children())[:-2]) # Freeze encoder if not fine-tuning if not encoder_finetune: for param in self.encoder.parameters(): param.requires_grad = False # CNN outputs [batch, 2048, 7, 7] for 224x224 input encoder_dim = 2048 num_spatial_positions = 7 * 7 # === BRIDGE: CNN features → RNN state === # Project pooled features to decoder hidden dimension self.feature_projection = nn.Linear(encoder_dim, hidden_dim) # Initialize decoder hidden/cell from image features self.init_h = nn.Linear(encoder_dim, hidden_dim) self.init_c = nn.Linear(encoder_dim, hidden_dim) # === LSTM DECODER === self.embedding = nn.Embedding(vocab_size, embed_dim) self.decoder = nn.LSTM( input_size=embed_dim + hidden_dim, # Attend to image hidden_size=hidden_dim, num_layers=num_decoder_layers, batch_first=True, dropout=0.3 if num_decoder_layers > 1 else 0 ) # Simple attention over spatial positions self.attention = nn.Linear(hidden_dim + encoder_dim, 1) # Output projection self.fc_out = nn.Linear(hidden_dim, vocab_size) self.dropout = nn.Dropout(0.5) self.hidden_dim = hidden_dim self.encoder_dim = encoder_dim def encode(self, images: torch.Tensor) -> torch.Tensor: """ Encode images to feature maps. Args: images: [batch, 3, 224, 224] Returns: features: [batch, 49, 2048] - flattened spatial features """ with torch.set_grad_enabled(self.encoder[0].weight.requires_grad): features = self.encoder(images) # [batch, 2048, 7, 7] # Flatten spatial dimensions batch_size = features.size(0) features = features.view(batch_size, self.encoder_dim, -1) # [batch, 2048, 49] features = features.permute(0, 2, 1) # [batch, 49, 2048] return features def init_decoder_state( self, features: torch.Tensor ) -> tuple[torch.Tensor, torch.Tensor]: """ Initialize decoder state from image features. Args: features: [batch, 49, 2048] Returns: h0: [num_layers, batch, hidden] c0: [num_layers, batch, hidden] """ # Pool over spatial positions mean_features = features.mean(dim=1) # [batch, 2048] # Project to hidden dimension h0 = torch.tanh(self.init_h(mean_features)) # [batch, hidden] c0 = torch.tanh(self.init_c(mean_features)) # [batch, hidden] # Expand to decoder layers h0 = h0.unsqueeze(0).expand(self.decoder.num_layers, -1, -1).contiguous() c0 = c0.unsqueeze(0).expand(self.decoder.num_layers, -1, -1).contiguous() return h0, c0 def attend( self, features: torch.Tensor, decoder_hidden: torch.Tensor ) -> torch.Tensor: """ Attend to image features based on decoder state. Args: features: [batch, 49, 2048] decoder_hidden: [batch, hidden] Returns: context: [batch, hidden] - attended image features """ num_positions = features.size(1) # Expand decoder hidden to match spatial positions decoder_expanded = decoder_hidden.unsqueeze(1).expand(-1, num_positions, -1) # Concatenate and score combined = torch.cat([features, decoder_expanded], dim=-1) scores = self.attention(combined).squeeze(-1) # [batch, 49] # Softmax attention weights weights = torch.softmax(scores, dim=-1) # [batch, 49] # Weighted sum of features context = torch.bmm(weights.unsqueeze(1), features) # [batch, 1, 2048] context = context.squeeze(1) # [batch, 2048] # Project to hidden dimension context = self.feature_projection(context) # [batch, hidden] return context def forward( self, images: torch.Tensor, captions: torch.Tensor, teacher_forcing_ratio: float = 1.0 ) -> torch.Tensor: """ Training forward pass. Args: images: [batch, 3, 224, 224] captions: [batch, caption_len] - ground truth captions with <sos> teacher_forcing_ratio: probability of using ground truth Returns: outputs: [batch, caption_len-1, vocab_size] """ batch_size = images.size(0) caption_len = captions.size(1) # Encode image features = self.encode(images) # [batch, 49, 2048] # Initialize decoder hidden, cell = self.init_decoder_state(features) # Store outputs outputs = torch.zeros(batch_size, caption_len - 1, self.fc_out.out_features) outputs = outputs.to(images.device) # First input is <sos> decoder_input = captions[:, 0] # [batch] for t in range(1, caption_len): # Embed input embedded = self.dropout(self.embedding(decoder_input)) # [batch, embed] # Attend to image h_for_attention = hidden[-1] # Use top layer for attention context = self.attend(features, h_for_attention) # [batch, hidden] # Concatenate embedding and context rnn_input = torch.cat([embedded, context], dim=-1) # [batch, embed+hidden] rnn_input = rnn_input.unsqueeze(1) # [batch, 1, embed+hidden] # Decoder step output, (hidden, cell) = self.decoder(rnn_input, (hidden, cell)) output = output.squeeze(1) # [batch, hidden] # Project to vocabulary output = self.fc_out(self.dropout(output)) # [batch, vocab] outputs[:, t-1] = output # Next input use_tf = torch.rand(1).item() < teacher_forcing_ratio decoder_input = captions[:, t] if use_tf else output.argmax(dim=-1) return outputsIn some scenarios, sharing parameters between encoder and decoder improves regularization and efficiency.
Weight Tying Strategies
| Sharing Type | What's Shared | When to Use |
|---|---|---|
| Embedding tying | Source and target embeddings | Same/similar vocabulary |
| Output-embedding tying | Decoder embeddings and output projection | Reduces parameters by $ |
| Encoder-decoder tying | All or some layers | Auto-encoding tasks |
| Cross-lingual tying | Embeddings across languages | Multilingual models |
Three-Way Tying
For same-vocabulary tasks (e.g., summarization, paraphrasing), three-way tying combines:
All three become the same matrix (with appropriate transposition for output):
$$\mathbf{E}_x = \mathbf{E}y = \mathbf{W}{\text{out}}^\top$$
Weight tying reduces parameters significantly (vocabulary is often the largest parameter component) and can improve generalization by forcing embeddings to work in both encoding and generation contexts. Press & Wolf (2017) showed consistent improvements from output-embedding tying.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138
import torchimport torch.nn as nn class TiedSeq2Seq(nn.Module): """ Seq2Seq with weight tying between embeddings and output layer. Suitable for same-vocabulary tasks. """ def __init__( self, vocab_size: int, embed_dim: int, hidden_dim: int, num_layers: int = 2, tie_weights: bool = True ): super().__init__() self.hidden_dim = hidden_dim self.tie_weights = tie_weights # Single embedding matrix (will be shared) self.embedding = nn.Embedding(vocab_size, embed_dim) # Encoder self.encoder = nn.LSTM( embed_dim, hidden_dim, num_layers, batch_first=True, bidirectional=True ) # Bridge from bidirectional to unidirectional self.bridge = nn.Linear(hidden_dim * 2, hidden_dim) # Decoder self.decoder = nn.LSTM( embed_dim, hidden_dim, num_layers, batch_first=True ) # Output projection if tie_weights: # Output projection will use transposed embedding weights # Only need a bias term self.output_bias = nn.Parameter(torch.zeros(vocab_size)) # Projection from hidden to embed (if dimensions differ) if hidden_dim != embed_dim: self.output_projection = nn.Linear(hidden_dim, embed_dim, bias=False) else: self.output_projection = None else: # Standard untied output layer self.output_layer = nn.Linear(hidden_dim, vocab_size) def get_output_embeddings(self) -> nn.Embedding: """Return the output embedding matrix (same as input if tied).""" return self.embedding def compute_output(self, hidden: torch.Tensor) -> torch.Tensor: """ Compute output logits from decoder hidden state. Args: hidden: [batch, hidden_dim] Returns: logits: [batch, vocab_size] """ if self.tie_weights: # Project to embedding dimension if needed if self.output_projection is not None: hidden = self.output_projection(hidden) # Multiply by embedding weights (transposed) # embedding.weight: [vocab_size, embed_dim] # hidden: [batch, embed_dim] logits = torch.matmul(hidden, self.embedding.weight.t()) logits = logits + self.output_bias else: logits = self.output_layer(hidden) return logits def forward( self, src: torch.Tensor, trg: torch.Tensor ) -> torch.Tensor: """ Forward pass with weight tying. Args: src: [batch, src_len] trg: [batch, trg_len] Returns: outputs: [batch, trg_len-1, vocab_size] """ # Embed with shared embeddings src_embedded = self.embedding(src) trg_embedded = self.embedding(trg[:, :-1]) # Exclude last for teacher forcing # Encode encoder_outputs, (hidden, cell) = self.encoder(src_embedded) # Bridge bidirectional → unidirectional # hidden: [num_layers*2, batch, hidden] num_layers = hidden.size(0) // 2 hidden = hidden.view(num_layers, 2, -1, self.hidden_dim) cell = cell.view(num_layers, 2, -1, self.hidden_dim) # Concatenate and project hidden = self.bridge(torch.cat([hidden[:, 0], hidden[:, 1]], dim=-1)) cell = self.bridge(torch.cat([cell[:, 0], cell[:, 1]], dim=-1)) # Decode decoder_outputs, _ = self.decoder(trg_embedded, (hidden, cell)) # Compute output logits for each position batch_size, seq_len, _ = decoder_outputs.size() decoder_outputs = decoder_outputs.contiguous().view(-1, self.hidden_dim) logits = self.compute_output(decoder_outputs) logits = logits.view(batch_size, seq_len, -1) return logits def count_parameters(self) -> dict: """Count parameters with and without tying.""" total = sum(p.numel() for p in self.parameters()) embedding_params = self.embedding.weight.numel() return { 'total': total, 'embedding': embedding_params, 'savings_from_tying': embedding_params if self.tie_weights else 0, 'effective_total': total - (embedding_params if self.tie_weights else 0) }Training encoder-decoder models effectively requires careful attention to several strategies beyond basic gradient descent.
Loss Masking
Sequences in a batch have different lengths, requiring padding. Loss should be computed only on valid positions:
$$\mathcal{L} = -\frac{1}{\sum_t m_t} \sum_{t=1}^{T} m_t \log P(y_t^* | \hat{y}_{<t}, \mathbf{x})$$
where $m_t = 1$ if position $t$ is valid, $0$ for padding.
Gradient Clipping
Encoder-decoder models are susceptible to exploding gradients, especially with long sequences:
$$\tilde{\mathbf{g}} = \min\left(1, \frac{\theta}{|\mathbf{g}|}\right) \mathbf{g}$$
Typical clip value $\theta \in [1.0, 5.0]$.
Label Smoothing
Soften one-hot targets to prevent overconfidence:
$$y_k^{\text{smooth}} = (1-\epsilon) y_k + \frac{\epsilon}{|V|}$$
with $\epsilon \approx 0.1$ common.
Cross-entropy loss during training doesn't perfectly correlate with generation quality (BLEU, ROUGE, etc.). A model with slightly higher validation loss may actually produce better translations. Always monitor actual metrics, not just loss.
Building encoder-decoder systems involves navigating several common failure modes.
| Problem | Symptoms | Solution |
|---|---|---|
| Repetitive outputs | Decoder repeats same phrase/token | Coverage mechanism, repetition penalty, nucleus sampling |
| Premature EOS | Very short outputs | Length penalty in beam search, minimum length constraint |
| Generic outputs | Safe, common phrases regardless of input | Reduce teacher forcing, increase diversity, check data balance |
| Hallucination | Decoder invents content not in source | Stronger attention to source, copy mechanisms |
| Catastrophic forgetting | Long-range info lost in decoder | Attention mechanism, deeper decoder, residual connections |
| Exposure bias | Good loss, poor generation | Scheduled sampling, sequence-level training |
| Mode collapse | All inputs → similar outputs | Check encoder capacity, data diversity, temperature sampling |
| Training instability | Loss spikes, NaN values | Lower LR, gradient clipping, layer norm, check for data issues |
When generation quality is poor: (1) Visualize attention weights to check source-target alignment, (2) Compare greedy vs beam search outputs, (3) Check if teacher forcing ratio affects validation quality, (4) Sample multiple outputs to see diversity, (5) Test on simple synthetic data first.
We have thoroughly explored encoder-decoder architectures—the general framework underlying sequence transduction tasks. Let's consolidate the key takeaways:
What's Next:
We've established the encoder-decoder framework and its limitations (particularly the information bottleneck). The next page introduces Attention Preview—how attention mechanisms revolutionize the encoder-decoder interface by enabling dynamic, position-specific access to encoded representations.
You now understand encoder-decoder design principles, state transfer mechanisms, asymmetric architectures, weight tying, and practical training strategies. This framework underlies machine translation, summarization, and countless other sequence transduction applications.