Advanced Rnn Topics - Learning Module

Loading content...

0/278

Encoder-Decoder

The Encoder-Decoder Paradigm

The encoder-decoder framework is a general architectural pattern that extends far beyond basic seq2seq. While we introduced the core concept in the previous page, here we dive deeper into the design principles, variations, and engineering considerations that make encoder-decoder systems work effectively in practice.

The fundamental insight is separation of concerns:

The encoder specializes in understanding the input
The decoder specializes in generating the output
The interface between them defines how information flows

This separation enables:

Different architectures for encoder and decoder
Asymmetric capacity allocation (more layers for complex encoding, fewer for generation)
Pre-training on separate objectives (e.g., language modeling for decoder, masked language modeling for encoder)
Multi-modal applications (image encoder + text decoder for captioning)

What You Will Learn

By the end of this page, you will understand encoder-decoder design principles, state transfer mechanisms, cross-modal applications, asymmetric architectures, and the engineering decisions that affect real-world performance.

Design Principles

Effective encoder-decoder systems follow several key design principles that guide architectural choices.

Principle 1: Representation Completeness

The encoder must produce a representation that contains all information the decoder needs to generate the correct output. Missing information cannot be recovered.

$$\text{If } I(\mathbf{z}; \mathbf{y} | \mathbf{x}) = I(\mathbf{x}; \mathbf{y})$$

where $\mathbf{z}$ is the encoded representation, then $\mathbf{z}$ is a sufficient statistic for predicting $\mathbf{y}$ from $\mathbf{x}$.

Principle 2: Representational Compactness

The encoding should be as compact as possible while preserving necessary information. Over-encoding leads to:

Wasted computation in the decoder
Difficulty learning appropriate abstractions
Overfitting to spurious source patterns

Principle 3: Decoder Autonomy

The decoder should be capable of conditional generation—producing valid outputs for any plausible encoding, not just those seen during training. This enables generalization to new inputs.

Encoder-Decoder Design Trade-offs
Design Choice	Benefit	Cost
Deep encoder, shallow decoder	Rich source understanding	Less expressive generation
Shallow encoder, deep decoder	Powerful generation	May miss source nuances
Balanced depth	Versatile	May not optimize either function
Bidirectional encoder	Full source context	Cannot share weights with decoder
Shared encoder-decoder layers	Parameter efficiency	May limit specialization
Large hidden dimension	More capacity	Harder to train, slower inference

Asymmetric Capacity

Many successful systems use different capacities for encoder and decoder. For translation, encoders often have more layers (understanding foreign language is harder). For summarization, decoders may need more capacity (generating coherent summaries is harder than understanding articles).

State Transfer Mechanisms

The interface between encoder and decoder—how information transfers from encoding to decoding—is a critical design decision with significant implications.

Strategy 1: Final State Transfer

The simplest approach: initialize decoder hidden state with encoder's final hidden state.

$$\mathbf{h}0^{\text{dec}} = \mathbf{h}{T_x}^{\text{enc}}$$

For bidirectional encoders with different dimensions:

$$\mathbf{h}0^{\text{dec}} = \mathbf{W}{\text{proj}}[\overrightarrow{\mathbf{h}}_{T_x}; \overleftarrow{\mathbf{h}}1] + \mathbf{b}{\text{proj}}$$

Strategy 2: All-States Access (Attention)

Instead of passing only the final state, store all encoder states and let the decoder access them dynamically:

$$\mathbf{c}t = \sum{i=1}^{T_x} \alpha_{ti} \mathbf{h}_i^{\text{enc}}$$

where $\alpha_{ti}$ are attention weights (covered in detail in the next page).

Strategy 3: Mean/Max Pooling

Aggregate all encoder states:

$$\mathbf{c} = \frac{1}{T_x} \sum_{i=1}^{T_x} \mathbf{h}_i^{\text{enc}} \quad \text{(mean)}$$ $$\mathbf{c} = \max_i \mathbf{h}_i^{\text{enc}} \quad \text{(max)}$$

state_transfer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
import torch
import torch.nn as nn
 
class StateTransferMethods(nn.Module):
    """
    Various methods for transferring encoder state to decoder.
    This module demonstrates different initialization strategies.
    """
    
    def __init__(
        self,
        encoder_hidden_dim: int,
        decoder_hidden_dim: int,
        num_encoder_layers: int,
        num_decoder_layers: int,
        bidirectional: bool = True
    ):
        super().__init__()
        
        self.bidirectional = bidirectional
        self.num_encoder_layers = num_encoder_layers
        self.num_decoder_layers = num_decoder_layers
        
        # Calculate effective encoder output dimension
        encoder_output_dim = encoder_hidden_dim * (2 if bidirectional else 1)
        
        # Projection layers for state transfer
        self.hidden_proj = nn.Linear(encoder_output_dim, decoder_hidden_dim)
        self.cell_proj = nn.Linear(encoder_output_dim, decoder_hidden_dim)
        
        # For multi-layer decoder with different depth
        if num_decoder_layers != num_encoder_layers:
            # Bridge network to map encoder layers to decoder layers
            self.layer_bridge = nn.Linear(
                num_encoder_layers * encoder_output_dim,
                num_decoder_layers * decoder_hidden_dim
            )
        
    def final_state_transfer(
        self,
        encoder_hidden: torch.Tensor,
        encoder_cell: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Transfer final encoder states to initialize decoder.
        
        Args:
            encoder_hidden: [num_layers*num_directions, batch, hidden]
            encoder_cell: [num_layers*num_directions, batch, hidden]
            
        Returns:
            decoder_hidden: [num_decoder_layers, batch, decoder_hidden]
            decoder_cell: [num_decoder_layers, batch, decoder_hidden]
        """
        batch_size = encoder_hidden.size(1)
        
        if self.bidirectional:
            # Reshape to separate directions
            # [num_layers, 2, batch, hidden]
            hidden = encoder_hidden.view(
                self.num_encoder_layers, 2, batch_size, -1
            )
            cell = encoder_cell.view(
                self.num_encoder_layers, 2, batch_size, -1
            )
            
            # Concatenate forward and backward for each layer
            # Take last encoder layer
            hidden = torch.cat([hidden[-1, 0], hidden[-1, 1]], dim=-1)
            cell = torch.cat([cell[-1, 0], cell[-1, 1]], dim=-1)
        else:
            # Just take last layer
            hidden = encoder_hidden[-1]
            cell = encoder_cell[-1]
        
        # Project to decoder dimension
        decoder_hidden = torch.tanh(self.hidden_proj(hidden))
        decoder_cell = torch.tanh(self.cell_proj(cell))
        
        # Expand to decoder layers (simple: repeat)
        decoder_hidden = decoder_hidden.unsqueeze(0).expand(
            self.num_decoder_layers, -1, -1
        ).contiguous()
        decoder_cell = decoder_cell.unsqueeze(0).expand(
            self.num_decoder_layers, -1, -1
        ).contiguous()
        
        return decoder_hidden, decoder_cell
    
    def pooled_state_transfer(
        self,
        encoder_outputs: torch.Tensor,
        src_lengths: torch.Tensor,
        pooling: str = 'mean'
    ) -> torch.Tensor:
        """
        Create context by pooling all encoder outputs.
        
        Args:
            encoder_outputs: [batch, src_len, encoder_dim]
            src_lengths: [batch] - actual sequence lengths
            pooling: 'mean', 'max', or 'last'
            
        Returns:
            context: [batch, encoder_dim]
        """
        batch_size = encoder_outputs.size(0)
        max_len = encoder_outputs.size(1)
        device = encoder_outputs.device
        
        # Create mask for variable-length sequences
        mask = torch.arange(max_len, device=device)[None, :] < src_lengths[:, None]
        mask = mask.unsqueeze(-1).float()  # [batch, src_len, 1]
        
        if pooling == 'mean':
            # Masked mean pooling
            summed = (encoder_outputs * mask).sum(dim=1)
            context = summed / src_lengths.unsqueeze(-1).float()
            
        elif pooling == 'max':
            # Masked max pooling (set padding to large negative)
            masked_outputs = encoder_outputs.masked_fill(
                ~mask.bool(), float('-inf')
            )
            context, _ = masked_outputs.max(dim=1)
            
        elif pooling == 'last':
            # Get last valid position for each sequence
            last_indices = (src_lengths - 1).view(-1, 1, 1)
            last_indices = last_indices.expand(-1, 1, encoder_outputs.size(-1))
            context = encoder_outputs.gather(1, last_indices).squeeze(1)
            
        else:
            raise ValueError(f"Unknown pooling: {pooling}")
        
        return context
    
    def layer_wise_transfer(
        self,
        encoder_hidden: torch.Tensor,
        encoder_cell: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Transfer state from each encoder layer to corresponding decoder layer.
        Handles cases where num_encoder_layers != num_decoder_layers.
        
        Args:
            encoder_hidden: [num_layers*num_directions, batch, hidden]
            encoder_cell: [num_layers*num_directions, batch, hidden]
            
        Returns:
            decoder_hidden: [num_decoder_layers, batch, decoder_hidden]
            decoder_cell: [num_decoder_layers, batch, decoder_hidden]
        """
        batch_size = encoder_hidden.size(1)
        
        if self.bidirectional:
            # Reshape and concatenate directions
            hidden = encoder_hidden.view(
                self.num_encoder_layers, 2, batch_size, -1
            )
            cell = encoder_cell.view(
                self.num_encoder_layers, 2, batch_size, -1
            )
            
            # Concat forward and backward: [num_layers, batch, 2*hidden]
            hidden = torch.cat([hidden[:, 0], hidden[:, 1]], dim=-1)
            cell = torch.cat([cell[:, 0], cell[:, 1]], dim=-1)
        
        # If same number of layers, project each layer directly
        if self.num_encoder_layers == self.num_decoder_layers:
            decoder_hidden = torch.tanh(self.hidden_proj(hidden))
            decoder_cell = torch.tanh(self.cell_proj(cell))
        else:
            # Flatten all layers, project through bridge, reshape
            hidden_flat = hidden.permute(1, 0, 2).contiguous()  # [batch, layers, dim]
            hidden_flat = hidden_flat.view(batch_size, -1)  # [batch, layers*dim]
            
            cell_flat = cell.permute(1, 0, 2).contiguous().view(batch_size, -1)
            
            # Project through bridge
            hidden_bridged = self.layer_bridge(hidden_flat)
            cell_bridged = self.layer_bridge(cell_flat)
            
            # Reshape to decoder layers
            decoder_hidden = hidden_bridged.view(
                batch_size, self.num_decoder_layers, -1
            ).permute(1, 0, 2).contiguous()
            
            decoder_cell = cell_bridged.view(
                batch_size, self.num_decoder_layers, -1
            ).permute(1, 0, 2).contiguous()
            
            decoder_hidden = torch.tanh(decoder_hidden)
            decoder_cell = torch.tanh(decoder_cell)
        
        return decoder_hidden, decoder_cell

Multi-Layer Encoder-Decoder

When both encoder and decoder have multiple layers, managing state transfer becomes more nuanced.

Layer Correspondence Strategies

Strategy	Description	Use Case
Top-to-Top	Only transfer top encoder layer to bottom decoder layer	Simple, most common
Layer-wise	Each encoder layer initializes corresponding decoder layer	Matched architecture
Full Bridge	Dense connection from all encoder layers to all decoder layers	Maximum flexibility
Selective	Learned selection of which encoder layers feed which decoder layers	Auto-discovered correspondence

Residual Connections Across Encoder-Decoder

For deep architectures, residual connections can span the encoder-decoder boundary:

Converting Mermaid diagram...

Depth Considerations

Configuration	Total Gradient Path	Training Difficulty
2-layer encoder + 2-layer decoder	$T_x + 2 + T_y + 2$	Moderate
4-layer encoder + 4-layer decoder	$T_x + 4 + T_y + 4$	Challenging
6-layer encoder + 6-layer decoder	$T_x + 6 + T_y + 6$	Requires residual/attention

Practical Recommendation:

For deep encoder-decoder models (≥4 layers each):

Use residual connections within encoder and decoder
Use layer normalization at each layer
Consider dense attention over all encoder outputs (not just final layer)
Initialize carefully (Xavier/Kaiming with appropriate gains)

Asymmetric Architectures

Encoder and decoder don't need to be symmetric. Asymmetric architectures match capacity to task demands.

Encoder-Heavy (Understanding-Focused)

More encoder layers, fewer decoder layers
Larger encoder hidden dimension
Use case: Complex source comprehension (question answering, reading comprehension)

Decoder-Heavy (Generation-Focused)

Fewer encoder layers, more decoder layers
Larger decoder hidden dimension
Use case: Complex output generation (story generation, code synthesis)

Mixed Modality

Different modalities often require radically different encoders and decoders:

Cross-Modal Encoder-Decoder Examples
Task	Encoder	Decoder	Interface
Image Captioning	CNN (ResNet, ViT)	LSTM/Transformer	Image features → RNN state
Speech Recognition	CNN + RNN (acoustic)	RNN (language model)	Audio frames → text tokens
Video Description	3D CNN + Temporal RNN	Language LSTM	Spatio-temporal features → words
Document QA	BERT/RoBERTa	Pointer network	Contextualized embeddings → spans
Text-to-Speech	Text encoder (RNN)	Spectrogram decoder (autoregressive)	Phoneme sequence → acoustic features

Multi-Modal State Transfer

When encoder and decoder process different modalities, a learned projection layer is essential. The projection bridges the representation spaces, mapping image features to the manifold expected by a text decoder, or acoustic embeddings to language model hidden states.

image_captioning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
import torch
import torch.nn as nn
import torchvision.models as models
 
class ImageCaptioningModel(nn.Module):
    """
    Asymmetric encoder-decoder for image captioning.
    CNN encoder (frozen or fine-tuned) + LSTM decoder.
    """
    
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int = 256,
        hidden_dim: int = 512,
        num_decoder_layers: int = 2,
        encoder_pretrained: bool = True,
        encoder_finetune: bool = False
    ):
        super().__init__()
        
        # === CNN ENCODER ===
        # Use pretrained ResNet-50 as feature extractor
        resnet = models.resnet50(pretrained=encoder_pretrained)
        
        # Remove final classification layer
        self.encoder = nn.Sequential(*list(resnet.children())[:-2])
        
        # Freeze encoder if not fine-tuning
        if not encoder_finetune:
            for param in self.encoder.parameters():
                param.requires_grad = False
        
        # CNN outputs [batch, 2048, 7, 7] for 224x224 input
        encoder_dim = 2048
        num_spatial_positions = 7 * 7
        
        # === BRIDGE: CNN features → RNN state ===
        # Project pooled features to decoder hidden dimension
        self.feature_projection = nn.Linear(encoder_dim, hidden_dim)
        
        # Initialize decoder hidden/cell from image features
        self.init_h = nn.Linear(encoder_dim, hidden_dim)
        self.init_c = nn.Linear(encoder_dim, hidden_dim)
        
        # === LSTM DECODER ===
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        self.decoder = nn.LSTM(
            input_size=embed_dim + hidden_dim,  # Attend to image
            hidden_size=hidden_dim,
            num_layers=num_decoder_layers,
            batch_first=True,
            dropout=0.3 if num_decoder_layers > 1 else 0
        )
        
        # Simple attention over spatial positions
        self.attention = nn.Linear(hidden_dim + encoder_dim, 1)
        
        # Output projection
        self.fc_out = nn.Linear(hidden_dim, vocab_size)
        
        self.dropout = nn.Dropout(0.5)
        self.hidden_dim = hidden_dim
        self.encoder_dim = encoder_dim
    
    def encode(self, images: torch.Tensor) -> torch.Tensor:
        """
        Encode images to feature maps.
        
        Args:
            images: [batch, 3, 224, 224]
            
        Returns:
            features: [batch, 49, 2048] - flattened spatial features
        """
        with torch.set_grad_enabled(self.encoder[0].weight.requires_grad):
            features = self.encoder(images)  # [batch, 2048, 7, 7]
        
        # Flatten spatial dimensions
        batch_size = features.size(0)
        features = features.view(batch_size, self.encoder_dim, -1)  # [batch, 2048, 49]
        features = features.permute(0, 2, 1)  # [batch, 49, 2048]
        
        return features
    
    def init_decoder_state(
        self,
        features: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Initialize decoder state from image features.
        
        Args:
            features: [batch, 49, 2048]
            
        Returns:
            h0: [num_layers, batch, hidden]
            c0: [num_layers, batch, hidden]
        """
        # Pool over spatial positions
        mean_features = features.mean(dim=1)  # [batch, 2048]
        
        # Project to hidden dimension
        h0 = torch.tanh(self.init_h(mean_features))  # [batch, hidden]
        c0 = torch.tanh(self.init_c(mean_features))  # [batch, hidden]
        
        # Expand to decoder layers
        h0 = h0.unsqueeze(0).expand(self.decoder.num_layers, -1, -1).contiguous()
        c0 = c0.unsqueeze(0).expand(self.decoder.num_layers, -1, -1).contiguous()
        
        return h0, c0
    
    def attend(
        self,
        features: torch.Tensor,
        decoder_hidden: torch.Tensor
    ) -> torch.Tensor:
        """
        Attend to image features based on decoder state.
        
        Args:
            features: [batch, 49, 2048]
            decoder_hidden: [batch, hidden]
            
        Returns:
            context: [batch, hidden] - attended image features
        """
        num_positions = features.size(1)
        
        # Expand decoder hidden to match spatial positions
        decoder_expanded = decoder_hidden.unsqueeze(1).expand(-1, num_positions, -1)
        
        # Concatenate and score
        combined = torch.cat([features, decoder_expanded], dim=-1)
        scores = self.attention(combined).squeeze(-1)  # [batch, 49]
        
        # Softmax attention weights
        weights = torch.softmax(scores, dim=-1)  # [batch, 49]
        
        # Weighted sum of features
        context = torch.bmm(weights.unsqueeze(1), features)  # [batch, 1, 2048]
        context = context.squeeze(1)  # [batch, 2048]
        
        # Project to hidden dimension
        context = self.feature_projection(context)  # [batch, hidden]
        
        return context
    
    def forward(
        self,
        images: torch.Tensor,
        captions: torch.Tensor,
        teacher_forcing_ratio: float = 1.0
    ) -> torch.Tensor:
        """
        Training forward pass.
        
        Args:
            images: [batch, 3, 224, 224]
            captions: [batch, caption_len] - ground truth captions with <sos>
            teacher_forcing_ratio: probability of using ground truth
            
        Returns:
            outputs: [batch, caption_len-1, vocab_size]
        """
        batch_size = images.size(0)
        caption_len = captions.size(1)
        
        # Encode image
        features = self.encode(images)  # [batch, 49, 2048]
        
        # Initialize decoder
        hidden, cell = self.init_decoder_state(features)
        
        # Store outputs
        outputs = torch.zeros(batch_size, caption_len - 1, self.fc_out.out_features)
        outputs = outputs.to(images.device)
        
        # First input is <sos>
        decoder_input = captions[:, 0]  # [batch]
        
        for t in range(1, caption_len):
            # Embed input
            embedded = self.dropout(self.embedding(decoder_input))  # [batch, embed]
            
            # Attend to image
            h_for_attention = hidden[-1]  # Use top layer for attention
            context = self.attend(features, h_for_attention)  # [batch, hidden]
            
            # Concatenate embedding and context
            rnn_input = torch.cat([embedded, context], dim=-1)  # [batch, embed+hidden]
            rnn_input = rnn_input.unsqueeze(1)  # [batch, 1, embed+hidden]
            
            # Decoder step
            output, (hidden, cell) = self.decoder(rnn_input, (hidden, cell))
            output = output.squeeze(1)  # [batch, hidden]
            
            # Project to vocabulary
            output = self.fc_out(self.dropout(output))  # [batch, vocab]
            outputs[:, t-1] = output
            
            # Next input
            use_tf = torch.rand(1).item() < teacher_forcing_ratio
            decoder_input = captions[:, t] if use_tf else output.argmax(dim=-1)
        
        return outputs

Shared and Tied Representations

In some scenarios, sharing parameters between encoder and decoder improves regularization and efficiency.

Weight Tying Strategies

Sharing Type	What's Shared	When to Use
Embedding tying	Source and target embeddings	Same/similar vocabulary
Output-embedding tying	Decoder embeddings and output projection	Reduces parameters by $
Encoder-decoder tying	All or some layers	Auto-encoding tasks
Cross-lingual tying	Embeddings across languages	Multilingual models

Three-Way Tying

For same-vocabulary tasks (e.g., summarization, paraphrasing), three-way tying combines:

Source embedding $\mathbf{E}_x$
Target embedding $\mathbf{E}_y$
Output projection $\mathbf{W}_{\text{out}}$

All three become the same matrix (with appropriate transposition for output):

$$\mathbf{E}_x = \mathbf{E}y = \mathbf{W}{\text{out}}^\top$$

Tying Benefits

Weight tying reduces parameters significantly (vocabulary is often the largest parameter component) and can improve generalization by forcing embeddings to work in both encoding and generation contexts. Press & Wolf (2017) showed consistent improvements from output-embedding tying.

weight_tying.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import torch
import torch.nn as nn
 
class TiedSeq2Seq(nn.Module):
    """
    Seq2Seq with weight tying between embeddings and output layer.
    Suitable for same-vocabulary tasks.
    """
    
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int,
        hidden_dim: int,
        num_layers: int = 2,
        tie_weights: bool = True
    ):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.tie_weights = tie_weights
        
        # Single embedding matrix (will be shared)
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # Encoder
        self.encoder = nn.LSTM(
            embed_dim, hidden_dim, num_layers,
            batch_first=True, bidirectional=True
        )
        
        # Bridge from bidirectional to unidirectional
        self.bridge = nn.Linear(hidden_dim * 2, hidden_dim)
        
        # Decoder
        self.decoder = nn.LSTM(
            embed_dim, hidden_dim, num_layers,
            batch_first=True
        )
        
        # Output projection
        if tie_weights:
            # Output projection will use transposed embedding weights
            # Only need a bias term
            self.output_bias = nn.Parameter(torch.zeros(vocab_size))
            
            # Projection from hidden to embed (if dimensions differ)
            if hidden_dim != embed_dim:
                self.output_projection = nn.Linear(hidden_dim, embed_dim, bias=False)
            else:
                self.output_projection = None
        else:
            # Standard untied output layer
            self.output_layer = nn.Linear(hidden_dim, vocab_size)
    
    def get_output_embeddings(self) -> nn.Embedding:
        """Return the output embedding matrix (same as input if tied)."""
        return self.embedding
    
    def compute_output(self, hidden: torch.Tensor) -> torch.Tensor:
        """
        Compute output logits from decoder hidden state.
        
        Args:
            hidden: [batch, hidden_dim]
            
        Returns:
            logits: [batch, vocab_size]
        """
        if self.tie_weights:
            # Project to embedding dimension if needed
            if self.output_projection is not None:
                hidden = self.output_projection(hidden)
            
            # Multiply by embedding weights (transposed)
            # embedding.weight: [vocab_size, embed_dim]
            # hidden: [batch, embed_dim]
            logits = torch.matmul(hidden, self.embedding.weight.t())
            logits = logits + self.output_bias
        else:
            logits = self.output_layer(hidden)
        
        return logits
    
    def forward(
        self,
        src: torch.Tensor,
        trg: torch.Tensor
    ) -> torch.Tensor:
        """
        Forward pass with weight tying.
        
        Args:
            src: [batch, src_len]
            trg: [batch, trg_len]
            
        Returns:
            outputs: [batch, trg_len-1, vocab_size]
        """
        # Embed with shared embeddings
        src_embedded = self.embedding(src)
        trg_embedded = self.embedding(trg[:, :-1])  # Exclude last for teacher forcing
        
        # Encode
        encoder_outputs, (hidden, cell) = self.encoder(src_embedded)
        
        # Bridge bidirectional → unidirectional
        # hidden: [num_layers*2, batch, hidden]
        num_layers = hidden.size(0) // 2
        hidden = hidden.view(num_layers, 2, -1, self.hidden_dim)
        cell = cell.view(num_layers, 2, -1, self.hidden_dim)
        
        # Concatenate and project
        hidden = self.bridge(torch.cat([hidden[:, 0], hidden[:, 1]], dim=-1))
        cell = self.bridge(torch.cat([cell[:, 0], cell[:, 1]], dim=-1))
        
        # Decode
        decoder_outputs, _ = self.decoder(trg_embedded, (hidden, cell))
        
        # Compute output logits for each position
        batch_size, seq_len, _ = decoder_outputs.size()
        decoder_outputs = decoder_outputs.contiguous().view(-1, self.hidden_dim)
        
        logits = self.compute_output(decoder_outputs)
        logits = logits.view(batch_size, seq_len, -1)
        
        return logits
    
    def count_parameters(self) -> dict:
        """Count parameters with and without tying."""
        total = sum(p.numel() for p in self.parameters())
        embedding_params = self.embedding.weight.numel()
        
        return {
            'total': total,
            'embedding': embedding_params,
            'savings_from_tying': embedding_params if self.tie_weights else 0,
            'effective_total': total - (embedding_params if self.tie_weights else 0)
        }

Training Strategies

Training encoder-decoder models effectively requires careful attention to several strategies beyond basic gradient descent.

Loss Masking

Sequences in a batch have different lengths, requiring padding. Loss should be computed only on valid positions:

$$\mathcal{L} = -\frac{1}{\sum_t m_t} \sum_{t=1}^{T} m_t \log P(y_t^* | \hat{y}_{<t}, \mathbf{x})$$

where $m_t = 1$ if position $t$ is valid, $0$ for padding.

Gradient Clipping

Encoder-decoder models are susceptible to exploding gradients, especially with long sequences:

$$\tilde{\mathbf{g}} = \min\left(1, \frac{\theta}{|\mathbf{g}|}\right) \mathbf{g}$$

Typical clip value $\theta \in [1.0, 5.0]$.

Label Smoothing

Soften one-hot targets to prevent overconfidence:

$$y_k^{\text{smooth}} = (1-\epsilon) y_k + \frac{\epsilon}{|V|}$$

with $\epsilon \approx 0.1$ common.

Training Checklist

•Mask padding in loss — Never backprop through padding tokens
•Clip gradients — Prevent exploding gradients (clip norm to 1-5)
•Use label smoothing — Regularizes output distribution ($\epsilon=0.1$)
•Warm up learning rate — Start small, ramp up over first 4000-8000 steps
•Bucket by length — Minimize padding waste within batches
•Monitor attention entropy — Collapsing attention indicates problems
•Track validation BLEU — Don't just watch loss, measure actual generation quality
•Use early stopping — Save best model by validation metric, not loss

Loss vs Generation Quality

Cross-entropy loss during training doesn't perfectly correlate with generation quality (BLEU, ROUGE, etc.). A model with slightly higher validation loss may actually produce better translations. Always monitor actual metrics, not just loss.

Common Pitfalls and Solutions

Building encoder-decoder systems involves navigating several common failure modes.

Common Encoder-Decoder Problems
Problem	Symptoms	Solution
Repetitive outputs	Decoder repeats same phrase/token	Coverage mechanism, repetition penalty, nucleus sampling
Premature EOS	Very short outputs	Length penalty in beam search, minimum length constraint
Generic outputs	Safe, common phrases regardless of input	Reduce teacher forcing, increase diversity, check data balance
Hallucination	Decoder invents content not in source	Stronger attention to source, copy mechanisms
Catastrophic forgetting	Long-range info lost in decoder	Attention mechanism, deeper decoder, residual connections
Exposure bias	Good loss, poor generation	Scheduled sampling, sequence-level training
Mode collapse	All inputs → similar outputs	Check encoder capacity, data diversity, temperature sampling
Training instability	Loss spikes, NaN values	Lower LR, gradient clipping, layer norm, check for data issues

Debugging Generation

When generation quality is poor: (1) Visualize attention weights to check source-target alignment, (2) Compare greedy vs beam search outputs, (3) Check if teacher forcing ratio affects validation quality, (4) Sample multiple outputs to see diversity, (5) Test on simple synthetic data first.

Summary: Encoder-Decoder

We have thoroughly explored encoder-decoder architectures—the general framework underlying sequence transduction tasks. Let's consolidate the key takeaways:

Key Takeaways

•Separation of concerns enables specialization — Encoder focuses on understanding, decoder on generation
•State transfer is a critical design choice — Final state, pooled states, or dynamic attention each have trade-offs
•Asymmetric architectures match capacity to task — Encoder-heavy for comprehension, decoder-heavy for generation
•Weight tying reduces parameters and improves generalization — Especially effective for same-vocabulary tasks
•Multi-layer systems need careful initialization — Layer normalization, residual connections, gradient clipping
•Training requires masking, clipping, and label smoothing — Standard techniques for stable optimization
•Generation quality != training loss — Monitor actual metrics (BLEU, ROUGE) not just cross-entropy

What's Next:

We've established the encoder-decoder framework and its limitations (particularly the information bottleneck). The next page introduces Attention Preview—how attention mechanisms revolutionize the encoder-decoder interface by enabling dynamic, position-specific access to encoded representations.

Page Complete

You now understand encoder-decoder design principles, state transfer mechanisms, asymmetric architectures, weight tying, and practical training strategies. This framework underlies machine translation, summarization, and countless other sequence transduction applications.