Machine LearningAttention & Transformers

Transformer Architecture

LevelAdvanced

Duration90 mins

TopicAttention & Transformers

1 / 5

Encoder-Decoder Structure

The Architectural Revolution

The Transformer architecture, introduced in the seminal 2017 paper "Attention Is All You Need" by Vaswani et al., fundamentally reimagined how neural networks process sequential data. At its core lies an encoder-decoder structure—a design pattern with roots in earlier sequence-to-sequence models, but implemented here with a radical departure: the complete elimination of recurrence and convolutions in favor of pure attention mechanisms.

Understanding this encoder-decoder structure is essential because it provides the conceptual framework upon which all subsequent Transformer variants are built. Whether you're working with BERT (encoder-only), GPT (decoder-only), or T5 (full encoder-decoder), the principles established in the original architecture inform every design decision.

This page provides an exhaustive exploration of the encoder-decoder architecture, examining how information flows through the system, why specific design choices were made, and how the components work together to achieve state-of-the-art performance on sequence-to-sequence tasks.

Prerequisites

This page assumes familiarity with attention mechanisms, self-attention, and multi-head attention from previous modules. We build upon those concepts to understand how they're orchestrated at the architectural level.

Historical Context: From RNNs to Transformers

Before diving into the Transformer's encoder-decoder structure, we must understand the landscape it emerged from. This context illuminates why the architecture was designed as it was.

The Sequence-to-Sequence Paradigm

Sequence-to-sequence (seq2seq) learning addresses the challenge of mapping one variable-length sequence to another. This encompasses machine translation (English → French), summarization (document → summary), speech recognition (audio → text), and countless other tasks.

The dominant approach before 2017 used recurrent neural networks (RNNs), particularly LSTMs and GRUs, in an encoder-decoder configuration:

Encoder: Process the input sequence token by token, building up a "context vector" that supposedly captures the input's meaning
Decoder: Generate the output sequence token by token, conditioned on the context vector and previously generated tokens

This approach, pioneered by Sutskever et al. (2014) and Cho et al. (2014), achieved impressive results but suffered from fundamental limitations.

Fundamental Limitations of RNN Encoder-Decoders

•Information Bottleneck: Compressing an entire input sequence into a single fixed-size vector loses information, especially for long sequences. The encoder's final hidden state cannot faithfully represent all nuances of the input.
•Sequential Processing: RNNs must process tokens one at a time, creating a fundamental barrier to parallelization. Training time scales linearly with sequence length, making long-context modeling prohibitively expensive.
•Gradient Flow Issues: Despite gating mechanisms in LSTMs and GRUs, gradients still struggle to flow across very long sequences, limiting effective context windows to hundreds of tokens.
•Lack of Direct Access: When decoding token 50, accessing information from input token 3 requires that information to have been preserved through numerous recurrent steps—an unreliable process.

The Attention Solution

The attention mechanism, introduced by Bahdanau et al. (2015), addressed the bottleneck problem by allowing the decoder to "look back" at all encoder hidden states rather than relying solely on the context vector. This was transformative:

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}$$

$$c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j$$

where $\alpha_{ij}$ is the attention weight between decoder position $i$ and encoder position $j$, $e_{ij}$ is an alignment score, and $c_i$ is the context vector for decoder step $i$.

This attention-augmented encoder-decoder became the state-of-the-art for machine translation. But it still relied on RNNs for sequential processing—attention was an addition to the recurrent architecture, not a replacement.

The Key Insight

The Transformer's genius was recognizing that attention alone—without any recurrence—could serve as the complete computational mechanism. If attention lets you access any position directly, why maintain the sequential processing constraint at all?

High-Level Architecture Overview

The Transformer maintains the encoder-decoder structure but reimplements both components using attention mechanisms exclusively. Let's first understand the high-level information flow before diving into details.

The Encoder Stack

The encoder's job is to transform an input sequence $X = (x_1, x_2, ..., x_n)$ into a sequence of continuous representations $Z = (z_1, z_2, ..., z_n)$. Unlike RNN encoders, the Transformer encoder:

Processes all positions simultaneously (parallel computation)
Each position can attend to all other positions in the input
Builds increasingly abstract representations through stacked layers

The original Transformer uses 6 identical encoder layers stacked on top of each other. Each layer refines the representations, with lower layers capturing local patterns and higher layers capturing more global, abstract relationships.

The Decoder Stack

The decoder generates the output sequence $Y = (y_1, y_2, ..., y_m)$ one token at a time, conditioned on the encoder representations $Z$ and the previously generated tokens. The decoder:

Uses masked self-attention to ensure autoregressive generation (position $i$ can only attend to positions $< i$)
Attends to the encoder output via cross-attention
Also consists of 6 stacked identical layers

Information Flow Summary

Input tokens → Embedding + Positional Encoding → Encoder Stack
Encoder produces contextualized representations for all input positions
Decoder receives (shifted) output tokens + Positional Encoding
Each decoder layer: Masked Self-Attention → Cross-Attention (to encoder) → Feed-Forward → Output representations
Final decoder output → Linear projection + Softmax → Token probabilities

Converting Mermaid diagram...

Parallel Encoder, Autoregressive Decoder

The encoder processes all input positions in parallel during both training and inference. The decoder, however, generates outputs autoregressively during inference (one token at a time), though training parallelizes across positions using teacher forcing and masked attention.

The Encoder: Detailed Anatomy

Each encoder layer is a self-contained computational unit that transforms its input representations while maintaining the sequence length. Let's dissect each component.

Encoder Layer Structure

Each of the N encoder layers contains two sub-layers:

Multi-Head Self-Attention: Each position attends to all positions in the input sequence
Position-wise Feed-Forward Network (FFN): Applied independently to each position

Around each sub-layer, the architecture employs:

Residual connections: The input is added to the sub-layer output
Layer normalization: Applied after the residual addition (Post-LN) or before the sub-layer (Pre-LN)

Mathematically, for input $x$ to each sub-layer:

$$\text{Output} = \text{LayerNorm}(x + \text{SubLayer}(x))$$

Input Processing

Before the first encoder layer, input tokens are processed through:

Token Embedding: Each token index maps to a $d_{model}$-dimensional vector via a learned embedding matrix $E \in \mathbb{R}^{V \times d_{model}}$ where $V$ is vocabulary size
Positional Encoding: Since self-attention is position-agnostic, positional information must be injected. The original Transformer uses sinusoidal positional encodings:

$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$$ $$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$

Embedding Scaling: Embeddings are multiplied by $\sqrt{d_{model}}$ to prevent the positional encodings from dominating

The combined input is thus: $$X_0 = \sqrt{d_{model}} \cdot \text{Embed}(tokens) + PE$$

encoder_layer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import torch
import torch.nn as nn
import math
 
class EncoderLayer(nn.Module):
    """
    A single Transformer encoder layer.
    
    Architecture:
        Input → MultiHeadAttention → Add & Norm → FFN → Add & Norm → Output
    
    All sequence positions are processed in parallel.
    """
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        
        # Multi-head self-attention
        self.self_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout,
            batch_first=True
        )
        
        # Position-wise feed-forward network
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        
        # Layer normalization (Post-LN variant)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        """
        Forward pass through the encoder layer.
        
        Args:
            x: Input tensor of shape [batch_size, seq_len, d_model]
            mask: Optional attention mask (e.g., for padding)
            
        Returns:
            Output tensor of shape [batch_size, seq_len, d_model]
        """
        # 1. Multi-head self-attention with residual + layer norm
        attn_output, _ = self.self_attn(
            query=x, key=x, value=x,
            key_padding_mask=mask
        )
        x = self.norm1(x + self.dropout(attn_output))
        
        # 2. Feed-forward network with residual + layer norm
        ff_output = self.feed_forward(x)
        x = self.norm2(x + ff_output)
        
        return x
 
 
class TransformerEncoder(nn.Module):
    """
    Complete Transformer encoder: embedding + N encoder layers.
    """
    def __init__(
        self,
        vocab_size: int,
        max_seq_len: int,
        d_model: int = 512,
        n_layers: int = 6,
        n_heads: int = 8,
        d_ff: int = 2048,
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.d_model = d_model
        
        # Token embedding
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # Positional encoding (sinusoidal, fixed)
        self.register_buffer(
            'positional_encoding',
            self._create_positional_encoding(max_seq_len, d_model)
        )
        
        # Stack of encoder layers
        self.layers = nn.ModuleList([
            EncoderLayer(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        self.dropout = nn.Dropout(dropout)
        
    def _create_positional_encoding(self, max_len: int, d_model: int) -> torch.Tensor:
        """Create sinusoidal positional encodings."""
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        return pe.unsqueeze(0)  # [1, max_len, d_model]
        
    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        """
        Encode input sequence.
        
        Args:
            x: Token indices of shape [batch_size, seq_len]
            mask: Optional padding mask
            
        Returns:
            Encoded representations [batch_size, seq_len, d_model]
        """
        seq_len = x.size(1)
        
        # Embed tokens and scale
        x = self.token_embedding(x) * math.sqrt(self.d_model)
        
        # Add positional encoding
        x = x + self.positional_encoding[:, :seq_len, :]
        x = self.dropout(x)
        
        # Pass through encoder layers
        for layer in self.layers:
            x = layer(x, mask)
            
        return x

Understanding Encoder Self-Attention

In the encoder, self-attention allows every position to attend to every other position in the input sequence. This enables the model to build contextualized representations where each token's representation incorporates information from the entire sequence.

Consider encoding the sentence "The cat sat on the mat":

When computing the representation for "cat", the attention mechanism can directly access "sat", "mat", and all other tokens
The model can learn that "cat" is the subject of "sat" through attention patterns
Multiple attention heads can capture different relationship types simultaneously

The encoder attention is bidirectional—position 3 can attend to both position 1 and position 5. This is crucial for understanding: disambiguating a word often requires context from both directions.

Why Stack Multiple Layers?

Each encoder layer refines the representations by:

Layer 1-2: Capturing local syntactic patterns (adjacent word relationships, phrase structure)
Layer 3-4: Building intermediate semantic representations (clause-level meaning)
Layer 5-6: Constructing high-level abstractions (document-level relationships, long-range dependencies)

This hierarchical refinement is analogous to how CNNs build from edges → textures → parts → objects, but for sequential data.

Encoder Output Interpretation

The encoder's output is a sequence of contextualized vectors—one per input token. Unlike an RNN's final hidden state which "summarizes" the input, the Transformer encoder preserves all positions, allowing the decoder to selectively attend to whichever input positions are relevant for each output token.

The Decoder: Detailed Anatomy

The decoder is more complex than the encoder, featuring three sub-layers per layer instead of two. It must perform two distinct tasks:

Attend to previously generated tokens (self-attention)
Attend to the encoder's output (cross-attention)

Decoder Layer Structure

Each decoder layer contains:

Masked Multi-Head Self-Attention: Positions can only attend to earlier positions (maintaining autoregressive property)
Multi-Head Cross-Attention: Decoder queries attend to encoder key-value pairs
Position-wise Feed-Forward Network: Same structure as encoder FFN

The Masking Mechanism

During training, we feed the entire target sequence to the decoder simultaneously for parallel computation. However, we must prevent position $i$ from "seeing" future positions ($i+1$, $i+2$, etc.) during self-attention—otherwise the model would "cheat" by looking at the answer.

This is achieved through causal masking (also called "look-ahead" masking):

$$\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \ -\infty & \text{if } j > i \end{cases}$$

When added to attention scores before softmax, positions $j > i$ receive effectively zero attention weight.

Cross-Attention: The Bridge

Cross-attention is where encoder and decoder meet. In this layer:

Query (Q): Comes from the decoder (representation of what we've generated so far)
Key (K) and Value (V): Come from the encoder (representations of the input sequence)

This allows each decoder position to "look at" the entire encoded input and decide which input tokens are most relevant for generating the next output token.

For machine translation, this manifests as:

When generating "le" (French for "the"), attend strongly to "the" in the English source
When generating "chat" (French for "cat"), attend strongly to "cat" in the source
For word reordering (common between languages), the model learns appropriate non-monotonic attention patterns

decoder_layer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
import torch
import torch.nn as nn
import math
 
class DecoderLayer(nn.Module):
    """
    A single Transformer decoder layer.
    
    Architecture:
        Input → Masked Self-Attention → Add & Norm 
              → Cross-Attention → Add & Norm 
              → FFN → Add & Norm → Output
    """
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        
        # Masked multi-head self-attention
        self.self_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout,
            batch_first=True
        )
        
        # Cross-attention (attending to encoder output)
        self.cross_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout,
            batch_first=True
        )
        
        # Position-wise feed-forward network
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        
        # Layer normalization
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(
        self,
        x: torch.Tensor,
        encoder_output: torch.Tensor,
        tgt_mask: torch.Tensor = None,
        memory_mask: torch.Tensor = None
    ) -> torch.Tensor:
        """
        Forward pass through the decoder layer.
        
        Args:
            x: Decoder input [batch_size, tgt_len, d_model]
            encoder_output: Encoder output [batch_size, src_len, d_model]
            tgt_mask: Causal mask for self-attention
            memory_mask: Optional mask for cross-attention (e.g., padding)
            
        Returns:
            Decoder output [batch_size, tgt_len, d_model]
        """
        # 1. Masked self-attention
        self_attn_output, _ = self.self_attn(
            query=x, key=x, value=x,
            attn_mask=tgt_mask  # Causal mask
        )
        x = self.norm1(x + self.dropout(self_attn_output))
        
        # 2. Cross-attention to encoder output
        cross_attn_output, _ = self.cross_attn(
            query=x,                    # Query from decoder
            key=encoder_output,         # Key from encoder
            value=encoder_output,       # Value from encoder
            key_padding_mask=memory_mask
        )
        x = self.norm2(x + self.dropout(cross_attn_output))
        
        # 3. Feed-forward network
        ff_output = self.feed_forward(x)
        x = self.norm3(x + ff_output)
        
        return x
 
 
def generate_causal_mask(seq_len: int, device: torch.device) -> torch.Tensor:
    """
    Generate causal (look-ahead) mask for decoder self-attention.
    
    Returns a mask where position i can only attend to positions <= i.
    Upper triangular entries (j > i) are set to -inf.
    """
    mask = torch.triu(
        torch.ones(seq_len, seq_len, device=device) * float('-inf'),
        diagonal=1
    )
    return mask
 
 
class TransformerDecoder(nn.Module):
    """
    Complete Transformer decoder: embedding + N decoder layers + output projection.
    """
    def __init__(
        self,
        vocab_size: int,
        max_seq_len: int,
        d_model: int = 512,
        n_layers: int = 6,
        n_heads: int = 8,
        d_ff: int = 2048,
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.d_model = d_model
        
        # Token embedding (often shared with encoder and output projection)
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # Positional encoding
        self.register_buffer(
            'positional_encoding',
            self._create_positional_encoding(max_seq_len, d_model)
        )
        
        # Stack of decoder layers
        self.layers = nn.ModuleList([
            DecoderLayer(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        # Output projection to vocabulary
        self.output_projection = nn.Linear(d_model, vocab_size)
        
        self.dropout = nn.Dropout(dropout)
        
    def _create_positional_encoding(self, max_len: int, d_model: int) -> torch.Tensor:
        """Create sinusoidal positional encodings."""
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        return pe.unsqueeze(0)
        
    def forward(
        self,
        tgt: torch.Tensor,
        encoder_output: torch.Tensor,
        memory_mask: torch.Tensor = None
    ) -> torch.Tensor:
        """
        Decode target sequence given encoder output.
        
        Args:
            tgt: Target token indices [batch_size, tgt_len]
            encoder_output: Encoder output [batch_size, src_len, d_model]
            memory_mask: Optional padding mask for source
            
        Returns:
            Logits over vocabulary [batch_size, tgt_len, vocab_size]
        """
        seq_len = tgt.size(1)
        
        # Generate causal mask
        tgt_mask = generate_causal_mask(seq_len, tgt.device)
        
        # Embed and add positional encoding
        x = self.token_embedding(tgt) * math.sqrt(self.d_model)
        x = x + self.positional_encoding[:, :seq_len, :]
        x = self.dropout(x)
        
        # Pass through decoder layers
        for layer in self.layers:
            x = layer(x, encoder_output, tgt_mask, memory_mask)
            
        # Project to vocabulary
        logits = self.output_projection(x)
        
        return logits

Encoder vs Decoder Self-Attention Comparison
Aspect	Encoder Self-Attention	Decoder Self-Attention
Attention Pattern	Bidirectional (all positions)	Unidirectional (past positions only)
Masking	None (or padding mask only)	Causal mask (upper triangular)
Purpose	Build contextualized input representations	Model dependencies in generated sequence
Parallelization	Fully parallel (training + inference)	Parallel training, sequential inference
Information Flow	Position i sees positions 1...n	Position i sees positions 1...i

Cross-Attention: Bridging Encoder and Decoder

Cross-attention is the critical bridge between the encoder and decoder. Understanding its mechanics deeply is essential for grasping how information flows from input to output.

The Query-Key-Value Split

In cross-attention:

$$Q = W^Q \cdot \text{DecoderHidden}$$ $$K = W^K \cdot \text{EncoderOutput}$$ $$V = W^V \cdot \text{EncoderOutput}$$

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

The decoder's current state (query) asks: "What information from the input should I retrieve?" The encoder's output (keys and values) answers: "Here's what's available, and here's the information associated with each."

Information Retrieval Interpretation

Think of cross-attention as a soft database lookup:

Query: "I'm about to generate a verb, what actions are mentioned in the input?"
Keys: Each input token provides a "label" indicating what type of information it contains
Values: The actual content/meaning associated with each input position
Result: A weighted combination of input information most relevant to the current generation step

Why Cross-Attention in Every Layer?

The original Transformer applies cross-attention in every decoder layer, not just once. Why?

Layer 1-2: May need different aspects of input for early syntactic decisions
Layer 3-4: Accesses more refined encoder representations for semantic choices
Layer 5-6: High-level abstraction matching between source and target

Each layer can learn different retrieval patterns. Early layers might attend to local alignments; later layers might capture non-local relationships like long-range discourse coherence.

Converting Mermaid diagram...

Asymmetric Sequence Lengths

Cross-attention naturally handles different source and target lengths. The attention matrix has shape [tgt_len × src_len], allowing the decoder (e.g., 50 tokens) to attend to a different-length encoder output (e.g., 80 tokens). No padding or truncation to match lengths is required—the mechanism is inherently flexible.

Visualizing Cross-Attention Patterns

For machine translation, well-trained cross-attention often shows:

Diagonal patterns: For monotonically aligned content (dates, numbers, cognates)
Block patterns: For phrases that translate as units
Distributed patterns: For words requiring long-range context
Repeated patterns: For words in the target that reference the same source word multiple times

These patterns are interpretable and provide insights into what the model is "doing" when translating.

Training vs. Inference: Key Differences

The encoder-decoder architecture behaves quite differently during training versus inference. Understanding this distinction is crucial for implementation and optimization.

Training: Teacher Forcing

During training:

Encoder: Receives the full source sequence, processes all positions in parallel
Decoder: Receives the ground-truth target sequence (shifted right by one position), processes all positions in parallel using causal masking

This is called teacher forcing: the decoder always sees the correct previous tokens, not its own predictions. This allows:

Parallel computation across the entire target sequence
Efficient GPU utilization with batched matrix operations
Stable training (errors don't compound during a single forward pass)

$$\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t | y_{<t}, Z)$$

where $Z$ is the encoder output and $y_{<t}$ are the ground-truth previous tokens.

Inference: Autoregressive Generation

During inference:

Encoder: Same as training—process source sequence once
Decoder: Generate one token at a time:
- Start with special [BOS] (beginning of sequence) token
- Generate token 1, append to input
- Generate token 2 given token 1, append
- Continue until [EOS] or max length

This sequential generation is a computational bottleneck:

Cannot parallelize across output positions
Each step requires a full decoder forward pass
For length-m output, requires m forward passes

Training Characteristics

•Full sequence processed in parallel
•Uses ground-truth target tokens
•Single forward + backward pass
•Highly efficient GPU utilization
•Causal mask simulates autoregressive

Inference Characteristics

•Sequential token-by-token generation
•Uses own previous predictions
•Multiple forward passes required
•Lower GPU utilization per step
•KV caching optimizes computation

KV Cache Optimization

During inference, a crucial optimization is KV caching. Since the decoder is autoregressive:

At step $t$, we compute attention over positions $1, ..., t$
At step $t+1$, positions $1, ..., t$ haven't changed—only position $t+1$ is new

We can cache the Key and Value projections from steps $1$ to $t$ and only compute K, V for the new position. This reduces Attention computation from $O(t^2)$ to $O(t)$ per step.

Exposure Bias

Teacher forcing introduces exposure bias: during training, the model always sees ground-truth context, but during inference, it sees its own (potentially erroneous) predictions. If the model makes an error at step 3, it may have never learned to recover from that type of error.

Mitigation strategies include:

Scheduled sampling: Gradually replace ground-truth with model predictions during training
Reinforcement learning fine-tuning: Optimize end-to-end generation quality
Diverse beam search: Explore multiple generation paths

The Training-Inference Mismatch

Exposure bias is a fundamental challenge in autoregressive models. The model trains on perfect context but must generate from imperfect context. Advanced techniques like sequence-level training and minimum risk training address this partially, but it remains an active research area.

Architectural Hyperparameters

The Transformer's performance is highly sensitive to architectural hyperparameters. The original "base" and "big" configurations have become standard reference points, but understanding what each parameter controls is essential for adaptation to new domains.

Transformer Hyperparameter Reference
Parameter	Base Model	Big Model	Description
$N$ (layers)	6	6	Number of encoder and decoder layers
$d_{model}$	512	1024	Hidden dimension throughout model
$d_{ff}$	2048	4096	Feed-forward inner dimension (typically 4× $d_{model}$)
$h$ (heads)	8	16	Number of attention heads
$d_k = d_v$	64	64	$d_{model} / h$, dimension per head
$P_{drop}$	0.1	0.3	Dropout probability
Parameters	~65M	~213M	Total trainable parameters

Parameter Relationships and Trade-offs

Model dimension ($d_{model}$): The "width" of the model. Larger values:

Allow richer representations per position
Increase memory and compute proportionally
Generally improve performance, with diminishing returns

Number of layers ($N$): The "depth" of the model. More layers:

Enable more computational steps and abstraction levels
Increase gradient flow challenges
Require careful initialization and learning rate scheduling

Number of heads ($h$): Typically $d_{model} / d_k$ where $d_k = 64$. More heads:

Allow more diverse attention patterns
Each head has smaller dimension (less expressive individually)
The tradeoff is explored in multi-head attention research

Feed-forward dimension ($d_{ff}$): Usually $4 \times d_{model}$. This:

Is where most parameters reside
Acts as a non-linear feature transformation
Recent work shows this can sometimes be reduced with minimal impact

Dropout ($P_{drop}$): Applied after attention, after FFN, and to embeddings. Essential for:

Regularization in the face of the model's large capacity
Preventing overfitting on small datasets
Larger models typically need higher dropout

Scaling Laws

Modern research on scaling laws (Kaplan et al., Hoffmann et al.) shows that performance improves predictably with compute, data, and parameters. For a given compute budget, there's an optimal allocation between model size and training data. These laws guide decisions about when to scale width vs. depth.

Summary: The Encoder-Decoder Foundation

We have thoroughly examined the encoder-decoder structure that forms the backbone of the original Transformer architecture. Let's consolidate the key insights:

Key Takeaways

•Encoder: Transforms input sequence into contextualized representations using bidirectional self-attention. All positions processed in parallel with each position attending to all others.
•Decoder: Generates output autoregressively using masked self-attention (only sees past positions) and cross-attention (accesses encoder output). Training is parallel; inference is sequential.
•Cross-Attention: The bridge between encoder and decoder. Decoder queries retrieve relevant information from encoder keys/values, enabling flexible alignment between variable-length sequences.
•Layer Stacking: Both encoder and decoder use stacked identical layers (N=6 in original). Each layer refines representations through attention and feed-forward sublayers with residual connections.
•Training vs Inference: Teacher forcing enables parallel training, but creates exposure bias. Inference requires sequential generation, optimized via KV caching.
•No Recurrence: The entire architecture operates without recurrence, relying solely on attention for sequence modeling. Positional encodings inject position information that attention cannot capture.

Looking Ahead

This encoder-decoder structure is the foundation upon which all Transformer variants are built:

Encoder-only models (BERT, RoBERTa): Use only the encoder stack for tasks requiring rich bidirectional representations
Decoder-only models (GPT, LLaMA): Use only the decoder stack for pure generation tasks
Encoder-decoder models (T5, BART): Use the full architecture for sequence-to-sequence tasks

In the next pages, we'll examine the critical components that make this architecture work: layer normalization, feed-forward networks, and residual connections. Each plays an essential role in training stability and representational power.

Page Complete

You now understand the Transformer's encoder-decoder structure—how information flows from input to output, the roles of self-attention and cross-attention, and the key differences between training and inference. Next, we'll examine layer normalization and its critical role in training stability.

1 / 5

Loading learning content...

Machine LearningAttention & Transformers

Transformer Architecture

LevelAdvanced

Duration90 mins

TopicAttention & Transformers

1 / 5

Encoder-Decoder Structure

The Architectural Revolution

Prerequisites

Historical Context: From RNNs to Transformers

Before diving into the Transformer's encoder-decoder structure, we must understand the landscape it emerged from. This context illuminates why the architecture was designed as it was.

The Sequence-to-Sequence Paradigm

The dominant approach before 2017 used recurrent neural networks (RNNs), particularly LSTMs and GRUs, in an encoder-decoder configuration:

Encoder: Process the input sequence token by token, building up a "context vector" that supposedly captures the input's meaning
Decoder: Generate the output sequence token by token, conditioned on the context vector and previously generated tokens

This approach, pioneered by Sutskever et al. (2014) and Cho et al. (2014), achieved impressive results but suffered from fundamental limitations.

Fundamental Limitations of RNN Encoder-Decoders

•Information Bottleneck: Compressing an entire input sequence into a single fixed-size vector loses information, especially for long sequences. The encoder's final hidden state cannot faithfully represent all nuances of the input.
•Sequential Processing: RNNs must process tokens one at a time, creating a fundamental barrier to parallelization. Training time scales linearly with sequence length, making long-context modeling prohibitively expensive.
•Gradient Flow Issues: Despite gating mechanisms in LSTMs and GRUs, gradients still struggle to flow across very long sequences, limiting effective context windows to hundreds of tokens.
•Lack of Direct Access: When decoding token 50, accessing information from input token 3 requires that information to have been preserved through numerous recurrent steps—an unreliable process.

The Attention Solution

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}$$

$$c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j$$

where $\alpha_{ij}$ is the attention weight between decoder position $i$ and encoder position $j$, $e_{ij}$ is an alignment score, and $c_i$ is the context vector for decoder step $i$.

The Key Insight

High-Level Architecture Overview

The Encoder Stack

The encoder's job is to transform an input sequence $X = (x_1, x_2, ..., x_n)$ into a sequence of continuous representations $Z = (z_1, z_2, ..., z_n)$. Unlike RNN encoders, the Transformer encoder:

Processes all positions simultaneously (parallel computation)
Each position can attend to all other positions in the input
Builds increasingly abstract representations through stacked layers

The Decoder Stack

The decoder generates the output sequence $Y = (y_1, y_2, ..., y_m)$ one token at a time, conditioned on the encoder representations $Z$ and the previously generated tokens. The decoder:

Uses masked self-attention to ensure autoregressive generation (position $i$ can only attend to positions $< i$)
Attends to the encoder output via cross-attention
Also consists of 6 stacked identical layers

Information Flow Summary

Input tokens → Embedding + Positional Encoding → Encoder Stack
Encoder produces contextualized representations for all input positions
Decoder receives (shifted) output tokens + Positional Encoding
Each decoder layer: Masked Self-Attention → Cross-Attention (to encoder) → Feed-Forward → Output representations
Final decoder output → Linear projection + Softmax → Token probabilities

Converting Mermaid diagram...

Parallel Encoder, Autoregressive Decoder

The Encoder: Detailed Anatomy

Each encoder layer is a self-contained computational unit that transforms its input representations while maintaining the sequence length. Let's dissect each component.

Encoder Layer Structure

Each of the N encoder layers contains two sub-layers:

Multi-Head Self-Attention: Each position attends to all positions in the input sequence
Position-wise Feed-Forward Network (FFN): Applied independently to each position

Around each sub-layer, the architecture employs:

Residual connections: The input is added to the sub-layer output
Layer normalization: Applied after the residual addition (Post-LN) or before the sub-layer (Pre-LN)

Mathematically, for input $x$ to each sub-layer:

$$\text{Output} = \text{LayerNorm}(x + \text{SubLayer}(x))$$

Input Processing

Before the first encoder layer, input tokens are processed through:

Token Embedding: Each token index maps to a $d_{model}$-dimensional vector via a learned embedding matrix $E \in \mathbb{R}^{V \times d_{model}}$ where $V$ is vocabulary size
Positional Encoding: Since self-attention is position-agnostic, positional information must be injected. The original Transformer uses sinusoidal positional encodings:

$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$$ $$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$

Embedding Scaling: Embeddings are multiplied by $\sqrt{d_{model}}$ to prevent the positional encodings from dominating

The combined input is thus: $$X_0 = \sqrt{d_{model}} \cdot \text{Embed}(tokens) + PE$$

encoder_layer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import torch
import torch.nn as nn
import math
 
class EncoderLayer(nn.Module):
    """
    A single Transformer encoder layer.
    
    Architecture:
        Input → MultiHeadAttention → Add & Norm → FFN → Add & Norm → Output
    
    All sequence positions are processed in parallel.
    """
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        
        # Multi-head self-attention
        self.self_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout,
            batch_first=True
        )
        
        # Position-wise feed-forward network
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        
        # Layer normalization (Post-LN variant)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        """
        Forward pass through the encoder layer.
        
        Args:
            x: Input tensor of shape [batch_size, seq_len, d_model]
            mask: Optional attention mask (e.g., for padding)
            
        Returns:
            Output tensor of shape [batch_size, seq_len, d_model]
        """
        # 1. Multi-head self-attention with residual + layer norm
        attn_output, _ = self.self_attn(
            query=x, key=x, value=x,
            key_padding_mask=mask
        )
        x = self.norm1(x + self.dropout(attn_output))
        
        # 2. Feed-forward network with residual + layer norm
        ff_output = self.feed_forward(x)
        x = self.norm2(x + ff_output)
        
        return x
 
 
class TransformerEncoder(nn.Module):
    """
    Complete Transformer encoder: embedding + N encoder layers.
    """
    def __init__(
        self,
        vocab_size: int,
        max_seq_len: int,
        d_model: int = 512,
        n_layers: int = 6,
        n_heads: int = 8,
        d_ff: int = 2048,
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.d_model = d_model
        
        # Token embedding
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # Positional encoding (sinusoidal, fixed)
        self.register_buffer(
            'positional_encoding',
            self._create_positional_encoding(max_seq_len, d_model)
        )
        
        # Stack of encoder layers
        self.layers = nn.ModuleList([
            EncoderLayer(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        self.dropout = nn.Dropout(dropout)
        
    def _create_positional_encoding(self, max_len: int, d_model: int) -> torch.Tensor:
        """Create sinusoidal positional encodings."""
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        return pe.unsqueeze(0)  # [1, max_len, d_model]
        
    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        """
        Encode input sequence.
        
        Args:
            x: Token indices of shape [batch_size, seq_len]
            mask: Optional padding mask
            
        Returns:
            Encoded representations [batch_size, seq_len, d_model]
        """
        seq_len = x.size(1)
        
        # Embed tokens and scale
        x = self.token_embedding(x) * math.sqrt(self.d_model)
        
        # Add positional encoding
        x = x + self.positional_encoding[:, :seq_len, :]
        x = self.dropout(x)
        
        # Pass through encoder layers
        for layer in self.layers:
            x = layer(x, mask)
            
        return x

Understanding Encoder Self-Attention

Consider encoding the sentence "The cat sat on the mat":

When computing the representation for "cat", the attention mechanism can directly access "sat", "mat", and all other tokens
The model can learn that "cat" is the subject of "sat" through attention patterns
Multiple attention heads can capture different relationship types simultaneously

Why Stack Multiple Layers?

Each encoder layer refines the representations by:

Layer 1-2: Capturing local syntactic patterns (adjacent word relationships, phrase structure)
Layer 3-4: Building intermediate semantic representations (clause-level meaning)
Layer 5-6: Constructing high-level abstractions (document-level relationships, long-range dependencies)

This hierarchical refinement is analogous to how CNNs build from edges → textures → parts → objects, but for sequential data.

Encoder Output Interpretation

The Decoder: Detailed Anatomy

The decoder is more complex than the encoder, featuring three sub-layers per layer instead of two. It must perform two distinct tasks:

Attend to previously generated tokens (self-attention)
Attend to the encoder's output (cross-attention)

Decoder Layer Structure

Each decoder layer contains:

Masked Multi-Head Self-Attention: Positions can only attend to earlier positions (maintaining autoregressive property)
Multi-Head Cross-Attention: Decoder queries attend to encoder key-value pairs
Position-wise Feed-Forward Network: Same structure as encoder FFN

The Masking Mechanism

This is achieved through causal masking (also called "look-ahead" masking):

$$\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \ -\infty & \text{if } j > i \end{cases}$$

When added to attention scores before softmax, positions $j > i$ receive effectively zero attention weight.

Cross-Attention: The Bridge

Cross-attention is where encoder and decoder meet. In this layer:

Query (Q): Comes from the decoder (representation of what we've generated so far)
Key (K) and Value (V): Come from the encoder (representations of the input sequence)

This allows each decoder position to "look at" the entire encoded input and decide which input tokens are most relevant for generating the next output token.

For machine translation, this manifests as:

When generating "le" (French for "the"), attend strongly to "the" in the English source
When generating "chat" (French for "cat"), attend strongly to "cat" in the source
For word reordering (common between languages), the model learns appropriate non-monotonic attention patterns

decoder_layer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
import torch
import torch.nn as nn
import math
 
class DecoderLayer(nn.Module):
    """
    A single Transformer decoder layer.
    
    Architecture:
        Input → Masked Self-Attention → Add & Norm 
              → Cross-Attention → Add & Norm 
              → FFN → Add & Norm → Output
    """
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        
        # Masked multi-head self-attention
        self.self_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout,
            batch_first=True
        )
        
        # Cross-attention (attending to encoder output)
        self.cross_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout,
            batch_first=True
        )
        
        # Position-wise feed-forward network
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        
        # Layer normalization
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(
        self,
        x: torch.Tensor,
        encoder_output: torch.Tensor,
        tgt_mask: torch.Tensor = None,
        memory_mask: torch.Tensor = None
    ) -> torch.Tensor:
        """
        Forward pass through the decoder layer.
        
        Args:
            x: Decoder input [batch_size, tgt_len, d_model]
            encoder_output: Encoder output [batch_size, src_len, d_model]
            tgt_mask: Causal mask for self-attention
            memory_mask: Optional mask for cross-attention (e.g., padding)
            
        Returns:
            Decoder output [batch_size, tgt_len, d_model]
        """
        # 1. Masked self-attention
        self_attn_output, _ = self.self_attn(
            query=x, key=x, value=x,
            attn_mask=tgt_mask  # Causal mask
        )
        x = self.norm1(x + self.dropout(self_attn_output))
        
        # 2. Cross-attention to encoder output
        cross_attn_output, _ = self.cross_attn(
            query=x,                    # Query from decoder
            key=encoder_output,         # Key from encoder
            value=encoder_output,       # Value from encoder
            key_padding_mask=memory_mask
        )
        x = self.norm2(x + self.dropout(cross_attn_output))
        
        # 3. Feed-forward network
        ff_output = self.feed_forward(x)
        x = self.norm3(x + ff_output)
        
        return x
 
 
def generate_causal_mask(seq_len: int, device: torch.device) -> torch.Tensor:
    """
    Generate causal (look-ahead) mask for decoder self-attention.
    
    Returns a mask where position i can only attend to positions <= i.
    Upper triangular entries (j > i) are set to -inf.
    """
    mask = torch.triu(
        torch.ones(seq_len, seq_len, device=device) * float('-inf'),
        diagonal=1
    )
    return mask
 
 
class TransformerDecoder(nn.Module):
    """
    Complete Transformer decoder: embedding + N decoder layers + output projection.
    """
    def __init__(
        self,
        vocab_size: int,
        max_seq_len: int,
        d_model: int = 512,
        n_layers: int = 6,
        n_heads: int = 8,
        d_ff: int = 2048,
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.d_model = d_model
        
        # Token embedding (often shared with encoder and output projection)
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # Positional encoding
        self.register_buffer(
            'positional_encoding',
            self._create_positional_encoding(max_seq_len, d_model)
        )
        
        # Stack of decoder layers
        self.layers = nn.ModuleList([
            DecoderLayer(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        # Output projection to vocabulary
        self.output_projection = nn.Linear(d_model, vocab_size)
        
        self.dropout = nn.Dropout(dropout)
        
    def _create_positional_encoding(self, max_len: int, d_model: int) -> torch.Tensor:
        """Create sinusoidal positional encodings."""
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        return pe.unsqueeze(0)
        
    def forward(
        self,
        tgt: torch.Tensor,
        encoder_output: torch.Tensor,
        memory_mask: torch.Tensor = None
    ) -> torch.Tensor:
        """
        Decode target sequence given encoder output.
        
        Args:
            tgt: Target token indices [batch_size, tgt_len]
            encoder_output: Encoder output [batch_size, src_len, d_model]
            memory_mask: Optional padding mask for source
            
        Returns:
            Logits over vocabulary [batch_size, tgt_len, vocab_size]
        """
        seq_len = tgt.size(1)
        
        # Generate causal mask
        tgt_mask = generate_causal_mask(seq_len, tgt.device)
        
        # Embed and add positional encoding
        x = self.token_embedding(tgt) * math.sqrt(self.d_model)
        x = x + self.positional_encoding[:, :seq_len, :]
        x = self.dropout(x)
        
        # Pass through decoder layers
        for layer in self.layers:
            x = layer(x, encoder_output, tgt_mask, memory_mask)
            
        # Project to vocabulary
        logits = self.output_projection(x)
        
        return logits

Encoder vs Decoder Self-Attention Comparison
Aspect	Encoder Self-Attention	Decoder Self-Attention
Attention Pattern	Bidirectional (all positions)	Unidirectional (past positions only)
Masking	None (or padding mask only)	Causal mask (upper triangular)
Purpose	Build contextualized input representations	Model dependencies in generated sequence
Parallelization	Fully parallel (training + inference)	Parallel training, sequential inference
Information Flow	Position i sees positions 1...n	Position i sees positions 1...i

Cross-Attention: Bridging Encoder and Decoder

Cross-attention is the critical bridge between the encoder and decoder. Understanding its mechanics deeply is essential for grasping how information flows from input to output.

The Query-Key-Value Split

In cross-attention:

$$Q = W^Q \cdot \text{DecoderHidden}$$ $$K = W^K \cdot \text{EncoderOutput}$$ $$V = W^V \cdot \text{EncoderOutput}$$

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Information Retrieval Interpretation

Think of cross-attention as a soft database lookup:

Query: "I'm about to generate a verb, what actions are mentioned in the input?"
Keys: Each input token provides a "label" indicating what type of information it contains
Values: The actual content/meaning associated with each input position
Result: A weighted combination of input information most relevant to the current generation step

Why Cross-Attention in Every Layer?

The original Transformer applies cross-attention in every decoder layer, not just once. Why?

Layer 1-2: May need different aspects of input for early syntactic decisions
Layer 3-4: Accesses more refined encoder representations for semantic choices
Layer 5-6: High-level abstraction matching between source and target

Each layer can learn different retrieval patterns. Early layers might attend to local alignments; later layers might capture non-local relationships like long-range discourse coherence.

Converting Mermaid diagram...

Asymmetric Sequence Lengths

Visualizing Cross-Attention Patterns

For machine translation, well-trained cross-attention often shows:

Diagonal patterns: For monotonically aligned content (dates, numbers, cognates)
Block patterns: For phrases that translate as units
Distributed patterns: For words requiring long-range context
Repeated patterns: For words in the target that reference the same source word multiple times

These patterns are interpretable and provide insights into what the model is "doing" when translating.

Training vs. Inference: Key Differences

The encoder-decoder architecture behaves quite differently during training versus inference. Understanding this distinction is crucial for implementation and optimization.

Training: Teacher Forcing

During training:

Encoder: Receives the full source sequence, processes all positions in parallel
Decoder: Receives the ground-truth target sequence (shifted right by one position), processes all positions in parallel using causal masking

This is called teacher forcing: the decoder always sees the correct previous tokens, not its own predictions. This allows:

Parallel computation across the entire target sequence
Efficient GPU utilization with batched matrix operations
Stable training (errors don't compound during a single forward pass)

$$\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t | y_{<t}, Z)$$

where $Z$ is the encoder output and $y_{<t}$ are the ground-truth previous tokens.

Inference: Autoregressive Generation

During inference:

Encoder: Same as training—process source sequence once
Decoder: Generate one token at a time:
- Start with special [BOS] (beginning of sequence) token
- Generate token 1, append to input
- Generate token 2 given token 1, append
- Continue until [EOS] or max length

This sequential generation is a computational bottleneck:

Cannot parallelize across output positions
Each step requires a full decoder forward pass
For length-m output, requires m forward passes

Training Characteristics

•Full sequence processed in parallel
•Uses ground-truth target tokens
•Single forward + backward pass
•Highly efficient GPU utilization
•Causal mask simulates autoregressive

Inference Characteristics

•Sequential token-by-token generation
•Uses own previous predictions
•Multiple forward passes required
•Lower GPU utilization per step
•KV caching optimizes computation

KV Cache Optimization

During inference, a crucial optimization is KV caching. Since the decoder is autoregressive:

At step $t$, we compute attention over positions $1, ..., t$
At step $t+1$, positions $1, ..., t$ haven't changed—only position $t+1$ is new

We can cache the Key and Value projections from steps $1$ to $t$ and only compute K, V for the new position. This reduces Attention computation from $O(t^2)$ to $O(t)$ per step.

Exposure Bias

Mitigation strategies include:

Scheduled sampling: Gradually replace ground-truth with model predictions during training
Reinforcement learning fine-tuning: Optimize end-to-end generation quality
Diverse beam search: Explore multiple generation paths

The Training-Inference Mismatch

Architectural Hyperparameters

Transformer Hyperparameter Reference
Parameter	Base Model	Big Model	Description
$N$ (layers)	6	6	Number of encoder and decoder layers
$d_{model}$	512	1024	Hidden dimension throughout model
$d_{ff}$	2048	4096	Feed-forward inner dimension (typically 4× $d_{model}$)
$h$ (heads)	8	16	Number of attention heads
$d_k = d_v$	64	64	$d_{model} / h$, dimension per head
$P_{drop}$	0.1	0.3	Dropout probability
Parameters	~65M	~213M	Total trainable parameters

Parameter Relationships and Trade-offs

Model dimension ($d_{model}$): The "width" of the model. Larger values:

Allow richer representations per position
Increase memory and compute proportionally
Generally improve performance, with diminishing returns

Number of layers ($N$): The "depth" of the model. More layers:

Enable more computational steps and abstraction levels
Increase gradient flow challenges
Require careful initialization and learning rate scheduling

Number of heads ($h$): Typically $d_{model} / d_k$ where $d_k = 64$. More heads:

Allow more diverse attention patterns
Each head has smaller dimension (less expressive individually)
The tradeoff is explored in multi-head attention research

Feed-forward dimension ($d_{ff}$): Usually $4 \times d_{model}$. This:

Is where most parameters reside
Acts as a non-linear feature transformation
Recent work shows this can sometimes be reduced with minimal impact

Dropout ($P_{drop}$): Applied after attention, after FFN, and to embeddings. Essential for:

Regularization in the face of the model's large capacity
Preventing overfitting on small datasets
Larger models typically need higher dropout

Scaling Laws

Summary: The Encoder-Decoder Foundation

We have thoroughly examined the encoder-decoder structure that forms the backbone of the original Transformer architecture. Let's consolidate the key insights:

Key Takeaways

•Encoder: Transforms input sequence into contextualized representations using bidirectional self-attention. All positions processed in parallel with each position attending to all others.
•Decoder: Generates output autoregressively using masked self-attention (only sees past positions) and cross-attention (accesses encoder output). Training is parallel; inference is sequential.
•Cross-Attention: The bridge between encoder and decoder. Decoder queries retrieve relevant information from encoder keys/values, enabling flexible alignment between variable-length sequences.
•Layer Stacking: Both encoder and decoder use stacked identical layers (N=6 in original). Each layer refines representations through attention and feed-forward sublayers with residual connections.
•Training vs Inference: Teacher forcing enables parallel training, but creates exposure bias. Inference requires sequential generation, optimized via KV caching.
•No Recurrence: The entire architecture operates without recurrence, relying solely on attention for sequence modeling. Positional encodings inject position information that attention cannot capture.

Looking Ahead

This encoder-decoder structure is the foundation upon which all Transformer variants are built:

Encoder-only models (BERT, RoBERTa): Use only the encoder stack for tasks requiring rich bidirectional representations
Decoder-only models (GPT, LLaMA): Use only the decoder stack for pure generation tasks
Encoder-decoder models (T5, BART): Use the full architecture for sequence-to-sequence tasks

Page Complete

1 / 5