Loading learning content...
The Transformer architecture, introduced in the seminal 2017 paper "Attention Is All You Need" by Vaswani et al., fundamentally reimagined how neural networks process sequential data. At its core lies an encoder-decoder structure—a design pattern with roots in earlier sequence-to-sequence models, but implemented here with a radical departure: the complete elimination of recurrence and convolutions in favor of pure attention mechanisms.
Understanding this encoder-decoder structure is essential because it provides the conceptual framework upon which all subsequent Transformer variants are built. Whether you're working with BERT (encoder-only), GPT (decoder-only), or T5 (full encoder-decoder), the principles established in the original architecture inform every design decision.
This page provides an exhaustive exploration of the encoder-decoder architecture, examining how information flows through the system, why specific design choices were made, and how the components work together to achieve state-of-the-art performance on sequence-to-sequence tasks.
This page assumes familiarity with attention mechanisms, self-attention, and multi-head attention from previous modules. We build upon those concepts to understand how they're orchestrated at the architectural level.
Before diving into the Transformer's encoder-decoder structure, we must understand the landscape it emerged from. This context illuminates why the architecture was designed as it was.
The Sequence-to-Sequence Paradigm
Sequence-to-sequence (seq2seq) learning addresses the challenge of mapping one variable-length sequence to another. This encompasses machine translation (English → French), summarization (document → summary), speech recognition (audio → text), and countless other tasks.
The dominant approach before 2017 used recurrent neural networks (RNNs), particularly LSTMs and GRUs, in an encoder-decoder configuration:
This approach, pioneered by Sutskever et al. (2014) and Cho et al. (2014), achieved impressive results but suffered from fundamental limitations.
The Attention Solution
The attention mechanism, introduced by Bahdanau et al. (2015), addressed the bottleneck problem by allowing the decoder to "look back" at all encoder hidden states rather than relying solely on the context vector. This was transformative:
$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}$$
$$c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j$$
where $\alpha_{ij}$ is the attention weight between decoder position $i$ and encoder position $j$, $e_{ij}$ is an alignment score, and $c_i$ is the context vector for decoder step $i$.
This attention-augmented encoder-decoder became the state-of-the-art for machine translation. But it still relied on RNNs for sequential processing—attention was an addition to the recurrent architecture, not a replacement.
The Transformer's genius was recognizing that attention alone—without any recurrence—could serve as the complete computational mechanism. If attention lets you access any position directly, why maintain the sequential processing constraint at all?
The Transformer maintains the encoder-decoder structure but reimplements both components using attention mechanisms exclusively. Let's first understand the high-level information flow before diving into details.
The Encoder Stack
The encoder's job is to transform an input sequence $X = (x_1, x_2, ..., x_n)$ into a sequence of continuous representations $Z = (z_1, z_2, ..., z_n)$. Unlike RNN encoders, the Transformer encoder:
The original Transformer uses 6 identical encoder layers stacked on top of each other. Each layer refines the representations, with lower layers capturing local patterns and higher layers capturing more global, abstract relationships.
The Decoder Stack
The decoder generates the output sequence $Y = (y_1, y_2, ..., y_m)$ one token at a time, conditioned on the encoder representations $Z$ and the previously generated tokens. The decoder:
Information Flow Summary
The encoder processes all input positions in parallel during both training and inference. The decoder, however, generates outputs autoregressively during inference (one token at a time), though training parallelizes across positions using teacher forcing and masked attention.
Each encoder layer is a self-contained computational unit that transforms its input representations while maintaining the sequence length. Let's dissect each component.
Encoder Layer Structure
Each of the N encoder layers contains two sub-layers:
Around each sub-layer, the architecture employs:
Mathematically, for input $x$ to each sub-layer:
$$\text{Output} = \text{LayerNorm}(x + \text{SubLayer}(x))$$
Input Processing
Before the first encoder layer, input tokens are processed through:
$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$$ $$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$
The combined input is thus: $$X_0 = \sqrt{d_{model}} \cdot \text{Embed}(tokens) + PE$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137
import torchimport torch.nn as nnimport math class EncoderLayer(nn.Module): """ A single Transformer encoder layer. Architecture: Input → MultiHeadAttention → Add & Norm → FFN → Add & Norm → Output All sequence positions are processed in parallel. """ def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1): super().__init__() # Multi-head self-attention self.self_attn = nn.MultiheadAttention( embed_dim=d_model, num_heads=n_heads, dropout=dropout, batch_first=True ) # Position-wise feed-forward network self.feed_forward = nn.Sequential( nn.Linear(d_model, d_ff), nn.ReLU(), nn.Dropout(dropout), nn.Linear(d_ff, d_model), nn.Dropout(dropout) ) # Layer normalization (Post-LN variant) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor: """ Forward pass through the encoder layer. Args: x: Input tensor of shape [batch_size, seq_len, d_model] mask: Optional attention mask (e.g., for padding) Returns: Output tensor of shape [batch_size, seq_len, d_model] """ # 1. Multi-head self-attention with residual + layer norm attn_output, _ = self.self_attn( query=x, key=x, value=x, key_padding_mask=mask ) x = self.norm1(x + self.dropout(attn_output)) # 2. Feed-forward network with residual + layer norm ff_output = self.feed_forward(x) x = self.norm2(x + ff_output) return x class TransformerEncoder(nn.Module): """ Complete Transformer encoder: embedding + N encoder layers. """ def __init__( self, vocab_size: int, max_seq_len: int, d_model: int = 512, n_layers: int = 6, n_heads: int = 8, d_ff: int = 2048, dropout: float = 0.1 ): super().__init__() self.d_model = d_model # Token embedding self.token_embedding = nn.Embedding(vocab_size, d_model) # Positional encoding (sinusoidal, fixed) self.register_buffer( 'positional_encoding', self._create_positional_encoding(max_seq_len, d_model) ) # Stack of encoder layers self.layers = nn.ModuleList([ EncoderLayer(d_model, n_heads, d_ff, dropout) for _ in range(n_layers) ]) self.dropout = nn.Dropout(dropout) def _create_positional_encoding(self, max_len: int, d_model: int) -> torch.Tensor: """Create sinusoidal positional encodings.""" pe = torch.zeros(max_len, d_model) position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) div_term = torch.exp( torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model) ) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) return pe.unsqueeze(0) # [1, max_len, d_model] def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor: """ Encode input sequence. Args: x: Token indices of shape [batch_size, seq_len] mask: Optional padding mask Returns: Encoded representations [batch_size, seq_len, d_model] """ seq_len = x.size(1) # Embed tokens and scale x = self.token_embedding(x) * math.sqrt(self.d_model) # Add positional encoding x = x + self.positional_encoding[:, :seq_len, :] x = self.dropout(x) # Pass through encoder layers for layer in self.layers: x = layer(x, mask) return xUnderstanding Encoder Self-Attention
In the encoder, self-attention allows every position to attend to every other position in the input sequence. This enables the model to build contextualized representations where each token's representation incorporates information from the entire sequence.
Consider encoding the sentence "The cat sat on the mat":
The encoder attention is bidirectional—position 3 can attend to both position 1 and position 5. This is crucial for understanding: disambiguating a word often requires context from both directions.
Why Stack Multiple Layers?
Each encoder layer refines the representations by:
This hierarchical refinement is analogous to how CNNs build from edges → textures → parts → objects, but for sequential data.
The encoder's output is a sequence of contextualized vectors—one per input token. Unlike an RNN's final hidden state which "summarizes" the input, the Transformer encoder preserves all positions, allowing the decoder to selectively attend to whichever input positions are relevant for each output token.
The decoder is more complex than the encoder, featuring three sub-layers per layer instead of two. It must perform two distinct tasks:
Decoder Layer Structure
Each decoder layer contains:
The Masking Mechanism
During training, we feed the entire target sequence to the decoder simultaneously for parallel computation. However, we must prevent position $i$ from "seeing" future positions ($i+1$, $i+2$, etc.) during self-attention—otherwise the model would "cheat" by looking at the answer.
This is achieved through causal masking (also called "look-ahead" masking):
$$\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \ -\infty & \text{if } j > i \end{cases}$$
When added to attention scores before softmax, positions $j > i$ receive effectively zero attention weight.
Cross-Attention: The Bridge
Cross-attention is where encoder and decoder meet. In this layer:
This allows each decoder position to "look at" the entire encoded input and decide which input tokens are most relevant for generating the next output token.
For machine translation, this manifests as:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188
import torchimport torch.nn as nnimport math class DecoderLayer(nn.Module): """ A single Transformer decoder layer. Architecture: Input → Masked Self-Attention → Add & Norm → Cross-Attention → Add & Norm → FFN → Add & Norm → Output """ def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1): super().__init__() # Masked multi-head self-attention self.self_attn = nn.MultiheadAttention( embed_dim=d_model, num_heads=n_heads, dropout=dropout, batch_first=True ) # Cross-attention (attending to encoder output) self.cross_attn = nn.MultiheadAttention( embed_dim=d_model, num_heads=n_heads, dropout=dropout, batch_first=True ) # Position-wise feed-forward network self.feed_forward = nn.Sequential( nn.Linear(d_model, d_ff), nn.ReLU(), nn.Dropout(dropout), nn.Linear(d_ff, d_model), nn.Dropout(dropout) ) # Layer normalization self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.norm3 = nn.LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward( self, x: torch.Tensor, encoder_output: torch.Tensor, tgt_mask: torch.Tensor = None, memory_mask: torch.Tensor = None ) -> torch.Tensor: """ Forward pass through the decoder layer. Args: x: Decoder input [batch_size, tgt_len, d_model] encoder_output: Encoder output [batch_size, src_len, d_model] tgt_mask: Causal mask for self-attention memory_mask: Optional mask for cross-attention (e.g., padding) Returns: Decoder output [batch_size, tgt_len, d_model] """ # 1. Masked self-attention self_attn_output, _ = self.self_attn( query=x, key=x, value=x, attn_mask=tgt_mask # Causal mask ) x = self.norm1(x + self.dropout(self_attn_output)) # 2. Cross-attention to encoder output cross_attn_output, _ = self.cross_attn( query=x, # Query from decoder key=encoder_output, # Key from encoder value=encoder_output, # Value from encoder key_padding_mask=memory_mask ) x = self.norm2(x + self.dropout(cross_attn_output)) # 3. Feed-forward network ff_output = self.feed_forward(x) x = self.norm3(x + ff_output) return x def generate_causal_mask(seq_len: int, device: torch.device) -> torch.Tensor: """ Generate causal (look-ahead) mask for decoder self-attention. Returns a mask where position i can only attend to positions <= i. Upper triangular entries (j > i) are set to -inf. """ mask = torch.triu( torch.ones(seq_len, seq_len, device=device) * float('-inf'), diagonal=1 ) return mask class TransformerDecoder(nn.Module): """ Complete Transformer decoder: embedding + N decoder layers + output projection. """ def __init__( self, vocab_size: int, max_seq_len: int, d_model: int = 512, n_layers: int = 6, n_heads: int = 8, d_ff: int = 2048, dropout: float = 0.1 ): super().__init__() self.d_model = d_model # Token embedding (often shared with encoder and output projection) self.token_embedding = nn.Embedding(vocab_size, d_model) # Positional encoding self.register_buffer( 'positional_encoding', self._create_positional_encoding(max_seq_len, d_model) ) # Stack of decoder layers self.layers = nn.ModuleList([ DecoderLayer(d_model, n_heads, d_ff, dropout) for _ in range(n_layers) ]) # Output projection to vocabulary self.output_projection = nn.Linear(d_model, vocab_size) self.dropout = nn.Dropout(dropout) def _create_positional_encoding(self, max_len: int, d_model: int) -> torch.Tensor: """Create sinusoidal positional encodings.""" pe = torch.zeros(max_len, d_model) position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) div_term = torch.exp( torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model) ) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) return pe.unsqueeze(0) def forward( self, tgt: torch.Tensor, encoder_output: torch.Tensor, memory_mask: torch.Tensor = None ) -> torch.Tensor: """ Decode target sequence given encoder output. Args: tgt: Target token indices [batch_size, tgt_len] encoder_output: Encoder output [batch_size, src_len, d_model] memory_mask: Optional padding mask for source Returns: Logits over vocabulary [batch_size, tgt_len, vocab_size] """ seq_len = tgt.size(1) # Generate causal mask tgt_mask = generate_causal_mask(seq_len, tgt.device) # Embed and add positional encoding x = self.token_embedding(tgt) * math.sqrt(self.d_model) x = x + self.positional_encoding[:, :seq_len, :] x = self.dropout(x) # Pass through decoder layers for layer in self.layers: x = layer(x, encoder_output, tgt_mask, memory_mask) # Project to vocabulary logits = self.output_projection(x) return logits| Aspect | Encoder Self-Attention | Decoder Self-Attention |
|---|---|---|
| Attention Pattern | Bidirectional (all positions) | Unidirectional (past positions only) |
| Masking | None (or padding mask only) | Causal mask (upper triangular) |
| Purpose | Build contextualized input representations | Model dependencies in generated sequence |
| Parallelization | Fully parallel (training + inference) | Parallel training, sequential inference |
| Information Flow | Position i sees positions 1...n | Position i sees positions 1...i |
Cross-attention is the critical bridge between the encoder and decoder. Understanding its mechanics deeply is essential for grasping how information flows from input to output.
The Query-Key-Value Split
In cross-attention:
$$Q = W^Q \cdot \text{DecoderHidden}$$ $$K = W^K \cdot \text{EncoderOutput}$$ $$V = W^V \cdot \text{EncoderOutput}$$
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
The decoder's current state (query) asks: "What information from the input should I retrieve?" The encoder's output (keys and values) answers: "Here's what's available, and here's the information associated with each."
Information Retrieval Interpretation
Think of cross-attention as a soft database lookup:
Why Cross-Attention in Every Layer?
The original Transformer applies cross-attention in every decoder layer, not just once. Why?
Each layer can learn different retrieval patterns. Early layers might attend to local alignments; later layers might capture non-local relationships like long-range discourse coherence.
Cross-attention naturally handles different source and target lengths. The attention matrix has shape [tgt_len × src_len], allowing the decoder (e.g., 50 tokens) to attend to a different-length encoder output (e.g., 80 tokens). No padding or truncation to match lengths is required—the mechanism is inherently flexible.
Visualizing Cross-Attention Patterns
For machine translation, well-trained cross-attention often shows:
These patterns are interpretable and provide insights into what the model is "doing" when translating.
The encoder-decoder architecture behaves quite differently during training versus inference. Understanding this distinction is crucial for implementation and optimization.
Training: Teacher Forcing
During training:
This is called teacher forcing: the decoder always sees the correct previous tokens, not its own predictions. This allows:
$$\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t | y_{<t}, Z)$$
where $Z$ is the encoder output and $y_{<t}$ are the ground-truth previous tokens.
Inference: Autoregressive Generation
During inference:
This sequential generation is a computational bottleneck:
KV Cache Optimization
During inference, a crucial optimization is KV caching. Since the decoder is autoregressive:
We can cache the Key and Value projections from steps $1$ to $t$ and only compute K, V for the new position. This reduces Attention computation from $O(t^2)$ to $O(t)$ per step.
Exposure Bias
Teacher forcing introduces exposure bias: during training, the model always sees ground-truth context, but during inference, it sees its own (potentially erroneous) predictions. If the model makes an error at step 3, it may have never learned to recover from that type of error.
Mitigation strategies include:
Exposure bias is a fundamental challenge in autoregressive models. The model trains on perfect context but must generate from imperfect context. Advanced techniques like sequence-level training and minimum risk training address this partially, but it remains an active research area.
The Transformer's performance is highly sensitive to architectural hyperparameters. The original "base" and "big" configurations have become standard reference points, but understanding what each parameter controls is essential for adaptation to new domains.
| Parameter | Base Model | Big Model | Description |
|---|---|---|---|
| $N$ (layers) | 6 | 6 | Number of encoder and decoder layers |
| $d_{model}$ | 512 | 1024 | Hidden dimension throughout model |
| $d_{ff}$ | 2048 | 4096 | Feed-forward inner dimension (typically 4× $d_{model}$) |
| $h$ (heads) | 8 | 16 | Number of attention heads |
| $d_k = d_v$ | 64 | 64 | $d_{model} / h$, dimension per head |
| $P_{drop}$ | 0.1 | 0.3 | Dropout probability |
| Parameters | ~65M | ~213M | Total trainable parameters |
Parameter Relationships and Trade-offs
Model dimension ($d_{model}$): The "width" of the model. Larger values:
Number of layers ($N$): The "depth" of the model. More layers:
Number of heads ($h$): Typically $d_{model} / d_k$ where $d_k = 64$. More heads:
Feed-forward dimension ($d_{ff}$): Usually $4 \times d_{model}$. This:
Dropout ($P_{drop}$): Applied after attention, after FFN, and to embeddings. Essential for:
Modern research on scaling laws (Kaplan et al., Hoffmann et al.) shows that performance improves predictably with compute, data, and parameters. For a given compute budget, there's an optimal allocation between model size and training data. These laws guide decisions about when to scale width vs. depth.
We have thoroughly examined the encoder-decoder structure that forms the backbone of the original Transformer architecture. Let's consolidate the key insights:
Looking Ahead
This encoder-decoder structure is the foundation upon which all Transformer variants are built:
In the next pages, we'll examine the critical components that make this architecture work: layer normalization, feed-forward networks, and residual connections. Each plays an essential role in training stability and representational power.
You now understand the Transformer's encoder-decoder structure—how information flows from input to output, the roles of self-attention and cross-attention, and the key differences between training and inference. Next, we'll examine layer normalization and its critical role in training stability.