Machine LearningAttention & Transformers

Transformer Variants

LevelAdvanced

Duration180 mins

TopicAttention & Transformers

2 / 5

GPT: Generative Pre-trained Transformer

The Generative Revolution

While BERT approached language understanding through bidirectional encoding and masked prediction, OpenAI charted a different course with GPT (Generative Pre-trained Transformer). Released just months before BERT in June 2018, GPT pioneered the use of transformer decoder blocks for language modeling, establishing an architecture that would eventually scale to systems capable of human-like text generation, code writing, and complex reasoning.

The GPT family represents a fundamentally different philosophy from BERT: rather than learning to fill in blanks bidirectionally, GPT learns to predict what comes next—one token at a time, left to right. This autoregressive approach, while seemingly simpler, has proven extraordinarily powerful when scaled, leading to emergent capabilities that continue to surprise researchers.

What You Will Learn

This page covers the complete GPT architecture—from causal self-attention to positional encoding, from the original GPT-1 to the evolution through GPT-2, GPT-3, and GPT-4. You'll understand why autoregressive modeling scales so effectively, how in-context learning emerges, and how to implement and use GPT models in practice.

Two Philosophies of Language Understanding:

Aspect	BERT (Encoder)	GPT (Decoder)
Attention	Bidirectional (sees all tokens)	Causal (sees only past tokens)
Training Objective	Masked Language Modeling	Next Token Prediction
Primary Strength	Understanding & Classification	Generation & Completion
Pre-training Signal	~15% of tokens (masked)	100% of tokens (all predicted)
Fine-tuning Style	Task-specific heads	Prompting / Few-shot

These aren't merely engineering choices—they reflect deep assumptions about how language models should learn and what they should optimize for.

Autoregressive Language Modeling

At GPT's core is autoregressive language modeling—the task of predicting the next token given all previous tokens. This is expressed mathematically as modeling the joint probability of a sequence by decomposing it into conditional probabilities:

$$P(x_1, x_2, ..., x_n) = \prod_{i=1}^{n} P(x_i | x_1, x_2, ..., x_{i-1})$$

This factorization is exact (no approximation) and provides a natural way to model sequential data. Each token prediction is conditioned on the complete left context, and the training objective maximizes the likelihood of the training corpus.

The Training Objective:

GPT minimizes the negative log-likelihood over the training corpus:

$$\mathcal{L} = -\sum_{i=1}^{n} \log P(x_i | x_1, ..., x_{i-1}; \theta)$$

This is equivalent to cross-entropy loss between the predicted probability distribution and the true next token at each position.

Why Autoregressive Modeling Is Powerful

Unlike BERT which only predicts ~15% of masked tokens, autoregressive models predict EVERY token in the sequence. This means: (1) More training signal per example, (2) Natural alignment with generation tasks, (3) No pre-train/fine-tune distribution mismatch from [MASK] tokens. The trade-off is losing bidirectional context, but scale seems to compensate for this limitation.

Efficient Training with Teacher Forcing:

During training, GPT uses teacher forcing—at each position, the model receives the ground truth previous tokens rather than its own predictions. This allows parallel computation of all position predictions:

Input:   [BOS] The  cat  sat  on   the  mat
Target:  The  cat  sat  on   the  mat  [EOS]

All predictions can be computed in a single forward pass because the causal attention mask ensures each position only sees previous tokens—exactly what it would see during generation.

Generation at Inference:

At inference time, GPT generates autoregressively:

Start with a prompt
Predict next token distribution
Sample or select a token
Append to context
Repeat until stopping criterion

This sequential generation is inherently slower than encoding (which is fully parallel), but enables open-ended text generation.

autoregressive_generation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import torch
import torch.nn.functional as F
from typing import List, Optional
 
def autoregressive_generate(
    model,
    tokenizer,
    prompt: str,
    max_length: int = 100,
    temperature: float = 1.0,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    stop_tokens: Optional[List[int]] = None
) -> str:
    """
    Generate text autoregressively using GPT-style model.
    
    Args:
        model: GPT model with forward method returning logits
        tokenizer: Tokenizer with encode/decode methods
        prompt: Starting text
        max_length: Maximum tokens to generate
        temperature: Sampling temperature (1.0 = neutral, <1 = conservative, >1 = creative)
        top_k: If set, sample from top-k most likely tokens
        top_p: If set, sample from smallest set with cumulative prob >= top_p
        stop_tokens: Token IDs that terminate generation
    """
    model.eval()
    device = next(model.parameters()).device
    
    # Encode prompt
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
    generated = input_ids.clone()
    
    stop_tokens = stop_tokens or [tokenizer.eos_token_id]
    
    with torch.no_grad():
        for _ in range(max_length):
            # Forward pass - get logits for next token
            outputs = model(generated)
            next_token_logits = outputs[:, -1, :]  # [batch, vocab]
            
            # Apply temperature
            next_token_logits = next_token_logits / temperature
            
            # Apply top-k filtering
            if top_k is not None:
                indices_to_remove = next_token_logits < torch.topk(next_token_logits, top_k)[0][..., -1, None]
                next_token_logits[indices_to_remove] = float('-inf')
            
            # Apply top-p (nucleus) filtering
            if top_p is not None:
                sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
                
                # Remove tokens with cumulative probability above threshold
                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
                sorted_indices_to_remove[..., 0] = 0
                
                indices_to_remove = sorted_indices_to_remove.scatter(
                    dim=-1, index=sorted_indices, src=sorted_indices_to_remove
                )
                next_token_logits[indices_to_remove] = float('-inf')
            
            # Sample from distribution
            probs = F.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            
            # Check for stop token
            if next_token.item() in stop_tokens:
                break
            
            # Append and continue
            generated = torch.cat([generated, next_token], dim=-1)
    
    return tokenizer.decode(generated[0], skip_special_tokens=True)
 
 
class SamplingStrategies:
    """
    Common sampling strategies for text generation.
    """
    
    @staticmethod
    def greedy(logits: torch.Tensor) -> torch.Tensor:
        """Select highest probability token."""
        return logits.argmax(dim=-1, keepdim=True)
    
    @staticmethod
    def temperature_sample(logits: torch.Tensor, temperature: float = 1.0) -> torch.Tensor:
        """Sample with temperature scaling."""
        scaled_logits = logits / temperature
        probs = F.softmax(scaled_logits, dim=-1)
        return torch.multinomial(probs, num_samples=1)
    
    @staticmethod
    def top_k_sample(logits: torch.Tensor, k: int = 50, temperature: float = 1.0) -> torch.Tensor:
        """Sample from top-k most likely tokens."""
        logits = logits / temperature
        top_k_logits, top_k_indices = torch.topk(logits, k, dim=-1)
        probs = F.softmax(top_k_logits, dim=-1)
        sampled_idx = torch.multinomial(probs, num_samples=1)
        return torch.gather(top_k_indices, dim=-1, index=sampled_idx)
    
    @staticmethod  
    def nucleus_sample(logits: torch.Tensor, p: float = 0.9, temperature: float = 1.0) -> torch.Tensor:
        """Sample from smallest set with cumulative probability >= p."""
        logits = logits / temperature
        sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
        
        # Find cutoff index
        cutoff_mask = cumulative_probs > p
        cutoff_mask[..., 1:] = cutoff_mask[..., :-1].clone()
        cutoff_mask[..., 0] = False
        
        sorted_logits[cutoff_mask] = float('-inf')
        probs = F.softmax(sorted_logits, dim=-1)
        sampled_idx = torch.multinomial(probs, num_samples=1)
        
        return torch.gather(sorted_indices, dim=-1, index=sampled_idx)

The Decoder-Only Architecture

GPT uses a decoder-only transformer architecture—specifically, it uses only the decoder blocks from the original transformer, modified to remove cross-attention (since there's no encoder to attend to).

Key Architectural Components:

Causal Self-Attention (Masked Self-Attention)
- Each token can only attend to previous tokens and itself
- Implemented via a triangular attention mask
- Prevents information leakage from future tokens during training
Position-wise Feed-Forward Networks
- Same as in BERT: two linear layers with activation
- Applied independently to each position
Pre-Layer Normalization (GPT-2 onward)
- Layer norm applied before (not after) each sub-layer
- Improves training stability for deep networks

Pre-LN vs Post-LN

GPT-1 used post-layer normalization (like the original transformer), applying LayerNorm after the residual connection. GPT-2 and later switched to pre-layer normalization, applying LayerNorm before the attention/FFN sub-layers. Pre-LN provides more stable gradients and enables training of much deeper models without careful learning rate warmup.

gpt_architecture.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
 
class CausalSelfAttention(nn.Module):
    """
    Causal (masked) self-attention for GPT.
    Each position can only attend to previous positions.
    """
    def __init__(self, hidden_size: int, num_heads: int, max_seq_length: int, dropout: float = 0.1):
        super().__init__()
        assert hidden_size % num_heads == 0
        
        self.num_heads = num_heads
        self.head_dim = hidden_size // num_heads
        self.hidden_size = hidden_size
        
        # Combined projection for Q, K, V (more efficient)
        self.c_attn = nn.Linear(hidden_size, 3 * hidden_size)
        self.c_proj = nn.Linear(hidden_size, hidden_size)
        
        self.attn_dropout = nn.Dropout(dropout)
        self.resid_dropout = nn.Dropout(dropout)
        
        # Causal mask: lower triangular matrix
        # Register as buffer (not a parameter, but saved in state_dict)
        self.register_buffer(
            "causal_mask",
            torch.tril(torch.ones(max_seq_length, max_seq_length)).view(
                1, 1, max_seq_length, max_seq_length
            )
        )
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch_size, seq_length, hidden_size = x.shape
        
        # Compute Q, K, V
        qkv = self.c_attn(x)  # [batch, seq, 3 * hidden]
        q, k, v = qkv.split(self.hidden_size, dim=-1)
        
        # Reshape for multi-head attention
        q = q.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        k = k.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        v = v.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        
        # Scaled dot-product attention with causal mask
        scale = 1.0 / math.sqrt(self.head_dim)
        attn_weights = torch.matmul(q, k.transpose(-2, -1)) * scale
        
        # Apply causal mask: set future positions to -inf
        causal_mask = self.causal_mask[:, :, :seq_length, :seq_length]
        attn_weights = attn_weights.masked_fill(causal_mask == 0, float('-inf'))
        
        attn_weights = F.softmax(attn_weights, dim=-1)
        attn_weights = self.attn_dropout(attn_weights)
        
        # Weighted sum of values
        output = torch.matmul(attn_weights, v)
        
        # Reshape and project
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_length, hidden_size)
        output = self.c_proj(output)
        output = self.resid_dropout(output)
        
        return output
 
 
class GPTBlock(nn.Module):
    """
    Single GPT transformer block with pre-layer normalization.
    """
    def __init__(self, hidden_size: int, num_heads: int, max_seq_length: int, dropout: float = 0.1):
        super().__init__()
        self.ln_1 = nn.LayerNorm(hidden_size)
        self.attn = CausalSelfAttention(hidden_size, num_heads, max_seq_length, dropout)
        self.ln_2 = nn.LayerNorm(hidden_size)
        self.mlp = nn.Sequential(
            nn.Linear(hidden_size, 4 * hidden_size),
            nn.GELU(),
            nn.Linear(4 * hidden_size, hidden_size),
            nn.Dropout(dropout)
        )
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Pre-LN: LayerNorm before attention/MLP
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x
 
 
class GPT(nn.Module):
    """
    GPT model: stack of decoder blocks with language modeling head.
    """
    def __init__(
        self,
        vocab_size: int,
        hidden_size: int = 768,
        num_layers: int = 12,
        num_heads: int = 12,
        max_seq_length: int = 1024,
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.token_embedding = nn.Embedding(vocab_size, hidden_size)
        self.position_embedding = nn.Embedding(max_seq_length, hidden_size)
        self.dropout = nn.Dropout(dropout)
        
        self.blocks = nn.ModuleList([
            GPTBlock(hidden_size, num_heads, max_seq_length, dropout)
            for _ in range(num_layers)
        ])
        
        self.ln_f = nn.LayerNorm(hidden_size)
        
        # Language modeling head (weight tied with token embedding)
        self.lm_head = nn.Linear(hidden_size, vocab_size, bias=False)
        self.lm_head.weight = self.token_embedding.weight  # Weight tying
        
        # Initialize weights
        self.apply(self._init_weights)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            
    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        batch_size, seq_length = input_ids.shape
        device = input_ids.device
        
        # Get embeddings
        positions = torch.arange(seq_length, device=device).unsqueeze(0)
        x = self.token_embedding(input_ids) + self.position_embedding(positions)
        x = self.dropout(x)
        
        # Pass through transformer blocks
        for block in self.blocks:
            x = block(x)
            
        x = self.ln_f(x)
        
        # Language model logits
        logits = self.lm_head(x)
        
        return logits

Weight Tying:

GPT ties the weights of the token embedding layer and the language modeling head. This means the same matrix is used to:

Convert input token IDs to embeddings
Convert final hidden states to vocabulary logits

Weight tying reduces parameters significantly (vocab_size × hidden_size parameters saved) and provides a form of regularization by constraining the embedding and output spaces to share the same geometry.

Positional Encoding:

GPT uses learned positional embeddings rather than sinusoidal encodings. Each position from 0 to max_seq_length-1 has its own learned embedding vector added to the token embedding. This allows the model to learn position-specific patterns from data.

The GPT Evolution: From GPT-1 to GPT-4

The GPT family has evolved through several major iterations, each bringing architectural refinements, scale increases, and surprising emergent capabilities.

GPT-1 (2018): Establishing the Paradigm

GPT-1 was the first to demonstrate that a transformer decoder pre-trained on language modeling could be effectively fine-tuned for diverse NLP tasks.

Key Contributions:

Demonstrated transformer decoders for transfer learning
Showed competitive results on 9 of 12 benchmarks with minimal task-specific architecture
Introduced the generative pre-training approach

GPT Family Evolution
Version	Parameters	Layers	Hidden Size	Context Length	Training Data
GPT-1	117M	12	768	512	BookCorpus (4.5GB)
GPT-2	1.5B	48	1600	1024	WebText (40GB)
GPT-3	175B	96	12288	2048	CommonCrawl, etc. (570GB)
GPT-4	~1.8T*	Unknown	Unknown	8K/32K/128K	Unknown

*GPT-4 specifications are based on estimates and leaks; OpenAI has not disclosed official architecture details.

GPT-2 (2019): Scale and Zero-Shot Learning

GPT-2 increased scale by 10x and demonstrated remarkable zero-shot capabilities—performing tasks without any fine-tuning or task-specific training.

Key Innovations:

Pre-layer normalization: Moved LayerNorm before attention/FFN
Larger context window: 1024 tokens vs 512
Zero-shot task framing: Tasks expressed as text ("translate English to French: ...")
WebText dataset: Curated web pages linked from Reddit (3+ upvotes)

The Zero-Shot Insight

GPT-2's key insight was that language models trained at sufficient scale implicitly learn to perform tasks described in text. Instead of training a classifier for sentiment, you prompt: 'Review: Great movie! Sentiment: positive. Review: Terrible plot. Sentiment:' and let the model complete. This suggested that scale might be sufficient for general intelligence.

GPT-3 (2020): The Emergence of In-Context Learning

GPT-3's 175 billion parameters crossed a threshold where qualitatively new capabilities emerged.

In-Context Learning:

GPT-3 demonstrated remarkable few-shot learning—improving at tasks simply by seeing a few examples in the prompt:

Translate English to French:

English: cheese
French: fromage

English: hello
French: bonjour

English: computer
French:

Without any gradient updates, GPT-3 learns from the pattern in the context and produces "ordinateur" (the correct translation).

GPT-3 Emergent Capabilities

•Few-shot learning: Learn new tasks from a few examples in the prompt
•Arithmetic: Perform multi-digit addition, subtraction (with varying accuracy)
•Code generation: Write working code from natural language descriptions
•Analogical reasoning: Complete analogies and pattern matching
•Translation: Translate between many language pairs without explicit training
•Question answering: Answer factual questions from knowledge in the weights

GPT-4 (2023): Multimodal and Enhanced Reasoning

GPT-4 represents the current frontier, with capabilities that approach human-level performance on many professional benchmarks.

Key Advances:

Multimodal input: Accepts both text and images
Longer context: Up to 128K tokens
Improved reasoning: Chain-of-thought, multi-step problem solving
Better calibration: More accurate uncertainty estimates
RLHF fine-tuning: Extensive reinforcement learning from human feedback
Professional-level performance: Passes bar exam, medical licensing exams

The Scaling Laws

Research has shown that model performance follows predictable power laws with compute, data, and parameters. GPT's evolution has been driven largely by scaling these factors—more parameters, more data, more compute. However, the returns appear to be logarithmic: each 10x increase in compute yields approximately linear improvements in loss.

In-Context Learning and Prompt Engineering

One of GPT's most remarkable properties is in-context learning—the ability to adapt to new tasks at inference time without any parameter updates, simply by conditioning on a suitable prompt.

Prompt Paradigms:

Zero-shot: Task description only, no examples

Classify the sentiment of the following review as positive or negative.
Review: This product exceeded all my expectations!
Sentiment:

One-shot: Single example provided

Classify sentiment:
Review: Great product! Sentiment: positive
Review: Waste of money. Sentiment:

Few-shot: Multiple examples provided

Review: Amazing quality → positive
Review: Terrible service → negative
Review: Exactly what I needed → positive
Review: Broke after one day →

Why In-Context Learning Works

The mechanism behind in-context learning remains an active research area. Leading theories suggest that: (1) Pre-training implicitly meta-learns algorithms for learning from context, (2) Larger models have more capacity to store and apply these implicit algorithms, (3) The attention mechanism can implement gradient descent-like updates in the forward pass.

Prompt Engineering Techniques:

Effective prompting can dramatically improve GPT performance. Key techniques include:

Chain-of-Thought (CoT) Prompting:

For reasoning tasks, asking the model to "think step by step" often improves accuracy:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. 
Each can has 3 balls. How many tennis balls does he have now?

A: Let's think step by step.
Roger started with 5 balls.
He bought 2 cans with 3 balls each = 2 × 3 = 6 balls.
Total = 5 + 6 = 11 balls.

The answer is 11.

CoT prompting can improve accuracy on math problems from ~20% to ~90%+ on some benchmarks.

prompting_strategies.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
from dataclasses import dataclass
from typing import List, Optional
import json
 
@dataclass
class PromptTemplate:
    """
    Structured prompt template for GPT-style models.
    """
    system_prompt: str
    examples: List[dict]
    instruction: str
    
    def format(self, input_text: str) -> str:
        """Format complete prompt with examples and input."""
        parts = []
        
        # System context
        if self.system_prompt:
            parts.append(self.system_prompt + "\n\n")
        
        # Few-shot examples
        for example in self.examples:
            parts.append(f"Input: {example['input']}\n")
            parts.append(f"Output: {example['output']}\n\n")
        
        # Current query
        parts.append(f"Input: {input_text}\n")
        parts.append(f"Output:")
        
        return "".join(parts)
 
 
class PromptingStrategies:
    """
    Common prompting strategies for GPT models.
    """
    
    @staticmethod
    def zero_shot(task_description: str, input_text: str) -> str:
        """Zero-shot prompting with task description only."""
        return f"""{task_description}
 
Input: {input_text}
Output:"""
 
    @staticmethod
    def few_shot(examples: List[dict], input_text: str, instruction: str = "") -> str:
        """Few-shot prompting with examples."""
        prompt_parts = []
        
        if instruction:
            prompt_parts.append(instruction + "\n\n")
        
        for ex in examples:
            prompt_parts.append(f"Input: {ex['input']}\nOutput: {ex['output']}\n\n")
        
        prompt_parts.append(f"Input: {input_text}\nOutput:")
        
        return "".join(prompt_parts)
 
    @staticmethod
    def chain_of_thought(question: str, examples_with_reasoning: Optional[List[dict]] = None) -> str:
        """Chain-of-thought prompting for reasoning tasks."""
        if examples_with_reasoning:
            prompt_parts = []
            for ex in examples_with_reasoning:
                prompt_parts.append(f"Q: {ex['question']}\n")
                prompt_parts.append(f"A: Let's think step by step.\n{ex['reasoning']}\n")
                prompt_parts.append(f"The answer is {ex['answer']}.\n\n")
            
            prompt_parts.append(f"Q: {question}\n")
            prompt_parts.append("A: Let's think step by step.\n")
            return "".join(prompt_parts)
        else:
            return f"Q: {question}\nA: Let's think step by step.\n"
 
    @staticmethod
    def structured_output(input_text: str, output_format: dict, examples: Optional[List[dict]] = None) -> str:
        """Prompt for structured (JSON) output."""
        format_spec = json.dumps(output_format, indent=2)
        
        prompt = f"""Extract information from the input and return it in the following JSON format:
{format_spec}
 
"""
        if examples:
            for ex in examples:
                prompt += f"Input: {ex['input']}\nOutput: {json.dumps(ex['output'])}\n\n"
        
        prompt += f"Input: {input_text}\nOutput:"
        return prompt
 
 
# Example usage
if __name__ == "__main__":
    # Sentiment analysis with few-shot
    examples = [
        {"input": "This movie was amazing!", "output": "positive"},
        {"input": "Terrible waste of time.", "output": "negative"},
        {"input": "Pretty good, worth watching.", "output": "positive"},
    ]
    
    prompt = PromptingStrategies.few_shot(
        examples=examples,
        input_text="I've seen better but it was okay.",
        instruction="Classify the sentiment as positive or negative."
    )
    print(prompt)

Prompt Engineering Best Practices

•Be specific: Clear, unambiguous instructions yield better results
•Use consistent formatting: Same structure for examples and queries
•Order matters: More recent examples have stronger influence
•Include diverse examples: Cover edge cases and variations
•Specify output format: Structured formats (JSON) are more reliable
•Iterative refinement: Treat prompting as an engineering process

Training GPT at Scale

Training GPT-scale models requires sophisticated distributed training techniques and careful engineering. The computational and data requirements are immense.

Distributed Training Strategies:

Data Parallelism: Replicate model across GPUs, each processing different data
- Works well for models that fit in GPU memory
- Gradient synchronization becomes bottleneck at scale
Model Parallelism (Tensor Parallelism): Split individual layers across GPUs
- Attention heads and FFN can be partitioned
- Requires all-reduce communication between partitions
Pipeline Parallelism: Assign different layers to different GPUs
- Micro-batching to keep all GPUs busy
- Memory-efficient but introduces bubbles
3D Parallelism: Combine all three approaches
- Used for GPT-3 scale training
- Requires careful load balancing

GPT-3 Training Statistics
Metric	Value
Total training compute	3.14 × 10²³ FLOPS
Training tokens	~300 billion
Training time	~34 days on 1024 V100 GPUs
Estimated cost	$4.6 million (at cloud prices)
Peak GPU memory per GPU	~32 GB

Memory Optimization Techniques:

Gradient Checkpointing: Trade compute for memory by recomputing activations during backward pass
Mixed Precision Training: Use FP16/BF16 for forward/backward, FP32 for accumulation
ZeRO (Zero Redundancy Optimizer): Partition optimizer states, gradients, and parameters across GPUs
Activation Offloading: Move activations to CPU memory during forward pass
Sparse Attention: Reduce attention computation (discussed in efficient transformers)

The Compute Frontier

Training frontier GPT models is restricted to organizations with access to massive compute clusters. GPT-4's training reportedly cost over $100 million and required months on thousands of specialized accelerators. This concentration of capability raises important questions about accessibility and democratization of AI research.

Data Quality and Curation:

Training data quality significantly impacts model capability:

Deduplication: Remove duplicate documents to prevent memorization
Quality filtering: Remove low-quality, toxic, or personally identifiable content
Domain balancing: Mix of books, web, code, and curated sources
Tokenization: Subword tokenization (BPE or SentencePiece) optimized for vocabulary efficiency

GPT-3's training mix was approximately: 60% CommonCrawl (filtered), 22% WebText, 16% Books, 3% Wikipedia.

RLHF and Instruction Tuning

Raw GPT models trained on next-token prediction can generate coherent text but often fail to be helpful, safe, or aligned with human intentions. Reinforcement Learning from Human Feedback (RLHF) addresses this by fine-tuning models to follow instructions and produce preferred outputs.

The RLHF Pipeline:

Supervised Fine-Tuning (SFT): Fine-tune base model on human-written demonstrations of desired behavior
Reward Model Training: Train a separate model to predict human preference scores
- Collect pairs of outputs and ask humans which is better
- Train model to assign higher scores to preferred outputs
PPO Optimization: Use reinforcement learning (Proximal Policy Optimization) to maximize reward
- Model generates outputs, reward model scores them
- Model is updated to increase probability of high-reward outputs
- KL penalty prevents diverging too far from SFT model

Why RLHF Matters

RLHF transforms GPT from a text continuation engine into an assistant that follows instructions. InstructGPT (RLHF-tuned GPT-3) with 1.3B parameters was preferred by human raters over vanilla GPT-3 with 175B parameters. This demonstrates that alignment is as important as raw capability.

reward_model.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple
 
class RewardModel(nn.Module):
    """
    Reward model for RLHF: predicts scalar reward for text sequences.
    Built on top of pre-trained GPT backbone.
    """
    def __init__(self, base_model: nn.Module, hidden_size: int):
        super().__init__()
        self.backbone = base_model
        
        # Remove language modeling head, add reward head
        self.reward_head = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 1)  # Scalar reward
        )
        
    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
        # Get hidden states from backbone
        hidden_states = self.backbone(input_ids, attention_mask, return_hidden_states=True)
        
        # Use last token hidden state (or mean pool)
        # Last token represents the complete sequence
        last_hidden = hidden_states[:, -1, :]
        
        reward = self.reward_head(last_hidden)
        return reward.squeeze(-1)
    
    def compute_preference_loss(
        self, 
        chosen_ids: torch.Tensor, 
        chosen_mask: torch.Tensor,
        rejected_ids: torch.Tensor, 
        rejected_mask: torch.Tensor
    ) -> Tuple[torch.Tensor, float]:
        """
        Compute loss for preference learning.
        We want: reward(chosen) > reward(rejected)
        
        Uses Bradley-Terry model: P(chosen > rejected) = sigmoid(r_chosen - r_rejected)
        """
        reward_chosen = self(chosen_ids, chosen_mask)
        reward_rejected = self(rejected_ids, rejected_mask)
        
        # Log sigmoid of reward difference
        # Loss = -log(sigmoid(r_chosen - r_rejected))
        loss = -F.logsigmoid(reward_chosen - reward_rejected).mean()
        
        # Accuracy for monitoring
        accuracy = (reward_chosen > reward_rejected).float().mean().item()
        
        return loss, accuracy
 
 
class PPOTrainer:
    """
    Simplified PPO trainer for RLHF.
    """
    def __init__(
        self, 
        policy_model: nn.Module,
        reward_model: RewardModel,
        ref_model: nn.Module,  # Frozen reference model
        kl_coef: float = 0.1,
        clip_range: float = 0.2
    ):
        self.policy = policy_model
        self.reward_model = reward_model
        self.ref_model = ref_model
        self.kl_coef = kl_coef
        self.clip_range = clip_range
        
    def compute_rewards(
        self, 
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor, 
        response_ids: torch.Tensor
    ) -> torch.Tensor:
        """Compute reward with KL penalty."""
        # Get reward from reward model
        full_sequence = torch.cat([input_ids, response_ids], dim=1)
        full_mask = torch.cat([attention_mask, torch.ones_like(response_ids)], dim=1)
        reward = self.reward_model(full_sequence, full_mask)
        
        # Compute KL divergence between policy and reference
        with torch.no_grad():
            policy_logits = self.policy(full_sequence)
            ref_logits = self.ref_model(full_sequence)
            
            policy_logprobs = F.log_softmax(policy_logits, dim=-1)
            ref_logprobs = F.log_softmax(ref_logits, dim=-1)
            
            kl_div = (policy_logprobs.exp() * (policy_logprobs - ref_logprobs)).sum(-1).mean()
        
        # Final reward = external reward - KL penalty
        final_reward = reward - self.kl_coef * kl_div
        
        return final_reward, kl_div
    
    def ppo_step(
        self,
        input_ids: torch.Tensor,
        response_ids: torch.Tensor,
        old_logprobs: torch.Tensor,
        rewards: torch.Tensor,
        advantages: torch.Tensor
    ) -> dict:
        """Single PPO update step."""
        # Get current policy log probabilities
        full_sequence = torch.cat([input_ids, response_ids], dim=1)
        logits = self.policy(full_sequence)
        
        # Extract logprobs for response tokens only
        response_logits = logits[:, input_ids.size(1):-1, :]
        current_logprobs = F.log_softmax(response_logits, dim=-1)
        current_logprobs = torch.gather(
            current_logprobs, 
            dim=-1, 
            index=response_ids[:, 1:].unsqueeze(-1)
        ).squeeze(-1)
        
        # PPO clipped objective
        ratio = (current_logprobs - old_logprobs).exp()
        clipped_ratio = torch.clamp(ratio, 1 - self.clip_range, 1 + self.clip_range)
        
        policy_loss = -torch.min(
            ratio * advantages,
            clipped_ratio * advantages
        ).mean()
        
        return {
            'loss': policy_loss,
            'ratio_mean': ratio.mean().item(),
            'clipped_fraction': ((ratio - clipped_ratio).abs() > 1e-6).float().mean().item()
        }

Effects of RLHF

•Improved helpfulness: Models follow instructions more reliably
•Reduced toxicity: Explicit harmful content is filtered
•Better calibration: Models express uncertainty appropriately
•Refusal behavior: Models decline inappropriate requests
•Capabilities trade-off: Some creative/edge capabilities may be reduced

GPT vs BERT: Choosing the Right Approach

GPT and BERT represent two fundamental approaches to language modeling. Understanding when to use each is crucial for practitioners.

GPT vs BERT Comparison
Aspect	GPT (Decoder)	BERT (Encoder)
Best for	Generation, completion, dialogue	Classification, NER, understanding
Attention	Causal (left-to-right)	Bidirectional (all tokens)
Fine-tuning	Prompting / instruction tuning	Add classification heads
Few-shot learning	Excellent via prompting	Limited without fine-tuning
Training signal	100% of tokens	~15% masked tokens
Inference efficiency	Sequential (slow for long outputs)	Parallel (fast)
Open-ended tasks	Natural fit	Requires reformulation

Rule of Thumb

Use GPT-style models when you need to generate text or when your task can be framed as completion. Use BERT-style models when you need fixed-size representations for classification, retrieval, or when you need efficient parallel processing of inputs.

Practical Guidance:

Choose GPT when:

Building chatbots or dialogue systems
Generating creative content (stories, code, summaries)
Tasks can be solved with few examples in prompt
You need flexible, instruction-following behavior

Choose BERT when:

Building search/retrieval systems
Need representations for clustering or similarity
Classification tasks with clear categories
Token-level tasks (NER, POS tagging)
Efficiency is critical (parallel encoding)

Hybrid Approaches:

Many production systems use both:

BERT for efficient retrieval and encoding
GPT for generation and re-ranking
Ensemble of both for robust predictions

Page Complete

You now understand GPT's autoregressive architecture, its evolution from GPT-1 to GPT-4, in-context learning, RLHF alignment, and how it compares to BERT. GPT's generative paradigm has proven remarkably versatile, enabling capabilities from code generation to complex reasoning. Next, we'll explore T5, which unifies encoder-decoder architectures under a text-to-text framework.

2 / 5

Loading learning content...

Machine LearningAttention & Transformers

Transformer Variants

LevelAdvanced

Duration180 mins

TopicAttention & Transformers

2 / 5

GPT: Generative Pre-trained Transformer

The Generative Revolution

What You Will Learn

Two Philosophies of Language Understanding:

Aspect	BERT (Encoder)	GPT (Decoder)
Attention	Bidirectional (sees all tokens)	Causal (sees only past tokens)
Training Objective	Masked Language Modeling	Next Token Prediction
Primary Strength	Understanding & Classification	Generation & Completion
Pre-training Signal	~15% of tokens (masked)	100% of tokens (all predicted)
Fine-tuning Style	Task-specific heads	Prompting / Few-shot

These aren't merely engineering choices—they reflect deep assumptions about how language models should learn and what they should optimize for.

Autoregressive Language Modeling

$$P(x_1, x_2, ..., x_n) = \prod_{i=1}^{n} P(x_i | x_1, x_2, ..., x_{i-1})$$

The Training Objective:

GPT minimizes the negative log-likelihood over the training corpus:

$$\mathcal{L} = -\sum_{i=1}^{n} \log P(x_i | x_1, ..., x_{i-1}; \theta)$$

This is equivalent to cross-entropy loss between the predicted probability distribution and the true next token at each position.

Why Autoregressive Modeling Is Powerful

Efficient Training with Teacher Forcing:

Input:   [BOS] The  cat  sat  on   the  mat
Target:  The  cat  sat  on   the  mat  [EOS]

All predictions can be computed in a single forward pass because the causal attention mask ensures each position only sees previous tokens—exactly what it would see during generation.

Generation at Inference:

At inference time, GPT generates autoregressively:

Start with a prompt
Predict next token distribution
Sample or select a token
Append to context
Repeat until stopping criterion

This sequential generation is inherently slower than encoding (which is fully parallel), but enables open-ended text generation.

autoregressive_generation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import torch
import torch.nn.functional as F
from typing import List, Optional
 
def autoregressive_generate(
    model,
    tokenizer,
    prompt: str,
    max_length: int = 100,
    temperature: float = 1.0,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    stop_tokens: Optional[List[int]] = None
) -> str:
    """
    Generate text autoregressively using GPT-style model.
    
    Args:
        model: GPT model with forward method returning logits
        tokenizer: Tokenizer with encode/decode methods
        prompt: Starting text
        max_length: Maximum tokens to generate
        temperature: Sampling temperature (1.0 = neutral, <1 = conservative, >1 = creative)
        top_k: If set, sample from top-k most likely tokens
        top_p: If set, sample from smallest set with cumulative prob >= top_p
        stop_tokens: Token IDs that terminate generation
    """
    model.eval()
    device = next(model.parameters()).device
    
    # Encode prompt
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
    generated = input_ids.clone()
    
    stop_tokens = stop_tokens or [tokenizer.eos_token_id]
    
    with torch.no_grad():
        for _ in range(max_length):
            # Forward pass - get logits for next token
            outputs = model(generated)
            next_token_logits = outputs[:, -1, :]  # [batch, vocab]
            
            # Apply temperature
            next_token_logits = next_token_logits / temperature
            
            # Apply top-k filtering
            if top_k is not None:
                indices_to_remove = next_token_logits < torch.topk(next_token_logits, top_k)[0][..., -1, None]
                next_token_logits[indices_to_remove] = float('-inf')
            
            # Apply top-p (nucleus) filtering
            if top_p is not None:
                sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
                
                # Remove tokens with cumulative probability above threshold
                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
                sorted_indices_to_remove[..., 0] = 0
                
                indices_to_remove = sorted_indices_to_remove.scatter(
                    dim=-1, index=sorted_indices, src=sorted_indices_to_remove
                )
                next_token_logits[indices_to_remove] = float('-inf')
            
            # Sample from distribution
            probs = F.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            
            # Check for stop token
            if next_token.item() in stop_tokens:
                break
            
            # Append and continue
            generated = torch.cat([generated, next_token], dim=-1)
    
    return tokenizer.decode(generated[0], skip_special_tokens=True)
 
 
class SamplingStrategies:
    """
    Common sampling strategies for text generation.
    """
    
    @staticmethod
    def greedy(logits: torch.Tensor) -> torch.Tensor:
        """Select highest probability token."""
        return logits.argmax(dim=-1, keepdim=True)
    
    @staticmethod
    def temperature_sample(logits: torch.Tensor, temperature: float = 1.0) -> torch.Tensor:
        """Sample with temperature scaling."""
        scaled_logits = logits / temperature
        probs = F.softmax(scaled_logits, dim=-1)
        return torch.multinomial(probs, num_samples=1)
    
    @staticmethod
    def top_k_sample(logits: torch.Tensor, k: int = 50, temperature: float = 1.0) -> torch.Tensor:
        """Sample from top-k most likely tokens."""
        logits = logits / temperature
        top_k_logits, top_k_indices = torch.topk(logits, k, dim=-1)
        probs = F.softmax(top_k_logits, dim=-1)
        sampled_idx = torch.multinomial(probs, num_samples=1)
        return torch.gather(top_k_indices, dim=-1, index=sampled_idx)
    
    @staticmethod  
    def nucleus_sample(logits: torch.Tensor, p: float = 0.9, temperature: float = 1.0) -> torch.Tensor:
        """Sample from smallest set with cumulative probability >= p."""
        logits = logits / temperature
        sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
        
        # Find cutoff index
        cutoff_mask = cumulative_probs > p
        cutoff_mask[..., 1:] = cutoff_mask[..., :-1].clone()
        cutoff_mask[..., 0] = False
        
        sorted_logits[cutoff_mask] = float('-inf')
        probs = F.softmax(sorted_logits, dim=-1)
        sampled_idx = torch.multinomial(probs, num_samples=1)
        
        return torch.gather(sorted_indices, dim=-1, index=sampled_idx)

The Decoder-Only Architecture

Key Architectural Components:

Causal Self-Attention (Masked Self-Attention)
- Each token can only attend to previous tokens and itself
- Implemented via a triangular attention mask
- Prevents information leakage from future tokens during training
Position-wise Feed-Forward Networks
- Same as in BERT: two linear layers with activation
- Applied independently to each position
Pre-Layer Normalization (GPT-2 onward)
- Layer norm applied before (not after) each sub-layer
- Improves training stability for deep networks

Pre-LN vs Post-LN

gpt_architecture.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
 
class CausalSelfAttention(nn.Module):
    """
    Causal (masked) self-attention for GPT.
    Each position can only attend to previous positions.
    """
    def __init__(self, hidden_size: int, num_heads: int, max_seq_length: int, dropout: float = 0.1):
        super().__init__()
        assert hidden_size % num_heads == 0
        
        self.num_heads = num_heads
        self.head_dim = hidden_size // num_heads
        self.hidden_size = hidden_size
        
        # Combined projection for Q, K, V (more efficient)
        self.c_attn = nn.Linear(hidden_size, 3 * hidden_size)
        self.c_proj = nn.Linear(hidden_size, hidden_size)
        
        self.attn_dropout = nn.Dropout(dropout)
        self.resid_dropout = nn.Dropout(dropout)
        
        # Causal mask: lower triangular matrix
        # Register as buffer (not a parameter, but saved in state_dict)
        self.register_buffer(
            "causal_mask",
            torch.tril(torch.ones(max_seq_length, max_seq_length)).view(
                1, 1, max_seq_length, max_seq_length
            )
        )
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch_size, seq_length, hidden_size = x.shape
        
        # Compute Q, K, V
        qkv = self.c_attn(x)  # [batch, seq, 3 * hidden]
        q, k, v = qkv.split(self.hidden_size, dim=-1)
        
        # Reshape for multi-head attention
        q = q.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        k = k.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        v = v.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        
        # Scaled dot-product attention with causal mask
        scale = 1.0 / math.sqrt(self.head_dim)
        attn_weights = torch.matmul(q, k.transpose(-2, -1)) * scale
        
        # Apply causal mask: set future positions to -inf
        causal_mask = self.causal_mask[:, :, :seq_length, :seq_length]
        attn_weights = attn_weights.masked_fill(causal_mask == 0, float('-inf'))
        
        attn_weights = F.softmax(attn_weights, dim=-1)
        attn_weights = self.attn_dropout(attn_weights)
        
        # Weighted sum of values
        output = torch.matmul(attn_weights, v)
        
        # Reshape and project
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_length, hidden_size)
        output = self.c_proj(output)
        output = self.resid_dropout(output)
        
        return output
 
 
class GPTBlock(nn.Module):
    """
    Single GPT transformer block with pre-layer normalization.
    """
    def __init__(self, hidden_size: int, num_heads: int, max_seq_length: int, dropout: float = 0.1):
        super().__init__()
        self.ln_1 = nn.LayerNorm(hidden_size)
        self.attn = CausalSelfAttention(hidden_size, num_heads, max_seq_length, dropout)
        self.ln_2 = nn.LayerNorm(hidden_size)
        self.mlp = nn.Sequential(
            nn.Linear(hidden_size, 4 * hidden_size),
            nn.GELU(),
            nn.Linear(4 * hidden_size, hidden_size),
            nn.Dropout(dropout)
        )
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Pre-LN: LayerNorm before attention/MLP
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x
 
 
class GPT(nn.Module):
    """
    GPT model: stack of decoder blocks with language modeling head.
    """
    def __init__(
        self,
        vocab_size: int,
        hidden_size: int = 768,
        num_layers: int = 12,
        num_heads: int = 12,
        max_seq_length: int = 1024,
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.token_embedding = nn.Embedding(vocab_size, hidden_size)
        self.position_embedding = nn.Embedding(max_seq_length, hidden_size)
        self.dropout = nn.Dropout(dropout)
        
        self.blocks = nn.ModuleList([
            GPTBlock(hidden_size, num_heads, max_seq_length, dropout)
            for _ in range(num_layers)
        ])
        
        self.ln_f = nn.LayerNorm(hidden_size)
        
        # Language modeling head (weight tied with token embedding)
        self.lm_head = nn.Linear(hidden_size, vocab_size, bias=False)
        self.lm_head.weight = self.token_embedding.weight  # Weight tying
        
        # Initialize weights
        self.apply(self._init_weights)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            
    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        batch_size, seq_length = input_ids.shape
        device = input_ids.device
        
        # Get embeddings
        positions = torch.arange(seq_length, device=device).unsqueeze(0)
        x = self.token_embedding(input_ids) + self.position_embedding(positions)
        x = self.dropout(x)
        
        # Pass through transformer blocks
        for block in self.blocks:
            x = block(x)
            
        x = self.ln_f(x)
        
        # Language model logits
        logits = self.lm_head(x)
        
        return logits

Weight Tying:

GPT ties the weights of the token embedding layer and the language modeling head. This means the same matrix is used to:

Convert input token IDs to embeddings
Convert final hidden states to vocabulary logits

Positional Encoding:

The GPT Evolution: From GPT-1 to GPT-4

The GPT family has evolved through several major iterations, each bringing architectural refinements, scale increases, and surprising emergent capabilities.

GPT-1 (2018): Establishing the Paradigm

GPT-1 was the first to demonstrate that a transformer decoder pre-trained on language modeling could be effectively fine-tuned for diverse NLP tasks.

Key Contributions:

Demonstrated transformer decoders for transfer learning
Showed competitive results on 9 of 12 benchmarks with minimal task-specific architecture
Introduced the generative pre-training approach

GPT Family Evolution
Version	Parameters	Layers	Hidden Size	Context Length	Training Data
GPT-1	117M	12	768	512	BookCorpus (4.5GB)
GPT-2	1.5B	48	1600	1024	WebText (40GB)
GPT-3	175B	96	12288	2048	CommonCrawl, etc. (570GB)
GPT-4	~1.8T*	Unknown	Unknown	8K/32K/128K	Unknown

*GPT-4 specifications are based on estimates and leaks; OpenAI has not disclosed official architecture details.

GPT-2 (2019): Scale and Zero-Shot Learning

GPT-2 increased scale by 10x and demonstrated remarkable zero-shot capabilities—performing tasks without any fine-tuning or task-specific training.

Key Innovations:

Pre-layer normalization: Moved LayerNorm before attention/FFN
Larger context window: 1024 tokens vs 512
Zero-shot task framing: Tasks expressed as text ("translate English to French: ...")
WebText dataset: Curated web pages linked from Reddit (3+ upvotes)

The Zero-Shot Insight

GPT-3 (2020): The Emergence of In-Context Learning

GPT-3's 175 billion parameters crossed a threshold where qualitatively new capabilities emerged.

In-Context Learning:

GPT-3 demonstrated remarkable few-shot learning—improving at tasks simply by seeing a few examples in the prompt:

Translate English to French:

English: cheese
French: fromage

English: hello
French: bonjour

English: computer
French:

Without any gradient updates, GPT-3 learns from the pattern in the context and produces "ordinateur" (the correct translation).

GPT-3 Emergent Capabilities

•Few-shot learning: Learn new tasks from a few examples in the prompt
•Arithmetic: Perform multi-digit addition, subtraction (with varying accuracy)
•Code generation: Write working code from natural language descriptions
•Analogical reasoning: Complete analogies and pattern matching
•Translation: Translate between many language pairs without explicit training
•Question answering: Answer factual questions from knowledge in the weights

GPT-4 (2023): Multimodal and Enhanced Reasoning

GPT-4 represents the current frontier, with capabilities that approach human-level performance on many professional benchmarks.

Key Advances:

Multimodal input: Accepts both text and images
Longer context: Up to 128K tokens
Improved reasoning: Chain-of-thought, multi-step problem solving
Better calibration: More accurate uncertainty estimates
RLHF fine-tuning: Extensive reinforcement learning from human feedback
Professional-level performance: Passes bar exam, medical licensing exams

The Scaling Laws

In-Context Learning and Prompt Engineering

One of GPT's most remarkable properties is in-context learning—the ability to adapt to new tasks at inference time without any parameter updates, simply by conditioning on a suitable prompt.

Prompt Paradigms:

Zero-shot: Task description only, no examples

Classify the sentiment of the following review as positive or negative.
Review: This product exceeded all my expectations!
Sentiment:

One-shot: Single example provided

Classify sentiment:
Review: Great product! Sentiment: positive
Review: Waste of money. Sentiment:

Few-shot: Multiple examples provided

Review: Amazing quality → positive
Review: Terrible service → negative
Review: Exactly what I needed → positive
Review: Broke after one day →

Why In-Context Learning Works

Prompt Engineering Techniques:

Effective prompting can dramatically improve GPT performance. Key techniques include:

Chain-of-Thought (CoT) Prompting:

For reasoning tasks, asking the model to "think step by step" often improves accuracy:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. 
Each can has 3 balls. How many tennis balls does he have now?

A: Let's think step by step.
Roger started with 5 balls.
He bought 2 cans with 3 balls each = 2 × 3 = 6 balls.
Total = 5 + 6 = 11 balls.

The answer is 11.

CoT prompting can improve accuracy on math problems from ~20% to ~90%+ on some benchmarks.

prompting_strategies.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
from dataclasses import dataclass
from typing import List, Optional
import json
 
@dataclass
class PromptTemplate:
    """
    Structured prompt template for GPT-style models.
    """
    system_prompt: str
    examples: List[dict]
    instruction: str
    
    def format(self, input_text: str) -> str:
        """Format complete prompt with examples and input."""
        parts = []
        
        # System context
        if self.system_prompt:
            parts.append(self.system_prompt + "\n\n")
        
        # Few-shot examples
        for example in self.examples:
            parts.append(f"Input: {example['input']}\n")
            parts.append(f"Output: {example['output']}\n\n")
        
        # Current query
        parts.append(f"Input: {input_text}\n")
        parts.append(f"Output:")
        
        return "".join(parts)
 
 
class PromptingStrategies:
    """
    Common prompting strategies for GPT models.
    """
    
    @staticmethod
    def zero_shot(task_description: str, input_text: str) -> str:
        """Zero-shot prompting with task description only."""
        return f"""{task_description}
 
Input: {input_text}
Output:"""
 
    @staticmethod
    def few_shot(examples: List[dict], input_text: str, instruction: str = "") -> str:
        """Few-shot prompting with examples."""
        prompt_parts = []
        
        if instruction:
            prompt_parts.append(instruction + "\n\n")
        
        for ex in examples:
            prompt_parts.append(f"Input: {ex['input']}\nOutput: {ex['output']}\n\n")
        
        prompt_parts.append(f"Input: {input_text}\nOutput:")
        
        return "".join(prompt_parts)
 
    @staticmethod
    def chain_of_thought(question: str, examples_with_reasoning: Optional[List[dict]] = None) -> str:
        """Chain-of-thought prompting for reasoning tasks."""
        if examples_with_reasoning:
            prompt_parts = []
            for ex in examples_with_reasoning:
                prompt_parts.append(f"Q: {ex['question']}\n")
                prompt_parts.append(f"A: Let's think step by step.\n{ex['reasoning']}\n")
                prompt_parts.append(f"The answer is {ex['answer']}.\n\n")
            
            prompt_parts.append(f"Q: {question}\n")
            prompt_parts.append("A: Let's think step by step.\n")
            return "".join(prompt_parts)
        else:
            return f"Q: {question}\nA: Let's think step by step.\n"
 
    @staticmethod
    def structured_output(input_text: str, output_format: dict, examples: Optional[List[dict]] = None) -> str:
        """Prompt for structured (JSON) output."""
        format_spec = json.dumps(output_format, indent=2)
        
        prompt = f"""Extract information from the input and return it in the following JSON format:
{format_spec}
 
"""
        if examples:
            for ex in examples:
                prompt += f"Input: {ex['input']}\nOutput: {json.dumps(ex['output'])}\n\n"
        
        prompt += f"Input: {input_text}\nOutput:"
        return prompt
 
 
# Example usage
if __name__ == "__main__":
    # Sentiment analysis with few-shot
    examples = [
        {"input": "This movie was amazing!", "output": "positive"},
        {"input": "Terrible waste of time.", "output": "negative"},
        {"input": "Pretty good, worth watching.", "output": "positive"},
    ]
    
    prompt = PromptingStrategies.few_shot(
        examples=examples,
        input_text="I've seen better but it was okay.",
        instruction="Classify the sentiment as positive or negative."
    )
    print(prompt)

Prompt Engineering Best Practices

•Be specific: Clear, unambiguous instructions yield better results
•Use consistent formatting: Same structure for examples and queries
•Order matters: More recent examples have stronger influence
•Include diverse examples: Cover edge cases and variations
•Specify output format: Structured formats (JSON) are more reliable
•Iterative refinement: Treat prompting as an engineering process

Training GPT at Scale

Training GPT-scale models requires sophisticated distributed training techniques and careful engineering. The computational and data requirements are immense.

Distributed Training Strategies:

Data Parallelism: Replicate model across GPUs, each processing different data
- Works well for models that fit in GPU memory
- Gradient synchronization becomes bottleneck at scale
Model Parallelism (Tensor Parallelism): Split individual layers across GPUs
- Attention heads and FFN can be partitioned
- Requires all-reduce communication between partitions
Pipeline Parallelism: Assign different layers to different GPUs
- Micro-batching to keep all GPUs busy
- Memory-efficient but introduces bubbles
3D Parallelism: Combine all three approaches
- Used for GPT-3 scale training
- Requires careful load balancing

GPT-3 Training Statistics
Metric	Value
Total training compute	3.14 × 10²³ FLOPS
Training tokens	~300 billion
Training time	~34 days on 1024 V100 GPUs
Estimated cost	$4.6 million (at cloud prices)
Peak GPU memory per GPU	~32 GB

Memory Optimization Techniques:

Gradient Checkpointing: Trade compute for memory by recomputing activations during backward pass
Mixed Precision Training: Use FP16/BF16 for forward/backward, FP32 for accumulation
ZeRO (Zero Redundancy Optimizer): Partition optimizer states, gradients, and parameters across GPUs
Activation Offloading: Move activations to CPU memory during forward pass
Sparse Attention: Reduce attention computation (discussed in efficient transformers)

The Compute Frontier

Data Quality and Curation:

Training data quality significantly impacts model capability:

Deduplication: Remove duplicate documents to prevent memorization
Quality filtering: Remove low-quality, toxic, or personally identifiable content
Domain balancing: Mix of books, web, code, and curated sources
Tokenization: Subword tokenization (BPE or SentencePiece) optimized for vocabulary efficiency

GPT-3's training mix was approximately: 60% CommonCrawl (filtered), 22% WebText, 16% Books, 3% Wikipedia.

RLHF and Instruction Tuning

The RLHF Pipeline:

Supervised Fine-Tuning (SFT): Fine-tune base model on human-written demonstrations of desired behavior
Reward Model Training: Train a separate model to predict human preference scores
- Collect pairs of outputs and ask humans which is better
- Train model to assign higher scores to preferred outputs
PPO Optimization: Use reinforcement learning (Proximal Policy Optimization) to maximize reward
- Model generates outputs, reward model scores them
- Model is updated to increase probability of high-reward outputs
- KL penalty prevents diverging too far from SFT model

Why RLHF Matters

reward_model.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple
 
class RewardModel(nn.Module):
    """
    Reward model for RLHF: predicts scalar reward for text sequences.
    Built on top of pre-trained GPT backbone.
    """
    def __init__(self, base_model: nn.Module, hidden_size: int):
        super().__init__()
        self.backbone = base_model
        
        # Remove language modeling head, add reward head
        self.reward_head = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 1)  # Scalar reward
        )
        
    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
        # Get hidden states from backbone
        hidden_states = self.backbone(input_ids, attention_mask, return_hidden_states=True)
        
        # Use last token hidden state (or mean pool)
        # Last token represents the complete sequence
        last_hidden = hidden_states[:, -1, :]
        
        reward = self.reward_head(last_hidden)
        return reward.squeeze(-1)
    
    def compute_preference_loss(
        self, 
        chosen_ids: torch.Tensor, 
        chosen_mask: torch.Tensor,
        rejected_ids: torch.Tensor, 
        rejected_mask: torch.Tensor
    ) -> Tuple[torch.Tensor, float]:
        """
        Compute loss for preference learning.
        We want: reward(chosen) > reward(rejected)
        
        Uses Bradley-Terry model: P(chosen > rejected) = sigmoid(r_chosen - r_rejected)
        """
        reward_chosen = self(chosen_ids, chosen_mask)
        reward_rejected = self(rejected_ids, rejected_mask)
        
        # Log sigmoid of reward difference
        # Loss = -log(sigmoid(r_chosen - r_rejected))
        loss = -F.logsigmoid(reward_chosen - reward_rejected).mean()
        
        # Accuracy for monitoring
        accuracy = (reward_chosen > reward_rejected).float().mean().item()
        
        return loss, accuracy
 
 
class PPOTrainer:
    """
    Simplified PPO trainer for RLHF.
    """
    def __init__(
        self, 
        policy_model: nn.Module,
        reward_model: RewardModel,
        ref_model: nn.Module,  # Frozen reference model
        kl_coef: float = 0.1,
        clip_range: float = 0.2
    ):
        self.policy = policy_model
        self.reward_model = reward_model
        self.ref_model = ref_model
        self.kl_coef = kl_coef
        self.clip_range = clip_range
        
    def compute_rewards(
        self, 
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor, 
        response_ids: torch.Tensor
    ) -> torch.Tensor:
        """Compute reward with KL penalty."""
        # Get reward from reward model
        full_sequence = torch.cat([input_ids, response_ids], dim=1)
        full_mask = torch.cat([attention_mask, torch.ones_like(response_ids)], dim=1)
        reward = self.reward_model(full_sequence, full_mask)
        
        # Compute KL divergence between policy and reference
        with torch.no_grad():
            policy_logits = self.policy(full_sequence)
            ref_logits = self.ref_model(full_sequence)
            
            policy_logprobs = F.log_softmax(policy_logits, dim=-1)
            ref_logprobs = F.log_softmax(ref_logits, dim=-1)
            
            kl_div = (policy_logprobs.exp() * (policy_logprobs - ref_logprobs)).sum(-1).mean()
        
        # Final reward = external reward - KL penalty
        final_reward = reward - self.kl_coef * kl_div
        
        return final_reward, kl_div
    
    def ppo_step(
        self,
        input_ids: torch.Tensor,
        response_ids: torch.Tensor,
        old_logprobs: torch.Tensor,
        rewards: torch.Tensor,
        advantages: torch.Tensor
    ) -> dict:
        """Single PPO update step."""
        # Get current policy log probabilities
        full_sequence = torch.cat([input_ids, response_ids], dim=1)
        logits = self.policy(full_sequence)
        
        # Extract logprobs for response tokens only
        response_logits = logits[:, input_ids.size(1):-1, :]
        current_logprobs = F.log_softmax(response_logits, dim=-1)
        current_logprobs = torch.gather(
            current_logprobs, 
            dim=-1, 
            index=response_ids[:, 1:].unsqueeze(-1)
        ).squeeze(-1)
        
        # PPO clipped objective
        ratio = (current_logprobs - old_logprobs).exp()
        clipped_ratio = torch.clamp(ratio, 1 - self.clip_range, 1 + self.clip_range)
        
        policy_loss = -torch.min(
            ratio * advantages,
            clipped_ratio * advantages
        ).mean()
        
        return {
            'loss': policy_loss,
            'ratio_mean': ratio.mean().item(),
            'clipped_fraction': ((ratio - clipped_ratio).abs() > 1e-6).float().mean().item()
        }

Effects of RLHF

•Improved helpfulness: Models follow instructions more reliably
•Reduced toxicity: Explicit harmful content is filtered
•Better calibration: Models express uncertainty appropriately
•Refusal behavior: Models decline inappropriate requests
•Capabilities trade-off: Some creative/edge capabilities may be reduced

GPT vs BERT: Choosing the Right Approach

GPT and BERT represent two fundamental approaches to language modeling. Understanding when to use each is crucial for practitioners.

GPT vs BERT Comparison
Aspect	GPT (Decoder)	BERT (Encoder)
Best for	Generation, completion, dialogue	Classification, NER, understanding
Attention	Causal (left-to-right)	Bidirectional (all tokens)
Fine-tuning	Prompting / instruction tuning	Add classification heads
Few-shot learning	Excellent via prompting	Limited without fine-tuning
Training signal	100% of tokens	~15% masked tokens
Inference efficiency	Sequential (slow for long outputs)	Parallel (fast)
Open-ended tasks	Natural fit	Requires reformulation

Rule of Thumb

Practical Guidance:

Choose GPT when:

Building chatbots or dialogue systems
Generating creative content (stories, code, summaries)
Tasks can be solved with few examples in prompt
You need flexible, instruction-following behavior

Choose BERT when:

Building search/retrieval systems
Need representations for clustering or similarity
Classification tasks with clear categories
Token-level tasks (NER, POS tagging)
Efficiency is critical (parallel encoding)

Hybrid Approaches:

Many production systems use both:

BERT for efficient retrieval and encoding
GPT for generation and re-ranking
Ensemble of both for robust predictions

Page Complete

2 / 5