Loading learning content...
While BERT approached language understanding through bidirectional encoding and masked prediction, OpenAI charted a different course with GPT (Generative Pre-trained Transformer). Released just months before BERT in June 2018, GPT pioneered the use of transformer decoder blocks for language modeling, establishing an architecture that would eventually scale to systems capable of human-like text generation, code writing, and complex reasoning.
The GPT family represents a fundamentally different philosophy from BERT: rather than learning to fill in blanks bidirectionally, GPT learns to predict what comes next—one token at a time, left to right. This autoregressive approach, while seemingly simpler, has proven extraordinarily powerful when scaled, leading to emergent capabilities that continue to surprise researchers.
This page covers the complete GPT architecture—from causal self-attention to positional encoding, from the original GPT-1 to the evolution through GPT-2, GPT-3, and GPT-4. You'll understand why autoregressive modeling scales so effectively, how in-context learning emerges, and how to implement and use GPT models in practice.
Two Philosophies of Language Understanding:
| Aspect | BERT (Encoder) | GPT (Decoder) |
|---|---|---|
| Attention | Bidirectional (sees all tokens) | Causal (sees only past tokens) |
| Training Objective | Masked Language Modeling | Next Token Prediction |
| Primary Strength | Understanding & Classification | Generation & Completion |
| Pre-training Signal | ~15% of tokens (masked) | 100% of tokens (all predicted) |
| Fine-tuning Style | Task-specific heads | Prompting / Few-shot |
These aren't merely engineering choices—they reflect deep assumptions about how language models should learn and what they should optimize for.
At GPT's core is autoregressive language modeling—the task of predicting the next token given all previous tokens. This is expressed mathematically as modeling the joint probability of a sequence by decomposing it into conditional probabilities:
$$P(x_1, x_2, ..., x_n) = \prod_{i=1}^{n} P(x_i | x_1, x_2, ..., x_{i-1})$$
This factorization is exact (no approximation) and provides a natural way to model sequential data. Each token prediction is conditioned on the complete left context, and the training objective maximizes the likelihood of the training corpus.
The Training Objective:
GPT minimizes the negative log-likelihood over the training corpus:
$$\mathcal{L} = -\sum_{i=1}^{n} \log P(x_i | x_1, ..., x_{i-1}; \theta)$$
This is equivalent to cross-entropy loss between the predicted probability distribution and the true next token at each position.
Unlike BERT which only predicts ~15% of masked tokens, autoregressive models predict EVERY token in the sequence. This means: (1) More training signal per example, (2) Natural alignment with generation tasks, (3) No pre-train/fine-tune distribution mismatch from [MASK] tokens. The trade-off is losing bidirectional context, but scale seems to compensate for this limitation.
Efficient Training with Teacher Forcing:
During training, GPT uses teacher forcing—at each position, the model receives the ground truth previous tokens rather than its own predictions. This allows parallel computation of all position predictions:
Input: [BOS] The cat sat on the mat
Target: The cat sat on the mat [EOS]
All predictions can be computed in a single forward pass because the causal attention mask ensures each position only sees previous tokens—exactly what it would see during generation.
Generation at Inference:
At inference time, GPT generates autoregressively:
This sequential generation is inherently slower than encoding (which is fully parallel), but enables open-ended text generation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
import torchimport torch.nn.functional as Ffrom typing import List, Optional def autoregressive_generate( model, tokenizer, prompt: str, max_length: int = 100, temperature: float = 1.0, top_k: Optional[int] = None, top_p: Optional[float] = None, stop_tokens: Optional[List[int]] = None) -> str: """ Generate text autoregressively using GPT-style model. Args: model: GPT model with forward method returning logits tokenizer: Tokenizer with encode/decode methods prompt: Starting text max_length: Maximum tokens to generate temperature: Sampling temperature (1.0 = neutral, <1 = conservative, >1 = creative) top_k: If set, sample from top-k most likely tokens top_p: If set, sample from smallest set with cumulative prob >= top_p stop_tokens: Token IDs that terminate generation """ model.eval() device = next(model.parameters()).device # Encode prompt input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device) generated = input_ids.clone() stop_tokens = stop_tokens or [tokenizer.eos_token_id] with torch.no_grad(): for _ in range(max_length): # Forward pass - get logits for next token outputs = model(generated) next_token_logits = outputs[:, -1, :] # [batch, vocab] # Apply temperature next_token_logits = next_token_logits / temperature # Apply top-k filtering if top_k is not None: indices_to_remove = next_token_logits < torch.topk(next_token_logits, top_k)[0][..., -1, None] next_token_logits[indices_to_remove] = float('-inf') # Apply top-p (nucleus) filtering if top_p is not None: sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True) cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1) # Remove tokens with cumulative probability above threshold sorted_indices_to_remove = cumulative_probs > top_p sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone() sorted_indices_to_remove[..., 0] = 0 indices_to_remove = sorted_indices_to_remove.scatter( dim=-1, index=sorted_indices, src=sorted_indices_to_remove ) next_token_logits[indices_to_remove] = float('-inf') # Sample from distribution probs = F.softmax(next_token_logits, dim=-1) next_token = torch.multinomial(probs, num_samples=1) # Check for stop token if next_token.item() in stop_tokens: break # Append and continue generated = torch.cat([generated, next_token], dim=-1) return tokenizer.decode(generated[0], skip_special_tokens=True) class SamplingStrategies: """ Common sampling strategies for text generation. """ @staticmethod def greedy(logits: torch.Tensor) -> torch.Tensor: """Select highest probability token.""" return logits.argmax(dim=-1, keepdim=True) @staticmethod def temperature_sample(logits: torch.Tensor, temperature: float = 1.0) -> torch.Tensor: """Sample with temperature scaling.""" scaled_logits = logits / temperature probs = F.softmax(scaled_logits, dim=-1) return torch.multinomial(probs, num_samples=1) @staticmethod def top_k_sample(logits: torch.Tensor, k: int = 50, temperature: float = 1.0) -> torch.Tensor: """Sample from top-k most likely tokens.""" logits = logits / temperature top_k_logits, top_k_indices = torch.topk(logits, k, dim=-1) probs = F.softmax(top_k_logits, dim=-1) sampled_idx = torch.multinomial(probs, num_samples=1) return torch.gather(top_k_indices, dim=-1, index=sampled_idx) @staticmethod def nucleus_sample(logits: torch.Tensor, p: float = 0.9, temperature: float = 1.0) -> torch.Tensor: """Sample from smallest set with cumulative probability >= p.""" logits = logits / temperature sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1) cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1) # Find cutoff index cutoff_mask = cumulative_probs > p cutoff_mask[..., 1:] = cutoff_mask[..., :-1].clone() cutoff_mask[..., 0] = False sorted_logits[cutoff_mask] = float('-inf') probs = F.softmax(sorted_logits, dim=-1) sampled_idx = torch.multinomial(probs, num_samples=1) return torch.gather(sorted_indices, dim=-1, index=sampled_idx)GPT uses a decoder-only transformer architecture—specifically, it uses only the decoder blocks from the original transformer, modified to remove cross-attention (since there's no encoder to attend to).
Key Architectural Components:
Causal Self-Attention (Masked Self-Attention)
Position-wise Feed-Forward Networks
Pre-Layer Normalization (GPT-2 onward)
GPT-1 used post-layer normalization (like the original transformer), applying LayerNorm after the residual connection. GPT-2 and later switched to pre-layer normalization, applying LayerNorm before the attention/FFN sub-layers. Pre-LN provides more stable gradients and enables training of much deeper models without careful learning rate warmup.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
import torchimport torch.nn as nnimport torch.nn.functional as Fimport math class CausalSelfAttention(nn.Module): """ Causal (masked) self-attention for GPT. Each position can only attend to previous positions. """ def __init__(self, hidden_size: int, num_heads: int, max_seq_length: int, dropout: float = 0.1): super().__init__() assert hidden_size % num_heads == 0 self.num_heads = num_heads self.head_dim = hidden_size // num_heads self.hidden_size = hidden_size # Combined projection for Q, K, V (more efficient) self.c_attn = nn.Linear(hidden_size, 3 * hidden_size) self.c_proj = nn.Linear(hidden_size, hidden_size) self.attn_dropout = nn.Dropout(dropout) self.resid_dropout = nn.Dropout(dropout) # Causal mask: lower triangular matrix # Register as buffer (not a parameter, but saved in state_dict) self.register_buffer( "causal_mask", torch.tril(torch.ones(max_seq_length, max_seq_length)).view( 1, 1, max_seq_length, max_seq_length ) ) def forward(self, x: torch.Tensor) -> torch.Tensor: batch_size, seq_length, hidden_size = x.shape # Compute Q, K, V qkv = self.c_attn(x) # [batch, seq, 3 * hidden] q, k, v = qkv.split(self.hidden_size, dim=-1) # Reshape for multi-head attention q = q.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2) k = k.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2) v = v.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2) # Scaled dot-product attention with causal mask scale = 1.0 / math.sqrt(self.head_dim) attn_weights = torch.matmul(q, k.transpose(-2, -1)) * scale # Apply causal mask: set future positions to -inf causal_mask = self.causal_mask[:, :, :seq_length, :seq_length] attn_weights = attn_weights.masked_fill(causal_mask == 0, float('-inf')) attn_weights = F.softmax(attn_weights, dim=-1) attn_weights = self.attn_dropout(attn_weights) # Weighted sum of values output = torch.matmul(attn_weights, v) # Reshape and project output = output.transpose(1, 2).contiguous().view(batch_size, seq_length, hidden_size) output = self.c_proj(output) output = self.resid_dropout(output) return output class GPTBlock(nn.Module): """ Single GPT transformer block with pre-layer normalization. """ def __init__(self, hidden_size: int, num_heads: int, max_seq_length: int, dropout: float = 0.1): super().__init__() self.ln_1 = nn.LayerNorm(hidden_size) self.attn = CausalSelfAttention(hidden_size, num_heads, max_seq_length, dropout) self.ln_2 = nn.LayerNorm(hidden_size) self.mlp = nn.Sequential( nn.Linear(hidden_size, 4 * hidden_size), nn.GELU(), nn.Linear(4 * hidden_size, hidden_size), nn.Dropout(dropout) ) def forward(self, x: torch.Tensor) -> torch.Tensor: # Pre-LN: LayerNorm before attention/MLP x = x + self.attn(self.ln_1(x)) x = x + self.mlp(self.ln_2(x)) return x class GPT(nn.Module): """ GPT model: stack of decoder blocks with language modeling head. """ def __init__( self, vocab_size: int, hidden_size: int = 768, num_layers: int = 12, num_heads: int = 12, max_seq_length: int = 1024, dropout: float = 0.1 ): super().__init__() self.token_embedding = nn.Embedding(vocab_size, hidden_size) self.position_embedding = nn.Embedding(max_seq_length, hidden_size) self.dropout = nn.Dropout(dropout) self.blocks = nn.ModuleList([ GPTBlock(hidden_size, num_heads, max_seq_length, dropout) for _ in range(num_layers) ]) self.ln_f = nn.LayerNorm(hidden_size) # Language modeling head (weight tied with token embedding) self.lm_head = nn.Linear(hidden_size, vocab_size, bias=False) self.lm_head.weight = self.token_embedding.weight # Weight tying # Initialize weights self.apply(self._init_weights) def _init_weights(self, module): if isinstance(module, nn.Linear): torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) if module.bias is not None: torch.nn.init.zeros_(module.bias) elif isinstance(module, nn.Embedding): torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) def forward(self, input_ids: torch.Tensor) -> torch.Tensor: batch_size, seq_length = input_ids.shape device = input_ids.device # Get embeddings positions = torch.arange(seq_length, device=device).unsqueeze(0) x = self.token_embedding(input_ids) + self.position_embedding(positions) x = self.dropout(x) # Pass through transformer blocks for block in self.blocks: x = block(x) x = self.ln_f(x) # Language model logits logits = self.lm_head(x) return logitsWeight Tying:
GPT ties the weights of the token embedding layer and the language modeling head. This means the same matrix is used to:
Weight tying reduces parameters significantly (vocab_size × hidden_size parameters saved) and provides a form of regularization by constraining the embedding and output spaces to share the same geometry.
Positional Encoding:
GPT uses learned positional embeddings rather than sinusoidal encodings. Each position from 0 to max_seq_length-1 has its own learned embedding vector added to the token embedding. This allows the model to learn position-specific patterns from data.
The GPT family has evolved through several major iterations, each bringing architectural refinements, scale increases, and surprising emergent capabilities.
GPT-1 was the first to demonstrate that a transformer decoder pre-trained on language modeling could be effectively fine-tuned for diverse NLP tasks.
Key Contributions:
| Version | Parameters | Layers | Hidden Size | Context Length | Training Data |
|---|---|---|---|---|---|
| GPT-1 | 117M | 12 | 768 | 512 | BookCorpus (4.5GB) |
| GPT-2 | 1.5B | 48 | 1600 | 1024 | WebText (40GB) |
| GPT-3 | 175B | 96 | 12288 | 2048 | CommonCrawl, etc. (570GB) |
| GPT-4 | ~1.8T* | Unknown | Unknown | 8K/32K/128K | Unknown |
*GPT-4 specifications are based on estimates and leaks; OpenAI has not disclosed official architecture details.
GPT-2 increased scale by 10x and demonstrated remarkable zero-shot capabilities—performing tasks without any fine-tuning or task-specific training.
Key Innovations:
GPT-2's key insight was that language models trained at sufficient scale implicitly learn to perform tasks described in text. Instead of training a classifier for sentiment, you prompt: 'Review: Great movie! Sentiment: positive. Review: Terrible plot. Sentiment:' and let the model complete. This suggested that scale might be sufficient for general intelligence.
GPT-3's 175 billion parameters crossed a threshold where qualitatively new capabilities emerged.
In-Context Learning:
GPT-3 demonstrated remarkable few-shot learning—improving at tasks simply by seeing a few examples in the prompt:
Translate English to French:
English: cheese
French: fromage
English: hello
French: bonjour
English: computer
French:
Without any gradient updates, GPT-3 learns from the pattern in the context and produces "ordinateur" (the correct translation).
GPT-4 represents the current frontier, with capabilities that approach human-level performance on many professional benchmarks.
Key Advances:
Research has shown that model performance follows predictable power laws with compute, data, and parameters. GPT's evolution has been driven largely by scaling these factors—more parameters, more data, more compute. However, the returns appear to be logarithmic: each 10x increase in compute yields approximately linear improvements in loss.
One of GPT's most remarkable properties is in-context learning—the ability to adapt to new tasks at inference time without any parameter updates, simply by conditioning on a suitable prompt.
Prompt Paradigms:
Zero-shot: Task description only, no examples
Classify the sentiment of the following review as positive or negative.
Review: This product exceeded all my expectations!
Sentiment:
One-shot: Single example provided
Classify sentiment:
Review: Great product! Sentiment: positive
Review: Waste of money. Sentiment:
Few-shot: Multiple examples provided
Review: Amazing quality → positive
Review: Terrible service → negative
Review: Exactly what I needed → positive
Review: Broke after one day →
The mechanism behind in-context learning remains an active research area. Leading theories suggest that: (1) Pre-training implicitly meta-learns algorithms for learning from context, (2) Larger models have more capacity to store and apply these implicit algorithms, (3) The attention mechanism can implement gradient descent-like updates in the forward pass.
Prompt Engineering Techniques:
Effective prompting can dramatically improve GPT performance. Key techniques include:
Chain-of-Thought (CoT) Prompting:
For reasoning tasks, asking the model to "think step by step" often improves accuracy:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 balls. How many tennis balls does he have now?
A: Let's think step by step.
Roger started with 5 balls.
He bought 2 cans with 3 balls each = 2 × 3 = 6 balls.
Total = 5 + 6 = 11 balls.
The answer is 11.
CoT prompting can improve accuracy on math problems from ~20% to ~90%+ on some benchmarks.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
from dataclasses import dataclassfrom typing import List, Optionalimport json @dataclassclass PromptTemplate: """ Structured prompt template for GPT-style models. """ system_prompt: str examples: List[dict] instruction: str def format(self, input_text: str) -> str: """Format complete prompt with examples and input.""" parts = [] # System context if self.system_prompt: parts.append(self.system_prompt + "\n\n") # Few-shot examples for example in self.examples: parts.append(f"Input: {example['input']}\n") parts.append(f"Output: {example['output']}\n\n") # Current query parts.append(f"Input: {input_text}\n") parts.append(f"Output:") return "".join(parts) class PromptingStrategies: """ Common prompting strategies for GPT models. """ @staticmethod def zero_shot(task_description: str, input_text: str) -> str: """Zero-shot prompting with task description only.""" return f"""{task_description} Input: {input_text}Output:""" @staticmethod def few_shot(examples: List[dict], input_text: str, instruction: str = "") -> str: """Few-shot prompting with examples.""" prompt_parts = [] if instruction: prompt_parts.append(instruction + "\n\n") for ex in examples: prompt_parts.append(f"Input: {ex['input']}\nOutput: {ex['output']}\n\n") prompt_parts.append(f"Input: {input_text}\nOutput:") return "".join(prompt_parts) @staticmethod def chain_of_thought(question: str, examples_with_reasoning: Optional[List[dict]] = None) -> str: """Chain-of-thought prompting for reasoning tasks.""" if examples_with_reasoning: prompt_parts = [] for ex in examples_with_reasoning: prompt_parts.append(f"Q: {ex['question']}\n") prompt_parts.append(f"A: Let's think step by step.\n{ex['reasoning']}\n") prompt_parts.append(f"The answer is {ex['answer']}.\n\n") prompt_parts.append(f"Q: {question}\n") prompt_parts.append("A: Let's think step by step.\n") return "".join(prompt_parts) else: return f"Q: {question}\nA: Let's think step by step.\n" @staticmethod def structured_output(input_text: str, output_format: dict, examples: Optional[List[dict]] = None) -> str: """Prompt for structured (JSON) output.""" format_spec = json.dumps(output_format, indent=2) prompt = f"""Extract information from the input and return it in the following JSON format:{format_spec} """ if examples: for ex in examples: prompt += f"Input: {ex['input']}\nOutput: {json.dumps(ex['output'])}\n\n" prompt += f"Input: {input_text}\nOutput:" return prompt # Example usageif __name__ == "__main__": # Sentiment analysis with few-shot examples = [ {"input": "This movie was amazing!", "output": "positive"}, {"input": "Terrible waste of time.", "output": "negative"}, {"input": "Pretty good, worth watching.", "output": "positive"}, ] prompt = PromptingStrategies.few_shot( examples=examples, input_text="I've seen better but it was okay.", instruction="Classify the sentiment as positive or negative." ) print(prompt)Training GPT-scale models requires sophisticated distributed training techniques and careful engineering. The computational and data requirements are immense.
Distributed Training Strategies:
Data Parallelism: Replicate model across GPUs, each processing different data
Model Parallelism (Tensor Parallelism): Split individual layers across GPUs
Pipeline Parallelism: Assign different layers to different GPUs
3D Parallelism: Combine all three approaches
| Metric | Value |
|---|---|
| Total training compute | 3.14 × 10²³ FLOPS |
| Training tokens | ~300 billion |
| Training time | ~34 days on 1024 V100 GPUs |
| Estimated cost | $4.6 million (at cloud prices) |
| Peak GPU memory per GPU | ~32 GB |
Memory Optimization Techniques:
Training frontier GPT models is restricted to organizations with access to massive compute clusters. GPT-4's training reportedly cost over $100 million and required months on thousands of specialized accelerators. This concentration of capability raises important questions about accessibility and democratization of AI research.
Data Quality and Curation:
Training data quality significantly impacts model capability:
GPT-3's training mix was approximately: 60% CommonCrawl (filtered), 22% WebText, 16% Books, 3% Wikipedia.
Raw GPT models trained on next-token prediction can generate coherent text but often fail to be helpful, safe, or aligned with human intentions. Reinforcement Learning from Human Feedback (RLHF) addresses this by fine-tuning models to follow instructions and produce preferred outputs.
The RLHF Pipeline:
Supervised Fine-Tuning (SFT): Fine-tune base model on human-written demonstrations of desired behavior
Reward Model Training: Train a separate model to predict human preference scores
PPO Optimization: Use reinforcement learning (Proximal Policy Optimization) to maximize reward
RLHF transforms GPT from a text continuation engine into an assistant that follows instructions. InstructGPT (RLHF-tuned GPT-3) with 1.3B parameters was preferred by human raters over vanilla GPT-3 with 175B parameters. This demonstrates that alignment is as important as raw capability.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139
import torchimport torch.nn as nnimport torch.nn.functional as Ffrom typing import Tuple class RewardModel(nn.Module): """ Reward model for RLHF: predicts scalar reward for text sequences. Built on top of pre-trained GPT backbone. """ def __init__(self, base_model: nn.Module, hidden_size: int): super().__init__() self.backbone = base_model # Remove language modeling head, add reward head self.reward_head = nn.Sequential( nn.Linear(hidden_size, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 1) # Scalar reward ) def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor: # Get hidden states from backbone hidden_states = self.backbone(input_ids, attention_mask, return_hidden_states=True) # Use last token hidden state (or mean pool) # Last token represents the complete sequence last_hidden = hidden_states[:, -1, :] reward = self.reward_head(last_hidden) return reward.squeeze(-1) def compute_preference_loss( self, chosen_ids: torch.Tensor, chosen_mask: torch.Tensor, rejected_ids: torch.Tensor, rejected_mask: torch.Tensor ) -> Tuple[torch.Tensor, float]: """ Compute loss for preference learning. We want: reward(chosen) > reward(rejected) Uses Bradley-Terry model: P(chosen > rejected) = sigmoid(r_chosen - r_rejected) """ reward_chosen = self(chosen_ids, chosen_mask) reward_rejected = self(rejected_ids, rejected_mask) # Log sigmoid of reward difference # Loss = -log(sigmoid(r_chosen - r_rejected)) loss = -F.logsigmoid(reward_chosen - reward_rejected).mean() # Accuracy for monitoring accuracy = (reward_chosen > reward_rejected).float().mean().item() return loss, accuracy class PPOTrainer: """ Simplified PPO trainer for RLHF. """ def __init__( self, policy_model: nn.Module, reward_model: RewardModel, ref_model: nn.Module, # Frozen reference model kl_coef: float = 0.1, clip_range: float = 0.2 ): self.policy = policy_model self.reward_model = reward_model self.ref_model = ref_model self.kl_coef = kl_coef self.clip_range = clip_range def compute_rewards( self, input_ids: torch.Tensor, attention_mask: torch.Tensor, response_ids: torch.Tensor ) -> torch.Tensor: """Compute reward with KL penalty.""" # Get reward from reward model full_sequence = torch.cat([input_ids, response_ids], dim=1) full_mask = torch.cat([attention_mask, torch.ones_like(response_ids)], dim=1) reward = self.reward_model(full_sequence, full_mask) # Compute KL divergence between policy and reference with torch.no_grad(): policy_logits = self.policy(full_sequence) ref_logits = self.ref_model(full_sequence) policy_logprobs = F.log_softmax(policy_logits, dim=-1) ref_logprobs = F.log_softmax(ref_logits, dim=-1) kl_div = (policy_logprobs.exp() * (policy_logprobs - ref_logprobs)).sum(-1).mean() # Final reward = external reward - KL penalty final_reward = reward - self.kl_coef * kl_div return final_reward, kl_div def ppo_step( self, input_ids: torch.Tensor, response_ids: torch.Tensor, old_logprobs: torch.Tensor, rewards: torch.Tensor, advantages: torch.Tensor ) -> dict: """Single PPO update step.""" # Get current policy log probabilities full_sequence = torch.cat([input_ids, response_ids], dim=1) logits = self.policy(full_sequence) # Extract logprobs for response tokens only response_logits = logits[:, input_ids.size(1):-1, :] current_logprobs = F.log_softmax(response_logits, dim=-1) current_logprobs = torch.gather( current_logprobs, dim=-1, index=response_ids[:, 1:].unsqueeze(-1) ).squeeze(-1) # PPO clipped objective ratio = (current_logprobs - old_logprobs).exp() clipped_ratio = torch.clamp(ratio, 1 - self.clip_range, 1 + self.clip_range) policy_loss = -torch.min( ratio * advantages, clipped_ratio * advantages ).mean() return { 'loss': policy_loss, 'ratio_mean': ratio.mean().item(), 'clipped_fraction': ((ratio - clipped_ratio).abs() > 1e-6).float().mean().item() }GPT and BERT represent two fundamental approaches to language modeling. Understanding when to use each is crucial for practitioners.
| Aspect | GPT (Decoder) | BERT (Encoder) |
|---|---|---|
| Best for | Generation, completion, dialogue | Classification, NER, understanding |
| Attention | Causal (left-to-right) | Bidirectional (all tokens) |
| Fine-tuning | Prompting / instruction tuning | Add classification heads |
| Few-shot learning | Excellent via prompting | Limited without fine-tuning |
| Training signal | 100% of tokens | ~15% masked tokens |
| Inference efficiency | Sequential (slow for long outputs) | Parallel (fast) |
| Open-ended tasks | Natural fit | Requires reformulation |
Use GPT-style models when you need to generate text or when your task can be framed as completion. Use BERT-style models when you need fixed-size representations for classification, retrieval, or when you need efficient parallel processing of inputs.
Practical Guidance:
Choose GPT when:
Choose BERT when:
Hybrid Approaches:
Many production systems use both:
You now understand GPT's autoregressive architecture, its evolution from GPT-1 to GPT-4, in-context learning, RLHF alignment, and how it compares to BERT. GPT's generative paradigm has proven remarkably versatile, enabling capabilities from code generation to complex reasoning. Next, we'll explore T5, which unifies encoder-decoder architectures under a text-to-text framework.