Loading content...
No single development in machine learning history has had the cultural and economic impact of Large Language Models (LLMs). ChatGPT reached 100 million users in two months—faster than any technology in history. GPT-4 passes professional exams. Claude writes production code. LLaMA powers thousands of applications.
These models represent the convergence of the scaling and emergence phenomena we've explored. They are transformers trained at unprecedented scale on vast text corpora, demonstrating capabilities that have surprised even their creators. This page provides a deep technical understanding of LLMs—how they work, why they work, and what they can and cannot do.
By the end of this page, you will understand: (1) the evolution of the GPT architecture from GPT-1 to GPT-4, (2) the transformer decoder architecture that underlies these models, (3) pre-training objectives and their implications, (4) instruction tuning and RLHF, (5) the capabilities and limitations of current LLMs, and (6) how to reason about LLM behavior.
The GPT (Generative Pre-trained Transformer) series, developed by OpenAI, charts the evolution of language models from research curiosity to transformative technology. Each generation represents not just larger scale but fundamental advances in how we train and deploy these models.
GPT-1 (2018): The Proof of Concept
GPT-1 demonstrated that pre-training a transformer on unlabeled text, then fine-tuning on downstream tasks, could achieve competitive performance across diverse NLP benchmarks.
| Model | Year | Parameters | Training Data | Key Capabilities |
|---|---|---|---|---|
| GPT-1 | 2018 | 117M | ~5GB text | Transfer learning to NLP tasks |
| GPT-2 | 2019 | 1.5B | 40GB (WebText) | Zero-shot task performance, coherent generation |
| GPT-3 | 2020 | 175B | ~570GB | Few-shot learning, emergent capabilities |
| InstructGPT | 2022 | ~175B | GPT-3 + RLHF | Instruction following, alignment |
| GPT-4 | 2023 | ~1.8T (MoE) | ~13T tokens (est.) | Multimodal, complex reasoning, professional-level tasks |
GPT-2 (2019): The First Shock
GPT-2 scaled up 10× and introduced genuinely impressive text generation. OpenAI initially withheld the full model over misuse concerns—a first for AI releases.
GPT-3 (2020): The Paradigm Shift
GPT-3 was 100× larger than GPT-2 and introduced few-shot learning—the ability to learn tasks from examples in the prompt. This was the inflection point.
GPT-4 (2023): The Multimodal Giant
GPT-4 remains partially undocumented, but represents the current frontier:
The model behind ChatGPT is not 'GPT-3' or 'GPT-4' directly. ChatGPT was initially based on GPT-3.5—an instruction-tuned variant of GPT-3. Later versions use GPT-4. The distinction between base models (GPT-3, GPT-4) and their fine-tuned, instruction-following variants is crucial for understanding LLM behavior.
All GPT models share the same fundamental architecture: the transformer decoder. Understanding this architecture is essential for understanding LLM behavior.
Core Components:
A transformer decoder consists of stacked identical layers, each containing:
The 'masked' property is crucial: each position can only see tokens that come before it, enabling autoregressive generation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144
import torchimport torch.nn as nnimport torch.nn.functional as Fimport math class MultiHeadSelfAttention(nn.Module): """ The core attention mechanism of GPT. Masked self-attention ensures each token only attends to previous tokens, enabling autoregressive generation. """ def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1): super().__init__() assert d_model % n_heads == 0 self.d_model = d_model self.n_heads = n_heads self.d_head = d_model // n_heads # Q, K, V projections (combined for efficiency) self.qkv_proj = nn.Linear(d_model, 3 * d_model) self.out_proj = nn.Linear(d_model, d_model) self.dropout = nn.Dropout(dropout) def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor: batch_size, seq_len, _ = x.shape # Project to Q, K, V qkv = self.qkv_proj(x) q, k, v = qkv.chunk(3, dim=-1) # Reshape for multi-head attention q = q.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2) k = k.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2) v = v.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2) # Scaled dot-product attention scale = math.sqrt(self.d_head) attn_weights = torch.matmul(q, k.transpose(-2, -1)) / scale # Apply causal mask (crucial for autoregressive generation) if mask is None: mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool() mask = mask.to(x.device) attn_weights = attn_weights.masked_fill(mask, float('-inf')) attn_weights = F.softmax(attn_weights, dim=-1) attn_weights = self.dropout(attn_weights) # Apply attention to values output = torch.matmul(attn_weights, v) output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model) return self.out_proj(output) class TransformerBlock(nn.Module): """ A single transformer decoder block. Each block: LayerNorm -> Attention -> Residual -> LayerNorm -> FFN -> Residual GPT uses Pre-LN (LayerNorm before sublayer) for training stability. """ def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1): super().__init__() self.ln1 = nn.LayerNorm(d_model) self.attn = MultiHeadSelfAttention(d_model, n_heads, dropout) self.ln2 = nn.LayerNorm(d_model) self.ffn = nn.Sequential( nn.Linear(d_model, d_ff), nn.GELU(), # GPT-2+ uses GELU activation nn.Linear(d_ff, d_model), nn.Dropout(dropout), ) self.dropout = nn.Dropout(dropout) def forward(self, x: torch.Tensor) -> torch.Tensor: # Pre-LN architecture (more stable for deep models) x = x + self.dropout(self.attn(self.ln1(x))) x = x + self.ffn(self.ln2(x)) return x class GPTModel(nn.Module): """ Simplified GPT architecture. This captures the essential structure: embedding + transformer blocks + LM head. Production models include many optimizations not shown here. """ def __init__( self, vocab_size: int = 50257, # GPT-2 vocabulary size d_model: int = 768, n_heads: int = 12, n_layers: int = 12, max_seq_len: int = 1024, d_ff: int = 3072, dropout: float = 0.1, ): super().__init__() self.token_embedding = nn.Embedding(vocab_size, d_model) self.position_embedding = nn.Embedding(max_seq_len, d_model) self.dropout = nn.Dropout(dropout) self.blocks = nn.ModuleList([ TransformerBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers) ]) self.ln_final = nn.LayerNorm(d_model) self.lm_head = nn.Linear(d_model, vocab_size, bias=False) # Weight tying: embedding and output projection share weights self.lm_head.weight = self.token_embedding.weight def forward(self, input_ids: torch.Tensor) -> torch.Tensor: batch_size, seq_len = input_ids.shape # Embeddings positions = torch.arange(seq_len, device=input_ids.device) x = self.token_embedding(input_ids) + self.position_embedding(positions) x = self.dropout(x) # Transformer blocks for block in self.blocks: x = block(x) # Output projection x = self.ln_final(x) logits = self.lm_head(x) return logits # Shape: [batch, seq_len, vocab_size] # GPT-2 configurationsGPT2_CONFIGS = { 'small': {'d_model': 768, 'n_heads': 12, 'n_layers': 12}, # 117M 'medium': {'d_model': 1024, 'n_heads': 16, 'n_layers': 24}, # 345M 'large': {'d_model': 1280, 'n_heads': 20, 'n_layers': 36}, # 762M 'xl': {'d_model': 1600, 'n_heads': 25, 'n_layers': 48}, # 1.5B}Key Architectural Insights:
Autoregressive Factorization: The masked attention forces the model to predict each token based only on preceding context: $P(x) = \prod_{t=1}^{T} P(x_t | x_{<t})$
Position Encoding: GPT-1/2 use learned position embeddings; some variants use rotary positions (RoPE) for length extrapolation.
Weight Tying: The input embedding matrix and output projection share weights, reducing parameters and leveraging symmetry.
Pre-Layer Normalization: Moving LayerNorm before sublayers (Pre-LN) improves training stability for very deep networks.
FFN Expansion: The feed-forward network typically expands dimension by 4×, providing capacity for knowledge storage.
The architecture alone doesn't explain LLM capabilities. The same architecture at different scales produces qualitatively different behavior. Understanding LLMs requires understanding the interaction between architecture, scale, training data, and training procedure.
LLMs are trained on a deceptively simple objective: predict the next token. This language modeling objective, at sufficient scale, produces models with remarkable capabilities.
The Next-Token Prediction Objective:
Given a sequence of tokens $x_1, x_2, ..., x_{t-1}$, predict $x_t$. The training loss is the cross-entropy between predicted and actual next tokens:
$$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P(x_t | x_{<t}; \theta)$$
This objective is:
Why Next-Token Prediction Works So Well:
The power of next-token prediction is often underestimated. Consider what it requires:
Good prediction requires good understanding. A model that perfectly predicts the next token must, in some sense, understand the text deeply.
The Compression Perspective:
Another way to understand pre-training: language modeling is lossless compression. A model that predicts tokens well can compress text efficiently. The minimum description length principle suggests that good compression requires discovering the underlying structure of data. A model that compresses internet text well has learned enormous amounts about language, knowledge, and reasoning.
| Source | Proportion | What It Provides |
|---|---|---|
| Web pages (Common Crawl) | 50-60% | Broad knowledge, diverse styles, some noise |
| Books | 10-15% | Deep knowledge, coherent reasoning, formal writing |
| Wikipedia | 3-5% | Factual knowledge, encyclopedic coverage |
| Code (GitHub) | 10-15% | Programming, formal reasoning, structured thinking |
| Academic papers (arXiv, etc.) | 5-10% | Technical knowledge, scientific reasoning |
| Conversations/Forums | 5-10% | Dialogue patterns, Q&A format |
It's remarkable that predicting the next token—a task with no explicit understanding component—produces models that pass bar exams and write production code. This suggests that prediction and understanding are more deeply connected than previously thought, or that 'understanding' itself is more about prediction than we assumed.
Base pre-trained models are powerful but not directly useful. They predict text—but don't reliably follow instructions or engage in helpful dialogue. Instruction tuning and RLHF transform prediction engines into assistants.
The Problem with Base Models:
A base GPT model, given the prompt 'What is the capital of France?', might continue with:
Base models complete text, they don't answer questions. The distribution of internet text doesn't privilege helpful responses.
Stage 1: Supervised Fine-Tuning (SFT)
The first alignment stage trains the model on demonstrations of good behavior:
SFT data typically comes from:
Stage 2: Reward Modeling
Human preferences are captured in a reward model:
The reward model typically predicts: 'Given prompt X and response Y, how likely is a human to prefer Y?'
Stage 3: Reinforcement Learning from Human Feedback (RLHF)
The model is optimized to maximize the learned reward:
$$\max_{\pi} \mathbb{E}{x \sim D, y \sim \pi(\cdot|x)} [R(x, y)] - \beta \cdot KL[\pi || \pi{SFT}]$$
where:
This is typically optimized using PPO (Proximal Policy Optimization).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
"""RLHF Training Pipeline (Conceptual Overview) This illustrates the key stages of post-training alignment.Production implementations are more complex.""" # Stage 1: Supervised Fine-Tuning (SFT)def supervised_fine_tuning(base_model, sft_dataset): """ Fine-tune on (instruction, response) pairs. Dataset format: [ {"instruction": "Explain quantum computing", "response": "Quantum computing uses quantum mechanical phenomena..."}, {"instruction": "Write a Python function to sort a list", "response": "def sort_list(lst):\n return sorted(lst)"}, ... ] """ # Standard supervised learning: minimize cross-entropy # loss = -log P(response | instruction) for instruction, response in sft_dataset: prompt = format_instruction(instruction) loss = cross_entropy_loss(base_model(prompt), response) loss.backward() optimizer.step() return sft_model # Stage 2: Reward Model Trainingdef train_reward_model(sft_model, comparison_dataset): """ Train a model to predict human preferences. Dataset format: [ {"prompt": "...", "chosen": "better response", "rejected": "worse response"}, ... ] """ reward_model = copy_and_add_scalar_head(sft_model) # Bradley-Terry model: P(chosen > rejected) = σ(r_chosen - r_rejected) for prompt, chosen, rejected in comparison_dataset: r_chosen = reward_model(prompt + chosen) r_rejected = reward_model(prompt + rejected) # Maximize log probability that chosen is preferred loss = -log_sigmoid(r_chosen - r_rejected) loss.backward() optimizer.step() return reward_model # Stage 3: RLHF with PPOdef rlhf_training(sft_model, reward_model, prompts): """ Optimize the policy to maximize reward while staying close to SFT. Key components: - Policy: the LLM we're training - Reference: the SFT model (frozen) - Reward: score from reward model minus KL penalty """ policy = copy(sft_model) reference = sft_model # frozen for prompt in prompts: # Sample response from current policy response = policy.generate(prompt) # Compute reward reward = reward_model(prompt + response) # Compute KL penalty (stay close to SFT behavior) log_prob_policy = policy.log_prob(response | prompt) log_prob_ref = reference.log_prob(response | prompt) kl_penalty = beta * (log_prob_policy - log_prob_ref) # Total reward = external reward - KL penalty total_reward = reward - kl_penalty # PPO update (simplified) ppo_update(policy, prompt, response, total_reward) return policy # Alternative: Direct Preference Optimization (DPO)def dpo_training(sft_model, comparison_dataset): """ DPO directly optimizes preferences without reward model. Key insight: The optimal policy under RLHF can be derived analytically. We can optimize directly for preferences without explicit RL. Loss: -log σ(β * (log π(chosen)/π_ref(chosen) - log π(rejected)/π_ref(rejected))) """ policy = copy(sft_model) reference = sft_model # frozen for prompt, chosen, rejected in comparison_dataset: # Log probability ratios log_ratio_chosen = policy.log_prob(chosen | prompt) - reference.log_prob(chosen | prompt) log_ratio_rejected = policy.log_prob(rejected | prompt) - reference.log_prob(rejected | prompt) # DPO loss loss = -log_sigmoid(beta * (log_ratio_chosen - log_ratio_rejected)) loss.backward() optimizer.step() return policyRLHF can reduce certain capabilities (the 'alignment tax'). Models may become less willing to engage with edge cases or provide complete information. Modern techniques like Constitutional AI, DPO, and careful reward model design aim to achieve helpfulness while minimizing capability loss.
Understanding LLM capabilities requires moving beyond impressions to systematic analysis. Modern LLMs demonstrate remarkable skills across diverse domains—but also consistent patterns of failure.
Core Capability Categories:
Language Understanding and Generation:
Knowledge and Factual Recall:
| Exam | GPT-4 Score | Human Passing Threshold |
|---|---|---|
| Bar Exam (MBE) | ~90% | ~63% |
| LSAT | 88th percentile | 50th percentile |
| GRE Verbal | 99th percentile | 50th percentile |
| GRE Quantitative | 80th percentile | 50th percentile |
| SAT Math | ~700/800 | ~520/800 |
| Medical Licensing (USMLE) | ~85% | ~60% |
| AP Calculus BC | 4/5 | 3/5 |
Reasoning and Problem Solving:
However, reasoning remains one of the most variable capabilities—sometimes brilliant, sometimes bafflingly wrong.
In-Context Learning:
Metacognition (Limited):
LLM capabilities are highly variable. The same model may solve a hard problem and fail an easy one. Performance depends heavily on prompting, problem framing, and sometimes apparent luck. Point estimates of capability can be misleading—distributions matter.
Understanding LLM limitations is as important as understanding their capabilities. Some limitations are engineering challenges that may yield to scale; others appear more fundamental.
Hallucination: The Persistent Problem
LLMs confidently generate false information—a phenomenon called 'hallucination':
Hallucination persists even in the largest models. It appears to be an inherent consequence of training on prediction rather than truth. The model doesn't 'know' that it's wrong because it doesn't track truth—only probability.
The Stochastic Parrot Critique:
Emily Bender and colleagues introduced the 'stochastic parrot' metaphor—arguing LLMs are sophisticated pattern matchers that don't 'understand' in any meaningful sense. Key claims:
Counterarguments:
The Practical Stance:
For most practitioners, what matters is: What can I reliably use this for?
Think of LLMs as 'extremely well-read but unreliable interns.' They've 'read' billions of documents and can synthesize, summarize, and generate impressively. But they make errors a competent human wouldn't, they can't verify their own claims, and they need supervision for important work.
The LLM landscape has diversified rapidly. Understanding the ecosystem helps practitioners make informed choices about which models to use.
Major Closed-Source Models:
OpenAI:
Anthropic:
Google DeepMind:
Others:
| Family | Organization | Sizes | Key Features |
|---|---|---|---|
| LLaMA 3 | Meta | 8B, 70B, 405B | Strong baseline, extensive fine-tuning ecosystem |
| Mistral | Mistral AI | 7B, 8x7B, 8x22B | Efficient, MoE variants, permissive license |
| Qwen 2 | Alibaba | 0.5B-72B | Strong multilingual, tools, coding |
| Gemma 2 | 2B, 9B, 27B | Efficient, strong for size, research-friendly | |
| Falcon | TII | 7B, 40B, 180B | Large-scale, Apache 2.0 license |
| DeepSeek | DeepSeek | 7B, 67B, MoE | Strong reasoning, code, MoE efficiency |
Open vs. Closed Trade-offs:
Closed-source advantages:
Open-weight advantages:
The Fine-Tuning Ecosystem:
Open models power a vast fine-tuning ecosystem:
The LLM ecosystem changes monthly. New models frequently reset capability frontiers. Practitioners should stay current with model releases while maintaining understanding of fundamental principles that persist across models.
We have explored LLMs from architecture through capabilities to limitations. Let's consolidate the key insights:
What's Next:
LLMs are not the only manifestation of foundation models. The next page explores multimodal models—systems that process and generate across modalities (text, image, audio, video). We'll see how the foundation model paradigm extends beyond language to vision-language models like GPT-4V and CLIP, and to unified architectures that blur the boundaries between modalities.
You now have a comprehensive understanding of large language models—their evolution, architecture, training, capabilities, and limitations. This foundation enables you to work with LLMs effectively, understand their behavior, and stay current as the field evolves.