Foundation Models - Learning Module

Loading content...

0/245

GPT and Large Language Models

The LLM Revolution

No single development in machine learning history has had the cultural and economic impact of Large Language Models (LLMs). ChatGPT reached 100 million users in two months—faster than any technology in history. GPT-4 passes professional exams. Claude writes production code. LLaMA powers thousands of applications.

These models represent the convergence of the scaling and emergence phenomena we've explored. They are transformers trained at unprecedented scale on vast text corpora, demonstrating capabilities that have surprised even their creators. This page provides a deep technical understanding of LLMs—how they work, why they work, and what they can and cannot do.

What You Will Learn

By the end of this page, you will understand: (1) the evolution of the GPT architecture from GPT-1 to GPT-4, (2) the transformer decoder architecture that underlies these models, (3) pre-training objectives and their implications, (4) instruction tuning and RLHF, (5) the capabilities and limitations of current LLMs, and (6) how to reason about LLM behavior.

The Evolution of GPT: From 117M to 1.8T Parameters

The GPT (Generative Pre-trained Transformer) series, developed by OpenAI, charts the evolution of language models from research curiosity to transformative technology. Each generation represents not just larger scale but fundamental advances in how we train and deploy these models.

GPT-1 (2018): The Proof of Concept

GPT-1 demonstrated that pre-training a transformer on unlabeled text, then fine-tuning on downstream tasks, could achieve competitive performance across diverse NLP benchmarks.

Scale: 117 million parameters, 8 transformer layers
Training: BookCorpus (~5GB of text)
Key Innovation: Unsupervised pre-training + supervised fine-tuning
Impact: Showed pre-training transfers meaningfully but still required task-specific fine-tuning

The GPT Family: Scale and Capability Evolution
Model	Year	Parameters	Training Data	Key Capabilities
GPT-1	2018	117M	~5GB text	Transfer learning to NLP tasks
GPT-2	2019	1.5B	40GB (WebText)	Zero-shot task performance, coherent generation
GPT-3	2020	175B	~570GB	Few-shot learning, emergent capabilities
InstructGPT	2022	~175B	GPT-3 + RLHF	Instruction following, alignment
GPT-4	2023	~1.8T (MoE)	~13T tokens (est.)	Multimodal, complex reasoning, professional-level tasks

GPT-2 (2019): The First Shock

GPT-2 scaled up 10× and introduced genuinely impressive text generation. OpenAI initially withheld the full model over misuse concerns—a first for AI releases.

Scale: 1.5 billion parameters, 48 layers
Training: WebText (40GB of outbound Reddit links)
Key Innovation: Zero-shot task performance through prompting
Impact: Demonstrated that sufficiently large models could perform tasks without any fine-tuning

GPT-3 (2020): The Paradigm Shift

GPT-3 was 100× larger than GPT-2 and introduced few-shot learning—the ability to learn tasks from examples in the prompt. This was the inflection point.

Scale: 175 billion parameters, 96 layers
Training: ~570GB of filtered internet text, books, Wikipedia
Key Innovation: In-context learning, emergent capabilities
Impact: Spawned an industry; demonstrated that scale alone produces qualitatively new behaviors

GPT-4 (2023): The Multimodal Giant

GPT-4 remains partially undocumented, but represents the current frontier:

Scale: Estimated ~1.8T parameters in a Mixture-of-Experts configuration
Training: Estimated 10-15T tokens of text and code, plus instruction tuning and RLHF
Key Innovation: Multimodal (vision + text), dramatically improved reasoning, professional-level performance
Impact: Performance on professional exams, complex coding, nuanced reasoning

The GPT Naming Confusion

The model behind ChatGPT is not 'GPT-3' or 'GPT-4' directly. ChatGPT was initially based on GPT-3.5—an instruction-tuned variant of GPT-3. Later versions use GPT-4. The distinction between base models (GPT-3, GPT-4) and their fine-tuned, instruction-following variants is crucial for understanding LLM behavior.

The Transformer Decoder Architecture

All GPT models share the same fundamental architecture: the transformer decoder. Understanding this architecture is essential for understanding LLM behavior.

Core Components:

A transformer decoder consists of stacked identical layers, each containing:

Masked Multi-Head Self-Attention: Allows each token to attend to all previous tokens (but not future ones)
Feed-Forward Network (FFN): Two linear layers with activation, processing each position independently
Layer Normalization: Stabilizes training
Residual Connections: Enable gradient flow through deep networks

The 'masked' property is crucial: each position can only see tokens that come before it, enabling autoregressive generation.

gpt_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
 
class MultiHeadSelfAttention(nn.Module):
    """
    The core attention mechanism of GPT.
    
    Masked self-attention ensures each token only attends to
    previous tokens, enabling autoregressive generation.
    """
    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
        super().__init__()
        assert d_model % n_heads == 0
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_model // n_heads
        
        # Q, K, V projections (combined for efficiency)
        self.qkv_proj = nn.Linear(d_model, 3 * d_model)
        self.out_proj = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        batch_size, seq_len, _ = x.shape
        
        # Project to Q, K, V
        qkv = self.qkv_proj(x)
        q, k, v = qkv.chunk(3, dim=-1)
        
        # Reshape for multi-head attention
        q = q.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2)
        k = k.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2)
        v = v.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2)
        
        # Scaled dot-product attention
        scale = math.sqrt(self.d_head)
        attn_weights = torch.matmul(q, k.transpose(-2, -1)) / scale
        
        # Apply causal mask (crucial for autoregressive generation)
        if mask is None:
            mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
            mask = mask.to(x.device)
        attn_weights = attn_weights.masked_fill(mask, float('-inf'))
        
        attn_weights = F.softmax(attn_weights, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Apply attention to values
        output = torch.matmul(attn_weights, v)
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
        
        return self.out_proj(output)
 
 
class TransformerBlock(nn.Module):
    """
    A single transformer decoder block.
    
    Each block: LayerNorm -> Attention -> Residual -> LayerNorm -> FFN -> Residual
    GPT uses Pre-LN (LayerNorm before sublayer) for training stability.
    """
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = MultiHeadSelfAttention(d_model, n_heads, dropout)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),  # GPT-2+ uses GELU activation
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout),
        )
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Pre-LN architecture (more stable for deep models)
        x = x + self.dropout(self.attn(self.ln1(x)))
        x = x + self.ffn(self.ln2(x))
        return x
 
 
class GPTModel(nn.Module):
    """
    Simplified GPT architecture.
    
    This captures the essential structure: embedding + transformer blocks + LM head.
    Production models include many optimizations not shown here.
    """
    def __init__(
        self,
        vocab_size: int = 50257,  # GPT-2 vocabulary size
        d_model: int = 768,
        n_heads: int = 12,
        n_layers: int = 12,
        max_seq_len: int = 1024,
        d_ff: int = 3072,
        dropout: float = 0.1,
    ):
        super().__init__()
        
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_len, d_model)
        self.dropout = nn.Dropout(dropout)
        
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        self.ln_final = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        
        # Weight tying: embedding and output projection share weights
        self.lm_head.weight = self.token_embedding.weight
        
    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        batch_size, seq_len = input_ids.shape
        
        # Embeddings
        positions = torch.arange(seq_len, device=input_ids.device)
        x = self.token_embedding(input_ids) + self.position_embedding(positions)
        x = self.dropout(x)
        
        # Transformer blocks
        for block in self.blocks:
            x = block(x)
            
        # Output projection
        x = self.ln_final(x)
        logits = self.lm_head(x)
        
        return logits  # Shape: [batch, seq_len, vocab_size]
 
# GPT-2 configurations
GPT2_CONFIGS = {
    'small':  {'d_model': 768,  'n_heads': 12, 'n_layers': 12},   # 117M
    'medium': {'d_model': 1024, 'n_heads': 16, 'n_layers': 24},   # 345M
    'large':  {'d_model': 1280, 'n_heads': 20, 'n_layers': 36},   # 762M
    'xl':     {'d_model': 1600, 'n_heads': 25, 'n_layers': 48},   # 1.5B
}

Key Architectural Insights:

Autoregressive Factorization: The masked attention forces the model to predict each token based only on preceding context: $P(x) = \prod_{t=1}^{T} P(x_t | x_{<t})$
Position Encoding: GPT-1/2 use learned position embeddings; some variants use rotary positions (RoPE) for length extrapolation.
Weight Tying: The input embedding matrix and output projection share weights, reducing parameters and leveraging symmetry.
Pre-Layer Normalization: Moving LayerNorm before sublayers (Pre-LN) improves training stability for very deep networks.
FFN Expansion: The feed-forward network typically expands dimension by 4×, providing capacity for knowledge storage.

What the Architecture Doesn't Tell You

The architecture alone doesn't explain LLM capabilities. The same architecture at different scales produces qualitatively different behavior. Understanding LLMs requires understanding the interaction between architecture, scale, training data, and training procedure.

Pre-Training: Learning from Text at Scale

LLMs are trained on a deceptively simple objective: predict the next token. This language modeling objective, at sufficient scale, produces models with remarkable capabilities.

The Next-Token Prediction Objective:

Given a sequence of tokens $x_1, x_2, ..., x_{t-1}$, predict $x_t$. The training loss is the cross-entropy between predicted and actual next tokens:

$$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P(x_t | x_{<t}; \theta)$$

This objective is:

Self-supervised: No human labels required—text provides its own supervision
Scalable: Can train on essentially unlimited text from the internet
General: All language understanding tasks can be framed as prediction
Rich: Predicting text requires modeling syntax, semantics, facts, reasoning, and more

Why Next-Token Prediction Works So Well:

The power of next-token prediction is often underestimated. Consider what it requires:

To predict the next word in a sentence about physics, you need physics knowledge
To predict the next line of code, you need programming knowledge
To predict the answer in a Q&A conversation, you need to answer the question
To predict the next step in a proof, you need mathematical reasoning

Good prediction requires good understanding. A model that perfectly predicts the next token must, in some sense, understand the text deeply.

The Compression Perspective:

Another way to understand pre-training: language modeling is lossless compression. A model that predicts tokens well can compress text efficiently. The minimum description length principle suggests that good compression requires discovering the underlying structure of data. A model that compresses internet text well has learned enormous amounts about language, knowledge, and reasoning.

Pre-Training Data Composition (Typical LLM)
Source	Proportion	What It Provides
Web pages (Common Crawl)	50-60%	Broad knowledge, diverse styles, some noise
Books	10-15%	Deep knowledge, coherent reasoning, formal writing
Wikipedia	3-5%	Factual knowledge, encyclopedic coverage
Code (GitHub)	10-15%	Programming, formal reasoning, structured thinking
Academic papers (arXiv, etc.)	5-10%	Technical knowledge, scientific reasoning
Conversations/Forums	5-10%	Dialogue patterns, Q&A format

Pre-Training Engineering Challenges

•Data Filtering — Removing toxic, low-quality, duplicated, and personally identifiable content at scale requires sophisticated pipelines.
•Domain Mixing — Finding the right balance of web, books, code, and other sources significantly affects downstream capabilities.
•Tokenization — The choice of tokenizer (BPE, SentencePiece) affects efficiency, multilinguality, and handling of rare words/code.
•Training Stability — Long training runs at scale are prone to loss spikes and divergence; careful learning rate scheduling and optimization are essential.
•Distributed Training — Training 100B+ parameter models requires sophisticated parallelism across thousands of GPUs.

The Unreasonable Effectiveness of Prediction

It's remarkable that predicting the next token—a task with no explicit understanding component—produces models that pass bar exams and write production code. This suggests that prediction and understanding are more deeply connected than previously thought, or that 'understanding' itself is more about prediction than we assumed.

Instruction Tuning and RLHF: From Prediction to Assistance

Base pre-trained models are powerful but not directly useful. They predict text—but don't reliably follow instructions or engage in helpful dialogue. Instruction tuning and RLHF transform prediction engines into assistants.

The Problem with Base Models:

A base GPT model, given the prompt 'What is the capital of France?', might continue with:

'What is the capital of Germany?' (continuing a list of questions)
'This is a common trivia question that...' (continuing an essay)
'Paris is the answer.' (only sometimes giving the answer)

Base models complete text, they don't answer questions. The distribution of internet text doesn't privilege helpful responses.

Stage 1: Supervised Fine-Tuning (SFT)

The first alignment stage trains the model on demonstrations of good behavior:

Collect examples of desired behavior (instruction → helpful response pairs)
Fine-tune the pre-trained model on these examples
The model learns to produce helpful responses in instruction-following format

SFT data typically comes from:

Human contractors writing ideal responses to prompts
Filtered/curated existing dialogue data
Synthetic data from stronger models (distillation)

Stage 2: Reward Modeling

Human preferences are captured in a reward model:

Generate multiple responses to the same prompt
Humans rank the responses by quality
Train a model to predict human preferences
This reward model scores response quality

The reward model typically predicts: 'Given prompt X and response Y, how likely is a human to prefer Y?'

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

The model is optimized to maximize the learned reward:

$$\max_{\pi} \mathbb{E}{x \sim D, y \sim \pi(\cdot|x)} [R(x, y)] - \beta \cdot KL[\pi || \pi{SFT}]$$

where:

$\pi$ is the policy (the LLM)
$R$ is the reward model
$\pi_{SFT}$ is the supervised fine-tuned model
The KL term prevents the model from diverging too far from coherent text

This is typically optimized using PPO (Proximal Policy Optimization).

rlhf_overview.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
"""
RLHF Training Pipeline (Conceptual Overview)
 
This illustrates the key stages of post-training alignment.
Production implementations are more complex.
"""
 
# Stage 1: Supervised Fine-Tuning (SFT)
def supervised_fine_tuning(base_model, sft_dataset):
    """
    Fine-tune on (instruction, response) pairs.
    
    Dataset format:
    [
        {"instruction": "Explain quantum computing", 
         "response": "Quantum computing uses quantum mechanical phenomena..."},
        {"instruction": "Write a Python function to sort a list",
         "response": "def sort_list(lst):\n    return sorted(lst)"},
        ...
    ]
    """
    # Standard supervised learning: minimize cross-entropy
    # loss = -log P(response | instruction)
    for instruction, response in sft_dataset:
        prompt = format_instruction(instruction)
        loss = cross_entropy_loss(base_model(prompt), response)
        loss.backward()
        optimizer.step()
    
    return sft_model
 
 
# Stage 2: Reward Model Training
def train_reward_model(sft_model, comparison_dataset):
    """
    Train a model to predict human preferences.
    
    Dataset format:
    [
        {"prompt": "...", 
         "chosen": "better response", 
         "rejected": "worse response"},
        ...
    ]
    """
    reward_model = copy_and_add_scalar_head(sft_model)
    
    # Bradley-Terry model: P(chosen > rejected) = σ(r_chosen - r_rejected)
    for prompt, chosen, rejected in comparison_dataset:
        r_chosen = reward_model(prompt + chosen)
        r_rejected = reward_model(prompt + rejected)
        
        # Maximize log probability that chosen is preferred
        loss = -log_sigmoid(r_chosen - r_rejected)
        loss.backward()
        optimizer.step()
    
    return reward_model
 
 
# Stage 3: RLHF with PPO
def rlhf_training(sft_model, reward_model, prompts):
    """
    Optimize the policy to maximize reward while staying close to SFT.
    
    Key components:
    - Policy: the LLM we're training
    - Reference: the SFT model (frozen)
    - Reward: score from reward model minus KL penalty
    """
    policy = copy(sft_model)
    reference = sft_model  # frozen
    
    for prompt in prompts:
        # Sample response from current policy
        response = policy.generate(prompt)
        
        # Compute reward
        reward = reward_model(prompt + response)
        
        # Compute KL penalty (stay close to SFT behavior)
        log_prob_policy = policy.log_prob(response | prompt)
        log_prob_ref = reference.log_prob(response | prompt)
        kl_penalty = beta * (log_prob_policy - log_prob_ref)
        
        # Total reward = external reward - KL penalty
        total_reward = reward - kl_penalty
        
        # PPO update (simplified)
        ppo_update(policy, prompt, response, total_reward)
    
    return policy
 
 
# Alternative: Direct Preference Optimization (DPO)
def dpo_training(sft_model, comparison_dataset):
    """
    DPO directly optimizes preferences without reward model.
    
    Key insight: The optimal policy under RLHF can be derived analytically.
    We can optimize directly for preferences without explicit RL.
    
    Loss: -log σ(β * (log π(chosen)/π_ref(chosen) - log π(rejected)/π_ref(rejected)))
    """
    policy = copy(sft_model)
    reference = sft_model  # frozen
    
    for prompt, chosen, rejected in comparison_dataset:
        # Log probability ratios
        log_ratio_chosen = policy.log_prob(chosen | prompt) - reference.log_prob(chosen | prompt)
        log_ratio_rejected = policy.log_prob(rejected | prompt) - reference.log_prob(rejected | prompt)
        
        # DPO loss
        loss = -log_sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))
        loss.backward()
        optimizer.step()
    
    return policy

The Alignment Tax and Its Mitigation

RLHF can reduce certain capabilities (the 'alignment tax'). Models may become less willing to engage with edge cases or provide complete information. Modern techniques like Constitutional AI, DPO, and careful reward model design aim to achieve helpfulness while minimizing capability loss.

What LLMs Can Do: Capabilities and Behaviors

Understanding LLM capabilities requires moving beyond impressions to systematic analysis. Modern LLMs demonstrate remarkable skills across diverse domains—but also consistent patterns of failure.

Core Capability Categories:

Language Understanding and Generation:

Fluent text in multiple styles and registers
Summarization, paraphrasing, translation
Sentiment analysis, classification, extraction
Creative writing, poetry, dialogue

Knowledge and Factual Recall:

Broad factual knowledge (trained on Wikipedia, books, etc.)
Domain expertise in law, medicine, science (at varying levels)
Current events (up to training cutoff)
Trivia and general knowledge

LLM Performance on Professional Exams (GPT-4)
Exam	GPT-4 Score	Human Passing Threshold
Bar Exam (MBE)	~90%	~63%
LSAT	88th percentile	50th percentile
GRE Verbal	99th percentile	50th percentile
GRE Quantitative	80th percentile	50th percentile
SAT Math	~700/800	~520/800
Medical Licensing (USMLE)	~85%	~60%
AP Calculus BC	4/5	3/5

Reasoning and Problem Solving:

Multi-step mathematical reasoning (with chain-of-thought)
Logical deduction and inference
Code writing and debugging
Strategic planning and problem decomposition

However, reasoning remains one of the most variable capabilities—sometimes brilliant, sometimes bafflingly wrong.

In-Context Learning:

Learning new tasks from few examples in the prompt
Adapting to new formats and instructions
Following complex multi-step procedures
Maintaining context over long interactions

Metacognition (Limited):

Expressing uncertainty (though often poorly calibrated)
Asking clarifying questions
Recognizing some limitations
Providing citations/sources (though often fabricated)

Impressive Emergent Behaviors

•Theory of Mind — Understanding that others have different beliefs and knowledge; reasoning about mental states.
•Analogical Reasoning — Drawing parallels between different domains; applying patterns abstractly.
•Self-Correction — When prompted appropriately, revising wrong answers upon reflection.
•Format Adherence — Following complex output formats (JSON, tables, specific structures).
•Multi-Turn Consistency — Maintaining persona, context, and goals across long conversations.

Capability Variation

LLM capabilities are highly variable. The same model may solve a hard problem and fail an easy one. Performance depends heavily on prompting, problem framing, and sometimes apparent luck. Point estimates of capability can be misleading—distributions matter.

What LLMs Cannot Do: Fundamental Limitations

Understanding LLM limitations is as important as understanding their capabilities. Some limitations are engineering challenges that may yield to scale; others appear more fundamental.

Hallucination: The Persistent Problem

LLMs confidently generate false information—a phenomenon called 'hallucination':

Fabricated facts, citations, quotes
Plausible-sounding but incorrect technical content
Invented URLs, papers, statistics
Confidently wrong historical claims

Hallucination persists even in the largest models. It appears to be an inherent consequence of training on prediction rather than truth. The model doesn't 'know' that it's wrong because it doesn't track truth—only probability.

Fundamental LLM Limitations

•No Reliable Factual Accuracy — The model cannot guarantee any fact is correct. Use for brainstorming, not as a source of truth.
•Poor Calibration — Confidence doesn't correlate well with accuracy. Models are often overconfident about wrong answers.
•Degraded Long-Range Coherence — Over very long outputs, consistency and coherence degrade. Plans unravel, facts contradict.
•Brittleness to Adversarial Inputs — Small input changes can dramatically alter outputs; prompts can be engineered to produce harmful content.
•Limited True Reasoning — Step-by-step reasoning helps, but models still fail at multi-step logic requiring genuine inference.
•No Real-World Grounding — No ability to observe, interact with, or verify claims about the physical world.
•Frozen Knowledge — Knowledge limited to training data cutoff; no learning from new information.
•Context Length Constraints — Finite context window limits working memory; can't reference unlimited prior information.

The Stochastic Parrot Critique:

Emily Bender and colleagues introduced the 'stochastic parrot' metaphor—arguing LLMs are sophisticated pattern matchers that don't 'understand' in any meaningful sense. Key claims:

LLMs manipulate linguistic form without connecting to meaning
Surface coherence doesn't imply deep understanding
Capabilities are pattern matching, not reasoning
The appearance of understanding is a product of scale, not a breakthrough

Counterarguments:

Behavior that reliably exhibits understanding may constitute understanding
The distinction between 'pattern matching' and 'reasoning' is unclear
LLMs generalize beyond training data in ways that suggest abstraction
Whether systems 'truly understand' may not be empirically decidable

The Practical Stance:

For most practitioners, what matters is: What can I reliably use this for?

Use LLMs where occasional errors are tolerable
Add verification for factual claims
Use as brainstorming/drafting tools, not oracles
Combine with retrieval, tools, and human oversight

The Right Mental Model

Think of LLMs as 'extremely well-read but unreliable interns.' They've 'read' billions of documents and can synthesize, summarize, and generate impressively. But they make errors a competent human wouldn't, they can't verify their own claims, and they need supervision for important work.

The LLM Ecosystem: Key Models and Providers

The LLM landscape has diversified rapidly. Understanding the ecosystem helps practitioners make informed choices about which models to use.

Major Closed-Source Models:

OpenAI:

GPT-4 / GPT-4o: Frontier multimodal model
GPT-3.5-Turbo: Cost-effective for many applications
o1 family: Enhanced reasoning via test-time compute

Anthropic:

Claude 3.5 Sonnet/Opus: Strong coding, long context, safety focus
Constitutional AI approach to alignment

Google DeepMind:

Gemini Ultra/Pro/Nano: Multimodal, integrated into Google products
Native multimodality (trained on mixed modalities from scratch)

Others:

xAI's Grok: Trained with real-time data access
Cohere's Command: Enterprise focus, retrieval integration
Amazon's Nova: AWS-integrated models

Major Open-Weight LLM Families
Family	Organization	Sizes	Key Features
LLaMA 3	Meta	8B, 70B, 405B	Strong baseline, extensive fine-tuning ecosystem
Mistral	Mistral AI	7B, 8x7B, 8x22B	Efficient, MoE variants, permissive license
Qwen 2	Alibaba	0.5B-72B	Strong multilingual, tools, coding
Gemma 2	Google	2B, 9B, 27B	Efficient, strong for size, research-friendly
Falcon	TII	7B, 40B, 180B	Large-scale, Apache 2.0 license
DeepSeek	DeepSeek	7B, 67B, MoE	Strong reasoning, code, MoE efficiency

Open vs. Closed Trade-offs:

Closed-source advantages:

Generally highest capability (GPT-4, Claude Opus)
Managed infrastructure, APIs, support
Continuous improvement without user effort

Open-weight advantages:

Self-hosting, data privacy, cost control at scale
Fine-tuning on proprietary data
No API dependency or rate limits
Community innovation and customization

The Fine-Tuning Ecosystem:

Open models power a vast fine-tuning ecosystem:

Instruct variants: Base models fine-tuned for instruction following (Llama-3-Instruct, Mistral-Instruct)
Task-specific models: Fine-tuned for code (CodeLlama), medicine (MedPalm), law (LegalLlama)
Personality variants: Custom personas, use-case optimization
Efficient variants: Quantized (GPTQ, AWQ), distilled, pruned for deployment

A Rapidly Moving Landscape

The LLM ecosystem changes monthly. New models frequently reset capability frontiers. Practitioners should stay current with model releases while maintaining understanding of fundamental principles that persist across models.

Summary: Understanding LLMs

We have explored LLMs from architecture through capabilities to limitations. Let's consolidate the key insights:

Key Takeaways

•GPT evolved through scale — From GPT-1's 117M to GPT-4's estimated 1.8T parameters, each generation represented not just more compute but fundamentally new capabilities.
•Architecture is transformer decoder — Masked self-attention enables autoregressive generation; the same architecture scales from small to frontier models.
•Pre-training is next-token prediction — This self-supervised objective, at scale, produces models with broad knowledge and capabilities.
•Alignment transforms prediction to assistance — SFT and RLHF turn text predictors into helpful, honest, and harmless assistants.
•Capabilities are impressive but variable — Professional exam performance, coding, reasoning—but with high variance and prompt sensitivity.
•Limitations are fundamental — Hallucination, poor calibration, and limited true reasoning persist even in frontier models.
•The ecosystem is diverse — Closed APIs and open weights offer different trade-offs for different use cases.
•Mental model matters — Think 'well-read but unreliable intern' rather than oracle or search engine.

What's Next:

LLMs are not the only manifestation of foundation models. The next page explores multimodal models—systems that process and generate across modalities (text, image, audio, video). We'll see how the foundation model paradigm extends beyond language to vision-language models like GPT-4V and CLIP, and to unified architectures that blur the boundaries between modalities.

Page Complete

You now have a comprehensive understanding of large language models—their evolution, architecture, training, capabilities, and limitations. This foundation enables you to work with LLMs effectively, understand their behavior, and stay current as the field evolves.