Foundation Models - Learning Module

Loading content...

0/245

Emergent Capabilities

The Surprise of Emergence

One of the most striking and philosophically troubling discoveries in modern machine learning is the phenomenon of emergent capabilities. As language models scale, they don't simply get incrementally better at what smaller models already do—they suddenly acquire entirely new abilities that were absent at smaller scales.

A model with 10 billion parameters cannot perform multi-step arithmetic. Train an otherwise identical model with 100 billion parameters, and it can. No one engineered this capability. No one wrote code for arithmetic. It simply... emerged.

This page explores the phenomenon of emergence in foundation models—what it means, how it manifests, why it matters, and what it tells us about the nature of these systems.

What You Will Learn

By the end of this page, you will understand: (1) the formal definition of emergent capabilities, (2) concrete examples of emergent behaviors across different domains, (3) the debate about whether emergence is 'real' or a measurement artifact, (4) theoretical explanations for why emergence occurs, and (5) the implications of emergence for AI safety and predictability.

What Is Emergence?

The term 'emergence' has a rich history in philosophy and complex systems theory. In the context of large language models, we adopt a specific, empirically grounded definition.

Formal Definition:

An ability is emergent if it is:

Not present in smaller models — Models below a certain scale threshold show near-random or near-zero performance on the task.
Present in larger models — Models above the threshold show substantially above-chance performance.
Unpredictable from smaller scales — The performance curve is not smooth; you cannot reliably extrapolate from smaller models to predict when the capability will appear.

The key feature distinguishing emergence from ordinary scaling is the shape of the performance curve. For most capabilities, performance improves gradually and predictably with scale. For emergent capabilities, performance is flat (often near-random) until a critical scale, then jumps sharply.

Comparison: Emergent vs. Continuous Scaling
Characteristic	Continuous Scaling	Emergent Capability
Performance curve	Smooth, predictable improvement	Flat then sudden jump
Extrapolation	Can predict from smaller scales	Cannot predict threshold
Small model performance	Worse but non-zero	Near-random or zero
Threshold	No clear threshold	Sharp transition point
Example	Perplexity improvement	Multi-digit arithmetic

The Phase Transition Analogy:

Emergence in LLMs is often compared to phase transitions in physics—the sudden change from water to ice at 0°C, or from liquid to gas at 100°C. In both cases:

The underlying system changes continuously (temperature, scale)
The observable behavior changes discontinuously (liquid/solid, capable/incapable)
The transition is sharp, not gradual
Predicting the exact transition point requires understanding the system deeply

This analogy is more than metaphor. Some researchers argue that emergent capabilities in LLMs represent genuine phase transitions in the model's internal representations—qualitative reorganizations of how information is processed.

Why 'Emergent' Rather Than 'Learned'?

We say capabilities 'emerge' rather than are 'learned' because the training objective never explicitly targets these skills. GPT models are trained to predict the next token—nothing in the loss function mentions arithmetic, translation, or reasoning. These capabilities arise as a byproduct of scale, not as an explicit training goal.

Documented Emergent Capabilities

Researchers have documented dozens of emergent capabilities across different model families and benchmarks. Understanding these examples helps build intuition about what emerges and when.

Arithmetic and Mathematical Reasoning:

One of the most studied emergent capabilities is multi-digit arithmetic. Studies show:

Models below ~10B parameters perform near-randomly on 3+ digit addition
Models above ~60B parameters achieve high accuracy
The transition is sharp—from ~10% to ~80% over a narrow range
This pattern holds across model families (GPT, PaLM, Chinchilla)

Importantly, single-digit arithmetic is NOT emergent—it improves gradually with scale. The emergence is specific to multi-digit operations requiring carry propagation.

Examples of Emergent Capabilities by Domain
Domain	Emergent Capability	Approximate Threshold	Measurement
Mathematics	Multi-digit addition with carries	~60B parameters	Exact match accuracy
Mathematics	Multi-step word problems (GSM8K)	~100B parameters	Problem-solving accuracy
Reasoning	Chain-of-thought effectiveness	~60B parameters	Accuracy gain from CoT prompting
Language	Word unscrambling	~10B parameters	Accuracy on anagram tasks
Translation	Low-resource language translation	~30B parameters	BLEU score
Instruction Following	Zero-shot task generalization	~50B parameters	Average benchmark accuracy
Programming	Complex code synthesis	~70B parameters	Pass@1 on HumanEval
Theory of Mind	False belief understanding	~100B parameters	Accuracy on ToM benchmarks

In-Context Learning and Few-Shot Prompting:

Perhaps the most consequential emergent capability is in-context learning—the ability to learn new tasks from a few examples provided in the prompt, without any gradient updates.

This capability:

Is largely absent in models below ~1B parameters
Becomes reliable in models above ~10B parameters
Becomes extraordinarily powerful in 100B+ parameter models
Enables GPT-3's famous 'few-shot' performance

Crucially, in-context learning is meta-emergence: it enables other emergent behaviors. Once a model can learn from examples in context, you can demonstrate almost any task format to it.

Chain-of-Thought Reasoning:

Chain-of-thought (CoT) prompting—asking models to 'think step by step'—provides another striking example of emergence:

In small models, CoT prompting provides no benefit or even hurts performance
In models above ~60B parameters, CoT dramatically improves accuracy on reasoning tasks
The same prompt that helps large models is useless or harmful for small ones

This suggests that the ability to productively use intermediate reasoning steps is itself emergent.

Capabilities vs. Elicitation

Emergence interacts with prompting. A capability might be 'present' in a model but only accessible with the right prompt. Chain-of-thought emergence demonstrates this: the capability for step-by-step reasoning exists at smaller scales but cannot be elicited. Measuring emergence requires careful attention to how capabilities are measured.

The Emergence Controversy: Real or Artifact?

The reality of emergence in LLMs has become controversial. In 2024, researchers from Stanford argued that emergence is largely a measurement artifact rather than a genuine property of model behavior. This debate has profound implications for how we understand scaling.

The Critique: Emergence as Metric Mirage

Schaeffer, Miranda, and Koyejo (2024) argued that observed emergence often reflects properties of the metric used to measure performance, not properties of the model. Their key claims:

Discontinuous metrics create apparent discontinuities: When using metrics like 'exact match' accuracy, performance appears to jump suddenly because partial credit isn't given. Under continuous metrics (e.g., edit distance), the same capabilities show smooth improvement.
Resolution effects: When tasks have discrete answers (right/wrong), there's a measurement threshold below which all partial capability reads as zero.
Statistical power: Small models may have the capability but fail to demonstrate it consistently due to lower statistical reliability.

emergence_metric_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
import matplotlib.pyplot as plt
 
def simulate_emergence_metric_effect(
    true_capability: np.ndarray,
    threshold: float = 0.5
) -> tuple[np.ndarray, np.ndarray]:
    """
    Demonstrates how metric choice affects emergence perception.
    
    Argument:
    - True capability improves smoothly with scale
    - But discrete metrics create apparent discontinuities
    
    Args:
        true_capability: Underlying smooth capability curve (e.g., 0.1 to 0.9)
        threshold: Accuracy threshold for "success" on discrete metric
        
    Returns:
        continuous_metric: What we'd see with a continuous metric
        discrete_metric: What we'd see with exact-match accuracy
    """
    # Continuous metric: reflects true capability directly
    continuous_metric = true_capability
    
    # Discrete metric: binary success/failure
    # Probability of getting a multi-step problem right
    # If each step has probability = true_capability
    # And there are N steps, P(all correct) = capability^N
    n_steps = 5  # Typical for multi-step reasoning
    discrete_metric = true_capability ** n_steps
    
    return continuous_metric, discrete_metric
 
# Simulate smooth capability improvement
model_scales = np.logspace(9, 12, 100)  # 1B to 1T parameters
# True capability improves as log(scale)
true_capability = 0.5 + 0.4 * np.log10(model_scales / 1e9) / 3
true_capability = np.clip(true_capability, 0.1, 0.95)
 
continuous, discrete = simulate_emergence_metric_effect(true_capability)
 
# The discrete metric shows "emergence" around 100B parameters
# But the underlying capability is improving smoothly throughout
 
print("Scale: 1B params   -> Continuous: {:.2f}, Discrete: {:.2f}".format(
    continuous[0], discrete[0]))
print("Scale: 10B params  -> Continuous: {:.2f}, Discrete: {:.2f}".format(
    continuous[33], discrete[33]))
print("Scale: 100B params -> Continuous: {:.2f}, Discrete: {:.2f}".format(
    continuous[66], discrete[66]))
print("Scale: 1T params   -> Continuous: {:.2f}, Discrete: {:.2f}".format(
    continuous[99], discrete[99]))
 
# Key insight: The "jump" in discrete performance around 100B
# corresponds to the capability crossing the threshold where
# P(all 5 steps correct) becomes non-negligible

The Defense: Emergence is Still Real

Defenders of emergence argue that the metric critique, while valid in some cases, doesn't explain away all emergent phenomena:

Some capabilities genuinely absent: For certain tasks, small models generate completely wrong outputs—not 'close but not quite right.' No continuous metric makes random outputs partially correct.
Behavioral discontinuities: The ability to use chain-of-thought effectively isn't a graded capability—small models actively get worse with CoT prompting.
Qualitative changes in output structure: Some emergent behavior involves generating entirely different kinds of outputs, not just more accurate versions.
Internal representation changes: Mechanistic interpretability work suggests that model internals change qualitatively at scale, not just quantitatively.

The Synthesis:

The current consensus (to the extent one exists) is nuanced:

Some observed emergence is metric artifacts — Careful metric choice can smooth out some apparent discontinuities.
Some emergence is real phase transitions — Genuine qualitative changes in model behavior do occur.
Distinguishing them requires careful analysis — Claims of emergence need to be tested with multiple metrics and prompting strategies.
Predictability remains a concern — Whether artifactual or real, our inability to predict when capabilities appear is practically significant.

Practical Implications of the Debate

Whether emergence is 'real' or an artifact matters practically. If capabilities emerge unpredictably, safety evaluation is harder—we can't know what a model can do until we test exhaustively at scale. If emergence is merely metric artifacts, we might predict capabilities from smaller models. The current evidence suggests the truth is in between.

Why Does Emergence Occur?

Beyond the debate about measurement, researchers seek to understand why emergence occurs at all. Several theoretical frameworks offer partial explanations.

Theory 1: Compositional Capability Thresholds

Complex capabilities require combining multiple simpler sub-capabilities. Each sub-capability improves gradually with scale, but the composite capability only works when all sub-capabilities exceed a threshold.

Consider multi-digit arithmetic:

Requires: digit recognition, carry propagation, place value understanding, left-to-right processing
If each sub-capability has 80% reliability at a given scale:
- For 5 sub-steps: 0.8^5 ≈ 33% success rate
If each improves to 95%:
- For 5 sub-steps: 0.95^5 ≈ 77% success rate

Small improvements in sub-capabilities produce large jumps in composite capability when sub-capabilities cross reliability thresholds.

Theory 2: Circuit Formation in Neural Networks

Mechanistic interpretability research suggests that neural networks implement specific 'circuits'—subnetworks that compute particular functions. Emergence may correspond to:

Circuit Completion: A necessary circuit only becomes complete (all required connections present) above a certain scale.
Circuit Activation: A dormant circuit becomes active when capacity allows.
Circuit Competition: Multiple potential circuits compete; the 'correct' one wins only at sufficient scale.

This view suggests emergence reflects discrete changes in the network's computational structure, not just quantitative improvements in existing computations.

Theory 3: Representation Learning Transitions

The quality of learned representations may undergo phase transitions:

At small scales, representations are entangled and local
At larger scales, representations become more disentangled and abstract
Certain capabilities require a minimum representation quality to function

This connects to the manifold hypothesis from the scaling page—larger models better approximate the true data manifold, and some capabilities only work with sufficiently faithful approximations.

composition_threshold_model.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
from typing import List
 
def compositional_capability(
    scale: float,
    n_subcapabilities: int = 5,
    scale_coefficient: float = 0.1,
    base_capability: float = 0.5
) -> dict:
    """
    Model emergent capabilities as composition of sub-capabilities.
    
    Core insight: Complex tasks require multiple sub-skills.
    Each sub-skill improves smoothly with scale.
    But the composite task only succeeds when ALL sub-skills succeed.
    
    This creates apparent emergence even from smooth underlying improvement.
    
    Args:
        scale: Model scale (log parameters, e.g., 9 for 1B)
        n_subcapabilities: Number of required sub-capabilities
        scale_coefficient: How quickly sub-capabilities improve with scale
        base_capability: Sub-capability level at scale=0
        
    Returns:
        Dict with sub-capability levels and composite success probability
    """
    # Each sub-capability improves with log(scale)
    # Using sigmoid to bound between 0 and 1
    sub_capability_level = 1 / (1 + np.exp(-scale_coefficient * (scale - 10)))
    
    # Individual sub-capability success rate
    sub_success_rate = base_capability + (1 - base_capability) * sub_capability_level
    
    # Composite capability requires ALL sub-capabilities to succeed
    composite_success = sub_success_rate ** n_subcapabilities
    
    return {
        'scale': scale,
        'sub_capability_level': sub_success_rate,
        'composite_success': composite_success,
        'n_subcapabilities': n_subcapabilities
    }
 
# Demonstrate emergence from composition
scales = np.linspace(8, 12, 50)  # Log scale: 100M to 1T
results = [compositional_capability(s) for s in scales]
 
# Find "emergence" point (where composite crosses 0.5)
for r in results:
    scale = r['scale']
    sub = r['sub_capability_level']
    comp = r['composite_success']
    if 9 <= scale <= 11:  # Focus on transition region
        print(f"Scale: 10^{scale:.1f} | Sub: {sub:.3f} | Composite: {comp:.3f}")
 
# Key observation: 
# - Sub-capabilities improve smoothly (0.65 -> 0.85 over this range)
# - Composite shows sharp transition (0.12 -> 0.44)
# This explains "emergence" even when underlying skills improve gradually

Additional Theoretical Perspectives

•Loss Landscape Transitions — The optimization landscape may undergo qualitative changes at scale, with new minima becoming accessible that implement capabilities absent from smaller-model optima.
•Data Efficiency Thresholds — Some capabilities may require seeing enough diverse examples. Larger models, trained on more data, cross coverage thresholds for rare patterns.
•Interference Reduction — At small scales, learned knowledge interferes destructively. Scale provides capacity to represent multiple patterns without interference.
•Grokking Dynamics — Related to 'grokking' (delayed generalization), some capabilities may require extended training to consolidate, and scale accelerates this process.

No Complete Theory Yet

Despite these frameworks, we lack a complete theory of emergence. We cannot reliably predict what capabilities will emerge, at what scale, or for which architectures. This theoretical gap is both a fascinating research direction and a practical concern for AI development.

Implications for AI Safety and Predictability

Emergence has profound implications for AI safety—both concerning and hopeful. Understanding these implications is essential for responsible development and deployment of foundation models.

The Core Safety Concern: Unpredictability

The fundamental problem emergence poses for safety is unpredictability:

We cannot know what capabilities a model will have until we train and test it
Dangerous capabilities might emerge without warning
Safety evaluations at one scale may not apply at the next scale
We may not even know what capabilities to test for

This contrasts sharply with traditional software, where capabilities are explicitly programmed and therefore known in advance.

Specific Safety Concerns from Emergence

•Dangerous capability emergence — Capabilities for manipulation, deception, or harm might emerge unexpectedly at scale, without any intentional training for these behaviors.
•Evaluation gaps — Safety evaluations designed for current models may fail to anticipate capabilities of next-generation models. What we test for is based on what we've seen.
•Alignment failure modes — A model that appears aligned at smaller scales might develop new goals, values, or behaviors at larger scales that violate alignment properties.
•Deceptive alignment — In theory, a sufficiently capable model might learn to appear aligned during evaluation while behaving differently in deployment.
•Discontinuous capability jumps — If capabilities can jump suddenly, we may go from 'clearly safe' to 'potentially dangerous' without gradual warning signs.

Mitigating Factors:

The emergence phenomenon also has features that may help with safety:

Emergent alignment capabilities: The same mechanisms that produce emergent dangerous capabilities might produce emergent safety-relevant capabilities (instruction following, honesty, corrigibility).
Scaling predictability: While specific emergents are unpredictable, overall trends in capabilities are somewhat predictable from scaling laws, allowing planning.
Gradual deployment: Real-world deployment is gradual, providing opportunities to observe emergence before widespread impact.
Mechanistic understanding: Growing interpretability research may eventually allow us to predict and detect emergent circuits before they manifest behaviorally.

Current Best Practices:

The AI safety community has developed practices to manage emergence-related risks:

Staged deployment: Release models gradually, monitoring for unexpected capabilities
Red-teaming: Actively search for emergent dangerous capabilities before deployment
Scaling pauses: Evaluate carefully before significant scale-ups
Capability elicitation: Systematically try to surface latent capabilities
Trend tracking: Monitor capability emergence across model families to improve prediction

The Fundamental Tension

There is a fundamental tension between the value of scale (better capabilities) and the risks of emergence (unpredictable capabilities). Managing this tension—capturing the benefits of scale while mitigating the risks of unpredictable emergence—is one of the central challenges of contemporary AI development.

Emergence Across Modalities and Architectures

While this discussion has focused on language models, emergence is not unique to text. Understanding how emergence manifests across modalities and architectures provides deeper insight into the phenomenon.

Emergence in Vision Models:

Large vision transformers (ViT, CLIP) show their own emergent behaviors:

Zero-shot classification: CLIP-scale models can classify images into arbitrary categories without fine-tuning
Compositional understanding: Larger models understand complex visual scenes with multiple objects and relationships
Abstract visual reasoning: Performance on visual analogy tasks emerges at scale

Emergence in Multimodal Models:

Models trained on multiple modalities (image + text, audio + text) show particularly interesting emergent behaviors:

Cross-modal reasoning: Answering questions about images that require world knowledge not in the image
Visual instruction following: Following complex multi-step instructions involving visual content
Emergent OCR: Reading text in images without explicit training
Diagram understanding: Interpreting charts, diagrams, and structured visual information

Emergence Across Model Architectures and Modalities
Model Type	Emergent Capability Example	Approximate Scale
Language (Decoder-only)	Chain-of-thought reasoning	~60B parameters
Language (Encoder-only)	Sentence-level semantics	~1B parameters
Vision (ViT)	Fine-grained recognition	~300M parameters
Vision-Language (CLIP)	Zero-shot classification	~400M parameters
Multimodal (GPT-4V)	Visual reasoning	~1T parameters (estimate)
Code (Codex)	Multi-file understanding	~100B parameters
Speech (Whisper)	Multi-speaker diarization	~1B parameters

Architecture Effects on Emergence:

Different architectures show different emergence patterns:

Transformer vs. RNN/LSTM: Transformers show sharper emergence curves for long-range dependencies
Dense vs. MoE: Mixture-of-Experts models may show different emergence thresholds (more parameters but same compute)
Encoder-decoder vs. Decoder-only: Different architectures favor different emergent capabilities

The Universality Question:

A deep open question is whether emergence is a universal phenomenon of scale or specific to current architectures:

If universal: Any sufficiently large model will show emergence
If architecture-specific: Different designs might show different (or no) emergence

Current evidence suggests transformers are particularly prone to emergence, possibly due to their global attention mechanism, but this remains an active research area.

Emergence as a Design Signal

The presence of emergence in an architecture can be seen as evidence that it's capable of learning complex, compositional representations. Some researchers argue that architectures showing rich emergent behavior are 'doing something right'—capturing the right inductive biases for general intelligence.

Implications for ML Research and Practice

Emergence has transformed how we approach ML research and practice. Understanding these implications helps practitioners navigate the current landscape.

Implication 1: Evaluation at Scale

Because capabilities emerge at specific scales, evaluating a model requires testing at the target deployment scale:

A 7B model isn't simply a 'less good' version of a 70B model—it may lack capabilities entirely
Safety evaluations must be repeated at each scale
Benchmarks that saturate at one scale may be meaningful at the next

Implication 2: Prompt Engineering Matters More

Emergent capabilities are often latent—present but not accessible without proper prompting. This makes prompt engineering critical:

The same model may appear to have or lack a capability depending on prompting
Chain-of-thought, few-shot, and other techniques can 'unlock' emergent abilities
Evaluation must include diverse prompting strategies

Implication 3: The Death of Benchmark-Driven Research

Classic ML research targeted specific benchmarks. Emergence disrupts this:

Models may suddenly 'solve' a benchmark without any benchmark-specific work
Benchmark performance is less predictive of real-world capability at scale
Emergent capabilities cross benchmark categories

Implication 4: Reduced Control over Capabilities

Emergence means practitioners have less control over model capabilities:

You cannot precisely engineer which capabilities emerge
Training a 'safer' model may remove desirable capabilities alongside undesirable ones
Fine-tuning can affect emergent capabilities in unpredictable ways

Implication 5: Scaling as Exploration

Training at scale becomes partially an exploration process:

Each significant scale-up potentially reveals new capabilities
The relationship between training choices and emergent capabilities is unclear
Empirical testing dominates theoretical prediction

Best Practices in an Emergent World

•Broad evaluation suites — Test models on diverse tasks, not just target benchmarks, to catch unexpected strengths and weaknesses.
•Prompting diversity — Try multiple prompting strategies before concluding a capability is absent.
•Scale-aware expectations — Don't expect small models to replicate large model behaviors, even with the same architecture.
•Continuous monitoring — After deployment, continue monitoring for emergent (and potentially problematic) behaviors.
•Safety margins — Build in safety margins assuming capabilities may be present that testing didn't surface.

A New Research Paradigm

Emergence forces a shift from the traditional ML paradigm of 'build a model for a task' to 'build a capable model and discover what tasks it can do.' This inversion—where capabilities are discovered rather than designed—is a fundamental change in how we do machine learning.

Summary: Understanding Emergence

We have explored the phenomenon of emergent capabilities in foundation models—from definition through documentation to theoretical explanation and practical implications. Let's consolidate the key insights:

Key Takeaways

•Emergence is discontinuous scaling — Some capabilities appear suddenly at specific scales, not gradually, challenging extrapolation from smaller models.
•Many capabilities are documented — Arithmetic, chain-of-thought, in-context learning, and many other abilities have been shown to emerge at specific scales.
•The emergence debate continues — Some apparent emergence may be metric artifacts, but genuine qualitative transitions also occur.
•Theoretical explanations are partial — Compositional thresholds, circuit formation, and representation transitions offer frameworks but not complete theories.
•Safety implications are significant — Unpredictable capability emergence challenges traditional evaluation and safety approaches.
•Emergence crosses modalities — Vision, multimodal, and other systems show their own emergent behaviors.
•Research practices must adapt — Evaluation, prompting, and development practices all change in an emergent world.
•Capabilities are discovered, not designed — This represents a fundamental philosophical shift in how we build and understand AI systems.

What's Next:

With our understanding of scale and emergence in place, we now turn to the most impactful manifestation of these phenomena: Large Language Models (LLMs). The next page explores GPT and its successors—how they work, what they can do, and why they have transformed AI and its applications.

Page Complete

You now understand the phenomenon of emergent capabilities—what it means, how it manifests, why it might occur, and what it implies. This understanding is essential for anyone working with or reasoning about foundation models, as emergence fundamentally shapes what these models can do and how predictable they are.