Loading content...
One of the most striking and philosophically troubling discoveries in modern machine learning is the phenomenon of emergent capabilities. As language models scale, they don't simply get incrementally better at what smaller models already do—they suddenly acquire entirely new abilities that were absent at smaller scales.
A model with 10 billion parameters cannot perform multi-step arithmetic. Train an otherwise identical model with 100 billion parameters, and it can. No one engineered this capability. No one wrote code for arithmetic. It simply... emerged.
This page explores the phenomenon of emergence in foundation models—what it means, how it manifests, why it matters, and what it tells us about the nature of these systems.
By the end of this page, you will understand: (1) the formal definition of emergent capabilities, (2) concrete examples of emergent behaviors across different domains, (3) the debate about whether emergence is 'real' or a measurement artifact, (4) theoretical explanations for why emergence occurs, and (5) the implications of emergence for AI safety and predictability.
The term 'emergence' has a rich history in philosophy and complex systems theory. In the context of large language models, we adopt a specific, empirically grounded definition.
Formal Definition:
An ability is emergent if it is:
The key feature distinguishing emergence from ordinary scaling is the shape of the performance curve. For most capabilities, performance improves gradually and predictably with scale. For emergent capabilities, performance is flat (often near-random) until a critical scale, then jumps sharply.
| Characteristic | Continuous Scaling | Emergent Capability |
|---|---|---|
| Performance curve | Smooth, predictable improvement | Flat then sudden jump |
| Extrapolation | Can predict from smaller scales | Cannot predict threshold |
| Small model performance | Worse but non-zero | Near-random or zero |
| Threshold | No clear threshold | Sharp transition point |
| Example | Perplexity improvement | Multi-digit arithmetic |
The Phase Transition Analogy:
Emergence in LLMs is often compared to phase transitions in physics—the sudden change from water to ice at 0°C, or from liquid to gas at 100°C. In both cases:
This analogy is more than metaphor. Some researchers argue that emergent capabilities in LLMs represent genuine phase transitions in the model's internal representations—qualitative reorganizations of how information is processed.
We say capabilities 'emerge' rather than are 'learned' because the training objective never explicitly targets these skills. GPT models are trained to predict the next token—nothing in the loss function mentions arithmetic, translation, or reasoning. These capabilities arise as a byproduct of scale, not as an explicit training goal.
Researchers have documented dozens of emergent capabilities across different model families and benchmarks. Understanding these examples helps build intuition about what emerges and when.
Arithmetic and Mathematical Reasoning:
One of the most studied emergent capabilities is multi-digit arithmetic. Studies show:
Importantly, single-digit arithmetic is NOT emergent—it improves gradually with scale. The emergence is specific to multi-digit operations requiring carry propagation.
| Domain | Emergent Capability | Approximate Threshold | Measurement |
|---|---|---|---|
| Mathematics | Multi-digit addition with carries | ~60B parameters | Exact match accuracy |
| Mathematics | Multi-step word problems (GSM8K) | ~100B parameters | Problem-solving accuracy |
| Reasoning | Chain-of-thought effectiveness | ~60B parameters | Accuracy gain from CoT prompting |
| Language | Word unscrambling | ~10B parameters | Accuracy on anagram tasks |
| Translation | Low-resource language translation | ~30B parameters | BLEU score |
| Instruction Following | Zero-shot task generalization | ~50B parameters | Average benchmark accuracy |
| Programming | Complex code synthesis | ~70B parameters | Pass@1 on HumanEval |
| Theory of Mind | False belief understanding | ~100B parameters | Accuracy on ToM benchmarks |
In-Context Learning and Few-Shot Prompting:
Perhaps the most consequential emergent capability is in-context learning—the ability to learn new tasks from a few examples provided in the prompt, without any gradient updates.
This capability:
Crucially, in-context learning is meta-emergence: it enables other emergent behaviors. Once a model can learn from examples in context, you can demonstrate almost any task format to it.
Chain-of-Thought Reasoning:
Chain-of-thought (CoT) prompting—asking models to 'think step by step'—provides another striking example of emergence:
This suggests that the ability to productively use intermediate reasoning steps is itself emergent.
Emergence interacts with prompting. A capability might be 'present' in a model but only accessible with the right prompt. Chain-of-thought emergence demonstrates this: the capability for step-by-step reasoning exists at smaller scales but cannot be elicited. Measuring emergence requires careful attention to how capabilities are measured.
The reality of emergence in LLMs has become controversial. In 2024, researchers from Stanford argued that emergence is largely a measurement artifact rather than a genuine property of model behavior. This debate has profound implications for how we understand scaling.
The Critique: Emergence as Metric Mirage
Schaeffer, Miranda, and Koyejo (2024) argued that observed emergence often reflects properties of the metric used to measure performance, not properties of the model. Their key claims:
Discontinuous metrics create apparent discontinuities: When using metrics like 'exact match' accuracy, performance appears to jump suddenly because partial credit isn't given. Under continuous metrics (e.g., edit distance), the same capabilities show smooth improvement.
Resolution effects: When tasks have discrete answers (right/wrong), there's a measurement threshold below which all partial capability reads as zero.
Statistical power: Small models may have the capability but fail to demonstrate it consistently due to lower statistical reliability.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npimport matplotlib.pyplot as plt def simulate_emergence_metric_effect( true_capability: np.ndarray, threshold: float = 0.5) -> tuple[np.ndarray, np.ndarray]: """ Demonstrates how metric choice affects emergence perception. Argument: - True capability improves smoothly with scale - But discrete metrics create apparent discontinuities Args: true_capability: Underlying smooth capability curve (e.g., 0.1 to 0.9) threshold: Accuracy threshold for "success" on discrete metric Returns: continuous_metric: What we'd see with a continuous metric discrete_metric: What we'd see with exact-match accuracy """ # Continuous metric: reflects true capability directly continuous_metric = true_capability # Discrete metric: binary success/failure # Probability of getting a multi-step problem right # If each step has probability = true_capability # And there are N steps, P(all correct) = capability^N n_steps = 5 # Typical for multi-step reasoning discrete_metric = true_capability ** n_steps return continuous_metric, discrete_metric # Simulate smooth capability improvementmodel_scales = np.logspace(9, 12, 100) # 1B to 1T parameters# True capability improves as log(scale)true_capability = 0.5 + 0.4 * np.log10(model_scales / 1e9) / 3true_capability = np.clip(true_capability, 0.1, 0.95) continuous, discrete = simulate_emergence_metric_effect(true_capability) # The discrete metric shows "emergence" around 100B parameters# But the underlying capability is improving smoothly throughout print("Scale: 1B params -> Continuous: {:.2f}, Discrete: {:.2f}".format( continuous[0], discrete[0]))print("Scale: 10B params -> Continuous: {:.2f}, Discrete: {:.2f}".format( continuous[33], discrete[33]))print("Scale: 100B params -> Continuous: {:.2f}, Discrete: {:.2f}".format( continuous[66], discrete[66]))print("Scale: 1T params -> Continuous: {:.2f}, Discrete: {:.2f}".format( continuous[99], discrete[99])) # Key insight: The "jump" in discrete performance around 100B# corresponds to the capability crossing the threshold where# P(all 5 steps correct) becomes non-negligibleThe Defense: Emergence is Still Real
Defenders of emergence argue that the metric critique, while valid in some cases, doesn't explain away all emergent phenomena:
Some capabilities genuinely absent: For certain tasks, small models generate completely wrong outputs—not 'close but not quite right.' No continuous metric makes random outputs partially correct.
Behavioral discontinuities: The ability to use chain-of-thought effectively isn't a graded capability—small models actively get worse with CoT prompting.
Qualitative changes in output structure: Some emergent behavior involves generating entirely different kinds of outputs, not just more accurate versions.
Internal representation changes: Mechanistic interpretability work suggests that model internals change qualitatively at scale, not just quantitatively.
The Synthesis:
The current consensus (to the extent one exists) is nuanced:
Whether emergence is 'real' or an artifact matters practically. If capabilities emerge unpredictably, safety evaluation is harder—we can't know what a model can do until we test exhaustively at scale. If emergence is merely metric artifacts, we might predict capabilities from smaller models. The current evidence suggests the truth is in between.
Beyond the debate about measurement, researchers seek to understand why emergence occurs at all. Several theoretical frameworks offer partial explanations.
Theory 1: Compositional Capability Thresholds
Complex capabilities require combining multiple simpler sub-capabilities. Each sub-capability improves gradually with scale, but the composite capability only works when all sub-capabilities exceed a threshold.
Consider multi-digit arithmetic:
Small improvements in sub-capabilities produce large jumps in composite capability when sub-capabilities cross reliability thresholds.
Theory 2: Circuit Formation in Neural Networks
Mechanistic interpretability research suggests that neural networks implement specific 'circuits'—subnetworks that compute particular functions. Emergence may correspond to:
This view suggests emergence reflects discrete changes in the network's computational structure, not just quantitative improvements in existing computations.
Theory 3: Representation Learning Transitions
The quality of learned representations may undergo phase transitions:
This connects to the manifold hypothesis from the scaling page—larger models better approximate the true data manifold, and some capabilities only work with sufficiently faithful approximations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import numpy as npfrom typing import List def compositional_capability( scale: float, n_subcapabilities: int = 5, scale_coefficient: float = 0.1, base_capability: float = 0.5) -> dict: """ Model emergent capabilities as composition of sub-capabilities. Core insight: Complex tasks require multiple sub-skills. Each sub-skill improves smoothly with scale. But the composite task only succeeds when ALL sub-skills succeed. This creates apparent emergence even from smooth underlying improvement. Args: scale: Model scale (log parameters, e.g., 9 for 1B) n_subcapabilities: Number of required sub-capabilities scale_coefficient: How quickly sub-capabilities improve with scale base_capability: Sub-capability level at scale=0 Returns: Dict with sub-capability levels and composite success probability """ # Each sub-capability improves with log(scale) # Using sigmoid to bound between 0 and 1 sub_capability_level = 1 / (1 + np.exp(-scale_coefficient * (scale - 10))) # Individual sub-capability success rate sub_success_rate = base_capability + (1 - base_capability) * sub_capability_level # Composite capability requires ALL sub-capabilities to succeed composite_success = sub_success_rate ** n_subcapabilities return { 'scale': scale, 'sub_capability_level': sub_success_rate, 'composite_success': composite_success, 'n_subcapabilities': n_subcapabilities } # Demonstrate emergence from compositionscales = np.linspace(8, 12, 50) # Log scale: 100M to 1Tresults = [compositional_capability(s) for s in scales] # Find "emergence" point (where composite crosses 0.5)for r in results: scale = r['scale'] sub = r['sub_capability_level'] comp = r['composite_success'] if 9 <= scale <= 11: # Focus on transition region print(f"Scale: 10^{scale:.1f} | Sub: {sub:.3f} | Composite: {comp:.3f}") # Key observation: # - Sub-capabilities improve smoothly (0.65 -> 0.85 over this range)# - Composite shows sharp transition (0.12 -> 0.44)# This explains "emergence" even when underlying skills improve graduallyDespite these frameworks, we lack a complete theory of emergence. We cannot reliably predict what capabilities will emerge, at what scale, or for which architectures. This theoretical gap is both a fascinating research direction and a practical concern for AI development.
Emergence has profound implications for AI safety—both concerning and hopeful. Understanding these implications is essential for responsible development and deployment of foundation models.
The Core Safety Concern: Unpredictability
The fundamental problem emergence poses for safety is unpredictability:
This contrasts sharply with traditional software, where capabilities are explicitly programmed and therefore known in advance.
Mitigating Factors:
The emergence phenomenon also has features that may help with safety:
Emergent alignment capabilities: The same mechanisms that produce emergent dangerous capabilities might produce emergent safety-relevant capabilities (instruction following, honesty, corrigibility).
Scaling predictability: While specific emergents are unpredictable, overall trends in capabilities are somewhat predictable from scaling laws, allowing planning.
Gradual deployment: Real-world deployment is gradual, providing opportunities to observe emergence before widespread impact.
Mechanistic understanding: Growing interpretability research may eventually allow us to predict and detect emergent circuits before they manifest behaviorally.
Current Best Practices:
The AI safety community has developed practices to manage emergence-related risks:
There is a fundamental tension between the value of scale (better capabilities) and the risks of emergence (unpredictable capabilities). Managing this tension—capturing the benefits of scale while mitigating the risks of unpredictable emergence—is one of the central challenges of contemporary AI development.
While this discussion has focused on language models, emergence is not unique to text. Understanding how emergence manifests across modalities and architectures provides deeper insight into the phenomenon.
Emergence in Vision Models:
Large vision transformers (ViT, CLIP) show their own emergent behaviors:
Emergence in Multimodal Models:
Models trained on multiple modalities (image + text, audio + text) show particularly interesting emergent behaviors:
| Model Type | Emergent Capability Example | Approximate Scale |
|---|---|---|
| Language (Decoder-only) | Chain-of-thought reasoning | ~60B parameters |
| Language (Encoder-only) | Sentence-level semantics | ~1B parameters |
| Vision (ViT) | Fine-grained recognition | ~300M parameters |
| Vision-Language (CLIP) | Zero-shot classification | ~400M parameters |
| Multimodal (GPT-4V) | Visual reasoning | ~1T parameters (estimate) |
| Code (Codex) | Multi-file understanding | ~100B parameters |
| Speech (Whisper) | Multi-speaker diarization | ~1B parameters |
Architecture Effects on Emergence:
Different architectures show different emergence patterns:
The Universality Question:
A deep open question is whether emergence is a universal phenomenon of scale or specific to current architectures:
Current evidence suggests transformers are particularly prone to emergence, possibly due to their global attention mechanism, but this remains an active research area.
The presence of emergence in an architecture can be seen as evidence that it's capable of learning complex, compositional representations. Some researchers argue that architectures showing rich emergent behavior are 'doing something right'—capturing the right inductive biases for general intelligence.
Emergence has transformed how we approach ML research and practice. Understanding these implications helps practitioners navigate the current landscape.
Implication 1: Evaluation at Scale
Because capabilities emerge at specific scales, evaluating a model requires testing at the target deployment scale:
Implication 2: Prompt Engineering Matters More
Emergent capabilities are often latent—present but not accessible without proper prompting. This makes prompt engineering critical:
Implication 3: The Death of Benchmark-Driven Research
Classic ML research targeted specific benchmarks. Emergence disrupts this:
Implication 4: Reduced Control over Capabilities
Emergence means practitioners have less control over model capabilities:
Implication 5: Scaling as Exploration
Training at scale becomes partially an exploration process:
Emergence forces a shift from the traditional ML paradigm of 'build a model for a task' to 'build a capable model and discover what tasks it can do.' This inversion—where capabilities are discovered rather than designed—is a fundamental change in how we do machine learning.
We have explored the phenomenon of emergent capabilities in foundation models—from definition through documentation to theoretical explanation and practical implications. Let's consolidate the key insights:
What's Next:
With our understanding of scale and emergence in place, we now turn to the most impactful manifestation of these phenomena: Large Language Models (LLMs). The next page explores GPT and its successors—how they work, what they can do, and why they have transformed AI and its applications.
You now understand the phenomenon of emergent capabilities—what it means, how it manifests, why it might occur, and what it implies. This understanding is essential for anyone working with or reasoning about foundation models, as emergence fundamentally shapes what these models can do and how predictable they are.