Machine LearningResearch Frontiers

Large Language Models

LevelAdvanced

Duration120 mins

TopicResearch Frontiers

1 / 5

Transformer Scaling

The Scale Revolution

In 2017, the transformer architecture introduced in Attention Is All You Need achieved state-of-the-art machine translation. By 2020, GPT-3 demonstrated that scaling transformers to 175 billion parameters unlocked remarkable capabilities—from code generation to creative writing—that smaller models simply could not achieve. The field had discovered a profound truth: scale fundamentally changes what models can do.

This wasn't merely incremental improvement. Researchers observed qualitative changes in behavior as models grew—capabilities that emerged suddenly rather than gradually, abilities that defied prediction from smaller-scale experiments. Understanding transformer scaling has become essential for anyone working at the frontier of machine learning, influencing everything from research priorities to infrastructure investments worth billions of dollars.

What You Will Learn

This page covers the science and practice of scaling transformers: the empirical laws governing model performance, the architectural innovations enabling scale, the infrastructure required for training, and the economic realities of frontier model development. You will understand both why scale matters and how to reason about scaling decisions.

Scaling Laws: The Empirical Foundation

The modern understanding of transformer scaling is grounded in scaling laws—empirical relationships that predict model performance as a function of compute, data, and parameters. These laws, discovered through systematic experimentation, provide the theoretical foundation for billion-dollar training decisions.

The Kaplan Scaling Laws (OpenAI, 2020)

The seminal work by Kaplan et al. established that language model loss follows predictable power-law relationships:

$$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}$$

Where $L$ is the loss, $N$ is the number of parameters, and $N_c$ and $\alpha_N$ are fitted constants. Similar relationships exist for compute $C$ and dataset size $D$.

Key findings from the Kaplan laws:

Smooth power-law scaling: Loss decreases as a power-law with increased compute, parameters, or data
Compute-efficient scaling: Optimal model size scales sublinearly with compute budget
Parameter efficiency: For fixed compute, larger models trained on less data often outperform smaller models trained longer
Data requirements: Optimal data size scales roughly linearly with model size

Scaling Exponents from Empirical Studies
Scaling Dimension	Exponent (α)	Interpretation
Parameters (N)	~0.076	Doubling parameters reduces loss by ~5%
Compute (C)	~0.050	10× compute reduces loss by ~11%
Data (D)	~0.095	Doubling data reduces loss by ~6.5%
Optimal N given C	~0.73	Model size should scale with C^0.73

The Chinchilla Scaling Laws (DeepMind, 2022)

Hoffmann et al. revised our understanding with the Chinchilla paper, demonstrating that many large models were significantly undertrained. Their analysis suggested a more balanced approach to scaling:

Chinchilla optimal scaling:

For compute-optimal training, parameters and data should scale roughly equally
The optimal model size is approximately $N^* \propto C^{0.5}$ (not $C^{0.73}$)
The optimal number of training tokens is approximately $D^* \propto C^{0.5}$

This had profound implications:

Model	Parameters	Training Tokens	Tokens per Parameter	Chinchilla Optimal?
GPT-3	175B	300B	1.7	Significantly undertrained
Gopher	280B	300B	1.1	Significantly undertrained
Chinchilla	70B	1.4T	20	Compute-optimal

Chinchilla (70B parameters) outperformed Gopher (280B parameters) while using the same compute, simply by training longer on more data.

Scaling Laws Have Limits

Scaling laws are empirical fits, not fundamental laws of nature. They may break down at extreme scales, depend on architecture and data quality, and don't capture everything that matters (like reasoning ability or factual recall). Use them as guides, not gospel.

The Unified Scaling Law

Integrating the dimensions of scale, performance can be modeled as:

$$L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$$

Where:

$E$ is the irreducible entropy of natural language (estimated ~1.69 nats for English)
$A/N^\alpha$ captures parameter scaling
$B/D^\beta$ captures data scaling

This formulation reveals a crucial insight: performance is bottlenecked by the smaller of the two terms. An undertrained large model or an overtrained small model both waste compute. Optimal allocation balances both contributions.

Emergent Capabilities: The Scale Surprise

Perhaps the most fascinating aspect of scaling is the emergence of capabilities that cannot be predicted from smaller models. These emergent abilities appear suddenly once a model reaches a critical scale threshold, transforming from near-random performance to rapid improvement.

Defining Emergence

An ability is considered emergent if:

It is absent or random in smaller models
It appears abruptly at scale rather than improving gradually
It cannot be predicted by extrapolating from smaller models

This is distinct from smooth scaling, where doubling parameters yields predictable, incremental gains. Emergence represents a phase transition—a qualitative change in what the model can do.

Documented Emergent Abilities in Language Models
Capability	Emergence Threshold	Description
Multi-step Arithmetic	~10B parameters	Solving 3+ digit addition, multiplication
Chain-of-Thought Reasoning	~100B parameters	Producing step-by-step reasoning chains
Word Unscrambling	~10B parameters	Anagram solving, letter manipulation
In-Context Learning	~10B parameters	Learning new tasks from few examples in prompt
Code Generation	~10B parameters	Producing functional code from descriptions
Instruction Following	~50B parameters	Reliable following of natural language instructions
Theory of Mind	~50-100B parameters	Modeling beliefs and intentions of others

The Unpredictability Challenge

Emergent abilities create a fundamental challenge for AI development: you cannot predict what a larger model will be capable of until you build it. This uncertainty drives speculative large-scale investments and makes safety evaluation more difficult—harmful capabilities could also emerge unpredictably.

Theories of Emergence

Several hypotheses attempt to explain why emergence occurs:

1. Task Decomposition Threshold Complex tasks require multiple sub-capabilities. Performance remains low until all required sub-capabilities reach sufficient quality, then jumps rapidly. For example, arithmetic requires digit recognition, positional understanding, and carry propagation—all must work for the overall task to succeed.

2. Circuit Formation Neural networks form computational circuits that perform specific functions. Emergent abilities may correspond to circuits that only form (or become reliably activated) above certain parameter counts.

3. Grokking Phenomenon Models sometimes exhibit 'grokking'—sudden generalization long after memorization. This may happen at scale when sufficient capacity enables the model to discover generalizable algorithms rather than memorizing patterns.

4. Measurement Artifacts Some researchers argue emergence is partially an artifact of discrete metrics. When measuring accuracy on binary tasks, gradual improvements in underlying probability appear as sudden jumps. Continuous metrics may show smoother scaling.

Implications for Model Development

Emergent capabilities shape how frontier AI labs approach development:

Scale-first mentality: The possibility of valuable emergent capabilities incentivizes training ever-larger models
Capability elicitation: Extensive testing is required to discover what abilities a model has acquired
Safety evaluation: New safety concerns may emerge unpredictably at scale
Benchmark design: Traditional benchmarks may not capture emergent abilities; new evaluation paradigms are needed
Competitive pressure: The unpredictability creates FOMO-driven scaling races between labs

Understanding emergence is crucial because it fundamentally changes the return profile of scale investments: you're not just getting proportionally better performance, you might unlock entirely new capabilities.

Architectural Innovations for Scale

Scaling transformers is not merely about adding more parameters—it requires architectural innovations that maintain training stability, computational efficiency, and model capability as dimensions increase. The standard transformer has evolved substantially to enable billion-parameter scale.

Core Modifications for Stability

Pre-Layer Normalization (Pre-LN)

The original transformer used post-layer normalization, applying LayerNorm after the attention/FFN sublayers. This creates gradient scaling issues at extreme depths. Pre-LN applies normalization before each sublayer:

# Post-LN (original, unstable at depth)
x = x + Attention(LayerNorm(x))  # ❌

# Pre-LN (modern, stable)
x = x + Attention(LayerNorm(x))  # ✓ Norm before, not after
x = x + FFN(LayerNorm(x))

Pre-LN enables training models with 100+ layers where Post-LN would diverge.

RMSNorm

Root Mean Square Layer Normalization simplifies LayerNorm by removing mean centering:

$$\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum x_i^2 + \epsilon}} \cdot \gamma$$

This reduces computation (~30% faster) while maintaining performance, important at scale.

DeepNorm

For extremely deep transformers (1000+ layers), DeepNorm modifies initialization and residual connections:

$$x_{l+1} = \text{LN}(\alpha x_l + \text{Sublayer}(x_l))$$

Where $\alpha > 1$ amplifies residual connections, and sublayer weights are scaled down accordingly.

Key Architectural Components for Scale

•Rotary Position Embeddings (RoPE): Encodes position through rotation matrices, enabling infinite context length extrapolation and relative position encoding without learned parameters
•Grouped Query Attention (GQA): Shares key-value heads across query heads, reducing KV-cache memory by 4-8× while preserving quality—essential for inference at scale
•Multi-Query Attention (MQA): Extreme version with single KV head, maximum memory efficiency but some quality loss
•SwiGLU Activation: Gated linear unit with Swish activation: SwiGLU(x) = Swish(xW₁) ⊙ (xW₂), consistently outperforms ReLU/GELU
•Flash Attention: Memory-efficient attention algorithm reducing memory from O(n²) to O(n), enabling longer contexts
•Sliding Window Attention: Limits attention to local windows with global tokens, enabling longer sequences with linear complexity

Model Dimension Scaling

When scaling model size, practitioners must decide how to allocate additional parameters across dimensions:

Dimension	Symbol	Typical Scaling	Considerations
Hidden size	d_model	d ∝ √N	Wider models more stable, but memory-intensive
Layers	L	L ∝ N^0.2	Depth helps reasoning but causes gradient issues
Attention heads	H	H ∝ d_model/64	Head dimension typically fixed at 64-128
FFN dimension	d_ff	d_ff = 4×d_model	SwiGLU models often use 8/3×d_model
Vocabulary	V	50K-250K tokens	Larger vocab improves compression, adds parameters

Common configurations:

Model Size	d_model	Layers	Heads	d_ff	Parameters
Small	768	12	12	3072	~125M
Medium	1024	24	16	4096	~355M
Large	1536	24	24	6144	~760M
XL	2048	24	32	8192	~1.5B
7B-class	4096	32	32	11008	~7B
70B-class	8192	80	64	28672	~70B

The Ratio Hypothesis

Empirically, the ratio of depth to width matters more than absolute values. 'Tall and thin' (many layers, smaller d_model) vs 'short and wide' have different properties: depth helps compositional reasoning while width helps memorization and parallel processing. Most successful large models favor moderate depth with substantial width.

Training Infrastructure at Scale

Training models with hundreds of billions of parameters requires sophisticated distributed systems. A single GPU cannot hold the model weights, let alone the gradients and optimizer states. Modern training infrastructure employs multiple parallelism strategies working in concert.

Memory Requirements

For a model with $N$ parameters using mixed precision training:

Component	Memory per Parameter	7B Model	70B Model
Weights (fp16)	2 bytes	14 GB	140 GB
Gradients (fp16)	2 bytes	14 GB	140 GB
Adam Optimizer States	12 bytes	84 GB	840 GB
Activations	Variable	50-500 GB	500 GB-5 TB
Minimum Total	~16 bytes	~165 GB	~1.6 TB

A 70B parameter model requires ~1.6 TB just for the training state—no single GPU has this capacity (as of 2024, the largest GPU has 80GB). This necessitates distributed training.

Parallelism Strategies

1. Data Parallelism (DP)

The simplest form: replicate the entire model on each device, split batches across devices, and average gradients.

✓ Scales linearly with devices for batch size
✗ Requires each device to hold full model
✗ Gradient synchronization becomes bottleneck at scale

2. Tensor Parallelism (TP)

Split individual operations across devices. For matrix multiplication $Y = XW$, split $W$ across devices:

         GPU 0          GPU 1
    [W_0, W_1]    [W_2, W_3]
    Y_0 = X @ W_0  Y_1 = X @ W_1  ...
    AllGather([Y_0, Y_1, Y_2, Y_3]) → Y

✓ Reduces per-device memory
✗ Requires high-bandwidth interconnect (NVLink)
✗ Typically limited to 8 devices (single node)

3. Pipeline Parallelism (PP)

Split model layers across devices, process micro-batches in pipeline fashion:

Time →
GPU 0 (Layers 0-7):   [B1] [B2] [B3] [B4]     [B1 bwd] [B2 bwd]...
GPU 1 (Layers 8-15):       [B1] [B2] [B3] [B4] [B1 bwd]...
GPU 2 (Layers 16-23):           [B1] [B2] [B3] [B4]...

✓ Works across nodes with commodity interconnect
✓ Reduces per-device memory
✗ Pipeline bubbles reduce efficiency (~50-80% utilization)

4. Sequence Parallelism (SP)

Split along sequence dimension, particularly for attention computation where memory scales as O(n²).

5. ZeRO (Zero Redundancy Optimizer)

Shards optimizer states, gradients, and optionally weights across data-parallel ranks:

Stage	Shards	Memory Reduction	Communication
ZeRO-1	Optimizer states	~4×	Minimal overhead
ZeRO-2	+ Gradients	~8×	Reduce-scatter gradients
ZeRO-3	+ Parameters	~∞	All-gather parameters per layer

Typical Large-Scale Training Configuration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Example: 70B model training configuration
# Using 3D parallelism: TP × PP × DP
 
training_config = {
    # Model architecture
    "hidden_size": 8192,
    "num_layers": 80,
    "num_attention_heads": 64,
    "num_key_value_heads": 8,  # GQA: 8 KV heads
    "intermediate_size": 28672,
    "vocab_size": 128256,
    
    # Parallelism configuration
    "tensor_model_parallel_size": 8,   # TP across 8 GPUs in node
    "pipeline_model_parallel_size": 8,  # PP across 8 nodes
    "data_parallel_size": 64,           # DP with ZeRO-1
    # Total: 8 × 8 × 64 = 4096 GPUs
    
    # Memory optimization
    "use_flash_attention": True,
    "activation_checkpointing": "selective",
    "mixed_precision": "bf16",
    
    # Training hyperparameters
    "global_batch_size": 4_000_000,  # ~4M tokens
    "micro_batch_size": 2,
    "learning_rate": 1.5e-4,
    "warmup_steps": 2000,
    "total_training_tokens": 2_000_000_000_000,  # 2T tokens
}
 
# Training time estimate
tokens_per_batch = training_config["global_batch_size"]
total_tokens = training_config["total_training_tokens"]
steps = total_tokens // tokens_per_batch  # 500,000 steps
seconds_per_step = 2.5  # Depends on hardware
training_time_hours = (steps * seconds_per_step) / 3600  # ~347 hours
training_time_days = training_time_hours / 24  # ~14.5 days

The Reliability Challenge

At 4096 GPUs running for 2 weeks, hardware failures are near-certain. Training systems must checkpoint frequently (every few hundred steps) and gracefully recover from failures. A single day of lost training can cost $1M+ in compute time at frontier scale.

The Economics of Scale

Understanding the economics of large-scale training is essential for practitioners. The costs involved shape research priorities, influence architectural decisions, and determine who can participate in frontier AI development.

Compute Cost Estimation

Training compute is measured in FLOPs (floating-point operations). For a transformer:

$$C \approx 6ND$$

Where:

$C$ = total training FLOPs
$N$ = model parameters
$D$ = training dataset size (tokens)

The factor of 6 comes from 2× (forward + backward) × 3 (matrix operations per layer).

Example calculations:

Model	Parameters	Tokens	Training FLOPs	GPU Hours (A100)	Est. Cost
GPT-3	175B	300B	~3.1×10²³	~1M	~$5M
LLaMA 2 70B	70B	2T	~8.4×10²³	~1.7M	~$10M
GPT-4 (est.)	1.7T	13T	~1.3×10²⁵	~50M	~$100M
Llama 3 405B	405B	15T	~3.6×10²⁵	~90M	~$200M

Hardware Utilization

Theoretical FLOPS ≠ achieved FLOPS. Model FLOPs Utilization (MFU) measures actual efficiency:

$$\text{MFU} = \frac{\text{Achieved FLOPS}}{\text{Peak Hardware FLOPS}}$$

Hardware Configuration	Typical MFU	Notes
Single GPU	50-70%	Memory bandwidth limited
Multi-GPU (TP only)	40-55%	Communication overhead
Multi-Node (PP+TP+DP)	30-45%	Pipeline bubbles, network latency
Optimized (Megatron-LM)	45-55%	State-of-the-art distributed training

A rule of thumb: Plan for ~40% MFU for large-scale training. A 50% MFU is excellent; 55%+ is exceptional.

Training Cost Components

Total Training Cost Breakdown (70B model, 2T tokens):

├── GPU Compute (H100 cloud): $7.5M
│   └── 30M GPU hours × $2.50/hr (negotiated rate)
│
├── Infrastructure: $1.5M
│   ├── Networking (InfiniBand/RoCE fabric)
│   ├── Storage (100+ TB high-speed)
│   └── Cooling and power overhead
│
├── Engineering: $1M
│   ├── Training infrastructure (6 months × 5 engineers)
│   ├── Data pipeline development
│   └── Monitoring and recovery systems
│
├── Data: $500K
│   ├── Acquisition and licensing
│   ├── Cleaning and preprocessing
│   └── Quality filtering
│
└── Experiments: $2M
    ├── Architecture ablations at smaller scales
    ├── Hyperparameter search
    └── Failed training runs

────────────────────────────────────────
TOTAL: ~$12.5M for production 70B model

Scaling Law Investment Strategy

Before training at full scale, labs typically invest 5-10% of compute budget on scaling law experiments at smaller scales. This allows prediction of final model performance and early detection of issues. A $500K scaling law study can prevent a $10M failed training run.

The Compute Divide

The economics of scale create a stark divide in AI capability:

Tier	Annual Compute Budget	Achievable Scale	Examples
Frontier Labs	$100M-$1B+	100B-1T+ parameters	OpenAI, Anthropic, Google, Meta
Major Tech	$10M-$100M	10B-100B parameters	Microsoft, Amazon, NVIDIA
Well-funded Startups	$1M-$10M	1B-10B parameters	Mistral, Character.AI
Academic Groups	$100K-$1M	100M-1B parameters	University labs, research institutes
Individual Researchers	<$100K	<100M parameters	Fine-tuning pretrained models only

This concentration of capability raises important questions about:

Democratization of AI research
Replication and verification of results
Safety testing by independent parties
Regulatory oversight and accountability

Open-weight models (LLaMA, Mistral) partially address this by allowing smaller players to access frontier capabilities through fine-tuning rather than pre-training.

Making Optimal Scaling Decisions

Given a fixed compute budget, how should resources be allocated? This is the central practical question of scaling. The answer depends on your goals, constraints, and downstream use cases.

The Scaling Decision Framework

Step 1: Define Your Objective

Different objectives favor different scales:

Objective	Preferred Strategy
Minimize training loss	Follow Chinchilla scaling
Maximize inference efficiency	Undertrain larger model, then distill
Achieve specific capability	Scale until capability emerges
Minimize total cost of ownership	Consider inference cost in scaling
Deploy on constrained hardware	Train smaller model with more data

Step 2: Account for Inference

For deployment, inference cost often dominates:

$$\text{Total Cost} = C_{\text{training}} + N_{\text{inferences}} \times C_{\text{inference}}$$

For high-volume applications, the Chinchilla-optimal model may be suboptimal overall. A smaller, overtrained model might:

Cost more to train (suboptimal compute allocation)
Cost much less per inference
Yield lower total cost at high query volume

Practical Scaling Guidelines

•Start with scaling law experiments: Use 0.1-1% of compute budget to establish empirical scaling curves for your specific data and architecture
•Consider the Pareto frontier: Plot model size vs. inference cost vs. quality. Choose the point that optimizes for your actual needs
•Account for capability discontinuities: If you need a specific capability, scale until you reliably achieve it—partial capability has little value
•Factor in fine-tuning: A larger base model may require less fine-tuning data/compute, affecting total cost
•Plan for model lifetime: If the model will serve millions of queries, inference efficiency dominates; for research prototypes, training efficiency matters more
•Consider distillation: Training a large teacher model then distilling to smaller students can outperform directly training the small model

Beyond Loss: What Scaling Optimizes

Scaling laws optimize for loss on the training distribution. But practitioners care about:

Task-specific accuracy: May not correlate perfectly with loss
Reasoning capability: May require architectural changes, not just scale
Factual grounding: Can improve or plateau unpredictably with scale
Instruction following: Depends heavily on fine-tuning, not just pre-training scale
Safety properties: May require specific interventions beyond scale

The uncomfortable truth: Scale is necessary but not sufficient. A 100B model trained on poor data will underperform a 10B model trained on excellent data. Scale amplifies the effect of other factors—good data becomes great, bad data becomes terrible.

The Future of Scaling

Current trends suggest:

Diminishing returns approaching: The gains from pure scale may be tapering
Data becomes the bottleneck: High-quality data is increasingly scarce
Architectural innovation matters more: Post-transformer architectures may unlock new scaling regimes
Inference optimization critical: As models deploy widely, efficiency innovations become paramount
Specialization over monolithic scale: Multiple specialized models may outperform single general models

The era of 'just scale it' may be transitioning to an era of 'scale it intelligently'.

The Research Frontier

Active research areas include: sample-efficient pre-training, compute-efficient architectures (Mamba, RWKV), better scaling laws that predict emergent capabilities, and automated methods for finding optimal configurations. The field is evolving rapidly.

Summary: The Science of Scale

Transformer scaling has transformed machine learning from an incremental science to one characterized by dramatic capability leaps. Understanding scale is now essential knowledge for anyone working with modern language models.

Key Takeaways

•Scaling laws are predictive — Power-law relationships between compute/parameters/data and loss enable principled resource allocation at billion-dollar scales
•Chinchilla scaling changed the game — Compute-optimal training balances model size and data quantity rather than maximizing parameters
•Emergence creates discontinuities — Capabilities can appear suddenly at scale, making large model behavior partially unpredictable
•Architecture must support scale — Pre-LN, RoPE, GQA, Flash Attention and other innovations are necessary to train and serve models at scale
•Distributed training is essential — 3D parallelism (TP + PP + DP) with ZeRO enables training models beyond single-device memory
•Economics shapes research — The $10M-$100M cost of frontier models concentrates capability and drives strategic decisions
•Scaling decisions are nuanced — Optimal allocation depends on objectives, with inference cost often dominating for deployed models

What's next:

With an understanding of how transformers scale, we turn to what they learn during pre-training. The next page explores pre-training objectives—the loss functions and training tasks that give large language models their remarkable capabilities. We'll see how simple objectives like next-token prediction create models that understand language, reason about the world, and follow complex instructions.

Page Complete

You now understand the science and practice of transformer scaling—from empirical laws to architectural innovations to compute economics. This foundation enables you to reason about modern AI development at the frontier. Next, we explore the pre-training objectives that shape what scaled models learn.

1 / 5

Loading learning content...

Machine LearningResearch Frontiers

Large Language Models

LevelAdvanced

Duration120 mins

TopicResearch Frontiers

1 / 5

Transformer Scaling

The Scale Revolution

What You Will Learn

Scaling Laws: The Empirical Foundation

The Kaplan Scaling Laws (OpenAI, 2020)

The seminal work by Kaplan et al. established that language model loss follows predictable power-law relationships:

$$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}$$

Where $L$ is the loss, $N$ is the number of parameters, and $N_c$ and $\alpha_N$ are fitted constants. Similar relationships exist for compute $C$ and dataset size $D$.

Key findings from the Kaplan laws:

Smooth power-law scaling: Loss decreases as a power-law with increased compute, parameters, or data
Compute-efficient scaling: Optimal model size scales sublinearly with compute budget
Parameter efficiency: For fixed compute, larger models trained on less data often outperform smaller models trained longer
Data requirements: Optimal data size scales roughly linearly with model size

Scaling Exponents from Empirical Studies
Scaling Dimension	Exponent (α)	Interpretation
Parameters (N)	~0.076	Doubling parameters reduces loss by ~5%
Compute (C)	~0.050	10× compute reduces loss by ~11%
Data (D)	~0.095	Doubling data reduces loss by ~6.5%
Optimal N given C	~0.73	Model size should scale with C^0.73

The Chinchilla Scaling Laws (DeepMind, 2022)

Chinchilla optimal scaling:

For compute-optimal training, parameters and data should scale roughly equally
The optimal model size is approximately $N^* \propto C^{0.5}$ (not $C^{0.73}$)
The optimal number of training tokens is approximately $D^* \propto C^{0.5}$

This had profound implications:

Model	Parameters	Training Tokens	Tokens per Parameter	Chinchilla Optimal?
GPT-3	175B	300B	1.7	Significantly undertrained
Gopher	280B	300B	1.1	Significantly undertrained
Chinchilla	70B	1.4T	20	Compute-optimal

Chinchilla (70B parameters) outperformed Gopher (280B parameters) while using the same compute, simply by training longer on more data.

Scaling Laws Have Limits

The Unified Scaling Law

Integrating the dimensions of scale, performance can be modeled as:

$$L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$$

Where:

$E$ is the irreducible entropy of natural language (estimated ~1.69 nats for English)
$A/N^\alpha$ captures parameter scaling
$B/D^\beta$ captures data scaling

Emergent Capabilities: The Scale Surprise

Defining Emergence

An ability is considered emergent if:

It is absent or random in smaller models
It appears abruptly at scale rather than improving gradually
It cannot be predicted by extrapolating from smaller models

This is distinct from smooth scaling, where doubling parameters yields predictable, incremental gains. Emergence represents a phase transition—a qualitative change in what the model can do.

Documented Emergent Abilities in Language Models
Capability	Emergence Threshold	Description
Multi-step Arithmetic	~10B parameters	Solving 3+ digit addition, multiplication
Chain-of-Thought Reasoning	~100B parameters	Producing step-by-step reasoning chains
Word Unscrambling	~10B parameters	Anagram solving, letter manipulation
In-Context Learning	~10B parameters	Learning new tasks from few examples in prompt
Code Generation	~10B parameters	Producing functional code from descriptions
Instruction Following	~50B parameters	Reliable following of natural language instructions
Theory of Mind	~50-100B parameters	Modeling beliefs and intentions of others

The Unpredictability Challenge

Theories of Emergence

Several hypotheses attempt to explain why emergence occurs:

Implications for Model Development

Emergent capabilities shape how frontier AI labs approach development:

Scale-first mentality: The possibility of valuable emergent capabilities incentivizes training ever-larger models
Capability elicitation: Extensive testing is required to discover what abilities a model has acquired
Safety evaluation: New safety concerns may emerge unpredictably at scale
Benchmark design: Traditional benchmarks may not capture emergent abilities; new evaluation paradigms are needed
Competitive pressure: The unpredictability creates FOMO-driven scaling races between labs

Architectural Innovations for Scale

Core Modifications for Stability

Pre-Layer Normalization (Pre-LN)

# Post-LN (original, unstable at depth)
x = x + Attention(LayerNorm(x))  # ❌

# Pre-LN (modern, stable)
x = x + Attention(LayerNorm(x))  # ✓ Norm before, not after
x = x + FFN(LayerNorm(x))

Pre-LN enables training models with 100+ layers where Post-LN would diverge.

RMSNorm

Root Mean Square Layer Normalization simplifies LayerNorm by removing mean centering:

$$\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum x_i^2 + \epsilon}} \cdot \gamma$$

This reduces computation (~30% faster) while maintaining performance, important at scale.

DeepNorm

For extremely deep transformers (1000+ layers), DeepNorm modifies initialization and residual connections:

$$x_{l+1} = \text{LN}(\alpha x_l + \text{Sublayer}(x_l))$$

Where $\alpha > 1$ amplifies residual connections, and sublayer weights are scaled down accordingly.

Key Architectural Components for Scale

•Rotary Position Embeddings (RoPE): Encodes position through rotation matrices, enabling infinite context length extrapolation and relative position encoding without learned parameters
•Grouped Query Attention (GQA): Shares key-value heads across query heads, reducing KV-cache memory by 4-8× while preserving quality—essential for inference at scale
•Multi-Query Attention (MQA): Extreme version with single KV head, maximum memory efficiency but some quality loss
•SwiGLU Activation: Gated linear unit with Swish activation: SwiGLU(x) = Swish(xW₁) ⊙ (xW₂), consistently outperforms ReLU/GELU
•Flash Attention: Memory-efficient attention algorithm reducing memory from O(n²) to O(n), enabling longer contexts
•Sliding Window Attention: Limits attention to local windows with global tokens, enabling longer sequences with linear complexity

Model Dimension Scaling

When scaling model size, practitioners must decide how to allocate additional parameters across dimensions:

Dimension	Symbol	Typical Scaling	Considerations
Hidden size	d_model	d ∝ √N	Wider models more stable, but memory-intensive
Layers	L	L ∝ N^0.2	Depth helps reasoning but causes gradient issues
Attention heads	H	H ∝ d_model/64	Head dimension typically fixed at 64-128
FFN dimension	d_ff	d_ff = 4×d_model	SwiGLU models often use 8/3×d_model
Vocabulary	V	50K-250K tokens	Larger vocab improves compression, adds parameters

Common configurations:

Model Size	d_model	Layers	Heads	d_ff	Parameters
Small	768	12	12	3072	~125M
Medium	1024	24	16	4096	~355M
Large	1536	24	24	6144	~760M
XL	2048	24	32	8192	~1.5B
7B-class	4096	32	32	11008	~7B
70B-class	8192	80	64	28672	~70B

The Ratio Hypothesis

Training Infrastructure at Scale

Memory Requirements

For a model with $N$ parameters using mixed precision training:

Component	Memory per Parameter	7B Model	70B Model
Weights (fp16)	2 bytes	14 GB	140 GB
Gradients (fp16)	2 bytes	14 GB	140 GB
Adam Optimizer States	12 bytes	84 GB	840 GB
Activations	Variable	50-500 GB	500 GB-5 TB
Minimum Total	~16 bytes	~165 GB	~1.6 TB

A 70B parameter model requires ~1.6 TB just for the training state—no single GPU has this capacity (as of 2024, the largest GPU has 80GB). This necessitates distributed training.

Parallelism Strategies

1. Data Parallelism (DP)

The simplest form: replicate the entire model on each device, split batches across devices, and average gradients.

✓ Scales linearly with devices for batch size
✗ Requires each device to hold full model
✗ Gradient synchronization becomes bottleneck at scale

2. Tensor Parallelism (TP)

Split individual operations across devices. For matrix multiplication $Y = XW$, split $W$ across devices:

         GPU 0          GPU 1
    [W_0, W_1]    [W_2, W_3]
    Y_0 = X @ W_0  Y_1 = X @ W_1  ...
    AllGather([Y_0, Y_1, Y_2, Y_3]) → Y

✓ Reduces per-device memory
✗ Requires high-bandwidth interconnect (NVLink)
✗ Typically limited to 8 devices (single node)

3. Pipeline Parallelism (PP)

Split model layers across devices, process micro-batches in pipeline fashion:

Time →
GPU 0 (Layers 0-7):   [B1] [B2] [B3] [B4]     [B1 bwd] [B2 bwd]...
GPU 1 (Layers 8-15):       [B1] [B2] [B3] [B4] [B1 bwd]...
GPU 2 (Layers 16-23):           [B1] [B2] [B3] [B4]...

✓ Works across nodes with commodity interconnect
✓ Reduces per-device memory
✗ Pipeline bubbles reduce efficiency (~50-80% utilization)

4. Sequence Parallelism (SP)

Split along sequence dimension, particularly for attention computation where memory scales as O(n²).

5. ZeRO (Zero Redundancy Optimizer)

Shards optimizer states, gradients, and optionally weights across data-parallel ranks:

Stage	Shards	Memory Reduction	Communication
ZeRO-1	Optimizer states	~4×	Minimal overhead
ZeRO-2	+ Gradients	~8×	Reduce-scatter gradients
ZeRO-3	+ Parameters	~∞	All-gather parameters per layer

Typical Large-Scale Training Configuration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Example: 70B model training configuration
# Using 3D parallelism: TP × PP × DP
 
training_config = {
    # Model architecture
    "hidden_size": 8192,
    "num_layers": 80,
    "num_attention_heads": 64,
    "num_key_value_heads": 8,  # GQA: 8 KV heads
    "intermediate_size": 28672,
    "vocab_size": 128256,
    
    # Parallelism configuration
    "tensor_model_parallel_size": 8,   # TP across 8 GPUs in node
    "pipeline_model_parallel_size": 8,  # PP across 8 nodes
    "data_parallel_size": 64,           # DP with ZeRO-1
    # Total: 8 × 8 × 64 = 4096 GPUs
    
    # Memory optimization
    "use_flash_attention": True,
    "activation_checkpointing": "selective",
    "mixed_precision": "bf16",
    
    # Training hyperparameters
    "global_batch_size": 4_000_000,  # ~4M tokens
    "micro_batch_size": 2,
    "learning_rate": 1.5e-4,
    "warmup_steps": 2000,
    "total_training_tokens": 2_000_000_000_000,  # 2T tokens
}
 
# Training time estimate
tokens_per_batch = training_config["global_batch_size"]
total_tokens = training_config["total_training_tokens"]
steps = total_tokens // tokens_per_batch  # 500,000 steps
seconds_per_step = 2.5  # Depends on hardware
training_time_hours = (steps * seconds_per_step) / 3600  # ~347 hours
training_time_days = training_time_hours / 24  # ~14.5 days

The Reliability Challenge

The Economics of Scale

Compute Cost Estimation

Training compute is measured in FLOPs (floating-point operations). For a transformer:

$$C \approx 6ND$$

Where:

$C$ = total training FLOPs
$N$ = model parameters
$D$ = training dataset size (tokens)

The factor of 6 comes from 2× (forward + backward) × 3 (matrix operations per layer).

Example calculations:

Model	Parameters	Tokens	Training FLOPs	GPU Hours (A100)	Est. Cost
GPT-3	175B	300B	~3.1×10²³	~1M	~$5M
LLaMA 2 70B	70B	2T	~8.4×10²³	~1.7M	~$10M
GPT-4 (est.)	1.7T	13T	~1.3×10²⁵	~50M	~$100M
Llama 3 405B	405B	15T	~3.6×10²⁵	~90M	~$200M

Hardware Utilization

Theoretical FLOPS ≠ achieved FLOPS. Model FLOPs Utilization (MFU) measures actual efficiency:

$$\text{MFU} = \frac{\text{Achieved FLOPS}}{\text{Peak Hardware FLOPS}}$$

Hardware Configuration	Typical MFU	Notes
Single GPU	50-70%	Memory bandwidth limited
Multi-GPU (TP only)	40-55%	Communication overhead
Multi-Node (PP+TP+DP)	30-45%	Pipeline bubbles, network latency
Optimized (Megatron-LM)	45-55%	State-of-the-art distributed training

A rule of thumb: Plan for ~40% MFU for large-scale training. A 50% MFU is excellent; 55%+ is exceptional.

Training Cost Components

Total Training Cost Breakdown (70B model, 2T tokens):

├── GPU Compute (H100 cloud): $7.5M
│   └── 30M GPU hours × $2.50/hr (negotiated rate)
│
├── Infrastructure: $1.5M
│   ├── Networking (InfiniBand/RoCE fabric)
│   ├── Storage (100+ TB high-speed)
│   └── Cooling and power overhead
│
├── Engineering: $1M
│   ├── Training infrastructure (6 months × 5 engineers)
│   ├── Data pipeline development
│   └── Monitoring and recovery systems
│
├── Data: $500K
│   ├── Acquisition and licensing
│   ├── Cleaning and preprocessing
│   └── Quality filtering
│
└── Experiments: $2M
    ├── Architecture ablations at smaller scales
    ├── Hyperparameter search
    └── Failed training runs

────────────────────────────────────────
TOTAL: ~$12.5M for production 70B model

Scaling Law Investment Strategy

The Compute Divide

The economics of scale create a stark divide in AI capability:

Tier	Annual Compute Budget	Achievable Scale	Examples
Frontier Labs	$100M-$1B+	100B-1T+ parameters	OpenAI, Anthropic, Google, Meta
Major Tech	$10M-$100M	10B-100B parameters	Microsoft, Amazon, NVIDIA
Well-funded Startups	$1M-$10M	1B-10B parameters	Mistral, Character.AI
Academic Groups	$100K-$1M	100M-1B parameters	University labs, research institutes
Individual Researchers	<$100K	<100M parameters	Fine-tuning pretrained models only

This concentration of capability raises important questions about:

Democratization of AI research
Replication and verification of results
Safety testing by independent parties
Regulatory oversight and accountability

Open-weight models (LLaMA, Mistral) partially address this by allowing smaller players to access frontier capabilities through fine-tuning rather than pre-training.

Making Optimal Scaling Decisions

Given a fixed compute budget, how should resources be allocated? This is the central practical question of scaling. The answer depends on your goals, constraints, and downstream use cases.

The Scaling Decision Framework

Step 1: Define Your Objective

Different objectives favor different scales:

Objective	Preferred Strategy
Minimize training loss	Follow Chinchilla scaling
Maximize inference efficiency	Undertrain larger model, then distill
Achieve specific capability	Scale until capability emerges
Minimize total cost of ownership	Consider inference cost in scaling
Deploy on constrained hardware	Train smaller model with more data

Step 2: Account for Inference

For deployment, inference cost often dominates:

$$\text{Total Cost} = C_{\text{training}} + N_{\text{inferences}} \times C_{\text{inference}}$$

For high-volume applications, the Chinchilla-optimal model may be suboptimal overall. A smaller, overtrained model might:

Cost more to train (suboptimal compute allocation)
Cost much less per inference
Yield lower total cost at high query volume

Practical Scaling Guidelines

•Start with scaling law experiments: Use 0.1-1% of compute budget to establish empirical scaling curves for your specific data and architecture
•Consider the Pareto frontier: Plot model size vs. inference cost vs. quality. Choose the point that optimizes for your actual needs
•Account for capability discontinuities: If you need a specific capability, scale until you reliably achieve it—partial capability has little value
•Factor in fine-tuning: A larger base model may require less fine-tuning data/compute, affecting total cost
•Plan for model lifetime: If the model will serve millions of queries, inference efficiency dominates; for research prototypes, training efficiency matters more
•Consider distillation: Training a large teacher model then distilling to smaller students can outperform directly training the small model

Beyond Loss: What Scaling Optimizes

Scaling laws optimize for loss on the training distribution. But practitioners care about:

Task-specific accuracy: May not correlate perfectly with loss
Reasoning capability: May require architectural changes, not just scale
Factual grounding: Can improve or plateau unpredictably with scale
Instruction following: Depends heavily on fine-tuning, not just pre-training scale
Safety properties: May require specific interventions beyond scale

The Future of Scaling

Current trends suggest:

Diminishing returns approaching: The gains from pure scale may be tapering
Data becomes the bottleneck: High-quality data is increasingly scarce
Architectural innovation matters more: Post-transformer architectures may unlock new scaling regimes
Inference optimization critical: As models deploy widely, efficiency innovations become paramount
Specialization over monolithic scale: Multiple specialized models may outperform single general models

The era of 'just scale it' may be transitioning to an era of 'scale it intelligently'.

The Research Frontier

Summary: The Science of Scale

Key Takeaways

•Scaling laws are predictive — Power-law relationships between compute/parameters/data and loss enable principled resource allocation at billion-dollar scales
•Chinchilla scaling changed the game — Compute-optimal training balances model size and data quantity rather than maximizing parameters
•Emergence creates discontinuities — Capabilities can appear suddenly at scale, making large model behavior partially unpredictable
•Architecture must support scale — Pre-LN, RoPE, GQA, Flash Attention and other innovations are necessary to train and serve models at scale
•Distributed training is essential — 3D parallelism (TP + PP + DP) with ZeRO enables training models beyond single-device memory
•Economics shapes research — The $10M-$100M cost of frontier models concentrates capability and drives strategic decisions
•Scaling decisions are nuanced — Optimal allocation depends on objectives, with inference cost often dominating for deployed models

What's next:

Page Complete

1 / 5