Loading learning content...
In 2017, the transformer architecture introduced in Attention Is All You Need achieved state-of-the-art machine translation. By 2020, GPT-3 demonstrated that scaling transformers to 175 billion parameters unlocked remarkable capabilities—from code generation to creative writing—that smaller models simply could not achieve. The field had discovered a profound truth: scale fundamentally changes what models can do.
This wasn't merely incremental improvement. Researchers observed qualitative changes in behavior as models grew—capabilities that emerged suddenly rather than gradually, abilities that defied prediction from smaller-scale experiments. Understanding transformer scaling has become essential for anyone working at the frontier of machine learning, influencing everything from research priorities to infrastructure investments worth billions of dollars.
This page covers the science and practice of scaling transformers: the empirical laws governing model performance, the architectural innovations enabling scale, the infrastructure required for training, and the economic realities of frontier model development. You will understand both why scale matters and how to reason about scaling decisions.
The modern understanding of transformer scaling is grounded in scaling laws—empirical relationships that predict model performance as a function of compute, data, and parameters. These laws, discovered through systematic experimentation, provide the theoretical foundation for billion-dollar training decisions.
The seminal work by Kaplan et al. established that language model loss follows predictable power-law relationships:
$$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}$$
Where $L$ is the loss, $N$ is the number of parameters, and $N_c$ and $\alpha_N$ are fitted constants. Similar relationships exist for compute $C$ and dataset size $D$.
Key findings from the Kaplan laws:
| Scaling Dimension | Exponent (α) | Interpretation |
|---|---|---|
| Parameters (N) | ~0.076 | Doubling parameters reduces loss by ~5% |
| Compute (C) | ~0.050 | 10× compute reduces loss by ~11% |
| Data (D) | ~0.095 | Doubling data reduces loss by ~6.5% |
| Optimal N given C | ~0.73 | Model size should scale with C^0.73 |
Hoffmann et al. revised our understanding with the Chinchilla paper, demonstrating that many large models were significantly undertrained. Their analysis suggested a more balanced approach to scaling:
Chinchilla optimal scaling:
This had profound implications:
| Model | Parameters | Training Tokens | Tokens per Parameter | Chinchilla Optimal? |
|---|---|---|---|---|
| GPT-3 | 175B | 300B | 1.7 | Significantly undertrained |
| Gopher | 280B | 300B | 1.1 | Significantly undertrained |
| Chinchilla | 70B | 1.4T | 20 | Compute-optimal |
Chinchilla (70B parameters) outperformed Gopher (280B parameters) while using the same compute, simply by training longer on more data.
Scaling laws are empirical fits, not fundamental laws of nature. They may break down at extreme scales, depend on architecture and data quality, and don't capture everything that matters (like reasoning ability or factual recall). Use them as guides, not gospel.
Integrating the dimensions of scale, performance can be modeled as:
$$L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$$
Where:
This formulation reveals a crucial insight: performance is bottlenecked by the smaller of the two terms. An undertrained large model or an overtrained small model both waste compute. Optimal allocation balances both contributions.
Perhaps the most fascinating aspect of scaling is the emergence of capabilities that cannot be predicted from smaller models. These emergent abilities appear suddenly once a model reaches a critical scale threshold, transforming from near-random performance to rapid improvement.
An ability is considered emergent if:
This is distinct from smooth scaling, where doubling parameters yields predictable, incremental gains. Emergence represents a phase transition—a qualitative change in what the model can do.
| Capability | Emergence Threshold | Description |
|---|---|---|
| Multi-step Arithmetic | ~10B parameters | Solving 3+ digit addition, multiplication |
| Chain-of-Thought Reasoning | ~100B parameters | Producing step-by-step reasoning chains |
| Word Unscrambling | ~10B parameters | Anagram solving, letter manipulation |
| In-Context Learning | ~10B parameters | Learning new tasks from few examples in prompt |
| Code Generation | ~10B parameters | Producing functional code from descriptions |
| Instruction Following | ~50B parameters | Reliable following of natural language instructions |
| Theory of Mind | ~50-100B parameters | Modeling beliefs and intentions of others |
Emergent abilities create a fundamental challenge for AI development: you cannot predict what a larger model will be capable of until you build it. This uncertainty drives speculative large-scale investments and makes safety evaluation more difficult—harmful capabilities could also emerge unpredictably.
Several hypotheses attempt to explain why emergence occurs:
1. Task Decomposition Threshold Complex tasks require multiple sub-capabilities. Performance remains low until all required sub-capabilities reach sufficient quality, then jumps rapidly. For example, arithmetic requires digit recognition, positional understanding, and carry propagation—all must work for the overall task to succeed.
2. Circuit Formation Neural networks form computational circuits that perform specific functions. Emergent abilities may correspond to circuits that only form (or become reliably activated) above certain parameter counts.
3. Grokking Phenomenon Models sometimes exhibit 'grokking'—sudden generalization long after memorization. This may happen at scale when sufficient capacity enables the model to discover generalizable algorithms rather than memorizing patterns.
4. Measurement Artifacts Some researchers argue emergence is partially an artifact of discrete metrics. When measuring accuracy on binary tasks, gradual improvements in underlying probability appear as sudden jumps. Continuous metrics may show smoother scaling.
Emergent capabilities shape how frontier AI labs approach development:
Understanding emergence is crucial because it fundamentally changes the return profile of scale investments: you're not just getting proportionally better performance, you might unlock entirely new capabilities.
Scaling transformers is not merely about adding more parameters—it requires architectural innovations that maintain training stability, computational efficiency, and model capability as dimensions increase. The standard transformer has evolved substantially to enable billion-parameter scale.
Pre-Layer Normalization (Pre-LN)
The original transformer used post-layer normalization, applying LayerNorm after the attention/FFN sublayers. This creates gradient scaling issues at extreme depths. Pre-LN applies normalization before each sublayer:
# Post-LN (original, unstable at depth)
x = x + Attention(LayerNorm(x)) # ❌
# Pre-LN (modern, stable)
x = x + Attention(LayerNorm(x)) # ✓ Norm before, not after
x = x + FFN(LayerNorm(x))
Pre-LN enables training models with 100+ layers where Post-LN would diverge.
RMSNorm
Root Mean Square Layer Normalization simplifies LayerNorm by removing mean centering:
$$\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum x_i^2 + \epsilon}} \cdot \gamma$$
This reduces computation (~30% faster) while maintaining performance, important at scale.
DeepNorm
For extremely deep transformers (1000+ layers), DeepNorm modifies initialization and residual connections:
$$x_{l+1} = \text{LN}(\alpha x_l + \text{Sublayer}(x_l))$$
Where $\alpha > 1$ amplifies residual connections, and sublayer weights are scaled down accordingly.
When scaling model size, practitioners must decide how to allocate additional parameters across dimensions:
| Dimension | Symbol | Typical Scaling | Considerations |
|---|---|---|---|
| Hidden size | d_model | d ∝ √N | Wider models more stable, but memory-intensive |
| Layers | L | L ∝ N^0.2 | Depth helps reasoning but causes gradient issues |
| Attention heads | H | H ∝ d_model/64 | Head dimension typically fixed at 64-128 |
| FFN dimension | d_ff | d_ff = 4×d_model | SwiGLU models often use 8/3×d_model |
| Vocabulary | V | 50K-250K tokens | Larger vocab improves compression, adds parameters |
Common configurations:
| Model Size | d_model | Layers | Heads | d_ff | Parameters |
|---|---|---|---|---|---|
| Small | 768 | 12 | 12 | 3072 | ~125M |
| Medium | 1024 | 24 | 16 | 4096 | ~355M |
| Large | 1536 | 24 | 24 | 6144 | ~760M |
| XL | 2048 | 24 | 32 | 8192 | ~1.5B |
| 7B-class | 4096 | 32 | 32 | 11008 | ~7B |
| 70B-class | 8192 | 80 | 64 | 28672 | ~70B |
Empirically, the ratio of depth to width matters more than absolute values. 'Tall and thin' (many layers, smaller d_model) vs 'short and wide' have different properties: depth helps compositional reasoning while width helps memorization and parallel processing. Most successful large models favor moderate depth with substantial width.
Training models with hundreds of billions of parameters requires sophisticated distributed systems. A single GPU cannot hold the model weights, let alone the gradients and optimizer states. Modern training infrastructure employs multiple parallelism strategies working in concert.
For a model with $N$ parameters using mixed precision training:
| Component | Memory per Parameter | 7B Model | 70B Model |
|---|---|---|---|
| Weights (fp16) | 2 bytes | 14 GB | 140 GB |
| Gradients (fp16) | 2 bytes | 14 GB | 140 GB |
| Adam Optimizer States | 12 bytes | 84 GB | 840 GB |
| Activations | Variable | 50-500 GB | 500 GB-5 TB |
| Minimum Total | ~16 bytes | ~165 GB | ~1.6 TB |
A 70B parameter model requires ~1.6 TB just for the training state—no single GPU has this capacity (as of 2024, the largest GPU has 80GB). This necessitates distributed training.
1. Data Parallelism (DP)
The simplest form: replicate the entire model on each device, split batches across devices, and average gradients.
2. Tensor Parallelism (TP)
Split individual operations across devices. For matrix multiplication $Y = XW$, split $W$ across devices:
GPU 0 GPU 1
[W_0, W_1] [W_2, W_3]
Y_0 = X @ W_0 Y_1 = X @ W_1 ...
AllGather([Y_0, Y_1, Y_2, Y_3]) → Y
3. Pipeline Parallelism (PP)
Split model layers across devices, process micro-batches in pipeline fashion:
Time →
GPU 0 (Layers 0-7): [B1] [B2] [B3] [B4] [B1 bwd] [B2 bwd]...
GPU 1 (Layers 8-15): [B1] [B2] [B3] [B4] [B1 bwd]...
GPU 2 (Layers 16-23): [B1] [B2] [B3] [B4]...
4. Sequence Parallelism (SP)
Split along sequence dimension, particularly for attention computation where memory scales as O(n²).
5. ZeRO (Zero Redundancy Optimizer)
Shards optimizer states, gradients, and optionally weights across data-parallel ranks:
| Stage | Shards | Memory Reduction | Communication |
|---|---|---|---|
| ZeRO-1 | Optimizer states | ~4× | Minimal overhead |
| ZeRO-2 | + Gradients | ~8× | Reduce-scatter gradients |
| ZeRO-3 | + Parameters | ~∞ | All-gather parameters per layer |
1234567891011121314151617181920212223242526272829303132333435363738
# Example: 70B model training configuration# Using 3D parallelism: TP × PP × DP training_config = { # Model architecture "hidden_size": 8192, "num_layers": 80, "num_attention_heads": 64, "num_key_value_heads": 8, # GQA: 8 KV heads "intermediate_size": 28672, "vocab_size": 128256, # Parallelism configuration "tensor_model_parallel_size": 8, # TP across 8 GPUs in node "pipeline_model_parallel_size": 8, # PP across 8 nodes "data_parallel_size": 64, # DP with ZeRO-1 # Total: 8 × 8 × 64 = 4096 GPUs # Memory optimization "use_flash_attention": True, "activation_checkpointing": "selective", "mixed_precision": "bf16", # Training hyperparameters "global_batch_size": 4_000_000, # ~4M tokens "micro_batch_size": 2, "learning_rate": 1.5e-4, "warmup_steps": 2000, "total_training_tokens": 2_000_000_000_000, # 2T tokens} # Training time estimatetokens_per_batch = training_config["global_batch_size"]total_tokens = training_config["total_training_tokens"]steps = total_tokens // tokens_per_batch # 500,000 stepsseconds_per_step = 2.5 # Depends on hardwaretraining_time_hours = (steps * seconds_per_step) / 3600 # ~347 hourstraining_time_days = training_time_hours / 24 # ~14.5 daysAt 4096 GPUs running for 2 weeks, hardware failures are near-certain. Training systems must checkpoint frequently (every few hundred steps) and gracefully recover from failures. A single day of lost training can cost $1M+ in compute time at frontier scale.
Understanding the economics of large-scale training is essential for practitioners. The costs involved shape research priorities, influence architectural decisions, and determine who can participate in frontier AI development.
Training compute is measured in FLOPs (floating-point operations). For a transformer:
$$C \approx 6ND$$
Where:
The factor of 6 comes from 2× (forward + backward) × 3 (matrix operations per layer).
Example calculations:
| Model | Parameters | Tokens | Training FLOPs | GPU Hours (A100) | Est. Cost |
|---|---|---|---|---|---|
| GPT-3 | 175B | 300B | ~3.1×10²³ | ~1M | ~$5M |
| LLaMA 2 70B | 70B | 2T | ~8.4×10²³ | ~1.7M | ~$10M |
| GPT-4 (est.) | 1.7T | 13T | ~1.3×10²⁵ | ~50M | ~$100M |
| Llama 3 405B | 405B | 15T | ~3.6×10²⁵ | ~90M | ~$200M |
Theoretical FLOPS ≠ achieved FLOPS. Model FLOPs Utilization (MFU) measures actual efficiency:
$$\text{MFU} = \frac{\text{Achieved FLOPS}}{\text{Peak Hardware FLOPS}}$$
| Hardware Configuration | Typical MFU | Notes |
|---|---|---|
| Single GPU | 50-70% | Memory bandwidth limited |
| Multi-GPU (TP only) | 40-55% | Communication overhead |
| Multi-Node (PP+TP+DP) | 30-45% | Pipeline bubbles, network latency |
| Optimized (Megatron-LM) | 45-55% | State-of-the-art distributed training |
A rule of thumb: Plan for ~40% MFU for large-scale training. A 50% MFU is excellent; 55%+ is exceptional.
Total Training Cost Breakdown (70B model, 2T tokens):
├── GPU Compute (H100 cloud): $7.5M
│ └── 30M GPU hours × $2.50/hr (negotiated rate)
│
├── Infrastructure: $1.5M
│ ├── Networking (InfiniBand/RoCE fabric)
│ ├── Storage (100+ TB high-speed)
│ └── Cooling and power overhead
│
├── Engineering: $1M
│ ├── Training infrastructure (6 months × 5 engineers)
│ ├── Data pipeline development
│ └── Monitoring and recovery systems
│
├── Data: $500K
│ ├── Acquisition and licensing
│ ├── Cleaning and preprocessing
│ └── Quality filtering
│
└── Experiments: $2M
├── Architecture ablations at smaller scales
├── Hyperparameter search
└── Failed training runs
────────────────────────────────────────
TOTAL: ~$12.5M for production 70B model
Before training at full scale, labs typically invest 5-10% of compute budget on scaling law experiments at smaller scales. This allows prediction of final model performance and early detection of issues. A $500K scaling law study can prevent a $10M failed training run.
The economics of scale create a stark divide in AI capability:
| Tier | Annual Compute Budget | Achievable Scale | Examples |
|---|---|---|---|
| Frontier Labs | $100M-$1B+ | 100B-1T+ parameters | OpenAI, Anthropic, Google, Meta |
| Major Tech | $10M-$100M | 10B-100B parameters | Microsoft, Amazon, NVIDIA |
| Well-funded Startups | $1M-$10M | 1B-10B parameters | Mistral, Character.AI |
| Academic Groups | $100K-$1M | 100M-1B parameters | University labs, research institutes |
| Individual Researchers | <$100K | <100M parameters | Fine-tuning pretrained models only |
This concentration of capability raises important questions about:
Open-weight models (LLaMA, Mistral) partially address this by allowing smaller players to access frontier capabilities through fine-tuning rather than pre-training.
Given a fixed compute budget, how should resources be allocated? This is the central practical question of scaling. The answer depends on your goals, constraints, and downstream use cases.
Step 1: Define Your Objective
Different objectives favor different scales:
| Objective | Preferred Strategy |
|---|---|
| Minimize training loss | Follow Chinchilla scaling |
| Maximize inference efficiency | Undertrain larger model, then distill |
| Achieve specific capability | Scale until capability emerges |
| Minimize total cost of ownership | Consider inference cost in scaling |
| Deploy on constrained hardware | Train smaller model with more data |
Step 2: Account for Inference
For deployment, inference cost often dominates:
$$\text{Total Cost} = C_{\text{training}} + N_{\text{inferences}} \times C_{\text{inference}}$$
For high-volume applications, the Chinchilla-optimal model may be suboptimal overall. A smaller, overtrained model might:
Scaling laws optimize for loss on the training distribution. But practitioners care about:
The uncomfortable truth: Scale is necessary but not sufficient. A 100B model trained on poor data will underperform a 10B model trained on excellent data. Scale amplifies the effect of other factors—good data becomes great, bad data becomes terrible.
Current trends suggest:
The era of 'just scale it' may be transitioning to an era of 'scale it intelligently'.
Active research areas include: sample-efficient pre-training, compute-efficient architectures (Mamba, RWKV), better scaling laws that predict emergent capabilities, and automated methods for finding optimal configurations. The field is evolving rapidly.
Transformer scaling has transformed machine learning from an incremental science to one characterized by dramatic capability leaps. Understanding scale is now essential knowledge for anyone working with modern language models.
What's next:
With an understanding of how transformers scale, we turn to what they learn during pre-training. The next page explores pre-training objectives—the loss functions and training tasks that give large language models their remarkable capabilities. We'll see how simple objectives like next-token prediction create models that understand language, reason about the world, and follow complex instructions.
You now understand the science and practice of transformer scaling—from empirical laws to architectural innovations to compute economics. This foundation enables you to reason about modern AI development at the frontier. Next, we explore the pre-training objectives that shape what scaled models learn.