Loading learning content...
In the history of machine learning, no single factor has been more transformative than scale. The models that power today's most impressive AI capabilities—ChatGPT, Claude, GPT-4, Gemini, LLaMA—are not fundamentally different architectures from their predecessors. They are the same architectures, but at a scale that seemed computationally infeasible just a decade ago.
This page explores the profound implications of scale in machine learning. We will examine how scale has transformed from a mere engineering consideration into a fundamental research paradigm, one that has upended decades of conventional wisdom about how to improve AI systems. The story of modern AI is, in many ways, the story of scale—and understanding this story is essential for any practitioner seeking to work with or understand foundation models.
By the end of this page, you will understand: (1) why scale has become the dominant paradigm in modern ML, (2) the mathematical foundations of scaling laws, (3) the computational and economic implications of training at scale, (4) the historical context that led to the scaling revolution, and (5) how to reason about scale in your own work.
To appreciate the significance of scale in modern ML, we must understand the paradigm it replaced. For most of machine learning's history, progress came primarily through algorithmic innovation—developing better models, better features, and better optimization techniques.
The Traditional ML Paradigm (1950s-2010s):
For decades, the dominant approach to improving ML systems followed a predictable pattern:
Careful Feature Engineering: Domain experts would spend months or years crafting features that captured the essential structure of problems. Computer vision practitioners developed SIFT, HOG, and SURF features. NLP researchers built syntactic parsers and hand-crafted lexicons.
Clever Architectural Innovations: Researchers focused on developing objectively 'better' models—support vector machines over perceptrons, random forests over single decision trees, deep networks over shallow ones.
Sophisticated Optimization: Training algorithms became increasingly sophisticated—from gradient descent to momentum, RMSprop, Adam, and beyond.
In this paradigm, data and compute were secondary concerns. The prevailing wisdom held that more data helped, but with diminishing returns. Extra compute could speed up experiments but wouldn't fundamentally change what was possible.
| Aspect | Traditional ML (1950s-2010s) | Scale-First ML (2017-Present) |
|---|---|---|
| Primary lever for progress | Algorithmic innovation | Scale (data, compute, parameters) |
| Feature engineering | Critical, domain-dependent | Minimal, learned from data |
| Model architecture | Task-specific designs | General-purpose transformers |
| Data requirements | Moderate (thousands-millions) | Massive (billions-trillions tokens) |
| Compute requirements | Modest (single GPU/CPU) | Enormous (thousands of GPUs) |
| Dominant practitioners | Academic labs | Well-funded industry labs |
| Key bottleneck | Algorithm design | Compute budget & data access |
The Turning Point: Deep Learning at Scale (2012-2017)
The shift began with AlexNet in 2012, which demonstrated that deep neural networks trained on large datasets (ImageNet's 1.2 million images) using GPUs could dramatically outperform hand-engineered approaches. But AlexNet was still modest by modern standards—60 million parameters, trained on a few GPUs.
Over the next five years, a growing body of evidence suggested that scale followed predictable laws. Researchers at OpenAI, Google, and elsewhere noticed that:
This observation—that you could buy better performance by simply scaling up—upended the traditional paradigm. If more compute reliably means better results, then the bottleneck shifts from algorithm design to compute access and efficient training.
Rich Sutton's influential 2019 essay 'The Bitter Lesson' captured this paradigm shift: 'The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.' Human knowledge and clever algorithms consistently lose to simple methods scaled up—a bitter pill for researchers who devoted careers to algorithmic sophistication.
The most remarkable discovery of the scaling era is that improvements from scale are not arbitrary—they follow precise mathematical laws. These scaling laws allow researchers to predict how much performance will improve from additional compute, data, or parameters, enabling rational planning of massive training runs.
The Power Law Relationship:
Scaling laws typically take the form of power laws:
$$L(X) = \left(\frac{X_c}{X}\right)^{\alpha_X}$$
where:
This relationship means that each order of magnitude increase in scale yields a consistent, predictable decrease in loss. The exponent $\alpha$ determines how quickly performance improves with scale.
123456789101112131415161718192021222324252627282930313233343536373839
import numpy as npimport matplotlib.pyplot as plt def scaling_law(x, x_c, alpha): """ Compute loss under scaling law. The power law relationship: L(X) = (X_c/X)^alpha Args: x: Scaling variable (compute, data, or parameters) x_c: Constant term alpha: Scaling exponent Returns: Predicted loss value """ return (x_c / x) ** alpha # Typical scaling exponents observed in language models# Note: These are approximate values from published researchscaling_exponents = { 'compute': 0.050, # Loss ~ C^(-0.050) from Kaplan et al. 'parameters': 0.076, # Loss ~ N^(-0.076) 'data': 0.095, # Loss ~ D^(-0.095)} # Example: Predicting loss for different model sizesparameters = np.logspace(6, 12, 100) # 1M to 1T parametersx_c = 8.8e13 # Empirical constant for language modeling losses = scaling_law(parameters, x_c, scaling_exponents['parameters']) # Key insight: Loss decreases as a power law with parameters# Doubling parameters reduces loss by: 2^(-0.076) ≈ 0.949# This means ~5% improvement in loss per doubling print(f"Loss reduction per 10x parameters: {10**(-scaling_exponents['parameters']):.3f}")print(f"To halve the loss, need: {2**(1/scaling_exponents['parameters']):.1e}x more parameters")The Chinchilla Scaling Laws:
A pivotal contribution came from DeepMind's 2022 'Chinchilla' paper, which refined our understanding of optimal scaling. The key insight was that previous models were undertrained relative to their size.
The Chinchilla team proposed that compute-optimal training requires balancing model size and training data according to:
$$N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}$$
where $N$ is parameters, $D$ is training tokens, and $C$ is compute. This means:
| Model | Parameters | Training Tokens | Tokens/Parameter Ratio | Chinchilla Optimal? |
|---|---|---|---|---|
| GPT-3 | 175B | 300B | 1.7 | ❌ Undertrained |
| Gopher | 280B | 300B | 1.1 | ❌ Severely undertrained |
| Chinchilla | 70B | 1.4T | 20 | ✅ Optimal |
| LLaMA-2 70B | 70B | 2T | 29 | ✅ Over-trained (inference efficient) |
| LLaMA-3 70B | 70B | 15T | 214 | ✅ Heavily over-trained |
Chinchilla optimality minimizes training compute, but production models must also consider inference costs. A smaller, overtrained model (like LLaMA) is cheaper to serve than a larger, optimally-trained model with equal performance. This has led to a trend of 'over-training' models well beyond the Chinchilla optimal point for deployment efficiency.
Abstract discussions of scale become concrete when we consider the actual computational requirements of training modern foundation models. The numbers are staggering and have profound implications for who can participate in frontier ML research.
Measuring Compute: FLOPs and GPU-hours
We measure training compute in FLOPs (floating point operations)—specifically, total FLOPs across the entire training run. The relationship between model parameters, training tokens, and compute follows the approximation:
$$C \approx 6ND$$
where:
This allows us to estimate compute for any training configuration. For example, training GPT-3 (175B parameters, 300B tokens):
$$C = 6 \times 175 \times 10^9 \times 300 \times 10^9 = 3.15 \times 10^{23} \text{ FLOPs}$$
| Model | Year | Parameters | Tokens | Estimated FLOPs | Estimated Cost (USD) |
|---|---|---|---|---|---|
| BERT-Large | 2018 | 340M | 3.3B | ~10²⁰ | ~$10K |
| GPT-2 | 2019 | 1.5B | 40B | ~10²¹ | ~$50K |
| GPT-3 | 2020 | 175B | 300B | ~3×10²³ | ~$4-12M |
| Chinchilla | 2022 | 70B | 1.4T | ~6×10²³ | ~$10M |
| GPT-4 (estimate) | 2023 | ~1.8T (MoE) | ~13T | ~10²⁵ | ~$100M+ |
| Claude 3 Opus (estimate) | 2024 | Unknown | Unknown | ~10²⁵ | ~$100M+ |
| Llama 3 405B | 2024 | 405B | 15T | ~4×10²⁵ | ~$150M+ |
The Hardware Landscape:
Frontier model training runs on specialized hardware. The dominant platform is NVIDIA's A100 and H100 GPUs, organized into massive clusters:
| GPU | bf16 Performance | Memory | Power | Typical Cloud Cost |
|---|---|---|---|---|
| A100 (40GB) | 312 TFLOPS | 40 GB HBM2e | 400W | ~$2-3/hour |
| A100 (80GB) | 312 TFLOPS | 80 GB HBM2e | 400W | ~$3-4/hour |
| H100 (80GB) | 1,979 TFLOPS | 80 GB HBM3 | 700W | ~$4-6/hour |
Example: Training LLaMA 2 70B
Meta reported training LLaMA 2 70B used:
This excludes:
The compute requirements for frontier models create a stark divide in AI research. Training a GPT-4 class model requires resources available to perhaps a dozen organizations worldwide. This concentration of capability in well-funded labs raises important questions about research access, reproducibility, and who gets to shape AI development.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
def estimate_training_compute( parameters: float, tokens: float, flops_per_token_factor: float = 6.0) -> dict: """ Estimate training compute requirements. The standard approximation: C ≈ 6 * N * D - Forward pass: ~2 * N FLOPs per token - Backward pass: ~4 * N FLOPs per token (2x forward, for gradients) - Total: ~6 * N FLOPs per token, times D tokens Args: parameters: Number of model parameters tokens: Number of training tokens flops_per_token_factor: FLOPs multiplier (default 6 for transformer) Returns: Dict with compute estimates """ total_flops = flops_per_token_factor * parameters * tokens # GPU performance benchmarks (realistic training efficiency) a100_peak_flops = 312e12 # 312 TFLOPS bf16 h100_peak_flops = 1979e12 # ~2 PFLOPS bf16 # Typical training efficiency: 30-50% of peak a100_effective_flops = a100_peak_flops * 0.4 # 125 TFLOPS effective h100_effective_flops = h100_peak_flops * 0.4 # 800 TFLOPS effective a100_hours = total_flops / (a100_effective_flops * 3600) h100_hours = total_flops / (h100_effective_flops * 3600) # Cost estimates (cloud pricing) a100_cost_per_hour = 3.00 # USD h100_cost_per_hour = 5.00 # USD return { 'total_flops': total_flops, 'total_flops_scientific': f"{total_flops:.2e}", 'a100_gpu_hours': a100_hours, 'h100_gpu_hours': h100_hours, 'a100_cost_usd': a100_hours * a100_cost_per_hour, 'h100_cost_usd': h100_hours * h100_cost_per_hour, 'a100_days_1000_gpus': a100_hours / (1000 * 24), 'h100_days_1000_gpus': h100_hours / (1000 * 24), } # Example: Estimate for LLaMA 2 70B trainingllama2_70b = estimate_training_compute( parameters=70e9, # 70 billion parameters tokens=2e12, # 2 trillion tokens) print("LLaMA 2 70B Training Estimate:")print(f" Total FLOPs: {llama2_70b['total_flops_scientific']}")print(f" A100 GPU-hours: {llama2_70b['a100_gpu_hours']:,.0f}")print(f" H100 GPU-hours: {llama2_70b['h100_gpu_hours']:,.0f}")print(f" A100 cost: ${llama2_70b['a100_cost_usd']:, .0f}")print(f" Days on 1000 A100s: {llama2_70b['a100_days_1000_gpus']:.1f}")Scale isn't just about compute and parameters—data is equally fundamental. The shift to web-scale training data represents a transformation as significant as the architectural innovations. Understanding data scaling is crucial for anyone working with foundation models.
The Data Scaling Journey:
The progression of training datasets tells the story of scale:
| Era | Dataset | Size | Curation Level |
|---|---|---|---|
| 2010s | Penn Treebank | 1M tokens | Expertly curated, annotated |
| 2015 | Wikipedia | ~3B tokens | High-quality, curated |
| 2019 | WebText (GPT-2) | 40B tokens | Filtered web scrape |
| 2020 | C4 (T5) | 156B tokens | Filtered Common Crawl |
| 2020 | The Pile | 825B tokens | Diverse sources |
| 2022 | RefinedWeb | 5T tokens | Deduplicated, filtered |
| 2023 | RedPajama | 1.2T tokens | Open replication of LLaMA |
| 2024 | FineWeb | 15T tokens | High-quality web data |
The Data Quality-Quantity Tradeoff:
Scaling data is not simply 'more is better.' Research has shown that data quality interacts nonlinearly with scale:
Deduplication Matters Enormously: Training on duplicated data leads to memorization rather than generalization. Near-duplicate detection at scale requires sophisticated MinHash and locality-sensitive hashing techniques.
Quality Filtering Improves Efficiency: Filtering low-quality data (spam, boilerplate, machine-generated text) can improve performance more than adding equivalent amounts of unfiltered data.
Domain Mixing Requires Care: The ratio of data from different domains (code, books, web, academic papers) significantly affects downstream capabilities.
Repetition Has Costs: Training on the same data multiple epochs provides diminishing returns and can harm generalization.
Some researchers argue we are approaching a 'data wall'—the limit of useful web text for training. The internet contains perhaps 50-100 trillion tokens of unique, high-quality text in English. If frontier models already train on 10-15T tokens, we may be 3-5× away from exhausting unique data. This has driven major investments in synthetic data generation, multimodal data, and multilingual expansion.
The empirical success of scaling is undeniable, but why does it work so well? This remains an active area of research. We present several theoretical perspectives that help explain the scaling phenomenon.
Perspective 1: Capacity and Task Coverage
Language modeling requires representing a vast space of knowledge and skills. A more capacious model can:
Under this view, scaling provides the raw capacity to represent the full complexity of language and knowledge. Smaller models must make compromises—forgetting rare information, simplifying patterns, or failing to capture subtle distinctions.
Perspective 2: Feature Learning and Representation Quality
Neural networks learn hierarchical representations. Empirically, larger models learn:
The quality of learned representations improves smoothly with scale, enabling better generalization to new tasks.
Perspective 3: The Manifold Hypothesis
High-dimensional data often lies on lower-dimensional manifolds. Language—despite its apparent complexity—has significant structure. Larger models can:
Perspective 4: Compression and Generalization
Kolmogorov complexity theory suggests that learning is compression—finding short programs that generate training data. Larger models with more training compute can:
Perspective 5: Phase Transitions and Emergent Structure
Some capabilities appear to 'emerge' suddenly at specific scales, suggesting phase transitions in the model's internal structure. These may represent:
Despite these perspectives, we lack a complete theoretical account of why scaling laws take the specific form they do—why the exponents are what they are, why transformers scale better than earlier architectures, and whether the current trends will continue indefinitely. This theoretical gap motivates significant ongoing research.
The scaling paradigm does not mean that efficiency is irrelevant—quite the opposite. Given the enormous costs of frontier training, efficiency gains are multiplicative. A 2× efficiency improvement saves millions of dollars. This section surveys key approaches to efficient scaling.
Architectural Efficiency:
Mixture of Experts (MoE):
MoE models activate only a subset of parameters for each input, dramatically reducing compute while maintaining (or exceeding) the effective capacity of dense models:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import torchimport torch.nn as nnimport torch.nn.functional as F class MixtureOfExpertsLayer(nn.Module): """ Simplified Mixture of Experts layer. Key insight: Instead of one large FFN, use K smaller FFNs(experts) and route each token to the top- k experts based on a learned gating function.This allows scaling parameters without proportionally scaling compute. """ def __init__( self, hidden_dim: int, expert_dim: int, num_experts: int = 8, top_k: int = 2, ): super().__init__() self.num_experts = num_experts self.top_k = top_k # Router learns which experts to use for each token self.gate = nn.Linear(hidden_dim, num_experts, bias = False) # K independent expert networks self.experts = nn.ModuleList([ nn.Sequential( nn.Linear(hidden_dim, expert_dim), nn.GELU(), nn.Linear(expert_dim, hidden_dim), ) for _ in range(num_experts) ]) def forward(self, x: torch.Tensor) -> torch.Tensor: batch_size, seq_len, hidden_dim = x.shape # Compute routing weights(which experts to use) router_logits = self.gate(x) #[batch, seq, num_experts] routing_weights = F.softmax(router_logits, dim = -1) # Select top - k experts for each token topk_weights, topk_indices = torch.topk( routing_weights, self.top_k, dim = -1 ) # Renormalize weights for selected experts topk_weights = topk_weights / topk_weights.sum(dim = -1, keepdim = True) # Compute expert outputs and combine output = torch.zeros_like(x) for k in range(self.top_k): for expert_idx in range(self.num_experts): # Find tokens routed to this expert at position k mask = (topk_indices[:, :, k] == expert_idx) if mask.any(): expert_input = x[mask] # Tokens for this expert expert_output = self.experts[expert_idx](expert_input) output[mask] += topk_weights[:, :, k][mask].unsqueeze(-1) * expert_output return output # Efficiency comparison:# Dense model: hidden_dim = 4096, FFN_dim = 16384# - FFN parameters: 4096 * 16384 * 2 = 134M per layer# - FLOPs per token: 134M * 2 = 268M # MoE model: 8 experts, each FFN_dim = 4096, top - k=2# - Total parameters: 4096 * 4096 * 2 * 8 = 268M per layer(2x more)# - FLOPs per token: 4096 * 4096 * 2 * 2 = 67M(4x less!) # Result: 2x parameters, 0.25x compute, ~equivalent performanceThe goal of efficient scaling is to shift the Pareto frontier—achieving the same performance with less compute, or better performance with the same compute. This is not in opposition to scaling; it means your scaling budget goes further. A 2× efficiency gain is equivalent to having 2× the compute budget.
While scale has delivered remarkable progress, it is not a panacea. Critical questions remain about the limits and limitations of the scaling paradigm.
Question 1: Do Scaling Laws Eventually Break?
Current scaling laws are empirical observations, not theoretical guarantees. They may:
Some researchers observe that loss improvements with scale are slowing on certain benchmarks, though interpretations differ.
Question 2: Does Lower Loss Mean Better Capabilities?
Scaling laws predict loss (perplexity), not downstream task performance. The relationship between:
Models with similar losses can have very different capabilities and failure modes.
Question 3: What Can't Scale Solve?
Some challenges may require more than scale:
Question 4: Environmental and Economic Sustainability
Training costs are growing faster than efficiency improvements:
This raises questions about sustainable scaling and equitable access.
Question 5: What Comes After Scale?
Scale may be necessary but not sufficient. Emerging research directions include:
The scaling paradigm has been extraordinarily productive, but treating it as the only path forward would be a mistake. The most capable AI systems of the future will likely combine scale with architectural innovations, improved training objectives, better data curation, and techniques we haven't yet discovered.
We have explored the role of scale in modern machine learning—from historical context through mathematical foundations to practical implications. Let's consolidate the key insights:
What's Next:
Now that we understand the role of scale, we'll explore its most surprising consequence: emergent capabilities. In the next page, we examine how scaling sometimes produces capabilities that were not present at smaller scales—behaviors that 'emerge' abruptly and unpredictably, challenging our understanding of what these models actually learn.
You now understand the fundamental role of scale in modern ML—the historical context, the mathematical foundations, the computational realities, and the open questions. This foundation is essential for understanding how and why foundation models work the way they do.