Machine LearningResearch Frontiers

Foundation Models

LevelAdvanced

Duration90 mins

TopicResearch Frontiers

1 / 5

Scale in Machine Learning

The Scale Revolution

In the history of machine learning, no single factor has been more transformative than scale. The models that power today's most impressive AI capabilities—ChatGPT, Claude, GPT-4, Gemini, LLaMA—are not fundamentally different architectures from their predecessors. They are the same architectures, but at a scale that seemed computationally infeasible just a decade ago.

This page explores the profound implications of scale in machine learning. We will examine how scale has transformed from a mere engineering consideration into a fundamental research paradigm, one that has upended decades of conventional wisdom about how to improve AI systems. The story of modern AI is, in many ways, the story of scale—and understanding this story is essential for any practitioner seeking to work with or understand foundation models.

What You Will Learn

By the end of this page, you will understand: (1) why scale has become the dominant paradigm in modern ML, (2) the mathematical foundations of scaling laws, (3) the computational and economic implications of training at scale, (4) the historical context that led to the scaling revolution, and (5) how to reason about scale in your own work.

The Historical Context: From Feature Engineering to Scale

To appreciate the significance of scale in modern ML, we must understand the paradigm it replaced. For most of machine learning's history, progress came primarily through algorithmic innovation—developing better models, better features, and better optimization techniques.

The Traditional ML Paradigm (1950s-2010s):

For decades, the dominant approach to improving ML systems followed a predictable pattern:

Careful Feature Engineering: Domain experts would spend months or years crafting features that captured the essential structure of problems. Computer vision practitioners developed SIFT, HOG, and SURF features. NLP researchers built syntactic parsers and hand-crafted lexicons.
Clever Architectural Innovations: Researchers focused on developing objectively 'better' models—support vector machines over perceptrons, random forests over single decision trees, deep networks over shallow ones.
Sophisticated Optimization: Training algorithms became increasingly sophisticated—from gradient descent to momentum, RMSprop, Adam, and beyond.

In this paradigm, data and compute were secondary concerns. The prevailing wisdom held that more data helped, but with diminishing returns. Extra compute could speed up experiments but wouldn't fundamentally change what was possible.

The Paradigm Shift: Traditional ML vs. Scale-First ML
Aspect	Traditional ML (1950s-2010s)	Scale-First ML (2017-Present)
Primary lever for progress	Algorithmic innovation	Scale (data, compute, parameters)
Feature engineering	Critical, domain-dependent	Minimal, learned from data
Model architecture	Task-specific designs	General-purpose transformers
Data requirements	Moderate (thousands-millions)	Massive (billions-trillions tokens)
Compute requirements	Modest (single GPU/CPU)	Enormous (thousands of GPUs)
Dominant practitioners	Academic labs	Well-funded industry labs
Key bottleneck	Algorithm design	Compute budget & data access

The Turning Point: Deep Learning at Scale (2012-2017)

The shift began with AlexNet in 2012, which demonstrated that deep neural networks trained on large datasets (ImageNet's 1.2 million images) using GPUs could dramatically outperform hand-engineered approaches. But AlexNet was still modest by modern standards—60 million parameters, trained on a few GPUs.

Over the next five years, a growing body of evidence suggested that scale followed predictable laws. Researchers at OpenAI, Google, and elsewhere noticed that:

Larger models consistently performed better on the same tasks
More training data consistently led to improvements
More compute consistently translated to better results
These improvements were smooth and predictable, not subject to diminishing returns as rapidly as previously thought

This observation—that you could buy better performance by simply scaling up—upended the traditional paradigm. If more compute reliably means better results, then the bottleneck shifts from algorithm design to compute access and efficient training.

The Bitter Lesson

Rich Sutton's influential 2019 essay 'The Bitter Lesson' captured this paradigm shift: 'The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.' Human knowledge and clever algorithms consistently lose to simple methods scaled up—a bitter pill for researchers who devoted careers to algorithmic sophistication.

Scaling Laws: The Mathematics of More

The most remarkable discovery of the scaling era is that improvements from scale are not arbitrary—they follow precise mathematical laws. These scaling laws allow researchers to predict how much performance will improve from additional compute, data, or parameters, enabling rational planning of massive training runs.

The Power Law Relationship:

Scaling laws typically take the form of power laws:

$$L(X) = \left(\frac{X_c}{X}\right)^{\alpha_X}$$

where:

$L$ is the loss (lower is better)
$X$ is the scaling variable (compute, data, or parameters)
$X_c$ is a constant
$\alpha_X$ is the scaling exponent

This relationship means that each order of magnitude increase in scale yields a consistent, predictable decrease in loss. The exponent $\alpha$ determines how quickly performance improves with scale.

scaling_law_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
import matplotlib.pyplot as plt
 
def scaling_law(x, x_c, alpha):
    """
    Compute loss under scaling law.
    
    The power law relationship: L(X) = (X_c/X)^alpha
    
    Args:
        x: Scaling variable (compute, data, or parameters)
        x_c: Constant term
        alpha: Scaling exponent
        
    Returns:
        Predicted loss value
    """
    return (x_c / x) ** alpha
 
# Typical scaling exponents observed in language models
# Note: These are approximate values from published research
scaling_exponents = {
    'compute': 0.050,      # Loss ~ C^(-0.050) from Kaplan et al.
    'parameters': 0.076,   # Loss ~ N^(-0.076)
    'data': 0.095,         # Loss ~ D^(-0.095)
}
 
# Example: Predicting loss for different model sizes
parameters = np.logspace(6, 12, 100)  # 1M to 1T parameters
x_c = 8.8e13  # Empirical constant for language modeling
 
losses = scaling_law(parameters, x_c, scaling_exponents['parameters'])
 
# Key insight: Loss decreases as a power law with parameters
# Doubling parameters reduces loss by: 2^(-0.076) ≈ 0.949
# This means ~5% improvement in loss per doubling
 
print(f"Loss reduction per 10x parameters: {10**(-scaling_exponents['parameters']):.3f}")
print(f"To halve the loss, need: {2**(1/scaling_exponents['parameters']):.1e}x more parameters")

The Chinchilla Scaling Laws:

A pivotal contribution came from DeepMind's 2022 'Chinchilla' paper, which refined our understanding of optimal scaling. The key insight was that previous models were undertrained relative to their size.

The Chinchilla team proposed that compute-optimal training requires balancing model size and training data according to:

$$N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}$$

where $N$ is parameters, $D$ is training tokens, and $C$ is compute. This means:

Parameters and data should scale equally with compute
For every 2× increase in compute, increase both model size and data by √2 ≈ 1.41×
GPT-3 (175B parameters, 300B tokens) was significantly undertrained
A 'Chinchilla-optimal' 175B model would need ~3.5T tokens

Chinchilla Optimal vs. Actual Training Configurations
Model	Parameters	Training Tokens	Tokens/Parameter Ratio	Chinchilla Optimal?
GPT-3	175B	300B	1.7	❌ Undertrained
Gopher	280B	300B	1.1	❌ Severely undertrained
Chinchilla	70B	1.4T	20	✅ Optimal
LLaMA-2 70B	70B	2T	29	✅ Over-trained (inference efficient)
LLaMA-3 70B	70B	15T	214	✅ Heavily over-trained

Beyond Chinchilla: Inference-Time Considerations

Chinchilla optimality minimizes training compute, but production models must also consider inference costs. A smaller, overtrained model (like LLaMA) is cheaper to serve than a larger, optimally-trained model with equal performance. This has led to a trend of 'over-training' models well beyond the Chinchilla optimal point for deployment efficiency.

The Compute Reality: What Scale Actually Costs

Abstract discussions of scale become concrete when we consider the actual computational requirements of training modern foundation models. The numbers are staggering and have profound implications for who can participate in frontier ML research.

Measuring Compute: FLOPs and GPU-hours

We measure training compute in FLOPs (floating point operations)—specifically, total FLOPs across the entire training run. The relationship between model parameters, training tokens, and compute follows the approximation:

$$C \approx 6ND$$

where:

$C$ = total FLOPs
$N$ = model parameters
$D$ = training tokens
The factor 6 comes from forward + backward passes through the model

This allows us to estimate compute for any training configuration. For example, training GPT-3 (175B parameters, 300B tokens):

$$C = 6 \times 175 \times 10^9 \times 300 \times 10^9 = 3.15 \times 10^{23} \text{ FLOPs}$$

Training Compute for Major Language Models
Model	Year	Parameters	Tokens	Estimated FLOPs	Estimated Cost (USD)
BERT-Large	2018	340M	3.3B	~10²⁰	~$10K
GPT-2	2019	1.5B	40B	~10²¹	~$50K
GPT-3	2020	175B	300B	~3×10²³	~$4-12M
Chinchilla	2022	70B	1.4T	~6×10²³	~$10M
GPT-4 (estimate)	2023	~1.8T (MoE)	~13T	~10²⁵	~$100M+
Claude 3 Opus (estimate)	2024	Unknown	Unknown	~10²⁵	~$100M+
Llama 3 405B	2024	405B	15T	~4×10²⁵	~$150M+

The Hardware Landscape:

Frontier model training runs on specialized hardware. The dominant platform is NVIDIA's A100 and H100 GPUs, organized into massive clusters:

GPU	bf16 Performance	Memory	Power	Typical Cloud Cost
A100 (40GB)	312 TFLOPS	40 GB HBM2e	400W	~$2-3/hour
A100 (80GB)	312 TFLOPS	80 GB HBM2e	400W	~$3-4/hour
H100 (80GB)	1,979 TFLOPS	80 GB HBM3	700W	~$4-6/hour

Example: Training LLaMA 2 70B

Meta reported training LLaMA 2 70B used:

2,000 A100 (80GB) GPUs
~1.7 million GPU-hours
~35 days of continuous training
Estimated cost: $5-10 million in compute alone

This excludes:

Engineering salaries (team of 50+ for months)
Failed experiments and hyperparameter tuning
Data collection and preprocessing
Evaluation and red-teaming
Infrastructure and electricity costs

The Compute Divide

The compute requirements for frontier models create a stark divide in AI research. Training a GPT-4 class model requires resources available to perhaps a dozen organizations worldwide. This concentration of capability in well-funded labs raises important questions about research access, reproducibility, and who gets to shape AI development.

compute_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def estimate_training_compute(
    parameters: float,
    tokens: float,
    flops_per_token_factor: float = 6.0
) -> dict:
    """
    Estimate training compute requirements.
    
    The standard approximation: C ≈ 6 * N * D
    - Forward pass: ~2 * N FLOPs per token
    - Backward pass: ~4 * N FLOPs per token (2x forward, for gradients)
    - Total: ~6 * N FLOPs per token, times D tokens
    
    Args:
        parameters: Number of model parameters
        tokens: Number of training tokens
        flops_per_token_factor: FLOPs multiplier (default 6 for transformer)
        
    Returns:
        Dict with compute estimates
    """
    total_flops = flops_per_token_factor * parameters * tokens
    
    # GPU performance benchmarks (realistic training efficiency)
    a100_peak_flops = 312e12  # 312 TFLOPS bf16
    h100_peak_flops = 1979e12  # ~2 PFLOPS bf16
    
    # Typical training efficiency: 30-50% of peak
    a100_effective_flops = a100_peak_flops * 0.4  # 125 TFLOPS effective
    h100_effective_flops = h100_peak_flops * 0.4  # 800 TFLOPS effective
    
    a100_hours = total_flops / (a100_effective_flops * 3600)
    h100_hours = total_flops / (h100_effective_flops * 3600)
    
    # Cost estimates (cloud pricing)
    a100_cost_per_hour = 3.00  # USD
    h100_cost_per_hour = 5.00  # USD
    
    return {
        'total_flops': total_flops,
        'total_flops_scientific': f"{total_flops:.2e}",
        'a100_gpu_hours': a100_hours,
        'h100_gpu_hours': h100_hours,
        'a100_cost_usd': a100_hours * a100_cost_per_hour,
        'h100_cost_usd': h100_hours * h100_cost_per_hour,
        'a100_days_1000_gpus': a100_hours / (1000 * 24),
        'h100_days_1000_gpus': h100_hours / (1000 * 24),
    }
 
# Example: Estimate for LLaMA 2 70B training
llama2_70b = estimate_training_compute(
    parameters=70e9,    # 70 billion parameters
    tokens=2e12,        # 2 trillion tokens
)
 
print("LLaMA 2 70B Training Estimate:")
print(f"  Total FLOPs: {llama2_70b['total_flops_scientific']}")
print(f"  A100 GPU-hours: {llama2_70b['a100_gpu_hours']:,.0f}")
print(f"  H100 GPU-hours: {llama2_70b['h100_gpu_hours']:,.0f}")
print(f"  A100 cost: ${llama2_70b['a100_cost_usd']:, .0f}")
print(f"  Days on 1000 A100s: {llama2_70b['a100_days_1000_gpus']:.1f}")

Data at Scale: From Curated to Web-Scale

Scale isn't just about compute and parameters—data is equally fundamental. The shift to web-scale training data represents a transformation as significant as the architectural innovations. Understanding data scaling is crucial for anyone working with foundation models.

The Data Scaling Journey:

The progression of training datasets tells the story of scale:

Era	Dataset	Size	Curation Level
2010s	Penn Treebank	1M tokens	Expertly curated, annotated
2015	Wikipedia	~3B tokens	High-quality, curated
2019	WebText (GPT-2)	40B tokens	Filtered web scrape
2020	C4 (T5)	156B tokens	Filtered Common Crawl
2020	The Pile	825B tokens	Diverse sources
2022	RefinedWeb	5T tokens	Deduplicated, filtered
2023	RedPajama	1.2T tokens	Open replication of LLaMA
2024	FineWeb	15T tokens	High-quality web data

The Data Quality-Quantity Tradeoff:

Scaling data is not simply 'more is better.' Research has shown that data quality interacts nonlinearly with scale:

Deduplication Matters Enormously: Training on duplicated data leads to memorization rather than generalization. Near-duplicate detection at scale requires sophisticated MinHash and locality-sensitive hashing techniques.
Quality Filtering Improves Efficiency: Filtering low-quality data (spam, boilerplate, machine-generated text) can improve performance more than adding equivalent amounts of unfiltered data.
Domain Mixing Requires Care: The ratio of data from different domains (code, books, web, academic papers) significantly affects downstream capabilities.
Repetition Has Costs: Training on the same data multiple epochs provides diminishing returns and can harm generalization.

Key Data Scaling Insights

•Deduplication is critical — Even paragraph-level deduplication improves model quality. Exact and near-duplicate removal is standard practice for all frontier models.
•Quality classifiers help — Training a classifier to distinguish 'high-quality' from 'low-quality' text (e.g., using Wikipedia as positive examples) and filtering accordingly improves token efficiency.
•Domain proportions matter — Too much code reduces conversational ability; too little reduces reasoning capability. Finding optimal mixtures is an active research area.
•Curriculum learning shows promise — Starting with cleaner data and gradually introducing noisier sources can improve final performance, though results are mixed.
•Synthetic data is increasingly important — For domains with limited natural data (math, formal reasoning), synthetic data generation is becoming essential.

The Data Wall

Some researchers argue we are approaching a 'data wall'—the limit of useful web text for training. The internet contains perhaps 50-100 trillion tokens of unique, high-quality text in English. If frontier models already train on 10-15T tokens, we may be 3-5× away from exhausting unique data. This has driven major investments in synthetic data generation, multimodal data, and multilingual expansion.

Why Scale Works: Theoretical Perspectives

The empirical success of scaling is undeniable, but why does it work so well? This remains an active area of research. We present several theoretical perspectives that help explain the scaling phenomenon.

Perspective 1: Capacity and Task Coverage

Language modeling requires representing a vast space of knowledge and skills. A more capacious model can:

Store more factual associations
Represent more complex patterns and relationships
Capture more linguistic structures and stylistic variations
Encode more problem-solving strategies

Under this view, scaling provides the raw capacity to represent the full complexity of language and knowledge. Smaller models must make compromises—forgetting rare information, simplifying patterns, or failing to capture subtle distinctions.

Perspective 2: Feature Learning and Representation Quality

Neural networks learn hierarchical representations. Empirically, larger models learn:

More disentangled features
More transferable representations
Better separation of semantic concepts
Higher-fidelity internal world models

The quality of learned representations improves smoothly with scale, enabling better generalization to new tasks.

Perspective 3: The Manifold Hypothesis

High-dimensional data often lies on lower-dimensional manifolds. Language—despite its apparent complexity—has significant structure. Larger models can:

Better approximate these complex manifolds
Reduce approximation error in high-curvature regions
Capture long-range dependencies that smaller models miss
Represent multiple interconnected manifolds (different skills/knowledge areas)

Perspective 4: Compression and Generalization

Kolmogorov complexity theory suggests that learning is compression—finding short programs that generate training data. Larger models with more training compute can:

Find better compressions of the training data
Discover more efficient representations
Identify deeper regularities that enable generalization
Better approximate the true data-generating process

Perspective 5: Phase Transitions and Emergent Structure

Some capabilities appear to 'emerge' suddenly at specific scales, suggesting phase transitions in the model's internal structure. These may represent:

Qualitative changes in how information is organized
The activation of dormant computational circuits
Threshold effects where multiple sub-capabilities combine
Transitions in the geometry of the loss landscape

An Incomplete Picture

Despite these perspectives, we lack a complete theoretical account of why scaling laws take the specific form they do—why the exponents are what they are, why transformers scale better than earlier architectures, and whether the current trends will continue indefinitely. This theoretical gap motivates significant ongoing research.

Efficient Scaling: Doing More with Less

The scaling paradigm does not mean that efficiency is irrelevant—quite the opposite. Given the enormous costs of frontier training, efficiency gains are multiplicative. A 2× efficiency improvement saves millions of dollars. This section surveys key approaches to efficient scaling.

Architectural Efficiency:

Mixture of Experts (MoE):

MoE models activate only a subset of parameters for each input, dramatically reducing compute while maintaining (or exceeding) the effective capacity of dense models:

GPT-4 is widely believed to use MoE with 8 experts
Mixtral 8x7B activates only 2 of 8 experts per token
Each forward pass uses ~12B parameters despite 46B total
Training efficiency improves by 2-4× for equal performance

moe_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MixtureOfExpertsLayer(nn.Module):
                            """
    Simplified Mixture of Experts layer.
    
    Key insight: Instead of one large FFN, use K smaller FFNs(experts)
    and route each token to the top- k experts based on a learned gating
    function.This allows scaling parameters without proportionally
    scaling compute.
    """
    def __init__(
                                self,
                                hidden_dim: int,
                                expert_dim: int,
                                num_experts: int = 8,
                                top_k: int = 2,
                            ):
                        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        
        # Router learns which experts to use for each token
        self.gate = nn.Linear(hidden_dim, num_experts, bias = False)
        
        # K independent expert networks
        self.experts = nn.ModuleList([
                            nn.Sequential(
                                nn.Linear(hidden_dim, expert_dim),
                                nn.GELU(),
                                nn.Linear(expert_dim, hidden_dim),
                            )
            for _ in range(num_experts)
        ])
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
    batch_size, seq_len, hidden_dim = x.shape
        
        # Compute routing weights(which experts to use)
    router_logits = self.gate(x)  #[batch, seq, num_experts]
    routing_weights = F.softmax(router_logits, dim = -1)
        
        # Select top - k experts for each token
        topk_weights, topk_indices = torch.topk(
        routing_weights, self.top_k, dim = -1
    )
        # Renormalize weights for selected experts
        topk_weights = topk_weights / topk_weights.sum(dim = -1, keepdim = True)
        
        # Compute expert outputs and combine
    output = torch.zeros_like(x)
    for k in range(self.top_k):
        for expert_idx in range(self.num_experts):
                # Find tokens routed to this expert at position k
    mask = (topk_indices[:, :, k] == expert_idx)
    if mask.any():
        expert_input = x[mask]  # Tokens for this expert
                    expert_output = self.experts[expert_idx](expert_input)
                    output[mask] += topk_weights[:, :, k][mask].unsqueeze(-1) * expert_output
 
    return output
    
# Efficiency comparison:
# Dense model: hidden_dim = 4096, FFN_dim = 16384
# - FFN parameters: 4096 * 16384 * 2 = 134M per layer
# - FLOPs per token: 134M * 2 = 268M
 
# MoE model: 8 experts, each FFN_dim = 4096, top - k=2
# - Total parameters: 4096 * 4096 * 2 * 8 = 268M per layer(2x more)
# - FLOPs per token: 4096 * 4096 * 2 * 2 = 67M(4x less!)
 
# Result: 2x parameters, 0.25x compute, ~equivalent performance

Other Efficiency Techniques

•Mixed-Precision Training — Using float16/bfloat16 instead of float32 doubles memory efficiency and compute throughput with minimal accuracy loss.
•Gradient Checkpointing — Trade compute for memory by recomputing activations during backward pass rather than storing them.
•Flash Attention — IO-aware attention algorithms that reduce memory bandwidth bottlenecks, enabling longer sequences and faster training.
•Tensor/Pipeline Parallelism — Distribute model across GPUs efficiently, minimizing communication overhead while maximizing utilization.
•Knowledge Distillation — Train smaller 'student' models to mimic larger 'teacher' models, capturing capabilities with reduced inference cost.
•Quantization — Reduce precision of weights/activations for inference (INT8, INT4), enabling 2-4× speedup with minimal quality loss.

Efficiency is About Pareto Improvements

The goal of efficient scaling is to shift the Pareto frontier—achieving the same performance with less compute, or better performance with the same compute. This is not in opposition to scaling; it means your scaling budget goes further. A 2× efficiency gain is equivalent to having 2× the compute budget.

The Limits of Scale: Open Questions

While scale has delivered remarkable progress, it is not a panacea. Critical questions remain about the limits and limitations of the scaling paradigm.

Question 1: Do Scaling Laws Eventually Break?

Current scaling laws are empirical observations, not theoretical guarantees. They may:

Eventually flatten (diminishing returns kick in)
Hit phase transitions (qualitative changes in behavior)
Face data bottlenecks (running out of quality training data)
Encounter irreducible loss floors

Some researchers observe that loss improvements with scale are slowing on certain benchmarks, though interpretations differ.

Question 2: Does Lower Loss Mean Better Capabilities?

Scaling laws predict loss (perplexity), not downstream task performance. The relationship between:

Cross-entropy loss and task accuracy is imperfect
Perplexity and 'understanding' is unclear
Training metrics and safety is unknown

Models with similar losses can have very different capabilities and failure modes.

Question 3: What Can't Scale Solve?

Some challenges may require more than scale:

Reliable factual accuracy: Models still hallucinate despite scale
Precise multi-step reasoning: Complex logical chains remain fragile
Robust agent behavior: Autonomous action requires more than pattern matching
Genuine understanding: Whether scale leads to 'understanding' is philosophically contested

Question 4: Environmental and Economic Sustainability

Training costs are growing faster than efficiency improvements:

GPT-4 training reportedly cost $100M+
Next generation may cost $1B+
Carbon footprint of frontier training is substantial
Only a handful of organizations can afford frontier research

This raises questions about sustainable scaling and equitable access.

Question 5: What Comes After Scale?

Scale may be necessary but not sufficient. Emerging research directions include:

Better architectures (post-transformer designs)
Improved data quality and curation
Test-time compute scaling (inference scaling)
Neurosymbolic integration
Self-improvement and recursive capability gain

Scale is Not the End

The scaling paradigm has been extraordinarily productive, but treating it as the only path forward would be a mistake. The most capable AI systems of the future will likely combine scale with architectural innovations, improved training objectives, better data curation, and techniques we haven't yet discovered.

Summary: Scale as Foundation

We have explored the role of scale in modern machine learning—from historical context through mathematical foundations to practical implications. Let's consolidate the key insights:

Key Takeaways

•Scale transformed ML — The shift from algorithmic innovation to scale as the primary lever for progress represents a paradigm change in how we approach AI development.
•Scaling follows mathematical laws — Power law relationships between compute, data, parameters, and performance allow prediction and planning of training runs.
•Chinchilla optimality matters — Compute-optimal training balances parameters and data. Most models are trained beyond this point for inference efficiency.
•Compute costs are staggering — Frontier models cost $10-100M+ to train, creating a significant barrier to entry for research.
•Data quality interacts with scale — Deduplication, filtering, and domain mixing are as important as raw data volume.
•Efficiency is a multiplier — Techniques like MoE, mixed precision, and attention optimization stretch compute budgets without changing the scaling paradigm.
•Scale has limits — Data walls, economic constraints, and capabilities that require more than pattern matching suggest scale alone is insufficient.
•The future is hybrid — Progress likely requires combining scale with new techniques we are only beginning to explore.

What's Next:

Now that we understand the role of scale, we'll explore its most surprising consequence: emergent capabilities. In the next page, we examine how scaling sometimes produces capabilities that were not present at smaller scales—behaviors that 'emerge' abruptly and unpredictably, challenging our understanding of what these models actually learn.

Page Complete

You now understand the fundamental role of scale in modern ML—the historical context, the mathematical foundations, the computational realities, and the open questions. This foundation is essential for understanding how and why foundation models work the way they do.

1 / 5

Loading learning content...

Machine LearningResearch Frontiers

Foundation Models

LevelAdvanced

Duration90 mins

TopicResearch Frontiers

1 / 5

Scale in Machine Learning

The Scale Revolution

What You Will Learn

The Historical Context: From Feature Engineering to Scale

The Traditional ML Paradigm (1950s-2010s):

For decades, the dominant approach to improving ML systems followed a predictable pattern:

Careful Feature Engineering: Domain experts would spend months or years crafting features that captured the essential structure of problems. Computer vision practitioners developed SIFT, HOG, and SURF features. NLP researchers built syntactic parsers and hand-crafted lexicons.
Clever Architectural Innovations: Researchers focused on developing objectively 'better' models—support vector machines over perceptrons, random forests over single decision trees, deep networks over shallow ones.
Sophisticated Optimization: Training algorithms became increasingly sophisticated—from gradient descent to momentum, RMSprop, Adam, and beyond.

The Paradigm Shift: Traditional ML vs. Scale-First ML
Aspect	Traditional ML (1950s-2010s)	Scale-First ML (2017-Present)
Primary lever for progress	Algorithmic innovation	Scale (data, compute, parameters)
Feature engineering	Critical, domain-dependent	Minimal, learned from data
Model architecture	Task-specific designs	General-purpose transformers
Data requirements	Moderate (thousands-millions)	Massive (billions-trillions tokens)
Compute requirements	Modest (single GPU/CPU)	Enormous (thousands of GPUs)
Dominant practitioners	Academic labs	Well-funded industry labs
Key bottleneck	Algorithm design	Compute budget & data access

The Turning Point: Deep Learning at Scale (2012-2017)

Over the next five years, a growing body of evidence suggested that scale followed predictable laws. Researchers at OpenAI, Google, and elsewhere noticed that:

Larger models consistently performed better on the same tasks
More training data consistently led to improvements
More compute consistently translated to better results
These improvements were smooth and predictable, not subject to diminishing returns as rapidly as previously thought

The Bitter Lesson

Scaling Laws: The Mathematics of More

The Power Law Relationship:

Scaling laws typically take the form of power laws:

$$L(X) = \left(\frac{X_c}{X}\right)^{\alpha_X}$$

where:

$L$ is the loss (lower is better)
$X$ is the scaling variable (compute, data, or parameters)
$X_c$ is a constant
$\alpha_X$ is the scaling exponent

scaling_law_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
import matplotlib.pyplot as plt
 
def scaling_law(x, x_c, alpha):
    """
    Compute loss under scaling law.
    
    The power law relationship: L(X) = (X_c/X)^alpha
    
    Args:
        x: Scaling variable (compute, data, or parameters)
        x_c: Constant term
        alpha: Scaling exponent
        
    Returns:
        Predicted loss value
    """
    return (x_c / x) ** alpha
 
# Typical scaling exponents observed in language models
# Note: These are approximate values from published research
scaling_exponents = {
    'compute': 0.050,      # Loss ~ C^(-0.050) from Kaplan et al.
    'parameters': 0.076,   # Loss ~ N^(-0.076)
    'data': 0.095,         # Loss ~ D^(-0.095)
}
 
# Example: Predicting loss for different model sizes
parameters = np.logspace(6, 12, 100)  # 1M to 1T parameters
x_c = 8.8e13  # Empirical constant for language modeling
 
losses = scaling_law(parameters, x_c, scaling_exponents['parameters'])
 
# Key insight: Loss decreases as a power law with parameters
# Doubling parameters reduces loss by: 2^(-0.076) ≈ 0.949
# This means ~5% improvement in loss per doubling
 
print(f"Loss reduction per 10x parameters: {10**(-scaling_exponents['parameters']):.3f}")
print(f"To halve the loss, need: {2**(1/scaling_exponents['parameters']):.1e}x more parameters")

The Chinchilla Scaling Laws:

The Chinchilla team proposed that compute-optimal training requires balancing model size and training data according to:

$$N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}$$

where $N$ is parameters, $D$ is training tokens, and $C$ is compute. This means:

Parameters and data should scale equally with compute
For every 2× increase in compute, increase both model size and data by √2 ≈ 1.41×
GPT-3 (175B parameters, 300B tokens) was significantly undertrained
A 'Chinchilla-optimal' 175B model would need ~3.5T tokens

Chinchilla Optimal vs. Actual Training Configurations
Model	Parameters	Training Tokens	Tokens/Parameter Ratio	Chinchilla Optimal?
GPT-3	175B	300B	1.7	❌ Undertrained
Gopher	280B	300B	1.1	❌ Severely undertrained
Chinchilla	70B	1.4T	20	✅ Optimal
LLaMA-2 70B	70B	2T	29	✅ Over-trained (inference efficient)
LLaMA-3 70B	70B	15T	214	✅ Heavily over-trained

Beyond Chinchilla: Inference-Time Considerations

The Compute Reality: What Scale Actually Costs

Measuring Compute: FLOPs and GPU-hours

$$C \approx 6ND$$

where:

$C$ = total FLOPs
$N$ = model parameters
$D$ = training tokens
The factor 6 comes from forward + backward passes through the model

This allows us to estimate compute for any training configuration. For example, training GPT-3 (175B parameters, 300B tokens):

$$C = 6 \times 175 \times 10^9 \times 300 \times 10^9 = 3.15 \times 10^{23} \text{ FLOPs}$$

Training Compute for Major Language Models
Model	Year	Parameters	Tokens	Estimated FLOPs	Estimated Cost (USD)
BERT-Large	2018	340M	3.3B	~10²⁰	~$10K
GPT-2	2019	1.5B	40B	~10²¹	~$50K
GPT-3	2020	175B	300B	~3×10²³	~$4-12M
Chinchilla	2022	70B	1.4T	~6×10²³	~$10M
GPT-4 (estimate)	2023	~1.8T (MoE)	~13T	~10²⁵	~$100M+
Claude 3 Opus (estimate)	2024	Unknown	Unknown	~10²⁵	~$100M+
Llama 3 405B	2024	405B	15T	~4×10²⁵	~$150M+

The Hardware Landscape:

Frontier model training runs on specialized hardware. The dominant platform is NVIDIA's A100 and H100 GPUs, organized into massive clusters:

GPU	bf16 Performance	Memory	Power	Typical Cloud Cost
A100 (40GB)	312 TFLOPS	40 GB HBM2e	400W	~$2-3/hour
A100 (80GB)	312 TFLOPS	80 GB HBM2e	400W	~$3-4/hour
H100 (80GB)	1,979 TFLOPS	80 GB HBM3	700W	~$4-6/hour

Example: Training LLaMA 2 70B

Meta reported training LLaMA 2 70B used:

2,000 A100 (80GB) GPUs
~1.7 million GPU-hours
~35 days of continuous training
Estimated cost: $5-10 million in compute alone

This excludes:

Engineering salaries (team of 50+ for months)
Failed experiments and hyperparameter tuning
Data collection and preprocessing
Evaluation and red-teaming
Infrastructure and electricity costs

The Compute Divide

compute_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def estimate_training_compute(
    parameters: float,
    tokens: float,
    flops_per_token_factor: float = 6.0
) -> dict:
    """
    Estimate training compute requirements.
    
    The standard approximation: C ≈ 6 * N * D
    - Forward pass: ~2 * N FLOPs per token
    - Backward pass: ~4 * N FLOPs per token (2x forward, for gradients)
    - Total: ~6 * N FLOPs per token, times D tokens
    
    Args:
        parameters: Number of model parameters
        tokens: Number of training tokens
        flops_per_token_factor: FLOPs multiplier (default 6 for transformer)
        
    Returns:
        Dict with compute estimates
    """
    total_flops = flops_per_token_factor * parameters * tokens
    
    # GPU performance benchmarks (realistic training efficiency)
    a100_peak_flops = 312e12  # 312 TFLOPS bf16
    h100_peak_flops = 1979e12  # ~2 PFLOPS bf16
    
    # Typical training efficiency: 30-50% of peak
    a100_effective_flops = a100_peak_flops * 0.4  # 125 TFLOPS effective
    h100_effective_flops = h100_peak_flops * 0.4  # 800 TFLOPS effective
    
    a100_hours = total_flops / (a100_effective_flops * 3600)
    h100_hours = total_flops / (h100_effective_flops * 3600)
    
    # Cost estimates (cloud pricing)
    a100_cost_per_hour = 3.00  # USD
    h100_cost_per_hour = 5.00  # USD
    
    return {
        'total_flops': total_flops,
        'total_flops_scientific': f"{total_flops:.2e}",
        'a100_gpu_hours': a100_hours,
        'h100_gpu_hours': h100_hours,
        'a100_cost_usd': a100_hours * a100_cost_per_hour,
        'h100_cost_usd': h100_hours * h100_cost_per_hour,
        'a100_days_1000_gpus': a100_hours / (1000 * 24),
        'h100_days_1000_gpus': h100_hours / (1000 * 24),
    }
 
# Example: Estimate for LLaMA 2 70B training
llama2_70b = estimate_training_compute(
    parameters=70e9,    # 70 billion parameters
    tokens=2e12,        # 2 trillion tokens
)
 
print("LLaMA 2 70B Training Estimate:")
print(f"  Total FLOPs: {llama2_70b['total_flops_scientific']}")
print(f"  A100 GPU-hours: {llama2_70b['a100_gpu_hours']:,.0f}")
print(f"  H100 GPU-hours: {llama2_70b['h100_gpu_hours']:,.0f}")
print(f"  A100 cost: ${llama2_70b['a100_cost_usd']:, .0f}")
print(f"  Days on 1000 A100s: {llama2_70b['a100_days_1000_gpus']:.1f}")

Data at Scale: From Curated to Web-Scale

The Data Scaling Journey:

The progression of training datasets tells the story of scale:

Era	Dataset	Size	Curation Level
2010s	Penn Treebank	1M tokens	Expertly curated, annotated
2015	Wikipedia	~3B tokens	High-quality, curated
2019	WebText (GPT-2)	40B tokens	Filtered web scrape
2020	C4 (T5)	156B tokens	Filtered Common Crawl
2020	The Pile	825B tokens	Diverse sources
2022	RefinedWeb	5T tokens	Deduplicated, filtered
2023	RedPajama	1.2T tokens	Open replication of LLaMA
2024	FineWeb	15T tokens	High-quality web data

The Data Quality-Quantity Tradeoff:

Scaling data is not simply 'more is better.' Research has shown that data quality interacts nonlinearly with scale:

Deduplication Matters Enormously: Training on duplicated data leads to memorization rather than generalization. Near-duplicate detection at scale requires sophisticated MinHash and locality-sensitive hashing techniques.
Quality Filtering Improves Efficiency: Filtering low-quality data (spam, boilerplate, machine-generated text) can improve performance more than adding equivalent amounts of unfiltered data.
Domain Mixing Requires Care: The ratio of data from different domains (code, books, web, academic papers) significantly affects downstream capabilities.
Repetition Has Costs: Training on the same data multiple epochs provides diminishing returns and can harm generalization.

Key Data Scaling Insights

•Deduplication is critical — Even paragraph-level deduplication improves model quality. Exact and near-duplicate removal is standard practice for all frontier models.
•Quality classifiers help — Training a classifier to distinguish 'high-quality' from 'low-quality' text (e.g., using Wikipedia as positive examples) and filtering accordingly improves token efficiency.
•Domain proportions matter — Too much code reduces conversational ability; too little reduces reasoning capability. Finding optimal mixtures is an active research area.
•Curriculum learning shows promise — Starting with cleaner data and gradually introducing noisier sources can improve final performance, though results are mixed.
•Synthetic data is increasingly important — For domains with limited natural data (math, formal reasoning), synthetic data generation is becoming essential.

The Data Wall

Why Scale Works: Theoretical Perspectives

Perspective 1: Capacity and Task Coverage

Language modeling requires representing a vast space of knowledge and skills. A more capacious model can:

Store more factual associations
Represent more complex patterns and relationships
Capture more linguistic structures and stylistic variations
Encode more problem-solving strategies

Perspective 2: Feature Learning and Representation Quality

Neural networks learn hierarchical representations. Empirically, larger models learn:

More disentangled features
More transferable representations
Better separation of semantic concepts
Higher-fidelity internal world models

The quality of learned representations improves smoothly with scale, enabling better generalization to new tasks.

Perspective 3: The Manifold Hypothesis

High-dimensional data often lies on lower-dimensional manifolds. Language—despite its apparent complexity—has significant structure. Larger models can:

Better approximate these complex manifolds
Reduce approximation error in high-curvature regions
Capture long-range dependencies that smaller models miss
Represent multiple interconnected manifolds (different skills/knowledge areas)

Perspective 4: Compression and Generalization

Kolmogorov complexity theory suggests that learning is compression—finding short programs that generate training data. Larger models with more training compute can:

Find better compressions of the training data
Discover more efficient representations
Identify deeper regularities that enable generalization
Better approximate the true data-generating process

Perspective 5: Phase Transitions and Emergent Structure

Some capabilities appear to 'emerge' suddenly at specific scales, suggesting phase transitions in the model's internal structure. These may represent:

Qualitative changes in how information is organized
The activation of dormant computational circuits
Threshold effects where multiple sub-capabilities combine
Transitions in the geometry of the loss landscape

An Incomplete Picture

Efficient Scaling: Doing More with Less

Architectural Efficiency:

Mixture of Experts (MoE):

MoE models activate only a subset of parameters for each input, dramatically reducing compute while maintaining (or exceeding) the effective capacity of dense models:

GPT-4 is widely believed to use MoE with 8 experts
Mixtral 8x7B activates only 2 of 8 experts per token
Each forward pass uses ~12B parameters despite 46B total
Training efficiency improves by 2-4× for equal performance

moe_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MixtureOfExpertsLayer(nn.Module):
                            """
    Simplified Mixture of Experts layer.
    
    Key insight: Instead of one large FFN, use K smaller FFNs(experts)
    and route each token to the top- k experts based on a learned gating
    function.This allows scaling parameters without proportionally
    scaling compute.
    """
    def __init__(
                                self,
                                hidden_dim: int,
                                expert_dim: int,
                                num_experts: int = 8,
                                top_k: int = 2,
                            ):
                        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        
        # Router learns which experts to use for each token
        self.gate = nn.Linear(hidden_dim, num_experts, bias = False)
        
        # K independent expert networks
        self.experts = nn.ModuleList([
                            nn.Sequential(
                                nn.Linear(hidden_dim, expert_dim),
                                nn.GELU(),
                                nn.Linear(expert_dim, hidden_dim),
                            )
            for _ in range(num_experts)
        ])
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
    batch_size, seq_len, hidden_dim = x.shape
        
        # Compute routing weights(which experts to use)
    router_logits = self.gate(x)  #[batch, seq, num_experts]
    routing_weights = F.softmax(router_logits, dim = -1)
        
        # Select top - k experts for each token
        topk_weights, topk_indices = torch.topk(
        routing_weights, self.top_k, dim = -1
    )
        # Renormalize weights for selected experts
        topk_weights = topk_weights / topk_weights.sum(dim = -1, keepdim = True)
        
        # Compute expert outputs and combine
    output = torch.zeros_like(x)
    for k in range(self.top_k):
        for expert_idx in range(self.num_experts):
                # Find tokens routed to this expert at position k
    mask = (topk_indices[:, :, k] == expert_idx)
    if mask.any():
        expert_input = x[mask]  # Tokens for this expert
                    expert_output = self.experts[expert_idx](expert_input)
                    output[mask] += topk_weights[:, :, k][mask].unsqueeze(-1) * expert_output
 
    return output
    
# Efficiency comparison:
# Dense model: hidden_dim = 4096, FFN_dim = 16384
# - FFN parameters: 4096 * 16384 * 2 = 134M per layer
# - FLOPs per token: 134M * 2 = 268M
 
# MoE model: 8 experts, each FFN_dim = 4096, top - k=2
# - Total parameters: 4096 * 4096 * 2 * 8 = 268M per layer(2x more)
# - FLOPs per token: 4096 * 4096 * 2 * 2 = 67M(4x less!)
 
# Result: 2x parameters, 0.25x compute, ~equivalent performance

Other Efficiency Techniques

•Mixed-Precision Training — Using float16/bfloat16 instead of float32 doubles memory efficiency and compute throughput with minimal accuracy loss.
•Gradient Checkpointing — Trade compute for memory by recomputing activations during backward pass rather than storing them.
•Flash Attention — IO-aware attention algorithms that reduce memory bandwidth bottlenecks, enabling longer sequences and faster training.
•Tensor/Pipeline Parallelism — Distribute model across GPUs efficiently, minimizing communication overhead while maximizing utilization.
•Knowledge Distillation — Train smaller 'student' models to mimic larger 'teacher' models, capturing capabilities with reduced inference cost.
•Quantization — Reduce precision of weights/activations for inference (INT8, INT4), enabling 2-4× speedup with minimal quality loss.

Efficiency is About Pareto Improvements

The Limits of Scale: Open Questions

While scale has delivered remarkable progress, it is not a panacea. Critical questions remain about the limits and limitations of the scaling paradigm.

Question 1: Do Scaling Laws Eventually Break?

Current scaling laws are empirical observations, not theoretical guarantees. They may:

Eventually flatten (diminishing returns kick in)
Hit phase transitions (qualitative changes in behavior)
Face data bottlenecks (running out of quality training data)
Encounter irreducible loss floors

Some researchers observe that loss improvements with scale are slowing on certain benchmarks, though interpretations differ.

Question 2: Does Lower Loss Mean Better Capabilities?

Scaling laws predict loss (perplexity), not downstream task performance. The relationship between:

Cross-entropy loss and task accuracy is imperfect
Perplexity and 'understanding' is unclear
Training metrics and safety is unknown

Models with similar losses can have very different capabilities and failure modes.

Question 3: What Can't Scale Solve?

Some challenges may require more than scale:

Reliable factual accuracy: Models still hallucinate despite scale
Precise multi-step reasoning: Complex logical chains remain fragile
Robust agent behavior: Autonomous action requires more than pattern matching
Genuine understanding: Whether scale leads to 'understanding' is philosophically contested

Question 4: Environmental and Economic Sustainability

Training costs are growing faster than efficiency improvements:

GPT-4 training reportedly cost $100M+
Next generation may cost $1B+
Carbon footprint of frontier training is substantial
Only a handful of organizations can afford frontier research

This raises questions about sustainable scaling and equitable access.

Question 5: What Comes After Scale?

Scale may be necessary but not sufficient. Emerging research directions include:

Better architectures (post-transformer designs)
Improved data quality and curation
Test-time compute scaling (inference scaling)
Neurosymbolic integration
Self-improvement and recursive capability gain

Scale is Not the End

Summary: Scale as Foundation

We have explored the role of scale in modern machine learning—from historical context through mathematical foundations to practical implications. Let's consolidate the key insights:

Key Takeaways

•Scale transformed ML — The shift from algorithmic innovation to scale as the primary lever for progress represents a paradigm change in how we approach AI development.
•Scaling follows mathematical laws — Power law relationships between compute, data, parameters, and performance allow prediction and planning of training runs.
•Chinchilla optimality matters — Compute-optimal training balances parameters and data. Most models are trained beyond this point for inference efficiency.
•Compute costs are staggering — Frontier models cost $10-100M+ to train, creating a significant barrier to entry for research.
•Data quality interacts with scale — Deduplication, filtering, and domain mixing are as important as raw data volume.
•Efficiency is a multiplier — Techniques like MoE, mixed precision, and attention optimization stretch compute budgets without changing the scaling paradigm.
•Scale has limits — Data walls, economic constraints, and capabilities that require more than pattern matching suggest scale alone is insufficient.
•The future is hybrid — Progress likely requires combining scale with new techniques we are only beginning to explore.

What's Next:

Page Complete

1 / 5