Machine LearningResearch Frontiers

Large Language Models

LevelAdvanced

Duration120 mins

TopicResearch Frontiers

2 / 5

Pre-training Objectives

Learning Language from Scratch

How does GPT-4 know that water is wet, that Paris is in France, that print('Hello') outputs text, and that stories should have coherent plots? It was never explicitly taught these facts. Instead, it learned them through a remarkably simple objective: predict the next token.

Pre-training objectives are the loss functions that shape what language models learn during the initial, compute-intensive phase of training on massive text corpora. These objectives seem almost absurdly simple—predict missing words, guess the next token—yet they produce models with stunning breadth and depth of knowledge. Understanding pre-training objectives reveals how neural networks extract and organize the latent structure of human knowledge from raw text.

What You Will Learn

This page covers the major pre-training objectives: autoregressive language modeling (GPT-style), masked language modeling (BERT-style), and hybrid approaches. You will understand the theoretical foundations, practical trade-offs, and why different objectives produce models suited to different tasks.

The Pre-training Paradigm

Modern language models follow a two-stage training paradigm:

Pre-training: Self-supervised learning on massive unlabeled text corpora
Post-training: Supervised fine-tuning for specific tasks or alignment objectives

Pre-training is where the majority of compute is spent and where models acquire their general capabilities. The objectives used during this phase determine the fundamental nature of the resulting model.

Why Self-Supervised Learning?

The key insight enabling modern LLMs is that text itself provides supervision. Unlike traditional supervised learning requiring expensive human labels, we can construct training signals directly from text:

Given "The capital of France is ___", predict "Paris"
Given "She picked up the [MASK] and began to read", predict "book"
Given corrupted text, reconstruct the original

This enables training on virtually unlimited data—the entire internet becomes a training set. But not all self-supervised objectives are equal.

Pre-training Objective Comparison
Objective	Supervision Signal	Model Architecture	Primary Use Case
Autoregressive LM	Predict next token	Decoder-only	Text generation
Masked LM	Predict masked tokens	Encoder-only	Text understanding
Prefix LM	Predict continuation	Encoder-decoder	Conditional generation
Denoising	Reconstruct corrupted text	Encoder-decoder	Translation, summarization

The Surprising Power of Prediction

Why does predicting the next token produce capable models? The answer lies in what prediction requires:

To predict the next token accurately, a model must implicitly learn:

Grammar and syntax (to predict grammatical continuations)
Semantics (to predict contextually appropriate words)
World knowledge (to predict factually correct statements)
Reasoning (to predict logical conclusions)
Style and register (to maintain consistent voice)
Format and structure (to complete patterns)

The prediction objective is a proxy for understanding. A model that predicts perfectly must understand deeply. This is why simple objectives produce such capable models—the objective is simple, but doing it well is not.

Shannon's Information Theory Connection

Claude Shannon estimated the entropy of English at ~1-1.5 bits per character. A perfect language model would achieve this bound—which requires near-complete understanding of language and the world. Every bit of uncertainty reduced represents knowledge gained.

Autoregressive Language Modeling

Autoregressive language modeling (AR-LM) is the dominant pre-training objective for modern LLMs, powering GPT, LLaMA, Claude, and most frontier models. The objective is elegantly simple: given a sequence of tokens, predict the next one.

Mathematical Formulation

Given a sequence of tokens $x = (x_1, x_2, ..., x_n)$, the autoregressive model factorizes the joint probability as:

$$P(x) = \prod_{i=1}^{n} P(x_i | x_1, x_2, ..., x_{i-1})$$

The training objective minimizes the negative log-likelihood:

$$\mathcal{L}{AR} = -\sum{i=1}^{n} \log P(x_i | x_{<i}; \theta)$$

In practice, this is implemented as cross-entropy loss between the model's predicted distribution and the one-hot encoded true next token.

Autoregressive Language Model Training
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import torch
import torch.nn.functional as F
 
def compute_ar_loss(model, input_ids, attention_mask):
    """
    Compute autoregressive language modeling loss.
    
    Args:
        input_ids: [batch_size, seq_len] - Token IDs
        attention_mask: [batch_size, seq_len] - Attention mask
        
    Returns:
        loss: Scalar loss value
    """
    # Forward pass - get logits for all positions
    # Shape: [batch_size, seq_len, vocab_size]
    logits = model(input_ids, attention_mask=attention_mask).logits
    
    # Shift: logits[t] predicts token[t+1]
    # Remove last logit (no target) and first token (no prediction)
    shift_logits = logits[:, :-1, :].contiguous()  # [B, S-1, V]
    shift_labels = input_ids[:, 1:].contiguous()    # [B, S-1]
    
    # Flatten for cross-entropy
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),  # [B*(S-1), V]
        shift_labels.view(-1),                          # [B*(S-1)]
        ignore_index=-100  # Ignore padding tokens
    )
    
    return loss
 
# The key insight: every position provides a training signal
# A sequence of 2048 tokens yields 2047 prediction tasks
# This is extremely sample-efficient compared to labeled data

The Causal Attention Mask

Autoregressive models use causal attention to ensure each position can only attend to previous positions:

Attention Pattern (Causal Mask):

        t=1  t=2  t=3  t=4  t=5
t=1     ✓    ✗    ✗    ✗    ✗
t=2     ✓    ✓    ✗    ✗    ✗
t=3     ✓    ✓    ✓    ✗    ✗
t=4     ✓    ✓    ✓    ✓    ✗
t=5     ✓    ✓    ✓    ✓    ✓

This triangular mask ensures the model cannot "cheat" by looking at future tokens during training, maintaining consistency between training and generation.

Why Autoregressive Models Dominate

Several properties make AR-LM particularly effective:

Advantages of Autoregressive Modeling

•Natural generation: Produces text left-to-right, matching human writing. No special decoding procedures required.
•In-context learning: The causal structure naturally supports few-shot learning via examples in the prompt
•Unified interface: Same model handles any text-based task through appropriate prompting
•Scalable training: Embarrassingly parallel—every token is a prediction target, maximizing gradient signal per sample
•No length mismatch: Training and generation use identical mechanisms—same attention pattern, same sampling
•Log-probability access: Can compute exact probabilities for any sequence, useful for evaluation and ranking

Training Dynamics

Autoregressive loss exhibits characteristic training dynamics:

Early training (perplexity > 50):

Model learns basic tokenization patterns and word frequencies
Common words and syntax are prioritized
Loss decreases rapidly

Mid training (perplexity 10-50):

Model learns semantic relationships and context dependence
Factual knowledge begins accumulating
Grammatical fluency emerges

Late training (perplexity < 10):

Fine-grained distinctions and rare patterns
Complex reasoning and long-range coherence
Loss decreases slowly, capability gains continue

Notably, perplexity and capability don't perfectly correlate. A model might stall in perplexity but continue gaining reasoning ability. Evaluation must go beyond the training metric.

The Exposure Bias Problem

During training, the model always sees ground-truth previous tokens. During generation, it sees its own predictions, which may contain errors. This mismatch (exposure bias) can cause compounding errors. Techniques like scheduled sampling and RLHF partially address this, but it remains an active research area.

Masked Language Modeling

Masked Language Modeling (MLM), introduced by BERT (Bidirectional Encoder Representations from Transformers), takes a different approach: randomly mask tokens and train the model to predict them using bidirectional context.

Mathematical Formulation

Given a sequence $x = (x_1, ..., x_n)$, randomly select positions $M$ to mask. The objective is:

$$\mathcal{L}{MLM} = -\sum{i \in M} \log P(x_i | x_{\backslash M}; \theta)$$

Where $x_{\backslash M}$ denotes the sequence with masked positions replaced by a special [MASK] token or corrupted in some way.

The BERT Masking Strategy

BERT's original procedure:

Select 15% of tokens uniformly at random
For selected tokens:
- 80%: Replace with [MASK]
- 10%: Replace with a random token
- 10%: Keep unchanged
Predict original token at masked positions

The 80-10-10 split addresses a key issue: if we only used [MASK], the model would never see this token at inference time (creating train-test mismatch).

Masked Language Model Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import torch
import random
 
def create_mlm_training_sample(tokens, vocab_size, mask_token_id, mask_prob=0.15):
    """
    Create a masked language modeling training sample.
    
    Args:
        tokens: List of token IDs
        vocab_size: Size of vocabulary
        mask_token_id: ID of [MASK] token
        mask_prob: Probability of masking each token
        
    Returns:
        masked_tokens: Input with masks applied
        labels: Original tokens at masked positions, -100 elsewhere
    """
    masked_tokens = tokens.copy()
    labels = [-100] * len(tokens)  # -100 = ignore in loss
    
    for i in range(len(tokens)):
        if random.random() < mask_prob:
            labels[i] = tokens[i]  # Store true token as label
            
            rand = random.random()
            if rand < 0.8:
                # 80%: Replace with [MASK]
                masked_tokens[i] = mask_token_id
            elif rand < 0.9:
                # 10%: Replace with random token
                masked_tokens[i] = random.randint(0, vocab_size - 1)
            # else 10%: Keep unchanged
    
    return masked_tokens, labels
 
# Key difference from AR: bidirectional attention
# Position i can attend to ALL positions, including future
# This provides richer context but prevents generation

MLM Advantages

•Bidirectional context: Full context for each prediction
•Strong representations: Better for understanding tasks
•Efficient fine-tuning: Excellent transfer to classification
•Position independence: Each mask is predicted independently

MLM Disadvantages

•Poor generation: Not naturally autoregressive
•Sample inefficiency: Only 15% of tokens provide signal
•Mask mismatch: [MASK] never appears at inference
•Independence assumption: Ignores dependencies between masked tokens

MLM Variants and Extensions

RoBERTa (Robustly Optimized BERT):

Removes next sentence prediction (NSP) task
Dynamic masking (different masks each epoch)
Larger batches, more data
Result: Significant improvements with same architecture

SpanBERT:

Masks contiguous spans rather than random tokens
Span lengths sampled from geometric distribution
Better for tasks requiring span understanding

ALBERT:

Factorized embedding parameters
Cross-layer parameter sharing
Sentence order prediction instead of NSP
18× fewer parameters, competitive performance

ELECTRA:

Train a generator to corrupt tokens, discriminator to detect
All tokens provide signal (not just 15% masked)
More sample-efficient than MLM

When to Use MLM vs AR

Use Case	Preferred Objective	Reason
Text generation	Autoregressive	Natural left-to-right generation
Sentence classification	MLM/Encoder	Full context enables rich representations
Named entity recognition	MLM/Encoder	Bidirectional context for boundaries
Question answering	Either	Depends on extractive vs. generative
Summarization	AR or Encoder-Decoder	Generative task requiring coherent output
Few-shot learning	Autoregressive	In-context learning requires causal structure

The Encoder-Decoder Renaissance

While decoder-only AR models dominate today, encoder-decoder models (T5, BART) using denoising objectives remain competitive for certain tasks. The trend toward unified AR models may not be permanent—architectural diversity continues in research.

Denoising and Sequence-to-Sequence Objectives

A third category of pre-training objectives frames language modeling as a denoising or sequence-to-sequence task. Given corrupted input, the model must produce the original (or a corrected/completed) output. This family includes T5, BART, and UL2.

T5: Text-to-Text Transfer Transformer

T5 unifies all NLP tasks as text-to-text problems with a span corruption objective:

Span Corruption Process:

Randomly select spans of tokens to corrupt (15% of tokens, mean span length 3)
Replace each span with a unique sentinel token (<extra_id_0>, <extra_id_1>, etc.)
Target: produce only the corrupted spans prefixed with their sentinel

Example:

Input:  "The [X] sat on the [Y] and meowed loudly."
Target: "[X] cat [Y] mat"

This is more efficient than BERT's approach: the target sequence is much shorter than the input (only corrupted portions), speeding up training.

T5 Span Corruption Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
 
def t5_span_corruption(tokens, noise_density=0.15, mean_span_length=3.0):
    """
    Apply T5-style span corruption.
    
    Args:
        tokens: List of token IDs
        noise_density: Fraction of tokens to corrupt
        mean_span_length: Average length of corrupted spans
        
    Returns:
        input_ids: Corrupted input with sentinel tokens
        target_ids: Target containing original spans
    """
    n = len(tokens)
    num_noise_tokens = int(n * noise_density)
    num_spans = max(1, int(num_noise_tokens / mean_span_length))
    
    # Generate span lengths from Poisson distribution
    span_lengths = np.random.poisson(mean_span_length, num_spans)
    span_lengths = np.maximum(1, span_lengths)
    
    # Select span start positions
    total_span_length = sum(span_lengths)
    num_available = n - total_span_length
    span_starts = sorted(np.random.choice(num_available, num_spans, replace=False))
    
    # Build input and target
    input_ids = []
    target_ids = []
    sentinel_id = 32000  # Starting sentinel ID
    
    pos = 0
    for i, (start, length) in enumerate(zip(span_starts, span_lengths)):
        # Add non-corrupted prefix
        input_ids.extend(tokens[pos:start])
        # Add sentinel for corrupted span
        input_ids.append(sentinel_id + i)
        
        # Target: sentinel followed by original tokens
        target_ids.append(sentinel_id + i)
        target_ids.extend(tokens[start:start + length])
        
        pos = start + length
    
    # Add remaining tokens to input
    input_ids.extend(tokens[pos:])
    
    return input_ids, target_ids

BART: Denoising Autoencoder

BART uses a more flexible noise approach with an autoregressive decoder:

Noise Types in BART:

Noise Type	Description	Example
Token Masking	Replace tokens with [MASK]	"The cat sat" → "The [MASK] sat"
Token Deletion	Remove tokens without placeholder	"The cat sat" → "The sat"
Text Infilling	Replace spans with single [MASK]	"The cat sat there" → "The [MASK] there"
Sentence Permutation	Shuffle sentence order	Document sentence reordering
Document Rotation	Rotate to start at random token	"The cat sat" → "cat sat The"

BART's decoder is fully autoregressive, generating the complete original sequence. This makes it naturally suited for generation tasks like summarization.

UL2: Unified Language Learner

UL2 proposes mixing multiple objectives during pre-training:

UL2 Mixture:

R-Denoiser (Regular): Short spans, like T5
S-Denoiser (Sequential): Long prefix, complete the rest (like AR-LM)
X-Denoiser (Extreme): Very long spans, high corruption

Each sample is prefixed with a mode token ([R], [S], [X]) indicating the denoising mode. The model learns to handle all modes, gaining benefits of each.

Key insight: Different objectives are complementary. Span infilling improves bidirectional understanding; sequential denoising improves generation. Combining them produces a more capable model than either alone.

Prefix Language Modeling

A hybrid approach: the model attends bidirectionally to a prefix, then autoregressively generates the suffix. This combines encoder-style understanding of context with decoder-style generation. Used in encoder-decoder models and increasingly in decoder-only models via attention mask manipulation.

The Critical Role of Pre-training Data

The pre-training objective is only half the story. Equally critical is the data on which that objective is optimized. Data quality, diversity, and composition fundamentally shape what models learn.

Data Sources and Their Properties

Source	Size	Quality	Properties
Web Crawl (Common Crawl)	100T+ tokens	Variable	Broad coverage, noise, duplication
Wikipedia	~4B tokens (English)	High	Factual, encyclopedic, neutral
Books	10-100B tokens	High	Long-form, narrative, diverse style
Code (GitHub, etc.)	~500B tokens	Variable	Structured, logical, multilingual
Scientific Papers	~100B tokens	High	Technical, precise, specialized
Curated Web (C4, RefinedWeb)	1-10T tokens	Medium-High	Filtered web, removed duplicates
Synthetic Data	Unlimited	Variable	LLM-generated, controlled properties

Data Quality Filtering

Raw web crawl data is not directly usable. Quality filtering is essential:

1. Deduplication

Exact deduplication: Remove identical documents
Near-duplicate removal: MinHash/LSH for fuzzy matching
n-gram deduplication: Remove documents sharing many n-grams

Why it matters: Duplicated data causes memorization, wasted compute, and benchmark contamination. Models can recite training data verbatim if it appears frequently enough.

2. Quality Scoring

Rule-based filters: Length, punctuation ratio, word list presence
Classifier-based: Train classifier on Wikipedia+ vs. low-quality web
Perplexity filtering: Remove text that is too perplexing (noise) or not perplexing enough (repetitive)

3. Content Filtering

PII removal: Names, addresses, phone numbers
Toxicity filtering: Hate speech, explicit content
Benchmark decontamination: Remove test set instances

Data Quality Pipeline Example
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class DataQualityPipeline:
    """
    Multi-stage pipeline for pre-training data quality.
    """
    
    def __init__(self, quality_classifier, toxicity_classifier):
        self.quality_cls = quality_classifier
        self.toxicity_cls = toxicity_classifier
        
    def process_document(self, text: str) -> dict:
        """
        Process a single document through all quality stages.
        
        Returns:
            dict with 'keep' bool and 'reasons' for filtering
        """
        reasons = []
        
        # Stage 1: Basic heuristics
        if len(text.split()) < 20:
            reasons.append("too_short")
        
        # Check character quality
        alpha_ratio = sum(c.isalpha() for c in text) / max(len(text), 1)
        if alpha_ratio < 0.7:
            reasons.append("low_alpha_ratio")
            
        # Repetition detection (sliding window)
        words = text.lower().split()
        if len(words) > 50:
            window = 50
            unique_ratio = len(set(words[:window])) / window
            if unique_ratio < 0.3:
                reasons.append("high_repetition")
        
        # Stage 2: Quality classifier
        quality_score = self.quality_cls.predict_proba([text])[0]
        if quality_score < 0.5:
            reasons.append(f"quality_score={quality_score:.2f}")
            
        # Stage 3: Toxicity filtering
        toxicity_score = self.toxicity_cls.predict_proba([text])[0]
        if toxicity_score > 0.8:
            reasons.append(f"toxicity_score={toxicity_score:.2f}")
        
        return {
            "keep": len(reasons) == 0,
            "reasons": reasons,
            "quality_score": quality_score,
            "toxicity_score": toxicity_score
        }

Data Mixture and Curriculum

Frontier models train on carefully balanced mixtures:

Typical LLM Data Mixture:

├── Web (filtered): 50-60%
│   └── High-quality web text, deduplicated
│
├── Code: 15-20%
│   └── GitHub, code documentation
│
├── Books: 10-15%
│   └── Fiction and non-fiction
│
├── Academic: 5-10%
│   └── arXiv, PubMed, patents
│
├── Wikipedia: 2-5%
│   └── Upsampled for factual grounding
│
├── Curated dialogue: 2-5%
│   └── Reddit, forums, Q&A sites
│
└── Math/Reasoning: 2-5%
    └── Textbooks, proofs, problem sets

Key insights:

Higher-quality data often upsampled (Wikipedia may be seen 10×+)
Code data dramatically improves reasoning capabilities
Mixture affects emergent capabilities (more math data → better at math)
The optimal mixture is not known a priori—requires ablation studies

Data Determines Capability Ceiling

A model cannot generate knowledge it has never seen. If a fact, language, or skill is absent from training data, the model will not know it—scale cannot compensate for missing data. This creates fundamental limitations that are often invisible until the model fails.

Pre-training Implementation Details

Converting the theoretical objective into a working training pipeline requires careful attention to optimization, stability, and efficiency. Here we cover the key implementation considerations.

Optimizer Configuration

Most LLMs use AdamW with carefully tuned hyperparameters:

optimizer_config = {
    "optimizer": "AdamW",
    "learning_rate": 3e-4,      # Peak LR, lower for larger models
    "beta1": 0.9,                # Momentum term
    "beta2": 0.95,               # Second moment (lower than default 0.999)
    "epsilon": 1e-8,
    "weight_decay": 0.1,         # L2 regularization
    "grad_clip": 1.0,            # Gradient clipping norm
}

lr_schedule = {
    "type": "cosine_with_warmup",
    "warmup_steps": 2000,        # Linear warmup
    "min_lr_ratio": 0.1,         # 10× decay from peak
}

Critical hyperparameter scaling:

Model Size	Peak LR	Batch Size (tokens)	Warmup Steps
125M	6e-4	256K	1000
1B	3e-4	512K	2000
7B	1.5e-4	2M	2000
70B	1e-4	4M	2000
405B	8e-5	8M	8000

Rule of thumb: Larger models need lower learning rates and larger batches for stability.

Mixed Precision Training

Modern training uses multiple precision levels:

Data Type	Bits	Use	Range
FP32	32	Master weights, loss scaling	±3.4×10³⁸
BF16	16	Forward/backward pass	±3.4×10³⁸
FP16	16	Alternative to BF16	±65504
FP8	8	Emerging for matrix multiply	Limited

BFloat16 (BF16) is now standard for LLM training:

Same exponent range as FP32 (no overflow issues)
3-bit mantissa reduction acceptable for deep learning
Supports tensor cores for accelerated matrix multiply

Loss scaling is required with FP16 to prevent underflow in gradients:

# Automatic mixed precision with loss scaling
scaler = torch.cuda.amp.GradScaler()

with torch.cuda.amp.autocast(dtype=torch.bfloat16):
    loss = model(batch).loss
    
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Common Training Failures and Solutions

•Loss spikes → Lower learning rate, increase warmup, gradient clipping. May indicate bad data batch or instability.
•NaN/Inf gradients → Enable gradient clipping, use BF16 over FP16, check for division by zero in normalization.
•Training divergence → Reduce LR aggressively (0.1×), restart from earlier checkpoint, identify corrupted data samples.
•Slow convergence → Increase batch size, verify data pipeline is not bottlenecked, check for frozen parameters.
•OOM errors → Enable activation checkpointing, reduce micro-batch size, verify parallelism configuration.
•Poor final quality → Check data mixture, verify no benchmark contamination in training data, examine loss curves for epochs.

Checkpoint and Evaluation Strategy

checkpoint_config = {
    # Checkpointing frequency
    "save_interval": 500,        # Steps between checkpoints
    "keep_last_n": 10,           # Rolling window
    "keep_milestones": [10000, 50000, 100000],  # Permanent saves
    
    # Evaluation schedule
    "eval_interval": 1000,
    "eval_datasets": [
        "validation_set",         # Held-out distribution
        "mmlu_subset",            # Capability tracking
        "hellaswag_subset",
    ],
    
    # Logging
    "log_interval": 10,          # Steps between loss logging
    "log_grad_norm": True,       # Monitor gradient health
    "log_learning_rate": True,
}

Evaluation during pre-training:

While validation loss is the primary metric, tracking downstream task performance reveals capability emergence. A common approach:

Evaluate on standardized benchmarks (MMLU, HellaSwag) every 1K-10K steps
Monitor for sudden capability gains (emergence)
Track multiple capabilities to detect regression
Use evaluation results to inform mixture adjustments

The Importance of Reproducibility

Pre-training runs are expensive and non-reproducible without careful logging. Record: random seeds, data ordering, exact configurations, hardware state. A $10M training run without proper logging is a $10M experiment you cannot fully understand or repeat.

Summary: The Foundation of LLM Capability

Pre-training objectives and data form the foundation upon which all LLM capabilities rest. The remarkable abilities of modern language models emerge from simple prediction tasks applied at massive scale to diverse data.

Key Takeaways

•Autoregressive LM dominates — Next-token prediction is the primary objective for frontier models, enabling natural generation and in-context learning
•Masked LM excels at understanding — Bidirectional context produces strong representations for classification and extraction tasks
•Denoising objectives are flexible — T5/BART-style corruption handles sequence-to-sequence tasks naturally
•Simple objectives, deep learning — Prediction requires implicit understanding of syntax, semantics, facts, and reasoning
•Data quality is paramount — Filtering, deduplication, and mixture design fundamentally shape what models learn
•Implementation details matter — Optimizer configuration, precision, and stability techniques determine training success
•Objectives and data are inseparable — The training pipeline is a system; both must be optimized together

What's next:

Pre-training produces models with broad capabilities but no particular alignment to user intent. The next page covers instruction tuning—the process of teaching models to follow instructions, answer questions helpfully, and behave as intended. We'll see how supervised fine-tuning transforms a base language model into an assistant.

Page Complete

You now understand the pre-training objectives that give LLMs their foundational capabilities—from autoregressive prediction to masked language modeling to denoising. This knowledge enables you to reason about model capabilities and limitations at their source. Next, we explore instruction tuning.

2 / 5

Loading learning content...

Machine LearningResearch Frontiers

Large Language Models

LevelAdvanced

Duration120 mins

TopicResearch Frontiers

2 / 5

Pre-training Objectives

Learning Language from Scratch

What You Will Learn

The Pre-training Paradigm

Modern language models follow a two-stage training paradigm:

Pre-training: Self-supervised learning on massive unlabeled text corpora
Post-training: Supervised fine-tuning for specific tasks or alignment objectives

Why Self-Supervised Learning?

Given "The capital of France is ___", predict "Paris"
Given "She picked up the [MASK] and began to read", predict "book"
Given corrupted text, reconstruct the original

This enables training on virtually unlimited data—the entire internet becomes a training set. But not all self-supervised objectives are equal.

Pre-training Objective Comparison
Objective	Supervision Signal	Model Architecture	Primary Use Case
Autoregressive LM	Predict next token	Decoder-only	Text generation
Masked LM	Predict masked tokens	Encoder-only	Text understanding
Prefix LM	Predict continuation	Encoder-decoder	Conditional generation
Denoising	Reconstruct corrupted text	Encoder-decoder	Translation, summarization

The Surprising Power of Prediction

Why does predicting the next token produce capable models? The answer lies in what prediction requires:

To predict the next token accurately, a model must implicitly learn:

Grammar and syntax (to predict grammatical continuations)
Semantics (to predict contextually appropriate words)
World knowledge (to predict factually correct statements)
Reasoning (to predict logical conclusions)
Style and register (to maintain consistent voice)
Format and structure (to complete patterns)

Shannon's Information Theory Connection

Autoregressive Language Modeling

Mathematical Formulation

Given a sequence of tokens $x = (x_1, x_2, ..., x_n)$, the autoregressive model factorizes the joint probability as:

$$P(x) = \prod_{i=1}^{n} P(x_i | x_1, x_2, ..., x_{i-1})$$

The training objective minimizes the negative log-likelihood:

$$\mathcal{L}{AR} = -\sum{i=1}^{n} \log P(x_i | x_{<i}; \theta)$$

In practice, this is implemented as cross-entropy loss between the model's predicted distribution and the one-hot encoded true next token.

Autoregressive Language Model Training
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import torch
import torch.nn.functional as F
 
def compute_ar_loss(model, input_ids, attention_mask):
    """
    Compute autoregressive language modeling loss.
    
    Args:
        input_ids: [batch_size, seq_len] - Token IDs
        attention_mask: [batch_size, seq_len] - Attention mask
        
    Returns:
        loss: Scalar loss value
    """
    # Forward pass - get logits for all positions
    # Shape: [batch_size, seq_len, vocab_size]
    logits = model(input_ids, attention_mask=attention_mask).logits
    
    # Shift: logits[t] predicts token[t+1]
    # Remove last logit (no target) and first token (no prediction)
    shift_logits = logits[:, :-1, :].contiguous()  # [B, S-1, V]
    shift_labels = input_ids[:, 1:].contiguous()    # [B, S-1]
    
    # Flatten for cross-entropy
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),  # [B*(S-1), V]
        shift_labels.view(-1),                          # [B*(S-1)]
        ignore_index=-100  # Ignore padding tokens
    )
    
    return loss
 
# The key insight: every position provides a training signal
# A sequence of 2048 tokens yields 2047 prediction tasks
# This is extremely sample-efficient compared to labeled data

The Causal Attention Mask

Autoregressive models use causal attention to ensure each position can only attend to previous positions:

Attention Pattern (Causal Mask):

        t=1  t=2  t=3  t=4  t=5
t=1     ✓    ✗    ✗    ✗    ✗
t=2     ✓    ✓    ✗    ✗    ✗
t=3     ✓    ✓    ✓    ✗    ✗
t=4     ✓    ✓    ✓    ✓    ✗
t=5     ✓    ✓    ✓    ✓    ✓

This triangular mask ensures the model cannot "cheat" by looking at future tokens during training, maintaining consistency between training and generation.

Why Autoregressive Models Dominate

Several properties make AR-LM particularly effective:

Advantages of Autoregressive Modeling

•Natural generation: Produces text left-to-right, matching human writing. No special decoding procedures required.
•In-context learning: The causal structure naturally supports few-shot learning via examples in the prompt
•Unified interface: Same model handles any text-based task through appropriate prompting
•Scalable training: Embarrassingly parallel—every token is a prediction target, maximizing gradient signal per sample
•No length mismatch: Training and generation use identical mechanisms—same attention pattern, same sampling
•Log-probability access: Can compute exact probabilities for any sequence, useful for evaluation and ranking

Training Dynamics

Autoregressive loss exhibits characteristic training dynamics:

Early training (perplexity > 50):

Model learns basic tokenization patterns and word frequencies
Common words and syntax are prioritized
Loss decreases rapidly

Mid training (perplexity 10-50):

Model learns semantic relationships and context dependence
Factual knowledge begins accumulating
Grammatical fluency emerges

Late training (perplexity < 10):

Fine-grained distinctions and rare patterns
Complex reasoning and long-range coherence
Loss decreases slowly, capability gains continue

Notably, perplexity and capability don't perfectly correlate. A model might stall in perplexity but continue gaining reasoning ability. Evaluation must go beyond the training metric.

The Exposure Bias Problem

Masked Language Modeling

Mathematical Formulation

Given a sequence $x = (x_1, ..., x_n)$, randomly select positions $M$ to mask. The objective is:

$$\mathcal{L}{MLM} = -\sum{i \in M} \log P(x_i | x_{\backslash M}; \theta)$$

Where $x_{\backslash M}$ denotes the sequence with masked positions replaced by a special [MASK] token or corrupted in some way.

The BERT Masking Strategy

BERT's original procedure:

Select 15% of tokens uniformly at random
For selected tokens:
- 80%: Replace with [MASK]
- 10%: Replace with a random token
- 10%: Keep unchanged
Predict original token at masked positions

The 80-10-10 split addresses a key issue: if we only used [MASK], the model would never see this token at inference time (creating train-test mismatch).

Masked Language Model Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import torch
import random
 
def create_mlm_training_sample(tokens, vocab_size, mask_token_id, mask_prob=0.15):
    """
    Create a masked language modeling training sample.
    
    Args:
        tokens: List of token IDs
        vocab_size: Size of vocabulary
        mask_token_id: ID of [MASK] token
        mask_prob: Probability of masking each token
        
    Returns:
        masked_tokens: Input with masks applied
        labels: Original tokens at masked positions, -100 elsewhere
    """
    masked_tokens = tokens.copy()
    labels = [-100] * len(tokens)  # -100 = ignore in loss
    
    for i in range(len(tokens)):
        if random.random() < mask_prob:
            labels[i] = tokens[i]  # Store true token as label
            
            rand = random.random()
            if rand < 0.8:
                # 80%: Replace with [MASK]
                masked_tokens[i] = mask_token_id
            elif rand < 0.9:
                # 10%: Replace with random token
                masked_tokens[i] = random.randint(0, vocab_size - 1)
            # else 10%: Keep unchanged
    
    return masked_tokens, labels
 
# Key difference from AR: bidirectional attention
# Position i can attend to ALL positions, including future
# This provides richer context but prevents generation

MLM Advantages

•Bidirectional context: Full context for each prediction
•Strong representations: Better for understanding tasks
•Efficient fine-tuning: Excellent transfer to classification
•Position independence: Each mask is predicted independently

MLM Disadvantages

•Poor generation: Not naturally autoregressive
•Sample inefficiency: Only 15% of tokens provide signal
•Mask mismatch: [MASK] never appears at inference
•Independence assumption: Ignores dependencies between masked tokens

MLM Variants and Extensions

RoBERTa (Robustly Optimized BERT):

Removes next sentence prediction (NSP) task
Dynamic masking (different masks each epoch)
Larger batches, more data
Result: Significant improvements with same architecture

SpanBERT:

Masks contiguous spans rather than random tokens
Span lengths sampled from geometric distribution
Better for tasks requiring span understanding

ALBERT:

Factorized embedding parameters
Cross-layer parameter sharing
Sentence order prediction instead of NSP
18× fewer parameters, competitive performance

ELECTRA:

Train a generator to corrupt tokens, discriminator to detect
All tokens provide signal (not just 15% masked)
More sample-efficient than MLM

When to Use MLM vs AR

Use Case	Preferred Objective	Reason
Text generation	Autoregressive	Natural left-to-right generation
Sentence classification	MLM/Encoder	Full context enables rich representations
Named entity recognition	MLM/Encoder	Bidirectional context for boundaries
Question answering	Either	Depends on extractive vs. generative
Summarization	AR or Encoder-Decoder	Generative task requiring coherent output
Few-shot learning	Autoregressive	In-context learning requires causal structure

The Encoder-Decoder Renaissance

Denoising and Sequence-to-Sequence Objectives

T5: Text-to-Text Transfer Transformer

T5 unifies all NLP tasks as text-to-text problems with a span corruption objective:

Span Corruption Process:

Randomly select spans of tokens to corrupt (15% of tokens, mean span length 3)
Replace each span with a unique sentinel token (<extra_id_0>, <extra_id_1>, etc.)
Target: produce only the corrupted spans prefixed with their sentinel

Example:

Input:  "The [X] sat on the [Y] and meowed loudly."
Target: "[X] cat [Y] mat"

This is more efficient than BERT's approach: the target sequence is much shorter than the input (only corrupted portions), speeding up training.

T5 Span Corruption Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
 
def t5_span_corruption(tokens, noise_density=0.15, mean_span_length=3.0):
    """
    Apply T5-style span corruption.
    
    Args:
        tokens: List of token IDs
        noise_density: Fraction of tokens to corrupt
        mean_span_length: Average length of corrupted spans
        
    Returns:
        input_ids: Corrupted input with sentinel tokens
        target_ids: Target containing original spans
    """
    n = len(tokens)
    num_noise_tokens = int(n * noise_density)
    num_spans = max(1, int(num_noise_tokens / mean_span_length))
    
    # Generate span lengths from Poisson distribution
    span_lengths = np.random.poisson(mean_span_length, num_spans)
    span_lengths = np.maximum(1, span_lengths)
    
    # Select span start positions
    total_span_length = sum(span_lengths)
    num_available = n - total_span_length
    span_starts = sorted(np.random.choice(num_available, num_spans, replace=False))
    
    # Build input and target
    input_ids = []
    target_ids = []
    sentinel_id = 32000  # Starting sentinel ID
    
    pos = 0
    for i, (start, length) in enumerate(zip(span_starts, span_lengths)):
        # Add non-corrupted prefix
        input_ids.extend(tokens[pos:start])
        # Add sentinel for corrupted span
        input_ids.append(sentinel_id + i)
        
        # Target: sentinel followed by original tokens
        target_ids.append(sentinel_id + i)
        target_ids.extend(tokens[start:start + length])
        
        pos = start + length
    
    # Add remaining tokens to input
    input_ids.extend(tokens[pos:])
    
    return input_ids, target_ids

BART: Denoising Autoencoder

BART uses a more flexible noise approach with an autoregressive decoder:

Noise Types in BART:

Noise Type	Description	Example
Token Masking	Replace tokens with [MASK]	"The cat sat" → "The [MASK] sat"
Token Deletion	Remove tokens without placeholder	"The cat sat" → "The sat"
Text Infilling	Replace spans with single [MASK]	"The cat sat there" → "The [MASK] there"
Sentence Permutation	Shuffle sentence order	Document sentence reordering
Document Rotation	Rotate to start at random token	"The cat sat" → "cat sat The"

BART's decoder is fully autoregressive, generating the complete original sequence. This makes it naturally suited for generation tasks like summarization.

UL2: Unified Language Learner

UL2 proposes mixing multiple objectives during pre-training:

UL2 Mixture:

R-Denoiser (Regular): Short spans, like T5
S-Denoiser (Sequential): Long prefix, complete the rest (like AR-LM)
X-Denoiser (Extreme): Very long spans, high corruption

Each sample is prefixed with a mode token ([R], [S], [X]) indicating the denoising mode. The model learns to handle all modes, gaining benefits of each.

Prefix Language Modeling

The Critical Role of Pre-training Data

Data Sources and Their Properties

Source	Size	Quality	Properties
Web Crawl (Common Crawl)	100T+ tokens	Variable	Broad coverage, noise, duplication
Wikipedia	~4B tokens (English)	High	Factual, encyclopedic, neutral
Books	10-100B tokens	High	Long-form, narrative, diverse style
Code (GitHub, etc.)	~500B tokens	Variable	Structured, logical, multilingual
Scientific Papers	~100B tokens	High	Technical, precise, specialized
Curated Web (C4, RefinedWeb)	1-10T tokens	Medium-High	Filtered web, removed duplicates
Synthetic Data	Unlimited	Variable	LLM-generated, controlled properties

Data Quality Filtering

Raw web crawl data is not directly usable. Quality filtering is essential:

1. Deduplication

Exact deduplication: Remove identical documents
Near-duplicate removal: MinHash/LSH for fuzzy matching
n-gram deduplication: Remove documents sharing many n-grams

Why it matters: Duplicated data causes memorization, wasted compute, and benchmark contamination. Models can recite training data verbatim if it appears frequently enough.

2. Quality Scoring

Rule-based filters: Length, punctuation ratio, word list presence
Classifier-based: Train classifier on Wikipedia+ vs. low-quality web
Perplexity filtering: Remove text that is too perplexing (noise) or not perplexing enough (repetitive)

3. Content Filtering

PII removal: Names, addresses, phone numbers
Toxicity filtering: Hate speech, explicit content
Benchmark decontamination: Remove test set instances

Data Quality Pipeline Example
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class DataQualityPipeline:
    """
    Multi-stage pipeline for pre-training data quality.
    """
    
    def __init__(self, quality_classifier, toxicity_classifier):
        self.quality_cls = quality_classifier
        self.toxicity_cls = toxicity_classifier
        
    def process_document(self, text: str) -> dict:
        """
        Process a single document through all quality stages.
        
        Returns:
            dict with 'keep' bool and 'reasons' for filtering
        """
        reasons = []
        
        # Stage 1: Basic heuristics
        if len(text.split()) < 20:
            reasons.append("too_short")
        
        # Check character quality
        alpha_ratio = sum(c.isalpha() for c in text) / max(len(text), 1)
        if alpha_ratio < 0.7:
            reasons.append("low_alpha_ratio")
            
        # Repetition detection (sliding window)
        words = text.lower().split()
        if len(words) > 50:
            window = 50
            unique_ratio = len(set(words[:window])) / window
            if unique_ratio < 0.3:
                reasons.append("high_repetition")
        
        # Stage 2: Quality classifier
        quality_score = self.quality_cls.predict_proba([text])[0]
        if quality_score < 0.5:
            reasons.append(f"quality_score={quality_score:.2f}")
            
        # Stage 3: Toxicity filtering
        toxicity_score = self.toxicity_cls.predict_proba([text])[0]
        if toxicity_score > 0.8:
            reasons.append(f"toxicity_score={toxicity_score:.2f}")
        
        return {
            "keep": len(reasons) == 0,
            "reasons": reasons,
            "quality_score": quality_score,
            "toxicity_score": toxicity_score
        }

Data Mixture and Curriculum

Frontier models train on carefully balanced mixtures:

Typical LLM Data Mixture:

├── Web (filtered): 50-60%
│   └── High-quality web text, deduplicated
│
├── Code: 15-20%
│   └── GitHub, code documentation
│
├── Books: 10-15%
│   └── Fiction and non-fiction
│
├── Academic: 5-10%
│   └── arXiv, PubMed, patents
│
├── Wikipedia: 2-5%
│   └── Upsampled for factual grounding
│
├── Curated dialogue: 2-5%
│   └── Reddit, forums, Q&A sites
│
└── Math/Reasoning: 2-5%
    └── Textbooks, proofs, problem sets

Key insights:

Higher-quality data often upsampled (Wikipedia may be seen 10×+)
Code data dramatically improves reasoning capabilities
Mixture affects emergent capabilities (more math data → better at math)
The optimal mixture is not known a priori—requires ablation studies

Data Determines Capability Ceiling

Pre-training Implementation Details

Converting the theoretical objective into a working training pipeline requires careful attention to optimization, stability, and efficiency. Here we cover the key implementation considerations.

Optimizer Configuration

Most LLMs use AdamW with carefully tuned hyperparameters:

optimizer_config = {
    "optimizer": "AdamW",
    "learning_rate": 3e-4,      # Peak LR, lower for larger models
    "beta1": 0.9,                # Momentum term
    "beta2": 0.95,               # Second moment (lower than default 0.999)
    "epsilon": 1e-8,
    "weight_decay": 0.1,         # L2 regularization
    "grad_clip": 1.0,            # Gradient clipping norm
}

lr_schedule = {
    "type": "cosine_with_warmup",
    "warmup_steps": 2000,        # Linear warmup
    "min_lr_ratio": 0.1,         # 10× decay from peak
}

Critical hyperparameter scaling:

Model Size	Peak LR	Batch Size (tokens)	Warmup Steps
125M	6e-4	256K	1000
1B	3e-4	512K	2000
7B	1.5e-4	2M	2000
70B	1e-4	4M	2000
405B	8e-5	8M	8000

Rule of thumb: Larger models need lower learning rates and larger batches for stability.

Mixed Precision Training

Modern training uses multiple precision levels:

Data Type	Bits	Use	Range
FP32	32	Master weights, loss scaling	±3.4×10³⁸
BF16	16	Forward/backward pass	±3.4×10³⁸
FP16	16	Alternative to BF16	±65504
FP8	8	Emerging for matrix multiply	Limited

BFloat16 (BF16) is now standard for LLM training:

Same exponent range as FP32 (no overflow issues)
3-bit mantissa reduction acceptable for deep learning
Supports tensor cores for accelerated matrix multiply

Loss scaling is required with FP16 to prevent underflow in gradients:

# Automatic mixed precision with loss scaling
scaler = torch.cuda.amp.GradScaler()

with torch.cuda.amp.autocast(dtype=torch.bfloat16):
    loss = model(batch).loss
    
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Common Training Failures and Solutions

•Loss spikes → Lower learning rate, increase warmup, gradient clipping. May indicate bad data batch or instability.
•NaN/Inf gradients → Enable gradient clipping, use BF16 over FP16, check for division by zero in normalization.
•Training divergence → Reduce LR aggressively (0.1×), restart from earlier checkpoint, identify corrupted data samples.
•Slow convergence → Increase batch size, verify data pipeline is not bottlenecked, check for frozen parameters.
•OOM errors → Enable activation checkpointing, reduce micro-batch size, verify parallelism configuration.
•Poor final quality → Check data mixture, verify no benchmark contamination in training data, examine loss curves for epochs.

Checkpoint and Evaluation Strategy

checkpoint_config = {
    # Checkpointing frequency
    "save_interval": 500,        # Steps between checkpoints
    "keep_last_n": 10,           # Rolling window
    "keep_milestones": [10000, 50000, 100000],  # Permanent saves
    
    # Evaluation schedule
    "eval_interval": 1000,
    "eval_datasets": [
        "validation_set",         # Held-out distribution
        "mmlu_subset",            # Capability tracking
        "hellaswag_subset",
    ],
    
    # Logging
    "log_interval": 10,          # Steps between loss logging
    "log_grad_norm": True,       # Monitor gradient health
    "log_learning_rate": True,
}

Evaluation during pre-training:

While validation loss is the primary metric, tracking downstream task performance reveals capability emergence. A common approach:

Evaluate on standardized benchmarks (MMLU, HellaSwag) every 1K-10K steps
Monitor for sudden capability gains (emergence)
Track multiple capabilities to detect regression
Use evaluation results to inform mixture adjustments

The Importance of Reproducibility

Summary: The Foundation of LLM Capability

Key Takeaways

•Autoregressive LM dominates — Next-token prediction is the primary objective for frontier models, enabling natural generation and in-context learning
•Masked LM excels at understanding — Bidirectional context produces strong representations for classification and extraction tasks
•Denoising objectives are flexible — T5/BART-style corruption handles sequence-to-sequence tasks naturally
•Simple objectives, deep learning — Prediction requires implicit understanding of syntax, semantics, facts, and reasoning
•Data quality is paramount — Filtering, deduplication, and mixture design fundamentally shape what models learn
•Implementation details matter — Optimizer configuration, precision, and stability techniques determine training success
•Objectives and data are inseparable — The training pipeline is a system; both must be optimized together

What's next:

Page Complete

2 / 5