Loading learning content...
How does GPT-4 know that water is wet, that Paris is in France, that print('Hello') outputs text, and that stories should have coherent plots? It was never explicitly taught these facts. Instead, it learned them through a remarkably simple objective: predict the next token.
Pre-training objectives are the loss functions that shape what language models learn during the initial, compute-intensive phase of training on massive text corpora. These objectives seem almost absurdly simple—predict missing words, guess the next token—yet they produce models with stunning breadth and depth of knowledge. Understanding pre-training objectives reveals how neural networks extract and organize the latent structure of human knowledge from raw text.
This page covers the major pre-training objectives: autoregressive language modeling (GPT-style), masked language modeling (BERT-style), and hybrid approaches. You will understand the theoretical foundations, practical trade-offs, and why different objectives produce models suited to different tasks.
Modern language models follow a two-stage training paradigm:
Pre-training is where the majority of compute is spent and where models acquire their general capabilities. The objectives used during this phase determine the fundamental nature of the resulting model.
The key insight enabling modern LLMs is that text itself provides supervision. Unlike traditional supervised learning requiring expensive human labels, we can construct training signals directly from text:
This enables training on virtually unlimited data—the entire internet becomes a training set. But not all self-supervised objectives are equal.
| Objective | Supervision Signal | Model Architecture | Primary Use Case |
|---|---|---|---|
| Autoregressive LM | Predict next token | Decoder-only | Text generation |
| Masked LM | Predict masked tokens | Encoder-only | Text understanding |
| Prefix LM | Predict continuation | Encoder-decoder | Conditional generation |
| Denoising | Reconstruct corrupted text | Encoder-decoder | Translation, summarization |
Why does predicting the next token produce capable models? The answer lies in what prediction requires:
To predict the next token accurately, a model must implicitly learn:
The prediction objective is a proxy for understanding. A model that predicts perfectly must understand deeply. This is why simple objectives produce such capable models—the objective is simple, but doing it well is not.
Claude Shannon estimated the entropy of English at ~1-1.5 bits per character. A perfect language model would achieve this bound—which requires near-complete understanding of language and the world. Every bit of uncertainty reduced represents knowledge gained.
Autoregressive language modeling (AR-LM) is the dominant pre-training objective for modern LLMs, powering GPT, LLaMA, Claude, and most frontier models. The objective is elegantly simple: given a sequence of tokens, predict the next one.
Given a sequence of tokens $x = (x_1, x_2, ..., x_n)$, the autoregressive model factorizes the joint probability as:
$$P(x) = \prod_{i=1}^{n} P(x_i | x_1, x_2, ..., x_{i-1})$$
The training objective minimizes the negative log-likelihood:
$$\mathcal{L}{AR} = -\sum{i=1}^{n} \log P(x_i | x_{<i}; \theta)$$
In practice, this is implemented as cross-entropy loss between the model's predicted distribution and the one-hot encoded true next token.
1234567891011121314151617181920212223242526272829303132333435
import torchimport torch.nn.functional as F def compute_ar_loss(model, input_ids, attention_mask): """ Compute autoregressive language modeling loss. Args: input_ids: [batch_size, seq_len] - Token IDs attention_mask: [batch_size, seq_len] - Attention mask Returns: loss: Scalar loss value """ # Forward pass - get logits for all positions # Shape: [batch_size, seq_len, vocab_size] logits = model(input_ids, attention_mask=attention_mask).logits # Shift: logits[t] predicts token[t+1] # Remove last logit (no target) and first token (no prediction) shift_logits = logits[:, :-1, :].contiguous() # [B, S-1, V] shift_labels = input_ids[:, 1:].contiguous() # [B, S-1] # Flatten for cross-entropy loss = F.cross_entropy( shift_logits.view(-1, shift_logits.size(-1)), # [B*(S-1), V] shift_labels.view(-1), # [B*(S-1)] ignore_index=-100 # Ignore padding tokens ) return loss # The key insight: every position provides a training signal# A sequence of 2048 tokens yields 2047 prediction tasks# This is extremely sample-efficient compared to labeled dataAutoregressive models use causal attention to ensure each position can only attend to previous positions:
Attention Pattern (Causal Mask):
t=1 t=2 t=3 t=4 t=5
t=1 ✓ ✗ ✗ ✗ ✗
t=2 ✓ ✓ ✗ ✗ ✗
t=3 ✓ ✓ ✓ ✗ ✗
t=4 ✓ ✓ ✓ ✓ ✗
t=5 ✓ ✓ ✓ ✓ ✓
This triangular mask ensures the model cannot "cheat" by looking at future tokens during training, maintaining consistency between training and generation.
Several properties make AR-LM particularly effective:
Autoregressive loss exhibits characteristic training dynamics:
Early training (perplexity > 50):
Mid training (perplexity 10-50):
Late training (perplexity < 10):
Notably, perplexity and capability don't perfectly correlate. A model might stall in perplexity but continue gaining reasoning ability. Evaluation must go beyond the training metric.
During training, the model always sees ground-truth previous tokens. During generation, it sees its own predictions, which may contain errors. This mismatch (exposure bias) can cause compounding errors. Techniques like scheduled sampling and RLHF partially address this, but it remains an active research area.
Masked Language Modeling (MLM), introduced by BERT (Bidirectional Encoder Representations from Transformers), takes a different approach: randomly mask tokens and train the model to predict them using bidirectional context.
Given a sequence $x = (x_1, ..., x_n)$, randomly select positions $M$ to mask. The objective is:
$$\mathcal{L}{MLM} = -\sum{i \in M} \log P(x_i | x_{\backslash M}; \theta)$$
Where $x_{\backslash M}$ denotes the sequence with masked positions replaced by a special [MASK] token or corrupted in some way.
BERT's original procedure:
[MASK]The 80-10-10 split addresses a key issue: if we only used [MASK], the model would never see this token at inference time (creating train-test mismatch).
1234567891011121314151617181920212223242526272829303132333435363738
import torchimport random def create_mlm_training_sample(tokens, vocab_size, mask_token_id, mask_prob=0.15): """ Create a masked language modeling training sample. Args: tokens: List of token IDs vocab_size: Size of vocabulary mask_token_id: ID of [MASK] token mask_prob: Probability of masking each token Returns: masked_tokens: Input with masks applied labels: Original tokens at masked positions, -100 elsewhere """ masked_tokens = tokens.copy() labels = [-100] * len(tokens) # -100 = ignore in loss for i in range(len(tokens)): if random.random() < mask_prob: labels[i] = tokens[i] # Store true token as label rand = random.random() if rand < 0.8: # 80%: Replace with [MASK] masked_tokens[i] = mask_token_id elif rand < 0.9: # 10%: Replace with random token masked_tokens[i] = random.randint(0, vocab_size - 1) # else 10%: Keep unchanged return masked_tokens, labels # Key difference from AR: bidirectional attention# Position i can attend to ALL positions, including future# This provides richer context but prevents generationRoBERTa (Robustly Optimized BERT):
SpanBERT:
ALBERT:
ELECTRA:
| Use Case | Preferred Objective | Reason |
|---|---|---|
| Text generation | Autoregressive | Natural left-to-right generation |
| Sentence classification | MLM/Encoder | Full context enables rich representations |
| Named entity recognition | MLM/Encoder | Bidirectional context for boundaries |
| Question answering | Either | Depends on extractive vs. generative |
| Summarization | AR or Encoder-Decoder | Generative task requiring coherent output |
| Few-shot learning | Autoregressive | In-context learning requires causal structure |
While decoder-only AR models dominate today, encoder-decoder models (T5, BART) using denoising objectives remain competitive for certain tasks. The trend toward unified AR models may not be permanent—architectural diversity continues in research.
A third category of pre-training objectives frames language modeling as a denoising or sequence-to-sequence task. Given corrupted input, the model must produce the original (or a corrected/completed) output. This family includes T5, BART, and UL2.
T5 unifies all NLP tasks as text-to-text problems with a span corruption objective:
Span Corruption Process:
<extra_id_0>, <extra_id_1>, etc.)Example:
Input: "The [X] sat on the [Y] and meowed loudly."
Target: "[X] cat [Y] mat"
This is more efficient than BERT's approach: the target sequence is much shorter than the input (only corrupted portions), speeding up training.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as np def t5_span_corruption(tokens, noise_density=0.15, mean_span_length=3.0): """ Apply T5-style span corruption. Args: tokens: List of token IDs noise_density: Fraction of tokens to corrupt mean_span_length: Average length of corrupted spans Returns: input_ids: Corrupted input with sentinel tokens target_ids: Target containing original spans """ n = len(tokens) num_noise_tokens = int(n * noise_density) num_spans = max(1, int(num_noise_tokens / mean_span_length)) # Generate span lengths from Poisson distribution span_lengths = np.random.poisson(mean_span_length, num_spans) span_lengths = np.maximum(1, span_lengths) # Select span start positions total_span_length = sum(span_lengths) num_available = n - total_span_length span_starts = sorted(np.random.choice(num_available, num_spans, replace=False)) # Build input and target input_ids = [] target_ids = [] sentinel_id = 32000 # Starting sentinel ID pos = 0 for i, (start, length) in enumerate(zip(span_starts, span_lengths)): # Add non-corrupted prefix input_ids.extend(tokens[pos:start]) # Add sentinel for corrupted span input_ids.append(sentinel_id + i) # Target: sentinel followed by original tokens target_ids.append(sentinel_id + i) target_ids.extend(tokens[start:start + length]) pos = start + length # Add remaining tokens to input input_ids.extend(tokens[pos:]) return input_ids, target_idsBART uses a more flexible noise approach with an autoregressive decoder:
Noise Types in BART:
| Noise Type | Description | Example |
|---|---|---|
| Token Masking | Replace tokens with [MASK] | "The cat sat" → "The [MASK] sat" |
| Token Deletion | Remove tokens without placeholder | "The cat sat" → "The sat" |
| Text Infilling | Replace spans with single [MASK] | "The cat sat there" → "The [MASK] there" |
| Sentence Permutation | Shuffle sentence order | Document sentence reordering |
| Document Rotation | Rotate to start at random token | "The cat sat" → "cat sat The" |
BART's decoder is fully autoregressive, generating the complete original sequence. This makes it naturally suited for generation tasks like summarization.
UL2 proposes mixing multiple objectives during pre-training:
UL2 Mixture:
Each sample is prefixed with a mode token ([R], [S], [X]) indicating the denoising mode. The model learns to handle all modes, gaining benefits of each.
Key insight: Different objectives are complementary. Span infilling improves bidirectional understanding; sequential denoising improves generation. Combining them produces a more capable model than either alone.
A hybrid approach: the model attends bidirectionally to a prefix, then autoregressively generates the suffix. This combines encoder-style understanding of context with decoder-style generation. Used in encoder-decoder models and increasingly in decoder-only models via attention mask manipulation.
The pre-training objective is only half the story. Equally critical is the data on which that objective is optimized. Data quality, diversity, and composition fundamentally shape what models learn.
| Source | Size | Quality | Properties |
|---|---|---|---|
| Web Crawl (Common Crawl) | 100T+ tokens | Variable | Broad coverage, noise, duplication |
| Wikipedia | ~4B tokens (English) | High | Factual, encyclopedic, neutral |
| Books | 10-100B tokens | High | Long-form, narrative, diverse style |
| Code (GitHub, etc.) | ~500B tokens | Variable | Structured, logical, multilingual |
| Scientific Papers | ~100B tokens | High | Technical, precise, specialized |
| Curated Web (C4, RefinedWeb) | 1-10T tokens | Medium-High | Filtered web, removed duplicates |
| Synthetic Data | Unlimited | Variable | LLM-generated, controlled properties |
Raw web crawl data is not directly usable. Quality filtering is essential:
1. Deduplication
Why it matters: Duplicated data causes memorization, wasted compute, and benchmark contamination. Models can recite training data verbatim if it appears frequently enough.
2. Quality Scoring
3. Content Filtering
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
class DataQualityPipeline: """ Multi-stage pipeline for pre-training data quality. """ def __init__(self, quality_classifier, toxicity_classifier): self.quality_cls = quality_classifier self.toxicity_cls = toxicity_classifier def process_document(self, text: str) -> dict: """ Process a single document through all quality stages. Returns: dict with 'keep' bool and 'reasons' for filtering """ reasons = [] # Stage 1: Basic heuristics if len(text.split()) < 20: reasons.append("too_short") # Check character quality alpha_ratio = sum(c.isalpha() for c in text) / max(len(text), 1) if alpha_ratio < 0.7: reasons.append("low_alpha_ratio") # Repetition detection (sliding window) words = text.lower().split() if len(words) > 50: window = 50 unique_ratio = len(set(words[:window])) / window if unique_ratio < 0.3: reasons.append("high_repetition") # Stage 2: Quality classifier quality_score = self.quality_cls.predict_proba([text])[0] if quality_score < 0.5: reasons.append(f"quality_score={quality_score:.2f}") # Stage 3: Toxicity filtering toxicity_score = self.toxicity_cls.predict_proba([text])[0] if toxicity_score > 0.8: reasons.append(f"toxicity_score={toxicity_score:.2f}") return { "keep": len(reasons) == 0, "reasons": reasons, "quality_score": quality_score, "toxicity_score": toxicity_score }Frontier models train on carefully balanced mixtures:
Typical LLM Data Mixture:
├── Web (filtered): 50-60%
│ └── High-quality web text, deduplicated
│
├── Code: 15-20%
│ └── GitHub, code documentation
│
├── Books: 10-15%
│ └── Fiction and non-fiction
│
├── Academic: 5-10%
│ └── arXiv, PubMed, patents
│
├── Wikipedia: 2-5%
│ └── Upsampled for factual grounding
│
├── Curated dialogue: 2-5%
│ └── Reddit, forums, Q&A sites
│
└── Math/Reasoning: 2-5%
└── Textbooks, proofs, problem sets
Key insights:
A model cannot generate knowledge it has never seen. If a fact, language, or skill is absent from training data, the model will not know it—scale cannot compensate for missing data. This creates fundamental limitations that are often invisible until the model fails.
Converting the theoretical objective into a working training pipeline requires careful attention to optimization, stability, and efficiency. Here we cover the key implementation considerations.
Most LLMs use AdamW with carefully tuned hyperparameters:
optimizer_config = {
"optimizer": "AdamW",
"learning_rate": 3e-4, # Peak LR, lower for larger models
"beta1": 0.9, # Momentum term
"beta2": 0.95, # Second moment (lower than default 0.999)
"epsilon": 1e-8,
"weight_decay": 0.1, # L2 regularization
"grad_clip": 1.0, # Gradient clipping norm
}
lr_schedule = {
"type": "cosine_with_warmup",
"warmup_steps": 2000, # Linear warmup
"min_lr_ratio": 0.1, # 10× decay from peak
}
Critical hyperparameter scaling:
| Model Size | Peak LR | Batch Size (tokens) | Warmup Steps |
|---|---|---|---|
| 125M | 6e-4 | 256K | 1000 |
| 1B | 3e-4 | 512K | 2000 |
| 7B | 1.5e-4 | 2M | 2000 |
| 70B | 1e-4 | 4M | 2000 |
| 405B | 8e-5 | 8M | 8000 |
Rule of thumb: Larger models need lower learning rates and larger batches for stability.
Modern training uses multiple precision levels:
| Data Type | Bits | Use | Range |
|---|---|---|---|
| FP32 | 32 | Master weights, loss scaling | ±3.4×10³⁸ |
| BF16 | 16 | Forward/backward pass | ±3.4×10³⁸ |
| FP16 | 16 | Alternative to BF16 | ±65504 |
| FP8 | 8 | Emerging for matrix multiply | Limited |
BFloat16 (BF16) is now standard for LLM training:
Loss scaling is required with FP16 to prevent underflow in gradients:
# Automatic mixed precision with loss scaling
scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
loss = model(batch).loss
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
checkpoint_config = {
# Checkpointing frequency
"save_interval": 500, # Steps between checkpoints
"keep_last_n": 10, # Rolling window
"keep_milestones": [10000, 50000, 100000], # Permanent saves
# Evaluation schedule
"eval_interval": 1000,
"eval_datasets": [
"validation_set", # Held-out distribution
"mmlu_subset", # Capability tracking
"hellaswag_subset",
],
# Logging
"log_interval": 10, # Steps between loss logging
"log_grad_norm": True, # Monitor gradient health
"log_learning_rate": True,
}
Evaluation during pre-training:
While validation loss is the primary metric, tracking downstream task performance reveals capability emergence. A common approach:
Pre-training runs are expensive and non-reproducible without careful logging. Record: random seeds, data ordering, exact configurations, hardware state. A $10M training run without proper logging is a $10M experiment you cannot fully understand or repeat.
Pre-training objectives and data form the foundation upon which all LLM capabilities rest. The remarkable abilities of modern language models emerge from simple prediction tasks applied at massive scale to diverse data.
What's next:
Pre-training produces models with broad capabilities but no particular alignment to user intent. The next page covers instruction tuning—the process of teaching models to follow instructions, answer questions helpfully, and behave as intended. We'll see how supervised fine-tuning transforms a base language model into an assistant.
You now understand the pre-training objectives that give LLMs their foundational capabilities—from autoregressive prediction to masked language modeling to denoising. This knowledge enables you to reason about model capabilities and limitations at their source. Next, we explore instruction tuning.