Loading content...
When you read a sentence like 'The cat sat on the ___', your brain automatically generates predictions for the next word—'mat', 'chair', 'floor'. You don't predict the entire rest of the document; you focus on the immediate next token. This simple insight underlies one of the most powerful frameworks in machine learning: autoregressive modeling.
An autoregressive model learns to predict the next element given all previous elements. By chaining these predictions—each output becoming input for the next—we can model entire sequences of arbitrary length. This framework is the foundation of language models from simple n-grams to GPT-4, of speech synthesis systems, of video prediction, and of any generative model for sequential data.
In this page, we develop the autoregressive framework rigorously. We derive the training objective, show how maximum likelihood leads to next-token prediction, explore generation procedures, and connect autoregressive modeling to information theory and compression.
By the end of this page, you will: (1) Formalize the autoregressive factorization, (2) Derive the maximum likelihood training objective, (3) Understand teacher forcing and its implications, (4) Master generation procedures including sampling methods, and (5) Connect autoregressive models to compression and perplexity.
The autoregressive factorization decomposes the joint probability of a sequence into a product of conditional probabilities:
$$P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1, x_2, \ldots, x_{t-1}) = \prod_{t=1}^{T} P(x_t | \mathbf{x}_{<t})$$
This is not an approximation—it follows directly from the chain rule of probability and holds for any joint distribution. The key modeling decision is how to parameterize each conditional distribution $P(x_t | \mathbf{x}_{<t})$.
In an autoregressive model, we use a single parameterized function (typically a neural network) to represent all these conditionals:
$$P_\theta(x_t | \mathbf{x}{<t}) = f\theta(\mathbf{x}_{<t})$$
where $\theta$ are learnable parameters and the output is a probability distribution over possible values of $x_t$.
The term 'autoregressive' comes from time series analysis, where it refers to models that regress a variable on its own past values: $x_t = \sum \phi_i x_{t-i} + \epsilon_t$. Neural autoregressive models generalize this to arbitrary nonlinear dependencies and non-Gaussian distributions, but retain the core idea: predict the present from the past.
Components of an Autoregressive Model:
Encoder: Maps the history $\mathbf{x}_{<t}$ to a fixed-size representation $h_t$
Predictor: Maps the representation to output distribution parameters
Output distribution: Defines $P(x_t | h_t)$ for the domain
For discrete sequences (text): $$P(x_t = k | h_t) = \text{softmax}(W h_t + b)_k = \frac{\exp(W_k^\top h_t + b_k)}{\sum_j \exp(W_j^\top h_t + b_j)}$$
For continuous sequences: $$P(x_t | h_t) = \mathcal{N}(\mu(h_t), \sigma^2(h_t))$$
or more flexible distributions like mixtures of Gaussians, normalizing flows, etc.
| Domain | Output Type | Distribution | Parameters |
|---|---|---|---|
| Text (word-level) | Categorical | Softmax over vocabulary | $V$ logits |
| Text (character) | Categorical | Softmax over characters | ~100-300 logits |
| Audio (WaveNet) | Categorical (μ-law) | Softmax over 256 values | 256 logits |
| Audio (continuous) | Mixture of Gaussians | MoG density | $K(\mu, \sigma, \pi)$ per component |
| Time series | Gaussian | $\mathcal{N}(\mu, \sigma^2)$ | Mean and variance |
| Video (pixels) | Categorical per channel | Softmax over 256 values × 3 | 768 logits |
Given a dataset of sequences $\mathcal{D} = {\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(N)}}$, we train the autoregressive model by maximum likelihood estimation (MLE)—finding parameters $\theta$ that maximize the probability of the observed data:
$$\theta^* = \arg\max_\theta \prod_{n=1}^{N} P_\theta(\mathbf{x}^{(n)})$$
Taking logarithms (which doesn't change the argmax) and using the autoregressive factorization:
$$\theta^* = \arg\max_\theta \sum_{n=1}^{N} \log P_\theta(\mathbf{x}^{(n)}) = \arg\max_\theta \sum_{n=1}^{N} \sum_{t=1}^{T_n} \log P_\theta(x_t^{(n)} | \mathbf{x}_{<t}^{(n)})$$
The Cross-Entropy Loss:
Converting maximization to minimization (for gradient descent), and averaging over tokens, we obtain the cross-entropy loss:
$$\mathcal{L}(\theta) = -\frac{1}{\sum_n T_n} \sum_{n=1}^{N} \sum_{t=1}^{T_n} \log P_\theta(x_t^{(n)} | \mathbf{x}_{<t}^{(n)})$$
For categorical outputs, this is precisely the categorical cross-entropy between the model's predicted distribution and the one-hot encoded ground truth:
$$\mathcal{L}t = -\sum{k=1}^{V} \mathbf{1}[x_t = k] \log P_\theta(x_t = k | \mathbf{x}{<t}) = -\log P\theta(x_t | \mathbf{x}_{<t})$$
The loss at each timestep measures how well the model predicted the actual next token. Summing over all positions gives the total sequence loss.
The cross-entropy loss penalizes the model when it assigns low probability to the true next token. If the model predicts $P(\text{'cat'}) = 0.01$ but the true next word is 'cat', the loss is $-\log(0.01) \approx 4.6$—quite high. If it predicts $P(\text{'cat'}) = 0.9$, the loss is only $-\log(0.9) \approx 0.1$. The model learns to assign high probability to likely continuations.
Gradient Computation:
The gradient of the loss with respect to parameters $\theta$ decomposes across timesteps:
$$\nabla_\theta \mathcal{L} = -\frac{1}{\sum_n T_n} \sum_{n,t} \nabla_\theta \log P_\theta(x_t^{(n)} | \mathbf{x}_{<t}^{(n)})$$
For softmax outputs, the gradient with respect to logits $z_t$ has a particularly elegant form:
$$\frac{\partial \mathcal{L}t}{\partial z_k} = P\theta(x_t = k | \mathbf{x}_{<t}) - \mathbf{1}[x_t = k]$$
This is simply the predicted probability minus the one-hot target: push up the probability of the correct token, push down all others proportionally.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import torchimport torch.nn as nnimport torch.nn.functional as F def autoregressive_loss(logits, targets, reduction='mean'): """ Compute cross-entropy loss for autoregressive prediction. Args: logits: Model outputs, shape [batch, seq_len, vocab_size] targets: Ground truth token IDs, shape [batch, seq_len] reduction: 'mean', 'sum', or 'none' Returns: Loss value (averaged over valid tokens if reduction='mean') """ batch_size, seq_len, vocab_size = logits.shape # Reshape for cross_entropy: [batch * seq_len, vocab_size] logits_flat = logits.view(-1, vocab_size) targets_flat = targets.view(-1) # Compute per-token cross-entropy # This is -log P(target_token | context) loss = F.cross_entropy( logits_flat, targets_flat, reduction=reduction, ignore_index=-100 # Ignore padding tokens ) return loss def compute_perplexity(model, data_loader) -> float: """ Compute perplexity: exp(average cross-entropy loss). Lower perplexity = better model. """ model.eval() total_loss = 0.0 total_tokens = 0 with torch.no_grad(): for batch in data_loader: inputs, targets = batch['input_ids'], batch['labels'] # Forward pass logits = model(inputs) # Compute loss (sum over tokens) loss = autoregressive_loss(logits, targets, reduction='sum') # Count non-padding tokens valid_tokens = (targets != -100).sum().item() total_loss += loss.item() total_tokens += valid_tokens # Perplexity = exp(average negative log likelihood) avg_nll = total_loss / total_tokens perplexity = torch.exp(torch.tensor(avg_nll)).item() return perplexityTeacher forcing is the standard training procedure for autoregressive models. During training, we condition on the ground-truth history rather than the model's own predictions:
$$\mathcal{L} = -\sum_{t=1}^{T} \log P_\theta(x_t | x_1, x_2, \ldots, x_{t-1})$$
where $x_1, \ldots, x_{t-1}$ are the true tokens from the training sequence, not tokens generated by the model.
Why teacher forcing?
Efficiency: We can compute all predictions in parallel (for Transformers) or in a single forward pass (for RNNs with ground-truth inputs)
Stable gradients: Early in training, model predictions are random. Conditioning on garbage would produce garbage—useless gradients.
Dense supervision: Every position provides a learning signal. We don't wait for errors to compound before correction.
Teacher forcing creates a mismatch between training and inference. During training, the model sees perfect history; during generation, it sees its own (imperfect) predictions. This 'exposure bias' can lead to cascading errors: one wrong prediction leads to an unfamiliar state, leading to more errors, and so on. The model has never learned to recover from mistakes.
Mitigation Strategies:
1. Scheduled Sampling
Gradually increase the probability of using model predictions during training:
$$p_t = \begin{cases} x_t^{\text{true}} & \text{with probability } \epsilon \ x_t^{\text{pred}} & \text{with probability } 1 - \epsilon \end{cases}$$
where $\epsilon$ decreases over training (e.g., from 1.0 to 0.1). This exposes the model to its own errors, but the approach is heuristic and can destabilize training.
2. Sequence-Level Training
Training objectives that consider entire generated sequences:
3. Data Augmentation
Expose the model to perturbed inputs during training, simulating the noisy inputs it might see during generation.
4. Robust Decoding
At inference time, use decoding strategies that are less sensitive to individual errors (e.g., beam search, nucleus sampling with reranking).
The Parallelization Advantage:
Teacher forcing enables parallel computation during training. Consider a sequence of length $T$:
This parallelization is crucial for training on modern hardware (GPUs/TPUs) and scales to sequences of thousands of tokens.
Once trained, an autoregressive model can generate new sequences by sampling from the learned distribution. The generation process is inherently sequential:
The sampling step introduces stochasticity—the same prompt can produce different outputs. Various sampling strategies balance diversity against quality.
Greedy Decoding:
The simplest strategy: always select the most probable token:
$$x_t = \arg\max_k P_\theta(x_t = k | \mathbf{x}_{<t})$$
Pros:
Cons:
When to use:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import torchimport torch.nn.functional as F def sample_next_token( logits: torch.Tensor, temperature: float = 1.0, top_k: int = 0, top_p: float = 1.0) -> torch.Tensor: """ Sample next token using temperature, top-k, and/or top-p. Args: logits: Unnormalized log-probs [batch_size, vocab_size] temperature: Softmax temperature (lower = more greedy) top_k: Keep only top k tokens (0 = disabled) top_p: Keep tokens with cumulative prob >= p (1.0 = disabled) Returns: Sampled token indices [batch_size] """ # Apply temperature if temperature != 1.0: logits = logits / temperature # Apply top-k filtering if top_k > 0: top_k_logits, top_k_indices = torch.topk(logits, top_k, dim=-1) # Set all other logits to -inf logits = torch.full_like(logits, float('-inf')) logits.scatter_(-1, top_k_indices, top_k_logits) # Apply top-p (nucleus) filtering if top_p < 1.0: sorted_logits, sorted_indices = torch.sort(logits, descending=True) probs = F.softmax(sorted_logits, dim=-1) cumsum_probs = torch.cumsum(probs, dim=-1) # Remove tokens where cumsum > top_p (keep first that exceeds) sorted_mask = cumsum_probs > top_p # Shift right to keep first token that exceeds threshold sorted_mask[..., 1:] = sorted_mask[..., :-1].clone() sorted_mask[..., 0] = False # Set filtered logits to -inf sorted_logits[sorted_mask] = float('-inf') # Unsort logits = sorted_logits.gather(-1, sorted_indices.argsort(-1)) # Sample probs = F.softmax(logits, dim=-1) next_token = torch.multinomial(probs, num_samples=1).squeeze(-1) return next_token def greedy_decode(model, input_ids, max_length): """Greedy decoding - always pick highest prob token.""" for _ in range(max_length): logits = model(input_ids)[:, -1, :] # Last position next_token = logits.argmax(dim=-1, keepdim=True) input_ids = torch.cat([input_ids, next_token], dim=-1) if next_token.item() == model.eos_token_id: break return input_idsPerplexity is the standard metric for evaluating autoregressive models. It measures how 'surprised' the model is by the test data—lower perplexity means better prediction.
Definition:
$$\text{PPL}(\mathbf{x}) = \exp\left( -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t | \mathbf{x}_{<t}) \right) = \exp(\mathcal{L})$$
Perplexity is the exponential of the average cross-entropy loss. This has an intuitive interpretation: perplexity equals the effective number of equally likely choices the model considers at each step.
If a model has perplexity 100 on English text, it's as uncertain as if it were choosing uniformly from 100 equally likely words at each position. A perplexity of 10 means the model has narrowed down to about 10 plausible options. State-of-the-art language models achieve perplexities of 10-30 on standard benchmarks—they've learned substantial structure.
| Model | Dataset | Perplexity | Year |
|---|---|---|---|
| Unigram baseline | Penn Treebank | ~1000 | |
| Kneser-Ney trigram | Penn Treebank | ~140 | 1995 |
| LSTM (large) | Penn Treebank | ~60 | 2016 |
| AWD-LSTM | Penn Treebank | ~52 | 2017 |
| Transformer-XL | Penn Treebank | ~24 | 2019 |
| GPT-2 (large) | Penn Treebank | ~35* | 2019 |
| GPT-3 (175B) | Various | ~20-30* | 2020 |
Connection to Compression:
Perplexity connects directly to information theory and data compression:
$$\text{Bits per character} = \log_2(\text{PPL})$$
If perplexity is 32, the model requires $\log_2(32) = 5$ bits per character on average to encode the sequence. Shannon's source coding theorem tells us that the entropy rate is the theoretical minimum—any model with perplexity above $2^{H}$ is suboptimal.
This connection has practical implications:
Limitations of Perplexity:
Autoregressive models naturally support conditional generation: instead of starting from an empty context, we start from a prompt that steers the generation:
$$P(x_{t+1}, \ldots, x_T | x_1, \ldots, x_t) = \prod_{i=t+1}^{T} P(x_i | x_1, \ldots, x_{i-1})$$
The prompt $x_1, \ldots, x_t$ constrains the output distribution. This enables:
Large language models have revealed that prompting is a powerful paradigm. Instead of fine-tuning model weights for each task, we can 'program' the model through its input. The prompt provides context, examples, and instructions—all interpreted through the autoregressive probability distribution the model has learned.
Types of Conditioning:
1. Explicit Prefixing Simply prepend conditioning information to the sequence: '[STYLE: Shakespeare] To be or not to be...'
2. Control Codes Train with special tokens indicating attributes: '<positive_sentiment> The movie was...'
3. Cross-Attention Conditioning For encoder-decoder models, condition generation on encoded representations: $$P(y_t | y_{<t}, \text{Encode}(x))$$
Used in translation, summarization, image captioning.
4. Latent Conditioning Condition on continuous latent codes: $$P(x | z) = \prod_t P(x_t | x_{<t}, z)$$
Used in VAEs, controlled generation, style mixing.
Prompt Engineering:
The art of crafting effective prompts has become crucial for modern LLMs:
While prompting seems ad-hoc, it's principled in the autoregressive framework: we're conditioning the generation on observations that maximize the conditional probability of desired outputs.
The autoregressive factorization imposes a causal structure: to predict $x_t$, we can only use $x_1, \ldots, x_{t-1}$. But for many tasks—classification, tagging, understanding—we have the entire sequence available. Why not use all of it?
Bidirectional models like BERT compute representations that condition on both past AND future:
$$h_t = f(x_1, \ldots, x_{t-1}, x_t, x_{t+1}, \ldots, x_T)$$
This enables richer representations for understanding tasks but precludes direct generation.
Choosing Between Paradigms:
| Task Type | Preferred Approach | Reason |
|---|---|---|
| Text generation | Autoregressive | Must generate sequentially |
| Classification | Bidirectional | Full context improves accuracy |
| Sequence labeling | Bidirectional | Each token benefits from context |
| Translation | Encoder-decoder | Bidirectional encoder, autoregressive decoder |
| Infilling | Specialized | BERT with mask, or XLNet-style permutation |
Hybrid Approaches:
The fundamental issue: you cannot generate what you haven't seen. Autoregressive models respect causality by construction—at each step, only the past exists. Bidirectional models assume the full sequence is observed, which is true for analysis but impossible for generation. This isn't a limitation to overcome; it's a fundamental constraint of the task.
The autoregressive framework provides the foundation for neural sequence modeling. We have developed both the theoretical underpinnings and practical considerations:
What's Next:
With the autoregressive framework established, we now turn to the specific motivation for Recurrent Neural Networks. How do we efficiently compute the conditionals $P(x_t | \mathbf{x}_{<t})$ when the history can be arbitrarily long? The RNN answer: maintain a hidden state that summarizes all relevant history.
You now understand the complete autoregressive paradigm—from probabilistic foundations through training and generation to evaluation. This framework underlies language models from simple RNNs to GPT-4. The next page motivates the specific architectural choices that make RNNs effective for this task.