Sequence Modeling - Learning Module

Loading content...

0/245

Autoregressive Models

Learning to Predict One Step at a Time

When you read a sentence like 'The cat sat on the ___', your brain automatically generates predictions for the next word—'mat', 'chair', 'floor'. You don't predict the entire rest of the document; you focus on the immediate next token. This simple insight underlies one of the most powerful frameworks in machine learning: autoregressive modeling.

An autoregressive model learns to predict the next element given all previous elements. By chaining these predictions—each output becoming input for the next—we can model entire sequences of arbitrary length. This framework is the foundation of language models from simple n-grams to GPT-4, of speech synthesis systems, of video prediction, and of any generative model for sequential data.

In this page, we develop the autoregressive framework rigorously. We derive the training objective, show how maximum likelihood leads to next-token prediction, explore generation procedures, and connect autoregressive modeling to information theory and compression.

Learning Objectives

By the end of this page, you will: (1) Formalize the autoregressive factorization, (2) Derive the maximum likelihood training objective, (3) Understand teacher forcing and its implications, (4) Master generation procedures including sampling methods, and (5) Connect autoregressive models to compression and perplexity.

The Autoregressive Factorization

The autoregressive factorization decomposes the joint probability of a sequence into a product of conditional probabilities:

$$P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1, x_2, \ldots, x_{t-1}) = \prod_{t=1}^{T} P(x_t | \mathbf{x}_{<t})$$

This is not an approximation—it follows directly from the chain rule of probability and holds for any joint distribution. The key modeling decision is how to parameterize each conditional distribution $P(x_t | \mathbf{x}_{<t})$.

In an autoregressive model, we use a single parameterized function (typically a neural network) to represent all these conditionals:

$$P_\theta(x_t | \mathbf{x}{<t}) = f\theta(\mathbf{x}_{<t})$$

where $\theta$ are learnable parameters and the output is a probability distribution over possible values of $x_t$.

Why 'Autoregressive'?

The term 'autoregressive' comes from time series analysis, where it refers to models that regress a variable on its own past values: $x_t = \sum \phi_i x_{t-i} + \epsilon_t$. Neural autoregressive models generalize this to arbitrary nonlinear dependencies and non-Gaussian distributions, but retain the core idea: predict the present from the past.

Components of an Autoregressive Model:

Encoder: Maps the history $\mathbf{x}_{<t}$ to a fixed-size representation $h_t$
Predictor: Maps the representation to output distribution parameters
Output distribution: Defines $P(x_t | h_t)$ for the domain

For discrete sequences (text): $$P(x_t = k | h_t) = \text{softmax}(W h_t + b)_k = \frac{\exp(W_k^\top h_t + b_k)}{\sum_j \exp(W_j^\top h_t + b_j)}$$

For continuous sequences: $$P(x_t | h_t) = \mathcal{N}(\mu(h_t), \sigma^2(h_t))$$

or more flexible distributions like mixtures of Gaussians, normalizing flows, etc.

Autoregressive Model Output Distributions by Domain
Domain	Output Type	Distribution	Parameters
Text (word-level)	Categorical	Softmax over vocabulary	$V$ logits
Text (character)	Categorical	Softmax over characters	~100-300 logits
Audio (WaveNet)	Categorical (μ-law)	Softmax over 256 values	256 logits
Audio (continuous)	Mixture of Gaussians	MoG density	$K(\mu, \sigma, \pi)$ per component
Time series	Gaussian	$\mathcal{N}(\mu, \sigma^2)$	Mean and variance
Video (pixels)	Categorical per channel	Softmax over 256 values × 3	768 logits

Maximum Likelihood Training

Given a dataset of sequences $\mathcal{D} = {\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(N)}}$, we train the autoregressive model by maximum likelihood estimation (MLE)—finding parameters $\theta$ that maximize the probability of the observed data:

$$\theta^* = \arg\max_\theta \prod_{n=1}^{N} P_\theta(\mathbf{x}^{(n)})$$

Taking logarithms (which doesn't change the argmax) and using the autoregressive factorization:

$$\theta^* = \arg\max_\theta \sum_{n=1}^{N} \log P_\theta(\mathbf{x}^{(n)}) = \arg\max_\theta \sum_{n=1}^{N} \sum_{t=1}^{T_n} \log P_\theta(x_t^{(n)} | \mathbf{x}_{<t}^{(n)})$$

The Cross-Entropy Loss:

Converting maximization to minimization (for gradient descent), and averaging over tokens, we obtain the cross-entropy loss:

$$\mathcal{L}(\theta) = -\frac{1}{\sum_n T_n} \sum_{n=1}^{N} \sum_{t=1}^{T_n} \log P_\theta(x_t^{(n)} | \mathbf{x}_{<t}^{(n)})$$

For categorical outputs, this is precisely the categorical cross-entropy between the model's predicted distribution and the one-hot encoded ground truth:

$$\mathcal{L}t = -\sum{k=1}^{V} \mathbf{1}[x_t = k] \log P_\theta(x_t = k | \mathbf{x}{<t}) = -\log P\theta(x_t | \mathbf{x}_{<t})$$

The loss at each timestep measures how well the model predicted the actual next token. Summing over all positions gives the total sequence loss.

Intuition: Learning to Predict

The cross-entropy loss penalizes the model when it assigns low probability to the true next token. If the model predicts $P(\text{'cat'}) = 0.01$ but the true next word is 'cat', the loss is $-\log(0.01) \approx 4.6$—quite high. If it predicts $P(\text{'cat'}) = 0.9$, the loss is only $-\log(0.9) \approx 0.1$. The model learns to assign high probability to likely continuations.

Gradient Computation:

The gradient of the loss with respect to parameters $\theta$ decomposes across timesteps:

$$\nabla_\theta \mathcal{L} = -\frac{1}{\sum_n T_n} \sum_{n,t} \nabla_\theta \log P_\theta(x_t^{(n)} | \mathbf{x}_{<t}^{(n)})$$

For softmax outputs, the gradient with respect to logits $z_t$ has a particularly elegant form:

$$\frac{\partial \mathcal{L}t}{\partial z_k} = P\theta(x_t = k | \mathbf{x}_{<t}) - \mathbf{1}[x_t = k]$$

This is simply the predicted probability minus the one-hot target: push up the probability of the correct token, push down all others proportionally.

autoregressive_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import torch
import torch.nn as nn
import torch.nn.functional as F
 
def autoregressive_loss(logits, targets, reduction='mean'):
    """
    Compute cross-entropy loss for autoregressive prediction.
    
    Args:
        logits: Model outputs, shape [batch, seq_len, vocab_size]
        targets: Ground truth token IDs, shape [batch, seq_len]
        reduction: 'mean', 'sum', or 'none'
    
    Returns:
        Loss value (averaged over valid tokens if reduction='mean')
    """
    batch_size, seq_len, vocab_size = logits.shape
    
    # Reshape for cross_entropy: [batch * seq_len, vocab_size]
    logits_flat = logits.view(-1, vocab_size)
    targets_flat = targets.view(-1)
    
    # Compute per-token cross-entropy
    # This is -log P(target_token | context)
    loss = F.cross_entropy(
        logits_flat, 
        targets_flat,
        reduction=reduction,
        ignore_index=-100  # Ignore padding tokens
    )
    
    return loss
 
 
def compute_perplexity(model, data_loader) -> float:
    """
    Compute perplexity: exp(average cross-entropy loss).
    Lower perplexity = better model.
    """
    model.eval()
    total_loss = 0.0
    total_tokens = 0
    
    with torch.no_grad():
        for batch in data_loader:
            inputs, targets = batch['input_ids'], batch['labels']
            
            # Forward pass
            logits = model(inputs)
            
            # Compute loss (sum over tokens)
            loss = autoregressive_loss(logits, targets, reduction='sum')
            
            # Count non-padding tokens
            valid_tokens = (targets != -100).sum().item()
            
            total_loss += loss.item()
            total_tokens += valid_tokens
    
    # Perplexity = exp(average negative log likelihood)
    avg_nll = total_loss / total_tokens
    perplexity = torch.exp(torch.tensor(avg_nll)).item()
    
    return perplexity

Teacher Forcing

Teacher forcing is the standard training procedure for autoregressive models. During training, we condition on the ground-truth history rather than the model's own predictions:

$$\mathcal{L} = -\sum_{t=1}^{T} \log P_\theta(x_t | x_1, x_2, \ldots, x_{t-1})$$

where $x_1, \ldots, x_{t-1}$ are the true tokens from the training sequence, not tokens generated by the model.

Why teacher forcing?

Efficiency: We can compute all predictions in parallel (for Transformers) or in a single forward pass (for RNNs with ground-truth inputs)
Stable gradients: Early in training, model predictions are random. Conditioning on garbage would produce garbage—useless gradients.
Dense supervision: Every position provides a learning signal. We don't wait for errors to compound before correction.

Exposure Bias: The Training-Inference Gap

Teacher forcing creates a mismatch between training and inference. During training, the model sees perfect history; during generation, it sees its own (imperfect) predictions. This 'exposure bias' can lead to cascading errors: one wrong prediction leads to an unfamiliar state, leading to more errors, and so on. The model has never learned to recover from mistakes.

Mitigation Strategies:

1. Scheduled Sampling

Gradually increase the probability of using model predictions during training:

$$p_t = \begin{cases} x_t^{\text{true}} & \text{with probability } \epsilon \ x_t^{\text{pred}} & \text{with probability } 1 - \epsilon \end{cases}$$

where $\epsilon$ decreases over training (e.g., from 1.0 to 0.1). This exposes the model to its own errors, but the approach is heuristic and can destabilize training.

2. Sequence-Level Training

Training objectives that consider entire generated sequences:

REINFORCE: Treat generation as a policy, use reward signals
Minimum risk training: Optimize expected task loss (e.g., BLEU)
Adversarial training: Discriminator distinguishes real vs. generated sequences

3. Data Augmentation

Expose the model to perturbed inputs during training, simulating the noisy inputs it might see during generation.

4. Robust Decoding

At inference time, use decoding strategies that are less sensitive to individual errors (e.g., beam search, nucleus sampling with reranking).

The Parallelization Advantage:

Teacher forcing enables parallel computation during training. Consider a sequence of length $T$:

With true history, all conditionals $P(x_t | x_1, \ldots, x_{t-1})$ for $t = 1, \ldots, T$ can be computed simultaneously
For RNNs, we still need sequential hidden state updates, but inputs are all known
For Transformers, attention can look at all positions in parallel—no sequential dependency

This parallelization is crucial for training on modern hardware (GPUs/TPUs) and scales to sequences of thousands of tokens.

Generation and Sampling

Once trained, an autoregressive model can generate new sequences by sampling from the learned distribution. The generation process is inherently sequential:

Initialize $\mathbf{x}_{<1} = \emptyset$ (or with a prompt/start token)
For $t = 1, 2, \ldots$:
- Compute $P_\theta(x_t | \mathbf{x}_{<t})$
- Sample $x_t \sim P_\theta(\cdot | \mathbf{x}_{<t})$
- Append $x_t$ to the sequence
Continue until end-of-sequence token or maximum length

The sampling step introduces stochasticity—the same prompt can produce different outputs. Various sampling strategies balance diversity against quality.

Greedy Decoding:

The simplest strategy: always select the most probable token:

$$x_t = \arg\max_k P_\theta(x_t = k | \mathbf{x}_{<t})$$

Pros:

Deterministic: same input always produces same output
Fast: no need to sample or maintain candidates
Often produces fluent, coherent text

Cons:

Often repetitive: high-probability tokens are selected again and again
Can get stuck in loops: 'I think that I think that I think...'
May miss globally optimal sequences (local decisions aren't globally optimal)
No diversity: one output per prompt

When to use:

When determinism is required
For short sequences where repetition is less problematic
As a baseline for comparison

sampling_methods.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import torch
import torch.nn.functional as F
 
def sample_next_token(
    logits: torch.Tensor,
    temperature: float = 1.0,
    top_k: int = 0,
    top_p: float = 1.0
) -> torch.Tensor:
    """
    Sample next token using temperature, top-k, and/or top-p.
    
    Args:
        logits: Unnormalized log-probs [batch_size, vocab_size]
        temperature: Softmax temperature (lower = more greedy)
        top_k: Keep only top k tokens (0 = disabled)
        top_p: Keep tokens with cumulative prob >= p (1.0 = disabled)
    
    Returns:
        Sampled token indices [batch_size]
    """
    # Apply temperature
    if temperature != 1.0:
        logits = logits / temperature
    
    # Apply top-k filtering
    if top_k > 0:
        top_k_logits, top_k_indices = torch.topk(logits, top_k, dim=-1)
        # Set all other logits to -inf
        logits = torch.full_like(logits, float('-inf'))
        logits.scatter_(-1, top_k_indices, top_k_logits)
    
    # Apply top-p (nucleus) filtering
    if top_p < 1.0:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        probs = F.softmax(sorted_logits, dim=-1)
        cumsum_probs = torch.cumsum(probs, dim=-1)
        
        # Remove tokens where cumsum > top_p (keep first that exceeds)
        sorted_mask = cumsum_probs > top_p
        # Shift right to keep first token that exceeds threshold
        sorted_mask[..., 1:] = sorted_mask[..., :-1].clone()
        sorted_mask[..., 0] = False
        
        # Set filtered logits to -inf
        sorted_logits[sorted_mask] = float('-inf')
        
        # Unsort
        logits = sorted_logits.gather(-1, sorted_indices.argsort(-1))
    
    # Sample
    probs = F.softmax(logits, dim=-1)
    next_token = torch.multinomial(probs, num_samples=1).squeeze(-1)
    
    return next_token
 
 
def greedy_decode(model, input_ids, max_length):
    """Greedy decoding - always pick highest prob token."""
    for _ in range(max_length):
        logits = model(input_ids)[:, -1, :]  # Last position
        next_token = logits.argmax(dim=-1, keepdim=True)
        input_ids = torch.cat([input_ids, next_token], dim=-1)
        if next_token.item() == model.eos_token_id:
            break
    return input_ids

Perplexity and Model Evaluation

Perplexity is the standard metric for evaluating autoregressive models. It measures how 'surprised' the model is by the test data—lower perplexity means better prediction.

Definition:

$$\text{PPL}(\mathbf{x}) = \exp\left( -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t | \mathbf{x}_{<t}) \right) = \exp(\mathcal{L})$$

Perplexity is the exponential of the average cross-entropy loss. This has an intuitive interpretation: perplexity equals the effective number of equally likely choices the model considers at each step.

Perplexity Intuition

If a model has perplexity 100 on English text, it's as uncertain as if it were choosing uniformly from 100 equally likely words at each position. A perplexity of 10 means the model has narrowed down to about 10 plausible options. State-of-the-art language models achieve perplexities of 10-30 on standard benchmarks—they've learned substantial structure.

Perplexity Benchmarks for Language Models
Model	Dataset	Perplexity	Year
Unigram baseline	Penn Treebank	~1000
Kneser-Ney trigram	Penn Treebank	~140	1995
LSTM (large)	Penn Treebank	~60	2016
AWD-LSTM	Penn Treebank	~52	2017
Transformer-XL	Penn Treebank	~24	2019
GPT-2 (large)	Penn Treebank	~35*	2019
GPT-3 (175B)	Various	~20-30*	2020

Connection to Compression:

Perplexity connects directly to information theory and data compression:

$$\text{Bits per character} = \log_2(\text{PPL})$$

If perplexity is 32, the model requires $\log_2(32) = 5$ bits per character on average to encode the sequence. Shannon's source coding theorem tells us that the entropy rate is the theoretical minimum—any model with perplexity above $2^{H}$ is suboptimal.

This connection has practical implications:

A better language model = a better compressor
Compression algorithms can be derived from language models
Progress in language modeling improves compression (and vice versa)

Limitations of Perplexity:

Vocabulary dependence: Perplexity is not comparable across different tokenizations
Doesn't capture downstream task performance: Low perplexity doesn't guarantee good generation quality
Sensitive to long-tail tokens: Rare tokens dominate the loss but may not matter for quality
Doesn't measure diversity or coherence: A model could memorize data and achieve low perplexity

Conditioning and Prompting

Autoregressive models naturally support conditional generation: instead of starting from an empty context, we start from a prompt that steers the generation:

$$P(x_{t+1}, \ldots, x_T | x_1, \ldots, x_t) = \prod_{i=t+1}^{T} P(x_i | x_1, \ldots, x_{i-1})$$

The prompt $x_1, \ldots, x_t$ constrains the output distribution. This enables:

Text completion: Given a partial sentence, complete it
Question answering: Given 'Q: What is the capital of France? A:', generate the answer
Style transfer: Prompt with a style example, generate more in that style
Few-shot learning: Include examples in the prompt to demonstrate the task

The Prompting Paradigm

Large language models have revealed that prompting is a powerful paradigm. Instead of fine-tuning model weights for each task, we can 'program' the model through its input. The prompt provides context, examples, and instructions—all interpreted through the autoregressive probability distribution the model has learned.

Types of Conditioning:

1. Explicit Prefixing Simply prepend conditioning information to the sequence: '[STYLE: Shakespeare] To be or not to be...'

2. Control Codes Train with special tokens indicating attributes: '<positive_sentiment> The movie was...'

3. Cross-Attention Conditioning For encoder-decoder models, condition generation on encoded representations: $$P(y_t | y_{<t}, \text{Encode}(x))$$

Used in translation, summarization, image captioning.

4. Latent Conditioning Condition on continuous latent codes: $$P(x | z) = \prod_t P(x_t | x_{<t}, z)$$

Used in VAEs, controlled generation, style mixing.

Prompt Engineering:

The art of crafting effective prompts has become crucial for modern LLMs:

Instruction clarity: Be specific about the task format
Few-shot examples: Provide input-output examples before the query
Chain-of-thought: Prompt the model to reason step by step
Role assignment: 'You are an expert in...'

While prompting seems ad-hoc, it's principled in the autoregressive framework: we're conditioning the generation on observations that maximize the conditional probability of desired outputs.

Bidirectional vs. Autoregressive Models

The autoregressive factorization imposes a causal structure: to predict $x_t$, we can only use $x_1, \ldots, x_{t-1}$. But for many tasks—classification, tagging, understanding—we have the entire sequence available. Why not use all of it?

Bidirectional models like BERT compute representations that condition on both past AND future:

$$h_t = f(x_1, \ldots, x_{t-1}, x_t, x_{t+1}, \ldots, x_T)$$

This enables richer representations for understanding tasks but precludes direct generation.

Autoregressive (GPT-style)

•Causal attention: $t$ only attends to $<t$
•Supports generation natively
•Trained on next-token prediction
•Single forward pass for full sequence
•Representation of $x_t$ based only on prefix

Bidirectional (BERT-style)

•Full attention: $t$ attends to all positions
•Cannot generate autoregressively
•Trained on masked token prediction
•Richer token representations
•Better for classification/understanding

Choosing Between Paradigms:

Task Type	Preferred Approach	Reason
Text generation	Autoregressive	Must generate sequentially
Classification	Bidirectional	Full context improves accuracy
Sequence labeling	Bidirectional	Each token benefits from context
Translation	Encoder-decoder	Bidirectional encoder, autoregressive decoder
Infilling	Specialized	BERT with mask, or XLNet-style permutation

Hybrid Approaches:

Encoder-Decoder: Bidirectional encoder, autoregressive decoder (T5, BART)
Prefix LM: Bidirectional on prefix, autoregressive on generation (UniLM)
Permutation LM: Predict in random orders, capturing bidirectional context (XLNet)

The Generation Constraint

The fundamental issue: you cannot generate what you haven't seen. Autoregressive models respect causality by construction—at each step, only the past exists. Bidirectional models assume the full sequence is observed, which is true for analysis but impossible for generation. This isn't a limitation to overcome; it's a fundamental constraint of the task.

Summary: The Autoregressive Framework

The autoregressive framework provides the foundation for neural sequence modeling. We have developed both the theoretical underpinnings and practical considerations:

Key Takeaways

•Autoregressive factorization decomposes joint probability into sequential conditionals—exact, not approximate.
•Maximum likelihood training reduces to minimizing cross-entropy loss at each position.
•Teacher forcing enables efficient parallel training but creates exposure bias.
•Generation procedures trade off quality vs. diversity: greedy, beam search, temperature, top-k/p sampling.
•Perplexity measures model quality and connects to compression—lower is better.
•Conditioning/prompting enables controlled generation within the autoregressive framework.
•Bidirectional models sacrifice generation ability for richer representations.

What's Next:

With the autoregressive framework established, we now turn to the specific motivation for Recurrent Neural Networks. How do we efficiently compute the conditionals $P(x_t | \mathbf{x}_{<t})$ when the history can be arbitrarily long? The RNN answer: maintain a hidden state that summarizes all relevant history.

Framework Mastered

You now understand the complete autoregressive paradigm—from probabilistic foundations through training and generation to evaluation. This framework underlies language models from simple RNNs to GPT-4. The next page motivates the specific architectural choices that make RNNs effective for this task.