Loading content...
Standard classification predicts a single label; regression predicts a single number. But many real-world predictions are structured—the output is a complex object where components are interdependent. This is the domain of structured prediction.
Consider these tasks:
In each case, the output has internal structure that must be respected. Predicting each component independently ignores crucial dependencies.
By the end of this page, you will understand structured prediction at a foundational level: why structure matters, the formal framework of structured outputs, common structures (sequences, trees, graphs), computational challenges of inference and learning, and modern approaches from graphical models to neural sequence-to-sequence methods. This knowledge enables you to tackle prediction problems with complex, interdependent outputs.
To appreciate structured prediction, consider what happens when we ignore structure.
Example: Part-of-Speech Tagging
Input: "Time flies like an arrow"
Goal: Label each word with its part of speech.
Naive approach: Train a classifier to predict each word's POS independently.
Problem: "flies" could be a noun (insects) or verb (travels through air). Without context, we can't tell. Even with local context (surrounding words), the classifier might produce:
But this violates grammatical structure: two nouns in a row without a conjunction is unusual.
Structured approach: Model the sequence of labels, enforcing that transitions between tags follow grammatical patterns. Now we correctly get:
Structured prediction models the joint distribution over output components, not just marginals. Instead of P(y₁|x), P(y₂|x), ... independently, we model P(y₁, y₂, ..., yₙ|x). This allows capturing dependencies and enforcing consistency across the entire output.
The Computational Challenge:
Modeling structure introduces computational difficulty. If each of n positions can take one of K labels, there are K^n possible outputs. For n=20 and K=50, that's over 10^33 possibilities—impossible to enumerate.
Structured prediction requires:
This trifecta—expressiveness, tractable inference, trainable learning—defines the structured prediction research agenda.
Structured prediction generalizes classification to structured output spaces. Let's formalize.
Basic Setup:
The output space $\mathcal{Y}(x)$ is typically combinatorially large and may depend on the input (e.g., number of labels = number of words).
Score-Based Formulation:
Most structured prediction models learn a scoring function: $$s(x, y; \theta): \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}$$
that assigns higher scores to better outputs. Prediction becomes optimization: $$\hat{y} = \arg\max_{y \in \mathcal{Y}(x)} s(x, y; \theta)$$
The scoring function decomposes over parts of the structure for tractability: $$s(x, y; \theta) = \sum_{c \in \mathcal{C}(y)} \phi_c(x, y_c; \theta)$$
where $\mathcal{C}(y)$ is a set of 'cliques' or 'factors' and $y_c$ denotes the subset of output variables involved in factor $c$.
| Structure | Decomposition | Example |
|---|---|---|
| Linear chain | Unary + pairwise: $\sum_i \phi(x, y_i) + \sum_i \psi(y_i, y_{i+1})$ | POS tagging, NER |
| Tree | Parent-child potentials: $\sum_{(i,j) \in E} \phi(x, y_i, y_j)$ | Dependency parsing |
| General graph | Clique potentials over arbitrary subsets | Image segmentation, MRFs |
| Set | Sum over selected items: $\sum_{i \in S} \phi(x, i)$ | Subset selection, multi-label |
Probabilistic Formulation:
Alternatively, define a probability distribution over outputs: $$p(y | x; \theta) = \frac{1}{Z(x; \theta)} \exp(s(x, y; \theta))$$
where $Z(x; \theta) = \sum_{y' \in \mathcal{Y}(x)} \exp(s(x, y'; \theta))$ is the partition function.
This is a Conditional Random Field (CRF) when the graph structure is predefined, or more generally a structured probabilistic model.
Challenge: Computing the partition function $Z$ requires summing over exponentially many outputs. Special structure (chains, trees) allows polynomial-time computation via dynamic programming. General graphs require approximation.
CRFs are discriminative—they model P(y|x) directly. Hidden Markov Models (HMMs) are generative—they model P(x,y) = P(y)P(x|y). CRFs typically outperform HMMs because they can incorporate arbitrary features of x without modeling its distribution, avoiding the 'independence assumptions' that generative models require.
Sequence labeling is the most common structured prediction task: given a sequence of inputs, predict a sequence of labels.
Formal Setting:
Output space size: $K^n$ — exponential in sequence length
Linear-Chain CRF:
The workhorse model for sequence labeling. Score decomposes as: $$s(x, y) = \sum_{i=1}^{n} \phi(x, y_i, i) + \sum_{i=1}^{n-1} \psi(y_i, y_{i+1})$$
where:
Inference via Viterbi Algorithm:
Despite $K^n$ possible outputs, the optimal sequence can be found in $O(nK^2)$ time via dynamic programming.
Key insight: The Markov assumption ($y_i$ depends only on $y_{i-1}$, not earlier labels) enables tractable inference. Higher-order dependencies (trigram, etc.) increase complexity to $O(nK^m)$ for order $m-1$.
The Label Bias Problem:
MEMMs (locally normalized models) suffer from label bias: states with few outgoing transitions effectively ignore observations. Consider:
CRFs (globally normalized) avoid this by normalizing over entire sequences, ensuring observations always influence predictions.
Modern Neural Approach:
Currently, the dominant approach is:
For many tasks, (a) works well if the encoder is powerful enough. For tasks with strong output constraints (NER, especially), (b) often helps.
Add a CRF output layer when: output has hard constraints (e.g., I-PER can't follow I-LOC), label transitions carry significant information, or independent softmax gives inconsistent outputs. For simple sequence labeling with powerful encoders, independent softmax often suffices and trains faster.
Beyond sequences, many outputs have tree or graph structure: parse trees, dependency graphs, scene graphs, molecular structures.
Dependency Parsing:
Given a sentence, predict a tree where:
Output space: All valid trees over n nodes ≈ n^(n-1) possibilities (by Cayley's formula)
Approach 1: Graph-based parsing
Approach 2: Transition-based parsing
Constituency Parsing:
Predict a hierarchical tree where:
Approaches:
Modern state-of-the-art: Transformer-based models that predict spans directly or use constituency-specific architectures.
| Structure | Example Tasks | Inference Algorithm | Complexity |
|---|---|---|---|
| Sequence | POS tagging, NER, chunking | Viterbi (DP) | O(nK²) |
| Projective tree | Dependency parsing (projective) | Eisner's algorithm | O(n³) |
| Non-projective tree | Dependency parsing (general) | Chu-Liu-Edmonds | O(n²) |
| Context-free tree | Constituency parsing | CKY algorithm | O(n³G) |
| General graph | Scene graphs, knowledge graphs | Approximate inference | NP-hard in general |
| Alignment | Machine translation (word alignment) | IBM models, EM | O(n²) to O(n⁴) |
For general graph structures (cycles allowed, arbitrary potentials), exact inference is NP-hard. Approximate methods are required: loopy belief propagation, sampling (MCMC), variational inference, or neural approaches that avoid explicit inference. Designing tractable yet expressive structures is a core challenge.
When input and output are both sequences but of different lengths, we need sequence-to-sequence (seq2seq) models. This covers:
The output length is itself a prediction—we don't know it in advance.
Encoder-Decoder Architecture:
The foundational seq2seq architecture:
$$p(y_1, \ldots, y_m | x_1, \ldots, x_n) = \prod_{t=1}^{m} p(y_t | y_{<t}, x)$$
Decoder generates autoregressively: Each token conditions on all previous tokens.
Attention Mechanism:
Simple encoder-decoder compresses input to fixed vector—bottleneck. Attention allows the decoder to focus on different input positions for different output tokens.
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
At each decoding step, compute relevance of each encoder position, weight accordingly.
Transformer Architecture:
The dominant seq2seq architecture today:
Advantages:
Decoder variants:
Modern practice leverages pre-trained models: T5, BART, mT5. These models are trained on massive text corpora with denoising objectives, then fine-tuned for specific tasks. They provide strong initialization and often outperform task-specific architectures trained from scratch.
Training structured prediction models presents unique challenges beyond standard classification. The output space is exponentially large, and loss functions may not decompose cleanly.
Loss Functions for Structure:
Negative Log-Likelihood (CRF loss): $$\mathcal{L} = -\log p(y|x) = -s(x,y) + \log Z(x)$$
Requires computing the partition function Z—tractable only for special structures.
Structured Hinge Loss (Structured SVM): $$\mathcal{L} = \max_{y' \in \mathcal{Y}} [s(x, y') + \Delta(y, y')] - s(x, y)$$
where $\Delta(y, y')$ is a task-specific loss. Requires loss-augmented inference.
Structured Perceptron:
Simple online learning algorithm:
Push up features of correct output, push down features of predicted output.
Advantages: Simple, no partition function needed Disadvantages: Doesn't use margin, may not converge to separator with maximum margin
Structured SVM:
Add margin: Require $s(x, y^) \geq s(x, y') + \Delta(y^, y')$ for all $y' eq y^*$.
Solve a convex optimization problem with exponentially many constraints—handled via constraint generation (adding violated constraints) or cutting-plane methods.
| Algorithm | Loss Type | Inference Needed | Properties |
|---|---|---|---|
| CRF (max likelihood) | Negative log-likelihood | Forward-backward (Z computation) | Well-calibrated probabilities |
| Structured SVM | Structured hinge | Loss-augmented inference | Maximum margin, sparse |
| Structured Perceptron | Perceptron loss | Argmax only | Simple, online, no margin |
| Softmax (per-position) | Cross-entropy | None (independent) | Fast, ignores structure |
| Seq2seq (teacher forcing) | Cross-entropy | None (given prefix) | Exposure bias potential |
Neural Approaches to Structured Prediction:
Modern neural methods often sidestep explicit structured inference:
1. Powerful Encoders: If the encoder (BERT, transformer) is powerful enough, per-position softmax may suffice—the encoder captures structure implicitly.
2. Autoregressive Generation: For seq2seq, generate output left-to-right, conditioning on previous outputs. Structure emerges from sequential generation.
3. Neural CRF: Combine neural features with a CRF output layer for best of both worlds.
4. Reinforcement Learning: Train with task reward (BLEU, F1) using policy gradient methods. Addresses exposure bias but high variance.
Trade-off: Explicit structure provides guarantees (valid outputs, tractable inference) but adds complexity. Implicit structure via powerful models is simpler but may produce invalid outputs.
In structured prediction, inference (finding the best output) is intimately connected to learning. Many training algorithms require inference as a subroutine (structured perceptron, structured SVM). If inference is intractable, learning becomes intractable. This is why tractable inference is so important.
Evaluating structured predictions requires metrics that capture structural correctness, not just component-level accuracy.
Sequence-Level Metrics:
Exact Match (EM): Fraction of sequences predicted entirely correctly. $$EM = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\hat{y}^{(i)} = y^{(i)}]$$
Very strict—one wrong label makes the whole sequence wrong.
Token-Level Accuracy: Average accuracy across all tokens. $$Acc = \frac{1}{\sum_i n_i} \sum_{i=1}^{N} \sum_{t=1}^{n_i} \mathbb{1}[\hat{y}_t^{(i)} = y_t^{(i)}]$$
| Task | Metric | What It Measures |
|---|---|---|
| NER | Entity-level F1 | Correct entity spans and types |
| NER | Span F1 (strict) | Exact boundary match required |
| POS tagging | Token accuracy | Per-token correctness |
| Parsing | Labeled Attachment Score (LAS) | Correct head + label |
| Parsing | Unlabeled Attachment Score (UAS) | Correct head only |
| Constituency parsing | Bracketed F1 | Correct bracket spans |
| Translation | BLEU | N-gram overlap with reference |
| Translation | METEOR, BLEURT | Semantic similarity variants |
| Summarization | ROUGE | Recall-oriented n-gram overlap |
| General generation | BERTScore | Semantic similarity via embeddings |
BLEU (Bilingual Evaluation Understudy):
The standard metric for machine translation: $$BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$$
where:
Limitations of BLEU:
Modern alternatives: BERTScore (embedding-based), BLEURT (learned metric), human evaluation (gold standard but expensive).
Training loss and evaluation metric often differ. Models optimize cross-entropy but are evaluated on BLEU, F1, or exact match. This mismatch can cause suboptimal performance. Techniques like reinforcement learning fine-tuning (REINFORCE) can optimize metrics directly but add training complexity.
We've explored structured prediction comprehensively—why structure matters, formal frameworks, sequence labeling, tree and graph structures, seq2seq models, learning algorithms, and evaluation. Let's synthesize the key insights.
Connection to Other Problem Types:
Structured prediction connects broadly:
What's Next:
We turn to generative modeling—learning to generate new data that resembles training data. While structured prediction focuses on predicting given inputs, generative models learn the underlying distribution itself, enabling synthesis of novel examples.
You now possess a comprehensive understanding of structured prediction in machine learning. From sequences through complex graphs, from traditional CRFs to modern transformers, you have the conceptual framework to tackle any prediction problem with interdependent outputs. Next, we explore generative modeling—learning to synthesize new data.