Ml Landscape - Learning Module

Loading content...

0/278

Structured Prediction

Predicting Complex, Interdependent Outputs

Standard classification predicts a single label; regression predicts a single number. But many real-world predictions are structured—the output is a complex object where components are interdependent. This is the domain of structured prediction.

Consider these tasks:

Part-of-speech tagging: Given a sentence, predict a label for each word—but labels depend on context (nouns follow articles, verbs follow nouns)
Named entity recognition: Identify entity spans ("New York" should be one entity, not two)
Machine translation: Generate a sentence in another language—word choices depend on surrounding words
Semantic parsing: Convert text to structured representations (trees, logical forms)
Image segmentation: Label each pixel—neighboring pixels are likely the same object

In each case, the output has internal structure that must be respected. Predicting each component independently ignores crucial dependencies.

What You Will Master

By the end of this page, you will understand structured prediction at a foundational level: why structure matters, the formal framework of structured outputs, common structures (sequences, trees, graphs), computational challenges of inference and learning, and modern approaches from graphical models to neural sequence-to-sequence methods. This knowledge enables you to tackle prediction problems with complex, interdependent outputs.

Why Structure Matters

To appreciate structured prediction, consider what happens when we ignore structure.

Example: Part-of-Speech Tagging

Input: "Time flies like an arrow"

Goal: Label each word with its part of speech.

Naive approach: Train a classifier to predict each word's POS independently.

Problem: "flies" could be a noun (insects) or verb (travels through air). Without context, we can't tell. Even with local context (surrounding words), the classifier might produce:

Time/NOUN flies/NOUN like/VERB an/DET arrow/NOUN

But this violates grammatical structure: two nouns in a row without a conjunction is unusual.

Structured approach: Model the sequence of labels, enforcing that transitions between tags follow grammatical patterns. Now we correctly get:

Time/NOUN flies/VERB like/PREP an/DET arrow/NOUN

Why Independent Prediction Fails

•Local ambiguity: Individual predictions may be ambiguous without global context
•Constraint violation: Independent predictions may violate structural constraints (e.g., invalid sequences)
•Exponential output space: Can't enumerate all possible structured outputs to find the best one independently
•Label dependencies: Output components influence each other (Markov properties, hierarchical relations)
•Segmentation ambiguity: Boundaries between entities/regions are themselves predictions

The Core Insight

Structured prediction models the joint distribution over output components, not just marginals. Instead of P(y₁|x), P(y₂|x), ... independently, we model P(y₁, y₂, ..., yₙ|x). This allows capturing dependencies and enforcing consistency across the entire output.

The Computational Challenge:

Modeling structure introduces computational difficulty. If each of n positions can take one of K labels, there are K^n possible outputs. For n=20 and K=50, that's over 10^33 possibilities—impossible to enumerate.

Structured prediction requires:

Expressive models that capture relevant dependencies
Efficient inference algorithms that find (or approximate) the best output without enumeration
Learning algorithms that train parameters despite intractable output spaces

This trifecta—expressiveness, tractable inference, trainable learning—defines the structured prediction research agenda.

Formal Framework for Structured Prediction

Structured prediction generalizes classification to structured output spaces. Let's formalize.

Basic Setup:

Input space: $\mathcal{X}$ (e.g., sentences, images, documents)
Output space: $\mathcal{Y}(x)$ — space of valid structured outputs for input $x$
Goal: Learn a function $f: \mathcal{X} \rightarrow \mathcal{Y}$ that predicts the correct structure

The output space $\mathcal{Y}(x)$ is typically combinatorially large and may depend on the input (e.g., number of labels = number of words).

Score-Based Formulation:

Most structured prediction models learn a scoring function: $$s(x, y; \theta): \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}$$

that assigns higher scores to better outputs. Prediction becomes optimization: $$\hat{y} = \arg\max_{y \in \mathcal{Y}(x)} s(x, y; \theta)$$

The scoring function decomposes over parts of the structure for tractability: $$s(x, y; \theta) = \sum_{c \in \mathcal{C}(y)} \phi_c(x, y_c; \theta)$$

where $\mathcal{C}(y)$ is a set of 'cliques' or 'factors' and $y_c$ denotes the subset of output variables involved in factor $c$.

Common Score Decompositions
Structure	Decomposition	Example
Linear chain	Unary + pairwise: $\sum_i \phi(x, y_i) + \sum_i \psi(y_i, y_{i+1})$	POS tagging, NER
Tree	Parent-child potentials: $\sum_{(i,j) \in E} \phi(x, y_i, y_j)$	Dependency parsing
General graph	Clique potentials over arbitrary subsets	Image segmentation, MRFs
Set	Sum over selected items: $\sum_{i \in S} \phi(x, i)$	Subset selection, multi-label

Probabilistic Formulation:

Alternatively, define a probability distribution over outputs: $$p(y | x; \theta) = \frac{1}{Z(x; \theta)} \exp(s(x, y; \theta))$$

where $Z(x; \theta) = \sum_{y' \in \mathcal{Y}(x)} \exp(s(x, y'; \theta))$ is the partition function.

This is a Conditional Random Field (CRF) when the graph structure is predefined, or more generally a structured probabilistic model.

Challenge: Computing the partition function $Z$ requires summing over exponentially many outputs. Special structure (chains, trees) allows polynomial-time computation via dynamic programming. General graphs require approximation.

Discriminative vs. Generative

CRFs are discriminative—they model P(y|x) directly. Hidden Markov Models (HMMs) are generative—they model P(x,y) = P(y)P(x|y). CRFs typically outperform HMMs because they can incorporate arbitrary features of x without modeling its distribution, avoiding the 'independence assumptions' that generative models require.

Sequence Labeling

Sequence labeling is the most common structured prediction task: given a sequence of inputs, predict a sequence of labels.

Formal Setting:

Input: $x = (x_1, x_2, \ldots, x_n)$ — e.g., words in a sentence
Output: $y = (y_1, y_2, \ldots, y_n)$ — e.g., POS tags or entity labels
Each $y_i \in {1, 2, \ldots, K}$ — discrete label set

Output space size: $K^n$ — exponential in sequence length

Linear-Chain CRF:

The workhorse model for sequence labeling. Score decomposes as: $$s(x, y) = \sum_{i=1}^{n} \phi(x, y_i, i) + \sum_{i=1}^{n-1} \psi(y_i, y_{i+1})$$

where:

$\phi(x, y_i, i)$ = emission score: how well does label $y_i$ fit position $i$ given the input?
$\psi(y_i, y_{i+1})$ = transition score: how likely is label $y_{i+1}$ to follow $y_i$?

Inference via Viterbi Algorithm:

Despite $K^n$ possible outputs, the optimal sequence can be found in $O(nK^2)$ time via dynamic programming.

Key insight: The Markov assumption ($y_i$ depends only on $y_{i-1}$, not earlier labels) enables tractable inference. Higher-order dependencies (trigram, etc.) increase complexity to $O(nK^m)$ for order $m-1$.

Classic Sequence Models

•HMM (Hidden Markov Model): Generative; models P(x,y)
•MEMM (Max-Entropy Markov Model): Discriminative; local normalization
•Linear-Chain CRF: Discriminative; global normalization
•Semi-Markov CRF: Models explicit segments, not just labels

Neural Sequence Models

•BiLSTM-CRF: Neural features + CRF output layer
•Transformer: Self-attention for global context
•BERT + CRF: Pre-trained representations + structured output
•Token Classification: Softmax per token (no explicit structure)

The Label Bias Problem:

MEMMs (locally normalized models) suffer from label bias: states with few outgoing transitions effectively ignore observations. Consider:

State A transitions only to state B with certainty
No matter what the observation, the transition happens

CRFs (globally normalized) avoid this by normalizing over entire sequences, ensuring observations always influence predictions.

Modern Neural Approach:

Currently, the dominant approach is:

Encode input with a powerful neural network (BERT, transformer)
Produce per-position scores via a linear layer
Either: (a) apply softmax independently per position, or (b) add a CRF layer for structured output

For many tasks, (a) works well if the encoder is powerful enough. For tasks with strong output constraints (NER, especially), (b) often helps.

When to Add a CRF Layer

Add a CRF output layer when: output has hard constraints (e.g., I-PER can't follow I-LOC), label transitions carry significant information, or independent softmax gives inconsistent outputs. For simple sequence labeling with powerful encoders, independent softmax often suffices and trains faster.

Tree and Graph Structures

Beyond sequences, many outputs have tree or graph structure: parse trees, dependency graphs, scene graphs, molecular structures.

Dependency Parsing:

Given a sentence, predict a tree where:

Nodes = words
Edges = syntactic dependencies (head → dependent)
Each word has exactly one head (except root)

Output space: All valid trees over n nodes ≈ n^(n-1) possibilities (by Cayley's formula)

Approach 1: Graph-based parsing

Score each potential edge independently: $s(i \rightarrow j)$
Find maximum spanning tree (MST) with constraints
Eisner's algorithm: O(n³) for projective trees
Chu-Liu-Edmonds: O(n²) for non-projective trees

Approach 2: Transition-based parsing

Define a sequence of actions (shift, reduce, left-arc, right-arc)
Train a classifier to predict actions
Parse = sequence of actions that builds the tree
O(n) complexity but greedy; may need beam search

Constituency Parsing:

Predict a hierarchical tree where:

Leaves = words
Internal nodes = phrases (NP, VP, S, etc.)
Tree structure represents grammar

Approaches:

Chart parsing (CKY): O(n³) dynamic programming for context-free grammars
Transition-based: Stack-based operations build the tree
Neural (span-based): Score each span, find best tree by dynamic programming

Modern state-of-the-art: Transformer-based models that predict spans directly or use constituency-specific architectures.

Structured Output Types and Algorithms
Structure	Example Tasks	Inference Algorithm	Complexity
Sequence	POS tagging, NER, chunking	Viterbi (DP)	O(nK²)
Projective tree	Dependency parsing (projective)	Eisner's algorithm	O(n³)
Non-projective tree	Dependency parsing (general)	Chu-Liu-Edmonds	O(n²)
Context-free tree	Constituency parsing	CKY algorithm	O(n³G)
General graph	Scene graphs, knowledge graphs	Approximate inference	NP-hard in general
Alignment	Machine translation (word alignment)	IBM models, EM	O(n²) to O(n⁴)

General Graphs Are Hard

For general graph structures (cycles allowed, arbitrary potentials), exact inference is NP-hard. Approximate methods are required: loopy belief propagation, sampling (MCMC), variational inference, or neural approaches that avoid explicit inference. Designing tractable yet expressive structures is a core challenge.

Sequence-to-Sequence Models

When input and output are both sequences but of different lengths, we need sequence-to-sequence (seq2seq) models. This covers:

Machine translation: English sentence → French sentence
Summarization: Long document → short summary
Dialogue: User utterance → system response
Speech recognition: Audio → text transcript
Code generation: Natural language → code

The output length is itself a prediction—we don't know it in advance.

Encoder-Decoder Architecture:

The foundational seq2seq architecture:

Encoder: Reads the input sequence, produces a fixed-size representation (or sequence of representations)
Decoder: Generates the output sequence one token at a time, conditioned on encoder output and previous generated tokens

$$p(y_1, \ldots, y_m | x_1, \ldots, x_n) = \prod_{t=1}^{m} p(y_t | y_{<t}, x)$$

Decoder generates autoregressively: Each token conditions on all previous tokens.

Attention Mechanism:

Simple encoder-decoder compresses input to fixed vector—bottleneck. Attention allows the decoder to focus on different input positions for different output tokens.

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

At each decoding step, compute relevance of each encoder position, weight accordingly.

Transformer Architecture:

The dominant seq2seq architecture today:

Self-attention: Each position attends to all positions (encoder: bidirectional; decoder: causal)
Multi-head attention: Multiple attention heads capture different relationships
Position embeddings: Since attention is permutation-invariant, add positional information
Layer normalization and residual connections: Training stability

Advantages:

Parallelizable (unlike RNNs)
Captures long-range dependencies directly
Scales effectively with compute and data

Decoder variants:

Autoregressive: Generate left-to-right, one token at a time (GPT-style)
Non-autoregressive: Generate all tokens in parallel (faster but often lower quality)

Decoding Strategies

•Greedy Decoding: Choose highest-probability token at each step. Fast but suboptimal.
•Beam Search: Keep top-k candidates at each step. Better quality, k× slower.
•Sampling: Sample from distribution. Adds diversity, may reduce quality.
•Nucleus Sampling: Sample from top-p probability mass. Balances quality and diversity.

Decoding Challenges

•Exposure Bias: Training sees ground truth; inference sees own predictions.
•Length Bias: Models may prefer shorter outputs (higher probability).
•Repetition: Models may repeat phrases or loop.
•Hallucination: Generated content not grounded in input.

Pre-trained Seq2Seq Models

Modern practice leverages pre-trained models: T5, BART, mT5. These models are trained on massive text corpora with denoising objectives, then fine-tuned for specific tasks. They provide strong initialization and often outperform task-specific architectures trained from scratch.

Learning for Structured Prediction

Training structured prediction models presents unique challenges beyond standard classification. The output space is exponentially large, and loss functions may not decompose cleanly.

Loss Functions for Structure:

Negative Log-Likelihood (CRF loss): $$\mathcal{L} = -\log p(y|x) = -s(x,y) + \log Z(x)$$

Requires computing the partition function Z—tractable only for special structures.

Structured Hinge Loss (Structured SVM): $$\mathcal{L} = \max_{y' \in \mathcal{Y}} [s(x, y') + \Delta(y, y')] - s(x, y)$$

where $\Delta(y, y')$ is a task-specific loss. Requires loss-augmented inference.

Structured Perceptron:

Simple online learning algorithm:

Predict: $\hat{y} = \arg\max_y s(x, y; \theta)$
If $\hat{y} eq y^$: Update $\theta \leftarrow \theta + \phi(x, y^) - \phi(x, \hat{y})$

Push up features of correct output, push down features of predicted output.

Advantages: Simple, no partition function needed Disadvantages: Doesn't use margin, may not converge to separator with maximum margin

Structured SVM:

Add margin: Require $s(x, y^) \geq s(x, y') + \Delta(y^, y')$ for all $y' eq y^*$.

Solve a convex optimization problem with exponentially many constraints—handled via constraint generation (adding violated constraints) or cutting-plane methods.

Learning Algorithms for Structured Prediction
Algorithm	Loss Type	Inference Needed	Properties
CRF (max likelihood)	Negative log-likelihood	Forward-backward (Z computation)	Well-calibrated probabilities
Structured SVM	Structured hinge	Loss-augmented inference	Maximum margin, sparse
Structured Perceptron	Perceptron loss	Argmax only	Simple, online, no margin
Softmax (per-position)	Cross-entropy	None (independent)	Fast, ignores structure
Seq2seq (teacher forcing)	Cross-entropy	None (given prefix)	Exposure bias potential

Neural Approaches to Structured Prediction:

Modern neural methods often sidestep explicit structured inference:

1. Powerful Encoders: If the encoder (BERT, transformer) is powerful enough, per-position softmax may suffice—the encoder captures structure implicitly.

2. Autoregressive Generation: For seq2seq, generate output left-to-right, conditioning on previous outputs. Structure emerges from sequential generation.

3. Neural CRF: Combine neural features with a CRF output layer for best of both worlds.

4. Reinforcement Learning: Train with task reward (BLEU, F1) using policy gradient methods. Addresses exposure bias but high variance.

Trade-off: Explicit structure provides guarantees (valid outputs, tractable inference) but adds complexity. Implicit structure via powerful models is simpler but may produce invalid outputs.

The Inference-Learning Connection

In structured prediction, inference (finding the best output) is intimately connected to learning. Many training algorithms require inference as a subroutine (structured perceptron, structured SVM). If inference is intractable, learning becomes intractable. This is why tractable inference is so important.

Evaluation for Structured Outputs

Evaluating structured predictions requires metrics that capture structural correctness, not just component-level accuracy.

Sequence-Level Metrics:

Exact Match (EM): Fraction of sequences predicted entirely correctly. $$EM = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\hat{y}^{(i)} = y^{(i)}]$$

Very strict—one wrong label makes the whole sequence wrong.

Token-Level Accuracy: Average accuracy across all tokens. $$Acc = \frac{1}{\sum_i n_i} \sum_{i=1}^{N} \sum_{t=1}^{n_i} \mathbb{1}[\hat{y}_t^{(i)} = y_t^{(i)}]$$

Structured Prediction Evaluation Metrics
Task	Metric	What It Measures
NER	Entity-level F1	Correct entity spans and types
NER	Span F1 (strict)	Exact boundary match required
POS tagging	Token accuracy	Per-token correctness
Parsing	Labeled Attachment Score (LAS)	Correct head + label
Parsing	Unlabeled Attachment Score (UAS)	Correct head only
Constituency parsing	Bracketed F1	Correct bracket spans
Translation	BLEU	N-gram overlap with reference
Translation	METEOR, BLEURT	Semantic similarity variants
Summarization	ROUGE	Recall-oriented n-gram overlap
General generation	BERTScore	Semantic similarity via embeddings

BLEU (Bilingual Evaluation Understudy):

The standard metric for machine translation: $$BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$$

where:

$p_n$ = precision of n-grams (clipped to reference counts)
$BP$ = brevity penalty for short translations
Typically N=4 (BLEU-4), uniform weights

Limitations of BLEU:

Doesn't capture semantic equivalence
Sensitive to segmentation
Poor correlation with human judgment for single sentences
Multiple valid translations exist; single reference is limiting

Modern alternatives: BERTScore (embedding-based), BLEURT (learned metric), human evaluation (gold standard but expensive).

Metric Mismatch

Training loss and evaluation metric often differ. Models optimize cross-entropy but are evaluated on BLEU, F1, or exact match. This mismatch can cause suboptimal performance. Techniques like reinforcement learning fine-tuning (REINFORCE) can optimize metrics directly but add training complexity.

Summary: Structured Prediction in the ML Landscape

We've explored structured prediction comprehensively—why structure matters, formal frameworks, sequence labeling, tree and graph structures, seq2seq models, learning algorithms, and evaluation. Let's synthesize the key insights.

Key Takeaways

•Structure captures dependencies — Independent prediction ignores correlations between output components; structured models capture them explicitly or implicitly.
•Output spaces are exponentially large — Direct enumeration is impossible; tractable inference requires exploiting structure (DP, graph algorithms, autoregression).
•Score decomposition enables tractability — Breaking scores into local factors over cliques/positions enables polynomial-time inference for chains and trees.
•Sequences and trees have efficient inference — Viterbi, forward-backward, Eisner's algorithm, CKY; general graphs require approximation.
•Seq2seq handles variable-length outputs — Encoder-decoder with attention/transformers powers translation, summarization, dialogue.
•Learning requires inference — Structured perceptron, CRF training, structured SVM all need inference as a subroutine.
•Modern neural methods often avoid explicit structure — Powerful encoders + simple outputs can work; explicit structure adds guarantees but complexity.

Connection to Other Problem Types:

Structured prediction connects broadly:

Classification: Multiclass is structured prediction with atomic outputs; multilabel adds structure.
Regression: Multivariate regression can have structured outputs (predicting curves, trajectories).
Generative Modeling: Autoregressive generation (GPT) is structured prediction with neural implicit structure.
Reinforcement Learning: Sequential decision-making is structured prediction over action sequences.

What's Next:

We turn to generative modeling—learning to generate new data that resembles training data. While structured prediction focuses on predicting given inputs, generative models learn the underlying distribution itself, enabling synthesis of novel examples.

Page Complete

You now possess a comprehensive understanding of structured prediction in machine learning. From sequences through complex graphs, from traditional CRFs to modern transformers, you have the conceptual framework to tackle any prediction problem with interdependent outputs. Next, we explore generative modeling—learning to synthesize new data.

Structured Prediction

Predicting Complex, Interdependent Outputs

Consider these tasks:

Part-of-speech tagging: Given a sentence, predict a label for each word—but labels depend on context (nouns follow articles, verbs follow nouns)
Named entity recognition: Identify entity spans ("New York" should be one entity, not two)
Machine translation: Generate a sentence in another language—word choices depend on surrounding words
Semantic parsing: Convert text to structured representations (trees, logical forms)
Image segmentation: Label each pixel—neighboring pixels are likely the same object

In each case, the output has internal structure that must be respected. Predicting each component independently ignores crucial dependencies.

What You Will Master

Why Structure Matters

To appreciate structured prediction, consider what happens when we ignore structure.

Example: Part-of-Speech Tagging

Input: "Time flies like an arrow"

Goal: Label each word with its part of speech.

Naive approach: Train a classifier to predict each word's POS independently.

Problem: "flies" could be a noun (insects) or verb (travels through air). Without context, we can't tell. Even with local context (surrounding words), the classifier might produce:

Time/NOUN flies/NOUN like/VERB an/DET arrow/NOUN

But this violates grammatical structure: two nouns in a row without a conjunction is unusual.

Structured approach: Model the sequence of labels, enforcing that transitions between tags follow grammatical patterns. Now we correctly get:

Time/NOUN flies/VERB like/PREP an/DET arrow/NOUN

Why Independent Prediction Fails

•Local ambiguity: Individual predictions may be ambiguous without global context
•Constraint violation: Independent predictions may violate structural constraints (e.g., invalid sequences)
•Exponential output space: Can't enumerate all possible structured outputs to find the best one independently
•Label dependencies: Output components influence each other (Markov properties, hierarchical relations)
•Segmentation ambiguity: Boundaries between entities/regions are themselves predictions

The Core Insight

The Computational Challenge:

Structured prediction requires:

Expressive models that capture relevant dependencies
Efficient inference algorithms that find (or approximate) the best output without enumeration
Learning algorithms that train parameters despite intractable output spaces

This trifecta—expressiveness, tractable inference, trainable learning—defines the structured prediction research agenda.

Formal Framework for Structured Prediction

Structured prediction generalizes classification to structured output spaces. Let's formalize.

Basic Setup:

Input space: $\mathcal{X}$ (e.g., sentences, images, documents)
Output space: $\mathcal{Y}(x)$ — space of valid structured outputs for input $x$
Goal: Learn a function $f: \mathcal{X} \rightarrow \mathcal{Y}$ that predicts the correct structure

The output space $\mathcal{Y}(x)$ is typically combinatorially large and may depend on the input (e.g., number of labels = number of words).

Score-Based Formulation:

Most structured prediction models learn a scoring function: $$s(x, y; \theta): \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}$$

that assigns higher scores to better outputs. Prediction becomes optimization: $$\hat{y} = \arg\max_{y \in \mathcal{Y}(x)} s(x, y; \theta)$$

The scoring function decomposes over parts of the structure for tractability: $$s(x, y; \theta) = \sum_{c \in \mathcal{C}(y)} \phi_c(x, y_c; \theta)$$

where $\mathcal{C}(y)$ is a set of 'cliques' or 'factors' and $y_c$ denotes the subset of output variables involved in factor $c$.

Common Score Decompositions
Structure	Decomposition	Example
Linear chain	Unary + pairwise: $\sum_i \phi(x, y_i) + \sum_i \psi(y_i, y_{i+1})$	POS tagging, NER
Tree	Parent-child potentials: $\sum_{(i,j) \in E} \phi(x, y_i, y_j)$	Dependency parsing
General graph	Clique potentials over arbitrary subsets	Image segmentation, MRFs
Set	Sum over selected items: $\sum_{i \in S} \phi(x, i)$	Subset selection, multi-label

Probabilistic Formulation:

Alternatively, define a probability distribution over outputs: $$p(y | x; \theta) = \frac{1}{Z(x; \theta)} \exp(s(x, y; \theta))$$

where $Z(x; \theta) = \sum_{y' \in \mathcal{Y}(x)} \exp(s(x, y'; \theta))$ is the partition function.

This is a Conditional Random Field (CRF) when the graph structure is predefined, or more generally a structured probabilistic model.

Discriminative vs. Generative

Sequence Labeling

Sequence labeling is the most common structured prediction task: given a sequence of inputs, predict a sequence of labels.

Formal Setting:

Input: $x = (x_1, x_2, \ldots, x_n)$ — e.g., words in a sentence
Output: $y = (y_1, y_2, \ldots, y_n)$ — e.g., POS tags or entity labels
Each $y_i \in {1, 2, \ldots, K}$ — discrete label set

Output space size: $K^n$ — exponential in sequence length

Linear-Chain CRF:

The workhorse model for sequence labeling. Score decomposes as: $$s(x, y) = \sum_{i=1}^{n} \phi(x, y_i, i) + \sum_{i=1}^{n-1} \psi(y_i, y_{i+1})$$

where:

$\phi(x, y_i, i)$ = emission score: how well does label $y_i$ fit position $i$ given the input?
$\psi(y_i, y_{i+1})$ = transition score: how likely is label $y_{i+1}$ to follow $y_i$?

Inference via Viterbi Algorithm:

Despite $K^n$ possible outputs, the optimal sequence can be found in $O(nK^2)$ time via dynamic programming.

Classic Sequence Models

•HMM (Hidden Markov Model): Generative; models P(x,y)
•MEMM (Max-Entropy Markov Model): Discriminative; local normalization
•Linear-Chain CRF: Discriminative; global normalization
•Semi-Markov CRF: Models explicit segments, not just labels

Neural Sequence Models

•BiLSTM-CRF: Neural features + CRF output layer
•Transformer: Self-attention for global context
•BERT + CRF: Pre-trained representations + structured output
•Token Classification: Softmax per token (no explicit structure)

The Label Bias Problem:

MEMMs (locally normalized models) suffer from label bias: states with few outgoing transitions effectively ignore observations. Consider:

State A transitions only to state B with certainty
No matter what the observation, the transition happens

CRFs (globally normalized) avoid this by normalizing over entire sequences, ensuring observations always influence predictions.

Modern Neural Approach:

Currently, the dominant approach is:

Encode input with a powerful neural network (BERT, transformer)
Produce per-position scores via a linear layer
Either: (a) apply softmax independently per position, or (b) add a CRF layer for structured output

For many tasks, (a) works well if the encoder is powerful enough. For tasks with strong output constraints (NER, especially), (b) often helps.

When to Add a CRF Layer

Tree and Graph Structures

Beyond sequences, many outputs have tree or graph structure: parse trees, dependency graphs, scene graphs, molecular structures.

Dependency Parsing:

Given a sentence, predict a tree where:

Nodes = words
Edges = syntactic dependencies (head → dependent)
Each word has exactly one head (except root)

Output space: All valid trees over n nodes ≈ n^(n-1) possibilities (by Cayley's formula)

Approach 1: Graph-based parsing

Score each potential edge independently: $s(i \rightarrow j)$
Find maximum spanning tree (MST) with constraints
Eisner's algorithm: O(n³) for projective trees
Chu-Liu-Edmonds: O(n²) for non-projective trees

Approach 2: Transition-based parsing

Define a sequence of actions (shift, reduce, left-arc, right-arc)
Train a classifier to predict actions
Parse = sequence of actions that builds the tree
O(n) complexity but greedy; may need beam search

Constituency Parsing:

Predict a hierarchical tree where:

Leaves = words
Internal nodes = phrases (NP, VP, S, etc.)
Tree structure represents grammar

Approaches:

Chart parsing (CKY): O(n³) dynamic programming for context-free grammars
Transition-based: Stack-based operations build the tree
Neural (span-based): Score each span, find best tree by dynamic programming

Modern state-of-the-art: Transformer-based models that predict spans directly or use constituency-specific architectures.

Structured Output Types and Algorithms
Structure	Example Tasks	Inference Algorithm	Complexity
Sequence	POS tagging, NER, chunking	Viterbi (DP)	O(nK²)
Projective tree	Dependency parsing (projective)	Eisner's algorithm	O(n³)
Non-projective tree	Dependency parsing (general)	Chu-Liu-Edmonds	O(n²)
Context-free tree	Constituency parsing	CKY algorithm	O(n³G)
General graph	Scene graphs, knowledge graphs	Approximate inference	NP-hard in general
Alignment	Machine translation (word alignment)	IBM models, EM	O(n²) to O(n⁴)

General Graphs Are Hard

Sequence-to-Sequence Models

When input and output are both sequences but of different lengths, we need sequence-to-sequence (seq2seq) models. This covers:

Machine translation: English sentence → French sentence
Summarization: Long document → short summary
Dialogue: User utterance → system response
Speech recognition: Audio → text transcript
Code generation: Natural language → code

The output length is itself a prediction—we don't know it in advance.

Encoder-Decoder Architecture:

The foundational seq2seq architecture:

Encoder: Reads the input sequence, produces a fixed-size representation (or sequence of representations)
Decoder: Generates the output sequence one token at a time, conditioned on encoder output and previous generated tokens

$$p(y_1, \ldots, y_m | x_1, \ldots, x_n) = \prod_{t=1}^{m} p(y_t | y_{<t}, x)$$

Decoder generates autoregressively: Each token conditions on all previous tokens.

Attention Mechanism:

Simple encoder-decoder compresses input to fixed vector—bottleneck. Attention allows the decoder to focus on different input positions for different output tokens.

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

At each decoding step, compute relevance of each encoder position, weight accordingly.

Transformer Architecture:

The dominant seq2seq architecture today:

Self-attention: Each position attends to all positions (encoder: bidirectional; decoder: causal)
Multi-head attention: Multiple attention heads capture different relationships
Position embeddings: Since attention is permutation-invariant, add positional information
Layer normalization and residual connections: Training stability

Advantages:

Parallelizable (unlike RNNs)
Captures long-range dependencies directly
Scales effectively with compute and data

Decoder variants:

Autoregressive: Generate left-to-right, one token at a time (GPT-style)
Non-autoregressive: Generate all tokens in parallel (faster but often lower quality)

Decoding Strategies

•Greedy Decoding: Choose highest-probability token at each step. Fast but suboptimal.
•Beam Search: Keep top-k candidates at each step. Better quality, k× slower.
•Sampling: Sample from distribution. Adds diversity, may reduce quality.
•Nucleus Sampling: Sample from top-p probability mass. Balances quality and diversity.

Decoding Challenges

•Exposure Bias: Training sees ground truth; inference sees own predictions.
•Length Bias: Models may prefer shorter outputs (higher probability).
•Repetition: Models may repeat phrases or loop.
•Hallucination: Generated content not grounded in input.

Pre-trained Seq2Seq Models

Learning for Structured Prediction

Training structured prediction models presents unique challenges beyond standard classification. The output space is exponentially large, and loss functions may not decompose cleanly.

Loss Functions for Structure:

Negative Log-Likelihood (CRF loss): $$\mathcal{L} = -\log p(y|x) = -s(x,y) + \log Z(x)$$

Requires computing the partition function Z—tractable only for special structures.

Structured Hinge Loss (Structured SVM): $$\mathcal{L} = \max_{y' \in \mathcal{Y}} [s(x, y') + \Delta(y, y')] - s(x, y)$$

where $\Delta(y, y')$ is a task-specific loss. Requires loss-augmented inference.

Structured Perceptron:

Simple online learning algorithm:

Predict: $\hat{y} = \arg\max_y s(x, y; \theta)$
If $\hat{y} eq y^$: Update $\theta \leftarrow \theta + \phi(x, y^) - \phi(x, \hat{y})$

Push up features of correct output, push down features of predicted output.

Advantages: Simple, no partition function needed Disadvantages: Doesn't use margin, may not converge to separator with maximum margin

Structured SVM:

Add margin: Require $s(x, y^) \geq s(x, y') + \Delta(y^, y')$ for all $y' eq y^*$.

Solve a convex optimization problem with exponentially many constraints—handled via constraint generation (adding violated constraints) or cutting-plane methods.

Learning Algorithms for Structured Prediction
Algorithm	Loss Type	Inference Needed	Properties
CRF (max likelihood)	Negative log-likelihood	Forward-backward (Z computation)	Well-calibrated probabilities
Structured SVM	Structured hinge	Loss-augmented inference	Maximum margin, sparse
Structured Perceptron	Perceptron loss	Argmax only	Simple, online, no margin
Softmax (per-position)	Cross-entropy	None (independent)	Fast, ignores structure
Seq2seq (teacher forcing)	Cross-entropy	None (given prefix)	Exposure bias potential

Neural Approaches to Structured Prediction:

Modern neural methods often sidestep explicit structured inference:

1. Powerful Encoders: If the encoder (BERT, transformer) is powerful enough, per-position softmax may suffice—the encoder captures structure implicitly.

2. Autoregressive Generation: For seq2seq, generate output left-to-right, conditioning on previous outputs. Structure emerges from sequential generation.

3. Neural CRF: Combine neural features with a CRF output layer for best of both worlds.

4. Reinforcement Learning: Train with task reward (BLEU, F1) using policy gradient methods. Addresses exposure bias but high variance.

Trade-off: Explicit structure provides guarantees (valid outputs, tractable inference) but adds complexity. Implicit structure via powerful models is simpler but may produce invalid outputs.

The Inference-Learning Connection

Evaluation for Structured Outputs

Evaluating structured predictions requires metrics that capture structural correctness, not just component-level accuracy.

Sequence-Level Metrics:

Exact Match (EM): Fraction of sequences predicted entirely correctly. $$EM = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\hat{y}^{(i)} = y^{(i)}]$$

Very strict—one wrong label makes the whole sequence wrong.

Token-Level Accuracy: Average accuracy across all tokens. $$Acc = \frac{1}{\sum_i n_i} \sum_{i=1}^{N} \sum_{t=1}^{n_i} \mathbb{1}[\hat{y}_t^{(i)} = y_t^{(i)}]$$

Structured Prediction Evaluation Metrics
Task	Metric	What It Measures
NER	Entity-level F1	Correct entity spans and types
NER	Span F1 (strict)	Exact boundary match required
POS tagging	Token accuracy	Per-token correctness
Parsing	Labeled Attachment Score (LAS)	Correct head + label
Parsing	Unlabeled Attachment Score (UAS)	Correct head only
Constituency parsing	Bracketed F1	Correct bracket spans
Translation	BLEU	N-gram overlap with reference
Translation	METEOR, BLEURT	Semantic similarity variants
Summarization	ROUGE	Recall-oriented n-gram overlap
General generation	BERTScore	Semantic similarity via embeddings

BLEU (Bilingual Evaluation Understudy):

The standard metric for machine translation: $$BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$$

where:

$p_n$ = precision of n-grams (clipped to reference counts)
$BP$ = brevity penalty for short translations
Typically N=4 (BLEU-4), uniform weights

Limitations of BLEU:

Doesn't capture semantic equivalence
Sensitive to segmentation
Poor correlation with human judgment for single sentences
Multiple valid translations exist; single reference is limiting

Modern alternatives: BERTScore (embedding-based), BLEURT (learned metric), human evaluation (gold standard but expensive).

Metric Mismatch

Summary: Structured Prediction in the ML Landscape

Key Takeaways

•Structure captures dependencies — Independent prediction ignores correlations between output components; structured models capture them explicitly or implicitly.
•Output spaces are exponentially large — Direct enumeration is impossible; tractable inference requires exploiting structure (DP, graph algorithms, autoregression).
•Score decomposition enables tractability — Breaking scores into local factors over cliques/positions enables polynomial-time inference for chains and trees.
•Sequences and trees have efficient inference — Viterbi, forward-backward, Eisner's algorithm, CKY; general graphs require approximation.
•Seq2seq handles variable-length outputs — Encoder-decoder with attention/transformers powers translation, summarization, dialogue.
•Learning requires inference — Structured perceptron, CRF training, structured SVM all need inference as a subroutine.
•Modern neural methods often avoid explicit structure — Powerful encoders + simple outputs can work; explicit structure adds guarantees but complexity.

Connection to Other Problem Types:

Structured prediction connects broadly:

Classification: Multiclass is structured prediction with atomic outputs; multilabel adds structure.
Regression: Multivariate regression can have structured outputs (predicting curves, trajectories).
Generative Modeling: Autoregressive generation (GPT) is structured prediction with neural implicit structure.
Reinforcement Learning: Sequential decision-making is structured prediction over action sequences.

What's Next:

Page Complete