Conditional Random Fields - Learning Module

Loading content...

0/278

CRF Formulation

The Challenge of Structured Prediction

Machine learning traditionally excels at independent prediction—classifying individual images, predicting single numerical values, or categorizing standalone documents. But what happens when the output isn't a single value, but rather a structured sequence where each element depends on its neighbors?

Consider these structured prediction challenges:

Part-of-Speech Tagging: Given a sentence, assign grammatical tags (noun, verb, adjective) to each word. The tag for "bank" depends on context—is it a financial institution or a river bank?
Named Entity Recognition: Identify person names, organizations, and locations in text. "New York Times" is a single organization, not three separate entities.
Biological Sequence Labeling: Identify gene boundaries, protein secondary structures, or splice sites in DNA/RNA sequences.
Semantic Segmentation: Label each pixel in an image with its object class, respecting spatial coherence.

In all these problems, adjacent predictions are not independent. A noun is more likely to follow an adjective than another noun. Recognizing this interdependence is crucial for accurate structured prediction.

What You Will Learn

By the end of this page, you will understand: (1) Why naive independent classifiers fail for structured outputs, (2) The fundamental difference between generative models (HMMs) and discriminative models (CRFs), (3) The complete mathematical formulation of CRFs including their probabilistic interpretation, and (4) Why CRFs have become the de facto standard for many sequence labeling tasks.

The Failure of Independent Classification

Before diving into CRFs, let's rigorously understand why simpler approaches fail. Consider the task of Part-of-Speech (POS) tagging for the sentence:

"The old man the boats"

A naive approach would train a classifier (logistic regression, neural network, etc.) that predicts each word's tag independently:

$$P(y_i \mid x_i)$$

where $x_i$ is the $i$-th word and $y_i$ is its tag.

This local classifier might predict:

"The" → DET (determiner)
"old" → ADJ (adjective) — most common usage
"man" → NOUN — most common usage
"the" → DET
"boats" → NOUN

But this produces: DET ADJ NOUN DET NOUN — a grammatically invalid sequence!

The Correct Parse

The correct interpretation is: "The old [people] man [operate] the boats" → DET NOUN VERB DET NOUN. Here, "old" functions as a noun (the elderly) and "man" is a verb (to operate). Without considering sequence structure, the classifier makes locally reasonable but globally inconsistent predictions.

Why does independent classification fail?

The fundamental problem is that the classifier ignores label dependencies. In natural language:

Syntactic constraints: Certain tag sequences are grammatically invalid (e.g., DET DET rarely occurs)
Semantic coherence: Words form coherent phrases that span multiple tokens
Long-range dependencies: Decisions at position $i$ may depend on labels at distant positions

The mathematical insight:

When we predict labels independently, we're implicitly assuming:

$$P(y_1, y_2, \ldots, y_n \mid x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} P(y_i \mid x_i)$$

This factorization ignores the conditional dependencies between labels. What we actually need is a model that captures:

$$P(\mathbf{y} \mid \mathbf{x})$$

where $\mathbf{y} = (y_1, \ldots, y_n)$ is the entire label sequence and we model how labels interact given the full observation sequence $\mathbf{x}$.

Comparison of Classification Approaches for Structured Output
Approach	Model	Limitation
Independent Classifier	$P(y_i \mid x_i)$	Ignores label dependencies entirely
Window Classifier	$P(y_i \mid x_{i-k:i+k})$	Uses context but still predicts labels independently
Greedy Sequential	$P(y_i \mid x, y_{1:i-1})$	Left-to-right bias; cannot correct earlier mistakes
Structured Model	$P(\mathbf{y} \mid \mathbf{x})$	Models full joint distribution over label sequence

Generative vs. Discriminative Models

Before introducing CRFs, we must understand the fundamental distinction between generative and discriminative probabilistic models. This dichotomy is central to understanding why CRFs often outperform Hidden Markov Models (HMMs).

Generative Models

Generative models learn the joint distribution $P(\mathbf{x}, \mathbf{y})$ of observations and labels. To make predictions, they use Bayes' rule:

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{P(\mathbf{x}, \mathbf{y})}{P(\mathbf{x})} = \frac{P(\mathbf{x} \mid \mathbf{y}) P(\mathbf{y})}{P(\mathbf{x})}$$

The Hidden Markov Model (HMM) is the prototypical generative sequence model:

$$P(\mathbf{x}, \mathbf{y}) = P(y_1) \prod_{i=2}^{n} P(y_i \mid y_{i-1}) \prod_{i=1}^{n} P(x_i \mid y_i)$$

Discriminative Models

Discriminative models directly model the conditional distribution $P(\mathbf{y} \mid \mathbf{x})$ without explicitly modeling how observations are generated. They answer: "Given these observations, what labels are most likely?" without asking "How would these observations be generated?"

Generative Models (e.g., HMM)

•Model $P(\mathbf{x}, \mathbf{y}) = P(\mathbf{x} \mid \mathbf{y}) P(\mathbf{y})$
•Require modeling observation distribution
•Strong independence assumptions on features
•Can generate synthetic data
•Handle missing data naturally
•Often simpler to train (closed-form for exponential family)

Discriminative Models (e.g., CRF)

•Model $P(\mathbf{y} \mid \mathbf{x})$ directly
•No need to model observation generation
•Can use arbitrary overlapping features
•Cannot generate synthetic observations
•Missing data requires marginalization
•Typically require iterative optimization

Why discriminative models often win:

The key advantage of discriminative models lies in their feature flexibility. Consider HMMs:

$$P(x_i \mid y_i)$$

This emission probability requires specifying how each observation is generated from each state—a local dependence. If you want to condition on neighboring words, their capitalization, prefixes/suffixes, or external resources (gazetteers), you must either:

Augment the state space (combinatorial explosion)
Make strong independence assumptions (losing information)

CRFs escape this constraint. They can incorporate arbitrary features of the entire observation sequence at any position:

$$f(\mathbf{x}, y_i, y_{i-1}, i)$$

This might include:

The current word, previous word, next word
Capitalization patterns in a 5-word window
Whether the current word appears in a gazetteer of locations
Word prefixes/suffixes of any length
Part-of-speech predictions from another model

All of these can be incorporated without altering the model structure.

The Ng & Jordan Result

Andrew Ng and Michael Jordan's seminal 2002 paper "On Discriminative vs. Generative Classifiers" showed that while generative models may converge faster with limited data, discriminative models typically achieve lower asymptotic error. For sequence labeling with rich features, CRFs consistently outperform HMMs—often by significant margins.

The CRF Model Formulation

We now present the formal definition of Conditional Random Fields. The formulation builds on the theory of undirected graphical models (Markov Random Fields) but conditions on observed data rather than modeling its generation.

Definition (Conditional Random Field):

Let $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ be an observation sequence and $\mathbf{y} = (y_1, y_2, \ldots, y_n)$ be a corresponding label sequence. A Conditional Random Field defines the conditional probability:

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \sum_{k=1}^{K} \lambda_k F_k(\mathbf{x}, \mathbf{y}) \right)$$

where:

$F_k(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} f_k(\mathbf{x}, y_i, y_{i-1}, i)$ is the $k$-th global feature function
$\lambda_k \in \mathbb{R}$ is the weight for feature $k$
$Z(\mathbf{x}) = \sum_{\mathbf{y}'} \exp\left( \sum_{k=1}^{K} \lambda_k F_k(\mathbf{x}, \mathbf{y}') \right)$ is the partition function ensuring normalization

Understanding the Partition Function

The partition function $Z(\mathbf{x})$ sums over ALL possible label sequences. For a sequence of length $n$ with $L$ possible labels per position, there are $L^n$ possible label sequences. This exponential growth makes naive computation intractable, but dynamic programming (the forward algorithm) makes it efficient—O(nL²) for linear-chain CRFs.

Compact Vector Notation:

Let $\boldsymbol{\lambda} = (\lambda_1, \ldots, \lambda_K)^T$ be the weight vector and $\mathbf{F}(\mathbf{x}, \mathbf{y}) = (F_1(\mathbf{x}, \mathbf{y}), \ldots, F_K(\mathbf{x}, \mathbf{y}))^T$ be the global feature vector. Then:

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}) \right)$$

This is an instance of a log-linear model (exponential family): the log-probability is linear in the features.

Why Exponential Form?

The exponential parameterization has deep theoretical justifications:

Maximum Entropy Principle: Among all distributions consistent with expected feature values, the exponential family distribution has maximum entropy (least commitment beyond constraints)
Hammersley-Clifford Theorem: For undirected graphical models, positive distributions that respect conditional independence must be expressible as products of potential functions—equivalent to exponential form
Convex Optimization: The log-likelihood is concave, ensuring a unique global optimum
Natural Conjugacy: Enables efficient Bayesian treatment with Gaussian priors on $\boldsymbol{\lambda}$

crf_probability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from typing import List, Tuple, Callable
 
def compute_crf_probability(
    x: List[str],                    # Observation sequence
    y: List[int],                    # Label sequence (integer encoded)
    weights: np.ndarray,             # Weight vector λ (K,)
    feature_functions: List[Callable],  # List of K feature functions
    all_labels: List[int],           # All possible labels
) -> float:
    """
    Compute P(y | x) for a CRF.
    
    Each feature function f_k(x, y_i, y_{i-1}, i) -> float
    Feature functions take: (observations, current_label, prev_label, position)
    
    This is a naive O(L^n) implementation for pedagogical clarity.
    In practice, use dynamic programming for the partition function.
    """
    n = len(x)
    K = len(feature_functions)
    
    def global_feature_vector(x: List[str], y: List[int]) -> np.ndarray:
        """Compute F(x, y) = sum over positions of local features."""
        F = np.zeros(K)
        for i in range(n):
            y_prev = y[i-1] if i > 0 else -1  # -1 indicates start
            for k, f_k in enumerate(feature_functions):
                F[k] += f_k(x, y[i], y_prev, i)
        return F
    
    # Compute score for the given label sequence
    F_xy = global_feature_vector(x, y)
    score_xy = np.dot(weights, F_xy)
    
    # Compute partition function by summing over all possible y'
    # WARNING: This is O(L^n) - exponential! Only for small examples.
    from itertools import product
    Z = 0.0
    for y_prime in product(all_labels, repeat=n):
        F_xy_prime = global_feature_vector(x, list(y_prime))
        Z += np.exp(np.dot(weights, F_xy_prime))
    
    # Probability = exp(score) / Z
    probability = np.exp(score_xy) / Z
    return probability
 
# Example: Simple binary labeling (labels: 0 or 1)
# Feature 1: Word is capitalized AND label is 1
# Feature 2: Transition from label 0 to label 1
def f1(x, y_i, y_prev, i):
    return 1.0 if x[i][0].isupper() and y_i == 1 else 0.0
 
def f2(x, y_i, y_prev, i):
    return 1.0 if y_prev == 0 and y_i == 1 else 0.0
 
# Example usage
x = ["The", "Bank", "closed"]
y = [0, 1, 0]
weights = np.array([2.0, 0.5])  # Capitalized+label1 is strong signal
feature_functions = [f1, f2]
all_labels = [0, 1]
 
prob = compute_crf_probability(x, y, weights, feature_functions, all_labels)
print(f"P(y={y} | x={x}) = {prob:.6f}")

The Factor Graph Perspective

CRFs can be understood through the lens of factor graphs, providing a visual and algebraic framework for the model structure.

Factor Graph Definition:

A factor graph is a bipartite graph with two types of nodes:

Variable nodes: Represent random variables (label variables $y_1, \ldots, y_n$)
Factor nodes: Represent potential functions that score configurations of connected variables

For a linear-chain CRF, the factorization is:

$$P(\mathbf{y} \mid \mathbf{x}) \propto \prod_{i=1}^{n} \Psi_i(y_{i-1}, y_i, \mathbf{x})$$

where $\Psi_i(y_{i-1}, y_i, \mathbf{x}) = \exp\left( \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) \right)$ is the potential function at position $i$.

Converting Mermaid diagram...

Key Structural Properties:

1. Markov Property

In a linear-chain CRF, each label $y_i$ is conditionally independent of all other labels given its neighbors and the observations:

$$P(y_i \mid \mathbf{y}{\setminus i}, \mathbf{x}) = P(y_i \mid y{i-1}, y_{i+1}, \mathbf{x})$$

This Markov blanket property enables efficient inference.

2. Global Conditioning

Unlike HMMs where emission probabilities are local $P(x_i \mid y_i)$, CRF factors $\Psi_i$ can depend on the entire observation sequence $\mathbf{x}$. This is because we never need to model $P(\mathbf{x})$—we only compute conditional probabilities given fixed $\mathbf{x}$.

3. Undirected Structure

CRFs are undirected models (Markov Random Fields conditioned on $\mathbf{x}$). Unlike Bayesian Networks, there's no causal interpretation of edges. The model simply captures correlations between adjacent labels.

CRFs as Conditional MRFs

Formally, a CRF is a Markov Random Field globally conditioned on observations. The graph structure (cliques, potentials) describes dependencies among labels, while observations appear as parameters of these potentials rather than random variables.

CRFs as Structured Logistic Regression

An illuminating perspective on CRFs is to view them as a structured extension of logistic regression. This connection clarifies both the model form and training procedure.

Logistic Regression Refresher:

For binary classification with features $\mathbf{x}$ and label $y \in {0, 1}$:

$$P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) = \frac{\exp(\mathbf{w}^T \mathbf{x})}{1 + \exp(\mathbf{w}^T \mathbf{x})}$$

Equivalently, using the softmax formulation for $y \in {0, 1}$:

$$P(y \mid \mathbf{x}) = \frac{\exp(\mathbf{w}y^T \mathbf{x})}{\sum{y'} \exp(\mathbf{w}_{y'}^T \mathbf{x})}$$

CRF as Sequence-Level Softmax:

CRFs extend this to structured outputs. Instead of normalizing over labels for a single prediction, we normalize over label sequences:

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{\exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}))}{\sum_{\mathbf{y}'} \exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}'))}$$

Logistic Regression vs. CRF
Aspect	Logistic Regression	Conditional Random Field
Output Space	Single label $y$	Label sequence $\mathbf{y} = (y_1, \ldots, y_n)$
Feature Function	$\mathbf{x}$ (input features)	$\mathbf{F}(\mathbf{x}, \mathbf{y})$ (input + output features)
Normalization	Sum over label values	Sum over label sequences
Partition Function	$\sum_{y'} \exp(\mathbf{w}_{y'}^T \mathbf{x})$ — $O(L)$	$\sum_{\mathbf{y}'} \exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}'))$ — $O(nL^2)$
Training Objective	Conditional log-likelihood	Conditional log-likelihood
Optimization	Convex (gradient descent)	Convex (gradient descent)

The Gradient Connection:

Both models share the same gradient structure. The gradient of the log-likelihood with respect to weights is:

$$ abla_{\boldsymbol{\lambda}} \log P(\mathbf{y} \mid \mathbf{x}) = \mathbf{F}(\mathbf{x}, \mathbf{y}) - \mathbb{E}_{\mathbf{y}' \sim P(\cdot \mid \mathbf{x})}[\mathbf{F}(\mathbf{x}, \mathbf{y}')]$$

Intuition: The gradient pushes weights to:

Increase features that fire for the correct label sequence $\mathbf{y}$
Decrease features expected under the current model's predictions

This is the classic "observed minus expected" gradient form characteristic of exponential families.

For logistic regression, the expectation is easy—sum over $L$ labels. For CRFs, computing $\mathbb{E}[\mathbf{F}]$ requires summing over $L^n$ sequences, but the linear-chain structure enables efficient computation via the forward-backward algorithm.

Convexity Guarantees

Like logistic regression, the CRF log-likelihood is a concave function of the weights. This means any local optimum is also the global optimum—no risk of getting stuck in bad local minima. Standard convex optimization techniques (L-BFGS, SGD with proper learning rate) are guaranteed to find the optimal weights.

CRF vs. HMM: A Detailed Comparison

Since CRFs are often presented as an improvement over Hidden Markov Models, let's rigorously compare these two approaches to sequence labeling.

HMM Formulation:

$$P(\mathbf{x}, \mathbf{y}) = P(y_1) \prod_{i=2}^{n} P(y_i \mid y_{i-1}) \prod_{i=1}^{n} P(x_i \mid y_i)$$

CRF Formulation:

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \sum_i \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) \right)$$

HMM Characteristics

•Generative: Models $P(\mathbf{x}, \mathbf{y})$
•Local emissions: $P(x_i \mid y_i)$ only
•Independence assumption: $x_i \perp x_j \mid y_i$ for $j eq i$
•Multinomial/Gaussian emissions: Parametric forms required
•Fast training: EM algorithm with closed-form updates
•Handles missing data: Marginalize naturally

CRF Characteristics

•Discriminative: Models $P(\mathbf{y} \mid \mathbf{x})$ directly
•Global features: $f_k(\mathbf{x}, y_i, y_{i-1}, i)$ uses full $\mathbf{x}$
•No independence assumption: Features can capture arbitrary patterns
•Arbitrary features: Binary, real-valued, overlapping, external
•Iterative training: Gradient-based optimization
•Principled integration: Combine diverse information sources

The Feature Flexibility Advantage:

Consider Named Entity Recognition. An HMM might use:

$$P(\text{"Obama"} \mid \text{PERSON})$$

But what if we want to use:

Is the word capitalized?
Does it appear in a gazetteer of person names?
What's the previous word? ("President Obama" vs "Obama tower")
Does it match a pattern like [A-Z][a-z]+?

In an HMM, these become:

$$P(\text{capitalized=true, in_gazetteer=true, prev_word="President"} \mid \text{PERSON})$$

This requires modeling the joint distribution of all these features given each label—often leading to:

Sparse data problems (many feature combinations never seen)
Strong independence assumptions (features assumed independent given label)

CRFs simply add these as separate features:

$f_{\text{cap}}(\mathbf{x}, y_i) = \mathbb{1}[\text{is_capitalized}(x_i) \wedge y_i = \text{PERSON}]$
$f_{\text{gaz}}(\mathbf{x}, y_i) = \mathbb{1}[x_i \in \text{person_gazetteer} \wedge y_i = \text{PERSON}]$
$f_{\text{prev}}(\mathbf{x}, y_i) = \mathbb{1}[x_{i-1} = \text{"President"} \wedge y_i = \text{PERSON}]$

Each feature gets its own weight. Features can overlap freely. No independence assumptions required.

When HMMs Win

HMMs can outperform CRFs when: (1) Training data is very limited and the generative assumptions are approximately correct, (2) We need to handle missing observations naturally, (3) We want to generate synthetic data, or (4) Computational resources for CRF training are constrained. However, for most NLP sequence labeling tasks with reasonable training data, CRFs dominate.

Mathematical Properties of CRFs

Let's establish the key mathematical properties that make CRFs both theoretically elegant and practically useful.

Property 1: Normalization

For any observation sequence $\mathbf{x}$:

$$\sum_{\mathbf{y}} P(\mathbf{y} \mid \mathbf{x}) = 1$$

Proof: By definition of the partition function:

$$\sum_{\mathbf{y}} P(\mathbf{y} \mid \mathbf{x}) = \sum_{\mathbf{y}} \frac{\exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}))}{Z(\mathbf{x})} = \frac{1}{Z(\mathbf{x})} \sum_{\mathbf{y}} \exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y})) = \frac{Z(\mathbf{x})}{Z(\mathbf{x})} = 1 \quad \square$$

Property 2: Log-Linearity

The log-probability is linear in the parameters:

$$\log P(\mathbf{y} \mid \mathbf{x}) = \boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}) - \log Z(\mathbf{x})$$

This log-linear form is fundamental to the exponential family and enables efficient gradient computation.

Property 3: Concave Log-Likelihood

The log-likelihood $\mathcal{L}(\boldsymbol{\lambda}) = \sum_{(\mathbf{x}, \mathbf{y})} \log P(\mathbf{y} \mid \mathbf{x}; \boldsymbol{\lambda})$ is a concave function of the weights $\boldsymbol{\lambda}$.

Proof Sketch:

The Hessian of the log-likelihood is:

$$ abla^2_{\boldsymbol{\lambda}} \log P(\mathbf{y} \mid \mathbf{x}) = -\text{Var}_{P(\mathbf{y}' \mid \mathbf{x})}[\mathbf{F}(\mathbf{x}, \mathbf{y}')]$$

Since variance is non-negative, the Hessian is negative semi-definite, proving concavity. $\square$

Implication: Any local optimum of the log-likelihood is also the global optimum. Standard optimization algorithms (gradient descent, L-BFGS, Newton methods) are guaranteed to converge to the (unique) optimal parameters.

Property 4: Consistency

Under mild regularity conditions (compact parameter space, positive data probability), maximum likelihood estimation for CRFs is consistent: as training data grows, $\hat{\boldsymbol{\lambda}}_{\text{MLE}} \to \boldsymbol{\lambda}^*$ (the true parameters, if data is generated from the model family).

Property 5: Markov Property (Linear-Chain)

For linear-chain CRFs, labels satisfy the first-order Markov property:

$$y_i \perp \mathbf{y}{\setminus{i-1, i, i+1}} \mid y{i-1}, y_{i+1}, \mathbf{x}$$

The label at position $i$ is conditionally independent of all other labels given its immediate neighbors and all observations.

Practical Implications

These mathematical properties have practical consequences: (1) Convexity means no hyperparameter tuning for initialization—any starting point works, (2) Consistency means more data always helps, (3) The Markov property enables O(nL²) inference instead of O(L^n), (4) Log-linearity enables efficient gradient computation for training.

Summary: CRF Formulation

We have established the complete mathematical foundation for Conditional Random Fields. Let's consolidate the key concepts:

Key Takeaways

•Structured prediction requires modeling label dependencies — Independent classifiers ignore crucial correlations between adjacent predictions.
•CRFs are discriminative sequence models — They model $P(\mathbf{y} \mid \mathbf{x})$ directly without making assumptions about how observations are generated.
•Feature flexibility is the key advantage — CRFs can incorporate arbitrary overlapping features of the entire observation sequence at any position.
•The exponential/log-linear form has deep justifications — Maximum entropy, Hammersley-Clifford theorem, convex optimization, and efficient gradient computation.
•CRFs generalize logistic regression to sequences — The same gradient structure (observed minus expected) with normalization over sequences instead of labels.
•Mathematical properties ensure reliable training — Concave log-likelihood guarantees convergence to global optimum; consistency ensures asymptotic correctness.

The core CRF equation to internalize:

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \sum_{i=1}^{n} \sum_{k=1}^{K} \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) \right)$$

What's next:

Now that we understand the general CRF formulation, we'll specialize to Linear-Chain CRFs—the most common variant used for sequence labeling. We'll see how the chain structure enables efficient inference through dynamic programming, and how to implement the forward-backward algorithm for computing partition functions and marginal probabilities.

Page Complete

You now understand the foundational formulation of Conditional Random Fields: why structured prediction requires modeling label dependencies, how CRFs differ from generative models like HMMs, the mathematical form of the model, and its key properties. Next, we'll explore the linear-chain structure that makes CRFs computationally tractable.

CRF Formulation

The Challenge of Structured Prediction

Consider these structured prediction challenges:

Part-of-Speech Tagging: Given a sentence, assign grammatical tags (noun, verb, adjective) to each word. The tag for "bank" depends on context—is it a financial institution or a river bank?
Named Entity Recognition: Identify person names, organizations, and locations in text. "New York Times" is a single organization, not three separate entities.
Biological Sequence Labeling: Identify gene boundaries, protein secondary structures, or splice sites in DNA/RNA sequences.
Semantic Segmentation: Label each pixel in an image with its object class, respecting spatial coherence.

What You Will Learn

The Failure of Independent Classification

Before diving into CRFs, let's rigorously understand why simpler approaches fail. Consider the task of Part-of-Speech (POS) tagging for the sentence:

"The old man the boats"

A naive approach would train a classifier (logistic regression, neural network, etc.) that predicts each word's tag independently:

$$P(y_i \mid x_i)$$

where $x_i$ is the $i$-th word and $y_i$ is its tag.

This local classifier might predict:

"The" → DET (determiner)
"old" → ADJ (adjective) — most common usage
"man" → NOUN — most common usage
"the" → DET
"boats" → NOUN

But this produces: DET ADJ NOUN DET NOUN — a grammatically invalid sequence!

The Correct Parse

Why does independent classification fail?

The fundamental problem is that the classifier ignores label dependencies. In natural language:

Syntactic constraints: Certain tag sequences are grammatically invalid (e.g., DET DET rarely occurs)
Semantic coherence: Words form coherent phrases that span multiple tokens
Long-range dependencies: Decisions at position $i$ may depend on labels at distant positions

The mathematical insight:

When we predict labels independently, we're implicitly assuming:

$$P(y_1, y_2, \ldots, y_n \mid x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} P(y_i \mid x_i)$$

This factorization ignores the conditional dependencies between labels. What we actually need is a model that captures:

$$P(\mathbf{y} \mid \mathbf{x})$$

where $\mathbf{y} = (y_1, \ldots, y_n)$ is the entire label sequence and we model how labels interact given the full observation sequence $\mathbf{x}$.

Comparison of Classification Approaches for Structured Output
Approach	Model	Limitation
Independent Classifier	$P(y_i \mid x_i)$	Ignores label dependencies entirely
Window Classifier	$P(y_i \mid x_{i-k:i+k})$	Uses context but still predicts labels independently
Greedy Sequential	$P(y_i \mid x, y_{1:i-1})$	Left-to-right bias; cannot correct earlier mistakes
Structured Model	$P(\mathbf{y} \mid \mathbf{x})$	Models full joint distribution over label sequence

Generative vs. Discriminative Models

Generative Models

Generative models learn the joint distribution $P(\mathbf{x}, \mathbf{y})$ of observations and labels. To make predictions, they use Bayes' rule:

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{P(\mathbf{x}, \mathbf{y})}{P(\mathbf{x})} = \frac{P(\mathbf{x} \mid \mathbf{y}) P(\mathbf{y})}{P(\mathbf{x})}$$

The Hidden Markov Model (HMM) is the prototypical generative sequence model:

$$P(\mathbf{x}, \mathbf{y}) = P(y_1) \prod_{i=2}^{n} P(y_i \mid y_{i-1}) \prod_{i=1}^{n} P(x_i \mid y_i)$$

Discriminative Models

Generative Models (e.g., HMM)

•Model $P(\mathbf{x}, \mathbf{y}) = P(\mathbf{x} \mid \mathbf{y}) P(\mathbf{y})$
•Require modeling observation distribution
•Strong independence assumptions on features
•Can generate synthetic data
•Handle missing data naturally
•Often simpler to train (closed-form for exponential family)

Discriminative Models (e.g., CRF)

•Model $P(\mathbf{y} \mid \mathbf{x})$ directly
•No need to model observation generation
•Can use arbitrary overlapping features
•Cannot generate synthetic observations
•Missing data requires marginalization
•Typically require iterative optimization

Why discriminative models often win:

The key advantage of discriminative models lies in their feature flexibility. Consider HMMs:

$$P(x_i \mid y_i)$$

Augment the state space (combinatorial explosion)
Make strong independence assumptions (losing information)

CRFs escape this constraint. They can incorporate arbitrary features of the entire observation sequence at any position:

$$f(\mathbf{x}, y_i, y_{i-1}, i)$$

This might include:

The current word, previous word, next word
Capitalization patterns in a 5-word window
Whether the current word appears in a gazetteer of locations
Word prefixes/suffixes of any length
Part-of-speech predictions from another model

All of these can be incorporated without altering the model structure.

The Ng & Jordan Result

The CRF Model Formulation

Definition (Conditional Random Field):

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \sum_{k=1}^{K} \lambda_k F_k(\mathbf{x}, \mathbf{y}) \right)$$

where:

$F_k(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} f_k(\mathbf{x}, y_i, y_{i-1}, i)$ is the $k$-th global feature function
$\lambda_k \in \mathbb{R}$ is the weight for feature $k$
$Z(\mathbf{x}) = \sum_{\mathbf{y}'} \exp\left( \sum_{k=1}^{K} \lambda_k F_k(\mathbf{x}, \mathbf{y}') \right)$ is the partition function ensuring normalization

Understanding the Partition Function

Compact Vector Notation:

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}) \right)$$

This is an instance of a log-linear model (exponential family): the log-probability is linear in the features.

Why Exponential Form?

The exponential parameterization has deep theoretical justifications:

Maximum Entropy Principle: Among all distributions consistent with expected feature values, the exponential family distribution has maximum entropy (least commitment beyond constraints)
Hammersley-Clifford Theorem: For undirected graphical models, positive distributions that respect conditional independence must be expressible as products of potential functions—equivalent to exponential form
Convex Optimization: The log-likelihood is concave, ensuring a unique global optimum
Natural Conjugacy: Enables efficient Bayesian treatment with Gaussian priors on $\boldsymbol{\lambda}$

crf_probability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from typing import List, Tuple, Callable
 
def compute_crf_probability(
    x: List[str],                    # Observation sequence
    y: List[int],                    # Label sequence (integer encoded)
    weights: np.ndarray,             # Weight vector λ (K,)
    feature_functions: List[Callable],  # List of K feature functions
    all_labels: List[int],           # All possible labels
) -> float:
    """
    Compute P(y | x) for a CRF.
    
    Each feature function f_k(x, y_i, y_{i-1}, i) -> float
    Feature functions take: (observations, current_label, prev_label, position)
    
    This is a naive O(L^n) implementation for pedagogical clarity.
    In practice, use dynamic programming for the partition function.
    """
    n = len(x)
    K = len(feature_functions)
    
    def global_feature_vector(x: List[str], y: List[int]) -> np.ndarray:
        """Compute F(x, y) = sum over positions of local features."""
        F = np.zeros(K)
        for i in range(n):
            y_prev = y[i-1] if i > 0 else -1  # -1 indicates start
            for k, f_k in enumerate(feature_functions):
                F[k] += f_k(x, y[i], y_prev, i)
        return F
    
    # Compute score for the given label sequence
    F_xy = global_feature_vector(x, y)
    score_xy = np.dot(weights, F_xy)
    
    # Compute partition function by summing over all possible y'
    # WARNING: This is O(L^n) - exponential! Only for small examples.
    from itertools import product
    Z = 0.0
    for y_prime in product(all_labels, repeat=n):
        F_xy_prime = global_feature_vector(x, list(y_prime))
        Z += np.exp(np.dot(weights, F_xy_prime))
    
    # Probability = exp(score) / Z
    probability = np.exp(score_xy) / Z
    return probability
 
# Example: Simple binary labeling (labels: 0 or 1)
# Feature 1: Word is capitalized AND label is 1
# Feature 2: Transition from label 0 to label 1
def f1(x, y_i, y_prev, i):
    return 1.0 if x[i][0].isupper() and y_i == 1 else 0.0
 
def f2(x, y_i, y_prev, i):
    return 1.0 if y_prev == 0 and y_i == 1 else 0.0
 
# Example usage
x = ["The", "Bank", "closed"]
y = [0, 1, 0]
weights = np.array([2.0, 0.5])  # Capitalized+label1 is strong signal
feature_functions = [f1, f2]
all_labels = [0, 1]
 
prob = compute_crf_probability(x, y, weights, feature_functions, all_labels)
print(f"P(y={y} | x={x}) = {prob:.6f}")

The Factor Graph Perspective

CRFs can be understood through the lens of factor graphs, providing a visual and algebraic framework for the model structure.

Factor Graph Definition:

A factor graph is a bipartite graph with two types of nodes:

Variable nodes: Represent random variables (label variables $y_1, \ldots, y_n$)
Factor nodes: Represent potential functions that score configurations of connected variables

For a linear-chain CRF, the factorization is:

$$P(\mathbf{y} \mid \mathbf{x}) \propto \prod_{i=1}^{n} \Psi_i(y_{i-1}, y_i, \mathbf{x})$$

where $\Psi_i(y_{i-1}, y_i, \mathbf{x}) = \exp\left( \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) \right)$ is the potential function at position $i$.

Converting Mermaid diagram...

Key Structural Properties:

1. Markov Property

In a linear-chain CRF, each label $y_i$ is conditionally independent of all other labels given its neighbors and the observations:

$$P(y_i \mid \mathbf{y}{\setminus i}, \mathbf{x}) = P(y_i \mid y{i-1}, y_{i+1}, \mathbf{x})$$

This Markov blanket property enables efficient inference.

2. Global Conditioning

3. Undirected Structure

CRFs as Conditional MRFs

CRFs as Structured Logistic Regression

An illuminating perspective on CRFs is to view them as a structured extension of logistic regression. This connection clarifies both the model form and training procedure.

Logistic Regression Refresher:

For binary classification with features $\mathbf{x}$ and label $y \in {0, 1}$:

$$P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) = \frac{\exp(\mathbf{w}^T \mathbf{x})}{1 + \exp(\mathbf{w}^T \mathbf{x})}$$

Equivalently, using the softmax formulation for $y \in {0, 1}$:

$$P(y \mid \mathbf{x}) = \frac{\exp(\mathbf{w}y^T \mathbf{x})}{\sum{y'} \exp(\mathbf{w}_{y'}^T \mathbf{x})}$$

CRF as Sequence-Level Softmax:

CRFs extend this to structured outputs. Instead of normalizing over labels for a single prediction, we normalize over label sequences:

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{\exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}))}{\sum_{\mathbf{y}'} \exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}'))}$$

Logistic Regression vs. CRF
Aspect	Logistic Regression	Conditional Random Field
Output Space	Single label $y$	Label sequence $\mathbf{y} = (y_1, \ldots, y_n)$
Feature Function	$\mathbf{x}$ (input features)	$\mathbf{F}(\mathbf{x}, \mathbf{y})$ (input + output features)
Normalization	Sum over label values	Sum over label sequences
Partition Function	$\sum_{y'} \exp(\mathbf{w}_{y'}^T \mathbf{x})$ — $O(L)$	$\sum_{\mathbf{y}'} \exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}'))$ — $O(nL^2)$
Training Objective	Conditional log-likelihood	Conditional log-likelihood
Optimization	Convex (gradient descent)	Convex (gradient descent)

The Gradient Connection:

Both models share the same gradient structure. The gradient of the log-likelihood with respect to weights is:

$$ abla_{\boldsymbol{\lambda}} \log P(\mathbf{y} \mid \mathbf{x}) = \mathbf{F}(\mathbf{x}, \mathbf{y}) - \mathbb{E}_{\mathbf{y}' \sim P(\cdot \mid \mathbf{x})}[\mathbf{F}(\mathbf{x}, \mathbf{y}')]$$

Intuition: The gradient pushes weights to:

Increase features that fire for the correct label sequence $\mathbf{y}$
Decrease features expected under the current model's predictions

This is the classic "observed minus expected" gradient form characteristic of exponential families.

Convexity Guarantees

CRF vs. HMM: A Detailed Comparison

Since CRFs are often presented as an improvement over Hidden Markov Models, let's rigorously compare these two approaches to sequence labeling.

HMM Formulation:

$$P(\mathbf{x}, \mathbf{y}) = P(y_1) \prod_{i=2}^{n} P(y_i \mid y_{i-1}) \prod_{i=1}^{n} P(x_i \mid y_i)$$

CRF Formulation:

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \sum_i \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) \right)$$

HMM Characteristics

•Generative: Models $P(\mathbf{x}, \mathbf{y})$
•Local emissions: $P(x_i \mid y_i)$ only
•Independence assumption: $x_i \perp x_j \mid y_i$ for $j eq i$
•Multinomial/Gaussian emissions: Parametric forms required
•Fast training: EM algorithm with closed-form updates
•Handles missing data: Marginalize naturally

CRF Characteristics

•Discriminative: Models $P(\mathbf{y} \mid \mathbf{x})$ directly
•Global features: $f_k(\mathbf{x}, y_i, y_{i-1}, i)$ uses full $\mathbf{x}$
•No independence assumption: Features can capture arbitrary patterns
•Arbitrary features: Binary, real-valued, overlapping, external
•Iterative training: Gradient-based optimization
•Principled integration: Combine diverse information sources

The Feature Flexibility Advantage:

Consider Named Entity Recognition. An HMM might use:

$$P(\text{"Obama"} \mid \text{PERSON})$$

But what if we want to use:

Is the word capitalized?
Does it appear in a gazetteer of person names?
What's the previous word? ("President Obama" vs "Obama tower")
Does it match a pattern like [A-Z][a-z]+?

In an HMM, these become:

$$P(\text{capitalized=true, in_gazetteer=true, prev_word="President"} \mid \text{PERSON})$$

This requires modeling the joint distribution of all these features given each label—often leading to:

Sparse data problems (many feature combinations never seen)
Strong independence assumptions (features assumed independent given label)

CRFs simply add these as separate features:

$f_{\text{cap}}(\mathbf{x}, y_i) = \mathbb{1}[\text{is_capitalized}(x_i) \wedge y_i = \text{PERSON}]$
$f_{\text{gaz}}(\mathbf{x}, y_i) = \mathbb{1}[x_i \in \text{person_gazetteer} \wedge y_i = \text{PERSON}]$
$f_{\text{prev}}(\mathbf{x}, y_i) = \mathbb{1}[x_{i-1} = \text{"President"} \wedge y_i = \text{PERSON}]$

Each feature gets its own weight. Features can overlap freely. No independence assumptions required.

When HMMs Win

Mathematical Properties of CRFs

Let's establish the key mathematical properties that make CRFs both theoretically elegant and practically useful.

Property 1: Normalization

For any observation sequence $\mathbf{x}$:

$$\sum_{\mathbf{y}} P(\mathbf{y} \mid \mathbf{x}) = 1$$

Proof: By definition of the partition function:

Property 2: Log-Linearity

The log-probability is linear in the parameters:

$$\log P(\mathbf{y} \mid \mathbf{x}) = \boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}) - \log Z(\mathbf{x})$$

This log-linear form is fundamental to the exponential family and enables efficient gradient computation.

Property 3: Concave Log-Likelihood

Proof Sketch:

The Hessian of the log-likelihood is:

$$ abla^2_{\boldsymbol{\lambda}} \log P(\mathbf{y} \mid \mathbf{x}) = -\text{Var}_{P(\mathbf{y}' \mid \mathbf{x})}[\mathbf{F}(\mathbf{x}, \mathbf{y}')]$$

Since variance is non-negative, the Hessian is negative semi-definite, proving concavity. $\square$

Property 4: Consistency

Property 5: Markov Property (Linear-Chain)

For linear-chain CRFs, labels satisfy the first-order Markov property:

$$y_i \perp \mathbf{y}{\setminus{i-1, i, i+1}} \mid y{i-1}, y_{i+1}, \mathbf{x}$$

The label at position $i$ is conditionally independent of all other labels given its immediate neighbors and all observations.

Practical Implications

Summary: CRF Formulation

We have established the complete mathematical foundation for Conditional Random Fields. Let's consolidate the key concepts:

Key Takeaways

•Structured prediction requires modeling label dependencies — Independent classifiers ignore crucial correlations between adjacent predictions.
•CRFs are discriminative sequence models — They model $P(\mathbf{y} \mid \mathbf{x})$ directly without making assumptions about how observations are generated.
•Feature flexibility is the key advantage — CRFs can incorporate arbitrary overlapping features of the entire observation sequence at any position.
•The exponential/log-linear form has deep justifications — Maximum entropy, Hammersley-Clifford theorem, convex optimization, and efficient gradient computation.
•CRFs generalize logistic regression to sequences — The same gradient structure (observed minus expected) with normalization over sequences instead of labels.
•Mathematical properties ensure reliable training — Concave log-likelihood guarantees convergence to global optimum; consistency ensures asymptotic correctness.

The core CRF equation to internalize:

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \sum_{i=1}^{n} \sum_{k=1}^{K} \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) \right)$$

What's next:

Page Complete