Loading content...
Machine learning traditionally excels at independent prediction—classifying individual images, predicting single numerical values, or categorizing standalone documents. But what happens when the output isn't a single value, but rather a structured sequence where each element depends on its neighbors?
Consider these structured prediction challenges:
In all these problems, adjacent predictions are not independent. A noun is more likely to follow an adjective than another noun. Recognizing this interdependence is crucial for accurate structured prediction.
By the end of this page, you will understand: (1) Why naive independent classifiers fail for structured outputs, (2) The fundamental difference between generative models (HMMs) and discriminative models (CRFs), (3) The complete mathematical formulation of CRFs including their probabilistic interpretation, and (4) Why CRFs have become the de facto standard for many sequence labeling tasks.
Before diving into CRFs, let's rigorously understand why simpler approaches fail. Consider the task of Part-of-Speech (POS) tagging for the sentence:
"The old man the boats"
A naive approach would train a classifier (logistic regression, neural network, etc.) that predicts each word's tag independently:
$$P(y_i \mid x_i)$$
where $x_i$ is the $i$-th word and $y_i$ is its tag.
This local classifier might predict:
But this produces: DET ADJ NOUN DET NOUN — a grammatically invalid sequence!
The correct interpretation is: "The old [people] man [operate] the boats" → DET NOUN VERB DET NOUN. Here, "old" functions as a noun (the elderly) and "man" is a verb (to operate). Without considering sequence structure, the classifier makes locally reasonable but globally inconsistent predictions.
Why does independent classification fail?
The fundamental problem is that the classifier ignores label dependencies. In natural language:
The mathematical insight:
When we predict labels independently, we're implicitly assuming:
$$P(y_1, y_2, \ldots, y_n \mid x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} P(y_i \mid x_i)$$
This factorization ignores the conditional dependencies between labels. What we actually need is a model that captures:
$$P(\mathbf{y} \mid \mathbf{x})$$
where $\mathbf{y} = (y_1, \ldots, y_n)$ is the entire label sequence and we model how labels interact given the full observation sequence $\mathbf{x}$.
| Approach | Model | Limitation |
|---|---|---|
| Independent Classifier | $P(y_i \mid x_i)$ | Ignores label dependencies entirely |
| Window Classifier | $P(y_i \mid x_{i-k:i+k})$ | Uses context but still predicts labels independently |
| Greedy Sequential | $P(y_i \mid x, y_{1:i-1})$ | Left-to-right bias; cannot correct earlier mistakes |
| Structured Model | $P(\mathbf{y} \mid \mathbf{x})$ | Models full joint distribution over label sequence |
Before introducing CRFs, we must understand the fundamental distinction between generative and discriminative probabilistic models. This dichotomy is central to understanding why CRFs often outperform Hidden Markov Models (HMMs).
Generative Models
Generative models learn the joint distribution $P(\mathbf{x}, \mathbf{y})$ of observations and labels. To make predictions, they use Bayes' rule:
$$P(\mathbf{y} \mid \mathbf{x}) = \frac{P(\mathbf{x}, \mathbf{y})}{P(\mathbf{x})} = \frac{P(\mathbf{x} \mid \mathbf{y}) P(\mathbf{y})}{P(\mathbf{x})}$$
The Hidden Markov Model (HMM) is the prototypical generative sequence model:
$$P(\mathbf{x}, \mathbf{y}) = P(y_1) \prod_{i=2}^{n} P(y_i \mid y_{i-1}) \prod_{i=1}^{n} P(x_i \mid y_i)$$
Discriminative Models
Discriminative models directly model the conditional distribution $P(\mathbf{y} \mid \mathbf{x})$ without explicitly modeling how observations are generated. They answer: "Given these observations, what labels are most likely?" without asking "How would these observations be generated?"
Why discriminative models often win:
The key advantage of discriminative models lies in their feature flexibility. Consider HMMs:
$$P(x_i \mid y_i)$$
This emission probability requires specifying how each observation is generated from each state—a local dependence. If you want to condition on neighboring words, their capitalization, prefixes/suffixes, or external resources (gazetteers), you must either:
CRFs escape this constraint. They can incorporate arbitrary features of the entire observation sequence at any position:
$$f(\mathbf{x}, y_i, y_{i-1}, i)$$
This might include:
All of these can be incorporated without altering the model structure.
Andrew Ng and Michael Jordan's seminal 2002 paper "On Discriminative vs. Generative Classifiers" showed that while generative models may converge faster with limited data, discriminative models typically achieve lower asymptotic error. For sequence labeling with rich features, CRFs consistently outperform HMMs—often by significant margins.
We now present the formal definition of Conditional Random Fields. The formulation builds on the theory of undirected graphical models (Markov Random Fields) but conditions on observed data rather than modeling its generation.
Definition (Conditional Random Field):
Let $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ be an observation sequence and $\mathbf{y} = (y_1, y_2, \ldots, y_n)$ be a corresponding label sequence. A Conditional Random Field defines the conditional probability:
$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \sum_{k=1}^{K} \lambda_k F_k(\mathbf{x}, \mathbf{y}) \right)$$
where:
The partition function $Z(\mathbf{x})$ sums over ALL possible label sequences. For a sequence of length $n$ with $L$ possible labels per position, there are $L^n$ possible label sequences. This exponential growth makes naive computation intractable, but dynamic programming (the forward algorithm) makes it efficient—O(nL²) for linear-chain CRFs.
Compact Vector Notation:
Let $\boldsymbol{\lambda} = (\lambda_1, \ldots, \lambda_K)^T$ be the weight vector and $\mathbf{F}(\mathbf{x}, \mathbf{y}) = (F_1(\mathbf{x}, \mathbf{y}), \ldots, F_K(\mathbf{x}, \mathbf{y}))^T$ be the global feature vector. Then:
$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}) \right)$$
This is an instance of a log-linear model (exponential family): the log-probability is linear in the features.
Why Exponential Form?
The exponential parameterization has deep theoretical justifications:
Maximum Entropy Principle: Among all distributions consistent with expected feature values, the exponential family distribution has maximum entropy (least commitment beyond constraints)
Hammersley-Clifford Theorem: For undirected graphical models, positive distributions that respect conditional independence must be expressible as products of potential functions—equivalent to exponential form
Convex Optimization: The log-likelihood is concave, ensuring a unique global optimum
Natural Conjugacy: Enables efficient Bayesian treatment with Gaussian priors on $\boldsymbol{\lambda}$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as npfrom typing import List, Tuple, Callable def compute_crf_probability( x: List[str], # Observation sequence y: List[int], # Label sequence (integer encoded) weights: np.ndarray, # Weight vector λ (K,) feature_functions: List[Callable], # List of K feature functions all_labels: List[int], # All possible labels) -> float: """ Compute P(y | x) for a CRF. Each feature function f_k(x, y_i, y_{i-1}, i) -> float Feature functions take: (observations, current_label, prev_label, position) This is a naive O(L^n) implementation for pedagogical clarity. In practice, use dynamic programming for the partition function. """ n = len(x) K = len(feature_functions) def global_feature_vector(x: List[str], y: List[int]) -> np.ndarray: """Compute F(x, y) = sum over positions of local features.""" F = np.zeros(K) for i in range(n): y_prev = y[i-1] if i > 0 else -1 # -1 indicates start for k, f_k in enumerate(feature_functions): F[k] += f_k(x, y[i], y_prev, i) return F # Compute score for the given label sequence F_xy = global_feature_vector(x, y) score_xy = np.dot(weights, F_xy) # Compute partition function by summing over all possible y' # WARNING: This is O(L^n) - exponential! Only for small examples. from itertools import product Z = 0.0 for y_prime in product(all_labels, repeat=n): F_xy_prime = global_feature_vector(x, list(y_prime)) Z += np.exp(np.dot(weights, F_xy_prime)) # Probability = exp(score) / Z probability = np.exp(score_xy) / Z return probability # Example: Simple binary labeling (labels: 0 or 1)# Feature 1: Word is capitalized AND label is 1# Feature 2: Transition from label 0 to label 1def f1(x, y_i, y_prev, i): return 1.0 if x[i][0].isupper() and y_i == 1 else 0.0 def f2(x, y_i, y_prev, i): return 1.0 if y_prev == 0 and y_i == 1 else 0.0 # Example usagex = ["The", "Bank", "closed"]y = [0, 1, 0]weights = np.array([2.0, 0.5]) # Capitalized+label1 is strong signalfeature_functions = [f1, f2]all_labels = [0, 1] prob = compute_crf_probability(x, y, weights, feature_functions, all_labels)print(f"P(y={y} | x={x}) = {prob:.6f}")CRFs can be understood through the lens of factor graphs, providing a visual and algebraic framework for the model structure.
Factor Graph Definition:
A factor graph is a bipartite graph with two types of nodes:
For a linear-chain CRF, the factorization is:
$$P(\mathbf{y} \mid \mathbf{x}) \propto \prod_{i=1}^{n} \Psi_i(y_{i-1}, y_i, \mathbf{x})$$
where $\Psi_i(y_{i-1}, y_i, \mathbf{x}) = \exp\left( \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) \right)$ is the potential function at position $i$.
Key Structural Properties:
1. Markov Property
In a linear-chain CRF, each label $y_i$ is conditionally independent of all other labels given its neighbors and the observations:
$$P(y_i \mid \mathbf{y}{\setminus i}, \mathbf{x}) = P(y_i \mid y{i-1}, y_{i+1}, \mathbf{x})$$
This Markov blanket property enables efficient inference.
2. Global Conditioning
Unlike HMMs where emission probabilities are local $P(x_i \mid y_i)$, CRF factors $\Psi_i$ can depend on the entire observation sequence $\mathbf{x}$. This is because we never need to model $P(\mathbf{x})$—we only compute conditional probabilities given fixed $\mathbf{x}$.
3. Undirected Structure
CRFs are undirected models (Markov Random Fields conditioned on $\mathbf{x}$). Unlike Bayesian Networks, there's no causal interpretation of edges. The model simply captures correlations between adjacent labels.
Formally, a CRF is a Markov Random Field globally conditioned on observations. The graph structure (cliques, potentials) describes dependencies among labels, while observations appear as parameters of these potentials rather than random variables.
An illuminating perspective on CRFs is to view them as a structured extension of logistic regression. This connection clarifies both the model form and training procedure.
Logistic Regression Refresher:
For binary classification with features $\mathbf{x}$ and label $y \in {0, 1}$:
$$P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) = \frac{\exp(\mathbf{w}^T \mathbf{x})}{1 + \exp(\mathbf{w}^T \mathbf{x})}$$
Equivalently, using the softmax formulation for $y \in {0, 1}$:
$$P(y \mid \mathbf{x}) = \frac{\exp(\mathbf{w}y^T \mathbf{x})}{\sum{y'} \exp(\mathbf{w}_{y'}^T \mathbf{x})}$$
CRF as Sequence-Level Softmax:
CRFs extend this to structured outputs. Instead of normalizing over labels for a single prediction, we normalize over label sequences:
$$P(\mathbf{y} \mid \mathbf{x}) = \frac{\exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}))}{\sum_{\mathbf{y}'} \exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}'))}$$
| Aspect | Logistic Regression | Conditional Random Field |
|---|---|---|
| Output Space | Single label $y$ | Label sequence $\mathbf{y} = (y_1, \ldots, y_n)$ |
| Feature Function | $\mathbf{x}$ (input features) | $\mathbf{F}(\mathbf{x}, \mathbf{y})$ (input + output features) |
| Normalization | Sum over label values | Sum over label sequences |
| Partition Function | $\sum_{y'} \exp(\mathbf{w}_{y'}^T \mathbf{x})$ — $O(L)$ | $\sum_{\mathbf{y}'} \exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}'))$ — $O(nL^2)$ |
| Training Objective | Conditional log-likelihood | Conditional log-likelihood |
| Optimization | Convex (gradient descent) | Convex (gradient descent) |
The Gradient Connection:
Both models share the same gradient structure. The gradient of the log-likelihood with respect to weights is:
$$\nabla_{\boldsymbol{\lambda}} \log P(\mathbf{y} \mid \mathbf{x}) = \mathbf{F}(\mathbf{x}, \mathbf{y}) - \mathbb{E}_{\mathbf{y}' \sim P(\cdot \mid \mathbf{x})}[\mathbf{F}(\mathbf{x}, \mathbf{y}')]$$
Intuition: The gradient pushes weights to:
This is the classic "observed minus expected" gradient form characteristic of exponential families.
For logistic regression, the expectation is easy—sum over $L$ labels. For CRFs, computing $\mathbb{E}[\mathbf{F}]$ requires summing over $L^n$ sequences, but the linear-chain structure enables efficient computation via the forward-backward algorithm.
Like logistic regression, the CRF log-likelihood is a concave function of the weights. This means any local optimum is also the global optimum—no risk of getting stuck in bad local minima. Standard convex optimization techniques (L-BFGS, SGD with proper learning rate) are guaranteed to find the optimal weights.
Since CRFs are often presented as an improvement over Hidden Markov Models, let's rigorously compare these two approaches to sequence labeling.
HMM Formulation:
$$P(\mathbf{x}, \mathbf{y}) = P(y_1) \prod_{i=2}^{n} P(y_i \mid y_{i-1}) \prod_{i=1}^{n} P(x_i \mid y_i)$$
CRF Formulation:
$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \sum_i \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) \right)$$
The Feature Flexibility Advantage:
Consider Named Entity Recognition. An HMM might use:
$$P(\text{"Obama"} \mid \text{PERSON})$$
But what if we want to use:
In an HMM, these become:
$$P(\text{capitalized=true, in_gazetteer=true, prev_word="President"} \mid \text{PERSON})$$
This requires modeling the joint distribution of all these features given each label—often leading to:
CRFs simply add these as separate features:
Each feature gets its own weight. Features can overlap freely. No independence assumptions required.
HMMs can outperform CRFs when: (1) Training data is very limited and the generative assumptions are approximately correct, (2) We need to handle missing observations naturally, (3) We want to generate synthetic data, or (4) Computational resources for CRF training are constrained. However, for most NLP sequence labeling tasks with reasonable training data, CRFs dominate.
Let's establish the key mathematical properties that make CRFs both theoretically elegant and practically useful.
Property 1: Normalization
For any observation sequence $\mathbf{x}$:
$$\sum_{\mathbf{y}} P(\mathbf{y} \mid \mathbf{x}) = 1$$
Proof: By definition of the partition function:
$$\sum_{\mathbf{y}} P(\mathbf{y} \mid \mathbf{x}) = \sum_{\mathbf{y}} \frac{\exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}))}{Z(\mathbf{x})} = \frac{1}{Z(\mathbf{x})} \sum_{\mathbf{y}} \exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y})) = \frac{Z(\mathbf{x})}{Z(\mathbf{x})} = 1 \quad \square$$
Property 2: Log-Linearity
The log-probability is linear in the parameters:
$$\log P(\mathbf{y} \mid \mathbf{x}) = \boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}) - \log Z(\mathbf{x})$$
This log-linear form is fundamental to the exponential family and enables efficient gradient computation.
Property 3: Concave Log-Likelihood
The log-likelihood $\mathcal{L}(\boldsymbol{\lambda}) = \sum_{(\mathbf{x}, \mathbf{y})} \log P(\mathbf{y} \mid \mathbf{x}; \boldsymbol{\lambda})$ is a concave function of the weights $\boldsymbol{\lambda}$.
Proof Sketch:
The Hessian of the log-likelihood is:
$$\nabla^2_{\boldsymbol{\lambda}} \log P(\mathbf{y} \mid \mathbf{x}) = -\text{Var}_{P(\mathbf{y}' \mid \mathbf{x})}[\mathbf{F}(\mathbf{x}, \mathbf{y}')]$$
Since variance is non-negative, the Hessian is negative semi-definite, proving concavity. $\square$
Implication: Any local optimum of the log-likelihood is also the global optimum. Standard optimization algorithms (gradient descent, L-BFGS, Newton methods) are guaranteed to converge to the (unique) optimal parameters.
Property 4: Consistency
Under mild regularity conditions (compact parameter space, positive data probability), maximum likelihood estimation for CRFs is consistent: as training data grows, $\hat{\boldsymbol{\lambda}}_{\text{MLE}} \to \boldsymbol{\lambda}^*$ (the true parameters, if data is generated from the model family).
Property 5: Markov Property (Linear-Chain)
For linear-chain CRFs, labels satisfy the first-order Markov property:
$$y_i \perp \mathbf{y}{\setminus{i-1, i, i+1}} \mid y{i-1}, y_{i+1}, \mathbf{x}$$
The label at position $i$ is conditionally independent of all other labels given its immediate neighbors and all observations.
These mathematical properties have practical consequences: (1) Convexity means no hyperparameter tuning for initialization—any starting point works, (2) Consistency means more data always helps, (3) The Markov property enables O(nL²) inference instead of O(L^n), (4) Log-linearity enables efficient gradient computation for training.
We have established the complete mathematical foundation for Conditional Random Fields. Let's consolidate the key concepts:
The core CRF equation to internalize:
$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \sum_{i=1}^{n} \sum_{k=1}^{K} \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) \right)$$
What's next:
Now that we understand the general CRF formulation, we'll specialize to Linear-Chain CRFs—the most common variant used for sequence labeling. We'll see how the chain structure enables efficient inference through dynamic programming, and how to implement the forward-backward algorithm for computing partition functions and marginal probabilities.
You now understand the foundational formulation of Conditional Random Fields: why structured prediction requires modeling label dependencies, how CRFs differ from generative models like HMMs, the mathematical form of the model, and its key properties. Next, we'll explore the linear-chain structure that makes CRFs computationally tractable.