Loading learning content...
In the previous section, we introduced the general Conditional Random Field formulation. While CRFs can be defined over arbitrary graph structures—trees, grids, fully-connected graphs—the most widely used variant in practice is the Linear-Chain CRF.
Linear-chain CRFs are specifically designed for sequential data where labels form a chain: each label depends on its immediate predecessor. This structure is ideal for:
The linear-chain structure is not merely a simplification—it's a carefully chosen balance between expressiveness (capturing local dependencies) and tractability (enabling polynomial-time inference).
By the end of this page, you will understand: (1) The precise mathematical form of linear-chain CRFs, (2) How the chain structure enables efficient dynamic programming, (3) The forward algorithm for computing partition functions in O(nL²), (4) The backward algorithm for computing marginal probabilities, and (5) How to implement Viterbi decoding for finding the most probable label sequence.
A Linear-Chain CRF is a CRF whose underlying graph is a simple chain connecting labels $y_1 - y_2 - \cdots - y_n$. The key structural assumption is that the potential functions only involve adjacent label pairs.
Formal Definition:
For observation sequence $\mathbf{x} = (x_1, \ldots, x_n)$ and label sequence $\mathbf{y} = (y_1, \ldots, y_n)$ where each $y_i \in \mathcal{Y}$ (a finite label set of size $L = |\mathcal{Y}|$):
$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \prod_{i=1}^{n} \Psi_i(y_{i-1}, y_i, \mathbf{x})$$
where:
The Score Function:
It's often convenient to work with log-space scores. Define the score of a label sequence:
$$\text{score}(\mathbf{y} \mid \mathbf{x}) = \sum_{i=1}^{n} \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) = \sum_{i=1}^{n} s(y_{i-1}, y_i, \mathbf{x}, i)$$
where $s(y_{i-1}, y_i, \mathbf{x}, i) = \log \Psi_i(y_{i-1}, y_i, \mathbf{x})$ is the position score for transitioning from $y_{i-1}$ to $y_i$ at position $i$.
Matrix Formulation:
For efficient computation, we represent the potential functions as transition matrices. At each position $i$, define an $L \times L$ matrix $\mathbf{M}_i(\mathbf{x})$ where:
$$[\mathbf{M}i(\mathbf{x})]{y', y} = \Psi_i(y', y, \mathbf{x}) = \exp(s(y', y, \mathbf{x}, i))$$
Here, row index $y'$ represents the previous label, and column index $y$ represents the current label.
The partition function then becomes:
$$Z(\mathbf{x}) = \mathbf{1}^T_{\text{start}} \cdot \mathbf{M}_1(\mathbf{x}) \cdot \mathbf{M}_2(\mathbf{x}) \cdots \mathbf{M}_n(\mathbf{x}) \cdot \mathbf{1}$$
where $\mathbf{1}_{\text{start}}$ is a vector with 1 in the START position and $\mathbf{1}$ is the all-ones vector.
The matrix formulation reveals that computing the partition function is equivalent to multiplying n transition matrices. This directly leads to the O(nL²) forward algorithm—we perform n matrix-vector products, each taking O(L²) time.
The forward algorithm is a dynamic programming procedure for efficiently computing the partition function $Z(\mathbf{x})$. Without dynamic programming, we would need to sum over all $L^n$ possible label sequences—exponential complexity.
The Key Insight:
Instead of summing over complete sequences, we build up partial sums position by position. Define the forward variable:
$$\alpha_i(y) = \sum_{y_1, \ldots, y_{i-1}} \prod_{j=1}^{i} \Psi_j(y_{j-1}, y_j, \mathbf{x})$$
where $y_i = y$ (the sum is over all partial sequences ending in label $y$ at position $i$).
Intuition: $\alpha_i(y)$ represents the (unnormalized) sum of scores for all paths from the start to position $i$ that end in label $y$.
Recurrence Relation:
$$\alpha_1(y) = \Psi_1(\text{START}, y, \mathbf{x})$$
$$\alpha_i(y) = \sum_{y' \in \mathcal{Y}} \alpha_{i-1}(y') \cdot \Psi_i(y', y, \mathbf{x}) \quad \text{for } i = 2, \ldots, n$$
Partition Function:
$$Z(\mathbf{x}) = \sum_{y \in \mathcal{Y}} \alpha_n(y)$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
import numpy as npfrom typing import List, Callable def forward_algorithm( x: List[str], # Observation sequence num_labels: int, # Number of possible labels L compute_potential: Callable[[int, int, int], float], # Ψ(y_prev, y, position)) -> tuple[np.ndarray, float]: """ Compute forward variables and partition function for linear-chain CRF. Args: x: Observation sequence of length n num_labels: Number of possible labels (L) compute_potential: Function Ψ(y_prev, y, i) returning exp(score) y_prev=-1 represents START Returns: alpha: Forward variables, shape (n, L) Z: Partition function Complexity: O(n * L^2) time, O(n * L) space """ n = len(x) L = num_labels # alpha[i, y] = sum over all paths from START to position i ending in y alpha = np.zeros((n, L)) # Initialization: α_1(y) = Ψ_1(START, y, x) for y in range(L): alpha[0, y] = compute_potential(-1, y, 0) # -1 = START # Recursion: α_i(y) = Σ_{y'} α_{i-1}(y') · Ψ_i(y', y, x) for i in range(1, n): for y in range(L): total = 0.0 for y_prev in range(L): total += alpha[i-1, y_prev] * compute_potential(y_prev, y, i) alpha[i, y] = total # Partition function: Z = Σ_y α_n(y) Z = np.sum(alpha[n-1, :]) return alpha, Z def forward_algorithm_log_space( x: List[str], num_labels: int, compute_score: Callable[[int, int, int], float], # s(y_prev, y, position)) -> tuple[np.ndarray, float]: """ Log-space implementation for numerical stability. Instead of computing potentials Ψ = exp(s), we work with log-scores s and use the log-sum-exp trick to avoid overflow/underflow. Returns: log_alpha: Log forward variables, shape (n, L) log_Z: Log partition function """ n = len(x) L = num_labels # log_alpha[i, y] = log(α_i(y)) log_alpha = np.full((n, L), -np.inf) # -inf represents log(0) # Initialization for y in range(L): log_alpha[0, y] = compute_score(-1, y, 0) # Recursion using log-sum-exp for numerical stability for i in range(1, n): for y in range(L): # log(Σ_{y'} α_{i-1}(y') · Ψ_i(y', y)) # = log(Σ_{y'} exp(log_α_{i-1}(y') + s_i(y', y))) # = log_sum_exp over y' of (log_α_{i-1}(y') + s_i(y', y)) terms = np.array([ log_alpha[i-1, y_prev] + compute_score(y_prev, y, i) for y_prev in range(L) ]) log_alpha[i, y] = log_sum_exp(terms) # Log partition function log_Z = log_sum_exp(log_alpha[n-1, :]) return log_alpha, log_Z def log_sum_exp(arr: np.ndarray) -> float: """ Compute log(sum(exp(arr))) in a numerically stable way. The trick: log(Σ_i exp(a_i)) = max(a) + log(Σ_i exp(a_i - max(a))) """ max_val = np.max(arr) if max_val == -np.inf: return -np.inf return max_val + np.log(np.sum(np.exp(arr - max_val))) # Example usagedef example_score(y_prev: int, y: int, i: int) -> float: """Simple scoring function for demonstration.""" # Emission scores (based on position/label) emission = 0.5 if y == i % 3 else -0.5 # Transition scores (prefer same label) transition = 0.3 if y_prev == y else -0.1 if y_prev >= 0 else 0.0 return emission + transition x = ["word1", "word2", "word3", "word4"]log_alpha, log_Z = forward_algorithm_log_space(x, num_labels=3, compute_score=example_score) print("Log-forward variables:")print(log_alpha)print(f"\nLog partition function: log Z = {log_Z:.4f}")print(f"Partition function: Z = {np.exp(log_Z):.4f}")Always use the log-space implementation in practice! The non-log version multiplies many small numbers (potentials can be < 1), leading to underflow. Multiplying many large numbers leads to overflow. The log-sum-exp trick converts products to sums in log-space, maintaining numerical precision even for long sequences.
The backward algorithm computes the sum of scores for all paths from a given position to the end of the sequence. Combined with forward variables, it enables efficient computation of marginal probabilities.
Definition of Backward Variable:
$$\beta_i(y) = \sum_{y_{i+1}, \ldots, y_n} \prod_{j=i+1}^{n} \Psi_j(y_{j-1}, y_j, \mathbf{x})$$
where $y_i = y$ (the sum is over all partial sequences starting from label $y$ at position $i$).
Intuition: $\beta_i(y)$ represents the (unnormalized) sum of scores for all paths from position $i$ (with label $y$) to the end of the sequence.
Recurrence Relation (right-to-left):
$$\beta_n(y) = 1 \quad \text{(or END potential if used)}$$
$$\beta_i(y) = \sum_{y' \in \mathcal{Y}} \Psi_{i+1}(y, y', \mathbf{x}) \cdot \beta_{i+1}(y') \quad \text{for } i = n-1, \ldots, 1$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import numpy as npfrom typing import List, Callable def backward_algorithm_log_space( x: List[str], num_labels: int, compute_score: Callable[[int, int, int], float],) -> np.ndarray: """ Compute backward variables in log-space. Args: x: Observation sequence of length n num_labels: Number of possible labels (L) compute_score: Function s(y_prev, y, i) returning log-potential Returns: log_beta: Backward variables, shape (n, L) Complexity: O(n * L^2) time, O(n * L) space """ n = len(x) L = num_labels log_beta = np.full((n, L), -np.inf) # Initialization: β_n(y) = 1, so log β_n(y) = 0 log_beta[n-1, :] = 0.0 # Recursion (right to left) for i in range(n-2, -1, -1): for y in range(L): # log β_i(y) = log Σ_{y'} exp(s_{i+1}(y, y') + log β_{i+1}(y')) terms = np.array([ compute_score(y, y_next, i+1) + log_beta[i+1, y_next] for y_next in range(L) ]) log_beta[i, y] = log_sum_exp(terms) return log_beta def log_sum_exp(arr: np.ndarray) -> float: """Log-sum-exp for numerical stability.""" max_val = np.max(arr) if max_val == -np.inf: return -np.inf return max_val + np.log(np.sum(np.exp(arr - max_val))) def compute_marginals( log_alpha: np.ndarray, # (n, L) forward variables log_beta: np.ndarray, # (n, L) backward variables log_Z: float, # Log partition function) -> np.ndarray: """ Compute marginal probabilities P(y_i = y | x) for each position. Using: P(y_i = y | x) = α_i(y) · β_i(y) / Z In log-space: log P = log_α_i(y) + log_β_i(y) - log_Z Returns: marginals: Shape (n, L), where marginals[i, y] = P(y_i = y | x) """ n, L = log_alpha.shape log_marginals = log_alpha + log_beta - log_Z marginals = np.exp(log_marginals) return marginals def compute_pairwise_marginals( x: List[str], log_alpha: np.ndarray, log_beta: np.ndarray, log_Z: float, compute_score: Callable[[int, int, int], float],) -> np.ndarray: """ Compute pairwise marginals P(y_{i-1} = y', y_i = y | x). These are needed for gradient computation during training. Using: P(y_{i-1}=y', y_i=y | x) = α_{i-1}(y') · Ψ_i(y',y) · β_i(y) / Z Returns: pairwise: Shape (n-1, L, L), where pairwise[i, y', y] = P(y_i=y', y_{i+1}=y | x) """ n, L = log_alpha.shape pairwise = np.zeros((n-1, L, L)) for i in range(1, n): for y_prev in range(L): for y in range(L): log_prob = ( log_alpha[i-1, y_prev] + compute_score(y_prev, y, i) + log_beta[i, y] - log_Z ) pairwise[i-1, y_prev, y] = np.exp(log_prob) return pairwise # Example demonstrating marginal computationdef example_score(y_prev: int, y: int, i: int) -> float: emission = 0.5 if y == i % 3 else -0.5 transition = 0.3 if y_prev == y else -0.1 if y_prev >= 0 else 0.0 return emission + transition x = ["word1", "word2", "word3", "word4"]L = 3 # Run forward-backwardlog_alpha, log_Z = forward_algorithm_log_space(x, L, example_score)log_beta = backward_algorithm_log_space(x, L, example_score) # Compute marginalsmarginals = compute_marginals(log_alpha, log_beta, log_Z)print("Marginal probabilities P(y_i = y | x):")print(marginals)print(f"\nVerify: Each row sums to 1: {np.sum(marginals, axis=1)}")The Forward-Backward Identity:
A crucial relationship connects forward and backward variables:
$$Z(\mathbf{x}) = \sum_{y \in \mathcal{Y}} \alpha_i(y) \cdot \beta_i(y) \quad \text{for any } i$$
Proof: At any position $i$, the forward variable $\alpha_i(y)$ sums all paths from start to $i$ ending in $y$, while $\beta_i(y)$ sums all paths from $i$ to end starting with $y$. Their product covers all complete paths through label $y$ at position $i$. Summing over all $y$ gives all paths. $\square$
This identity is useful for:
The gradient of the log-likelihood requires computing E[F_k(x, y)] under the model distribution. This expectation decomposes into sums of marginal probabilities P(y_{i-1}=y', y_i=y | x) times feature values—exactly what forward-backward provides.
While the forward-backward algorithm computes probabilities over the entire label space, we often want to find the single best label sequence:
$$\mathbf{y}^* = \arg\max_{\mathbf{y}} P(\mathbf{y} \mid \mathbf{x}) = \arg\max_{\mathbf{y}} \text{score}(\mathbf{y} \mid \mathbf{x})$$
The Viterbi algorithm solves this efficiently using dynamic programming. It's structurally similar to the forward algorithm, but replaces sum with max.
Viterbi Variable:
$$\delta_i(y) = \max_{y_1, \ldots, y_{i-1}} \sum_{j=1}^{i} s(y_{j-1}, y_j, \mathbf{x}, j)$$
where $y_i = y$. This equals the maximum score among all paths from start to position $i$ ending in label $y$.
Recurrence:
$$\delta_1(y) = s(\text{START}, y, \mathbf{x}, 1)$$
$$\delta_i(y) = \max_{y' \in \mathcal{Y}} \left[ \delta_{i-1}(y') + s(y', y, \mathbf{x}, i) \right]$$
Backpointers:
To recover the actual sequence, we store backpointers:
$$\psi_i(y) = \arg\max_{y' \in \mathcal{Y}} \left[ \delta_{i-1}(y') + s(y', y, \mathbf{x}, i) \right]$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150
import numpy as npfrom typing import List, Callable, Tuple def viterbi_decode( x: List[str], num_labels: int, compute_score: Callable[[int, int, int], float],) -> Tuple[List[int], float]: """ Find the highest-scoring label sequence using Viterbi algorithm. Args: x: Observation sequence of length n num_labels: Number of possible labels (L) compute_score: Function s(y_prev, y, i) returning log-potential y_prev=-1 represents START Returns: best_path: List of label indices [y_1, y_2, ..., y_n] best_score: Score of the best path Complexity: O(n * L^2) time, O(n * L) space """ n = len(x) L = num_labels # δ[i, y] = max score of any path ending in y at position i delta = np.full((n, L), -np.inf) # ψ[i, y] = best previous label to reach y at position i psi = np.zeros((n, L), dtype=int) # Initialization for y in range(L): delta[0, y] = compute_score(-1, y, 0) # -1 = START psi[0, y] = -1 # No previous label # Recursion (forward pass) for i in range(1, n): for y in range(L): # Find best previous label scores = np.array([ delta[i-1, y_prev] + compute_score(y_prev, y, i) for y_prev in range(L) ]) best_prev = np.argmax(scores) delta[i, y] = scores[best_prev] psi[i, y] = best_prev # Termination: find best final label best_final_label = np.argmax(delta[n-1, :]) best_score = delta[n-1, best_final_label] # Backtrack to find best path best_path = [0] * n best_path[n-1] = best_final_label for i in range(n-2, -1, -1): best_path[i] = psi[i+1, best_path[i+1]] return best_path, best_score def viterbi_with_constraints( x: List[str], num_labels: int, compute_score: Callable[[int, int, int], float], valid_transitions: np.ndarray, # (L, L) boolean matrix) -> Tuple[List[int], float]: """ Viterbi with transition constraints. valid_transitions[y', y] = True if transition from y' to y is allowed. Invalid transitions are assigned -inf score. Example: In NER, B-PER cannot be followed by I-LOC. """ n = len(x) L = num_labels delta = np.full((n, L), -np.inf) psi = np.zeros((n, L), dtype=int) for y in range(L): delta[0, y] = compute_score(-1, y, 0) psi[0, y] = -1 for i in range(1, n): for y in range(L): scores = np.array([ delta[i-1, y_prev] + compute_score(y_prev, y, i) if valid_transitions[y_prev, y] else -np.inf for y_prev in range(L) ]) if np.all(scores == -np.inf): delta[i, y] = -np.inf psi[i, y] = 0 else: best_prev = np.argmax(scores) delta[i, y] = scores[best_prev] psi[i, y] = best_prev # Handle case where no valid path exists if np.all(delta[n-1, :] == -np.inf): return [], float('-inf') best_final = np.argmax(delta[n-1, :]) best_score = delta[n-1, best_final] best_path = [0] * n best_path[n-1] = best_final for i in range(n-2, -1, -1): best_path[i] = psi[i+1, best_path[i+1]] return best_path, best_score # Example with named entity tagsLABELS = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG']L = len(LABELS) def ner_score(y_prev: int, y: int, i: int) -> float: """Score function for NER tagging.""" # Simple example scores transition_scores = { ('O', 'O'): 0.5, ('O', 'B-PER'): 0.1, ('O', 'B-ORG'): 0.1, ('B-PER', 'I-PER'): 0.8, ('B-ORG', 'I-ORG'): 0.8, ('I-PER', 'I-PER'): 0.6, ('I-ORG', 'I-ORG'): 0.6, } prev_label = LABELS[y_prev] if y_prev >= 0 else 'START' curr_label = LABELS[y] trans = transition_scores.get((prev_label, curr_label), -0.5) emission = 0.5 if curr_label == 'O' else 0.0 return trans + emission x = ["John", "Smith", "works", "at", "Google"]best_path, best_score = viterbi_decode(x, L, ner_score) print("Best label sequence:")for word, label_idx in zip(x, best_path): print(f" {word}: {LABELS[label_idx]}")print(f"\nBest score: {best_score:.4f}")The best label at each position according to marginals may NOT form the best sequence! Example: If P(y₁=A)=0.6 but P(y₁=A, y₂=A)=0.01, choosing A at position 1 leads to suboptimal outcomes. Viterbi finds the globally optimal sequence; marginal-based decoding is greedy and can fail.
Let's rigorously analyze the computational complexity of linear-chain CRF algorithms. Let $n$ be sequence length, $L$ be the number of labels, and $K$ be the number of features.
Inference Complexity:
| Operation | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Forward Algorithm | $O(n L^2)$ | $O(nL)$ or $O(L)$ | Sum over predecessors at each position |
| Backward Algorithm | $O(n L^2)$ | $O(nL)$ | Same structure, reverse direction |
| Partition Function | $O(n L^2)$ | $O(L)$ | From forward variables at last position |
| Marginal Probabilities | $O(n L^2)$ | $O(nL)$ | Requires both forward and backward |
| Viterbi Decoding | $O(n L^2)$ | $O(nL)$ | Max replaces sum; backpointers needed |
| Feature Expectations | $O(n L^2 K)$ | $O(nL)$ | Sum features weighted by pairwise marginals |
| Gradient Computation | $O(n L^2 + nK)$ | $O(nL + K)$ | Forward-backward plus feature aggregation |
Why $O(nL^2)$ and not $O(L^n)$?
The naive approach enumerates all $L^n$ label sequences. The forward algorithm exploits the Markov property: the score at position $i$ depends only on $(y_{i-1}, y_i)$, not the entire history.
Decomposition insight:
$$\sum_{\mathbf{y}} \exp(\text{score}(\mathbf{y})) = \sum_{\mathbf{y}} \prod_{i} \Psi_i(y_{i-1}, y_i)$$
We can rearrange the sum by "pushing" sums inward:
$$= \sum_{y_n} \left( \sum_{y_{n-1}} \cdots \left( \sum_{y_1} \Psi_1(y_0, y_1) \right) \Psi_2(y_1, y_2) \cdots \right) \Psi_n(y_{n-1}, y_n)$$
The innermost sum over $y_1$ takes $O(L)$ for each $y_2$. The next sum over $y_2$ takes $O(L)$ for each $y_3$. We do this $n$ times, giving $O(nL^2)$.
Training Complexity:
For $N$ training sequences of average length $\bar{n}$ and $T$ gradient descent iterations:
$$\text{Total training time} = O(T \cdot N \cdot \bar{n} \cdot L^2)$$
Plus the cost of feature computation (often the bottleneck in practice).
In practice, feature computation often dominates. With sparse features (common in NLP), we only sum over non-zero features. GPU parallelization across positions and mini-batches can significantly speed up training. For very large label sets (e.g., L > 1000), specialized techniques like hierarchical softmax or sampled softmax may be needed.
Space Optimization:
The full forward and backward arrays require $O(nL)$ space. For very long sequences, we can trade off:
Checkpointing: Store forward variables only at every $\sqrt{n}$ positions. Recompute intermediate values during backward pass. Reduces space to $O(\sqrt{n} \cdot L)$ but increases time by constant factor.
Streaming: For partition function only, forward algorithm needs only $O(L)$ space (keep only previous row). But full marginals require both forward and backward arrays.
Label Set Considerations:
The $L^2$ factor comes from considering all label pairs. For structured label sets (e.g., IOB tagging where many transitions are invalid), sparse transition matrices can reduce this to $O(nL \cdot \text{avg_valid_transitions})$.
While first-order linear-chain CRFs (where features depend on $y_{i-1}, y_i$) are most common, we sometimes need to capture longer-range dependencies between labels.
Second-Order CRFs:
Features depend on $(y_{i-2}, y_{i-1}, y_i)$:
$$P(\mathbf{y} \mid \mathbf{x}) \propto \exp\left( \sum_i \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, y_{i-2}, i) \right)$$
Example motivation: In POS tagging, the sequence "to [verb] [noun]" vs "to [verb] [verb]" has different patterns that a first-order model can't capture.
Complexity Impact:
| CRF Order | State Space | Time Complexity | Space Complexity |
|---|---|---|---|
| First-order | $O(L)$ | $O(nL^2)$ | $O(nL)$ |
| Second-order | $O(L^2)$ | $O(nL^3)$ | $O(nL^2)$ |
| $k$-th order | $O(L^k)$ | $O(nL^{k+1})$ | $O(nL^k)$ |
Implementation via State Augmentation:
A second-order CRF can be implemented as a first-order CRF with an augmented state space. Each "label" becomes a label pair $(y_{i-1}, y_i)$.
This trick maintains the first-order algorithm structure but with larger state space.
When to Use Higher-Order CRFs:
Practical Alternative: Semi-Markov CRFs
For segment-level labeling (assigning labels to spans rather than tokens), Semi-Markov CRFs model segment boundaries and can capture arbitrary segment features without explicit higher-order structure.
In contemporary NLP, neural network encoders (LSTMs, Transformers) implicitly capture long-range dependencies in their hidden states. A first-order CRF on top of neural features often outperforms higher-order CRFs on hand-crafted features, with better computational efficiency.
We have covered the complete algorithmic framework for linear-chain CRFs. Here are the essential concepts:
The core algorithms summarized:
| Algorithm | Computes | Key Operation | Result |
|---|---|---|---|
| Forward | $\alpha_i(y)$ | Sum over predecessors | Partition function |
| Backward | $\beta_i(y)$ | Sum over successors | Enables marginals |
| Forward-Backward | Both + combine | $\alpha \cdot \beta / Z$ | Marginal probabilities |
| Viterbi | $\delta_i(y)$ + backtrack | Max over predecessors | Best label sequence |
What's next:
With the inference algorithms in place, we need to understand how to design features that CRFs can use effectively. The next page covers feature functions—the representations that encode domain knowledge and enable CRFs to learn from data.
You now understand the linear-chain CRF structure and its core inference algorithms: forward, backward, and Viterbi. You can implement these algorithms efficiently in log-space and understand their computational complexity. Next, we'll explore how to design powerful feature functions for CRFs.