Machine LearningConditional Random Fields

Conditional Random Fields

LevelAdvanced

Duration120 mins

TopicConditional Random Fields

2 / 5

Linear-Chain CRF

Specializing CRFs for Sequential Data

In the previous section, we introduced the general Conditional Random Field formulation. While CRFs can be defined over arbitrary graph structures—trees, grids, fully-connected graphs—the most widely used variant in practice is the Linear-Chain CRF.

Linear-chain CRFs are specifically designed for sequential data where labels form a chain: each label depends on its immediate predecessor. This structure is ideal for:

Natural Language Processing: POS tagging, named entity recognition, chunking
Computational Biology: Gene prediction, protein structure labeling
Speech Recognition: Phoneme labeling, speaker diarization
Time Series: Activity recognition from sensor data

The linear-chain structure is not merely a simplification—it's a carefully chosen balance between expressiveness (capturing local dependencies) and tractability (enabling polynomial-time inference).

What You Will Learn

By the end of this page, you will understand: (1) The precise mathematical form of linear-chain CRFs, (2) How the chain structure enables efficient dynamic programming, (3) The forward algorithm for computing partition functions in O(nL²), (4) The backward algorithm for computing marginal probabilities, and (5) How to implement Viterbi decoding for finding the most probable label sequence.

The Linear-Chain Structure

A Linear-Chain CRF is a CRF whose underlying graph is a simple chain connecting labels $y_1 - y_2 - \cdots - y_n$. The key structural assumption is that the potential functions only involve adjacent label pairs.

Formal Definition:

For observation sequence $\mathbf{x} = (x_1, \ldots, x_n)$ and label sequence $\mathbf{y} = (y_1, \ldots, y_n)$ where each $y_i \in \mathcal{Y}$ (a finite label set of size $L = |\mathcal{Y}|$):

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \prod_{i=1}^{n} \Psi_i(y_{i-1}, y_i, \mathbf{x})$$

where:

$\Psi_i(y_{i-1}, y_i, \mathbf{x}) = \exp\left( \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) \right)$ is the potential at position $i$
$Z(\mathbf{x}) = \sum_{\mathbf{y}'} \prod_{i=1}^{n} \Psi_i(y'_{i-1}, y'_i, \mathbf{x})$ is the partition function
For $i = 1$, we define $y_0$ as a special START symbol

The Score Function:

It's often convenient to work with log-space scores. Define the score of a label sequence:

$$\text{score}(\mathbf{y} \mid \mathbf{x}) = \sum_{i=1}^{n} \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) = \sum_{i=1}^{n} s(y_{i-1}, y_i, \mathbf{x}, i)$$

where $s(y_{i-1}, y_i, \mathbf{x}, i) = \log \Psi_i(y_{i-1}, y_i, \mathbf{x})$ is the position score for transitioning from $y_{i-1}$ to $y_i$ at position $i$.

Converting Mermaid diagram...

Matrix Formulation:

For efficient computation, we represent the potential functions as transition matrices. At each position $i$, define an $L \times L$ matrix $\mathbf{M}_i(\mathbf{x})$ where:

$$[\mathbf{M}i(\mathbf{x})]{y', y} = \Psi_i(y', y, \mathbf{x}) = \exp(s(y', y, \mathbf{x}, i))$$

Here, row index $y'$ represents the previous label, and column index $y$ represents the current label.

The partition function then becomes:

$$Z(\mathbf{x}) = \mathbf{1}^T_{\text{start}} \cdot \mathbf{M}_1(\mathbf{x}) \cdot \mathbf{M}_2(\mathbf{x}) \cdots \mathbf{M}_n(\mathbf{x}) \cdot \mathbf{1}$$

where $\mathbf{1}_{\text{start}}$ is a vector with 1 in the START position and $\mathbf{1}$ is the all-ones vector.

Why Matrices Matter

The matrix formulation reveals that computing the partition function is equivalent to multiplying n transition matrices. This directly leads to the O(nL²) forward algorithm—we perform n matrix-vector products, each taking O(L²) time.

The Forward Algorithm

The forward algorithm is a dynamic programming procedure for efficiently computing the partition function $Z(\mathbf{x})$. Without dynamic programming, we would need to sum over all $L^n$ possible label sequences—exponential complexity.

The Key Insight:

Instead of summing over complete sequences, we build up partial sums position by position. Define the forward variable:

$$\alpha_i(y) = \sum_{y_1, \ldots, y_{i-1}} \prod_{j=1}^{i} \Psi_j(y_{j-1}, y_j, \mathbf{x})$$

where $y_i = y$ (the sum is over all partial sequences ending in label $y$ at position $i$).

Intuition: $\alpha_i(y)$ represents the (unnormalized) sum of scores for all paths from the start to position $i$ that end in label $y$.

Recurrence Relation:

$$\alpha_1(y) = \Psi_1(\text{START}, y, \mathbf{x})$$

$$\alpha_i(y) = \sum_{y' \in \mathcal{Y}} \alpha_{i-1}(y') \cdot \Psi_i(y', y, \mathbf{x}) \quad \text{for } i = 2, \ldots, n$$

Partition Function:

$$Z(\mathbf{x}) = \sum_{y \in \mathcal{Y}} \alpha_n(y)$$

forward_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
from typing import List, Callable
 
def forward_algorithm(
    x: List[str],                           # Observation sequence
    num_labels: int,                        # Number of possible labels L
    compute_potential: Callable[[int, int, int], float],  # Ψ(y_prev, y, position)
) -> tuple[np.ndarray, float]:
    """
    Compute forward variables and partition function for linear-chain CRF.
    
    Args:
        x: Observation sequence of length n
        num_labels: Number of possible labels (L)
        compute_potential: Function Ψ(y_prev, y, i) returning exp(score)
                          y_prev=-1 represents START
    
    Returns:
        alpha: Forward variables, shape (n, L)
        Z: Partition function
    
    Complexity: O(n * L^2) time, O(n * L) space
    """
    n = len(x)
    L = num_labels
    
    # alpha[i, y] = sum over all paths from START to position i ending in y
    alpha = np.zeros((n, L))
    
    # Initialization: α_1(y) = Ψ_1(START, y, x)
    for y in range(L):
        alpha[0, y] = compute_potential(-1, y, 0)  # -1 = START
    
    # Recursion: α_i(y) = Σ_{y'} α_{i-1}(y') · Ψ_i(y', y, x)
    for i in range(1, n):
        for y in range(L):
            total = 0.0
            for y_prev in range(L):
                total += alpha[i-1, y_prev] * compute_potential(y_prev, y, i)
            alpha[i, y] = total
    
    # Partition function: Z = Σ_y α_n(y)
    Z = np.sum(alpha[n-1, :])
    
    return alpha, Z
 
 
def forward_algorithm_log_space(
    x: List[str],
    num_labels: int,
    compute_score: Callable[[int, int, int], float],  # s(y_prev, y, position)
) -> tuple[np.ndarray, float]:
    """
    Log-space implementation for numerical stability.
    
    Instead of computing potentials Ψ = exp(s), we work with log-scores s
    and use the log-sum-exp trick to avoid overflow/underflow.
    
    Returns:
        log_alpha: Log forward variables, shape (n, L)
        log_Z: Log partition function
    """
    n = len(x)
    L = num_labels
    
    # log_alpha[i, y] = log(α_i(y))
    log_alpha = np.full((n, L), -np.inf)  # -inf represents log(0)
    
    # Initialization
    for y in range(L):
        log_alpha[0, y] = compute_score(-1, y, 0)
    
    # Recursion using log-sum-exp for numerical stability
    for i in range(1, n):
        for y in range(L):
            # log(Σ_{y'} α_{i-1}(y') · Ψ_i(y', y))
            # = log(Σ_{y'} exp(log_α_{i-1}(y') + s_i(y', y)))
            # = log_sum_exp over y' of (log_α_{i-1}(y') + s_i(y', y))
            terms = np.array([
                log_alpha[i-1, y_prev] + compute_score(y_prev, y, i)
                for y_prev in range(L)
            ])
            log_alpha[i, y] = log_sum_exp(terms)
    
    # Log partition function
    log_Z = log_sum_exp(log_alpha[n-1, :])
    
    return log_alpha, log_Z
 
 
def log_sum_exp(arr: np.ndarray) -> float:
    """
    Compute log(sum(exp(arr))) in a numerically stable way.
    
    The trick: log(Σ_i exp(a_i)) = max(a) + log(Σ_i exp(a_i - max(a)))
    """
    max_val = np.max(arr)
    if max_val == -np.inf:
        return -np.inf
    return max_val + np.log(np.sum(np.exp(arr - max_val)))
 
 
# Example usage
def example_score(y_prev: int, y: int, i: int) -> float:
    """Simple scoring function for demonstration."""
    # Emission scores (based on position/label)
    emission = 0.5 if y == i % 3 else -0.5
    # Transition scores (prefer same label)
    transition = 0.3 if y_prev == y else -0.1 if y_prev >= 0 else 0.0
    return emission + transition
 
x = ["word1", "word2", "word3", "word4"]
log_alpha, log_Z = forward_algorithm_log_space(x, num_labels=3, compute_score=example_score)
 
print("Log-forward variables:")
print(log_alpha)
print(f"\nLog partition function: log Z = {log_Z:.4f}")
print(f"Partition function: Z = {np.exp(log_Z):.4f}")

Numerical Stability

Always use the log-space implementation in practice! The non-log version multiplies many small numbers (potentials can be < 1), leading to underflow. Multiplying many large numbers leads to overflow. The log-sum-exp trick converts products to sums in log-space, maintaining numerical precision even for long sequences.

The Backward Algorithm

The backward algorithm computes the sum of scores for all paths from a given position to the end of the sequence. Combined with forward variables, it enables efficient computation of marginal probabilities.

Definition of Backward Variable:

$$\beta_i(y) = \sum_{y_{i+1}, \ldots, y_n} \prod_{j=i+1}^{n} \Psi_j(y_{j-1}, y_j, \mathbf{x})$$

where $y_i = y$ (the sum is over all partial sequences starting from label $y$ at position $i$).

Intuition: $\beta_i(y)$ represents the (unnormalized) sum of scores for all paths from position $i$ (with label $y$) to the end of the sequence.

Recurrence Relation (right-to-left):

$$\beta_n(y) = 1 \quad \text{(or END potential if used)}$$

$$\beta_i(y) = \sum_{y' \in \mathcal{Y}} \Psi_{i+1}(y, y', \mathbf{x}) \cdot \beta_{i+1}(y') \quad \text{for } i = n-1, \ldots, 1$$

backward_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from typing import List, Callable
 
def backward_algorithm_log_space(
    x: List[str],
    num_labels: int,
    compute_score: Callable[[int, int, int], float],
) -> np.ndarray:
    """
    Compute backward variables in log-space.
    
    Args:
        x: Observation sequence of length n
        num_labels: Number of possible labels (L)  
        compute_score: Function s(y_prev, y, i) returning log-potential
    
    Returns:
        log_beta: Backward variables, shape (n, L)
    
    Complexity: O(n * L^2) time, O(n * L) space
    """
    n = len(x)
    L = num_labels
    
    log_beta = np.full((n, L), -np.inf)
    
    # Initialization: β_n(y) = 1, so log β_n(y) = 0
    log_beta[n-1, :] = 0.0
    
    # Recursion (right to left)
    for i in range(n-2, -1, -1):
        for y in range(L):
            # log β_i(y) = log Σ_{y'} exp(s_{i+1}(y, y') + log β_{i+1}(y'))
            terms = np.array([
                compute_score(y, y_next, i+1) + log_beta[i+1, y_next]
                for y_next in range(L)
            ])
            log_beta[i, y] = log_sum_exp(terms)
    
    return log_beta
 
 
def log_sum_exp(arr: np.ndarray) -> float:
    """Log-sum-exp for numerical stability."""
    max_val = np.max(arr)
    if max_val == -np.inf:
        return -np.inf
    return max_val + np.log(np.sum(np.exp(arr - max_val)))
 
 
def compute_marginals(
    log_alpha: np.ndarray,    # (n, L) forward variables
    log_beta: np.ndarray,     # (n, L) backward variables
    log_Z: float,             # Log partition function
) -> np.ndarray:
    """
    Compute marginal probabilities P(y_i = y | x) for each position.
    
    Using: P(y_i = y | x) = α_i(y) · β_i(y) / Z
    In log-space: log P = log_α_i(y) + log_β_i(y) - log_Z
    
    Returns:
        marginals: Shape (n, L), where marginals[i, y] = P(y_i = y | x)
    """
    n, L = log_alpha.shape
    
    log_marginals = log_alpha + log_beta - log_Z
    marginals = np.exp(log_marginals)
    
    return marginals
 
 
def compute_pairwise_marginals(
    x: List[str],
    log_alpha: np.ndarray,
    log_beta: np.ndarray,
    log_Z: float,
    compute_score: Callable[[int, int, int], float],
) -> np.ndarray:
    """
    Compute pairwise marginals P(y_{i-1} = y', y_i = y | x).
    
    These are needed for gradient computation during training.
    
    Using: P(y_{i-1}=y', y_i=y | x) = α_{i-1}(y') · Ψ_i(y',y) · β_i(y) / Z
    
    Returns:
        pairwise: Shape (n-1, L, L), where pairwise[i, y', y] = P(y_i=y', y_{i+1}=y | x)
    """
    n, L = log_alpha.shape
    pairwise = np.zeros((n-1, L, L))
    
    for i in range(1, n):
        for y_prev in range(L):
            for y in range(L):
                log_prob = (
                    log_alpha[i-1, y_prev] +
                    compute_score(y_prev, y, i) +
                    log_beta[i, y] -
                    log_Z
                )
                pairwise[i-1, y_prev, y] = np.exp(log_prob)
    
    return pairwise
 
 
# Example demonstrating marginal computation
def example_score(y_prev: int, y: int, i: int) -> float:
    emission = 0.5 if y == i % 3 else -0.5
    transition = 0.3 if y_prev == y else -0.1 if y_prev >= 0 else 0.0
    return emission + transition
 
x = ["word1", "word2", "word3", "word4"]
L = 3
 
# Run forward-backward
log_alpha, log_Z = forward_algorithm_log_space(x, L, example_score)
log_beta = backward_algorithm_log_space(x, L, example_score)
 
# Compute marginals
marginals = compute_marginals(log_alpha, log_beta, log_Z)
print("Marginal probabilities P(y_i = y | x):")
print(marginals)
print(f"\nVerify: Each row sums to 1: {np.sum(marginals, axis=1)}")

The Forward-Backward Identity:

A crucial relationship connects forward and backward variables:

$$Z(\mathbf{x}) = \sum_{y \in \mathcal{Y}} \alpha_i(y) \cdot \beta_i(y) \quad \text{for any } i$$

Proof: At any position $i$, the forward variable $\alpha_i(y)$ sums all paths from start to $i$ ending in $y$, while $\beta_i(y)$ sums all paths from $i$ to end starting with $y$. Their product covers all complete paths through label $y$ at position $i$. Summing over all $y$ gives all paths. $\square$

This identity is useful for:

Verification: Check that forward and backward computations are consistent
Marginal computation: $P(y_i = y \mid \mathbf{x}) = \frac{\alpha_i(y) \cdot \beta_i(y)}{Z(\mathbf{x})}$

Marginals Enable Gradient Computation

The gradient of the log-likelihood requires computing E[F_k(x, y)] under the model distribution. This expectation decomposes into sums of marginal probabilities P(y_{i-1}=y', y_i=y | x) times feature values—exactly what forward-backward provides.

Viterbi Decoding: Finding the Best Sequence

While the forward-backward algorithm computes probabilities over the entire label space, we often want to find the single best label sequence:

$$\mathbf{y}^* = \arg\max_{\mathbf{y}} P(\mathbf{y} \mid \mathbf{x}) = \arg\max_{\mathbf{y}} \text{score}(\mathbf{y} \mid \mathbf{x})$$

The Viterbi algorithm solves this efficiently using dynamic programming. It's structurally similar to the forward algorithm, but replaces sum with max.

Viterbi Variable:

$$\delta_i(y) = \max_{y_1, \ldots, y_{i-1}} \sum_{j=1}^{i} s(y_{j-1}, y_j, \mathbf{x}, j)$$

where $y_i = y$. This equals the maximum score among all paths from start to position $i$ ending in label $y$.

Recurrence:

$$\delta_1(y) = s(\text{START}, y, \mathbf{x}, 1)$$

$$\delta_i(y) = \max_{y' \in \mathcal{Y}} \left[ \delta_{i-1}(y') + s(y', y, \mathbf{x}, i) \right]$$

Backpointers:

To recover the actual sequence, we store backpointers:

$$\psi_i(y) = \arg\max_{y' \in \mathcal{Y}} \left[ \delta_{i-1}(y') + s(y', y, \mathbf{x}, i) \right]$$

viterbi_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import numpy as np
from typing import List, Callable, Tuple
 
def viterbi_decode(
    x: List[str],
    num_labels: int,
    compute_score: Callable[[int, int, int], float],
) -> Tuple[List[int], float]:
    """
    Find the highest-scoring label sequence using Viterbi algorithm.
    
    Args:
        x: Observation sequence of length n
        num_labels: Number of possible labels (L)
        compute_score: Function s(y_prev, y, i) returning log-potential
                      y_prev=-1 represents START
    
    Returns:
        best_path: List of label indices [y_1, y_2, ..., y_n]
        best_score: Score of the best path
    
    Complexity: O(n * L^2) time, O(n * L) space
    """
    n = len(x)
    L = num_labels
    
    # δ[i, y] = max score of any path ending in y at position i
    delta = np.full((n, L), -np.inf)
    
    # ψ[i, y] = best previous label to reach y at position i
    psi = np.zeros((n, L), dtype=int)
    
    # Initialization
    for y in range(L):
        delta[0, y] = compute_score(-1, y, 0)  # -1 = START
        psi[0, y] = -1  # No previous label
    
    # Recursion (forward pass)
    for i in range(1, n):
        for y in range(L):
            # Find best previous label
            scores = np.array([
                delta[i-1, y_prev] + compute_score(y_prev, y, i)
                for y_prev in range(L)
            ])
            best_prev = np.argmax(scores)
            delta[i, y] = scores[best_prev]
            psi[i, y] = best_prev
    
    # Termination: find best final label
    best_final_label = np.argmax(delta[n-1, :])
    best_score = delta[n-1, best_final_label]
    
    # Backtrack to find best path
    best_path = [0] * n
    best_path[n-1] = best_final_label
    
    for i in range(n-2, -1, -1):
        best_path[i] = psi[i+1, best_path[i+1]]
    
    return best_path, best_score
 
 
def viterbi_with_constraints(
    x: List[str],
    num_labels: int,
    compute_score: Callable[[int, int, int], float],
    valid_transitions: np.ndarray,  # (L, L) boolean matrix
) -> Tuple[List[int], float]:
    """
    Viterbi with transition constraints.
    
    valid_transitions[y', y] = True if transition from y' to y is allowed.
    Invalid transitions are assigned -inf score.
    
    Example: In NER, B-PER cannot be followed by I-LOC.
    """
    n = len(x)
    L = num_labels
    
    delta = np.full((n, L), -np.inf)
    psi = np.zeros((n, L), dtype=int)
    
    for y in range(L):
        delta[0, y] = compute_score(-1, y, 0)
        psi[0, y] = -1
    
    for i in range(1, n):
        for y in range(L):
            scores = np.array([
                delta[i-1, y_prev] + compute_score(y_prev, y, i)
                if valid_transitions[y_prev, y] else -np.inf
                for y_prev in range(L)
            ])
            
            if np.all(scores == -np.inf):
                delta[i, y] = -np.inf
                psi[i, y] = 0
            else:
                best_prev = np.argmax(scores)
                delta[i, y] = scores[best_prev]
                psi[i, y] = best_prev
    
    # Handle case where no valid path exists
    if np.all(delta[n-1, :] == -np.inf):
        return [], float('-inf')
    
    best_final = np.argmax(delta[n-1, :])
    best_score = delta[n-1, best_final]
    
    best_path = [0] * n
    best_path[n-1] = best_final
    for i in range(n-2, -1, -1):
        best_path[i] = psi[i+1, best_path[i+1]]
    
    return best_path, best_score
 
 
# Example with named entity tags
LABELS = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG']
L = len(LABELS)
 
def ner_score(y_prev: int, y: int, i: int) -> float:
    """Score function for NER tagging."""
    # Simple example scores
    transition_scores = {
        ('O', 'O'): 0.5,
        ('O', 'B-PER'): 0.1,
        ('O', 'B-ORG'): 0.1,
        ('B-PER', 'I-PER'): 0.8,
        ('B-ORG', 'I-ORG'): 0.8,
        ('I-PER', 'I-PER'): 0.6,
        ('I-ORG', 'I-ORG'): 0.6,
    }
    
    prev_label = LABELS[y_prev] if y_prev >= 0 else 'START'
    curr_label = LABELS[y]
    
    trans = transition_scores.get((prev_label, curr_label), -0.5)
    emission = 0.5 if curr_label == 'O' else 0.0
    
    return trans + emission
 
x = ["John", "Smith", "works", "at", "Google"]
best_path, best_score = viterbi_decode(x, L, ner_score)
 
print("Best label sequence:")
for word, label_idx in zip(x, best_path):
    print(f"  {word}: {LABELS[label_idx]}")
print(f"\nBest score: {best_score:.4f}")

Viterbi vs. Marginal Inference

The best label at each position according to marginals may NOT form the best sequence! Example: If P(y₁=A)=0.6 but P(y₁=A, y₂=A)=0.01, choosing A at position 1 leads to suboptimal outcomes. Viterbi finds the globally optimal sequence; marginal-based decoding is greedy and can fail.

Complexity Analysis

Let's rigorously analyze the computational complexity of linear-chain CRF algorithms. Let $n$ be sequence length, $L$ be the number of labels, and $K$ be the number of features.

Inference Complexity:

Computational Complexity of Linear-Chain CRF Operations
Operation	Time Complexity	Space Complexity	Notes
Forward Algorithm	$O(n L^2)$	$O(nL)$ or $O(L)$	Sum over predecessors at each position
Backward Algorithm	$O(n L^2)$	$O(nL)$	Same structure, reverse direction
Partition Function	$O(n L^2)$	$O(L)$	From forward variables at last position
Marginal Probabilities	$O(n L^2)$	$O(nL)$	Requires both forward and backward
Viterbi Decoding	$O(n L^2)$	$O(nL)$	Max replaces sum; backpointers needed
Feature Expectations	$O(n L^2 K)$	$O(nL)$	Sum features weighted by pairwise marginals
Gradient Computation	$O(n L^2 + nK)$	$O(nL + K)$	Forward-backward plus feature aggregation

Why $O(nL^2)$ and not $O(L^n)$?

The naive approach enumerates all $L^n$ label sequences. The forward algorithm exploits the Markov property: the score at position $i$ depends only on $(y_{i-1}, y_i)$, not the entire history.

Decomposition insight:

$$\sum_{\mathbf{y}} \exp(\text{score}(\mathbf{y})) = \sum_{\mathbf{y}} \prod_{i} \Psi_i(y_{i-1}, y_i)$$

We can rearrange the sum by "pushing" sums inward:

$$= \sum_{y_n} \left( \sum_{y_{n-1}} \cdots \left( \sum_{y_1} \Psi_1(y_0, y_1) \right) \Psi_2(y_1, y_2) \cdots \right) \Psi_n(y_{n-1}, y_n)$$

The innermost sum over $y_1$ takes $O(L)$ for each $y_2$. The next sum over $y_2$ takes $O(L)$ for each $y_3$. We do this $n$ times, giving $O(nL^2)$.

Training Complexity:

For $N$ training sequences of average length $\bar{n}$ and $T$ gradient descent iterations:

$$\text{Total training time} = O(T \cdot N \cdot \bar{n} \cdot L^2)$$

Plus the cost of feature computation (often the bottleneck in practice).

Practical Considerations

In practice, feature computation often dominates. With sparse features (common in NLP), we only sum over non-zero features. GPU parallelization across positions and mini-batches can significantly speed up training. For very large label sets (e.g., L > 1000), specialized techniques like hierarchical softmax or sampled softmax may be needed.

Space Optimization:

The full forward and backward arrays require $O(nL)$ space. For very long sequences, we can trade off:

Checkpointing: Store forward variables only at every $\sqrt{n}$ positions. Recompute intermediate values during backward pass. Reduces space to $O(\sqrt{n} \cdot L)$ but increases time by constant factor.
Streaming: For partition function only, forward algorithm needs only $O(L)$ space (keep only previous row). But full marginals require both forward and backward arrays.

Label Set Considerations:

The $L^2$ factor comes from considering all label pairs. For structured label sets (e.g., IOB tagging where many transitions are invalid), sparse transition matrices can reduce this to $O(nL \cdot \text{avg_valid_transitions})$.

Beyond First-Order: Higher-Order CRFs

While first-order linear-chain CRFs (where features depend on $y_{i-1}, y_i$) are most common, we sometimes need to capture longer-range dependencies between labels.

Second-Order CRFs:

Features depend on $(y_{i-2}, y_{i-1}, y_i)$:

$$P(\mathbf{y} \mid \mathbf{x}) \propto \exp\left( \sum_i \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, y_{i-2}, i) \right)$$

Example motivation: In POS tagging, the sequence "to [verb] [noun]" vs "to [verb] [verb]" has different patterns that a first-order model can't capture.

Complexity Impact:

Complexity Comparison: First vs. Higher-Order CRFs
CRF Order	State Space	Time Complexity	Space Complexity
First-order	$O(L)$	$O(nL^2)$	$O(nL)$
Second-order	$O(L^2)$	$O(nL^3)$	$O(nL^2)$
$k$-th order	$O(L^k)$	$O(nL^{k+1})$	$O(nL^k)$

Implementation via State Augmentation:

A second-order CRF can be implemented as a first-order CRF with an augmented state space. Each "label" becomes a label pair $(y_{i-1}, y_i)$.

Original: 5 labels → 5 states
Second-order: 5 labels → 25 states (label pairs)
The transition from state $(y_{i-2}, y_{i-1})$ to state $(y_{i-1}, y_i)$ is valid only if the middle label matches

This trick maintains the first-order algorithm structure but with larger state space.

When to Use Higher-Order CRFs:

Label-label patterns: When specific sequences of 3+ labels have meaningful patterns (e.g., "B-X I-X I-X" patterns in nested NER)
Disambiguation: When first-order features are insufficient to resolve ambiguities
Sufficient data: Higher-order features need more data to estimate reliably

Practical Alternative: Semi-Markov CRFs

For segment-level labeling (assigning labels to spans rather than tokens), Semi-Markov CRFs model segment boundaries and can capture arbitrary segment features without explicit higher-order structure.

Modern Practice

In contemporary NLP, neural network encoders (LSTMs, Transformers) implicitly capture long-range dependencies in their hidden states. A first-order CRF on top of neural features often outperforms higher-order CRFs on hand-crafted features, with better computational efficiency.

Summary: Linear-Chain CRF

We have covered the complete algorithmic framework for linear-chain CRFs. Here are the essential concepts:

Key Takeaways

•Linear-chain structure enables tractable inference — First-order Markov assumption reduces exponential complexity to polynomial.
•Forward algorithm computes partition function — Dynamic programming sums over all paths in O(nL²) by computing partial sums position-by-position.
•Backward algorithm enables marginal computation — Combined with forward variables, yields P(y_i | x) and P(y_{i-1}, y_i | x).
•Viterbi finds the optimal sequence — Replacing sum with max in forward recurrence, plus backpointers for path recovery.
•Log-space computation is essential — The log-sum-exp trick prevents numerical overflow/underflow for long sequences.
•Matrix formulation provides intuition — Partition function equals product of transition matrices; algorithms are matrix-vector operations.
•Higher-order CRFs trade expressiveness for complexity — State augmentation enables implementation but with exponential cost in order.

The core algorithms summarized:

Algorithm	Computes	Key Operation	Result
Forward	$\alpha_i(y)$	Sum over predecessors	Partition function
Backward	$\beta_i(y)$	Sum over successors	Enables marginals
Forward-Backward	Both + combine	$\alpha \cdot \beta / Z$	Marginal probabilities
Viterbi	$\delta_i(y)$ + backtrack	Max over predecessors	Best label sequence

What's next:

With the inference algorithms in place, we need to understand how to design features that CRFs can use effectively. The next page covers feature functions—the representations that encode domain knowledge and enable CRFs to learn from data.

Page Complete

You now understand the linear-chain CRF structure and its core inference algorithms: forward, backward, and Viterbi. You can implement these algorithms efficiently in log-space and understand their computational complexity. Next, we'll explore how to design powerful feature functions for CRFs.

2 / 5

Loading learning content...

Machine LearningConditional Random Fields

Conditional Random Fields

LevelAdvanced

Duration120 mins

TopicConditional Random Fields

2 / 5

Linear-Chain CRF

Specializing CRFs for Sequential Data

Linear-chain CRFs are specifically designed for sequential data where labels form a chain: each label depends on its immediate predecessor. This structure is ideal for:

Natural Language Processing: POS tagging, named entity recognition, chunking
Computational Biology: Gene prediction, protein structure labeling
Speech Recognition: Phoneme labeling, speaker diarization
Time Series: Activity recognition from sensor data

What You Will Learn

The Linear-Chain Structure

Formal Definition:

For observation sequence $\mathbf{x} = (x_1, \ldots, x_n)$ and label sequence $\mathbf{y} = (y_1, \ldots, y_n)$ where each $y_i \in \mathcal{Y}$ (a finite label set of size $L = |\mathcal{Y}|$):

$$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \prod_{i=1}^{n} \Psi_i(y_{i-1}, y_i, \mathbf{x})$$

where:

$\Psi_i(y_{i-1}, y_i, \mathbf{x}) = \exp\left( \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) \right)$ is the potential at position $i$
$Z(\mathbf{x}) = \sum_{\mathbf{y}'} \prod_{i=1}^{n} \Psi_i(y'_{i-1}, y'_i, \mathbf{x})$ is the partition function
For $i = 1$, we define $y_0$ as a special START symbol

The Score Function:

It's often convenient to work with log-space scores. Define the score of a label sequence:

$$\text{score}(\mathbf{y} \mid \mathbf{x}) = \sum_{i=1}^{n} \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, i) = \sum_{i=1}^{n} s(y_{i-1}, y_i, \mathbf{x}, i)$$

where $s(y_{i-1}, y_i, \mathbf{x}, i) = \log \Psi_i(y_{i-1}, y_i, \mathbf{x})$ is the position score for transitioning from $y_{i-1}$ to $y_i$ at position $i$.

Converting Mermaid diagram...

Matrix Formulation:

For efficient computation, we represent the potential functions as transition matrices. At each position $i$, define an $L \times L$ matrix $\mathbf{M}_i(\mathbf{x})$ where:

$$[\mathbf{M}i(\mathbf{x})]{y', y} = \Psi_i(y', y, \mathbf{x}) = \exp(s(y', y, \mathbf{x}, i))$$

Here, row index $y'$ represents the previous label, and column index $y$ represents the current label.

The partition function then becomes:

$$Z(\mathbf{x}) = \mathbf{1}^T_{\text{start}} \cdot \mathbf{M}_1(\mathbf{x}) \cdot \mathbf{M}_2(\mathbf{x}) \cdots \mathbf{M}_n(\mathbf{x}) \cdot \mathbf{1}$$

where $\mathbf{1}_{\text{start}}$ is a vector with 1 in the START position and $\mathbf{1}$ is the all-ones vector.

Why Matrices Matter

The Forward Algorithm

The Key Insight:

Instead of summing over complete sequences, we build up partial sums position by position. Define the forward variable:

$$\alpha_i(y) = \sum_{y_1, \ldots, y_{i-1}} \prod_{j=1}^{i} \Psi_j(y_{j-1}, y_j, \mathbf{x})$$

where $y_i = y$ (the sum is over all partial sequences ending in label $y$ at position $i$).

Intuition: $\alpha_i(y)$ represents the (unnormalized) sum of scores for all paths from the start to position $i$ that end in label $y$.

Recurrence Relation:

$$\alpha_1(y) = \Psi_1(\text{START}, y, \mathbf{x})$$

$$\alpha_i(y) = \sum_{y' \in \mathcal{Y}} \alpha_{i-1}(y') \cdot \Psi_i(y', y, \mathbf{x}) \quad \text{for } i = 2, \ldots, n$$

Partition Function:

$$Z(\mathbf{x}) = \sum_{y \in \mathcal{Y}} \alpha_n(y)$$

forward_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
from typing import List, Callable
 
def forward_algorithm(
    x: List[str],                           # Observation sequence
    num_labels: int,                        # Number of possible labels L
    compute_potential: Callable[[int, int, int], float],  # Ψ(y_prev, y, position)
) -> tuple[np.ndarray, float]:
    """
    Compute forward variables and partition function for linear-chain CRF.
    
    Args:
        x: Observation sequence of length n
        num_labels: Number of possible labels (L)
        compute_potential: Function Ψ(y_prev, y, i) returning exp(score)
                          y_prev=-1 represents START
    
    Returns:
        alpha: Forward variables, shape (n, L)
        Z: Partition function
    
    Complexity: O(n * L^2) time, O(n * L) space
    """
    n = len(x)
    L = num_labels
    
    # alpha[i, y] = sum over all paths from START to position i ending in y
    alpha = np.zeros((n, L))
    
    # Initialization: α_1(y) = Ψ_1(START, y, x)
    for y in range(L):
        alpha[0, y] = compute_potential(-1, y, 0)  # -1 = START
    
    # Recursion: α_i(y) = Σ_{y'} α_{i-1}(y') · Ψ_i(y', y, x)
    for i in range(1, n):
        for y in range(L):
            total = 0.0
            for y_prev in range(L):
                total += alpha[i-1, y_prev] * compute_potential(y_prev, y, i)
            alpha[i, y] = total
    
    # Partition function: Z = Σ_y α_n(y)
    Z = np.sum(alpha[n-1, :])
    
    return alpha, Z
 
 
def forward_algorithm_log_space(
    x: List[str],
    num_labels: int,
    compute_score: Callable[[int, int, int], float],  # s(y_prev, y, position)
) -> tuple[np.ndarray, float]:
    """
    Log-space implementation for numerical stability.
    
    Instead of computing potentials Ψ = exp(s), we work with log-scores s
    and use the log-sum-exp trick to avoid overflow/underflow.
    
    Returns:
        log_alpha: Log forward variables, shape (n, L)
        log_Z: Log partition function
    """
    n = len(x)
    L = num_labels
    
    # log_alpha[i, y] = log(α_i(y))
    log_alpha = np.full((n, L), -np.inf)  # -inf represents log(0)
    
    # Initialization
    for y in range(L):
        log_alpha[0, y] = compute_score(-1, y, 0)
    
    # Recursion using log-sum-exp for numerical stability
    for i in range(1, n):
        for y in range(L):
            # log(Σ_{y'} α_{i-1}(y') · Ψ_i(y', y))
            # = log(Σ_{y'} exp(log_α_{i-1}(y') + s_i(y', y)))
            # = log_sum_exp over y' of (log_α_{i-1}(y') + s_i(y', y))
            terms = np.array([
                log_alpha[i-1, y_prev] + compute_score(y_prev, y, i)
                for y_prev in range(L)
            ])
            log_alpha[i, y] = log_sum_exp(terms)
    
    # Log partition function
    log_Z = log_sum_exp(log_alpha[n-1, :])
    
    return log_alpha, log_Z
 
 
def log_sum_exp(arr: np.ndarray) -> float:
    """
    Compute log(sum(exp(arr))) in a numerically stable way.
    
    The trick: log(Σ_i exp(a_i)) = max(a) + log(Σ_i exp(a_i - max(a)))
    """
    max_val = np.max(arr)
    if max_val == -np.inf:
        return -np.inf
    return max_val + np.log(np.sum(np.exp(arr - max_val)))
 
 
# Example usage
def example_score(y_prev: int, y: int, i: int) -> float:
    """Simple scoring function for demonstration."""
    # Emission scores (based on position/label)
    emission = 0.5 if y == i % 3 else -0.5
    # Transition scores (prefer same label)
    transition = 0.3 if y_prev == y else -0.1 if y_prev >= 0 else 0.0
    return emission + transition
 
x = ["word1", "word2", "word3", "word4"]
log_alpha, log_Z = forward_algorithm_log_space(x, num_labels=3, compute_score=example_score)
 
print("Log-forward variables:")
print(log_alpha)
print(f"\nLog partition function: log Z = {log_Z:.4f}")
print(f"Partition function: Z = {np.exp(log_Z):.4f}")

Numerical Stability

The Backward Algorithm

Definition of Backward Variable:

$$\beta_i(y) = \sum_{y_{i+1}, \ldots, y_n} \prod_{j=i+1}^{n} \Psi_j(y_{j-1}, y_j, \mathbf{x})$$

where $y_i = y$ (the sum is over all partial sequences starting from label $y$ at position $i$).

Intuition: $\beta_i(y)$ represents the (unnormalized) sum of scores for all paths from position $i$ (with label $y$) to the end of the sequence.

Recurrence Relation (right-to-left):

$$\beta_n(y) = 1 \quad \text{(or END potential if used)}$$

$$\beta_i(y) = \sum_{y' \in \mathcal{Y}} \Psi_{i+1}(y, y', \mathbf{x}) \cdot \beta_{i+1}(y') \quad \text{for } i = n-1, \ldots, 1$$

backward_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from typing import List, Callable
 
def backward_algorithm_log_space(
    x: List[str],
    num_labels: int,
    compute_score: Callable[[int, int, int], float],
) -> np.ndarray:
    """
    Compute backward variables in log-space.
    
    Args:
        x: Observation sequence of length n
        num_labels: Number of possible labels (L)  
        compute_score: Function s(y_prev, y, i) returning log-potential
    
    Returns:
        log_beta: Backward variables, shape (n, L)
    
    Complexity: O(n * L^2) time, O(n * L) space
    """
    n = len(x)
    L = num_labels
    
    log_beta = np.full((n, L), -np.inf)
    
    # Initialization: β_n(y) = 1, so log β_n(y) = 0
    log_beta[n-1, :] = 0.0
    
    # Recursion (right to left)
    for i in range(n-2, -1, -1):
        for y in range(L):
            # log β_i(y) = log Σ_{y'} exp(s_{i+1}(y, y') + log β_{i+1}(y'))
            terms = np.array([
                compute_score(y, y_next, i+1) + log_beta[i+1, y_next]
                for y_next in range(L)
            ])
            log_beta[i, y] = log_sum_exp(terms)
    
    return log_beta
 
 
def log_sum_exp(arr: np.ndarray) -> float:
    """Log-sum-exp for numerical stability."""
    max_val = np.max(arr)
    if max_val == -np.inf:
        return -np.inf
    return max_val + np.log(np.sum(np.exp(arr - max_val)))
 
 
def compute_marginals(
    log_alpha: np.ndarray,    # (n, L) forward variables
    log_beta: np.ndarray,     # (n, L) backward variables
    log_Z: float,             # Log partition function
) -> np.ndarray:
    """
    Compute marginal probabilities P(y_i = y | x) for each position.
    
    Using: P(y_i = y | x) = α_i(y) · β_i(y) / Z
    In log-space: log P = log_α_i(y) + log_β_i(y) - log_Z
    
    Returns:
        marginals: Shape (n, L), where marginals[i, y] = P(y_i = y | x)
    """
    n, L = log_alpha.shape
    
    log_marginals = log_alpha + log_beta - log_Z
    marginals = np.exp(log_marginals)
    
    return marginals
 
 
def compute_pairwise_marginals(
    x: List[str],
    log_alpha: np.ndarray,
    log_beta: np.ndarray,
    log_Z: float,
    compute_score: Callable[[int, int, int], float],
) -> np.ndarray:
    """
    Compute pairwise marginals P(y_{i-1} = y', y_i = y | x).
    
    These are needed for gradient computation during training.
    
    Using: P(y_{i-1}=y', y_i=y | x) = α_{i-1}(y') · Ψ_i(y',y) · β_i(y) / Z
    
    Returns:
        pairwise: Shape (n-1, L, L), where pairwise[i, y', y] = P(y_i=y', y_{i+1}=y | x)
    """
    n, L = log_alpha.shape
    pairwise = np.zeros((n-1, L, L))
    
    for i in range(1, n):
        for y_prev in range(L):
            for y in range(L):
                log_prob = (
                    log_alpha[i-1, y_prev] +
                    compute_score(y_prev, y, i) +
                    log_beta[i, y] -
                    log_Z
                )
                pairwise[i-1, y_prev, y] = np.exp(log_prob)
    
    return pairwise
 
 
# Example demonstrating marginal computation
def example_score(y_prev: int, y: int, i: int) -> float:
    emission = 0.5 if y == i % 3 else -0.5
    transition = 0.3 if y_prev == y else -0.1 if y_prev >= 0 else 0.0
    return emission + transition
 
x = ["word1", "word2", "word3", "word4"]
L = 3
 
# Run forward-backward
log_alpha, log_Z = forward_algorithm_log_space(x, L, example_score)
log_beta = backward_algorithm_log_space(x, L, example_score)
 
# Compute marginals
marginals = compute_marginals(log_alpha, log_beta, log_Z)
print("Marginal probabilities P(y_i = y | x):")
print(marginals)
print(f"\nVerify: Each row sums to 1: {np.sum(marginals, axis=1)}")

The Forward-Backward Identity:

A crucial relationship connects forward and backward variables:

$$Z(\mathbf{x}) = \sum_{y \in \mathcal{Y}} \alpha_i(y) \cdot \beta_i(y) \quad \text{for any } i$$

This identity is useful for:

Verification: Check that forward and backward computations are consistent
Marginal computation: $P(y_i = y \mid \mathbf{x}) = \frac{\alpha_i(y) \cdot \beta_i(y)}{Z(\mathbf{x})}$

Marginals Enable Gradient Computation

Viterbi Decoding: Finding the Best Sequence

While the forward-backward algorithm computes probabilities over the entire label space, we often want to find the single best label sequence:

$$\mathbf{y}^* = \arg\max_{\mathbf{y}} P(\mathbf{y} \mid \mathbf{x}) = \arg\max_{\mathbf{y}} \text{score}(\mathbf{y} \mid \mathbf{x})$$

The Viterbi algorithm solves this efficiently using dynamic programming. It's structurally similar to the forward algorithm, but replaces sum with max.

Viterbi Variable:

$$\delta_i(y) = \max_{y_1, \ldots, y_{i-1}} \sum_{j=1}^{i} s(y_{j-1}, y_j, \mathbf{x}, j)$$

where $y_i = y$. This equals the maximum score among all paths from start to position $i$ ending in label $y$.

Recurrence:

$$\delta_1(y) = s(\text{START}, y, \mathbf{x}, 1)$$

$$\delta_i(y) = \max_{y' \in \mathcal{Y}} \left[ \delta_{i-1}(y') + s(y', y, \mathbf{x}, i) \right]$$

Backpointers:

To recover the actual sequence, we store backpointers:

$$\psi_i(y) = \arg\max_{y' \in \mathcal{Y}} \left[ \delta_{i-1}(y') + s(y', y, \mathbf{x}, i) \right]$$

viterbi_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import numpy as np
from typing import List, Callable, Tuple
 
def viterbi_decode(
    x: List[str],
    num_labels: int,
    compute_score: Callable[[int, int, int], float],
) -> Tuple[List[int], float]:
    """
    Find the highest-scoring label sequence using Viterbi algorithm.
    
    Args:
        x: Observation sequence of length n
        num_labels: Number of possible labels (L)
        compute_score: Function s(y_prev, y, i) returning log-potential
                      y_prev=-1 represents START
    
    Returns:
        best_path: List of label indices [y_1, y_2, ..., y_n]
        best_score: Score of the best path
    
    Complexity: O(n * L^2) time, O(n * L) space
    """
    n = len(x)
    L = num_labels
    
    # δ[i, y] = max score of any path ending in y at position i
    delta = np.full((n, L), -np.inf)
    
    # ψ[i, y] = best previous label to reach y at position i
    psi = np.zeros((n, L), dtype=int)
    
    # Initialization
    for y in range(L):
        delta[0, y] = compute_score(-1, y, 0)  # -1 = START
        psi[0, y] = -1  # No previous label
    
    # Recursion (forward pass)
    for i in range(1, n):
        for y in range(L):
            # Find best previous label
            scores = np.array([
                delta[i-1, y_prev] + compute_score(y_prev, y, i)
                for y_prev in range(L)
            ])
            best_prev = np.argmax(scores)
            delta[i, y] = scores[best_prev]
            psi[i, y] = best_prev
    
    # Termination: find best final label
    best_final_label = np.argmax(delta[n-1, :])
    best_score = delta[n-1, best_final_label]
    
    # Backtrack to find best path
    best_path = [0] * n
    best_path[n-1] = best_final_label
    
    for i in range(n-2, -1, -1):
        best_path[i] = psi[i+1, best_path[i+1]]
    
    return best_path, best_score
 
 
def viterbi_with_constraints(
    x: List[str],
    num_labels: int,
    compute_score: Callable[[int, int, int], float],
    valid_transitions: np.ndarray,  # (L, L) boolean matrix
) -> Tuple[List[int], float]:
    """
    Viterbi with transition constraints.
    
    valid_transitions[y', y] = True if transition from y' to y is allowed.
    Invalid transitions are assigned -inf score.
    
    Example: In NER, B-PER cannot be followed by I-LOC.
    """
    n = len(x)
    L = num_labels
    
    delta = np.full((n, L), -np.inf)
    psi = np.zeros((n, L), dtype=int)
    
    for y in range(L):
        delta[0, y] = compute_score(-1, y, 0)
        psi[0, y] = -1
    
    for i in range(1, n):
        for y in range(L):
            scores = np.array([
                delta[i-1, y_prev] + compute_score(y_prev, y, i)
                if valid_transitions[y_prev, y] else -np.inf
                for y_prev in range(L)
            ])
            
            if np.all(scores == -np.inf):
                delta[i, y] = -np.inf
                psi[i, y] = 0
            else:
                best_prev = np.argmax(scores)
                delta[i, y] = scores[best_prev]
                psi[i, y] = best_prev
    
    # Handle case where no valid path exists
    if np.all(delta[n-1, :] == -np.inf):
        return [], float('-inf')
    
    best_final = np.argmax(delta[n-1, :])
    best_score = delta[n-1, best_final]
    
    best_path = [0] * n
    best_path[n-1] = best_final
    for i in range(n-2, -1, -1):
        best_path[i] = psi[i+1, best_path[i+1]]
    
    return best_path, best_score
 
 
# Example with named entity tags
LABELS = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG']
L = len(LABELS)
 
def ner_score(y_prev: int, y: int, i: int) -> float:
    """Score function for NER tagging."""
    # Simple example scores
    transition_scores = {
        ('O', 'O'): 0.5,
        ('O', 'B-PER'): 0.1,
        ('O', 'B-ORG'): 0.1,
        ('B-PER', 'I-PER'): 0.8,
        ('B-ORG', 'I-ORG'): 0.8,
        ('I-PER', 'I-PER'): 0.6,
        ('I-ORG', 'I-ORG'): 0.6,
    }
    
    prev_label = LABELS[y_prev] if y_prev >= 0 else 'START'
    curr_label = LABELS[y]
    
    trans = transition_scores.get((prev_label, curr_label), -0.5)
    emission = 0.5 if curr_label == 'O' else 0.0
    
    return trans + emission
 
x = ["John", "Smith", "works", "at", "Google"]
best_path, best_score = viterbi_decode(x, L, ner_score)
 
print("Best label sequence:")
for word, label_idx in zip(x, best_path):
    print(f"  {word}: {LABELS[label_idx]}")
print(f"\nBest score: {best_score:.4f}")

Viterbi vs. Marginal Inference

Complexity Analysis

Let's rigorously analyze the computational complexity of linear-chain CRF algorithms. Let $n$ be sequence length, $L$ be the number of labels, and $K$ be the number of features.

Inference Complexity:

Computational Complexity of Linear-Chain CRF Operations
Operation	Time Complexity	Space Complexity	Notes
Forward Algorithm	$O(n L^2)$	$O(nL)$ or $O(L)$	Sum over predecessors at each position
Backward Algorithm	$O(n L^2)$	$O(nL)$	Same structure, reverse direction
Partition Function	$O(n L^2)$	$O(L)$	From forward variables at last position
Marginal Probabilities	$O(n L^2)$	$O(nL)$	Requires both forward and backward
Viterbi Decoding	$O(n L^2)$	$O(nL)$	Max replaces sum; backpointers needed
Feature Expectations	$O(n L^2 K)$	$O(nL)$	Sum features weighted by pairwise marginals
Gradient Computation	$O(n L^2 + nK)$	$O(nL + K)$	Forward-backward plus feature aggregation

Why $O(nL^2)$ and not $O(L^n)$?

The naive approach enumerates all $L^n$ label sequences. The forward algorithm exploits the Markov property: the score at position $i$ depends only on $(y_{i-1}, y_i)$, not the entire history.

Decomposition insight:

$$\sum_{\mathbf{y}} \exp(\text{score}(\mathbf{y})) = \sum_{\mathbf{y}} \prod_{i} \Psi_i(y_{i-1}, y_i)$$

We can rearrange the sum by "pushing" sums inward:

$$= \sum_{y_n} \left( \sum_{y_{n-1}} \cdots \left( \sum_{y_1} \Psi_1(y_0, y_1) \right) \Psi_2(y_1, y_2) \cdots \right) \Psi_n(y_{n-1}, y_n)$$

The innermost sum over $y_1$ takes $O(L)$ for each $y_2$. The next sum over $y_2$ takes $O(L)$ for each $y_3$. We do this $n$ times, giving $O(nL^2)$.

Training Complexity:

For $N$ training sequences of average length $\bar{n}$ and $T$ gradient descent iterations:

$$\text{Total training time} = O(T \cdot N \cdot \bar{n} \cdot L^2)$$

Plus the cost of feature computation (often the bottleneck in practice).

Practical Considerations

Space Optimization:

The full forward and backward arrays require $O(nL)$ space. For very long sequences, we can trade off:

Checkpointing: Store forward variables only at every $\sqrt{n}$ positions. Recompute intermediate values during backward pass. Reduces space to $O(\sqrt{n} \cdot L)$ but increases time by constant factor.
Streaming: For partition function only, forward algorithm needs only $O(L)$ space (keep only previous row). But full marginals require both forward and backward arrays.

Label Set Considerations:

Beyond First-Order: Higher-Order CRFs

While first-order linear-chain CRFs (where features depend on $y_{i-1}, y_i$) are most common, we sometimes need to capture longer-range dependencies between labels.

Second-Order CRFs:

Features depend on $(y_{i-2}, y_{i-1}, y_i)$:

$$P(\mathbf{y} \mid \mathbf{x}) \propto \exp\left( \sum_i \sum_k \lambda_k f_k(\mathbf{x}, y_i, y_{i-1}, y_{i-2}, i) \right)$$

Example motivation: In POS tagging, the sequence "to [verb] [noun]" vs "to [verb] [verb]" has different patterns that a first-order model can't capture.

Complexity Impact:

Complexity Comparison: First vs. Higher-Order CRFs
CRF Order	State Space	Time Complexity	Space Complexity
First-order	$O(L)$	$O(nL^2)$	$O(nL)$
Second-order	$O(L^2)$	$O(nL^3)$	$O(nL^2)$
$k$-th order	$O(L^k)$	$O(nL^{k+1})$	$O(nL^k)$

Implementation via State Augmentation:

A second-order CRF can be implemented as a first-order CRF with an augmented state space. Each "label" becomes a label pair $(y_{i-1}, y_i)$.

Original: 5 labels → 5 states
Second-order: 5 labels → 25 states (label pairs)
The transition from state $(y_{i-2}, y_{i-1})$ to state $(y_{i-1}, y_i)$ is valid only if the middle label matches

This trick maintains the first-order algorithm structure but with larger state space.

When to Use Higher-Order CRFs:

Label-label patterns: When specific sequences of 3+ labels have meaningful patterns (e.g., "B-X I-X I-X" patterns in nested NER)
Disambiguation: When first-order features are insufficient to resolve ambiguities
Sufficient data: Higher-order features need more data to estimate reliably

Practical Alternative: Semi-Markov CRFs

Modern Practice

Summary: Linear-Chain CRF

We have covered the complete algorithmic framework for linear-chain CRFs. Here are the essential concepts:

Key Takeaways

•Linear-chain structure enables tractable inference — First-order Markov assumption reduces exponential complexity to polynomial.
•Forward algorithm computes partition function — Dynamic programming sums over all paths in O(nL²) by computing partial sums position-by-position.
•Backward algorithm enables marginal computation — Combined with forward variables, yields P(y_i | x) and P(y_{i-1}, y_i | x).
•Viterbi finds the optimal sequence — Replacing sum with max in forward recurrence, plus backpointers for path recovery.
•Log-space computation is essential — The log-sum-exp trick prevents numerical overflow/underflow for long sequences.
•Matrix formulation provides intuition — Partition function equals product of transition matrices; algorithms are matrix-vector operations.
•Higher-order CRFs trade expressiveness for complexity — State augmentation enables implementation but with exponential cost in order.

The core algorithms summarized:

Algorithm	Computes	Key Operation	Result
Forward	$\alpha_i(y)$	Sum over predecessors	Partition function
Backward	$\beta_i(y)$	Sum over successors	Enables marginals
Forward-Backward	Both + combine	$\alpha \cdot \beta / Z$	Marginal probabilities
Viterbi	$\delta_i(y)$ + backtrack	Max over predecessors	Best label sequence

What's next:

Page Complete

2 / 5