Machine LearningConditional Random Fields

Conditional Random Fields

LevelAdvanced

Duration120 mins

TopicConditional Random Fields

4 / 5

Training

Learning CRF Parameters from Data

With the CRF model formulation and feature functions defined, we now address the fundamental learning problem: How do we find optimal feature weights from labeled training data?

CRF training is a well-understood optimization problem with strong theoretical foundations. The conditional log-likelihood objective is convex, guaranteeing that any local optimum is also the global optimum. This mathematical elegance translates to practical reliability—CRF training is stable and predictable.

This page covers the complete training pipeline: from objective function formulation through gradient computation to practical optimization strategies, including modern neural network integration.

What You Will Learn

By the end of this page, you will understand: (1) The conditional log-likelihood objective and why it's convex, (2) How to compute gradients using forward-backward expectations, (3) Regularization strategies (L1, L2) and their effects, (4) Optimization algorithms: L-BFGS, SGD, and Adam, and (5) How neural networks integrate with CRF training (BiLSTM-CRF).

The Conditional Log-Likelihood Objective

Given training data $\mathcal{D} = {(\mathbf{x}^{(1)}, \mathbf{y}^{(1)}), \ldots, (\mathbf{x}^{(N)}, \mathbf{y}^{(N)})}$, where each $(\mathbf{x}^{(i)}, \mathbf{y}^{(i)})$ is an observation-label sequence pair, we seek weights $\boldsymbol{\lambda}$ that maximize the conditional probability of the observed labels given observations.

Maximum Conditional Likelihood:

$$\boldsymbol{\lambda}^* = \arg\max_{\boldsymbol{\lambda}} \prod_{i=1}^{N} P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\lambda})$$

Log-Likelihood (more convenient for optimization):

$$\mathcal{L}(\boldsymbol{\lambda}) = \sum_{i=1}^{N} \log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\lambda})$$

Expanding using the CRF definition:

$$\mathcal{L}(\boldsymbol{\lambda}) = \sum_{i=1}^{N} \left[ \boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}^{(i)}, \mathbf{y}^{(i)}) - \log Z(\mathbf{x}^{(i)}) \right]$$

where $\mathbf{F}(\mathbf{x}, \mathbf{y}) = \sum_{j} \mathbf{f}(\mathbf{x}, y_j, y_{j-1}, j)$ is the global feature vector.

Intuition Behind the Objective:

The log-likelihood decomposes into two terms:

Score term $\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y})$: Reward for the correct label sequence. We want this to be HIGH.
Normalizer term $\log Z(\mathbf{x})$: Penalizes high scores for ALL sequences. We want the correct sequence to score higher than alternatives.

Maximizing log-likelihood means: increase the gap between the correct sequence's score and the average alternative score.

Why Conditional (not Joint) Likelihood?

We maximize $P(\mathbf{y} \mid \mathbf{x})$, not $P(\mathbf{x}, \mathbf{y})$. This is the discriminative principle:

We don't model how observations are generated
We focus entirely on the prediction task
Features can incorporate arbitrary observation patterns without generative assumptions

Convexity Guarantee

The log-likelihood $\mathcal{L}(\boldsymbol{\lambda})$ is a CONCAVE function of $\boldsymbol{\lambda}$. The score term is linear (concave). The log-partition function log Z is convex, and subtracting a convex function preserves concavity. Maximizing a concave function (or equivalently, minimizing its negative) is a convex optimization problem with a unique global optimum.

Converting Mermaid diagram...

Gradient Computation

To optimize the log-likelihood using gradient-based methods, we need to compute $\nabla_{\boldsymbol{\lambda}} \mathcal{L}$.

Gradient Derivation:

For a single training example $(\mathbf{x}, \mathbf{y})$:

$$\nabla_{\boldsymbol{\lambda}} \log P(\mathbf{y} \mid \mathbf{x}) = \nabla_{\boldsymbol{\lambda}} \left[ \boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}) - \log Z(\mathbf{x}) \right]$$

The first term is straightforward: $$\nabla_{\boldsymbol{\lambda}} \left[ \boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}) \right] = \mathbf{F}(\mathbf{x}, \mathbf{y})$$

For the second term, we use the chain rule. Let's compute $\nabla_{\boldsymbol{\lambda}} \log Z(\mathbf{x})$:

$$\nabla_{\boldsymbol{\lambda}} \log Z(\mathbf{x}) = \frac{1}{Z(\mathbf{x})} \nabla_{\boldsymbol{\lambda}} Z(\mathbf{x})$$

$$= \frac{1}{Z(\mathbf{x})} \nabla_{\boldsymbol{\lambda}} \sum_{\mathbf{y}'} \exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}'))$$

$$= \frac{1}{Z(\mathbf{x})} \sum_{\mathbf{y}'} \exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}')) \cdot \mathbf{F}(\mathbf{x}, \mathbf{y}')$$

$$= \sum_{\mathbf{y}'} P(\mathbf{y}' \mid \mathbf{x}) \cdot \mathbf{F}(\mathbf{x}, \mathbf{y}')$$

$$= \mathbb{E}_{\mathbf{y}' \sim P(\cdot \mid \mathbf{x})}[\mathbf{F}(\mathbf{x}, \mathbf{y}')]$$

The Gradient Formula:

Combining both terms:

$$\nabla_{\boldsymbol{\lambda}} \log P(\mathbf{y} \mid \mathbf{x}) = \mathbf{F}(\mathbf{x}, \mathbf{y}) - \mathbb{E}_{\mathbf{y}' \sim P(\cdot \mid \mathbf{x})}[\mathbf{F}(\mathbf{x}, \mathbf{y}')]$$

This is the "observed minus expected" gradient:

$\mathbf{F}(\mathbf{x}, \mathbf{y})$: Features that ACTUALLY fired for the correct labels (observed)
$\mathbb{E}[\mathbf{F}(\mathbf{x}, \mathbf{y}')]$: Features that would fire on average under the current model (expected)

Interpretation:

If a feature fires for the correct sequence MORE than expected → increase its weight (positive gradient)
If a feature fires LESS than expected → decrease its weight (negative gradient)
If observed equals expected → weight is already optimal (zero gradient)

For the full training set:

$$\nabla_{\boldsymbol{\lambda}} \mathcal{L} = \sum_{i=1}^{N} \left[ \mathbf{F}(\mathbf{x}^{(i)}, \mathbf{y}^{(i)}) - \mathbb{E}_{P(\cdot \mid \mathbf{x}^{(i)})}[\mathbf{F}(\mathbf{x}^{(i)}, \mathbf{y}')] \right]$$

Computing the Expectation

The expectation E[F(x, y')] sums over ALL possible label sequences—exponentially many! The forward-backward algorithm makes this tractable. The expected feature value decomposes into sums over pairwise marginals: E[f_k] = Σ_{i} Σ_{y,y'} P(y_{i-1}=y', y_i=y | x) · f_k(x, y, y', i)

crf_gradient.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
from typing import List, Tuple, Dict
 
def compute_gradient(
    sequences: List[Tuple[List[str], List[int]]],  # (tokens, labels) pairs
    weights: np.ndarray,                            # Current weights (K,)
    num_labels: int,
    feature_extractor,                              # Extract features
    forward_backward_fn,                            # Compute marginals
) -> Tuple[np.ndarray, float]:
    """
    Compute gradient of log-likelihood with respect to weights.
    
    Returns:
        gradient: Gradient vector (K,)
        log_likelihood: Current log-likelihood value
    """
    K = len(weights)
    gradient = np.zeros(K)
    total_log_likelihood = 0.0
    
    for tokens, correct_labels in sequences:
        n = len(tokens)
        
        # =====================
        # Step 1: Observed features (for correct label sequence)
        # =====================
        observed_features = np.zeros(K)
        for i in range(n):
            y = correct_labels[i]
            y_prev = correct_labels[i-1] if i > 0 else -1
            
            # Get active features and accumulate
            active = feature_extractor.get_features(tokens, i, y, y_prev)
            for feat_id, value in active:
                observed_features[feat_id] += value
        
        # =====================
        # Step 2: Expected features (under current model)
        # =====================
        # Run forward-backward to get marginals
        log_alpha, log_beta, log_Z, marginals, pairwise = forward_backward_fn(
            tokens, weights, num_labels, feature_extractor
        )
        
        expected_features = np.zeros(K)
        for i in range(n):
            if i == 0:
                # First position: only unigram marginals
                for y in range(num_labels):
                    prob = marginals[i, y]
                    active = feature_extractor.get_features(tokens, i, y, -1)
                    for feat_id, value in active:
                        expected_features[feat_id] += prob * value
            else:
                # Other positions: pairwise marginals
                for y_prev in range(num_labels):
                    for y in range(num_labels):
                        prob = pairwise[i-1, y_prev, y]
                        active = feature_extractor.get_features(tokens, i, y, y_prev)
                        for feat_id, value in active:
                            expected_features[feat_id] += prob * value
        
        # =====================
        # Step 3: Accumulate gradient and likelihood
        # =====================
        # Gradient = observed - expected
        gradient += observed_features - expected_features
        
        # Log-likelihood = score(correct) - log Z
        score = np.dot(weights, observed_features)
        total_log_likelihood += score - log_Z
    
    return gradient, total_log_likelihood
 
 
def compute_gradient_with_regularization(
    gradient: np.ndarray,
    weights: np.ndarray,
    regularization: str = 'l2',
    reg_strength: float = 1.0
) -> np.ndarray:
    """
    Add regularization gradient component.
    
    For L2: penalty = (c/2) ||λ||² → gradient contribution = -c * λ
    For L1: penalty = c ||λ||₁ → gradient contribution = -c * sign(λ)
    
    Note: We're MAXIMIZING log-likelihood, so we ADD the penalty gradient
    (which is negative, pulling weights toward zero).
    """
    if regularization == 'l2':
        # L2 regularization: gradient -= c * λ
        reg_gradient = -reg_strength * weights
    elif regularization == 'l1':
        # L1 regularization: gradient -= c * sign(λ)
        reg_gradient = -reg_strength * np.sign(weights)
    else:
        reg_gradient = np.zeros_like(weights)
    
    return gradient + reg_gradient
 
 
# Numerical gradient check (for verification)
def numerical_gradient(
    sequences: List[Tuple[List[str], List[int]]],
    weights: np.ndarray,
    compute_ll_fn,
    epsilon: float = 1e-5
) -> np.ndarray:
    """
    Compute gradient numerically for verification.
    
    For each weight dimension:
    ∂L/∂λ_k ≈ (L(λ+ε·e_k) - L(λ-ε·e_k)) / (2ε)
    """
    K = len(weights)
    numerical_grad = np.zeros(K)
    
    for k in range(K):
        weights_plus = weights.copy()
        weights_plus[k] += epsilon
        ll_plus = compute_ll_fn(sequences, weights_plus)
        
        weights_minus = weights.copy()
        weights_minus[k] -= epsilon
        ll_minus = compute_ll_fn(sequences, weights_minus)
        
        numerical_grad[k] = (ll_plus - ll_minus) / (2 * epsilon)
    
    return numerical_grad
 
 
print("Gradient computation for CRF training")
print("=" * 50)
print("Key formula: ∇L = F_observed - E[F_expected]")
print("")
print("This is the same gradient structure as:")
print("  - Logistic regression")
print("  - Maximum entropy models")
print("  - All exponential family distributions")

Regularization

With potentially millions of features, CRFs are prone to overfitting—especially when features are sparse and training data is limited. Regularization is essential.

Regularized Objective:

$$\mathcal{L}_{\text{reg}}(\boldsymbol{\lambda}) = \mathcal{L}(\boldsymbol{\lambda}) - \text{penalty}(\boldsymbol{\lambda})$$

L2 Regularization (Ridge / Weight Decay):

$$\text{penalty}_{L2}(\boldsymbol{\lambda}) = \frac{c}{2} |\boldsymbol{\lambda}|_2^2 = \frac{c}{2} \sum_k \lambda_k^2$$

Effect: Penalizes large weights; shrinks all weights toward zero
Gradient contribution: $-c \cdot \boldsymbol{\lambda}$
Interpretation: Gaussian prior on weights $P(\boldsymbol{\lambda}) \propto \exp(-\frac{c}{2}|\boldsymbol{\lambda}|^2)$
Preserves convexity: Sum of convex functions is convex

L1 Regularization (Lasso):

$$\text{penalty}_{L1}(\boldsymbol{\lambda}) = c |\boldsymbol{\lambda}|_1 = c \sum_k |\lambda_k|$$

Effect: Drives many weights to EXACTLY zero (sparse models)
Gradient contribution: $-c \cdot \text{sign}(\boldsymbol{\lambda})$ (subgradient at zero)
Interpretation: Laplace prior on weights
Enables feature selection: Zero-weight features are effectively removed

L2 Regularization

•Smooth, differentiable everywhere
•Weights shrink but rarely exactly zero
•Standard for most CRF applications
•Easy to optimize (just add gradient term)
•Robust to correlated features
•Typical: c ∈ [0.01, 10]

L1 Regularization

•Non-differentiable at zero
•Produces sparse weight vectors
•Automatic feature selection
•Requires specialized optimizers (OWLQN)
•Can be unstable with correlated features
•Typical: c ∈ [0.001, 1.0]

Elastic Net (L1 + L2):

Combines benefits of both:

$$\text{penalty}_{EN}(\boldsymbol{\lambda}) = c_1 |\boldsymbol{\lambda}|_1 + \frac{c_2}{2} |\boldsymbol{\lambda}|_2^2$$

L1 provides sparsity
L2 stabilizes when features are correlated

Selecting Regularization Strength:

The regularization strength $c$ is a hyperparameter selected via:

Cross-validation: Split training data, train with different $c$, evaluate on held-out set
Hold-out validation: Reserve a development set for hyperparameter tuning
Grid search: Try $c \in {0.001, 0.01, 0.1, 1.0, 10.0}$

Typical heuristic: Start with $c = 1.0$ and adjust based on validation performance.

Don't Regularize Transition Features?

Some practitioners apply weaker regularization to transition features (label bigrams) than to emission features. Rationale: transition patterns are fundamental to sequence structure (e.g., B-PER must be followed by I-PER or O), while emission features are where overfitting typically occurs.

Optimization Algorithms

With the objective and gradients defined, we need algorithms to find the optimal weights. Several approaches are used in practice.

L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno):

The workhorse of classical CRF training:

Type: Quasi-Newton method (approximates Hessian inverse)
Memory: O(mK) where m ≈ 10 is history size, K = number of features
Convergence: Superlinear (faster than gradient descent near optimum)
Batch size: Uses full training set (batch gradient)

Advantages:

Very fast convergence for convex problems
No learning rate to tune (uses line search)
Standard in CRF toolkits (CRF++, CRFsuite)

Disadvantages:

Requires full gradient (slow for large datasets)
Memory intensive for very large feature sets
Doesn't scale to neural CRFs

crf_optimization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
import numpy as np
from scipy.optimize import minimize
from typing import List, Tuple, Callable
 
class CRFTrainer:
    """
    CRF trainer supporting multiple optimization algorithms.
    """
    
    def __init__(
        self,
        num_features: int,
        num_labels: int,
        feature_extractor,
        regularization: str = 'l2',
        reg_strength: float = 1.0
    ):
        self.num_features = num_features
        self.num_labels = num_labels
        self.feature_extractor = feature_extractor
        self.regularization = regularization
        self.reg_strength = reg_strength
        self.weights = None
    
    def _objective_and_gradient(
        self,
        weights: np.ndarray,
        sequences: List[Tuple[List[str], List[int]]]
    ) -> Tuple[float, np.ndarray]:
        """
        Compute negative log-likelihood and its gradient.
        
        Note: We return NEGATIVE because scipy.optimize.minimize minimizes.
        """
        # Compute log-likelihood and gradient
        gradient, log_likelihood = compute_gradient(
            sequences, weights, self.num_labels,
            self.feature_extractor, self.forward_backward
        )
        
        # Add regularization
        if self.regularization == 'l2':
            reg_term = 0.5 * self.reg_strength * np.sum(weights ** 2)
            reg_grad = self.reg_strength * weights
        elif self.regularization == 'l1':
            reg_term = self.reg_strength * np.sum(np.abs(weights))
            reg_grad = self.reg_strength * np.sign(weights)
        else:
            reg_term = 0.0
            reg_grad = np.zeros_like(weights)
        
        # Return negative (for minimization)
        neg_ll = -(log_likelihood - reg_term)
        neg_grad = -(gradient - reg_grad)
        
        return neg_ll, neg_grad
    
    def train_lbfgs(
        self,
        sequences: List[Tuple[List[str], List[int]]],
        max_iter: int = 100,
        verbose: bool = True
    ) -> np.ndarray:
        """
        Train CRF using L-BFGS optimizer.
        
        This is the standard approach for classical CRFs.
        """
        # Initialize weights
        weights = np.zeros(self.num_features)
        
        # Track iterations
        iteration = [0]
        
        def callback(xk):
            if verbose and iteration[0] % 10 == 0:
                neg_ll, _ = self._objective_and_gradient(xk, sequences)
                print(f"Iteration {iteration[0]}: neg_ll = {neg_ll:.4f}")
            iteration[0] += 1
        
        # Run L-BFGS-B (bounded L-BFGS)
        result = minimize(
            fun=lambda w: self._objective_and_gradient(w, sequences),
            x0=weights,
            method='L-BFGS-B',
            jac=True,  # Objective returns gradient
            options={'maxiter': max_iter, 'disp': verbose},
            callback=callback
        )
        
        self.weights = result.x
        return self.weights
    
    def train_sgd(
        self,
        sequences: List[Tuple[List[str], List[int]]],
        epochs: int = 10,
        learning_rate: float = 0.1,
        batch_size: int = 32,
        lr_decay: float = 0.01,
        verbose: bool = True
    ) -> np.ndarray:
        """
        Train CRF using Stochastic Gradient Descent.
        
        Better for large datasets; required for neural CRFs.
        """
        weights = np.zeros(self.num_features)
        n_sequences = len(sequences)
        
        for epoch in range(epochs):
            # Shuffle data
            indices = np.random.permutation(n_sequences)
            epoch_loss = 0.0
            
            # Mini-batch iteration
            for batch_start in range(0, n_sequences, batch_size):
                batch_indices = indices[batch_start:batch_start + batch_size]
                batch = [sequences[i] for i in batch_indices]
                
                # Compute gradient on batch
                gradient, batch_ll = compute_gradient(
                    batch, weights, self.num_labels,
                    self.feature_extractor, self.forward_backward
                )
                
                # Add regularization gradient
                if self.regularization == 'l2':
                    gradient -= self.reg_strength * weights
                    reg_term = 0.5 * self.reg_strength * np.sum(weights ** 2)
                else:
                    reg_term = 0
                
                # Decayed learning rate
                t = epoch * (n_sequences // batch_size) + batch_start // batch_size
                lr = learning_rate / (1 + lr_decay * t)
                
                # Update weights (gradient ascent for maximization)
                weights += lr * gradient
                epoch_loss += -batch_ll + reg_term
            
            if verbose:
                avg_loss = epoch_loss / (n_sequences // batch_size)
                print(f"Epoch {epoch+1}/{epochs}: avg_loss = {avg_loss:.4f}")
        
        self.weights = weights
        return self.weights
    
    def train_adam(
        self,
        sequences: List[Tuple[List[str], List[int]]],
        epochs: int = 10,
        learning_rate: float = 0.001,
        batch_size: int = 32,
        beta1: float = 0.9,
        beta2: float = 0.999,
        epsilon: float = 1e-8,
        verbose: bool = True
    ) -> np.ndarray:
        """
        Train CRF using Adam optimizer.
        
        Modern variant of SGD with adaptive learning rates.
        Standard for neural CRF training.
        """
        weights = np.zeros(self.num_features)
        m = np.zeros_like(weights)  # First moment
        v = np.zeros_like(weights)  # Second moment
        t = 0
        
        n_sequences = len(sequences)
        
        for epoch in range(epochs):
            indices = np.random.permutation(n_sequences)
            
            for batch_start in range(0, n_sequences, batch_size):
                t += 1
                batch_indices = indices[batch_start:batch_start + batch_size]
                batch = [sequences[i] for i in batch_indices]
                
                # Compute gradient
                gradient, _ = compute_gradient(
                    batch, weights, self.num_labels,
                    self.feature_extractor, self.forward_backward
                )
                
                # Add regularization
                if self.regularization == 'l2':
                    gradient -= self.reg_strength * weights
                
                # Negate for gradient descent (we're maximizing)
                gradient = -gradient
                
                # Update moments
                m = beta1 * m + (1 - beta1) * gradient
                v = beta2 * v + (1 - beta2) * (gradient ** 2)
                
                # Bias correction
                m_hat = m / (1 - beta1 ** t)
                v_hat = v / (1 - beta2 ** t)
                
                # Update weights
                weights -= learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)
            
            if verbose:
                print(f"Epoch {epoch+1}/{epochs} complete")
        
        self.weights = weights
        return self.weights
 
 
# Algorithm comparison
print("CRF Optimization Algorithms Comparison")
print("=" * 60)
print(f"{'Algorithm':<15} {'Best For':<25} {'Hyperparameters':<20}")
print("-" * 60)
print(f"{'L-BFGS':<15} {'Small/medium data':<25} {'None (line search)':<20}")
print(f"{'SGD':<15} {'Large data, online':<25} {'learning_rate, decay':<20}")
print(f"{'Adam':<15} {'Neural CRFs':<25} {'lr, beta1, beta2':<20}")

Optimization Algorithm Selection Guide
Scenario	Recommended Algorithm	Rationale
Small dataset (< 10K sequences)	L-BFGS	Fast convergence, no tuning needed
Medium dataset (10K-100K)	L-BFGS or Mini-batch SGD	L-BFGS if fits in memory
Large dataset (> 100K)	SGD or Adam	Full gradient too expensive
Neural CRF (BiLSTM-CRF)	Adam	Standard for neural networks
Online learning	SGD	Updates with each example
L1 regularization	OWL-QN	Handles non-differentiability

Neural Network Integration: BiLSTM-CRF

Modern sequence labeling combines neural networks (for feature learning) with CRFs (for structured prediction). The BiLSTM-CRF architecture has become the de facto standard.

Architecture Overview:

Input Layer: Word embeddings (possibly + character embeddings)
BiLSTM: Bidirectional LSTM encodes contextual representations
Emission Layer: Linear projection to label scores per position
CRF Layer: Models label transitions; enables global decoding

Key Insight:

The BiLSTM replaces hand-crafted features with learned features. At each position, the BiLSTM hidden state encodes rich contextual information. The CRF layer then ensures the output sequence is globally coherent.

Score Computation:

For a sequence with BiLSTM outputs $\mathbf{h}_1, \ldots, \mathbf{h}_n$:

$$\text{score}(\mathbf{y} \mid \mathbf{x}) = \sum_{i=1}^{n} \left[ W_{y_i} \cdot \mathbf{h}i + T{y_{i-1}, y_i} \right]$$

where:

$W_{y_i} \cdot \mathbf{h}_i$: Emission score (BiLSTM output projected to label)
$T_{y_{i-1}, y_i}$: Transition score from $y_{i-1}$ to $y_i$ (learned matrix)

Converting Mermaid diagram...

bilstm_crf.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
import torch
import torch.nn as nn
from typing import List, Tuple
 
class BiLSTMCRF(nn.Module):
    """
    BiLSTM-CRF model for sequence labeling.
    
    Combines neural feature learning with structured prediction.
    """
    
    def __init__(
        self,
        vocab_size: int,
        embedding_dim: int,
        hidden_dim: int,
        num_labels: int,
        padding_idx: int = 0
    ):
        super().__init__()
        
        self.num_labels = num_labels
        self.hidden_dim = hidden_dim
        
        # Embedding layer
        self.embedding = nn.Embedding(
            vocab_size, embedding_dim, padding_idx=padding_idx
        )
        
        # BiLSTM encoder
        self.lstm = nn.LSTM(
            embedding_dim, hidden_dim // 2,
            num_layers=1, bidirectional=True, batch_first=True
        )
        
        # Emission projection: hidden -> label scores
        self.emission = nn.Linear(hidden_dim, num_labels)
        
        # CRF transition matrix: T[i,j] = score of transitioning from i to j
        # +2 for START and END tags
        self.transitions = nn.Parameter(
            torch.randn(num_labels + 2, num_labels + 2)
        )
        self.START_TAG = num_labels
        self.END_TAG = num_labels + 1
        
        # Constrain impossible transitions
        self.transitions.data[self.START_TAG, :] = -10000  # Can't transition TO start
        self.transitions.data[:, self.END_TAG] = -10000    # Can't transition FROM end
    
    def _forward_alg(self, emissions: torch.Tensor) -> torch.Tensor:
        """
        Forward algorithm in log-space.
        
        Args:
            emissions: Emission scores (seq_len, num_labels)
        
        Returns:
            log_Z: Log partition function (scalar)
        """
        seq_len = emissions.size(0)
        
        # Initialize: α[START] = 0, others = -inf
        # In expanded space: α[0, START] = 0
        alpha = torch.full((self.num_labels + 2,), -10000.0, device=emissions.device)
        alpha[self.START_TAG] = 0.0
        
        for i in range(seq_len):
            emit = emissions[i]  # (num_labels,)
            
            # Expand for broadcasting
            # alpha: (num_labels+2,) -> (num_labels+2, 1)
            # transitions: (num_labels+2, num_labels+2)
            # emit: (num_labels,) -> (1, num_labels)
            
            alpha_expanded = alpha.unsqueeze(1)  # (L+2, 1)
            trans = self.transitions[:, :self.num_labels]  # (L+2, L)
            emit_expanded = emit.unsqueeze(0)  # (1, L)
            
            # Next alpha via log-sum-exp
            scores = alpha_expanded + trans + emit_expanded  # (L+2, L)
            alpha_new = torch.logsumexp(scores, dim=0)  # (L,)
            
            # Expand back to include START and END
            alpha = torch.full((self.num_labels + 2,), -10000.0, device=emissions.device)
            alpha[:self.num_labels] = alpha_new
        
        # Terminal: add transition to END
        terminal = alpha + self.transitions[:, self.END_TAG]
        log_Z = torch.logsumexp(terminal, dim=0)
        
        return log_Z
    
    def _score_sentence(
        self, 
        emissions: torch.Tensor, 
        tags: torch.Tensor
    ) -> torch.Tensor:
        """
        Compute score of a specific tag sequence.
        
        Args:
            emissions: Emission scores (seq_len, num_labels)
            tags: Gold tag sequence (seq_len,)
        
        Returns:
            score: Score of this tag sequence (scalar)
        """
        seq_len = emissions.size(0)
        
        score = self.transitions[self.START_TAG, tags[0]]
        score += emissions[0, tags[0]]
        
        for i in range(1, seq_len):
            score += self.transitions[tags[i-1], tags[i]]
            score += emissions[i, tags[i]]
        
        score += self.transitions[tags[-1], self.END_TAG]
        
        return score
    
    def neg_log_likelihood(
        self, 
        sentence: torch.Tensor, 
        tags: torch.Tensor
    ) -> torch.Tensor:
        """
        Compute negative log-likelihood for training.
        
        Args:
            sentence: Word indices (seq_len,)
            tags: Gold tag indices (seq_len,)
        
        Returns:
            loss: Negative log-likelihood (scalar)
        """
        # Get emissions from BiLSTM
        embeds = self.embedding(sentence)
        lstm_out, _ = self.lstm(embeds.unsqueeze(0))
        lstm_out = lstm_out.squeeze(0)  # (seq_len, hidden_dim)
        emissions = self.emission(lstm_out)  # (seq_len, num_labels)
        
        # Compute forward score (log Z)
        forward_score = self._forward_alg(emissions)
        
        # Compute gold score
        gold_score = self._score_sentence(emissions, tags)
        
        # NLL = log Z - gold_score
        return forward_score - gold_score
    
    def viterbi_decode(self, sentence: torch.Tensor) -> List[int]:
        """
        Find best tag sequence using Viterbi.
        
        Args:
            sentence: Word indices (seq_len,)
        
        Returns:
            best_path: List of tag indices
        """
        # Get emissions
        embeds = self.embedding(sentence)
        lstm_out, _ = self.lstm(embeds.unsqueeze(0))
        lstm_out = lstm_out.squeeze(0)
        emissions = self.emission(lstm_out)
        
        seq_len = emissions.size(0)
        
        # Viterbi algorithm
        backpointers = []
        
        # Initialize
        viterbi = torch.full((self.num_labels + 2,), -10000.0)
        viterbi[self.START_TAG] = 0.0
        
        for i in range(seq_len):
            emit = emissions[i]
            viterbi_expanded = viterbi.unsqueeze(1)
            trans = self.transitions[:, :self.num_labels]
            emit_expanded = emit.unsqueeze(0)
            
            scores = viterbi_expanded + trans + emit_expanded
            best_prev = torch.argmax(scores, dim=0)
            viterbi_new = scores[best_prev, torch.arange(self.num_labels)]
            
            backpointers.append(best_prev[:self.num_labels])
            
            viterbi = torch.full((self.num_labels + 2,), -10000.0)
            viterbi[:self.num_labels] = viterbi_new
        
        # Terminal
        terminal_scores = viterbi + self.transitions[:, self.END_TAG]
        best_last = torch.argmax(terminal_scores).item()
        
        # Backtrack
        best_path = [best_last]
        for bp in reversed(backpointers):
            best_path.append(bp[best_path[-1]].item())
        best_path.reverse()
        best_path = best_path[:-1]  # Remove START artifact
        
        return best_path
 
 
# Example usage
print("BiLSTM-CRF Architecture")
print("=" * 50)
print("Key components:")
print("  1. Word embeddings → BiLSTM → hidden states")
print("  2. Hidden states → Linear → emission scores")
print("  3. Emission scores + transitions → CRF score")
print("  4. Training: minimize -log P(y|x)")
print("  5. Inference: Viterbi decoding")

Why Add CRF to Neural Networks?

Without the CRF layer, a neural network outputs independent label distributions at each position. This can produce invalid sequences (e.g., I-PER without preceding B-PER). The CRF layer scores entire sequences, ensuring global coherence. Empirically, BiLSTM-CRF typically improves 1-3% F1 over BiLSTM alone on NER tasks.

Practical Training Considerations

Training CRFs effectively requires attention to practical details. Here are key considerations based on extensive empirical experience:

Training Best Practices

•Initialize weights near zero — For classical CRFs, initialize all weights to 0. For neural CRFs, use standard neural network initialization (Xavier/Kaiming) but initialize transition matrix near zero or uniformly.
•Monitor log-likelihood on validation set — Training loss should decrease monotonically; validation performance indicates when to stop.
•Use early stopping — Stop when validation performance hasn't improved for N epochs (typically N=5-10).
•Gradient clipping for neural CRFs — Clip gradients to prevent exploding gradients (max norm ≈ 5.0).
•Batch by sequence length — Group similar-length sequences to minimize padding waste in neural CRFs.
•Warm up learning rate — For neural CRFs, gradually increase learning rate during initial epochs.

Hyperparameter Guidelines
Hyperparameter	Classical CRF	BiLSTM-CRF
Learning rate	0.1 (SGD), N/A (L-BFGS)	0.001-0.01 (Adam)
L2 regularization	0.1-10.0	1e-5 to 1e-3 (weight decay)
Batch size	Full batch or 32-128	16-64
Epochs	Until convergence (10-100)	20-50
Dropout	N/A	0.3-0.5 on LSTM/embeddings
Hidden dim	N/A	100-300 per direction
Embedding dim	N/A	100-300

Debugging Tips:

Symptom	Possible Cause	Solution
Loss not decreasing	Learning rate too low	Increase learning rate
Loss oscillating	Learning rate too high	Decrease learning rate
Training loss low, val loss high	Overfitting	Increase regularization, use dropout
Very slow convergence	Poor initialization	Check feature scaling, embedding initialization
NaN loss	Numerical instability	Use log-space algorithms, gradient clipping
All predictions same	Bias in data or features	Check label distribution, feature coverage

Common Pitfall: Label Imbalance

In NER, the O (outside) label dominates (>80% of tokens). Without handling, models may over-predict O. Solutions: (1) Class-weighted loss, (2) Downsampling majority class, (3) Focus evaluation on entity-level F1 rather than token accuracy.

Summary: CRF Training

We have covered the complete CRF training pipeline, from objective formulation to modern neural approaches. Here are the essential concepts:

Key Takeaways

•Maximize conditional log-likelihood — The objective is convex, ensuring a unique global optimum.
•Gradient = observed - expected — Push weights to increase features of correct labels relative to model expectations.
•Forward-backward computes expectations — The expected feature value decomposes into sums over marginal probabilities.
•Regularization is essential — L2 prevents weight explosion; L1 enables feature selection.
•L-BFGS for classical CRFs — Fast convergence, no hyperparameters, standard in toolkits.
•Adam for neural CRFs — Standard optimizer for end-to-end neural network training.
•BiLSTM-CRF is state-of-the-art — Combines neural feature learning with structured CRF decoding.

What's next:

With training covered, we'll apply everything we've learned to the canonical application: Sequence Labeling. The next page covers NER, POS tagging, and other sequence labeling tasks, showing how CRFs are used in practice to solve real-world problems.

Page Complete

You now understand CRF training: the conditional log-likelihood objective, gradient computation via forward-backward, regularization strategies, optimization algorithms, and neural network integration. Next, we'll see CRFs in action for sequence labeling tasks.

4 / 5

Loading learning content...

Machine LearningConditional Random Fields

Conditional Random Fields

LevelAdvanced

Duration120 mins

TopicConditional Random Fields

4 / 5

Training

Learning CRF Parameters from Data

With the CRF model formulation and feature functions defined, we now address the fundamental learning problem: How do we find optimal feature weights from labeled training data?

This page covers the complete training pipeline: from objective function formulation through gradient computation to practical optimization strategies, including modern neural network integration.

What You Will Learn

The Conditional Log-Likelihood Objective

Maximum Conditional Likelihood:

$$\boldsymbol{\lambda}^* = \arg\max_{\boldsymbol{\lambda}} \prod_{i=1}^{N} P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\lambda})$$

Log-Likelihood (more convenient for optimization):

$$\mathcal{L}(\boldsymbol{\lambda}) = \sum_{i=1}^{N} \log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\lambda})$$

Expanding using the CRF definition:

$$\mathcal{L}(\boldsymbol{\lambda}) = \sum_{i=1}^{N} \left[ \boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}^{(i)}, \mathbf{y}^{(i)}) - \log Z(\mathbf{x}^{(i)}) \right]$$

where $\mathbf{F}(\mathbf{x}, \mathbf{y}) = \sum_{j} \mathbf{f}(\mathbf{x}, y_j, y_{j-1}, j)$ is the global feature vector.

Intuition Behind the Objective:

The log-likelihood decomposes into two terms:

Score term $\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y})$: Reward for the correct label sequence. We want this to be HIGH.
Normalizer term $\log Z(\mathbf{x})$: Penalizes high scores for ALL sequences. We want the correct sequence to score higher than alternatives.

Maximizing log-likelihood means: increase the gap between the correct sequence's score and the average alternative score.

Why Conditional (not Joint) Likelihood?

We maximize $P(\mathbf{y} \mid \mathbf{x})$, not $P(\mathbf{x}, \mathbf{y})$. This is the discriminative principle:

We don't model how observations are generated
We focus entirely on the prediction task
Features can incorporate arbitrary observation patterns without generative assumptions