Loading learning content...
With the CRF model formulation and feature functions defined, we now address the fundamental learning problem: How do we find optimal feature weights from labeled training data?
CRF training is a well-understood optimization problem with strong theoretical foundations. The conditional log-likelihood objective is convex, guaranteeing that any local optimum is also the global optimum. This mathematical elegance translates to practical reliability—CRF training is stable and predictable.
This page covers the complete training pipeline: from objective function formulation through gradient computation to practical optimization strategies, including modern neural network integration.
By the end of this page, you will understand: (1) The conditional log-likelihood objective and why it's convex, (2) How to compute gradients using forward-backward expectations, (3) Regularization strategies (L1, L2) and their effects, (4) Optimization algorithms: L-BFGS, SGD, and Adam, and (5) How neural networks integrate with CRF training (BiLSTM-CRF).
Given training data $\mathcal{D} = {(\mathbf{x}^{(1)}, \mathbf{y}^{(1)}), \ldots, (\mathbf{x}^{(N)}, \mathbf{y}^{(N)})}$, where each $(\mathbf{x}^{(i)}, \mathbf{y}^{(i)})$ is an observation-label sequence pair, we seek weights $\boldsymbol{\lambda}$ that maximize the conditional probability of the observed labels given observations.
Maximum Conditional Likelihood:
$$\boldsymbol{\lambda}^* = \arg\max_{\boldsymbol{\lambda}} \prod_{i=1}^{N} P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\lambda})$$
Log-Likelihood (more convenient for optimization):
$$\mathcal{L}(\boldsymbol{\lambda}) = \sum_{i=1}^{N} \log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\lambda})$$
Expanding using the CRF definition:
$$\mathcal{L}(\boldsymbol{\lambda}) = \sum_{i=1}^{N} \left[ \boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}^{(i)}, \mathbf{y}^{(i)}) - \log Z(\mathbf{x}^{(i)}) \right]$$
where $\mathbf{F}(\mathbf{x}, \mathbf{y}) = \sum_{j} \mathbf{f}(\mathbf{x}, y_j, y_{j-1}, j)$ is the global feature vector.
Intuition Behind the Objective:
The log-likelihood decomposes into two terms:
Score term $\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y})$: Reward for the correct label sequence. We want this to be HIGH.
Normalizer term $\log Z(\mathbf{x})$: Penalizes high scores for ALL sequences. We want the correct sequence to score higher than alternatives.
Maximizing log-likelihood means: increase the gap between the correct sequence's score and the average alternative score.
Why Conditional (not Joint) Likelihood?
We maximize $P(\mathbf{y} \mid \mathbf{x})$, not $P(\mathbf{x}, \mathbf{y})$. This is the discriminative principle:
The log-likelihood $\mathcal{L}(\boldsymbol{\lambda})$ is a CONCAVE function of $\boldsymbol{\lambda}$. The score term is linear (concave). The log-partition function log Z is convex, and subtracting a convex function preserves concavity. Maximizing a concave function (or equivalently, minimizing its negative) is a convex optimization problem with a unique global optimum.
To optimize the log-likelihood using gradient-based methods, we need to compute $\nabla_{\boldsymbol{\lambda}} \mathcal{L}$.
Gradient Derivation:
For a single training example $(\mathbf{x}, \mathbf{y})$:
$$\nabla_{\boldsymbol{\lambda}} \log P(\mathbf{y} \mid \mathbf{x}) = \nabla_{\boldsymbol{\lambda}} \left[ \boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}) - \log Z(\mathbf{x}) \right]$$
The first term is straightforward: $$\nabla_{\boldsymbol{\lambda}} \left[ \boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}) \right] = \mathbf{F}(\mathbf{x}, \mathbf{y})$$
For the second term, we use the chain rule. Let's compute $\nabla_{\boldsymbol{\lambda}} \log Z(\mathbf{x})$:
$$\nabla_{\boldsymbol{\lambda}} \log Z(\mathbf{x}) = \frac{1}{Z(\mathbf{x})} \nabla_{\boldsymbol{\lambda}} Z(\mathbf{x})$$
$$= \frac{1}{Z(\mathbf{x})} \nabla_{\boldsymbol{\lambda}} \sum_{\mathbf{y}'} \exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}'))$$
$$= \frac{1}{Z(\mathbf{x})} \sum_{\mathbf{y}'} \exp(\boldsymbol{\lambda}^T \mathbf{F}(\mathbf{x}, \mathbf{y}')) \cdot \mathbf{F}(\mathbf{x}, \mathbf{y}')$$
$$= \sum_{\mathbf{y}'} P(\mathbf{y}' \mid \mathbf{x}) \cdot \mathbf{F}(\mathbf{x}, \mathbf{y}')$$
$$= \mathbb{E}_{\mathbf{y}' \sim P(\cdot \mid \mathbf{x})}[\mathbf{F}(\mathbf{x}, \mathbf{y}')]$$
The Gradient Formula:
Combining both terms:
$$\nabla_{\boldsymbol{\lambda}} \log P(\mathbf{y} \mid \mathbf{x}) = \mathbf{F}(\mathbf{x}, \mathbf{y}) - \mathbb{E}_{\mathbf{y}' \sim P(\cdot \mid \mathbf{x})}[\mathbf{F}(\mathbf{x}, \mathbf{y}')]$$
This is the "observed minus expected" gradient:
Interpretation:
For the full training set:
$$\nabla_{\boldsymbol{\lambda}} \mathcal{L} = \sum_{i=1}^{N} \left[ \mathbf{F}(\mathbf{x}^{(i)}, \mathbf{y}^{(i)}) - \mathbb{E}_{P(\cdot \mid \mathbf{x}^{(i)})}[\mathbf{F}(\mathbf{x}^{(i)}, \mathbf{y}')] \right]$$
The expectation E[F(x, y')] sums over ALL possible label sequences—exponentially many! The forward-backward algorithm makes this tractable. The expected feature value decomposes into sums over pairwise marginals: E[f_k] = Σ_{i} Σ_{y,y'} P(y_{i-1}=y', y_i=y | x) · f_k(x, y, y', i)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
import numpy as npfrom typing import List, Tuple, Dict def compute_gradient( sequences: List[Tuple[List[str], List[int]]], # (tokens, labels) pairs weights: np.ndarray, # Current weights (K,) num_labels: int, feature_extractor, # Extract features forward_backward_fn, # Compute marginals) -> Tuple[np.ndarray, float]: """ Compute gradient of log-likelihood with respect to weights. Returns: gradient: Gradient vector (K,) log_likelihood: Current log-likelihood value """ K = len(weights) gradient = np.zeros(K) total_log_likelihood = 0.0 for tokens, correct_labels in sequences: n = len(tokens) # ===================== # Step 1: Observed features (for correct label sequence) # ===================== observed_features = np.zeros(K) for i in range(n): y = correct_labels[i] y_prev = correct_labels[i-1] if i > 0 else -1 # Get active features and accumulate active = feature_extractor.get_features(tokens, i, y, y_prev) for feat_id, value in active: observed_features[feat_id] += value # ===================== # Step 2: Expected features (under current model) # ===================== # Run forward-backward to get marginals log_alpha, log_beta, log_Z, marginals, pairwise = forward_backward_fn( tokens, weights, num_labels, feature_extractor ) expected_features = np.zeros(K) for i in range(n): if i == 0: # First position: only unigram marginals for y in range(num_labels): prob = marginals[i, y] active = feature_extractor.get_features(tokens, i, y, -1) for feat_id, value in active: expected_features[feat_id] += prob * value else: # Other positions: pairwise marginals for y_prev in range(num_labels): for y in range(num_labels): prob = pairwise[i-1, y_prev, y] active = feature_extractor.get_features(tokens, i, y, y_prev) for feat_id, value in active: expected_features[feat_id] += prob * value # ===================== # Step 3: Accumulate gradient and likelihood # ===================== # Gradient = observed - expected gradient += observed_features - expected_features # Log-likelihood = score(correct) - log Z score = np.dot(weights, observed_features) total_log_likelihood += score - log_Z return gradient, total_log_likelihood def compute_gradient_with_regularization( gradient: np.ndarray, weights: np.ndarray, regularization: str = 'l2', reg_strength: float = 1.0) -> np.ndarray: """ Add regularization gradient component. For L2: penalty = (c/2) ||λ||² → gradient contribution = -c * λ For L1: penalty = c ||λ||₁ → gradient contribution = -c * sign(λ) Note: We're MAXIMIZING log-likelihood, so we ADD the penalty gradient (which is negative, pulling weights toward zero). """ if regularization == 'l2': # L2 regularization: gradient -= c * λ reg_gradient = -reg_strength * weights elif regularization == 'l1': # L1 regularization: gradient -= c * sign(λ) reg_gradient = -reg_strength * np.sign(weights) else: reg_gradient = np.zeros_like(weights) return gradient + reg_gradient # Numerical gradient check (for verification)def numerical_gradient( sequences: List[Tuple[List[str], List[int]]], weights: np.ndarray, compute_ll_fn, epsilon: float = 1e-5) -> np.ndarray: """ Compute gradient numerically for verification. For each weight dimension: ∂L/∂λ_k ≈ (L(λ+ε·e_k) - L(λ-ε·e_k)) / (2ε) """ K = len(weights) numerical_grad = np.zeros(K) for k in range(K): weights_plus = weights.copy() weights_plus[k] += epsilon ll_plus = compute_ll_fn(sequences, weights_plus) weights_minus = weights.copy() weights_minus[k] -= epsilon ll_minus = compute_ll_fn(sequences, weights_minus) numerical_grad[k] = (ll_plus - ll_minus) / (2 * epsilon) return numerical_grad print("Gradient computation for CRF training")print("=" * 50)print("Key formula: ∇L = F_observed - E[F_expected]")print("")print("This is the same gradient structure as:")print(" - Logistic regression")print(" - Maximum entropy models")print(" - All exponential family distributions")With potentially millions of features, CRFs are prone to overfitting—especially when features are sparse and training data is limited. Regularization is essential.
Regularized Objective:
$$\mathcal{L}_{\text{reg}}(\boldsymbol{\lambda}) = \mathcal{L}(\boldsymbol{\lambda}) - \text{penalty}(\boldsymbol{\lambda})$$
L2 Regularization (Ridge / Weight Decay):
$$\text{penalty}_{L2}(\boldsymbol{\lambda}) = \frac{c}{2} |\boldsymbol{\lambda}|_2^2 = \frac{c}{2} \sum_k \lambda_k^2$$
L1 Regularization (Lasso):
$$\text{penalty}_{L1}(\boldsymbol{\lambda}) = c |\boldsymbol{\lambda}|_1 = c \sum_k |\lambda_k|$$
Elastic Net (L1 + L2):
Combines benefits of both:
$$\text{penalty}_{EN}(\boldsymbol{\lambda}) = c_1 |\boldsymbol{\lambda}|_1 + \frac{c_2}{2} |\boldsymbol{\lambda}|_2^2$$
Selecting Regularization Strength:
The regularization strength $c$ is a hyperparameter selected via:
Typical heuristic: Start with $c = 1.0$ and adjust based on validation performance.
Some practitioners apply weaker regularization to transition features (label bigrams) than to emission features. Rationale: transition patterns are fundamental to sequence structure (e.g., B-PER must be followed by I-PER or O), while emission features are where overfitting typically occurs.
With the objective and gradients defined, we need algorithms to find the optimal weights. Several approaches are used in practice.
L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno):
The workhorse of classical CRF training:
Advantages:
Disadvantages:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219
import numpy as npfrom scipy.optimize import minimizefrom typing import List, Tuple, Callable class CRFTrainer: """ CRF trainer supporting multiple optimization algorithms. """ def __init__( self, num_features: int, num_labels: int, feature_extractor, regularization: str = 'l2', reg_strength: float = 1.0 ): self.num_features = num_features self.num_labels = num_labels self.feature_extractor = feature_extractor self.regularization = regularization self.reg_strength = reg_strength self.weights = None def _objective_and_gradient( self, weights: np.ndarray, sequences: List[Tuple[List[str], List[int]]] ) -> Tuple[float, np.ndarray]: """ Compute negative log-likelihood and its gradient. Note: We return NEGATIVE because scipy.optimize.minimize minimizes. """ # Compute log-likelihood and gradient gradient, log_likelihood = compute_gradient( sequences, weights, self.num_labels, self.feature_extractor, self.forward_backward ) # Add regularization if self.regularization == 'l2': reg_term = 0.5 * self.reg_strength * np.sum(weights ** 2) reg_grad = self.reg_strength * weights elif self.regularization == 'l1': reg_term = self.reg_strength * np.sum(np.abs(weights)) reg_grad = self.reg_strength * np.sign(weights) else: reg_term = 0.0 reg_grad = np.zeros_like(weights) # Return negative (for minimization) neg_ll = -(log_likelihood - reg_term) neg_grad = -(gradient - reg_grad) return neg_ll, neg_grad def train_lbfgs( self, sequences: List[Tuple[List[str], List[int]]], max_iter: int = 100, verbose: bool = True ) -> np.ndarray: """ Train CRF using L-BFGS optimizer. This is the standard approach for classical CRFs. """ # Initialize weights weights = np.zeros(self.num_features) # Track iterations iteration = [0] def callback(xk): if verbose and iteration[0] % 10 == 0: neg_ll, _ = self._objective_and_gradient(xk, sequences) print(f"Iteration {iteration[0]}: neg_ll = {neg_ll:.4f}") iteration[0] += 1 # Run L-BFGS-B (bounded L-BFGS) result = minimize( fun=lambda w: self._objective_and_gradient(w, sequences), x0=weights, method='L-BFGS-B', jac=True, # Objective returns gradient options={'maxiter': max_iter, 'disp': verbose}, callback=callback ) self.weights = result.x return self.weights def train_sgd( self, sequences: List[Tuple[List[str], List[int]]], epochs: int = 10, learning_rate: float = 0.1, batch_size: int = 32, lr_decay: float = 0.01, verbose: bool = True ) -> np.ndarray: """ Train CRF using Stochastic Gradient Descent. Better for large datasets; required for neural CRFs. """ weights = np.zeros(self.num_features) n_sequences = len(sequences) for epoch in range(epochs): # Shuffle data indices = np.random.permutation(n_sequences) epoch_loss = 0.0 # Mini-batch iteration for batch_start in range(0, n_sequences, batch_size): batch_indices = indices[batch_start:batch_start + batch_size] batch = [sequences[i] for i in batch_indices] # Compute gradient on batch gradient, batch_ll = compute_gradient( batch, weights, self.num_labels, self.feature_extractor, self.forward_backward ) # Add regularization gradient if self.regularization == 'l2': gradient -= self.reg_strength * weights reg_term = 0.5 * self.reg_strength * np.sum(weights ** 2) else: reg_term = 0 # Decayed learning rate t = epoch * (n_sequences // batch_size) + batch_start // batch_size lr = learning_rate / (1 + lr_decay * t) # Update weights (gradient ascent for maximization) weights += lr * gradient epoch_loss += -batch_ll + reg_term if verbose: avg_loss = epoch_loss / (n_sequences // batch_size) print(f"Epoch {epoch+1}/{epochs}: avg_loss = {avg_loss:.4f}") self.weights = weights return self.weights def train_adam( self, sequences: List[Tuple[List[str], List[int]]], epochs: int = 10, learning_rate: float = 0.001, batch_size: int = 32, beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-8, verbose: bool = True ) -> np.ndarray: """ Train CRF using Adam optimizer. Modern variant of SGD with adaptive learning rates. Standard for neural CRF training. """ weights = np.zeros(self.num_features) m = np.zeros_like(weights) # First moment v = np.zeros_like(weights) # Second moment t = 0 n_sequences = len(sequences) for epoch in range(epochs): indices = np.random.permutation(n_sequences) for batch_start in range(0, n_sequences, batch_size): t += 1 batch_indices = indices[batch_start:batch_start + batch_size] batch = [sequences[i] for i in batch_indices] # Compute gradient gradient, _ = compute_gradient( batch, weights, self.num_labels, self.feature_extractor, self.forward_backward ) # Add regularization if self.regularization == 'l2': gradient -= self.reg_strength * weights # Negate for gradient descent (we're maximizing) gradient = -gradient # Update moments m = beta1 * m + (1 - beta1) * gradient v = beta2 * v + (1 - beta2) * (gradient ** 2) # Bias correction m_hat = m / (1 - beta1 ** t) v_hat = v / (1 - beta2 ** t) # Update weights weights -= learning_rate * m_hat / (np.sqrt(v_hat) + epsilon) if verbose: print(f"Epoch {epoch+1}/{epochs} complete") self.weights = weights return self.weights # Algorithm comparisonprint("CRF Optimization Algorithms Comparison")print("=" * 60)print(f"{'Algorithm':<15} {'Best For':<25} {'Hyperparameters':<20}")print("-" * 60)print(f"{'L-BFGS':<15} {'Small/medium data':<25} {'None (line search)':<20}")print(f"{'SGD':<15} {'Large data, online':<25} {'learning_rate, decay':<20}")print(f"{'Adam':<15} {'Neural CRFs':<25} {'lr, beta1, beta2':<20}")| Scenario | Recommended Algorithm | Rationale |
|---|---|---|
| Small dataset (< 10K sequences) | L-BFGS | Fast convergence, no tuning needed |
| Medium dataset (10K-100K) | L-BFGS or Mini-batch SGD | L-BFGS if fits in memory |
| Large dataset (> 100K) | SGD or Adam | Full gradient too expensive |
| Neural CRF (BiLSTM-CRF) | Adam | Standard for neural networks |
| Online learning | SGD | Updates with each example |
| L1 regularization | OWL-QN | Handles non-differentiability |
Modern sequence labeling combines neural networks (for feature learning) with CRFs (for structured prediction). The BiLSTM-CRF architecture has become the de facto standard.
Architecture Overview:
Key Insight:
The BiLSTM replaces hand-crafted features with learned features. At each position, the BiLSTM hidden state encodes rich contextual information. The CRF layer then ensures the output sequence is globally coherent.
Score Computation:
For a sequence with BiLSTM outputs $\mathbf{h}_1, \ldots, \mathbf{h}_n$:
$$\text{score}(\mathbf{y} \mid \mathbf{x}) = \sum_{i=1}^{n} \left[ W_{y_i} \cdot \mathbf{h}i + T{y_{i-1}, y_i} \right]$$
where:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214
import torchimport torch.nn as nnfrom typing import List, Tuple class BiLSTMCRF(nn.Module): """ BiLSTM-CRF model for sequence labeling. Combines neural feature learning with structured prediction. """ def __init__( self, vocab_size: int, embedding_dim: int, hidden_dim: int, num_labels: int, padding_idx: int = 0 ): super().__init__() self.num_labels = num_labels self.hidden_dim = hidden_dim # Embedding layer self.embedding = nn.Embedding( vocab_size, embedding_dim, padding_idx=padding_idx ) # BiLSTM encoder self.lstm = nn.LSTM( embedding_dim, hidden_dim // 2, num_layers=1, bidirectional=True, batch_first=True ) # Emission projection: hidden -> label scores self.emission = nn.Linear(hidden_dim, num_labels) # CRF transition matrix: T[i,j] = score of transitioning from i to j # +2 for START and END tags self.transitions = nn.Parameter( torch.randn(num_labels + 2, num_labels + 2) ) self.START_TAG = num_labels self.END_TAG = num_labels + 1 # Constrain impossible transitions self.transitions.data[self.START_TAG, :] = -10000 # Can't transition TO start self.transitions.data[:, self.END_TAG] = -10000 # Can't transition FROM end def _forward_alg(self, emissions: torch.Tensor) -> torch.Tensor: """ Forward algorithm in log-space. Args: emissions: Emission scores (seq_len, num_labels) Returns: log_Z: Log partition function (scalar) """ seq_len = emissions.size(0) # Initialize: α[START] = 0, others = -inf # In expanded space: α[0, START] = 0 alpha = torch.full((self.num_labels + 2,), -10000.0, device=emissions.device) alpha[self.START_TAG] = 0.0 for i in range(seq_len): emit = emissions[i] # (num_labels,) # Expand for broadcasting # alpha: (num_labels+2,) -> (num_labels+2, 1) # transitions: (num_labels+2, num_labels+2) # emit: (num_labels,) -> (1, num_labels) alpha_expanded = alpha.unsqueeze(1) # (L+2, 1) trans = self.transitions[:, :self.num_labels] # (L+2, L) emit_expanded = emit.unsqueeze(0) # (1, L) # Next alpha via log-sum-exp scores = alpha_expanded + trans + emit_expanded # (L+2, L) alpha_new = torch.logsumexp(scores, dim=0) # (L,) # Expand back to include START and END alpha = torch.full((self.num_labels + 2,), -10000.0, device=emissions.device) alpha[:self.num_labels] = alpha_new # Terminal: add transition to END terminal = alpha + self.transitions[:, self.END_TAG] log_Z = torch.logsumexp(terminal, dim=0) return log_Z def _score_sentence( self, emissions: torch.Tensor, tags: torch.Tensor ) -> torch.Tensor: """ Compute score of a specific tag sequence. Args: emissions: Emission scores (seq_len, num_labels) tags: Gold tag sequence (seq_len,) Returns: score: Score of this tag sequence (scalar) """ seq_len = emissions.size(0) score = self.transitions[self.START_TAG, tags[0]] score += emissions[0, tags[0]] for i in range(1, seq_len): score += self.transitions[tags[i-1], tags[i]] score += emissions[i, tags[i]] score += self.transitions[tags[-1], self.END_TAG] return score def neg_log_likelihood( self, sentence: torch.Tensor, tags: torch.Tensor ) -> torch.Tensor: """ Compute negative log-likelihood for training. Args: sentence: Word indices (seq_len,) tags: Gold tag indices (seq_len,) Returns: loss: Negative log-likelihood (scalar) """ # Get emissions from BiLSTM embeds = self.embedding(sentence) lstm_out, _ = self.lstm(embeds.unsqueeze(0)) lstm_out = lstm_out.squeeze(0) # (seq_len, hidden_dim) emissions = self.emission(lstm_out) # (seq_len, num_labels) # Compute forward score (log Z) forward_score = self._forward_alg(emissions) # Compute gold score gold_score = self._score_sentence(emissions, tags) # NLL = log Z - gold_score return forward_score - gold_score def viterbi_decode(self, sentence: torch.Tensor) -> List[int]: """ Find best tag sequence using Viterbi. Args: sentence: Word indices (seq_len,) Returns: best_path: List of tag indices """ # Get emissions embeds = self.embedding(sentence) lstm_out, _ = self.lstm(embeds.unsqueeze(0)) lstm_out = lstm_out.squeeze(0) emissions = self.emission(lstm_out) seq_len = emissions.size(0) # Viterbi algorithm backpointers = [] # Initialize viterbi = torch.full((self.num_labels + 2,), -10000.0) viterbi[self.START_TAG] = 0.0 for i in range(seq_len): emit = emissions[i] viterbi_expanded = viterbi.unsqueeze(1) trans = self.transitions[:, :self.num_labels] emit_expanded = emit.unsqueeze(0) scores = viterbi_expanded + trans + emit_expanded best_prev = torch.argmax(scores, dim=0) viterbi_new = scores[best_prev, torch.arange(self.num_labels)] backpointers.append(best_prev[:self.num_labels]) viterbi = torch.full((self.num_labels + 2,), -10000.0) viterbi[:self.num_labels] = viterbi_new # Terminal terminal_scores = viterbi + self.transitions[:, self.END_TAG] best_last = torch.argmax(terminal_scores).item() # Backtrack best_path = [best_last] for bp in reversed(backpointers): best_path.append(bp[best_path[-1]].item()) best_path.reverse() best_path = best_path[:-1] # Remove START artifact return best_path # Example usageprint("BiLSTM-CRF Architecture")print("=" * 50)print("Key components:")print(" 1. Word embeddings → BiLSTM → hidden states")print(" 2. Hidden states → Linear → emission scores")print(" 3. Emission scores + transitions → CRF score")print(" 4. Training: minimize -log P(y|x)")print(" 5. Inference: Viterbi decoding")Without the CRF layer, a neural network outputs independent label distributions at each position. This can produce invalid sequences (e.g., I-PER without preceding B-PER). The CRF layer scores entire sequences, ensuring global coherence. Empirically, BiLSTM-CRF typically improves 1-3% F1 over BiLSTM alone on NER tasks.
Training CRFs effectively requires attention to practical details. Here are key considerations based on extensive empirical experience:
| Hyperparameter | Classical CRF | BiLSTM-CRF |
|---|---|---|
| Learning rate | 0.1 (SGD), N/A (L-BFGS) | 0.001-0.01 (Adam) |
| L2 regularization | 0.1-10.0 | 1e-5 to 1e-3 (weight decay) |
| Batch size | Full batch or 32-128 | 16-64 |
| Epochs | Until convergence (10-100) | 20-50 |
| Dropout | N/A | 0.3-0.5 on LSTM/embeddings |
| Hidden dim | N/A | 100-300 per direction |
| Embedding dim | N/A | 100-300 |
Debugging Tips:
| Symptom | Possible Cause | Solution |
|---|---|---|
| Loss not decreasing | Learning rate too low | Increase learning rate |
| Loss oscillating | Learning rate too high | Decrease learning rate |
| Training loss low, val loss high | Overfitting | Increase regularization, use dropout |
| Very slow convergence | Poor initialization | Check feature scaling, embedding initialization |
| NaN loss | Numerical instability | Use log-space algorithms, gradient clipping |
| All predictions same | Bias in data or features | Check label distribution, feature coverage |
In NER, the O (outside) label dominates (>80% of tokens). Without handling, models may over-predict O. Solutions: (1) Class-weighted loss, (2) Downsampling majority class, (3) Focus evaluation on entity-level F1 rather than token accuracy.
We have covered the complete CRF training pipeline, from objective formulation to modern neural approaches. Here are the essential concepts:
What's next:
With training covered, we'll apply everything we've learned to the canonical application: Sequence Labeling. The next page covers NER, POS tagging, and other sequence labeling tasks, showing how CRFs are used in practice to solve real-world problems.
You now understand CRF training: the conditional log-likelihood objective, gradient computation via forward-backward, regularization strategies, optimization algorithms, and neural network integration. Next, we'll see CRFs in action for sequence labeling tasks.