Loading learning content...
The expressive power of Conditional Random Fields comes not from the model structure (which is relatively simple) but from the feature functions we design. While the linear-chain structure provides efficient inference, features provide the representation—encoding what the model can observe and learn.
Feature engineering for CRFs is where domain expertise meets machine learning. A well-designed feature set can make the difference between 70% and 95% accuracy on sequence labeling tasks. This page provides a comprehensive treatment of feature function design.
The Feature Function Signature:
In a linear-chain CRF, each feature function has the form:
$$f_k(\mathbf{x}, y_i, y_{i-1}, i) \to \mathbb{R}$$
It takes:
And returns a real-valued score. In practice, most features are binary (0 or 1), though real-valued features are also used.
By the end of this page, you will understand: (1) The taxonomy of CRF features (emission, transition, contextual), (2) Feature templates and automatic feature generation, (3) Common feature patterns for NLP tasks, (4) Sparse vs. dense feature representations, and (5) Best practices for feature engineering in practice.
CRF features can be categorized based on what aspects of the input and output they condition on. Understanding this taxonomy helps guide systematic feature design.
1. Emission (State) Features:
These features relate observations to labels at a single position:
$$f_{\text{emission}}(\mathbf{x}, y_i, i)$$
They capture: "Given this observation pattern, which label is likely?"
Example: Word identity feature $$f_{\text{word=Obama}}(\mathbf{x}, y_i, i) = \mathbb{1}[x_i = \text{"Obama"} \wedge y_i = \text{PERSON}]$$
2. Transition Features:
These features involve only adjacent labels, independent of observations:
$$f_{\text{transition}}(y_i, y_{i-1})$$
They capture label grammar: "Which label sequences are likely/unlikely?"
Example: Label bigram $$f_{\text{B-PER→I-PER}}(y_i, y_{i-1}) = \mathbb{1}[y_{i-1} = \text{B-PER} \wedge y_i = \text{I-PER}]$$
3. Contextual Emission Features:
These features combine observations from the surrounding context with the current label:
$$f_{\text{context}}(\mathbf{x}, y_i, i)$$
They capture: "Given what's nearby, which label is likely?"
Example: Previous word feature $$f_{\text{prev=President}}(\mathbf{x}, y_i, i) = \mathbb{1}[x_{i-1} = \text{"President"} \wedge y_i = \text{PERSON}]$$
4. Combined Features:
These features involve both observations and label transitions:
$$f_{\text{combined}}(\mathbf{x}, y_i, y_{i-1}, i)$$
They capture: "Given this context AND the previous label, which label is likely?"
Example: $$f_{\text{cap+B-PER→I-PER}}(\mathbf{x}, y_i, y_{i-1}, i) = \mathbb{1}[\text{is_capitalized}(x_i) \wedge y_{i-1} = \text{B-PER} \wedge y_i = \text{I-PER}]$$
| Type | Depends On | Captures | Typical Count |
|---|---|---|---|
| Emission | $x_i, y_i$ | Label-observation correlation | Thousands (vocab × labels) |
| Transition | $y_{i-1}, y_i$ | Label grammar | $L^2$ (label pairs) |
| Contextual | $x_{i-k:i+k}, y_i$ | Context-label correlation | Many thousands |
| Combined | $\mathbf{x}, y_{i-1}, y_i$ | Transition patterns in context | Can be millions |
In CRFs, you're not limited to a single feature per observation. The key insight is to create MANY overlapping features that capture different aspects of the same position. The model learns to weight them appropriately. More features (with proper regularization) generally improve performance.
In practice, we don't define individual feature functions manually. Instead, we define feature templates that automatically generate features for each position in each training example.
Feature Template Definition:
A feature template is a function that, given a position and observation sequence, extracts a feature identifier. Combined with label values, this identifier becomes a specific feature.
Template notation (used in tools like CRF++):
# Unigram templates (emission features)
U00:%x[-1,0] # Previous word
U01:%x[0,0] # Current word
U02:%x[1,0] # Next word
U03:%x[0,0]/%x[1,0] # Current + next word bigram
# Bigram templates (transition features)
B # Current label given previous label
The %x[row, col] notation indexes into the observation matrix (row = position offset, col = feature column).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264
from typing import List, Dict, Set, Tuple, Callablefrom dataclasses import dataclassfrom collections import defaultdict @dataclassclass Token: """Represents a token with its features.""" word: str pos: str = None # Part-of-speech (if available) chunk: str = None # Chunk tag (if available) class FeatureTemplate: """ Feature template that generates feature identifiers. A template extracts an observation-based identifier for each position. Combined with label(s), this becomes a feature for the CRF. """ def __init__(self, name: str, extractor: Callable[[List[Token], int], str]): """ Args: name: Template identifier (e.g., "word[0]") extractor: Function (tokens, position) -> feature_id or None """ self.name = name self.extractor = extractor def extract(self, tokens: List[Token], position: int) -> str: """Extract feature identifier at given position.""" return self.extractor(tokens, position) def create_standard_templates() -> List[FeatureTemplate]: """ Create standard NER feature templates. Returns list of templates commonly used for named entity recognition. """ templates = [] # ========== Word Features ========== # Current word templates.append(FeatureTemplate( "word[0]", lambda tokens, i: f"word[0]={tokens[i].word}" if i < len(tokens) else None )) # Previous word (with boundary handling) templates.append(FeatureTemplate( "word[-1]", lambda tokens, i: f"word[-1]={tokens[i-1].word}" if i > 0 else "word[-1]=<BOS>" )) # Next word templates.append(FeatureTemplate( "word[+1]", lambda tokens, i: f"word[+1]={tokens[i+1].word}" if i < len(tokens)-1 else "word[+1]=<EOS>" )) # Word bigrams templates.append(FeatureTemplate( "word[-1,0]", lambda tokens, i: f"word[-1,0]={tokens[i-1].word if i > 0 else '<BOS>'}_{tokens[i].word}" )) templates.append(FeatureTemplate( "word[0,+1]", lambda tokens, i: f"word[0,+1]={tokens[i].word}_{tokens[i+1].word if i < len(tokens)-1 else '<EOS>'}" )) # ========== Shape Features ========== def get_word_shape(word: str) -> str: """Convert word to shape (e.g., 'John' -> 'Xxxx').""" shape = [] for c in word: if c.isupper(): shape.append('X') elif c.islower(): shape.append('x') elif c.isdigit(): shape.append('d') else: shape.append(c) return ''.join(shape) def get_short_shape(word: str) -> str: """Collapsed shape (e.g., 'John' -> 'Xx').""" shape = get_word_shape(word) # Collapse consecutive same characters result = [shape[0]] if shape else [] for c in shape[1:]: if c != result[-1]: result.append(c) return ''.join(result) templates.append(FeatureTemplate( "shape[0]", lambda tokens, i: f"shape[0]={get_short_shape(tokens[i].word)}" )) # ========== Capitalization Features ========== templates.append(FeatureTemplate( "capitalized[0]", lambda tokens, i: f"capitalized[0]={tokens[i].word[0].isupper()}" if tokens[i].word else None )) templates.append(FeatureTemplate( "all_caps[0]", lambda tokens, i: f"all_caps[0]={tokens[i].word.isupper()}" )) templates.append(FeatureTemplate( "all_lower[0]", lambda tokens, i: f"all_lower[0]={tokens[i].word.islower()}" )) # ========== Prefix/Suffix Features ========== for length in [2, 3, 4]: templates.append(FeatureTemplate( f"prefix{length}[0]", lambda tokens, i, l=length: f"prefix{l}[0]={tokens[i].word[:l]}" if len(tokens[i].word) >= l else None )) templates.append(FeatureTemplate( f"suffix{length}[0]", lambda tokens, i, l=length: f"suffix{l}[0]={tokens[i].word[-l:]}" if len(tokens[i].word) >= l else None )) # ========== Digit Features ========== templates.append(FeatureTemplate( "has_digit[0]", lambda tokens, i: f"has_digit[0]={any(c.isdigit() for c in tokens[i].word)}" )) templates.append(FeatureTemplate( "all_digits[0]", lambda tokens, i: f"all_digits[0]={tokens[i].word.isdigit()}" )) # ========== Punctuation Features ========== templates.append(FeatureTemplate( "has_hyphen[0]", lambda tokens, i: f"has_hyphen[0]={'-' in tokens[i].word}" )) templates.append(FeatureTemplate( "has_period[0]", lambda tokens, i: f"has_period[0]={'.' in tokens[i].word}" )) return templates class CRFFeatureExtractor: """ Complete feature extraction system for CRFs. Converts observation sequences into sparse feature vectors. """ def __init__(self, templates: List[FeatureTemplate]): self.templates = templates self.feature_to_id: Dict[str, int] = {} self.id_to_feature: Dict[int, str] = {} self.label_to_id: Dict[str, int] = {} self.is_fitted = False def fit(self, sequences: List[List[Token]], labels: List[List[str]]) -> None: """ Build feature vocabulary from training data. Args: sequences: List of token sequences labels: List of label sequences """ # Collect all labels all_labels = set() for label_seq in labels: all_labels.update(label_seq) self.label_to_id = {label: i for i, label in enumerate(sorted(all_labels))} # Collect all features feature_set: Set[str] = set() for tokens in sequences: for i in range(len(tokens)): for template in self.templates: feat_id = template.extract(tokens, i) if feat_id: # Create features for each label for label in all_labels: feature_set.add(f"{feat_id}|y={label}") # Create features for label pairs (transitions) for label1 in all_labels: for label2 in all_labels: feature_set.add(f"transition|y[-1]={label1}|y={label2}") # Assign IDs self.feature_to_id = {f: i for i, f in enumerate(sorted(feature_set))} self.id_to_feature = {i: f for f, i in self.feature_to_id.items()} self.is_fitted = True print(f"Feature vocabulary: {len(self.feature_to_id)} features") print(f"Label vocabulary: {len(self.label_to_id)} labels") def extract_features( self, tokens: List[Token], position: int, current_label: str, prev_label: str = None ) -> List[int]: """ Extract active feature IDs for a position and label assignment. Returns list of feature IDs that fire (have value 1). """ if not self.is_fitted: raise ValueError("Must call fit() before extracting features") active_features = [] # Emission features (observation + current label) for template in self.templates: feat_id = template.extract(tokens, position) if feat_id: full_feature = f"{feat_id}|y={current_label}" if full_feature in self.feature_to_id: active_features.append(self.feature_to_id[full_feature]) # Transition features (label bigram) if prev_label is not None: trans_feature = f"transition|y[-1]={prev_label}|y={current_label}" if trans_feature in self.feature_to_id: active_features.append(self.feature_to_id[trans_feature]) return active_features # Example usagetokens = [ Token("Barack"), Token("Obama"), Token("was"), Token("born"), Token("in"), Token("Hawaii"), Token(".")] templates = create_standard_templates()print(f"Number of templates: {len(templates)}")print("\nSample features at position 1 (Obama):")for template in templates[:10]: feat = template.extract(tokens, 1) if feat: print(f" {template.name}: {feat}")With V vocabulary words, L labels, and T templates, we can have O(V × L × T) features. For typical NLP tasks: V ≈ 50,000 words, L = 10 labels, T = 30 templates → 15 million potential features. Sparse representations and feature hashing help manage this scale.
Over decades of research, practitioners have identified feature patterns that consistently improve performance on NLP sequence labeling tasks. Here we catalog the most effective patterns.
Word Identity Features:
The most basic but often most important features:
| Feature | Template | Example |
|---|---|---|
| Current word | word[0]=X | word[0]=Obama |
| Previous word | word[-1]=X | word[-1]=President |
| Next word | word[+1]=X | word[+1]=was |
| Word bigram | word[-1,0]=X_Y | word[-1,0]=President_Obama |
| Word trigram | word[-1,0,+1]=X_Y_Z | word[-1,0,+1]=...President_Obama_was |
Orthographic Features:
Capitalization and character patterns are crucial for NER:
| Feature | Captures | Examples |
|---|---|---|
is_capitalized | Proper nouns | True for "John", "IBM" |
all_caps | Acronyms | True for "NATO", "USA" |
all_lower | Common words | True for "the", "running" |
mixed_case | Special names | True for "iPhone", "eBay" |
initial_cap | Sentence start vs. proper noun | Distinguish capitalization causes |
word_shape | Character pattern | "Xxxx" for "John", "XXXX" for "NASA" |
short_shape | Collapsed pattern | "Xx" for "John", "X.X." for "U.S." |
Affix Features:
Prefixes and suffixes capture morphological patterns:
| Feature | Captures | Examples |
|---|---|---|
prefix_2 | Short prefixes | "un-" (negative), "re-" (repetition) |
prefix_3 | Medium prefixes | "pre-", "dis-", "mis-" |
suffix_2 | Short suffixes | "-ed" (past), "-ly" (adverb) |
suffix_3 | Medium suffixes | "-ing", "-tion", "-ness" |
suffix_4 | Long suffixes | "-ment", "-able" |
has_digit — Contains any digitall_digits — Purely numerichas_hyphen — Contains hyphen (compound words)has_period — Contains period (abbreviations)has_apostrophe — Contractions, possessivescontains_at — Email addressesis_url — Web URLsis_punctuation — Pure punctuationin_person_names — Known person namesin_org_names — Known organization namesin_location_names — Known location namesin_country_list — Country namesin_city_list — City namesin_stopwords — Common function wordsin_product_list — Product namesbrown_cluster — Word cluster IDContextual Windows:
Contextual features examine patterns in surrounding positions:
# Context window features (position offsets -2 to +2)
word[-2], word[-1], word[0], word[+1], word[+2]
shape[-2], shape[-1], shape[0], shape[+1], shape[+2]
capitalized[-1], capitalized[0], capitalized[+1]
# Contextual conjunctions
word[-1]/word[0] # Bigram
word[0]/word[+1] # Bigram
word[-1]/word[0]/word[+1] # Trigram context
shape[-1]/shape[0] # Shape pattern
POS Tag Features (when available):
If a POS tagger has already processed the text:
| Feature | Captures |
|---|---|
pos[0]=NNP | Current word is proper noun |
pos[-1]=DT | Previous word is determiner |
pos[-1]/pos[0]=JJ_NN | Adjective-noun pattern |
pos[0]/pos[+1]=VB_TO | Verb followed by "to" |
The best features are (1) Generalizable — work across different examples, (2) Discriminative — distinguish between labels, (3) Observable — can be computed from input, and (4) Reliable — hold consistently across data. Avoid overly specific features that memorize training data rather than capturing patterns.
Feature representation significantly impacts both model performance and computational efficiency. Let's compare the two main paradigms.
Sparse (Indicator) Features:
Traditional CRF features are sparse binary indicators. Each feature is either active (1) or inactive (0), and at any position only a tiny fraction of all features fire.
Advantages of sparse features:
Disadvantages:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
import numpy as npfrom typing import List, Dict class SparseFeatures: """ Sparse feature representation for traditional CRFs. Features are stored as (feature_id, value) pairs. Most features have value 1.0 (binary indicators). """ def __init__(self, feature_to_id: Dict[str, int]): self.feature_to_id = feature_to_id self.num_features = len(feature_to_id) def extract(self, tokens: List[str], position: int, label: str) -> List[int]: """ Extract active feature IDs for a position. Returns list of feature IDs (all have value 1). """ active = [] word = tokens[position] # Word identity feature feat = f"word={word}|label={label}" if feat in self.feature_to_id: active.append(self.feature_to_id[feat]) # Lowercase version (generalization) feat = f"word_lower={word.lower()}|label={label}" if feat in self.feature_to_id: active.append(self.feature_to_id[feat]) # Capitalization feat = f"is_cap={word[0].isupper()}|label={label}" if feat in self.feature_to_id: active.append(self.feature_to_id[feat]) # ... many more features return active def score(self, weights: np.ndarray, active_features: List[int]) -> float: """Compute score as sum of weights for active features.""" return sum(weights[f] for f in active_features) class DenseFeatures: """ Dense feature representation using neural embeddings. Each position is represented as a fixed-size dense vector. """ def __init__( self, word_embeddings: np.ndarray, # (vocab_size, embedding_dim) word_to_id: Dict[str, int], context_window: int = 2 ): self.embeddings = word_embeddings self.word_to_id = word_to_id self.context_window = context_window self.embedding_dim = word_embeddings.shape[1] self.feature_dim = (2 * context_window + 1) * self.embedding_dim def extract(self, tokens: List[str], position: int) -> np.ndarray: """ Extract dense feature vector for a position. Returns concatenation of word embeddings in context window. """ features = [] for offset in range(-self.context_window, self.context_window + 1): idx = position + offset if 0 <= idx < len(tokens): word = tokens[idx].lower() word_id = self.word_to_id.get(word, self.word_to_id.get('<UNK>', 0)) embedding = self.embeddings[word_id] else: # Padding for positions outside sequence if idx < 0: embedding = self.embeddings[self.word_to_id.get('<BOS>', 0)] else: embedding = self.embeddings[self.word_to_id.get('<EOS>', 0)] features.append(embedding) return np.concatenate(features) def score( self, weights: np.ndarray, # (num_labels, feature_dim) features: np.ndarray ) -> np.ndarray: """Compute scores for all labels given features.""" return weights @ features # (num_labels,) # Comparison: Memory and computationprint("Sparse Features Example:")print(" Active features per position: ~50")print(" Total feature space: 10 million")print(" Memory per position: 50 × 4 bytes = 200 bytes")print(" Score computation: 50 additions") print("\nDense Features Example:")print(" Embedding dim: 300")print(" Context window: 2 (5 words)")print(" Feature dim: 5 × 300 = 1500")print(" Memory per position: 1500 × 4 bytes = 6000 bytes")print(" Score computation: 1500 × 10 = 15000 multiply-adds (for 10 labels)")Dense (Neural) Features:
Modern CRFs often use dense, learned representations from neural networks:
Advantages of dense features:
Disadvantages:
| Aspect | Sparse Features | Dense Features |
|---|---|---|
| Representation | Binary indicator vectors | Real-valued dense vectors |
| Dimension | Millions (mostly zeros) | Hundreds (all non-zero) |
| Feature engineering | Manual, domain expertise | Automatic, learned |
| OOV handling | Poor (fallback features) | Good (character models) |
| Interpretability | High | Low |
| Data efficiency | Can work with small data | Needs more data |
| Domain knowledge | Easy to incorporate | Harder to incorporate |
| Modern usage | Classical CRFs | Neural CRFs (BiLSTM-CRF) |
State-of-the-art systems often combine both: neural networks produce dense features that capture semantic patterns, while the CRF layer ensures structured output consistency. Some systems also inject sparse gazetteer features into neural models for the best of both worlds.
With millions of potential features, understanding which features matter and how to select them is crucial for building effective CRF models.
Why Feature Selection Matters:
Feature Selection Approaches:
1. Frequency Thresholding:
Remove features that occur fewer than $k$ times in training data:
$$\text{Keep feature } f \text{ if } \text{count}(f) \geq k$$
Typical threshold: $k = 1$ to $5$. Removes rare, unreliable features.
2. L1 Regularization (Lasso):
L1 penalty drives many weights to exactly zero:
$$\mathcal{L}(\boldsymbol{\lambda}) = \sum_i \log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}) - \alpha |\boldsymbol{\lambda}|_1$$
Strength $\alpha$ controls sparsity. Higher $\alpha$ → more zeros → fewer active features.
3. Information Gain:
Rank features by mutual information with labels:
$$\text{IG}(f) = H(Y) - H(Y \mid f)$$
Select top-$k$ features. More principled but computationally expensive.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144
import numpy as npfrom typing import Dict, List, Tuplefrom collections import Counter def analyze_learned_features( weights: np.ndarray, feature_names: List[str], top_k: int = 20) -> Dict[str, List[Tuple[str, float]]]: """ Analyze learned feature weights to understand model behavior. Args: weights: Weight vector (num_features,) feature_names: Human-readable feature names top_k: Number of top features to return per category Returns: Dictionary with analysis results """ assert len(weights) == len(feature_names) # Sort features by absolute weight sorted_indices = np.argsort(np.abs(weights))[::-1] # Analyze by label label_features = {} for idx in sorted_indices[:500]: # Top 500 by magnitude feat_name = feature_names[idx] weight = weights[idx] # Parse feature to extract label if '|y=' in feat_name: parts = feat_name.split('|y=') observation_part = parts[0] label = parts[1] if len(parts) > 1 else 'unknown' if label not in label_features: label_features[label] = [] if len(label_features[label]) < top_k: label_features[label].append((observation_part, weight)) return label_features def compute_feature_statistics( training_data: List[Tuple[List[str], List[str]]], feature_extractor) -> Dict[str, Dict]: """ Compute statistics about feature occurrences. Returns: Dictionary with feature statistics """ feature_counts = Counter() feature_label_counts = {} # feature -> {label: count} for tokens, labels in training_data: for i, label in enumerate(labels): features = feature_extractor.extract(tokens, i, label) for feat in features: feature_counts[feat] += 1 if feat not in feature_label_counts: feature_label_counts[feat] = Counter() feature_label_counts[feat][label] += 1 # Compute entropy for each feature feature_stats = {} for feat, label_counts in feature_label_counts.items(): total = sum(label_counts.values()) probs = np.array(list(label_counts.values())) / total entropy = -np.sum(probs * np.log2(probs + 1e-10)) feature_stats[feat] = { 'count': feature_counts[feat], 'entropy': entropy, 'dominant_label': label_counts.most_common(1)[0][0], 'purity': label_counts.most_common(1)[0][1] / total } return feature_stats def select_features_by_frequency( feature_stats: Dict[str, Dict], min_count: int = 5, max_features: int = 100000) -> List[str]: """ Select features by frequency threshold. Args: feature_stats: Statistics from compute_feature_statistics min_count: Minimum occurrence count max_features: Maximum features to keep Returns: List of selected feature names """ # Filter by minimum count filtered = { feat: stats for feat, stats in feature_stats.items() if stats['count'] >= min_count } # Sort by count and take top max_features sorted_features = sorted( filtered.keys(), key=lambda f: filtered[f]['count'], reverse=True ) return sorted_features[:max_features] # Example analysis outputexample_weights = { 'word=Obama|y=B-PER': 3.2, 'word=Google|y=B-ORG': 2.8, 'is_cap=True|y=B-PER': 1.5, 'is_cap=True|y=B-ORG': 1.3, 'is_cap=True|y=O': -2.1, 'word=the|y=O': 1.8, 'word=the|y=B-PER': -2.5, 'suffix2=ly|y=O': 0.9, 'trans|y[-1]=B-PER|y=I-PER': 4.1, 'trans|y[-1]=O|y=I-PER': -3.8,} print("Feature Weight Analysis")print("=" * 50)print("\nTop positive weights (features indicating label):")for feat, weight in sorted(example_weights.items(), key=lambda x: x[1], reverse=True)[:5]: print(f" {weight:+.1f} {feat}") print("\nTop negative weights (features contra-indicating label):")for feat, weight in sorted(example_weights.items(), key=lambda x: x[1])[:5]: print(f" {weight:+.1f} {feat}")Large positive weights indicate strong evidence FOR a label. Large negative weights indicate strong evidence AGAINST. Weights near zero suggest the feature is uninformative. Comparing weights across labels reveals what distinguishes categories.
Based on decades of CRF applications, here are proven best practices for feature engineering:
Common Mistakes to Avoid:
| Mistake | Problem | Solution |
|---|---|---|
| Features on test-only data | Data leakage | Extract features only from training data |
| Label-derived features | Circular reasoning | Features should depend only on observations |
| Overly specific features | Overfitting | Use frequency thresholds, regularization |
| Missing boundary handling | Crashes/bugs | Always handle sequence start/end |
| Case sensitivity issues | Inconsistent matching | Normalize consistently |
| Ignoring feature interactions | Missed patterns | Carefully designed conjunctions |
Neural networks have largely replaced manual feature engineering for NLP. However, understanding features remains essential: (1) Neural features are still features—just learned ones, (2) Hybrid systems combine neural with handcrafted features, (3) Low-resource scenarios still benefit from feature engineering, (4) Understanding features aids debugging and interpretation.
Feature functions are the heart of CRF modeling—they encode what the model observes and can learn. Here are the key concepts:
What's next:
With features defined, we need to learn the weights that combine them effectively. The next page covers CRF training: the objective function (conditional log-likelihood), gradient computation using feature expectations, and optimization algorithms (L-BFGS, SGD, and modern neural approaches).
You now understand CRF feature functions: their taxonomy, template-based generation, common patterns for NLP, sparse vs. dense representations, and feature selection techniques. Next, we'll learn how to train CRF models by optimizing the feature weights.