Machine LearningConditional Random Fields

Conditional Random Fields

LevelAdvanced

Duration120 mins

TopicConditional Random Fields

3 / 5

Feature Functions

The Power of Feature Engineering in CRFs

The expressive power of Conditional Random Fields comes not from the model structure (which is relatively simple) but from the feature functions we design. While the linear-chain structure provides efficient inference, features provide the representation—encoding what the model can observe and learn.

Feature engineering for CRFs is where domain expertise meets machine learning. A well-designed feature set can make the difference between 70% and 95% accuracy on sequence labeling tasks. This page provides a comprehensive treatment of feature function design.

The Feature Function Signature:

In a linear-chain CRF, each feature function has the form:

$$f_k(\mathbf{x}, y_i, y_{i-1}, i) \to \mathbb{R}$$

It takes:

$\mathbf{x}$: The entire observation sequence
$y_i$: The current label
$y_{i-1}$: The previous label
$i$: The current position

And returns a real-valued score. In practice, most features are binary (0 or 1), though real-valued features are also used.

What You Will Learn

By the end of this page, you will understand: (1) The taxonomy of CRF features (emission, transition, contextual), (2) Feature templates and automatic feature generation, (3) Common feature patterns for NLP tasks, (4) Sparse vs. dense feature representations, and (5) Best practices for feature engineering in practice.

Taxonomy of CRF Features

CRF features can be categorized based on what aspects of the input and output they condition on. Understanding this taxonomy helps guide systematic feature design.

1. Emission (State) Features:

These features relate observations to labels at a single position:

$$f_{\text{emission}}(\mathbf{x}, y_i, i)$$

They capture: "Given this observation pattern, which label is likely?"

Example: Word identity feature $$f_{\text{word=Obama}}(\mathbf{x}, y_i, i) = \mathbb{1}[x_i = \text{"Obama"} \wedge y_i = \text{PERSON}]$$

2. Transition Features:

These features involve only adjacent labels, independent of observations:

$$f_{\text{transition}}(y_i, y_{i-1})$$

They capture label grammar: "Which label sequences are likely/unlikely?"

Example: Label bigram $$f_{\text{B-PER→I-PER}}(y_i, y_{i-1}) = \mathbb{1}[y_{i-1} = \text{B-PER} \wedge y_i = \text{I-PER}]$$

3. Contextual Emission Features:

These features combine observations from the surrounding context with the current label:

$$f_{\text{context}}(\mathbf{x}, y_i, i)$$

They capture: "Given what's nearby, which label is likely?"

Example: Previous word feature $$f_{\text{prev=President}}(\mathbf{x}, y_i, i) = \mathbb{1}[x_{i-1} = \text{"President"} \wedge y_i = \text{PERSON}]$$

4. Combined Features:

These features involve both observations and label transitions:

$$f_{\text{combined}}(\mathbf{x}, y_i, y_{i-1}, i)$$

They capture: "Given this context AND the previous label, which label is likely?"

Example: $$f_{\text{cap+B-PER→I-PER}}(\mathbf{x}, y_i, y_{i-1}, i) = \mathbb{1}[\text{is_capitalized}(x_i) \wedge y_{i-1} = \text{B-PER} \wedge y_i = \text{I-PER}]$$

Feature Type Summary
Type	Depends On	Captures	Typical Count
Emission	$x_i, y_i$	Label-observation correlation	Thousands (vocab × labels)
Transition	$y_{i-1}, y_i$	Label grammar	$L^2$ (label pairs)
Contextual	$x_{i-k:i+k}, y_i$	Context-label correlation	Many thousands
Combined	$\mathbf{x}, y_{i-1}, y_i$	Transition patterns in context	Can be millions

Feature Design Philosophy

In CRFs, you're not limited to a single feature per observation. The key insight is to create MANY overlapping features that capture different aspects of the same position. The model learns to weight them appropriately. More features (with proper regularization) generally improve performance.

Feature Templates

In practice, we don't define individual feature functions manually. Instead, we define feature templates that automatically generate features for each position in each training example.

Feature Template Definition:

A feature template is a function that, given a position and observation sequence, extracts a feature identifier. Combined with label values, this identifier becomes a specific feature.

Template notation (used in tools like CRF++):

# Unigram templates (emission features)
U00:%x[-1,0]   # Previous word
U01:%x[0,0]    # Current word  
U02:%x[1,0]    # Next word
U03:%x[0,0]/%x[1,0]  # Current + next word bigram

# Bigram templates (transition features)
B              # Current label given previous label

The %x[row, col] notation indexes into the observation matrix (row = position offset, col = feature column).

feature_templates.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
from typing import List, Dict, Set, Tuple, Callable
from dataclasses import dataclass
from collections import defaultdict
 
@dataclass
class Token:
    """Represents a token with its features."""
    word: str
    pos: str = None        # Part-of-speech (if available)
    chunk: str = None      # Chunk tag (if available)
    
class FeatureTemplate:
    """
    Feature template that generates feature identifiers.
    
    A template extracts an observation-based identifier for each position.
    Combined with label(s), this becomes a feature for the CRF.
    """
    
    def __init__(self, name: str, extractor: Callable[[List[Token], int], str]):
        """
        Args:
            name: Template identifier (e.g., "word[0]")
            extractor: Function (tokens, position) -> feature_id or None
        """
        self.name = name
        self.extractor = extractor
    
    def extract(self, tokens: List[Token], position: int) -> str:
        """Extract feature identifier at given position."""
        return self.extractor(tokens, position)
 
 
def create_standard_templates() -> List[FeatureTemplate]:
    """
    Create standard NER feature templates.
    
    Returns list of templates commonly used for named entity recognition.
    """
    templates = []
    
    # ========== Word Features ==========
    
    # Current word
    templates.append(FeatureTemplate(
        "word[0]",
        lambda tokens, i: f"word[0]={tokens[i].word}" if i < len(tokens) else None
    ))
    
    # Previous word (with boundary handling)
    templates.append(FeatureTemplate(
        "word[-1]",
        lambda tokens, i: f"word[-1]={tokens[i-1].word}" if i > 0 else "word[-1]=<BOS>"
    ))
    
    # Next word
    templates.append(FeatureTemplate(
        "word[+1]",
        lambda tokens, i: f"word[+1]={tokens[i+1].word}" if i < len(tokens)-1 else "word[+1]=<EOS>"
    ))
    
    # Word bigrams
    templates.append(FeatureTemplate(
        "word[-1,0]",
        lambda tokens, i: f"word[-1,0]={tokens[i-1].word if i > 0 else '<BOS>'}_{tokens[i].word}"
    ))
    
    templates.append(FeatureTemplate(
        "word[0,+1]",
        lambda tokens, i: f"word[0,+1]={tokens[i].word}_{tokens[i+1].word if i < len(tokens)-1 else '<EOS>'}"
    ))
    
    # ========== Shape Features ==========
    
    def get_word_shape(word: str) -> str:
        """Convert word to shape (e.g., 'John' -> 'Xxxx')."""
        shape = []
        for c in word:
            if c.isupper():
                shape.append('X')
            elif c.islower():
                shape.append('x')
            elif c.isdigit():
                shape.append('d')
            else:
                shape.append(c)
        return ''.join(shape)
    
    def get_short_shape(word: str) -> str:
        """Collapsed shape (e.g., 'John' -> 'Xx')."""
        shape = get_word_shape(word)
        # Collapse consecutive same characters
        result = [shape[0]] if shape else []
        for c in shape[1:]:
            if c != result[-1]:
                result.append(c)
        return ''.join(result)
    
    templates.append(FeatureTemplate(
        "shape[0]",
        lambda tokens, i: f"shape[0]={get_short_shape(tokens[i].word)}"
    ))
    
    # ========== Capitalization Features ==========
    
    templates.append(FeatureTemplate(
        "capitalized[0]",
        lambda tokens, i: f"capitalized[0]={tokens[i].word[0].isupper()}" if tokens[i].word else None
    ))
    
    templates.append(FeatureTemplate(
        "all_caps[0]",
        lambda tokens, i: f"all_caps[0]={tokens[i].word.isupper()}"
    ))
    
    templates.append(FeatureTemplate(
        "all_lower[0]",
        lambda tokens, i: f"all_lower[0]={tokens[i].word.islower()}"
    ))
    
    # ========== Prefix/Suffix Features ==========
    
    for length in [2, 3, 4]:
        templates.append(FeatureTemplate(
            f"prefix{length}[0]",
            lambda tokens, i, l=length: f"prefix{l}[0]={tokens[i].word[:l]}" if len(tokens[i].word) >= l else None
        ))
        
        templates.append(FeatureTemplate(
            f"suffix{length}[0]",
            lambda tokens, i, l=length: f"suffix{l}[0]={tokens[i].word[-l:]}" if len(tokens[i].word) >= l else None
        ))
    
    # ========== Digit Features ==========
    
    templates.append(FeatureTemplate(
        "has_digit[0]",
        lambda tokens, i: f"has_digit[0]={any(c.isdigit() for c in tokens[i].word)}"
    ))
    
    templates.append(FeatureTemplate(
        "all_digits[0]",
        lambda tokens, i: f"all_digits[0]={tokens[i].word.isdigit()}"
    ))
    
    # ========== Punctuation Features ==========
    
    templates.append(FeatureTemplate(
        "has_hyphen[0]",
        lambda tokens, i: f"has_hyphen[0]={'-' in tokens[i].word}"
    ))
    
    templates.append(FeatureTemplate(
        "has_period[0]",
        lambda tokens, i: f"has_period[0]={'.' in tokens[i].word}"
    ))
    
    return templates
 
 
class CRFFeatureExtractor:
    """
    Complete feature extraction system for CRFs.
    
    Converts observation sequences into sparse feature vectors.
    """
    
    def __init__(self, templates: List[FeatureTemplate]):
        self.templates = templates
        self.feature_to_id: Dict[str, int] = {}
        self.id_to_feature: Dict[int, str] = {}
        self.label_to_id: Dict[str, int] = {}
        self.is_fitted = False
    
    def fit(self, sequences: List[List[Token]], labels: List[List[str]]) -> None:
        """
        Build feature vocabulary from training data.
        
        Args:
            sequences: List of token sequences
            labels: List of label sequences
        """
        # Collect all labels
        all_labels = set()
        for label_seq in labels:
            all_labels.update(label_seq)
        self.label_to_id = {label: i for i, label in enumerate(sorted(all_labels))}
        
        # Collect all features
        feature_set: Set[str] = set()
        
        for tokens in sequences:
            for i in range(len(tokens)):
                for template in self.templates:
                    feat_id = template.extract(tokens, i)
                    if feat_id:
                        # Create features for each label
                        for label in all_labels:
                            feature_set.add(f"{feat_id}|y={label}")
                        # Create features for label pairs (transitions)
                        for label1 in all_labels:
                            for label2 in all_labels:
                                feature_set.add(f"transition|y[-1]={label1}|y={label2}")
        
        # Assign IDs
        self.feature_to_id = {f: i for i, f in enumerate(sorted(feature_set))}
        self.id_to_feature = {i: f for f, i in self.feature_to_id.items()}
        self.is_fitted = True
        
        print(f"Feature vocabulary: {len(self.feature_to_id)} features")
        print(f"Label vocabulary: {len(self.label_to_id)} labels")
    
    def extract_features(
        self, 
        tokens: List[Token], 
        position: int,
        current_label: str,
        prev_label: str = None
    ) -> List[int]:
        """
        Extract active feature IDs for a position and label assignment.
        
        Returns list of feature IDs that fire (have value 1).
        """
        if not self.is_fitted:
            raise ValueError("Must call fit() before extracting features")
        
        active_features = []
        
        # Emission features (observation + current label)
        for template in self.templates:
            feat_id = template.extract(tokens, position)
            if feat_id:
                full_feature = f"{feat_id}|y={current_label}"
                if full_feature in self.feature_to_id:
                    active_features.append(self.feature_to_id[full_feature])
        
        # Transition features (label bigram)
        if prev_label is not None:
            trans_feature = f"transition|y[-1]={prev_label}|y={current_label}"
            if trans_feature in self.feature_to_id:
                active_features.append(self.feature_to_id[trans_feature])
        
        return active_features
 
 
# Example usage
tokens = [
    Token("Barack"),
    Token("Obama"),
    Token("was"),
    Token("born"),
    Token("in"),
    Token("Hawaii"),
    Token(".")
]
 
templates = create_standard_templates()
print(f"Number of templates: {len(templates)}")
print("\nSample features at position 1 (Obama):")
for template in templates[:10]:
    feat = template.extract(tokens, 1)
    if feat:
        print(f"  {template.name}: {feat}")

Feature Explosion

With V vocabulary words, L labels, and T templates, we can have O(V × L × T) features. For typical NLP tasks: V ≈ 50,000 words, L = 10 labels, T = 30 templates → 15 million potential features. Sparse representations and feature hashing help manage this scale.

Common Feature Patterns for NLP

Over decades of research, practitioners have identified feature patterns that consistently improve performance on NLP sequence labeling tasks. Here we catalog the most effective patterns.

Word Identity Features:

The most basic but often most important features:

Feature	Template	Example
Current word	`word[0]=X`	word[0]=Obama
Previous word	`word[-1]=X`	word[-1]=President
Next word	`word[+1]=X`	word[+1]=was
Word bigram	`word[-1,0]=X_Y`	word[-1,0]=President_Obama
Word trigram	`word[-1,0,+1]=X_Y_Z`	word[-1,0,+1]=...President_Obama_was

Orthographic Features:

Capitalization and character patterns are crucial for NER:

Feature	Captures	Examples
`is_capitalized`	Proper nouns	True for "John", "IBM"
`all_caps`	Acronyms	True for "NATO", "USA"
`all_lower`	Common words	True for "the", "running"
`mixed_case`	Special names	True for "iPhone", "eBay"
`initial_cap`	Sentence start vs. proper noun	Distinguish capitalization causes
`word_shape`	Character pattern	"Xxxx" for "John", "XXXX" for "NASA"
`short_shape`	Collapsed pattern	"Xx" for "John", "X.X." for "U.S."

Affix Features:

Prefixes and suffixes capture morphological patterns:

Feature	Captures	Examples
`prefix_2`	Short prefixes	"un-" (negative), "re-" (repetition)
`prefix_3`	Medium prefixes	"pre-", "dis-", "mis-"
`suffix_2`	Short suffixes	"-ed" (past), "-ly" (adverb)
`suffix_3`	Medium suffixes	"-ing", "-tion", "-ness"
`suffix_4`	Long suffixes	"-ment", "-able"

Pattern Features

•has_digit — Contains any digit
•all_digits — Purely numeric
•has_hyphen — Contains hyphen (compound words)
•has_period — Contains period (abbreviations)
•has_apostrophe — Contractions, possessives
•contains_at — Email addresses
•is_url — Web URLs
•is_punctuation — Pure punctuation

Gazetteer Features

•in_person_names — Known person names
•in_org_names — Known organization names
•in_location_names — Known location names
•in_country_list — Country names
•in_city_list — City names
•in_stopwords — Common function words
•in_product_list — Product names
•brown_cluster — Word cluster ID

Contextual Windows:

Contextual features examine patterns in surrounding positions:

# Context window features (position offsets -2 to +2)
word[-2], word[-1], word[0], word[+1], word[+2]
shape[-2], shape[-1], shape[0], shape[+1], shape[+2]
capitalized[-1], capitalized[0], capitalized[+1]

# Contextual conjunctions
word[-1]/word[0]    # Bigram
word[0]/word[+1]    # Bigram
word[-1]/word[0]/word[+1]  # Trigram context
shape[-1]/shape[0]  # Shape pattern

POS Tag Features (when available):

If a POS tagger has already processed the text:

Feature	Captures
`pos[0]=NNP`	Current word is proper noun
`pos[-1]=DT`	Previous word is determiner
`pos[-1]/pos[0]=JJ_NN`	Adjective-noun pattern
`pos[0]/pos[+1]=VB_TO`	Verb followed by "to"

Feature Engineering Wisdom

The best features are (1) Generalizable — work across different examples, (2) Discriminative — distinguish between labels, (3) Observable — can be computed from input, and (4) Reliable — hold consistently across data. Avoid overly specific features that memorize training data rather than capturing patterns.

Sparse vs. Dense Feature Representations

Feature representation significantly impacts both model performance and computational efficiency. Let's compare the two main paradigms.

Sparse (Indicator) Features:

Traditional CRF features are sparse binary indicators. Each feature is either active (1) or inactive (0), and at any position only a tiny fraction of all features fire.

Representation: List of active feature IDs
Storage: $O(k)$ where $k$ = number of active features (typically 20-100)
Feature space: Can be millions of features
Training: Regularization essential (L1 or L2)

Advantages of sparse features:

Interpretable — each feature has clear meaning
Can incorporate arbitrary domain knowledge
Feature conjunctions explicitly modeled
Works well with limited training data

Disadvantages:

Manual feature engineering required
No parameter sharing between similar features
OOV (out-of-vocabulary) words have no word features

sparse_vs_dense.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import numpy as np
from typing import List, Dict
 
class SparseFeatures:
    """
    Sparse feature representation for traditional CRFs.
    
    Features are stored as (feature_id, value) pairs.
    Most features have value 1.0 (binary indicators).
    """
    
    def __init__(self, feature_to_id: Dict[str, int]):
        self.feature_to_id = feature_to_id
        self.num_features = len(feature_to_id)
    
    def extract(self, tokens: List[str], position: int, label: str) -> List[int]:
        """
        Extract active feature IDs for a position.
        
        Returns list of feature IDs (all have value 1).
        """
        active = []
        word = tokens[position]
        
        # Word identity feature
        feat = f"word={word}|label={label}"
        if feat in self.feature_to_id:
            active.append(self.feature_to_id[feat])
        
        # Lowercase version (generalization)
        feat = f"word_lower={word.lower()}|label={label}"
        if feat in self.feature_to_id:
            active.append(self.feature_to_id[feat])
        
        # Capitalization
        feat = f"is_cap={word[0].isupper()}|label={label}"
        if feat in self.feature_to_id:
            active.append(self.feature_to_id[feat])
        
        # ... many more features
        
        return active
    
    def score(self, weights: np.ndarray, active_features: List[int]) -> float:
        """Compute score as sum of weights for active features."""
        return sum(weights[f] for f in active_features)
 
 
class DenseFeatures:
    """
    Dense feature representation using neural embeddings.
    
    Each position is represented as a fixed-size dense vector.
    """
    
    def __init__(
        self, 
        word_embeddings: np.ndarray,  # (vocab_size, embedding_dim)
        word_to_id: Dict[str, int],
        context_window: int = 2
    ):
        self.embeddings = word_embeddings
        self.word_to_id = word_to_id
        self.context_window = context_window
        self.embedding_dim = word_embeddings.shape[1]
        self.feature_dim = (2 * context_window + 1) * self.embedding_dim
    
    def extract(self, tokens: List[str], position: int) -> np.ndarray:
        """
        Extract dense feature vector for a position.
        
        Returns concatenation of word embeddings in context window.
        """
        features = []
        
        for offset in range(-self.context_window, self.context_window + 1):
            idx = position + offset
            
            if 0 <= idx < len(tokens):
                word = tokens[idx].lower()
                word_id = self.word_to_id.get(word, self.word_to_id.get('<UNK>', 0))
                embedding = self.embeddings[word_id]
            else:
                # Padding for positions outside sequence
                if idx < 0:
                    embedding = self.embeddings[self.word_to_id.get('<BOS>', 0)]
                else:
                    embedding = self.embeddings[self.word_to_id.get('<EOS>', 0)]
            
            features.append(embedding)
        
        return np.concatenate(features)
    
    def score(
        self, 
        weights: np.ndarray,  # (num_labels, feature_dim)
        features: np.ndarray
    ) -> np.ndarray:
        """Compute scores for all labels given features."""
        return weights @ features  # (num_labels,)
 
 
# Comparison: Memory and computation
print("Sparse Features Example:")
print("  Active features per position: ~50")
print("  Total feature space: 10 million")
print("  Memory per position: 50 × 4 bytes = 200 bytes")
print("  Score computation: 50 additions")
 
print("\nDense Features Example:")
print("  Embedding dim: 300")
print("  Context window: 2 (5 words)")
print("  Feature dim: 5 × 300 = 1500")
print("  Memory per position: 1500 × 4 bytes = 6000 bytes")
print("  Score computation: 1500 × 10 = 15000 multiply-adds (for 10 labels)")

Dense (Neural) Features:

Modern CRFs often use dense, learned representations from neural networks:

Representation: Fixed-size real-valued vectors
Storage: $O(d)$ where $d$ = embedding dimension (typically 128-1024)
Source: Word embeddings, character CNNs, BiLSTM hidden states
Training: End-to-end with the CRF layer

Advantages of dense features:

No manual feature engineering
Semantic similarity captured (similar words → similar vectors)
Handles OOV words (via character-level models)
Parameter sharing across contexts

Disadvantages:

Less interpretable
Requires more training data
Computationally more expensive
Harder to incorporate explicit domain knowledge

Sparse vs. Dense Features Comparison
Aspect	Sparse Features	Dense Features
Representation	Binary indicator vectors	Real-valued dense vectors
Dimension	Millions (mostly zeros)	Hundreds (all non-zero)
Feature engineering	Manual, domain expertise	Automatic, learned
OOV handling	Poor (fallback features)	Good (character models)
Interpretability	High	Low
Data efficiency	Can work with small data	Needs more data
Domain knowledge	Easy to incorporate	Harder to incorporate
Modern usage	Classical CRFs	Neural CRFs (BiLSTM-CRF)

Modern Hybrid Approach

State-of-the-art systems often combine both: neural networks produce dense features that capture semantic patterns, while the CRF layer ensures structured output consistency. Some systems also inject sparse gazetteer features into neural models for the best of both worlds.

Feature Selection and Analysis

With millions of potential features, understanding which features matter and how to select them is crucial for building effective CRF models.

Why Feature Selection Matters:

Computational efficiency: Fewer features = faster training and inference
Reduced overfitting: Irrelevant features can memorize noise
Improved generalization: Focus on robust patterns
Interpretability: Easier to understand model behavior

Feature Selection Approaches:

1. Frequency Thresholding:

Remove features that occur fewer than $k$ times in training data:

$$\text{Keep feature } f \text{ if } \text{count}(f) \geq k$$

Typical threshold: $k = 1$ to $5$. Removes rare, unreliable features.

2. L1 Regularization (Lasso):

L1 penalty drives many weights to exactly zero:

$$\mathcal{L}(\boldsymbol{\lambda}) = \sum_i \log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}) - \alpha |\boldsymbol{\lambda}|_1$$

Strength $\alpha$ controls sparsity. Higher $\alpha$ → more zeros → fewer active features.

3. Information Gain:

Rank features by mutual information with labels:

$$\text{IG}(f) = H(Y) - H(Y \mid f)$$

Select top-$k$ features. More principled but computationally expensive.

feature_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
import numpy as np
from typing import Dict, List, Tuple
from collections import Counter
 
def analyze_learned_features(
    weights: np.ndarray,
    feature_names: List[str],
    top_k: int = 20
) -> Dict[str, List[Tuple[str, float]]]:
    """
    Analyze learned feature weights to understand model behavior.
    
    Args:
        weights: Weight vector (num_features,)
        feature_names: Human-readable feature names
        top_k: Number of top features to return per category
    
    Returns:
        Dictionary with analysis results
    """
    assert len(weights) == len(feature_names)
    
    # Sort features by absolute weight
    sorted_indices = np.argsort(np.abs(weights))[::-1]
    
    # Analyze by label
    label_features = {}
    
    for idx in sorted_indices[:500]:  # Top 500 by magnitude
        feat_name = feature_names[idx]
        weight = weights[idx]
        
        # Parse feature to extract label
        if '|y=' in feat_name:
            parts = feat_name.split('|y=')
            observation_part = parts[0]
            label = parts[1] if len(parts) > 1 else 'unknown'
            
            if label not in label_features:
                label_features[label] = []
            
            if len(label_features[label]) < top_k:
                label_features[label].append((observation_part, weight))
    
    return label_features
 
 
def compute_feature_statistics(
    training_data: List[Tuple[List[str], List[str]]],
    feature_extractor
) -> Dict[str, Dict]:
    """
    Compute statistics about feature occurrences.
    
    Returns:
        Dictionary with feature statistics
    """
    feature_counts = Counter()
    feature_label_counts = {}  # feature -> {label: count}
    
    for tokens, labels in training_data:
        for i, label in enumerate(labels):
            features = feature_extractor.extract(tokens, i, label)
            
            for feat in features:
                feature_counts[feat] += 1
                
                if feat not in feature_label_counts:
                    feature_label_counts[feat] = Counter()
                feature_label_counts[feat][label] += 1
    
    # Compute entropy for each feature
    feature_stats = {}
    
    for feat, label_counts in feature_label_counts.items():
        total = sum(label_counts.values())
        probs = np.array(list(label_counts.values())) / total
        entropy = -np.sum(probs * np.log2(probs + 1e-10))
        
        feature_stats[feat] = {
            'count': feature_counts[feat],
            'entropy': entropy,
            'dominant_label': label_counts.most_common(1)[0][0],
            'purity': label_counts.most_common(1)[0][1] / total
        }
    
    return feature_stats
 
 
def select_features_by_frequency(
    feature_stats: Dict[str, Dict],
    min_count: int = 5,
    max_features: int = 100000
) -> List[str]:
    """
    Select features by frequency threshold.
    
    Args:
        feature_stats: Statistics from compute_feature_statistics
        min_count: Minimum occurrence count
        max_features: Maximum features to keep
    
    Returns:
        List of selected feature names
    """
    # Filter by minimum count
    filtered = {
        feat: stats for feat, stats in feature_stats.items()
        if stats['count'] >= min_count
    }
    
    # Sort by count and take top max_features
    sorted_features = sorted(
        filtered.keys(),
        key=lambda f: filtered[f]['count'],
        reverse=True
    )
    
    return sorted_features[:max_features]
 
 
# Example analysis output
example_weights = {
    'word=Obama|y=B-PER': 3.2,
    'word=Google|y=B-ORG': 2.8,
    'is_cap=True|y=B-PER': 1.5,
    'is_cap=True|y=B-ORG': 1.3,
    'is_cap=True|y=O': -2.1,
    'word=the|y=O': 1.8,
    'word=the|y=B-PER': -2.5,
    'suffix2=ly|y=O': 0.9,
    'trans|y[-1]=B-PER|y=I-PER': 4.1,
    'trans|y[-1]=O|y=I-PER': -3.8,
}
 
print("Feature Weight Analysis")
print("=" * 50)
print("\nTop positive weights (features indicating label):")
for feat, weight in sorted(example_weights.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(f"  {weight:+.1f}  {feat}")
 
print("\nTop negative weights (features contra-indicating label):")
for feat, weight in sorted(example_weights.items(), key=lambda x: x[1])[:5]:
    print(f"  {weight:+.1f}  {feat}")

Interpreting Feature Weights

Large positive weights indicate strong evidence FOR a label. Large negative weights indicate strong evidence AGAINST. Weights near zero suggest the feature is uninformative. Comparing weights across labels reveals what distinguishes categories.

Feature Engineering Best Practices

Based on decades of CRF applications, here are proven best practices for feature engineering:

Feature Engineering Guidelines

•Start with standard features — Word identity, capitalization, prefixes/suffixes, and context window features form a strong baseline for most NLP tasks.
•Use multiple granularities — Combine exact word matches with patterns (shapes, affixes). Exact matches capture known entities; patterns generalize to unseen ones.
•Incorporate domain knowledge — Gazetteers, lexicons, and rule-based patterns encode valuable human knowledge that's hard to learn from limited data.
•Design generalizable features — Features should fire on multiple examples. A feature that matches only one training example will overfit.
•Balance precision and recall — Highly specific features (word=Obama → PERSON) are precise but miss unseen entities. General features (is_capitalized → PERSON) recall more but are less precise.
•Use conjunctions sparingly — Feature combinations (word=President AND next_word=Obama) can be powerful but multiply feature space exponentially.
•Apply proper regularization — L2 regularization is standard; L1 for automatic feature selection. Cross-validate the regularization strength.
•Analyze errors systematically — Examine what features fire (or don't) on misclassified examples. This guides feature engineering iterations.

Common Mistakes to Avoid:

Mistake	Problem	Solution
Features on test-only data	Data leakage	Extract features only from training data
Label-derived features	Circular reasoning	Features should depend only on observations
Overly specific features	Overfitting	Use frequency thresholds, regularization
Missing boundary handling	Crashes/bugs	Always handle sequence start/end
Case sensitivity issues	Inconsistent matching	Normalize consistently
Ignoring feature interactions	Missed patterns	Carefully designed conjunctions

The End of Manual Feature Engineering?

Neural networks have largely replaced manual feature engineering for NLP. However, understanding features remains essential: (1) Neural features are still features—just learned ones, (2) Hybrid systems combine neural with handcrafted features, (3) Low-resource scenarios still benefit from feature engineering, (4) Understanding features aids debugging and interpretation.

Summary: Feature Functions

Feature functions are the heart of CRF modeling—they encode what the model observes and can learn. Here are the key concepts:

Key Takeaways

•Features condition on entire observation sequence — Unlike HMMs, CRF features can use global context at any position.
•Feature taxonomy: emission, transition, contextual, combined — Different feature types capture different aspects of the input-output relationship.
•Feature templates automate feature generation — Define patterns once; instantiate for each position and label combination.
•Standard NLP features: words, shapes, affixes, gazetteers — Decades of research have identified consistently useful feature patterns.
•Sparse vs. dense representations trade off interpretability and automation — Classical CRFs use sparse; neural CRFs use dense learned features.
•Feature selection improves efficiency and generalization — Frequency thresholding and L1 regularization prune irrelevant features.
•Good features balance precision and generalization — Specific enough to be discriminative; general enough to apply broadly.

What's next:

With features defined, we need to learn the weights that combine them effectively. The next page covers CRF training: the objective function (conditional log-likelihood), gradient computation using feature expectations, and optimization algorithms (L-BFGS, SGD, and modern neural approaches).

Page Complete

You now understand CRF feature functions: their taxonomy, template-based generation, common patterns for NLP, sparse vs. dense representations, and feature selection techniques. Next, we'll learn how to train CRF models by optimizing the feature weights.

3 / 5

Loading learning content...

Machine LearningConditional Random Fields

Conditional Random Fields

LevelAdvanced

Duration120 mins

TopicConditional Random Fields

3 / 5

Feature Functions

The Power of Feature Engineering in CRFs

The Feature Function Signature:

In a linear-chain CRF, each feature function has the form:

$$f_k(\mathbf{x}, y_i, y_{i-1}, i) \to \mathbb{R}$$

It takes:

$\mathbf{x}$: The entire observation sequence
$y_i$: The current label
$y_{i-1}$: The previous label
$i$: The current position

And returns a real-valued score. In practice, most features are binary (0 or 1), though real-valued features are also used.

What You Will Learn

Taxonomy of CRF Features

CRF features can be categorized based on what aspects of the input and output they condition on. Understanding this taxonomy helps guide systematic feature design.

1. Emission (State) Features:

These features relate observations to labels at a single position:

$$f_{\text{emission}}(\mathbf{x}, y_i, i)$$

They capture: "Given this observation pattern, which label is likely?"

Example: Word identity feature $$f_{\text{word=Obama}}(\mathbf{x}, y_i, i) = \mathbb{1}[x_i = \text{"Obama"} \wedge y_i = \text{PERSON}]$$

2. Transition Features:

These features involve only adjacent labels, independent of observations:

$$f_{\text{transition}}(y_i, y_{i-1})$$

They capture label grammar: "Which label sequences are likely/unlikely?"

Example: Label bigram $$f_{\text{B-PER→I-PER}}(y_i, y_{i-1}) = \mathbb{1}[y_{i-1} = \text{B-PER} \wedge y_i = \text{I-PER}]$$

3. Contextual Emission Features:

These features combine observations from the surrounding context with the current label:

$$f_{\text{context}}(\mathbf{x}, y_i, i)$$

They capture: "Given what's nearby, which label is likely?"

Example: Previous word feature $$f_{\text{prev=President}}(\mathbf{x}, y_i, i) = \mathbb{1}[x_{i-1} = \text{"President"} \wedge y_i = \text{PERSON}]$$

4. Combined Features:

These features involve both observations and label transitions:

$$f_{\text{combined}}(\mathbf{x}, y_i, y_{i-1}, i)$$

They capture: "Given this context AND the previous label, which label is likely?"

Example: $$f_{\text{cap+B-PER→I-PER}}(\mathbf{x}, y_i, y_{i-1}, i) = \mathbb{1}[\text{is_capitalized}(x_i) \wedge y_{i-1} = \text{B-PER} \wedge y_i = \text{I-PER}]$$

Feature Type Summary
Type	Depends On	Captures	Typical Count
Emission	$x_i, y_i$	Label-observation correlation	Thousands (vocab × labels)
Transition	$y_{i-1}, y_i$	Label grammar	$L^2$ (label pairs)
Contextual	$x_{i-k:i+k}, y_i$	Context-label correlation	Many thousands
Combined	$\mathbf{x}, y_{i-1}, y_i$	Transition patterns in context	Can be millions

Feature Design Philosophy

Feature Templates

In practice, we don't define individual feature functions manually. Instead, we define feature templates that automatically generate features for each position in each training example.

Feature Template Definition:

A feature template is a function that, given a position and observation sequence, extracts a feature identifier. Combined with label values, this identifier becomes a specific feature.

Template notation (used in tools like CRF++):

# Unigram templates (emission features)
U00:%x[-1,0]   # Previous word
U01:%x[0,0]    # Current word  
U02:%x[1,0]    # Next word
U03:%x[0,0]/%x[1,0]  # Current + next word bigram

# Bigram templates (transition features)
B              # Current label given previous label

The %x[row, col] notation indexes into the observation matrix (row = position offset, col = feature column).

feature_templates.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
from typing import List, Dict, Set, Tuple, Callable
from dataclasses import dataclass
from collections import defaultdict
 
@dataclass
class Token:
    """Represents a token with its features."""
    word: str
    pos: str = None        # Part-of-speech (if available)
    chunk: str = None      # Chunk tag (if available)
    
class FeatureTemplate:
    """
    Feature template that generates feature identifiers.
    
    A template extracts an observation-based identifier for each position.
    Combined with label(s), this becomes a feature for the CRF.
    """
    
    def __init__(self, name: str, extractor: Callable[[List[Token], int], str]):
        """
        Args:
            name: Template identifier (e.g., "word[0]")
            extractor: Function (tokens, position) -> feature_id or None
        """
        self.name = name
        self.extractor = extractor
    
    def extract(self, tokens: List[Token], position: int) -> str:
        """Extract feature identifier at given position."""
        return self.extractor(tokens, position)
 
 
def create_standard_templates() -> List[FeatureTemplate]:
    """
    Create standard NER feature templates.
    
    Returns list of templates commonly used for named entity recognition.
    """
    templates = []
    
    # ========== Word Features ==========
    
    # Current word
    templates.append(FeatureTemplate(
        "word[0]",
        lambda tokens, i: f"word[0]={tokens[i].word}" if i < len(tokens) else None
    ))
    
    # Previous word (with boundary handling)
    templates.append(FeatureTemplate(
        "word[-1]",
        lambda tokens, i: f"word[-1]={tokens[i-1].word}" if i > 0 else "word[-1]=<BOS>"
    ))
    
    # Next word
    templates.append(FeatureTemplate(
        "word[+1]",
        lambda tokens, i: f"word[+1]={tokens[i+1].word}" if i < len(tokens)-1 else "word[+1]=<EOS>"
    ))
    
    # Word bigrams
    templates.append(FeatureTemplate(
        "word[-1,0]",
        lambda tokens, i: f"word[-1,0]={tokens[i-1].word if i > 0 else '<BOS>'}_{tokens[i].word}"
    ))
    
    templates.append(FeatureTemplate(
        "word[0,+1]",
        lambda tokens, i: f"word[0,+1]={tokens[i].word}_{tokens[i+1].word if i < len(tokens)-1 else '<EOS>'}"
    ))
    
    # ========== Shape Features ==========
    
    def get_word_shape(word: str) -> str:
        """Convert word to shape (e.g., 'John' -> 'Xxxx')."""
        shape = []
        for c in word:
            if c.isupper():
                shape.append('X')
            elif c.islower():
                shape.append('x')
            elif c.isdigit():
                shape.append('d')
            else:
                shape.append(c)
        return ''.join(shape)
    
    def get_short_shape(word: str) -> str:
        """Collapsed shape (e.g., 'John' -> 'Xx')."""
        shape = get_word_shape(word)
        # Collapse consecutive same characters
        result = [shape[0]] if shape else []
        for c in shape[1:]:
            if c != result[-1]:
                result.append(c)
        return ''.join(result)
    
    templates.append(FeatureTemplate(
        "shape[0]",
        lambda tokens, i: f"shape[0]={get_short_shape(tokens[i].word)}"
    ))
    
    # ========== Capitalization Features ==========
    
    templates.append(FeatureTemplate(
        "capitalized[0]",
        lambda tokens, i: f"capitalized[0]={tokens[i].word[0].isupper()}" if tokens[i].word else None
    ))
    
    templates.append(FeatureTemplate(
        "all_caps[0]",
        lambda tokens, i: f"all_caps[0]={tokens[i].word.isupper()}"
    ))
    
    templates.append(FeatureTemplate(
        "all_lower[0]",
        lambda tokens, i: f"all_lower[0]={tokens[i].word.islower()}"
    ))
    
    # ========== Prefix/Suffix Features ==========
    
    for length in [2, 3, 4]:
        templates.append(FeatureTemplate(
            f"prefix{length}[0]",
            lambda tokens, i, l=length: f"prefix{l}[0]={tokens[i].word[:l]}" if len(tokens[i].word) >= l else None
        ))
        
        templates.append(FeatureTemplate(
            f"suffix{length}[0]",
            lambda tokens, i, l=length: f"suffix{l}[0]={tokens[i].word[-l:]}" if len(tokens[i].word) >= l else None
        ))
    
    # ========== Digit Features ==========
    
    templates.append(FeatureTemplate(
        "has_digit[0]",
        lambda tokens, i: f"has_digit[0]={any(c.isdigit() for c in tokens[i].word)}"
    ))
    
    templates.append(FeatureTemplate(
        "all_digits[0]",
        lambda tokens, i: f"all_digits[0]={tokens[i].word.isdigit()}"
    ))
    
    # ========== Punctuation Features ==========
    
    templates.append(FeatureTemplate(
        "has_hyphen[0]",
        lambda tokens, i: f"has_hyphen[0]={'-' in tokens[i].word}"
    ))
    
    templates.append(FeatureTemplate(
        "has_period[0]",
        lambda tokens, i: f"has_period[0]={'.' in tokens[i].word}"
    ))
    
    return templates
 
 
class CRFFeatureExtractor:
    """
    Complete feature extraction system for CRFs.
    
    Converts observation sequences into sparse feature vectors.
    """
    
    def __init__(self, templates: List[FeatureTemplate]):
        self.templates = templates
        self.feature_to_id: Dict[str, int] = {}
        self.id_to_feature: Dict[int, str] = {}
        self.label_to_id: Dict[str, int] = {}
        self.is_fitted = False
    
    def fit(self, sequences: List[List[Token]], labels: List[List[str]]) -> None:
        """
        Build feature vocabulary from training data.
        
        Args:
            sequences: List of token sequences
            labels: List of label sequences
        """
        # Collect all labels
        all_labels = set()
        for label_seq in labels:
            all_labels.update(label_seq)
        self.label_to_id = {label: i for i, label in enumerate(sorted(all_labels))}
        
        # Collect all features
        feature_set: Set[str] = set()
        
        for tokens in sequences:
            for i in range(len(tokens)):
                for template in self.templates:
                    feat_id = template.extract(tokens, i)
                    if feat_id:
                        # Create features for each label
                        for label in all_labels:
                            feature_set.add(f"{feat_id}|y={label}")
                        # Create features for label pairs (transitions)
                        for label1 in all_labels:
                            for label2 in all_labels:
                                feature_set.add(f"transition|y[-1]={label1}|y={label2}")
        
        # Assign IDs
        self.feature_to_id = {f: i for i, f in enumerate(sorted(feature_set))}
        self.id_to_feature = {i: f for f, i in self.feature_to_id.items()}
        self.is_fitted = True
        
        print(f"Feature vocabulary: {len(self.feature_to_id)} features")
        print(f"Label vocabulary: {len(self.label_to_id)} labels")
    
    def extract_features(
        self, 
        tokens: List[Token], 
        position: int,
        current_label: str,
        prev_label: str = None
    ) -> List[int]:
        """
        Extract active feature IDs for a position and label assignment.
        
        Returns list of feature IDs that fire (have value 1).
        """
        if not self.is_fitted:
            raise ValueError("Must call fit() before extracting features")
        
        active_features = []
        
        # Emission features (observation + current label)
        for template in self.templates:
            feat_id = template.extract(tokens, position)
            if feat_id:
                full_feature = f"{feat_id}|y={current_label}"
                if full_feature in self.feature_to_id:
                    active_features.append(self.feature_to_id[full_feature])
        
        # Transition features (label bigram)
        if prev_label is not None:
            trans_feature = f"transition|y[-1]={prev_label}|y={current_label}"
            if trans_feature in self.feature_to_id:
                active_features.append(self.feature_to_id[trans_feature])
        
        return active_features
 
 
# Example usage
tokens = [
    Token("Barack"),
    Token("Obama"),
    Token("was"),
    Token("born"),
    Token("in"),
    Token("Hawaii"),
    Token(".")
]
 
templates = create_standard_templates()
print(f"Number of templates: {len(templates)}")
print("\nSample features at position 1 (Obama):")
for template in templates[:10]:
    feat = template.extract(tokens, 1)
    if feat:
        print(f"  {template.name}: {feat}")

Feature Explosion

Common Feature Patterns for NLP

Over decades of research, practitioners have identified feature patterns that consistently improve performance on NLP sequence labeling tasks. Here we catalog the most effective patterns.

Word Identity Features:

The most basic but often most important features:

Feature	Template	Example
Current word	`word[0]=X`	word[0]=Obama
Previous word	`word[-1]=X`	word[-1]=President
Next word	`word[+1]=X`	word[+1]=was
Word bigram	`word[-1,0]=X_Y`	word[-1,0]=President_Obama
Word trigram	`word[-1,0,+1]=X_Y_Z`	word[-1,0,+1]=...President_Obama_was

Orthographic Features:

Capitalization and character patterns are crucial for NER:

Feature	Captures	Examples
`is_capitalized`	Proper nouns	True for "John", "IBM"
`all_caps`	Acronyms	True for "NATO", "USA"
`all_lower`	Common words	True for "the", "running"
`mixed_case`	Special names	True for "iPhone", "eBay"
`initial_cap`	Sentence start vs. proper noun	Distinguish capitalization causes
`word_shape`	Character pattern	"Xxxx" for "John", "XXXX" for "NASA"
`short_shape`	Collapsed pattern	"Xx" for "John", "X.X." for "U.S."

Affix Features:

Prefixes and suffixes capture morphological patterns:

Feature	Captures	Examples
`prefix_2`	Short prefixes	"un-" (negative), "re-" (repetition)
`prefix_3`	Medium prefixes	"pre-", "dis-", "mis-"
`suffix_2`	Short suffixes	"-ed" (past), "-ly" (adverb)
`suffix_3`	Medium suffixes	"-ing", "-tion", "-ness"
`suffix_4`	Long suffixes	"-ment", "-able"

Pattern Features

•has_digit — Contains any digit
•all_digits — Purely numeric
•has_hyphen — Contains hyphen (compound words)
•has_period — Contains period (abbreviations)
•has_apostrophe — Contractions, possessives
•contains_at — Email addresses
•is_url — Web URLs
•is_punctuation — Pure punctuation

Gazetteer Features

•in_person_names — Known person names
•in_org_names — Known organization names
•in_location_names — Known location names
•in_country_list — Country names
•in_city_list — City names
•in_stopwords — Common function words
•in_product_list — Product names
•brown_cluster — Word cluster ID

Contextual Windows:

Contextual features examine patterns in surrounding positions:

# Context window features (position offsets -2 to +2)
word[-2], word[-1], word[0], word[+1], word[+2]
shape[-2], shape[-1], shape[0], shape[+1], shape[+2]
capitalized[-1], capitalized[0], capitalized[+1]

# Contextual conjunctions
word[-1]/word[0]    # Bigram
word[0]/word[+1]    # Bigram
word[-1]/word[0]/word[+1]  # Trigram context
shape[-1]/shape[0]  # Shape pattern

POS Tag Features (when available):

If a POS tagger has already processed the text:

Feature	Captures
`pos[0]=NNP`	Current word is proper noun
`pos[-1]=DT`	Previous word is determiner
`pos[-1]/pos[0]=JJ_NN`	Adjective-noun pattern
`pos[0]/pos[+1]=VB_TO`	Verb followed by "to"

Feature Engineering Wisdom

Sparse vs. Dense Feature Representations

Feature representation significantly impacts both model performance and computational efficiency. Let's compare the two main paradigms.

Sparse (Indicator) Features:

Traditional CRF features are sparse binary indicators. Each feature is either active (1) or inactive (0), and at any position only a tiny fraction of all features fire.

Representation: List of active feature IDs
Storage: $O(k)$ where $k$ = number of active features (typically 20-100)
Feature space: Can be millions of features
Training: Regularization essential (L1 or L2)

Advantages of sparse features:

Interpretable — each feature has clear meaning
Can incorporate arbitrary domain knowledge
Feature conjunctions explicitly modeled
Works well with limited training data

Disadvantages:

Manual feature engineering required
No parameter sharing between similar features
OOV (out-of-vocabulary) words have no word features

sparse_vs_dense.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import numpy as np
from typing import List, Dict
 
class SparseFeatures:
    """
    Sparse feature representation for traditional CRFs.
    
    Features are stored as (feature_id, value) pairs.
    Most features have value 1.0 (binary indicators).
    """
    
    def __init__(self, feature_to_id: Dict[str, int]):
        self.feature_to_id = feature_to_id
        self.num_features = len(feature_to_id)
    
    def extract(self, tokens: List[str], position: int, label: str) -> List[int]:
        """
        Extract active feature IDs for a position.
        
        Returns list of feature IDs (all have value 1).
        """
        active = []
        word = tokens[position]
        
        # Word identity feature
        feat = f"word={word}|label={label}"
        if feat in self.feature_to_id:
            active.append(self.feature_to_id[feat])
        
        # Lowercase version (generalization)
        feat = f"word_lower={word.lower()}|label={label}"
        if feat in self.feature_to_id:
            active.append(self.feature_to_id[feat])
        
        # Capitalization
        feat = f"is_cap={word[0].isupper()}|label={label}"
        if feat in self.feature_to_id:
            active.append(self.feature_to_id[feat])
        
        # ... many more features
        
        return active
    
    def score(self, weights: np.ndarray, active_features: List[int]) -> float:
        """Compute score as sum of weights for active features."""
        return sum(weights[f] for f in active_features)
 
 
class DenseFeatures:
    """
    Dense feature representation using neural embeddings.
    
    Each position is represented as a fixed-size dense vector.
    """
    
    def __init__(
        self, 
        word_embeddings: np.ndarray,  # (vocab_size, embedding_dim)
        word_to_id: Dict[str, int],
        context_window: int = 2
    ):
        self.embeddings = word_embeddings
        self.word_to_id = word_to_id
        self.context_window = context_window
        self.embedding_dim = word_embeddings.shape[1]
        self.feature_dim = (2 * context_window + 1) * self.embedding_dim
    
    def extract(self, tokens: List[str], position: int) -> np.ndarray:
        """
        Extract dense feature vector for a position.
        
        Returns concatenation of word embeddings in context window.
        """
        features = []
        
        for offset in range(-self.context_window, self.context_window + 1):
            idx = position + offset
            
            if 0 <= idx < len(tokens):
                word = tokens[idx].lower()
                word_id = self.word_to_id.get(word, self.word_to_id.get('<UNK>', 0))
                embedding = self.embeddings[word_id]
            else:
                # Padding for positions outside sequence
                if idx < 0:
                    embedding = self.embeddings[self.word_to_id.get('<BOS>', 0)]
                else:
                    embedding = self.embeddings[self.word_to_id.get('<EOS>', 0)]
            
            features.append(embedding)
        
        return np.concatenate(features)
    
    def score(
        self, 
        weights: np.ndarray,  # (num_labels, feature_dim)
        features: np.ndarray
    ) -> np.ndarray:
        """Compute scores for all labels given features."""
        return weights @ features  # (num_labels,)
 
 
# Comparison: Memory and computation
print("Sparse Features Example:")
print("  Active features per position: ~50")
print("  Total feature space: 10 million")
print("  Memory per position: 50 × 4 bytes = 200 bytes")
print("  Score computation: 50 additions")
 
print("\nDense Features Example:")
print("  Embedding dim: 300")
print("  Context window: 2 (5 words)")
print("  Feature dim: 5 × 300 = 1500")
print("  Memory per position: 1500 × 4 bytes = 6000 bytes")
print("  Score computation: 1500 × 10 = 15000 multiply-adds (for 10 labels)")

Dense (Neural) Features:

Modern CRFs often use dense, learned representations from neural networks:

Representation: Fixed-size real-valued vectors
Storage: $O(d)$ where $d$ = embedding dimension (typically 128-1024)
Source: Word embeddings, character CNNs, BiLSTM hidden states
Training: End-to-end with the CRF layer

Advantages of dense features:

No manual feature engineering
Semantic similarity captured (similar words → similar vectors)
Handles OOV words (via character-level models)
Parameter sharing across contexts

Disadvantages:

Less interpretable
Requires more training data
Computationally more expensive
Harder to incorporate explicit domain knowledge

Sparse vs. Dense Features Comparison
Aspect	Sparse Features	Dense Features
Representation	Binary indicator vectors	Real-valued dense vectors
Dimension	Millions (mostly zeros)	Hundreds (all non-zero)
Feature engineering	Manual, domain expertise	Automatic, learned
OOV handling	Poor (fallback features)	Good (character models)
Interpretability	High	Low
Data efficiency	Can work with small data	Needs more data
Domain knowledge	Easy to incorporate	Harder to incorporate
Modern usage	Classical CRFs	Neural CRFs (BiLSTM-CRF)

Modern Hybrid Approach

Feature Selection and Analysis

With millions of potential features, understanding which features matter and how to select them is crucial for building effective CRF models.

Why Feature Selection Matters:

Computational efficiency: Fewer features = faster training and inference
Reduced overfitting: Irrelevant features can memorize noise
Improved generalization: Focus on robust patterns
Interpretability: Easier to understand model behavior

Feature Selection Approaches:

1. Frequency Thresholding:

Remove features that occur fewer than $k$ times in training data:

$$\text{Keep feature } f \text{ if } \text{count}(f) \geq k$$

Typical threshold: $k = 1$ to $5$. Removes rare, unreliable features.

2. L1 Regularization (Lasso):

L1 penalty drives many weights to exactly zero:

$$\mathcal{L}(\boldsymbol{\lambda}) = \sum_i \log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}) - \alpha |\boldsymbol{\lambda}|_1$$

Strength $\alpha$ controls sparsity. Higher $\alpha$ → more zeros → fewer active features.

3. Information Gain:

Rank features by mutual information with labels:

$$\text{IG}(f) = H(Y) - H(Y \mid f)$$

Select top-$k$ features. More principled but computationally expensive.

feature_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
import numpy as np
from typing import Dict, List, Tuple
from collections import Counter
 
def analyze_learned_features(
    weights: np.ndarray,
    feature_names: List[str],
    top_k: int = 20
) -> Dict[str, List[Tuple[str, float]]]:
    """
    Analyze learned feature weights to understand model behavior.
    
    Args:
        weights: Weight vector (num_features,)
        feature_names: Human-readable feature names
        top_k: Number of top features to return per category
    
    Returns:
        Dictionary with analysis results
    """
    assert len(weights) == len(feature_names)
    
    # Sort features by absolute weight
    sorted_indices = np.argsort(np.abs(weights))[::-1]
    
    # Analyze by label
    label_features = {}
    
    for idx in sorted_indices[:500]:  # Top 500 by magnitude
        feat_name = feature_names[idx]
        weight = weights[idx]
        
        # Parse feature to extract label
        if '|y=' in feat_name:
            parts = feat_name.split('|y=')
            observation_part = parts[0]
            label = parts[1] if len(parts) > 1 else 'unknown'
            
            if label not in label_features:
                label_features[label] = []
            
            if len(label_features[label]) < top_k:
                label_features[label].append((observation_part, weight))
    
    return label_features
 
 
def compute_feature_statistics(
    training_data: List[Tuple[List[str], List[str]]],
    feature_extractor
) -> Dict[str, Dict]:
    """
    Compute statistics about feature occurrences.
    
    Returns:
        Dictionary with feature statistics
    """
    feature_counts = Counter()
    feature_label_counts = {}  # feature -> {label: count}
    
    for tokens, labels in training_data:
        for i, label in enumerate(labels):
            features = feature_extractor.extract(tokens, i, label)
            
            for feat in features:
                feature_counts[feat] += 1
                
                if feat not in feature_label_counts:
                    feature_label_counts[feat] = Counter()
                feature_label_counts[feat][label] += 1
    
    # Compute entropy for each feature
    feature_stats = {}
    
    for feat, label_counts in feature_label_counts.items():
        total = sum(label_counts.values())
        probs = np.array(list(label_counts.values())) / total
        entropy = -np.sum(probs * np.log2(probs + 1e-10))
        
        feature_stats[feat] = {
            'count': feature_counts[feat],
            'entropy': entropy,
            'dominant_label': label_counts.most_common(1)[0][0],
            'purity': label_counts.most_common(1)[0][1] / total
        }
    
    return feature_stats
 
 
def select_features_by_frequency(
    feature_stats: Dict[str, Dict],
    min_count: int = 5,
    max_features: int = 100000
) -> List[str]:
    """
    Select features by frequency threshold.
    
    Args:
        feature_stats: Statistics from compute_feature_statistics
        min_count: Minimum occurrence count
        max_features: Maximum features to keep
    
    Returns:
        List of selected feature names
    """
    # Filter by minimum count
    filtered = {
        feat: stats for feat, stats in feature_stats.items()
        if stats['count'] >= min_count
    }
    
    # Sort by count and take top max_features
    sorted_features = sorted(
        filtered.keys(),
        key=lambda f: filtered[f]['count'],
        reverse=True
    )
    
    return sorted_features[:max_features]
 
 
# Example analysis output
example_weights = {
    'word=Obama|y=B-PER': 3.2,
    'word=Google|y=B-ORG': 2.8,
    'is_cap=True|y=B-PER': 1.5,
    'is_cap=True|y=B-ORG': 1.3,
    'is_cap=True|y=O': -2.1,
    'word=the|y=O': 1.8,
    'word=the|y=B-PER': -2.5,
    'suffix2=ly|y=O': 0.9,
    'trans|y[-1]=B-PER|y=I-PER': 4.1,
    'trans|y[-1]=O|y=I-PER': -3.8,
}
 
print("Feature Weight Analysis")
print("=" * 50)
print("\nTop positive weights (features indicating label):")
for feat, weight in sorted(example_weights.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(f"  {weight:+.1f}  {feat}")
 
print("\nTop negative weights (features contra-indicating label):")
for feat, weight in sorted(example_weights.items(), key=lambda x: x[1])[:5]:
    print(f"  {weight:+.1f}  {feat}")

Interpreting Feature Weights

Feature Engineering Best Practices

Based on decades of CRF applications, here are proven best practices for feature engineering:

Feature Engineering Guidelines

•Start with standard features — Word identity, capitalization, prefixes/suffixes, and context window features form a strong baseline for most NLP tasks.
•Use multiple granularities — Combine exact word matches with patterns (shapes, affixes). Exact matches capture known entities; patterns generalize to unseen ones.
•Incorporate domain knowledge — Gazetteers, lexicons, and rule-based patterns encode valuable human knowledge that's hard to learn from limited data.
•Design generalizable features — Features should fire on multiple examples. A feature that matches only one training example will overfit.
•Balance precision and recall — Highly specific features (word=Obama → PERSON) are precise but miss unseen entities. General features (is_capitalized → PERSON) recall more but are less precise.
•Use conjunctions sparingly — Feature combinations (word=President AND next_word=Obama) can be powerful but multiply feature space exponentially.
•Apply proper regularization — L2 regularization is standard; L1 for automatic feature selection. Cross-validate the regularization strength.
•Analyze errors systematically — Examine what features fire (or don't) on misclassified examples. This guides feature engineering iterations.

Common Mistakes to Avoid:

Mistake	Problem	Solution
Features on test-only data	Data leakage	Extract features only from training data
Label-derived features	Circular reasoning	Features should depend only on observations
Overly specific features	Overfitting	Use frequency thresholds, regularization
Missing boundary handling	Crashes/bugs	Always handle sequence start/end
Case sensitivity issues	Inconsistent matching	Normalize consistently
Ignoring feature interactions	Missed patterns	Carefully designed conjunctions

The End of Manual Feature Engineering?

Summary: Feature Functions

Feature functions are the heart of CRF modeling—they encode what the model observes and can learn. Here are the key concepts:

Key Takeaways

•Features condition on entire observation sequence — Unlike HMMs, CRF features can use global context at any position.
•Feature taxonomy: emission, transition, contextual, combined — Different feature types capture different aspects of the input-output relationship.
•Feature templates automate feature generation — Define patterns once; instantiate for each position and label combination.
•Standard NLP features: words, shapes, affixes, gazetteers — Decades of research have identified consistently useful feature patterns.
•Sparse vs. dense representations trade off interpretability and automation — Classical CRFs use sparse; neural CRFs use dense learned features.
•Feature selection improves efficiency and generalization — Frequency thresholding and L1 regularization prune irrelevant features.
•Good features balance precision and generalization — Specific enough to be discriminative; general enough to apply broadly.

What's next:

Page Complete

3 / 5