High Cardinality Features - Learning Module

Loading content...

0/278

Handling Novel Categories

The Cold Start Problem for Categories

Every production ML system eventually encounters a category it has never seen before. A new merchant opens on your platform. A user signs up from a country not in your training data. A product launches with a novel category label.

This is the novel category problem (or out-of-vocabulary problem for categories), and handling it gracefully separates robust production systems from fragile prototypes.

What Can Go Wrong:

# Training data had categories: ['A', 'B', 'C']
encoder.fit(train_data)

# Production data includes 'D' (never seen)
test_data = ['A', 'D', 'B']
encoder.transform(test_data)  # KeyError: 'D' not in vocabulary!

Without proper handling, your model crashes in production. With poor handling, predictions silently degrade.

Novel Categories Are Inevitable

In dynamic systems, new categories appear constantly: new users, products, locations, devices. Any encoding scheme that cannot handle novel categories will fail in production. Design for unknown categories from the start.

Fallback Encoding Strategies

Strategy 1: Unknown Token (Most Common)

Reserve a special <UNK> category during training. Map all novel categories to this token. The model learns appropriate behavior for "unknown" inputs.

vocab = {'A': 0, 'B': 1, 'C': 2, '<UNK>': 3}
def encode(cat):
    return vocab.get(cat, vocab['<UNK>'])

Strategy 2: Global Mean/Mode Imputation

For target encoding or numeric encodings, use the global statistic:

Classification: Global class probability
Regression: Global target mean
Embeddings: Mean of all learned embeddings

Strategy 3: Zero Vector

For one-hot or embedding outputs, return a zero vector. Effectively says "no information from this feature." Works when other features can compensate.

Strategy 4: Most Frequent Category

Map unknown to the most common training category. Assumption: unknowns behave like the majority. Risky if unknowns are systematically different.

fallback_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from sklearn.preprocessing import LabelEncoder
from typing import Dict, Any, Optional
 
class RobustCategoryEncoder:
    """
    Category encoder with configurable fallback for unknown categories.
    """
    def __init__(self, 
                 fallback_strategy: str = 'unknown_token',
                 min_frequency: int = 1):
        """
        Args:
            fallback_strategy: 'unknown_token', 'most_frequent', 'zero', 'error'
            min_frequency: Categories appearing less than this become <UNK>
        """
        self.fallback_strategy = fallback_strategy
        self.min_frequency = min_frequency
        self.category_to_idx: Dict[str, int] = {}
        self.idx_to_category: Dict[int, str] = {}
        self.unknown_idx: int = 0
        self.most_frequent_idx: int = 0
        
    def fit(self, categories):
        """Learn category vocabulary from training data."""
        from collections import Counter
        
        # Count frequencies
        counts = Counter(categories)
        
        # Filter by minimum frequency
        valid_categories = [cat for cat, count in counts.items() 
                          if count >= self.min_frequency]
        
        # Sort by frequency (descending) for deterministic ordering
        valid_categories.sort(key=lambda x: (-counts[x], x))
        
        # Build vocabulary with <UNK> at index 0
        self.category_to_idx = {'<UNK>': 0}
        for idx, cat in enumerate(valid_categories, start=1):
            self.category_to_idx[cat] = idx
        
        self.idx_to_category = {v: k for k, v in self.category_to_idx.items()}
        self.unknown_idx = 0
        self.most_frequent_idx = 1 if len(valid_categories) > 0 else 0
        
        return self
    
    def transform(self, categories):
        """Transform categories to indices with fallback handling."""
        result = []
        for cat in categories:
            if cat in self.category_to_idx:
                result.append(self.category_to_idx[cat])
            else:
                # Handle unknown category
                if self.fallback_strategy == 'unknown_token':
                    result.append(self.unknown_idx)
                elif self.fallback_strategy == 'most_frequent':
                    result.append(self.most_frequent_idx)
                elif self.fallback_strategy == 'error':
                    raise ValueError(f"Unknown category: {cat}")
                else:
                    result.append(self.unknown_idx)
        
        return np.array(result)
    
    def get_vocab_size(self):
        """Return vocabulary size including <UNK>."""
        return len(self.category_to_idx)
 
# Usage example
encoder = RobustCategoryEncoder(fallback_strategy='unknown_token', min_frequency=2)
 
train_categories = ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'D']  # D appears once
encoder.fit(train_categories)
 
print(f"Vocabulary: {encoder.category_to_idx}")
 
test_categories = ['A', 'B', 'E', 'F', 'D']  # E, F never seen; D was rare
encoded = encoder.transform(test_categories)
print(f"Encoded: {encoded}")  # E, F, D all map to <UNK>=0

Fallback Strategy Comparison
Strategy	Pros	Cons	Best For
Unknown Token	Model learns <UNK> behavior; simple	Single representation for all unknowns	Most use cases
Global Mean	Statistically neutral; no new parameters	Ignores category-specific info	Target encoding
Zero Vector	Clear "no information" signal	May break models expecting non-zero	Embeddings with other features
Most Frequent	Assumes unknowns ≈ majority	Dangerous if unknowns are outliers	When unknowns are truly rare

Hash-Based Solutions

Hash encoding inherently handles novel categories—any string can be hashed. This makes it the go-to solution when novel category rate is high.

Hybrid Approach: Vocabulary + Hash Fallback

Combine explicit embeddings for known high-value categories with hash fallback for unknowns:

Maintain vocabulary of top-K most frequent/important categories
These get dedicated, fully learned embeddings
All other categories (including novel ones) use hash embedding
Model learns when to trust hash vs. explicit embeddings

hybrid_embedding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import torch
import torch.nn as nn
import mmh3
 
class HybridCategoryEmbedding(nn.Module):
    """
    Explicit embeddings for known categories, hash embedding for unknowns.
    """
    def __init__(self, known_categories, embed_dim, hash_buckets=10000):
        super().__init__()
        self.embed_dim = embed_dim
        self.hash_buckets = hash_buckets
        
        # Build vocabulary for known categories
        self.vocab = {cat: idx for idx, cat in enumerate(known_categories)}
        self.vocab_size = len(known_categories)
        
        # Explicit embeddings for known categories
        self.known_embedding = nn.Embedding(self.vocab_size, embed_dim)
        
        # Hash embedding for unknown categories
        self.hash_embedding = nn.Embedding(hash_buckets, embed_dim)
        
        # Learnable weight for combining (optional)
        self.is_known_weight = nn.Parameter(torch.tensor(1.0))
    
    def forward(self, categories):
        """
        Args:
            categories: list of string category values
        """
        embeddings = []
        
        for cat in categories:
            if cat in self.vocab:
                # Use explicit embedding
                idx = torch.tensor([self.vocab[cat]])
                emb = self.known_embedding(idx).squeeze(0)
            else:
                # Use hash embedding
                hash_idx = mmh3.hash(str(cat), seed=0) % self.hash_buckets
                idx = torch.tensor([hash_idx])
                emb = self.hash_embedding(idx).squeeze(0)
            
            embeddings.append(emb)
        
        return torch.stack(embeddings)
 
# Example: Known top categories + hash fallback for rest
known_cats = ['electronics', 'clothing', 'home', 'sports', 'books']
hybrid_emb = HybridCategoryEmbedding(known_cats, embed_dim=32, hash_buckets=1000)
 
test_cats = ['electronics', 'unknown_category', 'books', 'brand_new_category']
output = hybrid_emb(test_cats)
print(f"Output shape: {output.shape}")
 
# Check parameter count
n_params = sum(p.numel() for p in hybrid_emb.parameters())
print(f"Parameters: {n_params:,}")

Progressive Vocabulary Expansion

Track unknown categories in production. When a category accumulates enough samples, add it to the explicit vocabulary in the next retraining. The hash embedding provides reasonable predictions until then.

Semantic Similarity Fallback

When category names carry semantic meaning, use text embeddings to map unknowns to similar known categories.

Approach:

Compute text embedding for each known category name (BERT, sentence transformers)
For unknown category, compute its text embedding
Find nearest known category by cosine similarity
Use that category's encoding (or weighted average of top-K nearest)

Example:

Known: 'laptop_computers', 'desktop_computers', 'smartphones'
Unknown: 'tablet_computers'

Semantic similarity → closest to 'laptop_computers'
Use 'laptop_computers' encoding as fallback

semantic_fallback.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
 
class SemanticCategoryEncoder:
    """
    Falls back to semantically similar known categories for unknowns.
    """
    def __init__(self, known_categories, category_encodings):
        """
        Args:
            known_categories: list of known category strings
            category_encodings: dict mapping category -> encoding vector
        """
        self.known_categories = known_categories
        self.category_encodings = category_encodings
        
        # Load sentence transformer for semantic similarity
        self.text_model = SentenceTransformer('all-MiniLM-L6-v2')
        
        # Pre-compute embeddings for known categories
        self.known_embeddings = self.text_model.encode(
            [self._clean_category(c) for c in known_categories]
        )
    
    def _clean_category(self, cat):
        """Convert category ID to readable text."""
        return cat.replace('_', ' ').replace('-', ' ')
    
    def encode(self, category, top_k=3):
        """
        Encode a category. Uses semantic similarity for unknowns.
        """
        if category in self.category_encodings:
            return self.category_encodings[category]
        
        # Compute semantic embedding for unknown
        unknown_emb = self.text_model.encode([self._clean_category(category)])
        
        # Find most similar known categories
        similarities = cosine_similarity(unknown_emb, self.known_embeddings)[0]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        top_similarities = similarities[top_indices]
        
        # Weighted average of top-K encodings
        weights = top_similarities / top_similarities.sum()
        
        result = np.zeros_like(list(self.category_encodings.values())[0])
        for idx, weight in zip(top_indices, weights):
            cat = self.known_categories[idx]
            result += weight * self.category_encodings[cat]
        
        return result
 
# Example usage (simplified)
known_cats = ['laptop_computer', 'desktop_computer', 'smartphone', 'headphones']
encodings = {cat: np.random.randn(32) for cat in known_cats}  # Pretend encodings
 
encoder = SemanticCategoryEncoder(known_cats, encodings)
 
# Encode unknown category
unknown = 'tablet_computer'
encoded = encoder.encode(unknown, top_k=2)
print(f"Encoded unknown '{unknown}' using semantic similarity")

Semantic Similarity Limitations

This approach assumes category names are semantically meaningful. It fails for opaque IDs (user_123, sku_456789) and may be misled by superficial text similarity that doesn't reflect behavioral similarity.

Online Learning for New Categories

Rather than static fallbacks, continuously learn embeddings for new categories as data arrives.

Strategy 1: Warm-Start New Embeddings

Initialize new category embeddings as:

Average of all existing embeddings
Average of semantically similar categories
Slightly noisy copy of most frequent category

Then fine-tune with incoming labeled data.

Strategy 2: Side Information

For new categories, use auxiliary features to predict initial embeddings:

Category metadata (description, attributes)
Co-occurrence with other features
Graph neighborhood in knowledge graph

Strategy 3: Incremental Vocabulary Expansion

Periodically expand the vocabulary:

Collect unknown categories from production logs
Filter to those with sufficient sample count
Add to vocabulary with warm-start initialization
Fine-tune model on recent data
Deploy updated model

online_category_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import torch
import torch.nn as nn
import numpy as np
from collections import defaultdict
 
class ExpandableEmbedding(nn.Module):
    """
    Embedding layer that can expand vocabulary at runtime.
    """
    def __init__(self, initial_vocab_size, embed_dim, max_size=None):
        super().__init__()
        self.embed_dim = embed_dim
        self.max_size = max_size or initial_vocab_size * 10
        
        # Initialize with extra capacity
        self.embedding = nn.Embedding(self.max_size, embed_dim)
        self.current_vocab_size = initial_vocab_size
        
        # Track unknown categories
        self.unknown_counts = defaultdict(int)
        self.unknown_threshold = 100  # Add after N samples
        
        # Category to index mapping
        self.cat_to_idx = {}
        self.idx_to_cat = {}
    
    def set_vocabulary(self, categories):
        """Set initial vocabulary."""
        for idx, cat in enumerate(categories):
            self.cat_to_idx[cat] = idx
            self.idx_to_cat[idx] = cat
        self.current_vocab_size = len(categories)
    
    def add_category(self, category, init_strategy='mean'):
        """Add new category to vocabulary."""
        if category in self.cat_to_idx:
            return  # Already exists
        
        if self.current_vocab_size >= self.max_size:
            raise RuntimeError("Maximum vocabulary size reached")
        
        new_idx = self.current_vocab_size
        self.cat_to_idx[category] = new_idx
        self.idx_to_cat[new_idx] = category
        
        # Initialize new embedding
        with torch.no_grad():
            if init_strategy == 'mean':
                # Average of existing embeddings
                init_vec = self.embedding.weight[:self.current_vocab_size].mean(dim=0)
            elif init_strategy == 'noise':
                # Random initialization
                init_vec = torch.randn(self.embed_dim) * 0.01
            else:
                init_vec = torch.zeros(self.embed_dim)
            
            self.embedding.weight[new_idx] = init_vec
        
        self.current_vocab_size += 1
        return new_idx
    
    def forward(self, categories, track_unknowns=True):
        """
        Forward pass with unknown tracking.
        """
        indices = []
        for cat in categories:
            if cat in self.cat_to_idx:
                indices.append(self.cat_to_idx[cat])
            else:
                # Track unknown for potential future addition
                if track_unknowns:
                    self.unknown_counts[cat] += 1
                indices.append(0)  # Fallback to first embedding (or <UNK>)
        
        return self.embedding(torch.tensor(indices))
    
    def expand_vocabulary(self, min_count=None):
        """Add frequently seen unknowns to vocabulary."""
        min_count = min_count or self.unknown_threshold
        added = []
        
        for cat, count in list(self.unknown_counts.items()):
            if count >= min_count:
                self.add_category(cat, init_strategy='mean')
                added.append(cat)
                del self.unknown_counts[cat]
        
        return added
 
# Usage
emb = ExpandableEmbedding(initial_vocab_size=100, embed_dim=32, max_size=1000)
emb.set_vocabulary([f'cat_{i}' for i in range(100)])
 
# Simulate production traffic with unknowns
for _ in range(150):
    _ = emb(['cat_0', 'new_category', 'cat_1'], track_unknowns=True)
 
# Expand vocabulary
added = emb.expand_vocabulary(min_count=100)
print(f"Added categories: {added}")
print(f"New vocab size: {emb.current_vocab_size}")

Monitoring and Alerting

Robust handling of novel categories requires continuous monitoring to detect when fallback mechanisms are overloaded.

Key Metrics to Monitor

•Unknown rate — Percentage of samples with unknown categories. Alert if > threshold (e.g., 5%).
•Novel category velocity — Rate of new unique unknowns over time. Spike may indicate data issue.
•Top unknowns — Most frequent unknown categories. May indicate vocabulary gaps.
•Prediction confidence on unknowns — If model confidence drops for unknown-category samples, fallback may be inadequate.
•Hash collision rate — For hash encoding, track if growing vocabulary increases effective collision rate.

monitoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
from collections import defaultdict
from datetime import datetime
import logging
 
class CategoryMonitor:
    """Monitor unknown category rates in production."""
    
    def __init__(self, known_vocab: set, alert_threshold: float = 0.05):
        self.known_vocab = known_vocab
        self.alert_threshold = alert_threshold
        
        self.total_samples = 0
        self.unknown_samples = 0
        self.unknown_categories = defaultdict(int)
        self.window_start = datetime.now()
    
    def log_sample(self, category: str):
        """Log a category observation."""
        self.total_samples += 1
        
        if category not in self.known_vocab:
            self.unknown_samples += 1
            self.unknown_categories[category] += 1
    
    def get_stats(self):
        """Get current monitoring statistics."""
        unknown_rate = self.unknown_samples / max(1, self.total_samples)
        
        top_unknowns = sorted(
            self.unknown_categories.items(), 
            key=lambda x: -x[1]
        )[:10]
        
        return {
            'total_samples': self.total_samples,
            'unknown_samples': self.unknown_samples,
            'unknown_rate': unknown_rate,
            'unique_unknowns': len(self.unknown_categories),
            'top_unknowns': top_unknowns,
            'window_duration': (datetime.now() - self.window_start).seconds
        }
    
    def check_alerts(self):
        """Check if alerting thresholds are exceeded."""
        stats = self.get_stats()
        
        if stats['unknown_rate'] > self.alert_threshold:
            logging.warning(
                f"Unknown category rate {stats['unknown_rate']:.2%} "
                f"exceeds threshold {self.alert_threshold:.2%}"
            )
            return True
        return False
 
# Example usage
known = {'A', 'B', 'C', 'D', 'E'}
monitor = CategoryMonitor(known, alert_threshold=0.10)
 
# Simulate production traffic
for cat in ['A', 'B', 'X', 'A', 'Y', 'C', 'Z', 'A', 'B', 'W']:
    monitor.log_sample(cat)
 
stats = monitor.get_stats()
print(f"Unknown rate: {stats['unknown_rate']:.2%}")
print(f"Top unknowns: {stats['top_unknowns']}")

Module Summary

This module has provided a comprehensive toolkit for handling high-cardinality categorical features—one of the most common and impactful challenges in production machine learning.

Module Key Takeaways

•Cardinality determines strategy — Low cardinality uses one-hot; high cardinality demands target encoding, embeddings, or hashing.
•Target encoding extracts signal directly — But requires careful regularization and cross-validation to prevent leakage.
•Embeddings learn task-specific similarity — Neural networks can discover category relationships automatically.
•Hash encoding handles unlimited vocabulary — At the cost of collisions; use multiple hashes to mitigate.
•Novel categories are inevitable — Design fallback mechanisms from the start; monitor unknown rates in production.
•Production requires robustness — Persist encoders, version alongside models, and test extensively with unknown inputs.

Module Complete

You now have the expertise to handle categorical features of any cardinality—from simple one-hot encoding for small vocabularies to sophisticated hash embeddings and online learning for unbounded categories. Apply these techniques to build production ML systems that gracefully handle the messiness of real-world data.