Loading content...
Every production ML system eventually encounters a category it has never seen before. A new merchant opens on your platform. A user signs up from a country not in your training data. A product launches with a novel category label.
This is the novel category problem (or out-of-vocabulary problem for categories), and handling it gracefully separates robust production systems from fragile prototypes.
What Can Go Wrong:
# Training data had categories: ['A', 'B', 'C']
encoder.fit(train_data)
# Production data includes 'D' (never seen)
test_data = ['A', 'D', 'B']
encoder.transform(test_data) # KeyError: 'D' not in vocabulary!
Without proper handling, your model crashes in production. With poor handling, predictions silently degrade.
In dynamic systems, new categories appear constantly: new users, products, locations, devices. Any encoding scheme that cannot handle novel categories will fail in production. Design for unknown categories from the start.
Strategy 1: Unknown Token (Most Common)
Reserve a special <UNK> category during training. Map all novel categories to this token. The model learns appropriate behavior for "unknown" inputs.
vocab = {'A': 0, 'B': 1, 'C': 2, '<UNK>': 3}
def encode(cat):
return vocab.get(cat, vocab['<UNK>'])
Strategy 2: Global Mean/Mode Imputation
For target encoding or numeric encodings, use the global statistic:
Strategy 3: Zero Vector
For one-hot or embedding outputs, return a zero vector. Effectively says "no information from this feature." Works when other features can compensate.
Strategy 4: Most Frequent Category
Map unknown to the most common training category. Assumption: unknowns behave like the majority. Risky if unknowns are systematically different.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npfrom sklearn.preprocessing import LabelEncoderfrom typing import Dict, Any, Optional class RobustCategoryEncoder: """ Category encoder with configurable fallback for unknown categories. """ def __init__(self, fallback_strategy: str = 'unknown_token', min_frequency: int = 1): """ Args: fallback_strategy: 'unknown_token', 'most_frequent', 'zero', 'error' min_frequency: Categories appearing less than this become <UNK> """ self.fallback_strategy = fallback_strategy self.min_frequency = min_frequency self.category_to_idx: Dict[str, int] = {} self.idx_to_category: Dict[int, str] = {} self.unknown_idx: int = 0 self.most_frequent_idx: int = 0 def fit(self, categories): """Learn category vocabulary from training data.""" from collections import Counter # Count frequencies counts = Counter(categories) # Filter by minimum frequency valid_categories = [cat for cat, count in counts.items() if count >= self.min_frequency] # Sort by frequency (descending) for deterministic ordering valid_categories.sort(key=lambda x: (-counts[x], x)) # Build vocabulary with <UNK> at index 0 self.category_to_idx = {'<UNK>': 0} for idx, cat in enumerate(valid_categories, start=1): self.category_to_idx[cat] = idx self.idx_to_category = {v: k for k, v in self.category_to_idx.items()} self.unknown_idx = 0 self.most_frequent_idx = 1 if len(valid_categories) > 0 else 0 return self def transform(self, categories): """Transform categories to indices with fallback handling.""" result = [] for cat in categories: if cat in self.category_to_idx: result.append(self.category_to_idx[cat]) else: # Handle unknown category if self.fallback_strategy == 'unknown_token': result.append(self.unknown_idx) elif self.fallback_strategy == 'most_frequent': result.append(self.most_frequent_idx) elif self.fallback_strategy == 'error': raise ValueError(f"Unknown category: {cat}") else: result.append(self.unknown_idx) return np.array(result) def get_vocab_size(self): """Return vocabulary size including <UNK>.""" return len(self.category_to_idx) # Usage exampleencoder = RobustCategoryEncoder(fallback_strategy='unknown_token', min_frequency=2) train_categories = ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'D'] # D appears onceencoder.fit(train_categories) print(f"Vocabulary: {encoder.category_to_idx}") test_categories = ['A', 'B', 'E', 'F', 'D'] # E, F never seen; D was rareencoded = encoder.transform(test_categories)print(f"Encoded: {encoded}") # E, F, D all map to <UNK>=0| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Unknown Token | Model learns <UNK> behavior; simple | Single representation for all unknowns | Most use cases |
| Global Mean | Statistically neutral; no new parameters | Ignores category-specific info | Target encoding |
| Zero Vector | Clear "no information" signal | May break models expecting non-zero | Embeddings with other features |
| Most Frequent | Assumes unknowns ≈ majority | Dangerous if unknowns are outliers | When unknowns are truly rare |
Hash encoding inherently handles novel categories—any string can be hashed. This makes it the go-to solution when novel category rate is high.
Hybrid Approach: Vocabulary + Hash Fallback
Combine explicit embeddings for known high-value categories with hash fallback for unknowns:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import torchimport torch.nn as nnimport mmh3 class HybridCategoryEmbedding(nn.Module): """ Explicit embeddings for known categories, hash embedding for unknowns. """ def __init__(self, known_categories, embed_dim, hash_buckets=10000): super().__init__() self.embed_dim = embed_dim self.hash_buckets = hash_buckets # Build vocabulary for known categories self.vocab = {cat: idx for idx, cat in enumerate(known_categories)} self.vocab_size = len(known_categories) # Explicit embeddings for known categories self.known_embedding = nn.Embedding(self.vocab_size, embed_dim) # Hash embedding for unknown categories self.hash_embedding = nn.Embedding(hash_buckets, embed_dim) # Learnable weight for combining (optional) self.is_known_weight = nn.Parameter(torch.tensor(1.0)) def forward(self, categories): """ Args: categories: list of string category values """ embeddings = [] for cat in categories: if cat in self.vocab: # Use explicit embedding idx = torch.tensor([self.vocab[cat]]) emb = self.known_embedding(idx).squeeze(0) else: # Use hash embedding hash_idx = mmh3.hash(str(cat), seed=0) % self.hash_buckets idx = torch.tensor([hash_idx]) emb = self.hash_embedding(idx).squeeze(0) embeddings.append(emb) return torch.stack(embeddings) # Example: Known top categories + hash fallback for restknown_cats = ['electronics', 'clothing', 'home', 'sports', 'books']hybrid_emb = HybridCategoryEmbedding(known_cats, embed_dim=32, hash_buckets=1000) test_cats = ['electronics', 'unknown_category', 'books', 'brand_new_category']output = hybrid_emb(test_cats)print(f"Output shape: {output.shape}") # Check parameter countn_params = sum(p.numel() for p in hybrid_emb.parameters())print(f"Parameters: {n_params:,}")Track unknown categories in production. When a category accumulates enough samples, add it to the explicit vocabulary in the next retraining. The hash embedding provides reasonable predictions until then.
When category names carry semantic meaning, use text embeddings to map unknowns to similar known categories.
Approach:
Example:
Known: 'laptop_computers', 'desktop_computers', 'smartphones'
Unknown: 'tablet_computers'
Semantic similarity → closest to 'laptop_computers'
Use 'laptop_computers' encoding as fallback
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import numpy as npfrom sentence_transformers import SentenceTransformerfrom sklearn.metrics.pairwise import cosine_similarity class SemanticCategoryEncoder: """ Falls back to semantically similar known categories for unknowns. """ def __init__(self, known_categories, category_encodings): """ Args: known_categories: list of known category strings category_encodings: dict mapping category -> encoding vector """ self.known_categories = known_categories self.category_encodings = category_encodings # Load sentence transformer for semantic similarity self.text_model = SentenceTransformer('all-MiniLM-L6-v2') # Pre-compute embeddings for known categories self.known_embeddings = self.text_model.encode( [self._clean_category(c) for c in known_categories] ) def _clean_category(self, cat): """Convert category ID to readable text.""" return cat.replace('_', ' ').replace('-', ' ') def encode(self, category, top_k=3): """ Encode a category. Uses semantic similarity for unknowns. """ if category in self.category_encodings: return self.category_encodings[category] # Compute semantic embedding for unknown unknown_emb = self.text_model.encode([self._clean_category(category)]) # Find most similar known categories similarities = cosine_similarity(unknown_emb, self.known_embeddings)[0] top_indices = np.argsort(similarities)[-top_k:][::-1] top_similarities = similarities[top_indices] # Weighted average of top-K encodings weights = top_similarities / top_similarities.sum() result = np.zeros_like(list(self.category_encodings.values())[0]) for idx, weight in zip(top_indices, weights): cat = self.known_categories[idx] result += weight * self.category_encodings[cat] return result # Example usage (simplified)known_cats = ['laptop_computer', 'desktop_computer', 'smartphone', 'headphones']encodings = {cat: np.random.randn(32) for cat in known_cats} # Pretend encodings encoder = SemanticCategoryEncoder(known_cats, encodings) # Encode unknown categoryunknown = 'tablet_computer'encoded = encoder.encode(unknown, top_k=2)print(f"Encoded unknown '{unknown}' using semantic similarity")This approach assumes category names are semantically meaningful. It fails for opaque IDs (user_123, sku_456789) and may be misled by superficial text similarity that doesn't reflect behavioral similarity.
Rather than static fallbacks, continuously learn embeddings for new categories as data arrives.
Strategy 1: Warm-Start New Embeddings
Initialize new category embeddings as:
Then fine-tune with incoming labeled data.
Strategy 2: Side Information
For new categories, use auxiliary features to predict initial embeddings:
Strategy 3: Incremental Vocabulary Expansion
Periodically expand the vocabulary:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import torchimport torch.nn as nnimport numpy as npfrom collections import defaultdict class ExpandableEmbedding(nn.Module): """ Embedding layer that can expand vocabulary at runtime. """ def __init__(self, initial_vocab_size, embed_dim, max_size=None): super().__init__() self.embed_dim = embed_dim self.max_size = max_size or initial_vocab_size * 10 # Initialize with extra capacity self.embedding = nn.Embedding(self.max_size, embed_dim) self.current_vocab_size = initial_vocab_size # Track unknown categories self.unknown_counts = defaultdict(int) self.unknown_threshold = 100 # Add after N samples # Category to index mapping self.cat_to_idx = {} self.idx_to_cat = {} def set_vocabulary(self, categories): """Set initial vocabulary.""" for idx, cat in enumerate(categories): self.cat_to_idx[cat] = idx self.idx_to_cat[idx] = cat self.current_vocab_size = len(categories) def add_category(self, category, init_strategy='mean'): """Add new category to vocabulary.""" if category in self.cat_to_idx: return # Already exists if self.current_vocab_size >= self.max_size: raise RuntimeError("Maximum vocabulary size reached") new_idx = self.current_vocab_size self.cat_to_idx[category] = new_idx self.idx_to_cat[new_idx] = category # Initialize new embedding with torch.no_grad(): if init_strategy == 'mean': # Average of existing embeddings init_vec = self.embedding.weight[:self.current_vocab_size].mean(dim=0) elif init_strategy == 'noise': # Random initialization init_vec = torch.randn(self.embed_dim) * 0.01 else: init_vec = torch.zeros(self.embed_dim) self.embedding.weight[new_idx] = init_vec self.current_vocab_size += 1 return new_idx def forward(self, categories, track_unknowns=True): """ Forward pass with unknown tracking. """ indices = [] for cat in categories: if cat in self.cat_to_idx: indices.append(self.cat_to_idx[cat]) else: # Track unknown for potential future addition if track_unknowns: self.unknown_counts[cat] += 1 indices.append(0) # Fallback to first embedding (or <UNK>) return self.embedding(torch.tensor(indices)) def expand_vocabulary(self, min_count=None): """Add frequently seen unknowns to vocabulary.""" min_count = min_count or self.unknown_threshold added = [] for cat, count in list(self.unknown_counts.items()): if count >= min_count: self.add_category(cat, init_strategy='mean') added.append(cat) del self.unknown_counts[cat] return added # Usageemb = ExpandableEmbedding(initial_vocab_size=100, embed_dim=32, max_size=1000)emb.set_vocabulary([f'cat_{i}' for i in range(100)]) # Simulate production traffic with unknownsfor _ in range(150): _ = emb(['cat_0', 'new_category', 'cat_1'], track_unknowns=True) # Expand vocabularyadded = emb.expand_vocabulary(min_count=100)print(f"Added categories: {added}")print(f"New vocab size: {emb.current_vocab_size}")Robust handling of novel categories requires continuous monitoring to detect when fallback mechanisms are overloaded.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
from collections import defaultdictfrom datetime import datetimeimport logging class CategoryMonitor: """Monitor unknown category rates in production.""" def __init__(self, known_vocab: set, alert_threshold: float = 0.05): self.known_vocab = known_vocab self.alert_threshold = alert_threshold self.total_samples = 0 self.unknown_samples = 0 self.unknown_categories = defaultdict(int) self.window_start = datetime.now() def log_sample(self, category: str): """Log a category observation.""" self.total_samples += 1 if category not in self.known_vocab: self.unknown_samples += 1 self.unknown_categories[category] += 1 def get_stats(self): """Get current monitoring statistics.""" unknown_rate = self.unknown_samples / max(1, self.total_samples) top_unknowns = sorted( self.unknown_categories.items(), key=lambda x: -x[1] )[:10] return { 'total_samples': self.total_samples, 'unknown_samples': self.unknown_samples, 'unknown_rate': unknown_rate, 'unique_unknowns': len(self.unknown_categories), 'top_unknowns': top_unknowns, 'window_duration': (datetime.now() - self.window_start).seconds } def check_alerts(self): """Check if alerting thresholds are exceeded.""" stats = self.get_stats() if stats['unknown_rate'] > self.alert_threshold: logging.warning( f"Unknown category rate {stats['unknown_rate']:.2%} " f"exceeds threshold {self.alert_threshold:.2%}" ) return True return False # Example usageknown = {'A', 'B', 'C', 'D', 'E'}monitor = CategoryMonitor(known, alert_threshold=0.10) # Simulate production trafficfor cat in ['A', 'B', 'X', 'A', 'Y', 'C', 'Z', 'A', 'B', 'W']: monitor.log_sample(cat) stats = monitor.get_stats()print(f"Unknown rate: {stats['unknown_rate']:.2%}")print(f"Top unknowns: {stats['top_unknowns']}")This module has provided a comprehensive toolkit for handling high-cardinality categorical features—one of the most common and impactful challenges in production machine learning.
You now have the expertise to handle categorical features of any cardinality—from simple one-hot encoding for small vocabularies to sophisticated hash embeddings and online learning for unbounded categories. Apply these techniques to build production ML systems that gracefully handle the messiness of real-world data.