Loading learning content...
Hash encoding (also called feature hashing or the hashing trick) solves a fundamental problem: how do you encode categorical features when cardinality is unknown, unbounded, or too large to enumerate?
The solution is elegantly simple: apply a hash function to map any category to a fixed-size output space.
Hash Function: h(x) → {0, 1, ..., m-1}
'user_123456789' → h('user_123456789') → 42
'user_987654321' → h('user_987654321') → 7891
'never_seen_before' → h('never_seen_before') → 156
Unlike traditional encodings that require a pre-defined vocabulary, hash encoding works on any input string—including categories never seen during training.
No vocabulary needed: Works with streaming data and new categories. Fixed memory: O(m) regardless of cardinality. No lookup table: Faster inference without dictionary access. Handles any string: No preprocessing required.
Mathematical Formulation:
For a hash space of size m and category value v, feature hashing produces:
$$\phi(v) \in \mathbb{R}^m, \quad \phi(v)[h(v) \mod m] = \xi(v)$$
Where:
The Sign Hash:
The sign hash $\xi(v)$ is crucial for unbiased estimation. Without it, collisions always add positively, creating systematic bias. With random signs, collisions cancel in expectation:
$$E[\phi(v_1) \cdot \phi(v_2)] = \begin{cases} 1 & \text{if } v_1 = v_2 \ 0 & \text{if } v_1 eq v_2 \text{ (in expectation)} \end{cases}$$
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as npfrom sklearn.feature_extraction import FeatureHasherimport mmh3 # MurmurHash3 # --- sklearn FeatureHasher ---# Works with feature dictionariesdata = [ {'category': 'electronics', 'brand': 'sony'}, {'category': 'clothing', 'brand': 'nike'}, {'category': 'electronics', 'brand': 'samsung'},] hasher = FeatureHasher(n_features=16, input_type='dict', alternate_sign=True)hashed = hasher.transform(data) print("Hashed shape:", hashed.shape)print("Sparse matrix:", hashed.toarray()) # --- Manual implementation with MurmurHash ---def hash_encode(values, n_buckets, use_sign=True): """Hash encode categorical values to fixed-size sparse vector.""" result = np.zeros(n_buckets) for val in values: # Primary hash for bucket index bucket = mmh3.hash(str(val), seed=0) % n_buckets # Secondary hash for sign (if enabled) if use_sign: sign = 1 if mmh3.hash(str(val), seed=1) % 2 == 0 else -1 else: sign = 1 result[bucket] += sign return result # Examplecategories = ['user_123', 'product_456', 'category_electronics']encoded = hash_encode(categories, n_buckets=32, use_sign=True)print("Manual hash encoding:", encoded) # --- Handling feature name prefixes ---def hash_with_namespace(feature_name, value, n_buckets): """Prefix with feature name to avoid cross-feature collisions.""" combined = f"{feature_name}={value}" bucket = mmh3.hash(combined, seed=0) % n_buckets sign = 1 if mmh3.hash(combined, seed=1) % 2 == 0 else -1 return bucket, sign bucket, sign = hash_with_namespace('color', 'red', 1000)print(f"'color=red' → bucket {bucket}, sign {sign}")Hash collisions are inevitable when m < V (vocabulary size). Two different categories may hash to the same bucket, making them indistinguishable. Choose m large enough to minimize collision probability while respecting memory constraints.
Birthday Paradox Applied:
The probability of at least one collision when hashing n items to m buckets:
$$P(\text{collision}) \approx 1 - e^{-n(n-1)/(2m)}$$
For n << m, this simplifies to:
$$P(\text{collision}) \approx \frac{n^2}{2m}$$
Sizing Guidelines:
| Unique Categories (n) | m for <1% collision | m for <10% collision |
|---|---|---|
| 100 | ~500,000 | ~50,000 |
| 1,000 | ~50,000,000 | ~5,000,000 |
| 10,000 | ~5 billion | ~500 million |
Practical Reality:
These theoretical numbers are often impractical. In practice, we accept some collision rate and rely on:
12345678910111213141516171819202122232425262728293031
import numpy as npimport mmh3from collections import Counter def analyze_collisions(categories, n_buckets): """Analyze hash collision rate.""" buckets = [mmh3.hash(str(c), seed=0) % n_buckets for c in categories] bucket_counts = Counter(buckets) n_unique_buckets = len(bucket_counts) n_colliding = sum(1 for c in bucket_counts.values() if c > 1) max_collision = max(bucket_counts.values()) collision_rate = 1 - (n_unique_buckets / len(set(categories))) return { 'n_categories': len(set(categories)), 'n_buckets': n_buckets, 'unique_buckets_used': n_unique_buckets, 'collision_rate': collision_rate, 'max_categories_per_bucket': max_collision, 'utilization': n_unique_buckets / n_buckets } # Simulate with different bucket sizescategories = [f'cat_{i}' for i in range(10000)] for n_buckets in [1000, 10000, 100000, 1000000]: stats = analyze_collisions(categories, n_buckets) print(f"Buckets: {n_buckets:>8} | Collision Rate: {stats['collision_rate']:.2%} | " f"Max per bucket: {stats['max_categories_per_bucket']}")| Hash Size (m) | Memory (float32) | Collision Risk | Use Case |
|---|---|---|---|
| 2^10 (1K) | 4 KB | Very High | Extreme constraints only |
| 2^14 (16K) | 64 KB | High | Embedded systems |
| 2^18 (262K) | 1 MB | Moderate | Standard applications |
| 2^20 (1M) | 4 MB | Low | Most production systems |
| 2^24 (16M) | 64 MB | Very Low | High-cardinality features |
Combining Hashing with Learned Embeddings:
Hash embeddings use multiple hash functions to reduce collision impact:
category = 'user_12345'
h₁('user_12345') → 42 → E₁[42] = [0.1, 0.3, ...]
h₂('user_12345') → 891 → E₂[891] = [0.5, -0.2, ...]
h₃('user_12345') → 7 → E₃[7] = [-0.1, 0.8, ...]
Final = E₁[42] + E₂[891] + E₃[7] (or concatenate)
With K hash functions, the probability that two categories collide on ALL K is reduced to (1/m)^K.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import torchimport torch.nn as nnimport mmh3 class HashEmbedding(nn.Module): """ Hash embedding layer using multiple hash functions. Combines hashing trick with learnable embeddings. """ def __init__(self, num_buckets, embed_dim, num_hashes=2, combine='sum'): super().__init__() self.num_buckets = num_buckets self.num_hashes = num_hashes self.combine = combine # Create embedding table for each hash function self.embeddings = nn.ModuleList([ nn.Embedding(num_buckets, embed_dim) for _ in range(num_hashes) ]) if combine == 'concat': self.output_dim = embed_dim * num_hashes else: self.output_dim = embed_dim def hash_values(self, values, hash_idx): """Hash string values to bucket indices.""" # In practice, precompute these during data loading indices = [] for v in values: bucket = mmh3.hash(str(v), seed=hash_idx) % self.num_buckets indices.append(bucket) return torch.tensor(indices, dtype=torch.long) def forward(self, string_values): """ Args: string_values: list of string category values Returns: Tensor of shape (batch_size, output_dim) """ embeddings = [] for i, emb_layer in enumerate(self.embeddings): indices = self.hash_values(string_values, hash_idx=i) embeddings.append(emb_layer(indices)) if self.combine == 'sum': return sum(embeddings) elif self.combine == 'mean': return torch.stack(embeddings).mean(dim=0) elif self.combine == 'concat': return torch.cat(embeddings, dim=-1) # Examplehash_emb = HashEmbedding(num_buckets=10000, embed_dim=32, num_hashes=3, combine='sum')categories = ['user_123', 'user_456', 'user_789', 'user_123']output = hash_emb(categories)print(f"Output shape: {output.shape}") # Verify same input → same outputprint(f"Same category, same embedding: {torch.allclose(output[0], output[3])}")For efficiency, precompute hash indices during data preprocessing rather than computing hashes in the forward pass. Store hashed indices alongside raw data. This separates I/O-bound hashing from GPU-bound embedding lookups.
Bloom Embeddings:
Inspired by Bloom filters, this technique uses weighted combinations of multiple hash lookups with importance weights:
$$\text{emb}(x) = \sum_{i=1}^{K} w_i \cdot E_i[h_i(x)]$$
Where $w_i$ are learnable scalar weights per hash function. This allows the model to learn which hash functions are most reliable for reducing collision noise.
Quotient-Remainder Trick:
For extremely large vocabularies, decompose category IDs:
id = 123456789
quotient = id // 10000 = 12345 → embed as 'group'
remainder = id % 10000 = 6789 → embed as 'offset'
final = emb_q(quotient) + emb_r(remainder)
This reduces a 100M vocabulary to two 10K tables (200K parameters vs. 100M).
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import torchimport torch.nn as nn class QuotientRemainderEmbedding(nn.Module): """ Compositional embedding using quotient-remainder decomposition. Reduces O(V) to O(2 * sqrt(V)) parameters. """ def __init__(self, max_id, embed_dim, num_buckets=10000): super().__init__() self.num_buckets = num_buckets n_quotient_buckets = (max_id // num_buckets) + 1 self.quotient_emb = nn.Embedding(n_quotient_buckets, embed_dim) self.remainder_emb = nn.Embedding(num_buckets, embed_dim) # Learnable combination weights self.alpha = nn.Parameter(torch.tensor(0.5)) def forward(self, ids): """ Args: ids: LongTensor of category IDs """ quotients = ids // self.num_buckets remainders = ids % self.num_buckets q_emb = self.quotient_emb(quotients) r_emb = self.remainder_emb(remainders) # Weighted combination return self.alpha * q_emb + (1 - self.alpha) * r_emb # Example: 100M vocabulary reduced to ~20K parametersmax_vocab = 100_000_000embed_dim = 64num_buckets = 10_000 qr_emb = QuotientRemainderEmbedding(max_vocab, embed_dim, num_buckets) # Count parametersn_params = sum(p.numel() for p in qr_emb.parameters())full_params = max_vocab * embed_dim print(f"QR Embedding params: {n_params:,}")print(f"Full embedding params: {full_params:,}")print(f"Compression ratio: {full_params / n_params:.0f}x") # Testids = torch.tensor([0, 1, 9999, 10000, 99999999])output = qr_emb(ids)print(f"Output shape: {output.shape}")hash(f'{feature}={value}'). Prevents cross-feature collisions.feature_a=X_feature_b=Y to capture interactions in a single hashed feature.Hash encoding excels when: (1) vocabulary is unknown/unbounded at training time, (2) memory is strictly constrained, (3) new categories appear in production, (4) feature is very high cardinality with long tail. Avoid when: category semantics matter (embeddings better) or interpretability is required.
The final page addresses Handling Novel Categories—strategies for when production data contains categories never seen during training. We'll cover fallback mechanisms, online learning approaches, and robust encoding pipelines.