Machine LearningFeature Engineering & Selection

High-Cardinality Features

LevelAdvanced

Duration90 mins

TopicFeature Engineering & Selection

4 / 5

Hash Encoding

Fixed-Size Encoding for Unlimited Cardinality

Hash encoding (also called feature hashing or the hashing trick) solves a fundamental problem: how do you encode categorical features when cardinality is unknown, unbounded, or too large to enumerate?

The solution is elegantly simple: apply a hash function to map any category to a fixed-size output space.

Hash Function: h(x) → {0, 1, ..., m-1}

'user_123456789'  →  h('user_123456789')  →  42
'user_987654321'  →  h('user_987654321')  →  7891
'never_seen_before' → h('never_seen_before') → 156

Unlike traditional encodings that require a pre-defined vocabulary, hash encoding works on any input string—including categories never seen during training.

Key Advantages of Hash Encoding

No vocabulary needed: Works with streaming data and new categories. Fixed memory: O(m) regardless of cardinality. No lookup table: Faster inference without dictionary access. Handles any string: No preprocessing required.

The Hashing Trick (Feature Hashing)

Mathematical Formulation:

For a hash space of size m and category value v, feature hashing produces:

$$\phi(v) \in \mathbb{R}^m, \quad \phi(v)[h(v) \mod m] = \xi(v)$$

Where:

$h(v)$ is a hash function (e.g., MurmurHash, xxHash)
$\xi(v) \in {-1, +1}$ is a sign function (another hash)
All other indices are 0

The Sign Hash:

The sign hash $\xi(v)$ is crucial for unbiased estimation. Without it, collisions always add positively, creating systematic bias. With random signs, collisions cancel in expectation:

$$E[\phi(v_1) \cdot \phi(v_2)] = \begin{cases} 1 & \text{if } v_1 = v_2 \ 0 & \text{if } v_1 eq v_2 \text{ (in expectation)} \end{cases}$$

feature_hashing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
from sklearn.feature_extraction import FeatureHasher
import mmh3  # MurmurHash3
 
# --- sklearn FeatureHasher ---
# Works with feature dictionaries
data = [
    {'category': 'electronics', 'brand': 'sony'},
    {'category': 'clothing', 'brand': 'nike'},
    {'category': 'electronics', 'brand': 'samsung'},
]
 
hasher = FeatureHasher(n_features=16, input_type='dict', alternate_sign=True)
hashed = hasher.transform(data)
 
print("Hashed shape:", hashed.shape)
print("Sparse matrix:
", hashed.toarray())
 
# --- Manual implementation with MurmurHash ---
def hash_encode(values, n_buckets, use_sign=True):
    """Hash encode categorical values to fixed-size sparse vector."""
    result = np.zeros(n_buckets)
    
    for val in values:
        # Primary hash for bucket index
        bucket = mmh3.hash(str(val), seed=0) % n_buckets
        
        # Secondary hash for sign (if enabled)
        if use_sign:
            sign = 1 if mmh3.hash(str(val), seed=1) % 2 == 0 else -1
        else:
            sign = 1
        
        result[bucket] += sign
    
    return result
 
# Example
categories = ['user_123', 'product_456', 'category_electronics']
encoded = hash_encode(categories, n_buckets=32, use_sign=True)
print("
Manual hash encoding:", encoded)
 
# --- Handling feature name prefixes ---
def hash_with_namespace(feature_name, value, n_buckets):
    """Prefix with feature name to avoid cross-feature collisions."""
    combined = f"{feature_name}={value}"
    bucket = mmh3.hash(combined, seed=0) % n_buckets
    sign = 1 if mmh3.hash(combined, seed=1) % 2 == 0 else -1
    return bucket, sign
 
bucket, sign = hash_with_namespace('color', 'red', 1000)
print(f"
'color=red' → bucket {bucket}, sign {sign}")

The Collision Trade-off

Hash collisions are inevitable when m < V (vocabulary size). Two different categories may hash to the same bucket, making them indistinguishable. Choose m large enough to minimize collision probability while respecting memory constraints.

Collision Analysis and Sizing

Birthday Paradox Applied:

The probability of at least one collision when hashing n items to m buckets:

$$P(\text{collision}) \approx 1 - e^{-n(n-1)/(2m)}$$

For n << m, this simplifies to:

$$P(\text{collision}) \approx \frac{n^2}{2m}$$

Sizing Guidelines:

Unique Categories (n)	m for <1% collision	m for <10% collision
100	~500,000	~50,000
1,000	~50,000,000	~5,000,000
10,000	~5 billion	~500 million

Practical Reality:

These theoretical numbers are often impractical. In practice, we accept some collision rate and rely on:

The model learning despite noise
Sign hashing for unbiased gradients
Multiple hash functions (see hash embeddings)

collision_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
import mmh3
from collections import Counter
 
def analyze_collisions(categories, n_buckets):
    """Analyze hash collision rate."""
    buckets = [mmh3.hash(str(c), seed=0) % n_buckets for c in categories]
    
    bucket_counts = Counter(buckets)
    n_unique_buckets = len(bucket_counts)
    n_colliding = sum(1 for c in bucket_counts.values() if c > 1)
    max_collision = max(bucket_counts.values())
    
    collision_rate = 1 - (n_unique_buckets / len(set(categories)))
    
    return {
        'n_categories': len(set(categories)),
        'n_buckets': n_buckets,
        'unique_buckets_used': n_unique_buckets,
        'collision_rate': collision_rate,
        'max_categories_per_bucket': max_collision,
        'utilization': n_unique_buckets / n_buckets
    }
 
# Simulate with different bucket sizes
categories = [f'cat_{i}' for i in range(10000)]
 
for n_buckets in [1000, 10000, 100000, 1000000]:
    stats = analyze_collisions(categories, n_buckets)
    print(f"Buckets: {n_buckets:>8} | Collision Rate: {stats['collision_rate']:.2%} | "
          f"Max per bucket: {stats['max_categories_per_bucket']}")

Hash Space Sizing vs. Memory Trade-off
Hash Size (m)	Memory (float32)	Collision Risk	Use Case
2^10 (1K)	4 KB	Very High	Extreme constraints only
2^14 (16K)	64 KB	High	Embedded systems
2^18 (262K)	1 MB	Moderate	Standard applications
2^20 (1M)	4 MB	Low	Most production systems
2^24 (16M)	64 MB	Very Low	High-cardinality features

Hash Embeddings for Neural Networks

Combining Hashing with Learned Embeddings:

Hash embeddings use multiple hash functions to reduce collision impact:

Apply K independent hash functions to each category
Look up K embedding vectors
Combine (sum/concatenate/weight) the vectors

category = 'user_12345'

h₁('user_12345') → 42    → E₁[42] = [0.1, 0.3, ...]
h₂('user_12345') → 891   → E₂[891] = [0.5, -0.2, ...]
h₃('user_12345') → 7     → E₃[7] = [-0.1, 0.8, ...]

Final = E₁[42] + E₂[891] + E₃[7]  (or concatenate)

With K hash functions, the probability that two categories collide on ALL K is reduced to (1/m)^K.

hash_embeddings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import torch
import torch.nn as nn
import mmh3
 
class HashEmbedding(nn.Module):
    """
    Hash embedding layer using multiple hash functions.
    Combines hashing trick with learnable embeddings.
    """
    def __init__(self, num_buckets, embed_dim, num_hashes=2, combine='sum'):
        super().__init__()
        self.num_buckets = num_buckets
        self.num_hashes = num_hashes
        self.combine = combine
        
        # Create embedding table for each hash function
        self.embeddings = nn.ModuleList([
            nn.Embedding(num_buckets, embed_dim)
            for _ in range(num_hashes)
        ])
        
        if combine == 'concat':
            self.output_dim = embed_dim * num_hashes
        else:
            self.output_dim = embed_dim
    
    def hash_values(self, values, hash_idx):
        """Hash string values to bucket indices."""
        # In practice, precompute these during data loading
        indices = []
        for v in values:
            bucket = mmh3.hash(str(v), seed=hash_idx) % self.num_buckets
            indices.append(bucket)
        return torch.tensor(indices, dtype=torch.long)
    
    def forward(self, string_values):
        """
        Args:
            string_values: list of string category values
        Returns:
            Tensor of shape (batch_size, output_dim)
        """
        embeddings = []
        for i, emb_layer in enumerate(self.embeddings):
            indices = self.hash_values(string_values, hash_idx=i)
            embeddings.append(emb_layer(indices))
        
        if self.combine == 'sum':
            return sum(embeddings)
        elif self.combine == 'mean':
            return torch.stack(embeddings).mean(dim=0)
        elif self.combine == 'concat':
            return torch.cat(embeddings, dim=-1)
 
# Example
hash_emb = HashEmbedding(num_buckets=10000, embed_dim=32, num_hashes=3, combine='sum')
categories = ['user_123', 'user_456', 'user_789', 'user_123']
output = hash_emb(categories)
print(f"Output shape: {output.shape}")
 
# Verify same input → same output
print(f"Same category, same embedding: {torch.allclose(output[0], output[3])}")

Hash Embedding in Production

For efficiency, precompute hash indices during data preprocessing rather than computing hashes in the forward pass. Store hashed indices alongside raw data. This separates I/O-bound hashing from GPU-bound embedding lookups.

Bloom Embeddings and Compositional Codes

Bloom Embeddings:

Inspired by Bloom filters, this technique uses weighted combinations of multiple hash lookups with importance weights:

$$\text{emb}(x) = \sum_{i=1}^{K} w_i \cdot E_i[h_i(x)]$$

Where $w_i$ are learnable scalar weights per hash function. This allows the model to learn which hash functions are most reliable for reducing collision noise.

Quotient-Remainder Trick:

For extremely large vocabularies, decompose category IDs:

id = 123456789
quotient = id // 10000 = 12345  → embed as 'group'
remainder = id % 10000 = 6789   → embed as 'offset'
final = emb_q(quotient) + emb_r(remainder)

This reduces a 100M vocabulary to two 10K tables (200K parameters vs. 100M).

quotient_remainder.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch
import torch.nn as nn
 
class QuotientRemainderEmbedding(nn.Module):
    """
    Compositional embedding using quotient-remainder decomposition.
    Reduces O(V) to O(2 * sqrt(V)) parameters.
    """
    def __init__(self, max_id, embed_dim, num_buckets=10000):
        super().__init__()
        self.num_buckets = num_buckets
        
        n_quotient_buckets = (max_id // num_buckets) + 1
        
        self.quotient_emb = nn.Embedding(n_quotient_buckets, embed_dim)
        self.remainder_emb = nn.Embedding(num_buckets, embed_dim)
        
        # Learnable combination weights
        self.alpha = nn.Parameter(torch.tensor(0.5))
    
    def forward(self, ids):
        """
        Args:
            ids: LongTensor of category IDs
        """
        quotients = ids // self.num_buckets
        remainders = ids % self.num_buckets
        
        q_emb = self.quotient_emb(quotients)
        r_emb = self.remainder_emb(remainders)
        
        # Weighted combination
        return self.alpha * q_emb + (1 - self.alpha) * r_emb
 
# Example: 100M vocabulary reduced to ~20K parameters
max_vocab = 100_000_000
embed_dim = 64
num_buckets = 10_000
 
qr_emb = QuotientRemainderEmbedding(max_vocab, embed_dim, num_buckets)
 
# Count parameters
n_params = sum(p.numel() for p in qr_emb.parameters())
full_params = max_vocab * embed_dim
 
print(f"QR Embedding params: {n_params:,}")
print(f"Full embedding params: {full_params:,}")
print(f"Compression ratio: {full_params / n_params:.0f}x")
 
# Test
ids = torch.tensor([0, 1, 9999, 10000, 99999999])
output = qr_emb(ids)
print(f"Output shape: {output.shape}")

Production Considerations

Hash Encoding Best Practices

•Namespace your features — Prefix feature name to category value before hashing: hash(f'{feature}={value}'). Prevents cross-feature collisions.
•Use consistent hash functions — Same hash must produce same output across training and serving. Pin library versions.
•Benchmark hash speed — At high throughput, hash computation becomes significant. MurmurHash3 and xxHash are fast; cryptographic hashes are overkill.
•Consider feature interactions — Hash feature_a=X_feature_b=Y to capture interactions in a single hashed feature.
•Monitor collision impact — Track if model performance degrades as vocabulary grows (more collisions).
•Hybrid approaches — Use explicit embeddings for top-N frequent categories, hash encoding for the tail.

When to Use Hash Encoding

Hash encoding excels when: (1) vocabulary is unknown/unbounded at training time, (2) memory is strictly constrained, (3) new categories appear in production, (4) feature is very high cardinality with long tail. Avoid when: category semantics matter (embeddings better) or interpretability is required.

Summary

Key Takeaways

•Hash encoding fixes dimensionality — Map unlimited categories to fixed m buckets via hash functions.
•Sign hashing ensures unbiased estimates — Random ±1 signs cause collision noise to cancel in expectation.
•Collisions are the fundamental trade-off — Larger hash space means fewer collisions but more memory.
•Multiple hash functions reduce collision impact — Hash embeddings combine K lookups per category.
•Compositional codes compress large vocabularies — Quotient-remainder and similar tricks trade compute for memory.

Coming Next: Handling Novel Categories

The final page addresses Handling Novel Categories—strategies for when production data contains categories never seen during training. We'll cover fallback mechanisms, online learning approaches, and robust encoding pipelines.

4 / 5

Loading learning content...

Machine LearningFeature Engineering & Selection

High-Cardinality Features

LevelAdvanced

Duration90 mins

TopicFeature Engineering & Selection

4 / 5

Hash Encoding

Fixed-Size Encoding for Unlimited Cardinality

The solution is elegantly simple: apply a hash function to map any category to a fixed-size output space.

Hash Function: h(x) → {0, 1, ..., m-1}

'user_123456789'  →  h('user_123456789')  →  42
'user_987654321'  →  h('user_987654321')  →  7891
'never_seen_before' → h('never_seen_before') → 156

Unlike traditional encodings that require a pre-defined vocabulary, hash encoding works on any input string—including categories never seen during training.

Key Advantages of Hash Encoding

The Hashing Trick (Feature Hashing)

Mathematical Formulation:

For a hash space of size m and category value v, feature hashing produces:

$$\phi(v) \in \mathbb{R}^m, \quad \phi(v)[h(v) \mod m] = \xi(v)$$

Where:

$h(v)$ is a hash function (e.g., MurmurHash, xxHash)
$\xi(v) \in {-1, +1}$ is a sign function (another hash)
All other indices are 0

The Sign Hash:

The sign hash $\xi(v)$ is crucial for unbiased estimation. Without it, collisions always add positively, creating systematic bias. With random signs, collisions cancel in expectation:

$$E[\phi(v_1) \cdot \phi(v_2)] = \begin{cases} 1 & \text{if } v_1 = v_2 \ 0 & \text{if } v_1 eq v_2 \text{ (in expectation)} \end{cases}$$

feature_hashing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
from sklearn.feature_extraction import FeatureHasher
import mmh3  # MurmurHash3
 
# --- sklearn FeatureHasher ---
# Works with feature dictionaries
data = [
    {'category': 'electronics', 'brand': 'sony'},
    {'category': 'clothing', 'brand': 'nike'},
    {'category': 'electronics', 'brand': 'samsung'},
]
 
hasher = FeatureHasher(n_features=16, input_type='dict', alternate_sign=True)
hashed = hasher.transform(data)
 
print("Hashed shape:", hashed.shape)
print("Sparse matrix:
", hashed.toarray())
 
# --- Manual implementation with MurmurHash ---
def hash_encode(values, n_buckets, use_sign=True):
    """Hash encode categorical values to fixed-size sparse vector."""
    result = np.zeros(n_buckets)
    
    for val in values:
        # Primary hash for bucket index
        bucket = mmh3.hash(str(val), seed=0) % n_buckets
        
        # Secondary hash for sign (if enabled)
        if use_sign:
            sign = 1 if mmh3.hash(str(val), seed=1) % 2 == 0 else -1
        else:
            sign = 1
        
        result[bucket] += sign
    
    return result
 
# Example
categories = ['user_123', 'product_456', 'category_electronics']
encoded = hash_encode(categories, n_buckets=32, use_sign=True)
print("
Manual hash encoding:", encoded)
 
# --- Handling feature name prefixes ---
def hash_with_namespace(feature_name, value, n_buckets):
    """Prefix with feature name to avoid cross-feature collisions."""
    combined = f"{feature_name}={value}"
    bucket = mmh3.hash(combined, seed=0) % n_buckets
    sign = 1 if mmh3.hash(combined, seed=1) % 2 == 0 else -1
    return bucket, sign
 
bucket, sign = hash_with_namespace('color', 'red', 1000)
print(f"
'color=red' → bucket {bucket}, sign {sign}")

The Collision Trade-off

Collision Analysis and Sizing

Birthday Paradox Applied:

The probability of at least one collision when hashing n items to m buckets:

$$P(\text{collision}) \approx 1 - e^{-n(n-1)/(2m)}$$

For n << m, this simplifies to:

$$P(\text{collision}) \approx \frac{n^2}{2m}$$

Sizing Guidelines:

Unique Categories (n)	m for <1% collision	m for <10% collision
100	~500,000	~50,000
1,000	~50,000,000	~5,000,000
10,000	~5 billion	~500 million

Practical Reality:

These theoretical numbers are often impractical. In practice, we accept some collision rate and rely on:

The model learning despite noise
Sign hashing for unbiased gradients
Multiple hash functions (see hash embeddings)

collision_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
import mmh3
from collections import Counter
 
def analyze_collisions(categories, n_buckets):
    """Analyze hash collision rate."""
    buckets = [mmh3.hash(str(c), seed=0) % n_buckets for c in categories]
    
    bucket_counts = Counter(buckets)
    n_unique_buckets = len(bucket_counts)
    n_colliding = sum(1 for c in bucket_counts.values() if c > 1)
    max_collision = max(bucket_counts.values())
    
    collision_rate = 1 - (n_unique_buckets / len(set(categories)))
    
    return {
        'n_categories': len(set(categories)),
        'n_buckets': n_buckets,
        'unique_buckets_used': n_unique_buckets,
        'collision_rate': collision_rate,
        'max_categories_per_bucket': max_collision,
        'utilization': n_unique_buckets / n_buckets
    }
 
# Simulate with different bucket sizes
categories = [f'cat_{i}' for i in range(10000)]
 
for n_buckets in [1000, 10000, 100000, 1000000]:
    stats = analyze_collisions(categories, n_buckets)
    print(f"Buckets: {n_buckets:>8} | Collision Rate: {stats['collision_rate']:.2%} | "
          f"Max per bucket: {stats['max_categories_per_bucket']}")

Hash Space Sizing vs. Memory Trade-off
Hash Size (m)	Memory (float32)	Collision Risk	Use Case
2^10 (1K)	4 KB	Very High	Extreme constraints only
2^14 (16K)	64 KB	High	Embedded systems
2^18 (262K)	1 MB	Moderate	Standard applications
2^20 (1M)	4 MB	Low	Most production systems
2^24 (16M)	64 MB	Very Low	High-cardinality features

Hash Embeddings for Neural Networks

Combining Hashing with Learned Embeddings:

Hash embeddings use multiple hash functions to reduce collision impact:

Apply K independent hash functions to each category
Look up K embedding vectors
Combine (sum/concatenate/weight) the vectors

category = 'user_12345'

h₁('user_12345') → 42    → E₁[42] = [0.1, 0.3, ...]
h₂('user_12345') → 891   → E₂[891] = [0.5, -0.2, ...]
h₃('user_12345') → 7     → E₃[7] = [-0.1, 0.8, ...]

Final = E₁[42] + E₂[891] + E₃[7]  (or concatenate)

With K hash functions, the probability that two categories collide on ALL K is reduced to (1/m)^K.

hash_embeddings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import torch
import torch.nn as nn
import mmh3
 
class HashEmbedding(nn.Module):
    """
    Hash embedding layer using multiple hash functions.
    Combines hashing trick with learnable embeddings.
    """
    def __init__(self, num_buckets, embed_dim, num_hashes=2, combine='sum'):
        super().__init__()
        self.num_buckets = num_buckets
        self.num_hashes = num_hashes
        self.combine = combine
        
        # Create embedding table for each hash function
        self.embeddings = nn.ModuleList([
            nn.Embedding(num_buckets, embed_dim)
            for _ in range(num_hashes)
        ])
        
        if combine == 'concat':
            self.output_dim = embed_dim * num_hashes
        else:
            self.output_dim = embed_dim
    
    def hash_values(self, values, hash_idx):
        """Hash string values to bucket indices."""
        # In practice, precompute these during data loading
        indices = []
        for v in values:
            bucket = mmh3.hash(str(v), seed=hash_idx) % self.num_buckets
            indices.append(bucket)
        return torch.tensor(indices, dtype=torch.long)
    
    def forward(self, string_values):
        """
        Args:
            string_values: list of string category values
        Returns:
            Tensor of shape (batch_size, output_dim)
        """
        embeddings = []
        for i, emb_layer in enumerate(self.embeddings):
            indices = self.hash_values(string_values, hash_idx=i)
            embeddings.append(emb_layer(indices))
        
        if self.combine == 'sum':
            return sum(embeddings)
        elif self.combine == 'mean':
            return torch.stack(embeddings).mean(dim=0)
        elif self.combine == 'concat':
            return torch.cat(embeddings, dim=-1)
 
# Example
hash_emb = HashEmbedding(num_buckets=10000, embed_dim=32, num_hashes=3, combine='sum')
categories = ['user_123', 'user_456', 'user_789', 'user_123']
output = hash_emb(categories)
print(f"Output shape: {output.shape}")
 
# Verify same input → same output
print(f"Same category, same embedding: {torch.allclose(output[0], output[3])}")

Hash Embedding in Production

Bloom Embeddings and Compositional Codes

Bloom Embeddings:

Inspired by Bloom filters, this technique uses weighted combinations of multiple hash lookups with importance weights:

$$\text{emb}(x) = \sum_{i=1}^{K} w_i \cdot E_i[h_i(x)]$$

Where $w_i$ are learnable scalar weights per hash function. This allows the model to learn which hash functions are most reliable for reducing collision noise.

Quotient-Remainder Trick:

For extremely large vocabularies, decompose category IDs:

id = 123456789
quotient = id // 10000 = 12345  → embed as 'group'
remainder = id % 10000 = 6789   → embed as 'offset'
final = emb_q(quotient) + emb_r(remainder)

This reduces a 100M vocabulary to two 10K tables (200K parameters vs. 100M).

quotient_remainder.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch
import torch.nn as nn
 
class QuotientRemainderEmbedding(nn.Module):
    """
    Compositional embedding using quotient-remainder decomposition.
    Reduces O(V) to O(2 * sqrt(V)) parameters.
    """
    def __init__(self, max_id, embed_dim, num_buckets=10000):
        super().__init__()
        self.num_buckets = num_buckets
        
        n_quotient_buckets = (max_id // num_buckets) + 1
        
        self.quotient_emb = nn.Embedding(n_quotient_buckets, embed_dim)
        self.remainder_emb = nn.Embedding(num_buckets, embed_dim)
        
        # Learnable combination weights
        self.alpha = nn.Parameter(torch.tensor(0.5))
    
    def forward(self, ids):
        """
        Args:
            ids: LongTensor of category IDs
        """
        quotients = ids // self.num_buckets
        remainders = ids % self.num_buckets
        
        q_emb = self.quotient_emb(quotients)
        r_emb = self.remainder_emb(remainders)
        
        # Weighted combination
        return self.alpha * q_emb + (1 - self.alpha) * r_emb
 
# Example: 100M vocabulary reduced to ~20K parameters
max_vocab = 100_000_000
embed_dim = 64
num_buckets = 10_000
 
qr_emb = QuotientRemainderEmbedding(max_vocab, embed_dim, num_buckets)
 
# Count parameters
n_params = sum(p.numel() for p in qr_emb.parameters())
full_params = max_vocab * embed_dim
 
print(f"QR Embedding params: {n_params:,}")
print(f"Full embedding params: {full_params:,}")
print(f"Compression ratio: {full_params / n_params:.0f}x")
 
# Test
ids = torch.tensor([0, 1, 9999, 10000, 99999999])
output = qr_emb(ids)
print(f"Output shape: {output.shape}")

Production Considerations

Hash Encoding Best Practices

•Namespace your features — Prefix feature name to category value before hashing: hash(f'{feature}={value}'). Prevents cross-feature collisions.
•Use consistent hash functions — Same hash must produce same output across training and serving. Pin library versions.
•Benchmark hash speed — At high throughput, hash computation becomes significant. MurmurHash3 and xxHash are fast; cryptographic hashes are overkill.
•Consider feature interactions — Hash feature_a=X_feature_b=Y to capture interactions in a single hashed feature.
•Monitor collision impact — Track if model performance degrades as vocabulary grows (more collisions).
•Hybrid approaches — Use explicit embeddings for top-N frequent categories, hash encoding for the tail.

When to Use Hash Encoding

Summary

Key Takeaways

•Hash encoding fixes dimensionality — Map unlimited categories to fixed m buckets via hash functions.
•Sign hashing ensures unbiased estimates — Random ±1 signs cause collision noise to cancel in expectation.
•Collisions are the fundamental trade-off — Larger hash space means fewer collisions but more memory.
•Multiple hash functions reduce collision impact — Hash embeddings combine K lookups per category.
•Compositional codes compress large vocabularies — Quotient-remainder and similar tricks trade compute for memory.

Coming Next: Handling Novel Categories

4 / 5