High Cardinality Features - Learning Module

Loading content...

0/278

Embedding Layers

Learnable Representations for Categories

Embedding layers revolutionized how neural networks handle categorical data. Rather than manually designing encodings, we let the network learn dense vector representations optimized for the task at hand.

An embedding layer is simply a lookup table that maps each category index to a trainable dense vector:

Category Index  →  Embedding Vector (d=4)
      0         →  [0.23, -0.45, 0.12, 0.89]
      1         →  [-0.67, 0.34, 0.56, -0.21]
      2         →  [0.11, 0.78, -0.33, 0.44]
      ...

The embedding dimension d is a hyperparameter—typically 10-300 depending on cardinality and model capacity. These vectors are initialized randomly and updated via backpropagation like any other network weights.

Why Embeddings Work

Embeddings are powerful because they learn task-specific similarity. Categories that behave similarly with respect to the target end up with similar embedding vectors. This captures relationships that manual encodings cannot—without requiring domain knowledge.

Mathematical Foundation

Embedding as Matrix Multiplication:

An embedding layer with vocabulary size V and embedding dimension d maintains a weight matrix E ∈ ℝ^(V×d). For a category index i, the embedding lookup is:

$$\text{embed}(i) = E[i, :] = \mathbf{e}_i \in \mathbb{R}^d$$

Equivalently, using one-hot encoding x ∈ {0,1}^V:

$$\text{embed}(x) = x^T E$$

This is why embedding layers are computationally efficient—they're matrix lookups, not full matrix multiplications.

Gradient Flow:

During backpropagation, gradients flow only to the embedding vectors that were actually used in the forward pass. For a loss L, the gradient update for embedding i is:

$$\frac{\partial L}{\partial E[i,:]} = \frac{\partial L}{\partial \mathbf{e}_i}$$

Embeddings for categories not in the batch receive zero gradient—a form of implicit regularization.

Embedding Dimension Heuristics by Cardinality
Cardinality (V)	Recommended d	Parameters	Notes
< 10	2-4	< 40	May not need embeddings; one-hot works
10-100	4-16	< 1,600	Small embeddings sufficient
100-1,000	16-64	< 64K	Standard embedding sizes
1,000-100,000	32-128	< 12.8M	Consider compression techniques
100,000	64-256	Up to 25M+	Hash embeddings may be needed

Dimension Rule of Thumb

A common heuristic: d ≈ min(50, V/2) or d ≈ ceil(V^0.25). Google's recommendation: d = 6 × V^0.25. Always validate with cross-validation.

Implementation Patterns

embedding_layers.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import torch
import torch.nn as nn
import numpy as np
 
class TabularEmbeddingModel(nn.Module):
    """Neural network with embeddings for categorical features."""
    
    def __init__(self, cat_dims, cat_embed_dims, num_continuous, hidden_dims, output_dim):
        """
        Args:
            cat_dims: list of (vocab_size,) for each categorical feature
            cat_embed_dims: list of embedding dimensions for each categorical
            num_continuous: number of continuous features
            hidden_dims: list of hidden layer sizes
            output_dim: output dimension (1 for regression/binary, K for multiclass)
        """
        super().__init__()
        
        # Create embedding layers for each categorical feature
        self.embeddings = nn.ModuleList([
            nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim)
            for vocab_size, embed_dim in zip(cat_dims, cat_embed_dims)
        ])
        
        # Calculate total embedding output dimension
        total_embed_dim = sum(cat_embed_dims)
        input_dim = total_embed_dim + num_continuous
        
        # Build MLP layers
        layers = []
        prev_dim = input_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.ReLU(),
                nn.Dropout(0.2)
            ])
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, output_dim))
        self.mlp = nn.Sequential(*layers)
    
    def forward(self, cat_features, cont_features):
        """
        Args:
            cat_features: LongTensor of shape (batch, num_cat_features)
            cont_features: FloatTensor of shape (batch, num_continuous)
        """
        # Embed each categorical feature
        embeddings = [
            emb(cat_features[:, i]) 
            for i, emb in enumerate(self.embeddings)
        ]
        
        # Concatenate all embeddings and continuous features
        x = torch.cat(embeddings + [cont_features], dim=1)
        
        return self.mlp(x)
 
# Example usage
cat_dims = [100, 500, 50]  # Three categorical features with different cardinalities
cat_embed_dims = [16, 32, 12]  # Embedding dimensions
num_continuous = 10
 
model = TabularEmbeddingModel(
    cat_dims=cat_dims,
    cat_embed_dims=cat_embed_dims,
    num_continuous=num_continuous,
    hidden_dims=[128, 64],
    output_dim=1
)
 
# Sample forward pass
batch_size = 32
cat_input = torch.randint(0, 50, (batch_size, 3))  # Category indices
cont_input = torch.randn(batch_size, 10)  # Continuous features
 
output = model(cat_input, cont_input)
print(f"Output shape: {output.shape}")
 
# Check embedding weights
print(f"First embedding shape: {model.embeddings[0].weight.shape}")

Pre-trained and Transfer Embeddings

Transfer Learning for Categorical Features:

Embeddings learned on one task can transfer to related tasks. This is especially valuable for:

Product embeddings — Trained on purchase sequences, transferred to recommendations
User embeddings — Trained on one platform, transferred to another
Entity embeddings — Trained on knowledge graphs, used for downstream ML

Word2Vec-Style Category Embeddings:

Borrow techniques from NLP:

Skip-gram: Predict context categories from target category
CBOW: Predict target category from context
Item2Vec: Apply Word2Vec to purchase/click sequences

Initializing with Pre-trained Embeddings:

pretrained_embeddings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import torch
import torch.nn as nn
import numpy as np
from gensim.models import Word2Vec
 
# Train Word2Vec-style embeddings on sequences
sequences = [
    ['product_1', 'product_5', 'product_3', 'product_1'],
    ['product_2', 'product_7', 'product_5', 'product_8'],
    # ... many more user purchase sequences
]
 
# Train with gensim
w2v_model = Word2Vec(
    sentences=sequences,
    vector_size=64,
    window=3,
    min_count=1,
    workers=4,
    sg=1  # Skip-gram
)
 
# Create vocabulary mapping
vocab = {word: idx for idx, word in enumerate(w2v_model.wv.index_to_key)}
pretrained_weights = w2v_model.wv.vectors
 
# Initialize embedding layer with pre-trained weights
embedding = nn.Embedding(
    num_embeddings=len(vocab),
    embedding_dim=64
)
 
with torch.no_grad():
    embedding.weight.copy_(torch.from_numpy(pretrained_weights))
 
# Option 1: Freeze embeddings
embedding.weight.requires_grad = False
 
# Option 2: Fine-tune with lower learning rate
# Use separate parameter groups with different LR
 
print(f"Loaded {len(vocab)} pre-trained embeddings")

When to Use Pre-trained Embeddings

Pre-trained embeddings help most when: (1) limited labeled data for target task, (2) large unlabeled sequential data available, (3) categories have rich co-occurrence structure. For purely random category assignments, pre-training provides no benefit.

Advanced Embedding Techniques

1. Entity Embeddings of Categorical Variables (Guo & Berkhahn, 2016):

The seminal paper demonstrating that embeddings learned for one competition/task transfer well to others. Key insight: embeddings capture intrinsic category properties.

2. Multi-Feature Interaction:

Combine embeddings from multiple features before feeding to network:

# Concatenation (default)
combined = torch.cat([emb_a, emb_b], dim=-1)

# Element-wise product (captures interactions)
combined = emb_a * emb_b

# Addition (shared semantic space)
combined = emb_a + emb_b

3. Embedding Regularization:

L2 regularization: Prevent embedding vectors from growing too large
Dropout on embeddings: Randomly zero out embedding dimensions
Embedding noise: Add Gaussian noise during training

advanced_embeddings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
import torch.nn as nn
 
class RegularizedEmbedding(nn.Module):
    """Embedding with dropout and L2 regularization."""
    
    def __init__(self, num_embeddings, embedding_dim, dropout=0.1, l2_reg=1e-5):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)
        self.dropout = nn.Dropout(dropout)
        self.l2_reg = l2_reg
    
    def forward(self, x):
        emb = self.embedding(x)
        emb = self.dropout(emb)
        return emb
    
    def get_l2_loss(self):
        """Return L2 regularization loss for used embeddings."""
        return self.l2_reg * torch.norm(self.embedding.weight, p=2)
 
class FeatureInteractionLayer(nn.Module):
    """Factorization Machine-style feature interactions."""
    
    def __init__(self, embedding_dims, interaction_type='product'):
        super().__init__()
        self.interaction_type = interaction_type
        
        if interaction_type == 'bilinear':
            # Learnable interaction matrix
            total_dim = sum(embedding_dims)
            self.W = nn.Parameter(torch.randn(total_dim, total_dim) * 0.01)
    
    def forward(self, embeddings):
        """
        embeddings: list of tensors, each (batch, embed_dim_i)
        """
        concat = torch.cat(embeddings, dim=-1)
        
        if self.interaction_type == 'product':
            # All pairwise element-wise products
            interactions = []
            for i in range(len(embeddings)):
                for j in range(i+1, len(embeddings)):
                    if embeddings[i].shape[-1] == embeddings[j].shape[-1]:
                        interactions.append(embeddings[i] * embeddings[j])
            return torch.cat([concat] + interactions, dim=-1)
        
        elif self.interaction_type == 'bilinear':
            # x^T W x style interaction
            interaction = torch.einsum('bi,ij,bj->b', concat, self.W, concat)
            return torch.cat([concat, interaction.unsqueeze(-1)], dim=-1)
        
        return concat
 
# Example
emb_layer = RegularizedEmbedding(1000, 32, dropout=0.2)
interaction = FeatureInteractionLayer([32, 32, 16], 'product')

Summary

Key Takeaways

•Embeddings are learnable lookup tables — Each category maps to a trainable dense vector optimized for the task.
•Dimension scales sublinearly with cardinality — d ≈ V^0.25 is a good starting heuristic.
•Entity embeddings transfer across tasks — Pre-trained embeddings accelerate learning on new problems.
•Regularization prevents overfitting — Use dropout, L2, and proper initialization for rare categories.
•Feature interactions enhance expressivity — Combine embeddings via concatenation, products, or attention.

Coming Next: Hash Encoding

The next page covers Hash Encoding—a technique that handles unlimited cardinality by mapping categories to a fixed-size space via hash functions. Essential for streaming data and extreme cardinality scenarios.