Loading content...
Embedding layers revolutionized how neural networks handle categorical data. Rather than manually designing encodings, we let the network learn dense vector representations optimized for the task at hand.
An embedding layer is simply a lookup table that maps each category index to a trainable dense vector:
Category Index → Embedding Vector (d=4)
0 → [0.23, -0.45, 0.12, 0.89]
1 → [-0.67, 0.34, 0.56, -0.21]
2 → [0.11, 0.78, -0.33, 0.44]
...
The embedding dimension d is a hyperparameter—typically 10-300 depending on cardinality and model capacity. These vectors are initialized randomly and updated via backpropagation like any other network weights.
Embeddings are powerful because they learn task-specific similarity. Categories that behave similarly with respect to the target end up with similar embedding vectors. This captures relationships that manual encodings cannot—without requiring domain knowledge.
Embedding as Matrix Multiplication:
An embedding layer with vocabulary size V and embedding dimension d maintains a weight matrix E ∈ ℝ^(V×d). For a category index i, the embedding lookup is:
$$\text{embed}(i) = E[i, :] = \mathbf{e}_i \in \mathbb{R}^d$$
Equivalently, using one-hot encoding x ∈ {0,1}^V:
$$\text{embed}(x) = x^T E$$
This is why embedding layers are computationally efficient—they're matrix lookups, not full matrix multiplications.
Gradient Flow:
During backpropagation, gradients flow only to the embedding vectors that were actually used in the forward pass. For a loss L, the gradient update for embedding i is:
$$\frac{\partial L}{\partial E[i,:]} = \frac{\partial L}{\partial \mathbf{e}_i}$$
Embeddings for categories not in the batch receive zero gradient—a form of implicit regularization.
| Cardinality (V) | Recommended d | Parameters | Notes |
|---|---|---|---|
| < 10 | 2-4 | < 40 | May not need embeddings; one-hot works |
| 10-100 | 4-16 | < 1,600 | Small embeddings sufficient |
| 100-1,000 | 16-64 | < 64K | Standard embedding sizes |
| 1,000-100,000 | 32-128 | < 12.8M | Consider compression techniques |
100,000 | 64-256 | Up to 25M+ | Hash embeddings may be needed |
A common heuristic: d ≈ min(50, V/2) or d ≈ ceil(V^0.25). Google's recommendation: d = 6 × V^0.25. Always validate with cross-validation.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import torchimport torch.nn as nnimport numpy as np class TabularEmbeddingModel(nn.Module): """Neural network with embeddings for categorical features.""" def __init__(self, cat_dims, cat_embed_dims, num_continuous, hidden_dims, output_dim): """ Args: cat_dims: list of (vocab_size,) for each categorical feature cat_embed_dims: list of embedding dimensions for each categorical num_continuous: number of continuous features hidden_dims: list of hidden layer sizes output_dim: output dimension (1 for regression/binary, K for multiclass) """ super().__init__() # Create embedding layers for each categorical feature self.embeddings = nn.ModuleList([ nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim) for vocab_size, embed_dim in zip(cat_dims, cat_embed_dims) ]) # Calculate total embedding output dimension total_embed_dim = sum(cat_embed_dims) input_dim = total_embed_dim + num_continuous # Build MLP layers layers = [] prev_dim = input_dim for hidden_dim in hidden_dims: layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.BatchNorm1d(hidden_dim), nn.ReLU(), nn.Dropout(0.2) ]) prev_dim = hidden_dim layers.append(nn.Linear(prev_dim, output_dim)) self.mlp = nn.Sequential(*layers) def forward(self, cat_features, cont_features): """ Args: cat_features: LongTensor of shape (batch, num_cat_features) cont_features: FloatTensor of shape (batch, num_continuous) """ # Embed each categorical feature embeddings = [ emb(cat_features[:, i]) for i, emb in enumerate(self.embeddings) ] # Concatenate all embeddings and continuous features x = torch.cat(embeddings + [cont_features], dim=1) return self.mlp(x) # Example usagecat_dims = [100, 500, 50] # Three categorical features with different cardinalitiescat_embed_dims = [16, 32, 12] # Embedding dimensionsnum_continuous = 10 model = TabularEmbeddingModel( cat_dims=cat_dims, cat_embed_dims=cat_embed_dims, num_continuous=num_continuous, hidden_dims=[128, 64], output_dim=1) # Sample forward passbatch_size = 32cat_input = torch.randint(0, 50, (batch_size, 3)) # Category indicescont_input = torch.randn(batch_size, 10) # Continuous features output = model(cat_input, cont_input)print(f"Output shape: {output.shape}") # Check embedding weightsprint(f"First embedding shape: {model.embeddings[0].weight.shape}")Transfer Learning for Categorical Features:
Embeddings learned on one task can transfer to related tasks. This is especially valuable for:
Word2Vec-Style Category Embeddings:
Borrow techniques from NLP:
Initializing with Pre-trained Embeddings:
123456789101112131415161718192021222324252627282930313233343536373839404142
import torchimport torch.nn as nnimport numpy as npfrom gensim.models import Word2Vec # Train Word2Vec-style embeddings on sequencessequences = [ ['product_1', 'product_5', 'product_3', 'product_1'], ['product_2', 'product_7', 'product_5', 'product_8'], # ... many more user purchase sequences] # Train with gensimw2v_model = Word2Vec( sentences=sequences, vector_size=64, window=3, min_count=1, workers=4, sg=1 # Skip-gram) # Create vocabulary mappingvocab = {word: idx for idx, word in enumerate(w2v_model.wv.index_to_key)}pretrained_weights = w2v_model.wv.vectors # Initialize embedding layer with pre-trained weightsembedding = nn.Embedding( num_embeddings=len(vocab), embedding_dim=64) with torch.no_grad(): embedding.weight.copy_(torch.from_numpy(pretrained_weights)) # Option 1: Freeze embeddingsembedding.weight.requires_grad = False # Option 2: Fine-tune with lower learning rate# Use separate parameter groups with different LR print(f"Loaded {len(vocab)} pre-trained embeddings")Pre-trained embeddings help most when: (1) limited labeled data for target task, (2) large unlabeled sequential data available, (3) categories have rich co-occurrence structure. For purely random category assignments, pre-training provides no benefit.
1. Entity Embeddings of Categorical Variables (Guo & Berkhahn, 2016):
The seminal paper demonstrating that embeddings learned for one competition/task transfer well to others. Key insight: embeddings capture intrinsic category properties.
2. Multi-Feature Interaction:
Combine embeddings from multiple features before feeding to network:
# Concatenation (default)
combined = torch.cat([emb_a, emb_b], dim=-1)
# Element-wise product (captures interactions)
combined = emb_a * emb_b
# Addition (shared semantic space)
combined = emb_a + emb_b
3. Embedding Regularization:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import torchimport torch.nn as nn class RegularizedEmbedding(nn.Module): """Embedding with dropout and L2 regularization.""" def __init__(self, num_embeddings, embedding_dim, dropout=0.1, l2_reg=1e-5): super().__init__() self.embedding = nn.Embedding(num_embeddings, embedding_dim) self.dropout = nn.Dropout(dropout) self.l2_reg = l2_reg def forward(self, x): emb = self.embedding(x) emb = self.dropout(emb) return emb def get_l2_loss(self): """Return L2 regularization loss for used embeddings.""" return self.l2_reg * torch.norm(self.embedding.weight, p=2) class FeatureInteractionLayer(nn.Module): """Factorization Machine-style feature interactions.""" def __init__(self, embedding_dims, interaction_type='product'): super().__init__() self.interaction_type = interaction_type if interaction_type == 'bilinear': # Learnable interaction matrix total_dim = sum(embedding_dims) self.W = nn.Parameter(torch.randn(total_dim, total_dim) * 0.01) def forward(self, embeddings): """ embeddings: list of tensors, each (batch, embed_dim_i) """ concat = torch.cat(embeddings, dim=-1) if self.interaction_type == 'product': # All pairwise element-wise products interactions = [] for i in range(len(embeddings)): for j in range(i+1, len(embeddings)): if embeddings[i].shape[-1] == embeddings[j].shape[-1]: interactions.append(embeddings[i] * embeddings[j]) return torch.cat([concat] + interactions, dim=-1) elif self.interaction_type == 'bilinear': # x^T W x style interaction interaction = torch.einsum('bi,ij,bj->b', concat, self.W, concat) return torch.cat([concat, interaction.unsqueeze(-1)], dim=-1) return concat # Exampleemb_layer = RegularizedEmbedding(1000, 32, dropout=0.2)interaction = FeatureInteractionLayer([32, 32, 16], 'product')The next page covers Hash Encoding—a technique that handles unlimited cardinality by mapping categories to a fixed-size space via hash functions. Essential for streaming data and extreme cardinality scenarios.