Loading learning content...
Autoencoders offer an elegant paradigm for recommendation: learn to compress a user's interaction history into a dense latent representation, then decode it to predict preferences. This encode-decode framework naturally handles the collaborative filtering task—users with similar compressed representations will receive similar recommendations.
The autoencoder approach has several compelling properties:
This page covers: (1) How autoencoders formulate the recommendation problem, (2) AutoRec and its variants for explicit/implicit feedback, (3) Denoising autoencoders for robust learning, (4) Variational autoencoders (VAE) for recommendations with MultVAE, and (5) Training strategies and regularization techniques.
An autoencoder consists of two components:
For recommendations, the input is a user's interaction vector $\mathbf{r}_u \in \mathbb{R}^{|I|}$ where $|I|$ is the number of items. The goal is to learn a mapping that:
$$\mathbf{\hat{r}}u = f{dec}(f_{enc}(\mathbf{r}_u))$$
where the reconstruction $\hat{\mathbf{r}}_u$ predicts scores for all items, including those the user hasn't interacted with.
Why This Works for Recommendations:
The bottleneck architecture forces the encoder to learn compressed representations that capture essential preference patterns. Users with similar preferences map to similar latent codes, and the decoder generalizes to predict unseen items based on these patterns.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import torchimport torch.nn as nn class RecommenderAutoencoder(nn.Module): """ Basic autoencoder for collaborative filtering. Input: User's interaction vector (ratings or binary interactions) Output: Predicted scores for all items """ def __init__( self, num_items: int, hidden_dims: list = [512, 256], latent_dim: int = 64, dropout: float = 0.5 ): super().__init__() # Encoder: items -> hidden -> latent encoder_layers = [] prev_dim = num_items for hidden_dim in hidden_dims: encoder_layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.ReLU(), nn.Dropout(dropout) ]) prev_dim = hidden_dim encoder_layers.append(nn.Linear(prev_dim, latent_dim)) self.encoder = nn.Sequential(*encoder_layers) # Decoder: latent -> hidden -> items decoder_layers = [] prev_dim = latent_dim for hidden_dim in reversed(hidden_dims): decoder_layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.ReLU(), nn.Dropout(dropout) ]) prev_dim = hidden_dim decoder_layers.append(nn.Linear(prev_dim, num_items)) self.decoder = nn.Sequential(*decoder_layers) def forward(self, x: torch.Tensor): """ Args: x: User interaction vector, shape (batch_size, num_items) Can contain 0s for missing/unobserved items Returns: Reconstructed interaction vector (predictions for all items) """ latent = self.encoder(x) reconstruction = self.decoder(latent) return reconstruction def get_user_embedding(self, x: torch.Tensor): """Extract latent representation for a user.""" return self.encoder(x)| Aspect | Matrix Factorization | Autoencoder |
|---|---|---|
| Model type | Linear latent factors | Non-linear encoder/decoder |
| Learning objective | Minimize rating error | Minimize reconstruction error |
| Missing data | Explicit handling needed | Natural zero-masking |
| User representation | Learned embedding | Function of interactions |
| Cold start (items) | Requires embedding | Can use item features |
| Computational cost | Scales with interactions | Scales with items |
AutoRec (Sedhain et al., 2015) was one of the first successful applications of autoencoders to collaborative filtering. It comes in two variants:
Item-based AutoRec (I-AutoRec):
Takes an item's rating vector across all users as input, learns to reconstruct it:
$$\hat{\mathbf{r}}^{(i)} = f(\mathbf{W}_2 \cdot g(\mathbf{W}_1 \mathbf{r}^{(i)} + \mathbf{b}_1) + \mathbf{b}_2)$$
User-based AutoRec (U-AutoRec):
Takes a user's rating vector across all items as input:
$$\hat{\mathbf{r}}^{(u)} = f(\mathbf{W}_2 \cdot g(\mathbf{W}_1 \mathbf{r}^{(u)} + \mathbf{b}_1) + \mathbf{b}_2)$$
Key Insight: The loss only considers observed ratings, not the full reconstruction:
$$\mathcal{L} = \sum_{u} \sum_{i : r_{ui} eq 0} (r_{ui} - \hat{r}_{ui})^2 + \lambda \cdot ||\mathbf{W}||_F^2$$
This masked loss is crucial—we don't penalize predictions for items the user hasn't rated.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
class AutoRec(nn.Module): """ AutoRec: Autoencoder-based Collaborative Filtering. Key difference from standard AE: loss is computed only on observed ratings, not the full reconstruction. """ def __init__( self, num_items: int, hidden_dim: int = 500, activation: str = 'sigmoid' ): super().__init__() self.encoder = nn.Linear(num_items, hidden_dim) self.decoder = nn.Linear(hidden_dim, num_items) if activation == 'sigmoid': self.activation = nn.Sigmoid() elif activation == 'relu': self.activation = nn.ReLU() else: self.activation = nn.Identity() def forward(self, x: torch.Tensor): # Encode hidden = self.activation(self.encoder(x)) # Decode reconstruction = self.decoder(hidden) return reconstruction def autorec_loss(predictions, targets, mask, reg_lambda=0.01, model=None): """ Masked reconstruction loss for AutoRec. Args: predictions: Reconstructed ratings (batch_size, num_items) targets: Original ratings (batch_size, num_items) mask: Binary mask, 1 where rating exists (batch_size, num_items) reg_lambda: L2 regularization strength model: Model for weight regularization Returns: Total loss (reconstruction + regularization) """ # Masked MSE: only compute loss where ratings exist masked_predictions = predictions * mask masked_targets = targets * mask reconstruction_loss = torch.sum((masked_predictions - masked_targets) ** 2) reconstruction_loss = reconstruction_loss / mask.sum() # Average over observed # L2 regularization on weights reg_loss = 0 if model is not None: for param in model.parameters(): reg_loss += torch.sum(param ** 2) reg_loss = reg_lambda * reg_loss return reconstruction_loss + reg_loss # Training loop with maskingdef train_autorec(model, train_loader, optimizer, epochs=50, device='cuda'): model = model.to(device) for epoch in range(epochs): total_loss = 0 for ratings, mask in train_loader: ratings = ratings.to(device) mask = mask.to(device) optimizer.zero_grad() predictions = model(ratings) loss = autorec_loss(predictions, ratings, mask, model=model) loss.backward() optimizer.step() total_loss += loss.item() print(f"Epoch {epoch+1}: Loss = {total_loss/len(train_loader):.4f}")I-AutoRec typically outperforms U-AutoRec because items generally have more ratings than users, providing richer training signals. However, U-AutoRec is more natural for generating user recommendations and handles new items better.
Denoising Autoencoders (DAE) add noise to inputs during training, forcing the model to learn robust representations. For recommendations, this is particularly powerful:
The Corruption Process:
Given a user's interaction vector $\mathbf{r}_u$, create corrupted version $\tilde{\mathbf{r}}_u$:
Training Objective:
$$\mathcal{L} = \mathbb{E}{\tilde{\mathbf{r}} \sim q(\tilde{\mathbf{r}}|\mathbf{r})} [||\mathbf{r} - f{dec}(f_{enc}(\tilde{\mathbf{r}}))||^2]$$
Why Denoising Helps Recommendations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
class DenoisingAutoRec(nn.Module): """ Denoising AutoRec: Corrupts input during training to learn more robust representations. """ def __init__( self, num_items: int, hidden_dims: list = [512, 256], latent_dim: int = 128, corruption_ratio: float = 0.3 ): super().__init__() self.corruption_ratio = corruption_ratio # Encoder encoder_layers = [] prev_dim = num_items for hidden_dim in hidden_dims: encoder_layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.SELU(), # SELU for self-normalizing property ]) prev_dim = hidden_dim encoder_layers.append(nn.Linear(prev_dim, latent_dim)) self.encoder = nn.Sequential(*encoder_layers) # Decoder decoder_layers = [] prev_dim = latent_dim for hidden_dim in reversed(hidden_dims): decoder_layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.SELU(), ]) prev_dim = hidden_dim decoder_layers.append(nn.Linear(prev_dim, num_items)) self.decoder = nn.Sequential(*decoder_layers) def corrupt(self, x: torch.Tensor): """ Apply dropout corruption to input. Only corrupts non-zero entries (observed interactions). """ if not self.training: return x # Create corruption mask only for observed entries observed_mask = (x != 0).float() corruption_mask = torch.bernoulli( torch.ones_like(x) * (1 - self.corruption_ratio) ) # Apply corruption only to observed entries final_mask = observed_mask * corruption_mask + (1 - observed_mask) return x * final_mask def forward(self, x: torch.Tensor, add_noise: bool = True): if add_noise and self.training: x = self.corrupt(x) latent = self.encoder(x) reconstruction = self.decoder(latent) return reconstruction class CDAE(nn.Module): """ Collaborative Denoising Autoencoder (CDAE). Extends DAE with user-specific parameters for personalization. Each user has a learned offset vector added to the encoding. """ def __init__( self, num_users: int, num_items: int, hidden_dim: int = 256, corruption_ratio: float = 0.5 ): super().__init__() self.corruption_ratio = corruption_ratio # User-specific offset vectors self.user_embeddings = nn.Embedding(num_users, hidden_dim) # Encoder (item interactions -> hidden) self.encoder = nn.Linear(num_items, hidden_dim) # Decoder (hidden -> item predictions) self.decoder = nn.Linear(hidden_dim, num_items) self._init_weights() def _init_weights(self): nn.init.xavier_uniform_(self.encoder.weight) nn.init.xavier_uniform_(self.decoder.weight) nn.init.normal_(self.user_embeddings.weight, std=0.01) def forward(self, user_ids: torch.Tensor, interactions: torch.Tensor): # Corrupt input during training if self.training: mask = torch.bernoulli( torch.ones_like(interactions) * (1 - self.corruption_ratio) ) corrupted = interactions * mask else: corrupted = interactions # Encode with user-specific offset user_emb = self.user_embeddings(user_ids) hidden = torch.sigmoid(self.encoder(corrupted) + user_emb) # Decode to item predictions predictions = self.decoder(hidden) return predictionsCDAE adds user-specific embeddings that act as 'offsets' in the hidden layer. This allows the model to capture user-specific biases that pure reconstruction cannot learn—similar to how matrix factorization has separate user and item embeddings.
Variational Autoencoders (VAEs) bring probabilistic modeling to the autoencoder framework. Instead of learning point embeddings, VAEs learn distributions over latent representations.
The Generative Story:
The VAE Objective (ELBO):
$$\mathcal{L}{ELBO} = \mathbb{E}{q_\phi(\mathbf{z}|\mathbf{r})}[\log p_\theta(\mathbf{r}|\mathbf{z})] - \beta \cdot D_{KL}(q_\phi(\mathbf{z}|\mathbf{r}) || p(\mathbf{z}))$$
where:
Why VAE for Recommendations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
class MultVAE(nn.Module): """ Mult-VAE: Variational Autoencoder for Collaborative Filtering with Multinomial Likelihood (Liang et al., 2018). Uses multinomial likelihood which is well-suited for implicit feedback (click/no-click data). """ def __init__( self, num_items: int, hidden_dim: int = 600, latent_dim: int = 200, dropout: float = 0.5 ): super().__init__() self.latent_dim = latent_dim self.dropout = nn.Dropout(dropout) # Encoder: items -> hidden -> (mu, logvar) self.encoder_hidden = nn.Linear(num_items, hidden_dim) self.encoder_mu = nn.Linear(hidden_dim, latent_dim) self.encoder_logvar = nn.Linear(hidden_dim, latent_dim) # Decoder: latent -> hidden -> items self.decoder_hidden = nn.Linear(latent_dim, hidden_dim) self.decoder_output = nn.Linear(hidden_dim, num_items) self._init_weights() def _init_weights(self): for layer in [self.encoder_hidden, self.encoder_mu, self.encoder_logvar, self.decoder_hidden, self.decoder_output]: nn.init.xavier_uniform_(layer.weight) nn.init.zeros_(layer.bias) def encode(self, x: torch.Tensor): """Encode input to latent distribution parameters.""" # Normalize input (important for multinomial) x = F.normalize(x, p=2, dim=1) x = self.dropout(x) hidden = torch.tanh(self.encoder_hidden(x)) mu = self.encoder_mu(hidden) logvar = self.encoder_logvar(hidden) return mu, logvar def reparameterize(self, mu: torch.Tensor, logvar: torch.Tensor): """ Reparameterization trick: z = mu + std * epsilon Enables backpropagation through sampling. """ if self.training: std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + eps * std else: return mu # Use mean at inference def decode(self, z: torch.Tensor): """Decode latent to item logits.""" hidden = torch.tanh(self.decoder_hidden(z)) logits = self.decoder_output(hidden) return logits def forward(self, x: torch.Tensor): mu, logvar = self.encode(x) z = self.reparameterize(mu, logvar) logits = self.decode(z) return logits, mu, logvar def mult_vae_loss(logits, targets, mu, logvar, anneal=1.0): """ Mult-VAE loss: multinomial likelihood + KL divergence. Args: logits: Decoder output (unnormalized log-probs) targets: Original interaction vector mu: Encoder mean logvar: Encoder log-variance anneal: KL annealing factor (0 to 1) Returns: Total loss """ # Multinomial negative log-likelihood log_softmax = F.log_softmax(logits, dim=1) neg_ll = -torch.sum(log_softmax * targets, dim=1).mean() # KL divergence: D_KL(q(z|x) || p(z)) # For Gaussian prior p(z) = N(0, I): kl_div = -0.5 * torch.sum( 1 + logvar - mu.pow(2) - logvar.exp(), dim=1 ).mean() return neg_ll + anneal * kl_divTraining autoencoders for recommendations requires careful attention to several factors:
1. KL Annealing (for VAEs):
The KL term can dominate early training, causing the model to ignore the encoder (posterior collapse). Annealing gradually increases KL weight:
$$\beta_t = \min(1, t / T_{anneal})$$
2. Dropout as Regularization:
Heavy dropout (0.5+) is essential for sparse recommendation data:
3. Negative Sampling for Large Catalogs:
With millions of items, computing full softmax is expensive. Sample negatives:
| Hyperparameter | Typical Range | Effect |
|---|---|---|
| Learning rate | 1e-4 to 1e-3 | Higher for small data, lower for large |
| Dropout rate | 0.3 to 0.7 | Higher for sparser data |
| Latent dimension | 64 to 256 | Higher captures more patterns |
| Hidden layers | 1 to 3 | Deeper for complex patterns |
| Batch size | 128 to 512 | Larger for stable gradients |
| KL anneal epochs | 10% of total | Longer for stable training |
Training loss (reconstruction error) doesn't correlate perfectly with ranking quality. Always use ranking metrics (NDCG@K, Recall@K) on held-out data for early stopping and model selection.
Different autoencoder variants excel in different scenarios:
| Variant | Best For | Key Advantage | Limitation |
|---|---|---|---|
| AutoRec | Explicit ratings (1-5 stars) | Simple, fast training | Limited expressiveness |
| DAE | Implicit feedback (clicks) | Robust to noise | No probabilistic interpretation |
| CDAE | Personalized recommendations | User-specific modeling | Scales with users |
| Mult-VAE | Implicit feedback at scale | State-of-art quality, principled | Training complexity |
| β-VAE | Disentangled representations | Interpretable factors | Harder to tune |
Autoencoders treat user interactions as unordered sets. But the sequence of interactions matters—what you clicked yesterday influences what you want today. Next, we'll explore how RNNs, Transformers, and other sequence models capture temporal patterns in user behavior.