Machine LearningRecommendation Systems

Deep Learning for Recommendations

LevelAdvanced

Duration90 mins

TopicRecommendation Systems

2 / 5

Autoencoders for Recommendations

Learning to Compress User Behavior

Autoencoders offer an elegant paradigm for recommendation: learn to compress a user's interaction history into a dense latent representation, then decode it to predict preferences. This encode-decode framework naturally handles the collaborative filtering task—users with similar compressed representations will receive similar recommendations.

The autoencoder approach has several compelling properties:

Unified Framework: The same architecture handles both explicit ratings and implicit feedback
Natural Missing Data Handling: Autoencoders learn to predict missing entries as part of reconstruction
Latent Space Properties: The learned latent space captures user preference patterns
Extension to Variational Methods: VAEs add probabilistic reasoning and generation capabilities

What You Will Master

This page covers: (1) How autoencoders formulate the recommendation problem, (2) AutoRec and its variants for explicit/implicit feedback, (3) Denoising autoencoders for robust learning, (4) Variational autoencoders (VAE) for recommendations with MultVAE, and (5) Training strategies and regularization techniques.

The Autoencoder Framework for Recommendations

An autoencoder consists of two components:

Encoder: Maps high-dimensional input to low-dimensional latent code
Decoder: Reconstructs input from latent code

For recommendations, the input is a user's interaction vector $\mathbf{r}_u \in \mathbb{R}^{|I|}$ where $|I|$ is the number of items. The goal is to learn a mapping that:

$$\mathbf{\hat{r}}u = f{dec}(f_{enc}(\mathbf{r}_u))$$

where the reconstruction $\hat{\mathbf{r}}_u$ predicts scores for all items, including those the user hasn't interacted with.

Why This Works for Recommendations:

The bottleneck architecture forces the encoder to learn compressed representations that capture essential preference patterns. Users with similar preferences map to similar latent codes, and the decoder generalizes to predict unseen items based on these patterns.

basic_autoencoder.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import torch
import torch.nn as nn
 
class RecommenderAutoencoder(nn.Module):
    """
    Basic autoencoder for collaborative filtering.
    
    Input: User's interaction vector (ratings or binary interactions)
    Output: Predicted scores for all items
    """
    
    def __init__(
        self,
        num_items: int,
        hidden_dims: list = [512, 256],
        latent_dim: int = 64,
        dropout: float = 0.5
    ):
        super().__init__()
        
        # Encoder: items -> hidden -> latent
        encoder_layers = []
        prev_dim = num_items
        
        for hidden_dim in hidden_dims:
            encoder_layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout)
            ])
            prev_dim = hidden_dim
        
        encoder_layers.append(nn.Linear(prev_dim, latent_dim))
        self.encoder = nn.Sequential(*encoder_layers)
        
        # Decoder: latent -> hidden -> items
        decoder_layers = []
        prev_dim = latent_dim
        
        for hidden_dim in reversed(hidden_dims):
            decoder_layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout)
            ])
            prev_dim = hidden_dim
        
        decoder_layers.append(nn.Linear(prev_dim, num_items))
        self.decoder = nn.Sequential(*decoder_layers)
    
    def forward(self, x: torch.Tensor):
        """
        Args:
            x: User interaction vector, shape (batch_size, num_items)
               Can contain 0s for missing/unobserved items
        
        Returns:
            Reconstructed interaction vector (predictions for all items)
        """
        latent = self.encoder(x)
        reconstruction = self.decoder(latent)
        return reconstruction
    
    def get_user_embedding(self, x: torch.Tensor):
        """Extract latent representation for a user."""
        return self.encoder(x)

Autoencoder vs Matrix Factorization
Aspect	Matrix Factorization	Autoencoder
Model type	Linear latent factors	Non-linear encoder/decoder
Learning objective	Minimize rating error	Minimize reconstruction error
Missing data	Explicit handling needed	Natural zero-masking
User representation	Learned embedding	Function of interactions
Cold start (items)	Requires embedding	Can use item features
Computational cost	Scales with interactions	Scales with items

AutoRec: Autoencoder Meets Collaborative Filtering

AutoRec (Sedhain et al., 2015) was one of the first successful applications of autoencoders to collaborative filtering. It comes in two variants:

Item-based AutoRec (I-AutoRec):

Takes an item's rating vector across all users as input, learns to reconstruct it:

$$\hat{\mathbf{r}}^{(i)} = f(\mathbf{W}_2 \cdot g(\mathbf{W}_1 \mathbf{r}^{(i)} + \mathbf{b}_1) + \mathbf{b}_2)$$

User-based AutoRec (U-AutoRec):

Takes a user's rating vector across all items as input:

$$\hat{\mathbf{r}}^{(u)} = f(\mathbf{W}_2 \cdot g(\mathbf{W}_1 \mathbf{r}^{(u)} + \mathbf{b}_1) + \mathbf{b}_2)$$

Key Insight: The loss only considers observed ratings, not the full reconstruction:

$$\mathcal{L} = \sum_{u} \sum_{i : r_{ui} eq 0} (r_{ui} - \hat{r}_{ui})^2 + \lambda \cdot ||\mathbf{W}||_F^2$$

This masked loss is crucial—we don't penalize predictions for items the user hasn't rated.

autorec.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
class AutoRec(nn.Module):
    """
    AutoRec: Autoencoder-based Collaborative Filtering.
    
    Key difference from standard AE: loss is computed only on
    observed ratings, not the full reconstruction.
    """
    
    def __init__(
        self,
        num_items: int,
        hidden_dim: int = 500,
        activation: str = 'sigmoid'
    ):
        super().__init__()
        
        self.encoder = nn.Linear(num_items, hidden_dim)
        self.decoder = nn.Linear(hidden_dim, num_items)
        
        if activation == 'sigmoid':
            self.activation = nn.Sigmoid()
        elif activation == 'relu':
            self.activation = nn.ReLU()
        else:
            self.activation = nn.Identity()
    
    def forward(self, x: torch.Tensor):
        # Encode
        hidden = self.activation(self.encoder(x))
        # Decode
        reconstruction = self.decoder(hidden)
        return reconstruction
 
 
def autorec_loss(predictions, targets, mask, reg_lambda=0.01, model=None):
    """
    Masked reconstruction loss for AutoRec.
    
    Args:
        predictions: Reconstructed ratings (batch_size, num_items)
        targets: Original ratings (batch_size, num_items)
        mask: Binary mask, 1 where rating exists (batch_size, num_items)
        reg_lambda: L2 regularization strength
        model: Model for weight regularization
    
    Returns:
        Total loss (reconstruction + regularization)
    """
    # Masked MSE: only compute loss where ratings exist
    masked_predictions = predictions * mask
    masked_targets = targets * mask
    
    reconstruction_loss = torch.sum((masked_predictions - masked_targets) ** 2)
    reconstruction_loss = reconstruction_loss / mask.sum()  # Average over observed
    
    # L2 regularization on weights
    reg_loss = 0
    if model is not None:
        for param in model.parameters():
            reg_loss += torch.sum(param ** 2)
        reg_loss = reg_lambda * reg_loss
    
    return reconstruction_loss + reg_loss
 
 
# Training loop with masking
def train_autorec(model, train_loader, optimizer, epochs=50, device='cuda'):
    model = model.to(device)
    
    for epoch in range(epochs):
        total_loss = 0
        for ratings, mask in train_loader:
            ratings = ratings.to(device)
            mask = mask.to(device)
            
            optimizer.zero_grad()
            predictions = model(ratings)
            
            loss = autorec_loss(predictions, ratings, mask, model=model)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        print(f"Epoch {epoch+1}: Loss = {total_loss/len(train_loader):.4f}")

User-based vs Item-based

I-AutoRec typically outperforms U-AutoRec because items generally have more ratings than users, providing richer training signals. However, U-AutoRec is more natural for generating user recommendations and handles new items better.

Denoising Autoencoders: Learning Robust Representations

Denoising Autoencoders (DAE) add noise to inputs during training, forcing the model to learn robust representations. For recommendations, this is particularly powerful:

The Corruption Process:

Given a user's interaction vector $\mathbf{r}_u$, create corrupted version $\tilde{\mathbf{r}}_u$:

Dropout Corruption: Randomly set entries to zero with probability $p$
Gaussian Noise: Add $\mathcal{N}(0, \sigma^2)$ to observed entries
Swap Corruption: Randomly swap values between entries

Training Objective:

$$\mathcal{L} = \mathbb{E}{\tilde{\mathbf{r}} \sim q(\tilde{\mathbf{r}}|\mathbf{r})} [||\mathbf{r} - f{dec}(f_{enc}(\tilde{\mathbf{r}}))||^2]$$

Why Denoising Helps Recommendations:

Forces the model to learn underlying patterns rather than memorizing inputs
Simulates the missing at random assumption in implicit feedback
Improves generalization to unseen user-item pairs
Acts as data augmentation for sparse interaction data

denoising_autorec.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
class DenoisingAutoRec(nn.Module):
    """
    Denoising AutoRec: Corrupts input during training
    to learn more robust representations.
    """
    
    def __init__(
        self,
        num_items: int,
        hidden_dims: list = [512, 256],
        latent_dim: int = 128,
        corruption_ratio: float = 0.3
    ):
        super().__init__()
        
        self.corruption_ratio = corruption_ratio
        
        # Encoder
        encoder_layers = []
        prev_dim = num_items
        for hidden_dim in hidden_dims:
            encoder_layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.SELU(),  # SELU for self-normalizing property
            ])
            prev_dim = hidden_dim
        encoder_layers.append(nn.Linear(prev_dim, latent_dim))
        self.encoder = nn.Sequential(*encoder_layers)
        
        # Decoder
        decoder_layers = []
        prev_dim = latent_dim
        for hidden_dim in reversed(hidden_dims):
            decoder_layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.SELU(),
            ])
            prev_dim = hidden_dim
        decoder_layers.append(nn.Linear(prev_dim, num_items))
        self.decoder = nn.Sequential(*decoder_layers)
    
    def corrupt(self, x: torch.Tensor):
        """
        Apply dropout corruption to input.
        Only corrupts non-zero entries (observed interactions).
        """
        if not self.training:
            return x
        
        # Create corruption mask only for observed entries
        observed_mask = (x != 0).float()
        corruption_mask = torch.bernoulli(
            torch.ones_like(x) * (1 - self.corruption_ratio)
        )
        
        # Apply corruption only to observed entries
        final_mask = observed_mask * corruption_mask + (1 - observed_mask)
        return x * final_mask
    
    def forward(self, x: torch.Tensor, add_noise: bool = True):
        if add_noise and self.training:
            x = self.corrupt(x)
        
        latent = self.encoder(x)
        reconstruction = self.decoder(latent)
        return reconstruction
 
 
class CDAE(nn.Module):
    """
    Collaborative Denoising Autoencoder (CDAE).
    
    Extends DAE with user-specific parameters for personalization.
    Each user has a learned offset vector added to the encoding.
    """
    
    def __init__(
        self,
        num_users: int,
        num_items: int,
        hidden_dim: int = 256,
        corruption_ratio: float = 0.5
    ):
        super().__init__()
        
        self.corruption_ratio = corruption_ratio
        
        # User-specific offset vectors
        self.user_embeddings = nn.Embedding(num_users, hidden_dim)
        
        # Encoder (item interactions -> hidden)
        self.encoder = nn.Linear(num_items, hidden_dim)
        
        # Decoder (hidden -> item predictions)
        self.decoder = nn.Linear(hidden_dim, num_items)
        
        self._init_weights()
    
    def _init_weights(self):
        nn.init.xavier_uniform_(self.encoder.weight)
        nn.init.xavier_uniform_(self.decoder.weight)
        nn.init.normal_(self.user_embeddings.weight, std=0.01)
    
    def forward(self, user_ids: torch.Tensor, interactions: torch.Tensor):
        # Corrupt input during training
        if self.training:
            mask = torch.bernoulli(
                torch.ones_like(interactions) * (1 - self.corruption_ratio)
            )
            corrupted = interactions * mask
        else:
            corrupted = interactions
        
        # Encode with user-specific offset
        user_emb = self.user_embeddings(user_ids)
        hidden = torch.sigmoid(self.encoder(corrupted) + user_emb)
        
        # Decode to item predictions
        predictions = self.decoder(hidden)
        return predictions

CDAE's User Embeddings

CDAE adds user-specific embeddings that act as 'offsets' in the hidden layer. This allows the model to capture user-specific biases that pure reconstruction cannot learn—similar to how matrix factorization has separate user and item embeddings.

Variational Autoencoders for Recommendations

Variational Autoencoders (VAEs) bring probabilistic modeling to the autoencoder framework. Instead of learning point embeddings, VAEs learn distributions over latent representations.

The Generative Story:

Sample user latent code: $\mathbf{z}_u \sim \mathcal{N}(0, \mathbf{I})$
Generate preferences: $\mathbf{r}u \sim p{\theta}(\mathbf{r}|\mathbf{z}_u)$

The VAE Objective (ELBO):

$$\mathcal{L}{ELBO} = \mathbb{E}{q_\phi(\mathbf{z}|\mathbf{r})}[\log p_\theta(\mathbf{r}|\mathbf{z})] - \beta \cdot D_{KL}(q_\phi(\mathbf{z}|\mathbf{r}) || p(\mathbf{z}))$$

where:

First term: Reconstruction likelihood
Second term: KL divergence regularizing latent space toward prior
$\beta$: Weighting factor (can be annealed during training)

Why VAE for Recommendations:

Uncertainty Quantification: Distributions capture confidence in preferences
Regularized Latent Space: KL term prevents overfitting to sparse data
Better Generalization: Sampling encourages robust representations
Principled Framework: Grounded in probabilistic graphical models

mult_vae.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
class MultVAE(nn.Module):
    """
    Mult-VAE: Variational Autoencoder for Collaborative Filtering
    with Multinomial Likelihood (Liang et al., 2018).
    
    Uses multinomial likelihood which is well-suited for
    implicit feedback (click/no-click data).
    """
    
    def __init__(
        self,
        num_items: int,
        hidden_dim: int = 600,
        latent_dim: int = 200,
        dropout: float = 0.5
    ):
        super().__init__()
        
        self.latent_dim = latent_dim
        self.dropout = nn.Dropout(dropout)
        
        # Encoder: items -> hidden -> (mu, logvar)
        self.encoder_hidden = nn.Linear(num_items, hidden_dim)
        self.encoder_mu = nn.Linear(hidden_dim, latent_dim)
        self.encoder_logvar = nn.Linear(hidden_dim, latent_dim)
        
        # Decoder: latent -> hidden -> items
        self.decoder_hidden = nn.Linear(latent_dim, hidden_dim)
        self.decoder_output = nn.Linear(hidden_dim, num_items)
        
        self._init_weights()
    
    def _init_weights(self):
        for layer in [self.encoder_hidden, self.encoder_mu, 
                      self.encoder_logvar, self.decoder_hidden,
                      self.decoder_output]:
            nn.init.xavier_uniform_(layer.weight)
            nn.init.zeros_(layer.bias)
    
    def encode(self, x: torch.Tensor):
        """Encode input to latent distribution parameters."""
        # Normalize input (important for multinomial)
        x = F.normalize(x, p=2, dim=1)
        x = self.dropout(x)
        
        hidden = torch.tanh(self.encoder_hidden(x))
        mu = self.encoder_mu(hidden)
        logvar = self.encoder_logvar(hidden)
        
        return mu, logvar
    
    def reparameterize(self, mu: torch.Tensor, logvar: torch.Tensor):
        """
        Reparameterization trick: z = mu + std * epsilon
        Enables backpropagation through sampling.
        """
        if self.training:
            std = torch.exp(0.5 * logvar)
            eps = torch.randn_like(std)
            return mu + eps * std
        else:
            return mu  # Use mean at inference
    
    def decode(self, z: torch.Tensor):
        """Decode latent to item logits."""
        hidden = torch.tanh(self.decoder_hidden(z))
        logits = self.decoder_output(hidden)
        return logits
    
    def forward(self, x: torch.Tensor):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        logits = self.decode(z)
        return logits, mu, logvar
 
 
def mult_vae_loss(logits, targets, mu, logvar, anneal=1.0):
    """
    Mult-VAE loss: multinomial likelihood + KL divergence.
    
    Args:
        logits: Decoder output (unnormalized log-probs)
        targets: Original interaction vector
        mu: Encoder mean
        logvar: Encoder log-variance
        anneal: KL annealing factor (0 to 1)
    
    Returns:
        Total loss
    """
    # Multinomial negative log-likelihood
    log_softmax = F.log_softmax(logits, dim=1)
    neg_ll = -torch.sum(log_softmax * targets, dim=1).mean()
    
    # KL divergence: D_KL(q(z|x) || p(z))
    # For Gaussian prior p(z) = N(0, I):
    kl_div = -0.5 * torch.sum(
        1 + logvar - mu.pow(2) - logvar.exp(), 
        dim=1
    ).mean()
    
    return neg_ll + anneal * kl_div

Training Strategies and Regularization

Training autoencoders for recommendations requires careful attention to several factors:

1. KL Annealing (for VAEs):

The KL term can dominate early training, causing the model to ignore the encoder (posterior collapse). Annealing gradually increases KL weight:

$$\beta_t = \min(1, t / T_{anneal})$$

2. Dropout as Regularization:

Heavy dropout (0.5+) is essential for sparse recommendation data:

Input dropout simulates missing data
Hidden dropout prevents overfitting

3. Negative Sampling for Large Catalogs:

With millions of items, computing full softmax is expensive. Sample negatives:

Popularity-based sampling
In-batch negatives
Hard negative mining

Training Hyperparameters Guide
Hyperparameter	Typical Range	Effect
Learning rate	1e-4 to 1e-3	Higher for small data, lower for large
Dropout rate	0.3 to 0.7	Higher for sparser data
Latent dimension	64 to 256	Higher captures more patterns
Hidden layers	1 to 3	Deeper for complex patterns
Batch size	128 to 512	Larger for stable gradients
KL anneal epochs	10% of total	Longer for stable training

Early Stopping on Validation NDCG

Training loss (reconstruction error) doesn't correlate perfectly with ranking quality. Always use ranking metrics (NDCG@K, Recall@K) on held-out data for early stopping and model selection.

Choosing the Right Autoencoder Variant

Different autoencoder variants excel in different scenarios:

Autoencoder Variant Selection Guide
Variant	Best For	Key Advantage	Limitation
AutoRec	Explicit ratings (1-5 stars)	Simple, fast training	Limited expressiveness
DAE	Implicit feedback (clicks)	Robust to noise	No probabilistic interpretation
CDAE	Personalized recommendations	User-specific modeling	Scales with users
Mult-VAE	Implicit feedback at scale	State-of-art quality, principled	Training complexity
β-VAE	Disentangled representations	Interpretable factors	Harder to tune

Selection Criteria

•Data Type: Explicit ratings → AutoRec; Implicit feedback → Mult-VAE
•Scale: Millions of items → Need negative sampling variants
•Latent Space Needs: If you need interpretable/disentangled → β-VAE
•Computational Budget: Limited resources → DAE; Flexible → Mult-VAE
•Cold Start: Need to handle new users → CDAE with user features

Summary: Autoencoders for Recommendations

Key Takeaways

•Autoencoders frame recommendation as reconstruction — learning to compress and decode user interaction patterns.
•AutoRec uses masked loss — training only on observed ratings, naturally handling missing data.
•Denoising autoencoders improve robustness — corruption during training forces learning of underlying patterns.
•VAEs add probabilistic modeling — enabling uncertainty quantification and regularized latent spaces.
•Mult-VAE achieves state-of-art results — using multinomial likelihood for implicit feedback.
•Training requires careful regularization — dropout, KL annealing, and proper negative sampling are essential.

Next Up: Sequence Models

Autoencoders treat user interactions as unordered sets. But the sequence of interactions matters—what you clicked yesterday influences what you want today. Next, we'll explore how RNNs, Transformers, and other sequence models capture temporal patterns in user behavior.

2 / 5

Loading learning content...

Machine LearningRecommendation Systems

Deep Learning for Recommendations

LevelAdvanced

Duration90 mins

TopicRecommendation Systems

2 / 5

Autoencoders for Recommendations

Learning to Compress User Behavior

The autoencoder approach has several compelling properties:

Unified Framework: The same architecture handles both explicit ratings and implicit feedback
Natural Missing Data Handling: Autoencoders learn to predict missing entries as part of reconstruction
Latent Space Properties: The learned latent space captures user preference patterns
Extension to Variational Methods: VAEs add probabilistic reasoning and generation capabilities

What You Will Master

The Autoencoder Framework for Recommendations

An autoencoder consists of two components:

Encoder: Maps high-dimensional input to low-dimensional latent code
Decoder: Reconstructs input from latent code

For recommendations, the input is a user's interaction vector $\mathbf{r}_u \in \mathbb{R}^{|I|}$ where $|I|$ is the number of items. The goal is to learn a mapping that:

$$\mathbf{\hat{r}}u = f{dec}(f_{enc}(\mathbf{r}_u))$$

where the reconstruction $\hat{\mathbf{r}}_u$ predicts scores for all items, including those the user hasn't interacted with.

Why This Works for Recommendations:

basic_autoencoder.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import torch
import torch.nn as nn
 
class RecommenderAutoencoder(nn.Module):
    """
    Basic autoencoder for collaborative filtering.
    
    Input: User's interaction vector (ratings or binary interactions)
    Output: Predicted scores for all items
    """
    
    def __init__(
        self,
        num_items: int,
        hidden_dims: list = [512, 256],
        latent_dim: int = 64,
        dropout: float = 0.5
    ):
        super().__init__()
        
        # Encoder: items -> hidden -> latent
        encoder_layers = []
        prev_dim = num_items
        
        for hidden_dim in hidden_dims:
            encoder_layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout)
            ])
            prev_dim = hidden_dim
        
        encoder_layers.append(nn.Linear(prev_dim, latent_dim))
        self.encoder = nn.Sequential(*encoder_layers)
        
        # Decoder: latent -> hidden -> items
        decoder_layers = []
        prev_dim = latent_dim
        
        for hidden_dim in reversed(hidden_dims):
            decoder_layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout)
            ])
            prev_dim = hidden_dim
        
        decoder_layers.append(nn.Linear(prev_dim, num_items))
        self.decoder = nn.Sequential(*decoder_layers)
    
    def forward(self, x: torch.Tensor):
        """
        Args:
            x: User interaction vector, shape (batch_size, num_items)
               Can contain 0s for missing/unobserved items
        
        Returns:
            Reconstructed interaction vector (predictions for all items)
        """
        latent = self.encoder(x)
        reconstruction = self.decoder(latent)
        return reconstruction
    
    def get_user_embedding(self, x: torch.Tensor):
        """Extract latent representation for a user."""
        return self.encoder(x)

Autoencoder vs Matrix Factorization
Aspect	Matrix Factorization	Autoencoder
Model type	Linear latent factors	Non-linear encoder/decoder
Learning objective	Minimize rating error	Minimize reconstruction error
Missing data	Explicit handling needed	Natural zero-masking
User representation	Learned embedding	Function of interactions
Cold start (items)	Requires embedding	Can use item features
Computational cost	Scales with interactions	Scales with items

AutoRec: Autoencoder Meets Collaborative Filtering

AutoRec (Sedhain et al., 2015) was one of the first successful applications of autoencoders to collaborative filtering. It comes in two variants:

Item-based AutoRec (I-AutoRec):

Takes an item's rating vector across all users as input, learns to reconstruct it:

$$\hat{\mathbf{r}}^{(i)} = f(\mathbf{W}_2 \cdot g(\mathbf{W}_1 \mathbf{r}^{(i)} + \mathbf{b}_1) + \mathbf{b}_2)$$

User-based AutoRec (U-AutoRec):

Takes a user's rating vector across all items as input:

$$\hat{\mathbf{r}}^{(u)} = f(\mathbf{W}_2 \cdot g(\mathbf{W}_1 \mathbf{r}^{(u)} + \mathbf{b}_1) + \mathbf{b}_2)$$

Key Insight: The loss only considers observed ratings, not the full reconstruction:

$$\mathcal{L} = \sum_{u} \sum_{i : r_{ui} eq 0} (r_{ui} - \hat{r}_{ui})^2 + \lambda \cdot ||\mathbf{W}||_F^2$$

This masked loss is crucial—we don't penalize predictions for items the user hasn't rated.

autorec.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
class AutoRec(nn.Module):
    """
    AutoRec: Autoencoder-based Collaborative Filtering.
    
    Key difference from standard AE: loss is computed only on
    observed ratings, not the full reconstruction.
    """
    
    def __init__(
        self,
        num_items: int,
        hidden_dim: int = 500,
        activation: str = 'sigmoid'
    ):
        super().__init__()
        
        self.encoder = nn.Linear(num_items, hidden_dim)
        self.decoder = nn.Linear(hidden_dim, num_items)
        
        if activation == 'sigmoid':
            self.activation = nn.Sigmoid()
        elif activation == 'relu':
            self.activation = nn.ReLU()
        else:
            self.activation = nn.Identity()
    
    def forward(self, x: torch.Tensor):
        # Encode
        hidden = self.activation(self.encoder(x))
        # Decode
        reconstruction = self.decoder(hidden)
        return reconstruction
 
 
def autorec_loss(predictions, targets, mask, reg_lambda=0.01, model=None):
    """
    Masked reconstruction loss for AutoRec.
    
    Args:
        predictions: Reconstructed ratings (batch_size, num_items)
        targets: Original ratings (batch_size, num_items)
        mask: Binary mask, 1 where rating exists (batch_size, num_items)
        reg_lambda: L2 regularization strength
        model: Model for weight regularization
    
    Returns:
        Total loss (reconstruction + regularization)
    """
    # Masked MSE: only compute loss where ratings exist
    masked_predictions = predictions * mask
    masked_targets = targets * mask
    
    reconstruction_loss = torch.sum((masked_predictions - masked_targets) ** 2)
    reconstruction_loss = reconstruction_loss / mask.sum()  # Average over observed
    
    # L2 regularization on weights
    reg_loss = 0
    if model is not None:
        for param in model.parameters():
            reg_loss += torch.sum(param ** 2)
        reg_loss = reg_lambda * reg_loss
    
    return reconstruction_loss + reg_loss
 
 
# Training loop with masking
def train_autorec(model, train_loader, optimizer, epochs=50, device='cuda'):
    model = model.to(device)
    
    for epoch in range(epochs):
        total_loss = 0
        for ratings, mask in train_loader:
            ratings = ratings.to(device)
            mask = mask.to(device)
            
            optimizer.zero_grad()
            predictions = model(ratings)
            
            loss = autorec_loss(predictions, ratings, mask, model=model)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        print(f"Epoch {epoch+1}: Loss = {total_loss/len(train_loader):.4f}")

User-based vs Item-based

Denoising Autoencoders: Learning Robust Representations

Denoising Autoencoders (DAE) add noise to inputs during training, forcing the model to learn robust representations. For recommendations, this is particularly powerful:

The Corruption Process:

Given a user's interaction vector $\mathbf{r}_u$, create corrupted version $\tilde{\mathbf{r}}_u$:

Dropout Corruption: Randomly set entries to zero with probability $p$
Gaussian Noise: Add $\mathcal{N}(0, \sigma^2)$ to observed entries
Swap Corruption: Randomly swap values between entries

Training Objective:

$$\mathcal{L} = \mathbb{E}{\tilde{\mathbf{r}} \sim q(\tilde{\mathbf{r}}|\mathbf{r})} [||\mathbf{r} - f{dec}(f_{enc}(\tilde{\mathbf{r}}))||^2]$$

Why Denoising Helps Recommendations:

Forces the model to learn underlying patterns rather than memorizing inputs
Simulates the missing at random assumption in implicit feedback
Improves generalization to unseen user-item pairs
Acts as data augmentation for sparse interaction data

denoising_autorec.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
class DenoisingAutoRec(nn.Module):
    """
    Denoising AutoRec: Corrupts input during training
    to learn more robust representations.
    """
    
    def __init__(
        self,
        num_items: int,
        hidden_dims: list = [512, 256],
        latent_dim: int = 128,
        corruption_ratio: float = 0.3
    ):
        super().__init__()
        
        self.corruption_ratio = corruption_ratio
        
        # Encoder
        encoder_layers = []
        prev_dim = num_items
        for hidden_dim in hidden_dims:
            encoder_layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.SELU(),  # SELU for self-normalizing property
            ])
            prev_dim = hidden_dim
        encoder_layers.append(nn.Linear(prev_dim, latent_dim))
        self.encoder = nn.Sequential(*encoder_layers)
        
        # Decoder
        decoder_layers = []
        prev_dim = latent_dim
        for hidden_dim in reversed(hidden_dims):
            decoder_layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.SELU(),
            ])
            prev_dim = hidden_dim
        decoder_layers.append(nn.Linear(prev_dim, num_items))
        self.decoder = nn.Sequential(*decoder_layers)
    
    def corrupt(self, x: torch.Tensor):
        """
        Apply dropout corruption to input.
        Only corrupts non-zero entries (observed interactions).
        """
        if not self.training:
            return x
        
        # Create corruption mask only for observed entries
        observed_mask = (x != 0).float()
        corruption_mask = torch.bernoulli(
            torch.ones_like(x) * (1 - self.corruption_ratio)
        )
        
        # Apply corruption only to observed entries
        final_mask = observed_mask * corruption_mask + (1 - observed_mask)
        return x * final_mask
    
    def forward(self, x: torch.Tensor, add_noise: bool = True):
        if add_noise and self.training:
            x = self.corrupt(x)
        
        latent = self.encoder(x)
        reconstruction = self.decoder(latent)
        return reconstruction
 
 
class CDAE(nn.Module):
    """
    Collaborative Denoising Autoencoder (CDAE).
    
    Extends DAE with user-specific parameters for personalization.
    Each user has a learned offset vector added to the encoding.
    """
    
    def __init__(
        self,
        num_users: int,
        num_items: int,
        hidden_dim: int = 256,
        corruption_ratio: float = 0.5
    ):
        super().__init__()
        
        self.corruption_ratio = corruption_ratio
        
        # User-specific offset vectors
        self.user_embeddings = nn.Embedding(num_users, hidden_dim)
        
        # Encoder (item interactions -> hidden)
        self.encoder = nn.Linear(num_items, hidden_dim)
        
        # Decoder (hidden -> item predictions)
        self.decoder = nn.Linear(hidden_dim, num_items)
        
        self._init_weights()
    
    def _init_weights(self):
        nn.init.xavier_uniform_(self.encoder.weight)
        nn.init.xavier_uniform_(self.decoder.weight)
        nn.init.normal_(self.user_embeddings.weight, std=0.01)
    
    def forward(self, user_ids: torch.Tensor, interactions: torch.Tensor):
        # Corrupt input during training
        if self.training:
            mask = torch.bernoulli(
                torch.ones_like(interactions) * (1 - self.corruption_ratio)
            )
            corrupted = interactions * mask
        else:
            corrupted = interactions
        
        # Encode with user-specific offset
        user_emb = self.user_embeddings(user_ids)
        hidden = torch.sigmoid(self.encoder(corrupted) + user_emb)
        
        # Decode to item predictions
        predictions = self.decoder(hidden)
        return predictions

CDAE's User Embeddings

Variational Autoencoders for Recommendations

Variational Autoencoders (VAEs) bring probabilistic modeling to the autoencoder framework. Instead of learning point embeddings, VAEs learn distributions over latent representations.

The Generative Story:

Sample user latent code: $\mathbf{z}_u \sim \mathcal{N}(0, \mathbf{I})$
Generate preferences: $\mathbf{r}u \sim p{\theta}(\mathbf{r}|\mathbf{z}_u)$

The VAE Objective (ELBO):

$$\mathcal{L}{ELBO} = \mathbb{E}{q_\phi(\mathbf{z}|\mathbf{r})}[\log p_\theta(\mathbf{r}|\mathbf{z})] - \beta \cdot D_{KL}(q_\phi(\mathbf{z}|\mathbf{r}) || p(\mathbf{z}))$$

where:

First term: Reconstruction likelihood
Second term: KL divergence regularizing latent space toward prior
$\beta$: Weighting factor (can be annealed during training)

Why VAE for Recommendations:

Uncertainty Quantification: Distributions capture confidence in preferences
Regularized Latent Space: KL term prevents overfitting to sparse data
Better Generalization: Sampling encourages robust representations
Principled Framework: Grounded in probabilistic graphical models

mult_vae.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
class MultVAE(nn.Module):
    """
    Mult-VAE: Variational Autoencoder for Collaborative Filtering
    with Multinomial Likelihood (Liang et al., 2018).
    
    Uses multinomial likelihood which is well-suited for
    implicit feedback (click/no-click data).
    """
    
    def __init__(
        self,
        num_items: int,
        hidden_dim: int = 600,
        latent_dim: int = 200,
        dropout: float = 0.5
    ):
        super().__init__()
        
        self.latent_dim = latent_dim
        self.dropout = nn.Dropout(dropout)
        
        # Encoder: items -> hidden -> (mu, logvar)
        self.encoder_hidden = nn.Linear(num_items, hidden_dim)
        self.encoder_mu = nn.Linear(hidden_dim, latent_dim)
        self.encoder_logvar = nn.Linear(hidden_dim, latent_dim)
        
        # Decoder: latent -> hidden -> items
        self.decoder_hidden = nn.Linear(latent_dim, hidden_dim)
        self.decoder_output = nn.Linear(hidden_dim, num_items)
        
        self._init_weights()
    
    def _init_weights(self):
        for layer in [self.encoder_hidden, self.encoder_mu, 
                      self.encoder_logvar, self.decoder_hidden,
                      self.decoder_output]:
            nn.init.xavier_uniform_(layer.weight)
            nn.init.zeros_(layer.bias)
    
    def encode(self, x: torch.Tensor):
        """Encode input to latent distribution parameters."""
        # Normalize input (important for multinomial)
        x = F.normalize(x, p=2, dim=1)
        x = self.dropout(x)
        
        hidden = torch.tanh(self.encoder_hidden(x))
        mu = self.encoder_mu(hidden)
        logvar = self.encoder_logvar(hidden)
        
        return mu, logvar
    
    def reparameterize(self, mu: torch.Tensor, logvar: torch.Tensor):
        """
        Reparameterization trick: z = mu + std * epsilon
        Enables backpropagation through sampling.
        """
        if self.training:
            std = torch.exp(0.5 * logvar)
            eps = torch.randn_like(std)
            return mu + eps * std
        else:
            return mu  # Use mean at inference
    
    def decode(self, z: torch.Tensor):
        """Decode latent to item logits."""
        hidden = torch.tanh(self.decoder_hidden(z))
        logits = self.decoder_output(hidden)
        return logits
    
    def forward(self, x: torch.Tensor):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        logits = self.decode(z)
        return logits, mu, logvar
 
 
def mult_vae_loss(logits, targets, mu, logvar, anneal=1.0):
    """
    Mult-VAE loss: multinomial likelihood + KL divergence.
    
    Args:
        logits: Decoder output (unnormalized log-probs)
        targets: Original interaction vector
        mu: Encoder mean
        logvar: Encoder log-variance
        anneal: KL annealing factor (0 to 1)
    
    Returns:
        Total loss
    """
    # Multinomial negative log-likelihood
    log_softmax = F.log_softmax(logits, dim=1)
    neg_ll = -torch.sum(log_softmax * targets, dim=1).mean()
    
    # KL divergence: D_KL(q(z|x) || p(z))
    # For Gaussian prior p(z) = N(0, I):
    kl_div = -0.5 * torch.sum(
        1 + logvar - mu.pow(2) - logvar.exp(), 
        dim=1
    ).mean()
    
    return neg_ll + anneal * kl_div

Training Strategies and Regularization

Training autoencoders for recommendations requires careful attention to several factors:

1. KL Annealing (for VAEs):

The KL term can dominate early training, causing the model to ignore the encoder (posterior collapse). Annealing gradually increases KL weight:

$$\beta_t = \min(1, t / T_{anneal})$$

2. Dropout as Regularization:

Heavy dropout (0.5+) is essential for sparse recommendation data:

Input dropout simulates missing data
Hidden dropout prevents overfitting

3. Negative Sampling for Large Catalogs:

With millions of items, computing full softmax is expensive. Sample negatives:

Popularity-based sampling
In-batch negatives
Hard negative mining

Training Hyperparameters Guide
Hyperparameter	Typical Range	Effect
Learning rate	1e-4 to 1e-3	Higher for small data, lower for large
Dropout rate	0.3 to 0.7	Higher for sparser data
Latent dimension	64 to 256	Higher captures more patterns
Hidden layers	1 to 3	Deeper for complex patterns
Batch size	128 to 512	Larger for stable gradients
KL anneal epochs	10% of total	Longer for stable training

Early Stopping on Validation NDCG

Training loss (reconstruction error) doesn't correlate perfectly with ranking quality. Always use ranking metrics (NDCG@K, Recall@K) on held-out data for early stopping and model selection.

Choosing the Right Autoencoder Variant

Different autoencoder variants excel in different scenarios:

Autoencoder Variant Selection Guide
Variant	Best For	Key Advantage	Limitation
AutoRec	Explicit ratings (1-5 stars)	Simple, fast training	Limited expressiveness
DAE	Implicit feedback (clicks)	Robust to noise	No probabilistic interpretation
CDAE	Personalized recommendations	User-specific modeling	Scales with users
Mult-VAE	Implicit feedback at scale	State-of-art quality, principled	Training complexity
β-VAE	Disentangled representations	Interpretable factors	Harder to tune

Selection Criteria

•Data Type: Explicit ratings → AutoRec; Implicit feedback → Mult-VAE
•Scale: Millions of items → Need negative sampling variants
•Latent Space Needs: If you need interpretable/disentangled → β-VAE
•Computational Budget: Limited resources → DAE; Flexible → Mult-VAE
•Cold Start: Need to handle new users → CDAE with user features

Summary: Autoencoders for Recommendations

Key Takeaways

•Autoencoders frame recommendation as reconstruction — learning to compress and decode user interaction patterns.
•AutoRec uses masked loss — training only on observed ratings, naturally handling missing data.
•Denoising autoencoders improve robustness — corruption during training forces learning of underlying patterns.
•VAEs add probabilistic modeling — enabling uncertainty quantification and regularized latent spaces.
•Mult-VAE achieves state-of-art results — using multinomial likelihood for implicit feedback.
•Training requires careful regularization — dropout, KL annealing, and proper negative sampling are essential.

Next Up: Sequence Models

2 / 5