Deep Learning For Recommendations - Learning Module

Loading content...

0/245

Neural Collaborative Filtering

Beyond Matrix Factorization

For over a decade, matrix factorization dominated collaborative filtering. The Netflix Prize (2006-2009) cemented SVD-based approaches as the gold standard, with their elegant mathematical foundations and proven effectiveness. Yet these methods share a fundamental limitation: they model user-item interactions through linear inner products.

Neural Collaborative Filtering (NCF), introduced by He et al. in 2017, asks a provocative question: What if we replaced the fixed inner product with a learnable neural network?

This simple conceptual shift unlocked a new paradigm. Instead of assuming interactions follow a specific mathematical form, we let neural networks learn the interaction function from data. The results were transformative—NCF and its descendants now power recommendations at Netflix, YouTube, Amazon, and virtually every major platform.

What You Will Master

By the end of this page, you will understand: (1) Why linear interactions limit traditional CF, (2) The complete NCF architecture including GMF and MLP components, (3) How to train NCF for implicit feedback, (4) The fusion strategies that combine model strengths, and (5) Practical implementation considerations for production systems.

The Expressiveness Problem in Matrix Factorization

To appreciate NCF's contribution, we must first understand what matrix factorization cannot capture.

The Inner Product Assumption:

In matrix factorization, the predicted interaction between user $u$ and item $i$ is:

$$\hat{y}{ui} = \mathbf{p}u^T \mathbf{q}i = \sum{k=1}^{K} p{uk} \cdot q{ik}$$

where $\mathbf{p}_u \in \mathbb{R}^K$ is the user embedding and $\mathbf{q}_i \in \mathbb{R}^K$ is the item embedding. This formulation makes a strong assumption: the interaction function is bilinear.

Bilinearity means:

Doubling a user's affinity for a latent factor doubles the predicted score
Interactions between factors are purely multiplicative
The geometry of embeddings is fixed as a inner product space

The Geometry Mismatch

Consider three users where: user 1 is similar to user 2, user 2 is similar to user 3, but user 1 is dissimilar to user 3. This triangular relationship cannot be perfectly represented in a low-dimensional inner product space—yet such non-transitive preferences are common in real recommendation scenarios.

Concrete Example of MF Failure:

Consider a movie recommendation scenario with 4 users:

User	Likes Action	Likes Romance	Likes Comedy
Alice	✓	✓	✗
Bob	✓	✗	✓
Carol	✗	✓	✓
Dave	✓	✓	✓

Using 2D embeddings (a common dimensionality), MF must position users such that:

Alice and Bob share Action preference → close in one dimension
Bob and Carol share Comedy preference → close in another dimension
Alice and Carol share Romance preference → close in yet another dimension

With only 2 dimensions, we cannot satisfy all constraints simultaneously. The interaction patterns form a structure that requires higher-order reasoning than linear algebra provides.

Linear vs Non-Linear Interaction Modeling
Aspect	Matrix Factorization	Neural Approach
Interaction function	Fixed (inner product)	Learned from data
Feature interactions	First-order only	Arbitrary order
Non-linear patterns	Cannot capture	Naturally captured
Geometric constraints	Euclidean/cosine only	Arbitrary manifolds
Interpretability	High (factor meanings)	Lower (black box)
Computational cost	O(K) per prediction	O(network size)

Neural Collaborative Filtering Architecture

NCF replaces the inner product with a multi-layer perceptron (MLP) that learns the interaction function. The key insight is treating recommendation as a binary classification problem: given a user-item pair, predict the probability of interaction.

The General NCF Framework:

$$\hat{y}_{ui} = f(\mathbf{p}_u, \mathbf{q}_i | \Theta)$$

where $f$ is a neural network with parameters $\Theta$. This seemingly simple change has profound implications:

Learned Interactions: The network discovers how user and item features should combine
Non-linear Transformations: ReLU and other activations enable complex decision boundaries
Hierarchical Features: Deep layers extract progressively abstract interaction patterns

ncf_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import torch
import torch.nn as nn
 
class NeuralCollaborativeFiltering(nn.Module):
    """
    Neural Collaborative Filtering combining GMF and MLP.
    
    Architecture:
    - Separate embeddings for GMF and MLP paths
    - GMF: Element-wise product (generalized matrix factorization)
    - MLP: Concatenation + deep layers (learns non-linear interactions)
    - NeuMF: Fusion of GMF and MLP outputs
    """
    
    def __init__(
        self,
        num_users: int,
        num_items: int,
        gmf_embedding_dim: int = 32,
        mlp_embedding_dim: int = 32,
        mlp_hidden_layers: list = [64, 32, 16],
        dropout: float = 0.2
    ):
        super().__init__()
        
        # GMF embeddings (for generalized matrix factorization)
        self.gmf_user_embedding = nn.Embedding(num_users, gmf_embedding_dim)
        self.gmf_item_embedding = nn.Embedding(num_items, gmf_embedding_dim)
        
        # MLP embeddings (separate from GMF for flexibility)
        self.mlp_user_embedding = nn.Embedding(num_users, mlp_embedding_dim)
        self.mlp_item_embedding = nn.Embedding(num_items, mlp_embedding_dim)
        
        # MLP layers
        mlp_input_dim = mlp_embedding_dim * 2
        self.mlp_layers = nn.ModuleList()
        
        for hidden_dim in mlp_hidden_layers:
            self.mlp_layers.append(nn.Linear(mlp_input_dim, hidden_dim))
            self.mlp_layers.append(nn.ReLU())
            self.mlp_layers.append(nn.Dropout(dropout))
            mlp_input_dim = hidden_dim
        
        # Final prediction layer (NeuMF fusion)
        # GMF output dim + MLP output dim -> 1
        final_input_dim = gmf_embedding_dim + mlp_hidden_layers[-1]
        self.prediction_layer = nn.Linear(final_input_dim, 1)
        
        self._init_weights()
    
    def _init_weights(self):
        """Initialize embeddings with small random values."""
        for embedding in [self.gmf_user_embedding, self.gmf_item_embedding,
                          self.mlp_user_embedding, self.mlp_item_embedding]:
            nn.init.normal_(embedding.weight, std=0.01)
    
    def forward(self, user_ids: torch.Tensor, item_ids: torch.Tensor):
        # GMF path: element-wise product
        gmf_user = self.gmf_user_embedding(user_ids)
        gmf_item = self.gmf_item_embedding(item_ids)
        gmf_output = gmf_user * gmf_item  # Element-wise product
        
        # MLP path: concatenation + deep layers
        mlp_user = self.mlp_user_embedding(user_ids)
        mlp_item = self.mlp_item_embedding(item_ids)
        mlp_input = torch.cat([mlp_user, mlp_item], dim=-1)
        
        mlp_output = mlp_input
        for layer in self.mlp_layers:
            mlp_output = layer(mlp_output)
        
        # NeuMF: concatenate GMF and MLP outputs
        neumf_input = torch.cat([gmf_output, mlp_output], dim=-1)
        prediction = self.prediction_layer(neumf_input)
        
        return torch.sigmoid(prediction).squeeze(-1)

Why Separate Embeddings?

NCF uses separate embeddings for GMF and MLP paths rather than sharing embeddings. This allows each path to learn optimal representations for its specific interaction modeling approach. GMF embeddings capture multiplicative patterns while MLP embeddings optimize for concatenation-based learning.

GMF: The Neural Interpretation of Matrix Factorization

The Generalized Matrix Factorization (GMF) component bridges classical MF with neural approaches. It shows that MF is a special case of NCF.

GMF Formulation:

$$\phi_{GMF} = \mathbf{p}_u \odot \mathbf{q}_i$$

where $\odot$ denotes element-wise (Hadamard) product. The final prediction is:

$$\hat{y}{ui} = a{out}(\mathbf{h}^T (\mathbf{p}_u \odot \mathbf{q}_i))$$

where $\mathbf{h}$ is the edge weight vector and $a_{out}$ is the activation function.

Connection to Classical MF:

If we set:

$a_{out}$ = identity function
$\mathbf{h}$ = uniform vector of ones

Then GMF reduces to standard matrix factorization:

$$\hat{y}{ui} = \sum_k p{uk} \cdot q_{ik} = \mathbf{p}_u^T \mathbf{q}_i$$

The Generalization:

By learning $\mathbf{h}$, GMF allows different latent dimensions to have different importance. This weighted element-wise product is strictly more expressive than the uniform inner product of classical MF.

gmf_component.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class GeneralizedMatrixFactorization(nn.Module):
    """
    GMF component of NCF.
    
    Generalizes MF by learning importance weights for each latent dimension.
    When h = [1, 1, ..., 1], this reduces to standard matrix factorization.
    """
    
    def __init__(self, num_users: int, num_items: int, embedding_dim: int = 32):
        super().__init__()
        
        self.user_embedding = nn.Embedding(num_users, embedding_dim)
        self.item_embedding = nn.Embedding(num_items, embedding_dim)
        
        # Learnable importance weights (the 'h' vector)
        # This is what makes it "generalized" - different dimensions
        # can have different contributions to the final score
        self.h = nn.Linear(embedding_dim, 1, bias=False)
        
        self._init_weights()
    
    def _init_weights(self):
        nn.init.normal_(self.user_embedding.weight, std=0.01)
        nn.init.normal_(self.item_embedding.weight, std=0.01)
        # Initialize h uniformly to approximate standard MF initially
        nn.init.ones_(self.h.weight)
    
    def forward(self, user_ids: torch.Tensor, item_ids: torch.Tensor):
        user_emb = self.user_embedding(user_ids)
        item_emb = self.item_embedding(item_ids)
        
        # Element-wise product
        element_product = user_emb * item_emb
        
        # Weighted sum (learned importance per dimension)
        output = self.h(element_product)
        
        return torch.sigmoid(output).squeeze(-1)
    
    def get_dimension_importance(self):
        """Analyze which latent dimensions matter most."""
        return self.h.weight.data.squeeze().abs().cpu().numpy()

MLP: Learning Non-Linear Interactions

The MLP component is where NCF's true power emerges. While GMF generalizes linear interactions, the MLP learns entirely new interaction patterns that cannot be expressed as weighted products.

Architecture Design:

$$\phi_{MLP} = a_L(W_L^T(a_{L-1}(...a_1(W_1^T[\mathbf{p}_u; \mathbf{q}_i])...)))$$

The key design choices:

Concatenation over Product: Unlike GMF, MLP concatenates embeddings. This preserves individual feature information rather than immediately combining them.
Tower Structure: Layers typically decrease in size (e.g., 128 → 64 → 32). This creates a compression bottleneck that forces the network to learn essential interaction patterns.
ReLU Activations: Enable learning piecewise linear decision boundaries that can approximate any continuous function.
Dropout Regularization: Critical for preventing overfitting on sparse interaction data.

MLP Architecture Design Principles
Design Choice	Rationale	Typical Values
Embedding dimension	Balance expressiveness vs overfitting	32-128
Number of layers	Deeper = more complex interactions	2-4 layers
Layer size decay	Halving pattern works well	[128, 64, 32]
Activation function	ReLU for efficiency, can try GELU	ReLU
Dropout rate	Higher for sparse data	0.2-0.5
Batch normalization	Stabilizes training	Optional

Why Concatenation for MLP?

Element-wise product immediately creates feature interactions, which can lose individual feature information. Concatenation preserves raw features and lets the network decide how to combine them. This is especially valuable when user and item features have asymmetric importance.

Training NCF for Implicit Feedback

Most modern recommendation systems work with implicit feedback—clicks, views, purchases—rather than explicit ratings. This fundamentally changes how we train models.

The Implicit Feedback Challenge:

Observed interactions (clicks, purchases) = positive signal
Unobserved interactions = unknown (not necessarily negative)

We cannot treat all unobserved interactions as negative—users haven't seen most items. The solution: negative sampling.

Binary Cross-Entropy Loss:

$$\mathcal{L} = -\sum_{(u,i) \in \mathcal{O}} \log \hat{y}{ui} - \sum{(u,j) \in \mathcal{O}^-} \log(1 - \hat{y}_{uj})$$

where $\mathcal{O}$ is the set of observed interactions and $\mathcal{O}^-$ is a set of sampled negative interactions.

ncf_training.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import numpy as np
 
class NCFDataset(Dataset):
    """
    Dataset for NCF training with negative sampling.
    
    For each positive (user, item) pair, samples negative items
    that the user hasn't interacted with.
    """
    
    def __init__(
        self,
        user_item_pairs: np.ndarray,  # Shape: (N, 2)
        num_items: int,
        num_negatives: int = 4,
        user_positive_items: dict = None  # user_id -> set of positive items
    ):
        self.user_item_pairs = user_item_pairs
        self.num_items = num_items
        self.num_negatives = num_negatives
        
        # Build positive item sets for efficient negative sampling
        if user_positive_items is None:
            self.user_positive_items = {}
            for user_id, item_id in user_item_pairs:
                if user_id not in self.user_positive_items:
                    self.user_positive_items[user_id] = set()
                self.user_positive_items[user_id].add(item_id)
        else:
            self.user_positive_items = user_positive_items
    
    def __len__(self):
        return len(self.user_item_pairs) * (1 + self.num_negatives)
    
    def __getitem__(self, idx):
        # Determine if this is a positive or negative sample
        pair_idx = idx // (1 + self.num_negatives)
        sample_type = idx % (1 + self.num_negatives)
        
        user_id = self.user_item_pairs[pair_idx, 0]
        
        if sample_type == 0:
            # Positive sample
            item_id = self.user_item_pairs[pair_idx, 1]
            label = 1.0
        else:
            # Negative sample - randomly sample item user hasn't interacted with
            item_id = self._sample_negative(user_id)
            label = 0.0
        
        return torch.tensor(user_id), torch.tensor(item_id), torch.tensor(label)
    
    def _sample_negative(self, user_id):
        """Sample an item the user hasn't interacted with."""
        positives = self.user_positive_items.get(user_id, set())
        while True:
            neg_item = np.random.randint(0, self.num_items)
            if neg_item not in positives:
                return neg_item
 
 
def train_ncf(
    model: nn.Module,
    train_loader: DataLoader,
    optimizer: torch.optim.Optimizer,
    epochs: int = 20,
    device: str = 'cuda'
):
    """
    Train NCF with binary cross-entropy loss.
    """
    model = model.to(device)
    
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        num_batches = 0
        
        for user_ids, item_ids, labels in train_loader:
            user_ids = user_ids.to(device)
            item_ids = item_ids.to(device)
            labels = labels.to(device).float()
            
            optimizer.zero_grad()
            
            predictions = model(user_ids, item_ids)
            loss = F.binary_cross_entropy(predictions, labels)
            
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            num_batches += 1
        
        avg_loss = total_loss / num_batches
        print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

Negative Sampling Ratio

The ratio of negative to positive samples significantly impacts model quality. Too few negatives (1:1) may not provide enough signal; too many (1:10+) can overwhelm positive signals and slow training. A ratio of 4-5 negatives per positive typically works well.

NeuMF: Fusing GMF and MLP

The full Neural Matrix Factorization (NeuMF) model combines GMF and MLP to leverage both linear and non-linear interaction modeling.

Fusion Strategy:

$$\hat{y}{ui} = \sigma(\mathbf{h}^T[\phi{GMF}; \phi_{MLP}])$$

where $[;]$ denotes concatenation. This late fusion approach:

Allows each component to specialize in its strength
Preserves the expressive power of both approaches
Enables pre-training strategies for better initialization

Pre-training for Better Convergence:

A key insight from the original NCF paper: initialize NeuMF using pre-trained GMF and MLP models:

Train standalone GMF model to convergence
Train standalone MLP model to convergence
Initialize NeuMF with pre-trained embeddings
Fine-tune the complete model with smaller learning rate

neumf_pretrain.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def load_pretrained_neumf(
    neumf_model: NeuralCollaborativeFiltering,
    pretrained_gmf: GeneralizedMatrixFactorization,
    pretrained_mlp: nn.Module,
    alpha: float = 0.5
):
    """
    Initialize NeuMF with pre-trained GMF and MLP models.
    
    Args:
        neumf_model: Target NeuMF model to initialize
        pretrained_gmf: Pre-trained GMF model
        pretrained_mlp: Pre-trained MLP model  
        alpha: Weight for combining GMF vs MLP in final layer (0.5 = equal)
    """
    # Load GMF embeddings
    neumf_model.gmf_user_embedding.weight.data.copy_(
        pretrained_gmf.user_embedding.weight.data
    )
    neumf_model.gmf_item_embedding.weight.data.copy_(
        pretrained_gmf.item_embedding.weight.data
    )
    
    # Load MLP embeddings (assuming same architecture)
    neumf_model.mlp_user_embedding.weight.data.copy_(
        pretrained_mlp.user_embedding.weight.data
    )
    neumf_model.mlp_item_embedding.weight.data.copy_(
        pretrained_mlp.item_embedding.weight.data
    )
    
    # Load MLP layers
    for neumf_layer, pretrained_layer in zip(
        neumf_model.mlp_layers, 
        pretrained_mlp.layers
    ):
        if hasattr(neumf_layer, 'weight'):
            neumf_layer.weight.data.copy_(pretrained_layer.weight.data)
            neumf_layer.bias.data.copy_(pretrained_layer.bias.data)
    
    # Initialize prediction layer with weighted combination
    gmf_dim = pretrained_gmf.h.weight.shape[1]
    mlp_dim = neumf_model.prediction_layer.weight.shape[1] - gmf_dim
    
    # Scale GMF contribution by alpha, MLP by (1-alpha)
    gmf_weights = alpha * pretrained_gmf.h.weight.data.squeeze()
    mlp_weights = (1 - alpha) * pretrained_mlp.output_layer.weight.data.squeeze()
    
    neumf_model.prediction_layer.weight.data = torch.cat(
        [gmf_weights, mlp_weights]
    ).unsqueeze(0)
    
    return neumf_model

NCF in Production Systems

Deploying NCF at scale introduces challenges beyond model accuracy:

Inference Latency:

NCF requires a forward pass for each user-item pair. For real-time recommendations across millions of items, this is prohibitive. Solutions:

Approximate Nearest Neighbors: Pre-compute embeddings, use ANN for retrieval
Two-Stage Pipeline: NCF for ranking, simpler model for retrieval
Model Distillation: Train faster student model from NCF teacher

Embedding Table Size:

With millions of users and items, embedding tables dominate memory:

10M users × 10M items × 64 dim × 4 bytes = ~5TB

Solutions for Scale:

Scaling Strategies

•Hash Embeddings: Map IDs through hash function to reduce table size
•Compositional Embeddings: Build embeddings from smaller components
•Mixed-Precision Training: Use FP16 to halve memory requirements
•Embedding Pruning: Remove or compress rare user/item embeddings
•Distributed Training: Shard embedding tables across machines

NCF's Legacy

While newer architectures have superseded NCF in benchmarks, its contribution was foundational: proving that neural networks could outperform handcrafted interaction functions. This opened the door to transformers, graph networks, and the modern deep recommendation ecosystem we'll explore in subsequent pages.

Summary: Neural Collaborative Filtering

Key Takeaways

•NCF replaces fixed inner products with neural networks — enabling learning of complex, non-linear user-item interactions.
•GMF generalizes matrix factorization — by learning dimension-wise importance weights rather than uniform contribution.
•MLP captures higher-order patterns — through concatenation and deep layers that discover interaction patterns impossible for linear models.
•NeuMF fuses both approaches — leveraging linear efficiency and non-linear expressiveness simultaneously.
•Implicit feedback requires careful training — negative sampling and appropriate loss functions are crucial for learning from clicks/purchases.
•Production deployment needs optimization — embedding compression, approximate search, and model distillation address scale challenges.

Coming Next: We'll explore Autoencoders for Recommendations—how encoding user behavior into compressed latent spaces enables both collaborative filtering and content-based recommendation with a single architecture.