Machine LearningRecommendation Systems

Deep Learning for Recommendations

LevelAdvanced

Duration90 mins

TopicRecommendation Systems

4 / 5

Graph Neural Networks for Recommendations

Recommendations as Graph Learning

Recommendation systems are inherently graph-structured. Users connect to items through interactions, items connect to attributes, users connect through social networks. Yet traditional collaborative filtering treats these connections superficially—typically using only direct neighbors.

Graph Neural Networks (GNNs) unlock the power of multi-hop reasoning: a user's preferences propagate through the graph, enriched by neighbors-of-neighbors, shared attributes, and community structure. This enables:

Transitive Preference Discovery: If Alice likes what Bob likes, and Bob likes item X, maybe Alice will like X too
Attribute Propagation: Items sharing attributes influence each other's embeddings
Cold Start Mitigation: New items inherit signals from similar items
Explainable Paths: Recommendations can be traced through graph connections

What You Will Master

This page covers: (1) Modeling RecSys as bipartite graphs, (2) Graph convolution for collaborative filtering, (3) LightGCN architecture and its simplicity, (4) NGCF and feature transformation approaches, (5) Knowledge graph enhanced recommendations, and (6) Scalability strategies for large graphs.

The Recommendation Graph

The User-Item Bipartite Graph:

$$G = (\mathcal{U} \cup \mathcal{I}, \mathcal{E})$$

where:

$\mathcal{U}$: Set of user nodes
$\mathcal{I}$: Set of item nodes
$\mathcal{E}$: Set of edges (user-item interactions)

The adjacency matrix $A$ has structure:

$$A = \begin{pmatrix} 0 & R \ R^T & 0 \end{pmatrix}$$

where $R$ is the user-item interaction matrix. The bipartite structure means:

Users only connect to items
Items only connect to users
No direct user-user or item-item edges (in basic formulation)

Why Graphs for RecSys:

Graphs naturally capture collaborative signals. Similar users connect to similar items, forming communities. GNNs propagate information through these communities, learning embeddings that encode both direct preferences and neighborhood structure.

Graph Elements in Recommendation Systems
Graph Element	RecSys Meaning	Example
User node	Individual user	User ID embedding
Item node	Item to recommend	Movie, product, article
User-Item edge	Interaction signal	Click, purchase, rating
Edge weight	Interaction strength	Rating value, frequency
1-hop neighbors	Direct interactions	Items user clicked
2-hop neighbors	Collaborative signal	Items similar users clicked

graph_construction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import torch
import scipy.sparse as sp
import numpy as np
 
def build_bipartite_graph(
    interactions: np.ndarray,  # (num_interactions, 2) - [user_id, item_id]
    num_users: int,
    num_items: int
):
    """
    Construct user-item bipartite graph adjacency matrix.
    
    Returns normalized adjacency for GCN-style propagation.
    """
    # Build sparse interaction matrix R
    user_ids = interactions[:, 0]
    item_ids = interactions[:, 1]
    
    data = np.ones(len(interactions))
    R = sp.csr_matrix(
        (data, (user_ids, item_ids)), 
        shape=(num_users, num_items)
    )
    
    # Build full adjacency matrix A = [[0, R], [R^T, 0]]
    zero_u = sp.csr_matrix((num_users, num_users))
    zero_i = sp.csr_matrix((num_items, num_items))
    
    A = sp.bmat([
        [zero_u, R],
        [R.T, zero_i]
    ]).tocsr()
    
    return A, R
 
 
def normalize_adjacency(A: sp.csr_matrix, symmetric: bool = True):
    """
    Normalize adjacency matrix for GCN propagation.
    
    Options:
    - Symmetric: D^(-1/2) A D^(-1/2) (LightGCN, GCN)
    - Left-only: D^(-1) A (random walk normalization)
    """
    # Compute degree
    degrees = np.array(A.sum(axis=1)).flatten()
    degrees = np.maximum(degrees, 1)  # Avoid division by zero
    
    if symmetric:
        # D^(-1/2)
        d_inv_sqrt = np.power(degrees, -0.5)
        D_inv_sqrt = sp.diags(d_inv_sqrt)
        norm_A = D_inv_sqrt @ A @ D_inv_sqrt
    else:
        # D^(-1)
        d_inv = np.power(degrees, -1)
        D_inv = sp.diags(d_inv)
        norm_A = D_inv @ A
    
    return norm_A.tocsr()
 
 
def sparse_to_torch(sparse_mx: sp.csr_matrix):
    """Convert scipy sparse matrix to PyTorch sparse tensor."""
    sparse_mx = sparse_mx.tocoo()
    indices = torch.LongTensor(
        np.vstack([sparse_mx.row, sparse_mx.col])
    )
    values = torch.FloatTensor(sparse_mx.data)
    shape = torch.Size(sparse_mx.shape)
    return torch.sparse_coo_tensor(indices, values, shape)

Graph Convolution for Collaborative Filtering

Graph convolution aggregates information from neighbors to refine node embeddings. The core operation:

$$\mathbf{e}_u^{(l+1)} = \text{AGG}({\mathbf{e}_i^{(l)} : i \in \mathcal{N}_u})$$

where $\mathcal{N}_u$ is the neighbor set of user $u$.

Standard GCN Propagation:

$$\mathbf{E}^{(l+1)} = \sigma(\tilde{A} \mathbf{E}^{(l)} W^{(l)})$$

where:

$\tilde{A}$: Normalized adjacency matrix
$\mathbf{E}^{(l)}$: Node embeddings at layer $l$
$W^{(l)}$: Learnable transformation
$\sigma$: Non-linear activation

For Recommendations:

Each GCN layer propagates embeddings across user-item edges:

Layer 1: User embeds aggregate from directly interacted items
Layer 2: User embeds now include items of similar users
Layer 3: Even broader collaborative signals captured

gcn_layer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
class GCNConv(nn.Module):
    """
    Standard Graph Convolution layer for recommendations.
    
    Aggregates neighbor information with learned transformation.
    """
    
    def __init__(self, in_dim: int, out_dim: int, bias: bool = True):
        super().__init__()
        
        self.linear = nn.Linear(in_dim, out_dim, bias=bias)
        self.reset_parameters()
    
    def reset_parameters(self):
        nn.init.xavier_uniform_(self.linear.weight)
        if self.linear.bias is not None:
            nn.init.zeros_(self.linear.bias)
    
    def forward(
        self, 
        x: torch.Tensor,           # Node features (num_nodes, in_dim)
        adj: torch.sparse.Tensor   # Normalized adjacency
    ):
        """
        GCN forward: Transform then aggregate.
        
        h = σ(A_norm @ X @ W)
        """
        # Aggregate from neighbors
        support = torch.sparse.mm(adj, x)
        
        # Linear transformation
        output = self.linear(support)
        
        return output
 
 
class NGCF(nn.Module):
    """
    Neural Graph Collaborative Filtering.
    
    Key features:
    - Feature transformation at each layer
    - Interaction between representations (bi-interaction)
    - Message passing with element-wise products
    """
    
    def __init__(
        self,
        num_users: int,
        num_items: int,
        embedding_dim: int = 64,
        hidden_dims: list = [64, 64, 64],
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.num_users = num_users
        self.num_items = num_items
        
        # Initial embeddings
        self.user_embedding = nn.Embedding(num_users, embedding_dim)
        self.item_embedding = nn.Embedding(num_items, embedding_dim)
        
        # NGCF layers
        self.layers = nn.ModuleList()
        prev_dim = embedding_dim
        
        for hidden_dim in hidden_dims:
            self.layers.append(NGCFConv(prev_dim, hidden_dim, dropout))
            prev_dim = hidden_dim
        
        self._init_weights()
    
    def _init_weights(self):
        nn.init.normal_(self.user_embedding.weight, std=0.1)
        nn.init.normal_(self.item_embedding.weight, std=0.1)
    
    def forward(self, adj: torch.sparse.Tensor):
        """
        Compute embeddings for all users and items.
        """
        # Concatenate user and item embeddings
        all_embeddings = torch.cat([
            self.user_embedding.weight,
            self.item_embedding.weight
        ], dim=0)
        
        embeddings_list = [all_embeddings]
        
        # Propagate through layers
        for layer in self.layers:
            all_embeddings = layer(all_embeddings, adj)
            embeddings_list.append(all_embeddings)
        
        # Layer combination: concatenate all layers
        final_embeddings = torch.cat(embeddings_list, dim=-1)
        
        # Split back to users and items
        user_embeds = final_embeddings[:self.num_users]
        item_embeds = final_embeddings[self.num_users:]
        
        return user_embeds, item_embeds
 
 
class NGCFConv(nn.Module):
    """Single NGCF convolution layer with bi-interaction."""
    
    def __init__(self, in_dim: int, out_dim: int, dropout: float = 0.1):
        super().__init__()
        
        self.W1 = nn.Linear(in_dim, out_dim, bias=True)
        self.W2 = nn.Linear(in_dim, out_dim, bias=True)
        self.dropout = nn.Dropout(dropout)
        self.activation = nn.LeakyReLU(0.2)
    
    def forward(self, x: torch.Tensor, adj: torch.sparse.Tensor):
        # Aggregate neighbors
        neighbor_agg = torch.sparse.mm(adj, x)
        
        # NGCF message: includes element-wise product (bi-interaction)
        msg_1 = self.W1(neighbor_agg + x)
        msg_2 = self.W2(neighbor_agg * x)  # Element-wise product
        
        output = msg_1 + msg_2
        output = self.activation(output)
        output = self.dropout(output)
        
        return output

LightGCN: Simplicity Wins

LightGCN (He et al., 2020) challenged the complexity of NGCF with a surprising finding: simpler is better for recommendation graphs.

Key Simplifications:

No Feature Transformation: Remove weight matrices $W^{(l)}$
No Non-linear Activation: Remove $\sigma(\cdot)$
Pure Neighborhood Aggregation: Just weighted sum of neighbors

LightGCN Propagation:

$$\mathbf{e}u^{(l+1)} = \sum{i \in \mathcal{N}_u} \frac{1}{\sqrt{|\mathcal{N}_u|} \sqrt{|\mathcal{N}_i|}} \mathbf{e}_i^{(l)}$$

Final Embeddings:

Layer outputs are combined via weighted average: $$\mathbf{e}u = \sum{l=0}^{L} \alpha_l \mathbf{e}_u^{(l)}$$

where $\alpha_l = 1/(L+1)$ (uniform) works well in practice.

Why This Works:

In bipartite recommendation graphs:

Node features ARE the embeddings (no raw features to transform)
Non-linearity may hurt by over-complicating the smooth collaborative signal
The core value is neighborhood aggregation, not complex transformations

lightgcn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
class LightGCN(nn.Module):
    """
    LightGCN: Simplified Graph Convolution for Recommendations.
    
    No weight matrices. No non-linearities. Just neighborhood averaging.
    Achieves state-of-art while being simpler and faster than NGCF.
    """
    
    def __init__(
        self,
        num_users: int,
        num_items: int,
        embedding_dim: int = 64,
        num_layers: int = 3
    ):
        super().__init__()
        
        self.num_users = num_users
        self.num_items = num_items
        self.num_layers = num_layers
        
        # Only learnable parameters: initial embeddings
        self.user_embedding = nn.Embedding(num_users, embedding_dim)
        self.item_embedding = nn.Embedding(num_items, embedding_dim)
        
        self._init_weights()
    
    def _init_weights(self):
        # Xavier initialization for embeddings
        nn.init.xavier_uniform_(self.user_embedding.weight)
        nn.init.xavier_uniform_(self.item_embedding.weight)
    
    def forward(self, adj: torch.sparse.Tensor):
        """
        Compute LightGCN embeddings for all users and items.
        
        Args:
            adj: Symmetrically normalized adjacency matrix
                 Shape: (num_users + num_items, num_users + num_items)
        
        Returns:
            user_embeds: (num_users, embedding_dim)
            item_embeds: (num_items, embedding_dim)
        """
        # Concatenate initial embeddings
        all_embeddings = torch.cat([
            self.user_embedding.weight,
            self.item_embedding.weight
        ], dim=0)
        
        # Store embeddings at each layer for layer combination
        embeddings_list = [all_embeddings]
        
        # LightGCN propagation: just multiply by normalized adjacency
        for _ in range(self.num_layers):
            all_embeddings = torch.sparse.mm(adj, all_embeddings)
            embeddings_list.append(all_embeddings)
        
        # Layer combination: simple average
        final_embeddings = torch.stack(embeddings_list, dim=0).mean(dim=0)
        
        # Split users and items
        user_embeds = final_embeddings[:self.num_users]
        item_embeds = final_embeddings[self.num_users:]
        
        return user_embeds, item_embeds
    
    def predict(self, user_ids: torch.Tensor, adj: torch.sparse.Tensor):
        """Predict scores for users with all items."""
        user_embeds, item_embeds = self.forward(adj)
        
        user_vecs = user_embeds[user_ids]  # (batch, dim)
        scores = user_vecs @ item_embeds.T  # (batch, num_items)
        
        return scores
    
    def bpr_loss(
        self,
        user_ids: torch.Tensor,
        pos_ids: torch.Tensor,
        neg_ids: torch.Tensor,
        adj: torch.sparse.Tensor,
        reg_weight: float = 1e-4
    ):
        """
        BPR loss for training LightGCN.
        
        Includes L2 regularization on initial embeddings only
        (not propagated embeddings - they're not parameters).
        """
        user_embeds, item_embeds = self.forward(adj)
        
        user_vecs = user_embeds[user_ids]
        pos_vecs = item_embeds[pos_ids]
        neg_vecs = item_embeds[neg_ids]
        
        # BPR loss
        pos_scores = (user_vecs * pos_vecs).sum(dim=1)
        neg_scores = (user_vecs * neg_vecs).sum(dim=1)
        
        bpr_loss = -torch.log(torch.sigmoid(pos_scores - neg_scores) + 1e-8)
        bpr_loss = bpr_loss.mean()
        
        # L2 regularization on initial embeddings
        reg_loss = (
            self.user_embedding.weight[user_ids].norm(2).pow(2) +
            self.item_embedding.weight[pos_ids].norm(2).pow(2) +
            self.item_embedding.weight[neg_ids].norm(2).pow(2)
        ) / user_ids.shape[0]
        
        return bpr_loss + reg_weight * reg_loss

LightGCN's Lesson

LightGCN's success teaches an important lesson: understand your domain before adding complexity. In recommendation, the graph structure already carries signal—feature transformations and non-linearities can add noise rather than value.

Graph Attention for Recommendations

Graph Attention Networks (GATs) learn which neighbors matter most rather than treating all equally.

Attention Mechanism:

$$\alpha_{uv} = \frac{\exp(\text{LeakyReLU}(\mathbf{a}^T[\mathbf{W}\mathbf{h}_u || \mathbf{W}\mathbf{h}v]))}{\sum{k \in \mathcal{N}_u} \exp(\text{LeakyReLU}(\mathbf{a}^T[\mathbf{W}\mathbf{h}_u || \mathbf{W}\mathbf{h}_k]))}$$

where $\alpha_{uv}$ is the attention weight from node $u$ to neighbor $v$.

Aggregation with Attention:

$$\mathbf{h}u' = \sigma\left(\sum{v \in \mathcal{N}u} \alpha{uv} \mathbf{W} \mathbf{h}_v\right)$$

For Recommendations:

Different items should contribute differently to user embeddings
Recent, high-rated, or contextually relevant items deserve higher attention
Attention weights provide interpretability: "You might like X because of your interest in Y"

gat_recsys.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
class GATConvRec(nn.Module):
    """
    Graph Attention layer adapted for recommendations.
    """
    
    def __init__(
        self,
        in_dim: int,
        out_dim: int,
        num_heads: int = 4,
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.num_heads = num_heads
        self.head_dim = out_dim // num_heads
        
        # Shared transformation
        self.W = nn.Linear(in_dim, out_dim, bias=False)
        
        # Attention parameters per head
        self.attention = nn.Parameter(
            torch.zeros(num_heads, 2 * self.head_dim)
        )
        nn.init.xavier_uniform_(self.attention)
        
        self.leaky_relu = nn.LeakyReLU(0.2)
        self.dropout = nn.Dropout(dropout)
    
    def forward(
        self,
        x: torch.Tensor,          # (num_nodes, in_dim)
        edge_index: torch.Tensor  # (2, num_edges)
    ):
        """
        GAT forward with multi-head attention.
        """
        num_nodes = x.size(0)
        
        # Transform features
        h = self.W(x)  # (num_nodes, out_dim)
        h = h.view(num_nodes, self.num_heads, self.head_dim)
        
        # Get source and target node features for edges
        src_idx, tgt_idx = edge_index
        h_src = h[src_idx]  # (num_edges, num_heads, head_dim)
        h_tgt = h[tgt_idx]
        
        # Compute attention scores
        edge_features = torch.cat([h_src, h_tgt], dim=-1)
        attn_scores = (edge_features * self.attention).sum(dim=-1)
        attn_scores = self.leaky_relu(attn_scores)
        
        # Softmax normalization over neighbors
        attn_weights = self._edge_softmax(attn_scores, tgt_idx, num_nodes)
        attn_weights = self.dropout(attn_weights)
        
        # Weighted aggregation
        weighted_src = h_src * attn_weights.unsqueeze(-1)
        
        # Scatter add to aggregate
        output = torch.zeros(num_nodes, self.num_heads, self.head_dim, 
                            device=x.device)
        output.scatter_add_(0, tgt_idx.view(-1, 1, 1).expand_as(weighted_src),
                           weighted_src)
        
        # Flatten heads
        return output.view(num_nodes, -1)
    
    def _edge_softmax(self, scores, tgt_idx, num_nodes):
        """Compute softmax over edges grouped by target node."""
        scores_max = torch.zeros(num_nodes, self.num_heads, device=scores.device)
        scores_max.scatter_reduce_(
            0, tgt_idx.view(-1, 1).expand(-1, self.num_heads),
            scores, reduce='amax'
        )
        scores = scores - scores_max[tgt_idx]
        
        exp_scores = scores.exp()
        exp_sum = torch.zeros(num_nodes, self.num_heads, device=scores.device)
        exp_sum.scatter_add_(
            0, tgt_idx.view(-1, 1).expand(-1, self.num_heads), exp_scores
        )
        
        return exp_scores / (exp_sum[tgt_idx] + 1e-8)

Knowledge Graph Enhanced Recommendations

Knowledge Graphs (KGs) enrich the recommendation graph with semantic relationships between items.

KG Structure:

Triples: $(h, r, t)$ - head entity, relation, tail entity

(The Dark Knight, genre, Thriller)
(The Dark Knight, director, Christopher Nolan)
(Christopher Nolan, directed, Inception)

Benefits for Recommendations:

Rich Item Representations: Items inherit knowledge from connected entities
Explainability: "Recommended because directed by Nolan (who you like)"
Cold Start: New items get signals from KG connections
Cross-Domain: Shared entities enable cross-domain recommendations

Integration Approaches:

Embedding-Based: Pre-train KG embeddings (TransE, RotatE), use as item features
Path-Based: Reason over meta-paths in the KG
Propagation-Based: GNN propagation over combined user-item-KG graph

KG Integration Methods for RecSys
Method	Approach	Strengths	Limitations
KGAT	Attentive propagation over KG	End-to-end learning	Scalability
RippleNet	Preference propagation waves	Interpretable	Limited hop range
KGCN	GCN on item-side KG	Efficient	User-side missing
KGIN	Intent discovery + KG	User intent modeling	Complexity

KG Quality Matters

The value of KG-enhanced recommendations depends heavily on knowledge graph quality and coverage. Incomplete or noisy KGs can hurt rather than help. Always evaluate with and without KG to measure actual impact.

Scaling GNNs for Production

Production recommendation systems have millions of users and items. Full-graph GNN training becomes prohibitive.

Scalability Challenges:

Memory: Full adjacency matrix for 10M nodes exceeds GPU memory
Computation: Each layer requires sparse-dense matrix multiplication
Neighbor Explosion: L-layer GNN touches O(d^L) nodes

Solutions:

Scalability Strategies

•Mini-batch Training: Sample subgraphs per batch (GraphSAGE-style neighbor sampling)
•Layer-wise Sampling: Sample different neighbors at each layer
•Pre-computation: For LightGCN, pre-compute propagation offline
•Approximation: Use low-rank approximations of propagation
•Distributed Training: Partition graph across machines

neighbor_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class NeighborSampler:
    """
    Neighbor sampling for scalable GNN training.
    
    Samples fixed-size neighborhoods per layer,
    enabling mini-batch training on large graphs.
    """
    
    def __init__(
        self,
        adj_list: dict,  # node_id -> list of neighbor ids
        num_samples: list = [10, 10]  # samples per layer
    ):
        self.adj_list = adj_list
        self.num_samples = num_samples
    
    def sample(self, target_nodes: list):
        """
        Sample multi-hop neighborhoods for target nodes.
        
        Returns:
            sampled_nodes: Nodes at each layer
            adjs: Adjacency info for each layer
        """
        all_sampled = [target_nodes]
        adjs = []
        
        current_nodes = target_nodes
        
        for layer_samples in self.num_samples:
            neighbors = set()
            edge_src, edge_dst = [], []
            
            for i, node in enumerate(current_nodes):
                node_neighbors = self.adj_list.get(node, [])
                
                if len(node_neighbors) > layer_samples:
                    sampled = np.random.choice(
                        node_neighbors, layer_samples, replace=False
                    )
                else:
                    sampled = node_neighbors
                
                for neighbor in sampled:
                    neighbors.add(neighbor)
                    edge_src.append(neighbor)
                    edge_dst.append(i)  # Index in current_nodes
            
            neighbors = list(neighbors)
            all_sampled.append(neighbors)
            adjs.append((edge_src, edge_dst, len(current_nodes)))
            current_nodes = neighbors
        
        return all_sampled, adjs

Summary: Graph Neural Networks for Recommendations

Key Takeaways

•Recommendations are naturally graphs — users, items, and interactions form a bipartite structure.
•GNNs capture multi-hop collaborative signals — going beyond direct neighbors to encode community structure.
•LightGCN proves simplicity wins — removing transformations and activations improves accuracy.
•Graph attention enables selective aggregation — learning which neighbors matter most.
•Knowledge graphs add semantic richness — entity relationships improve recommendations and explainability.
•Scalability requires sampling strategies — neighbor sampling and pre-computation enable production deployment.

Next: Two-Tower Models

GNNs create rich embeddings but require graph access at inference time. Two-tower models separate user and item encoding for efficient retrieval. Next, we'll explore how this architecture powers recommendations at massive scale.

4 / 5

Loading learning content...

Machine LearningRecommendation Systems

Deep Learning for Recommendations

LevelAdvanced

Duration90 mins

TopicRecommendation Systems

4 / 5

Graph Neural Networks for Recommendations

Recommendations as Graph Learning

Transitive Preference Discovery: If Alice likes what Bob likes, and Bob likes item X, maybe Alice will like X too
Attribute Propagation: Items sharing attributes influence each other's embeddings
Cold Start Mitigation: New items inherit signals from similar items
Explainable Paths: Recommendations can be traced through graph connections

What You Will Master

The Recommendation Graph

The User-Item Bipartite Graph:

$$G = (\mathcal{U} \cup \mathcal{I}, \mathcal{E})$$

where:

$\mathcal{U}$: Set of user nodes
$\mathcal{I}$: Set of item nodes
$\mathcal{E}$: Set of edges (user-item interactions)

The adjacency matrix $A$ has structure:

$$A = \begin{pmatrix} 0 & R \ R^T & 0 \end{pmatrix}$$

where $R$ is the user-item interaction matrix. The bipartite structure means:

Users only connect to items
Items only connect to users
No direct user-user or item-item edges (in basic formulation)

Why Graphs for RecSys:

Graph Elements in Recommendation Systems
Graph Element	RecSys Meaning	Example
User node	Individual user	User ID embedding
Item node	Item to recommend	Movie, product, article
User-Item edge	Interaction signal	Click, purchase, rating
Edge weight	Interaction strength	Rating value, frequency
1-hop neighbors	Direct interactions	Items user clicked
2-hop neighbors	Collaborative signal	Items similar users clicked

graph_construction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import torch
import scipy.sparse as sp
import numpy as np
 
def build_bipartite_graph(
    interactions: np.ndarray,  # (num_interactions, 2) - [user_id, item_id]
    num_users: int,
    num_items: int
):
    """
    Construct user-item bipartite graph adjacency matrix.
    
    Returns normalized adjacency for GCN-style propagation.
    """
    # Build sparse interaction matrix R
    user_ids = interactions[:, 0]
    item_ids = interactions[:, 1]
    
    data = np.ones(len(interactions))
    R = sp.csr_matrix(
        (data, (user_ids, item_ids)), 
        shape=(num_users, num_items)
    )
    
    # Build full adjacency matrix A = [[0, R], [R^T, 0]]
    zero_u = sp.csr_matrix((num_users, num_users))
    zero_i = sp.csr_matrix((num_items, num_items))
    
    A = sp.bmat([
        [zero_u, R],
        [R.T, zero_i]
    ]).tocsr()
    
    return A, R
 
 
def normalize_adjacency(A: sp.csr_matrix, symmetric: bool = True):
    """
    Normalize adjacency matrix for GCN propagation.
    
    Options:
    - Symmetric: D^(-1/2) A D^(-1/2) (LightGCN, GCN)
    - Left-only: D^(-1) A (random walk normalization)
    """
    # Compute degree
    degrees = np.array(A.sum(axis=1)).flatten()
    degrees = np.maximum(degrees, 1)  # Avoid division by zero
    
    if symmetric:
        # D^(-1/2)
        d_inv_sqrt = np.power(degrees, -0.5)
        D_inv_sqrt = sp.diags(d_inv_sqrt)
        norm_A = D_inv_sqrt @ A @ D_inv_sqrt
    else:
        # D^(-1)
        d_inv = np.power(degrees, -1)
        D_inv = sp.diags(d_inv)
        norm_A = D_inv @ A
    
    return norm_A.tocsr()
 
 
def sparse_to_torch(sparse_mx: sp.csr_matrix):
    """Convert scipy sparse matrix to PyTorch sparse tensor."""
    sparse_mx = sparse_mx.tocoo()
    indices = torch.LongTensor(
        np.vstack([sparse_mx.row, sparse_mx.col])
    )
    values = torch.FloatTensor(sparse_mx.data)
    shape = torch.Size(sparse_mx.shape)
    return torch.sparse_coo_tensor(indices, values, shape)

Graph Convolution for Collaborative Filtering

Graph convolution aggregates information from neighbors to refine node embeddings. The core operation:

$$\mathbf{e}_u^{(l+1)} = \text{AGG}({\mathbf{e}_i^{(l)} : i \in \mathcal{N}_u})$$

where $\mathcal{N}_u$ is the neighbor set of user $u$.

Standard GCN Propagation:

$$\mathbf{E}^{(l+1)} = \sigma(\tilde{A} \mathbf{E}^{(l)} W^{(l)})$$

where:

$\tilde{A}$: Normalized adjacency matrix
$\mathbf{E}^{(l)}$: Node embeddings at layer $l$
$W^{(l)}$: Learnable transformation
$\sigma$: Non-linear activation

For Recommendations:

Each GCN layer propagates embeddings across user-item edges:

Layer 1: User embeds aggregate from directly interacted items
Layer 2: User embeds now include items of similar users
Layer 3: Even broader collaborative signals captured

gcn_layer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
class GCNConv(nn.Module):
    """
    Standard Graph Convolution layer for recommendations.
    
    Aggregates neighbor information with learned transformation.
    """
    
    def __init__(self, in_dim: int, out_dim: int, bias: bool = True):
        super().__init__()
        
        self.linear = nn.Linear(in_dim, out_dim, bias=bias)
        self.reset_parameters()
    
    def reset_parameters(self):
        nn.init.xavier_uniform_(self.linear.weight)
        if self.linear.bias is not None:
            nn.init.zeros_(self.linear.bias)
    
    def forward(
        self, 
        x: torch.Tensor,           # Node features (num_nodes, in_dim)
        adj: torch.sparse.Tensor   # Normalized adjacency
    ):
        """
        GCN forward: Transform then aggregate.
        
        h = σ(A_norm @ X @ W)
        """
        # Aggregate from neighbors
        support = torch.sparse.mm(adj, x)
        
        # Linear transformation
        output = self.linear(support)
        
        return output
 
 
class NGCF(nn.Module):
    """
    Neural Graph Collaborative Filtering.
    
    Key features:
    - Feature transformation at each layer
    - Interaction between representations (bi-interaction)
    - Message passing with element-wise products
    """
    
    def __init__(
        self,
        num_users: int,
        num_items: int,
        embedding_dim: int = 64,
        hidden_dims: list = [64, 64, 64],
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.num_users = num_users
        self.num_items = num_items
        
        # Initial embeddings
        self.user_embedding = nn.Embedding(num_users, embedding_dim)
        self.item_embedding = nn.Embedding(num_items, embedding_dim)
        
        # NGCF layers
        self.layers = nn.ModuleList()
        prev_dim = embedding_dim
        
        for hidden_dim in hidden_dims:
            self.layers.append(NGCFConv(prev_dim, hidden_dim, dropout))
            prev_dim = hidden_dim
        
        self._init_weights()
    
    def _init_weights(self):
        nn.init.normal_(self.user_embedding.weight, std=0.1)
        nn.init.normal_(self.item_embedding.weight, std=0.1)
    
    def forward(self, adj: torch.sparse.Tensor):
        """
        Compute embeddings for all users and items.
        """
        # Concatenate user and item embeddings
        all_embeddings = torch.cat([
            self.user_embedding.weight,
            self.item_embedding.weight
        ], dim=0)
        
        embeddings_list = [all_embeddings]
        
        # Propagate through layers
        for layer in self.layers:
            all_embeddings = layer(all_embeddings, adj)
            embeddings_list.append(all_embeddings)
        
        # Layer combination: concatenate all layers
        final_embeddings = torch.cat(embeddings_list, dim=-1)
        
        # Split back to users and items
        user_embeds = final_embeddings[:self.num_users]
        item_embeds = final_embeddings[self.num_users:]
        
        return user_embeds, item_embeds
 
 
class NGCFConv(nn.Module):
    """Single NGCF convolution layer with bi-interaction."""
    
    def __init__(self, in_dim: int, out_dim: int, dropout: float = 0.1):
        super().__init__()
        
        self.W1 = nn.Linear(in_dim, out_dim, bias=True)
        self.W2 = nn.Linear(in_dim, out_dim, bias=True)
        self.dropout = nn.Dropout(dropout)
        self.activation = nn.LeakyReLU(0.2)
    
    def forward(self, x: torch.Tensor, adj: torch.sparse.Tensor):
        # Aggregate neighbors
        neighbor_agg = torch.sparse.mm(adj, x)
        
        # NGCF message: includes element-wise product (bi-interaction)
        msg_1 = self.W1(neighbor_agg + x)
        msg_2 = self.W2(neighbor_agg * x)  # Element-wise product
        
        output = msg_1 + msg_2
        output = self.activation(output)
        output = self.dropout(output)
        
        return output

LightGCN: Simplicity Wins

LightGCN (He et al., 2020) challenged the complexity of NGCF with a surprising finding: simpler is better for recommendation graphs.

Key Simplifications:

No Feature Transformation: Remove weight matrices $W^{(l)}$
No Non-linear Activation: Remove $\sigma(\cdot)$
Pure Neighborhood Aggregation: Just weighted sum of neighbors

LightGCN Propagation:

$$\mathbf{e}u^{(l+1)} = \sum{i \in \mathcal{N}_u} \frac{1}{\sqrt{|\mathcal{N}_u|} \sqrt{|\mathcal{N}_i|}} \mathbf{e}_i^{(l)}$$

Final Embeddings:

Layer outputs are combined via weighted average: $$\mathbf{e}u = \sum{l=0}^{L} \alpha_l \mathbf{e}_u^{(l)}$$

where $\alpha_l = 1/(L+1)$ (uniform) works well in practice.

Why This Works:

In bipartite recommendation graphs:

Node features ARE the embeddings (no raw features to transform)
Non-linearity may hurt by over-complicating the smooth collaborative signal
The core value is neighborhood aggregation, not complex transformations

lightgcn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
class LightGCN(nn.Module):
    """
    LightGCN: Simplified Graph Convolution for Recommendations.
    
    No weight matrices. No non-linearities. Just neighborhood averaging.
    Achieves state-of-art while being simpler and faster than NGCF.
    """
    
    def __init__(
        self,
        num_users: int,
        num_items: int,
        embedding_dim: int = 64,
        num_layers: int = 3
    ):
        super().__init__()
        
        self.num_users = num_users
        self.num_items = num_items
        self.num_layers = num_layers
        
        # Only learnable parameters: initial embeddings
        self.user_embedding = nn.Embedding(num_users, embedding_dim)
        self.item_embedding = nn.Embedding(num_items, embedding_dim)
        
        self._init_weights()
    
    def _init_weights(self):
        # Xavier initialization for embeddings
        nn.init.xavier_uniform_(self.user_embedding.weight)
        nn.init.xavier_uniform_(self.item_embedding.weight)
    
    def forward(self, adj: torch.sparse.Tensor):
        """
        Compute LightGCN embeddings for all users and items.
        
        Args:
            adj: Symmetrically normalized adjacency matrix
                 Shape: (num_users + num_items, num_users + num_items)
        
        Returns:
            user_embeds: (num_users, embedding_dim)
            item_embeds: (num_items, embedding_dim)
        """
        # Concatenate initial embeddings
        all_embeddings = torch.cat([
            self.user_embedding.weight,
            self.item_embedding.weight
        ], dim=0)
        
        # Store embeddings at each layer for layer combination
        embeddings_list = [all_embeddings]
        
        # LightGCN propagation: just multiply by normalized adjacency
        for _ in range(self.num_layers):
            all_embeddings = torch.sparse.mm(adj, all_embeddings)
            embeddings_list.append(all_embeddings)
        
        # Layer combination: simple average
        final_embeddings = torch.stack(embeddings_list, dim=0).mean(dim=0)
        
        # Split users and items
        user_embeds = final_embeddings[:self.num_users]
        item_embeds = final_embeddings[self.num_users:]
        
        return user_embeds, item_embeds
    
    def predict(self, user_ids: torch.Tensor, adj: torch.sparse.Tensor):
        """Predict scores for users with all items."""
        user_embeds, item_embeds = self.forward(adj)
        
        user_vecs = user_embeds[user_ids]  # (batch, dim)
        scores = user_vecs @ item_embeds.T  # (batch, num_items)
        
        return scores
    
    def bpr_loss(
        self,
        user_ids: torch.Tensor,
        pos_ids: torch.Tensor,
        neg_ids: torch.Tensor,
        adj: torch.sparse.Tensor,
        reg_weight: float = 1e-4
    ):
        """
        BPR loss for training LightGCN.
        
        Includes L2 regularization on initial embeddings only
        (not propagated embeddings - they're not parameters).
        """
        user_embeds, item_embeds = self.forward(adj)
        
        user_vecs = user_embeds[user_ids]
        pos_vecs = item_embeds[pos_ids]
        neg_vecs = item_embeds[neg_ids]
        
        # BPR loss
        pos_scores = (user_vecs * pos_vecs).sum(dim=1)
        neg_scores = (user_vecs * neg_vecs).sum(dim=1)
        
        bpr_loss = -torch.log(torch.sigmoid(pos_scores - neg_scores) + 1e-8)
        bpr_loss = bpr_loss.mean()
        
        # L2 regularization on initial embeddings
        reg_loss = (
            self.user_embedding.weight[user_ids].norm(2).pow(2) +
            self.item_embedding.weight[pos_ids].norm(2).pow(2) +
            self.item_embedding.weight[neg_ids].norm(2).pow(2)
        ) / user_ids.shape[0]
        
        return bpr_loss + reg_weight * reg_loss

LightGCN's Lesson

Graph Attention for Recommendations

Graph Attention Networks (GATs) learn which neighbors matter most rather than treating all equally.

Attention Mechanism:

where $\alpha_{uv}$ is the attention weight from node $u$ to neighbor $v$.

Aggregation with Attention:

$$\mathbf{h}u' = \sigma\left(\sum{v \in \mathcal{N}u} \alpha{uv} \mathbf{W} \mathbf{h}_v\right)$$

For Recommendations:

Different items should contribute differently to user embeddings
Recent, high-rated, or contextually relevant items deserve higher attention
Attention weights provide interpretability: "You might like X because of your interest in Y"

gat_recsys.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
class GATConvRec(nn.Module):
    """
    Graph Attention layer adapted for recommendations.
    """
    
    def __init__(
        self,
        in_dim: int,
        out_dim: int,
        num_heads: int = 4,
        dropout: float = 0.1
    ):
        super().__init__()
        
        self.num_heads = num_heads
        self.head_dim = out_dim // num_heads
        
        # Shared transformation
        self.W = nn.Linear(in_dim, out_dim, bias=False)
        
        # Attention parameters per head
        self.attention = nn.Parameter(
            torch.zeros(num_heads, 2 * self.head_dim)
        )
        nn.init.xavier_uniform_(self.attention)
        
        self.leaky_relu = nn.LeakyReLU(0.2)
        self.dropout = nn.Dropout(dropout)
    
    def forward(
        self,
        x: torch.Tensor,          # (num_nodes, in_dim)
        edge_index: torch.Tensor  # (2, num_edges)
    ):
        """
        GAT forward with multi-head attention.
        """
        num_nodes = x.size(0)
        
        # Transform features
        h = self.W(x)  # (num_nodes, out_dim)
        h = h.view(num_nodes, self.num_heads, self.head_dim)
        
        # Get source and target node features for edges
        src_idx, tgt_idx = edge_index
        h_src = h[src_idx]  # (num_edges, num_heads, head_dim)
        h_tgt = h[tgt_idx]
        
        # Compute attention scores
        edge_features = torch.cat([h_src, h_tgt], dim=-1)
        attn_scores = (edge_features * self.attention).sum(dim=-1)
        attn_scores = self.leaky_relu(attn_scores)
        
        # Softmax normalization over neighbors
        attn_weights = self._edge_softmax(attn_scores, tgt_idx, num_nodes)
        attn_weights = self.dropout(attn_weights)
        
        # Weighted aggregation
        weighted_src = h_src * attn_weights.unsqueeze(-1)
        
        # Scatter add to aggregate
        output = torch.zeros(num_nodes, self.num_heads, self.head_dim, 
                            device=x.device)
        output.scatter_add_(0, tgt_idx.view(-1, 1, 1).expand_as(weighted_src),
                           weighted_src)
        
        # Flatten heads
        return output.view(num_nodes, -1)
    
    def _edge_softmax(self, scores, tgt_idx, num_nodes):
        """Compute softmax over edges grouped by target node."""
        scores_max = torch.zeros(num_nodes, self.num_heads, device=scores.device)
        scores_max.scatter_reduce_(
            0, tgt_idx.view(-1, 1).expand(-1, self.num_heads),
            scores, reduce='amax'
        )
        scores = scores - scores_max[tgt_idx]
        
        exp_scores = scores.exp()
        exp_sum = torch.zeros(num_nodes, self.num_heads, device=scores.device)
        exp_sum.scatter_add_(
            0, tgt_idx.view(-1, 1).expand(-1, self.num_heads), exp_scores
        )
        
        return exp_scores / (exp_sum[tgt_idx] + 1e-8)

Knowledge Graph Enhanced Recommendations

Knowledge Graphs (KGs) enrich the recommendation graph with semantic relationships between items.

KG Structure:

Triples: $(h, r, t)$ - head entity, relation, tail entity

(The Dark Knight, genre, Thriller)
(The Dark Knight, director, Christopher Nolan)
(Christopher Nolan, directed, Inception)

Benefits for Recommendations:

Rich Item Representations: Items inherit knowledge from connected entities
Explainability: "Recommended because directed by Nolan (who you like)"
Cold Start: New items get signals from KG connections
Cross-Domain: Shared entities enable cross-domain recommendations

Integration Approaches:

Embedding-Based: Pre-train KG embeddings (TransE, RotatE), use as item features
Path-Based: Reason over meta-paths in the KG
Propagation-Based: GNN propagation over combined user-item-KG graph

KG Integration Methods for RecSys
Method	Approach	Strengths	Limitations
KGAT	Attentive propagation over KG	End-to-end learning	Scalability
RippleNet	Preference propagation waves	Interpretable	Limited hop range
KGCN	GCN on item-side KG	Efficient	User-side missing
KGIN	Intent discovery + KG	User intent modeling	Complexity

KG Quality Matters

Scaling GNNs for Production

Production recommendation systems have millions of users and items. Full-graph GNN training becomes prohibitive.

Scalability Challenges:

Memory: Full adjacency matrix for 10M nodes exceeds GPU memory
Computation: Each layer requires sparse-dense matrix multiplication
Neighbor Explosion: L-layer GNN touches O(d^L) nodes

Solutions:

Scalability Strategies

•Mini-batch Training: Sample subgraphs per batch (GraphSAGE-style neighbor sampling)
•Layer-wise Sampling: Sample different neighbors at each layer
•Pre-computation: For LightGCN, pre-compute propagation offline
•Approximation: Use low-rank approximations of propagation
•Distributed Training: Partition graph across machines

neighbor_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class NeighborSampler:
    """
    Neighbor sampling for scalable GNN training.
    
    Samples fixed-size neighborhoods per layer,
    enabling mini-batch training on large graphs.
    """
    
    def __init__(
        self,
        adj_list: dict,  # node_id -> list of neighbor ids
        num_samples: list = [10, 10]  # samples per layer
    ):
        self.adj_list = adj_list
        self.num_samples = num_samples
    
    def sample(self, target_nodes: list):
        """
        Sample multi-hop neighborhoods for target nodes.
        
        Returns:
            sampled_nodes: Nodes at each layer
            adjs: Adjacency info for each layer
        """
        all_sampled = [target_nodes]
        adjs = []
        
        current_nodes = target_nodes
        
        for layer_samples in self.num_samples:
            neighbors = set()
            edge_src, edge_dst = [], []
            
            for i, node in enumerate(current_nodes):
                node_neighbors = self.adj_list.get(node, [])
                
                if len(node_neighbors) > layer_samples:
                    sampled = np.random.choice(
                        node_neighbors, layer_samples, replace=False
                    )
                else:
                    sampled = node_neighbors
                
                for neighbor in sampled:
                    neighbors.add(neighbor)
                    edge_src.append(neighbor)
                    edge_dst.append(i)  # Index in current_nodes
            
            neighbors = list(neighbors)
            all_sampled.append(neighbors)
            adjs.append((edge_src, edge_dst, len(current_nodes)))
            current_nodes = neighbors
        
        return all_sampled, adjs

Summary: Graph Neural Networks for Recommendations

Key Takeaways

•Recommendations are naturally graphs — users, items, and interactions form a bipartite structure.
•GNNs capture multi-hop collaborative signals — going beyond direct neighbors to encode community structure.
•LightGCN proves simplicity wins — removing transformations and activations improves accuracy.
•Graph attention enables selective aggregation — learning which neighbors matter most.
•Knowledge graphs add semantic richness — entity relationships improve recommendations and explainability.
•Scalability requires sampling strategies — neighbor sampling and pre-computation enable production deployment.

Next: Two-Tower Models

4 / 5