Loading learning content...
Recommendation systems are inherently graph-structured. Users connect to items through interactions, items connect to attributes, users connect through social networks. Yet traditional collaborative filtering treats these connections superficially—typically using only direct neighbors.
Graph Neural Networks (GNNs) unlock the power of multi-hop reasoning: a user's preferences propagate through the graph, enriched by neighbors-of-neighbors, shared attributes, and community structure. This enables:
This page covers: (1) Modeling RecSys as bipartite graphs, (2) Graph convolution for collaborative filtering, (3) LightGCN architecture and its simplicity, (4) NGCF and feature transformation approaches, (5) Knowledge graph enhanced recommendations, and (6) Scalability strategies for large graphs.
The User-Item Bipartite Graph:
$$G = (\mathcal{U} \cup \mathcal{I}, \mathcal{E})$$
where:
The adjacency matrix $A$ has structure:
$$A = \begin{pmatrix} 0 & R \ R^T & 0 \end{pmatrix}$$
where $R$ is the user-item interaction matrix. The bipartite structure means:
Why Graphs for RecSys:
Graphs naturally capture collaborative signals. Similar users connect to similar items, forming communities. GNNs propagate information through these communities, learning embeddings that encode both direct preferences and neighborhood structure.
| Graph Element | RecSys Meaning | Example |
|---|---|---|
| User node | Individual user | User ID embedding |
| Item node | Item to recommend | Movie, product, article |
| User-Item edge | Interaction signal | Click, purchase, rating |
| Edge weight | Interaction strength | Rating value, frequency |
| 1-hop neighbors | Direct interactions | Items user clicked |
| 2-hop neighbors | Collaborative signal | Items similar users clicked |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import torchimport scipy.sparse as spimport numpy as np def build_bipartite_graph( interactions: np.ndarray, # (num_interactions, 2) - [user_id, item_id] num_users: int, num_items: int): """ Construct user-item bipartite graph adjacency matrix. Returns normalized adjacency for GCN-style propagation. """ # Build sparse interaction matrix R user_ids = interactions[:, 0] item_ids = interactions[:, 1] data = np.ones(len(interactions)) R = sp.csr_matrix( (data, (user_ids, item_ids)), shape=(num_users, num_items) ) # Build full adjacency matrix A = [[0, R], [R^T, 0]] zero_u = sp.csr_matrix((num_users, num_users)) zero_i = sp.csr_matrix((num_items, num_items)) A = sp.bmat([ [zero_u, R], [R.T, zero_i] ]).tocsr() return A, R def normalize_adjacency(A: sp.csr_matrix, symmetric: bool = True): """ Normalize adjacency matrix for GCN propagation. Options: - Symmetric: D^(-1/2) A D^(-1/2) (LightGCN, GCN) - Left-only: D^(-1) A (random walk normalization) """ # Compute degree degrees = np.array(A.sum(axis=1)).flatten() degrees = np.maximum(degrees, 1) # Avoid division by zero if symmetric: # D^(-1/2) d_inv_sqrt = np.power(degrees, -0.5) D_inv_sqrt = sp.diags(d_inv_sqrt) norm_A = D_inv_sqrt @ A @ D_inv_sqrt else: # D^(-1) d_inv = np.power(degrees, -1) D_inv = sp.diags(d_inv) norm_A = D_inv @ A return norm_A.tocsr() def sparse_to_torch(sparse_mx: sp.csr_matrix): """Convert scipy sparse matrix to PyTorch sparse tensor.""" sparse_mx = sparse_mx.tocoo() indices = torch.LongTensor( np.vstack([sparse_mx.row, sparse_mx.col]) ) values = torch.FloatTensor(sparse_mx.data) shape = torch.Size(sparse_mx.shape) return torch.sparse_coo_tensor(indices, values, shape)Graph convolution aggregates information from neighbors to refine node embeddings. The core operation:
$$\mathbf{e}_u^{(l+1)} = \text{AGG}({\mathbf{e}_i^{(l)} : i \in \mathcal{N}_u})$$
where $\mathcal{N}_u$ is the neighbor set of user $u$.
Standard GCN Propagation:
$$\mathbf{E}^{(l+1)} = \sigma(\tilde{A} \mathbf{E}^{(l)} W^{(l)})$$
where:
For Recommendations:
Each GCN layer propagates embeddings across user-item edges:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
class GCNConv(nn.Module): """ Standard Graph Convolution layer for recommendations. Aggregates neighbor information with learned transformation. """ def __init__(self, in_dim: int, out_dim: int, bias: bool = True): super().__init__() self.linear = nn.Linear(in_dim, out_dim, bias=bias) self.reset_parameters() def reset_parameters(self): nn.init.xavier_uniform_(self.linear.weight) if self.linear.bias is not None: nn.init.zeros_(self.linear.bias) def forward( self, x: torch.Tensor, # Node features (num_nodes, in_dim) adj: torch.sparse.Tensor # Normalized adjacency ): """ GCN forward: Transform then aggregate. h = σ(A_norm @ X @ W) """ # Aggregate from neighbors support = torch.sparse.mm(adj, x) # Linear transformation output = self.linear(support) return output class NGCF(nn.Module): """ Neural Graph Collaborative Filtering. Key features: - Feature transformation at each layer - Interaction between representations (bi-interaction) - Message passing with element-wise products """ def __init__( self, num_users: int, num_items: int, embedding_dim: int = 64, hidden_dims: list = [64, 64, 64], dropout: float = 0.1 ): super().__init__() self.num_users = num_users self.num_items = num_items # Initial embeddings self.user_embedding = nn.Embedding(num_users, embedding_dim) self.item_embedding = nn.Embedding(num_items, embedding_dim) # NGCF layers self.layers = nn.ModuleList() prev_dim = embedding_dim for hidden_dim in hidden_dims: self.layers.append(NGCFConv(prev_dim, hidden_dim, dropout)) prev_dim = hidden_dim self._init_weights() def _init_weights(self): nn.init.normal_(self.user_embedding.weight, std=0.1) nn.init.normal_(self.item_embedding.weight, std=0.1) def forward(self, adj: torch.sparse.Tensor): """ Compute embeddings for all users and items. """ # Concatenate user and item embeddings all_embeddings = torch.cat([ self.user_embedding.weight, self.item_embedding.weight ], dim=0) embeddings_list = [all_embeddings] # Propagate through layers for layer in self.layers: all_embeddings = layer(all_embeddings, adj) embeddings_list.append(all_embeddings) # Layer combination: concatenate all layers final_embeddings = torch.cat(embeddings_list, dim=-1) # Split back to users and items user_embeds = final_embeddings[:self.num_users] item_embeds = final_embeddings[self.num_users:] return user_embeds, item_embeds class NGCFConv(nn.Module): """Single NGCF convolution layer with bi-interaction.""" def __init__(self, in_dim: int, out_dim: int, dropout: float = 0.1): super().__init__() self.W1 = nn.Linear(in_dim, out_dim, bias=True) self.W2 = nn.Linear(in_dim, out_dim, bias=True) self.dropout = nn.Dropout(dropout) self.activation = nn.LeakyReLU(0.2) def forward(self, x: torch.Tensor, adj: torch.sparse.Tensor): # Aggregate neighbors neighbor_agg = torch.sparse.mm(adj, x) # NGCF message: includes element-wise product (bi-interaction) msg_1 = self.W1(neighbor_agg + x) msg_2 = self.W2(neighbor_agg * x) # Element-wise product output = msg_1 + msg_2 output = self.activation(output) output = self.dropout(output) return outputLightGCN (He et al., 2020) challenged the complexity of NGCF with a surprising finding: simpler is better for recommendation graphs.
Key Simplifications:
LightGCN Propagation:
$$\mathbf{e}u^{(l+1)} = \sum{i \in \mathcal{N}_u} \frac{1}{\sqrt{|\mathcal{N}_u|} \sqrt{|\mathcal{N}_i|}} \mathbf{e}_i^{(l)}$$
Final Embeddings:
Layer outputs are combined via weighted average: $$\mathbf{e}u = \sum{l=0}^{L} \alpha_l \mathbf{e}_u^{(l)}$$
where $\alpha_l = 1/(L+1)$ (uniform) works well in practice.
Why This Works:
In bipartite recommendation graphs:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
class LightGCN(nn.Module): """ LightGCN: Simplified Graph Convolution for Recommendations. No weight matrices. No non-linearities. Just neighborhood averaging. Achieves state-of-art while being simpler and faster than NGCF. """ def __init__( self, num_users: int, num_items: int, embedding_dim: int = 64, num_layers: int = 3 ): super().__init__() self.num_users = num_users self.num_items = num_items self.num_layers = num_layers # Only learnable parameters: initial embeddings self.user_embedding = nn.Embedding(num_users, embedding_dim) self.item_embedding = nn.Embedding(num_items, embedding_dim) self._init_weights() def _init_weights(self): # Xavier initialization for embeddings nn.init.xavier_uniform_(self.user_embedding.weight) nn.init.xavier_uniform_(self.item_embedding.weight) def forward(self, adj: torch.sparse.Tensor): """ Compute LightGCN embeddings for all users and items. Args: adj: Symmetrically normalized adjacency matrix Shape: (num_users + num_items, num_users + num_items) Returns: user_embeds: (num_users, embedding_dim) item_embeds: (num_items, embedding_dim) """ # Concatenate initial embeddings all_embeddings = torch.cat([ self.user_embedding.weight, self.item_embedding.weight ], dim=0) # Store embeddings at each layer for layer combination embeddings_list = [all_embeddings] # LightGCN propagation: just multiply by normalized adjacency for _ in range(self.num_layers): all_embeddings = torch.sparse.mm(adj, all_embeddings) embeddings_list.append(all_embeddings) # Layer combination: simple average final_embeddings = torch.stack(embeddings_list, dim=0).mean(dim=0) # Split users and items user_embeds = final_embeddings[:self.num_users] item_embeds = final_embeddings[self.num_users:] return user_embeds, item_embeds def predict(self, user_ids: torch.Tensor, adj: torch.sparse.Tensor): """Predict scores for users with all items.""" user_embeds, item_embeds = self.forward(adj) user_vecs = user_embeds[user_ids] # (batch, dim) scores = user_vecs @ item_embeds.T # (batch, num_items) return scores def bpr_loss( self, user_ids: torch.Tensor, pos_ids: torch.Tensor, neg_ids: torch.Tensor, adj: torch.sparse.Tensor, reg_weight: float = 1e-4 ): """ BPR loss for training LightGCN. Includes L2 regularization on initial embeddings only (not propagated embeddings - they're not parameters). """ user_embeds, item_embeds = self.forward(adj) user_vecs = user_embeds[user_ids] pos_vecs = item_embeds[pos_ids] neg_vecs = item_embeds[neg_ids] # BPR loss pos_scores = (user_vecs * pos_vecs).sum(dim=1) neg_scores = (user_vecs * neg_vecs).sum(dim=1) bpr_loss = -torch.log(torch.sigmoid(pos_scores - neg_scores) + 1e-8) bpr_loss = bpr_loss.mean() # L2 regularization on initial embeddings reg_loss = ( self.user_embedding.weight[user_ids].norm(2).pow(2) + self.item_embedding.weight[pos_ids].norm(2).pow(2) + self.item_embedding.weight[neg_ids].norm(2).pow(2) ) / user_ids.shape[0] return bpr_loss + reg_weight * reg_lossLightGCN's success teaches an important lesson: understand your domain before adding complexity. In recommendation, the graph structure already carries signal—feature transformations and non-linearities can add noise rather than value.
Graph Attention Networks (GATs) learn which neighbors matter most rather than treating all equally.
Attention Mechanism:
$$\alpha_{uv} = \frac{\exp(\text{LeakyReLU}(\mathbf{a}^T[\mathbf{W}\mathbf{h}_u || \mathbf{W}\mathbf{h}v]))}{\sum{k \in \mathcal{N}_u} \exp(\text{LeakyReLU}(\mathbf{a}^T[\mathbf{W}\mathbf{h}_u || \mathbf{W}\mathbf{h}_k]))}$$
where $\alpha_{uv}$ is the attention weight from node $u$ to neighbor $v$.
Aggregation with Attention:
$$\mathbf{h}u' = \sigma\left(\sum{v \in \mathcal{N}u} \alpha{uv} \mathbf{W} \mathbf{h}_v\right)$$
For Recommendations:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
class GATConvRec(nn.Module): """ Graph Attention layer adapted for recommendations. """ def __init__( self, in_dim: int, out_dim: int, num_heads: int = 4, dropout: float = 0.1 ): super().__init__() self.num_heads = num_heads self.head_dim = out_dim // num_heads # Shared transformation self.W = nn.Linear(in_dim, out_dim, bias=False) # Attention parameters per head self.attention = nn.Parameter( torch.zeros(num_heads, 2 * self.head_dim) ) nn.init.xavier_uniform_(self.attention) self.leaky_relu = nn.LeakyReLU(0.2) self.dropout = nn.Dropout(dropout) def forward( self, x: torch.Tensor, # (num_nodes, in_dim) edge_index: torch.Tensor # (2, num_edges) ): """ GAT forward with multi-head attention. """ num_nodes = x.size(0) # Transform features h = self.W(x) # (num_nodes, out_dim) h = h.view(num_nodes, self.num_heads, self.head_dim) # Get source and target node features for edges src_idx, tgt_idx = edge_index h_src = h[src_idx] # (num_edges, num_heads, head_dim) h_tgt = h[tgt_idx] # Compute attention scores edge_features = torch.cat([h_src, h_tgt], dim=-1) attn_scores = (edge_features * self.attention).sum(dim=-1) attn_scores = self.leaky_relu(attn_scores) # Softmax normalization over neighbors attn_weights = self._edge_softmax(attn_scores, tgt_idx, num_nodes) attn_weights = self.dropout(attn_weights) # Weighted aggregation weighted_src = h_src * attn_weights.unsqueeze(-1) # Scatter add to aggregate output = torch.zeros(num_nodes, self.num_heads, self.head_dim, device=x.device) output.scatter_add_(0, tgt_idx.view(-1, 1, 1).expand_as(weighted_src), weighted_src) # Flatten heads return output.view(num_nodes, -1) def _edge_softmax(self, scores, tgt_idx, num_nodes): """Compute softmax over edges grouped by target node.""" scores_max = torch.zeros(num_nodes, self.num_heads, device=scores.device) scores_max.scatter_reduce_( 0, tgt_idx.view(-1, 1).expand(-1, self.num_heads), scores, reduce='amax' ) scores = scores - scores_max[tgt_idx] exp_scores = scores.exp() exp_sum = torch.zeros(num_nodes, self.num_heads, device=scores.device) exp_sum.scatter_add_( 0, tgt_idx.view(-1, 1).expand(-1, self.num_heads), exp_scores ) return exp_scores / (exp_sum[tgt_idx] + 1e-8)Knowledge Graphs (KGs) enrich the recommendation graph with semantic relationships between items.
KG Structure:
Triples: $(h, r, t)$ - head entity, relation, tail entity
Benefits for Recommendations:
Integration Approaches:
| Method | Approach | Strengths | Limitations |
|---|---|---|---|
| KGAT | Attentive propagation over KG | End-to-end learning | Scalability |
| RippleNet | Preference propagation waves | Interpretable | Limited hop range |
| KGCN | GCN on item-side KG | Efficient | User-side missing |
| KGIN | Intent discovery + KG | User intent modeling | Complexity |
The value of KG-enhanced recommendations depends heavily on knowledge graph quality and coverage. Incomplete or noisy KGs can hurt rather than help. Always evaluate with and without KG to measure actual impact.
Production recommendation systems have millions of users and items. Full-graph GNN training becomes prohibitive.
Scalability Challenges:
Solutions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
class NeighborSampler: """ Neighbor sampling for scalable GNN training. Samples fixed-size neighborhoods per layer, enabling mini-batch training on large graphs. """ def __init__( self, adj_list: dict, # node_id -> list of neighbor ids num_samples: list = [10, 10] # samples per layer ): self.adj_list = adj_list self.num_samples = num_samples def sample(self, target_nodes: list): """ Sample multi-hop neighborhoods for target nodes. Returns: sampled_nodes: Nodes at each layer adjs: Adjacency info for each layer """ all_sampled = [target_nodes] adjs = [] current_nodes = target_nodes for layer_samples in self.num_samples: neighbors = set() edge_src, edge_dst = [], [] for i, node in enumerate(current_nodes): node_neighbors = self.adj_list.get(node, []) if len(node_neighbors) > layer_samples: sampled = np.random.choice( node_neighbors, layer_samples, replace=False ) else: sampled = node_neighbors for neighbor in sampled: neighbors.add(neighbor) edge_src.append(neighbor) edge_dst.append(i) # Index in current_nodes neighbors = list(neighbors) all_sampled.append(neighbors) adjs.append((edge_src, edge_dst, len(current_nodes))) current_nodes = neighbors return all_sampled, adjsGNNs create rich embeddings but require graph access at inference time. Two-tower models separate user and item encoding for efficient retrieval. Next, we'll explore how this architecture powers recommendations at massive scale.