Loading learning content...
Traditional collaborative filtering treats user interactions as unordered sets—the model sees that a user liked items A, B, and C, but ignores when and in what order these interactions occurred. Yet temporal patterns are everywhere:
Sequence models capture these temporal dynamics by modeling user behavior as ordered sequences and learning to predict the next interaction.
This page covers: (1) The shift from static to sequential recommendation, (2) GRU4Rec and session-based recommendations, (3) Self-attention and SASRec architecture, (4) BERT4Rec and bidirectional modeling, (5) Handling variable-length sequences and position encoding.
Problem Formulation:
Given a user's historical interaction sequence: $$S_u = [s_1, s_2, ..., s_t]$$
where $s_i$ is the item interacted with at time step $i$, predict the next item $s_{t+1}$.
Key Characteristics:
Comparison with Static CF:
| Aspect | Static CF | Sequential CF |
|---|---|---|
| User representation | Single embedding vector | Sequence of interactions |
| Temporal modeling | None (or decay heuristics) | Learned from data |
| Context sensitivity | Fixed preferences | Dynamic, context-aware |
| Training signal | User-item pairs | Sequences with next-item labels |
| Prediction target | Affinity to any item | Next item in sequence |
| Model architecture | MLPs, matrix factorization | RNNs, Transformers |
Session-Based vs User-Based Sequential:
Session-based is common in e-commerce where many users aren't logged in. User-based suits platforms with strong user identity (Netflix, Spotify).
GRU4Rec (Hidasi et al., 2015) pioneered deep learning for session-based recommendations using Gated Recurrent Units (GRUs).
Architecture:
The GRU Advantage:
GRUs maintain a hidden state $\mathbf{h}_t$ that summarizes the session so far: $$\mathbf{h}_t = GRU(\mathbf{x}t, \mathbf{h}{t-1})$$
This hidden state captures:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
import torchimport torch.nn as nnimport torch.nn.functional as F class GRU4Rec(nn.Module): """ GRU4Rec: Session-based Recommendations with RNNs. Architecture: - Item embedding layer - Multi-layer GRU - Output layer projecting to item scores """ def __init__( self, num_items: int, embedding_dim: int = 64, hidden_dim: int = 100, num_layers: int = 1, dropout: float = 0.25 ): super().__init__() self.num_items = num_items self.hidden_dim = hidden_dim self.num_layers = num_layers # Item embeddings self.item_embedding = nn.Embedding( num_items, embedding_dim, padding_idx=0 ) # GRU layers self.gru = nn.GRU( input_size=embedding_dim, hidden_size=hidden_dim, num_layers=num_layers, batch_first=True, dropout=dropout if num_layers > 1 else 0 ) # Output projection self.output = nn.Linear(hidden_dim, num_items) self._init_weights() def _init_weights(self): nn.init.xavier_uniform_(self.item_embedding.weight) nn.init.xavier_uniform_(self.output.weight) def forward( self, item_seq: torch.Tensor, # (batch, seq_len) seq_lengths: torch.Tensor = None ): """ Args: item_seq: Sequence of item IDs seq_lengths: Actual lengths (for variable-length seqs) Returns: Scores for all items based on final hidden state """ # Embed items x = self.item_embedding(item_seq) # (batch, seq_len, embed_dim) # Pack sequence if lengths provided (for efficiency) if seq_lengths is not None: x = nn.utils.rnn.pack_padded_sequence( x, seq_lengths.cpu(), batch_first=True, enforce_sorted=False ) # GRU forward pass output, hidden = self.gru(x) # Use final hidden state for prediction # hidden shape: (num_layers, batch, hidden_dim) final_hidden = hidden[-1] # (batch, hidden_dim) # Project to item scores scores = self.output(final_hidden) # (batch, num_items) return scores def predict_next(self, session: list, top_k: int = 10): """Predict top-k next items for a session.""" self.eval() with torch.no_grad(): item_seq = torch.tensor([session]).long() scores = self.forward(item_seq) probs = F.softmax(scores, dim=-1) top_scores, top_items = probs.topk(top_k) return top_items[0].tolist(), top_scores[0].tolist()GRU4Rec introduced session-parallel mini-batching: instead of processing full sessions sequentially, it processes the same position across multiple sessions in parallel. This dramatically improves training efficiency on GPUs.
Sequential recommendation introduces unique loss function considerations:
1. Cross-Entropy (Softmax) Loss:
$$\mathcal{L}{CE} = -\log \frac{\exp(s{pos})}{\sum_{j} \exp(s_j)}$$
Computes probability over all items. Computationally expensive for large catalogs.
2. BPR (Bayesian Personalized Ranking) Loss:
$$\mathcal{L}{BPR} = -\log \sigma(s{pos} - s_{neg})$$
Pairwise loss comparing positive vs sampled negative. More efficient, focuses on ranking.
3. TOP1 Loss (GRU4Rec original):
$$\mathcal{L}{TOP1} = \frac{1}{N_s} \sum{j=1}^{N_s} \sigma(s_j - s_{pos}) + \sigma(s_j^2)$$
Regularized pairwise loss designed for session-based recommendations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
def bpr_loss(pos_scores: torch.Tensor, neg_scores: torch.Tensor): """ BPR Loss: Bayesian Personalized Ranking. Maximizes the difference between positive and negative item scores. Args: pos_scores: Scores for positive (next) items, shape (batch,) neg_scores: Scores for sampled negative items, shape (batch,) or (batch, num_negatives) """ if neg_scores.dim() == 2: # Multiple negatives: average over them pos_scores = pos_scores.unsqueeze(1) # (batch, 1) diff = pos_scores - neg_scores # (batch, num_negatives) loss = -torch.log(torch.sigmoid(diff) + 1e-8).mean() else: diff = pos_scores - neg_scores loss = -torch.log(torch.sigmoid(diff) + 1e-8).mean() return loss def sampled_softmax_loss( scores: torch.Tensor, targets: torch.Tensor, num_negatives: int = 100): """ Sampled Softmax: Approximate cross-entropy with negative sampling. Instead of computing softmax over all items, sample negatives. Much faster for large item catalogs. """ batch_size, num_items = scores.shape device = scores.device # Get positive scores pos_scores = scores.gather(1, targets.unsqueeze(1)).squeeze(1) # Sample negative item indices (uniform random) neg_indices = torch.randint( 0, num_items, (batch_size, num_negatives), device=device ) neg_scores = scores.gather(1, neg_indices) # (batch, num_negatives) # Compute log-softmax approximation # log p(pos) ≈ pos_score - log(exp(pos_score) + sum(exp(neg_scores))) all_scores = torch.cat([pos_scores.unsqueeze(1), neg_scores], dim=1) log_softmax = F.log_softmax(all_scores, dim=1) # Loss is negative log probability of positive item (index 0) loss = -log_softmax[:, 0].mean() return loss def top1_loss(pos_scores: torch.Tensor, neg_scores: torch.Tensor): """ TOP1 Loss from original GRU4Rec paper. Combines regularization with ranking objective. """ diff = neg_scores - pos_scores.unsqueeze(1) loss = torch.sigmoid(diff).mean() + torch.sigmoid(neg_scores ** 2).mean() return lossSASRec (Kang & McAuley, 2018) applies Transformer self-attention to sequential recommendation, achieving superior results over RNN-based methods.
Why Self-Attention for Sequences:
Architecture Overview:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
class SASRec(nn.Module): """ Self-Attentive Sequential Recommendation (SASRec). Uses causal (left-to-right) self-attention to model sequences. Position t can only attend to positions 1...t. """ def __init__( self, num_items: int, max_seq_len: int = 50, embedding_dim: int = 64, num_heads: int = 2, num_layers: int = 2, dropout: float = 0.2 ): super().__init__() self.max_seq_len = max_seq_len self.embedding_dim = embedding_dim # Embeddings self.item_embedding = nn.Embedding( num_items + 1, embedding_dim, padding_idx=0 ) self.position_embedding = nn.Embedding(max_seq_len, embedding_dim) # Transformer blocks self.attention_blocks = nn.ModuleList([ TransformerBlock(embedding_dim, num_heads, dropout) for _ in range(num_layers) ]) self.layer_norm = nn.LayerNorm(embedding_dim) self.dropout = nn.Dropout(dropout) # Output projection (tied with item embeddings) self.output_bias = nn.Parameter(torch.zeros(num_items + 1)) self._init_weights() def _init_weights(self): nn.init.xavier_uniform_(self.item_embedding.weight) nn.init.xavier_uniform_(self.position_embedding.weight) def forward(self, item_seq: torch.Tensor): """ Args: item_seq: (batch, seq_len) item IDs, 0 for padding Returns: scores: (batch, seq_len, num_items) predictions at each position """ batch_size, seq_len = item_seq.shape device = item_seq.device # Create attention mask (causal + padding) # Causal: position t can only see positions <= t causal_mask = torch.triu( torch.ones(seq_len, seq_len, device=device), diagonal=1 ).bool() # Padding mask: don't attend to padding tokens padding_mask = (item_seq == 0) # Get embeddings item_emb = self.item_embedding(item_seq) # (batch, seq, dim) # Positional embeddings positions = torch.arange(seq_len, device=device).unsqueeze(0) pos_emb = self.position_embedding(positions) # (1, seq, dim) # Combine x = item_emb + pos_emb x = self.dropout(x) # Apply transformer blocks for block in self.attention_blocks: x = block(x, causal_mask, padding_mask) x = self.layer_norm(x) # Compute scores via dot product with item embeddings # (batch, seq, dim) @ (dim, num_items) -> (batch, seq, num_items) scores = x @ self.item_embedding.weight.T + self.output_bias return scores def predict_next(self, item_seq: torch.Tensor, top_k: int = 10): """Predict next item based on sequence.""" self.eval() with torch.no_grad(): scores = self.forward(item_seq) # Use last position's prediction last_scores = scores[:, -1, :] probs = F.softmax(last_scores, dim=-1) return probs.topk(top_k) class TransformerBlock(nn.Module): """Single transformer block with causal self-attention.""" def __init__(self, dim: int, num_heads: int, dropout: float = 0.1): super().__init__() self.attention = nn.MultiheadAttention( dim, num_heads, dropout=dropout, batch_first=True ) self.ffn = nn.Sequential( nn.Linear(dim, dim * 4), nn.GELU(), nn.Dropout(dropout), nn.Linear(dim * 4, dim), nn.Dropout(dropout) ) self.norm1 = nn.LayerNorm(dim) self.norm2 = nn.LayerNorm(dim) self.dropout = nn.Dropout(dropout) def forward(self, x, causal_mask, padding_mask): # Self-attention with residual attn_out, _ = self.attention( x, x, x, attn_mask=causal_mask, key_padding_mask=padding_mask ) x = self.norm1(x + self.dropout(attn_out)) # FFN with residual x = self.norm2(x + self.ffn(x)) return xUnlike language models that might use bidirectional attention, SASRec must use causal (left-to-right) masking. At prediction time, we only know items 1...t and must predict t+1. Bidirectional attention would leak future information during training.
BERT4Rec (Sun et al., 2019) adapts BERT's masked language modeling to sequential recommendation.
Key Insight:
While SASRec uses causal attention for next-item prediction, BERT4Rec uses masked item prediction with bidirectional attention:
Advantages of Bidirectional:
The Cloze Task:
$$\mathcal{L} = \sum_{m \in M} -\log P(s_m | S \setminus M)$$
where $M$ is the set of masked positions.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
class BERT4Rec(nn.Module): """ BERT4Rec: Bidirectional encoder for sequential recommendation. Uses masked item prediction (cloze task) for training. Bidirectional attention allows each position to see all others. """ def __init__( self, num_items: int, max_seq_len: int = 50, embedding_dim: int = 64, num_heads: int = 2, num_layers: int = 2, dropout: float = 0.1, mask_prob: float = 0.2 ): super().__init__() self.mask_prob = mask_prob self.mask_token = num_items + 1 # Special [MASK] token # Embeddings (mask token is num_items + 1) self.item_embedding = nn.Embedding( num_items + 2, embedding_dim, padding_idx=0 ) self.position_embedding = nn.Embedding(max_seq_len, embedding_dim) # BERT encoder layers (bidirectional, no causal mask) encoder_layer = nn.TransformerEncoderLayer( d_model=embedding_dim, nhead=num_heads, dim_feedforward=embedding_dim * 4, dropout=dropout, activation='gelu', batch_first=True ) self.encoder = nn.TransformerEncoder( encoder_layer, num_layers=num_layers ) self.layer_norm = nn.LayerNorm(embedding_dim) self.dropout = nn.Dropout(dropout) # Output head self.output = nn.Linear(embedding_dim, num_items + 1) def mask_sequence(self, item_seq: torch.Tensor): """ Apply random masking to sequence for training. Returns: masked_seq: Sequence with some items replaced by [MASK] labels: Original items at masked positions, -100 elsewhere """ device = item_seq.device batch_size, seq_len = item_seq.shape # Don't mask padding (item_id = 0) maskable = item_seq != 0 # Random mask selection mask = torch.rand(batch_size, seq_len, device=device) < self.mask_prob mask = mask & maskable # Create masked sequence masked_seq = item_seq.clone() masked_seq[mask] = self.mask_token # Labels: original items at masked positions, -100 elsewhere labels = torch.full_like(item_seq, -100) labels[mask] = item_seq[mask] return masked_seq, labels def forward(self, item_seq: torch.Tensor): """ Forward pass through BERT encoder. Args: item_seq: (batch, seq_len) - may contain [MASK] tokens Returns: logits: (batch, seq_len, num_items) predictions at each position """ batch_size, seq_len = item_seq.shape device = item_seq.device # Padding mask (for attention) padding_mask = (item_seq == 0) # Embeddings item_emb = self.item_embedding(item_seq) positions = torch.arange(seq_len, device=device).unsqueeze(0) pos_emb = self.position_embedding(positions) x = self.dropout(self.layer_norm(item_emb + pos_emb)) # BERT encoding (bidirectional - no causal mask!) x = self.encoder(x, src_key_padding_mask=padding_mask) # Output logits logits = self.output(x) # (batch, seq_len, num_items) return logits def predict_next(self, item_seq: torch.Tensor, top_k: int = 10): """ Predict next item by appending [MASK] to sequence. """ self.eval() with torch.no_grad(): # Append mask token to end mask_token = torch.full( (item_seq.size(0), 1), self.mask_token, device=item_seq.device, dtype=item_seq.dtype ) masked_seq = torch.cat([item_seq, mask_token], dim=1) logits = self.forward(masked_seq) # Prediction at last position (the mask) last_logits = logits[:, -1, :] probs = F.softmax(last_logits, dim=-1) return probs.topk(top_k)| Aspect | SASRec | BERT4Rec |
|---|---|---|
| Attention type | Causal (left-to-right) | Bidirectional |
| Training task | Next-item prediction | Masked item prediction |
| Training signal per sequence | 1 (last item) | Multiple (all masked items) |
| Inference method | Predict from last position | Append [MASK], predict |
| Training efficiency | Faster per epoch | More signal per sequence |
| Context utilization | Left context only | Full context |
Self-attention is permutation-invariant—without position information, it cannot distinguish $[A, B, C]$ from $[C, B, A]$. Position encoding injects ordering information.
Options for Recommendation Sequences:
1. Learned Position Embeddings: $$PE(pos) = W_{pos}[pos]$$ Learnable embedding per position. Works well for fixed max length.
2. Sinusoidal Position Encoding: $$PE(pos, 2i) = \sin(pos / 10000^{2i/d})$$ $$PE(pos, 2i+1) = \cos(pos / 10000^{2i/d})$$ Generalizes to unseen lengths. No learned parameters.
3. Relative Position Encoding: Encodes distance between positions rather than absolute positions. Better for variable-length sequences.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
class SinusoidalPE(nn.Module): """Sinusoidal position encoding (from Transformer paper).""" def __init__(self, d_model: int, max_len: int = 5000): super().__init__() pe = torch.zeros(max_len, d_model) position = torch.arange(max_len).unsqueeze(1).float() div_term = torch.exp( torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model) ) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) self.register_buffer('pe', pe.unsqueeze(0)) def forward(self, x): # x: (batch, seq_len, d_model) return x + self.pe[:, :x.size(1)] class TimeAwarePE(nn.Module): """ Time-aware position encoding for recommendations. Encodes both position AND actual timestamps if available. Useful when time gaps matter (e.g., purchase sequences). """ def __init__(self, d_model: int, max_len: int = 1000): super().__init__() self.position_embedding = nn.Embedding(max_len, d_model) self.time_projection = nn.Linear(1, d_model) def forward(self, seq_len: int, timestamps: torch.Tensor = None): """ Args: seq_len: Length of sequence timestamps: (batch, seq_len) actual timestamps (optional) """ device = self.position_embedding.weight.device positions = torch.arange(seq_len, device=device) pos_emb = self.position_embedding(positions) # (seq_len, d_model) if timestamps is not None: # Normalize timestamps to [0, 1] range timestamps = timestamps.float() t_min = timestamps.min(dim=1, keepdim=True)[0] t_max = timestamps.max(dim=1, keepdim=True)[0] normalized_t = (timestamps - t_min) / (t_max - t_min + 1e-8) time_emb = self.time_projection(normalized_t.unsqueeze(-1)) return pos_emb.unsqueeze(0) + time_emb return pos_emb.unsqueeze(0)Beyond position encoding, adding explicit recency features (time since last interaction, position from end) can significantly improve sequential models, especially for capturing the strong recency bias in user behavior.
Sequences capture individual user trajectories, but recommendations also depend on item-item and user-user relationships. Next, we explore how Graph Neural Networks model these complex relational structures for recommendations.