Machine LearningRecommendation Systems

Deep Learning for Recommendations

LevelAdvanced

Duration90 mins

TopicRecommendation Systems

3 / 5

Sequence Models for Recommendations

The Temporal Dimension of Preferences

Traditional collaborative filtering treats user interactions as unordered sets—the model sees that a user liked items A, B, and C, but ignores when and in what order these interactions occurred. Yet temporal patterns are everywhere:

After watching a thriller, users often want another thriller
Morning coffee purchases predict afternoon snack preferences
Browsing laptops suggests accessories in the next session
Recent interactions matter more than year-old ones

Sequence models capture these temporal dynamics by modeling user behavior as ordered sequences and learning to predict the next interaction.

What You Will Master

This page covers: (1) The shift from static to sequential recommendation, (2) GRU4Rec and session-based recommendations, (3) Self-attention and SASRec architecture, (4) BERT4Rec and bidirectional modeling, (5) Handling variable-length sequences and position encoding.

The Sequential Recommendation Problem

Problem Formulation:

Given a user's historical interaction sequence: $$S_u = [s_1, s_2, ..., s_t]$$

where $s_i$ is the item interacted with at time step $i$, predict the next item $s_{t+1}$.

Key Characteristics:

Order Matters: The sequence $[A, B, C]$ has different meaning than $[C, A, B]$
Recency Bias: Recent items typically have more predictive power
Variable Length: Users have sequences of different lengths
Session Structure: Interactions often cluster into browsing sessions

Comparison with Static CF:

Static vs Sequential Collaborative Filtering
Aspect	Static CF	Sequential CF
User representation	Single embedding vector	Sequence of interactions
Temporal modeling	None (or decay heuristics)	Learned from data
Context sensitivity	Fixed preferences	Dynamic, context-aware
Training signal	User-item pairs	Sequences with next-item labels
Prediction target	Affinity to any item	Next item in sequence
Model architecture	MLPs, matrix factorization	RNNs, Transformers

Session-Based vs User-Based Sequential:

Session-based: Model each browsing session independently, no persistent user ID
User-based: Model entire user history, combining sessions

Session-based is common in e-commerce where many users aren't logged in. User-based suits platforms with strong user identity (Netflix, Spotify).

GRU4Rec: Recurrent Networks for Sessions

GRU4Rec (Hidasi et al., 2015) pioneered deep learning for session-based recommendations using Gated Recurrent Units (GRUs).

Architecture:

Input: One-hot encoded item IDs (or item embeddings)
GRU Layers: Process sequence step by step, updating hidden state
Output: Score distribution over all items for next-item prediction

The GRU Advantage:

GRUs maintain a hidden state $\mathbf{h}_t$ that summarizes the session so far: $$\mathbf{h}_t = GRU(\mathbf{x}t, \mathbf{h}{t-1})$$

This hidden state captures:

What items were viewed
In what order they appeared
Patterns of user intent evolution

gru4rec.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class GRU4Rec(nn.Module):
    """
    GRU4Rec: Session-based Recommendations with RNNs.
    
    Architecture:
    - Item embedding layer
    - Multi-layer GRU
    - Output layer projecting to item scores
    """
    
    def __init__(
        self,
        num_items: int,
        embedding_dim: int = 64,
        hidden_dim: int = 100,
        num_layers: int = 1,
        dropout: float = 0.25
    ):
        super().__init__()
        
        self.num_items = num_items
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        
        # Item embeddings
        self.item_embedding = nn.Embedding(
            num_items, embedding_dim, padding_idx=0
        )
        
        # GRU layers
        self.gru = nn.GRU(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Output projection
        self.output = nn.Linear(hidden_dim, num_items)
        
        self._init_weights()
    
    def _init_weights(self):
        nn.init.xavier_uniform_(self.item_embedding.weight)
        nn.init.xavier_uniform_(self.output.weight)
    
    def forward(
        self, 
        item_seq: torch.Tensor,  # (batch, seq_len)
        seq_lengths: torch.Tensor = None
    ):
        """
        Args:
            item_seq: Sequence of item IDs
            seq_lengths: Actual lengths (for variable-length seqs)
        
        Returns:
            Scores for all items based on final hidden state
        """
        # Embed items
        x = self.item_embedding(item_seq)  # (batch, seq_len, embed_dim)
        
        # Pack sequence if lengths provided (for efficiency)
        if seq_lengths is not None:
            x = nn.utils.rnn.pack_padded_sequence(
                x, seq_lengths.cpu(), batch_first=True, 
                enforce_sorted=False
            )
        
        # GRU forward pass
        output, hidden = self.gru(x)
        
        # Use final hidden state for prediction
        # hidden shape: (num_layers, batch, hidden_dim)
        final_hidden = hidden[-1]  # (batch, hidden_dim)
        
        # Project to item scores
        scores = self.output(final_hidden)  # (batch, num_items)
        
        return scores
    
    def predict_next(self, session: list, top_k: int = 10):
        """Predict top-k next items for a session."""
        self.eval()
        with torch.no_grad():
            item_seq = torch.tensor([session]).long()
            scores = self.forward(item_seq)
            probs = F.softmax(scores, dim=-1)
            top_scores, top_items = probs.topk(top_k)
            return top_items[0].tolist(), top_scores[0].tolist()

Session-Parallel Mini-batching

GRU4Rec introduced session-parallel mini-batching: instead of processing full sessions sequentially, it processes the same position across multiple sessions in parallel. This dramatically improves training efficiency on GPUs.

Loss Functions for Sequential Recommendations

Sequential recommendation introduces unique loss function considerations:

1. Cross-Entropy (Softmax) Loss:

$$\mathcal{L}{CE} = -\log \frac{\exp(s{pos})}{\sum_{j} \exp(s_j)}$$

Computes probability over all items. Computationally expensive for large catalogs.

2. BPR (Bayesian Personalized Ranking) Loss:

$$\mathcal{L}{BPR} = -\log \sigma(s{pos} - s_{neg})$$

Pairwise loss comparing positive vs sampled negative. More efficient, focuses on ranking.

3. TOP1 Loss (GRU4Rec original):

$$\mathcal{L}{TOP1} = \frac{1}{N_s} \sum{j=1}^{N_s} \sigma(s_j - s_{pos}) + \sigma(s_j^2)$$

Regularized pairwise loss designed for session-based recommendations.

sequential_losses.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def bpr_loss(pos_scores: torch.Tensor, neg_scores: torch.Tensor):
    """
    BPR Loss: Bayesian Personalized Ranking.
    
    Maximizes the difference between positive and negative item scores.
    
    Args:
        pos_scores: Scores for positive (next) items, shape (batch,)
        neg_scores: Scores for sampled negative items, shape (batch,)
                    or (batch, num_negatives)
    """
    if neg_scores.dim() == 2:
        # Multiple negatives: average over them
        pos_scores = pos_scores.unsqueeze(1)  # (batch, 1)
        diff = pos_scores - neg_scores  # (batch, num_negatives)
        loss = -torch.log(torch.sigmoid(diff) + 1e-8).mean()
    else:
        diff = pos_scores - neg_scores
        loss = -torch.log(torch.sigmoid(diff) + 1e-8).mean()
    
    return loss
 
 
def sampled_softmax_loss(
    scores: torch.Tensor, 
    targets: torch.Tensor,
    num_negatives: int = 100
):
    """
    Sampled Softmax: Approximate cross-entropy with negative sampling.
    
    Instead of computing softmax over all items, sample negatives.
    Much faster for large item catalogs.
    """
    batch_size, num_items = scores.shape
    device = scores.device
    
    # Get positive scores
    pos_scores = scores.gather(1, targets.unsqueeze(1)).squeeze(1)
    
    # Sample negative item indices (uniform random)
    neg_indices = torch.randint(
        0, num_items, (batch_size, num_negatives), device=device
    )
    neg_scores = scores.gather(1, neg_indices)  # (batch, num_negatives)
    
    # Compute log-softmax approximation
    # log p(pos) ≈ pos_score - log(exp(pos_score) + sum(exp(neg_scores)))
    all_scores = torch.cat([pos_scores.unsqueeze(1), neg_scores], dim=1)
    log_softmax = F.log_softmax(all_scores, dim=1)
    
    # Loss is negative log probability of positive item (index 0)
    loss = -log_softmax[:, 0].mean()
    
    return loss
 
 
def top1_loss(pos_scores: torch.Tensor, neg_scores: torch.Tensor):
    """
    TOP1 Loss from original GRU4Rec paper.
    
    Combines regularization with ranking objective.
    """
    diff = neg_scores - pos_scores.unsqueeze(1)
    loss = torch.sigmoid(diff).mean() + torch.sigmoid(neg_scores ** 2).mean()
    return loss

SASRec: Self-Attention for Sequential Recommendation

SASRec (Kang & McAuley, 2018) applies Transformer self-attention to sequential recommendation, achieving superior results over RNN-based methods.

Why Self-Attention for Sequences:

No Sequential Bottleneck: Unlike RNNs, attention accesses all positions directly
Long-Range Dependencies: Equally capable of modeling near and far items
Parallelizable: All positions computed simultaneously during training
Interpretable: Attention weights show which past items influence prediction

Architecture Overview:

Item embedding + Positional embedding
Causal (masked) self-attention blocks
Point-wise feed-forward layers
Final position predicts next item

sasrec.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
class SASRec(nn.Module):
    """
    Self-Attentive Sequential Recommendation (SASRec).
    
    Uses causal (left-to-right) self-attention to model sequences.
    Position t can only attend to positions 1...t.
    """
    
    def __init__(
        self,
        num_items: int,
        max_seq_len: int = 50,
        embedding_dim: int = 64,
        num_heads: int = 2,
        num_layers: int = 2,
        dropout: float = 0.2
    ):
        super().__init__()
        
        self.max_seq_len = max_seq_len
        self.embedding_dim = embedding_dim
        
        # Embeddings
        self.item_embedding = nn.Embedding(
            num_items + 1, embedding_dim, padding_idx=0
        )
        self.position_embedding = nn.Embedding(max_seq_len, embedding_dim)
        
        # Transformer blocks
        self.attention_blocks = nn.ModuleList([
            TransformerBlock(embedding_dim, num_heads, dropout)
            for _ in range(num_layers)
        ])
        
        self.layer_norm = nn.LayerNorm(embedding_dim)
        self.dropout = nn.Dropout(dropout)
        
        # Output projection (tied with item embeddings)
        self.output_bias = nn.Parameter(torch.zeros(num_items + 1))
        
        self._init_weights()
    
    def _init_weights(self):
        nn.init.xavier_uniform_(self.item_embedding.weight)
        nn.init.xavier_uniform_(self.position_embedding.weight)
    
    def forward(self, item_seq: torch.Tensor):
        """
        Args:
            item_seq: (batch, seq_len) item IDs, 0 for padding
        
        Returns:
            scores: (batch, seq_len, num_items) predictions at each position
        """
        batch_size, seq_len = item_seq.shape
        device = item_seq.device
        
        # Create attention mask (causal + padding)
        # Causal: position t can only see positions <= t
        causal_mask = torch.triu(
            torch.ones(seq_len, seq_len, device=device), diagonal=1
        ).bool()
        
        # Padding mask: don't attend to padding tokens
        padding_mask = (item_seq == 0)
        
        # Get embeddings
        item_emb = self.item_embedding(item_seq)  # (batch, seq, dim)
        
        # Positional embeddings
        positions = torch.arange(seq_len, device=device).unsqueeze(0)
        pos_emb = self.position_embedding(positions)  # (1, seq, dim)
        
        # Combine
        x = item_emb + pos_emb
        x = self.dropout(x)
        
        # Apply transformer blocks
        for block in self.attention_blocks:
            x = block(x, causal_mask, padding_mask)
        
        x = self.layer_norm(x)
        
        # Compute scores via dot product with item embeddings
        # (batch, seq, dim) @ (dim, num_items) -> (batch, seq, num_items)
        scores = x @ self.item_embedding.weight.T + self.output_bias
        
        return scores
    
    def predict_next(self, item_seq: torch.Tensor, top_k: int = 10):
        """Predict next item based on sequence."""
        self.eval()
        with torch.no_grad():
            scores = self.forward(item_seq)
            # Use last position's prediction
            last_scores = scores[:, -1, :]
            probs = F.softmax(last_scores, dim=-1)
            return probs.topk(top_k)
 
 
class TransformerBlock(nn.Module):
    """Single transformer block with causal self-attention."""
    
    def __init__(self, dim: int, num_heads: int, dropout: float = 0.1):
        super().__init__()
        
        self.attention = nn.MultiheadAttention(
            dim, num_heads, dropout=dropout, batch_first=True
        )
        self.ffn = nn.Sequential(
            nn.Linear(dim, dim * 4),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(dim * 4, dim),
            nn.Dropout(dropout)
        )
        self.norm1 = nn.LayerNorm(dim)
        self.norm2 = nn.LayerNorm(dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, causal_mask, padding_mask):
        # Self-attention with residual
        attn_out, _ = self.attention(
            x, x, x,
            attn_mask=causal_mask,
            key_padding_mask=padding_mask
        )
        x = self.norm1(x + self.dropout(attn_out))
        
        # FFN with residual
        x = self.norm2(x + self.ffn(x))
        
        return x

Causal Masking is Essential

Unlike language models that might use bidirectional attention, SASRec must use causal (left-to-right) masking. At prediction time, we only know items 1...t and must predict t+1. Bidirectional attention would leak future information during training.

BERT4Rec: Bidirectional Sequential Modeling

BERT4Rec (Sun et al., 2019) adapts BERT's masked language modeling to sequential recommendation.

Key Insight:

While SASRec uses causal attention for next-item prediction, BERT4Rec uses masked item prediction with bidirectional attention:

Randomly mask items in the sequence: $[A, [MASK], C, [MASK], E]$
Use bidirectional attention to predict masked items
At inference, mask the last position to predict next item

Advantages of Bidirectional:

Uses both left AND right context for richer representations
Better captures complex sequential patterns
More training signal per sequence (multiple masked items)

The Cloze Task:

$$\mathcal{L} = \sum_{m \in M} -\log P(s_m | S \setminus M)$$

where $M$ is the set of masked positions.

bert4rec.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
class BERT4Rec(nn.Module):
    """
    BERT4Rec: Bidirectional encoder for sequential recommendation.
    
    Uses masked item prediction (cloze task) for training.
    Bidirectional attention allows each position to see all others.
    """
    
    def __init__(
        self,
        num_items: int,
        max_seq_len: int = 50,
        embedding_dim: int = 64,
        num_heads: int = 2,
        num_layers: int = 2,
        dropout: float = 0.1,
        mask_prob: float = 0.2
    ):
        super().__init__()
        
        self.mask_prob = mask_prob
        self.mask_token = num_items + 1  # Special [MASK] token
        
        # Embeddings (mask token is num_items + 1)
        self.item_embedding = nn.Embedding(
            num_items + 2, embedding_dim, padding_idx=0
        )
        self.position_embedding = nn.Embedding(max_seq_len, embedding_dim)
        
        # BERT encoder layers (bidirectional, no causal mask)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embedding_dim,
            nhead=num_heads,
            dim_feedforward=embedding_dim * 4,
            dropout=dropout,
            activation='gelu',
            batch_first=True
        )
        self.encoder = nn.TransformerEncoder(
            encoder_layer, num_layers=num_layers
        )
        
        self.layer_norm = nn.LayerNorm(embedding_dim)
        self.dropout = nn.Dropout(dropout)
        
        # Output head
        self.output = nn.Linear(embedding_dim, num_items + 1)
    
    def mask_sequence(self, item_seq: torch.Tensor):
        """
        Apply random masking to sequence for training.
        
        Returns:
            masked_seq: Sequence with some items replaced by [MASK]
            labels: Original items at masked positions, -100 elsewhere
        """
        device = item_seq.device
        batch_size, seq_len = item_seq.shape
        
        # Don't mask padding (item_id = 0)
        maskable = item_seq != 0
        
        # Random mask selection
        mask = torch.rand(batch_size, seq_len, device=device) < self.mask_prob
        mask = mask & maskable
        
        # Create masked sequence
        masked_seq = item_seq.clone()
        masked_seq[mask] = self.mask_token
        
        # Labels: original items at masked positions, -100 elsewhere
        labels = torch.full_like(item_seq, -100)
        labels[mask] = item_seq[mask]
        
        return masked_seq, labels
    
    def forward(self, item_seq: torch.Tensor):
        """
        Forward pass through BERT encoder.
        
        Args:
            item_seq: (batch, seq_len) - may contain [MASK] tokens
        
        Returns:
            logits: (batch, seq_len, num_items) predictions at each position
        """
        batch_size, seq_len = item_seq.shape
        device = item_seq.device
        
        # Padding mask (for attention)
        padding_mask = (item_seq == 0)
        
        # Embeddings
        item_emb = self.item_embedding(item_seq)
        positions = torch.arange(seq_len, device=device).unsqueeze(0)
        pos_emb = self.position_embedding(positions)
        
        x = self.dropout(self.layer_norm(item_emb + pos_emb))
        
        # BERT encoding (bidirectional - no causal mask!)
        x = self.encoder(x, src_key_padding_mask=padding_mask)
        
        # Output logits
        logits = self.output(x)  # (batch, seq_len, num_items)
        
        return logits
    
    def predict_next(self, item_seq: torch.Tensor, top_k: int = 10):
        """
        Predict next item by appending [MASK] to sequence.
        """
        self.eval()
        with torch.no_grad():
            # Append mask token to end
            mask_token = torch.full(
                (item_seq.size(0), 1), 
                self.mask_token,
                device=item_seq.device,
                dtype=item_seq.dtype
            )
            masked_seq = torch.cat([item_seq, mask_token], dim=1)
            
            logits = self.forward(masked_seq)
            # Prediction at last position (the mask)
            last_logits = logits[:, -1, :]
            probs = F.softmax(last_logits, dim=-1)
            
            return probs.topk(top_k)

SASRec vs BERT4Rec Comparison
Aspect	SASRec	BERT4Rec
Attention type	Causal (left-to-right)	Bidirectional
Training task	Next-item prediction	Masked item prediction
Training signal per sequence	1 (last item)	Multiple (all masked items)
Inference method	Predict from last position	Append [MASK], predict
Training efficiency	Faster per epoch	More signal per sequence
Context utilization	Left context only	Full context

Position Encoding for Sequential Recommendations

Self-attention is permutation-invariant—without position information, it cannot distinguish $[A, B, C]$ from $[C, B, A]$. Position encoding injects ordering information.

Options for Recommendation Sequences:

1. Learned Position Embeddings: $$PE(pos) = W_{pos}[pos]$$ Learnable embedding per position. Works well for fixed max length.

2. Sinusoidal Position Encoding: $$PE(pos, 2i) = \sin(pos / 10000^{2i/d})$$ $$PE(pos, 2i+1) = \cos(pos / 10000^{2i/d})$$ Generalizes to unseen lengths. No learned parameters.

3. Relative Position Encoding: Encodes distance between positions rather than absolute positions. Better for variable-length sequences.

position_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
class SinusoidalPE(nn.Module):
    """Sinusoidal position encoding (from Transformer paper)."""
    
    def __init__(self, d_model: int, max_len: int = 5000):
        super().__init__()
        
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(max_len).unsqueeze(1).float()
        
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * 
            (-math.log(10000.0) / d_model)
        )
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
    
    def forward(self, x):
        # x: (batch, seq_len, d_model)
        return x + self.pe[:, :x.size(1)]
 
 
class TimeAwarePE(nn.Module):
    """
    Time-aware position encoding for recommendations.
    
    Encodes both position AND actual timestamps if available.
    Useful when time gaps matter (e.g., purchase sequences).
    """
    
    def __init__(self, d_model: int, max_len: int = 1000):
        super().__init__()
        
        self.position_embedding = nn.Embedding(max_len, d_model)
        self.time_projection = nn.Linear(1, d_model)
    
    def forward(self, seq_len: int, timestamps: torch.Tensor = None):
        """
        Args:
            seq_len: Length of sequence
            timestamps: (batch, seq_len) actual timestamps (optional)
        """
        device = self.position_embedding.weight.device
        positions = torch.arange(seq_len, device=device)
        pos_emb = self.position_embedding(positions)  # (seq_len, d_model)
        
        if timestamps is not None:
            # Normalize timestamps to [0, 1] range
            timestamps = timestamps.float()
            t_min = timestamps.min(dim=1, keepdim=True)[0]
            t_max = timestamps.max(dim=1, keepdim=True)[0]
            normalized_t = (timestamps - t_min) / (t_max - t_min + 1e-8)
            
            time_emb = self.time_projection(normalized_t.unsqueeze(-1))
            return pos_emb.unsqueeze(0) + time_emb
        
        return pos_emb.unsqueeze(0)

Recency Features

Beyond position encoding, adding explicit recency features (time since last interaction, position from end) can significantly improve sequential models, especially for capturing the strong recency bias in user behavior.

Summary: Sequence Models for Recommendations

Key Takeaways

•Sequential models capture temporal dynamics — order and recency of interactions carry predictive signal.
•GRU4Rec pioneered deep sequential RecSys — using RNNs to encode session history.
•Self-attention (SASRec) overcomes RNN limitations — parallel processing, direct long-range access, interpretable weights.
•BERT4Rec enables bidirectional modeling — using masked prediction for richer training signal.
•Position encoding is essential — learned or sinusoidal embeddings inject sequence order into attention.
•Loss function choice matters — BPR for efficiency, cross-entropy for accuracy, sampled softmax for scale.

Next: Graph Neural Networks

Sequences capture individual user trajectories, but recommendations also depend on item-item and user-user relationships. Next, we explore how Graph Neural Networks model these complex relational structures for recommendations.

3 / 5

Loading learning content...

Machine LearningRecommendation Systems

Deep Learning for Recommendations

LevelAdvanced

Duration90 mins

TopicRecommendation Systems

3 / 5

Sequence Models for Recommendations

The Temporal Dimension of Preferences

After watching a thriller, users often want another thriller
Morning coffee purchases predict afternoon snack preferences
Browsing laptops suggests accessories in the next session
Recent interactions matter more than year-old ones

Sequence models capture these temporal dynamics by modeling user behavior as ordered sequences and learning to predict the next interaction.

What You Will Master

The Sequential Recommendation Problem

Problem Formulation:

Given a user's historical interaction sequence: $$S_u = [s_1, s_2, ..., s_t]$$

where $s_i$ is the item interacted with at time step $i$, predict the next item $s_{t+1}$.

Key Characteristics:

Order Matters: The sequence $[A, B, C]$ has different meaning than $[C, A, B]$
Recency Bias: Recent items typically have more predictive power
Variable Length: Users have sequences of different lengths
Session Structure: Interactions often cluster into browsing sessions

Comparison with Static CF:

Static vs Sequential Collaborative Filtering
Aspect	Static CF	Sequential CF
User representation	Single embedding vector	Sequence of interactions
Temporal modeling	None (or decay heuristics)	Learned from data
Context sensitivity	Fixed preferences	Dynamic, context-aware
Training signal	User-item pairs	Sequences with next-item labels
Prediction target	Affinity to any item	Next item in sequence
Model architecture	MLPs, matrix factorization	RNNs, Transformers

Session-Based vs User-Based Sequential:

Session-based: Model each browsing session independently, no persistent user ID
User-based: Model entire user history, combining sessions

Session-based is common in e-commerce where many users aren't logged in. User-based suits platforms with strong user identity (Netflix, Spotify).

GRU4Rec: Recurrent Networks for Sessions

GRU4Rec (Hidasi et al., 2015) pioneered deep learning for session-based recommendations using Gated Recurrent Units (GRUs).

Architecture:

Input: One-hot encoded item IDs (or item embeddings)
GRU Layers: Process sequence step by step, updating hidden state
Output: Score distribution over all items for next-item prediction

The GRU Advantage:

GRUs maintain a hidden state $\mathbf{h}_t$ that summarizes the session so far: $$\mathbf{h}_t = GRU(\mathbf{x}t, \mathbf{h}{t-1})$$

This hidden state captures:

What items were viewed
In what order they appeared
Patterns of user intent evolution

gru4rec.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class GRU4Rec(nn.Module):
    """
    GRU4Rec: Session-based Recommendations with RNNs.
    
    Architecture:
    - Item embedding layer
    - Multi-layer GRU
    - Output layer projecting to item scores
    """
    
    def __init__(
        self,
        num_items: int,
        embedding_dim: int = 64,
        hidden_dim: int = 100,
        num_layers: int = 1,
        dropout: float = 0.25
    ):
        super().__init__()
        
        self.num_items = num_items
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        
        # Item embeddings
        self.item_embedding = nn.Embedding(
            num_items, embedding_dim, padding_idx=0
        )
        
        # GRU layers
        self.gru = nn.GRU(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Output projection
        self.output = nn.Linear(hidden_dim, num_items)
        
        self._init_weights()
    
    def _init_weights(self):
        nn.init.xavier_uniform_(self.item_embedding.weight)
        nn.init.xavier_uniform_(self.output.weight)
    
    def forward(
        self, 
        item_seq: torch.Tensor,  # (batch, seq_len)
        seq_lengths: torch.Tensor = None
    ):
        """
        Args:
            item_seq: Sequence of item IDs
            seq_lengths: Actual lengths (for variable-length seqs)
        
        Returns:
            Scores for all items based on final hidden state
        """
        # Embed items
        x = self.item_embedding(item_seq)  # (batch, seq_len, embed_dim)
        
        # Pack sequence if lengths provided (for efficiency)
        if seq_lengths is not None:
            x = nn.utils.rnn.pack_padded_sequence(
                x, seq_lengths.cpu(), batch_first=True, 
                enforce_sorted=False
            )
        
        # GRU forward pass
        output, hidden = self.gru(x)
        
        # Use final hidden state for prediction
        # hidden shape: (num_layers, batch, hidden_dim)
        final_hidden = hidden[-1]  # (batch, hidden_dim)
        
        # Project to item scores
        scores = self.output(final_hidden)  # (batch, num_items)
        
        return scores
    
    def predict_next(self, session: list, top_k: int = 10):
        """Predict top-k next items for a session."""
        self.eval()
        with torch.no_grad():
            item_seq = torch.tensor([session]).long()
            scores = self.forward(item_seq)
            probs = F.softmax(scores, dim=-1)
            top_scores, top_items = probs.topk(top_k)
            return top_items[0].tolist(), top_scores[0].tolist()

Session-Parallel Mini-batching

Loss Functions for Sequential Recommendations

Sequential recommendation introduces unique loss function considerations:

1. Cross-Entropy (Softmax) Loss:

$$\mathcal{L}{CE} = -\log \frac{\exp(s{pos})}{\sum_{j} \exp(s_j)}$$

Computes probability over all items. Computationally expensive for large catalogs.

2. BPR (Bayesian Personalized Ranking) Loss:

$$\mathcal{L}{BPR} = -\log \sigma(s{pos} - s_{neg})$$

Pairwise loss comparing positive vs sampled negative. More efficient, focuses on ranking.

3. TOP1 Loss (GRU4Rec original):

$$\mathcal{L}{TOP1} = \frac{1}{N_s} \sum{j=1}^{N_s} \sigma(s_j - s_{pos}) + \sigma(s_j^2)$$

Regularized pairwise loss designed for session-based recommendations.

sequential_losses.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def bpr_loss(pos_scores: torch.Tensor, neg_scores: torch.Tensor):
    """
    BPR Loss: Bayesian Personalized Ranking.
    
    Maximizes the difference between positive and negative item scores.
    
    Args:
        pos_scores: Scores for positive (next) items, shape (batch,)
        neg_scores: Scores for sampled negative items, shape (batch,)
                    or (batch, num_negatives)
    """
    if neg_scores.dim() == 2:
        # Multiple negatives: average over them
        pos_scores = pos_scores.unsqueeze(1)  # (batch, 1)
        diff = pos_scores - neg_scores  # (batch, num_negatives)
        loss = -torch.log(torch.sigmoid(diff) + 1e-8).mean()
    else:
        diff = pos_scores - neg_scores
        loss = -torch.log(torch.sigmoid(diff) + 1e-8).mean()
    
    return loss
 
 
def sampled_softmax_loss(
    scores: torch.Tensor, 
    targets: torch.Tensor,
    num_negatives: int = 100
):
    """
    Sampled Softmax: Approximate cross-entropy with negative sampling.
    
    Instead of computing softmax over all items, sample negatives.
    Much faster for large item catalogs.
    """
    batch_size, num_items = scores.shape
    device = scores.device
    
    # Get positive scores
    pos_scores = scores.gather(1, targets.unsqueeze(1)).squeeze(1)
    
    # Sample negative item indices (uniform random)
    neg_indices = torch.randint(
        0, num_items, (batch_size, num_negatives), device=device
    )
    neg_scores = scores.gather(1, neg_indices)  # (batch, num_negatives)
    
    # Compute log-softmax approximation
    # log p(pos) ≈ pos_score - log(exp(pos_score) + sum(exp(neg_scores)))
    all_scores = torch.cat([pos_scores.unsqueeze(1), neg_scores], dim=1)
    log_softmax = F.log_softmax(all_scores, dim=1)
    
    # Loss is negative log probability of positive item (index 0)
    loss = -log_softmax[:, 0].mean()
    
    return loss
 
 
def top1_loss(pos_scores: torch.Tensor, neg_scores: torch.Tensor):
    """
    TOP1 Loss from original GRU4Rec paper.
    
    Combines regularization with ranking objective.
    """
    diff = neg_scores - pos_scores.unsqueeze(1)
    loss = torch.sigmoid(diff).mean() + torch.sigmoid(neg_scores ** 2).mean()
    return loss

SASRec: Self-Attention for Sequential Recommendation

SASRec (Kang & McAuley, 2018) applies Transformer self-attention to sequential recommendation, achieving superior results over RNN-based methods.

Why Self-Attention for Sequences:

No Sequential Bottleneck: Unlike RNNs, attention accesses all positions directly
Long-Range Dependencies: Equally capable of modeling near and far items
Parallelizable: All positions computed simultaneously during training
Interpretable: Attention weights show which past items influence prediction

Architecture Overview:

Item embedding + Positional embedding
Causal (masked) self-attention blocks
Point-wise feed-forward layers
Final position predicts next item

sasrec.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
class SASRec(nn.Module):
    """
    Self-Attentive Sequential Recommendation (SASRec).
    
    Uses causal (left-to-right) self-attention to model sequences.
    Position t can only attend to positions 1...t.
    """
    
    def __init__(
        self,
        num_items: int,
        max_seq_len: int = 50,
        embedding_dim: int = 64,
        num_heads: int = 2,
        num_layers: int = 2,
        dropout: float = 0.2
    ):
        super().__init__()
        
        self.max_seq_len = max_seq_len
        self.embedding_dim = embedding_dim
        
        # Embeddings
        self.item_embedding = nn.Embedding(
            num_items + 1, embedding_dim, padding_idx=0
        )
        self.position_embedding = nn.Embedding(max_seq_len, embedding_dim)
        
        # Transformer blocks
        self.attention_blocks = nn.ModuleList([
            TransformerBlock(embedding_dim, num_heads, dropout)
            for _ in range(num_layers)
        ])
        
        self.layer_norm = nn.LayerNorm(embedding_dim)
        self.dropout = nn.Dropout(dropout)
        
        # Output projection (tied with item embeddings)
        self.output_bias = nn.Parameter(torch.zeros(num_items + 1))
        
        self._init_weights()
    
    def _init_weights(self):
        nn.init.xavier_uniform_(self.item_embedding.weight)
        nn.init.xavier_uniform_(self.position_embedding.weight)
    
    def forward(self, item_seq: torch.Tensor):
        """
        Args:
            item_seq: (batch, seq_len) item IDs, 0 for padding
        
        Returns:
            scores: (batch, seq_len, num_items) predictions at each position
        """
        batch_size, seq_len = item_seq.shape
        device = item_seq.device
        
        # Create attention mask (causal + padding)
        # Causal: position t can only see positions <= t
        causal_mask = torch.triu(
            torch.ones(seq_len, seq_len, device=device), diagonal=1
        ).bool()
        
        # Padding mask: don't attend to padding tokens
        padding_mask = (item_seq == 0)
        
        # Get embeddings
        item_emb = self.item_embedding(item_seq)  # (batch, seq, dim)
        
        # Positional embeddings
        positions = torch.arange(seq_len, device=device).unsqueeze(0)
        pos_emb = self.position_embedding(positions)  # (1, seq, dim)
        
        # Combine
        x = item_emb + pos_emb
        x = self.dropout(x)
        
        # Apply transformer blocks
        for block in self.attention_blocks:
            x = block(x, causal_mask, padding_mask)
        
        x = self.layer_norm(x)
        
        # Compute scores via dot product with item embeddings
        # (batch, seq, dim) @ (dim, num_items) -> (batch, seq, num_items)
        scores = x @ self.item_embedding.weight.T + self.output_bias
        
        return scores
    
    def predict_next(self, item_seq: torch.Tensor, top_k: int = 10):
        """Predict next item based on sequence."""
        self.eval()
        with torch.no_grad():
            scores = self.forward(item_seq)
            # Use last position's prediction
            last_scores = scores[:, -1, :]
            probs = F.softmax(last_scores, dim=-1)
            return probs.topk(top_k)
 
 
class TransformerBlock(nn.Module):
    """Single transformer block with causal self-attention."""
    
    def __init__(self, dim: int, num_heads: int, dropout: float = 0.1):
        super().__init__()
        
        self.attention = nn.MultiheadAttention(
            dim, num_heads, dropout=dropout, batch_first=True
        )
        self.ffn = nn.Sequential(
            nn.Linear(dim, dim * 4),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(dim * 4, dim),
            nn.Dropout(dropout)
        )
        self.norm1 = nn.LayerNorm(dim)
        self.norm2 = nn.LayerNorm(dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, causal_mask, padding_mask):
        # Self-attention with residual
        attn_out, _ = self.attention(
            x, x, x,
            attn_mask=causal_mask,
            key_padding_mask=padding_mask
        )
        x = self.norm1(x + self.dropout(attn_out))
        
        # FFN with residual
        x = self.norm2(x + self.ffn(x))
        
        return x

Causal Masking is Essential

BERT4Rec: Bidirectional Sequential Modeling

BERT4Rec (Sun et al., 2019) adapts BERT's masked language modeling to sequential recommendation.

Key Insight:

While SASRec uses causal attention for next-item prediction, BERT4Rec uses masked item prediction with bidirectional attention:

Randomly mask items in the sequence: $[A, [MASK], C, [MASK], E]$
Use bidirectional attention to predict masked items
At inference, mask the last position to predict next item

Advantages of Bidirectional:

Uses both left AND right context for richer representations
Better captures complex sequential patterns
More training signal per sequence (multiple masked items)

The Cloze Task:

$$\mathcal{L} = \sum_{m \in M} -\log P(s_m | S \setminus M)$$

where $M$ is the set of masked positions.

bert4rec.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
class BERT4Rec(nn.Module):
    """
    BERT4Rec: Bidirectional encoder for sequential recommendation.
    
    Uses masked item prediction (cloze task) for training.
    Bidirectional attention allows each position to see all others.
    """
    
    def __init__(
        self,
        num_items: int,
        max_seq_len: int = 50,
        embedding_dim: int = 64,
        num_heads: int = 2,
        num_layers: int = 2,
        dropout: float = 0.1,
        mask_prob: float = 0.2
    ):
        super().__init__()
        
        self.mask_prob = mask_prob
        self.mask_token = num_items + 1  # Special [MASK] token
        
        # Embeddings (mask token is num_items + 1)
        self.item_embedding = nn.Embedding(
            num_items + 2, embedding_dim, padding_idx=0
        )
        self.position_embedding = nn.Embedding(max_seq_len, embedding_dim)
        
        # BERT encoder layers (bidirectional, no causal mask)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embedding_dim,
            nhead=num_heads,
            dim_feedforward=embedding_dim * 4,
            dropout=dropout,
            activation='gelu',
            batch_first=True
        )
        self.encoder = nn.TransformerEncoder(
            encoder_layer, num_layers=num_layers
        )
        
        self.layer_norm = nn.LayerNorm(embedding_dim)
        self.dropout = nn.Dropout(dropout)
        
        # Output head
        self.output = nn.Linear(embedding_dim, num_items + 1)
    
    def mask_sequence(self, item_seq: torch.Tensor):
        """
        Apply random masking to sequence for training.
        
        Returns:
            masked_seq: Sequence with some items replaced by [MASK]
            labels: Original items at masked positions, -100 elsewhere
        """
        device = item_seq.device
        batch_size, seq_len = item_seq.shape
        
        # Don't mask padding (item_id = 0)
        maskable = item_seq != 0
        
        # Random mask selection
        mask = torch.rand(batch_size, seq_len, device=device) < self.mask_prob
        mask = mask & maskable
        
        # Create masked sequence
        masked_seq = item_seq.clone()
        masked_seq[mask] = self.mask_token
        
        # Labels: original items at masked positions, -100 elsewhere
        labels = torch.full_like(item_seq, -100)
        labels[mask] = item_seq[mask]
        
        return masked_seq, labels
    
    def forward(self, item_seq: torch.Tensor):
        """
        Forward pass through BERT encoder.
        
        Args:
            item_seq: (batch, seq_len) - may contain [MASK] tokens
        
        Returns:
            logits: (batch, seq_len, num_items) predictions at each position
        """
        batch_size, seq_len = item_seq.shape
        device = item_seq.device
        
        # Padding mask (for attention)
        padding_mask = (item_seq == 0)
        
        # Embeddings
        item_emb = self.item_embedding(item_seq)
        positions = torch.arange(seq_len, device=device).unsqueeze(0)
        pos_emb = self.position_embedding(positions)
        
        x = self.dropout(self.layer_norm(item_emb + pos_emb))
        
        # BERT encoding (bidirectional - no causal mask!)
        x = self.encoder(x, src_key_padding_mask=padding_mask)
        
        # Output logits
        logits = self.output(x)  # (batch, seq_len, num_items)
        
        return logits
    
    def predict_next(self, item_seq: torch.Tensor, top_k: int = 10):
        """
        Predict next item by appending [MASK] to sequence.
        """
        self.eval()
        with torch.no_grad():
            # Append mask token to end
            mask_token = torch.full(
                (item_seq.size(0), 1), 
                self.mask_token,
                device=item_seq.device,
                dtype=item_seq.dtype
            )
            masked_seq = torch.cat([item_seq, mask_token], dim=1)
            
            logits = self.forward(masked_seq)
            # Prediction at last position (the mask)
            last_logits = logits[:, -1, :]
            probs = F.softmax(last_logits, dim=-1)
            
            return probs.topk(top_k)

SASRec vs BERT4Rec Comparison
Aspect	SASRec	BERT4Rec
Attention type	Causal (left-to-right)	Bidirectional
Training task	Next-item prediction	Masked item prediction
Training signal per sequence	1 (last item)	Multiple (all masked items)
Inference method	Predict from last position	Append [MASK], predict
Training efficiency	Faster per epoch	More signal per sequence
Context utilization	Left context only	Full context

Position Encoding for Sequential Recommendations

Self-attention is permutation-invariant—without position information, it cannot distinguish $[A, B, C]$ from $[C, B, A]$. Position encoding injects ordering information.

Options for Recommendation Sequences:

1. Learned Position Embeddings: $$PE(pos) = W_{pos}[pos]$$ Learnable embedding per position. Works well for fixed max length.

2. Sinusoidal Position Encoding: $$PE(pos, 2i) = \sin(pos / 10000^{2i/d})$$ $$PE(pos, 2i+1) = \cos(pos / 10000^{2i/d})$$ Generalizes to unseen lengths. No learned parameters.

3. Relative Position Encoding: Encodes distance between positions rather than absolute positions. Better for variable-length sequences.

position_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
class SinusoidalPE(nn.Module):
    """Sinusoidal position encoding (from Transformer paper)."""
    
    def __init__(self, d_model: int, max_len: int = 5000):
        super().__init__()
        
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(max_len).unsqueeze(1).float()
        
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * 
            (-math.log(10000.0) / d_model)
        )
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
    
    def forward(self, x):
        # x: (batch, seq_len, d_model)
        return x + self.pe[:, :x.size(1)]
 
 
class TimeAwarePE(nn.Module):
    """
    Time-aware position encoding for recommendations.
    
    Encodes both position AND actual timestamps if available.
    Useful when time gaps matter (e.g., purchase sequences).
    """
    
    def __init__(self, d_model: int, max_len: int = 1000):
        super().__init__()
        
        self.position_embedding = nn.Embedding(max_len, d_model)
        self.time_projection = nn.Linear(1, d_model)
    
    def forward(self, seq_len: int, timestamps: torch.Tensor = None):
        """
        Args:
            seq_len: Length of sequence
            timestamps: (batch, seq_len) actual timestamps (optional)
        """
        device = self.position_embedding.weight.device
        positions = torch.arange(seq_len, device=device)
        pos_emb = self.position_embedding(positions)  # (seq_len, d_model)
        
        if timestamps is not None:
            # Normalize timestamps to [0, 1] range
            timestamps = timestamps.float()
            t_min = timestamps.min(dim=1, keepdim=True)[0]
            t_max = timestamps.max(dim=1, keepdim=True)[0]
            normalized_t = (timestamps - t_min) / (t_max - t_min + 1e-8)
            
            time_emb = self.time_projection(normalized_t.unsqueeze(-1))
            return pos_emb.unsqueeze(0) + time_emb
        
        return pos_emb.unsqueeze(0)

Recency Features

Summary: Sequence Models for Recommendations

Key Takeaways

•Sequential models capture temporal dynamics — order and recency of interactions carry predictive signal.
•GRU4Rec pioneered deep sequential RecSys — using RNNs to encode session history.
•Self-attention (SASRec) overcomes RNN limitations — parallel processing, direct long-range access, interpretable weights.
•BERT4Rec enables bidirectional modeling — using masked prediction for richer training signal.
•Position encoding is essential — learned or sinusoidal embeddings inject sequence order into attention.
•Loss function choice matters — BPR for efficiency, cross-entropy for accuracy, sampled softmax for scale.

Next: Graph Neural Networks

3 / 5