Matrix Factorization - Learning Module

Loading content...

0/245

Latent Factor Models

Discovering Hidden Dimensions of Preference

When Netflix asks 'How do you like this movie?' and you give it a star rating, something remarkable happens. That single number encodes a complex mixture of your preferences—your love for intense plot twists, your tolerance for slow-burn narratives, your appreciation of cinematography, your nostalgia for 1990s aesthetics, and countless other factors you might not even consciously recognize.

The movie, too, contains multitudes: actor chemistry, directorial vision, genre conventions, pacing choices, thematic depth. When you rate a movie, you're essentially computing the compatibility between your hidden preferences and the movie's hidden characteristics.

Latent factor models are the mathematical framework that makes this intuition precise. They reveal the hidden structure underlying user-item interactions by decomposing the vast matrix of ratings into smaller, interpretable components—uncovering the unseen dimensions that drive human preference.

What You Will Learn

By the end of this page, you will understand the mathematical foundations of latent factor models, how they differ from neighborhood-based methods, the geometric intuition behind matrix factorization, and why discovering latent factors enables powerful generalization to unseen user-item pairs. You'll gain the conceptual framework that underlies modern recommendation systems at scale.

From Explicit Ratings to Hidden Structure

Consider a simple scenario: we have users rating movies. We can represent this as a user-item rating matrix R, where each entry R[u][i] contains user u's rating for item i:

	Inception	Titanic	The Matrix	The Notebook
Alice	5	3	5	2
Bob	4	5	4	5
Carol	?	2	?	1
Dave	5	?	5	?

The question marks represent unobserved ratings—movies the users haven't rated yet. The fundamental problem of recommendation is: Given the observed ratings, can we predict the missing ones?

This matrix presents a fascinating structure. Notice that Alice and Dave seem similar—they both love action/sci-fi (Inception, The Matrix) and seem less enthusiastic about romance (Titanic, The Notebook). Bob appears to have different tastes, rating both genres highly. Can we discover this structure automatically?

The Core Insight

The key insight behind latent factor models is that the rating matrix isn't random—it has structure. Users can be characterized by hidden preferences, items by hidden attributes, and ratings emerge from the interaction of these hidden factors. If we can discover these latent factors, we can predict any missing rating.

The sparsity problem:

In practice, the rating matrix is extremely sparse. Netflix has millions of users and tens of thousands of movies, but a typical user rates perhaps 100-200 items—less than 1% of the catalog. This sparsity has two implications:

Challenge: Most entries are missing, so we can't use standard statistical methods that assume complete data
Opportunity: The underlying structure that generates ratings is likely much simpler than the matrix itself

This second point is the foundation of latent factor models. If Alice's preferences can be described by just k = 50 numbers (her affinities for 50 latent factors), and each movie can similarly be described by 50 numbers, then predicting Alice's rating for any movie requires only these 100 numbers—not millions of observed ratings.

Comparison: Direct Storage vs. Latent Factor Storage
Approach	Storage for N users, M items	Prediction Method	Generalization
Full Matrix	O(N × M)	Look up stored value	Cannot predict missing entries
Latent Factors (k)	O((N + M) × k)	Compute p_u · q_i	Can predict any entry
Example: 1M users, 100K items, k=50	100 billion entries	50M + 5M = 55M parameters	~2000× compression

The Matrix Factorization Framework

The core idea of matrix factorization is elegantly simple: we approximate the large, sparse rating matrix R as the product of two smaller, dense matrices:

R ≈ P × Q^T

Where:

R is the m × n rating matrix (m users, n items)
P is an m × k matrix where each row p_u represents user u's affinities for k latent factors
Q is an n × k matrix where each row q_i represents item i's associations with k latent factors
k is the number of latent factors (a hyperparameter, typically 20-200)

The predicted rating for user u on item i becomes the dot product:

r̂_ui = p_u · q_i = Σ_{f=1}^{k} p_uf × q_if

Each term in this sum represents how much user u cares about factor f, multiplied by how much item i exhibits factor f. The sum across all factors captures the overall compatibility.

matrix_factorization_basic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
 
class BasicMatrixFactorization:
    """
    Basic matrix factorization for recommendation.
    
    The rating matrix R ≈ P @ Q.T where:
    - P[u, :] is the latent vector for user u
    - Q[i, :] is the latent vector for item i
    - Predicted rating r_hat[u, i] = P[u, :] @ Q[i, :]
    """
    
    def __init__(self, n_users: int, n_items: int, n_factors: int = 50):
        self.n_factors = n_factors
        
        # Initialize latent factor matrices with small random values
        # Using Xavier initialization for better convergence
        scale = 1.0 / np.sqrt(n_factors)
        self.P = np.random.normal(0, scale, (n_users, n_factors))
        self.Q = np.random.normal(0, scale, (n_items, n_factors))
    
    def predict(self, user_id: int, item_id: int) -> float:
        """Predict rating as dot product of latent vectors."""
        return np.dot(self.P[user_id], self.Q[item_id])
    
    def predict_all(self) -> np.ndarray:
        """Reconstruct the full rating matrix (for analysis)."""
        return self.P @ self.Q.T
    
    def get_user_vector(self, user_id: int) -> np.ndarray:
        """Get the latent representation for a user."""
        return self.P[user_id]
    
    def get_item_vector(self, item_id: int) -> np.ndarray:
        """Get the latent representation for an item."""
        return self.Q[item_id]
    
    def similar_items(self, item_id: int, top_k: int = 10) -> list:
        """Find most similar items based on latent space distance."""
        target = self.Q[item_id]
        # Compute cosine similarity to all items
        norms = np.linalg.norm(self.Q, axis=1)
        similarities = (self.Q @ target) / (norms * np.linalg.norm(target) + 1e-8)
        # Get top-k (excluding the item itself)
        most_similar = np.argsort(similarities)[::-1]
        return [idx for idx in most_similar if idx != item_id][:top_k]
 
 
# Example: Conceptual demonstration
def demonstrate_factorization():
    """Show how matrix factorization captures hidden structure."""
    
    # Original rating matrix (5 users, 4 movies)
    # Notice the implicit structure: users 0,2,4 prefer action; users 1,3 prefer romance
    R = np.array([
        [5, 2, 5, 1],  # Action fan
        [2, 5, 1, 5],  # Romance fan
        [5, 1, 5, 2],  # Action fan
        [1, 5, 2, 5],  # Romance fan
        [5, 3, 4, 2],  # Action fan (slightly mixed)
    ], dtype=float)
    
    # Perfect factorization with k=2 factors
    # Factor 0: "Action affinity", Factor 1: "Romance affinity"
    P_true = np.array([
        [1.0, 0.2],   # User 0: loves action
        [0.2, 1.0],   # User 1: loves romance
        [1.0, 0.1],   # User 2: loves action
        [0.1, 1.0],   # User 3: loves romance
        [0.9, 0.4],   # User 4: mostly action
    ])
    
    Q_true = np.array([
        [5.0, 1.0],   # Inception: action movie
        [1.0, 5.0],   # Titanic: romance movie
        [5.0, 1.0],   # Matrix: action movie
        [1.0, 5.0],   # Notebook: romance movie
    ])
    
    R_reconstructed = P_true @ Q_true.T
    
    print("Original R:")
    print(R)
    print("\nReconstructed (P @ Q^T):")
    print(R_reconstructed)
    print("\nReconstruction Error (Frobenius norm):")
    print(f"{np.linalg.norm(R - R_reconstructed):.4f}")
 
 
if __name__ == "__main__":
    demonstrate_factorization()

Geometric interpretation:

Each user and item lives in a k-dimensional latent space. The rating is the inner product between the user vector and item vector, which geometrically measures:

Magnitude: Users with strong preferences and items with strong characteristics produce larger ratings
Alignment: Vectors pointing in similar directions produce high ratings; orthogonal vectors produce zero; opposite vectors produce negative ratings

This geometric view provides powerful intuition:

Similar users have similar vectors (close in latent space)
Similar items cluster together in the item embedding space
Good recommendations are items whose vectors align well with the user's vector

Latent Factors: Interpretability and Discovery

One of the most fascinating aspects of latent factor models is that the discovered factors often correspond to meaningful concepts, even though we never explicitly defined them. The model discovers these dimensions purely from patterns in the data.

Example from movie recommendations:

After training a latent factor model on movie ratings, researchers have observed that individual factors often capture recognizable concepts:

Factor	High Positive Values	High Negative Values	Interpretation
f₁	The Godfather, Pulp Fiction, Fight Club	The Little Mermaid, Frozen	Serious/Mature vs. Family-Friendly
f₂	Inception, Interstellar, The Matrix	The Notebook, Titanic	Sci-Fi/Action vs. Romance
f₃	Monty Python, The Office	Schindler's List, 12 Years a Slave	Comedy vs. Drama
f₄	Old Hollywood classics	Recent blockbusters	Era/Nostalgic preference
f₅	Indie/art house films	Mainstream productions	Commercial vs. Artistic

These factors emerge automatically through optimization—the model wasn't told what 'genre' or 'era' means. It discovered that these dimensions are useful for predicting ratings.

Not All Factors Are Interpretable

While some latent factors correspond to human-understandable concepts, many others are abstract combinations that defy simple interpretation. A factor might capture 'movies with surprising endings that don't rely on CGI featuring ensemble casts'—a valid preference dimension that humans wouldn't naturally articulate. The model optimizes for prediction, not interpretability.

The number of factors k:

Choosing k is a critical decision that balances expressiveness against overfitting:

Too few factors (k too small):

Model cannot capture the full complexity of user preferences
Underfitting: poor predictions even on training data
Users with different preferences get similar representations

Too many factors (k too large):

Model memorizes noise in the training data
Overfitting: excellent training performance, poor generalization
Increased computational and storage costs
Latent vectors may become sparse and less meaningful

Practical guidance:

Start with k = 50-100 for most datasets
Use validation data to tune k
Larger datasets can support more factors
Domain-specific: music might need more nuanced factors than movies

factor_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from typing import List, Tuple, Dict
 
def analyze_latent_factors(
    Q: np.ndarray,
    item_names: List[str],
    top_k: int = 5
) -> Dict[int, Tuple[List[str], List[str]]]:
    """
    Analyze what each latent factor captures by examining
    items with highest and lowest values for that factor.
    
    Args:
        Q: Item latent factor matrix (n_items x n_factors)
        item_names: Names of items for interpretation
        top_k: Number of items to show per factor
    
    Returns:
        Dictionary mapping factor index to (high_items, low_items)
    """
    n_factors = Q.shape[1]
    factor_analysis = {}
    
    for f in range(n_factors):
        # Get factor f values for all items
        factor_values = Q[:, f]
        
        # Find items with highest and lowest values
        sorted_indices = np.argsort(factor_values)
        high_indices = sorted_indices[-top_k:][::-1]
        low_indices = sorted_indices[:top_k]
        
        high_items = [item_names[i] for i in high_indices]
        low_items = [item_names[i] for i in low_indices]
        
        factor_analysis[f] = (high_items, low_items)
        
        print(f"\nFactor {f}:")
        print(f"  High: {', '.join(high_items)}")
        print(f"  Low:  {', '.join(low_items)}")
    
    return factor_analysis
 
 
def find_factor_correlations(
    Q: np.ndarray,
    item_metadata: Dict[int, Dict[str, any]]
) -> Dict[int, str]:
    """
    Attempt to correlate latent factors with known metadata.
    
    This helps interpret what abstract factors might represent
    by correlating them with explicit attributes (genre, year, etc.)
    """
    correlations = {}
    
    for f in range(Q.shape[1]):
        factor_values = Q[:, f]
        
        # Correlate with various metadata fields
        best_correlation = 0
        best_attribute = "Unknown"
        
        for attr in ['genre_action', 'genre_romance', 'year', 'budget']:
            if attr in item_metadata.get(0, {}):
                attr_values = np.array([
                    item_metadata[i].get(attr, 0) 
                    for i in range(len(factor_values))
                ])
                corr = np.corrcoef(factor_values, attr_values)[0, 1]
                if abs(corr) > abs(best_correlation):
                    best_correlation = corr
                    best_attribute = f"{attr} ({'positive' if corr > 0 else 'negative'})"
        
        correlations[f] = f"{best_attribute} (r={best_correlation:.2f})"
    
    return correlations
 
 
def visualize_user_in_factor_space(
    P: np.ndarray,
    user_ids: List[int],
    factor_interpretations: List[str]
):
    """
    Print a user's position in the latent factor space
    with human-readable factor interpretations.
    """
    for user_id in user_ids:
        print(f"\nUser {user_id} preference profile:")
        user_vector = P[user_id]
        
        # Sort factors by absolute magnitude (most defining preferences first)
        sorted_factors = np.argsort(np.abs(user_vector))[::-1]
        
        for f in sorted_factors[:5]:  # Top 5 defining factors
            value = user_vector[f]
            interpretation = factor_interpretations[f] if f < len(factor_interpretations) else f"Factor {f}"
            direction = "strongly positive" if value > 1 else "positive" if value > 0 else "negative" if value > -1 else "strongly negative"
            print(f"  {interpretation}: {direction} ({value:.2f})")

Comparison with Neighborhood Methods

Before matrix factorization became dominant, neighborhood-based collaborative filtering was the standard approach. Understanding the differences illuminates why latent factor models proved superior for many applications.

Neighborhood-based methods:

These methods predict ratings based on similar users (user-based CF) or similar items (item-based CF):

User-based: 'Users similar to you also liked this item' Item-based: 'This item is similar to other items you've liked'

They compute similarity directly from the rating matrix (using cosine similarity, Pearson correlation, etc.) and predict by weighted averaging of neighbors' ratings.

Neighborhood Methods

•Strengths:
•Intuitive explanation ('users like you enjoyed...')
•Simple to implement and debug
•No training phase; predictions are computed on-demand
•Can easily incorporate new users/items
•Weaknesses:
•Scalability: O(n²) similarity matrix storage
•Sparsity sensitivity: fails when overlap is low
•Cannot capture latent patterns beyond direct similarity
•Limited expressiveness for complex preferences

Latent Factor Methods

•Strengths:
•Compact representation: O((m+n) × k) parameters
•Handles sparsity through low-rank structure
•Captures deep, transitive patterns in data
•Naturally handles cold-start for items (with side info)
•Weaknesses:
•Requires training phase (offline learning)
•Less interpretable predictions
•Cold-start for new users without ratings
•Hyperparameter tuning (k, learning rate, regularization)

Why latent factors win at scale:

Consider predicting Carol's rating for 'The Matrix' from our earlier example. Using item-based neighborhood methods, we'd find items Carol rated that are similar to The Matrix, then average those ratings weighted by similarity.

But Carol has only rated Titanic (2 stars) and The Notebook (1 star). Neither is particularly similar to The Matrix (an action/sci-fi film). Neighborhood methods would struggle here.

Latent factor models approach this differently. We learn that:

Carol's user vector p_Carol shows negative affinity for 'action/intense' factors
The Matrix's item vector q_Matrix shows high values on 'action/intense' factors
The dot product p_Carol · q_Matrix yields a low predicted rating

The key insight: latent factors can infer preferences transitively. Even though Carol hasn't rated action movies, her ratings for romance movies (combined with the learned factor structure) reveal her action movie preferences. This transitive inference is impossible for neighborhood methods that only use direct similarity.

Hybrid Approaches in Practice

Modern production systems often combine both approaches. Netflix's famous winning ensemble included neighborhood models, matrix factorization, and restricted Boltzmann machines. Each captures different signal patterns. Latent factors excel at discovering global patterns; neighborhood methods capture local, specialized preferences.

The Optimization Objective

To learn good latent factors P and Q, we need an objective function to optimize. The standard formulation minimizes the regularized squared error over observed ratings:

min_{P,Q} Σ_{(u,i) ∈ K} (r_ui - p_u · q_i)² + λ(||P||² + ||Q||²)

Where:

K is the set of (user, item) pairs with observed ratings
(r_ui - p_u · q_i)² is the squared prediction error
λ(||P||² + ||Q||²) is L2 regularization to prevent overfitting
λ is the regularization strength hyperparameter

This objective has several important properties:

Properties of the MF Objective

•Only observed ratings matter — We sum over K (observed ratings), not all (user, item) pairs. Missing entries don't contribute to the loss; they're what we're trying to predict.
•Non-convex optimization — The product p_u · q_i makes the objective non-convex. There's no guarantee of finding the global optimum, but local optima work well in practice.
•Regularization is essential — Without λ||P||² + λ||Q||², the model can overfit by making latent vectors arbitrarily large to fit noise.
•Efficient to compute — Only O(|K|) terms in the sum, where |K| is the number of ratings (typically 99% smaller than the full matrix).

mf_objective.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from typing import List, Tuple
from dataclasses import dataclass
 
@dataclass
class Rating:
    """A single observed rating."""
    user_id: int
    item_id: int
    rating: float
 
def compute_mf_loss(
    P: np.ndarray,
    Q: np.ndarray,
    ratings: List[Rating],
    lambda_reg: float = 0.1
) -> Tuple[float, float, float]:
    """
    Compute the regularized matrix factorization loss.
    
    L = Σ (r_ui - p_u · q_i)² + λ(||P||² + ||Q||²)
    
    Returns:
        (total_loss, squared_error, regularization_term)
    """
    # Squared error over observed ratings
    squared_error = 0.0
    for rating in ratings:
        predicted = np.dot(P[rating.user_id], Q[rating.item_id])
        error = rating.rating - predicted
        squared_error += error ** 2
    
    # L2 regularization
    reg_P = lambda_reg * np.sum(P ** 2)
    reg_Q = lambda_reg * np.sum(Q ** 2)
    regularization = reg_P + reg_Q
    
    total_loss = squared_error + regularization
    
    return total_loss, squared_error, regularization
 
 
def compute_gradients(
    P: np.ndarray,
    Q: np.ndarray,
    user_id: int,
    item_id: int,
    rating: float,
    lambda_reg: float = 0.1
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute gradients of loss w.r.t. p_u and q_i for a single rating.
    
    For the loss term: (r_ui - p_u · q_i)² + λ(||p_u||² + ||q_i||²)
    
    ∂L/∂p_u = -2(r_ui - p_u · q_i) * q_i + 2λp_u
    ∂L/∂q_i = -2(r_ui - p_u · q_i) * p_u + 2λq_i
    """
    p_u = P[user_id]
    q_i = Q[item_id]
    
    error = rating - np.dot(p_u, q_i)
    
    # Gradients (without the factor of 2 for computational convenience)
    grad_p = -error * q_i + lambda_reg * p_u
    grad_q = -error * p_u + lambda_reg * q_i
    
    return grad_p, grad_q
 
 
def train_mf_sgd(
    n_users: int,
    n_items: int,
    ratings: List[Rating],
    n_factors: int = 50,
    learning_rate: float = 0.01,
    lambda_reg: float = 0.1,
    n_epochs: int = 100,
    verbose: bool = True
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Train matrix factorization model using stochastic gradient descent.
    
    This is the core algorithm used in production systems, with various
    optimizations (momentum, adaptive learning rates, etc.)
    """
    # Initialize latent factors
    scale = 1.0 / np.sqrt(n_factors)
    P = np.random.normal(0, scale, (n_users, n_factors))
    Q = np.random.normal(0, scale, (n_items, n_factors))
    
    for epoch in range(n_epochs):
        # Shuffle ratings for SGD
        np.random.shuffle(ratings)
        
        for rating in ratings:
            # Compute gradients for this rating
            grad_p, grad_q = compute_gradients(
                P, Q, rating.user_id, rating.item_id, rating.rating, lambda_reg
            )
            
            # Update latent vectors
            P[rating.user_id] -= learning_rate * grad_p
            Q[rating.item_id] -= learning_rate * grad_q
        
        # Compute and report loss periodically
        if verbose and (epoch + 1) % 10 == 0:
            loss, sq_err, reg = compute_mf_loss(P, Q, ratings, lambda_reg)
            rmse = np.sqrt(sq_err / len(ratings))
            print(f"Epoch {epoch + 1}: Loss = {loss:.4f}, RMSE = {rmse:.4f}")
    
    return P, Q

Understanding the gradient update:

The gradient descent update for user u on rating r_ui is:

p_u ← p_u + α(e_ui × q_i - λp_u)

q_i ← q_i + α(e_ui × p_u - λq_i)

Where e_ui = r_ui - p_u · q_i is the prediction error and α is the learning rate.

Intuitively:

If we underpredict (e_ui > 0), we adjust p_u and q_i to be more aligned (larger dot product)
If we overpredict (e_ui < 0), we push them apart
The regularization term -λp_u shrinks vectors toward zero, preventing them from growing too large

Bias Terms and Extensions

The basic factorization r̂_ui = p_u · q_i assumes that all variation in ratings comes from the interaction between user preferences and item characteristics. But in reality, there are systematic biases that affect ratings:

Some users rate everything higher (lenient raters) or lower (harsh critics)
Some items are universally better (acclaimed masterpieces) or worse
The overall average rating differs by platform and context

We incorporate these insights through bias terms:

r̂_ui = μ + b_u + b_i + p_u · q_i

Where:

μ is the global average rating across all data
b_u is user u's bias (deviation from average)
b_i is item i's bias (deviation from average)
p_u · q_i captures the residual user-item interaction

Understanding Bias DecompositionConsider predicting Alice's rating for 'The Godfather':

Input

Output

biased_mf.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from typing import List, Tuple
from dataclasses import dataclass
 
@dataclass
class Rating:
    user_id: int
    item_id: int
    rating: float
 
class BiasedMatrixFactorization:
    """
    Matrix factorization with user and item biases.
    
    Prediction: r_hat = μ + b_u + b_i + p_u · q_i
    
    This model captures:
    - Global baseline (μ)
    - User-specific rating tendencies (b_u)
    - Item-specific quality (b_i)
    - User-item interaction (p_u · q_i)
    """
    
    def __init__(
        self,
        n_users: int,
        n_items: int,
        n_factors: int = 50,
        learning_rate: float = 0.01,
        reg_bias: float = 0.02,
        reg_factors: float = 0.02
    ):
        self.n_factors = n_factors
        self.lr = learning_rate
        self.reg_bias = reg_bias
        self.reg_factors = reg_factors
        
        # Biases
        self.global_mean = 0.0
        self.user_bias = np.zeros(n_users)
        self.item_bias = np.zeros(n_items)
        
        # Latent factors
        scale = 0.1
        self.P = np.random.normal(0, scale, (n_users, n_factors))
        self.Q = np.random.normal(0, scale, (n_items, n_factors))
    
    def predict(self, user_id: int, item_id: int) -> float:
        """Predict rating with biases and latent factors."""
        prediction = (
            self.global_mean +
            self.user_bias[user_id] +
            self.item_bias[item_id] +
            np.dot(self.P[user_id], self.Q[item_id])
        )
        # Clip to valid rating range
        return np.clip(prediction, 1.0, 5.0)
    
    def fit(self, ratings: List[Rating], n_epochs: int = 100):
        """Train the model using SGD."""
        
        # Compute global mean
        self.global_mean = np.mean([r.rating for r in ratings])
        
        for epoch in range(n_epochs):
            np.random.shuffle(ratings)
            
            for r in ratings:
                # Current prediction and error
                pred = self.predict(r.user_id, r.item_id)
                error = r.rating - pred
                
                # Update biases
                self.user_bias[r.user_id] += self.lr * (
                    error - self.reg_bias * self.user_bias[r.user_id]
                )
                self.item_bias[r.item_id] += self.lr * (
                    error - self.reg_bias * self.item_bias[r.item_id]
                )
                
                # Update latent factors
                p_u = self.P[r.user_id].copy()
                q_i = self.Q[r.item_id].copy()
                
                self.P[r.user_id] += self.lr * (
                    error * q_i - self.reg_factors * p_u
                )
                self.Q[r.item_id] += self.lr * (
                    error * p_u - self.reg_factors * q_i
                )
    
    def explain_prediction(self, user_id: int, item_id: int) -> dict:
        """
        Decompose a prediction into its components.
        
        Useful for debugging and interpretation.
        """
        return {
            "global_mean": self.global_mean,
            "user_bias": self.user_bias[user_id],
            "item_bias": self.item_bias[item_id],
            "interaction": np.dot(self.P[user_id], self.Q[item_id]),
            "final_prediction": self.predict(user_id, item_id)
        }
    
    def analyze_biases(
        self, 
        user_names: List[str] = None, 
        item_names: List[str] = None,
        top_k: int = 5
    ):
        """Analyze learned biases to understand rating patterns."""
        
        print("\n=== Highest User Biases (Lenient Raters) ===")
        top_users = np.argsort(self.user_bias)[-top_k:][::-1]
        for u in top_users:
            name = user_names[u] if user_names else f"User {u}"
            print(f"  {name}: {self.user_bias[u]:+.2f}")
        
        print("\n=== Lowest User Biases (Harsh Critics) ===")
        bottom_users = np.argsort(self.user_bias)[:top_k]
        for u in bottom_users:
            name = user_names[u] if user_names else f"User {u}"
            print(f"  {name}: {self.user_bias[u]:+.2f}")
        
        print("\n=== Highest Item Biases (Universally Loved) ===")
        top_items = np.argsort(self.item_bias)[-top_k:][::-1]
        for i in top_items:
            name = item_names[i] if item_names else f"Item {i}"
            print(f"  {name}: {self.item_bias[i]:+.2f}")
        
        print("\n=== Lowest Item Biases (Universally Disliked) ===")
        bottom_items = np.argsort(self.item_bias)[:top_k]
        for i in bottom_items:
            name = item_names[i] if item_names else f"Item {i}"
            print(f"  {name}: {self.item_bias[i]:+.2f}")

Why Separate Regularization for Biases?

We often use different regularization strengths for biases (reg_bias) and latent factors (reg_factors). Biases are simpler parameters with fewer degrees of freedom per user/item, so they can tolerate less regularization. The Netflix Prize winners tuned these separately and found significant improvement.

The Low-Rank Assumption

The fundamental assumption underlying matrix factorization is that the true rating matrix has low rank—that is, it can be well-approximated by a matrix of rank k, where k is much smaller than either dimension.

What does low rank mean?

A matrix has rank k if it can be expressed as the sum of k rank-1 matrices (outer products of vectors). Equivalently, it has at most k linearly independent rows or columns.

For our rating matrix R ≈ P × Q^T:

P is m × k and Q is n × k
The product P × Q^T is at most rank k (even though it's m × n)
If the true ratings really are governed by k underlying factors, this approximation is exact

Why would preferences be low-rank?

Human preferences, while seemingly infinite in variety, actually cluster around common patterns:

Genre preferences: People tend to like certain genres (action, romance, comedy), and movies belong to genre combinations
Quality dimensions: Well-made films with good acting, writing, and direction tend to be universally appreciated
Social factors: Age, culture, and demographics correlate with taste
Temporal patterns: Nostalgic preferences for films from one's formative years

These dimensions are finite and shared. Two users who've never interacted might have identical taste profiles because they share the same underlying factor values.

Matrix Rank vs. Approximation Quality
Rank k	Parameters	Compression Ratio*	Typical RMSE
10	55K	18,000×	~0.94
50	275K	3,600×	~0.90
100	550K	1,800×	~0.88
200	1.1M	900×	~0.87
Full (1000)	1B	1×	~0.86 (overfit)

*For a 100K user × 10K item matrix

The bias-variance tradeoff in rank:

As we increase k, we can capture finer-grained patterns:

k = 10: Broad strokes (action vs. romance, old vs. new)
k = 50: Genre combinations, era preferences, mainstream vs. indie
k = 100: Director styles, actor preferences, thematic elements
k = 200+: Increasingly subtle patterns, risk of fitting noise

The optimal k depends on the underlying complexity of the domain. Music streaming (Spotify) might justify higher k than movie ratings because musical taste has more dimensions (genre, tempo, energy, lyrics, production style, era, artist relationships).

When Low-Rank Fails

The low-rank assumption breaks down when ratings are highly idiosyncratic—driven by personal experiences rather than shared preferences. Your love for a specific obscure film because it was playing during a memorable life event can't be captured by any number of shared factors. This is why hybrid systems combining content features often outperform pure collaborative filtering.

Summary and Connections

We've established the conceptual foundation for understanding how matrix factorization drives modern recommendation systems. Let's consolidate the key insights:

Key Takeaways

•Latent factors reveal hidden structure — Users and items exist in a shared embedding space where compatibility is measured by vector dot products.
•Low-rank approximation enables generalization — By assuming k << min(m, n), we can predict millions of missing ratings from relatively few parameters.
•Bias terms capture systematic effects — Global, user, and item biases account for rating tendencies independent of user-item interaction.
•The objective balances fit and complexity — Regularized squared error prevents overfitting while enabling accurate in-sample predictions.
•Latent factors can be interpretable — Though not guaranteed, factors often correspond to meaningful concepts like genre or quality.
•MF surpasses neighborhood methods at scale — Transitive inference through the latent space enables better predictions on sparse data.

Connections to the broader ML landscape:

Latent factor models connect to several fundamental concepts:

Principal Component Analysis (PCA): SVD of centered data finds directions of maximum variance—latent factors for the data distribution itself
Word embeddings (Word2Vec): Words as vectors where similar words are nearby; essentially MF on word co-occurrence matrices
Neural networks: The user/item embedding layers in deep learning recommenders are exactly the P and Q matrices
Topic models (LDA): Documents as mixtures of topics; similar factorization of document-term matrices

What's next:

We've established what latent factor models are and why they work. The next pages will dive into:

SVD and SVD++: The mathematical formalization and extensions that won the Netflix Prize
ALS optimization: Alternative optimization strategies that enable distributed training at scale
Regularization strategies: Deep dive into preventing overfitting
Implicit feedback: Extending MF beyond explicit ratings to click, view, and purchase data

Page Complete

You now understand the foundational concepts of latent factor models: how matrix factorization discovers hidden dimensions of preference, why the low-rank assumption enables powerful generalization, and how bias terms capture systematic rating patterns. Next, we'll explore the mathematical machinery of SVD and its extensions that made matrix factorization the dominant paradigm in recommendation.

Latent Factor Models

Discovering Hidden Dimensions of Preference

What You Will Learn

From Explicit Ratings to Hidden Structure

Consider a simple scenario: we have users rating movies. We can represent this as a user-item rating matrix R, where each entry R[u][i] contains user u's rating for item i:

	Inception	Titanic	The Matrix	The Notebook
Alice	5	3	5	2
Bob	4	5	4	5
Carol	?	2	?	1
Dave	5	?	5	?

The Core Insight

The sparsity problem:

Challenge: Most entries are missing, so we can't use standard statistical methods that assume complete data
Opportunity: The underlying structure that generates ratings is likely much simpler than the matrix itself

Comparison: Direct Storage vs. Latent Factor Storage
Approach	Storage for N users, M items	Prediction Method	Generalization
Full Matrix	O(N × M)	Look up stored value	Cannot predict missing entries
Latent Factors (k)	O((N + M) × k)	Compute p_u · q_i	Can predict any entry
Example: 1M users, 100K items, k=50	100 billion entries	50M + 5M = 55M parameters	~2000× compression

The Matrix Factorization Framework

The core idea of matrix factorization is elegantly simple: we approximate the large, sparse rating matrix R as the product of two smaller, dense matrices:

R ≈ P × Q^T

Where:

R is the m × n rating matrix (m users, n items)
P is an m × k matrix where each row p_u represents user u's affinities for k latent factors
Q is an n × k matrix where each row q_i represents item i's associations with k latent factors
k is the number of latent factors (a hyperparameter, typically 20-200)

The predicted rating for user u on item i becomes the dot product:

r̂_ui = p_u · q_i = Σ_{f=1}^{k} p_uf × q_if

Each term in this sum represents how much user u cares about factor f, multiplied by how much item i exhibits factor f. The sum across all factors captures the overall compatibility.

matrix_factorization_basic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
 
class BasicMatrixFactorization:
    """
    Basic matrix factorization for recommendation.
    
    The rating matrix R ≈ P @ Q.T where:
    - P[u, :] is the latent vector for user u
    - Q[i, :] is the latent vector for item i
    - Predicted rating r_hat[u, i] = P[u, :] @ Q[i, :]
    """
    
    def __init__(self, n_users: int, n_items: int, n_factors: int = 50):
        self.n_factors = n_factors
        
        # Initialize latent factor matrices with small random values
        # Using Xavier initialization for better convergence
        scale = 1.0 / np.sqrt(n_factors)
        self.P = np.random.normal(0, scale, (n_users, n_factors))
        self.Q = np.random.normal(0, scale, (n_items, n_factors))
    
    def predict(self, user_id: int, item_id: int) -> float:
        """Predict rating as dot product of latent vectors."""
        return np.dot(self.P[user_id], self.Q[item_id])
    
    def predict_all(self) -> np.ndarray:
        """Reconstruct the full rating matrix (for analysis)."""
        return self.P @ self.Q.T
    
    def get_user_vector(self, user_id: int) -> np.ndarray:
        """Get the latent representation for a user."""
        return self.P[user_id]
    
    def get_item_vector(self, item_id: int) -> np.ndarray:
        """Get the latent representation for an item."""
        return self.Q[item_id]
    
    def similar_items(self, item_id: int, top_k: int = 10) -> list:
        """Find most similar items based on latent space distance."""
        target = self.Q[item_id]
        # Compute cosine similarity to all items
        norms = np.linalg.norm(self.Q, axis=1)
        similarities = (self.Q @ target) / (norms * np.linalg.norm(target) + 1e-8)
        # Get top-k (excluding the item itself)
        most_similar = np.argsort(similarities)[::-1]
        return [idx for idx in most_similar if idx != item_id][:top_k]
 
 
# Example: Conceptual demonstration
def demonstrate_factorization():
    """Show how matrix factorization captures hidden structure."""
    
    # Original rating matrix (5 users, 4 movies)
    # Notice the implicit structure: users 0,2,4 prefer action; users 1,3 prefer romance
    R = np.array([
        [5, 2, 5, 1],  # Action fan
        [2, 5, 1, 5],  # Romance fan
        [5, 1, 5, 2],  # Action fan
        [1, 5, 2, 5],  # Romance fan
        [5, 3, 4, 2],  # Action fan (slightly mixed)
    ], dtype=float)
    
    # Perfect factorization with k=2 factors
    # Factor 0: "Action affinity", Factor 1: "Romance affinity"
    P_true = np.array([
        [1.0, 0.2],   # User 0: loves action
        [0.2, 1.0],   # User 1: loves romance
        [1.0, 0.1],   # User 2: loves action
        [0.1, 1.0],   # User 3: loves romance
        [0.9, 0.4],   # User 4: mostly action
    ])
    
    Q_true = np.array([
        [5.0, 1.0],   # Inception: action movie
        [1.0, 5.0],   # Titanic: romance movie
        [5.0, 1.0],   # Matrix: action movie
        [1.0, 5.0],   # Notebook: romance movie
    ])
    
    R_reconstructed = P_true @ Q_true.T
    
    print("Original R:")
    print(R)
    print("\nReconstructed (P @ Q^T):")
    print(R_reconstructed)
    print("\nReconstruction Error (Frobenius norm):")
    print(f"{np.linalg.norm(R - R_reconstructed):.4f}")
 
 
if __name__ == "__main__":
    demonstrate_factorization()

Geometric interpretation:

Each user and item lives in a k-dimensional latent space. The rating is the inner product between the user vector and item vector, which geometrically measures:

Magnitude: Users with strong preferences and items with strong characteristics produce larger ratings
Alignment: Vectors pointing in similar directions produce high ratings; orthogonal vectors produce zero; opposite vectors produce negative ratings

This geometric view provides powerful intuition:

Similar users have similar vectors (close in latent space)
Similar items cluster together in the item embedding space
Good recommendations are items whose vectors align well with the user's vector

Latent Factors: Interpretability and Discovery

Example from movie recommendations:

After training a latent factor model on movie ratings, researchers have observed that individual factors often capture recognizable concepts:

Factor	High Positive Values	High Negative Values	Interpretation
f₁	The Godfather, Pulp Fiction, Fight Club	The Little Mermaid, Frozen	Serious/Mature vs. Family-Friendly
f₂	Inception, Interstellar, The Matrix	The Notebook, Titanic	Sci-Fi/Action vs. Romance
f₃	Monty Python, The Office	Schindler's List, 12 Years a Slave	Comedy vs. Drama
f₄	Old Hollywood classics	Recent blockbusters	Era/Nostalgic preference
f₅	Indie/art house films	Mainstream productions	Commercial vs. Artistic

These factors emerge automatically through optimization—the model wasn't told what 'genre' or 'era' means. It discovered that these dimensions are useful for predicting ratings.

Not All Factors Are Interpretable

The number of factors k:

Choosing k is a critical decision that balances expressiveness against overfitting:

Too few factors (k too small):

Model cannot capture the full complexity of user preferences
Underfitting: poor predictions even on training data
Users with different preferences get similar representations

Too many factors (k too large):

Model memorizes noise in the training data
Overfitting: excellent training performance, poor generalization
Increased computational and storage costs
Latent vectors may become sparse and less meaningful

Practical guidance:

Start with k = 50-100 for most datasets
Use validation data to tune k
Larger datasets can support more factors
Domain-specific: music might need more nuanced factors than movies

factor_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from typing import List, Tuple, Dict
 
def analyze_latent_factors(
    Q: np.ndarray,
    item_names: List[str],
    top_k: int = 5
) -> Dict[int, Tuple[List[str], List[str]]]:
    """
    Analyze what each latent factor captures by examining
    items with highest and lowest values for that factor.
    
    Args:
        Q: Item latent factor matrix (n_items x n_factors)
        item_names: Names of items for interpretation
        top_k: Number of items to show per factor
    
    Returns:
        Dictionary mapping factor index to (high_items, low_items)
    """
    n_factors = Q.shape[1]
    factor_analysis = {}
    
    for f in range(n_factors):
        # Get factor f values for all items
        factor_values = Q[:, f]
        
        # Find items with highest and lowest values
        sorted_indices = np.argsort(factor_values)
        high_indices = sorted_indices[-top_k:][::-1]
        low_indices = sorted_indices[:top_k]
        
        high_items = [item_names[i] for i in high_indices]
        low_items = [item_names[i] for i in low_indices]
        
        factor_analysis[f] = (high_items, low_items)
        
        print(f"\nFactor {f}:")
        print(f"  High: {', '.join(high_items)}")
        print(f"  Low:  {', '.join(low_items)}")
    
    return factor_analysis
 
 
def find_factor_correlations(
    Q: np.ndarray,
    item_metadata: Dict[int, Dict[str, any]]
) -> Dict[int, str]:
    """
    Attempt to correlate latent factors with known metadata.
    
    This helps interpret what abstract factors might represent
    by correlating them with explicit attributes (genre, year, etc.)
    """
    correlations = {}
    
    for f in range(Q.shape[1]):
        factor_values = Q[:, f]
        
        # Correlate with various metadata fields
        best_correlation = 0
        best_attribute = "Unknown"
        
        for attr in ['genre_action', 'genre_romance', 'year', 'budget']:
            if attr in item_metadata.get(0, {}):
                attr_values = np.array([
                    item_metadata[i].get(attr, 0) 
                    for i in range(len(factor_values))
                ])
                corr = np.corrcoef(factor_values, attr_values)[0, 1]
                if abs(corr) > abs(best_correlation):
                    best_correlation = corr
                    best_attribute = f"{attr} ({'positive' if corr > 0 else 'negative'})"
        
        correlations[f] = f"{best_attribute} (r={best_correlation:.2f})"
    
    return correlations
 
 
def visualize_user_in_factor_space(
    P: np.ndarray,
    user_ids: List[int],
    factor_interpretations: List[str]
):
    """
    Print a user's position in the latent factor space
    with human-readable factor interpretations.
    """
    for user_id in user_ids:
        print(f"\nUser {user_id} preference profile:")
        user_vector = P[user_id]
        
        # Sort factors by absolute magnitude (most defining preferences first)
        sorted_factors = np.argsort(np.abs(user_vector))[::-1]
        
        for f in sorted_factors[:5]:  # Top 5 defining factors
            value = user_vector[f]
            interpretation = factor_interpretations[f] if f < len(factor_interpretations) else f"Factor {f}"
            direction = "strongly positive" if value > 1 else "positive" if value > 0 else "negative" if value > -1 else "strongly negative"
            print(f"  {interpretation}: {direction} ({value:.2f})")

Comparison with Neighborhood Methods

Neighborhood-based methods:

These methods predict ratings based on similar users (user-based CF) or similar items (item-based CF):

User-based: 'Users similar to you also liked this item' Item-based: 'This item is similar to other items you've liked'

They compute similarity directly from the rating matrix (using cosine similarity, Pearson correlation, etc.) and predict by weighted averaging of neighbors' ratings.

Neighborhood Methods

•Strengths:
•Intuitive explanation ('users like you enjoyed...')
•Simple to implement and debug
•No training phase; predictions are computed on-demand
•Can easily incorporate new users/items
•Weaknesses:
•Scalability: O(n²) similarity matrix storage
•Sparsity sensitivity: fails when overlap is low
•Cannot capture latent patterns beyond direct similarity
•Limited expressiveness for complex preferences

Latent Factor Methods

•Strengths:
•Compact representation: O((m+n) × k) parameters
•Handles sparsity through low-rank structure
•Captures deep, transitive patterns in data
•Naturally handles cold-start for items (with side info)
•Weaknesses:
•Requires training phase (offline learning)
•Less interpretable predictions
•Cold-start for new users without ratings
•Hyperparameter tuning (k, learning rate, regularization)

Why latent factors win at scale:

But Carol has only rated Titanic (2 stars) and The Notebook (1 star). Neither is particularly similar to The Matrix (an action/sci-fi film). Neighborhood methods would struggle here.

Latent factor models approach this differently. We learn that:

Carol's user vector p_Carol shows negative affinity for 'action/intense' factors
The Matrix's item vector q_Matrix shows high values on 'action/intense' factors
The dot product p_Carol · q_Matrix yields a low predicted rating

Hybrid Approaches in Practice

The Optimization Objective

To learn good latent factors P and Q, we need an objective function to optimize. The standard formulation minimizes the regularized squared error over observed ratings:

min_{P,Q} Σ_{(u,i) ∈ K} (r_ui - p_u · q_i)² + λ(||P||² + ||Q||²)

Where:

K is the set of (user, item) pairs with observed ratings
(r_ui - p_u · q_i)² is the squared prediction error
λ(||P||² + ||Q||²) is L2 regularization to prevent overfitting
λ is the regularization strength hyperparameter

This objective has several important properties:

Properties of the MF Objective

•Only observed ratings matter — We sum over K (observed ratings), not all (user, item) pairs. Missing entries don't contribute to the loss; they're what we're trying to predict.
•Non-convex optimization — The product p_u · q_i makes the objective non-convex. There's no guarantee of finding the global optimum, but local optima work well in practice.
•Regularization is essential — Without λ||P||² + λ||Q||², the model can overfit by making latent vectors arbitrarily large to fit noise.
•Efficient to compute — Only O(|K|) terms in the sum, where |K| is the number of ratings (typically 99% smaller than the full matrix).

mf_objective.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from typing import List, Tuple
from dataclasses import dataclass
 
@dataclass
class Rating:
    """A single observed rating."""
    user_id: int
    item_id: int
    rating: float
 
def compute_mf_loss(
    P: np.ndarray,
    Q: np.ndarray,
    ratings: List[Rating],
    lambda_reg: float = 0.1
) -> Tuple[float, float, float]:
    """
    Compute the regularized matrix factorization loss.
    
    L = Σ (r_ui - p_u · q_i)² + λ(||P||² + ||Q||²)
    
    Returns:
        (total_loss, squared_error, regularization_term)
    """
    # Squared error over observed ratings
    squared_error = 0.0
    for rating in ratings:
        predicted = np.dot(P[rating.user_id], Q[rating.item_id])
        error = rating.rating - predicted
        squared_error += error ** 2
    
    # L2 regularization
    reg_P = lambda_reg * np.sum(P ** 2)
    reg_Q = lambda_reg * np.sum(Q ** 2)
    regularization = reg_P + reg_Q
    
    total_loss = squared_error + regularization
    
    return total_loss, squared_error, regularization
 
 
def compute_gradients(
    P: np.ndarray,
    Q: np.ndarray,
    user_id: int,
    item_id: int,
    rating: float,
    lambda_reg: float = 0.1
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute gradients of loss w.r.t. p_u and q_i for a single rating.
    
    For the loss term: (r_ui - p_u · q_i)² + λ(||p_u||² + ||q_i||²)
    
    ∂L/∂p_u = -2(r_ui - p_u · q_i) * q_i + 2λp_u
    ∂L/∂q_i = -2(r_ui - p_u · q_i) * p_u + 2λq_i
    """
    p_u = P[user_id]
    q_i = Q[item_id]
    
    error = rating - np.dot(p_u, q_i)
    
    # Gradients (without the factor of 2 for computational convenience)
    grad_p = -error * q_i + lambda_reg * p_u
    grad_q = -error * p_u + lambda_reg * q_i
    
    return grad_p, grad_q
 
 
def train_mf_sgd(
    n_users: int,
    n_items: int,
    ratings: List[Rating],
    n_factors: int = 50,
    learning_rate: float = 0.01,
    lambda_reg: float = 0.1,
    n_epochs: int = 100,
    verbose: bool = True
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Train matrix factorization model using stochastic gradient descent.
    
    This is the core algorithm used in production systems, with various
    optimizations (momentum, adaptive learning rates, etc.)
    """
    # Initialize latent factors
    scale = 1.0 / np.sqrt(n_factors)
    P = np.random.normal(0, scale, (n_users, n_factors))
    Q = np.random.normal(0, scale, (n_items, n_factors))
    
    for epoch in range(n_epochs):
        # Shuffle ratings for SGD
        np.random.shuffle(ratings)
        
        for rating in ratings:
            # Compute gradients for this rating
            grad_p, grad_q = compute_gradients(
                P, Q, rating.user_id, rating.item_id, rating.rating, lambda_reg
            )
            
            # Update latent vectors
            P[rating.user_id] -= learning_rate * grad_p
            Q[rating.item_id] -= learning_rate * grad_q
        
        # Compute and report loss periodically
        if verbose and (epoch + 1) % 10 == 0:
            loss, sq_err, reg = compute_mf_loss(P, Q, ratings, lambda_reg)
            rmse = np.sqrt(sq_err / len(ratings))
            print(f"Epoch {epoch + 1}: Loss = {loss:.4f}, RMSE = {rmse:.4f}")
    
    return P, Q

Understanding the gradient update:

The gradient descent update for user u on rating r_ui is:

p_u ← p_u + α(e_ui × q_i - λp_u)

q_i ← q_i + α(e_ui × p_u - λq_i)

Where e_ui = r_ui - p_u · q_i is the prediction error and α is the learning rate.

Intuitively:

If we underpredict (e_ui > 0), we adjust p_u and q_i to be more aligned (larger dot product)
If we overpredict (e_ui < 0), we push them apart
The regularization term -λp_u shrinks vectors toward zero, preventing them from growing too large

Bias Terms and Extensions

Some users rate everything higher (lenient raters) or lower (harsh critics)
Some items are universally better (acclaimed masterpieces) or worse
The overall average rating differs by platform and context

We incorporate these insights through bias terms:

r̂_ui = μ + b_u + b_i + p_u · q_i

Where:

μ is the global average rating across all data
b_u is user u's bias (deviation from average)
b_i is item i's bias (deviation from average)
p_u · q_i captures the residual user-item interaction

Understanding Bias DecompositionConsider predicting Alice's rating for 'The Godfather':

Input

Output

biased_mf.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from typing import List, Tuple
from dataclasses import dataclass
 
@dataclass
class Rating:
    user_id: int
    item_id: int
    rating: float
 
class BiasedMatrixFactorization:
    """
    Matrix factorization with user and item biases.
    
    Prediction: r_hat = μ + b_u + b_i + p_u · q_i
    
    This model captures:
    - Global baseline (μ)
    - User-specific rating tendencies (b_u)
    - Item-specific quality (b_i)
    - User-item interaction (p_u · q_i)
    """
    
    def __init__(
        self,
        n_users: int,
        n_items: int,
        n_factors: int = 50,
        learning_rate: float = 0.01,
        reg_bias: float = 0.02,
        reg_factors: float = 0.02
    ):
        self.n_factors = n_factors
        self.lr = learning_rate
        self.reg_bias = reg_bias
        self.reg_factors = reg_factors
        
        # Biases
        self.global_mean = 0.0
        self.user_bias = np.zeros(n_users)
        self.item_bias = np.zeros(n_items)
        
        # Latent factors
        scale = 0.1
        self.P = np.random.normal(0, scale, (n_users, n_factors))
        self.Q = np.random.normal(0, scale, (n_items, n_factors))
    
    def predict(self, user_id: int, item_id: int) -> float:
        """Predict rating with biases and latent factors."""
        prediction = (
            self.global_mean +
            self.user_bias[user_id] +
            self.item_bias[item_id] +
            np.dot(self.P[user_id], self.Q[item_id])
        )
        # Clip to valid rating range
        return np.clip(prediction, 1.0, 5.0)
    
    def fit(self, ratings: List[Rating], n_epochs: int = 100):
        """Train the model using SGD."""
        
        # Compute global mean
        self.global_mean = np.mean([r.rating for r in ratings])
        
        for epoch in range(n_epochs):
            np.random.shuffle(ratings)
            
            for r in ratings:
                # Current prediction and error
                pred = self.predict(r.user_id, r.item_id)
                error = r.rating - pred
                
                # Update biases
                self.user_bias[r.user_id] += self.lr * (
                    error - self.reg_bias * self.user_bias[r.user_id]
                )
                self.item_bias[r.item_id] += self.lr * (
                    error - self.reg_bias * self.item_bias[r.item_id]
                )
                
                # Update latent factors
                p_u = self.P[r.user_id].copy()
                q_i = self.Q[r.item_id].copy()
                
                self.P[r.user_id] += self.lr * (
                    error * q_i - self.reg_factors * p_u
                )
                self.Q[r.item_id] += self.lr * (
                    error * p_u - self.reg_factors * q_i
                )
    
    def explain_prediction(self, user_id: int, item_id: int) -> dict:
        """
        Decompose a prediction into its components.
        
        Useful for debugging and interpretation.
        """
        return {
            "global_mean": self.global_mean,
            "user_bias": self.user_bias[user_id],
            "item_bias": self.item_bias[item_id],
            "interaction": np.dot(self.P[user_id], self.Q[item_id]),
            "final_prediction": self.predict(user_id, item_id)
        }
    
    def analyze_biases(
        self, 
        user_names: List[str] = None, 
        item_names: List[str] = None,
        top_k: int = 5
    ):
        """Analyze learned biases to understand rating patterns."""
        
        print("\n=== Highest User Biases (Lenient Raters) ===")
        top_users = np.argsort(self.user_bias)[-top_k:][::-1]
        for u in top_users:
            name = user_names[u] if user_names else f"User {u}"
            print(f"  {name}: {self.user_bias[u]:+.2f}")
        
        print("\n=== Lowest User Biases (Harsh Critics) ===")
        bottom_users = np.argsort(self.user_bias)[:top_k]
        for u in bottom_users:
            name = user_names[u] if user_names else f"User {u}"
            print(f"  {name}: {self.user_bias[u]:+.2f}")
        
        print("\n=== Highest Item Biases (Universally Loved) ===")
        top_items = np.argsort(self.item_bias)[-top_k:][::-1]
        for i in top_items:
            name = item_names[i] if item_names else f"Item {i}"
            print(f"  {name}: {self.item_bias[i]:+.2f}")
        
        print("\n=== Lowest Item Biases (Universally Disliked) ===")
        bottom_items = np.argsort(self.item_bias)[:top_k]
        for i in bottom_items:
            name = item_names[i] if item_names else f"Item {i}"
            print(f"  {name}: {self.item_bias[i]:+.2f}")

Why Separate Regularization for Biases?

The Low-Rank Assumption

What does low rank mean?

A matrix has rank k if it can be expressed as the sum of k rank-1 matrices (outer products of vectors). Equivalently, it has at most k linearly independent rows or columns.

For our rating matrix R ≈ P × Q^T:

P is m × k and Q is n × k
The product P × Q^T is at most rank k (even though it's m × n)
If the true ratings really are governed by k underlying factors, this approximation is exact

Why would preferences be low-rank?

Human preferences, while seemingly infinite in variety, actually cluster around common patterns:

Genre preferences: People tend to like certain genres (action, romance, comedy), and movies belong to genre combinations
Quality dimensions: Well-made films with good acting, writing, and direction tend to be universally appreciated
Social factors: Age, culture, and demographics correlate with taste
Temporal patterns: Nostalgic preferences for films from one's formative years

These dimensions are finite and shared. Two users who've never interacted might have identical taste profiles because they share the same underlying factor values.

Matrix Rank vs. Approximation Quality
Rank k	Parameters	Compression Ratio*	Typical RMSE
10	55K	18,000×	~0.94
50	275K	3,600×	~0.90
100	550K	1,800×	~0.88
200	1.1M	900×	~0.87
Full (1000)	1B	1×	~0.86 (overfit)

*For a 100K user × 10K item matrix

The bias-variance tradeoff in rank:

As we increase k, we can capture finer-grained patterns:

k = 10: Broad strokes (action vs. romance, old vs. new)
k = 50: Genre combinations, era preferences, mainstream vs. indie
k = 100: Director styles, actor preferences, thematic elements
k = 200+: Increasingly subtle patterns, risk of fitting noise

When Low-Rank Fails

Summary and Connections

We've established the conceptual foundation for understanding how matrix factorization drives modern recommendation systems. Let's consolidate the key insights:

Key Takeaways

•Latent factors reveal hidden structure — Users and items exist in a shared embedding space where compatibility is measured by vector dot products.
•Low-rank approximation enables generalization — By assuming k << min(m, n), we can predict millions of missing ratings from relatively few parameters.
•Bias terms capture systematic effects — Global, user, and item biases account for rating tendencies independent of user-item interaction.
•The objective balances fit and complexity — Regularized squared error prevents overfitting while enabling accurate in-sample predictions.
•Latent factors can be interpretable — Though not guaranteed, factors often correspond to meaningful concepts like genre or quality.
•MF surpasses neighborhood methods at scale — Transitive inference through the latent space enables better predictions on sparse data.

Connections to the broader ML landscape:

Latent factor models connect to several fundamental concepts:

Principal Component Analysis (PCA): SVD of centered data finds directions of maximum variance—latent factors for the data distribution itself
Word embeddings (Word2Vec): Words as vectors where similar words are nearby; essentially MF on word co-occurrence matrices
Neural networks: The user/item embedding layers in deep learning recommenders are exactly the P and Q matrices
Topic models (LDA): Documents as mixtures of topics; similar factorization of document-term matrices

What's next:

We've established what latent factor models are and why they work. The next pages will dive into:

SVD and SVD++: The mathematical formalization and extensions that won the Netflix Prize
ALS optimization: Alternative optimization strategies that enable distributed training at scale
Regularization strategies: Deep dive into preventing overfitting
Implicit feedback: Extending MF beyond explicit ratings to click, view, and purchase data

Page Complete