Matrix Factorization - Learning Module

Loading content...

0/245

SVD and SVD++

The Mathematics Behind the Netflix Prize

In 2006, Netflix offered a $1 million prize to anyone who could improve their recommendation algorithm by 10%. The winning solution, achieved in 2009, relied heavily on Singular Value Decomposition (SVD) and its extensions. This established matrix factorization as the gold standard for collaborative filtering.

SVD provides the mathematical foundation for understanding latent factor models. While pure SVD cannot handle missing data directly, the techniques inspired by it—often called 'SVD-style' or 'Funk SVD'—became the backbone of modern recommendation systems. The extension SVD++ further improved predictions by incorporating implicit feedback signals.

What You Will Learn

This page covers classical SVD and its relationship to matrix factorization, the practical 'Funk SVD' approach for sparse matrices, SVD++ which incorporates implicit feedback, and the mathematical intuition connecting these techniques.

Classical Singular Value Decomposition

Singular Value Decomposition is a fundamental matrix factorization from linear algebra. For any m × n matrix R:

R = U × Σ × V^T

Where:

U is an m × m orthogonal matrix (left singular vectors)
Σ is an m × n diagonal matrix of singular values (σ₁ ≥ σ₂ ≥ ... ≥ 0)
V is an n × n orthogonal matrix (right singular vectors)

Low-rank approximation:

The power of SVD lies in the Eckart-Young theorem: the best rank-k approximation to R (minimizing Frobenius norm) is obtained by keeping only the top k singular values:

R_k = U_k × Σ_k × V_k^T

This gives the optimal low-rank compression of R, capturing maximum information in k dimensions.

classical_svd.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
 
def demonstrate_svd_approximation():
    """Show how SVD provides optimal low-rank approximation."""
    
    # Full rating matrix (no missing values for demonstration)
    R = np.array([
        [5, 3, 0, 1],
        [4, 0, 0, 1],
        [1, 1, 0, 5],
        [1, 0, 0, 4],
        [0, 1, 5, 4],
    ], dtype=float)
    
    # Compute full SVD
    U, sigma, Vt = np.linalg.svd(R, full_matrices=False)
    
    print("Singular values:", sigma)
    print("\nEnergy captured by each component:")
    energy = sigma**2 / np.sum(sigma**2)
    cumulative = np.cumsum(energy)
    for i, (e, c) in enumerate(zip(energy, cumulative)):
        print(f"  σ_{i+1}: {e:.1%} (cumulative: {c:.1%})")
    
    # Reconstruct with different ranks
    for k in [1, 2, 3]:
        R_k = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]
        error = np.linalg.norm(R - R_k, 'fro')
        print(f"\nRank-{k} approximation error: {error:.3f}")
 
if __name__ == "__main__":
    demonstrate_svd_approximation()

The Missing Data Problem

Classical SVD requires a complete matrix. In recommendations, 99%+ of entries are missing. Treating missing values as zeros drastically biases the decomposition (a missing rating isn't a zero rating). This motivated the development of specialized algorithms.

Funk SVD: Practical Matrix Factorization

Simon Funk, competing in the Netflix Prize, popularized a practical approach: instead of computing SVD on the full matrix, learn the factor matrices directly by minimizing error only on observed ratings.

This is the optimization we introduced earlier:

min_{P,Q} Σ_{(u,i) ∈ K} (r_ui - p_u · q_i)² + λ(||P||² + ||Q||²)

This 'Funk SVD' (despite not being true SVD) became the standard approach:

Initialize P and Q randomly
For each observed rating, compute gradient and update
Repeat until convergence

The key insight: we only need predictions for observed positions during training, and we want predictions for unobserved positions. Ignoring missing data during optimization solves the sparsity problem elegantly.

funk_svd.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple
 
@dataclass
class Rating:
    user_id: int
    item_id: int
    rating: float
 
class FunkSVD:
    """
    Funk's SVD-style matrix factorization.
    Trains only on observed ratings, ignoring missing entries.
    """
    
    def __init__(self, n_users: int, n_items: int, n_factors: int = 50):
        self.n_factors = n_factors
        
        # Initialize with small random values
        self.global_mean = 0.0
        self.user_bias = np.zeros(n_users)
        self.item_bias = np.zeros(n_items)
        self.P = np.random.normal(0, 0.1, (n_users, n_factors))
        self.Q = np.random.normal(0, 0.1, (n_items, n_factors))
    
    def predict(self, user_id: int, item_id: int) -> float:
        pred = (self.global_mean + 
                self.user_bias[user_id] + 
                self.item_bias[item_id] +
                np.dot(self.P[user_id], self.Q[item_id]))
        return np.clip(pred, 1.0, 5.0)
    
    def fit(self, ratings: List[Rating], n_epochs: int = 20,
            lr: float = 0.005, reg: float = 0.02):
        """Train using stochastic gradient descent."""
        
        self.global_mean = np.mean([r.rating for r in ratings])
        
        for epoch in range(n_epochs):
            np.random.shuffle(ratings)
            total_error = 0.0
            
            for r in ratings:
                pred = self.predict(r.user_id, r.item_id)
                err = r.rating - pred
                total_error += err ** 2
                
                # Update biases
                self.user_bias[r.user_id] += lr * (err - reg * self.user_bias[r.user_id])
                self.item_bias[r.item_id] += lr * (err - reg * self.item_bias[r.item_id])
                
                # Update latent factors
                pu, qi = self.P[r.user_id].copy(), self.Q[r.item_id].copy()
                self.P[r.user_id] += lr * (err * qi - reg * pu)
                self.Q[r.item_id] += lr * (err * pu - reg * qi)
            
            rmse = np.sqrt(total_error / len(ratings))
            print(f"Epoch {epoch+1}: RMSE = {rmse:.4f}")

SVD++: Incorporating Implicit Feedback

SVD++, proposed by Yehuda Koren, extends the basic model by incorporating implicit feedback—the fact that a user rated an item at all.

The insight: even before knowing the rating value, the act of rating reveals something about the user. Users who rate many indie films differ from those who rate only blockbusters, even if their explicit ratings are similar.

The SVD++ model:

r̂_ui = μ + b_u + b_i + q_i^T (p_u + |N(u)|^{-1/2} Σ_{j∈N(u)} y_j)

Where:

N(u) is the set of items user u has rated (implicit feedback)
y_j is an implicit factor vector for item j
The sum over y_j captures 'what kind of rater' user u is

svd_plus_plus.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from collections import defaultdict
from typing import List, Set, Dict
 
class SVDPlusPlus:
    """
    SVD++ model incorporating implicit feedback.
    
    User representation combines explicit factors (p_u) with 
    implicit factors (sum of y_j for items user has rated).
    """
    
    def __init__(self, n_users: int, n_items: int, n_factors: int = 50):
        self.n_factors = n_factors
        scale = 0.1
        
        self.global_mean = 0.0
        self.user_bias = np.zeros(n_users)
        self.item_bias = np.zeros(n_items)
        self.P = np.random.normal(0, scale, (n_users, n_factors))
        self.Q = np.random.normal(0, scale, (n_items, n_factors))
        self.Y = np.random.normal(0, scale, (n_items, n_factors))  # Implicit factors
        
        # User's rated items (implicit feedback)
        self.user_items: Dict[int, Set[int]] = defaultdict(set)
    
    def _get_implicit_feedback(self, user_id: int) -> np.ndarray:
        """Compute implicit feedback contribution for user."""
        items = self.user_items[user_id]
        if not items:
            return np.zeros(self.n_factors)
        
        # Sum of y vectors, normalized by sqrt of count
        y_sum = np.sum([self.Y[j] for j in items], axis=0)
        return y_sum / np.sqrt(len(items))
    
    def predict(self, user_id: int, item_id: int) -> float:
        """Predict with both explicit and implicit factors."""
        implicit = self._get_implicit_feedback(user_id)
        user_vec = self.P[user_id] + implicit
        
        pred = (self.global_mean + 
                self.user_bias[user_id] + 
                self.item_bias[item_id] +
                np.dot(self.Q[item_id], user_vec))
        return np.clip(pred, 1.0, 5.0)
    
    def fit(self, ratings: list, n_epochs: int = 20, 
            lr: float = 0.007, reg: float = 0.02):
        """Train SVD++ with SGD."""
        
        # Build user-item implicit feedback sets
        for r in ratings:
            self.user_items[r.user_id].add(r.item_id)
        
        self.global_mean = np.mean([r.rating for r in ratings])
        
        for epoch in range(n_epochs):
            np.random.shuffle(ratings)
            sq_error = 0.0
            
            for r in ratings:
                items_u = self.user_items[r.user_id]
                sqrt_nu = np.sqrt(len(items_u)) if items_u else 1.0
                
                # Implicit contribution
                implicit = self._get_implicit_feedback(r.user_id)
                user_vec = self.P[r.user_id] + implicit
                
                pred = (self.global_mean + self.user_bias[r.user_id] + 
                        self.item_bias[r.item_id] + np.dot(self.Q[r.item_id], user_vec))
                err = r.rating - pred
                sq_error += err ** 2
                
                # Update biases
                self.user_bias[r.user_id] += lr * (err - reg * self.user_bias[r.user_id])
                self.item_bias[r.item_id] += lr * (err - reg * self.item_bias[r.item_id])
                
                # Update factors
                qi = self.Q[r.item_id].copy()
                self.P[r.user_id] += lr * (err * qi - reg * self.P[r.user_id])
                self.Q[r.item_id] += lr * (err * user_vec - reg * qi)
                
                # Update implicit factors
                for j in items_u:
                    self.Y[j] += lr * (err * qi / sqrt_nu - reg * self.Y[j])
            
            print(f"Epoch {epoch+1}: RMSE = {np.sqrt(sq_error/len(ratings)):.4f}")

Why SVD++ Works

SVD++ typically improves RMSE by 1-2% over basic SVD. The implicit feedback helps especially for users with few ratings—even one rating places them in a 'neighborhood' based on that item's y vector. This provides a better starting point than an untrained p_u vector alone.

Training Considerations and Best Practices

Key Training Decisions

•Learning rate scheduling: Start with lr ≈ 0.01, decay by 0.9× each epoch. Prevents oscillation near convergence.
•Initialization: Small random values (σ ≈ 0.1) work well. Zero initialization breaks symmetry poorly.
•Early stopping: Monitor validation RMSE; stop when it increases for 3+ epochs.
•Shuffling: Randomize rating order each epoch to prevent cycles in SGD.
•Bias initialization: Start biases at 0; they converge quickly. Don't regularize global mean.

Typical Hyperparameter Ranges
Parameter	Typical Range	Netflix Prize Values
n_factors (k)	20-200	50-200
learning_rate	0.001-0.02	0.007
regularization	0.01-0.1	0.015-0.05
n_epochs	20-100	30-50

Summary

Key Takeaways

•Classical SVD provides optimal low-rank approximation but requires complete matrices.
•Funk SVD adapts the idea for sparse data by optimizing only over observed ratings.
•SVD++ incorporates implicit feedback, improving predictions by 1-2% on average.
•Biases (global, user, item) capture systematic rating tendencies separate from preferences.
•Proper training requires careful hyperparameter tuning(lr, λ, k) and regularization.

Page Complete

You now understand the SVD family of algorithms: from classical SVD to practical Funk SVD to the implicit-aware SVD++. Next, we'll explore Alternating Least Squares (ALS), a different optimization approach that enables parallelization and scales to massive datasets.

SVD and SVD++

The Mathematics Behind the Netflix Prize

What You Will Learn

Classical Singular Value Decomposition

Singular Value Decomposition is a fundamental matrix factorization from linear algebra. For any m × n matrix R:

R = U × Σ × V^T

Where:

U is an m × m orthogonal matrix (left singular vectors)
Σ is an m × n diagonal matrix of singular values (σ₁ ≥ σ₂ ≥ ... ≥ 0)
V is an n × n orthogonal matrix (right singular vectors)

Low-rank approximation:

The power of SVD lies in the Eckart-Young theorem: the best rank-k approximation to R (minimizing Frobenius norm) is obtained by keeping only the top k singular values:

R_k = U_k × Σ_k × V_k^T

This gives the optimal low-rank compression of R, capturing maximum information in k dimensions.

classical_svd.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
 
def demonstrate_svd_approximation():
    """Show how SVD provides optimal low-rank approximation."""
    
    # Full rating matrix (no missing values for demonstration)
    R = np.array([
        [5, 3, 0, 1],
        [4, 0, 0, 1],
        [1, 1, 0, 5],
        [1, 0, 0, 4],
        [0, 1, 5, 4],
    ], dtype=float)
    
    # Compute full SVD
    U, sigma, Vt = np.linalg.svd(R, full_matrices=False)
    
    print("Singular values:", sigma)
    print("\nEnergy captured by each component:")
    energy = sigma**2 / np.sum(sigma**2)
    cumulative = np.cumsum(energy)
    for i, (e, c) in enumerate(zip(energy, cumulative)):
        print(f"  σ_{i+1}: {e:.1%} (cumulative: {c:.1%})")
    
    # Reconstruct with different ranks
    for k in [1, 2, 3]:
        R_k = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]
        error = np.linalg.norm(R - R_k, 'fro')
        print(f"\nRank-{k} approximation error: {error:.3f}")
 
if __name__ == "__main__":
    demonstrate_svd_approximation()

The Missing Data Problem

Funk SVD: Practical Matrix Factorization

This is the optimization we introduced earlier:

min_{P,Q} Σ_{(u,i) ∈ K} (r_ui - p_u · q_i)² + λ(||P||² + ||Q||²)

This 'Funk SVD' (despite not being true SVD) became the standard approach:

Initialize P and Q randomly
For each observed rating, compute gradient and update
Repeat until convergence

funk_svd.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple
 
@dataclass
class Rating:
    user_id: int
    item_id: int
    rating: float
 
class FunkSVD:
    """
    Funk's SVD-style matrix factorization.
    Trains only on observed ratings, ignoring missing entries.
    """
    
    def __init__(self, n_users: int, n_items: int, n_factors: int = 50):
        self.n_factors = n_factors
        
        # Initialize with small random values
        self.global_mean = 0.0
        self.user_bias = np.zeros(n_users)
        self.item_bias = np.zeros(n_items)
        self.P = np.random.normal(0, 0.1, (n_users, n_factors))
        self.Q = np.random.normal(0, 0.1, (n_items, n_factors))
    
    def predict(self, user_id: int, item_id: int) -> float:
        pred = (self.global_mean + 
                self.user_bias[user_id] + 
                self.item_bias[item_id] +
                np.dot(self.P[user_id], self.Q[item_id]))
        return np.clip(pred, 1.0, 5.0)
    
    def fit(self, ratings: List[Rating], n_epochs: int = 20,
            lr: float = 0.005, reg: float = 0.02):
        """Train using stochastic gradient descent."""
        
        self.global_mean = np.mean([r.rating for r in ratings])
        
        for epoch in range(n_epochs):
            np.random.shuffle(ratings)
            total_error = 0.0
            
            for r in ratings:
                pred = self.predict(r.user_id, r.item_id)
                err = r.rating - pred
                total_error += err ** 2
                
                # Update biases
                self.user_bias[r.user_id] += lr * (err - reg * self.user_bias[r.user_id])
                self.item_bias[r.item_id] += lr * (err - reg * self.item_bias[r.item_id])
                
                # Update latent factors
                pu, qi = self.P[r.user_id].copy(), self.Q[r.item_id].copy()
                self.P[r.user_id] += lr * (err * qi - reg * pu)
                self.Q[r.item_id] += lr * (err * pu - reg * qi)
            
            rmse = np.sqrt(total_error / len(ratings))
            print(f"Epoch {epoch+1}: RMSE = {rmse:.4f}")

SVD++: Incorporating Implicit Feedback

SVD++, proposed by Yehuda Koren, extends the basic model by incorporating implicit feedback—the fact that a user rated an item at all.

The SVD++ model:

r̂_ui = μ + b_u + b_i + q_i^T (p_u + |N(u)|^{-1/2} Σ_{j∈N(u)} y_j)

Where:

N(u) is the set of items user u has rated (implicit feedback)
y_j is an implicit factor vector for item j
The sum over y_j captures 'what kind of rater' user u is

svd_plus_plus.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from collections import defaultdict
from typing import List, Set, Dict
 
class SVDPlusPlus:
    """
    SVD++ model incorporating implicit feedback.
    
    User representation combines explicit factors (p_u) with 
    implicit factors (sum of y_j for items user has rated).
    """
    
    def __init__(self, n_users: int, n_items: int, n_factors: int = 50):
        self.n_factors = n_factors
        scale = 0.1
        
        self.global_mean = 0.0
        self.user_bias = np.zeros(n_users)
        self.item_bias = np.zeros(n_items)
        self.P = np.random.normal(0, scale, (n_users, n_factors))
        self.Q = np.random.normal(0, scale, (n_items, n_factors))
        self.Y = np.random.normal(0, scale, (n_items, n_factors))  # Implicit factors
        
        # User's rated items (implicit feedback)
        self.user_items: Dict[int, Set[int]] = defaultdict(set)
    
    def _get_implicit_feedback(self, user_id: int) -> np.ndarray:
        """Compute implicit feedback contribution for user."""
        items = self.user_items[user_id]
        if not items:
            return np.zeros(self.n_factors)
        
        # Sum of y vectors, normalized by sqrt of count
        y_sum = np.sum([self.Y[j] for j in items], axis=0)
        return y_sum / np.sqrt(len(items))
    
    def predict(self, user_id: int, item_id: int) -> float:
        """Predict with both explicit and implicit factors."""
        implicit = self._get_implicit_feedback(user_id)
        user_vec = self.P[user_id] + implicit
        
        pred = (self.global_mean + 
                self.user_bias[user_id] + 
                self.item_bias[item_id] +
                np.dot(self.Q[item_id], user_vec))
        return np.clip(pred, 1.0, 5.0)
    
    def fit(self, ratings: list, n_epochs: int = 20, 
            lr: float = 0.007, reg: float = 0.02):
        """Train SVD++ with SGD."""
        
        # Build user-item implicit feedback sets
        for r in ratings:
            self.user_items[r.user_id].add(r.item_id)
        
        self.global_mean = np.mean([r.rating for r in ratings])
        
        for epoch in range(n_epochs):
            np.random.shuffle(ratings)
            sq_error = 0.0
            
            for r in ratings:
                items_u = self.user_items[r.user_id]
                sqrt_nu = np.sqrt(len(items_u)) if items_u else 1.0
                
                # Implicit contribution
                implicit = self._get_implicit_feedback(r.user_id)
                user_vec = self.P[r.user_id] + implicit
                
                pred = (self.global_mean + self.user_bias[r.user_id] + 
                        self.item_bias[r.item_id] + np.dot(self.Q[r.item_id], user_vec))
                err = r.rating - pred
                sq_error += err ** 2
                
                # Update biases
                self.user_bias[r.user_id] += lr * (err - reg * self.user_bias[r.user_id])
                self.item_bias[r.item_id] += lr * (err - reg * self.item_bias[r.item_id])
                
                # Update factors
                qi = self.Q[r.item_id].copy()
                self.P[r.user_id] += lr * (err * qi - reg * self.P[r.user_id])
                self.Q[r.item_id] += lr * (err * user_vec - reg * qi)
                
                # Update implicit factors
                for j in items_u:
                    self.Y[j] += lr * (err * qi / sqrt_nu - reg * self.Y[j])
            
            print(f"Epoch {epoch+1}: RMSE = {np.sqrt(sq_error/len(ratings)):.4f}")

Why SVD++ Works

Training Considerations and Best Practices

Key Training Decisions

•Learning rate scheduling: Start with lr ≈ 0.01, decay by 0.9× each epoch. Prevents oscillation near convergence.
•Initialization: Small random values (σ ≈ 0.1) work well. Zero initialization breaks symmetry poorly.
•Early stopping: Monitor validation RMSE; stop when it increases for 3+ epochs.
•Shuffling: Randomize rating order each epoch to prevent cycles in SGD.
•Bias initialization: Start biases at 0; they converge quickly. Don't regularize global mean.

Typical Hyperparameter Ranges
Parameter	Typical Range	Netflix Prize Values
n_factors (k)	20-200	50-200
learning_rate	0.001-0.02	0.007
regularization	0.01-0.1	0.015-0.05
n_epochs	20-100	30-50

Summary

Key Takeaways

•Classical SVD provides optimal low-rank approximation but requires complete matrices.
•Funk SVD adapts the idea for sparse data by optimizing only over observed ratings.
•SVD++ incorporates implicit feedback, improving predictions by 1-2% on average.
•Biases (global, user, item) capture systematic rating tendencies separate from preferences.
•Proper training requires careful hyperparameter tuning(lr, λ, k) and regularization.

Page Complete