Matrix Factorization - Learning Module

Loading content...

0/245

Regularization

Preventing Overfitting in Latent Factor Models

Matrix factorization models can easily overfit. With millions of parameters (k factors × (m users + n items)), the model can memorize training ratings rather than learning generalizable patterns. A user with only 5 ratings could have their p_u vector perfectly fit those 5 points—but fail completely on new items.

Regularization is the set of techniques that constrain model complexity, forcing the model to learn patterns that generalize. Without proper regularization, even the best algorithms will produce poor recommendations in production.

What You Will Learn

This page covers L2 regularization fundamentals, why it works (the bias-variance tradeoff), per-parameter regularization strategies, temporal dynamics handling, and best practices from industry.

L2 Regularization Fundamentals

The standard regularized objective is:

L = Σ_{(u,i) ∈ K} (r_ui - r̂_ui)² + λ_p||P||F² + λ_q||Q||F² + λ{bu}Σ b_u² + λ{bi}Σ b_i²

Where ||·||_F is the Frobenius norm (sum of squared entries).

Why L2 regularization works:

Shrinks parameters toward zero: Prevents any single factor from dominating
Encourages spreading: Information is distributed across factors rather than concentrated
Equivalent to Bayesian prior: L2 corresponds to placing Gaussian priors on parameters
Constrains model capacity: Effectively limits how complex the learned function can be

regularization_effects.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
import matplotlib.pyplot as plt
 
def demonstrate_regularization_effect():
    """
    Show how regularization affects learned parameters.
    """
    np.random.seed(42)
    
    # Simulate: user with only 3 ratings
    # True preference: slightly likes action (factor 0), neutral on romance (factor 1)
    true_p = np.array([0.3, 0.0])
    
    # 3 items this user rated
    Q = np.array([
        [1.0, 0.2],   # Action movie
        [0.8, 0.3],   # Action movie 
        [-0.1, 0.9],  # Romance movie
    ])
    
    # True ratings with noise
    true_ratings = Q @ true_p + np.random.normal(0, 0.2, 3)
    
    # Solve for p with different regularization strengths
    lambdas = [0.0, 0.1, 1.0, 10.0]
    
    print("Effect of regularization on learned user vector:")
    print(f"True p: {true_p}")
    print(f"Ratings: {true_ratings}")
    print()
    
    for lam in lambdas:
        # Ridge regression solution: (Q^T Q + λI)^{-1} Q^T r
        A = Q.T @ Q + lam * np.eye(2)
        b = Q.T @ true_ratings
        p_learned = np.linalg.solve(A, b)
        
        # Prediction error on training data
        train_pred = Q @ p_learned
        train_rmse = np.sqrt(np.mean((true_ratings - train_pred)**2))
        
        # True error (distance from true p)
        param_error = np.linalg.norm(p_learned - true_p)
        
        print(f"λ={lam:5.1f}: p={p_learned}, "
              f"train_RMSE={train_rmse:.3f}, param_error={param_error:.3f}")
 
if __name__ == "__main__":
    demonstrate_regularization_effect()

The Bias-Variance Tradeoff

With λ=0 (no regularization), we minimize training error but may have high variance (sensitive to noise). With large λ, we have low variance but high bias (systematic underfitting). The optimal λ balances these—typically found via cross-validation.

Per-Parameter Regularization

Not all parameters should be regularized equally. The Netflix Prize teams discovered that different regularization strengths for different parameter types significantly improves results:

User biases (b_u): Low regularization—users with many ratings have reliable biases
Item biases (b_i): Low regularization—popular items have stable biases
User factors (P): Medium regularization—need to generalize to new items
Item factors (Q): Medium-high regularization—especially for cold items
Implicit factors (Y in SVD++): Higher regularization—many parameters, less signal per param

Typical Regularization Strengths (Netflix Prize Era)
Parameter	Typical λ Range	Reason
Global mean (μ)	0 (no reg)	Single parameter, well-determined
User bias (b_u)	0.005 - 0.015	Simple parameter, one per user
Item bias (b_i)	0.005 - 0.015	Simple parameter, one per item
User factors (P)	0.015 - 0.05	Many parameters (k per user)
Item factors (Q)	0.015 - 0.05	Many parameters (k per item)
Implicit factors (Y)	0.02 - 0.1	Highest complexity risk

per_param_regularization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class RegularizedMF:
    """Matrix factorization with per-parameter regularization."""
    
    def __init__(self, n_users, n_items, n_factors=50):
        self.n_factors = n_factors
        
        # Model parameters
        self.global_mean = 0.0
        self.user_bias = np.zeros(n_users)
        self.item_bias = np.zeros(n_items)
        self.P = np.random.normal(0, 0.1, (n_users, n_factors))
        self.Q = np.random.normal(0, 0.1, (n_items, n_factors))
        
        # Per-parameter regularization strengths
        self.reg_bu = 0.01   # User bias
        self.reg_bi = 0.01   # Item bias
        self.reg_pu = 0.02   # User factors
        self.reg_qi = 0.02   # Item factors
    
    def update(self, user_id, item_id, rating, lr=0.005):
        """Single SGD update with per-parameter regularization."""
        
        pred = (self.global_mean + 
                self.user_bias[user_id] + 
                self.item_bias[item_id] +
                np.dot(self.P[user_id], self.Q[item_id]))
        
        err = rating - pred
        
        # Each parameter type uses its own regularization
        self.user_bias[user_id] += lr * (err - self.reg_bu * self.user_bias[user_id])
        self.item_bias[item_id] += lr * (err - self.reg_bi * self.item_bias[item_id])
        
        pu = self.P[user_id].copy()
        qi = self.Q[item_id].copy()
        
        self.P[user_id] += lr * (err * qi - self.reg_pu * pu)
        self.Q[item_id] += lr * (err * pu - self.reg_qi * qi)

Adaptive Regularization

Users and items with more ratings provide more signal and can support more complex parameters. This suggests adaptive regularization based on activity levels:

Frequency-based regularization:

λ_u = λ_base / √|I_u| (divide by sqrt of number of ratings)

λ_i = λ_base / √|U_i|

Users with 1000 ratings get 10× less regularization than users with 10 ratings. This allows the model to learn richer representations for active users while keeping sparse users simple.

adaptive_regularization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class AdaptiveRegMF:
    """MF with activity-based adaptive regularization."""
    
    def __init__(self, n_users, n_items, n_factors=50, base_reg=0.05):
        self.base_reg = base_reg
        self.n_factors = n_factors
        
        # Count ratings per user/item
        self.user_counts = np.zeros(n_users)
        self.item_counts = np.zeros(n_items)
        
        self.P = np.random.normal(0, 0.1, (n_users, n_factors))
        self.Q = np.random.normal(0, 0.1, (n_items, n_factors))
    
    def precompute_counts(self, ratings):
        """Count ratings per user and item."""
        for r in ratings:
            self.user_counts[r.user_id] += 1
            self.item_counts[r.item_id] += 1
    
    def get_user_reg(self, user_id):
        """Adaptive regularization for user."""
        count = max(1, self.user_counts[user_id])
        return self.base_reg / np.sqrt(count)
    
    def get_item_reg(self, item_id):
        """Adaptive regularization for item."""
        count = max(1, self.item_counts[item_id])
        return self.base_reg / np.sqrt(count)
    
    def update(self, user_id, item_id, rating, lr=0.005):
        err = rating - np.dot(self.P[user_id], self.Q[item_id])
        
        reg_u = self.get_user_reg(user_id)
        reg_i = self.get_item_reg(item_id)
        
        pu, qi = self.P[user_id].copy(), self.Q[item_id].copy()
        self.P[user_id] += lr * (err * qi - reg_u * pu)
        self.Q[item_id] += lr * (err * pu - reg_i * qi)

Cold Start Benefit

Adaptive regularization particularly helps cold start. New users with 1-2 ratings get heavy regularization, keeping their vectors near the global average. As they rate more items, the model learns their unique preferences.

Temporal Dynamics and Regularization

User preferences and item perceptions change over time. The Netflix Prize winners found that modeling temporal dynamics with careful regularization significantly improved predictions.

Key temporal effects:

User bias drift: Users may become more generous or harsh over time
Item bias decay: Movie appeal often decreases after initial release hype
Preference evolution: User tastes change as they explore the catalog

Temporal bias model:

b_u(t) = b_u + α_u × dev_u(t) + b_{u,t}

Where dev_u(t) captures deviation from user's mean rating time, and b_{u,t} captures short-term fluctuations.

Temporal Regularization Strategies

•Smooth time-varying biases: Regularize b_{u,t} toward b_{u,t-1} to enforce continuity
•Bin time into periods: Instead of per-day biases, use weekly or monthly bins
•Decay learning rate for old data: Recent ratings get higher effective learning rate
•Time-weighted regularization: Older data gets more regularization (less trusted)

Regularization Best Practices

Production Guidelines

•Cross-validate λ: Use held-out validation set. Grid search over log scale (0.001, 0.01, 0.1, 1.0)
•Start conservative: Higher regularization → tune down. Easier to diagnose underfitting than overfitting
•Monitor train/val gap: Large gap = overfitting → increase λ. Small gap = underfitting → decrease λ
•Use separate λ per parameter type: biases, factors, implicit factors each get their own λ
•Consider adaptive schemes: Frequency-based regularization helps cold start
•Early stopping: Stop when validation error increases—implicit regularization

Page Complete

You now understand regularization in matrix factorization: L2 fundamentals, per-parameter strategies, adaptive schemes, temporal considerations, and best practices. Next, we'll explore how to extend MF to implicit feedback data—clicks, views, and purchases.