Machine LearningRecommendation Systems

Collaborative Filtering

LevelIntermediate

Duration90 mins

TopicRecommendation Systems

3 / 5

Similarity Measures

The Heart of Collaborative Filtering

At the core of every collaborative filtering system lies a deceptively profound question: What does it mean for two users (or items) to be similar?

This isn't merely a computational convenience—it's an epistemological stance about how preferences relate to each other. Should we care about absolute rating values or relative patterns? Should agreement on rare items count more than agreement on popular ones? How do we handle the vast emptiness of sparse rating matrices?

The choice of similarity measure directly impacts recommendation quality, computational cost, and system behavior. A poor choice can doom an otherwise well-designed system. A thoughtful choice, informed by data characteristics and use case requirements, can unlock substantial accuracy improvements.

This page provides a comprehensive treatment of similarity measures in collaborative filtering—their mathematical foundations, computational properties, and practical guidance for real-world systems.

What You Will Master

By the end of this page, you will understand the fundamental distinction between similarity and distance, master the major similarity measures (Pearson, cosine, Jaccard, and their variants), appreciate the mathematical properties that matter in practice, and develop intuition for choosing the right measure for your data and use case.

Distance vs Similarity: Dual Perspectives

Before examining specific measures, we must understand the relationship between distance and similarity—two complementary ways of quantifying how alike two objects are.

Distance (Dissimilarity):

A distance function d(x, y) measures how far apart two objects are. Smaller values indicate more similar objects. A proper metric satisfies:

Non-negativity: d(x, y) ≥ 0
Identity: d(x, y) = 0 ⟺ x = y
Symmetry: d(x, y) = d(y, x)
Triangle inequality: d(x, z) ≤ d(x, y) + d(y, z)

Similarity:

A similarity function s(x, y) measures how alike two objects are. Larger values indicate more similar objects. Typically bounded (e.g., [0, 1] or [-1, 1]).

Converting Between Them:

Simplest: sim(x, y) = 1 / (1 + dist(x, y))
Alternative: sim(x, y) = exp(-dist(x, y))
For bounded distance: sim(x, y) = 1 - dist(x, y) / max_dist

Distance vs Similarity Properties
Property	Distance	Similarity
Value for identical objects	0 (minimum)	Maximum (often 1)
Value for different objects	Positive (larger = more different)	Smaller (often approaching 0)
Intuition	'How far apart?'	'How alike?'
Used in	Clustering, nearest neighbor search	Collaborative filtering, weighted aggregation

Why Collaborative Filtering Uses Similarity

CF uses similarity rather than distance because we need to weight neighbors' contributions. A neighbor with sim=0.9 should influence predictions 9x more than one with sim=0.1. This multiplicative weighting is natural with similarity but awkward with distance (would a neighbor at distance 0.1 contribute 10x more than one at distance 1?).

Pearson Correlation Coefficient

The Pearson correlation coefficient is the most widely used similarity measure for explicit ratings. It measures the linear correlation between two rating vectors.

Mathematical Definition:

For users u and v with co-rated items Iuv:

                   Σ_{i ∈ I_uv} (R_ui - r̄_u)(R_vi - r̄_v)
pearson(u, v) = ─────────────────────────────────────────────────────────────────
                √[Σ_{i ∈ I_uv} (R_ui - r̄_u)²] · √[Σ_{i ∈ I_uv} (R_vi - r̄_v)²]

Where r̄u is the mean of user u's ratings over co-rated items only.

Key Properties:

Range: [-1, +1]
+1: Perfect positive correlation (ratings move together)
0: No linear correlation
-1: Perfect negative correlation (opposite preferences)
Scale invariant: Linear transformations don't affect the result

Why Pearson Excels for Explicit Ratings:

Handles rating bias naturally: Mean-centering removes the effect of harsh/generous raters. A user who rates 3-5 and one who rates 1-3 can still be perfectly correlated if their relative preferences match.
Captures negative correlation: If two users have opposite tastes, Pearson returns negative similarity. This is valuable—we can predict that if my "anti-twin" loves something, I'll hate it.
Well-understood statistically: Significance testing is possible. We can assess whether correlation is statistically significant given sample size.

Example Calculation:

Users A and B rate movies:

Movie	User A	User B	A - mean(A)	B - mean(B)	Product
M1	5	3	+1.5	+0.5	+0.75
M2	4	2	+0.5	-0.5	-0.25
M3	2	2	-1.5	-0.5	+0.75
M4	3	3	-0.5	+0.5	-0.25

Mean A = 3.5, Mean B = 2.5

Numerator = 0.75 - 0.25 + 0.75 - 0.25 = 1.0

Denom A = √(1.5² + 0.5² + 1.5² + 0.5²) = √5 ≈ 2.24

Denom B = √(0.5² + 0.5² + 0.5² + 0.5²) = √1 = 1.0

Pearson = 1.0 / (2.24 × 1.0) ≈ 0.45

The Sparse Data Problem

Pearson's biggest weakness: it requires sufficient co-rated items for reliable estimation. With only 2-3 shared items, correlation can be near ±1 by chance. Always apply significance weighting or minimum overlap thresholds (typically 3-5 co-rated items minimum).

Cosine Similarity and Its Variants

Cosine similarity measures the angle between two vectors, treating ratings as vectors in item-dimensional space.

Standard Cosine Similarity:

                    Σ_{i ∈ I_uv} R_ui · R_vi
cos(u, v) = ────────────────────────────────────────────
            √[Σ_{i ∈ I_u} R_ui²] · √[Σ_{i ∈ I_v} R_vi²]

Note: The denominator uses all items each user rated, not just co-rated items. This is important for proper normalization.

Key Properties:

Range: [0, 1] for non-negative ratings, [-1, 1] if ratings can be negative
Direction-focused: Cares about rating patterns, not magnitudes
Sparse-friendly: Missing ratings treated as 0 (implicit zeros)

The Problem: Rating Bias

Raw cosine doesn't handle rating bias. Consider:

User A: rates everything 4-5
User B: rates everything 1-2

Even if they have identical relative preferences, raw cosine sees them as dissimilar because the angle between [4,5,5,4] and [1,2,2,1] isn't 0°.

Adjusted Cosine Similarity:

Subtracts user means before computing cosine:

                        Σ_{i ∈ I_uv} (R_ui - r̄_u)(R_vi - r̄_v)
adj_cos(u, v) = ────────────────────────────────────────────────────────────────
                √[Σ_{i ∈ I_u} (R_ui - r̄_u)²] · √[Σ_{i ∈ I_v} (R_vi - r̄_v)²]

This is similar to Pearson but differs in the denominator: Pearson sums only over co-rated items, while adjusted cosine sums over all items each user rated.

Cosine vs Pearson: When Each Wins

Scenario	Better Choice	Why
Explicit 1-5 ratings	Pearson	Handles rating bias via mean-centering
Binary feedback (like/no-like)	Cosine	No bias issue; angle is meaningful
Implicit feedback (clicks, views)	Cosine	Values are counts/frequencies, not preferences
Very sparse data	Cosine	More robust with many zeros
Item-based CF (explicit ratings)	Adjusted Cosine	Best of both worlds

similarity_measures.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
import numpy as np
from typing import Dict, Optional, Tuple
 
def pearson_correlation(
    ratings_u: Dict[int, float],
    ratings_v: Dict[int, float],
    min_overlap: int = 3
) -> Optional[float]:
    """
    Compute Pearson correlation between two users.
    
    Args:
        ratings_u: {item_id: rating} for user u
        ratings_v: {item_id: rating} for user v
        min_overlap: Minimum co-rated items required
    
    Returns:
        Pearson correlation in [-1, 1], or None if insufficient overlap
    """
    # Find co-rated items
    common = set(ratings_u.keys()) & set(ratings_v.keys())
    
    if len(common) < min_overlap:
        return None
    
    # Extract ratings for common items
    r_u = np.array([ratings_u[i] for i in common])
    r_v = np.array([ratings_v[i] for i in common])
    
    # Mean-center
    mean_u = np.mean(r_u)
    mean_v = np.mean(r_v)
    
    centered_u = r_u - mean_u
    centered_v = r_v - mean_v
    
    # Compute correlation
    numerator = np.dot(centered_u, centered_v)
    denom = np.sqrt(np.sum(centered_u**2)) * np.sqrt(np.sum(centered_v**2))
    
    if denom == 0:
        return 0.0
    
    return float(numerator / denom)
 
 
def cosine_similarity(
    ratings_u: Dict[int, float],
    ratings_v: Dict[int, float]
) -> float:
    """
    Compute cosine similarity between two users.
    
    Uses sparse representation - only considers rated items.
    """
    common = set(ratings_u.keys()) & set(ratings_v.keys())
    
    if not common:
        return 0.0
    
    # Numerator: dot product of common ratings
    dot = sum(ratings_u[i] * ratings_v[i] for i in common)
    
    # Denominators: norms of full rating vectors
    norm_u = np.sqrt(sum(r**2 for r in ratings_u.values()))
    norm_v = np.sqrt(sum(r**2 for r in ratings_v.values()))
    
    if norm_u == 0 or norm_v == 0:
        return 0.0
    
    return float(dot / (norm_u * norm_v))
 
 
def adjusted_cosine_similarity(
    ratings_u: Dict[int, float],
    ratings_v: Dict[int, float],
    user_mean_u: float,
    user_mean_v: float
) -> float:
    """
    Compute adjusted cosine similarity (for item-based CF).
    
    Subtracts user means to handle rating bias.
    """
    common = set(ratings_u.keys()) & set(ratings_v.keys())
    
    if not common:
        return 0.0
    
    # Centered ratings
    centered_u = {i: ratings_u[i] - user_mean_u for i in ratings_u}
    centered_v = {i: ratings_v[i] - user_mean_v for i in ratings_v}
    
    # Numerator: dot product of common centered ratings
    dot = sum(centered_u[i] * centered_v[i] for i in common)
    
    # Denominators: norms of full centered vectors
    norm_u = np.sqrt(sum(r**2 for r in centered_u.values()))
    norm_v = np.sqrt(sum(r**2 for r in centered_v.values()))
    
    if norm_u == 0 or norm_v == 0:
        return 0.0
    
    return float(dot / (norm_u * norm_v))
 
 
def jaccard_similarity(
    items_u: set,
    items_v: set
) -> float:
    """
    Compute Jaccard similarity between item sets.
    
    Used for binary/implicit feedback where only 'liked' items are known.
    """
    if not items_u or not items_v:
        return 0.0
    
    intersection = len(items_u & items_v)
    union = len(items_u | items_v)
    
    return intersection / union
 
 
def constrained_pearson(
    ratings_u: Dict[int, float],
    ratings_v: Dict[int, float],
    neutral_rating: float = 3.0,
    min_overlap: int = 3
) -> Optional[float]:
    """
    Constrained Pearson correlation.
    
    Centers around a fixed neutral value (e.g., 3 on a 1-5 scale)
    rather than user means. Useful when rating scale has inherent meaning.
    """
    common = set(ratings_u.keys()) & set(ratings_v.keys())
    
    if len(common) < min_overlap:
        return None
    
    numerator = sum(
        (ratings_u[i] - neutral_rating) * (ratings_v[i] - neutral_rating)
        for i in common
    )
    
    sum_sq_u = sum((ratings_u[i] - neutral_rating)**2 for i in common)
    sum_sq_v = sum((ratings_v[i] - neutral_rating)**2 for i in common)
    
    denom = np.sqrt(sum_sq_u) * np.sqrt(sum_sq_v)
    
    if denom == 0:
        return 0.0
    
    return float(numerator / denom)
 
 
# Example comparison
if __name__ == "__main__":
    # Two users with different rating scales but similar preferences
    user_a = {1: 5, 2: 4, 3: 3, 4: 5, 5: 4}  # Generous rater (mean ~4.2)
    user_b = {1: 3, 2: 2, 3: 1, 4: 3, 5: 2}  # Harsh rater (mean ~2.2)
    
    print("Comparing similarity measures:")
    print(f"  Pearson:          {pearson_correlation(user_a, user_b):.4f}")
    print(f"  Cosine:           {cosine_similarity(user_a, user_b):.4f}")
    print(f"  Adjusted Cosine:  {adjusted_cosine_similarity(user_a, user_b, 4.2, 2.2):.4f}")
    
    # Despite different absolute ratings, users have identical relative preferences
    # Pearson and adjusted cosine should show this (high similarity)
    # Raw cosine will be lower because of absolute value differences

Jaccard and Set-based Measures

When dealing with binary or implicit feedback (did the user buy/click/watch?), set-based similarity measures become appropriate.

Jaccard Similarity:

Measures overlap between two sets:

             |A ∩ B|
J(A, B) = ──────────
             |A ∪ B|

Properties:

Range: [0, 1]
1 = identical sets
0 = no overlap
Only considers presence/absence, not ratings

When to Use Jaccard:

Implicit feedback (purchases, clicks, views)
No rating values available
Binary liked/not-liked data
Association rule mining (market basket analysis)

Weighted Jaccard:

When items have associated weights (e.g., purchase counts):

               Σ_i min(w_ui, w_vi)
J_weighted = ─────────────────────
               Σ_i max(w_ui, w_vi)

Dice Coefficient:

A variation that double-counts intersection:

                2 · |A ∩ B|
Dice(A, B) = ───────────────
              |A| + |B|

Dice gives slightly higher values than Jaccard for the same sets. Relationship: Dice = 2J / (1 + J).

Simpson/Overlap Coefficient:

Normalizes by the smaller set:

                      |A ∩ B|
Simpson(A, B) = ─────────────────
                 min(|A|, |B|)

Useful when set sizes are very different—a small set being a subset of a large one gets maximum similarity.

Set-based Similarity Comparison
Measure	Formula	Set A={1,2,3}, B={2,3,4,5}	Notes
Jaccard	\|A∩B\| / \|A∪B\|	2/5 = 0.40	Most common choice
Dice	2\|A∩B\| / (\|A\|+\|B\|)	4/7 = 0.57	Emphasizes overlap
Simpson	\|A∩B\| / min(\|A\|,\|B\|)	2/3 = 0.67	Good for subset detection
Cosine (binary)	\|A∩B\| / √(\|A\|·\|B\|)	2/√12 = 0.58	Geometric mean normalization

Implicit Feedback Systems

For implicit feedback (e.g., purchase history), Jaccard between item purchaser sets works well for item-based CF. For user-based, cosine on TF-IDF-weighted purchase vectors often outperforms raw Jaccard by accounting for item popularity.

Advanced Similarity Measures

Beyond the standard measures, several advanced techniques address specific challenges in recommendation data.

1. Significance Weighting:

Reduces similarity estimates when based on few observations:

                      min(|I_uv|, γ)
sim_weighted(u,v) = ─────────────────── · sim(u,v)
                           γ

With γ=50, similarity based on 10 co-rated items is multiplied by 10/50 = 0.2, heavily discounting unreliable estimates.

2. Inverse User/Item Frequency:

Weights agreement on rare items more heavily (similar to TF-IDF):

iuf(i) = log(|U| / |U_i|)

Weighted similarity uses (R_ui - r̄_u) · iuf(i) instead of raw deviations

Agreement that users both loved an obscure film is more informative than both liking a blockbuster.

3. Mean Squared Difference (MSD):

A distance-based measure often converted to similarity:

                1
MSD(u,v) = ─────────── · Σ_{i ∈ I_uv} (R_ui - R_vi)²
           |I_uv|

sim_MSD(u,v) = 1 / (1 + MSD(u,v))

MSD focuses on absolute rating differences rather than correlation. Users who give identical ratings are most similar.

4. Spearman Rank Correlation:

Measures correlation of rankings rather than ratings:

                    6 · Σ d_i²
ρ = 1 - ─────────────────────
          n(n² - 1)

Where di is the difference in ranks for item i. More robust to outliers and non-linear relationships.

5. Proximity-Impact-Popularity (PIP):

Combines three factors:

Proximity: How close ratings are (like MSD)
Impact: Whether ratings are extreme (far from neutral)
Popularity: Inverse frequency weighting

PIP often outperforms single measures but requires more computation.

Combining Measures

Production systems often combine measures. Netflix Prize winners used ensemble methods that weighted predictions from systems using different similarity measures. Each captures different aspects of user/item relationships, and combining them reduces individual weaknesses.

Computational Considerations

Similarity computation is often the bottleneck in collaborative filtering. Understanding computational costs helps make practical choices.

Pairwise Computation Cost:

For m users and n items:

User-user similarity matrix: O(m²) pairs, each requiring O(n) time → O(m² · n) total
Item-item similarity matrix: O(n²) pairs, each requiring O(m) time → O(n² · m) total

With 1M users and 100K items, user-user requires ~10¹⁶ operations—infeasible.

Optimization Strategies:

Computational Optimizations

•Sparse data structures: Only store and iterate non-zero ratings. Real matrices are 99%+ zero.
•Inverted indices: For each item, maintain list of users who rated it. Finding co-raters becomes set intersection.
•Truncated computation: Only compute similarity for users/items with sufficient overlap. Skip pairs with <3 co-rated items.
•Sampling: Randomly sample 10K-100K users for user-user similarity. Often sufficient for reliable estimates.
•Incremental updates: When new ratings arrive, update only affected similarities rather than recomputing all.
•Dimensionality reduction: Project to k-dimensional space first (SVD), then compute similarities in reduced space.

efficient_similarity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
from collections import defaultdict
from typing import Dict, Set, List, Tuple
import numpy as np
from scipy.sparse import csr_matrix
 
class EfficientSimilarityComputer:
    """
    Efficient similarity computation using inverted indices.
    
    Key optimizations:
    - Inverted index for fast co-rated item/user lookup
    - Early termination for insufficient overlap
    - Sparse matrix operations
    """
    
    def __init__(self, min_overlap: int = 3):
        self.min_overlap = min_overlap
        
        # Inverted indices
        self.item_to_users: Dict[int, Set[int]] = defaultdict(set)
        self.user_to_items: Dict[int, Set[int]] = defaultdict(set)
        
        # Rating data
        self.ratings: Dict[Tuple[int, int], float] = {}
        self.user_means: Dict[int, float] = {}
    
    def build_indices(self, ratings: List[Tuple[int, int, float]]):
        """Build inverted indices from rating data."""
        user_sums = defaultdict(float)
        user_counts = defaultdict(int)
        
        for user_id, item_id, rating in ratings:
            self.item_to_users[item_id].add(user_id)
            self.user_to_items[user_id].add(item_id)
            self.ratings[(user_id, item_id)] = rating
            
            user_sums[user_id] += rating
            user_counts[user_id] += 1
        
        self.user_means = {
            u: user_sums[u] / user_counts[u]
            for u in user_counts
        }
    
    def find_candidate_pairs(
        self, 
        target_user: int,
        max_candidates: int = 10000
    ) -> List[int]:
        """
        Find users who share items with target user.
        
        Uses inverted index for efficient lookup.
        """
        candidate_overlap: Dict[int, int] = defaultdict(int)
        
        # For each item the target user rated
        for item_id in self.user_to_items[target_user]:
            # Find other users who rated this item
            for other_user in self.item_to_users[item_id]:
                if other_user != target_user:
                    candidate_overlap[other_user] += 1
        
        # Filter by minimum overlap
        candidates = [
            user_id for user_id, overlap in candidate_overlap.items()
            if overlap >= self.min_overlap
        ]
        
        # Sort by overlap (highest first) and truncate
        candidates.sort(key=lambda u: candidate_overlap[u], reverse=True)
        return candidates[:max_candidates]
    
    def pearson_pair(self, user_u: int, user_v: int) -> float:
        """Compute Pearson correlation for a user pair."""
        # Get co-rated items (already filtered by inverted index)
        common = self.user_to_items[user_u] & self.user_to_items[user_v]
        
        if len(common) < self.min_overlap:
            return 0.0
        
        mean_u = self.user_means[user_u]
        mean_v = self.user_means[user_v]
        
        numerator = 0.0
        sum_sq_u = 0.0
        sum_sq_v = 0.0
        
        for item_id in common:
            r_u = self.ratings[(user_u, item_id)] - mean_u
            r_v = self.ratings[(user_v, item_id)] - mean_v
            
            numerator += r_u * r_v
            sum_sq_u += r_u ** 2
            sum_sq_v += r_v ** 2
        
        denom = np.sqrt(sum_sq_u) * np.sqrt(sum_sq_v)
        
        return 0.0 if denom == 0 else float(numerator / denom)
    
    def compute_neighbors(
        self, 
        target_user: int, 
        k: int = 50
    ) -> List[Tuple[int, float]]:
        """
        Compute k-nearest neighbors for a user.
        
        Returns: List of (user_id, similarity) tuples.
        """
        candidates = self.find_candidate_pairs(target_user)
        
        similarities = [
            (user_id, self.pearson_pair(target_user, user_id))
            for user_id in candidates
        ]
        
        # Sort by absolute similarity (descending)
        similarities.sort(key=lambda x: abs(x[1]), reverse=True)
        
        return similarities[:k]
 
 
# Complexity analysis
print("Without inverted index:")
print("  Finding co-raters: O(m) scan per user pair")
print("  All pairs: O(m² · n) total")
print()
print("With inverted index:")
print("  Finding candidates: O(|I_u| · avg_item_popularity)")
print("  All neighbors for one user: O(|I_u| · p · n')")
print("    where p = avg popularity, n' = overlap size")
print("  Speedup: Often 100-1000x for sparse data")

Choosing the Right Measure

Selecting the appropriate similarity measure requires understanding your data characteristics and use case requirements. Here's a decision framework:

Similarity Measure Selection Guide
Data Type	Recommended Measure	Alternatives	Rationale
Explicit ratings (1-5 stars)	Pearson correlation	Adjusted cosine, constrained Pearson	Handles rating bias naturally
Binary feedback (like/dislike)	Cosine similarity	Jaccard	No bias to handle; angle meaningful
Implicit feedback (purchases, clicks)	Cosine on TF-IDF	Jaccard, log-weighted cosine	Account for item popularity
Item-based CF (explicit)	Adjusted cosine	Pearson (item-centered)	Correct bias handling for items
Very sparse data	Jaccard + shrinkage	Cosine with significance weighting	Robust with limited overlap
Asymmetric relationships	Conditional probability	Asymmetric cosine	Capture 'if A then B' patterns

Decision Checklist

•Do users have consistent rating scales? No → Use mean-centered measures (Pearson, adjusted cosine)
•Is feedback explicit or implicit? Implicit → Use set-based or TF-IDF-weighted measures
•Are you doing user-based or item-based CF? Item-based → Adjusted cosine is usually best
•How sparse is your data? Very sparse → Add significance weighting or shrinkage
•Do you need to capture negative correlations? Yes → Pearson or adjusted cosine (range includes negatives)
•Is item popularity highly skewed? Yes → Add IUF/IIF weighting

When in Doubt, Experiment

There's no universally best measure. Always hold out a test set and compare measures on your actual data. The 'right' measure depends on data characteristics that are hard to predict theoretically. A 2% improvement in offline metrics can translate to significant business impact at scale.

Summary: Similarity Measures

Key Takeaways

•Similarity vs distance: CF uses similarity for weighted aggregation; higher = more influence
•Pearson correlation: Gold standard for explicit ratings—handles bias, captures negative correlation, range [-1, +1]
•Cosine similarity: Best for binary/implicit feedback—direction-focused, sparse-friendly
•Adjusted cosine: Required for item-based CF with explicit ratings—combines cosine direction with bias correction
•Jaccard: Best for pure set overlap (purchases, clicks without ratings)
•Significance weighting: Essential for sparse data—shrinks unreliable estimates based on few observations
•Computational efficiency: Inverted indices, truncation, and sampling make pairwise computation tractable

What's Next:

Similarity measures tell us which users or items to consider as neighbors. The next page covers neighborhood methods—how to select neighbors, size the neighborhood, and aggregate their contributions into predictions. We'll explore threshold-based vs. top-k selection, weighted aggregation schemes, and hybrid approaches.

Page Complete

You now have comprehensive knowledge of similarity measures in collaborative filtering—from foundational mathematics to practical implementation and selection criteria. This knowledge enables you to make informed choices that significantly impact recommendation quality.

3 / 5

Loading learning content...

Machine LearningRecommendation Systems

Collaborative Filtering

LevelIntermediate

Duration90 mins

TopicRecommendation Systems

3 / 5

Similarity Measures

The Heart of Collaborative Filtering

At the core of every collaborative filtering system lies a deceptively profound question: What does it mean for two users (or items) to be similar?

What You Will Master

Distance vs Similarity: Dual Perspectives

Before examining specific measures, we must understand the relationship between distance and similarity—two complementary ways of quantifying how alike two objects are.

Distance (Dissimilarity):

A distance function d(x, y) measures how far apart two objects are. Smaller values indicate more similar objects. A proper metric satisfies:

Non-negativity: d(x, y) ≥ 0
Identity: d(x, y) = 0 ⟺ x = y
Symmetry: d(x, y) = d(y, x)
Triangle inequality: d(x, z) ≤ d(x, y) + d(y, z)

Similarity:

A similarity function s(x, y) measures how alike two objects are. Larger values indicate more similar objects. Typically bounded (e.g., [0, 1] or [-1, 1]).

Converting Between Them:

Simplest: sim(x, y) = 1 / (1 + dist(x, y))
Alternative: sim(x, y) = exp(-dist(x, y))
For bounded distance: sim(x, y) = 1 - dist(x, y) / max_dist

Distance vs Similarity Properties
Property	Distance	Similarity
Value for identical objects	0 (minimum)	Maximum (often 1)
Value for different objects	Positive (larger = more different)	Smaller (often approaching 0)
Intuition	'How far apart?'	'How alike?'
Used in	Clustering, nearest neighbor search	Collaborative filtering, weighted aggregation

Why Collaborative Filtering Uses Similarity

Pearson Correlation Coefficient

The Pearson correlation coefficient is the most widely used similarity measure for explicit ratings. It measures the linear correlation between two rating vectors.

Mathematical Definition:

For users u and v with co-rated items Iuv:

                   Σ_{i ∈ I_uv} (R_ui - r̄_u)(R_vi - r̄_v)
pearson(u, v) = ─────────────────────────────────────────────────────────────────
                √[Σ_{i ∈ I_uv} (R_ui - r̄_u)²] · √[Σ_{i ∈ I_uv} (R_vi - r̄_v)²]

Where r̄u is the mean of user u's ratings over co-rated items only.

Key Properties:

Range: [-1, +1]
+1: Perfect positive correlation (ratings move together)
0: No linear correlation
-1: Perfect negative correlation (opposite preferences)
Scale invariant: Linear transformations don't affect the result

Why Pearson Excels for Explicit Ratings:

Handles rating bias naturally: Mean-centering removes the effect of harsh/generous raters. A user who rates 3-5 and one who rates 1-3 can still be perfectly correlated if their relative preferences match.
Captures negative correlation: If two users have opposite tastes, Pearson returns negative similarity. This is valuable—we can predict that if my "anti-twin" loves something, I'll hate it.
Well-understood statistically: Significance testing is possible. We can assess whether correlation is statistically significant given sample size.

Example Calculation:

Users A and B rate movies:

Movie	User A	User B	A - mean(A)	B - mean(B)	Product
M1	5	3	+1.5	+0.5	+0.75
M2	4	2	+0.5	-0.5	-0.25
M3	2	2	-1.5	-0.5	+0.75
M4	3	3	-0.5	+0.5	-0.25

Mean A = 3.5, Mean B = 2.5

Numerator = 0.75 - 0.25 + 0.75 - 0.25 = 1.0

Denom A = √(1.5² + 0.5² + 1.5² + 0.5²) = √5 ≈ 2.24

Denom B = √(0.5² + 0.5² + 0.5² + 0.5²) = √1 = 1.0

Pearson = 1.0 / (2.24 × 1.0) ≈ 0.45

The Sparse Data Problem

Cosine Similarity and Its Variants

Cosine similarity measures the angle between two vectors, treating ratings as vectors in item-dimensional space.

Standard Cosine Similarity:

                    Σ_{i ∈ I_uv} R_ui · R_vi
cos(u, v) = ────────────────────────────────────────────
            √[Σ_{i ∈ I_u} R_ui²] · √[Σ_{i ∈ I_v} R_vi²]

Note: The denominator uses all items each user rated, not just co-rated items. This is important for proper normalization.

Key Properties:

Range: [0, 1] for non-negative ratings, [-1, 1] if ratings can be negative
Direction-focused: Cares about rating patterns, not magnitudes
Sparse-friendly: Missing ratings treated as 0 (implicit zeros)

The Problem: Rating Bias

Raw cosine doesn't handle rating bias. Consider:

User A: rates everything 4-5
User B: rates everything 1-2

Even if they have identical relative preferences, raw cosine sees them as dissimilar because the angle between [4,5,5,4] and [1,2,2,1] isn't 0°.

Adjusted Cosine Similarity:

Subtracts user means before computing cosine:

                        Σ_{i ∈ I_uv} (R_ui - r̄_u)(R_vi - r̄_v)
adj_cos(u, v) = ────────────────────────────────────────────────────────────────
                √[Σ_{i ∈ I_u} (R_ui - r̄_u)²] · √[Σ_{i ∈ I_v} (R_vi - r̄_v)²]

This is similar to Pearson but differs in the denominator: Pearson sums only over co-rated items, while adjusted cosine sums over all items each user rated.

Cosine vs Pearson: When Each Wins

Scenario	Better Choice	Why
Explicit 1-5 ratings	Pearson	Handles rating bias via mean-centering
Binary feedback (like/no-like)	Cosine	No bias issue; angle is meaningful
Implicit feedback (clicks, views)	Cosine	Values are counts/frequencies, not preferences
Very sparse data	Cosine	More robust with many zeros
Item-based CF (explicit ratings)	Adjusted Cosine	Best of both worlds

similarity_measures.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
import numpy as np
from typing import Dict, Optional, Tuple
 
def pearson_correlation(
    ratings_u: Dict[int, float],
    ratings_v: Dict[int, float],
    min_overlap: int = 3
) -> Optional[float]:
    """
    Compute Pearson correlation between two users.
    
    Args:
        ratings_u: {item_id: rating} for user u
        ratings_v: {item_id: rating} for user v
        min_overlap: Minimum co-rated items required
    
    Returns:
        Pearson correlation in [-1, 1], or None if insufficient overlap
    """
    # Find co-rated items
    common = set(ratings_u.keys()) & set(ratings_v.keys())
    
    if len(common) < min_overlap:
        return None
    
    # Extract ratings for common items
    r_u = np.array([ratings_u[i] for i in common])
    r_v = np.array([ratings_v[i] for i in common])
    
    # Mean-center
    mean_u = np.mean(r_u)
    mean_v = np.mean(r_v)
    
    centered_u = r_u - mean_u
    centered_v = r_v - mean_v
    
    # Compute correlation
    numerator = np.dot(centered_u, centered_v)
    denom = np.sqrt(np.sum(centered_u**2)) * np.sqrt(np.sum(centered_v**2))
    
    if denom == 0:
        return 0.0
    
    return float(numerator / denom)
 
 
def cosine_similarity(
    ratings_u: Dict[int, float],
    ratings_v: Dict[int, float]
) -> float:
    """
    Compute cosine similarity between two users.
    
    Uses sparse representation - only considers rated items.
    """
    common = set(ratings_u.keys()) & set(ratings_v.keys())
    
    if not common:
        return 0.0
    
    # Numerator: dot product of common ratings
    dot = sum(ratings_u[i] * ratings_v[i] for i in common)
    
    # Denominators: norms of full rating vectors
    norm_u = np.sqrt(sum(r**2 for r in ratings_u.values()))
    norm_v = np.sqrt(sum(r**2 for r in ratings_v.values()))
    
    if norm_u == 0 or norm_v == 0:
        return 0.0
    
    return float(dot / (norm_u * norm_v))
 
 
def adjusted_cosine_similarity(
    ratings_u: Dict[int, float],
    ratings_v: Dict[int, float],
    user_mean_u: float,
    user_mean_v: float
) -> float:
    """
    Compute adjusted cosine similarity (for item-based CF).
    
    Subtracts user means to handle rating bias.
    """
    common = set(ratings_u.keys()) & set(ratings_v.keys())
    
    if not common:
        return 0.0
    
    # Centered ratings
    centered_u = {i: ratings_u[i] - user_mean_u for i in ratings_u}
    centered_v = {i: ratings_v[i] - user_mean_v for i in ratings_v}
    
    # Numerator: dot product of common centered ratings
    dot = sum(centered_u[i] * centered_v[i] for i in common)
    
    # Denominators: norms of full centered vectors
    norm_u = np.sqrt(sum(r**2 for r in centered_u.values()))
    norm_v = np.sqrt(sum(r**2 for r in centered_v.values()))
    
    if norm_u == 0 or norm_v == 0:
        return 0.0
    
    return float(dot / (norm_u * norm_v))
 
 
def jaccard_similarity(
    items_u: set,
    items_v: set
) -> float:
    """
    Compute Jaccard similarity between item sets.
    
    Used for binary/implicit feedback where only 'liked' items are known.
    """
    if not items_u or not items_v:
        return 0.0
    
    intersection = len(items_u & items_v)
    union = len(items_u | items_v)
    
    return intersection / union
 
 
def constrained_pearson(
    ratings_u: Dict[int, float],
    ratings_v: Dict[int, float],
    neutral_rating: float = 3.0,
    min_overlap: int = 3
) -> Optional[float]:
    """
    Constrained Pearson correlation.
    
    Centers around a fixed neutral value (e.g., 3 on a 1-5 scale)
    rather than user means. Useful when rating scale has inherent meaning.
    """
    common = set(ratings_u.keys()) & set(ratings_v.keys())
    
    if len(common) < min_overlap:
        return None
    
    numerator = sum(
        (ratings_u[i] - neutral_rating) * (ratings_v[i] - neutral_rating)
        for i in common
    )
    
    sum_sq_u = sum((ratings_u[i] - neutral_rating)**2 for i in common)
    sum_sq_v = sum((ratings_v[i] - neutral_rating)**2 for i in common)
    
    denom = np.sqrt(sum_sq_u) * np.sqrt(sum_sq_v)
    
    if denom == 0:
        return 0.0
    
    return float(numerator / denom)
 
 
# Example comparison
if __name__ == "__main__":
    # Two users with different rating scales but similar preferences
    user_a = {1: 5, 2: 4, 3: 3, 4: 5, 5: 4}  # Generous rater (mean ~4.2)
    user_b = {1: 3, 2: 2, 3: 1, 4: 3, 5: 2}  # Harsh rater (mean ~2.2)
    
    print("Comparing similarity measures:")
    print(f"  Pearson:          {pearson_correlation(user_a, user_b):.4f}")
    print(f"  Cosine:           {cosine_similarity(user_a, user_b):.4f}")
    print(f"  Adjusted Cosine:  {adjusted_cosine_similarity(user_a, user_b, 4.2, 2.2):.4f}")
    
    # Despite different absolute ratings, users have identical relative preferences
    # Pearson and adjusted cosine should show this (high similarity)
    # Raw cosine will be lower because of absolute value differences

Jaccard and Set-based Measures

When dealing with binary or implicit feedback (did the user buy/click/watch?), set-based similarity measures become appropriate.

Jaccard Similarity:

Measures overlap between two sets:

             |A ∩ B|
J(A, B) = ──────────
             |A ∪ B|

Properties:

Range: [0, 1]
1 = identical sets
0 = no overlap
Only considers presence/absence, not ratings

When to Use Jaccard:

Implicit feedback (purchases, clicks, views)
No rating values available
Binary liked/not-liked data
Association rule mining (market basket analysis)

Weighted Jaccard:

When items have associated weights (e.g., purchase counts):

               Σ_i min(w_ui, w_vi)
J_weighted = ─────────────────────
               Σ_i max(w_ui, w_vi)

Dice Coefficient:

A variation that double-counts intersection:

                2 · |A ∩ B|
Dice(A, B) = ───────────────
              |A| + |B|

Dice gives slightly higher values than Jaccard for the same sets. Relationship: Dice = 2J / (1 + J).

Simpson/Overlap Coefficient:

Normalizes by the smaller set:

                      |A ∩ B|
Simpson(A, B) = ─────────────────
                 min(|A|, |B|)

Useful when set sizes are very different—a small set being a subset of a large one gets maximum similarity.

Set-based Similarity Comparison
Measure	Formula	Set A={1,2,3}, B={2,3,4,5}	Notes
Jaccard	\|A∩B\| / \|A∪B\|	2/5 = 0.40	Most common choice
Dice	2\|A∩B\| / (\|A\|+\|B\|)	4/7 = 0.57	Emphasizes overlap
Simpson	\|A∩B\| / min(\|A\|,\|B\|)	2/3 = 0.67	Good for subset detection
Cosine (binary)	\|A∩B\| / √(\|A\|·\|B\|)	2/√12 = 0.58	Geometric mean normalization

Implicit Feedback Systems

Advanced Similarity Measures

Beyond the standard measures, several advanced techniques address specific challenges in recommendation data.

1. Significance Weighting:

Reduces similarity estimates when based on few observations:

                      min(|I_uv|, γ)
sim_weighted(u,v) = ─────────────────── · sim(u,v)
                           γ

With γ=50, similarity based on 10 co-rated items is multiplied by 10/50 = 0.2, heavily discounting unreliable estimates.

2. Inverse User/Item Frequency:

Weights agreement on rare items more heavily (similar to TF-IDF):

iuf(i) = log(|U| / |U_i|)

Weighted similarity uses (R_ui - r̄_u) · iuf(i) instead of raw deviations

Agreement that users both loved an obscure film is more informative than both liking a blockbuster.

3. Mean Squared Difference (MSD):

A distance-based measure often converted to similarity:

                1
MSD(u,v) = ─────────── · Σ_{i ∈ I_uv} (R_ui - R_vi)²
           |I_uv|

sim_MSD(u,v) = 1 / (1 + MSD(u,v))

MSD focuses on absolute rating differences rather than correlation. Users who give identical ratings are most similar.

4. Spearman Rank Correlation:

Measures correlation of rankings rather than ratings:

                    6 · Σ d_i²
ρ = 1 - ─────────────────────
          n(n² - 1)

Where di is the difference in ranks for item i. More robust to outliers and non-linear relationships.

5. Proximity-Impact-Popularity (PIP):

Combines three factors:

Proximity: How close ratings are (like MSD)
Impact: Whether ratings are extreme (far from neutral)
Popularity: Inverse frequency weighting

PIP often outperforms single measures but requires more computation.

Combining Measures

Computational Considerations

Similarity computation is often the bottleneck in collaborative filtering. Understanding computational costs helps make practical choices.

Pairwise Computation Cost:

For m users and n items:

User-user similarity matrix: O(m²) pairs, each requiring O(n) time → O(m² · n) total
Item-item similarity matrix: O(n²) pairs, each requiring O(m) time → O(n² · m) total

With 1M users and 100K items, user-user requires ~10¹⁶ operations—infeasible.

Optimization Strategies:

Computational Optimizations

•Sparse data structures: Only store and iterate non-zero ratings. Real matrices are 99%+ zero.
•Inverted indices: For each item, maintain list of users who rated it. Finding co-raters becomes set intersection.
•Truncated computation: Only compute similarity for users/items with sufficient overlap. Skip pairs with <3 co-rated items.
•Sampling: Randomly sample 10K-100K users for user-user similarity. Often sufficient for reliable estimates.
•Incremental updates: When new ratings arrive, update only affected similarities rather than recomputing all.
•Dimensionality reduction: Project to k-dimensional space first (SVD), then compute similarities in reduced space.

efficient_similarity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
from collections import defaultdict
from typing import Dict, Set, List, Tuple
import numpy as np
from scipy.sparse import csr_matrix
 
class EfficientSimilarityComputer:
    """
    Efficient similarity computation using inverted indices.
    
    Key optimizations:
    - Inverted index for fast co-rated item/user lookup
    - Early termination for insufficient overlap
    - Sparse matrix operations
    """
    
    def __init__(self, min_overlap: int = 3):
        self.min_overlap = min_overlap
        
        # Inverted indices
        self.item_to_users: Dict[int, Set[int]] = defaultdict(set)
        self.user_to_items: Dict[int, Set[int]] = defaultdict(set)
        
        # Rating data
        self.ratings: Dict[Tuple[int, int], float] = {}
        self.user_means: Dict[int, float] = {}
    
    def build_indices(self, ratings: List[Tuple[int, int, float]]):
        """Build inverted indices from rating data."""
        user_sums = defaultdict(float)
        user_counts = defaultdict(int)
        
        for user_id, item_id, rating in ratings:
            self.item_to_users[item_id].add(user_id)
            self.user_to_items[user_id].add(item_id)
            self.ratings[(user_id, item_id)] = rating
            
            user_sums[user_id] += rating
            user_counts[user_id] += 1
        
        self.user_means = {
            u: user_sums[u] / user_counts[u]
            for u in user_counts
        }
    
    def find_candidate_pairs(
        self, 
        target_user: int,
        max_candidates: int = 10000
    ) -> List[int]:
        """
        Find users who share items with target user.
        
        Uses inverted index for efficient lookup.
        """
        candidate_overlap: Dict[int, int] = defaultdict(int)
        
        # For each item the target user rated
        for item_id in self.user_to_items[target_user]:
            # Find other users who rated this item
            for other_user in self.item_to_users[item_id]:
                if other_user != target_user:
                    candidate_overlap[other_user] += 1
        
        # Filter by minimum overlap
        candidates = [
            user_id for user_id, overlap in candidate_overlap.items()
            if overlap >= self.min_overlap
        ]
        
        # Sort by overlap (highest first) and truncate
        candidates.sort(key=lambda u: candidate_overlap[u], reverse=True)
        return candidates[:max_candidates]
    
    def pearson_pair(self, user_u: int, user_v: int) -> float:
        """Compute Pearson correlation for a user pair."""
        # Get co-rated items (already filtered by inverted index)
        common = self.user_to_items[user_u] & self.user_to_items[user_v]
        
        if len(common) < self.min_overlap:
            return 0.0
        
        mean_u = self.user_means[user_u]
        mean_v = self.user_means[user_v]
        
        numerator = 0.0
        sum_sq_u = 0.0
        sum_sq_v = 0.0
        
        for item_id in common:
            r_u = self.ratings[(user_u, item_id)] - mean_u
            r_v = self.ratings[(user_v, item_id)] - mean_v
            
            numerator += r_u * r_v
            sum_sq_u += r_u ** 2
            sum_sq_v += r_v ** 2
        
        denom = np.sqrt(sum_sq_u) * np.sqrt(sum_sq_v)
        
        return 0.0 if denom == 0 else float(numerator / denom)
    
    def compute_neighbors(
        self, 
        target_user: int, 
        k: int = 50
    ) -> List[Tuple[int, float]]:
        """
        Compute k-nearest neighbors for a user.
        
        Returns: List of (user_id, similarity) tuples.
        """
        candidates = self.find_candidate_pairs(target_user)
        
        similarities = [
            (user_id, self.pearson_pair(target_user, user_id))
            for user_id in candidates
        ]
        
        # Sort by absolute similarity (descending)
        similarities.sort(key=lambda x: abs(x[1]), reverse=True)
        
        return similarities[:k]
 
 
# Complexity analysis
print("Without inverted index:")
print("  Finding co-raters: O(m) scan per user pair")
print("  All pairs: O(m² · n) total")
print()
print("With inverted index:")
print("  Finding candidates: O(|I_u| · avg_item_popularity)")
print("  All neighbors for one user: O(|I_u| · p · n')")
print("    where p = avg popularity, n' = overlap size")
print("  Speedup: Often 100-1000x for sparse data")

Choosing the Right Measure

Selecting the appropriate similarity measure requires understanding your data characteristics and use case requirements. Here's a decision framework:

Similarity Measure Selection Guide
Data Type	Recommended Measure	Alternatives	Rationale
Explicit ratings (1-5 stars)	Pearson correlation	Adjusted cosine, constrained Pearson	Handles rating bias naturally
Binary feedback (like/dislike)	Cosine similarity	Jaccard	No bias to handle; angle meaningful
Implicit feedback (purchases, clicks)	Cosine on TF-IDF	Jaccard, log-weighted cosine	Account for item popularity
Item-based CF (explicit)	Adjusted cosine	Pearson (item-centered)	Correct bias handling for items
Very sparse data	Jaccard + shrinkage	Cosine with significance weighting	Robust with limited overlap
Asymmetric relationships	Conditional probability	Asymmetric cosine	Capture 'if A then B' patterns

Decision Checklist

•Do users have consistent rating scales? No → Use mean-centered measures (Pearson, adjusted cosine)
•Is feedback explicit or implicit? Implicit → Use set-based or TF-IDF-weighted measures
•Are you doing user-based or item-based CF? Item-based → Adjusted cosine is usually best
•How sparse is your data? Very sparse → Add significance weighting or shrinkage
•Do you need to capture negative correlations? Yes → Pearson or adjusted cosine (range includes negatives)
•Is item popularity highly skewed? Yes → Add IUF/IIF weighting

When in Doubt, Experiment

Summary: Similarity Measures

Key Takeaways

•Similarity vs distance: CF uses similarity for weighted aggregation; higher = more influence
•Pearson correlation: Gold standard for explicit ratings—handles bias, captures negative correlation, range [-1, +1]
•Cosine similarity: Best for binary/implicit feedback—direction-focused, sparse-friendly
•Adjusted cosine: Required for item-based CF with explicit ratings—combines cosine direction with bias correction
•Jaccard: Best for pure set overlap (purchases, clicks without ratings)
•Significance weighting: Essential for sparse data—shrinks unreliable estimates based on few observations
•Computational efficiency: Inverted indices, truncation, and sampling make pairwise computation tractable

What's Next:

Page Complete

3 / 5