Machine LearningDistance Metrics

Distance Metrics for K-Nearest Neighbors

LevelIntermediate

Duration90 mins

TopicDistance Metrics

4 / 5

Cosine Similarity

Measuring Orientation, Not Magnitude

All the distance metrics we have studied so far—Euclidean, Manhattan, Minkowski—measure the displacement between two points. They answer the question: "How far apart are these vectors?"

But in many domains, a different question is more meaningful: "How similar are the directions these vectors point?" This is precisely what cosine similarity measures.

Cosine similarity computes the cosine of the angle between two vectors, completely ignoring their magnitudes. Two vectors pointing in the same direction have cosine similarity 1 (identical), orthogonal vectors have similarity 0 (unrelated), and opposite vectors have similarity -1 (maximally dissimilar).

This magnitude-invariance makes cosine similarity indispensable for domains where the "size" of a vector is an artifact rather than a meaningful signal—most notably in text analysis, where document length shouldn't affect similarity assessment.

Learning Objectives

By mastering this page, you will:\n\n• Derive cosine similarity from the geometric definition of the dot product\n• Understand the relationship to Euclidean distance for normalized vectors\n• Recognize why magnitude-invariance is crucial for text and sparse data\n• Convert between cosine similarity, cosine distance, and angular distance\n• Implement efficient cosine similarity using vectorized operations\n• Apply cosine similarity correctly in KNN for document retrieval and recommendations\n• Understand the limitations and when Euclidean distance is preferable

Mathematical Foundation

The Dot Product and Angles

Recall the geometric definition of the dot product (inner product) between two vectors:

$$\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}| |\mathbf{b}| \cos\theta$$

where $\theta$ is the angle between the vectors, and $|\cdot|$ denotes the Euclidean (L²) norm.

Rearranging for the cosine of the angle:

$$\cos\theta = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|}$$

This is the definition of cosine similarity.

Formal Definition

For two non-zero vectors $\mathbf{a}, \mathbf{b} \in \mathbb{R}^n$:

$$\text{sim}{\cos}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}|2 |\mathbf{b}|2} = \frac{\sum{i=1}^{n} a_i b_i}{\sqrt{\sum{i=1}^{n} a_i^2} \cdot \sqrt{\sum{i=1}^{n} b_i^2}}$$

Range of Values

Since $\cos\theta \in [-1, 1]$:

Cosine Similarity	Angle θ	Interpretation
1	0°	Identical direction (parallel)
0	90°	Orthogonal (no relationship)
-1	180°	Opposite direction (anti-parallel)

For non-negative vectors (common in text/count data), cosine similarity is always in $[0, 1]$ since the angle cannot exceed 90°.

cosine_similarity.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
from typing import Union
 
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """
    Compute cosine similarity between two vectors.
    
    sim_cos(a, b) = (a · b) / (||a|| ||b||)
    
    Parameters:
        a: First vector
        b: Second vector
    
    Returns:
        Cosine similarity in range [-1, 1]
        Returns 0 if either vector is zero (undefined case)
    
    Example:
        >>> cosine_similarity(np.array([1, 0]), np.array([1, 0]))
        1.0  # Identical direction
        >>> cosine_similarity(np.array([1, 0]), np.array([0, 1]))
        0.0  # Orthogonal
        >>> cosine_similarity(np.array([1, 0]), np.array([-1, 0]))
        -1.0  # Opposite direction
    """
    # Compute norms
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    
    # Handle zero vectors
    if norm_a == 0 or norm_b == 0:
        return 0.0
    
    # Compute cosine similarity
    dot_product = np.dot(a, b)
    return dot_product / (norm_a * norm_b)
 
 
def cosine_similarity_normalized(a: np.ndarray, b: np.ndarray) -> float:
    """
    Compute cosine similarity for pre-normalized vectors (unit vectors).
    
    If ||a|| = ||b|| = 1, then cos(a, b) = a · b
    
    This is much faster when vectors are normalized in advance.
    """
    return np.dot(a, b)
 
 
def batch_cosine_similarity(X: np.ndarray, y: np.ndarray) -> np.ndarray:
    """
    Compute cosine similarity between a query vector y and all rows of X.
    
    Parameters:
        X: Matrix of shape (n_samples, n_features)
        y: Query vector of shape (n_features,)
    
    Returns:
        Array of cosine similarities of shape (n_samples,)
    """
    # Compute norms
    X_norms = np.linalg.norm(X, axis=1)
    y_norm = np.linalg.norm(y)
    
    # Handle zero vectors
    if y_norm == 0:
        return np.zeros(X.shape[0])
    
    # Compute dot products (matrix-vector multiplication)
    dot_products = X @ y
    
    # Normalize
    similarities = dot_products / (X_norms * y_norm + 1e-10)  # Small epsilon for stability
    
    return similarities
 
 
# Demonstration
if __name__ == "__main__":
    # Example vectors
    a = np.array([3, 4])
    b = np.array([6, 8])  # Same direction, different magnitude
    c = np.array([4, -3])  # Orthogonal to a
    d = np.array([-3, -4])  # Opposite to a
    
    print("Cosine Similarity Examples:")
    print(f"  a = {a}")
    print(f"  b = {b} (same direction, 2x magnitude)")
    print(f"  c = {c} (orthogonal)")
    print(f"  d = {d} (opposite)")
    print()
    print(f"  sim(a, b) = {cosine_similarity(a, b):.4f}")  # 1.0
    print(f"  sim(a, c) = {cosine_similarity(a, c):.4f}")  # 0.0
    print(f"  sim(a, d) = {cosine_similarity(a, d):.4f}")  # -1.0

Similarity vs Distance

Cosine similarity is a similarity measure, not a distance metric. Higher values mean more similar, not farther apart. This is the opposite convention from distance metrics.\n\n• Similarity: sim(a, a) = 1 (maximum), sim(a, b) → 0 as dissimilarity increases\n• Distance: d(a, a) = 0 (minimum), d(a, b) → ∞ as dissimilarity increases\n\nTo use cosine with distance-based algorithms like KNN, we must convert to a distance.

The Euclidean Distance Connection

There is a beautiful mathematical relationship between cosine similarity and Euclidean distance—but only for unit vectors (vectors with norm 1).

Euclidean Distance for Unit Vectors

For unit vectors $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ (with $|\hat{\mathbf{a}}| = |\hat{\mathbf{b}}| = 1$):

$$|\hat{\mathbf{a}} - \hat{\mathbf{b}}|_2^2 = |\hat{\mathbf{a}}|^2 - 2\hat{\mathbf{a}} \cdot \hat{\mathbf{b}} + |\hat{\mathbf{b}}|^2$$ $$= 1 - 2\hat{\mathbf{a}} \cdot \hat{\mathbf{b}} + 1 = 2(1 - \hat{\mathbf{a}} \cdot \hat{\mathbf{b}})$$ $$= 2(1 - \cos\theta)$$

Therefore:

$$|\hat{\mathbf{a}} - \hat{\mathbf{b}}|_2 = \sqrt{2(1 - \cos\theta)} = \sqrt{2} \cdot \sqrt{1 - \cos\theta}$$

Using the trigonometric identity $1 - \cos\theta = 2\sin^2(\theta/2)$:

$$|\hat{\mathbf{a}} - \hat{\mathbf{b}}|_2 = 2\sin(\theta/2)$$

Key Insight: Normalization Equivalence

For normalized vectors, minimizing Euclidean distance is equivalent to maximizing cosine similarity.

Since $d = \sqrt{2(1 - \cos\theta)}$ is a monotonically decreasing function of $\cos\theta$:

Higher cosine similarity → Lower Euclidean distance (for normalized vectors)
Finding nearest neighbors by Euclidean distance on normalized vectors = finding most similar by cosine

This has a profound practical implication: if you normalize your vectors first, you can use standard Euclidean KNN to get cosine-based neighbors.

normalize_for_cosine.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from sklearn.preprocessing import normalize
from sklearn.neighbors import NearestNeighbors
 
def demonstrate_normalization_equivalence():
    """
    Show that Euclidean distance on normalized vectors 
    produces the same ranking as cosine similarity.
    """
    # Create sample data
    np.random.seed(42)
    X = np.random.randn(100, 10)  # 100 vectors, 10 dimensions
    query = np.random.randn(10)
    
    # Method 1: Cosine similarity directly
    norms_X = np.linalg.norm(X, axis=1, keepdims=True)
    norm_query = np.linalg.norm(query)
    cosine_sims = (X @ query) / (norms_X.flatten() * norm_query)
    cosine_ranking = np.argsort(-cosine_sims)  # Descending (most similar first)
    
    # Method 2: Euclidean distance on normalized vectors
    X_normalized = X / norms_X
    query_normalized = query / norm_query
    euclidean_dists = np.linalg.norm(X_normalized - query_normalized, axis=1)
    euclidean_ranking = np.argsort(euclidean_dists)  # Ascending (closest first)
    
    # Compare rankings
    print("Demonstrating normalization equivalence:")
    print(f"  Top 5 by cosine similarity: {cosine_ranking[:5]}")
    print(f"  Top 5 by Euclidean (normalized): {euclidean_ranking[:5]}")
    print(f"  Rankings identical: {np.array_equal(cosine_ranking, euclidean_ranking)}")
 
 
def knn_with_cosine():
    """
    Use sklearn's NearestNeighbors with cosine metric.
    """
    # Create sample data
    X = np.random.randn(1000, 50)
    query = np.random.randn(50)
    
    # Option 1: Use cosine metric directly
    nn_cosine = NearestNeighbors(n_neighbors=5, metric='cosine')
    nn_cosine.fit(X)
    distances1, indices1 = nn_cosine.kneighbors([query])
    
    # Option 2: Normalize and use Euclidean (equivalent)
    X_norm = normalize(X)
    query_norm = normalize([query])
    nn_euclidean = NearestNeighbors(n_neighbors=5, metric='euclidean')
    nn_euclidean.fit(X_norm)
    distances2, indices2 = nn_euclidean.kneighbors(query_norm)
    
    print("\nKNN comparison:")
    print(f"  Direct cosine indices: {indices1[0]}")
    print(f"  Normalized Euclidean indices: {indices2[0]}")
    print(f"  Same neighbors: {np.array_equal(indices1, indices2)}")
 
 
if __name__ == "__main__":
    demonstrate_normalization_equivalence()
    knn_with_cosine()

Practical Recommendation

In practice, normalizing vectors and using Euclidean distance is often faster and more convenient than computing cosine similarity directly:\n\n1. Normalize once before storing vectors\n2. Use standard Euclidean KNN implementations (highly optimized)\n3. Benefit from acceleration structures (KD-trees, ball trees) that work with Euclidean distance\n\nThe slight overhead of normalization is repaid by faster queries.

Converting Similarity to Distance

To use cosine with algorithms that require a distance metric (like KNN), we must convert similarity to distance. Several conventions exist:

Cosine Distance

The most common conversion:

$$d_{\cos}(\mathbf{a}, \mathbf{b}) = 1 - \text{sim}_{\cos}(\mathbf{a}, \mathbf{b}) = 1 - \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|}$$

Cosine Similarity	Cosine Distance	Interpretation
1	0	Identical
0	1	Orthogonal
-1	2	Opposite

Warning: Cosine distance is not a true metric—it violates the triangle inequality!

Angular Distance

A proper metric based on the actual angle:

$$d_{\text{angular}}(\mathbf{a}, \mathbf{b}) = \frac{\arccos(\text{sim}_{\cos}(\mathbf{a}, \mathbf{b}))}{\pi}$$

This normalizes the angle to $[0, 1]$ where:

0 = same direction (0°)
0.5 = orthogonal (90°)
1 = opposite (180°)

Advantage: Angular distance satisfies the triangle inequality and is a proper metric.

Disadvantage: Computing arccos is slower than simple subtraction.

Comparison of Distance Conversions

Cosine-Based Distance Metrics
Metric	Formula	Range	True Metric?	Computation
Cosine Distance	1 - cos(θ)	[0, 2]	No	Fast (just 1 - similarity)
Angular Distance	arccos(cos(θ)) / π	[0, 1]	Yes	Slower (requires arccos)
Euclidean (normalized)	√(2(1-cos(θ)))	[0, 2]	Yes	Moderate

Triangle Inequality Violation

Cosine distance can violate the triangle inequality. Example:\n\nLet a = (1, 0), b = (1, 1) / √2, c = (0, 1)\n\n• d_cos(a, c) = 1 - 0 = 1\n• d_cos(a, b) = 1 - 1/√2 ≈ 0.293\n• d_cos(b, c) = 1 - 1/√2 ≈ 0.293\n\nBut d_cos(a, c) = 1 > 0.293 + 0.293 = 0.586 = d_cos(a, b) + d_cos(b, c)\n\nThis means cosine distance cannot be used with data structures that rely on the triangle inequality (KD-trees, ball trees). Use brute-force KNN or approximate methods like LSH.

distance_conversions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
 
def cosine_distance(a: np.ndarray, b: np.ndarray) -> float:
    """
    Cosine distance: 1 - cosine_similarity
    Range: [0, 2]
    Note: NOT a true metric (violates triangle inequality)
    """
    return 1 - cosine_similarity(a, b)
 
 
def angular_distance(a: np.ndarray, b: np.ndarray) -> float:
    """
    Angular distance: arccos(cosine_similarity) / π
    Range: [0, 1]
    This IS a true metric (satisfies triangle inequality)
    """
    cos_sim = cosine_similarity(a, b)
    # Clamp to [-1, 1] to avoid arccos domain errors due to float precision
    cos_sim = np.clip(cos_sim, -1, 1)
    return np.arccos(cos_sim) / np.pi
 
 
def euclidean_normalized(a: np.ndarray, b: np.ndarray) -> float:
    """
    Euclidean distance after L2 normalization.
    Equivalent to sqrt(2 * (1 - cos_sim)) for unit vectors.
    """
    a_norm = a / np.linalg.norm(a)
    b_norm = b / np.linalg.norm(b)
    return np.linalg.norm(a_norm - b_norm)
 
 
def triangle_inequality_example():
    """
    Demonstrate triangle inequality violation for cosine distance.
    """
    a = np.array([1, 0])
    b = np.array([1, 1]) / np.sqrt(2)  # 45 degrees
    c = np.array([0, 1])
    
    d_ac = cosine_distance(a, c)
    d_ab = cosine_distance(a, b)
    d_bc = cosine_distance(b, c)
    
    print("Triangle Inequality Test (Cosine Distance):")
    print(f"  a = {a}, b = {b.round(3)}, c = {c}")
    print(f"  d(a,c) = {d_ac:.4f}")
    print(f"  d(a,b) = {d_ab:.4f}")  
    print(f"  d(b,c) = {d_bc:.4f}")
    print(f"  d(a,b) + d(b,c) = {d_ab + d_bc:.4f}")
    print(f"  Triangle inequality d(a,c) <= d(a,b) + d(b,c)? {d_ac <= d_ab + d_bc}")
    
    print("\nTriangle Inequality Test (Angular Distance):")
    d_ac_ang = angular_distance(a, c)
    d_ab_ang = angular_distance(a, b)
    d_bc_ang = angular_distance(b, c)
    print(f"  d(a,c) = {d_ac_ang:.4f}")
    print(f"  d(a,b) + d(b,c) = {d_ab_ang + d_bc_ang:.4f}")
    print(f"  Triangle inequality holds? {d_ac_ang <= d_ab_ang + d_bc_ang + 1e-10}")
 
 
if __name__ == "__main__":
    triangle_inequality_example()

Cosine Similarity for Text and Sparse Data

Cosine similarity's magnitude-invariance makes it the default choice for text analysis and other domains with sparse, high-dimensional data.

The Document Length Problem

Consider comparing documents using term frequency vectors. A document repeating "machine learning" 100 times is saying the same thing as one saying it 10 times—just longer. But their Euclidean distance could be very large:

Doc A: [10, 5, 3] ("machine": 10, "learning": 5, "algorithm": 3)
Doc B: [100, 50, 30] (same topic, 10x longer)

Euclidean distance: $\sqrt{(100-10)^2 + (50-5)^2 + (30-3)^2} = \sqrt{8100 + 2025 + 729} \approx 104$

This large distance suggests they're very different, when they're actually about the same topic!

Cosine similarity: Since Doc B is a scalar multiple of Doc A, $\cos(A, B) = 1$ (identical direction).

Cosine similarity correctly identifies them as identical in topic.

TF-IDF and Cosine Similarity

The TF-IDF (Term Frequency–Inverse Document Frequency) representation is designed for cosine similarity:

TF weighs terms by frequency in the document
IDF downweights common terms (like "the", "is")
Cosine similarity compares the weighted vectors angle

This combination forms the backbone of classical information retrieval.

Sparse Vector Efficiency

Text vectors are extremely sparse (most terms don't appear in most documents). Cosine similarity can be computed efficiently:

$$\text{sim}{\cos}(\mathbf{a}, \mathbf{b}) = \frac{\sum{i: a_i \neq 0 \text{ AND } b_i \neq 0} a_i b_i}{|\mathbf{a}| |\mathbf{b}|}$$

Only non-zero entries in both vectors contribute to the dot product. If documents share few terms, the computation is very fast.

text_cosine_similarity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity as sklearn_cosine
 
def text_similarity_example():
    """
    Demonstrate cosine similarity for document comparison.
    """
    # Sample documents
    documents = [
        "Machine learning is a subset of artificial intelligence.",
        "AI and machine learning are transforming technology.",
        "Python is a popular programming language.",
        "Deep learning uses neural networks for AI tasks.",
    ]
    
    # Create TF-IDF vectors
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)
    
    print("TF-IDF Vocabulary (sample):")
    feature_names = vectorizer.get_feature_names_out()
    print(f"  {feature_names[:10]}...")
    print(f"  Total features: {len(feature_names)}")
    print(f"  Matrix shape: {tfidf_matrix.shape}")
    print(f"  Sparsity: {1 - tfidf_matrix.nnz / np.prod(tfidf_matrix.shape):.1%}")
    
    # Compute pairwise cosine similarities
    similarity_matrix = sklearn_cosine(tfidf_matrix)
    
    print("\nCosine Similarity Matrix:")
    print("        Doc0   Doc1   Doc2   Doc3")
    for i, row in enumerate(similarity_matrix):
        print(f"  Doc{i}: {' '.join(f'{x:6.3f}' for x in row)}")
    
    # Find most similar pair
    np.fill_diagonal(similarity_matrix, -1)  # Ignore self-similarity
    most_similar = np.unravel_index(np.argmax(similarity_matrix), similarity_matrix.shape)
    print(f"\nMost similar: Doc{most_similar[0]} and Doc{most_similar[1]}")
    print(f"  Doc{most_similar[0]}: {documents[most_similar[0]]}")
    print(f"  Doc{most_similar[1]}: {documents[most_similar[1]]}")
 
 
def sparse_efficiency_demo():
    """
    Show efficiency advantage of sparse cosine computation.
    """
    from scipy.sparse import csr_matrix
    from scipy.sparse import random as sparse_random
    
    # Create sparse matrices (like TF-IDF vectors)
    n_docs = 1000
    n_features = 10000
    density = 0.01  # 1% non-zero (typical for text)
    
    X_sparse = sparse_random(n_docs, n_features, density=density, format='csr')
    X_dense = X_sparse.toarray()
    
    print(f"\nSparse vs Dense Computation:")
    print(f"  Matrix shape: {X_sparse.shape}")
    print(f"  Density: {density:.1%}")
    print(f"  Non-zero elements: {X_sparse.nnz:,}")
    print(f"  Total elements: {n_docs * n_features:,}")
    print(f"  Memory (sparse): {X_sparse.data.nbytes / 1024:.1f} KB")
    print(f"  Memory (dense): {X_dense.nbytes / 1024:.1f} KB")
    
    # sklearn automatically handles sparse matrices efficiently
    import time
    
    start = time.time()
    sim_sparse = sklearn_cosine(X_sparse[:100])
    sparse_time = time.time() - start
    
    start = time.time()
    sim_dense = sklearn_cosine(X_dense[:100])
    dense_time = time.time() - start
    
    print(f"  Cosine similarity (100 docs):")
    print(f"    Sparse: {sparse_time*1000:.1f} ms")
    print(f"    Dense: {dense_time*1000:.1f} ms")
 
 
if __name__ == "__main__":
    text_similarity_example()
    sparse_efficiency_demo()

Beyond Text: Other Applications

Cosine similarity is valuable for any domain where magnitude is arbitrary:\n\n• Recommender systems: User-item rating vectors (users rate different numbers of items)\n• Image retrieval: Feature vectors from CNNs (batch normalization makes magnitude arbitrary)\n• Gene expression: Gene expression profiles across experiments\n• Social networks: User feature vectors (activity levels vary)\n• Embeddings: Word2Vec, GloVe, and transformer embeddings

Behavior in High Dimensions

Cosine similarity has interesting properties in high-dimensional spaces, some beneficial and some requiring caution.

Curse of Dimensionality and Cosine

Recall that Euclidean distance suffers from distance concentration in high dimensions—all points become approximately equidistant. Does cosine similarity have the same problem?

For random vectors: Yes! For random high-dimensional vectors with i.i.d. components, the cosine similarity concentrates around 0 (orthogonality). However, the variance of cosine similarity also decreases, meaning all similarities cluster tightly around 0.

For structured data: Real data often lies on lower-dimensional manifolds, meaning cosine similarity remains meaningful because vectors are not truly random.

Centering and Cosine Similarity

An important subtlety: cosine similarity is not invariant to translation. Shifting all vectors by a constant changes their angles!

Example: User-item ratings where 3 is "neutral":

User A rates: [5, 4, 1, 2] (likes items 1-2, dislikes 3-4)
User B rates: [1, 2, 5, 4] (opposite preferences)

Without centering, cosine similarity might be positive (both have positive ratings).

With centering (subtract mean rating):

User A centered: [2, 1, -2, -1]
User B centered: [-2, -1, 2, 1]

Now cosine similarity is -1, correctly indicating opposite preferences.

The adjusted cosine similarity or Pearson correlation addresses this by centering vectors before computing cosine.

adjusted_cosine.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
 
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Standard cosine similarity."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
 
def adjusted_cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """
    Adjusted cosine similarity (centers vectors before computing).
    
    This is equivalent to Pearson correlation when vectors are 
    centered by their means.
    """
    a_centered = a - np.mean(a)
    b_centered = b - np.mean(b)
    
    norm_a = np.linalg.norm(a_centered)
    norm_b = np.linalg.norm(b_centered)
    
    if norm_a == 0 or norm_b == 0:
        return 0.0
    
    return np.dot(a_centered, b_centered) / (norm_a * norm_b)
 
 
def pearson_correlation(a: np.ndarray, b: np.ndarray) -> float:
    """
    Pearson correlation coefficient.
    Mathematically equivalent to adjusted cosine similarity.
    """
    return np.corrcoef(a, b)[0, 1]
 
 
def demonstrate_centering_effect():
    """
    Show why centering matters for preference vectors.
    """
    # Two users with opposite preferences
    user_a = np.array([5, 4, 1, 2])  # Likes items 1-2
    user_b = np.array([1, 2, 5, 4])  # Likes items 3-4 (opposite)
    
    # User C with same preferences as A, different scale
    user_c = np.array([5, 5, 3, 3])  # Likes items 1-2, but more moderate
    
    print("Rating vectors:")
    print(f"  User A: {user_a} (likes items 1-2)")
    print(f"  User B: {user_b} (likes items 3-4, opposite of A)")
    print(f"  User C: {user_c} (similar to A, different scale)")
    
    print("\nStandard Cosine Similarity:")
    print(f"  sim(A, B) = {cosine_similarity(user_a, user_b):.4f}")  # Positive!
    print(f"  sim(A, C) = {cosine_similarity(user_a, user_c):.4f}")
    
    print("\nAdjusted Cosine (centered):")
    print(f"  sim(A, B) = {adjusted_cosine_similarity(user_a, user_b):.4f}")  # Negative!
    print(f"  sim(A, C) = {adjusted_cosine_similarity(user_a, user_c):.4f}")
    
    print("\nPearson Correlation (same as adjusted cosine):")
    print(f"  corr(A, B) = {pearson_correlation(user_a, user_b):.4f}")
    print(f"  corr(A, C) = {pearson_correlation(user_a, user_c):.4f}")
 
    print("\n→ Adjusted cosine correctly identifies A and B as having")
    print("  opposite preferences!")
 
 
if __name__ == "__main__":
    demonstrate_centering_effect()

Use Standard Cosine When

•Vectors represent proportions/ratios
•Zero is meaningful (absence)
•Magnitude is arbitrary (doc length)
•Vectors are already centered

Use Adjusted Cosine When

•Vectors have a meaningful baseline
•Working with ratings/preferences
•Need to detect opposite preferences
•Correlation is the true measure

Limitations: When Cosine Similarity Fails

Despite its popularity, cosine similarity is not universally appropriate. Understanding its limitations prevents misapplication.

1. When Magnitude Is Meaningful

Cosine similarity ignores magnitude by design. But in many domains, magnitude carries crucial information:

Physical measurements: A vector of [temperature, pressure, vibration] at [100°C, 2 atm, 50 Hz] is very different from [200°C, 4 atm, 100 Hz], even though they point in the same direction. The magnitudes indicate completely different operating conditions.

Financial data: Portfolio returns of [1%, 2%, -1%] vs [10%, 20%, -10%] have the same direction (same relative performance) but vastly different risk/reward profiles.

2. Negative Values and Interpretation

When vectors can have negative components, cosine similarity's interpretation becomes complex:

Negative cosine similarity indicates "opposite" direction
But for mixed-sign data, "opposite" may not be meaningful

Example: Economic indicators with growth rates (positive/negative). Two countries with cosine similarity -1 aren't opposites in any meaningful sense.

3. Zero Vectors

Cosine similarity is undefined for zero vectors:

$$\text{sim}_{\cos}(\mathbf{0}, \mathbf{b}) = \frac{0}{0 \cdot |\mathbf{b}|} = \frac{0}{0}$$

In text analysis, a document with no matching terms has a zero vector, leading to undefined similarity. Implementations typically return 0, but this is a convention, not a mathematical result.

4. Scale-Sensitive Features Ignored

If your preprocessing hasn't normalized features, cosine similarity can miss important scale information:

Common Mistake

Using cosine similarity on raw features with meaningful magnitudes (like [age, income, height]) discards the magnitude information. A 20-year-old earning $20K looks identical to a 60-year-old earning $60K if other features scale proportionally!\n\nFor such data, Euclidean distance (with standardization) is usually more appropriate.

Cosine vs Euclidean: Decision Guide
Criterion	Use Cosine	Use Euclidean
Magnitude meaningful?	No	Yes
Data type	Text, embeddings, counts	Physical measurements
Vector length varies?	Yes (doc length)	No (fixed format)
Sparsity	High (text)	Low to moderate
Pre-normalized?	Not necessarily	Should standardize
Care about scaling?	No	Yes

Using Cosine Similarity in KNN

Implementing KNN with cosine similarity requires some care due to the similarity-vs-distance distinction and metric properties.

Approach 1: Normalize + Euclidean (Recommended)

L2-normalize all vectors (training and query)
Use standard Euclidean KNN
The resulting neighbors are identical to cosine-based ranking

Advantages:

Can use accelerated data structures (KD-trees, ball trees)
Standard, highly-optimized implementations
No custom metric needed

Approach 2: Direct Cosine Metric

Use cosine distance (1 - cosine_similarity) directly:

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric='cosine')

Limitations:

Cannot use KD-trees or ball trees (triangle inequality violation)
Forced to use brute-force search
Slower for large datasets

Approach 3: Approximate Methods

For very large datasets, use approximate nearest neighbor methods designed for cosine similarity:

Locality-Sensitive Hashing (LSH) for cosine (see later module)
HNSW (Hierarchical Navigable Small World graphs)
Annoy (Approximate Nearest Neighbors Oh Yeah)
FAISS (Facebook AI Similarity Search)

knn_cosine.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import normalize
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
 
def compare_knn_approaches():
    """
    Compare different ways to implement cosine-based KNN.
    """
    # Load text data
    categories = ['comp.graphics', 'sci.med', 'rec.sport.baseball']
    newsgroups = fetch_20newsgroups(subset='train', categories=categories)
    
    # Create TF-IDF vectors
    vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
    X = vectorizer.fit_transform(newsgroups.data)
    y = newsgroups.target
    
    # Approach 1: Normalize + Euclidean
    X_normalized = normalize(X, norm='l2')
    knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
    knn_euclidean.fit(X_normalized, y)
    
    # Approach 2: Direct cosine metric
    knn_cosine = KNeighborsClassifier(n_neighbors=5, metric='cosine')
    knn_cosine.fit(X, y)
    
    # Test on new documents
    test_docs = [
        "The graphics card renders 3D images very fast",
        "The patient needs immediate medical attention",
        "The home run won the baseball game"
    ]
    X_test = vectorizer.transform(test_docs)
    X_test_normalized = normalize(X_test, norm='l2')
    
    # Predictions
    pred_euclidean = knn_euclidean.predict(X_test_normalized)
    pred_cosine = knn_cosine.predict(X_test)
    
    print("Test Document Predictions:")
    for i, doc in enumerate(test_docs):
        print(f"  '{doc[:50]}...'")
        print(f"    Normalized + Euclidean: {categories[pred_euclidean[i]]}")
        print(f"    Direct Cosine: {categories[pred_cosine[i]]}")
    
    # Check if results match
    print(f"\nPredictions match: {np.array_equal(pred_euclidean, pred_cosine)}")
    
    # Check algorithm used
    print(f"\nAlgorithm used:")
    print(f"  Euclidean: {knn_euclidean._fit_method}")  # ball_tree or kd_tree possible
    print(f"  Cosine: {knn_cosine._fit_method}")  # brute (no tree support)
 
 
def weighted_knn_cosine():
    """
    Demonstrate weighted KNN with cosine similarity.
    """
    # In weighted KNN, closer neighbors have more influence
    # With cosine, "distance" = 1 - similarity, so weights = similarity
    
    X = np.random.randn(100, 10)
    X = normalize(X)  # Normalize for cosine
    y = np.random.randint(0, 3, 100)
    
    # Distance-weighted KNN (inverse distance weighting)
    knn_weighted = KNeighborsClassifier(
        n_neighbors=5, 
        metric='cosine',
        weights='distance'  # Weight by inverse distance
    )
    knn_weighted.fit(X, y)
    
    # Uniform weighting (all neighbors equal)
    knn_uniform = KNeighborsClassifier(
        n_neighbors=5,
        metric='cosine',
        weights='uniform'
    )
    knn_uniform.fit(X, y)
    
    query = normalize([np.random.randn(10)])[0]
    
    # Get probabilities
    proba_weighted = knn_weighted.predict_proba([query])[0]
    proba_uniform = knn_uniform.predict_proba([query])[0]
    
    print("\nWeighted vs Uniform KNN:")
    print(f"  Query prediction probabilities:")
    print(f"    Weighted: {proba_weighted.round(3)}")
    print(f"    Uniform:  {proba_uniform.round(3)}")
 
 
if __name__ == "__main__":
    compare_knn_approaches()
    weighted_knn_cosine()

Summary and Path Forward

Cosine similarity provides a fundamentally different approach to measuring relationships between vectors—one based on orientation rather than displacement. Its magnitude-invariance makes it essential for text analysis and other domains where vector length is arbitrary.

Key Concepts Mastered

•Angle-based similarity — Measures the cosine of the angle: $\frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|}$
•Magnitude invariance — Only direction matters; scaling vectors doesn't change similarity
•Euclidean connection — Euclidean distance on normalized vectors is equivalent to cosine ranking
•Distance conversions — Cosine distance (1-sim) is not a true metric; angular distance is
•Text and sparse data — Ideal for TF-IDF vectors where document length shouldn't affect comparison
•Adjusted cosine — Center vectors first when comparing preferences/ratings
•KNN implementation — Normalize + Euclidean is faster than direct cosine metric
•Limitations — Inappropriate when magnitude is meaningful

What's Next: Custom Distances

The final page of this module explores custom distance functions—how to design domain-specific metrics that capture the true notion of similarity for your data. We'll cover weighted distances, learned metrics, edit distances for sequences, and techniques for mixed data types.

Page Complete

You now understand cosine similarity as a distinct paradigm from distance-based metrics. You can apply it appropriately to text and embedding data, convert between similarity and distance formulations, and implement it efficiently in KNN. This completes your toolkit of standard metrics, preparing you for custom distance design.

4 / 5

Loading learning content...

Machine LearningDistance Metrics

Distance Metrics for K-Nearest Neighbors

LevelIntermediate

Duration90 mins

TopicDistance Metrics

4 / 5

Cosine Similarity

Measuring Orientation, Not Magnitude

All the distance metrics we have studied so far—Euclidean, Manhattan, Minkowski—measure the displacement between two points. They answer the question: "How far apart are these vectors?"

But in many domains, a different question is more meaningful: "How similar are the directions these vectors point?" This is precisely what cosine similarity measures.

Learning Objectives

Mathematical Foundation

The Dot Product and Angles

Recall the geometric definition of the dot product (inner product) between two vectors:

$$\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}| |\mathbf{b}| \cos\theta$$

where $\theta$ is the angle between the vectors, and $|\cdot|$ denotes the Euclidean (L²) norm.

Rearranging for the cosine of the angle:

$$\cos\theta = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|}$$

This is the definition of cosine similarity.

Formal Definition

For two non-zero vectors $\mathbf{a}, \mathbf{b} \in \mathbb{R}^n$:

Range of Values

Since $\cos\theta \in [-1, 1]$:

Cosine Similarity	Angle θ	Interpretation
1	0°	Identical direction (parallel)
0	90°	Orthogonal (no relationship)
-1	180°	Opposite direction (anti-parallel)

For non-negative vectors (common in text/count data), cosine similarity is always in $[0, 1]$ since the angle cannot exceed 90°.

cosine_similarity.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
from typing import Union
 
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """
    Compute cosine similarity between two vectors.
    
    sim_cos(a, b) = (a · b) / (||a|| ||b||)
    
    Parameters:
        a: First vector
        b: Second vector
    
    Returns:
        Cosine similarity in range [-1, 1]
        Returns 0 if either vector is zero (undefined case)
    
    Example:
        >>> cosine_similarity(np.array([1, 0]), np.array([1, 0]))
        1.0  # Identical direction
        >>> cosine_similarity(np.array([1, 0]), np.array([0, 1]))
        0.0  # Orthogonal
        >>> cosine_similarity(np.array([1, 0]), np.array([-1, 0]))
        -1.0  # Opposite direction
    """
    # Compute norms
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    
    # Handle zero vectors
    if norm_a == 0 or norm_b == 0:
        return 0.0
    
    # Compute cosine similarity
    dot_product = np.dot(a, b)
    return dot_product / (norm_a * norm_b)
 
 
def cosine_similarity_normalized(a: np.ndarray, b: np.ndarray) -> float:
    """
    Compute cosine similarity for pre-normalized vectors (unit vectors).
    
    If ||a|| = ||b|| = 1, then cos(a, b) = a · b
    
    This is much faster when vectors are normalized in advance.
    """
    return np.dot(a, b)
 
 
def batch_cosine_similarity(X: np.ndarray, y: np.ndarray) -> np.ndarray:
    """
    Compute cosine similarity between a query vector y and all rows of X.
    
    Parameters:
        X: Matrix of shape (n_samples, n_features)
        y: Query vector of shape (n_features,)
    
    Returns:
        Array of cosine similarities of shape (n_samples,)
    """
    # Compute norms
    X_norms = np.linalg.norm(X, axis=1)
    y_norm = np.linalg.norm(y)
    
    # Handle zero vectors
    if y_norm == 0:
        return np.zeros(X.shape[0])
    
    # Compute dot products (matrix-vector multiplication)
    dot_products = X @ y
    
    # Normalize
    similarities = dot_products / (X_norms * y_norm + 1e-10)  # Small epsilon for stability
    
    return similarities
 
 
# Demonstration
if __name__ == "__main__":
    # Example vectors
    a = np.array([3, 4])
    b = np.array([6, 8])  # Same direction, different magnitude
    c = np.array([4, -3])  # Orthogonal to a
    d = np.array([-3, -4])  # Opposite to a
    
    print("Cosine Similarity Examples:")
    print(f"  a = {a}")
    print(f"  b = {b} (same direction, 2x magnitude)")
    print(f"  c = {c} (orthogonal)")
    print(f"  d = {d} (opposite)")
    print()
    print(f"  sim(a, b) = {cosine_similarity(a, b):.4f}")  # 1.0
    print(f"  sim(a, c) = {cosine_similarity(a, c):.4f}")  # 0.0
    print(f"  sim(a, d) = {cosine_similarity(a, d):.4f}")  # -1.0

Similarity vs Distance

The Euclidean Distance Connection

There is a beautiful mathematical relationship between cosine similarity and Euclidean distance—but only for unit vectors (vectors with norm 1).

Euclidean Distance for Unit Vectors

For unit vectors $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ (with $|\hat{\mathbf{a}}| = |\hat{\mathbf{b}}| = 1$):

Therefore:

$$|\hat{\mathbf{a}} - \hat{\mathbf{b}}|_2 = \sqrt{2(1 - \cos\theta)} = \sqrt{2} \cdot \sqrt{1 - \cos\theta}$$

Using the trigonometric identity $1 - \cos\theta = 2\sin^2(\theta/2)$:

$$|\hat{\mathbf{a}} - \hat{\mathbf{b}}|_2 = 2\sin(\theta/2)$$

Key Insight: Normalization Equivalence

For normalized vectors, minimizing Euclidean distance is equivalent to maximizing cosine similarity.

Since $d = \sqrt{2(1 - \cos\theta)}$ is a monotonically decreasing function of $\cos\theta$:

Higher cosine similarity → Lower Euclidean distance (for normalized vectors)
Finding nearest neighbors by Euclidean distance on normalized vectors = finding most similar by cosine

This has a profound practical implication: if you normalize your vectors first, you can use standard Euclidean KNN to get cosine-based neighbors.

normalize_for_cosine.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from sklearn.preprocessing import normalize
from sklearn.neighbors import NearestNeighbors
 
def demonstrate_normalization_equivalence():
    """
    Show that Euclidean distance on normalized vectors 
    produces the same ranking as cosine similarity.
    """
    # Create sample data
    np.random.seed(42)
    X = np.random.randn(100, 10)  # 100 vectors, 10 dimensions
    query = np.random.randn(10)
    
    # Method 1: Cosine similarity directly
    norms_X = np.linalg.norm(X, axis=1, keepdims=True)
    norm_query = np.linalg.norm(query)
    cosine_sims = (X @ query) / (norms_X.flatten() * norm_query)
    cosine_ranking = np.argsort(-cosine_sims)  # Descending (most similar first)
    
    # Method 2: Euclidean distance on normalized vectors
    X_normalized = X / norms_X
    query_normalized = query / norm_query
    euclidean_dists = np.linalg.norm(X_normalized - query_normalized, axis=1)
    euclidean_ranking = np.argsort(euclidean_dists)  # Ascending (closest first)
    
    # Compare rankings
    print("Demonstrating normalization equivalence:")
    print(f"  Top 5 by cosine similarity: {cosine_ranking[:5]}")
    print(f"  Top 5 by Euclidean (normalized): {euclidean_ranking[:5]}")
    print(f"  Rankings identical: {np.array_equal(cosine_ranking, euclidean_ranking)}")
 
 
def knn_with_cosine():
    """
    Use sklearn's NearestNeighbors with cosine metric.
    """
    # Create sample data
    X = np.random.randn(1000, 50)
    query = np.random.randn(50)
    
    # Option 1: Use cosine metric directly
    nn_cosine = NearestNeighbors(n_neighbors=5, metric='cosine')
    nn_cosine.fit(X)
    distances1, indices1 = nn_cosine.kneighbors([query])
    
    # Option 2: Normalize and use Euclidean (equivalent)
    X_norm = normalize(X)
    query_norm = normalize([query])
    nn_euclidean = NearestNeighbors(n_neighbors=5, metric='euclidean')
    nn_euclidean.fit(X_norm)
    distances2, indices2 = nn_euclidean.kneighbors(query_norm)
    
    print("\nKNN comparison:")
    print(f"  Direct cosine indices: {indices1[0]}")
    print(f"  Normalized Euclidean indices: {indices2[0]}")
    print(f"  Same neighbors: {np.array_equal(indices1, indices2)}")
 
 
if __name__ == "__main__":
    demonstrate_normalization_equivalence()
    knn_with_cosine()

Practical Recommendation

Converting Similarity to Distance

To use cosine with algorithms that require a distance metric (like KNN), we must convert similarity to distance. Several conventions exist:

Cosine Distance

The most common conversion:

$$d_{\cos}(\mathbf{a}, \mathbf{b}) = 1 - \text{sim}_{\cos}(\mathbf{a}, \mathbf{b}) = 1 - \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|}$$

Cosine Similarity	Cosine Distance	Interpretation
1	0	Identical
0	1	Orthogonal
-1	2	Opposite

Warning: Cosine distance is not a true metric—it violates the triangle inequality!

Angular Distance

A proper metric based on the actual angle:

$$d_{\text{angular}}(\mathbf{a}, \mathbf{b}) = \frac{\arccos(\text{sim}_{\cos}(\mathbf{a}, \mathbf{b}))}{\pi}$$

This normalizes the angle to $[0, 1]$ where:

0 = same direction (0°)
0.5 = orthogonal (90°)
1 = opposite (180°)

Advantage: Angular distance satisfies the triangle inequality and is a proper metric.

Disadvantage: Computing arccos is slower than simple subtraction.

Comparison of Distance Conversions

Cosine-Based Distance Metrics
Metric	Formula	Range	True Metric?	Computation
Cosine Distance	1 - cos(θ)	[0, 2]	No	Fast (just 1 - similarity)
Angular Distance	arccos(cos(θ)) / π	[0, 1]	Yes	Slower (requires arccos)
Euclidean (normalized)	√(2(1-cos(θ)))	[0, 2]	Yes	Moderate

Triangle Inequality Violation

distance_conversions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
 
def cosine_distance(a: np.ndarray, b: np.ndarray) -> float:
    """
    Cosine distance: 1 - cosine_similarity
    Range: [0, 2]
    Note: NOT a true metric (violates triangle inequality)
    """
    return 1 - cosine_similarity(a, b)
 
 
def angular_distance(a: np.ndarray, b: np.ndarray) -> float:
    """
    Angular distance: arccos(cosine_similarity) / π
    Range: [0, 1]
    This IS a true metric (satisfies triangle inequality)
    """
    cos_sim = cosine_similarity(a, b)
    # Clamp to [-1, 1] to avoid arccos domain errors due to float precision
    cos_sim = np.clip(cos_sim, -1, 1)
    return np.arccos(cos_sim) / np.pi
 
 
def euclidean_normalized(a: np.ndarray, b: np.ndarray) -> float:
    """
    Euclidean distance after L2 normalization.
    Equivalent to sqrt(2 * (1 - cos_sim)) for unit vectors.
    """
    a_norm = a / np.linalg.norm(a)
    b_norm = b / np.linalg.norm(b)
    return np.linalg.norm(a_norm - b_norm)
 
 
def triangle_inequality_example():
    """
    Demonstrate triangle inequality violation for cosine distance.
    """
    a = np.array([1, 0])
    b = np.array([1, 1]) / np.sqrt(2)  # 45 degrees
    c = np.array([0, 1])
    
    d_ac = cosine_distance(a, c)
    d_ab = cosine_distance(a, b)
    d_bc = cosine_distance(b, c)
    
    print("Triangle Inequality Test (Cosine Distance):")
    print(f"  a = {a}, b = {b.round(3)}, c = {c}")
    print(f"  d(a,c) = {d_ac:.4f}")
    print(f"  d(a,b) = {d_ab:.4f}")  
    print(f"  d(b,c) = {d_bc:.4f}")
    print(f"  d(a,b) + d(b,c) = {d_ab + d_bc:.4f}")
    print(f"  Triangle inequality d(a,c) <= d(a,b) + d(b,c)? {d_ac <= d_ab + d_bc}")
    
    print("\nTriangle Inequality Test (Angular Distance):")
    d_ac_ang = angular_distance(a, c)
    d_ab_ang = angular_distance(a, b)
    d_bc_ang = angular_distance(b, c)
    print(f"  d(a,c) = {d_ac_ang:.4f}")
    print(f"  d(a,b) + d(b,c) = {d_ab_ang + d_bc_ang:.4f}")
    print(f"  Triangle inequality holds? {d_ac_ang <= d_ab_ang + d_bc_ang + 1e-10}")
 
 
if __name__ == "__main__":
    triangle_inequality_example()

Cosine Similarity for Text and Sparse Data

Cosine similarity's magnitude-invariance makes it the default choice for text analysis and other domains with sparse, high-dimensional data.

The Document Length Problem

Doc A: [10, 5, 3] ("machine": 10, "learning": 5, "algorithm": 3)
Doc B: [100, 50, 30] (same topic, 10x longer)

Euclidean distance: $\sqrt{(100-10)^2 + (50-5)^2 + (30-3)^2} = \sqrt{8100 + 2025 + 729} \approx 104$

This large distance suggests they're very different, when they're actually about the same topic!

Cosine similarity: Since Doc B is a scalar multiple of Doc A, $\cos(A, B) = 1$ (identical direction).

Cosine similarity correctly identifies them as identical in topic.

TF-IDF and Cosine Similarity

The TF-IDF (Term Frequency–Inverse Document Frequency) representation is designed for cosine similarity:

TF weighs terms by frequency in the document
IDF downweights common terms (like "the", "is")
Cosine similarity compares the weighted vectors angle

This combination forms the backbone of classical information retrieval.

Sparse Vector Efficiency

Text vectors are extremely sparse (most terms don't appear in most documents). Cosine similarity can be computed efficiently:

$$\text{sim}{\cos}(\mathbf{a}, \mathbf{b}) = \frac{\sum{i: a_i \neq 0 \text{ AND } b_i \neq 0} a_i b_i}{|\mathbf{a}| |\mathbf{b}|}$$

Only non-zero entries in both vectors contribute to the dot product. If documents share few terms, the computation is very fast.

text_cosine_similarity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity as sklearn_cosine
 
def text_similarity_example():
    """
    Demonstrate cosine similarity for document comparison.
    """
    # Sample documents
    documents = [
        "Machine learning is a subset of artificial intelligence.",
        "AI and machine learning are transforming technology.",
        "Python is a popular programming language.",
        "Deep learning uses neural networks for AI tasks.",
    ]
    
    # Create TF-IDF vectors
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)
    
    print("TF-IDF Vocabulary (sample):")
    feature_names = vectorizer.get_feature_names_out()
    print(f"  {feature_names[:10]}...")
    print(f"  Total features: {len(feature_names)}")
    print(f"  Matrix shape: {tfidf_matrix.shape}")
    print(f"  Sparsity: {1 - tfidf_matrix.nnz / np.prod(tfidf_matrix.shape):.1%}")
    
    # Compute pairwise cosine similarities
    similarity_matrix = sklearn_cosine(tfidf_matrix)
    
    print("\nCosine Similarity Matrix:")
    print("        Doc0   Doc1   Doc2   Doc3")
    for i, row in enumerate(similarity_matrix):
        print(f"  Doc{i}: {' '.join(f'{x:6.3f}' for x in row)}")
    
    # Find most similar pair
    np.fill_diagonal(similarity_matrix, -1)  # Ignore self-similarity
    most_similar = np.unravel_index(np.argmax(similarity_matrix), similarity_matrix.shape)
    print(f"\nMost similar: Doc{most_similar[0]} and Doc{most_similar[1]}")
    print(f"  Doc{most_similar[0]}: {documents[most_similar[0]]}")
    print(f"  Doc{most_similar[1]}: {documents[most_similar[1]]}")
 
 
def sparse_efficiency_demo():
    """
    Show efficiency advantage of sparse cosine computation.
    """
    from scipy.sparse import csr_matrix
    from scipy.sparse import random as sparse_random
    
    # Create sparse matrices (like TF-IDF vectors)
    n_docs = 1000
    n_features = 10000
    density = 0.01  # 1% non-zero (typical for text)
    
    X_sparse = sparse_random(n_docs, n_features, density=density, format='csr')
    X_dense = X_sparse.toarray()
    
    print(f"\nSparse vs Dense Computation:")
    print(f"  Matrix shape: {X_sparse.shape}")
    print(f"  Density: {density:.1%}")
    print(f"  Non-zero elements: {X_sparse.nnz:,}")
    print(f"  Total elements: {n_docs * n_features:,}")
    print(f"  Memory (sparse): {X_sparse.data.nbytes / 1024:.1f} KB")
    print(f"  Memory (dense): {X_dense.nbytes / 1024:.1f} KB")
    
    # sklearn automatically handles sparse matrices efficiently
    import time
    
    start = time.time()
    sim_sparse = sklearn_cosine(X_sparse[:100])
    sparse_time = time.time() - start
    
    start = time.time()
    sim_dense = sklearn_cosine(X_dense[:100])
    dense_time = time.time() - start
    
    print(f"  Cosine similarity (100 docs):")
    print(f"    Sparse: {sparse_time*1000:.1f} ms")
    print(f"    Dense: {dense_time*1000:.1f} ms")
 
 
if __name__ == "__main__":
    text_similarity_example()
    sparse_efficiency_demo()

Beyond Text: Other Applications

Behavior in High Dimensions

Cosine similarity has interesting properties in high-dimensional spaces, some beneficial and some requiring caution.

Curse of Dimensionality and Cosine

Recall that Euclidean distance suffers from distance concentration in high dimensions—all points become approximately equidistant. Does cosine similarity have the same problem?

For structured data: Real data often lies on lower-dimensional manifolds, meaning cosine similarity remains meaningful because vectors are not truly random.

Centering and Cosine Similarity

An important subtlety: cosine similarity is not invariant to translation. Shifting all vectors by a constant changes their angles!

Example: User-item ratings where 3 is "neutral":

User A rates: [5, 4, 1, 2] (likes items 1-2, dislikes 3-4)
User B rates: [1, 2, 5, 4] (opposite preferences)

Without centering, cosine similarity might be positive (both have positive ratings).

With centering (subtract mean rating):

User A centered: [2, 1, -2, -1]
User B centered: [-2, -1, 2, 1]

Now cosine similarity is -1, correctly indicating opposite preferences.

The adjusted cosine similarity or Pearson correlation addresses this by centering vectors before computing cosine.

adjusted_cosine.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
 
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Standard cosine similarity."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
 
def adjusted_cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """
    Adjusted cosine similarity (centers vectors before computing).
    
    This is equivalent to Pearson correlation when vectors are 
    centered by their means.
    """
    a_centered = a - np.mean(a)
    b_centered = b - np.mean(b)
    
    norm_a = np.linalg.norm(a_centered)
    norm_b = np.linalg.norm(b_centered)
    
    if norm_a == 0 or norm_b == 0:
        return 0.0
    
    return np.dot(a_centered, b_centered) / (norm_a * norm_b)
 
 
def pearson_correlation(a: np.ndarray, b: np.ndarray) -> float:
    """
    Pearson correlation coefficient.
    Mathematically equivalent to adjusted cosine similarity.
    """
    return np.corrcoef(a, b)[0, 1]
 
 
def demonstrate_centering_effect():
    """
    Show why centering matters for preference vectors.
    """
    # Two users with opposite preferences
    user_a = np.array([5, 4, 1, 2])  # Likes items 1-2
    user_b = np.array([1, 2, 5, 4])  # Likes items 3-4 (opposite)
    
    # User C with same preferences as A, different scale
    user_c = np.array([5, 5, 3, 3])  # Likes items 1-2, but more moderate
    
    print("Rating vectors:")
    print(f"  User A: {user_a} (likes items 1-2)")
    print(f"  User B: {user_b} (likes items 3-4, opposite of A)")
    print(f"  User C: {user_c} (similar to A, different scale)")
    
    print("\nStandard Cosine Similarity:")
    print(f"  sim(A, B) = {cosine_similarity(user_a, user_b):.4f}")  # Positive!
    print(f"  sim(A, C) = {cosine_similarity(user_a, user_c):.4f}")
    
    print("\nAdjusted Cosine (centered):")
    print(f"  sim(A, B) = {adjusted_cosine_similarity(user_a, user_b):.4f}")  # Negative!
    print(f"  sim(A, C) = {adjusted_cosine_similarity(user_a, user_c):.4f}")
    
    print("\nPearson Correlation (same as adjusted cosine):")
    print(f"  corr(A, B) = {pearson_correlation(user_a, user_b):.4f}")
    print(f"  corr(A, C) = {pearson_correlation(user_a, user_c):.4f}")
 
    print("\n→ Adjusted cosine correctly identifies A and B as having")
    print("  opposite preferences!")
 
 
if __name__ == "__main__":
    demonstrate_centering_effect()

Use Standard Cosine When

•Vectors represent proportions/ratios
•Zero is meaningful (absence)
•Magnitude is arbitrary (doc length)
•Vectors are already centered

Use Adjusted Cosine When

•Vectors have a meaningful baseline
•Working with ratings/preferences
•Need to detect opposite preferences
•Correlation is the true measure

Limitations: When Cosine Similarity Fails

Despite its popularity, cosine similarity is not universally appropriate. Understanding its limitations prevents misapplication.

1. When Magnitude Is Meaningful

Cosine similarity ignores magnitude by design. But in many domains, magnitude carries crucial information:

Financial data: Portfolio returns of [1%, 2%, -1%] vs [10%, 20%, -10%] have the same direction (same relative performance) but vastly different risk/reward profiles.

2. Negative Values and Interpretation

When vectors can have negative components, cosine similarity's interpretation becomes complex:

Negative cosine similarity indicates "opposite" direction
But for mixed-sign data, "opposite" may not be meaningful

Example: Economic indicators with growth rates (positive/negative). Two countries with cosine similarity -1 aren't opposites in any meaningful sense.

3. Zero Vectors

Cosine similarity is undefined for zero vectors:

$$\text{sim}_{\cos}(\mathbf{0}, \mathbf{b}) = \frac{0}{0 \cdot |\mathbf{b}|} = \frac{0}{0}$$

In text analysis, a document with no matching terms has a zero vector, leading to undefined similarity. Implementations typically return 0, but this is a convention, not a mathematical result.

4. Scale-Sensitive Features Ignored

If your preprocessing hasn't normalized features, cosine similarity can miss important scale information:

Common Mistake

Cosine vs Euclidean: Decision Guide
Criterion	Use Cosine	Use Euclidean
Magnitude meaningful?	No	Yes
Data type	Text, embeddings, counts	Physical measurements
Vector length varies?	Yes (doc length)	No (fixed format)
Sparsity	High (text)	Low to moderate
Pre-normalized?	Not necessarily	Should standardize
Care about scaling?	No	Yes

Using Cosine Similarity in KNN

Implementing KNN with cosine similarity requires some care due to the similarity-vs-distance distinction and metric properties.

Approach 1: Normalize + Euclidean (Recommended)

L2-normalize all vectors (training and query)
Use standard Euclidean KNN
The resulting neighbors are identical to cosine-based ranking

Advantages:

Can use accelerated data structures (KD-trees, ball trees)
Standard, highly-optimized implementations
No custom metric needed

Approach 2: Direct Cosine Metric

Use cosine distance (1 - cosine_similarity) directly:

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric='cosine')

Limitations:

Cannot use KD-trees or ball trees (triangle inequality violation)
Forced to use brute-force search
Slower for large datasets

Approach 3: Approximate Methods

For very large datasets, use approximate nearest neighbor methods designed for cosine similarity:

Locality-Sensitive Hashing (LSH) for cosine (see later module)
HNSW (Hierarchical Navigable Small World graphs)
Annoy (Approximate Nearest Neighbors Oh Yeah)
FAISS (Facebook AI Similarity Search)

knn_cosine.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import normalize
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
 
def compare_knn_approaches():
    """
    Compare different ways to implement cosine-based KNN.
    """
    # Load text data
    categories = ['comp.graphics', 'sci.med', 'rec.sport.baseball']
    newsgroups = fetch_20newsgroups(subset='train', categories=categories)
    
    # Create TF-IDF vectors
    vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
    X = vectorizer.fit_transform(newsgroups.data)
    y = newsgroups.target
    
    # Approach 1: Normalize + Euclidean
    X_normalized = normalize(X, norm='l2')
    knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
    knn_euclidean.fit(X_normalized, y)
    
    # Approach 2: Direct cosine metric
    knn_cosine = KNeighborsClassifier(n_neighbors=5, metric='cosine')
    knn_cosine.fit(X, y)
    
    # Test on new documents
    test_docs = [
        "The graphics card renders 3D images very fast",
        "The patient needs immediate medical attention",
        "The home run won the baseball game"
    ]
    X_test = vectorizer.transform(test_docs)
    X_test_normalized = normalize(X_test, norm='l2')
    
    # Predictions
    pred_euclidean = knn_euclidean.predict(X_test_normalized)
    pred_cosine = knn_cosine.predict(X_test)
    
    print("Test Document Predictions:")
    for i, doc in enumerate(test_docs):
        print(f"  '{doc[:50]}...'")
        print(f"    Normalized + Euclidean: {categories[pred_euclidean[i]]}")
        print(f"    Direct Cosine: {categories[pred_cosine[i]]}")
    
    # Check if results match
    print(f"\nPredictions match: {np.array_equal(pred_euclidean, pred_cosine)}")
    
    # Check algorithm used
    print(f"\nAlgorithm used:")
    print(f"  Euclidean: {knn_euclidean._fit_method}")  # ball_tree or kd_tree possible
    print(f"  Cosine: {knn_cosine._fit_method}")  # brute (no tree support)
 
 
def weighted_knn_cosine():
    """
    Demonstrate weighted KNN with cosine similarity.
    """
    # In weighted KNN, closer neighbors have more influence
    # With cosine, "distance" = 1 - similarity, so weights = similarity
    
    X = np.random.randn(100, 10)
    X = normalize(X)  # Normalize for cosine
    y = np.random.randint(0, 3, 100)
    
    # Distance-weighted KNN (inverse distance weighting)
    knn_weighted = KNeighborsClassifier(
        n_neighbors=5, 
        metric='cosine',
        weights='distance'  # Weight by inverse distance
    )
    knn_weighted.fit(X, y)
    
    # Uniform weighting (all neighbors equal)
    knn_uniform = KNeighborsClassifier(
        n_neighbors=5,
        metric='cosine',
        weights='uniform'
    )
    knn_uniform.fit(X, y)
    
    query = normalize([np.random.randn(10)])[0]
    
    # Get probabilities
    proba_weighted = knn_weighted.predict_proba([query])[0]
    proba_uniform = knn_uniform.predict_proba([query])[0]
    
    print("\nWeighted vs Uniform KNN:")
    print(f"  Query prediction probabilities:")
    print(f"    Weighted: {proba_weighted.round(3)}")
    print(f"    Uniform:  {proba_uniform.round(3)}")
 
 
if __name__ == "__main__":
    compare_knn_approaches()
    weighted_knn_cosine()

Summary and Path Forward

Key Concepts Mastered

•Angle-based similarity — Measures the cosine of the angle: $\frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|}$
•Magnitude invariance — Only direction matters; scaling vectors doesn't change similarity
•Euclidean connection — Euclidean distance on normalized vectors is equivalent to cosine ranking
•Distance conversions — Cosine distance (1-sim) is not a true metric; angular distance is
•Text and sparse data — Ideal for TF-IDF vectors where document length shouldn't affect comparison
•Adjusted cosine — Center vectors first when comparing preferences/ratings
•KNN implementation — Normalize + Euclidean is faster than direct cosine metric
•Limitations — Inappropriate when magnitude is meaningful

What's Next: Custom Distances

Page Complete

4 / 5