Loading learning content...
All the distance metrics we have studied so far—Euclidean, Manhattan, Minkowski—measure the displacement between two points. They answer the question: "How far apart are these vectors?"
But in many domains, a different question is more meaningful: "How similar are the directions these vectors point?" This is precisely what cosine similarity measures.
Cosine similarity computes the cosine of the angle between two vectors, completely ignoring their magnitudes. Two vectors pointing in the same direction have cosine similarity 1 (identical), orthogonal vectors have similarity 0 (unrelated), and opposite vectors have similarity -1 (maximally dissimilar).
This magnitude-invariance makes cosine similarity indispensable for domains where the "size" of a vector is an artifact rather than a meaningful signal—most notably in text analysis, where document length shouldn't affect similarity assessment.
By mastering this page, you will:\n\n• Derive cosine similarity from the geometric definition of the dot product\n• Understand the relationship to Euclidean distance for normalized vectors\n• Recognize why magnitude-invariance is crucial for text and sparse data\n• Convert between cosine similarity, cosine distance, and angular distance\n• Implement efficient cosine similarity using vectorized operations\n• Apply cosine similarity correctly in KNN for document retrieval and recommendations\n• Understand the limitations and when Euclidean distance is preferable
Recall the geometric definition of the dot product (inner product) between two vectors:
$$\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}| |\mathbf{b}| \cos\theta$$
where $\theta$ is the angle between the vectors, and $|\cdot|$ denotes the Euclidean (L²) norm.
Rearranging for the cosine of the angle:
$$\cos\theta = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|}$$
This is the definition of cosine similarity.
For two non-zero vectors $\mathbf{a}, \mathbf{b} \in \mathbb{R}^n$:
$$\text{sim}{\cos}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}|2 |\mathbf{b}|2} = \frac{\sum{i=1}^{n} a_i b_i}{\sqrt{\sum{i=1}^{n} a_i^2} \cdot \sqrt{\sum{i=1}^{n} b_i^2}}$$
Since $\cos\theta \in [-1, 1]$:
| Cosine Similarity | Angle θ | Interpretation |
|---|---|---|
| 1 | 0° | Identical direction (parallel) |
| 0 | 90° | Orthogonal (no relationship) |
| -1 | 180° | Opposite direction (anti-parallel) |
For non-negative vectors (common in text/count data), cosine similarity is always in $[0, 1]$ since the angle cannot exceed 90°.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
import numpy as npfrom typing import Union def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: """ Compute cosine similarity between two vectors. sim_cos(a, b) = (a · b) / (||a|| ||b||) Parameters: a: First vector b: Second vector Returns: Cosine similarity in range [-1, 1] Returns 0 if either vector is zero (undefined case) Example: >>> cosine_similarity(np.array([1, 0]), np.array([1, 0])) 1.0 # Identical direction >>> cosine_similarity(np.array([1, 0]), np.array([0, 1])) 0.0 # Orthogonal >>> cosine_similarity(np.array([1, 0]), np.array([-1, 0])) -1.0 # Opposite direction """ # Compute norms norm_a = np.linalg.norm(a) norm_b = np.linalg.norm(b) # Handle zero vectors if norm_a == 0 or norm_b == 0: return 0.0 # Compute cosine similarity dot_product = np.dot(a, b) return dot_product / (norm_a * norm_b) def cosine_similarity_normalized(a: np.ndarray, b: np.ndarray) -> float: """ Compute cosine similarity for pre-normalized vectors (unit vectors). If ||a|| = ||b|| = 1, then cos(a, b) = a · b This is much faster when vectors are normalized in advance. """ return np.dot(a, b) def batch_cosine_similarity(X: np.ndarray, y: np.ndarray) -> np.ndarray: """ Compute cosine similarity between a query vector y and all rows of X. Parameters: X: Matrix of shape (n_samples, n_features) y: Query vector of shape (n_features,) Returns: Array of cosine similarities of shape (n_samples,) """ # Compute norms X_norms = np.linalg.norm(X, axis=1) y_norm = np.linalg.norm(y) # Handle zero vectors if y_norm == 0: return np.zeros(X.shape[0]) # Compute dot products (matrix-vector multiplication) dot_products = X @ y # Normalize similarities = dot_products / (X_norms * y_norm + 1e-10) # Small epsilon for stability return similarities # Demonstrationif __name__ == "__main__": # Example vectors a = np.array([3, 4]) b = np.array([6, 8]) # Same direction, different magnitude c = np.array([4, -3]) # Orthogonal to a d = np.array([-3, -4]) # Opposite to a print("Cosine Similarity Examples:") print(f" a = {a}") print(f" b = {b} (same direction, 2x magnitude)") print(f" c = {c} (orthogonal)") print(f" d = {d} (opposite)") print() print(f" sim(a, b) = {cosine_similarity(a, b):.4f}") # 1.0 print(f" sim(a, c) = {cosine_similarity(a, c):.4f}") # 0.0 print(f" sim(a, d) = {cosine_similarity(a, d):.4f}") # -1.0Cosine similarity is a similarity measure, not a distance metric. Higher values mean more similar, not farther apart. This is the opposite convention from distance metrics.\n\n• Similarity: sim(a, a) = 1 (maximum), sim(a, b) → 0 as dissimilarity increases\n• Distance: d(a, a) = 0 (minimum), d(a, b) → ∞ as dissimilarity increases\n\nTo use cosine with distance-based algorithms like KNN, we must convert to a distance.
There is a beautiful mathematical relationship between cosine similarity and Euclidean distance—but only for unit vectors (vectors with norm 1).
For unit vectors $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ (with $|\hat{\mathbf{a}}| = |\hat{\mathbf{b}}| = 1$):
$$|\hat{\mathbf{a}} - \hat{\mathbf{b}}|_2^2 = |\hat{\mathbf{a}}|^2 - 2\hat{\mathbf{a}} \cdot \hat{\mathbf{b}} + |\hat{\mathbf{b}}|^2$$ $$= 1 - 2\hat{\mathbf{a}} \cdot \hat{\mathbf{b}} + 1 = 2(1 - \hat{\mathbf{a}} \cdot \hat{\mathbf{b}})$$ $$= 2(1 - \cos\theta)$$
Therefore:
$$|\hat{\mathbf{a}} - \hat{\mathbf{b}}|_2 = \sqrt{2(1 - \cos\theta)} = \sqrt{2} \cdot \sqrt{1 - \cos\theta}$$
Using the trigonometric identity $1 - \cos\theta = 2\sin^2(\theta/2)$:
$$|\hat{\mathbf{a}} - \hat{\mathbf{b}}|_2 = 2\sin(\theta/2)$$
For normalized vectors, minimizing Euclidean distance is equivalent to maximizing cosine similarity.
Since $d = \sqrt{2(1 - \cos\theta)}$ is a monotonically decreasing function of $\cos\theta$:
This has a profound practical implication: if you normalize your vectors first, you can use standard Euclidean KNN to get cosine-based neighbors.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import numpy as npfrom sklearn.preprocessing import normalizefrom sklearn.neighbors import NearestNeighbors def demonstrate_normalization_equivalence(): """ Show that Euclidean distance on normalized vectors produces the same ranking as cosine similarity. """ # Create sample data np.random.seed(42) X = np.random.randn(100, 10) # 100 vectors, 10 dimensions query = np.random.randn(10) # Method 1: Cosine similarity directly norms_X = np.linalg.norm(X, axis=1, keepdims=True) norm_query = np.linalg.norm(query) cosine_sims = (X @ query) / (norms_X.flatten() * norm_query) cosine_ranking = np.argsort(-cosine_sims) # Descending (most similar first) # Method 2: Euclidean distance on normalized vectors X_normalized = X / norms_X query_normalized = query / norm_query euclidean_dists = np.linalg.norm(X_normalized - query_normalized, axis=1) euclidean_ranking = np.argsort(euclidean_dists) # Ascending (closest first) # Compare rankings print("Demonstrating normalization equivalence:") print(f" Top 5 by cosine similarity: {cosine_ranking[:5]}") print(f" Top 5 by Euclidean (normalized): {euclidean_ranking[:5]}") print(f" Rankings identical: {np.array_equal(cosine_ranking, euclidean_ranking)}") def knn_with_cosine(): """ Use sklearn's NearestNeighbors with cosine metric. """ # Create sample data X = np.random.randn(1000, 50) query = np.random.randn(50) # Option 1: Use cosine metric directly nn_cosine = NearestNeighbors(n_neighbors=5, metric='cosine') nn_cosine.fit(X) distances1, indices1 = nn_cosine.kneighbors([query]) # Option 2: Normalize and use Euclidean (equivalent) X_norm = normalize(X) query_norm = normalize([query]) nn_euclidean = NearestNeighbors(n_neighbors=5, metric='euclidean') nn_euclidean.fit(X_norm) distances2, indices2 = nn_euclidean.kneighbors(query_norm) print("\nKNN comparison:") print(f" Direct cosine indices: {indices1[0]}") print(f" Normalized Euclidean indices: {indices2[0]}") print(f" Same neighbors: {np.array_equal(indices1, indices2)}") if __name__ == "__main__": demonstrate_normalization_equivalence() knn_with_cosine()In practice, normalizing vectors and using Euclidean distance is often faster and more convenient than computing cosine similarity directly:\n\n1. Normalize once before storing vectors\n2. Use standard Euclidean KNN implementations (highly optimized)\n3. Benefit from acceleration structures (KD-trees, ball trees) that work with Euclidean distance\n\nThe slight overhead of normalization is repaid by faster queries.
To use cosine with algorithms that require a distance metric (like KNN), we must convert similarity to distance. Several conventions exist:
The most common conversion:
$$d_{\cos}(\mathbf{a}, \mathbf{b}) = 1 - \text{sim}_{\cos}(\mathbf{a}, \mathbf{b}) = 1 - \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|}$$
| Cosine Similarity | Cosine Distance | Interpretation |
|---|---|---|
| 1 | 0 | Identical |
| 0 | 1 | Orthogonal |
| -1 | 2 | Opposite |
Warning: Cosine distance is not a true metric—it violates the triangle inequality!
A proper metric based on the actual angle:
$$d_{\text{angular}}(\mathbf{a}, \mathbf{b}) = \frac{\arccos(\text{sim}_{\cos}(\mathbf{a}, \mathbf{b}))}{\pi}$$
This normalizes the angle to $[0, 1]$ where:
Advantage: Angular distance satisfies the triangle inequality and is a proper metric.
Disadvantage: Computing arccos is slower than simple subtraction.
| Metric | Formula | Range | True Metric? | Computation |
|---|---|---|---|---|
| Cosine Distance | 1 - cos(θ) | [0, 2] | No | Fast (just 1 - similarity) |
| Angular Distance | arccos(cos(θ)) / π | [0, 1] | Yes | Slower (requires arccos) |
| Euclidean (normalized) | √(2(1-cos(θ))) | [0, 2] | Yes | Moderate |
Cosine distance can violate the triangle inequality. Example:\n\nLet a = (1, 0), b = (1, 1) / √2, c = (0, 1)\n\n• d_cos(a, c) = 1 - 0 = 1\n• d_cos(a, b) = 1 - 1/√2 ≈ 0.293\n• d_cos(b, c) = 1 - 1/√2 ≈ 0.293\n\nBut d_cos(a, c) = 1 > 0.293 + 0.293 = 0.586 = d_cos(a, b) + d_cos(b, c)\n\nThis means cosine distance cannot be used with data structures that rely on the triangle inequality (KD-trees, ball trees). Use brute-force KNN or approximate methods like LSH.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import numpy as np def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: """Compute cosine similarity.""" return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) def cosine_distance(a: np.ndarray, b: np.ndarray) -> float: """ Cosine distance: 1 - cosine_similarity Range: [0, 2] Note: NOT a true metric (violates triangle inequality) """ return 1 - cosine_similarity(a, b) def angular_distance(a: np.ndarray, b: np.ndarray) -> float: """ Angular distance: arccos(cosine_similarity) / π Range: [0, 1] This IS a true metric (satisfies triangle inequality) """ cos_sim = cosine_similarity(a, b) # Clamp to [-1, 1] to avoid arccos domain errors due to float precision cos_sim = np.clip(cos_sim, -1, 1) return np.arccos(cos_sim) / np.pi def euclidean_normalized(a: np.ndarray, b: np.ndarray) -> float: """ Euclidean distance after L2 normalization. Equivalent to sqrt(2 * (1 - cos_sim)) for unit vectors. """ a_norm = a / np.linalg.norm(a) b_norm = b / np.linalg.norm(b) return np.linalg.norm(a_norm - b_norm) def triangle_inequality_example(): """ Demonstrate triangle inequality violation for cosine distance. """ a = np.array([1, 0]) b = np.array([1, 1]) / np.sqrt(2) # 45 degrees c = np.array([0, 1]) d_ac = cosine_distance(a, c) d_ab = cosine_distance(a, b) d_bc = cosine_distance(b, c) print("Triangle Inequality Test (Cosine Distance):") print(f" a = {a}, b = {b.round(3)}, c = {c}") print(f" d(a,c) = {d_ac:.4f}") print(f" d(a,b) = {d_ab:.4f}") print(f" d(b,c) = {d_bc:.4f}") print(f" d(a,b) + d(b,c) = {d_ab + d_bc:.4f}") print(f" Triangle inequality d(a,c) <= d(a,b) + d(b,c)? {d_ac <= d_ab + d_bc}") print("\nTriangle Inequality Test (Angular Distance):") d_ac_ang = angular_distance(a, c) d_ab_ang = angular_distance(a, b) d_bc_ang = angular_distance(b, c) print(f" d(a,c) = {d_ac_ang:.4f}") print(f" d(a,b) + d(b,c) = {d_ab_ang + d_bc_ang:.4f}") print(f" Triangle inequality holds? {d_ac_ang <= d_ab_ang + d_bc_ang + 1e-10}") if __name__ == "__main__": triangle_inequality_example()Cosine similarity's magnitude-invariance makes it the default choice for text analysis and other domains with sparse, high-dimensional data.
Consider comparing documents using term frequency vectors. A document repeating "machine learning" 100 times is saying the same thing as one saying it 10 times—just longer. But their Euclidean distance could be very large:
Euclidean distance: $\sqrt{(100-10)^2 + (50-5)^2 + (30-3)^2} = \sqrt{8100 + 2025 + 729} \approx 104$
This large distance suggests they're very different, when they're actually about the same topic!
Cosine similarity: Since Doc B is a scalar multiple of Doc A, $\cos(A, B) = 1$ (identical direction).
Cosine similarity correctly identifies them as identical in topic.
The TF-IDF (Term Frequency–Inverse Document Frequency) representation is designed for cosine similarity:
This combination forms the backbone of classical information retrieval.
Text vectors are extremely sparse (most terms don't appear in most documents). Cosine similarity can be computed efficiently:
$$\text{sim}{\cos}(\mathbf{a}, \mathbf{b}) = \frac{\sum{i: a_i \neq 0 \text{ AND } b_i \neq 0} a_i b_i}{|\mathbf{a}| |\mathbf{b}|}$$
Only non-zero entries in both vectors contribute to the dot product. If documents share few terms, the computation is very fast.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as npfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarity as sklearn_cosine def text_similarity_example(): """ Demonstrate cosine similarity for document comparison. """ # Sample documents documents = [ "Machine learning is a subset of artificial intelligence.", "AI and machine learning are transforming technology.", "Python is a popular programming language.", "Deep learning uses neural networks for AI tasks.", ] # Create TF-IDF vectors vectorizer = TfidfVectorizer() tfidf_matrix = vectorizer.fit_transform(documents) print("TF-IDF Vocabulary (sample):") feature_names = vectorizer.get_feature_names_out() print(f" {feature_names[:10]}...") print(f" Total features: {len(feature_names)}") print(f" Matrix shape: {tfidf_matrix.shape}") print(f" Sparsity: {1 - tfidf_matrix.nnz / np.prod(tfidf_matrix.shape):.1%}") # Compute pairwise cosine similarities similarity_matrix = sklearn_cosine(tfidf_matrix) print("\nCosine Similarity Matrix:") print(" Doc0 Doc1 Doc2 Doc3") for i, row in enumerate(similarity_matrix): print(f" Doc{i}: {' '.join(f'{x:6.3f}' for x in row)}") # Find most similar pair np.fill_diagonal(similarity_matrix, -1) # Ignore self-similarity most_similar = np.unravel_index(np.argmax(similarity_matrix), similarity_matrix.shape) print(f"\nMost similar: Doc{most_similar[0]} and Doc{most_similar[1]}") print(f" Doc{most_similar[0]}: {documents[most_similar[0]]}") print(f" Doc{most_similar[1]}: {documents[most_similar[1]]}") def sparse_efficiency_demo(): """ Show efficiency advantage of sparse cosine computation. """ from scipy.sparse import csr_matrix from scipy.sparse import random as sparse_random # Create sparse matrices (like TF-IDF vectors) n_docs = 1000 n_features = 10000 density = 0.01 # 1% non-zero (typical for text) X_sparse = sparse_random(n_docs, n_features, density=density, format='csr') X_dense = X_sparse.toarray() print(f"\nSparse vs Dense Computation:") print(f" Matrix shape: {X_sparse.shape}") print(f" Density: {density:.1%}") print(f" Non-zero elements: {X_sparse.nnz:,}") print(f" Total elements: {n_docs * n_features:,}") print(f" Memory (sparse): {X_sparse.data.nbytes / 1024:.1f} KB") print(f" Memory (dense): {X_dense.nbytes / 1024:.1f} KB") # sklearn automatically handles sparse matrices efficiently import time start = time.time() sim_sparse = sklearn_cosine(X_sparse[:100]) sparse_time = time.time() - start start = time.time() sim_dense = sklearn_cosine(X_dense[:100]) dense_time = time.time() - start print(f" Cosine similarity (100 docs):") print(f" Sparse: {sparse_time*1000:.1f} ms") print(f" Dense: {dense_time*1000:.1f} ms") if __name__ == "__main__": text_similarity_example() sparse_efficiency_demo()Cosine similarity is valuable for any domain where magnitude is arbitrary:\n\n• Recommender systems: User-item rating vectors (users rate different numbers of items)\n• Image retrieval: Feature vectors from CNNs (batch normalization makes magnitude arbitrary)\n• Gene expression: Gene expression profiles across experiments\n• Social networks: User feature vectors (activity levels vary)\n• Embeddings: Word2Vec, GloVe, and transformer embeddings
Cosine similarity has interesting properties in high-dimensional spaces, some beneficial and some requiring caution.
Recall that Euclidean distance suffers from distance concentration in high dimensions—all points become approximately equidistant. Does cosine similarity have the same problem?
For random vectors: Yes! For random high-dimensional vectors with i.i.d. components, the cosine similarity concentrates around 0 (orthogonality). However, the variance of cosine similarity also decreases, meaning all similarities cluster tightly around 0.
For structured data: Real data often lies on lower-dimensional manifolds, meaning cosine similarity remains meaningful because vectors are not truly random.
An important subtlety: cosine similarity is not invariant to translation. Shifting all vectors by a constant changes their angles!
Example: User-item ratings where 3 is "neutral":
Without centering, cosine similarity might be positive (both have positive ratings).
With centering (subtract mean rating):
Now cosine similarity is -1, correctly indicating opposite preferences.
The adjusted cosine similarity or Pearson correlation addresses this by centering vectors before computing cosine.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as np def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: """Standard cosine similarity.""" return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) def adjusted_cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: """ Adjusted cosine similarity (centers vectors before computing). This is equivalent to Pearson correlation when vectors are centered by their means. """ a_centered = a - np.mean(a) b_centered = b - np.mean(b) norm_a = np.linalg.norm(a_centered) norm_b = np.linalg.norm(b_centered) if norm_a == 0 or norm_b == 0: return 0.0 return np.dot(a_centered, b_centered) / (norm_a * norm_b) def pearson_correlation(a: np.ndarray, b: np.ndarray) -> float: """ Pearson correlation coefficient. Mathematically equivalent to adjusted cosine similarity. """ return np.corrcoef(a, b)[0, 1] def demonstrate_centering_effect(): """ Show why centering matters for preference vectors. """ # Two users with opposite preferences user_a = np.array([5, 4, 1, 2]) # Likes items 1-2 user_b = np.array([1, 2, 5, 4]) # Likes items 3-4 (opposite) # User C with same preferences as A, different scale user_c = np.array([5, 5, 3, 3]) # Likes items 1-2, but more moderate print("Rating vectors:") print(f" User A: {user_a} (likes items 1-2)") print(f" User B: {user_b} (likes items 3-4, opposite of A)") print(f" User C: {user_c} (similar to A, different scale)") print("\nStandard Cosine Similarity:") print(f" sim(A, B) = {cosine_similarity(user_a, user_b):.4f}") # Positive! print(f" sim(A, C) = {cosine_similarity(user_a, user_c):.4f}") print("\nAdjusted Cosine (centered):") print(f" sim(A, B) = {adjusted_cosine_similarity(user_a, user_b):.4f}") # Negative! print(f" sim(A, C) = {adjusted_cosine_similarity(user_a, user_c):.4f}") print("\nPearson Correlation (same as adjusted cosine):") print(f" corr(A, B) = {pearson_correlation(user_a, user_b):.4f}") print(f" corr(A, C) = {pearson_correlation(user_a, user_c):.4f}") print("\n→ Adjusted cosine correctly identifies A and B as having") print(" opposite preferences!") if __name__ == "__main__": demonstrate_centering_effect()Despite its popularity, cosine similarity is not universally appropriate. Understanding its limitations prevents misapplication.
Cosine similarity ignores magnitude by design. But in many domains, magnitude carries crucial information:
Physical measurements: A vector of [temperature, pressure, vibration] at [100°C, 2 atm, 50 Hz] is very different from [200°C, 4 atm, 100 Hz], even though they point in the same direction. The magnitudes indicate completely different operating conditions.
Financial data: Portfolio returns of [1%, 2%, -1%] vs [10%, 20%, -10%] have the same direction (same relative performance) but vastly different risk/reward profiles.
When vectors can have negative components, cosine similarity's interpretation becomes complex:
Example: Economic indicators with growth rates (positive/negative). Two countries with cosine similarity -1 aren't opposites in any meaningful sense.
Cosine similarity is undefined for zero vectors:
$$\text{sim}_{\cos}(\mathbf{0}, \mathbf{b}) = \frac{0}{0 \cdot |\mathbf{b}|} = \frac{0}{0}$$
In text analysis, a document with no matching terms has a zero vector, leading to undefined similarity. Implementations typically return 0, but this is a convention, not a mathematical result.
If your preprocessing hasn't normalized features, cosine similarity can miss important scale information:
Using cosine similarity on raw features with meaningful magnitudes (like [age, income, height]) discards the magnitude information. A 20-year-old earning $20K looks identical to a 60-year-old earning $60K if other features scale proportionally!\n\nFor such data, Euclidean distance (with standardization) is usually more appropriate.
| Criterion | Use Cosine | Use Euclidean |
|---|---|---|
| Magnitude meaningful? | No | Yes |
| Data type | Text, embeddings, counts | Physical measurements |
| Vector length varies? | Yes (doc length) | No (fixed format) |
| Sparsity | High (text) | Low to moderate |
| Pre-normalized? | Not necessarily | Should standardize |
| Care about scaling? | No | Yes |
Implementing KNN with cosine similarity requires some care due to the similarity-vs-distance distinction and metric properties.
Advantages:
Use cosine distance (1 - cosine_similarity) directly:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric='cosine')
Limitations:
For very large datasets, use approximate nearest neighbor methods designed for cosine similarity:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import numpy as npfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.preprocessing import normalizefrom sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import TfidfVectorizer def compare_knn_approaches(): """ Compare different ways to implement cosine-based KNN. """ # Load text data categories = ['comp.graphics', 'sci.med', 'rec.sport.baseball'] newsgroups = fetch_20newsgroups(subset='train', categories=categories) # Create TF-IDF vectors vectorizer = TfidfVectorizer(max_features=1000, stop_words='english') X = vectorizer.fit_transform(newsgroups.data) y = newsgroups.target # Approach 1: Normalize + Euclidean X_normalized = normalize(X, norm='l2') knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean') knn_euclidean.fit(X_normalized, y) # Approach 2: Direct cosine metric knn_cosine = KNeighborsClassifier(n_neighbors=5, metric='cosine') knn_cosine.fit(X, y) # Test on new documents test_docs = [ "The graphics card renders 3D images very fast", "The patient needs immediate medical attention", "The home run won the baseball game" ] X_test = vectorizer.transform(test_docs) X_test_normalized = normalize(X_test, norm='l2') # Predictions pred_euclidean = knn_euclidean.predict(X_test_normalized) pred_cosine = knn_cosine.predict(X_test) print("Test Document Predictions:") for i, doc in enumerate(test_docs): print(f" '{doc[:50]}...'") print(f" Normalized + Euclidean: {categories[pred_euclidean[i]]}") print(f" Direct Cosine: {categories[pred_cosine[i]]}") # Check if results match print(f"\nPredictions match: {np.array_equal(pred_euclidean, pred_cosine)}") # Check algorithm used print(f"\nAlgorithm used:") print(f" Euclidean: {knn_euclidean._fit_method}") # ball_tree or kd_tree possible print(f" Cosine: {knn_cosine._fit_method}") # brute (no tree support) def weighted_knn_cosine(): """ Demonstrate weighted KNN with cosine similarity. """ # In weighted KNN, closer neighbors have more influence # With cosine, "distance" = 1 - similarity, so weights = similarity X = np.random.randn(100, 10) X = normalize(X) # Normalize for cosine y = np.random.randint(0, 3, 100) # Distance-weighted KNN (inverse distance weighting) knn_weighted = KNeighborsClassifier( n_neighbors=5, metric='cosine', weights='distance' # Weight by inverse distance ) knn_weighted.fit(X, y) # Uniform weighting (all neighbors equal) knn_uniform = KNeighborsClassifier( n_neighbors=5, metric='cosine', weights='uniform' ) knn_uniform.fit(X, y) query = normalize([np.random.randn(10)])[0] # Get probabilities proba_weighted = knn_weighted.predict_proba([query])[0] proba_uniform = knn_uniform.predict_proba([query])[0] print("\nWeighted vs Uniform KNN:") print(f" Query prediction probabilities:") print(f" Weighted: {proba_weighted.round(3)}") print(f" Uniform: {proba_uniform.round(3)}") if __name__ == "__main__": compare_knn_approaches() weighted_knn_cosine()Cosine similarity provides a fundamentally different approach to measuring relationships between vectors—one based on orientation rather than displacement. Its magnitude-invariance makes it essential for text analysis and other domains where vector length is arbitrary.
The final page of this module explores custom distance functions—how to design domain-specific metrics that capture the true notion of similarity for your data. We'll cover weighted distances, learned metrics, edit distances for sequences, and techniques for mixed data types.
You now understand cosine similarity as a distinct paradigm from distance-based metrics. You can apply it appropriately to text and embedding data, convert between similarity and distance formulations, and implement it efficiently in KNN. This completes your toolkit of standard metrics, preparing you for custom distance design.