Loading learning content...
While distance metrics measure how different two objects are, similarity measures quantify how alike they are. This change of perspective is more than semantic—many ML applications naturally frame problems in terms of similarity rather than distance.
Consider document retrieval: you search for 'machine learning tutorials' and expect documents similar to your query. Or recommendation systems: you want items similar to your past preferences. Or clustering: you group similar data points together.
Similarity vs Distance:
| Aspect | Distance | Similarity |
|---|---|---|
| Range | $[0, \infty)$ | Typically $[0, 1]$ or $[-1, 1]$ |
| Same objects | Distance = 0 | Similarity = 1 (max) |
| Opposite objects | Distance = large | Similarity = 0 (or -1) |
| Interpretation | How far apart | How alike |
In many cases, similarity and distance are inverses, but the relationship isn't always simple.
By the end of this page, you will understand cosine similarity (the workhorse of NLP and embedding spaces), Jaccard similarity for sets, Pearson correlation, and other similarity measures. You'll know when to use each and how they relate to distance metrics.
Cosine similarity is arguably the most important similarity measure in machine learning, especially for text and embeddings. It measures the cosine of the angle between two vectors, ignoring their magnitudes.
Definition:
$$\text{sim}{\cos}(\mathbf{x}, \mathbf{y}) = \cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}|2 |\mathbf{y}|2} = \frac{\sum{i=1}^n x_i y_i}{\sqrt{\sum{i=1}^n x_i^2} \sqrt{\sum{i=1}^n y_i^2}}$$
Range: $[-1, 1]$
For non-negative vectors (common in ML: word counts, TF-IDF, probabilities), the range is $[0, 1]$.
Why Cosine Similarity is Powerful:
Magnitude Invariance: A document about 'machine learning' mentioned 10 times is equally similar to a query as the same document copied 100 times. Only the direction of the vector matters, not its length.
High-Dimensional Robustness: Unlike Euclidean distance, cosine similarity doesn't suffer as severely from the curse of dimensionality because it only cares about the angle.
Sparsity Friendly: In sparse vectors (like bag-of-words), only shared non-zero entries contribute to the dot product, making computation efficient.
Intuitive Interpretation: The angle captures 'alignment'—are these vectors pointing in the same conceptual direction?
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as npfrom sklearn.metrics.pairwise import cosine_similarity def cos_sim(x, y): """Compute cosine similarity between two vectors.""" dot = np.dot(x, y) norm_x = np.linalg.norm(x) norm_y = np.linalg.norm(y) if norm_x == 0 or norm_y == 0: return 0.0 # Handle zero vectors return dot / (norm_x * norm_y) # Example: Document vectors (word counts or TF-IDF)doc1 = np.array([3, 2, 0, 5, 0, 0, 0, 2, 0, 0]) # "machine learning python"doc2 = np.array([3, 0, 0, 8, 0, 0, 0, 1, 0, 0]) # Similar topicdoc3 = np.array([0, 0, 1, 0, 3, 6, 4, 0, 0, 0]) # Different topic print("Document Similarity Analysis")print("=" * 40)print(f"doc1: {doc1}")print(f"doc2: {doc2}")print(f"doc3: {doc3}") print(f"cosine(doc1, doc2) = {cos_sim(doc1, doc2):.4f} (similar topics)")print(f"cosine(doc1, doc3) = {cos_sim(doc1, doc3):.4f} (different topics)")print(f"cosine(doc2, doc3) = {cos_sim(doc2, doc3):.4f} (different topics)") # Magnitude invariance demonstrationdoc1_scaled = doc1 * 100 # Same direction, different magnitudeprint(f"cosine(doc1, doc1_scaled) = {cos_sim(doc1, doc1_scaled):.4f}")print("Cosine is 1.0 regardless of scaling!") # Compare with Euclidean distanceprint(f"Euclidean(doc1, doc2) = {np.linalg.norm(doc1 - doc2):.4f}")print(f"Euclidean(doc1, doc1_scaled) = {np.linalg.norm(doc1 - doc1_scaled):.4f}")print("Euclidean is sensitive to scaling, cosine is not!") # Using sklearn for pairwise similaritiesdocs = np.vstack([doc1, doc2, doc3])sim_matrix = cosine_similarity(docs)print(f"Pairwise cosine similarity matrix:")print(sim_matrix.round(4))Cosine Distance:
To convert cosine similarity to a distance metric:
$$d_{\cos}(\mathbf{x}, \mathbf{y}) = 1 - \text{sim}_{\cos}(\mathbf{x}, \mathbf{y})$$
Note: Cosine distance is NOT a true metric—it violates the triangle inequality in general. However, it's still useful for many algorithms. Angular distance $\theta / \pi$ (where $\theta = \arccos(\text{sim}_{\cos})$) is a true metric.
For unit-normalized vectors ($|\mathbf{x}|=|\mathbf{y}|=1$), there's a beautiful relationship: $|\mathbf{x}-\mathbf{y}|2^2 = 2(1 - \text{sim}{\cos}(\mathbf{x},\mathbf{y}))$. This is why normalizing to unit vectors makes Euclidean distance equivalent to (a function of) cosine similarity.
The Jaccard similarity (also called Jaccard index or Intersection over Union) measures the overlap between two sets. It's fundamental for comparing binary features, documents as sets of words, or any set-valued data.
Definition:
For two sets $A$ and $B$:
$$J(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{\text{intersection size}}{\text{union size}}$$
Range: $[0, 1]$
Jaccard Distance:
$$d_J(A, B) = 1 - J(A, B) = \frac{|A \cup B| - |A \cap B|}{|A \cup B|} = \frac{|A \Delta B|}{|A \cup B|}$$
where $A \Delta B$ is the symmetric difference (elements in A or B but not both).
Note: Jaccard distance IS a valid metric—it satisfies all four metric axioms.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import numpy as np def jaccard_similarity(set_a, set_b): """Compute Jaccard similarity between two sets.""" intersection = len(set_a & set_b) union = len(set_a | set_b) if union == 0: return 1.0 # Both empty sets are identical return intersection / union def jaccard_binary(x, y): """Jaccard similarity for binary vectors (treating as sets).""" # Positions where either is 1 union = np.sum((x == 1) | (y == 1)) # Positions where both are 1 intersection = np.sum((x == 1) & (y == 1)) if union == 0: return 1.0 return intersection / union # Set example: documents as word setsdoc1_words = {"machine", "learning", "python", "tutorial", "data"}doc2_words = {"machine", "learning", "tutorial", "beginner", "guide"}doc3_words = {"cooking", "recipe", "italian", "pasta", "sauce"} print("Jaccard Similarity for Document Sets")print("=" * 45)print(f"doc1: {doc1_words}")print(f"doc2: {doc2_words}")print(f"doc3: {doc3_words}") j12 = jaccard_similarity(doc1_words, doc2_words)j13 = jaccard_similarity(doc1_words, doc3_words)j23 = jaccard_similarity(doc2_words, doc3_words) print(f"J(doc1, doc2) = {j12:.4f} (related topics)")print(f"J(doc1, doc3) = {j13:.4f} (unrelated)")print(f"J(doc2, doc3) = {j23:.4f} (unrelated)") # Detailed breakdownintersection_12 = doc1_words & doc2_wordsunion_12 = doc1_words | doc2_wordsprint(f"Breakdown for (doc1, doc2):")print(f" Intersection: {intersection_12}")print(f" |Intersection| = {len(intersection_12)}")print(f" |Union| = {len(union_12)}")print(f" Jaccard = {len(intersection_12)}/{len(union_12)} = {j12:.4f}") # Binary vector exampleprint("" + "=" * 45)print("Jaccard for Binary Vectors (One-Hot Features)")x = np.array([1, 1, 0, 1, 0, 0, 1, 0])y = np.array([1, 0, 0, 1, 1, 0, 1, 0])z = np.array([0, 0, 1, 0, 0, 1, 0, 1]) print(f"x = {x}")print(f"y = {y}")print(f"z = {z}")print(f"J(x, y) = {jaccard_binary(x, y):.4f}")print(f"J(x, z) = {jaccard_binary(x, z):.4f}")For text similarity, Jaccard treats documents as sets (presence/absence of words), while cosine uses vectors (word counts/TF-IDF). Jaccard ignores word frequency; cosine incorporates it. For short texts with few repeated words, Jaccard is often sufficient. For longer documents with varying word frequencies, cosine + TF-IDF typically performs better.
The Pearson correlation coefficient measures the linear relationship between two variables. While cosine similarity measures alignment in a geometric sense, Pearson correlation measures alignment after centering the data.
Definition:
$$\rho(\mathbf{x}, \mathbf{y}) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}$$
where $\bar{x} = \frac{1}{n}\sum_i x_i$ is the mean.
Equivalently:
$$\rho(\mathbf{x}, \mathbf{y}) = \text{cosine}(\mathbf{x} - \bar{x}\mathbf{1}, \mathbf{y} - \bar{y}\mathbf{1})$$
Pearson correlation IS cosine similarity applied to mean-centered vectors!
Range: $[-1, 1]$
Key Properties:
Mean-Centered: Unlike cosine, Pearson accounts for bias. Two users who rate everything 1-star vs 5-star differently but have the same relative preferences will have correlation 1.
Scale & Translation Invariant: $\rho(\mathbf{x}, \mathbf{y}) = \rho(a\mathbf{x} + b, c\mathbf{y} + d)$ for any constants $a,c > 0$ and any $b, d$.
Only Captures Linear Relationships: Two perfectly dependent variables with nonlinear relationship can have $\rho = 0$.
| Aspect | Cosine Similarity | Pearson Correlation |
|---|---|---|
| Formula | $\frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}| |\mathbf{y}|}$ | $\frac{(\mathbf{x}-\bar{x}) \cdot (\mathbf{y}-\bar{y})}{|\mathbf{x}-\bar{x}| |\mathbf{y}-\bar{y}|}$ |
| Centering | No | Yes (mean-centered) |
| Invariance | Scale only | Scale and translation |
| Zero value for | Orthogonal vectors | No linear correlation |
| Best for | TF-IDF, embeddings | Ratings, preferences |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as npfrom scipy.stats import pearsonr def pearson_correlation(x, y): """Compute Pearson correlation coefficient.""" x_centered = x - np.mean(x) y_centered = y - np.mean(y) numerator = np.dot(x_centered, y_centered) denominator = np.linalg.norm(x_centered) * np.linalg.norm(y_centered) if denominator == 0: return 0.0 return numerator / denominator def cosine_similarity(x, y): """Compute cosine similarity.""" return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y)) # Example: User ratings (on scale 1-5)# Different rating styles, same relative preferencesuser1 = np.array([5, 4, 5, 2, 1]) # Enthusiastic rateruser2 = np.array([3, 2.5, 3, 1, 0.5]) # Conservative rater (same pattern, lower scores)user3 = np.array([1, 2, 1, 4, 5]) # Opposite preferences print("User Rating Similarity Analysis")print("=" * 45)print(f"User 1: {user1} (enthusiastic)")print(f"User 2: {user2} (conservative, same pattern)")print(f"User 3: {user3} (opposite preferences)") # Pearson captures same relative preferencesprint("Pearson Correlation:")print(f" ρ(user1, user2) = {pearson_correlation(user1, user2):.4f}")print(f" ρ(user1, user3) = {pearson_correlation(user1, user3):.4f}")print(f" ρ(user2, user3) = {pearson_correlation(user2, user3):.4f}") # Cosine doesn't account for rating biasprint("Cosine Similarity:")print(f" cos(user1, user2) = {cosine_similarity(user1, user2):.4f}")print(f" cos(user1, user3) = {cosine_similarity(user1, user3):.4f}")print(f" cos(user2, user3) = {cosine_similarity(user2, user3):.4f}") print("Notice: Pearson = 1.0 for users 1&2 (same relative pattern),")print("while cosine < 1.0 due to different absolute magnitudes.") # Demonstrate Pearson = Cosine on centered datax_centered = user1 - np.mean(user1)y_centered = user2 - np.mean(user2)print(f"Cosine on centered vectors:")print(f" = {cosine_similarity(x_centered, y_centered):.4f}")print("This equals Pearson correlation!")Use Pearson correlation for: (1) User-based collaborative filtering (users have different rating 'baselines'), (2) Stock returns (centered around different means), (3) Any situation where you care about relative patterns, not absolute values. Use cosine when absolute magnitude matters or when data is already centered/normalized.
Beyond Jaccard, several other set-based similarity measures are useful in different contexts. Each emphasizes different aspects of set overlap.
Dice Coefficient (Sørensen-Dice):
$$\text{Dice}(A, B) = \frac{2|A \cap B|}{|A| + |B|}$$
Dice gives more weight to shared elements relative to set sizes. It equals 2 × precision × recall / (precision + recall) = F1 score when treating A as predicted and B as ground truth.
Relationship to Jaccard: $\text{Dice} = \frac{2 \cdot \text{Jaccard}}{1 + \text{Jaccard}}$
Dice is always $\geq$ Jaccard for the same sets.
Overlap Coefficient (Szymkiewicz-Simpson):
$$\text{Overlap}(A, B) = \frac{|A \cap B|}{\min(|A|, |B|)}$$
This coefficient equals 1 whenever one set is a subset of the other. Useful when you care about whether the smaller set is 'contained' in the larger.
Tversky Index:
$$\text{Tversky}(A, B) = \frac{|A \cap B|}{|A \cap B| + \alpha|A - B| + \beta|B - A|}$$
A generalization that allows asymmetric weighting:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import numpy as np def jaccard(A, B): intersection = len(A & B) union = len(A | B) return intersection / union if union > 0 else 1.0 def dice(A, B): intersection = len(A & B) total = len(A) + len(B) return 2 * intersection / total if total > 0 else 1.0 def overlap(A, B): intersection = len(A & B) min_size = min(len(A), len(B)) return intersection / min_size if min_size > 0 else 1.0 def tversky(A, B, alpha=1.0, beta=1.0): intersection = len(A & B) a_minus_b = len(A - B) b_minus_a = len(B - A) denominator = intersection + alpha * a_minus_b + beta * b_minus_a return intersection / denominator if denominator > 0 else 1.0 # Example setsA = {"apple", "banana", "cherry", "date", "elderberry"}B = {"banana", "cherry", "fig", "grape"}C = {"banana", "cherry"} # Subset of both A and B print("Set Similarity Comparisons")print("=" * 50)print(f"A = {A}")print(f"B = {B}")print(f"C = {C} (subset)") print(f"{'Measure':<15} {'A vs B':<10} {'A vs C':<10} {'B vs C':<10}")print("-" * 50)print(f"{'Jaccard':<15} {jaccard(A, B):<10.4f} {jaccard(A, C):<10.4f} {jaccard(B, C):<10.4f}")print(f"{'Dice':<15} {dice(A, B):<10.4f} {dice(A, C):<10.4f} {dice(B, C):<10.4f}")print(f"{'Overlap':<15} {overlap(A, B):<10.4f} {overlap(A, C):<10.4f} {overlap(B, C):<10.4f}") print("Note: Overlap = 1.0 whenever one set is a subset!")print("C ⊆ A and C ⊆ B, so Overlap(A,C) = Overlap(B,C) = 1.0") # Verify Dice-Jaccard relationshipj = jaccard(A, B)d = dice(A, B)print(f"Verifying Dice = 2*Jaccard/(1+Jaccard):")print(f" 2*{j:.4f}/(1+{j:.4f}) = {2*j/(1+j):.4f} = Dice = {d:.4f} ✓")Kernel functions are a powerful class of similarity measures that implicitly compute inner products in high-dimensional (possibly infinite-dimensional) feature spaces. They're fundamental to Support Vector Machines, Gaussian Processes, and kernel methods generally.
Definition:
A function $k: X \times X \to \mathbb{R}$ is a valid kernel (positive semi-definite kernel) if there exists a feature map $\phi: X \to \mathcal{H}$ to some Hilbert space $\mathcal{H}$ such that:
$$k(\mathbf{x}, \mathbf{y}) = \langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle_{\mathcal{H}}$$
Kernels can be viewed as similarity measures because inner products generalize the notion of 'alignment' between vectors.
Common Kernels:
Linear Kernel: $k(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T \mathbf{y}$
Polynomial Kernel: $k(\mathbf{x}, \mathbf{y}) = (\mathbf{x}^T \mathbf{y} + c)^d$
Radial Basis Function (RBF/Gaussian) Kernel: $$k(\mathbf{x}, \mathbf{y}) = \exp\left(-\frac{|\mathbf{x} - \mathbf{y}|^2}{2\sigma^2}\right) = \exp(-\gamma |\mathbf{x} - \mathbf{y}|^2)$$
Sigmoid Kernel: $k(\mathbf{x}, \mathbf{y}) = \tanh(\alpha \mathbf{x}^T\mathbf{y} + c)$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import numpy as npfrom sklearn.metrics.pairwise import (linear_kernel, polynomial_kernel, rbf_kernel, sigmoid_kernel) # Sample data pointsX = np.array([[1, 2], [3, 4], [5, 6]]) print("Kernel Function Demonstrations")print("=" * 50)print(f"Data points:{X}") # Linear kernel = dot product similarityK_linear = linear_kernel(X)print("Linear Kernel (X @ X.T):")print(K_linear) # Polynomial kernelK_poly = polynomial_kernel(X, degree=2, coef0=1)print("Polynomial Kernel (degree=2, c=1):")print(K_poly.round(2)) # RBF kernel - similarity decays with distanceK_rbf = rbf_kernel(X, gamma=0.1)print("RBF Kernel (gamma=0.1):")print(K_rbf.round(4)) # Demonstrate RBF behaviorprint("RBF Kernel Behavior:")x = np.array([[0, 0]])distances = [0.0, 0.5, 1.0, 2.0, 5.0]gamma = 0.5for d in distances: y = np.array([[d, 0]]) # Point at distance d k = rbf_kernel(x, y, gamma=gamma)[0, 0] print(f" Distance = {d:.1f} → RBF similarity = {k:.4f}") # Show that similar points have high kernel valueprint("Key insight: RBF kernel ≈")print(" 1.0 for identical points")print(" ≈0 for far-apart points")print("It's a localized similarity measure!")The power of kernels is that we never need to compute $\phi(\mathbf{x})$ explicitly—we only need the kernel values $k(\mathbf{x}, \mathbf{y})$. This 'kernel trick' allows us to work in infinite-dimensional feature spaces efficiently. SVMs, Gaussian Processes, and kernel PCA all exploit this to achieve nonlinear capabilities with linear algorithm complexity.
In practice, we often need to convert between distance and similarity representations. The relationship isn't unique—multiple valid conversions exist, each with different properties.
Common Conversion Functions:
Given a distance $d \geq 0$, we can define similarity $s$ as:
| Method | Formula | Range | Notes |
|---|---|---|---|
| Subtraction | $s = 1 - d/d_{\max}$ | $[0, 1]$ | Requires bounded distance |
| Reciprocal | $s = 1 / (1 + d)$ | $(0, 1]$ | Always valid; smooth; common choice |
| Exponential (RBF) | $s = e^{-\gamma d^2}$ | $(0, 1]$ | Rapid decay; tunable via γ |
| Gaussian | $s = e^{-d^2 / (2\sigma^2)}$ | $(0, 1]$ | Same as RBF with $\gamma = 1/(2\sigma^2)$ |
| Linear decay | $s = \max(0, 1 - \alpha d)$ | $[0, 1]$ | Cuts off at $d = 1/\alpha$ |
Similarity to Distance:
Given similarity $s \in [0, 1]$ (higher = more similar), we can define distance as:
Caution: Not all conversions preserve metric properties! Even if $d$ was a metric, $s = 1 - d$ may not yield a similarity that converts back to a metric.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import numpy as np def distance_to_similarity(d, method='exponential', **kwargs): """Convert distance to similarity using various methods.""" if method == 'reciprocal': return 1 / (1 + d) elif method == 'exponential': gamma = kwargs.get('gamma', 1.0) return np.exp(-gamma * d**2) elif method == 'linear': d_max = kwargs.get('d_max', d.max()) return 1 - d / d_max elif method == 'linear_cutoff': alpha = kwargs.get('alpha', 1.0) return np.maximum(0, 1 - alpha * d) else: raise ValueError(f"Unknown method: {method}") # Example distancesdistances = np.array([0.0, 0.5, 1.0, 2.0, 5.0, 10.0]) print("Distance to Similarity Conversions")print("=" * 60)print(f"Distances: {distances}") methods = [ ('reciprocal', {}), ('exponential', {'gamma': 0.5}), ('exponential', {'gamma': 1.0}), ('linear', {'d_max': 10.0}), ('linear_cutoff', {'alpha': 0.2}),] for method, kwargs in methods: sim = distance_to_similarity(distances, method=method, **kwargs) param_str = ', '.join(f'{k}={v}' for k, v in kwargs.items()) print(f"{method}({param_str}):") print(f" {np.round(sim, 4)}") # Converting cosine distance backprint("Cosine Distance ↔ Cosine Similarity:")cosine_sims = np.array([1.0, 0.9, 0.5, 0.0, -0.5, -1.0])cosine_dists = 1 - cosine_simsprint(f" Similarity: {cosine_sims}")print(f" Distance: {cosine_dists}") # 1 - simprint(" Note: cosine 'distance' is not a true metric!")When choosing a conversion: (1) Use exponential/RBF for smooth, tunable decay—ideal for kernel methods. (2) Use reciprocal for a simple, always-valid conversion. (3) Use linear only when distance is bounded and you want uniform sensitivity. (4) Tune parameters (γ, σ) based on the typical scale of distances in your data.
Let's consolidate how similarity measures are used across major ML domains.
| Domain | Common Similarities | Why |
|---|---|---|
| Text/NLP | Cosine on TF-IDF, BM25 | Handles varying document lengths; magnitude invariant |
| Word Embeddings | Cosine similarity | Semantic similarity in direction, not magnitude |
| Recommender Systems | Pearson, Cosine, Jaccard | User/item similarity with different rating scales |
| Image Retrieval | L2 on embeddings, Cosine | CNN features are already normalized in direction |
| Near-Duplicate Detection | Jaccard + MinHash, SimHash | Efficient for large-scale set comparisons |
| Graph/Network Analysis | Jaccard on neighbors, Cosine on node vectors | Structural similarity, community detection |
| Bioinformatics | BLAST scores, Edit distance | Sequence alignment |
| Time Series | DTW, Pearson, Euclidean | Alignment-tolerant or shape-based |
Traditional ML uses fixed similarity measures chosen based on domain knowledge. Modern deep learning often learns the right similarity by training embedding networks with contrastive or triplet losses. The embeddings implicitly define a similarity (typically cosine or Euclidean distance in the embedding space) that's optimized for the task.
What's Next:
We've covered vector norms, matrix norms, distance metrics, and similarity measures. The final page of this module explores norm regularization—how L1 and L2 norms on model parameters prevent overfitting, the geometric intuition behind sparsity, and practical guidelines for regularization in machine learning.
You now have a comprehensive understanding of similarity measures, from geometric cosine similarity to set-based Jaccard to statistical correlation. This toolkit enables you to choose appropriate comparisons for any data type and application, whether in classical ML or modern deep learning.