Machine LearningMathematical Foundations: Linear Algebra

Norms and Distance Metrics

LevelIntermediate

Duration90 mins

TopicMathematical Foundations: Linear Algebra

4 / 5

Similarity Measures

From Distance to Similarity: A Change of Perspective

While distance metrics measure how different two objects are, similarity measures quantify how alike they are. This change of perspective is more than semantic—many ML applications naturally frame problems in terms of similarity rather than distance.

Consider document retrieval: you search for 'machine learning tutorials' and expect documents similar to your query. Or recommendation systems: you want items similar to your past preferences. Or clustering: you group similar data points together.

Similarity vs Distance:

Aspect	Distance	Similarity
Range	$[0, \infty)$	Typically $[0, 1]$ or $[-1, 1]$
Same objects	Distance = 0	Similarity = 1 (max)
Opposite objects	Distance = large	Similarity = 0 (or -1)
Interpretation	How far apart	How alike

In many cases, similarity and distance are inverses, but the relationship isn't always simple.

What You Will Master

By the end of this page, you will understand cosine similarity (the workhorse of NLP and embedding spaces), Jaccard similarity for sets, Pearson correlation, and other similarity measures. You'll know when to use each and how they relate to distance metrics.

Cosine Similarity: Measuring Angle, Not Magnitude

Cosine similarity is arguably the most important similarity measure in machine learning, especially for text and embeddings. It measures the cosine of the angle between two vectors, ignoring their magnitudes.

Definition:

$$\text{sim}{\cos}(\mathbf{x}, \mathbf{y}) = \cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}|2 |\mathbf{y}|2} = \frac{\sum{i=1}^n x_i y_i}{\sqrt{\sum{i=1}^n x_i^2} \sqrt{\sum{i=1}^n y_i^2}}$$

Range: $[-1, 1]$

$+1$: Vectors point in exactly the same direction (maximally similar)
$0$: Vectors are orthogonal (unrelated)
$-1$: Vectors point in opposite directions (maximally dissimilar)

For non-negative vectors (common in ML: word counts, TF-IDF, probabilities), the range is $[0, 1]$.

Why Cosine Similarity is Powerful:

Magnitude Invariance: A document about 'machine learning' mentioned 10 times is equally similar to a query as the same document copied 100 times. Only the direction of the vector matters, not its length.
High-Dimensional Robustness: Unlike Euclidean distance, cosine similarity doesn't suffer as severely from the curse of dimensionality because it only cares about the angle.
Sparsity Friendly: In sparse vectors (like bag-of-words), only shared non-zero entries contribute to the dot product, making computation efficient.
Intuitive Interpretation: The angle captures 'alignment'—are these vectors pointing in the same conceptual direction?

When to Use Cosine

•Text similarity (TF-IDF, word embeddings)
•Document retrieval and search
•Recommendation systems
•When magnitude is noise or arbitrary
•High-dimensional sparse vectors
•Neural network embeddings

When Cosine May Not Be Ideal

•When magnitude carries information
•Low-dimensional dense data
•When zero vectors are possible (undefined)
•Clustering where size matters
•Data with meaningful negatives rare

cosine_similarity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
 
def cos_sim(x, y):
    """Compute cosine similarity between two vectors."""
    dot = np.dot(x, y)
    norm_x = np.linalg.norm(x)
    norm_y = np.linalg.norm(y)
    if norm_x == 0 or norm_y == 0:
        return 0.0  # Handle zero vectors
    return dot / (norm_x * norm_y)
 
# Example: Document vectors (word counts or TF-IDF)
doc1 = np.array([3, 2, 0, 5, 0, 0, 0, 2, 0, 0])  # "machine learning python"
doc2 = np.array([3, 0, 0, 8, 0, 0, 0, 1, 0, 0])  # Similar topic
doc3 = np.array([0, 0, 1, 0, 3, 6, 4, 0, 0, 0])  # Different topic
 
print("Document Similarity Analysis")
print("=" * 40)
print(f"doc1: {doc1}")
print(f"doc2: {doc2}")
print(f"doc3: {doc3}
")
 
print(f"cosine(doc1, doc2) = {cos_sim(doc1, doc2):.4f}  (similar topics)")
print(f"cosine(doc1, doc3) = {cos_sim(doc1, doc3):.4f}  (different topics)")
print(f"cosine(doc2, doc3) = {cos_sim(doc2, doc3):.4f}  (different topics)")
 
# Magnitude invariance demonstration
doc1_scaled = doc1 * 100  # Same direction, different magnitude
print(f"
cosine(doc1, doc1_scaled) = {cos_sim(doc1, doc1_scaled):.4f}")
print("Cosine is 1.0 regardless of scaling!")
 
# Compare with Euclidean distance
print(f"
Euclidean(doc1, doc2) = {np.linalg.norm(doc1 - doc2):.4f}")
print(f"Euclidean(doc1, doc1_scaled) = {np.linalg.norm(doc1 - doc1_scaled):.4f}")
print("Euclidean is sensitive to scaling, cosine is not!")
 
# Using sklearn for pairwise similarities
docs = np.vstack([doc1, doc2, doc3])
sim_matrix = cosine_similarity(docs)
print(f"
Pairwise cosine similarity matrix:")
print(sim_matrix.round(4))

Cosine Distance:

To convert cosine similarity to a distance metric:

$$d_{\cos}(\mathbf{x}, \mathbf{y}) = 1 - \text{sim}_{\cos}(\mathbf{x}, \mathbf{y})$$

Note: Cosine distance is NOT a true metric—it violates the triangle inequality in general. However, it's still useful for many algorithms. Angular distance $\theta / \pi$ (where $\theta = \arccos(\text{sim}_{\cos})$) is a true metric.

Relationship to Euclidean Distance

For unit-normalized vectors ($|\mathbf{x}|=|\mathbf{y}|=1$), there's a beautiful relationship: $|\mathbf{x}-\mathbf{y}|2^2 = 2(1 - \text{sim}{\cos}(\mathbf{x},\mathbf{y}))$. This is why normalizing to unit vectors makes Euclidean distance equivalent to (a function of) cosine similarity.

Jaccard Similarity: Set Overlap

The Jaccard similarity (also called Jaccard index or Intersection over Union) measures the overlap between two sets. It's fundamental for comparing binary features, documents as sets of words, or any set-valued data.

Definition:

For two sets $A$ and $B$:

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{\text{intersection size}}{\text{union size}}$$

Range: $[0, 1]$

$1$: Sets are identical ($A = B$)
$0$: Sets are completely disjoint ($A \cap B = \emptyset$)

Jaccard Distance:

$$d_J(A, B) = 1 - J(A, B) = \frac{|A \cup B| - |A \cap B|}{|A \cup B|} = \frac{|A \Delta B|}{|A \cup B|}$$

where $A \Delta B$ is the symmetric difference (elements in A or B but not both).

Note: Jaccard distance IS a valid metric—it satisfies all four metric axioms.

Applications of Jaccard Similarity

•Document similarity: Treat documents as sets of words or n-grams
•Near-duplicate detection: Finding similar web pages, plagiarism detection
•MinHash and LSH: Jaccard is efficiently estimated via MinHash for large-scale similarity search
•Image segmentation evaluation: IoU (Intersection over Union) is Jaccard for pixel sets
•Clustering evaluation: Comparing cluster assignments as sets
•Recommendation systems: Overlap of user-item interactions

jaccard_similarity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
 
def jaccard_similarity(set_a, set_b):
    """Compute Jaccard similarity between two sets."""
    intersection = len(set_a & set_b)
    union = len(set_a | set_b)
    if union == 0:
        return 1.0  # Both empty sets are identical
    return intersection / union
 
def jaccard_binary(x, y):
    """Jaccard similarity for binary vectors (treating as sets)."""
    # Positions where either is 1
    union = np.sum((x == 1) | (y == 1))
    # Positions where both are 1
    intersection = np.sum((x == 1) & (y == 1))
    if union == 0:
        return 1.0
    return intersection / union
 
# Set example: documents as word sets
doc1_words = {"machine", "learning", "python", "tutorial", "data"}
doc2_words = {"machine", "learning", "tutorial", "beginner", "guide"}
doc3_words = {"cooking", "recipe", "italian", "pasta", "sauce"}
 
print("Jaccard Similarity for Document Sets")
print("=" * 45)
print(f"doc1: {doc1_words}")
print(f"doc2: {doc2_words}")
print(f"doc3: {doc3_words}
")
 
j12 = jaccard_similarity(doc1_words, doc2_words)
j13 = jaccard_similarity(doc1_words, doc3_words)
j23 = jaccard_similarity(doc2_words, doc3_words)
 
print(f"J(doc1, doc2) = {j12:.4f}  (related topics)")
print(f"J(doc1, doc3) = {j13:.4f}  (unrelated)")
print(f"J(doc2, doc3) = {j23:.4f}  (unrelated)")
 
# Detailed breakdown
intersection_12 = doc1_words & doc2_words
union_12 = doc1_words | doc2_words
print(f"
Breakdown for (doc1, doc2):")
print(f"  Intersection: {intersection_12}")
print(f"  |Intersection| = {len(intersection_12)}")
print(f"  |Union| = {len(union_12)}")
print(f"  Jaccard = {len(intersection_12)}/{len(union_12)} = {j12:.4f}")
 
# Binary vector example
print("
" + "=" * 45)
print("Jaccard for Binary Vectors (One-Hot Features)")
x = np.array([1, 1, 0, 1, 0, 0, 1, 0])
y = np.array([1, 0, 0, 1, 1, 0, 1, 0])
z = np.array([0, 0, 1, 0, 0, 1, 0, 1])
 
print(f"x = {x}")
print(f"y = {y}")
print(f"z = {z}")
print(f"
J(x, y) = {jaccard_binary(x, y):.4f}")
print(f"J(x, z) = {jaccard_binary(x, z):.4f}")

Jaccard vs Cosine for Text

For text similarity, Jaccard treats documents as sets (presence/absence of words), while cosine uses vectors (word counts/TF-IDF). Jaccard ignores word frequency; cosine incorporates it. For short texts with few repeated words, Jaccard is often sufficient. For longer documents with varying word frequencies, cosine + TF-IDF typically performs better.

Pearson Correlation: Linear Relationship

The Pearson correlation coefficient measures the linear relationship between two variables. While cosine similarity measures alignment in a geometric sense, Pearson correlation measures alignment after centering the data.

Definition:

$$\rho(\mathbf{x}, \mathbf{y}) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}$$

where $\bar{x} = \frac{1}{n}\sum_i x_i$ is the mean.

Equivalently:

$$\rho(\mathbf{x}, \mathbf{y}) = \text{cosine}(\mathbf{x} - \bar{x}\mathbf{1}, \mathbf{y} - \bar{y}\mathbf{1})$$

Pearson correlation IS cosine similarity applied to mean-centered vectors!

Range: $[-1, 1]$

$+1$: Perfect positive linear relationship
$0$: No linear relationship (may still have nonlinear relationship!)
$-1$: Perfect negative linear relationship

Key Properties:

Mean-Centered: Unlike cosine, Pearson accounts for bias. Two users who rate everything 1-star vs 5-star differently but have the same relative preferences will have correlation 1.
Scale & Translation Invariant: $\rho(\mathbf{x}, \mathbf{y}) = \rho(a\mathbf{x} + b, c\mathbf{y} + d)$ for any constants $a,c > 0$ and any $b, d$.
Only Captures Linear Relationships: Two perfectly dependent variables with nonlinear relationship can have $\rho = 0$.

Cosine vs Pearson Comparison
Aspect	Cosine Similarity	Pearson Correlation
Formula	$\frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|}$	$\frac{(\mathbf{x}-\bar{x}) \cdot (\mathbf{y}-\bar{y})}{\|\mathbf{x}-\bar{x}\| \|\mathbf{y}-\bar{y}\|}$
Centering	No	Yes (mean-centered)
Invariance	Scale only	Scale and translation
Zero value for	Orthogonal vectors	No linear correlation
Best for	TF-IDF, embeddings	Ratings, preferences

pearson_correlation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
from scipy.stats import pearsonr
 
def pearson_correlation(x, y):
    """Compute Pearson correlation coefficient."""
    x_centered = x - np.mean(x)
    y_centered = y - np.mean(y)
    
    numerator = np.dot(x_centered, y_centered)
    denominator = np.linalg.norm(x_centered) * np.linalg.norm(y_centered)
    
    if denominator == 0:
        return 0.0
    return numerator / denominator
 
def cosine_similarity(x, y):
    """Compute cosine similarity."""
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
 
# Example: User ratings (on scale 1-5)
# Different rating styles, same relative preferences
user1 = np.array([5, 4, 5, 2, 1])  # Enthusiastic rater
user2 = np.array([3, 2.5, 3, 1, 0.5])  # Conservative rater (same pattern, lower scores)
user3 = np.array([1, 2, 1, 4, 5])  # Opposite preferences
 
print("User Rating Similarity Analysis")
print("=" * 45)
print(f"User 1: {user1} (enthusiastic)")
print(f"User 2: {user2} (conservative, same pattern)")
print(f"User 3: {user3} (opposite preferences)
")
 
# Pearson captures same relative preferences
print("Pearson Correlation:")
print(f"  ρ(user1, user2) = {pearson_correlation(user1, user2):.4f}")
print(f"  ρ(user1, user3) = {pearson_correlation(user1, user3):.4f}")
print(f"  ρ(user2, user3) = {pearson_correlation(user2, user3):.4f}")
 
# Cosine doesn't account for rating bias
print("
Cosine Similarity:")
print(f"  cos(user1, user2) = {cosine_similarity(user1, user2):.4f}")
print(f"  cos(user1, user3) = {cosine_similarity(user1, user3):.4f}")
print(f"  cos(user2, user3) = {cosine_similarity(user2, user3):.4f}")
 
print("
Notice: Pearson = 1.0 for users 1&2 (same relative pattern),")
print("while cosine < 1.0 due to different absolute magnitudes.")
 
# Demonstrate Pearson = Cosine on centered data
x_centered = user1 - np.mean(user1)
y_centered = user2 - np.mean(user2)
print(f"
Cosine on centered vectors:")
print(f"  = {cosine_similarity(x_centered, y_centered):.4f}")
print("This equals Pearson correlation!")

When to Use Pearson

Use Pearson correlation for: (1) User-based collaborative filtering (users have different rating 'baselines'), (2) Stock returns (centered around different means), (3) Any situation where you care about relative patterns, not absolute values. Use cosine when absolute magnitude matters or when data is already centered/normalized.

Other Set-Based Similarities

Beyond Jaccard, several other set-based similarity measures are useful in different contexts. Each emphasizes different aspects of set overlap.

Dice Coefficient (Sørensen-Dice):

$$\text{Dice}(A, B) = \frac{2|A \cap B|}{|A| + |B|}$$

Dice gives more weight to shared elements relative to set sizes. It equals 2 × precision × recall / (precision + recall) = F1 score when treating A as predicted and B as ground truth.

Relationship to Jaccard: $\text{Dice} = \frac{2 \cdot \text{Jaccard}}{1 + \text{Jaccard}}$

Dice is always $\geq$ Jaccard for the same sets.

Overlap Coefficient (Szymkiewicz-Simpson):

$$\text{Overlap}(A, B) = \frac{|A \cap B|}{\min(|A|, |B|)}$$

This coefficient equals 1 whenever one set is a subset of the other. Useful when you care about whether the smaller set is 'contained' in the larger.

Tversky Index:

$$\text{Tversky}(A, B) = \frac{|A \cap B|}{|A \cap B| + \alpha|A - B| + \beta|B - A|}$$

A generalization that allows asymmetric weighting:

$\alpha = \beta = 1$: Jaccard
$\alpha = \beta = 0.5$: Dice
$\alpha = 1, \beta = 0$: $\frac{|A \cap B|}{|A|}$ (precision if B is ground truth)

set_similarities.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
 
def jaccard(A, B):
    intersection = len(A & B)
    union = len(A | B)
    return intersection / union if union > 0 else 1.0
 
def dice(A, B):
    intersection = len(A & B)
    total = len(A) + len(B)
    return 2 * intersection / total if total > 0 else 1.0
 
def overlap(A, B):
    intersection = len(A & B)
    min_size = min(len(A), len(B))
    return intersection / min_size if min_size > 0 else 1.0
 
def tversky(A, B, alpha=1.0, beta=1.0):
    intersection = len(A & B)
    a_minus_b = len(A - B)
    b_minus_a = len(B - A)
    denominator = intersection + alpha * a_minus_b + beta * b_minus_a
    return intersection / denominator if denominator > 0 else 1.0
 
# Example sets
A = {"apple", "banana", "cherry", "date", "elderberry"}
B = {"banana", "cherry", "fig", "grape"}
C = {"banana", "cherry"}  # Subset of both A and B
 
print("Set Similarity Comparisons")
print("=" * 50)
print(f"A = {A}")
print(f"B = {B}")
print(f"C = {C} (subset)
")
 
print(f"{'Measure':<15} {'A vs B':<10} {'A vs C':<10} {'B vs C':<10}")
print("-" * 50)
print(f"{'Jaccard':<15} {jaccard(A, B):<10.4f} {jaccard(A, C):<10.4f} {jaccard(B, C):<10.4f}")
print(f"{'Dice':<15} {dice(A, B):<10.4f} {dice(A, C):<10.4f} {dice(B, C):<10.4f}")
print(f"{'Overlap':<15} {overlap(A, B):<10.4f} {overlap(A, C):<10.4f} {overlap(B, C):<10.4f}")
 
print("
Note: Overlap = 1.0 whenever one set is a subset!")
print("C ⊆ A and C ⊆ B, so Overlap(A,C) = Overlap(B,C) = 1.0")
 
# Verify Dice-Jaccard relationship
j = jaccard(A, B)
d = dice(A, B)
print(f"
Verifying Dice = 2*Jaccard/(1+Jaccard):")
print(f"  2*{j:.4f}/(1+{j:.4f}) = {2*j/(1+j):.4f} = Dice = {d:.4f} ✓")

Kernel Functions: Implicit Similarities

Kernel functions are a powerful class of similarity measures that implicitly compute inner products in high-dimensional (possibly infinite-dimensional) feature spaces. They're fundamental to Support Vector Machines, Gaussian Processes, and kernel methods generally.

Definition:

A function $k: X \times X \to \mathbb{R}$ is a valid kernel (positive semi-definite kernel) if there exists a feature map $\phi: X \to \mathcal{H}$ to some Hilbert space $\mathcal{H}$ such that:

$$k(\mathbf{x}, \mathbf{y}) = \langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle_{\mathcal{H}}$$

Kernels can be viewed as similarity measures because inner products generalize the notion of 'alignment' between vectors.

Common Kernels:

Linear Kernel: $k(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T \mathbf{y}$
- Just the ordinary dot product
- Feature space is the original space
Polynomial Kernel: $k(\mathbf{x}, \mathbf{y}) = (\mathbf{x}^T \mathbf{y} + c)^d$
- Implicitly computes all polynomial features up to degree $d$
- $c$ controls influence of lower-degree terms
Radial Basis Function (RBF/Gaussian) Kernel: $$k(\mathbf{x}, \mathbf{y}) = \exp\left(-\frac{|\mathbf{x} - \mathbf{y}|^2}{2\sigma^2}\right) = \exp(-\gamma |\mathbf{x} - \mathbf{y}|^2)$$
- Feature space is infinite-dimensional!
- Similarity decays exponentially with distance
- $\sigma$ (or $\gamma = 1/(2\sigma^2)$) controls the 'width'
Sigmoid Kernel: $k(\mathbf{x}, \mathbf{y}) = \tanh(\alpha \mathbf{x}^T\mathbf{y} + c)$
- Related to neural network activations
- Not always positive semi-definite (careful with parameters)

kernel_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
from sklearn.metrics.pairwise import (linear_kernel, polynomial_kernel, 
                                       rbf_kernel, sigmoid_kernel)
 
# Sample data points
X = np.array([[1, 2], [3, 4], [5, 6]])
 
print("Kernel Function Demonstrations")
print("=" * 50)
print(f"Data points:
{X}
")
 
# Linear kernel = dot product similarity
K_linear = linear_kernel(X)
print("Linear Kernel (X @ X.T):")
print(K_linear)
 
# Polynomial kernel
K_poly = polynomial_kernel(X, degree=2, coef0=1)
print("
Polynomial Kernel (degree=2, c=1):")
print(K_poly.round(2))
 
# RBF kernel - similarity decays with distance
K_rbf = rbf_kernel(X, gamma=0.1)
print("
RBF Kernel (gamma=0.1):")
print(K_rbf.round(4))
 
# Demonstrate RBF behavior
print("
RBF Kernel Behavior:")
x = np.array([[0, 0]])
distances = [0.0, 0.5, 1.0, 2.0, 5.0]
gamma = 0.5
for d in distances:
    y = np.array([[d, 0]])  # Point at distance d
    k = rbf_kernel(x, y, gamma=gamma)[0, 0]
    print(f"  Distance = {d:.1f} → RBF similarity = {k:.4f}")
 
# Show that similar points have high kernel value
print("
Key insight: RBF kernel ≈")
print("  1.0 for identical points")
print("  ≈0 for far-apart points")
print("It's a localized similarity measure!")

The Kernel Trick

The power of kernels is that we never need to compute $\phi(\mathbf{x})$ explicitly—we only need the kernel values $k(\mathbf{x}, \mathbf{y})$. This 'kernel trick' allows us to work in infinite-dimensional feature spaces efficiently. SVMs, Gaussian Processes, and kernel PCA all exploit this to achieve nonlinear capabilities with linear algorithm complexity.

Converting Between Distance and Similarity

In practice, we often need to convert between distance and similarity representations. The relationship isn't unique—multiple valid conversions exist, each with different properties.

Common Conversion Functions:

Given a distance $d \geq 0$, we can define similarity $s$ as:

Distance-to-Similarity Conversion Methods
Method	Formula	Range	Notes
Subtraction	$s = 1 - d/d_{\max}$	$[0, 1]$	Requires bounded distance
Reciprocal	$s = 1 / (1 + d)$	$(0, 1]$	Always valid; smooth; common choice
Exponential (RBF)	$s = e^{-\gamma d^2}$	$(0, 1]$	Rapid decay; tunable via γ
Gaussian	$s = e^{-d^2 / (2\sigma^2)}$	$(0, 1]$	Same as RBF with $\gamma = 1/(2\sigma^2)$
Linear decay	$s = \max(0, 1 - \alpha d)$	$[0, 1]$	Cuts off at $d = 1/\alpha$

Similarity to Distance:

Given similarity $s \in [0, 1]$ (higher = more similar), we can define distance as:

Complement: $d = 1 - s$ (if $s \in [0,1]$)
Negative log: $d = -\log(s)$ (for $s \in (0, 1]$)
Inverse transform: Invert the forward function

Caution: Not all conversions preserve metric properties! Even if $d$ was a metric, $s = 1 - d$ may not yield a similarity that converts back to a metric.

similarity_distance_conversion.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
 
def distance_to_similarity(d, method='exponential', **kwargs):
    """Convert distance to similarity using various methods."""
    if method == 'reciprocal':
        return 1 / (1 + d)
    
    elif method == 'exponential':
        gamma = kwargs.get('gamma', 1.0)
        return np.exp(-gamma * d**2)
    
    elif method == 'linear':
        d_max = kwargs.get('d_max', d.max())
        return 1 - d / d_max
    
    elif method == 'linear_cutoff':
        alpha = kwargs.get('alpha', 1.0)
        return np.maximum(0, 1 - alpha * d)
    
    else:
        raise ValueError(f"Unknown method: {method}")
 
# Example distances
distances = np.array([0.0, 0.5, 1.0, 2.0, 5.0, 10.0])
 
print("Distance to Similarity Conversions")
print("=" * 60)
print(f"Distances: {distances}
")
 
methods = [
    ('reciprocal', {}),
    ('exponential', {'gamma': 0.5}),
    ('exponential', {'gamma': 1.0}),
    ('linear', {'d_max': 10.0}),
    ('linear_cutoff', {'alpha': 0.2}),
]
 
for method, kwargs in methods:
    sim = distance_to_similarity(distances, method=method, **kwargs)
    param_str = ', '.join(f'{k}={v}' for k, v in kwargs.items())
    print(f"{method}({param_str}):")
    print(f"  {np.round(sim, 4)}
")
 
# Converting cosine distance back
print("
Cosine Distance ↔ Cosine Similarity:")
cosine_sims = np.array([1.0, 0.9, 0.5, 0.0, -0.5, -1.0])
cosine_dists = 1 - cosine_sims
print(f"  Similarity: {cosine_sims}")
print(f"  Distance:   {cosine_dists}")  # 1 - sim
print("  Note: cosine 'distance' is not a true metric!")

Practical Advice

When choosing a conversion: (1) Use exponential/RBF for smooth, tunable decay—ideal for kernel methods. (2) Use reciprocal for a simple, always-valid conversion. (3) Use linear only when distance is bounded and you want uniform sensitivity. (4) Tune parameters (γ, σ) based on the typical scale of distances in your data.

Similarity Measures in ML Applications

Let's consolidate how similarity measures are used across major ML domains.

Similarity Measures by Application Domain
Domain	Common Similarities	Why
Text/NLP	Cosine on TF-IDF, BM25	Handles varying document lengths; magnitude invariant
Word Embeddings	Cosine similarity	Semantic similarity in direction, not magnitude
Recommender Systems	Pearson, Cosine, Jaccard	User/item similarity with different rating scales
Image Retrieval	L2 on embeddings, Cosine	CNN features are already normalized in direction
Near-Duplicate Detection	Jaccard + MinHash, SimHash	Efficient for large-scale set comparisons
Graph/Network Analysis	Jaccard on neighbors, Cosine on node vectors	Structural similarity, community detection
Bioinformatics	BLAST scores, Edit distance	Sequence alignment
Time Series	DTW, Pearson, Euclidean	Alignment-tolerant or shape-based

Modern Deep Learning and Similarity

•Contrastive Learning: Learns representations where similar items (positive pairs) have high cosine similarity and dissimilar items (negative pairs) have low similarity
•Siamese Networks: Train networks to produce embeddings where similarity/distance reflects semantic meaning
•Triplet Loss: $L = \max(0, d(a,p) - d(a,n) + \text{margin})$ encourages positives closer than negatives
•Attention Mechanisms: Scaled dot-product attention computes similarity between queries and keys: $\text{softmax}(QK^T / \sqrt{d_k})$
•CLIP and Multi-Modal Models: Align image and text embeddings via cosine similarity loss

Learned Similarity vs Fixed Metrics

Traditional ML uses fixed similarity measures chosen based on domain knowledge. Modern deep learning often learns the right similarity by training embedding networks with contrastive or triplet losses. The embeddings implicitly define a similarity (typically cosine or Euclidean distance in the embedding space) that's optimized for the task.

Summary: Measuring Likeness

Key Takeaways

•Similarity complements distance: Higher similarity means more alike, in contrast to distance where lower means more alike.
•Cosine similarity measures the angle between vectors, ignoring magnitude—ideal for text, embeddings, and high-dimensional sparse data.
•Jaccard similarity measures set overlap and is fundamental for binary features, document comparison, and MinHash-based methods.
•Pearson correlation is cosine similarity on mean-centered data, capturing relative patterns and removing baseline bias.
•Kernel functions are implicit similarities in (possibly infinite-dimensional) feature spaces, enabling nonlinear methods.
•Conversion functions transform between distance and similarity; exponential (RBF) is a common choice with tunable bandwidth.
•Modern deep learning learns task-specific similarities via contrastive learning, Siamese networks, and attention mechanisms.

What's Next:

We've covered vector norms, matrix norms, distance metrics, and similarity measures. The final page of this module explores norm regularization—how L1 and L2 norms on model parameters prevent overfitting, the geometric intuition behind sparsity, and practical guidelines for regularization in machine learning.

Similarity Toolkit Complete

You now have a comprehensive understanding of similarity measures, from geometric cosine similarity to set-based Jaccard to statistical correlation. This toolkit enables you to choose appropriate comparisons for any data type and application, whether in classical ML or modern deep learning.

4 / 5

Loading learning content...

Machine LearningMathematical Foundations: Linear Algebra

Norms and Distance Metrics

LevelIntermediate

Duration90 mins

TopicMathematical Foundations: Linear Algebra

4 / 5

Similarity Measures

From Distance to Similarity: A Change of Perspective

Similarity vs Distance:

Aspect	Distance	Similarity
Range	$[0, \infty)$	Typically $[0, 1]$ or $[-1, 1]$
Same objects	Distance = 0	Similarity = 1 (max)
Opposite objects	Distance = large	Similarity = 0 (or -1)
Interpretation	How far apart	How alike

In many cases, similarity and distance are inverses, but the relationship isn't always simple.

What You Will Master

Cosine Similarity: Measuring Angle, Not Magnitude

Definition:

Range: $[-1, 1]$

$+1$: Vectors point in exactly the same direction (maximally similar)
$0$: Vectors are orthogonal (unrelated)
$-1$: Vectors point in opposite directions (maximally dissimilar)

For non-negative vectors (common in ML: word counts, TF-IDF, probabilities), the range is $[0, 1]$.

Why Cosine Similarity is Powerful:

Magnitude Invariance: A document about 'machine learning' mentioned 10 times is equally similar to a query as the same document copied 100 times. Only the direction of the vector matters, not its length.
High-Dimensional Robustness: Unlike Euclidean distance, cosine similarity doesn't suffer as severely from the curse of dimensionality because it only cares about the angle.
Sparsity Friendly: In sparse vectors (like bag-of-words), only shared non-zero entries contribute to the dot product, making computation efficient.
Intuitive Interpretation: The angle captures 'alignment'—are these vectors pointing in the same conceptual direction?

When to Use Cosine

•Text similarity (TF-IDF, word embeddings)
•Document retrieval and search
•Recommendation systems
•When magnitude is noise or arbitrary
•High-dimensional sparse vectors
•Neural network embeddings

When Cosine May Not Be Ideal

•When magnitude carries information
•Low-dimensional dense data
•When zero vectors are possible (undefined)
•Clustering where size matters
•Data with meaningful negatives rare

cosine_similarity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
 
def cos_sim(x, y):
    """Compute cosine similarity between two vectors."""
    dot = np.dot(x, y)
    norm_x = np.linalg.norm(x)
    norm_y = np.linalg.norm(y)
    if norm_x == 0 or norm_y == 0:
        return 0.0  # Handle zero vectors
    return dot / (norm_x * norm_y)
 
# Example: Document vectors (word counts or TF-IDF)
doc1 = np.array([3, 2, 0, 5, 0, 0, 0, 2, 0, 0])  # "machine learning python"
doc2 = np.array([3, 0, 0, 8, 0, 0, 0, 1, 0, 0])  # Similar topic
doc3 = np.array([0, 0, 1, 0, 3, 6, 4, 0, 0, 0])  # Different topic
 
print("Document Similarity Analysis")
print("=" * 40)
print(f"doc1: {doc1}")
print(f"doc2: {doc2}")
print(f"doc3: {doc3}
")
 
print(f"cosine(doc1, doc2) = {cos_sim(doc1, doc2):.4f}  (similar topics)")
print(f"cosine(doc1, doc3) = {cos_sim(doc1, doc3):.4f}  (different topics)")
print(f"cosine(doc2, doc3) = {cos_sim(doc2, doc3):.4f}  (different topics)")
 
# Magnitude invariance demonstration
doc1_scaled = doc1 * 100  # Same direction, different magnitude
print(f"
cosine(doc1, doc1_scaled) = {cos_sim(doc1, doc1_scaled):.4f}")
print("Cosine is 1.0 regardless of scaling!")
 
# Compare with Euclidean distance
print(f"
Euclidean(doc1, doc2) = {np.linalg.norm(doc1 - doc2):.4f}")
print(f"Euclidean(doc1, doc1_scaled) = {np.linalg.norm(doc1 - doc1_scaled):.4f}")
print("Euclidean is sensitive to scaling, cosine is not!")
 
# Using sklearn for pairwise similarities
docs = np.vstack([doc1, doc2, doc3])
sim_matrix = cosine_similarity(docs)
print(f"
Pairwise cosine similarity matrix:")
print(sim_matrix.round(4))

Cosine Distance:

To convert cosine similarity to a distance metric:

$$d_{\cos}(\mathbf{x}, \mathbf{y}) = 1 - \text{sim}_{\cos}(\mathbf{x}, \mathbf{y})$$

Relationship to Euclidean Distance

Jaccard Similarity: Set Overlap

Definition:

For two sets $A$ and $B$:

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{\text{intersection size}}{\text{union size}}$$

Range: $[0, 1]$

$1$: Sets are identical ($A = B$)
$0$: Sets are completely disjoint ($A \cap B = \emptyset$)

Jaccard Distance:

$$d_J(A, B) = 1 - J(A, B) = \frac{|A \cup B| - |A \cap B|}{|A \cup B|} = \frac{|A \Delta B|}{|A \cup B|}$$

where $A \Delta B$ is the symmetric difference (elements in A or B but not both).

Note: Jaccard distance IS a valid metric—it satisfies all four metric axioms.

Applications of Jaccard Similarity

•Document similarity: Treat documents as sets of words or n-grams
•Near-duplicate detection: Finding similar web pages, plagiarism detection
•MinHash and LSH: Jaccard is efficiently estimated via MinHash for large-scale similarity search
•Image segmentation evaluation: IoU (Intersection over Union) is Jaccard for pixel sets
•Clustering evaluation: Comparing cluster assignments as sets
•Recommendation systems: Overlap of user-item interactions

jaccard_similarity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
 
def jaccard_similarity(set_a, set_b):
    """Compute Jaccard similarity between two sets."""
    intersection = len(set_a & set_b)
    union = len(set_a | set_b)
    if union == 0:
        return 1.0  # Both empty sets are identical
    return intersection / union
 
def jaccard_binary(x, y):
    """Jaccard similarity for binary vectors (treating as sets)."""
    # Positions where either is 1
    union = np.sum((x == 1) | (y == 1))
    # Positions where both are 1
    intersection = np.sum((x == 1) & (y == 1))
    if union == 0:
        return 1.0
    return intersection / union
 
# Set example: documents as word sets
doc1_words = {"machine", "learning", "python", "tutorial", "data"}
doc2_words = {"machine", "learning", "tutorial", "beginner", "guide"}
doc3_words = {"cooking", "recipe", "italian", "pasta", "sauce"}
 
print("Jaccard Similarity for Document Sets")
print("=" * 45)
print(f"doc1: {doc1_words}")
print(f"doc2: {doc2_words}")
print(f"doc3: {doc3_words}
")
 
j12 = jaccard_similarity(doc1_words, doc2_words)
j13 = jaccard_similarity(doc1_words, doc3_words)
j23 = jaccard_similarity(doc2_words, doc3_words)
 
print(f"J(doc1, doc2) = {j12:.4f}  (related topics)")
print(f"J(doc1, doc3) = {j13:.4f}  (unrelated)")
print(f"J(doc2, doc3) = {j23:.4f}  (unrelated)")
 
# Detailed breakdown
intersection_12 = doc1_words & doc2_words
union_12 = doc1_words | doc2_words
print(f"
Breakdown for (doc1, doc2):")
print(f"  Intersection: {intersection_12}")
print(f"  |Intersection| = {len(intersection_12)}")
print(f"  |Union| = {len(union_12)}")
print(f"  Jaccard = {len(intersection_12)}/{len(union_12)} = {j12:.4f}")
 
# Binary vector example
print("
" + "=" * 45)
print("Jaccard for Binary Vectors (One-Hot Features)")
x = np.array([1, 1, 0, 1, 0, 0, 1, 0])
y = np.array([1, 0, 0, 1, 1, 0, 1, 0])
z = np.array([0, 0, 1, 0, 0, 1, 0, 1])
 
print(f"x = {x}")
print(f"y = {y}")
print(f"z = {z}")
print(f"
J(x, y) = {jaccard_binary(x, y):.4f}")
print(f"J(x, z) = {jaccard_binary(x, z):.4f}")

Jaccard vs Cosine for Text

Pearson Correlation: Linear Relationship

Definition:

$$\rho(\mathbf{x}, \mathbf{y}) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}$$

where $\bar{x} = \frac{1}{n}\sum_i x_i$ is the mean.

Equivalently:

$$\rho(\mathbf{x}, \mathbf{y}) = \text{cosine}(\mathbf{x} - \bar{x}\mathbf{1}, \mathbf{y} - \bar{y}\mathbf{1})$$

Pearson correlation IS cosine similarity applied to mean-centered vectors!

Range: $[-1, 1]$

$+1$: Perfect positive linear relationship
$0$: No linear relationship (may still have nonlinear relationship!)
$-1$: Perfect negative linear relationship

Key Properties:

Mean-Centered: Unlike cosine, Pearson accounts for bias. Two users who rate everything 1-star vs 5-star differently but have the same relative preferences will have correlation 1.
Scale & Translation Invariant: $\rho(\mathbf{x}, \mathbf{y}) = \rho(a\mathbf{x} + b, c\mathbf{y} + d)$ for any constants $a,c > 0$ and any $b, d$.
Only Captures Linear Relationships: Two perfectly dependent variables with nonlinear relationship can have $\rho = 0$.

Cosine vs Pearson Comparison
Aspect	Cosine Similarity	Pearson Correlation
Formula	$\frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|}$	$\frac{(\mathbf{x}-\bar{x}) \cdot (\mathbf{y}-\bar{y})}{\|\mathbf{x}-\bar{x}\| \|\mathbf{y}-\bar{y}\|}$
Centering	No	Yes (mean-centered)
Invariance	Scale only	Scale and translation
Zero value for	Orthogonal vectors	No linear correlation
Best for	TF-IDF, embeddings	Ratings, preferences

pearson_correlation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
from scipy.stats import pearsonr
 
def pearson_correlation(x, y):
    """Compute Pearson correlation coefficient."""
    x_centered = x - np.mean(x)
    y_centered = y - np.mean(y)
    
    numerator = np.dot(x_centered, y_centered)
    denominator = np.linalg.norm(x_centered) * np.linalg.norm(y_centered)
    
    if denominator == 0:
        return 0.0
    return numerator / denominator
 
def cosine_similarity(x, y):
    """Compute cosine similarity."""
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
 
# Example: User ratings (on scale 1-5)
# Different rating styles, same relative preferences
user1 = np.array([5, 4, 5, 2, 1])  # Enthusiastic rater
user2 = np.array([3, 2.5, 3, 1, 0.5])  # Conservative rater (same pattern, lower scores)
user3 = np.array([1, 2, 1, 4, 5])  # Opposite preferences
 
print("User Rating Similarity Analysis")
print("=" * 45)
print(f"User 1: {user1} (enthusiastic)")
print(f"User 2: {user2} (conservative, same pattern)")
print(f"User 3: {user3} (opposite preferences)
")
 
# Pearson captures same relative preferences
print("Pearson Correlation:")
print(f"  ρ(user1, user2) = {pearson_correlation(user1, user2):.4f}")
print(f"  ρ(user1, user3) = {pearson_correlation(user1, user3):.4f}")
print(f"  ρ(user2, user3) = {pearson_correlation(user2, user3):.4f}")
 
# Cosine doesn't account for rating bias
print("
Cosine Similarity:")
print(f"  cos(user1, user2) = {cosine_similarity(user1, user2):.4f}")
print(f"  cos(user1, user3) = {cosine_similarity(user1, user3):.4f}")
print(f"  cos(user2, user3) = {cosine_similarity(user2, user3):.4f}")
 
print("
Notice: Pearson = 1.0 for users 1&2 (same relative pattern),")
print("while cosine < 1.0 due to different absolute magnitudes.")
 
# Demonstrate Pearson = Cosine on centered data
x_centered = user1 - np.mean(user1)
y_centered = user2 - np.mean(user2)
print(f"
Cosine on centered vectors:")
print(f"  = {cosine_similarity(x_centered, y_centered):.4f}")
print("This equals Pearson correlation!")

When to Use Pearson

Other Set-Based Similarities

Beyond Jaccard, several other set-based similarity measures are useful in different contexts. Each emphasizes different aspects of set overlap.

Dice Coefficient (Sørensen-Dice):

$$\text{Dice}(A, B) = \frac{2|A \cap B|}{|A| + |B|}$$

Dice gives more weight to shared elements relative to set sizes. It equals 2 × precision × recall / (precision + recall) = F1 score when treating A as predicted and B as ground truth.

Relationship to Jaccard: $\text{Dice} = \frac{2 \cdot \text{Jaccard}}{1 + \text{Jaccard}}$

Dice is always $\geq$ Jaccard for the same sets.

Overlap Coefficient (Szymkiewicz-Simpson):

$$\text{Overlap}(A, B) = \frac{|A \cap B|}{\min(|A|, |B|)}$$

This coefficient equals 1 whenever one set is a subset of the other. Useful when you care about whether the smaller set is 'contained' in the larger.

Tversky Index:

$$\text{Tversky}(A, B) = \frac{|A \cap B|}{|A \cap B| + \alpha|A - B| + \beta|B - A|}$$

A generalization that allows asymmetric weighting:

$\alpha = \beta = 1$: Jaccard
$\alpha = \beta = 0.5$: Dice
$\alpha = 1, \beta = 0$: $\frac{|A \cap B|}{|A|}$ (precision if B is ground truth)

set_similarities.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
 
def jaccard(A, B):
    intersection = len(A & B)
    union = len(A | B)
    return intersection / union if union > 0 else 1.0
 
def dice(A, B):
    intersection = len(A & B)
    total = len(A) + len(B)
    return 2 * intersection / total if total > 0 else 1.0
 
def overlap(A, B):
    intersection = len(A & B)
    min_size = min(len(A), len(B))
    return intersection / min_size if min_size > 0 else 1.0
 
def tversky(A, B, alpha=1.0, beta=1.0):
    intersection = len(A & B)
    a_minus_b = len(A - B)
    b_minus_a = len(B - A)
    denominator = intersection + alpha * a_minus_b + beta * b_minus_a
    return intersection / denominator if denominator > 0 else 1.0
 
# Example sets
A = {"apple", "banana", "cherry", "date", "elderberry"}
B = {"banana", "cherry", "fig", "grape"}
C = {"banana", "cherry"}  # Subset of both A and B
 
print("Set Similarity Comparisons")
print("=" * 50)
print(f"A = {A}")
print(f"B = {B}")
print(f"C = {C} (subset)
")
 
print(f"{'Measure':<15} {'A vs B':<10} {'A vs C':<10} {'B vs C':<10}")
print("-" * 50)
print(f"{'Jaccard':<15} {jaccard(A, B):<10.4f} {jaccard(A, C):<10.4f} {jaccard(B, C):<10.4f}")
print(f"{'Dice':<15} {dice(A, B):<10.4f} {dice(A, C):<10.4f} {dice(B, C):<10.4f}")
print(f"{'Overlap':<15} {overlap(A, B):<10.4f} {overlap(A, C):<10.4f} {overlap(B, C):<10.4f}")
 
print("
Note: Overlap = 1.0 whenever one set is a subset!")
print("C ⊆ A and C ⊆ B, so Overlap(A,C) = Overlap(B,C) = 1.0")
 
# Verify Dice-Jaccard relationship
j = jaccard(A, B)
d = dice(A, B)
print(f"
Verifying Dice = 2*Jaccard/(1+Jaccard):")
print(f"  2*{j:.4f}/(1+{j:.4f}) = {2*j/(1+j):.4f} = Dice = {d:.4f} ✓")

Kernel Functions: Implicit Similarities

Definition:

A function $k: X \times X \to \mathbb{R}$ is a valid kernel (positive semi-definite kernel) if there exists a feature map $\phi: X \to \mathcal{H}$ to some Hilbert space $\mathcal{H}$ such that:

$$k(\mathbf{x}, \mathbf{y}) = \langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle_{\mathcal{H}}$$

Kernels can be viewed as similarity measures because inner products generalize the notion of 'alignment' between vectors.

Common Kernels:

Linear Kernel: $k(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T \mathbf{y}$
- Just the ordinary dot product
- Feature space is the original space
Polynomial Kernel: $k(\mathbf{x}, \mathbf{y}) = (\mathbf{x}^T \mathbf{y} + c)^d$
- Implicitly computes all polynomial features up to degree $d$
- $c$ controls influence of lower-degree terms
Radial Basis Function (RBF/Gaussian) Kernel: $$k(\mathbf{x}, \mathbf{y}) = \exp\left(-\frac{|\mathbf{x} - \mathbf{y}|^2}{2\sigma^2}\right) = \exp(-\gamma |\mathbf{x} - \mathbf{y}|^2)$$
- Feature space is infinite-dimensional!
- Similarity decays exponentially with distance
- $\sigma$ (or $\gamma = 1/(2\sigma^2)$) controls the 'width'
Sigmoid Kernel: $k(\mathbf{x}, \mathbf{y}) = \tanh(\alpha \mathbf{x}^T\mathbf{y} + c)$
- Related to neural network activations
- Not always positive semi-definite (careful with parameters)

kernel_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
from sklearn.metrics.pairwise import (linear_kernel, polynomial_kernel, 
                                       rbf_kernel, sigmoid_kernel)
 
# Sample data points
X = np.array([[1, 2], [3, 4], [5, 6]])
 
print("Kernel Function Demonstrations")
print("=" * 50)
print(f"Data points:
{X}
")
 
# Linear kernel = dot product similarity
K_linear = linear_kernel(X)
print("Linear Kernel (X @ X.T):")
print(K_linear)
 
# Polynomial kernel
K_poly = polynomial_kernel(X, degree=2, coef0=1)
print("
Polynomial Kernel (degree=2, c=1):")
print(K_poly.round(2))
 
# RBF kernel - similarity decays with distance
K_rbf = rbf_kernel(X, gamma=0.1)
print("
RBF Kernel (gamma=0.1):")
print(K_rbf.round(4))
 
# Demonstrate RBF behavior
print("
RBF Kernel Behavior:")
x = np.array([[0, 0]])
distances = [0.0, 0.5, 1.0, 2.0, 5.0]
gamma = 0.5
for d in distances:
    y = np.array([[d, 0]])  # Point at distance d
    k = rbf_kernel(x, y, gamma=gamma)[0, 0]
    print(f"  Distance = {d:.1f} → RBF similarity = {k:.4f}")
 
# Show that similar points have high kernel value
print("
Key insight: RBF kernel ≈")
print("  1.0 for identical points")
print("  ≈0 for far-apart points")
print("It's a localized similarity measure!")

The Kernel Trick

Converting Between Distance and Similarity

In practice, we often need to convert between distance and similarity representations. The relationship isn't unique—multiple valid conversions exist, each with different properties.

Common Conversion Functions:

Given a distance $d \geq 0$, we can define similarity $s$ as:

Distance-to-Similarity Conversion Methods
Method	Formula	Range	Notes
Subtraction	$s = 1 - d/d_{\max}$	$[0, 1]$	Requires bounded distance
Reciprocal	$s = 1 / (1 + d)$	$(0, 1]$	Always valid; smooth; common choice
Exponential (RBF)	$s = e^{-\gamma d^2}$	$(0, 1]$	Rapid decay; tunable via γ
Gaussian	$s = e^{-d^2 / (2\sigma^2)}$	$(0, 1]$	Same as RBF with $\gamma = 1/(2\sigma^2)$
Linear decay	$s = \max(0, 1 - \alpha d)$	$[0, 1]$	Cuts off at $d = 1/\alpha$

Similarity to Distance:

Given similarity $s \in [0, 1]$ (higher = more similar), we can define distance as:

Complement: $d = 1 - s$ (if $s \in [0,1]$)
Negative log: $d = -\log(s)$ (for $s \in (0, 1]$)
Inverse transform: Invert the forward function

Caution: Not all conversions preserve metric properties! Even if $d$ was a metric, $s = 1 - d$ may not yield a similarity that converts back to a metric.

similarity_distance_conversion.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
 
def distance_to_similarity(d, method='exponential', **kwargs):
    """Convert distance to similarity using various methods."""
    if method == 'reciprocal':
        return 1 / (1 + d)
    
    elif method == 'exponential':
        gamma = kwargs.get('gamma', 1.0)
        return np.exp(-gamma * d**2)
    
    elif method == 'linear':
        d_max = kwargs.get('d_max', d.max())
        return 1 - d / d_max
    
    elif method == 'linear_cutoff':
        alpha = kwargs.get('alpha', 1.0)
        return np.maximum(0, 1 - alpha * d)
    
    else:
        raise ValueError(f"Unknown method: {method}")
 
# Example distances
distances = np.array([0.0, 0.5, 1.0, 2.0, 5.0, 10.0])
 
print("Distance to Similarity Conversions")
print("=" * 60)
print(f"Distances: {distances}
")
 
methods = [
    ('reciprocal', {}),
    ('exponential', {'gamma': 0.5}),
    ('exponential', {'gamma': 1.0}),
    ('linear', {'d_max': 10.0}),
    ('linear_cutoff', {'alpha': 0.2}),
]
 
for method, kwargs in methods:
    sim = distance_to_similarity(distances, method=method, **kwargs)
    param_str = ', '.join(f'{k}={v}' for k, v in kwargs.items())
    print(f"{method}({param_str}):")
    print(f"  {np.round(sim, 4)}
")
 
# Converting cosine distance back
print("
Cosine Distance ↔ Cosine Similarity:")
cosine_sims = np.array([1.0, 0.9, 0.5, 0.0, -0.5, -1.0])
cosine_dists = 1 - cosine_sims
print(f"  Similarity: {cosine_sims}")
print(f"  Distance:   {cosine_dists}")  # 1 - sim
print("  Note: cosine 'distance' is not a true metric!")

Practical Advice

Similarity Measures in ML Applications

Let's consolidate how similarity measures are used across major ML domains.

Similarity Measures by Application Domain
Domain	Common Similarities	Why
Text/NLP	Cosine on TF-IDF, BM25	Handles varying document lengths; magnitude invariant
Word Embeddings	Cosine similarity	Semantic similarity in direction, not magnitude
Recommender Systems	Pearson, Cosine, Jaccard	User/item similarity with different rating scales
Image Retrieval	L2 on embeddings, Cosine	CNN features are already normalized in direction
Near-Duplicate Detection	Jaccard + MinHash, SimHash	Efficient for large-scale set comparisons
Graph/Network Analysis	Jaccard on neighbors, Cosine on node vectors	Structural similarity, community detection
Bioinformatics	BLAST scores, Edit distance	Sequence alignment
Time Series	DTW, Pearson, Euclidean	Alignment-tolerant or shape-based

Modern Deep Learning and Similarity

•Contrastive Learning: Learns representations where similar items (positive pairs) have high cosine similarity and dissimilar items (negative pairs) have low similarity
•Siamese Networks: Train networks to produce embeddings where similarity/distance reflects semantic meaning
•Triplet Loss: $L = \max(0, d(a,p) - d(a,n) + \text{margin})$ encourages positives closer than negatives
•Attention Mechanisms: Scaled dot-product attention computes similarity between queries and keys: $\text{softmax}(QK^T / \sqrt{d_k})$
•CLIP and Multi-Modal Models: Align image and text embeddings via cosine similarity loss

Learned Similarity vs Fixed Metrics

Summary: Measuring Likeness

Key Takeaways

•Similarity complements distance: Higher similarity means more alike, in contrast to distance where lower means more alike.
•Cosine similarity measures the angle between vectors, ignoring magnitude—ideal for text, embeddings, and high-dimensional sparse data.
•Jaccard similarity measures set overlap and is fundamental for binary features, document comparison, and MinHash-based methods.
•Pearson correlation is cosine similarity on mean-centered data, capturing relative patterns and removing baseline bias.
•Kernel functions are implicit similarities in (possibly infinite-dimensional) feature spaces, enabling nonlinear methods.
•Conversion functions transform between distance and similarity; exponential (RBF) is a common choice with tunable bandwidth.
•Modern deep learning learns task-specific similarities via contrastive learning, Siamese networks, and attention mechanisms.

What's Next:

Similarity Toolkit Complete

4 / 5