Machine LearningMathematical Foundations: Linear Algebra

Norms and Distance Metrics

LevelIntermediate

Duration90 mins

TopicMathematical Foundations: Linear Algebra

3 / 5

Distance Metrics

Measuring Differences: The Foundation of ML Comparisons

At the heart of machine learning lies a simple question: How similar or different are two objects? Whether you're finding nearest neighbors, clustering data points, measuring prediction errors, or retrieving similar documents, you need a principled way to quantify 'distance' or 'difference.'

A distance metric (or simply metric) is a mathematical function that tells us how far apart two points are. But not any function qualifies—metrics must satisfy specific axioms that capture our intuitive understanding of distance.

Choosing the right distance metric is often as important as choosing the right algorithm. The same K-nearest neighbors algorithm with Euclidean distance behaves completely differently than with cosine distance. The same clustering algorithm produces different results with Manhattan distance versus Mahalanobis distance.

This page develops the rigorous theory of distance metrics and explores their practical implications in machine learning.

What You Will Master

By the end of this page, you will understand the mathematical definition of a metric space, the axioms that define valid distance functions, the most important distances in ML (Euclidean, Manhattan, Chebyshev, Minkowski), and when each is appropriate for different problem types.

The Axioms of a Metric

A metric (or distance function) on a set $X$ is a function $d: X \times X \to \mathbb{R}$ that formalizes our intuitive notion of distance. The pair $(X, d)$ is called a metric space.

Formal Definition:

A function $d$ is a metric if and only if for all points $\mathbf{x}, \mathbf{y}, \mathbf{z} \in X$, the following four axioms hold:

The Four Metric Axioms

•Non-Negativity: $d(\mathbf{x}, \mathbf{y}) \geq 0$. Distance is never negative.
•Identity of Indiscernibles: $d(\mathbf{x}, \mathbf{y}) = 0 \Leftrightarrow \mathbf{x} = \mathbf{y}$. The only way distance is zero is if the points are identical.
•Symmetry: $d(\mathbf{x}, \mathbf{y}) = d(\mathbf{y}, \mathbf{x})$. Distance from A to B equals distance from B to A.
•Triangle Inequality: $d(\mathbf{x}, \mathbf{z}) \leq d(\mathbf{x}, \mathbf{y}) + d(\mathbf{y}, \mathbf{z})$. The direct path is never longer than going through an intermediate point.

The Triangle Inequality: Geometric Intuition

The triangle inequality captures the fundamental property that the shortest path between two points is a straight line. In a triangle with vertices at $\mathbf{x}$, $\mathbf{y}$, $\mathbf{z}$:

The distance $d(\mathbf{x}, \mathbf{z})$ is one side of the triangle
The sum $d(\mathbf{x}, \mathbf{y}) + d(\mathbf{y}, \mathbf{z})$ represents the path through $\mathbf{y}$
A 'shortcut' through $\mathbf{y}$ can never be shorter than the direct path

This property is essential for algorithms: it enables efficient search (if A is far from B and B is close to C, we know something about A's distance to C), makes KD-trees and ball trees work, and underlies clustering validity.

Pseudometrics and Quasimetrics

Some useful functions don't satisfy all axioms. A pseudometric allows $d(\mathbf{x}, \mathbf{y}) = 0$ for distinct points (violating identity of indiscernibles). A quasimetric allows asymmetric distances where $d(\mathbf{x}, \mathbf{y}) eq d(\mathbf{y}, \mathbf{x})$ (like travel time on one-way streets). ML sometimes uses these relaxations intentionally.

Connection to Norms:

Every norm $|\cdot|$ on a vector space induces a metric:

$$d(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}|$$

This is the 'norm-induced metric.' The norm axioms guarantee that this function satisfies all metric axioms. However, not all metrics come from norms—metrics can be defined on sets that aren't even vector spaces (like the space of strings with edit distance).

The Minkowski Distance Family

The Minkowski distance is a generalized family of distances that includes the most commonly used metrics in machine learning. For vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ and parameter $p \geq 1$:

$$d_p(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}|p = \left( \sum{i=1}^{n} |x_i - y_i|^p \right)^{1/p}$$

This is simply the Lp norm applied to the difference vector. The three most important special cases are:

Key Minkowski Distances
p	Name	Formula	Geometric Meaning
1	Manhattan / L1 / Taxicab	$\sum_i \|x_i - y_i\|$	Grid-based distance (sum of axis-aligned steps)
2	Euclidean / L2	$\sqrt{\sum_i (x_i - y_i)^2}$	Straight-line distance (ordinary geometry)
∞	Chebyshev / L∞	$\max_i \|x_i - y_i\|$	Maximum difference in any coordinate

Euclidean Distance (p=2):

The Euclidean distance is our intuitive 'straight-line' distance—what you'd measure with a ruler. It's the default choice when no specific domain knowledge suggests otherwise.

$$d_2(\mathbf{x}, \mathbf{y}) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2 + \cdots + (x_n-y_n)^2}$$

Properties:

Rotationally invariant: rotating both points doesn't change distance
Sensitive to scale: features measured in larger units dominate
Suffers in high dimensions (curse of dimensionality)
Differentiable everywhere except when $\mathbf{x} = \mathbf{y}$

Manhattan Distance (p=1):

The Manhattan distance measures the total axis-aligned movement needed to go from one point to another—like navigating a city grid.

$$d_1(\mathbf{x}, \mathbf{y}) = |x_1-y_1| + |x_2-y_2| + \cdots + |x_n-y_n|$$

When to prefer Manhattan:

Features represent different types of quantities (heterogeneous; scales don't meaningfully combine)
You want robustness to outliers (single large difference contributes linearly, not quadratically)
Grid-structured data or movement (robot navigation, genome sequences)
High-dimensional sparse data

Chebyshev Distance (p=∞):

The Chebyshev distance is the maximum difference in any single coordinate—the worst-case deviation.

$$d_\infty(\mathbf{x}, \mathbf{y}) = \max_{i} |x_i - y_i|$$

When to use Chebyshev:

When the worst-case matters (e.g., component that's furthest from target)
Chessboard-like movement (king can move diagonally)
Quality control (maximum deviation from specification)
Lower bound for Euclidean: $d_\infty \leq d_2 \leq \sqrt{n} \cdot d_\infty$

minkowski_distances.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
from scipy.spatial.distance import minkowski, cityblock, chebyshev
 
# Define two points
x = np.array([1, 2, 3])
y = np.array([4, 0, 5])
 
print(f"Point x: {x}")
print(f"Point y: {y}")
print(f"Difference: {y - x}
")
 
# Compute various Minkowski distances
d1 = np.sum(np.abs(x - y))  # Manhattan
d2 = np.sqrt(np.sum((x - y)**2))  # Euclidean
dinf = np.max(np.abs(x - y))  # Chebyshev
 
print("Minkowski distances:")
print(f"  p=1 (Manhattan):  d₁ = |3| + |-2| + |2| = {d1}")
print(f"  p=2 (Euclidean):  d₂ = √(9 + 4 + 4) = √17 = {d2:.4f}")
print(f"  p=∞ (Chebyshev): d∞ = max(3, 2, 2) = {dinf}")
 
# Verify with scipy
print(f"
Scipy verification:")
print(f"  cityblock: {cityblock(x, y)}")
print(f"  minkowski p=2: {minkowski(x, y, p=2):.4f}")
print(f"  chebyshev: {chebyshev(x, y)}")
 
# Effect of different p values
print(f"
Minkowski distance for various p:")
for p in [0.5, 1, 1.5, 2, 3, 5, 10, 50]:
    if p < 1:
        # Not a true metric, but still computable
        d = np.sum(np.abs(x - y)**p)**(1/p)
    else:
        d = minkowski(x, y, p=p)
    print(f"  p={p:>4}: d = {d:.4f}")
 
# The Inequalities
print(f"
Verifying: d∞ ≤ d₂ ≤ d₁ ≤ √n · d₂")
n = len(x)
print(f"  {dinf:.4f} ≤ {d2:.4f} ≤ {d1:.4f} ≤ {np.sqrt(n)*d2:.4f}")

Euclidean Distance: Properties and Pitfalls

The Euclidean distance is the default choice in ML, but it comes with important properties and limitations that practitioners must understand.

Mathematical Properties:

$$d(\mathbf{x}, \mathbf{y})^2 = |\mathbf{x} - \mathbf{y}|_2^2 = |\mathbf{x}|^2 + |\mathbf{y}|^2 - 2\mathbf{x}^T\mathbf{y}$$

This expansion is fundamental—it shows that squared Euclidean distance can be computed from norms and inner products, enabling efficient computation in many contexts.

Key Properties of Euclidean Distance

•Translation Invariance: $d(\mathbf{x}+\mathbf{c}, \mathbf{y}+\mathbf{c}) = d(\mathbf{x}, \mathbf{y})$. Shifting all points doesn't change distances.
•Rotation Invariance: $d(\mathbf{Qx}, \mathbf{Qy}) = d(\mathbf{x}, \mathbf{y})$ for orthogonal $\mathbf{Q}$. Rotating preserves distances.
•Scale Sensitivity: $d(\alpha\mathbf{x}, \alpha\mathbf{y}) = |\alpha| d(\mathbf{x}, \mathbf{y})$. Scaling affects distance proportionally.
•Dimension Sensitivity: As dimension grows, distances tend to concentrate (curse of dimensionality).

The Curse of Dimensionality

In high dimensions, a surprising phenomenon occurs: all points become roughly equidistant! For random points uniformly distributed in a high-dimensional hypercube, the ratio of maximum to minimum distance approaches 1. This means nearest-neighbor methods become meaningless—the nearest neighbor is barely closer than the farthest point. Solutions include dimensionality reduction, careful feature engineering, and alternative distance metrics.

Squared Euclidean Distance:

In practice, we often use squared Euclidean distance:

$$d^2(\mathbf{x}, \mathbf{y}) = \sum_i (x_i - y_i)^2$$

Advantages:

Avoids the square root (faster computation)
Always differentiable (even at zero)
Preserves ordering (if $d^2(a,b) < d^2(a,c)$, then $d(a,b) < d(a,c)$)
Natural in least-squares optimization

Caution: Squared distance is NOT a metric (violates triangle inequality), but it preserves the essential ordering for nearest-neighbor search.

euclidean_distance_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
 
def euclidean_distance(x, y):
    """Standard Euclidean distance."""
    return np.sqrt(np.sum((x - y)**2))
 
def euclidean_via_inner_product(x, y):
    """Euclidean distance via the inner-product formula."""
    # d² = ||x||² + ||y||² - 2<x,y>
    return np.sqrt(np.dot(x, x) + np.dot(y, y) - 2 * np.dot(x, y))
 
# Test equivalence
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
 
print(f"Direct computation: {euclidean_distance(x, y):.6f}")
print(f"Via inner product:  {euclidean_via_inner_product(x, y):.6f}")
 
# Demonstrate curse of dimensionality
print(f"
--- Curse of Dimensionality ---")
np.random.seed(42)
 
for dim in [2, 10, 100, 1000]:
    # Generate random points in [0,1]^dim
    n_points = 100
    points = np.random.rand(n_points, dim)
    
    # Compute all pairwise distances
    distances = []
    for i in range(n_points):
        for j in range(i+1, n_points):
            distances.append(euclidean_distance(points[i], points[j]))
    
    distances = np.array(distances)
    ratio = distances.max() / distances.min()
    
    print(f"Dim={dim:4}: min_dist={distances.min():.3f}, "
          f"max_dist={distances.max():.3f}, ratio={ratio:.2f}")
    # Notice the ratio approaches 1 as dimension increases!
 
# Scaling sensitivity
print(f"
--- Scale Sensitivity ---")
x = np.array([1, 100])  # Different scales
y = np.array([2, 100])
z = np.array([1, 200])
 
print(f"x = {x}, y = {y}, z = {z}")
print(f"d(x,y) = {euclidean_distance(x, y):.4f}  (differ in small-scale feature)")
print(f"d(x,z) = {euclidean_distance(x, z):.4f}  (differ in large-scale feature)")
print("Distance is dominated by the large-scale feature!")

Mahalanobis Distance: Accounting for Correlations

The Mahalanobis distance addresses a critical limitation of Euclidean distance: it ignores the statistical structure of the data. In datasets with correlated features or different variances, Euclidean distance can be misleading.

Definition:

Given a covariance matrix $\mathbf{\Sigma}$ (typically estimated from data), the Mahalanobis distance between $\mathbf{x}$ and $\mathbf{y}$ is:

$$d_M(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x} - \mathbf{y})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{y})}$$

From a point $\mathbf{x}$ to the distribution mean $\boldsymbol{\mu}$:

$$d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$

Intuition: What Mahalanobis Distance Does

Decorrelates features: The $\mathbf{\Sigma}^{-1}$ term removes correlations between features.
Normalizes by variance: Features with high variance contribute less; features with low variance (more 'informative') contribute more.
Accounts for elliptical contours: While Euclidean distance has spherical iso-distance contours, Mahalanobis distance has elliptical contours aligned with the data distribution.

Relationship to transformation:

Mahalanobis distance equals Euclidean distance after 'whitening' the data. If $\mathbf{\Sigma} = \mathbf{L}\mathbf{L}^T$ (Cholesky decomposition), then:

$$d_M(\mathbf{x}, \mathbf{y}) = |\mathbf{L}^{-1}(\mathbf{x} - \mathbf{y})|_2$$

When to Use Mahalanobis

•Features have different variances (scales)
•Features are correlated
•You know the data distribution (have covariance)
•Outlier detection (Mahalanobis distance to center)
•Matching algorithms (tracking, recognition)

Mahalanobis Considerations

•Requires estimating covariance matrix
•Covariance must be invertible (full rank)
•Sensitive to covariance estimation errors
•Computationally expensive: O(n²) for covariance, O(n³) for inverse
•Assumes the same scaling applies everywhere (homogeneous data)

mahalanobis_distance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
from scipy.spatial.distance import mahalanobis
 
# Generate correlated 2D data
np.random.seed(42)
n = 100
mean = [5, 5]
cov = [[2, 1.5],   # Feature 1 has variance 2, covariance 1.5
       [1.5, 3]]   # Feature 2 has variance 3, covariance 1.5
data = np.random.multivariate_normal(mean, cov, n)
 
# Estimate covariance from data
sample_cov = np.cov(data.T)
sample_mean = np.mean(data, axis=0)
cov_inv = np.linalg.inv(sample_cov)
 
print(f"Mean: {sample_mean}")
print(f"Covariance:
{sample_cov}")
 
# Two test points
x = np.array([6, 5])  # Slightly off in x direction
y = np.array([5, 6])  # Slightly off in y direction
 
# Euclidean distances from mean
d_euc_x = np.linalg.norm(x - sample_mean)
d_euc_y = np.linalg.norm(y - sample_mean)
 
# Mahalanobis distances from mean
d_mah_x = mahalanobis(x, sample_mean, cov_inv)
d_mah_y = mahalanobis(y, sample_mean, cov_inv)
 
print(f"
Point x = {x}")
print(f"  Euclidean to mean: {d_euc_x:.4f}")
print(f"  Mahalanobis to mean: {d_mah_x:.4f}")
 
print(f"
Point y = {y}")
print(f"  Euclidean to mean: {d_euc_y:.4f}")
print(f"  Mahalanobis to mean: {d_mah_y:.4f}")
 
# Note: Though x and y are equidistant from mean in Euclidean sense,
# their Mahalanobis distances differ due to different variances
 
# Outlier detection example
outlier = np.array([10, 10])
d_mah_outlier = mahalanobis(outlier, sample_mean, cov_inv)
print(f"
Outlier = {outlier}")
print(f"  Mahalanobis distance: {d_mah_outlier:.4f}")
print(f"  (Typically, d_M > 3 suggests outlier for Gaussian data)")

Connection to Gaussian Distribution

The squared Mahalanobis distance appears directly in the multivariate Gaussian PDF: $p(\mathbf{x}) \propto \exp(-\frac{1}{2} d_M^2(\mathbf{x}, \boldsymbol{\mu}))$. Points with equal Mahalanobis distance from the mean have equal probability. This makes Mahalanobis distance natural for Gaussian-modeled data.

Discrete Distances: Hamming and Edit Distance

Not all data lives in continuous vector spaces. For categorical data, strings, and sequences, we need distance metrics designed for discrete structures.

Hamming Distance:

The Hamming distance between two strings of equal length is the number of positions where corresponding characters differ.

$$d_H(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} \mathbf{1}[x_i eq y_i]$$

where $\mathbf{1}[\cdot]$ is the indicator function (1 if true, 0 if false).

Examples:

$d_H(\text{"karolin"}, \text{"kathrin"}) = 3$ (positions 2, 3, 4 differ)
$d_H(1011101, 1001001) = 2$ (positions 3 and 5 differ)

Applications: Error-correcting codes, DNA sequence comparison, feature hashing, locality-sensitive hashing.

Levenshtein (Edit) Distance:

The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, substitutions) to transform one string into another. Unlike Hamming distance, it works on strings of different lengths.

$$d_L(\mathbf{x}, \mathbf{y}) = \min { \text{insertions} + \text{deletions} + \text{substitutions} }$$

Examples:

$d_L(\text{"kitten"}, \text{"sitting"}) = 3$ (k→s, e→i, +g)
$d_L(\text{"saturday"}, \text{"sunday"}) = 3$ (delete 'a', delete 't', u→r→n)

Algorithm: Dynamic programming in O(mn) time and space, where m and n are string lengths.

discrete_distances.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
def hamming_distance(x, y):
    """Hamming distance for equal-length strings or arrays."""
    if len(x) != len(y):
        raise ValueError("Strings must have equal length for Hamming distance")
    return sum(c1 != c2 for c1, c2 in zip(x, y))
 
def levenshtein_distance(s1, s2):
    """Levenshtein (edit) distance using dynamic programming."""
    m, n = len(s1), len(s2)
    
    # Create distance matrix
    dp = np.zeros((m + 1, n + 1), dtype=int)
    
    # Base cases: distance from/to empty string
    for i in range(m + 1):
        dp[i, 0] = i  # i deletions
    for j in range(n + 1):
        dp[0, j] = j  # j insertions
    
    # Fill the matrix
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i, j] = dp[i-1, j-1]  # Characters match, no cost
            else:
                dp[i, j] = 1 + min(
                    dp[i-1, j],      # Deletion
                    dp[i, j-1],      # Insertion
                    dp[i-1, j-1]     # Substitution
                )
    
    return dp[m, n]
 
# Hamming distance examples
print("Hamming Distance Examples:")
print(f"  'karolin' vs 'kathrin': {hamming_distance('karolin', 'kathrin')}")
print(f"  '1011101' vs '1001001': {hamming_distance('1011101', '1001001')}")
print(f"  Binary vectors: {hamming_distance([1,0,1,1,1,0,1], [1,0,0,1,0,0,1])}")
 
# Levenshtein distance examples
print(f"
Levenshtein Distance Examples:")
print(f"  'kitten' → 'sitting': {levenshtein_distance('kitten', 'sitting')}")
print(f"  'saturday' → 'sunday': {levenshtein_distance('saturday', 'sunday')}")
print(f"  'book' → 'back': {levenshtein_distance('book', 'back')}")
print(f"  '' → 'hello': {levenshtein_distance('', 'hello')} (5 insertions)")
 
# Normalized edit distance (for different-length strings)
s1, s2 = "algorithm", "altruistic"
raw_dist = levenshtein_distance(s1, s2)
normalized = raw_dist / max(len(s1), len(s2))
print(f"
'{s1}' vs '{s2}':")
print(f"  Raw distance: {raw_dist}")
print(f"  Normalized (÷ max length): {normalized:.3f}")

Applications in ML

Hamming distance is used in locality-sensitive hashing (LSH) for approximate nearest neighbors. Edit distance is essential for spell checkers, DNA alignment (bioinformatics), and measuring similarity between categorical sequences. Many NLP tasks use edit distance variants (word edit distance, tree edit distance for parse trees).

Choosing the Right Distance Metric

The choice of distance metric profoundly affects algorithm behavior. There's no universally 'best' metric—the right choice depends on your data, the domain, and what notion of similarity is meaningful for your problem.

Decision Framework:

Distance Metric Selection Guide
Data Type / Situation	Recommended Metric	Rationale
Generic continuous data	Euclidean (L2)	Default choice; intuitive; well-studied
Features with different scales	Standardized Euclidean or Mahalanobis	Accounts for variance differences
Correlated features	Mahalanobis	Decorrelates and normalizes
Sparse high-dimensional data	Cosine distance	Handles magnitude invariance; see next page
Robust to outliers needed	Manhattan (L1)	Linear rather than quadratic penalty
Want worst-case bounds	Chebyshev (L∞)	Maximum component difference
Binary/categorical vectors	Hamming	Counts mismatches
Strings of different lengths	Levenshtein / Edit distance	Minimum transformations
Probability distributions	KL divergence / Wasserstein	See information theory

Practical Considerations

•Always standardize/normalize before using Euclidean distance if features have different scales
•Consider the curse of dimensionality: In very high dimensions, Euclidean distance loses meaning; consider cosine or other dimension-robust metrics
•Domain knowledge matters: In text, cosine distance on TF-IDF vectors is standard; in images, perceptual metrics matter; in time series, DTW may be more appropriate
•Computational cost: Euclidean is fastest; Mahalanobis requires matrix inversion; edit distance is O(mn) per pair
•Test empirically: The 'right' distance for your problem is the one that gives the best results on your validation set

Metric Learning

Instead of choosing a fixed metric, you can learn the optimal metric from data! Metric learning algorithms (LMNN, NCA, Siamese networks) learn a transformation matrix $\mathbf{M}$ such that $d_M(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x}-\mathbf{y})^T\mathbf{M}(\mathbf{x}-\mathbf{y})}$ optimizes some objective (e.g., nearest neighbors of the same class are close).

Distance in High-Dimensional Spaces

High-dimensional data is ubiquitous in machine learning: images (thousands of pixels), text (vocabulary-sized vectors), genomics (millions of base pairs). Understanding how distance behaves in high dimensions is critical.

The Concentration Phenomenon:

As dimension $d \to \infty$, for random iid data:

$$\frac{d_{\max} - d_{\min}}{d_{\min}} \to 0$$

The relative difference between the nearest and farthest neighbors vanishes! All points become approximately equidistant from any query point.

Why This Happens:

For independent random variables, variance grows linearly with dimension: $$\text{Var}(\sum_i X_i) = \sum_i \text{Var}(X_i) = d \cdot \sigma^2$$

But the standard deviation grows only as $\sqrt{d}$, while the mean distance grows as $\sqrt{d}$ (for Euclidean). The coefficient of variation shrinks as $1/\sqrt{d}$, meaning distances concentrate around their mean.

Consequences for ML

•K-NN degrades: When all points are roughly equidistant, the 'nearest' neighbor is almost arbitrary
•Clustering becomes harder: Points don't naturally separate into distinct groups
•Euclidean distance loses meaning: Small differences become indistinguishable from noise
•Different metrics behave differently: Manhattan distance concentrates less severely than Euclidean in some cases

Mitigation Strategies:

Dimensionality Reduction: PCA, UMAP, t-SNE, autoencoders project to lower dimensions where distances are more meaningful.
Feature Selection: Keep only the most informative features; remove noise.
Different Metrics: Cosine similarity (which measures angle, not magnitude) often works better in high dimensions.
Fractional Norms: Some research suggests Lp norms with p < 1 (quasi-norms) may concentrate less, though they're not true metrics.
Domain-Specific Distances: Use distances designed for your data type (e.g., edit distance for sequences, graph kernels for graphs).

high_dim_distance_concentration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_distance_concentration(dimensions, n_points=500, n_trials=10):
    """Analyze how distance concentration varies with dimension."""
    results = []
    
    for d in dimensions:
        ratios = []
        for _ in range(n_trials):
            # Generate random points in [0,1]^d
            points = np.random.rand(n_points, d)
            
            # Pick a random query point
            query = np.random.rand(d)
            
            # Compute all distances to query
            distances = np.linalg.norm(points - query, axis=1)
            
            # Compute the contrast ratio
            ratio = (distances.max() - distances.min()) / distances.min()
            ratios.append(ratio)
        
        results.append({
            'dim': d,
            'mean_ratio': np.mean(ratios),
            'std_ratio': np.std(ratios)
        })
        
    return results
 
# Analyze
dimensions = [2, 5, 10, 25, 50, 100, 250, 500, 1000]
results = analyze_distance_concentration(dimensions)
 
print("Distance Concentration Analysis")
print("=" * 50)
print(f"{'Dimension':<12} {'Mean Ratio':<15} {'Std':<10}")
print("-" * 50)
for r in results:
    print(f"{r['dim']:<12} {r['mean_ratio']:<15.4f} {r['std_ratio']:<10.4f}")
 
print("
Note: As dimension increases, the contrast ratio")
print("(max_dist - min_dist) / min_dist decreases,")
print("meaning all points become roughly equidistant.")
print("This is the 'curse of dimensionality' in action!")

Summary: The Language of Difference

Key Takeaways

•A metric satisfies four axioms: non-negativity, identity of indiscernibles, symmetry, and the triangle inequality.
•The Minkowski family includes Manhattan (L1), Euclidean (L2), and Chebyshev (L∞), each with different geometric behaviors and use cases.
•Euclidean distance is the default but suffers from scale sensitivity and the curse of dimensionality in high dimensions.
•Mahalanobis distance accounts for feature correlations and different variances, essentially measuring distance in 'standard deviations.'
•Discrete distances like Hamming and Edit distance apply to categorical data and strings, essential for text processing and bioinformatics.
•The curse of dimensionality causes distance concentration in high dimensions, making nearest-neighbor methods unreliable without mitigation.
•Metric choice matters: The 'right' distance depends on your data's structure, your domain knowledge, and empirical validation.

What's Next:

We've covered distances—how to measure differences. The next page explores similarity measures, which flip the perspective: instead of measuring how far apart objects are, we measure how similar they are. We'll cover cosine similarity, Jaccard index, and other similarity functions crucial for retrieval, recommendation, and clustering.

Distance Metrics Mastered

You now have a rigorous understanding of distance metrics, from their axiomatic foundations to practical considerations in high-dimensional ML. This knowledge is fundamental for K-NN, clustering, anomaly detection, and any algorithm that compares data points.

3 / 5

Loading learning content...

Machine LearningMathematical Foundations: Linear Algebra

Norms and Distance Metrics

LevelIntermediate

Duration90 mins

TopicMathematical Foundations: Linear Algebra

3 / 5

Distance Metrics

Measuring Differences: The Foundation of ML Comparisons

This page develops the rigorous theory of distance metrics and explores their practical implications in machine learning.

What You Will Master

The Axioms of a Metric

A metric (or distance function) on a set $X$ is a function $d: X \times X \to \mathbb{R}$ that formalizes our intuitive notion of distance. The pair $(X, d)$ is called a metric space.

Formal Definition:

A function $d$ is a metric if and only if for all points $\mathbf{x}, \mathbf{y}, \mathbf{z} \in X$, the following four axioms hold:

The Four Metric Axioms

•Non-Negativity: $d(\mathbf{x}, \mathbf{y}) \geq 0$. Distance is never negative.
•Identity of Indiscernibles: $d(\mathbf{x}, \mathbf{y}) = 0 \Leftrightarrow \mathbf{x} = \mathbf{y}$. The only way distance is zero is if the points are identical.
•Symmetry: $d(\mathbf{x}, \mathbf{y}) = d(\mathbf{y}, \mathbf{x})$. Distance from A to B equals distance from B to A.
•Triangle Inequality: $d(\mathbf{x}, \mathbf{z}) \leq d(\mathbf{x}, \mathbf{y}) + d(\mathbf{y}, \mathbf{z})$. The direct path is never longer than going through an intermediate point.

The Triangle Inequality: Geometric Intuition

The triangle inequality captures the fundamental property that the shortest path between two points is a straight line. In a triangle with vertices at $\mathbf{x}$, $\mathbf{y}$, $\mathbf{z}$:

The distance $d(\mathbf{x}, \mathbf{z})$ is one side of the triangle
The sum $d(\mathbf{x}, \mathbf{y}) + d(\mathbf{y}, \mathbf{z})$ represents the path through $\mathbf{y}$
A 'shortcut' through $\mathbf{y}$ can never be shorter than the direct path

Pseudometrics and Quasimetrics

Connection to Norms:

Every norm $|\cdot|$ on a vector space induces a metric:

$$d(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}|$$

The Minkowski Distance Family

$$d_p(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}|p = \left( \sum{i=1}^{n} |x_i - y_i|^p \right)^{1/p}$$

This is simply the Lp norm applied to the difference vector. The three most important special cases are:

Key Minkowski Distances
p	Name	Formula	Geometric Meaning
1	Manhattan / L1 / Taxicab	$\sum_i \|x_i - y_i\|$	Grid-based distance (sum of axis-aligned steps)
2	Euclidean / L2	$\sqrt{\sum_i (x_i - y_i)^2}$	Straight-line distance (ordinary geometry)
∞	Chebyshev / L∞	$\max_i \|x_i - y_i\|$	Maximum difference in any coordinate

Euclidean Distance (p=2):

The Euclidean distance is our intuitive 'straight-line' distance—what you'd measure with a ruler. It's the default choice when no specific domain knowledge suggests otherwise.

$$d_2(\mathbf{x}, \mathbf{y}) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2 + \cdots + (x_n-y_n)^2}$$

Properties:

Rotationally invariant: rotating both points doesn't change distance
Sensitive to scale: features measured in larger units dominate
Suffers in high dimensions (curse of dimensionality)
Differentiable everywhere except when $\mathbf{x} = \mathbf{y}$

Manhattan Distance (p=1):

The Manhattan distance measures the total axis-aligned movement needed to go from one point to another—like navigating a city grid.

$$d_1(\mathbf{x}, \mathbf{y}) = |x_1-y_1| + |x_2-y_2| + \cdots + |x_n-y_n|$$

When to prefer Manhattan:

Features represent different types of quantities (heterogeneous; scales don't meaningfully combine)
You want robustness to outliers (single large difference contributes linearly, not quadratically)
Grid-structured data or movement (robot navigation, genome sequences)
High-dimensional sparse data

Chebyshev Distance (p=∞):

The Chebyshev distance is the maximum difference in any single coordinate—the worst-case deviation.

$$d_\infty(\mathbf{x}, \mathbf{y}) = \max_{i} |x_i - y_i|$$

When to use Chebyshev:

When the worst-case matters (e.g., component that's furthest from target)
Chessboard-like movement (king can move diagonally)
Quality control (maximum deviation from specification)
Lower bound for Euclidean: $d_\infty \leq d_2 \leq \sqrt{n} \cdot d_\infty$

minkowski_distances.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
from scipy.spatial.distance import minkowski, cityblock, chebyshev
 
# Define two points
x = np.array([1, 2, 3])
y = np.array([4, 0, 5])
 
print(f"Point x: {x}")
print(f"Point y: {y}")
print(f"Difference: {y - x}
")
 
# Compute various Minkowski distances
d1 = np.sum(np.abs(x - y))  # Manhattan
d2 = np.sqrt(np.sum((x - y)**2))  # Euclidean
dinf = np.max(np.abs(x - y))  # Chebyshev
 
print("Minkowski distances:")
print(f"  p=1 (Manhattan):  d₁ = |3| + |-2| + |2| = {d1}")
print(f"  p=2 (Euclidean):  d₂ = √(9 + 4 + 4) = √17 = {d2:.4f}")
print(f"  p=∞ (Chebyshev): d∞ = max(3, 2, 2) = {dinf}")
 
# Verify with scipy
print(f"
Scipy verification:")
print(f"  cityblock: {cityblock(x, y)}")
print(f"  minkowski p=2: {minkowski(x, y, p=2):.4f}")
print(f"  chebyshev: {chebyshev(x, y)}")
 
# Effect of different p values
print(f"
Minkowski distance for various p:")
for p in [0.5, 1, 1.5, 2, 3, 5, 10, 50]:
    if p < 1:
        # Not a true metric, but still computable
        d = np.sum(np.abs(x - y)**p)**(1/p)
    else:
        d = minkowski(x, y, p=p)
    print(f"  p={p:>4}: d = {d:.4f}")
 
# The Inequalities
print(f"
Verifying: d∞ ≤ d₂ ≤ d₁ ≤ √n · d₂")
n = len(x)
print(f"  {dinf:.4f} ≤ {d2:.4f} ≤ {d1:.4f} ≤ {np.sqrt(n)*d2:.4f}")

Euclidean Distance: Properties and Pitfalls

The Euclidean distance is the default choice in ML, but it comes with important properties and limitations that practitioners must understand.

Mathematical Properties:

$$d(\mathbf{x}, \mathbf{y})^2 = |\mathbf{x} - \mathbf{y}|_2^2 = |\mathbf{x}|^2 + |\mathbf{y}|^2 - 2\mathbf{x}^T\mathbf{y}$$

This expansion is fundamental—it shows that squared Euclidean distance can be computed from norms and inner products, enabling efficient computation in many contexts.

Key Properties of Euclidean Distance

•Translation Invariance: $d(\mathbf{x}+\mathbf{c}, \mathbf{y}+\mathbf{c}) = d(\mathbf{x}, \mathbf{y})$. Shifting all points doesn't change distances.
•Rotation Invariance: $d(\mathbf{Qx}, \mathbf{Qy}) = d(\mathbf{x}, \mathbf{y})$ for orthogonal $\mathbf{Q}$. Rotating preserves distances.
•Scale Sensitivity: $d(\alpha\mathbf{x}, \alpha\mathbf{y}) = |\alpha| d(\mathbf{x}, \mathbf{y})$. Scaling affects distance proportionally.
•Dimension Sensitivity: As dimension grows, distances tend to concentrate (curse of dimensionality).

The Curse of Dimensionality

Squared Euclidean Distance:

In practice, we often use squared Euclidean distance:

$$d^2(\mathbf{x}, \mathbf{y}) = \sum_i (x_i - y_i)^2$$

Advantages:

Avoids the square root (faster computation)
Always differentiable (even at zero)
Preserves ordering (if $d^2(a,b) < d^2(a,c)$, then $d(a,b) < d(a,c)$)
Natural in least-squares optimization

Caution: Squared distance is NOT a metric (violates triangle inequality), but it preserves the essential ordering for nearest-neighbor search.

euclidean_distance_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
 
def euclidean_distance(x, y):
    """Standard Euclidean distance."""
    return np.sqrt(np.sum((x - y)**2))
 
def euclidean_via_inner_product(x, y):
    """Euclidean distance via the inner-product formula."""
    # d² = ||x||² + ||y||² - 2<x,y>
    return np.sqrt(np.dot(x, x) + np.dot(y, y) - 2 * np.dot(x, y))
 
# Test equivalence
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
 
print(f"Direct computation: {euclidean_distance(x, y):.6f}")
print(f"Via inner product:  {euclidean_via_inner_product(x, y):.6f}")
 
# Demonstrate curse of dimensionality
print(f"
--- Curse of Dimensionality ---")
np.random.seed(42)
 
for dim in [2, 10, 100, 1000]:
    # Generate random points in [0,1]^dim
    n_points = 100
    points = np.random.rand(n_points, dim)
    
    # Compute all pairwise distances
    distances = []
    for i in range(n_points):
        for j in range(i+1, n_points):
            distances.append(euclidean_distance(points[i], points[j]))
    
    distances = np.array(distances)
    ratio = distances.max() / distances.min()
    
    print(f"Dim={dim:4}: min_dist={distances.min():.3f}, "
          f"max_dist={distances.max():.3f}, ratio={ratio:.2f}")
    # Notice the ratio approaches 1 as dimension increases!
 
# Scaling sensitivity
print(f"
--- Scale Sensitivity ---")
x = np.array([1, 100])  # Different scales
y = np.array([2, 100])
z = np.array([1, 200])
 
print(f"x = {x}, y = {y}, z = {z}")
print(f"d(x,y) = {euclidean_distance(x, y):.4f}  (differ in small-scale feature)")
print(f"d(x,z) = {euclidean_distance(x, z):.4f}  (differ in large-scale feature)")
print("Distance is dominated by the large-scale feature!")

Mahalanobis Distance: Accounting for Correlations

Definition:

Given a covariance matrix $\mathbf{\Sigma}$ (typically estimated from data), the Mahalanobis distance between $\mathbf{x}$ and $\mathbf{y}$ is:

$$d_M(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x} - \mathbf{y})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{y})}$$

From a point $\mathbf{x}$ to the distribution mean $\boldsymbol{\mu}$:

$$d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$

Intuition: What Mahalanobis Distance Does

Decorrelates features: The $\mathbf{\Sigma}^{-1}$ term removes correlations between features.
Normalizes by variance: Features with high variance contribute less; features with low variance (more 'informative') contribute more.
Accounts for elliptical contours: While Euclidean distance has spherical iso-distance contours, Mahalanobis distance has elliptical contours aligned with the data distribution.

Relationship to transformation:

Mahalanobis distance equals Euclidean distance after 'whitening' the data. If $\mathbf{\Sigma} = \mathbf{L}\mathbf{L}^T$ (Cholesky decomposition), then:

$$d_M(\mathbf{x}, \mathbf{y}) = |\mathbf{L}^{-1}(\mathbf{x} - \mathbf{y})|_2$$

When to Use Mahalanobis

•Features have different variances (scales)
•Features are correlated
•You know the data distribution (have covariance)
•Outlier detection (Mahalanobis distance to center)
•Matching algorithms (tracking, recognition)

Mahalanobis Considerations

•Requires estimating covariance matrix
•Covariance must be invertible (full rank)
•Sensitive to covariance estimation errors
•Computationally expensive: O(n²) for covariance, O(n³) for inverse
•Assumes the same scaling applies everywhere (homogeneous data)

mahalanobis_distance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
from scipy.spatial.distance import mahalanobis
 
# Generate correlated 2D data
np.random.seed(42)
n = 100
mean = [5, 5]
cov = [[2, 1.5],   # Feature 1 has variance 2, covariance 1.5
       [1.5, 3]]   # Feature 2 has variance 3, covariance 1.5
data = np.random.multivariate_normal(mean, cov, n)
 
# Estimate covariance from data
sample_cov = np.cov(data.T)
sample_mean = np.mean(data, axis=0)
cov_inv = np.linalg.inv(sample_cov)
 
print(f"Mean: {sample_mean}")
print(f"Covariance:
{sample_cov}")
 
# Two test points
x = np.array([6, 5])  # Slightly off in x direction
y = np.array([5, 6])  # Slightly off in y direction
 
# Euclidean distances from mean
d_euc_x = np.linalg.norm(x - sample_mean)
d_euc_y = np.linalg.norm(y - sample_mean)
 
# Mahalanobis distances from mean
d_mah_x = mahalanobis(x, sample_mean, cov_inv)
d_mah_y = mahalanobis(y, sample_mean, cov_inv)
 
print(f"
Point x = {x}")
print(f"  Euclidean to mean: {d_euc_x:.4f}")
print(f"  Mahalanobis to mean: {d_mah_x:.4f}")
 
print(f"
Point y = {y}")
print(f"  Euclidean to mean: {d_euc_y:.4f}")
print(f"  Mahalanobis to mean: {d_mah_y:.4f}")
 
# Note: Though x and y are equidistant from mean in Euclidean sense,
# their Mahalanobis distances differ due to different variances
 
# Outlier detection example
outlier = np.array([10, 10])
d_mah_outlier = mahalanobis(outlier, sample_mean, cov_inv)
print(f"
Outlier = {outlier}")
print(f"  Mahalanobis distance: {d_mah_outlier:.4f}")
print(f"  (Typically, d_M > 3 suggests outlier for Gaussian data)")

Connection to Gaussian Distribution

Discrete Distances: Hamming and Edit Distance

Not all data lives in continuous vector spaces. For categorical data, strings, and sequences, we need distance metrics designed for discrete structures.

Hamming Distance:

The Hamming distance between two strings of equal length is the number of positions where corresponding characters differ.

$$d_H(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} \mathbf{1}[x_i eq y_i]$$

where $\mathbf{1}[\cdot]$ is the indicator function (1 if true, 0 if false).

Examples:

$d_H(\text{"karolin"}, \text{"kathrin"}) = 3$ (positions 2, 3, 4 differ)
$d_H(1011101, 1001001) = 2$ (positions 3 and 5 differ)

Applications: Error-correcting codes, DNA sequence comparison, feature hashing, locality-sensitive hashing.

Levenshtein (Edit) Distance:

$$d_L(\mathbf{x}, \mathbf{y}) = \min { \text{insertions} + \text{deletions} + \text{substitutions} }$$

Examples:

$d_L(\text{"kitten"}, \text{"sitting"}) = 3$ (k→s, e→i, +g)
$d_L(\text{"saturday"}, \text{"sunday"}) = 3$ (delete 'a', delete 't', u→r→n)

Algorithm: Dynamic programming in O(mn) time and space, where m and n are string lengths.

discrete_distances.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
def hamming_distance(x, y):
    """Hamming distance for equal-length strings or arrays."""
    if len(x) != len(y):
        raise ValueError("Strings must have equal length for Hamming distance")
    return sum(c1 != c2 for c1, c2 in zip(x, y))
 
def levenshtein_distance(s1, s2):
    """Levenshtein (edit) distance using dynamic programming."""
    m, n = len(s1), len(s2)
    
    # Create distance matrix
    dp = np.zeros((m + 1, n + 1), dtype=int)
    
    # Base cases: distance from/to empty string
    for i in range(m + 1):
        dp[i, 0] = i  # i deletions
    for j in range(n + 1):
        dp[0, j] = j  # j insertions
    
    # Fill the matrix
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i, j] = dp[i-1, j-1]  # Characters match, no cost
            else:
                dp[i, j] = 1 + min(
                    dp[i-1, j],      # Deletion
                    dp[i, j-1],      # Insertion
                    dp[i-1, j-1]     # Substitution
                )
    
    return dp[m, n]
 
# Hamming distance examples
print("Hamming Distance Examples:")
print(f"  'karolin' vs 'kathrin': {hamming_distance('karolin', 'kathrin')}")
print(f"  '1011101' vs '1001001': {hamming_distance('1011101', '1001001')}")
print(f"  Binary vectors: {hamming_distance([1,0,1,1,1,0,1], [1,0,0,1,0,0,1])}")
 
# Levenshtein distance examples
print(f"
Levenshtein Distance Examples:")
print(f"  'kitten' → 'sitting': {levenshtein_distance('kitten', 'sitting')}")
print(f"  'saturday' → 'sunday': {levenshtein_distance('saturday', 'sunday')}")
print(f"  'book' → 'back': {levenshtein_distance('book', 'back')}")
print(f"  '' → 'hello': {levenshtein_distance('', 'hello')} (5 insertions)")
 
# Normalized edit distance (for different-length strings)
s1, s2 = "algorithm", "altruistic"
raw_dist = levenshtein_distance(s1, s2)
normalized = raw_dist / max(len(s1), len(s2))
print(f"
'{s1}' vs '{s2}':")
print(f"  Raw distance: {raw_dist}")
print(f"  Normalized (÷ max length): {normalized:.3f}")

Applications in ML

Choosing the Right Distance Metric

Decision Framework:

Distance Metric Selection Guide
Data Type / Situation	Recommended Metric	Rationale
Generic continuous data	Euclidean (L2)	Default choice; intuitive; well-studied
Features with different scales	Standardized Euclidean or Mahalanobis	Accounts for variance differences
Correlated features	Mahalanobis	Decorrelates and normalizes
Sparse high-dimensional data	Cosine distance	Handles magnitude invariance; see next page
Robust to outliers needed	Manhattan (L1)	Linear rather than quadratic penalty
Want worst-case bounds	Chebyshev (L∞)	Maximum component difference
Binary/categorical vectors	Hamming	Counts mismatches
Strings of different lengths	Levenshtein / Edit distance	Minimum transformations
Probability distributions	KL divergence / Wasserstein	See information theory

Practical Considerations

•Always standardize/normalize before using Euclidean distance if features have different scales
•Consider the curse of dimensionality: In very high dimensions, Euclidean distance loses meaning; consider cosine or other dimension-robust metrics
•Domain knowledge matters: In text, cosine distance on TF-IDF vectors is standard; in images, perceptual metrics matter; in time series, DTW may be more appropriate
•Computational cost: Euclidean is fastest; Mahalanobis requires matrix inversion; edit distance is O(mn) per pair
•Test empirically: The 'right' distance for your problem is the one that gives the best results on your validation set

Metric Learning

Distance in High-Dimensional Spaces

The Concentration Phenomenon:

As dimension $d \to \infty$, for random iid data:

$$\frac{d_{\max} - d_{\min}}{d_{\min}} \to 0$$

The relative difference between the nearest and farthest neighbors vanishes! All points become approximately equidistant from any query point.

Why This Happens:

For independent random variables, variance grows linearly with dimension: $$\text{Var}(\sum_i X_i) = \sum_i \text{Var}(X_i) = d \cdot \sigma^2$$

Consequences for ML

•K-NN degrades: When all points are roughly equidistant, the 'nearest' neighbor is almost arbitrary
•Clustering becomes harder: Points don't naturally separate into distinct groups
•Euclidean distance loses meaning: Small differences become indistinguishable from noise
•Different metrics behave differently: Manhattan distance concentrates less severely than Euclidean in some cases

Mitigation Strategies:

Dimensionality Reduction: PCA, UMAP, t-SNE, autoencoders project to lower dimensions where distances are more meaningful.
Feature Selection: Keep only the most informative features; remove noise.
Different Metrics: Cosine similarity (which measures angle, not magnitude) often works better in high dimensions.
Fractional Norms: Some research suggests Lp norms with p < 1 (quasi-norms) may concentrate less, though they're not true metrics.
Domain-Specific Distances: Use distances designed for your data type (e.g., edit distance for sequences, graph kernels for graphs).

high_dim_distance_concentration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_distance_concentration(dimensions, n_points=500, n_trials=10):
    """Analyze how distance concentration varies with dimension."""
    results = []
    
    for d in dimensions:
        ratios = []
        for _ in range(n_trials):
            # Generate random points in [0,1]^d
            points = np.random.rand(n_points, d)
            
            # Pick a random query point
            query = np.random.rand(d)
            
            # Compute all distances to query
            distances = np.linalg.norm(points - query, axis=1)
            
            # Compute the contrast ratio
            ratio = (distances.max() - distances.min()) / distances.min()
            ratios.append(ratio)
        
        results.append({
            'dim': d,
            'mean_ratio': np.mean(ratios),
            'std_ratio': np.std(ratios)
        })
        
    return results
 
# Analyze
dimensions = [2, 5, 10, 25, 50, 100, 250, 500, 1000]
results = analyze_distance_concentration(dimensions)
 
print("Distance Concentration Analysis")
print("=" * 50)
print(f"{'Dimension':<12} {'Mean Ratio':<15} {'Std':<10}")
print("-" * 50)
for r in results:
    print(f"{r['dim']:<12} {r['mean_ratio']:<15.4f} {r['std_ratio']:<10.4f}")
 
print("
Note: As dimension increases, the contrast ratio")
print("(max_dist - min_dist) / min_dist decreases,")
print("meaning all points become roughly equidistant.")
print("This is the 'curse of dimensionality' in action!")

Summary: The Language of Difference

Key Takeaways

•A metric satisfies four axioms: non-negativity, identity of indiscernibles, symmetry, and the triangle inequality.
•The Minkowski family includes Manhattan (L1), Euclidean (L2), and Chebyshev (L∞), each with different geometric behaviors and use cases.
•Euclidean distance is the default but suffers from scale sensitivity and the curse of dimensionality in high dimensions.
•Mahalanobis distance accounts for feature correlations and different variances, essentially measuring distance in 'standard deviations.'
•Discrete distances like Hamming and Edit distance apply to categorical data and strings, essential for text processing and bioinformatics.
•The curse of dimensionality causes distance concentration in high dimensions, making nearest-neighbor methods unreliable without mitigation.
•Metric choice matters: The 'right' distance depends on your data's structure, your domain knowledge, and empirical validation.

What's Next:

Distance Metrics Mastered

3 / 5