Loading learning content...
At the heart of machine learning lies a simple question: How similar or different are two objects? Whether you're finding nearest neighbors, clustering data points, measuring prediction errors, or retrieving similar documents, you need a principled way to quantify 'distance' or 'difference.'
A distance metric (or simply metric) is a mathematical function that tells us how far apart two points are. But not any function qualifies—metrics must satisfy specific axioms that capture our intuitive understanding of distance.
Choosing the right distance metric is often as important as choosing the right algorithm. The same K-nearest neighbors algorithm with Euclidean distance behaves completely differently than with cosine distance. The same clustering algorithm produces different results with Manhattan distance versus Mahalanobis distance.
This page develops the rigorous theory of distance metrics and explores their practical implications in machine learning.
By the end of this page, you will understand the mathematical definition of a metric space, the axioms that define valid distance functions, the most important distances in ML (Euclidean, Manhattan, Chebyshev, Minkowski), and when each is appropriate for different problem types.
A metric (or distance function) on a set $X$ is a function $d: X \times X \to \mathbb{R}$ that formalizes our intuitive notion of distance. The pair $(X, d)$ is called a metric space.
Formal Definition:
A function $d$ is a metric if and only if for all points $\mathbf{x}, \mathbf{y}, \mathbf{z} \in X$, the following four axioms hold:
The Triangle Inequality: Geometric Intuition
The triangle inequality captures the fundamental property that the shortest path between two points is a straight line. In a triangle with vertices at $\mathbf{x}$, $\mathbf{y}$, $\mathbf{z}$:
This property is essential for algorithms: it enables efficient search (if A is far from B and B is close to C, we know something about A's distance to C), makes KD-trees and ball trees work, and underlies clustering validity.
Some useful functions don't satisfy all axioms. A pseudometric allows $d(\mathbf{x}, \mathbf{y}) = 0$ for distinct points (violating identity of indiscernibles). A quasimetric allows asymmetric distances where $d(\mathbf{x}, \mathbf{y}) eq d(\mathbf{y}, \mathbf{x})$ (like travel time on one-way streets). ML sometimes uses these relaxations intentionally.
Connection to Norms:
Every norm $|\cdot|$ on a vector space induces a metric:
$$d(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}|$$
This is the 'norm-induced metric.' The norm axioms guarantee that this function satisfies all metric axioms. However, not all metrics come from norms—metrics can be defined on sets that aren't even vector spaces (like the space of strings with edit distance).
The Minkowski distance is a generalized family of distances that includes the most commonly used metrics in machine learning. For vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ and parameter $p \geq 1$:
$$d_p(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}|p = \left( \sum{i=1}^{n} |x_i - y_i|^p \right)^{1/p}$$
This is simply the Lp norm applied to the difference vector. The three most important special cases are:
| p | Name | Formula | Geometric Meaning |
|---|---|---|---|
| 1 | Manhattan / L1 / Taxicab | $\sum_i |x_i - y_i|$ | Grid-based distance (sum of axis-aligned steps) |
| 2 | Euclidean / L2 | $\sqrt{\sum_i (x_i - y_i)^2}$ | Straight-line distance (ordinary geometry) |
| ∞ | Chebyshev / L∞ | $\max_i |x_i - y_i|$ | Maximum difference in any coordinate |
Euclidean Distance (p=2):
The Euclidean distance is our intuitive 'straight-line' distance—what you'd measure with a ruler. It's the default choice when no specific domain knowledge suggests otherwise.
$$d_2(\mathbf{x}, \mathbf{y}) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2 + \cdots + (x_n-y_n)^2}$$
Properties:
Manhattan Distance (p=1):
The Manhattan distance measures the total axis-aligned movement needed to go from one point to another—like navigating a city grid.
$$d_1(\mathbf{x}, \mathbf{y}) = |x_1-y_1| + |x_2-y_2| + \cdots + |x_n-y_n|$$
When to prefer Manhattan:
Chebyshev Distance (p=∞):
The Chebyshev distance is the maximum difference in any single coordinate—the worst-case deviation.
$$d_\infty(\mathbf{x}, \mathbf{y}) = \max_{i} |x_i - y_i|$$
When to use Chebyshev:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as npfrom scipy.spatial.distance import minkowski, cityblock, chebyshev # Define two pointsx = np.array([1, 2, 3])y = np.array([4, 0, 5]) print(f"Point x: {x}")print(f"Point y: {y}")print(f"Difference: {y - x}") # Compute various Minkowski distancesd1 = np.sum(np.abs(x - y)) # Manhattand2 = np.sqrt(np.sum((x - y)**2)) # Euclideandinf = np.max(np.abs(x - y)) # Chebyshev print("Minkowski distances:")print(f" p=1 (Manhattan): d₁ = |3| + |-2| + |2| = {d1}")print(f" p=2 (Euclidean): d₂ = √(9 + 4 + 4) = √17 = {d2:.4f}")print(f" p=∞ (Chebyshev): d∞ = max(3, 2, 2) = {dinf}") # Verify with scipyprint(f"Scipy verification:")print(f" cityblock: {cityblock(x, y)}")print(f" minkowski p=2: {minkowski(x, y, p=2):.4f}")print(f" chebyshev: {chebyshev(x, y)}") # Effect of different p valuesprint(f"Minkowski distance for various p:")for p in [0.5, 1, 1.5, 2, 3, 5, 10, 50]: if p < 1: # Not a true metric, but still computable d = np.sum(np.abs(x - y)**p)**(1/p) else: d = minkowski(x, y, p=p) print(f" p={p:>4}: d = {d:.4f}") # The Inequalitiesprint(f"Verifying: d∞ ≤ d₂ ≤ d₁ ≤ √n · d₂")n = len(x)print(f" {dinf:.4f} ≤ {d2:.4f} ≤ {d1:.4f} ≤ {np.sqrt(n)*d2:.4f}")The Euclidean distance is the default choice in ML, but it comes with important properties and limitations that practitioners must understand.
Mathematical Properties:
$$d(\mathbf{x}, \mathbf{y})^2 = |\mathbf{x} - \mathbf{y}|_2^2 = |\mathbf{x}|^2 + |\mathbf{y}|^2 - 2\mathbf{x}^T\mathbf{y}$$
This expansion is fundamental—it shows that squared Euclidean distance can be computed from norms and inner products, enabling efficient computation in many contexts.
In high dimensions, a surprising phenomenon occurs: all points become roughly equidistant! For random points uniformly distributed in a high-dimensional hypercube, the ratio of maximum to minimum distance approaches 1. This means nearest-neighbor methods become meaningless—the nearest neighbor is barely closer than the farthest point. Solutions include dimensionality reduction, careful feature engineering, and alternative distance metrics.
Squared Euclidean Distance:
In practice, we often use squared Euclidean distance:
$$d^2(\mathbf{x}, \mathbf{y}) = \sum_i (x_i - y_i)^2$$
Advantages:
Caution: Squared distance is NOT a metric (violates triangle inequality), but it preserves the essential ordering for nearest-neighbor search.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import numpy as np def euclidean_distance(x, y): """Standard Euclidean distance.""" return np.sqrt(np.sum((x - y)**2)) def euclidean_via_inner_product(x, y): """Euclidean distance via the inner-product formula.""" # d² = ||x||² + ||y||² - 2<x,y> return np.sqrt(np.dot(x, x) + np.dot(y, y) - 2 * np.dot(x, y)) # Test equivalencex = np.array([1, 2, 3, 4])y = np.array([5, 6, 7, 8]) print(f"Direct computation: {euclidean_distance(x, y):.6f}")print(f"Via inner product: {euclidean_via_inner_product(x, y):.6f}") # Demonstrate curse of dimensionalityprint(f"--- Curse of Dimensionality ---")np.random.seed(42) for dim in [2, 10, 100, 1000]: # Generate random points in [0,1]^dim n_points = 100 points = np.random.rand(n_points, dim) # Compute all pairwise distances distances = [] for i in range(n_points): for j in range(i+1, n_points): distances.append(euclidean_distance(points[i], points[j])) distances = np.array(distances) ratio = distances.max() / distances.min() print(f"Dim={dim:4}: min_dist={distances.min():.3f}, " f"max_dist={distances.max():.3f}, ratio={ratio:.2f}") # Notice the ratio approaches 1 as dimension increases! # Scaling sensitivityprint(f"--- Scale Sensitivity ---")x = np.array([1, 100]) # Different scalesy = np.array([2, 100])z = np.array([1, 200]) print(f"x = {x}, y = {y}, z = {z}")print(f"d(x,y) = {euclidean_distance(x, y):.4f} (differ in small-scale feature)")print(f"d(x,z) = {euclidean_distance(x, z):.4f} (differ in large-scale feature)")print("Distance is dominated by the large-scale feature!")The Mahalanobis distance addresses a critical limitation of Euclidean distance: it ignores the statistical structure of the data. In datasets with correlated features or different variances, Euclidean distance can be misleading.
Definition:
Given a covariance matrix $\mathbf{\Sigma}$ (typically estimated from data), the Mahalanobis distance between $\mathbf{x}$ and $\mathbf{y}$ is:
$$d_M(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x} - \mathbf{y})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{y})}$$
From a point $\mathbf{x}$ to the distribution mean $\boldsymbol{\mu}$:
$$d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$
Intuition: What Mahalanobis Distance Does
Decorrelates features: The $\mathbf{\Sigma}^{-1}$ term removes correlations between features.
Normalizes by variance: Features with high variance contribute less; features with low variance (more 'informative') contribute more.
Accounts for elliptical contours: While Euclidean distance has spherical iso-distance contours, Mahalanobis distance has elliptical contours aligned with the data distribution.
Relationship to transformation:
Mahalanobis distance equals Euclidean distance after 'whitening' the data. If $\mathbf{\Sigma} = \mathbf{L}\mathbf{L}^T$ (Cholesky decomposition), then:
$$d_M(\mathbf{x}, \mathbf{y}) = |\mathbf{L}^{-1}(\mathbf{x} - \mathbf{y})|_2$$
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import numpy as npfrom scipy.spatial.distance import mahalanobis # Generate correlated 2D datanp.random.seed(42)n = 100mean = [5, 5]cov = [[2, 1.5], # Feature 1 has variance 2, covariance 1.5 [1.5, 3]] # Feature 2 has variance 3, covariance 1.5data = np.random.multivariate_normal(mean, cov, n) # Estimate covariance from datasample_cov = np.cov(data.T)sample_mean = np.mean(data, axis=0)cov_inv = np.linalg.inv(sample_cov) print(f"Mean: {sample_mean}")print(f"Covariance:{sample_cov}") # Two test pointsx = np.array([6, 5]) # Slightly off in x directiony = np.array([5, 6]) # Slightly off in y direction # Euclidean distances from meand_euc_x = np.linalg.norm(x - sample_mean)d_euc_y = np.linalg.norm(y - sample_mean) # Mahalanobis distances from meand_mah_x = mahalanobis(x, sample_mean, cov_inv)d_mah_y = mahalanobis(y, sample_mean, cov_inv) print(f"Point x = {x}")print(f" Euclidean to mean: {d_euc_x:.4f}")print(f" Mahalanobis to mean: {d_mah_x:.4f}") print(f"Point y = {y}")print(f" Euclidean to mean: {d_euc_y:.4f}")print(f" Mahalanobis to mean: {d_mah_y:.4f}") # Note: Though x and y are equidistant from mean in Euclidean sense,# their Mahalanobis distances differ due to different variances # Outlier detection exampleoutlier = np.array([10, 10])d_mah_outlier = mahalanobis(outlier, sample_mean, cov_inv)print(f"Outlier = {outlier}")print(f" Mahalanobis distance: {d_mah_outlier:.4f}")print(f" (Typically, d_M > 3 suggests outlier for Gaussian data)")The squared Mahalanobis distance appears directly in the multivariate Gaussian PDF: $p(\mathbf{x}) \propto \exp(-\frac{1}{2} d_M^2(\mathbf{x}, \boldsymbol{\mu}))$. Points with equal Mahalanobis distance from the mean have equal probability. This makes Mahalanobis distance natural for Gaussian-modeled data.
Not all data lives in continuous vector spaces. For categorical data, strings, and sequences, we need distance metrics designed for discrete structures.
Hamming Distance:
The Hamming distance between two strings of equal length is the number of positions where corresponding characters differ.
$$d_H(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} \mathbf{1}[x_i eq y_i]$$
where $\mathbf{1}[\cdot]$ is the indicator function (1 if true, 0 if false).
Examples:
Applications: Error-correcting codes, DNA sequence comparison, feature hashing, locality-sensitive hashing.
Levenshtein (Edit) Distance:
The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, substitutions) to transform one string into another. Unlike Hamming distance, it works on strings of different lengths.
$$d_L(\mathbf{x}, \mathbf{y}) = \min { \text{insertions} + \text{deletions} + \text{substitutions} }$$
Examples:
Algorithm: Dynamic programming in O(mn) time and space, where m and n are string lengths.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as np def hamming_distance(x, y): """Hamming distance for equal-length strings or arrays.""" if len(x) != len(y): raise ValueError("Strings must have equal length for Hamming distance") return sum(c1 != c2 for c1, c2 in zip(x, y)) def levenshtein_distance(s1, s2): """Levenshtein (edit) distance using dynamic programming.""" m, n = len(s1), len(s2) # Create distance matrix dp = np.zeros((m + 1, n + 1), dtype=int) # Base cases: distance from/to empty string for i in range(m + 1): dp[i, 0] = i # i deletions for j in range(n + 1): dp[0, j] = j # j insertions # Fill the matrix for i in range(1, m + 1): for j in range(1, n + 1): if s1[i-1] == s2[j-1]: dp[i, j] = dp[i-1, j-1] # Characters match, no cost else: dp[i, j] = 1 + min( dp[i-1, j], # Deletion dp[i, j-1], # Insertion dp[i-1, j-1] # Substitution ) return dp[m, n] # Hamming distance examplesprint("Hamming Distance Examples:")print(f" 'karolin' vs 'kathrin': {hamming_distance('karolin', 'kathrin')}")print(f" '1011101' vs '1001001': {hamming_distance('1011101', '1001001')}")print(f" Binary vectors: {hamming_distance([1,0,1,1,1,0,1], [1,0,0,1,0,0,1])}") # Levenshtein distance examplesprint(f"Levenshtein Distance Examples:")print(f" 'kitten' → 'sitting': {levenshtein_distance('kitten', 'sitting')}")print(f" 'saturday' → 'sunday': {levenshtein_distance('saturday', 'sunday')}")print(f" 'book' → 'back': {levenshtein_distance('book', 'back')}")print(f" '' → 'hello': {levenshtein_distance('', 'hello')} (5 insertions)") # Normalized edit distance (for different-length strings)s1, s2 = "algorithm", "altruistic"raw_dist = levenshtein_distance(s1, s2)normalized = raw_dist / max(len(s1), len(s2))print(f"'{s1}' vs '{s2}':")print(f" Raw distance: {raw_dist}")print(f" Normalized (÷ max length): {normalized:.3f}")Hamming distance is used in locality-sensitive hashing (LSH) for approximate nearest neighbors. Edit distance is essential for spell checkers, DNA alignment (bioinformatics), and measuring similarity between categorical sequences. Many NLP tasks use edit distance variants (word edit distance, tree edit distance for parse trees).
The choice of distance metric profoundly affects algorithm behavior. There's no universally 'best' metric—the right choice depends on your data, the domain, and what notion of similarity is meaningful for your problem.
Decision Framework:
| Data Type / Situation | Recommended Metric | Rationale |
|---|---|---|
| Generic continuous data | Euclidean (L2) | Default choice; intuitive; well-studied |
| Features with different scales | Standardized Euclidean or Mahalanobis | Accounts for variance differences |
| Correlated features | Mahalanobis | Decorrelates and normalizes |
| Sparse high-dimensional data | Cosine distance | Handles magnitude invariance; see next page |
| Robust to outliers needed | Manhattan (L1) | Linear rather than quadratic penalty |
| Want worst-case bounds | Chebyshev (L∞) | Maximum component difference |
| Binary/categorical vectors | Hamming | Counts mismatches |
| Strings of different lengths | Levenshtein / Edit distance | Minimum transformations |
| Probability distributions | KL divergence / Wasserstein | See information theory |
Instead of choosing a fixed metric, you can learn the optimal metric from data! Metric learning algorithms (LMNN, NCA, Siamese networks) learn a transformation matrix $\mathbf{M}$ such that $d_M(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x}-\mathbf{y})^T\mathbf{M}(\mathbf{x}-\mathbf{y})}$ optimizes some objective (e.g., nearest neighbors of the same class are close).
High-dimensional data is ubiquitous in machine learning: images (thousands of pixels), text (vocabulary-sized vectors), genomics (millions of base pairs). Understanding how distance behaves in high dimensions is critical.
The Concentration Phenomenon:
As dimension $d \to \infty$, for random iid data:
$$\frac{d_{\max} - d_{\min}}{d_{\min}} \to 0$$
The relative difference between the nearest and farthest neighbors vanishes! All points become approximately equidistant from any query point.
Why This Happens:
For independent random variables, variance grows linearly with dimension: $$\text{Var}(\sum_i X_i) = \sum_i \text{Var}(X_i) = d \cdot \sigma^2$$
But the standard deviation grows only as $\sqrt{d}$, while the mean distance grows as $\sqrt{d}$ (for Euclidean). The coefficient of variation shrinks as $1/\sqrt{d}$, meaning distances concentrate around their mean.
Mitigation Strategies:
Dimensionality Reduction: PCA, UMAP, t-SNE, autoencoders project to lower dimensions where distances are more meaningful.
Feature Selection: Keep only the most informative features; remove noise.
Different Metrics: Cosine similarity (which measures angle, not magnitude) often works better in high dimensions.
Fractional Norms: Some research suggests Lp norms with p < 1 (quasi-norms) may concentrate less, though they're not true metrics.
Domain-Specific Distances: Use distances designed for your data type (e.g., edit distance for sequences, graph kernels for graphs).
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import numpy as npimport matplotlib.pyplot as plt def analyze_distance_concentration(dimensions, n_points=500, n_trials=10): """Analyze how distance concentration varies with dimension.""" results = [] for d in dimensions: ratios = [] for _ in range(n_trials): # Generate random points in [0,1]^d points = np.random.rand(n_points, d) # Pick a random query point query = np.random.rand(d) # Compute all distances to query distances = np.linalg.norm(points - query, axis=1) # Compute the contrast ratio ratio = (distances.max() - distances.min()) / distances.min() ratios.append(ratio) results.append({ 'dim': d, 'mean_ratio': np.mean(ratios), 'std_ratio': np.std(ratios) }) return results # Analyzedimensions = [2, 5, 10, 25, 50, 100, 250, 500, 1000]results = analyze_distance_concentration(dimensions) print("Distance Concentration Analysis")print("=" * 50)print(f"{'Dimension':<12} {'Mean Ratio':<15} {'Std':<10}")print("-" * 50)for r in results: print(f"{r['dim']:<12} {r['mean_ratio']:<15.4f} {r['std_ratio']:<10.4f}") print("Note: As dimension increases, the contrast ratio")print("(max_dist - min_dist) / min_dist decreases,")print("meaning all points become roughly equidistant.")print("This is the 'curse of dimensionality' in action!")What's Next:
We've covered distances—how to measure differences. The next page explores similarity measures, which flip the perspective: instead of measuring how far apart objects are, we measure how similar they are. We'll cover cosine similarity, Jaccard index, and other similarity functions crucial for retrieval, recommendation, and clustering.
You now have a rigorous understanding of distance metrics, from their axiomatic foundations to practical considerations in high-dimensional ML. This knowledge is fundamental for K-NN, clustering, anomaly detection, and any algorithm that compares data points.